Skill

beautifulsoup4-consistency

Write, review, refactor, or debug Python code that parses HTML or XML with BeautifulSoup / bs4 (find, find_all, select, get_text, extracting links, tables, attributes) using one canonical idiom set. Use this skill whenever code scrapes or cleans markup, walks a parse tree, or when the user hits "NoneType object has no attribute", different results on different machines (parser auto-detection), empty .string on nested tags, KeyError on tag attributes, or asks find_all vs select. Trigger it even when the user just says "extract the prices from this HTML" or "parse this page" in Python — without saying the words "BeautifulSoup idioms."

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/beautifulsoup4-consistency:beautifulsoup4-consistency

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

bs4 has been stable for a decade, so the problem isn't stale APIs — it's *non-determinism

SKILL.md

102 lines · ~1.5k tokens

Stats

Stars0

MaintenanceGood

Last CommitJun 12, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

BeautifulSoup — consistent idioms

bs4 has been stable for a decade, so the problem isn't stale APIs — it's non-determinism and style drift: code that omits the parser argument builds different trees on different machines, mixes find_all and select arbitrarily, chains attributes off calls that return None, and confuses the three text accessors. This skill pins one canonical idiom set for bs4 4.12+.

Canonical idioms — always X, never Y

Always	Never	Why
`BeautifulSoup(html, "html.parser")` (or `"lxml"`, chosen deliberately)	`BeautifulSoup(html)`	Auto-detection picks whatever is installed; broken markup parses into different trees per machine.
`soup.select("div.item > a")` for structural queries	mega-nested `find` chains for structure	CSS selectors express descent/combinators readably.
`soup.find("a", class_="btn", href=True)` for attribute logic	`select` with attribute pseudo-gymnastics	Keyword filters, regex, and callables belong to find/find_all.
guard the miss: `if (el := soup.select_one(".price")) is not None:`	`soup.find(...).text` chains	`find`/`select_one` return `None` on a miss → `AttributeError` far from the cause.
`tag.get("href")` (or `tag.get("href", "")`)	`tag["href"]` on optional attributes	Subscript raises `KeyError`; `.get` makes the miss explicit.
`el.get_text(strip=True)` (with `separator=` when joining matters)	`.string` on elements with children / bare `.text` everywhere	`.string` is `None` unless exactly one text child; bare `.text` concatenates without separators ("priceQty").
`class_=` keyword / `attrs={"class": ...}`	`class=` (syntax error) or matching the full class string	`class` is multi-valued: `class_="btn"` matches `class="btn primary"`; string-equality on the joined value doesn't.
`find_all`	`findAll` / `findChildren` camelCase aliases	The camelCase names are bs3 leftovers.
feed bytes and let bs4 detect, or decode explicitly once	`html.decode()` guesses scattered through code	bs4's encoding detection (with the declared charset) beats ad-hoc decodes.
`response.text` only after checking `response.encoding` sanity	trusting requests' ISO-8859-1 fallback	Mis-decoded input corrupts every downstream string.

House style:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")

products = []
for card in soup.select("ul.products > li.product"):
    link = card.select_one("a.title")
    price = card.select_one("span.price")
    if link is None or price is None:
        continue  # structure changed or ad-card; skip explicitly
    products.append({
        "name": link.get_text(strip=True),
        "url": link.get("href"),
        "price": price.get_text(strip=True),
        "sku": card.get("data-sku"),
    })

Pitfalls that produce silently wrong results

Parser differences are real: html.parser (stdlib, lenient), lxml (fast, fixes markup aggressively, requires install), html5lib (slowest, browser-identical). Broken tables/nesting parse differently — the same code "works here, fails there" until the parser is pinned.
.string returns None when an element has more than one child node (even a comment counts) — the bug appears only on some rows of real data.
Whitespace text nodes live in .contents/.children; iterating and assuming elements-only breaks. Filter with tag.find_all(recursive=False) for element children.
find_all(text="...") (now string=) matches the exact full string of a text node — use string=re.compile(...) or search then check .get_text().
select is full CSS, find is not: soup.find("div.item") looks for a literal tag named div.item. Inversely :contains is non-standard — use find(string=...) logic.
Mutating while iterating (decompose() inside a find_all loop) is fine because find_all returns a list — but iterating .children while removing skips nodes.
Re-parsing fragments: BeautifulSoup("<td>x</td>", "lxml") wraps in <html><body> and may drop/move table elements; parse fragments with html.parser or parse the whole document.
bs4 sees the served HTML — if the data is rendered by JavaScript, no selector will find it; check response.text first and switch to a browser tool when absent.

Version notes

Target bs4 4.12+ (4.13 keeps the idioms; it adds typing and warns harder on bs3 aliases). The text= parameter is string= since 4.4 — write string=. Pair with lxml for speed or stdlib html.parser for zero dependencies; pick one per project.

Workflow

Pin the parser in every BeautifulSoup(...) call; verify the target data exists in the raw HTML (not JS-rendered).
Query structure with select/select_one, attribute/text logic with find/find_all — consistently.
Guard every single-result query against None before chaining; use .get for attributes.
Extract text with get_text(strip=True, separator=...); never .string unless the element is known text-only.
When reviewing, hunt: parser-less constructors, .find(...).x chains, tag["attr"] on optional attrs, camelCase methods, .text joins without separators.

For parser comparison details, navigation/search API reference, encoding handling, and tree-modification patterns, read references/beautifulsoup4-patterns.md.

beautifulsoup4-consistency

Invocation

Context Preview

SKILL.md

beautifulsoup4-consistency

Invocation

Context Preview

SKILL.md

BeautifulSoup — consistent idioms

Canonical idioms — always X, never Y

Pitfalls that produce silently wrong results

Version notes

Workflow

Similar Skills

BeautifulSoup — consistent idioms

Canonical idioms — always X, never Y

Pitfalls that produce silently wrong results

Version notes

Workflow

Similar Skills