From beautifulsoup4-consistency
Write, review, refactor, or debug Python code that parses HTML or XML with BeautifulSoup / bs4 (find, find_all, select, get_text, extracting links, tables, attributes) using one canonical idiom set. Use this skill whenever code scrapes or cleans markup, walks a parse tree, or when the user hits "NoneType object has no attribute", different results on different machines (parser auto-detection), empty .string on nested tags, KeyError on tag attributes, or asks find_all vs select. Trigger it even when the user just says "extract the prices from this HTML" or "parse this page" in Python — without saying the words "BeautifulSoup idioms."
How this skill is triggered — by the user, by Claude, or both
Slash command
/beautifulsoup4-consistency:beautifulsoup4-consistencyThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
bs4 has been stable for a decade, so the problem isn't stale APIs — it's *non-determinism
bs4 has been stable for a decade, so the problem isn't stale APIs — it's non-determinism
and style drift: code that omits the parser argument builds different trees on
different machines, mixes find_all and select arbitrarily, chains attributes off
calls that return None, and confuses the three text accessors. This skill pins one
canonical idiom set for bs4 4.12+.
| Always | Never | Why |
|---|---|---|
BeautifulSoup(html, "html.parser") (or "lxml", chosen deliberately) | BeautifulSoup(html) | Auto-detection picks whatever is installed; broken markup parses into different trees per machine. |
soup.select("div.item > a") for structural queries | mega-nested find chains for structure | CSS selectors express descent/combinators readably. |
soup.find("a", class_="btn", href=True) for attribute logic | select with attribute pseudo-gymnastics | Keyword filters, regex, and callables belong to find/find_all. |
guard the miss: if (el := soup.select_one(".price")) is not None: | soup.find(...).text chains | find/select_one return None on a miss → AttributeError far from the cause. |
tag.get("href") (or tag.get("href", "")) | tag["href"] on optional attributes | Subscript raises KeyError; .get makes the miss explicit. |
el.get_text(strip=True) (with separator= when joining matters) | .string on elements with children / bare .text everywhere | .string is None unless exactly one text child; bare .text concatenates without separators ("priceQty"). |
class_= keyword / attrs={"class": ...} | class= (syntax error) or matching the full class string | class is multi-valued: class_="btn" matches class="btn primary"; string-equality on the joined value doesn't. |
find_all | findAll / findChildren camelCase aliases | The camelCase names are bs3 leftovers. |
| feed bytes and let bs4 detect, or decode explicitly once | html.decode() guesses scattered through code | bs4's encoding detection (with the declared charset) beats ad-hoc decodes. |
response.text only after checking response.encoding sanity | trusting requests' ISO-8859-1 fallback | Mis-decoded input corrupts every downstream string. |
House style:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
products = []
for card in soup.select("ul.products > li.product"):
link = card.select_one("a.title")
price = card.select_one("span.price")
if link is None or price is None:
continue # structure changed or ad-card; skip explicitly
products.append({
"name": link.get_text(strip=True),
"url": link.get("href"),
"price": price.get_text(strip=True),
"sku": card.get("data-sku"),
})
html.parser (stdlib, lenient), lxml (fast, fixes
markup aggressively, requires install), html5lib (slowest, browser-identical). Broken
tables/nesting parse differently — the same code "works here, fails there" until the
parser is pinned..string returns None when an element has more than one child node (even a
comment counts) — the bug appears only on some rows of real data..contents/.children; iterating and assuming
elements-only breaks. Filter with tag.find_all(recursive=False) for element children.find_all(text="...") (now string=) matches the exact full string of a text
node — use string=re.compile(...) or search then check .get_text().select is full CSS, find is not: soup.find("div.item") looks for a literal tag
named div.item. Inversely :contains is non-standard — use find(string=...) logic.decompose() inside a find_all loop) is fine because
find_all returns a list — but iterating .children while removing skips nodes.BeautifulSoup("<td>x</td>", "lxml") wraps in
<html><body> and may drop/move table elements; parse fragments with html.parser
or parse the whole document.response.text first and switch to a browser tool when absent.Target bs4 4.12+ (4.13 keeps the idioms; it adds typing and warns harder on bs3
aliases). The text= parameter is string= since 4.4 — write string=. Pair with
lxml for speed or stdlib html.parser for zero dependencies; pick one per project.
BeautifulSoup(...) call; verify the target data exists in the
raw HTML (not JS-rendered).select/select_one, attribute/text logic with
find/find_all — consistently.None before chaining; use .get for
attributes.get_text(strip=True, separator=...); never .string unless the
element is known text-only..find(...).x chains, tag["attr"]
on optional attrs, camelCase methods, .text joins without separators.For parser comparison details, navigation/search API reference, encoding handling, and
tree-modification patterns, read references/beautifulsoup4-patterns.md.
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
npx claudepluginhub guidogl/beautifulsoup4-consistency --plugin beautifulsoup4-consistency