From docs-skills
Download HTML from websites and extract article content, removing HTML bloat. Specifically designed to extract content from <article> tags with aria-live="polite" attribute. Outputs clean, readable content for documentation sites like Red Hat docs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/docs-skills:article-extractorThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill downloads HTML from websites and extracts the article content, removing unnecessary HTML bloat. It's particularly useful for documentation websites that have large amounts of navigation, styling, and other non-content HTML.
This skill downloads HTML from websites and extracts the article content, removing unnecessary HTML bloat. It's particularly useful for documentation websites that have large amounts of navigation, styling, and other non-content HTML.
<article> tagsThe skill uses a Python script that downloads and parses HTML content.
Extract article from a URL:
uv run --script ${CLAUDE_SKILL_DIR}/scripts/article_extractor.py -- --url "https://example.com/page"
Extract with specific output format:
uv run --script ${CLAUDE_SKILL_DIR}/scripts/article_extractor.py -- --url "https://example.com/page" --format markdown
Save to file:
uv run --script ${CLAUDE_SKILL_DIR}/scripts/article_extractor.py -- --url "https://example.com/page" --output article.md
Extract with custom article selector:
uv run --script ${CLAUDE_SKILL_DIR}/scripts/article_extractor.py -- --url "https://example.com/page" --selector "article.main-content"
--url URL: The URL to fetch HTML from (required)--format {html,markdown,text}: Output format (default: markdown)--output FILE: Save output to file instead of stdout--selector SELECTOR: CSS selector for article content (default: article[aria-live="polite"])--pretty: Pretty-print HTML output with indentation--strip-links: Remove all hyperlinks from outputHTML (default): Extracts the article HTML content with all tags preserved but removes surrounding bloat.
Markdown: Converts the article content to Markdown format for easy reading and documentation.
Plain Text: Strips all HTML tags and returns plain text content.
# Extract from Red Hat OpenShift Lightspeed documentation
uv run --script ${CLAUDE_SKILL_DIR}/scripts/article_extractor.py -- \
--url "https://docs.redhat.com/en/documentation/red_hat_openshift_lightspeed/1.0/html/install/ols-installing-lightspeed" \
--format markdown \
--output openshift-lightspeed-install.md
# Extract from any site with article tags
uv run --script ${CLAUDE_SKILL_DIR}/scripts/article_extractor.py -- \
--url "https://example.com/docs/guide" \
--selector "article.documentation" \
--format text
This skill requires the following Python packages:
requests: For downloading HTML contentbeautifulsoup4: For parsing and extracting HTMLhtml2text: For converting HTML to Markdown (optional, for markdown format)Install dependencies:
python3 -m pip install requests beautifulsoup4 html2text
The skill downloads and processes HTML efficiently:
<article> tags or similar semantic HTMLnpx claudepluginhub opendatahub-io/docs-skills --plugin docs-skillsGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.