From web-scraper
Use when user asks to scrape, extract articles, fetch news, get blog posts, discover RSS feeds, or parse sitemaps from websites.
How this skill is triggered — by the user, by Claude, or both
Slash command
/web-scraper:web-scraperThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Extract blog and news content from any website. Results are auto-saved to `./scraper-output/` as JSON, Markdown, and HTML preview.
Extract blog and news content from any website. Results are auto-saved to ./scraper-output/ as JSON, Markdown, and HTML preview.
The scraper automatically detects whether a URL is:
When user provides a URL to scrape, create this script and run it:
import { smartScrape, extractArticle } from '${CLAUDE_PLUGIN_ROOT}/lib';
import * as fs from 'fs';
import * as path from 'path';
import { exec } from 'child_process';
async function main() {
const url = 'USER_PROVIDED_URL'; // Replace with actual URL
console.log(`Scraping: ${url}`);
// Smart scrape auto-detects article vs listing
const result = await smartScrape(url, {
maxArticles: 10,
qualityThreshold: 0.3
});
// Create output directory
const outputDir = './scraper-output';
fs.mkdirSync(outputDir, { recursive: true });
// Generate filename from URL and timestamp
const hostname = new URL(url).hostname.replace(/\./g, '-');
const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, 19);
const baseName = `${hostname}_${timestamp}`;
if (result.mode === 'article') {
// Single article extracted
const article = result.article;
// Save JSON
const jsonPath = path.join(outputDir, `${baseName}.json`);
fs.writeFileSync(jsonPath, JSON.stringify(article, null, 2));
// Save Markdown
const mdPath = path.join(outputDir, `${baseName}.md`);
fs.writeFileSync(mdPath, `# ${article.title}\n\n${article.markdown}`);
// Save HTML preview
const html = generateHtmlPreview([{
url: article.url,
title: article.title,
publishedDate: article.publishedDate,
qualityScore: article.confidence,
fullContent: article.html
}], url, 'single-article');
const htmlPath = path.join(outputDir, `${baseName}.html`);
fs.writeFileSync(htmlPath, html);
console.log('\n=== SINGLE ARTICLE EXTRACTED ===');
console.log(`Title: ${article.title}`);
console.log(`Words: ${article.wordCount}`);
console.log(`Reading time: ${article.readingTime} min`);
console.log(`\nSaved to:`);
console.log(` ${jsonPath}`);
console.log(` ${mdPath}`);
console.log(` ${htmlPath}`);
exec(`open "${htmlPath}"`);
console.log('\n✓ Opened preview in browser');
} else if (result.mode === 'listing') {
// Multiple articles discovered
const articles = result.articles;
// Save JSON
const jsonPath = path.join(outputDir, `${baseName}.json`);
fs.writeFileSync(jsonPath, JSON.stringify({ url, articles, stats: result.stats }, null, 2));
// Save Markdown
let markdown = `# Scraped Content: ${url}\n\n`;
markdown += `**Articles:** ${articles.length}\n\n---\n\n`;
for (const article of articles) {
markdown += `## ${article.title}\n\n`;
markdown += `- **URL:** ${article.url}\n`;
markdown += `- **Date:** ${article.publishedDate || 'Unknown'}\n`;
markdown += `- **Quality:** ${Math.round(article.qualityScore * 100)}%\n\n`;
if (article.fullContentMarkdown) {
markdown += article.fullContentMarkdown + '\n\n---\n\n';
}
}
const mdPath = path.join(outputDir, `${baseName}.md`);
fs.writeFileSync(mdPath, markdown);
// Save HTML preview
const html = generateHtmlPreview(articles, url, result.stats?.detectedType || 'auto');
const htmlPath = path.join(outputDir, `${baseName}.html`);
fs.writeFileSync(htmlPath, html);
console.log('\n=== RESULTS ===');
console.log(`Articles: ${articles.length}`);
console.log(`\nSaved to:`);
console.log(` ${jsonPath}`);
console.log(` ${mdPath}`);
console.log(` ${htmlPath}`);
console.log('\nTop articles:');
articles.slice(0, 5).forEach((a, i) => {
console.log(` ${i + 1}. ${a.title} (${Math.round(a.qualityScore * 100)}%)`);
});
exec(`open "${htmlPath}"`);
console.log('\n✓ Opened preview in browser');
} else {
console.error('Failed to extract content:', result.error);
}
}
function generateHtmlPreview(articles: any[], sourceUrl: string, sourceType: string): string {
return `<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Scraped: ${new URL(sourceUrl).hostname}</title>
<style>
body { font-family: -apple-system, system-ui, sans-serif; max-width: 900px; margin: 0 auto; padding: 2rem; background: #f9fafb; color: #111827; line-height: 1.6; }
h1 { font-size: 1.5rem; border-bottom: 2px solid #e5e7eb; padding-bottom: 0.5rem; }
.stats { background: #fff; padding: 1rem; border-radius: 8px; margin-bottom: 2rem; border: 1px solid #e5e7eb; }
.article { background: #fff; padding: 1.5rem; border-radius: 8px; margin-bottom: 1rem; border: 1px solid #e5e7eb; }
.article h2 { margin-top: 0; font-size: 1.2rem; }
.article h2 a { color: #2563eb; text-decoration: none; }
.meta { font-size: 0.875rem; color: #6b7280; margin-bottom: 1rem; }
.quality { display: inline-block; padding: 0.25rem 0.5rem; border-radius: 4px; font-size: 0.75rem; font-weight: 600; }
.quality.high { background: #d1fae5; color: #065f46; }
.quality.medium { background: #fef3c7; color: #92400e; }
.quality.low { background: #fee2e2; color: #991b1b; }
.content { font-size: 0.95rem; }
</style>
</head>
<body>
<h1>Scraped: ${sourceUrl}</h1>
<div class="stats">
<strong>Source:</strong> ${sourceType} •
<strong>Articles:</strong> ${articles.length}
</div>
${articles.map(a => {
const q = (a.qualityScore || a.confidence || 0) >= 0.7 ? 'high' : (a.qualityScore || a.confidence || 0) >= 0.4 ? 'medium' : 'low';
return `<article class="article">
<h2><a href="${a.url}" target="_blank">${a.title}</a></h2>
<div class="meta"><span class="quality ${q}">${Math.round((a.qualityScore || a.confidence || 0) * 100)}%</span> • ${a.publishedDate || 'Unknown'}</div>
<div class="content">${a.fullContent || a.html || a.description || ''}</div>
</article>`;
}).join('')}
</body>
</html>`;
}
main().catch(e => console.error('Error:', e.message));
Run with: npx tsx scrape-site.ts
Results are saved to ./scraper-output/:
| File | Format | Contains |
|---|---|---|
{hostname}_{timestamp}.json | JSON | Full structured data |
{hostname}_{timestamp}.md | Markdown | Human-readable content |
{hostname}_{timestamp}.html | HTML | Styled preview (auto-opens in browser) |
After scraping, report:
| Function | Purpose |
|---|---|
smartScrape(url) | Auto-detect article vs listing, extract appropriately |
extractArticle(url) | Extract single article directly (fastest) |
scrapeWebsite(url) | Discover multiple articles from listing page |
isArticleUrl(url) | Check if URL looks like an article |
isListingUrl(url) | Check if URL looks like a listing |
| Option | Type | Default | Description |
|---|---|---|---|
| maxArticles | number | 10 | Maximum articles to return |
| qualityThreshold | number | 0.3 | Minimum quality score (0-1) |
| sourceType | string | 'auto' | Force: 'rss', 'sitemap', 'html' |
| forceMode | string | - | Force: 'article' or 'listing' |
| allowPaths | string[] | [] | Only scrape these paths |
| denyPaths | string[] | [...] | Skip these paths |
Each article includes multiple formats:
| Format | Field | Description |
|---|---|---|
| HTML | html / fullContent | Raw HTML content |
| Markdown | markdown / fullContentMarkdown | Formatted markdown |
| Text | text / fullContentText | Plain text, cleaned |
| Excerpt | excerpt / description | Short summary |
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub tyroneross/rosslabs-ai-toolkit --plugin web-scraper