firecrawl-scraper | benedictking-skills

Stats

Actions

Tags

firecrawl-scraper | benedictking-skills

Firecrawl Scraper Skill

Trigger Conditions & Endpoint Selection

Choose Firecrawl endpoint based on user intent:

scrape: Need to extract content from a single web page (markdown, html, json, screenshot, pdf)
crawl: Need to crawl entire website with depth control and path filtering
map: Need to quickly get a list of all URLs on a website
batch-scrape: Need to scrape multiple URLs in parallel
crawl-status: Given crawl job ID, check crawl progress/results (optional --wait)
batch-status: Given batch job ID, check batch scrape progress/results (optional --wait)
batch-errors: Given batch job ID, retrieve batch scrape errors

Recommended Architecture (Main Skill + Sub-skill)

This skill uses a two-phase architecture:

Main skill (current context): Understand user question → Choose endpoint → Assemble JSON payload
Sub-skill (fork context): Only responsible for HTTP call execution, avoiding conversation history token waste

Execution Method

Use Task tool to invoke firecrawl-fetcher sub-skill, passing command and JSON (stdin):

Task parameters:
- subagent_type: Bash
- description: "Call Firecrawl API"
- prompt: cat <<'JSON' | node scripts/firecrawl-api.cjs <scrape|crawl|map|batch-scrape|crawl-status|batch-status|batch-errors> [--wait]
  { ...payload... }
  JSON

Payload Examples

1) Scrape Single Page

cat <<'JSON' | node scripts/firecrawl-api.cjs scrape
{
  "url": "https://example.com",
  "formats": ["markdown", "links"],
  "onlyMainContent": true,
  "includeTags": [],
  "excludeTags": ["nav", "footer"],
  "waitFor": 0,
  "timeout": 30000
}
JSON

Available formats:

"markdown", "html", "rawHtml", "links", "images", "summary"
{"type": "json", "prompt": "Extract product info", "schema": {...}}
{"type": "screenshot", "fullPage": true, "quality": 85}

2) Scrape with Actions (Page Interaction)

cat <<'JSON' | node scripts/firecrawl-api.cjs scrape
{
  "url": "https://example.com",
  "formats": ["markdown"],
  "actions": [
    {"type": "wait", "milliseconds": 2000},
    {"type": "click", "selector": "#load-more"},
    {"type": "wait", "milliseconds": 1000},
    {"type": "scroll", "direction": "down", "amount": 500}
  ]
}
JSON

Available actions:

wait, click, write, press, scroll, screenshot, scrape, executeJavascript

3) Parse PDF

cat <<'JSON' | node scripts/firecrawl-api.cjs scrape
{
  "url": "https://example.com/document.pdf",
  "formats": ["markdown"],
  "parsers": ["pdf"]
}
JSON

4) Extract Structured JSON

cat <<'JSON' | node scripts/firecrawl-api.cjs scrape
{
  "url": "https://example.com/product",
  "formats": [
    {
      "type": "json",
      "prompt": "Extract product information",
      "schema": {
        "type": "object",
        "properties": {
          "name": {"type": "string"},
          "price": {"type": "number"},
          "description": {"type": "string"}
        },
        "required": ["name", "price"]
      }
    }
  ]
}
JSON

5) Crawl Entire Website

cat <<'JSON' | node scripts/firecrawl-api.cjs crawl
{
  "url": "https://docs.example.com",
  "formats": ["markdown"],
  "includePaths": ["^/docs/.*"],
  "excludePaths": ["^/blog/.*"],
  "maxDiscoveryDepth": 3,
  "limit": 100,
  "allowExternalLinks": false,
  "allowSubdomains": false
}
JSON

5.1) Crawl + Wait for Completion

cat <<'JSON' | node scripts/firecrawl-api.cjs crawl --wait
{
  "url": "https://docs.example.com",
  "formats": ["markdown"],
  "limit": 100
}
JSON

6) Map Website URLs

cat <<'JSON' | node scripts/firecrawl-api.cjs map
{
  "url": "https://example.com",
  "search": "documentation",
  "limit": 5000
}
JSON

7) Batch Scrape Multiple URLs

cat <<'JSON' | node scripts/firecrawl-api.cjs batch-scrape
{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
  ],
  "formats": ["markdown"]
}
JSON

Returns async job response: { "success": true, "id": "<batch-id>", "url": "..." }

7.1) Batch Scrape + Wait for Completion

cat <<'JSON' | node scripts/firecrawl-api.cjs batch-scrape --wait
{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2"
  ],
  "formats": ["markdown"]
}
JSON

7.2) Check Batch Scrape Status

node scripts/firecrawl-api.cjs batch-status <batch-id>

Wait for completion:

node scripts/firecrawl-api.cjs batch-status <batch-id> --wait

7.3) Get Batch Scrape Errors

node scripts/firecrawl-api.cjs batch-errors <batch-id>

8) Check Crawl Status

node scripts/firecrawl-api.cjs crawl-status <crawl-id>

Wait for completion:

node scripts/firecrawl-api.cjs crawl-status <crawl-id> --wait

Key Features

Formats

markdown: Clean markdown content
html: Parsed HTML
rawHtml: Original HTML
links: All links on page
images: All images on page
summary: AI-generated summary
json: Structured data extraction with schema
screenshot: Page screenshot (PNG)

Content Control

onlyMainContent: Extract only main content (default: true)
includeTags: CSS selectors to include
excludeTags: CSS selectors to exclude
waitFor: Wait time before scraping (ms)
maxAge: Cache duration (default: 48 hours)

Actions (Browser Automation)

wait: Wait for specified time
click: Click element by selector
write: Input text into field
press: Press keyboard key
scroll: Scroll page
executeJavascript: Run custom JS

Crawl Options

includePaths: Regex patterns to include
excludePaths: Regex patterns to exclude
maxDiscoveryDepth: Maximum crawl depth
limit: Maximum pages to crawl
allowExternalLinks: Follow external links
allowSubdomains: Follow subdomains

Environment Variables & API Key

Two ways to configure API Key (priority: environment variable > .env):

Environment variable: FIRECRAWL_API_KEY
.env file: Place in .env, can copy from .env.example

Response Format

All endpoints return JSON with:

success: Boolean indicating success
data: Extracted content (format depends on endpoint)
For crawl: Returns job ID, use crawl-status (or GET /v2/crawl/{id}) to check status
For batch-scrape: Returns async job response ({ success, id, url }), use batch-status (or GET /v2/batch/scrape/{id}) to poll status
Batch status response includes status, total, completed, creditsUsed, expiresAt, next, data[]
- next: pagination URL for large/incomplete results (script returns raw response; follow manually if needed)