Skill

crw-extract

Extracts typed JSON from one or more web pages using a JSON Schema with fastCRW's scrape-based extraction pipeline. Use for structured data like prices, stock status, or job listings.

data-engineering

Popularity

Stars

190

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/crw:crw-extract

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

Bash(crw:*)Bash(curl:*)Read

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

- You need a **structured JSON object** from a page, not prose.

SKILL.md

138 lines · ~1.2k tokens

Stats

LanguageRust

Stars190

Forks16

MaintenanceExcellent

Last CommitJun 18, 2026

Actions

View Source View Plugin View on GitHub View README

crw-extract — typed JSON from pages

When to use

You need a structured JSON object from a page, not prose.
Step 6 in the crw ladder. First scrape succeeds (step 2), but you need machine-readable fields. If the source is a local PDF, use crw-parse (step 5) with formats:["json"] instead.
Schema-driven = deterministic output shape. No schema = exploratory; use a prompt to describe what you want.

How extraction works in crw

crw has no /v1/extract endpoint (Firecrawl's dedicated extract API). Instead, extraction runs in two ways, both backed by the same LLM pipeline:

Path	When to use	Sync?
Per-page: `formats:["json"]` + `jsonSchema` on scrape	Single URL, or inline during a crawl	Synchronous
Async multi-URL: `POST /v2/extract` → poll `GET /v2/extract/{id}`	Many URLs, fire-and-forget	Async

/v2/extract is marked deprecated in the server (it recommends /v2/scrape with formats:["json"]), but it works and is useful for multi-URL batches.

Requires a server-side LLM. Set [extraction.llm] in the server config (provider, api_key, model). Without it, requests return an error (HTTP 4xx) — e.g. 422 "no LLM configured". Use crw setup to configure the LLM for the CLI.

Quick start

CLI — per-page extraction via --extract:

# Inline schema
crw scrape "https://example.com/product" \
  --extract '{"type":"object","properties":{"price":{"type":"number"},"inStock":{"type":"boolean"}}}'

# Schema from file
crw scrape "https://example.com/job" --extract @schema.json -o result.json

MCP — pass formats:["json"] with jsonSchema on a scrape:

crw_scrape(
  url="https://example.com/product",
  formats=["json"],
  extract={"schema": {"type":"object","properties":{"price":{"type":"number"}}}}
)

Note: the MCP crw_scrape accepts extract.schema (Firecrawl style). The REST API also accepts jsonSchema as a top-level alias.

REST — per-page (synchronous):

curl -X POST "$CRW_API_URL/v1/scrape" \
  -H "Authorization: Bearer $CRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product",
    "formats": ["json"],
    "jsonSchema": {
      "type": "object",
      "properties": {
        "price": {"type": "number"},
        "inStock": {"type": "boolean"}
      }
    }
  }'

REST — async multi-URL (deprecated endpoint, still functional):

# Start job
curl -X POST "$CRW_API_URL/v2/extract" \
  -H "Authorization: Bearer $CRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com/p1", "https://example.com/p2"],
    "schema": {"type":"object","properties":{"price":{"type":"number"}}}
  }'
# → {"success":true,"id":"<uuid>","warnings":["...use /v2/scrape..."], ...}

# Poll until completed
curl "$CRW_API_URL/v2/extract/<uuid>" -H "Authorization: Bearer $CRW_API_KEY"
# → {"success":true,"status":"completed|scraping|failed","data":{...}}

Options

Need	CLI	MCP / REST
JSON schema	`--extract '<schema>'` or `@file.json`	`jsonSchema` / `extract.schema`
Free-text prompt (no schema)	—	`prompt` on `/v2/extract`
Save output	`-o FILE`	write the response yourself
Multi-URL async	not available	`POST /v2/extract` with `urls:[...]`
LLM override	`--llm-provider`, `--llm-key`, `--llm-model`	server config only

Tips

Schema = deterministic, prompt = exploratory. A schema pins the output shape; a free-text prompt is useful for exploration but less reliable. Start with a schema when you know the fields you want.
Don't over-specify. Narrow schemas ("give me exactly these three fields") extract more reliably than wide ones with fifty optional fields.
Crawl + extract in one pass. crw_crawl accepts a jsonSchema parameter — each page in the crawl gets extracted against the schema, saving a second round-trip.
Check data.json in the scrape response. Per-page extraction lands in data.json, not data.markdown. The markdown field is also populated for reference.

crw-extract

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

crw-extract

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

crw-extract — typed JSON from pages

When to use

How extraction works in crw

Quick start

Options

Tips

See also

Similar Skills

crw-extract — typed JSON from pages

When to use

How extraction works in crw

Quick start

Options

Tips

See also

Similar Skills