From wayback-archive
Recover product databases from defunct e-commerce sites via Wayback Machine, CommonCrawl, and Shopify CDN archaeology.
How this skill is triggered — by the user, by Claude, or both
Slash command
/wayback-archive:wayback-archiveThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Self-contained pipeline for recovering product databases from defunct e-commerce sites.
Self-contained pipeline for recovering product databases from defunct e-commerce sites. Supports Shopify, Swell Commerce, Fourthwall, and custom platforms via config-driven CDN patterns. Each stage has checkpoint/resume support.
Phase 1: DISCOVERY -> Find what existed (CDX dump, CommonCrawl, CDN archaeology)
Phase 2: EXTRACTION -> Get product data (fetch pages, extract metadata)
Phase 3: ASSET DOWNLOAD -> Get images/media (live CDN first, Wayback fallback)
Always query BOTH Wayback AND CommonCrawl. They have independent coverage. CommonCrawl yields 76% success for HTML; Wayback HTML yields 2.4%.
For HTML, prefer CommonCrawl WARCs. Wayback serves HTML through a JS replay framework. CommonCrawl WARCs contain the raw HTTP response with no wrapper.
Filter the CDX dump first. Raw dumps are 90%+ junk. filter_cdx.py reduces
them by ~94% with zero product data loss.
For detailed extraction strategy and method hierarchy, see references/extraction-strategy.md.
# 1. Install dependencies
pip install -r requirements.txt
# 2. Copy and customize a config
cp skills/wayback-archive/configs/example.yaml configs/mysite.yaml
# 3. Run the full pipeline (with confirmation gates)
python3 scripts/run_stage.py all --config configs/mysite.yaml
# Or dry-run first
python3 scripts/run_stage.py all --config configs/mysite.yaml --dry-run
Nine stages, executed in order. Run individually or use all:
python3 scripts/run_stage.py <stage> --config configs/site.yaml [--dry-run]
| Stage | Purpose | Key Tool |
|---|---|---|
cdx_dump | Dump every Wayback snapshot URL for each domain | tools/wayback_cdx |
index | Parse CDX + CommonCrawl discovery -> product index | lib/wayback_archiver/cdx.py |
filter | 6-layer CDX filter (94% junk reduction) | filter_cdx.py |
fetch | Queue-based cascade: direct -> CommonCrawl WARC -> proxy | fetch_archive.py |
cdn_discover | Shopify CDN archaeology (finds delisted product images) | shopify_downloader.py |
match | Fuzzy slug-to-SKU matching + dedup | lib/wayback_archiver/match.py |
download | Image cascade: live CDN -> Wayback CDX best -> exhaustive | lib/wayback_archiver/download.py |
normalize | Rename images, generate metadata.txt per product | lib/wayback_archiver/normalize.py |
build | Compile final catalog JSON + stats | lib/wayback_archiver/util.py |
# Fetch with datacenter proxies and 3 workers
python3 scripts/run_stage.py fetch --config configs/site.yaml --proxy dc --workers 3
# Try alternative archives for failed URLs
python3 scripts/run_stage.py fetch --config configs/site.yaml --fallback-archives archive_today memento
# Full pipeline, skip confirmation prompts
python3 scripts/run_stage.py all --config configs/site.yaml --yes
skills/wayback-archive/configs/example.yaml and customize domainstools/wayback_cdx handles CDX dumps automaticallyshopify_cdn.enabled: true in configOXYLABS_ISP_USER / OXYLABS_ISP_PASS env varspython3 scripts/run_stage.py all --config configs/mysite.yaml --dry-runFor config field reference, see references/site-config-schema.md.
Each script works independently without run_stage.py:
# CDX dump
cd tools/ && python -m wayback_cdx --domain mystore.com --output raw_cdx.txt --resume
# Filter
python filter_cdx.py raw_cdx.txt > links.txt
# Fetch
python fetch_archive.py links.txt --resume [--proxy isp|dc] [--workers 5]
# Shopify CDN discovery
python shopify_downloader.py --store mystore.com --wayback-only --manifest-only
For detailed script documentation, see references/tool-reference.md.
products.json is the holy grailProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Searches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
npx claudepluginhub saldigioia/wayback-archive-plugin --plugin wayback-archive