From data-liberation
Guides building a new platform adapter to extract content from unsupported platforms (Blogger, Ghost, Tumblr, etc.) by reverse-engineering detection signals, content discovery, and extraction methods.
How this skill is triggered — by the user, by Claude, or both
Slash command
/data-liberation:adaptThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Guide the process of adding extraction support for a new platform. The result is a working adapter that plugs into the existing extraction pipeline.
Guide the process of adding extraction support for a new platform. The result is a working adapter that plugs into the existing extraction pipeline.
src/adapters/ — if an adapter exists, this skill isn't needed.Understand how the target platform works before writing any code.
Figure out how to identify sites on this platform. Check:
.squarespace.com, .webflow.io, .wixsite.com)X-Squarespace-Version, X-Wix-Request-Id)Add detection signals to src/lib/extraction/detect-platform.ts:
URL_PATTERNSdetectFromHttp()Figure out how to find all pages on the site:
sitemap.xml, sitemap_index.xml. Most platforms generate these.extractNavLinks() in src/adapters/shared.ts handles this generically.?format=json or Shopify's /products.json).Figure out how to get the actual content from each page:
.post-body, article, .content, main).launchBrowser() from src/adapters/shared.ts.If the platform has an admin dashboard or uses client-side API calls, use liberate_map_apis to automatically discover all API endpoints:
--remote-debugging-port=9222 and log in to their account on the target platformliberate_map_apis with the CDP port, the site URL, and optionally a list of admin dashboard URLs to crawlThis is the fastest way to reverse-engineer a platform's API surface. The output tells you exactly which endpoints return content data, what auth is needed, and what the response shapes look like — everything you need to write the adapter's extractPage function.
You can also call liberate_probe to inspect window globals, localStorage, cookies, and platform identity fields on any page — useful for understanding what data the platform exposes client-side.
Document everything you find. This is research — take notes on endpoints, selectors, quirks.
An adapter is a directory src/adapters/<platform>/, never a single file. index.ts is a thin assembler; each concern lives in its own sibling. Read src/adapters/webflow/ (the smallest — a 3-file split) and src/adapters/shopify/ (a fuller split) as references.
index.ts — thin assembler + public API. It defines detect inline, imports discover/extract (and optional capture/blocks) from siblings, exports the <platform>Adapter object, and re-exports the inventory/opts types (plus any helpers other modules need) so external code only ever imports <platform>/index.js. Keep all real logic in siblings.
// src/adapters/webflow/index.ts — the whole assembler
import type { PlatformAdapter } from '../../types.js';
import { discoverWebflow } from './discover.js';
import { extractWebflow } from './extract.js';
export type { WebflowInventory, WebflowAdapterOpts } from './discover.js';
function detect(url: string): boolean {
return /webflow\.io|webflow\.com/i.test(url);
}
export const webflowAdapter: PlatformAdapter = {
id: 'webflow',
detect,
discover: discoverWebflow,
extract: extractWebflow,
};
Sibling files — add only what the platform needs (webflow uses 3; richer platforms split further):
| file | holds |
|---|---|
types.ts | <Platform>AdapterOpts + <Platform>Inventory (+ platform JSON shapes) |
discover.ts | discover() — sitemap/nav crawl, URL classification → inventory |
extract.ts | extract() — drives runExtractionLoop() with an extractPage fn |
content.ts | HTML/content parsing + quality scoring |
media.ts | media URL extraction |
products.ts | product → WooProduct mapping (e-commerce only) |
capture.ts | optional AdapterCapture (seam 1 — pre-capture DOM removals) |
blocks.ts | optional AdapterBlocks (seam 2 — content→blocks recipe) |
Both seams are typed in src/adapters/page-actions.ts; examples are shopify/capture.ts and squarespace/blocks.ts. Add any platform-specific helpers as further siblings (wix has runtime.ts/gallery.ts/page.ts; hubspot has url.ts/metadata.ts). The tiny webflow adapter has no types.ts — it inlines its opts/inventory in discover.ts and re-exports from there; use a dedicated types.ts for anything non-trivial.
The adapter contract — <platform>Adapter implements PlatformAdapter (src/types.ts):
id — lowercase platform name (e.g. 'ghost')detect(url) — true if the URL belongs to this platform (defined inline in index.ts)discover(url, opts) — fetch sitemap + navigation, classify URLs, return inventoryextract(inventory, wxr, opts, context) — call runExtractionLoop() from src/adapters/shared.ts with an extractPage functionprobe, capture, blocksDefine in types.ts:
<Platform>AdapterOpts extending Record<string, unknown> with: delay?, resume?, dryRun?, verbose?, outputDir?<Platform>Inventory with: siteUrl, discoveredAt, siteMeta (title, tagline, language), navigation, counts, urlsThis is where platform-specific extraction lives. For each URL:
ExtractedPage object (defined in src/adapters/shared.ts)Use the shared helpers from src/adapters/shared.ts:
extractMeta(html, property) — read meta tagsextractTitle(html) — read <title> tagextractHeading(html) — read <h1> with title fallbackextractNavLinks(html, baseUrl) — parse nav linksIMAGE_EXTENSIONS — regex for image file detectionCheck during reconnaissance whether the platform has e-commerce (product pages, a store, a shop section).
Generic detection (automatic): The shared extraction loop in src/adapters/shared.ts automatically detects products via JSON-LD @type: Product on any page classified as product type. This works out of the box if:
Platform-specific detection (optional but recommended): If the platform has a richer product API or non-standard product markup, provide a custom extractProduct function to runExtractionLoop():
const result = await runExtractionLoop({
// ...other opts
csvBuilder,
extractProduct: (url: string, html: string) => {
// Try platform-specific product extraction first
// Return WooProduct or null
},
});
The custom extractor is called before the generic JSON-LD fallback, so it takes priority.
What to extract for products (see WooProduct type in src/lib/import/woo-product-csv.ts):
name (required), description, shortDescriptionregularPrice, salePriceskuimages — array of image URLscategories, tagsweight, length, width, heightinStock, stockattributes — array of { name, values[], visible, global } for product options (size, color, etc.)type — 'simple', 'variable', 'grouped', 'external', or 'variation'parentSku — for variations, the parent product's SKUVariable products: If the platform supports product variants (sizes, colors), generate one variable parent row plus variation child rows with parentSku linking them. See shopifyProductToWoo() in src/adapters/shopify/products.ts for the pattern.
CSV streaming: The adapter should create a WooProductCsvBuilder, call openStream(outputDir) before extraction, and closeStream() after. The shared loop calls csvBuilder.addProduct() automatically when it detects products. See the Shopify or Wix adapters for the wiring pattern.
Always import the adapter from its barrel — ./adapters/<platform>/index.js — never a sibling directly.
src/mcp-server.ts (required) — add the import under the // Static adapter imports comment and add the adapter to the adapters: PlatformAdapter[] array. Both are kept alphabetical.src/ui/discover.tsx (CLI/Ink discovery UI) — add the top-level import and append to its adapters array. The default data-liberation <url> flow resolves adapters here, so the CLI path needs it.src/ui/inspect.tsx (optional) — liberate_inspect lazy-import()s a small allAdapters list inside the component; add yours there for inspect coverage. This list is partial today and isn't required for extraction.Create fixture files in test/fixtures/ with sample HTML and/or JSON from the platform. Sanitize any PII.
Create test/adapters/<platform>.test.ts. Test:
Run extraction against the user's live site:
npx tsx src/cli.ts <site-url> --dry-run --verbose
Check the output for quality: are titles correct? Is content complete? Are media URLs captured?
README.mdDISCOVERIES.md documenting what you learned about the platformAGENTS.md if any non-obvious details are worth notingnpx claudepluginhub automattic/data-liberation-agent --plugin data-liberationFront door for website migration: detects platform, inventories content, then asks the user which reconstruct path to take (blocks+products or theme replication) before extracting and dispatching the matching sub-skill.
Extracts structured data from websites like product listings, tables, search results, or profiles, generating an executable Playwright script and JSON/CSV output.
Orchestrates app launch prep: generates SEO meta tags/sitemaps/keywords, captures Playwright screenshots across viewports/modes, creates buyer personas/marketing strategies/ads/articles/landing pages.