From distill
Distill insights from any URL data source — browser history, bookmark exports, CSV/JSON dumps, URL lists, or read-later services — into an Obsidian-compatible knowledge base. Use this skill when a user mentions extracting insights from browsing history, building a knowledge base from bookmarks, organizing saved links, processing a Pocket/Raindrop/CSV export, sorting through URLs, or distilling articles from any source.
How this skill is triggered — by the user, by Claude, or both
Slash command
/distill:distillThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Distill insights from any collection of URLs into a growing Obsidian-compatible knowledge base. Works with browser history, bookmark exports, read-later service exports, CSV/JSON dumps, or plain URL lists.
Distill insights from any collection of URLs into a growing Obsidian-compatible knowledge base. Works with browser history, bookmark exports, read-later service exports, CSV/JSON dumps, or plain URL lists.
Two outputs per run:
{topic}/ folders with YAML frontmatter, key takeaways, and concept tags._noise.md) listing every article-candidate URL that was classified as not insight-worthy.| Source | How to provide | Adapter |
|---|---|---|
| Zen Browser | Auto-detect profile, read places.sqlite | references/browser-schemas.md (Firefox schema) |
| Firefox | Auto-detect profile, read places.sqlite | Same schema as Zen |
| Chrome / Arc / Brave / Edge | Auto-detect profile, read History DB | references/browser-schemas.md (Chromium schema) |
| Safari | Read History.db | references/browser-schemas.md (Safari schema) |
| Bookmark HTML export | User provides file path | Parse Netscape bookmark format |
| Pocket / Raindrop / Instapaper export | User provides CSV/JSON file | Parse export format |
| CSV / JSON URL dump | User provides file path | Read rows with url column or key |
| Plain URL list | User provides text file or pastes URLs | One URL per line |
On first run, use AskUserQuestion to ask what source to process and where the vault lives. If they just say "my bookmarks" or "my history," auto-detect installed browsers and present options alongside file-based sources.
The pipeline runs in 8 phases. Phases 1-7 are automated. Phase 8 is interactive review.
Read references/pipeline.md for detailed phase-by-phase instructions.
| Phase | What | How | LLM? |
|---|---|---|---|
| 1. Identify Source | Ask user for source, detect browser, extract URLs | Interactive + code | No |
| 2. Domain Triage | Classify domains as article-source/tool/mixed | Deterministic rules + Haiku | Haiku |
| 3. URL Triage | Filter mixed-domain URLs by title + path | Deterministic rules + Haiku | Haiku |
| 4. Scrape | Fetch article content | curl with parallel workers | No |
| 5. Extract Text | Strip HTML to plain text, filter short pages | Python | No |
| 6. Classify | Read content, judge insight vs noise | Sonnet | Sonnet |
| 7. Export | Write insight notes + noise log | File I/O | No |
| 8. Review + Restructure | Present summary, user overrides, consolidate folders | Interactive | Session model |
WebFetch through LLM agents is unreliable at scale. Agents timeout, die mid-scrape, and waste tokens on HTTP overhead. Instead:
curl with 20 parallel workers via Python's ThreadPoolExecutorThis separation (scrape with code, classify with LLM) is the single most important reliability improvement.
For browser history sources, don't over-filter at the SQL level. The visit_count/frecency filter removes almost nothing. Domain triage is the real volume reducer (drops ~70% of URLs). Process all visited URLs and let domain triage handle it.
Bookmarks with visit_count = 0 (imported/synced but never opened) must also be processed.
Different classification agents create inconsistent folder names. After all notes are written, a consolidation step restructures into a clean nested hierarchy. This is NOT part of the per-run pipeline.
Source detection (Phase 1): If the user doesn't specify, check for installed browsers in order: Zen, Firefox, Chrome, Arc, Brave, Safari. Ask which one.
Domain triage (Phase 2): Classify each unique domain as article-source, tool, or mixed. Deterministic rules handle ~80%. Haiku classifies the rest. Cached in domain-cache.yaml.
URL triage (Phase 3): For mixed domains, read title + URL path to decide article vs skip.
Insight classification (Phase 6): Sonnet reads scraped text and decides: genuine insight or noise?
Concept tagging: Tags emerge from content. Stored in frontmatter concepts field.
Topic grouping: Notes go into {topic}/ folders. Topics emerge from content, not a fixed taxonomy.
Each insight article becomes a .md file in {topic}/:
---
url: "https://..."
title: "Article Title"
source: "zen-history" or "chrome-bookmark" or "pocket-export" etc.
date_processed: 2026-04-01
confidence: high
insight_summary: "One-line core takeaway"
concepts:
- "topic-tag"
---
# Article Title
> [Original](url)
## Key Insight
{2-3 sentence summary of the core takeaway}
## Takeaways
- {bullet points of actionable or notable points}
## Context
{1 paragraph situating the article}
---
{scraped content}
The source field identifies where the URL came from:
{browser}-bookmark or {browser}-history (e.g., zen-bookmark, chrome-history, safari-history)pocket-export, raindrop-export, instapaper-exporturl-list, manualNever overwrite existing .md files.
_noise.md lists article candidates classified as not insight-worthy, grouped by source folder. Users can scan it to catch misclassifications.
| Phase | Model | Why |
|---|---|---|
| 2. Domain Triage | Haiku (for uncached domains only) | Trivial: "blog or dashboard?" |
| 3. URL Triage | Haiku (for ambiguous mixed-domain URLs) | Pattern matching on title + path |
| 6. Classify | Sonnet | Needs content comprehension |
| 7. Export (note generation) | Sonnet | Summarization and extraction |
Phases 1, 4, 5 are pure code. Phase 8 uses the session model.
Two state files in the vault root:
.distill-state.yaml — run config and watermarks:
last_run: 2026-04-01T12:00:00
vault_path: /Users/.../Obsidian/Insights
source:
type: zen # or: firefox, chrome, arc, brave, safari, file
profile_path: ~/Library/... # for browser sources
last_bookmark_id: 2405 # browser-specific watermarks
last_history_timestamp: 1774972539187785
domain-cache.yaml — cached domain classifications (shared across all sources):
domains:
simonwillison.net: article-source
github.com: mixed
grafana.hardcoretech.co: tool
references/pipeline.md — Phase-by-phase execution with edge casesreferences/classification.md — Insight vs noise classification guidereferences/browser-schemas.md — Browser database schemas (Firefox, Chromium, Safari)Searches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.
npx claudepluginhub cjhwong/claude-skills --plugin distill