From hiivmind-corpus
This skill should be used when the user asks to "build corpus index", "create index from docs", "analyze documentation", "populate corpus index", or needs to build the initial index for a corpus that was just initialized. Triggers on "build my corpus", "index the documentation", "create the index.md", "finish setting up corpus", "hiivmind-corpus build", or when a corpus has placeholder index.md that says "Run hiivmind-corpus-build", or "create the index.yaml".
How this skill is triggered — by the user, by Claude, or both
Slash command
/hiivmind-corpus:hiivmind-corpus-buildThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Build the documentation corpus index. Prepares all sources, scans for content, consults
Build the documentation corpus index. Prepares all sources, scans for content, consults
the user on organization preferences, generates index.yaml (structured, machine-queryable)
and renders index.md from it. Updates config metadata. Supports single and multi-source
corpora with tiered indexing for large (500+ file) corpora.
A config.yaml must exist with at least one source configured.
If not found, suggest running hiivmind-corpus-init and hiivmind-corpus-add-source.
1. PREPARE → 2. SCAN → 3. SEGMENT → 4. PREFERENCES → 5. INDEX → 6. GRAPH → 7. EMBEDDINGS → 8. SAVE
Inputs: working directory
Outputs: computed.config, computed.sources, all sources verified ready
config.yamlGit source:
.source/{source_id}/ clone existsgit clone --depth 1 --branch {branch} {url} .source/{source_id}Local source:
uploads/{source_id}/ directory exists.md, .mdx, or .pdf).md files may contain YAML frontmatter with tags, headings, and provenance
metadata produced by the PDF extraction pipeline (see lib/corpus/patterns/sources/pdf.md).
During scanning, extract tags and headings from frontmatter when present to enrich
index entries rather than deriving them solely from file content.Web source:
.cache/web/{source_id}/ directory existsllms-txt source:
.cache/llms-txt/{source_id}/ existsSelf source:
git rev-parse --show-topleveldocs_root: if ".", treat as repo root.hiivmind/ is auto-excluded during scanning (see lib/corpus/patterns/sources/self.md)Display: "Sources prepared: {count} ready, {skipped} skipped"
Inputs: prepared sources
Outputs: computed.scan_results
GUARD_PHASE_2():
IF computed.sources IS null OR len(computed.sources) == 0:
DISPLAY "Cannot proceed: Phase 1 (Prepare Sources) has not completed."
EXIT
See: lib/corpus/patterns/scanning.md
If only one source, scan directly:
If 2+ sources, spawn parallel source-scanner agents:
See: agents/source-scanner.md
For each source, create a Task with prompt:
Scan source '{source_id}' (type: {type}) at corpus path '{corpus_path}'.
Return YAML with: source_id, type, status, file_count, sections (name/path/file_count),
large_files, framework, frontmatter_type, notes.
{if source has extraction: block in config}
extraction_config:
wikilinks: {true|false}
frontmatter: {true|false}
tags: {true|false}
dataview: {true|false}
Include extraction output in your YAML report per the extraction output format in
${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/extraction.md § "Source-Scanner Extraction Output Format".
{end if}
{if source has sections: block in config}
sections_config:
enabled: true
min_level: {min_level}
min_content_lines: {min_content_lines}
Generate section entries for qualifying headings per the Section Entry Generation
instructions in ${CLAUDE_PLUGIN_ROOT}/agents/source-scanner.md.
Also report heading_consistency in your scan output.
{end if}
{if source has chunking: block in config}
chunking_config:
strategy: {strategy}
target_tokens: {target_tokens}
overlap_tokens: {overlap_tokens}
Run chunk.py on each file and include chunks in your output per the Chunk Generation
instructions in ${CLAUDE_PLUGIN_ROOT}/agents/source-scanner.md.
{end if}
Additionally, for each documentation file, include entry metadata in your output:
path, title, summary, tags, keywords, category, content_type, size, grep_hint, headings.
See ${CLAUDE_PLUGIN_ROOT}/agents/source-scanner.md § "Entry Metadata Generation" for field details.
{if type is "self"}
For self sources: scan from repo root {repo_root}/{docs_root}. Auto-exclude .hiivmind/ directory.
The repo root is: {output of git rev-parse --show-toplevel}
{end if}
Launch ALL tasks in a single response for parallel execution. Aggregate results.
Display results table:
Scan Results
──────────────────────────────────
| Source | Type | Files | Sections | Framework |
|-------------|------|-------|----------|--------------|
| {id} | git | 142 | 8 | Docusaurus |
| {id} | local| 12 | 1 | none |
Total: {total_files} files across {source_count} sources
GUARD_TREE_THINNING():
section_count = count(entry for entry in computed.scan_results if entry.tier == "section")
IF section_count == 0:
SKIP "No section entries to thin."
PROCEED to next phase
has_token_config = ANY(source.sections.min_section_tokens IS NOT null for source in config.sources)
IF NOT has_token_config:
SKIP "Tree thinning not configured (no min_section_tokens in any source)."
PROCEED to next phase
result = Bash("python3 ${PLUGIN_ROOT}/lib/corpus/scripts/thin_sections.py --index index.yaml --min-tokens {min_section_tokens} --dry-run")
IF result.exit_code != 0:
DISPLAY "Tree thinning failed: {stderr}. Proceeding with unthinned sections."
PROCEED to next phase
IF result.sections_before == result.sections_after:
DISPLAY "Tree thinning: all sections above threshold. No merges needed."
PROCEED to next phase
DISPLAY "Tree thinning would merge {sections_before - sections_after} sections."
ASK user: "Apply these merges? [Y/n]"
IF user approves:
Bash("python3 ${PLUGIN_ROOT}/lib/corpus/scripts/thin_sections.py --index index.yaml --min-tokens {min_section_tokens}")
DISPLAY "Thinned: {sections_before} → {sections_after} sections."
Inputs: computed.scan_results, total file count
Outputs: computed.segmentation
GUARD_PHASE_3():
IF computed.scan_results IS null:
DISPLAY "Cannot proceed: Phase 2 (Scan Sources) has not completed."
EXIT
Present segmentation options:
| Strategy | Description |
|---|---|
| Tiered (recommended) | Main index.md with section summaries, detailed index-{section}.md files |
| By source | One sub-index per source |
| By section | Main index covers top 20-30% only, link to sources for rest |
| Single file | Everything in one index.md (not recommended for large corpora) |
If tiered or by-source selected, collect section definitions from user.
Suggest segmentation but don't require it: "This corpus has {n} files. A tiered index is optional but can improve navigation. Use tiered indexing?"
Default to single file. No segmentation prompt needed.
Inputs: computed.scan_results, computed.segmentation
Outputs: computed.user_preferences
GUARD_PHASE_4():
IF computed.segmentation IS null:
DISPLAY "Cannot proceed: Phase 3 (Determine Segmentation) has not completed."
EXIT
Ask: "What's the primary use case for this corpus?"
| Option | Description |
|---|---|
| Reference | API docs, configuration reference |
| Learning | Tutorials, getting started guides |
| Troubleshooting | Error handling, debugging guides |
| Mixed | General purpose documentation |
If multiple sources, ask: "Which sources should be prioritized in the index?" Present sources for ordering. Higher priority sources get more detailed entries.
Ask: "How should the index be organized?"
| Option | Description |
|---|---|
| By topic | Group entries by subject area across sources |
| By source | Group entries by documentation source |
| Mixed | Topics first, source attribution inline |
Ask: "Are there sections to exclude? (e.g., changelog, internal docs)" Allow comma-separated section names or "none".
See: ${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/section-indexing.md and ${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/chunking.md
For each source, analyze scan results and recommend an indexing depth. Present the recommendation with concrete consequences — estimated counts and sizes.
For each source, present:
Source: {source_id} ({type}, {file_count} files, {large_file_count} large files)
Recommended indexing depth: {recommendation}
- File-level: {file_count} entries with metadata embeddings (current default)
- Section-level: ~{section_estimate} additional entries from h{min_level}+ headings
- Deep chunking: {chunk_estimate_or_"Not recommended"} using {strategy} strategy
Reason: {explanation based on scan results}
Options:
a) File only (current behavior)
b) File + Sections [recommended if heading_consistency is high]
c) File + Chunks [recommended if heading_consistency is low]
d) File + Sections + Chunks
Recommendation logic:
| Scan Result | Recommendation |
|---|---|
| heading_consistency: high, large_files > 0 | File + Sections |
| heading_consistency: low or mixed | File + Chunks |
| heading_consistency: high, large_files > 0, file_count > 200 | File + Sections + Chunks |
| file_count < 50, no large files | File only |
Estimation heuristics:
sum(headings per file at min_level+) * 0.7 (30% filtered by min_content_lines)sum(file_lines / target_lines_per_chunk) across chunking-eligible filesentry_count * 2KB for metadata, chunk_count * 15KB for chunksAfter all sources are configured, show a confirmation table:
Indexing Depth Summary
──────────────────────────────────────────────────────────────
Source | File | Sections | Chunks | Est. size
------------------|------|----------|--------|----------
polars-docs | 142 | ~200 | — | ~4MB
meeting-notes | 340 | — | ~3000 | ~45MB
obsidian-vault | 215 | ~180 | ~800 | ~18MB
──────────────────────────────────────────────────────────────
Store user choices in computed.indexing_depth for use in later phases.
Inputs: computed.scan_results, computed.user_preferences, computed.segmentation
Outputs: computed.index
GUARD_PHASE_5():
IF computed.user_preferences IS null:
DISPLAY "Cannot proceed: Phase 4 (Collect User Preferences) has not completed."
EXIT
All file paths in the index use: {source_id}:{relative_path}
| Source Type | Format | Example |
|---|---|---|
| git | {source_id}:{path} | react:reference/hooks.md |
| local | local:{source_id}/{file} | local:team-docs/guidelines.md |
| web | web:{source_id}/{file} | web:blog/article.md |
| llms-txt | llms-txt:{source_id}/{path} | llms-txt:claude-code/skills.md |
Read the documentation files, analyze their content, and generate an index organized per user preferences. Each entry should include:
For tiered indexes, generate the main index.md with section summaries and separate
index-{section}.md files with detailed entries.
See: lib/corpus/patterns/index-generation.md
From the source-scanner output, construct index.yaml following the strict schema in lib/corpus/patterns/index-format-v2.md.
For each entry from each source-scanner report:
id as {source_id}:{path}title, summary, tags, keywords, category, content_type, size, grep_hint, headingssource to the source IDlinks_to from extraction wikilinks (if extraction was enabled)links_from by cross-referencing all entries' links_to listsfrontmatter from extraction frontmatter data (if available, else {})concepts to empty list [] (populated later by Phase 6 if graph extraction is enabled, or manually via graph add-concept)stale: false, stale_since: null, last_indexed to current timestamptier: section):
id as {source_id}:{path}#{anchor}parent to the file entry ID ({source_id}:{path})title, summary, tags, keywords, anchor, heading_level, line_rangetier: sectionconcepts to empty list (populated by Phase 6 if applicable)size, grep_hint, headings, links_to, links_from, frontmatterConstruct meta:
generated_at: current timestampentry_count: total entriesWrite index.yaml to the corpus root.
After writing index.yaml, render index.md deterministically:
bash render-index.sh index.yaml
If render-index.sh does not exist in the corpus root, copy it from ${CLAUDE_PLUGIN_ROOT}/templates/render-index.sh first.
Pattern reference: ${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/index-rendering.md
Present the draft to the user and ask: "How does this look?"
| Option | Action |
|---|---|
| Looks good | Proceed to save |
| Expand sections | Ask which sections to expand, regenerate with more detail |
| Reorganize | Ask for new organization preference, regenerate |
| Missing coverage | Ask what topics are missing, add entries |
| Custom feedback | Apply user's specific feedback |
Loop back to showing the draft after each refinement until the user is satisfied.
Inputs: computed.scan_results (with extraction data from sources that had it enabled)
Outputs: graph.yaml written alongside index.md
GUARD_PHASE_6():
IF computed.index IS null:
DISPLAY "Cannot proceed: Phase 5 (Generate Index) has not completed."
EXIT
Precondition: At least one source in computed.scan_results has an extraction: block in its scan report.
Skip condition: If no source produced extraction data → skip this phase entirely. No graph.yaml is written.
See: ${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/graph.md and ${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/extraction.md
Merge extraction data
Collect extraction: blocks from all source-scanner reports. For each source's extraction data, prefix all file paths with {source_id}: to create corpus-scoped references. Merge into a unified extraction dataset:
from and to paths)Cluster entries into concepts
Apply the clustering algorithm from graph.md § "Graph Generation from Extraction Output":
Propose concepts to user
Present a table of proposed concepts with their candidate labels and entry counts:
Proposed Concepts from Extraction
────────────────────────────────────────
| Concept (proposed) | Entries | Based On |
|---------------------|---------|-----------------|
| family-activities | 12 | directory + tags |
| work-projects | 8 | tags |
| recipes | 5 | directory |
Accept all / Rename / Merge / Discard unwanted
Allow the user to rename, merge, or discard proposed concepts before proceeding.
Generate relationships
From the merged extraction data and confirmed concepts:
wikilink)includes relationshipssee-also relationships (origin: tag)evidence path for each auto-generated relationshipWrite graph.yaml
Write graph.yaml to the same directory as index.md, following the strict schema in graph.md § "Schema Definition (Strict)" (schema_version: 2 — no entry lists in concepts). Set meta.generated_at to current timestamp.
Display: "Graph generated: graph.yaml ({concept_count} concepts, {relationship_count} relationships)"
Populate concepts in index.yaml entries:
After graph.yaml concepts are confirmed, update index.yaml entries with concept membership:
concepts: ["{concept-id}"] on each matched entry in index.yamlgraph.yaml v1 compatibility: If an existing graph.yaml with schema_version: 1 is detected (concepts have entries[] lists):
concepts[] fieldentries and entry_count from each concept in graph.yamlschema_version: 2Inputs: computed.index (index.yaml written), entry count
Outputs: index-embeddings.lance/ (if user opts in)
GUARD_PHASE_7():
IF computed.index IS null:
DISPLAY "Cannot proceed: Phase 5 (Generate Index) has not completed."
EXIT
See: ${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/embeddings.md
python3 ${CLAUDE_PLUGIN_ROOT}/lib/corpus/scripts/detect.pyentry_count > 150 OR corpus has tiered indexes (index-*.md files exist)pip install fastembed lancedb pyyaml (~260MB)"
c. If user declines: skip, proceed to Phase 8
d. If user accepts and fastembed not installed: run pip install fastembed lancedb pyyaml
e. If detect.py reports "no-model": inform user "Downloading embedding model (~80MB, one-time)..."python3 ${CLAUDE_PLUGIN_ROOT}/lib/corpus/scripts/embed.py index.yaml index-embeddings.lance/Commit guidance: index-embeddings.lance/ MUST be committed alongside index.yaml and index.md. It is a distributable artifact, not a cache. Do NOT add to .gitignore.
Inputs: computed.scan_results (with chunks data), computed.indexing_depth
Outputs: chunks-embeddings.lance/ (if any source has chunking enabled)
Skip condition: No source in computed.indexing_depth has chunks enabled.
Aggregate all chunks from source-scanner reports into a single JSON file:
chunks.json to a temporary locationid, parent, source, path, chunk_index, chunk_text, line_range, overlap_prevRun dependency detection:
python3 ${CLAUDE_PLUGIN_ROOT}/lib/corpus/scripts/detect.py
If not installed, prompt user (same as Phase 7).
Run chunk embedding:
python3 ${CLAUDE_PLUGIN_ROOT}/lib/corpus/scripts/embed.py --mode chunks chunks.json chunks-embeddings.lance/
Clean up temporary chunks.json
Display: "Generated chunk embeddings for {chunk_count} chunks across {source_count} sources"
Commit guidance: chunks-embeddings.lance/ MUST be committed alongside other corpus files. Do NOT add to .gitignore.
GUARD_PHASE_7C_VERIFICATION():
IF computed.index IS null:
DISPLAY "Cannot verify: index has not been generated."
EXIT
verify_enabled = config.build.verify_on_build
IF verify_enabled IS null:
verify_enabled = (computed.index.meta.entry_count < 200)
IF verify_enabled == false:
SKIP "Verification skipped."
PROCEED to Phase 8
sample_size = config.build.verify_sample_size OR 20
result = Bash("python3 ${CLAUDE_PLUGIN_ROOT}/lib/corpus/scripts/verify_entries.py --index index.yaml --source-root .source/ --sample {sample_size}")
IF result.exit_code != 0:
DISPLAY "Verification script failed. Proceeding without verification."
PROCEED to Phase 8
# LLM verification of previews in batches of 10
inaccurate = LLM_VERIFY(result)
IF len(inaccurate) == 0:
DISPLAY "Verification passed: all entries accurate."
ELSE:
DISPLAY "Verification found {N} entries with summary drift."
ASK user: "Regenerate summaries for these entries? [Y/n]"
IF user approves:
regenerate and re-embed if needed
Inputs: computed.index, computed.segmentation
GUARD_PHASE_8():
IF computed.index IS null:
DISPLAY "Cannot proceed: Phase 5 (Generate Index) has not completed."
EXIT
# Phase 6 (Graph) and Phase 7 (Embeddings) are optional —
# but verify they were evaluated, not skipped silently.
# Graph: skip condition is "no extraction data" (checked in Phase 6)
# Embeddings: skip condition is "heuristic not met" (checked in Phase 7)
index.yaml with the structured index${CLAUDE_PLUGIN_ROOT}/templates/render-index.sh to corpus root (if not already present)bash render-index.sh index.yaml to generate index.mdindex-{section}.md sub-index file (v1 format only — tiered v2 is deferred)index.last_updated_at to current timestamplast_indexed_at to current timestamplast_commit_sha to current clone HEADconfig.yamlDisplay summary:
Build complete!
Index: index.yaml ({entry_count} entries, {section_count} sections)
Rendered: index.md
{if graph: Graph: graph.yaml ({concept_count} concepts, {relationship_count} relationships)}
{if embeddings: Embeddings: index-embeddings.lance/ ({entry_count + section_count} entries)}
{if chunks: Chunks: chunks-embeddings.lance/ ({chunk_count} chunks)}
{if tiered: Sub-indexes: {count} files}
Strategy: {segmentation_strategy}
Sources indexed: {source_count}
| Error | Message | Recovery |
|---|---|---|
| No config.yaml | "No config.yaml found" | Run hiivmind-corpus-init |
| No sources | "No sources configured" | Run hiivmind-corpus-add-source |
| Clone failed | "Failed to clone {url}" | Check URL and network |
| Local source empty | "No files in uploads/{id}/" | Add documents or skip source |
| Scan failed | "Failed to scan source" | Check source accessibility |
| Save failed | "Failed to write index" | Check file permissions |
${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/scanning.md${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/parallel-scanning.md${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/index-generation.md${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/config-parsing.md${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/sources/${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/extraction.md${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/graph.md${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/index-format-v2.md${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/index-rendering.md${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/freshness.md${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/embeddings.md${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/section-indexing.md${CLAUDE_PLUGIN_ROOT}/lib/corpus/patterns/chunking.md${CLAUDE_PLUGIN_ROOT}/agents/source-scanner.md${CLAUDE_PLUGIN_ROOT}/skills/hiivmind-corpus-init/SKILL.md${CLAUDE_PLUGIN_ROOT}/skills/hiivmind-corpus-add-source/SKILL.md${CLAUDE_PLUGIN_ROOT}/skills/hiivmind-corpus-enhance/SKILL.md${CLAUDE_PLUGIN_ROOT}/skills/hiivmind-corpus-refresh/SKILL.md${CLAUDE_PLUGIN_ROOT}/skills/hiivmind-corpus-graph/SKILL.md — View, validate, edit concept graphs${CLAUDE_PLUGIN_ROOT}/skills/hiivmind-corpus-bridge/SKILL.md — Cross-corpus concept bridges and aliasesnpx claudepluginhub hiivmind/hiivmind-corpusFetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
Applies a firm's KYC/AML rules grid to parsed onboarding records: assigns risk rating, checks required documents, outputs rule outcomes with citations, and routes for escalation.
Generates daily or weekly digests of activity from connected sources (chat, email, docs, tasks, CRM), highlighting action items, decisions, mentions, and project updates.