From godmode
Guides building RAG systems for Q&A, chatbots, knowledge bases, covering embedding models, chunking strategies, vector stores, ingestion pipelines, retrieval optimization.
How this skill is triggered — by the user, by Claude, or both
Slash command
/godmode:ragThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- `/godmode:rag`, "build RAG system", "knowledge base"
/godmode:rag, "build RAG system", "knowledge base"Use case: <questions the system must answer>
Data sources: <docs, wiki, DB, PDFs, code>
Corpus: <N documents, N tokens, N MB>
Update frequency: static|daily|real-time
Query patterns:
Factual lookup (single-hop retrieval)
Analytical (multi-document retrieval)
Conversational (multi-turn Q&A)
Structured (metadata filtering + retrieval)
| Model | Dims | MTEB | Cost |
| text-embedding-3-large | 3072 | 64.6 | $0.13/1M |
| text-embedding-3-small | 1536 | 62.3 | $0.02/1M |
| Cohere embed-v3 | 1024 | 64.5 | $0.10/1M |
| Voyage voyage-3 | 1024 | 67.1 | $0.06/1M |
| BGE-large-en-v1.5 | 1024 | 64.2 | Free* |
IF budget-constrained: text-embedding-3-small. IF quality-critical: Voyage or BGE fine-tuned. IF multi-language: Cohere embed-v3.
| Strategy | Best For |
| Fixed-size (token) | Baseline, uniform docs |
| Recursive character | General-purpose |
| Semantic | Varying topic density |
| Code-aware (AST) | Source code repos |
| Markdown headers | Structured docs |
| Sliding window | Boundary context critical |
ALWAYS set overlap >= 10% of chunk size. IF chunk_size > 1000 tokens: information dilution risk. IF chunk_size < 100 tokens: context too fragmented. Default: 500 tokens, 50 token overlap.
| Store | Type | Scale | Best For |
| Pinecone | Managed | Billions | Production |
| Weaviate | Managed/Self | Millions | Hybrid search |
| Chroma | Embedded | Millions | Prototyping |
| pgvector | Extension | Millions | Existing PG |
| Qdrant | Managed/Self | Billions | High perf |
IF already using PostgreSQL: start with pgvector. IF < 100K chunks: Chroma for development. IF > 10M chunks: Pinecone or Qdrant.
Document -> Parse/Extract -> Clean/Transform
-> Chunk -> Embed -> Index
Loaders:
PDF: PyMuPDF, pdfplumber, Unstructured
HTML: BeautifulSoup, Unstructured
Code: tree-sitter AST parser
# Verify indexing
python -c "from chromadb import Client; \
c=Client(); print(c.list_collections())"
Hybrid search (RECOMMENDED for production):
Dense (vector): semantic similarity
Sparse (BM25): keyword/exact matching
Fusion: Reciprocal Rank Fusion (RRF)
Top-K: 5-20 chunks (start with 10)
Reranker: cross-encoder on top-20 results
(highest-impact single optimization)
IF recall < 70%: increase overlap, add BM25, try domain-specific embeddings. IF recall > 90% but bad answers: generation problem.
Context window budget:
System prompt: <N tokens>
Retrieved context: <N tokens>
Conversation history: <N tokens>
Output reservation: <N tokens>
Total < model context limit
Assembly: rank by relevance, include until budget.
Format with source attribution.
Retrieval metrics:
Hit rate @ K: % queries with answer in top-K
MRR: average 1/rank of first correct result
Generation metrics:
Faithfulness: grounded in retrieved context
Hallucination rate: answers without evidence
Targets:
Recall@10 >= 80%, MRR >= 0.7
Faithfulness >= 90%, Hallucination < 5%
# RAG pipeline testing
python -m pytest tests/test_rag.py -v
curl -s http://localhost:8080/api/search?q=test | jq .results
Append .godmode/rag.tsv:
timestamp action chunks recall_at_10 faithfulness hallucination status
KEEP if: target metric improved AND hallucination
did not increase.
DISCARD if: hallucination increased OR no improvement.
Never keep a change that increases hallucination.
STOP when FIRST of:
- Recall@10 >= 80%, faithfulness >= 90%,
hallucination < 5%
- Two iterations < 2% improvement
- Latency meets requirements
On failure: git reset --hard HEAD~1. Never pause.
| Failure | Action |
|---|---|
| Low recall < 70% | Increase overlap, add BM25, reranker |
| High hallucination | Add "only use context", reduce chunks |
| High latency | Cache frequent queries, reduce top-K |
npx claudepluginhub arbazkhan971/godmodeBuild RAG systems for LLM apps using vector databases, embeddings, and retrieval strategies. Use for document Q&A, grounded chatbots, and semantic search.
Guides designing RAG systems that ground LLM responses in retrieved documents to reduce hallucination and enable knowledge updates without retraining.
Guides RAG implementation from requirements to LLM integration, covering embedding selection, vector DB setup, chunking strategies, and retrieval optimization.