Skill

rag-architect

End-to-end RAG system design — chunking strategies, embedding selection, retrieval optimization, reranking. Use when building or tuning a RAG pipeline.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/ai-ml-eng-pro:rag-architect

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Designs and optimizes Retrieval-Augmented Generation (RAG) systems end-to-end. Covers chunking strategies (fixed, semantic, recursive, agentic), embedding model selection, vector database architecture, retrieval optimization (hybrid search, multi-stage retrieval), reranking, context window management, and hallucination reduction techniques.

SKILL.md

67 lines · ~1.1k tokens

Stats

LanguagePython

Parent stars0

MaintenanceGood

Last CommitMay 25, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

RAG Architect

What It Does

Iron Laws (NEVER violate)

Retrieval before generation — Never let the LLM answer from its parametric knowledge when relevant documents exist. RAG without retrieval is just generation.
Chunk for the task — Chunking strategy must match the query pattern. Q&A needs small chunks; summarization needs large chunks. Wrong chunk size = irrelevant retrieval.
Source attribution required — Every generated claim that came from a retrieved document must cite its source. Unattributed claims are indistinguishable from hallucination.
Evaluate retrieval separately — Retrieval quality (precision@k, recall@k, MRR) must be measured independently from generation quality. Bad retrieval → bad RAG, regardless of LLM quality.

Red Flags (STOP immediately)

Retrieval irrelevance — Top-k retrieved documents are unrelated to query → chunking or embedding strategy broken
Context overflow — Retrieved documents exceed model context window → need reranking, summarization, or chunk reduction
Hallucination from retrieval — LLM generates claims not present in retrieved documents → source attribution failure
Latency spiral — Retrieval + reranking + generation exceeds latency SLA → pipeline optimization needed

Common Rationalizations (self-deception)

"Just use the default chunk size (1000 tokens)" → Chunk size is task-dependent. Default is rarely optimal.
"Vector search is enough" → Keyword search catches exact matches that embeddings miss. Hybrid search is almost always better.
"More retrieved documents = better answers" → Beyond optimal k, irrelevant documents dilute context and increase hallucination.

When To Use

Building a RAG system for document Q&A
Optimizing an existing RAG pipeline that produces poor answers
Selecting embedding models and vector databases
Designing multi-stage retrieval with reranking
Reducing hallucination in LLM applications with grounding

Human Partner Signals (escalate to human)

Data sensitivity — Documents contain confidential/proprietary information → access control review
Compliance requirement — RAG system handles regulated data (health, finance, legal) → compliance review
Scale decision — Vector database choice has significant cost implications → architecture decision
Quality threshold — RAG accuracy below business requirement after optimization → human review of feasibility

Pipeline

Analyze: understand query patterns, document types, latency requirements, accuracy targets
Chunk: select and tune chunking strategy — size, overlap, metadata preservation
Embed: choose embedding model based on domain, dimensionality, cost, and benchmark performance
Store: configure vector database with appropriate indexing (HNSW, IVF) and metadata filtering
Retrieve: implement retrieval — hybrid search (dense + sparse), multi-stage (candidate → rerank)
Generate: design LLM prompt with retrieved context, source attribution, and anti-hallucination guards
Evaluate: measure retrieval quality and end-to-end answer accuracy separately
Iterate: tune chunk size, retrieval k, reranking threshold based on eval results

Verification Checklist

Chunking strategy tested with real query patterns (not synthetic)
Retrieval evaluated independently (precision@k, recall@k, MRR)
Hybrid search implemented (dense + sparse) for retrieval
Source attribution verified — every generated claim traceable to a document
Context window budget honored (no overflow)
Latency measured end-to-end and within SLA
Hallucination rate measured on held-out test set

Related Skills

embedding-manager — Embedding generation and optimization for RAG retrieval
prompt-engineer — RAG generation prompts require specialized design
model-evaluator — Evaluate RAG system quality end-to-end
dataset-curator — Curate document collections for RAG indexing

rag-architect

Invocation

Context Preview

SKILL.md

rag-architect

Invocation

Context Preview

SKILL.md

RAG Architect

What It Does

Iron Laws (NEVER violate)

Red Flags (STOP immediately)

Common Rationalizations (self-deception)

When To Use

Human Partner Signals (escalate to human)

Pipeline

Verification Checklist

Related Skills

Similar Skills

RAG Architect

What It Does

Iron Laws (NEVER violate)

Red Flags (STOP immediately)

Common Rationalizations (self-deception)

When To Use

Human Partner Signals (escalate to human)

Pipeline

Verification Checklist

Related Skills

Similar Skills