From skillry-ai-and-agent-systems
Use when you need to review retrieval, embeddings, chunking, vector search, ranking, and grounding strategy.
How this skill is triggered — by the user, by Claude, or both
Slash command
/skillry-ai-and-agent-systems:43-rag-vector-search-reviewThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Audit a Retrieval-Augmented Generation pipeline end-to-end: chunking strategy, embedding model choice, vector database configuration, similarity metric, hybrid search setup, reranking, recall measurement, and index staleness. Produces specific, actionable findings — not a generic RAG tutorial. Every finding must include the configuration parameter that needs to change and the expected impact on...
Audit a Retrieval-Augmented Generation pipeline end-to-end: chunking strategy, embedding model choice, vector database configuration, similarity metric, hybrid search setup, reranking, recall measurement, and index staleness. Produces specific, actionable findings — not a generic RAG tutorial. Every finding must include the configuration parameter that needs to change and the expected impact on recall.
prompt-systems-review).llm-evaluation-review).Map the pipeline. Document each stage with the responsible component and its key configuration: document ingestion → text extraction → chunking → embedding → indexing → query embedding → vector search → [optional: reranking] → context assembly → generation prompt construction. Any stage with no documented configuration is an audit gap.
Audit chunking strategy. Record:
text-embedding-3-small: max 512 tokens; for text-embedding-3-large: max 8,191 tokens; for voyage-3: max 32,000 tokens. Verify overlap is 10-15% of chunk size to prevent context splitting at boundaries.Audit the embedding model. Record: model name, version or release date, vector dimension count, max token input length, whether the model is identical at index time and query time (model name AND version). A mismatch between index-time and query-time embedding models produces meaningless similarity scores — the vectors are in different semantic spaces. Check whether the provider manages the model version or whether it is pinned. Provider-managed "latest" endpoints change without notice.
Audit the vector database configuration. Record:
ef_search (HNSW) or nprobe (IVF): higher values improve recall at latency costCheck index staleness. Record: last full re-index date, incremental update frequency (real-time / hourly / daily / weekly), average lag between source document update and index update. For high-volatility content (product FAQs, news, pricing), index lag over 6 hours is a hallucination risk — the model retrieves stale facts and presents them as current. For low-volatility content (legal docs, technical specs updated quarterly), weekly indexing may be acceptable.
Review context assembly. Confirm:
ef_search or IVF nprobe value documented; not left at defaultEmbedding model version drift. The index was built with text-embedding-ada-002. The provider released text-embedding-3-small with different vector dimensions (1,536 vs 1,536 — same dimensions, but different semantic space). A developer switches the query-time model to 3-small without re-indexing. Cosine similarity scores between ada-002 index vectors and 3-small query vectors are meaningless. Pin embedding model versions and re-index on every model change.
Chunk size exceeds model token limit. Chunks are split at 1,500 characters (~375 tokens) and the embedding model's limit is 512 tokens. However, the document contains technical paragraphs with very long sentences. Some chunks exceed 512 tokens; the model silently truncates them. The tail of every oversized chunk is not represented in the embedding. Convert character limits to token counts and enforce the token limit directly.
No recall measurement. The team reports that retrieval "seems better" after a tuning session. There are no Recall@k numbers before or after. The improvement may be a cherry-picked example. Always measure Recall@5 before making any pipeline change and after, using the same fixed golden set.
L2 distance on unnormalized embeddings. The vector DB is configured with L2 distance because it was the default. The embedding model outputs unnormalized vectors. High-magnitude vectors dominate the L2 ranking even if semantically unrelated to the query. Either normalize all embeddings at insert time or switch the distance metric to cosine. Do not rely on the application layer to normalize — enforce it at the DB configuration level.
Retrieval without deduplication. A 50-page PDF is chunked into 200 chunks with overlap. A query about Chapter 3 retrieves 8 chunks from Chapter 3, all with high similarity. The context assembly includes all 8, consuming most of the context window with near-duplicate text. Deduplicate by source document ID in the context assembly step: keep only the highest-scoring chunk per document for the initial result set.
Stale index for high-volatility content. A pricing FAQ is updated whenever product pricing changes — roughly weekly. The vector index is refreshed monthly. For up to 3 weeks after a pricing change, the RAG system retrieves and presents the old price with confidence. Set incremental index updates for pricing content to run within 1 hour of source document change.
Reranker on the full index. A developer adds a cross-encoder reranker for quality improvement but configures it to rerank all 10,000 documents for each query. Reranking 10,000 pairs takes 30-60 seconds. The application becomes unusable. Cross-encoders are accurate but slow; use them only on the top-k candidates from the approximate vector search (k = 20-50), not on the full corpus.
Produce a RAG pipeline review report with:
ef_search/nprobe valuenpx claudepluginhub fluxonlab/skillry --plugin skillry-ai-and-agent-systemsGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.