From rag-architect
Expert guidance on document chunking strategies for RAG systems. Use this skill when designing how to split documents for vector embeddings. Activate when: chunking, chunk size, text splitting, document segmentation, overlap, semantic chunking, recursive splitting.
How this skill is triggered — by the user, by Claude, or both
Slash command
/rag-architect:chunking-strategiesThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Optimize document splitting for retrieval accuracy and context preservation.**
Optimize document splitting for retrieval accuracy and context preservation.
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separator="\n"
)
chunks = splitter.split_text(document)
Best for: Homogeneous content, quick prototyping Avoid when: Documents have natural boundaries (sections, paragraphs)
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ".", " ", ""]
)
chunks = splitter.split_documents(docs)
Best for: General-purpose text, maintains paragraph integrity Hierarchy: Tries larger separators first, falls back to smaller
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
chunks = splitter.split_text(document)
Best for: When meaning matters more than size Trade-off: Slower, requires embedding calls
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
chunks = splitter.split_text(markdown_doc)
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=2000,
chunk_overlap=200
)
chunks = splitter.split_documents(code_docs)
from langchain.text_splitter import HTMLHeaderTextSplitter
splitter = HTMLHeaderTextSplitter(
headers_to_split_on=[("h1", "h1"), ("h2", "h2"), ("h3", "h3")]
)
chunks = splitter.split_text(html_doc)
| Content Type | Recommended Size | Overlap |
|---|---|---|
| Dense technical docs | 500-1000 tokens | 10-20% |
| Conversational/FAQ | 200-500 tokens | 5-10% |
| Legal/contracts | 1000-1500 tokens | 15-20% |
| Code | 1500-2000 tokens | 10-15% |
| Mixed content | 800-1200 tokens | 15% |
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
# Small chunks for retrieval, large chunks for context
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
store = InMemoryStore()
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
Why: Small chunks = precise retrieval, large chunks = better context
Always attach metadata to chunks:
for i, chunk in enumerate(chunks):
chunk.metadata.update({
"source": doc.metadata["source"],
"chunk_index": i,
"total_chunks": len(chunks),
"doc_type": detect_doc_type(chunk.page_content),
"has_code": bool(re.search(r'```', chunk.page_content)),
"timestamp": datetime.now().isoformat()
})
npx claudepluginhub latestaiagents/agent-skills --plugin rag-pluginDocument chunking implementations and benchmarking tools for RAG pipelines including fixed-size, semantic, recursive, and sentence-based strategies. Use when implementing document processing, optimizing chunk sizes, comparing chunking approaches, benchmarking retrieval performance, or when user mentions chunking, text splitting, document segmentation, RAG optimization, or chunk evaluation.
Provides chunking strategies for RAG document processing pipelines: fixed-size, semantic, recursive methods with Python examples, document-type recommendations, and best practices.
Generates chunking strategies for RAG systems: 256-1024 token sizes, 10-20% overlaps, semantic boundaries; validates coherence and evaluates precision/recall metrics. For vector DBs and large documents.