Chunking is the process of splitting documents into smaller segments for embedding and retrieval in RAG systems, representing a critical design decision that significantly impacts retrieval quality and generation accuracy. The fundamental tension: smaller chunks are more precise, retrieving exactly the relevant sentence rather than surrounding irrelevant content, but may lack context needed for understanding. Larger chunks provide more context but may contain irrelevant information that dilutes relevance scores or confuses the generator. Fixed-size chunking splits documents at character or token boundaries (e.g., 500 tokens with 50-token overlap), simple and predictable but potentially cutting sentences or ideas mid-thought. Sentence-based chunking respects linguistic boundaries but produces variable-length chunks. Semantic chunking uses embeddings to identify topic shifts, keeping coherent ideas together. Hierarchical chunking creates multiple granularities: paragraphs for broad retrieval, sentences for precise matching. Document structure-aware chunking respects headings, sections, and formatting, keeping tables and lists intact. Overlap between chunks ensures that information split across boundaries remains retrievable. The optimal strategy depends on document type, query patterns, and the embedding model's characteristics. Dense documents like technical manuals may need smaller chunks; narrative content may benefit from larger ones. Chunking decisions compound through the RAG pipeline: poor chunking creates retrieval failures that no amount of generation quality can compensate for.
Back to Glossary