Document Retrieval

Document retrieval finds relevant documents from a corpus given a query, forming the foundation of search engines and retrieval-augmented generation systems. Traditional methods like TF-IDF and BM25 use lexical matching: documents containing query terms score higher, with weights for term importance and document length.

These are fast and interpretable but miss semantic matches where different words express the same meaning. Dense retrieval embeds queries and documents into the same vector space, retrieving documents whose embeddings are nearest to the query embedding.

This captures semantic similarity: a query about 'machine learning' retrieves documents about 'ML' and 'neural networks' even without exact word matches. The embedding model is trained on query-document relevance data, learning to place relevant documents close to their queries. Hybrid retrieval combines lexical and dense methods, often outperforming either alone.

The retrieval pipeline typically involves indexing (pre-computing document embeddings), candidate generation (fast approximate nearest neighbor search), and optionally re-ranking. Evaluation uses metrics like NDCG, MRR, and Recall@k measuring how well relevant documents rank.

Modern RAG systems depend on high-quality retrieval: if the right documents aren't retrieved, the generator can't produce good answers.

Interactive Concept: document retrieval