veda.ng

Re-ranking is a two-stage retrieval approach where an initial fast retriever generates candidate documents, then a more powerful but slower model re-scores and reorders them by relevance, greatly improving search quality without the cost of running expensive models over entire corpora. The initial retrieval stage (BM25, dense retrieval, or hybrid) runs efficiently over millions or billions of documents, returning the top 100-1000 candidates. The re-ranker, typically a cross-encoder transformer, then scores each candidate by attending jointly to the query and document together. Cross-encoders are more accurate than bi-encoders because they can model fine-grained query-document interactions, but they're too slow for first-stage retrieval over large corpora. By limiting re-ranking to candidates from fast retrieval, systems achieve both coverage and precision. The re-ranker outputs a relevance score for each query-document pair, enabling accurate ranking that considers nuance, context, and semantic matching. Modern re-rankers like Cohere's Rerank, BGE Reranker, and cross-encoder models are trained on relevance judgments. Re-ranking is especially valuable in RAG pipelines where providing the LLM with the most relevant chunks directly impacts response quality. This retrieval-then-rerank approach is standard in production search systems.