Context loss is a known challenge faced by traditional Retrieval-Augmented Generation (RAG) systems, stemming from the necessary practice of splitting large documents into smaller chunks for efficient processing. When documents are divided, individual chunks often lack sufficient contextual information, making it difficult for the retrieval system to identify and utilize relevant information effectively. For example, a chunk mentioning “the company’s revenue” without specifying which company or timeframe, or a segment stating “the new policy changes” without context about which policy or department, become practically meaningless.
To address the issue of context loss in RAG systems, one can try several approaches, though each has its limitations:
- Using Semantic Embeddings: Leveraging embeddings to capture the meanings and relationships between words and phrases. However, this method may fail to retrieve exact matches for specific queries, especially with unique identifiers or technical terms.
- Applying TF-IDF and BM25 Ranking Functions: Utilizing TF-IDF to measure word importance and BM25 as a ranking function for information retrieval. The downside is that these methods focus on lexical matching and may not capture semantic relationships, leading to incomplete retrieval of relevant information.
- Combining Embeddings with Lexical Matching: Integrating semantic similarity with lexical matching to enhance retrieval accuracy. Despite this, it still suffers from context loss due to document chunking, as individual chunks may lack necessary context.
- Adding Summaries to Chunks: Prepending a general summary to each chunk to provide additional context. This offers limited improvements since generic summaries may not supply sufficient specific context for each chunk.
- Hypothetical Document Embedding: Hypothesizing the context of a chunk during embedding to improve retrieval. This approach is not thoroughly evaluated, making its practical effectiveness uncertain.
While these methods aim to reduce context loss, they often fall short in fully capturing the meaning and detailed context needed for accurate information retrieval. As a result, splitting documents into smaller chunks continues to be a major challenge, making it difficult to locate and use the right information from large knowledge bases effectively.

Anthropic recently introduced Contextual Retrieval, a method designed to enhance the retrieval process in RAG systems by preserving and leveraging contextual information for each chunk of a document. This approach involves generating and prepending chunk-specific context to each chunk before embedding and indexing, enabling more accurate retrieval of relevant information.
Contextual Retrieval comprises two main components: Contextual Embeddings and Contextual BM25. Contextual Embeddings aim to create embeddings that incorporate both the content of the chunk and its context within the original document. This is achieved by using a large language model (LLM), such as Claude, to generate concise, chunk-specific context by prompting the LLM with the entire document and specifying the chunk. The generated context provides an explanation or summary that situates the chunk within the broader document. This context is then prepended to the original chunk, resulting in a contextualized chunk that contains both the context and the content. Semantic embeddings are created for these contextualized chunks using an embedding model, capturing both semantic meaning and contextual relationships.

The second component, Contextual BM25, aims to improve lexical matching by indexing the contextualized chunks using TF-IDF and BM25 ranking functions. TF-IDF encodings are generated for the contextualized chunks, and a BM25 index is built using these encodings. At query time, BM25 is used to perform lexical search on the contextualized chunks, enhancing the retrieval of exact matches and important terms.
Prompt caching can be used in conjunction with contextual retrieval to improve its efficiency and reduce costs, but it is not an inherent part of the contextual retrieval process itself. Prompt caching makes it possible to implement contextual retrieval at a lower cost by allowing the system to avoid redundant processing. Normally, contextual retrieval adds extra details to document pieces to make searches more accurate. Without caching, adding this extra information can get expensive, but with caching, it reuses the stored details whenever needed, saving on costs.
Experiments conducted by Anthropic across various domains—such as codebases, fiction, and scientific literature—show that Contextual Retrieval significantly enhances retrieval accuracy. Using the top-20-chunk retrieval failure rate as a performance metric (which measures the proportion of times relevant chunks are not retrieved within the top 20 results), they observed a reduction of approximately 35% with Contextual Embeddings alone. When combined with Contextual BM25, the failure rate decreased by about 49%, and with the addition of reranking steps, a total reduction of around 67% was achieved. These results indicate that Contextual Retrieval effectively overcomes context loss, leading to more accurate and relevant information retrieval from large knowledge bases.

Limitations, Challenges, and the Road Ahead
While Contextual Retrieval offers significant improvements in addressing context loss, it introduces some limitations. Extensive use of LLMs for chunking and indexing raises costs and complexity, while generating context for each chunk significantly increases preprocessing time and storage needs. Dependence on advanced language models like Claude may entail additional costs or access restrictions. Fine-tuning prompts and adjusting parameters require manual effort, and the added complexity of integrating multiple components can pose implementation challenges. Additionally, the system will likely experience increased latency and computational costs during query time due to reranking steps, which can impact real-time applications.
To mitigate these limitations, teams should focus on optimization and experimentation. Exploring different chunk sizes, overlaps, and prompting techniques can enhance performance while reducing overhead. Developing domain-specific contextualizer prompts and fine-tuning embedding models can improve retrieval accuracy in specific applications. Efficiency improvements like batching, caching, and parallel processing can help manage computational demands. Evaluating the system across diverse datasets will provide insights into its effectiveness, and minimizing latency and cost in reranking will make it more suitable for time-sensitive applications. Establishing best practices and scalable architectures will facilitate integration and adoption in various environments.
Related Content
- GraphRAG: Design Patterns, Challenges, Recommendations
- Advancing RAG: Best Practices and Evaluation Frameworks
- Best Practices in Retrieval Augmented Generation
If you enjoyed this post please support our work by encouraging your friends and colleagues to subscribe to our newsletter:
