Introduction

In this chapter, we dive into several intriguing techniques within the Advanced RAG landscape that didn’t warrant a full chapter but are still worth highlighting. These methods, ranging from contextual retrieval to innovative embedding strategies, push the boundaries of traditional information retrieval. By providing a concise overview, our goal is to spark curiosity and offer a glimpse into the potential of these approaches, setting the stage for deeper exploration in future courses.

Contextual retrieval by Anthropic

Contextual Retrieval enhances traditional retrieval by adding chunk-specific context using an LLM. Each chunk is enriched with a concise explanation that situates it within the overall document, improving search relevance. For example:

Original chunk:

"The company's revenue grew by 3% over the previous quarter.”

Enriched chunk:

"This chunk is from an SEC filing on ACME corp's performance in Q2 2023; the previous quarter's revenue was $314 million. The company's revenue grew by 3% over the previous quarter.”

I know what you are thinking now - enriching every chunk sounds super expensive. But because of smart prompt caching, Anthropic claims that “the one-time cost to generate contextualized chunks is $1.02 per million document tokens” which actually does sound quite manageable.

From my perspective, this is the main takeaway. But in the technical blog post from Anthropic, they add Hybrid search on top of this, meaning they contextualize both dense and sparse vectors (we explained Hybrid search in [LINK]). These are the results:

Anthropic suggests going one step further and adding rerankers (which we explained in chapter [LINK]) which show even more promising results:

If you put all of this together, the final process diagram may look like this:

Recommendations for Small Datasets

For smaller datasets (e.g., under 500 pages), Anthropic recommends putting the entire dataset into the model’s prompt, bypassing the need for retrieval altogether. I was again very suspicious regarding costs, but Anthropic claims that smart caching can significantly help with that.

Anyways, the technical blog from Anthropic is amazingly written and contains great insights, give it a read: AnthropicAI Introducing Contextual Retrieval.

Contextual chunk headers

If Anthropic’s Contextual Retrieval feels like a high-end solution, then “Contextual Chunk Headers” might be its budget-friendly counterpart. Instead of enriching every chunk with precise LLM-generated context, this technique involves prepending higher-level contextual information, such as document titles or section headings, to groups of chunks (e.g., 10 chunks at once). This provides a more lightweight way to situate chunks in context without processing each individually through an LLM.

For example:

Chunk: "The company’s revenue grew by 3% over the previous quarter.”

With Header: "ACME Corp Q2 2023 Financial Report: The company’s revenue grew by 3% over the previous quarter.”

This approach offers an efficient trade-off between performance and cost. Although less granular than Contextual Retrieval, it can still significantly improve retrieval results.

The full implementation can be found here: GitHub RAG_Techniques/all_rag_techniques/contextual_chunk_headers.ipynb at main · NirDiamant/RAG_Techniques

Contextual Document Embeddings

The paper "Contextual Document Embeddings" improves document embeddings by incorporating information from neighboring documents into the embedding process, creating more robust and transferable representations. Traditional methods treat documents independently, but this approach adds contextual awareness by modifying the training process to include nearby documents and designing an architecture that explicitly encodes context. While fine-tuning is done on specific data, the model focuses on learning general principles of contextual relationships rather than memorizing domain-specific details. This allows it to generalize effectively to out-of-domain scenarios, improving performance on tasks like search and retrieval without relying on complex techniques like hard negative mining or large batch sizes.

By the way, this is a third technique in a row that starts with “Contextual” if you have not noticed. I would say it is quite telling of the direction RAG improvements are taking. Traditional splitting based on tokens is rather arbitrary and lacks context. This can be improved by splitting on logical chunks (e.g. sentences or markdown), but you can still miss the context of the chapter or section from which the document comes.

ColBERT: Late Interaction for Efficient Retrieval

ColBERT (Contextualized Late Interaction over BERT) is an advanced model designed for efficient and accurate information retrieval. Its key innovation is the late interaction mechanism, which enables token-level comparisons between queries and documents. This approach captures fine-grained semantic details without significantly increasing query-time computational costs.

In the Rerankers chapter [LINK], we discussed ColBERT's role in reranking. Here, we will explore its broader capabilities, particularly as a retriever and an embedding-based model.

Further resources:

How ColBERT Differs from Traditional RAG

Traditional Retrieval in RAG:

Encodes documents into multiple token-level embeddings, capturing detailed semantic information for each token.
Queries are encoded into token-level embeddings at runtime.
At query time, a single query embedding is compared to document embeddings using similarity measures like cosine or dot product.
While efficient, this approach may miss finer details since it aggregates the document's semantics into a single vector.

ColBERT's Approach:

Documents are encoded into multiple token-level embeddings offline, capturing granular semantic information for each token.
Queries are similarly encoded into token-level embeddings at runtime.
During retrieval, a late interaction mechanism compares each query token embedding with relevant document token embeddings, allowing for fine-grained semantic matching.
This mechanism uses MaxSim, aggregating the maximum similarity between query and document tokens to compute the final relevance score.

Architecture Overview

The ColBERT architecture comprises three main components:

Query Encoder:

Encodes queries into a bag of contextual embeddings.
A special [Q] token is prepended to each query, differentiating it from documents.
Padding or truncation ensures a uniform token count, enhancing performance.

Document Encoder:

Encodes documents into fixed-size embeddings while filtering out unnecessary tokens like punctuation.
A [D] token is prepended to each document to distinguish it during the encoding process.

Late Interaction Module:

Computes pairwise similarities between query and document embeddings and aggregates them to rank documents.

This modular approach makes ColBERT scalable for both reranking and end-to-end retrieval scenarios.

Applications and Use Cases

ColBERT can function as:

A retriever: Directly retrieving top-k relevant documents from a large corpus by generating token-level embeddings and performing fine-grained matching using its late interaction mechanism.
A reranker: Refining an initial set of candidate documents retrieved using simpler methods like BM25 by applying detailed token-level matching to improve ranking precision.
A token-level embedding generator: While not designed for general-purpose embeddings, ColBERT integrates its embedding generation process, optimized for retrieval tasks.

Example

Jupyter: Google Colab

To use ColBERT, we can leverage the colbert-ai library. We’ll start by installing it:

!pip install -U colbert-ai torch

In this snippet, we are loading a pretrained ColBERT model checkpoint for use in information retrieval tasks. Here’s what each part does:

Importing Modules :

Checkpoint is a utility from ColBERT that allows loading and managing pretrained model checkpoints.
ColBERTConfig provides configuration options for the ColBERT model, such as directory paths and other settings.

Initializing the Checkpoint :

"colbert-ir/colbertv2.0" specifies the name of the pretrained checkpoint to load. This could be a path to a local model file or a remote model identifier, depending on your setup.
ColBERTConfig(root="experiments") sets the root directory where model-related experiments will be saved or accessed. This is useful for organizing logs, results, and intermediate files.

Purpose :

The ckpt object now contains the pretrained ColBERT model and its configuration, ready to be used for tasks like ranking or embedding documents in information retrieval pipelines.

This step sets up the foundation for using ColBERT's capabilities in semantic search and ranking tasks efficiently.

from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig

ckpt = Checkpoint(
    "colbert-ir/colbertv2.0", colbert_config=ColBERTConfig(root="experiments")
)

In this example, we copy, structure, and process a medical dataset to generate embeddings for text documents using a pretrained ColBERT model.

Dataset Copy and Setup :

The deeplake.copy() function duplicates the medical_dataset from the Activeloop repository into your organization’s workspace.
deeplake.open() then opens the dataset for modifications, allowing us to add or manipulate columns.

deeplake.copy(f"al://activeloop/medical_dataset", f"al://{org_id}/medical_dataset")
medical_dataset = deeplake.open(f"al://{org_id}/medical_dataset")
medical_dataset.summary()

Adding an Embedding Column :

A new column named embedding is added to the dataset with the data type types.Array(types.Float32(), dimensions=2), preparing it to store 2D embeddings generated from the medical text.

medical_dataset.add_column(name="embedding", dtype=types.Array(types.Float32(),dimensions=2))
medical_dataset.commit()

Text Extraction :

The text data from the medical dataset is extracted into a list (medical_text) by iterating over the dataset and pulling the text field for each entry.

Batch Embedding Generation :

The text data is processed in batches of 1,000 entries using the ColBERT model (ckpt.docFromText), which generates embeddings for each batch.
The embeddings are appended to a list (all_vectors) for later use.

Efficient Processing :

Batching ensures efficient processing, especially when dealing with large datasets, as it prevents memory overload and speeds up embedding generation.

all_vectors = []
medical_text = [el["text"] for el in medical_dataset]

for i in range(0, len(medical_text), 1000):
    chunk = medical_text[i:i+1000]
    vectors_chunk = ckpt.docFromText(chunk)
    all_vectors.extend(vectors_chunk)
    
list_of_embeddings = [vector.tolist() for vector in all_vectors]
len(list_of_embeddings)

Output:

19719

We convert the embeddings into Python lists for compatibility with Deep Lake storage and checks the total number of embeddings. Each embedding from all_vectors is transformed using .tolist(), creating list_of_embeddings, and len(list_of_embeddings) confirms the total count matches the processed text entries.

medical_dataset["embedding"][0:len(list_of_embeddings)] = list_of_embeddings
medical_dataset.commit()

This code performs a semantic search using ColBERT embeddings, leveraging the MaxSim operator, executed directly in the cloud (as described in the index-on-the-lake section), for efficient similarity computations.

Query Embedding : The query is embedded with ckpt.queryFromText and converted into a format compatible with TQL queries.

query_vectors = ckpt.queryFromText(["What were the key risk factors for the development of posthemorrhagic/postoperative epilepsy in the study?"])[0]
query_vectors = query_vectors.tolist()

TQL Query Construction : The maxsim function compares the query embedding to dataset embeddings, ranking results by similarity and limiting them to the top n_res matches.
Query Execution : medical_dataset.query retrieves the most relevant entries based on semantic similarity.

n_res = 3
q_substrs = [f"ARRAY[{','.join(str(x) for x in sq)}]" for sq in query_vectors]
q_str = f"ARRAY[{','.join(q_substrs)}]"

# Construct a formatted TQL query
tql_colbert = f"""
    SELECT *, maxsim({q_str}, embedding) AS score 
    ORDER BY maxsim({q_str}, embedding) DESC 
    LIMIT {n_res}
"""

# Execute the query and append the results
results = medical_dataset.query(tql_colbert)

Results

for res in results:
    print(f"Text: {res['text']}")

‣

Output

Conclusion

This chapter introduced a mix of notable techniques, each offering a unique perspective on improving retrieval performance, cost-efficiency, and contextual relevance. From Anthropic's advanced contextualization to lightweight methods like chunk headers, and robust architectures like ColBERT, we explored diverse solutions that enrich the RAG toolkit. While briefly covered here, these ideas may inspire deeper dives and practical applications as you continue to explore retrieval strategies.

The next chapter is called “Other notable techniques” - the place for techniques that did not deserve a dedicated chapter. But the chapter still contains some gems, including my favourite - Contextual retrieval by Anthropic!

Notable Techniques: ColBERT & Contextual Retrieval