Introduction
Close your eyes and imagine a world with unlimited time and compute. Are you there? Ok, in this beautiful world, you can just take the smartest LLM available, pass it all the text chunks in your database and let the LLM evaluate which ones contain the answer to the user’s query. Sounds nice, right? Unfortunately, we live in a world with limited compute and time and therefore we need to figure out solutions that are fast, but admittedly, a bit dumb. Cosine similarity is a fast solution, but it is far from capturing full meaning of similarity between user’s query and individual text chunks.
Luckily, we can take at least some parts of the beautiful world described above to the real world. We can use a fast metric (e.g. cosine similarity) to get a subset of your text chunks - e.g. getting 30 most relevant chunks from millions in your database. Then you can use a reranker - a model that is way smarter in evaluating chunk relevancy to query. It can be an LLM, a cross-encoder or another type of model. For now, it does not matter - it is a model which is extremely good in evaluating relevancy (that is what we want) but costly and slow (that is why we do not use it on the millions of chunks right from the start). This reranker then narrows down our chunks even further, e.g. from 30 to 5.
The traditional approach would directly use cosine similarity to narrow down millions of chunks to just 5. The difference may seem small, but it can make a huge difference. Imagine that the chunks that contain the answer to the user’s query are ranked 1st, 4th and 12th by cosine similarity. With traditional approach, you get the 1st and 4th chunks correctly, but you miss the 12th one. And maybe that one contained one detail, that is crucial for the answer. It is a game of probability - in 90% of the cases, cosine similarity would do good enough job, but in 10% of the cases, the reranker would noticeably improve the response.
Sold? Let us get into it!
Cross-encoder rerankers
Cross-encoder rerankers analyze query-document pairs jointly, enabling precise relevance scoring. They concatenate the query and document into a single input sequence, allowing the model to process the full interaction between them in a single pass. This joint encoding captures nuanced semantic relationships, making cross-encoders highly effective for small reranking tasks where accuracy is the primary concern. However, their computational overhead makes them less suitable for large-scale retrieval.
Key Characteristics
- Joint Encoding: Cross-encoders process the full interaction between the query and document in a single pass, capturing nuanced semantic relationships.
- Relevance Scoring: The output is a score representing the document's relevance to the query, facilitating fine-grained ranking.
- Model Design: Most cross-encoders are built on transformer architectures fine-tuned on large datasets where relevance labels guide the model to differentiate between relevant and non-relevant query-document pairs.
Advantages
- High Accuracy: Captures fine-grained semantic relationships, making it highly effective for identifying relevant documents.
- Adaptability: Fine-tuning on domain-specific data enhances performance for specialized use cases.
Numerous cross-encoder reranker models are available on platforms like Hugging Face, making them accessible for direct use or further customization. Cross-encoder rerankers are a valuable tool in RAG systems, delivering precise and context-aware relevance scoring to improve retrieval results.
Example code
This is a toy example to demonstrate the process of reranking. In real-world scenarios, sentences or documents would typically be retrieved using traditional vector embedding techniques before applying reranking.
from sentence_transformers import SentenceTransformer, CrossEncoder, util
# Initialize the bi-encoder and cross-encoder models
bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Sample corpus of documents
corpus = [
"The capital of France is Paris.",
"The Eiffel Tower is located in Paris.",
"Berlin is the capital of Germany.",
"The Great Wall of China is in Beijing."
]
# Encode the corpus using the bi-encoder
corpus_embeddings = bi_encoder.encode(corpus, convert_to_tensor=True)
# Define the query
query = "Where is the Eiffel Tower located?"
# Encode the query using the bi-encoder
query_embedding = bi_encoder.encode(query, convert_to_tensor=True)
# Retrieve top-k candidates using cosine similarity
top_k = 3
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=top_k)[0]
# Prepare pairs for cross-encoder reranking
cross_inp = [[query, corpus[hit['corpus_id']]] for hit in hits]
# Rerank using the cross-encoder
cross_scores = cross_encoder.predict(cross_inp)
# Combine scores with corresponding documents
reranked_results = sorted(zip(cross_scores, hits), key=lambda x: x[0], reverse=True)
# Display reranked results
for score, hit in reranked_results:
print(f"Score: {score:.4f} \t Document: {corpus[hit['corpus_id']]}")
Expected output (may vary slightly):
Score: 10.3981 Document: The Eiffel Tower is located in Paris.
Score: -7.0279 Document: The capital of France is Paris.
Score: -9.1305 Document: The Great Wall of China is in Beijing.
LLM rerankers
LLM rerankers utilize the advanced language understanding capabilities of large-scale models to assess and reorder retrieved documents based on their relevance to a given query. Unlike traditional rerankers that may rely on predefined heuristics or simpler models, LLM rerankers can comprehend complex semantic relationships and contextual nuances, leading to more accurate relevance scoring.
Key Characteristics
- Deep Contextual Understanding: LLMs are trained on vast amounts of text data, enabling them to grasp intricate language patterns and contextual meanings, which is crucial for accurately determining the relevance of documents to a query.
- Generative Capabilities: Beyond classification tasks, LLMs can generate human-like text, allowing them to provide explanations or summaries that justify the relevance scores assigned to documents.
- Adaptability: LLM rerankers can be fine-tuned on specific datasets or tasks, enhancing their performance in particular domains or applications.
Advantages
- High Precision: The comprehensive language understanding of LLMs leads to more precise relevance assessments, especially in complex or ambiguous queries.
- Flexibility: LLM rerankers can adapt to various tasks without extensive retraining, making them versatile tools in information retrieval systems.
Considerations
- Computational Resources: Deploying LLM rerankers requires significant computational power, which may not be feasible for all applications.
- Latency: The complexity of LLMs can introduce latency in the reranking process, affecting the responsiveness of the system.
Example Implementations
- RankGPT: This approach leverages GPT-based models to reorder passages by generating permutations that reflect their relevance to the query. It utilizes prompt engineering to guide the model in assessing and ranking the documents.
Code example
from llama_index.llms.openai import OpenAI
from llama_index.core import PromptTemplate
import os
# Set OpenAI API key
os.environ["OPENAI_API_KEY"] = ""
# Initialize the LLM with GPT-4o
llm = OpenAI(model="gpt-4o", temperature=0)
# Define the reranking prompt template
rerank_template = PromptTemplate("""
You are an expert at document retrieval and ranking based on relevance.
Rank the following documents in order of their relevance to the given query.
Provide the ranking as a numbered list with the document content.
Query: {query_str}
Documents:
{context_str}
Ranked List:
""")
# Define the query and candidate documents
query = "What are the health benefits of green tea?"
documents = [
"Green tea is rich in antioxidants and provides numerous health benefits such as improved brain function and weight management.",
"Green tea is grown in several regions of the world, including Japan and China, and has cultural significance.",
"Green tea can reduce the risk of heart disease and contains catechins, which have anti-inflammatory properties.",
"Green tea is one of the most consumed beverages worldwide, known for its calming effects."
]
# Format the documents for the context
context_str = "\n".join([f"{i + 1}. {doc}" for i, doc in enumerate(documents)])
# Format the prompt
prompt = rerank_template.format(context_str=context_str, query_str=query)
# Call the LLM to get the reranked results
response = llm.complete(prompt)
print("Reranked Documents:")
print(response)
This is a very simple implementation of LLM-based reranking. A more sophisticated approach, such as RankGPT, includes additional techniques and optimizations behind the scenes for improved reranking quality.
Expected output:
Reranked Documents:
1. Green tea is rich in antioxidants and provides numerous health benefits such as improved brain function and weight management.
2. Green tea can reduce the risk of heart disease and contains catechins, which have anti-inflammatory properties.
3. Green tea is one of the most consumed beverages worldwide, known for its calming effects.
4. Green tea is grown in several regions of the world, including Japan and China, and has cultural significance.
In this example, both sentences 1 and 3 directly address the query and are correct. In RAG this is often sufficient. The goal is not to achieve an exact rank order but to ensure that the correct chunks appear in the top results (e.g., the top 5 out of 15).
ColBERT
ColBERT (Contextualized Late Interaction over BERT) is a retrieval model that can be used both as an embedding model and as a reranker. Its flexibility allows it to handle large-scale retrieval efficiently and also refine results for better precision. In this section, we focus on its reranking use case.
How ColBERT Works as a Reranker
- Independent Encoding:
- Late Interaction:
- Each token in the query is compared to all document tokens to find the maximum similarity.
- These maximum similarity scores are aggregated to compute the final relevance score.
- Reranking:
During the reranking process, both the query and candidate chunks are encoded independently into token-level embeddings using a transformer model like BERT.
ColBERT compares the token embeddings of the query and document using a MaxSim operation:
The computed relevance scores are used to reorder the subset of retrieved documents, ensuring that the most relevant ones are prioritized for downstream tasks.
Comparison to Cross-Encoders
Unlike cross-encoders, which jointly encode the query and document, ColBERT independently encodes them, leveraging precomputed document embeddings for scalability. While cross-encoders provide higher accuracy for small reranking tasks due to their full joint semantics, ColBERT’s late interaction mechanism offers a practical trade-off: it maintains high accuracy while being faster and more scalable for larger datasets. Both approaches serve different needs depending on the balance between accuracy and efficiency required.
Example code
#!pip install "rerankers[transformers]"
from rerankers import Reranker
# Initialize the ColBERT reranker
reranker = Reranker('colbert-ir/colbertv2.0', model_type='colbert')
# Define your query
query = "What are the health benefits of green tea?"
# Define your candidate documents
documents = [
"Green tea is rich in antioxidants and provides numerous health benefits such as improved brain function and weight management.",
"Green tea is grown in several regions of the world, including Japan and China, and has cultural significance.",
"Green tea can reduce the risk of heart disease and contains catechins, which have anti-inflammatory properties.",
"Green tea is one of the most consumed beverages worldwide, known for its calming effects."
]
# Rerank the documents based on the query
reranked_results = reranker.rank(query, documents)
# Display the reranked results
print("Reranked Results:")
for result in reranked_results:
# Access the attributes of the Result object
score = result.score
doc_text = result.document.text
print(f"Score: {score:.4f} \t Document: {doc_text}")
Expected output:
Score: 1.2095 Document: Green tea is rich in antioxidants and provides numerous health benefits such as improved brain function and weight management.
Score: 1.1540 Document: Green tea can reduce the risk of heart disease and contains catechins, which have anti-inflammatory properties.
Score: 0.9720 Document: Green tea is one of the most consumed beverages worldwide, known for its calming effects.
Score: 0.8190 Document: Green tea is grown in several regions of the world, including Japan and China, and has cultural significance.
Hybrid Approaches
In many retrieval-augmented generation (RAG) pipelines, hybrid strategies combine different rerankers to balance scalability and precision. For example, ColBERT can act as the first-stage retriever by efficiently narrowing down millions of documents to a manageable subset using precomputed embeddings. These top-k candidates can then be refined with more precise but computationally intensive methods, such as cross-encoders or LLM rerankers.
This approach leverages the strengths of each method: ColBERT’s scalability ensures rapid initial retrieval, while cross-encoders or LLM rerankers provide fine-grained semantic relevance for the final results.
Comparison
Reranker Type | Encoding Process | Strengths | Weaknesses | Ideal Use Case |
Cross-Encoder | Joint encoding of query and document | High accuracy; captures nuanced relationships | High computational cost; not scalable for large datasets | Reranking a small number of candidates for precision-critical tasks |
LLM Reranker | Depends on LLM prompt and retrieval | Deep contextual understanding; flexible for complex queries | Computationally expensive; higher latency | Handling complex or ambiguous queries requiring nuanced understanding |
ColBERT | Independent token-level encoding with late interaction | Efficient for large datasets; balances scalability and accuracy | Lower precision compared to cross-encoders for small reranking tasks | Reranking large subsets of documents efficiently in RAG pipelines |
But you know what is better than rerankers? Having a RAG system with a high baseline accuracy that minimizes the need for reranking! Luckily, we have such a system ready for you, it is called Deep Memory and you can read up more here: Davit Buniatyan Use Deep Memory to Boost RAG Apps' Accuracy by up to +22%
So what?
Lots of explanations and examples, but what to do next? If your application can accommodate slightly higher latency and cost, LLM rerankers often provide the most precise results, especially for complex or nuanced queries. However, for use cases where computational efficiency is a priority, cross-encoders or ColBERT offer effective alternatives. ColBERT, in particular, can serve as both a reranker and an embedding framework.
Ultimately, the best choice depends on your specific use case. We strongly recommend running use-case-specific evaluations to determine the most suitable approach for your system. While LLM rerankers may seem ideal, simpler methods like cross-encoders might suffice for your requirements. For more guidance on designing and running evaluations, refer to the dedicated module on evaluation methods included in this course [LINK].
Conclusion
This chapter covered the role of rerankers in RAG systems, focusing on cross-encoders, LLM rerankers, and ColBERT. We emphasized the importance of using lightweight retrieval methods to narrow down candidates before applying rerankers, ensuring both scalability and relevance.
The main takeaway is that rerankers should operate on a small subset of retrieved chunks, refining results to pass the most relevant information to the LLM for response generation. Choosing the right reranking strategy depends on your system’s specific performance and resource requirements.
Final tip: rerankers love use-cases, where you retrieve chunks from multiple resources - either because of multiple queries (e.g. multi-query technique) or because of multiple indexes (e.g. hybrid search [LINK]).
The next chapter is all about vectors - dense, sparse and their powerful combination Hybrid search!
Jupyter: Google Colab