Introduction
Hybrid search is a modern retrieval technique that combines two complementary approaches: sparse retrieval, which relies on keyword matching, and dense retrieval, which uses vector embeddings to capture semantic meaning.
- Sparse Retrieval (e.g., BM25): This method matches queries to documents based on exact keyword occurrences. It is efficient, interpretable, and works well for specific terms but struggles with synonyms or broader contextual meaning.
- Dense Retrieval (Vector-Based): Dense retrieval maps queries and documents into high-dimensional vector spaces, capturing their semantic meaning. It excels at understanding synonyms and related concepts but may miss exact keyword matches.
Hybrid search combines these strengths, retrieving results that are both precise and contextually relevant. For instance, it can surface documents containing specific technical terms (sparse retrieval) while also retrieving those semantically aligned with the query, even if they lack the exact keywords (dense retrieval).
How Hybrid Search Works
Hybrid search integrates sparse retrieval (e.g., BM25) with dense vector-based retrieval to leverage the strengths of both methods. Here’s how the process typically works:
Combining BM25 and Vector Search
- Parallel Retrieval:
- A query is sent to both a sparse index (e.g., BM25) and a dense vector index.
- Sparse retrieval identifies documents based on exact keyword matches.
- Dense retrieval finds documents based on semantic similarity in the vector space.
- Merging Results:
- Each method returns a ranked list of candidate documents with associated relevance scores.
- These results are merged into a single candidate set for further processing. In this chapter, we show a simplified approach to merging the results but in production you may want to add a reranker (cover in chapter [LINK]).
Ranking and Scoring
- Score Normalization:
- Sparse retrieval scores (e.g., BM25) and dense retrieval scores are calculated differently and exist on different scales.
- Scores must be normalized to a comparable range (e.g., using min-max scaling or standardization) to combine them meaningfully.
- Unified Scoring:
- A weighted combination of the normalized scores is used to compute a final relevance score for each document:
- The weight α controls the balance between sparse and dense contributions, depending on the use case.
- Re-Ranking:
- The documents are sorted based on their final scores, with the top results presented to the user or passed to downstream tasks like reranking or LLM-based generation. See the chapter on rerankers for more details! [LINK]
Advantages of Combining Methods
- Improved Recall: Sparse retrieval ensures exact matches are included, while dense retrieval adds semantically related results.
- Balanced Precision: Combining methods ensures that results are both accurate and contextually relevant.
- Flexibility: Adjusting the weighting (α) allows systems to adapt to different query types, prioritizing sparse or dense retrieval as needed.
Types of Indexes for Search
Efficient search relies on indexing methods that organize and retrieve data effectively. Different types of indexes are suited for sparse (keyword-based) and dense (vector-based) retrieval, each with distinct advantages and limitations.
Sparse Indexes
Sparse indexes rely on exact keyword matches and are commonly optimized using scoring methods like BM25 or TF-IDF for traditional text search. They represent text in a high-dimensional vector space, where each dimension corresponds to a term in the vocabulary.
Advantages
- Lightweight and Efficient: Sparse indexes are computationally inexpensive to build and query.
- Interpretable: Scores (e.g., BM25 or TF-IDF) are straightforward to understand, making it easy to explain why a document is ranked higher.
- Keyword Precision: Highly effective for queries with specific terms, such as proper nouns or exact phrases.
Limitations
- Lack of Semantic Understanding: Sparse methods treat terms as independent and struggle with synonyms or related concepts (e.g., "revenue" vs. "income").
- Contextual Nuance: They often fail to capture the broader meaning of a query, leading to less relevant results for complex queries.
Example of Sparse Vectors
Sparse vectors are high-dimensional and mostly empty (sparse), with non-zero values representing the weight of terms (e.g., BM25 or TF-IDF scores).
- Query: "health benefits of green tea"
- Sparse Vector (Illustrative):
[antioxidants: 0, benefits: 1, green: 1, cat: 0, health: 1, of: 1, tea: 1, ...]
In this vector:
- Non-zero values correspond to terms in the query that are present in the document corpus.
- Scores indicate the term's importance for relevance, as determined by methods like BM25 or TF-IDF.
Code Example (BM25)
Here’s how sparse vectors are generated using BM25 scoring for a query against a set of documents:
# pip install bm25s
import bm25s
# Create your corpus here
corpus = [
"Auto maintenance shops in urban areas are highly rated.",
"Find the best car repair services near you.",
"Mechanics for vehicle maintenance are available downtown.",
"Bike repair stations are popular in rural areas.",
"City-based automotive services offer competitive pricing.",
]
# Create the BM25 model and index the corpus
retriever = bm25s.BM25(corpus=corpus)
retriever.index(bm25s.tokenize(corpus))
# Print the vocabulary
vocab = retriever.vocab_dict
print("Vocabulary:")
print(vocab)
# Print the scores - see below for explanation
print("Scores:")
print(retriever.scores)
# Query the corpus and get top-k results
query = "car repair services in the city"
results, scores = retriever.retrieve(bm25s.tokenize(query), k=3)
# Display the top results
print("\nTop-k Results:")
for i in range(results.shape[1]):
doc, score = results[0, i], scores[0, i]
print(f"Rank {i+1} (score: {score:.2f}): {doc}")
Expected Output:
Vocabulary:
{'auto': 0, 'maintenance': 1, 'shops': 2, 'urban': 3, 'areas': 4, 'highly': 5, 'rated': 6, 'find': 7, 'best': 8, 'car': 9, 'repair': 10, 'services': 11, 'near': 12, 'you': 13, 'mechanics': 14, 'vehicle': 15, 'available': 16, 'downtown': 17, 'bike': 18, 'stations': 19, 'popular': 20, 'rural': 21, 'city': 22, 'based': 23, 'automotive': 24, 'offer': 25, 'competitive': 26, 'pricing': 27}
Scores:
{'data': array([0.532071 , 0.336012 , 0.38842288, 0.532071 , 0.532071 ,
0.336012 , 0.36032152, 0.532071 , 0.532071 , 0.532071 ,
0.532071 , 0.532071 , 0.336012 , 0.36032152, 0.336012 ,
0.336012 , 0.532071 , 0.532071 , 0.615063 , 0.615063 ,
0.615063 , 0.615063 , 0.57056487, 0.57056487, 0.57056487,
0.57056487, 0.532071 , 0.532071 , 0.532071 , 0.532071 ,
0.532071 , 0.532071 ], dtype=float32), 'indices': array([0, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 1, 1, 3, 1, 4, 1, 1, 2, 2, 2, 2,
3, 3, 3, 3, 4, 4, 4, 4, 4, 4], dtype=int32), 'indptr': array([ 0, 1, 3, 4, 5, 7, 8, 9, 10, 11, 12, 14, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32], dtype=int32), 'num_docs': 5}
Top-k Results:
Rank 1 (score: 1.20): Find the best car repair services near you.
Rank 2 (score: 0.87): City-based automotive services offer competitive pricing.
Rank 3 (score: 0.36): Bike repair stations are popular in rural areas.
Explanation of Output
Vocabulary
The vocab
dictionary maps each term in the corpus to a unique index. For example, "auto" is indexed as 0
, and "repair" is indexed as 10
. When new documents introduce terms not in the current vocabulary, the vocabulary expands, requiring re-indexing and recalculation of IDF values to maintain consistent scoring.
Scores
The scores
array represents the BM25 weights for each term in the documents:
data
: BM25 scores computed based on term frequency (TF), inverse document frequency (IDF), and document length normalization.indices
: Maps the scores to corresponding terms in the vocabulary.indptr
: Specifies the boundaries of each document's sparse vector representation.
Top-k Results
The retrieved results rank documents by their BM25 scores:
- Rank 1 (score: 1.20): "Find the best car repair services near you."This document directly matches multiple query terms such as "car," "repair," and "services," resulting in the highest score.
- Rank 2 (score: 0.87): "City-based automotive services offer competitive pricing."This document is somewhat relevant, with terms like "automotive" and "services" providing partial matches to the query.
- Rank 3 (score: 0.36): "Bike repair stations are popular in rural areas."While this document mentions "repair," the context ("bike repair stations") diverges from the query intent, leading to a lower score.
Missed Synonyms
Two documents that could be considered semantically similar to the query received notably low scores:
- "Auto maintenance shops in urban areas are highly rated."
- "Mechanics for vehicle maintenance are available downtown."
These documents contain synonymous terms like "auto" (instead of "car"), "maintenance" (instead of "repair"), and "mechanics" (instead of "services"). However, BM25 relies on exact matches of query terms in the documents. Without shared vocabulary terms, these documents are ranked poorly, showcasing the limitation of sparse indexing in handling synonyms or semantically equivalent terms.
Re-indexing Requirement
When adding new documents, the vocabulary expands, and the IDF values must be recalculated to incorporate new terms appropriately. This re-indexing ensures that scoring remains consistent across the corpus.
Key Takeaway
Sparse indexes, powered by BM25 or TF-IDF, are a foundational component of text search systems. While they excel at efficiency and interpretability, their limitations in handling synonyms and contextual nuances often necessitate pairing with dense or hybrid approaches for improved relevance in complex queries.
Dense Indexes
Dense indexes store high-dimensional vector embeddings, capturing semantic relationships between queries and documents.
- Advantages:
- Semantic Search: Dense indexes excel in understanding synonyms and related concepts, enabling retrieval based on meaning rather than exact words.
- Contextual Retrieval: Ideal for complex queries where context matters more than specific keywords.
- Limitations:
- Less Interpretable: Relevance scores are harder to explain compared to sparse methods. Dimensionality reduction and clustering can help.
Example of Dense Vectors: Dense vectors are low-dimensional and encode semantic meaning in a shared vector space. Similar queries and documents are closer in this space.
- Query: "health benefits of green tea"
- Dense Vector Example:
[0.21, -0.13, 0.44, ..., 0.19]
Code Example:
from llama_index.embeddings.openai import OpenAIEmbedding
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
os.environ["OPENAI_API_KEY"] = ""
# Initialize the OpenAI Embedding model
openai_embedding = OpenAIEmbedding()
# Define your corpus of documents
documents = [
"Auto maintenance shops in urban areas are highly rated.",
"Find the best car repair services near you.",
"Mechanics for vehicle maintenance are available downtown.",
"Bike repair stations are popular in rural areas.",
"City-based automotive services offer competitive pricing.",
]
# Generate dense embeddings for all documents
document_embeddings = [openai_embedding.get_text_embedding(doc) for doc in documents]
# Define your query
query = "car repair services in the city"
# Generate embedding for the query
query_embedding = openai_embedding.get_text_embedding(query)
# Compute cosine similarity between query and document embeddings
similarities = cosine_similarity(
[query_embedding],
document_embeddings
).flatten()
# Get the top 3 most similar documents
top_indices = similarities.argsort()[-3:][::-1]
# Display the top results
print("\nTop 3 Results:")
for idx in top_indices:
print(f"Document: {documents[idx]}")
print(f"Similarity Score: {similarities[idx]:.4f}")
Expected output:
Top 3 Results:
Document: City-based automotive services offer competitive pricing.
Similarity Score: 0.9044
Document: Find the best car repair services near you.
Similarity Score: 0.8942
Document: Mechanics for vehicle maintenance are available downtown.
Similarity Score: 0.8933
Final Thoughts on Dense Indexes
The results demonstrate the strength of dense embeddings in capturing semantic similarities that go beyond exact keyword matches. For example, "city-based automotive services" ranks highest because it closely aligns with the broader meaning of "car repair services in the city," even without exact term overlap.
Unlike BM25, which uses an absolute scale tied to term frequency and inverse document frequency, dense embeddings produce similarity scores normalized to the range of [-1, 1] or [0, 1] depending on the metric (e.g., cosine similarity). These scores reflect the closeness of vectors in the embedding space rather than keyword co-occurrence. This makes dense indexes particularly powerful for capturing relationships between semantically similar queries and documents, even when exact wording differs. However, their lack of interpretability and reliance on high-quality embeddings can present challenges for debugging or understanding retrieval decisions.
Hybrid search
Hybrid search combines the strengths of sparse retrieval (e.g., BM25) and dense retrieval (e.g., vector embeddings) to create a robust and versatile search mechanism. Sparse retrieval ensures precision for exact matches, while dense retrieval captures semantic similarities and contextual nuances. By merging these approaches, hybrid search improves recall and relevance, especially for complex queries.
In this section, we demonstrate how to implement hybrid search by running sparse and dense retrieval in parallel and merging their results. We'll compute normalized scores for both methods, combine them with a weighted formula, and return the most relevant documents.
Code example
import os
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from llama_index.embeddings.openai import OpenAIEmbedding
import bm25s
# Initialize the OpenAI Embedding model
openai_embedding = OpenAIEmbedding()
# Define your corpus of documents
documents = [
"Auto maintenance shops in urban areas are highly rated.",
"Find the best car repair services near you.",
"Mechanics for vehicle maintenance are available downtown.",
"Bike repair stations are popular in rural areas.",
"City-based automotive services offer competitive pricing.",
]
# Sparse Retrieval with BM25
corpus = documents
retriever = bm25s.BM25(corpus=corpus)
retriever.index(bm25s.tokenize(corpus))
sparse_results, sparse_scores = retriever.retrieve(bm25s.tokenize("car repair services in the city"), k=len(corpus))
# Dense Retrieval with OpenAI Embeddings
document_embeddings = [openai_embedding.get_text_embedding(doc) for doc in documents]
query = "car repair services in the city"
query_embedding = openai_embedding.get_text_embedding(query)
dense_scores = cosine_similarity([query_embedding], document_embeddings).flatten()
# Normalize Scores
sparse_scores_normalized = (sparse_scores - np.min(sparse_scores)) / (np.max(sparse_scores) - np.min(sparse_scores))
dense_scores_normalized = (dense_scores - np.min(dense_scores)) / (np.max(dense_scores) - np.min(dense_scores))
# Combine Scores
alpha = 0.5 # Adjust this weight to balance sparse and dense scores
final_scores = alpha * sparse_scores_normalized + (1 - alpha) * dense_scores_normalized
# Rank Documents
ranked_indices = np.argsort(final_scores.flatten())[::-1] # Flatten and sort to get plain integers
print("\nHybrid Search Results:")
for idx in ranked_indices[:3]: # Ensure idx is an integer
sparse_score = sparse_scores[0, idx] # Correct indexing for 2D arrays
dense_score = dense_scores[idx] # 1D array, direct indexing
final_score = final_scores[0, idx] # Correct indexing for 2D arrays
print(f"Document: {documents[idx]}")
print(f"Sparse Score (Normalized): {sparse_score:.2f}, Dense Score (Normalized): {dense_score:.2f}")
print(f"Final Combined Score: {final_score:.2f}\n")
Expected output:
Hybrid Search Results:
Document: Auto maintenance shops in urban areas are highly rated.
Sparse Score (Normalized): 1.00, Dense Score (Normalized): 0.78
Final Combined Score: 0.89
Document: Find the best car repair services near you.
Sparse Score (Normalized): 0.72, Dense Score (Normalized): 0.84
Final Combined Score: 0.78
Document: Mechanics for vehicle maintenance are available downtown.
Sparse Score (Normalized): 0.30, Dense Score (Normalized): 0.83
Final Combined Score: 0.57
Code Commentary and Output Explanation
In this example, we implemented hybrid search from scratch to provide a clear, step-by-step demonstration of the mechanics involved. In a production environment, you would likely leverage existing frameworks or abstractions to simplify and optimize the process. For example, libraries like LlamaIndex offer built-in support for hybrid search, integrating sparse and dense retrieval seamlessly.
Reranking with a Simple Formula
We combined the scores from sparse (BM25) and dense (OpenAI embeddings) retrieval using a simple weighted formula:
This approach enables the system to balance keyword-based precision with semantic relevance. For a deeper dive into advanced reranking strategies and their applications, refer to the Rerankers[LINK] chapter in this document. More sophisticated reranking methods can significantly enhance retrieval accuracy, especially in complex use cases.
Passing Reranked Chunks to the LLM
After reranking, the top chunks are typically passed to an LLM for generating a final answer. In this example, we used simplified chunks (short sentences) to focus on retrieval. In real-world applications, these chunks might represent segments from larger documents, such as market research reports or other domain-specific content. The hybrid search ensures that the most relevant chunks—both semantically and contextually—are presented to the LLM, improving the quality of the generated response.
Choosing the Weighting Parameter (Alpha)
We selected α=0.5 for simplicity, giving equal weight to sparse and dense scores. However, α is effectively a hyperparameter of the system and should be tuned based on the specific use case. For instance:
- Higher α: emphasizes exact keyword matches, useful in domains where precision is critical.
- Lower α: favors semantic relevance, ideal for queries that rely on understanding synonyms or contextual meaning.
To find the optimal α, systematic evaluation (discussed in the Evaluation Module [LINK]) is recommended. Depending on the use case, you may choose to prioritize sparse or dense retrieval differently.
So what?
A year ago, I would advise you to just use dense vectors. It seems like a more elegant solution. But lately I have been seeing great results with sparse vectors or hybrid search, so I would suggest you give them a try. With this specific part of the RAG pipeline it is very hard to give concrete guidelines - making it more important to run evaluations on your own use-case. Define metrics like precision or recall and run evaluations to find the best balance for your needs. For guidance on setting up evaluations and optimizing parameters, see the evaluation module [LINK].
Comparison of Sparse, Dense, and Hybrid Retrieval
Feature | Sparse Retrieval (e.g., BM25) | Dense Retrieval (e.g., Embeddings) | Hybrid Retrieval |
Strengths | Efficient, interpretable, precise for exact matches | Semantic understanding, handles synonyms and context | Combines precision and semantic relevance |
Limitations | Struggles with synonyms and contextual queries | Less interpretable, computationally intensive | Requires score normalization and parameter tuning |
Use Cases | Keyword-based search, FAQs, legal texts | Semantic search, conversational AI, document clustering | Complex queries, knowledge bases, recommendation systems |
Computational Cost | Low to moderate | High | Moderate to high, depending on balance |
Interpretability | High | Low | Moderate |
Scaling | Easy to scale for large datasets | Computationally expensive at scale | Scales depending on implementation |
Parameter Tuning | Minimal | Moderate | High (e.g., weighting alpha) |
Conclusion
Hybrid search offers a powerful framework for balancing the strengths of sparse and dense retrieval methods. By combining sparse retrieval’s precision with dense retrieval’s semantic understanding, it ensures that results are both accurate and contextually relevant. This versatility makes hybrid search particularly effective for diverse applications, from answering complex queries in knowledge bases to improving e-commerce recommendations.
While our example demonstrated a straightforward approach using normalized scores and a fixed weight (alpha), the real-world implementation often involves additional considerations, such as query-specific tuning, advanced reranking strategies, and integration with domain-specific embeddings. Leveraging hybrid search can significantly enhance the quality of information retrieval, especially when paired with tools like LLMs to generate nuanced answers from the retrieved results.
Experimentation is key to unlocking the full potential of hybrid search. Adjust parameters like alpha, explore different embedding models, and test with varied datasets to optimize performance for your specific use case.
In the next chapter, we will explore advanced chunking methods. The naive chunking approaches based on tokens and overlap disrupt logical flow of text and context and there are many techniques that try to fix this!
Jupyter: Google Colab