Introduction

This chapter demonstrates the implementation of a hybrid retrieval system on a practical restaurant reviews project. The main focus is on how to combine traditional BM25 and dense vector similarity search methods for enhanced search capabilities. The initial sections lay the groundwork by explaining and implementing these foundational techniques. The primary objective, however, is to show how these methods integrate into a hybrid approach, combining lexical and semantic retrieval to improve relevance of retrieved documents.

Resources:

Deep Lake docs: RAG -
Jupyter: Google Colab

Load the Data from Deep Lake

The following code opens the dataset in read-only mode from Deep Lake at the specified path al://activeloop/restaurant_reviews_complete. The scraped_data object now contains the complete restaurant dataset, featuring 160 restaurants and over 24,000 images, ready for data extraction and processing.

#!pip install deeplake
import deeplake
scraped_data = deeplake.open_read_only(f"al://activeloop/restaurant_reviews_complete")
print(f"Scraped {len(scraped_data)} reviews")

Output: Scraped 18625 reviews

Create the Dataset and Use an Inverted Index for Filtering

In the first stage of this course, we’ll cover Lexical Search, a traditional and foundational approach to information retrieval.

An inverted index is a data structure commonly used in search engines and databases to facilitate fast full-text searches. Unlike a row-wise search, which scans each row of a document or dataset for a search term, an inverted index maps each unique word or term to the locations (such as document IDs or row numbers) where it appears. This setup allows for very efficient retrieval of information, especially in large datasets.

For small datasets with up to 1,000 documents, row-wise search can provide efficient performance without needing an inverted index. For medium-sized datasets (10,000+ documents), inverted indexes become useful, particularly if search queries are frequent. For large datasets of 100,000+ documents, using an inverted index is essential to ensure efficient query processing and meet performance expectations.

import deeplake
from deeplake import types

# Create a dataset
inverted_index_dataset = "local_inverted_index"
ds = deeplake.create(f"file://{inverted_index_dataset}")

We now create two columns in the dataset: restaurant_name and restaurant_review. Both columns are text-based and use an inverted index to improve search efficiency.

ds.add_column("restaurant_name", types.Text(index_type=types.Inverted))
ds.add_column("restaurant_review", types.Text(index_type=types.Inverted))
ds.add_column("owner_answer", types.Text(index_type=types.Inverted))

Extract the data

This code extracts restaurant details from scraped_data into separate lists:

Initialize Lists : restaurant_name, restaurant_review and owner_answer are initialized to store respective data for each restaurant.
Populate Lists : For each entry (el) in scraped_data, the code appends:

el['restaurant_name'] to restaurant_name
el['restaurant_review'] to restaurant_review
el['owner_answer'] to owner_answer

After running, each list holds a specific field from all restaurants, ready for further processing.

restaurant_name = []
restaurant_review = []
owner_answer = []
images = []
for el in scraped_data:
    restaurant_name.append(el['restaurant_name'])
    restaurant_review.append(el['restaurant_review'])
    owner_answer.append(el['owner_answer'])

Add the data to the dataset

We add the collected restaurant names and reviews to the dataset ds. Using ds.append(), we insert two columns: "restaurant_name" and "restaurant_review", populated with the values from our lists restaurant_name and restaurant_review. After appending the data, ds.commit() saves the changes permanently to the dataset, ensuring all new entries are stored and ready for further processing.

ds.append({
    "restaurant_name": restaurant_name,
    "restaurant_review": restaurant_review,
    "owner_answer": owner_answer
})
ds.commit()
ds

Output:

Dataset(columns=(restaurant_name,restaurant_review,owner_answer), length=18625)

Search for the restaurant using a specific word

We define a search query to find any entries in the dataset ds where the word "tapas" appears in the restaurant_review column. The command ds.query() runs a TQL query with SELECT *, which retrieves all entries that match the condition CONTAINS(restaurant_review, '{word}'). This search filters the dataset to show only records containing the specified word (tapas) in their reviews. The results are saved in the variable view.

Deep Lake offers a high-performance SQL-based query engine for data analysis called TQL (Tensor Query Language). You can find the official documentation here.

word = 'burritos'
view = ds.query(f"""
    SELECT * 
    WHERE CONTAINS(restaurant_review, '{word}')
    LIMIT 4
""")
view

Output:

Dataset(columns=(restaurant_name,restaurant_review,owner_answer), length=4)

Show the results

for row in view:
    print(f"Restaurant name: {row['restaurant_name']} \nReview: {row['restaurant_review']}")

Output:

Restaurant name: Los Amigos
Review: Best Burritos i have ever tried!!!!! Wolderful!!!
Restaurant name: Los Amigos
Review: Really good breakfast burrito, and just burritos in general
Restaurant name: Los Amigos
Review: Ordered two of their veggie burritos, nothing crazy just added extra cheese and sour cream. They even repeated the order back to me and everything was fine, then when I picked the burritos up and got home they put zucchini and squash in it.. like what??
Restaurant name: Los Amigos
Review: Don't make my mistake and over order. The portions are monstrous. The wet burritos are as big as a football.

AI data retrieval systems today face 3 challenges: limited modalities, lack of accuracy, and high costs at scale. Deep Lake 4.0 fixes this by enabling true multi-modality, enhancing accuracy, and reducing query costs by 2x with index-on-the-lake technology.

Consider a scenario where we store all our data locally on a computer. Initially, this may be adequate, but as the volume of data grows, managing it becomes increasingly challenging. The computer’s storage becomes limited, data access slows, and sharing information with others is less efficient.

To address these challenges, we can transition our data storage to the cloud using Deep Lake. Designed specifically for handling large-scale datasets and AI workloads, Deep Lake enables up to 10 times faster data access. With cloud storage, hardware limitations are no longer a concern: Deep Lake offers ample storage capacity, secure access from any location, and streamlined data sharing.

This approach provides a robust and scalable infrastructure that can grow alongside our projects, minimizing the need for frequent hardware upgrades and ensuring efficient data management.

Use BM25 to Retrieve the Data

Our advanced "Index-On-The-Lake" technology enables sub-second query performance directly from object storage, such as S3, using minimal compute power and memory resources. Achieve up to 10x greater cost efficiency compared to in-memory databases and 2x faster performance than other object storage solutions, all without requiring additional disk-based caching.

With Deep Lake, you benefit from rapid streaming columnar access to train deep learning models directly, while also executing sub-second indexed queries for retrieval-augmented generation.

In this stage, the system uses BM25 for a straightforward lexical search. This approach is efficient for retrieving documents based on exact or partial keyword matches.

We start by importing deeplake and setting up an organization ID org_id and dataset name dataset_name_bm25. Next, we create a new dataset with the specified name and location in Deep Lake storage.

We then add two columns to the dataset: restaurant_name and restaurant_review. Both columns use a BM25 index, which optimizes them for relevance-based searches, enhancing the ability to rank results based on how well they match search terms.

Finally, we use ds_bm25.commit() to save these changes to the dataset and ds_bm25.summary() to display an overview of the dataset's structure and contents.

If you don't have a token yet, you can sign up and then log in on the official Activeloop website, then click the Create API token button to obtain a new API token. Here, under Select organization, you can also find your organization ID(s).

import os, getpass
os.environ["ACTIVELOOP_TOKEN"] = getpass.getpass("Activeloop API token: ")

org_id = "" 
dataset_name_bm25 = "bm25_test"

ds_bm25 = deeplake.create(f"al://{org_id}/{dataset_name_bm25}")

# Add columns to the dataset
ds_bm25.add_column("restaurant_name", types.Text(index_type=types.BM25))
ds_bm25.add_column("restaurant_review", types.Text(index_type=types.BM25))
ds_bm25.add_column("owner_answer", types.Text(index_type=types.BM25))
ds_bm25.commit()
ds_bm25.summary()

Output:

Dataset(columns=(restaurant_name,restaurant_review,owner_answer), length=0)
+-----------------+-----------------+
|     column      |      type       |
+-----------------+-----------------+
| restaurant_name |text (bm25 Index)|
+-----------------+-----------------+
|restaurant_review|text (bm25 Index)|
+-----------------+-----------------+
|  owner_answer   |text (bm25 Index)|
+-----------------+-----------------+

Add data to the dataset

We add data to the ds_bm25 dataset by appending the two columns, filled with values from the lists we previously created.

After appending, ds_bm25.commit() saves the changes, ensuring the new data is permanently stored in the dataset. Finally, ds_bm25.summary() provides a summary of the dataset's updated structure and contents, allowing us to verify that the data was added successfully.

ds_bm25.append({
    "restaurant_name": restaurant_name,
    "restaurant_review": restaurant_review,
    "owner_answer": owner_answer
})
ds_bm25.commit()
ds_bm25.summary()

Output:

Dataset(columns=(restaurant_name,restaurant_review,owner_answer), length=18625)
+-----------------+-----------------+
|     column      |      type       |
+-----------------+-----------------+
| restaurant_name |text (bm25 Index)|
+-----------------+-----------------+
|restaurant_review|text (bm25 Index)|
+-----------------+-----------------+
|  owner_answer   |text (bm25 Index)|
+-----------------+-----------------+

Search for the restaurant using a specific sentence

We define a query, "I want burritos", to find relevant restaurant reviews in the dataset. Using ds_bm25.query(), we search and rank entries in restaurant_review based on BM25 similarity to the query. The code orders results by how well they match the query (BM25_SIMILARITY), from highest to lowest relevance, and limits the output to the top 10 results. The final list of results is stored in view_bm25.

query = "I want burritos"
view_bm25 = ds_bm25.query(f"""
    SELECT * 
    ORDER BY BM25_SIMILARITY(restaurant_review, '{query}') DESC 
    LIMIT 6
""")
view_bm25

Output: Dataset(columns=(restaurant_name,restaurant_review,owner_answer), length=6)

Show the results

for row in view_bm25:
    print(f"Restaurant name: {row['restaurant_name']} \nReview: {row['restaurant_review']}")

Output:

Restaurant name: Los Amigos
Review: Best Burritos i have ever tried!!!!! Wolderful!!!
Restaurant name: Los Amigos
Review: Fantastic burritos!
Restaurant name: Cheztakos!!!
Review: Great burritos
Restaurant name: La Costeña
Review: Awesome burritos!
Restaurant name: La Costeña
Review: Awesome burritos
Restaurant name: La Costeña
Review: Bomb burritos

Vector similarity search

If you want to generate text embeddings for similarity search, you can choose a proprietary model like text-embedding-3-large from OpenAI, or you can opt for an open-source model. The MTEB leaderboard on Hugging Face provides a selection of open-source models that have been tested for their effectiveness at converting text into embeddings, which are numerical representations that capture the meaning and nuances of words and sentences. Using these embeddings, you can perform similarity search, grouping similar pieces of text (like sentences or documents) based on their meaning.

Selecting a model from the MTEB leaderboard offers several benefits: these models are ranked based on performance across a variety of tasks and languages, ensuring that you’re choosing a model that’s both accurate and versatile. If you prefer not to use a proprietary model, a high-performing model from this list is an excellent alternative.

We start by installing and importing the openai library to access OpenAI's API for generating embeddings.Next, we define the function embedding_function, which takes texts as input (either a single string or a list of strings) and a model name, defaulting to "text-embedding-3-large". Then, for each text, we replace newline characters with spaces to maintain clean, uniform text. Finally, we use openai.embeddings.create() to generate embeddings for each text and return a list of these embeddings, which can be used for cosine similarity comparisons.

#!pip install openai
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key: ")

import openai

def embedding_function(texts, model="text-embedding-3-large"):

    if isinstance(texts, str):
        texts = [texts]

    texts = [t.replace("\n", " ") for t in texts]
    return [data.embedding for data in openai.embeddings.create(input = texts, model=model).data]

Create the dataset and add the columns

Next, we add three columns to vector_search:

embedding: Stores vector embeddings with a dimension size of 3072, which will enable vector-based similarity searches.
restaurant_name: A text column with a BM25 index , optimizing it for relevance-based text search.
restaurant_review: Another text column with a BM25 index , also optimized for efficient and ranked search results.
owner_answer: A text column with an inverted index , allowing fast and efficient filtering based on specific onwner answer.

Finally, we use vector_search.commit() to save these new columns, ensuring the dataset structure is ready for further data additions and queries.

dataset_name_vs = "vector_indexes"
vector_search = deeplake.create(f"al://{org_id}/{dataset_name_vs}")

# Add columns to the dataset
vector_search.add_column(name="embedding", dtype=types.Embedding(3072))
vector_search.add_column(name="restaurant_name", dtype=types.Text(index_type=types.BM25))
vector_search.add_column(name="restaurant_review", dtype=types.Text(index_type=types.BM25))
vector_search.add_column(name="owner_answer", dtype=types.Text(index_type=types.Inverted))

vector_search.commit()

Create embeddings

This function processes each review in restaurant_review and converts it into a numerical embedding. These embeddings, stored in embeddings_restaurant_review, represent each review as a vector, enabling us to perform cosine similarity searches and comparisons within the dataset.

Deep Lake will handle the search computations, providing us with the final results.

# Create embeddings
batch_size = 500
embeddings_restaurant_review = []
for i in range(0, len(restaurant_review), batch_size):
    embeddings_restaurant_review += embedding_function(restaurant_review[i : i + batch_size])
# Add data to the dataset
vector_search.append({"restaurant_name": restaurant_name, "restaurant_review": restaurant_review, "embedding": embeddings_restaurant_review, "owner_answer": owner_answer})
vector_search.commit()
vector_search.summary()

Output:

Dataset(columns=(embedding,restaurant_name,restaurant_review,owner_answer), length=18625)
+-----------------+---------------------+
|     column      |        type         |
+-----------------+---------------------+
|    embedding    |   embedding(3072)   |
+-----------------+---------------------+
| restaurant_name |  text (bm25 Index)  |
+-----------------+---------------------+
|restaurant_review|  text (bm25 Index)  |
+-----------------+---------------------+
|  owner_answer   |text (Inverted Index)|
+-----------------+---------------------+

Search for the restaurant using a specific sentence

We start by defining a search query, "A restaurant that serves good burritos.".

Generate Embedding for Query :

We call embedding_function(query) to generate an embedding for this query. Since embedding_function returns a list, we access the first (and only) item with [0], storing the result in embed_query.

Convert Embedding to String :

We convert embed_query (a list of numbers) into a single comma-separated string using ",".join(str(c) for c in embed_query). This step stores the embedding as a formatted string in str_query, preparing it for further processing or use in queries.

query = "A restaurant that serves good burritos."
embed_query = embedding_function(query)[0]
str_query = ",".join(str(c) for c in embed_query)

Define Query with Cosine Similarity :

We construct a TQL query (query_vs) to search within the vector_search dataset.
The query calculates the cosine similarity between the embedding column and str_query, which is the embedding of our query, "A restaurant that serves good burritos.". This similarity score score measures how closely each entry matches our query.

Order by Score and Limit Results :

The query orders results by score in descending order, showing the most relevant matches first. We limit the results to the top 3 matches to focus on the best results.

Execute Query :

vector_search.query(query_vs) runs the query on the dataset, storing the output in view_vs, which contains the top 3 most similar entries based on cosine similarity. This approach helps us retrieve the most relevant records matching our query in vector_search.

query_vs = f"""
    SELECT * 
    FROM (
        SELECT *, cosine_similarity(embedding, ARRAY[{str_query}]) AS score 
        FROM (
            SELECT *, ROW_NUMBER() AS row_id
        )
    ) 
    ORDER BY score DESC 
    
    LIMIT 3
"""
view_vs = vector_search.query(query_vs)
view_vs

for row in view_vs:
    print(f"Restaurant name: {row['restaurant_name']} \nReview: {row['restaurant_review']}")

Output:

Restaurant name: Cheztakos!!!
Review: Great burritos
Restaurant name: Los Amigos
Review: Nice place real good burritos.
Restaurant name: La Costeña
Review: Awesome burritos

If we want to filter for a specific owner answer, such as Thank you , we set word = "Thank you" to define the desired owner answer. Here, we’re using an inverted index on the owner_answer column to efficiently filter results based on this owner answer.

word = "Thank you"
query_vs = f"""
    SELECT * 
    FROM (
        SELECT *, cosine_similarity(embedding, ARRAY[{str_query}]) AS score 
        FROM (
            SELECT *, ROW_NUMBER() AS row_id
        )
    ) 
    
    WHERE CONTAINS(owner_answer, '{word}') 
    ORDER BY score DESC 
    
    LIMIT 3
"""
view_vs = vector_search.query(query_vs)
view_vs

for row in view_vs:
    print(f"Restaurant name: {row['restaurant_name']} \nReview: {row['restaurant_review']} \nOwner Answer: {row['owner_answer']}")

Output:

Restaurant name: Taqueria La Espuela
Review: My favorite place for super burrito and horchata
Owner Answer: Thank you for your continued support!
Restaurant name: Chaat Bhavan Mountain View
Review: Great place with good food
Owner Answer: Thank you for your positive feedback! We're thrilled to hear that you had a great experience at our restaurant and enjoyed our delicious food. Your satisfaction is our priority, and we can't wait to welcome you back for another wonderful dining experience.

Thanks,
Team Chaat Bhavan
Restaurant name: Chaat Bhavan Mountain View
Review: Good food.
Owner Answer: Thank you for your 4-star rating! We're glad to hear that you had a positive experience at our restaurant. Your feedback is valuable to us, and we appreciate your support. If there's anything specific we can improve upon to earn that extra star next time, please let us know. We look forward to serving you again soon.

Thanks,
Team Chaat Bhavan

Hybrid search

In this stage, the system enhances its search capabilities by combining BM25 with Approximate Nearest Neighbors (ANN) for a hybrid search. This approach blends lexical search with semantic search, improving relevance by considering both keywords and semantic meaning. The introduction of a Large Language Model (LLM) allows the system to generate text-based answers, delivering direct responses instead of simply listing relevant documents.

We open the vector_search dataset to perform a hybrid search. First, we define a query "Let's grab a drink" and generate its embedding using embedding_function(query)[0]. We then convert this embedding into a comma-separated string embedding_string, preparing it for use in combined text and vector-based searches.

vector_search = deeplake.open(f"al://{org_id}/{dataset_name_vs}")

Search for the correct restaurant using a specific sentence

query = "I feel like a drink"
embed_query = embedding_function(query)[0]
embedding_string = ",".join(str(c) for c in embed_query)

We create two queries:

Vector Search (tql_vs): Calculates cosine similarity with embedding_string and returns the top 5 matches by score.
BM25 Search (tql_bm25): Ranks restaurant_review by BM25 similarity to query, also limited to the top 5.

We then execute both queries, storing vector results in vs_results and BM25 results in bm25_results. This allows us to compare results from both search methods.

tql_vs = f"""
    SELECT *
    FROM (
        SELECT *, cosine_similarity(embedding, ARRAY[{embedding_string}]) AS score 
        FROM (
            SELECT *, ROW_NUMBER() AS row_id
        )
    
    )
    ORDER BY score DESC 
    LIMIT 5
"""

tql_bm25 = f"""
    SELECT *, BM25_SIMILARITY(restaurant_review, '{query}') AS score 
    FROM (
        SELECT *, ROW_NUMBER() AS row_id
    ) 
    ORDER BY BM25_SIMILARITY(restaurant_review, '{query}') DESC 
    LIMIT 5
"""

vs_results = vector_search.query(tql_vs)
bm25_results = vector_search.query(tql_bm25)

Show the scores:

for el_vs in vs_results:
    print(f"vector search score: {el_vs['score']}")

for el_bm25 in bm25_results:
    print(f"bm25 score: {el_bm25['score']}")

Output:

vector search score: 0.5322654247283936
vector search score: 0.46281781792640686
vector search score: 0.4580579102039337
vector search score: 0.45585304498672485
vector search score: 0.4528498649597168
bm25 score: 13.076177597045898
bm25 score: 11.206666946411133
bm25 score: 11.023599624633789
bm25 score: 10.277934074401855
bm25 score: 10.238584518432617

First, we import the required libraries and define a Document class, where each document has an id, a data dictionary, and an optional score for ranking.

Setup and Classes :

We import necessary libraries and define a Document class using pydantic.BaseModel. Each Document has an id, a data dictionary, and an optional score for ranking.

Softmax Function :

The softmax function normalizes a list of scores (retrieved_score) using the softmax formula. Scores are exponentiated, limited by max_weight, and then normalized to sum up to 1. This returns new_weights, a list of normalized scores.

#!pip install numpy pydantic

import math
import numpy as np
from typing import Any, Dict, List, Optional
from pydantic import BaseModel

class Document(BaseModel):
    id: str
    data: Dict[str, Any]
    score: Optional[float] = None

def softmax(retrieved_score: list[float], max_weight: int = 700) -> Dict[str, Document]:
    # Compute the exponentials
    exp_scores = [math.exp(min(score, max_weight)) for score in retrieved_score]
    
    # Compute the sum of the exponentials
    sum_exp_scores = sum(exp_scores)

    # Update the scores of the documents using softmax
    new_weights = []
    for score in exp_scores:
        new_weights.append(score / sum_exp_scores)

    return new_weights

Normalize the score

Apply Softmax to Scores :

We extract score values from vs_results and bm25_results and apply softmax to them, storing the results in vss and bm25s. This step scales both sets of scores for easy comparison.

Create Document Dictionaries :

We create dictionaries docs_vs and docs_bm25 to store documents from vs_results and bm25_results, respectively. For each result, we add the restaurant_name and restaurant_review along with the normalized score. Each document is identified by row_id.

This code standardizes scores and organizes results, allowing comparison across both vector and BM25 search methods.

vs_score = vs_results["score"]
bm_score = bm25_results["score"]

vss = softmax(vs_score)
bm25s = softmax(bm_score)
print(vss)
print(bm25s)

Output:

[0.21224761685297047, 0.19800771415362647, 0.1970674552539808, 0.19663342673946818, 0.19604378699995426]
[0.7132230191866898, 0.10997834807700335, 0.09158030054295993, 0.04344738382536802, 0.04177094836797888]

docs_vs = {}
docs_bm25 = {}
for el, score in zip(vs_results, vss):
    docs_vs[str(el["row_id"])] = Document(id=str(el["row_id"]), data={"restaurant_name": el["restaurant_name"], "restaurant_review": el["restaurant_review"]}, score=score)
    
for el, score in zip(bm25_results, bm25s):
    docs_bm25[str(el["row_id"])] = Document(id=str(el["row_id"]), data={"restaurant_name": el["restaurant_name"], "restaurant_review": el["restaurant_review"]}, score=score)

We define weights for our hybrid search: VECTOR_WEIGHT and LEXICAL_WEIGHT are both set to 0.5, giving equal importance to vector-based and BM25 scores.

Initialize Results Dictionary :

We create an empty dictionary, results, to store documents with their combined scores from both search methods.

Combine Scores :

We iterate over the unique document IDs from docs_vs and docs_bm25.
For each document:

We add it to results, defaulting to the version available (vector or BM25).
We calculate a weighted score: vs_score from vector results (if present in docs_vs) and bm_score from BM25 results (if present in docs_bm25).
The final results[k].score is set by adding vs_score and bm_score.

This produces a fused score for each document in results, ready to rank in the hybrid search.

def fusion(docs_vs: Dict[str, Document], docs_bm25: Dict[str, Document]) -> Dict[str, Document]:
    VECTOR_WEIGHT = 0.5
    LEXICAL_WEIGHT = 0.5

    results: Dict[str, Dict[str, Document]] = {}
    

    for k in set(docs_vs) | set(docs_bm25):
        results[k] = docs_vs.get(k, None) or docs_bm25.get(k, None)
        vs_score = VECTOR_WEIGHT * docs_vs[k].score if k in docs_vs else 0
        bm_score = LEXICAL_WEIGHT * docs_bm25[k].score if k in docs_bm25 else 0
        results[k].score = vs_score + bm_score

    return results

results = fusion(docs_vs, docs_bm25)
results

Ouput:

{'2637': Document(id='2637', data={'restaurant_name': 'Mifen101 花溪米粉王', 'restaurant_review': 'Feel like I’m back in China.'}, score=0.013747293509625419),
 '5136': Document(id='5136', data={'restaurant_name': 'Scratch', 'restaurant_review': 'Just had drinks. They were good!'}, score=0.024505473374994282),
 '17426': Document(id='17426', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Good drinks an easy going bartenders'}, score=0.024579178342433523),
 '17444': Document(id='17444', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Good drinks, good food'}, score=0.02475096426920331),
 '2496': Document(id='2496', data={'restaurant_name': 'Seasons Noodles & Dumplings Garden', 'restaurant_review': 'Comfort food, excellent service! Feel like back to home.'}, score=0.005430922978171003),
 '4022': Document(id='4022', data={'restaurant_name': 'Eureka! Mountain View', 'restaurant_review': 'Good drinks and burgers'}, score=0.0246334319067476),
 '3518': Document(id='3518', data={'restaurant_name': 'Olympus Caffe & Bakery', 'restaurant_review': 'I like the garden to sit down with friends and have a drink.'}, score=0.08915287739833623),
 '17502': Document(id='17502', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Nice place for a drink'}, score=0.02653095210662131),
 '11383': Document(id='11383', data={'restaurant_name': 'Ludwigs Biergarten Mountain View', 'restaurant_review': 'Beer is fresh tables are big feel like a proper beer garden'}, score=0.011447537567869991),
 '10788': Document(id='10788', data={'restaurant_name': 'Casa Lupe', 'restaurant_review': 'Run by a family that makes you feel like part of the family. Awesome food. I love their wet Chili Verde burritos'}, score=0.00522136854599736)}

We sort the results dictionary by each document's combined score in descending order, ensuring that the highest-ranking documents appear first.

sorted_documents = dict(sorted(results.items(), key=lambda item: item[1].score, reverse=True))
sorted_documents

Output:

{'3518': Document(id='3518', data={'restaurant_name': 'Olympus Caffe & Bakery', 'restaurant_review': 'I like the garden to sit down with friends and have a drink.'}, score=0.3566115095933449),
 '17502': Document(id='17502', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Nice place for a drink'}, score=0.10612380842648524),
 '17444': Document(id='17444', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Good drinks, good food'}, score=0.09900385707681324),
 '4022': Document(id='4022', data={'restaurant_name': 'Eureka! Mountain View', 'restaurant_review': 'Good drinks and burgers'}, score=0.0985337276269904),
 '17426': Document(id='17426', data={'restaurant_name': "St. Stephen's Green", 'restaurant_review': 'Good drinks an easy going bartenders'}, score=0.09831671336973409),
 '5136': Document(id='5136', data={'restaurant_name': 'Scratch', 'restaurant_review': 'Just had drinks. They were good!'}, score=0.09802189349997713),
 '2637': Document(id='2637', data={'restaurant_name': 'Mifen101 花溪米粉王', 'restaurant_review': 'Feel like I’m back in China.'}, score=0.054989174038501676),
 '11383': Document(id='11383', data={'restaurant_name': 'Ludwigs Biergarten Mountain View', 'restaurant_review': 'Beer is fresh tables are big feel like a proper beer garden'}, score=0.045790150271479965),
 '2496': Document(id='2496', data={'restaurant_name': 'Seasons Noodles & Dumplings Garden', 'restaurant_review': 'Comfort food, excellent service! Feel like back to home.'}, score=0.02172369191268401),
 '10788': Document(id='10788', data={'restaurant_name': 'Casa Lupe', 'restaurant_review': 'Run by a family that makes you feel like part of the family. Awesome food. I love their wet Chili Verde burritos'}, score=0.02088547418398944)}

Show the results

We will output a list of restaurants in order of relevance, showing each name and review based on the hybrid search results.

for v in sorted_documents.values():
    print(f"Restaurant name: {v.data['restaurant_name']} \nReview: {v.data['restaurant_review']}")

Output:

Restaurant name: Olympus Caffe & Bakery
Review: I like the garden to sit down with friends and have a drink.
Restaurant name: St. Stephen's Green
Review: Nice place for a drink
Restaurant name: St. Stephen's Green
Review: Good drinks, good food
Restaurant name: Eureka! Mountain View
Review: Good drinks and burgers
Restaurant name: St. Stephen's Green
Review: Good drinks an easy going bartenders
Restaurant name: Scratch
Review: Just had drinks. They were good!
Restaurant name: Mifen101 花溪米粉王
Review: Feel like I’m back in China.
Restaurant name: Ludwigs Biergarten Mountain View
Review: Beer is fresh tables are big feel like a proper beer garden
Restaurant name: Seasons Noodles & Dumplings Garden
Review: Comfort food, excellent service! Feel like back to home.
Restaurant name: Casa Lupe
Review: Run by a family that makes you feel like part of the family. Awesome food. I love their wet Chili Verde burritos

Generating LLM answer

This code completes the RAG (Retrieval-Augmented Generation) approach by generating an LLM-based answer to a user’s question, using results retrieved in the previous step. Here’s how it works:

Setup and Initialization :

We import json for handling JSON responses and initialize the OpenAI client to interact with the language model.

Define generate_question Function :

This function accepts:

question: The user’s question.
information: A list relevant chunks retrieved previously, providing context.

System and User Prompts :

The system_prompt instructs the model to act as a restaurant assistant, using the provided chunks to answer clearly and without repetition.
The model is directed to format its response in JSON.
The user_prompt combines the user’s question and the information chunks.

Generate and Parse the Response :

Using client.chat.completions.create(), the system and user prompts are sent to the LLM (specified as gpt-4o-mini).
The response is parsed as JSON, extracting the answer field. If parsing fails, False is returned.

import json
from openai import OpenAI

client = OpenAI()

def generate_question(question:str, information:list):

    system_prompt = f"""You are a helpful assistant specialized in providing answers to questions about restaurants. Below is a question from a user, along with the top four relevant information chunks about restaurants from a Deep Lake database. Using these chunks, construct a clear and informative answer that addresses the question, incorporating key details without repeating information.
    The output must be in JSON format with the following structure:
    {{
        "answer": "The answer to the question."
    }}

    """

    user_prompt = f"Here is a question from a user: {question}\n\nHere are the top relevant information about restaurants {information}"
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        response_format={"type": "json_object"},
    )

    try:
        response = response.choices[0].message.content
        response = json.loads(response)
        questions = response["answer"]
        return questions
    except:
        return False

This function takes a restaurant-related question and retrieves the best response based on the given context. It completes the RAG process by combining relevant information and LLM-generated content into a concise answer.

information = [f'Review: {el["restaurant_review"]}, Restaurant name: {el["restaurant_name"]}' for el in view_vs]
result = generate_question(query, information)
result

Output:

"If you're feeling like a drink, consider visiting Taqueria La Espuela
 which is known for its refreshing horchata. Alternatively, you might enjoy
 Chaat Bhavan Mountain View, a great place with good food and a lively atmosphere."

Search on a multiple datasets

In this approach, we perform the hybrid search across two separate datasets: vector_search for vector-based search results and ds_bm25 for BM25-based text search results. This allows us to independently query and retrieve scores from each dataset, then combine them using the same fusion method as before.

ds_bm25 = deeplake.open(f"al://{org_id}/{dataset_name_bm25}")
vs_results = vector_search.query(tql_vs)
bm25_results = ds_bm25.query(tql_bm25)

vs_score = vs_results["score"]
bm_score = bm25_results["score"]

vss = softmax(vs_score)
bm25s = softmax(bm_score)

docs_vs = {}
docs_bm25 = {}
for el, score in zip(vs_results, vss):
    docs_vs[str(el["row_id"])] = Document(id=str(el["row_id"]), data={"restaurant_name": el["restaurant_name"], "restaurant_review": el["restaurant_review"]}, score=score)

for el, score in zip(bm25_results, bm25s):
    docs_bm25[str(el["row_id"])] = Document(id=str(el["row_id"]), data={"restaurant_name": el["restaurant_name"], "restaurant_review": el["restaurant_review"]}, score=score)

results = fusion(docs_vs, docs_bm25)

for v in sorted_documents.values():
    print(f"Restaurant name: {v.data['restaurant_name']} \nReview: {v.data['restaurant_review']}")

Output:

Restaurant name: Olympus Caffe & Bakery
Review: I like the garden to sit down with friends and have a drink.
Restaurant name: St. Stephen's Green
Review: Nice place for a drink
Restaurant name: St. Stephen's Green
Review: Good drinks, good food
Restaurant name: Eureka! Mountain View
Review: Good drinks and burgers
Restaurant name: St. Stephen's Green
Review: Good drinks an easy going bartenders
Restaurant name: Scratch
Review: Just had drinks. They were good!
Restaurant name: Mifen101 花溪米粉王
Review: Feel like I’m back in China.
Restaurant name: Ludwigs Biergarten Mountain View
Review: Beer is fresh tables are big feel like a proper beer garden
Restaurant name: Seasons Noodles & Dumplings Garden
Review: Comfort food, excellent service! Feel like back to home.
Restaurant name: Casa Lupe
Review: Run by a family that makes you feel like part of the family. Awesome food. I love their wet Chili Verde burritos

Comparison of Sync vs Async Query Performance

This code performs an asynchronous query on a Deep Lake dataset. It begins by opening the dataset asynchronously using await deeplake.open_async(), specifying org_id and dataset_name_vs.

ds_async = await deeplake.open_async(f"al://{org_id}/{dataset_name_vs}")
ds_async_results = ds_async.query_async(tql_vs).result()

This following code compares the execution times of synchronous and asynchronous queries on a Deep Lake dataset:

First, it records the start time start_sync for the synchronous query, executes the query with vector_search.query(tql_vs), and then records the end time end_sync. It calculates and prints the total time taken for the synchronous query by subtracting start_sync from end_sync.
Next, it measures the asynchronous query execution by recording start_async, running vector_search.query_async(tql_vs).result() to execute and retrieve the query result asynchronously, and then recording end_async. The asynchronous query time is calculated as the difference between end_async and start_async, and is printed.

The code executes two queries both synchronously and asynchronously, measuring the execution time for each method. In the synchronous part, the queries are executed one after the other, and the execution time is recorded. In the asynchronous part, the queries are run concurrently using asyncio.gather() to parallelize the asynchronous calls, and the execution time is also measured. The "speed factor" is then calculated by comparing the execution times, showing how much faster the asynchronous execution is compared to the synchronous one. Using asyncio.gather() allows the asynchronous queries to run in parallel, reducing the overall execution time.

Finally, the code calculates the speed factor by dividing the synchronous query time by the asynchronous query time, indicating how much faster the asynchronous query is. The speed factor is printed to compare the efficiency of asynchronous vs. synchronous execution.

import time
import asyncio
import nest_asyncio

nest_asyncio.apply()

async def run_async_queries():
    # Use asyncio.gather to run queries concurrently
    ds_async_results, ds_bm25_async_results = await asyncio.gather(
        vector_search.query_async(tql_vs),
        ds_bm25.query_async(tql_bm25)
    )
    return ds_async_results, ds_bm25_async_results

# Measure synchronous execution time
start_sync = time.time()
ds_sync_results = vector_search.query(tql_vs)
ds_bm25_sync_results = ds_bm25.query(tql_bm25)
end_sync = time.time()
print(f"Sync query time: {end_sync - start_sync}")

# Measure asynchronous execution time
start_async = time.time()
# Run the async queries concurrently using asyncio.gather
ds_async_results, ds_bm25_async_results = asyncio.run(run_async_queries())
end_async = time.time()
print(f"Async query time: {end_async - start_async}")

sync_time = end_sync - start_sync
async_time = end_async - start_async

# Calculate speed factor
speed_factor = sync_time / async_time

# Print the result
print(f"The async query is {speed_factor:.2f} times faster than the sync query.")

Output:

Sync query time: 0.09148645401000977
Async query time: 0.0657045841217041
The async query is 1.39 times faster than the sync query.

We can execute asynchronous queries even after loading the dataset synchronously. In the following example, we perform a BM25 query asynchronously on a dataset ds_bm25 that was loaded synchronously.

result_async_with_bm25 = ds_bm25.query_async(tql_bm25).result()
result_async_with_bm25

Conclusion

This chapter provides a step-by-step guide to building a hybrid search system, starting with BM25 and dense vector retrieval methods and culminating in their integration into a hybrid approach. By combining lexical and semantic retrieval, the hybrid system demonstrates how these methods complement each other to deliver more accurate and flexible results. This progression illustrates the practical value of hybrid search for achieving advanced functionality in modern information retrieval systems.

In the next chapter, we will explore advanced chunking methods. The naive chunking approaches based on tokens and overlap disrupt logical flow of text and context and there are many techniques that try to fix this!

Restaurant Reviews: using Hybrid search with Deep Lake

Introduction

Load the Data from Deep Lake

Create the Dataset and Use an Inverted Index for Filtering

Extract the data

Add the data to the dataset

Search for the restaurant using a specific word

Show the results

Use BM25 to Retrieve the Data

Add data to the dataset

Search for the restaurant using a specific sentence

Show the results

Vector similarity search

Create the dataset and add the columns

Create embeddings

Search for the restaurant using a specific sentence

Hybrid search

Search for the correct restaurant using a specific sentence

Normalize the score

Show the results

Generating LLM answer

Search on a multiple datasets

Comparison of Sync vs Async Query Performance

Conclusion