Introduction
In previous lessons, we learned about advanced techniques and evaluation metrics for LlamaIndex Retrieval-Augmented Generation (RAG) pipelines. Building on this knowledge, we now focus on optimizing a LlamaIndex RAG pipeline through a series of iterative evaluations. We aim to enhance the system's ability to retrieve and generate accurate and relevant information.
Here's our step-by-step plan:
- Baseline Evaluation: Construct a standard LlamaIndex RAG pipeline and establish an initial performance baseline.
- Adjusting TOP_K Retrieval Values: Experiment with different values of k (1, 3, 5, 7) to understand their effect on the accuracy of retrieved information and the relevance of generated answers.
- Testing Different Embedding Models: Evaluate models such as "text-embedding-ada-002" and "cohere/embed-english-v3.0" to identify the most effective one for our pipeline.
- Incorporating a Reranker: Implement a reranking mechanism to refine the document selection process of the retriever.
- Employing a Deep Memory Approach: Investigate the impact of a deep memory component on the accuracy of information retrieval.
Through these steps, we aim to refine our RAG system systematically, enhancing its performance by providing accurate and relevant information.
The code for this lesson is also available through a Colab notebook, where you can follow along.
1. Baseline evaluation
The first step is installing the required Python packages.
!pip3 install deeplake llama_index langchain openai tiktoken cohere pandas torch sentence-transformers
Here, you can set our API keys. You can skip this step if you plan to use other services.
import os
os.environ['OPENAI_API_KEY'] = '<YOUR_OPENAI_API_KEY>'
os.environ['ACTIVELOOP_TOKEN'] = '<YOUR_ACTIVELOOP_KEY>'
os.environ['COHERE_API_KEY'] = '<COHERE_API_KEY>'
We download the data, which is a single text file. You can use this or replace it with your own data.
!mkdir -p 'data/paul_graham/'
!curl 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -o 'data/paul_graham/paul_graham_essay.txt'
Let’s load the Data and build LlamaIndex nodes/chunks.
from llama_index.node_parser import SimpleNodeParser
from llama_index import SimpleDirectoryReader
# First we create Document LlamaIndex objects from the text data
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
# By default, the node/chunks ids are set to random uuids. To ensure same id's per run, we manually set them.
for idx, node in enumerate(nodes):
node.id_ = f"node_{idx}"
print(f"Number of Documents: {len(documents)}")
print(f"Number of nodes: {len(nodes)} with the current chunk size of {node_parser.chunk_size}")
Number of Documents: 1
Number of nodes: 58 with the current chunk size of 512
The next step is to create a LlamaIndex VectorStoreIndex
object and use a DeepLakeVectorStore
to store the vector embeddings.
We also choose gpt-3.5-turbo-1106
as our LLM and OpenAI’s embedding model text-embedding-ada-002
from llama_index import VectorStoreIndex, ServiceContext, StorageContext
from llama_index.vector_stores import DeepLakeVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms import OpenAI
# Create a local Deep Lake VectorStore
dataset_path = "./data/paul_graham/deep_lake_db"
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True, exec_option="compute_engine")
# LLM that will answer questions with the retrieved context
llm = OpenAI(model="gpt-3.5-turbo-1106")
# We use OpenAI's embedding model "text-embedding-ada-002"
embed_model = OpenAIEmbedding()
service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm,)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_index = VectorStoreIndex(nodes, service_context=service_context, storage_context=storage_context, show_progress=True)
Generating embeddings: 100%
58/58 [00:06<00:00, 8.75it/s]
Uploading data to deeplake dataset.
100%|██████████| 58/58 [00:00<00:00, 169.79it/s]Dataset(path='./data/paul_graham/deep_lake_db', tensors=['text', 'metadata', 'embedding', 'id'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
text text (58, 1) str None
metadata json (58, 1) str None
embedding embedding (58, 1536) float32 None
id text (58, 1) str None
With the vector index, we can now build a QueryEngine
, which generates answers with the LLM and the retrieved chunks of text.
query_engine = vector_index.as_query_engine(similarity_top_k=10)
response_vector = query_engine.query("What are the main things Paul worked on before college?")
print(response_vector.response)
Before college, Paul worked on writing and programming.
Now that we have a simple RAG pipeline, we can evaluate it. For that, we need a dataset. In this tutorial, we will generate one. LlamaIndex
offers a generate_question_context_pairs
module specifically for generating questions and context pairs. We will use that dataset to assess the RAG chunk retrieval and response capabilities.
Let’s also save the generated dataset in JSON format for later use. In this case we only generate 58 question and context pairs, but you can increase the number of samples in the dataset for a more thorough evaluation.
from llama_index.evaluation import generate_question_context_pairs
qc_dataset = generate_question_context_pairs(
nodes,
llm=llm,
num_questions_per_chunk=1
)
# We can save the dataset as a json file for later use.
qc_dataset.save_json("qc_dataset.json")
100%|██████████| 58/58 [01:30<00:00, 1.56s/it]
You can load the dataset from your local disk if you have already generated it.
from llama_index.finetuning.embeddings.common import (
EmbeddingQAFinetuneDataset,
)
qc_dataset = EmbeddingQAFinetuneDataset.from_json(
"qc_dataset.json"
)
With the generated dataset, we can first start with the retrieval evaluations.
We will make use of RetrieverEvaluator
available in LlamaIndex. We will measure the Hit Rate and Mean Reciprocal Rank (MRR).
Hit Rate:
Think of the Hit Rate as playing a game of guessing. You're given a question and need to guess the correct answer from a list of options. The Hit Rate measures how often you guess the correct answer by only looking at your top few guesses. You have a high Hit Rate if you often find the right answer in your first few guesses.
So, in a retrieval system, it's about how frequently the system finds the correct document within its top 'k' picks (where 'k' is a number you decide, like top 5 or top 10).
Mean Reciprocal Rank (MRR):
MRR is like measuring how quickly you can find a treasure in a list of boxes. Imagine you have a row of boxes, and only one has a treasure. The MRR calculates how close to the start of the row the treasure box is, on average.
If the treasure is always in the first box you open, you're doing great and have an MRR of 1. If it's in the second box, the score is 1/2, since you took two tries to find it. If it's in the third box, your score is 1/3, and so on. MRR averages these scores across all your searches. So, for a retrieval system, MRR looks at where the correct document ranks in the system's guesses. If it's usually near the top, the MRR will be high, indicating good performance.
In summary, Hit Rate tells you how often the system gets it right in its top guesses, and MRR tells you how close to the top the right answer usually is. Both metrics are useful for evaluating the effectiveness of a retrieval system, like how well a search engine or a recommendation system works.
First, we define a function to display the Retrieval evaluation results in table format.
import pandas as pd
def display_results_retriever(name, eval_results):
"""Display results from evaluate."""
metric_dicts = []
for eval_result in eval_results:
metric_dict = eval_result.metric_vals_dict
metric_dicts.append(metric_dict)
full_df = pd.DataFrame(metric_dicts)
hit_rate = full_df["hit_rate"].mean()
mrr = full_df["mrr"].mean()
metric_df = pd.DataFrame(
{"Retriever Name": [name], "Hit Rate": [hit_rate], "MRR": [mrr]}
)
return metric_df
Then, Run the evaluation procedure.
from llama_index.evaluation import RetrieverEvaluator
# We can evaluate the retievers with different top_k values.
for i in [2, 4, 6, 8, 10]:
retriever = vector_index.as_retriever(similarity_top_k=i)
retriever_evaluator = RetrieverEvaluator.from_metric_names(
["mrr", "hit_rate"], retriever=retriever
)
eval_results = await retriever_evaluator.aevaluate_dataset(qc_dataset)
print(display_results_retriever(f"Retriever top_{i}", eval_results))
Retriever Name Hit Rate MRR
0 Retriever top_2 0.687943 0.560284
Retriever Name Hit Rate MRR
0 Retriever top_4 0.829787 0.602837
Retriever Name Hit Rate MRR
0 Retriever top_6 0.893617 0.615366
Retriever Name Hit Rate MRR
0 Retriever top_8 0.943262 0.621952
Retriever Name Hit Rate MRR
0 Retriever top_10 0.957447 0.623449
The output.
We notice that the Hit Rate increases as the top_k value increases, which is what we can expect. We're increasing the probability of the correct answer being included in the returned set.
Now, how does that impact the quality of the generated answers?
Evaluation for Relevancy and Faithfulness metrics.
Relevancy Evaluates whether retrieved-context and answer are relevant to the query.
Faithfulness Evaluates if the answer is faithful to the retrieved contexts (in other words, whether there’s a hallucination).
LlamaIndex includes functions that evaluate both metrics using an LLM as the judge. GPT4 will be used as the judge here.
Now, let's see how the top_k value affects these two metrics.
from llama_index.evaluation import RelevancyEvaluator, FaithfulnessEvaluator, BatchEvalRunner
for i in [2, 4, 6, 8, 10]:
# Set Faithfulness and Relevancy evaluators
query_engine = vector_index.as_query_engine(similarity_top_k=i)
# While we use GPT3.5-Turbo to answer questions
# we can use GPT4 to evaluate the answers.
llm_gpt4 = OpenAI(temperature=0, model="gpt-4-1106-preview")
service_context_gpt4 = ServiceContext.from_defaults(llm=llm_gpt4)
faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context_gpt4)
relevancy_evaluator = RelevancyEvaluator(service_context=service_context_gpt4)
# Run evaluation
queries = list(qc_dataset.queries.values())
batch_eval_queries = queries[:20]
runner = BatchEvalRunner(
{"faithfulness": faithfulness_evaluator, "relevancy": relevancy_evaluator},
workers=8,
)
eval_results = await runner.aevaluate_queries(
query_engine, queries=batch_eval_queries
)
faithfulness_score = sum(result.passing for result in eval_results['faithfulness']) / len(eval_results['faithfulness'])
print(f"top_{i} faithfulness_score: {faithfulness_score}")
relevancy_score = sum(result.passing for result in eval_results['faithfulness']) / len(eval_results['relevancy'])
print(f"top_{i} relevancy_score: {relevancy_score}")
top_2 faithfulness_score: 0.95
top_2 relevancy_score: 0.95
top_4 faithfulness_score: 0.95
top_4 relevancy_score: 0.95
top_6 faithfulness_score: 0.95
top_6 relevancy_score: 0.95
top_8 faithfulness_score: 1.0
top_8 relevancy_score: 1.0
top_10 faithfulness_score: 1.0
top_10 relevancy_score: 1.0
We can notice the relevancy and faithfulness scores increase as the Top_k value increases. We also get a perfect score using eight retrieved chunks as context.
This is the LlamaIndex Relevancy prompt default template.
2. Changing the embedding model
Now that we have the baseline evaluation score, we can start changing some modules of our LlamaIndex RAG pipeline.
We can start by changing the embedding model. Here, we will be testing the cohere embedding model embed-english-v3.0
import os
from llama_index import VectorStoreIndex, ServiceContext, StorageContext
from llama_index.vector_stores import DeepLakeVectorStore
from llama_index.embeddings.cohereai import CohereEmbedding
from llama_index.llms import OpenAI
# Create another local DeepLakeVectorStore to store the embeddings
dataset_path = "./data/paul_graham/deep_lake_db_1"
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=False, exec_option="compute_engine")
llm = OpenAI(model="gpt-3.5-turbo-1106")
embed_model = CohereEmbedding(
cohere_api_key=os.getenv('COHERE_API_KEY'),
model_name="embed-english-v3.0",
input_type="search_document",
)
service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm,)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_index = VectorStoreIndex(nodes, service_context=service_context, storage_context=storage_context, show_progress=True)
Generating embeddings: 100%
58/58 [00:02<00:00, 23.68it/s]
Uploading data to deeplake dataset.
100%|██████████| 58/58 [00:00<00:00, 315.69it/s]Dataset(path='./data/paul_graham/deep_lake_db_1', tensors=['text', 'metadata', 'embedding', 'id'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
text text (58, 1) str None
metadata json (58, 1) str None
embedding embedding (58, 1024) float32 None
id text (58, 1) str None
We run the retrieval evaluation using these new embeddings.
from llama_index.evaluation import RetrieverEvaluator
embed_model.input_type = "search_query"
retriever = vector_index.as_retriever(similarity_top_k=10, embed_model=embed_model)
retriever_evaluator = RetrieverEvaluator.from_metric_names(
["mrr", "hit_rate"], retriever=retriever
)
eval_results = await retriever_evaluator.aevaluate_dataset(qc_dataset)
print(display_results_retriever(f"Retriever_cohere_embeds", eval_results))
Retriever Name Hit Rate MRR
0 Retriever_cohere_embeds 0.943262 0.648697
With these embeddings, we see a lower Hit Rate but a better MRR value.
3. Incorporating a Reranker
Here, we will be testing three different Rerankers that we learned about in previous lessons.
cross-encoder/ms-marco-MiniLM-L-6-v2
from the Hugging Face hub.- LlamaIndex’s
LLMRerank
- Cohere’s
CohereRerank
.
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.indices.postprocessor import SentenceTransformerRerank, LLMRerank
st_reranker = SentenceTransformerRerank(
top_n=5, model="cross-encoder/ms-marco-MiniLM-L-6-v2"
)
llm_reranker = LLMRerank(
choice_batch_size=4, top_n=5,
)
cohere_rerank = CohereRerank(api_key=os.getenv('COHERE_API_KEY'), top_n=10)
for reranker in [cohere_rerank, st_reranker, llm_reranker]:
retriever_with_reranker = vector_index.as_retriever(similarity_top_k=10, postprocessor=reranker, embed_model=embed_model)
retriever_evaluator_1 = RetrieverEvaluator.from_metric_names(
["mrr", "hit_rate"], retriever=retriever_with_reranker
)
eval_results1 = await retriever_evaluator_1.aevaluate_dataset(qc_dataset)
print(display_results_retriever("Retriever with added Reranker", eval_results1))
config.json: 100%
794/794 [00:00<00:00, 23.6kB/s]
pytorch_model.bin: 100%
90.9M/90.9M [00:00<00:00, 145MB/s]
tokenizer_config.json: 100%
316/316 [00:00<00:00, 11.0kB/s]
vocab.txt: 100%
232k/232k [00:00<00:00, 3.79MB/s]
special_tokens_map.json: 100%
112/112 [00:00<00:00, 3.88kB/s]
Retriever Name Hit Rate MRR
0 Retriever with added Reranker 0.943262 0.648697
Retriever Name Hit Rate MRR
0 Retriever with added Reranker 0.943262 0.648697
Retriever Name Hit Rate MRR
0 Retriever with added Reranker 0.943262 0.648697
Here, we don't see a significant improvement in the retriever's performance.
4. Employing Deep Memory
Activeloop's Deep Memory is a feature that introduces a tiny neural network layer trained to match user queries with relevant data from a corpus. While this addition incurs minimal latency during search, it can boost retrieval accuracy by up to 27%.
First, let's reuse and convert our generated dataset into a format Deep Memory expects. We need queries and relevant IDs.
def create_query_relevance(qa_dataset):
"""Function for converting LlamaIndex dataset to correct format for deep memory training"""
queries = [text for _, text in qa_dataset.queries.items()]
relevant_docs = qa_dataset.relevant_docs
relevance = []
for doc in relevant_docs:
relevance.append([(relevant_docs[doc][0], 1)])
return queries, relevance
train_queries, train_relevance = create_query_relevance(qc_dataset)
print(len(train_queries))
Now, let's upload our baseline Vectore Store on Activeloop's cloud platform and convert it into a managed database.
import deeplake
local = "./data/paul_graham/deep_lake_db"
hub_path = "hub://genai360/optimization_paul_graham"
hub_managed_path = "hub://genai360/optimization_paul_graham_managed"
First upload our local vector store
deeplake.deepcopy(local, hub_path, overwrite=True)
Create a managed vector store
deeplake.deepcopy(hub_path, hub_managed_path, overwrite=True, runtime={"tensor_db": True})
You can replace the paths using your organization name and database name.
Let’s create a LlamaIndex RAG pipeline using our new managed vector store.
import os
from llama_index import VectorStoreIndex, ServiceContext, StorageContext
from llama_index.vector_stores import DeepLakeVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms import OpenAI
vector_store = DeepLakeVectorStore(dataset_path=hub_managed_path, overwrite=False, runtime={"tensor_db": True}, read_only=True)
llm = OpenAI(model="gpt-3.5-turbo-1106")
embed_model = OpenAIEmbedding()
service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm,)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_index = VectorStoreIndex.from_vector_store(vector_store,service_context=service_context, storage_context=storage_context, use_async=False, show_progress=True)
Deep Lake Dataset in hub://genai360/optimization_paul_graham_managed already exists, loading from the storage
And now we can launch Deep Memory training.
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
job_id = vector_store.vectorstore.deep_memory.train(
queries=train_queries,
relevance=train_relevance,
embedding_function=embeddings.embed_documents,
)
Your Deep Lake dataset has been successfully created!
creating embeddings: 100%|██████████| 1/1 [00:02<00:00, 2.27s/it]
100%|██████████| 100/100 [00:00<00:00, 158.16it/s]
Dataset(path='hub://genai360/optimization_paul_graham_managed_queries', tensors=['text', 'metadata', 'embedding', 'id'])
tensor htype shape dtype compression
------- ------- ------- ------- -------
text text (100, 1) str None
metadata json (100, 1) str None
embedding embedding (100, 1536) float32 None
id text (100, 1) str None
DeepMemory training job started. Job ID: 652dceeed7d1579bf6abf3df
With the job_id, you can keep track of deep memory training like that
vector_store.vectorstore.deep_memory.status('652dceeed7d1579bf6abf3df')
--------------------------------------------------------------
| 652dceeed7d1579bf6abf3df |
--------------------------------------------------------------
| status | completed |
--------------------------------------------------------------
| progress | eta: 0.9 seconds |
| | |
--------------------------------------------------------------
| results | |
--------------------------------------------------------------
To evaluate our DeepMemory-enabled VectorStore, we can generate a Test dataset. Here, we only send 20 chunks to make things fast, but a bigger dataset size would be recommended for a stronger evaluation.
from llama_index.evaluation import generate_question_context_pairs
# Generate test dataset
test_dataset = generate_question_context_pairs(
nodes[:20],
llm=llm,
num_questions_per_chunk=1
)
test_dataset.save_json("test_dataset.json")
# We can also load the dataset from a json file if already done previously.
from llama_index.finetuning.embeddings.common import (
EmbeddingQAFinetuneDataset,
)
test_dataset = EmbeddingQAFinetuneDataset.from_json(
"test_dataset.json"
)
test_queries, test_relevance = create_query_relevance(test_dataset)
100%|██████████| 20/20 [00:29<00:00, 1.49s/it]
Let’s evaluate the recall on the generated test dataset using the Deep Lakes evaluation Python function.
# Evaluate recall on the generated test dataset
recalls = vector_store.vectorstore.deep_memory.evaluate(
queries=test_queries,
relevance=test_relevance,
embedding_function=embeddings.embed_documents,
)
Embedding queries took 0.82 seconds
---- Evaluating without Deep Memory ----
Recall@1: 45.2%
Recall@3: 78.6%
Recall@5: 90.5%
Recall@10: 95.2%
Recall@50: 100.0%
Recall@100: 100.0%
---- Evaluating with Deep Memory ----
Recall@1: 45.2%
Recall@3: 83.3%
Recall@5: 92.9%
Recall@10: 95.2%
Recall@50: 100.0%
Recall@100: 100.0%
Recall explained
Definition: Recall measures the proportion of relevant items successfully retrieved by the system from all relevant items available in the dataset.
Formula: Recall is calculated as:
Focus: It focuses on the system's ability to find all relevant items. A high recall means the system is good at not missing relevant items.
Compared to Hit Rate: Recall is about the system's thoroughness in retrieving all relevant items, and Hit Rate is about its effectiveness in ensuring that each query retrieves something relevant.
Now let’s get the Hit Rate and MRR scores of our Deep Memory enabled Vector store.
First, let’s get the Hit Rate of our base Vector store.
import os
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.evaluation import (
RetrieverEvaluator,
)
base_retriever = vector_index.as_retriever(similarity_top_k=10)
deep_memory_retriever = vector_index.as_retriever(
similarity_top_k=10, vector_store_kwargs={"deep_memory": True}
)
base_retriever_evaluator = RetrieverEvaluator.from_metric_names(
["mrr", "hit_rate"], retriever=base_retriever
)
eval_results = await base_retriever_evaluator.aevaluate_dataset(test_dataset)
print(display_results_retriever("Retriever Results", eval_results))
Retriever Name Hit Rate MRR
0 Retriever Results 0.952381 0.641761
Now, the same evaluation for the Deep Memory Vector Store
deep_memory_retriever = vector_index.as_retriever(
similarity_top_k=10, vector_store_kwargs={"deep_memory": True}
)
dm_retriever_evaluator = RetrieverEvaluator.from_metric_names(
["mrr", "hit_rate"], retriever=deep_memory_retriever
)
dm_eval_results = await dm_retriever_evaluator.aevaluate_dataset(test_dataset)
print(display_results_retriever("Retriever Results", dm_eval_results))
Retriever Name Hit Rate MRR
0 Retriever Results 0.952381 0.661376
We can see an increase in the MRR score compared to the baseline RAG pipeline.
Conclusion
In this lesson, optimizing a LlamaIndex RAG pipeline involved a structured approach to improve information retrieval and generation quality.
We adjusted retrieval top_k values, evaluated two embedding models, introduced reranking mechanisms, and integrated Active Loop’s Deep Memory, some leading to performance enhancements. We also highlight the importance of a good evaluation set of tools, such as a well-curated and large enough evaluation dataset.
RESOURCES
- Colab notebook for the lesson:
- LlamaIndex and Deep Memory integration:
This lesson is based on the Llamaindex AI-engineer-workshop posted by Disiok.