Introduction
So far, we've observed that LLMs are improved by incorporating context from a vector database, a typical design approach used in RAG systems for chatbots and question-answering systems.
RAG applications strive to produce outputs that are factually grounded, and supported by the context it retrieves. The evaluation process should focus on ensuring that the output meaningfully incorporates the context, avoiding mere repetition, aiming to create responses that are comprehensive, non-repetitive, and devoid of redundancy.
RAG Metrics
Broad-to-Specific Perspective vs Building Block Approach
In analyzing RAG metrics, we adopt two contrasting methodologies: the 'broad-to-specific' perspective and the 'building block' approach. The former progresses from a general view moving through retrieval, and finally focusing on indexing. In contrast, the latter starts with the creation of an index, advances to refining retrieval techniques, and culminates in exploring generation options.
Faithfulness vs Relevancy
Two critical scores in RAG evaluation are faithfulness and answer relevancy. Faithfulness measures the factual accuracy of an answer concerning the retrieved context. Answer relevancy evaluates the pertinence of the answer to the posed question. A high faithfulness score does not guarantee high relevance. For example, an answer that accurately reflects the context but lacks direct relevance to the question would score lower in answer relevance, especially if it includes incomplete or redundant information.
Indexing in Broad-to-Specific Perspective
In the 'broad-to-specific' framework, it's important to understand that errors in indexing can influence subsequent search and generation stages. The 'broad-to-specific' approach itself does not inherently introduce inaccuracies in the indexing process. Comprehensive evaluations of the RAG stack, encompassing all stages, are not typically conducted. Evaluations often depend on fixed contexts or controlled experiments like the 'Lost in the Middle' studies, focusing on the generation's accuracy and relevance.
Embedding Metrics
For embeddings, evaluation usually involves brute force indexing, which may not account for errors in approximate nearest neighbor algorithms. These are typically evaluated for a balance between accuracy, query processing speed, and recall. Recall, in this context, refers to identifying the most accurate nearest neighbors for a query, as opposed to simply identifying documents marked as 'relevant'.
Generation Metrics
The analysis starts broadly, focusing on the overarching goal of RAG applications to produce helpful and contextually supported outputs. It then narrows to specific evaluation metrics, including faithfulness, answer relevancy, and the Sensibleness and Specificity Average (SSA), with a focus on avoiding hallucination in responses.
Google's SSA metric evaluates open-domain chatbot responses for sensibleness (contextual coherence) and specificity (detailed and direct responses). Initially involving human evaluators, this approach aims to ensure outputs are comprehensive yet not overly vague.
Faithfulness Evaluator
Avoiding vague responses is essential, but preventing LLMs from 'hallucinating' is equally crucial. Hallucination refers to the generation of responses not grounded in factual content or context. - LlamaIndex's FaithfulnessEvaluator assesses responses based on their alignment with the retrieved context, measuring this aspect.
Faithfulness evaluation considers whether the response matches the retrieved context, aligns with the query, and adheres to the reference answer or guidelines.
Here's an example illustrating how to evaluate a single response for faithfulness:
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.llms import OpenAI
from llama_index.evaluation import FaithfulnessEvaluator
# build service context
llm = OpenAI(model="gpt-4", temperature=0.0)
service_context = ServiceContext.from_defaults(llm=llm)
# build index
vector_index = VectorStoreIndex()
# define evaluator
evaluator = FaithfulnessEvaluator(service_context=service_context)
# query index
query_engine = vector_index.as_query_engine()
response = query_engine.query(
"What battles took place in New York City in the American Revolution?"
)
eval_result = evaluator.evaluate_response(response=response)
print(str(eval_result.passing))
The Output:
True
The result returns a boolean value indicating whether the response passed the accuracy and faithfulness checks.
Here the RAG system initializes the model within a service context.
The Index creation involves construction of a a vector-based index for storing and retrieving data. Evaluation process can be observed as:
Evaluator Initialization: Setting up an evaluator to assess the accuracy of responses based on the service context:
- The code initializes a
FaithfulnessEvaluator
, a tool designed to assess the accuracy of responses generated by the language model (GPT-4 in this case). - The evaluator uses the
service_context
created earlier, which includes the configured GPT-4 model. This context provides the necessary environment and parameters for the language model to function. - The primary role of the
FaithfulnessEvaluator
is to determine how closely the language model's responses adhere to accurate and reliable information. It uses a set of criteria or algorithms to compare the generated responses against known factual data or expected outputs.
Query and Evaluation: The system queries the index about historical events and evaluates the response for its faithfulness to factual information:
- The code queries the created index using a
query_engine
, which is derived from theVectorStoreIndex
.
The specific query here is about historical events: "What battles took place in New York City in the American Revolution?".
The query is processed by the query engine, and a response is generated it is passed to the Evaluator for assessment.
The evaluator then checks the response for its faithfulness to factual information. This means it evaluates whether the response accurately and reliably reflects historical facts about the queried topic.
The result of this evaluation (eval_result
) is then checked to see if it meets the standards of accuracy set by the evaluator, indicated by eval_result.passing
. The result returns a boolean value indicating whether the response passed the accuracy and faithfulness checks.
Retrieval Evaluation Metrics
The evaluation of retrieval in RAG systems involves determining the relevance of documents to specific queries.
In Information Retrieval, the main aim is to identify unstructured data that meets a specific information requirement within a database.
In Information Retrieval metrics for evaluating a retriever, include Mean Reciprocal Rank (MRR) and Hit Rate, MAP and NDCG
- MAP (Mean Average Precision): is a measure of ranking quality across multiple queries. MAP calculates the mean of the average precisions for each query, where the average precision is computed as the mean of the precision scores after each relevant document is retrieved.
- NDCG (Normalized Discounted Cumulative Gain): this metric evaluates the ranking of documents based on their relevance, giving more importance to relevant documents that appear higher in the rank. It is normalized so that the perfect ranking's score is 1, allowing for comparison across different sets of queries.
- MRR is a measure of the retrieval system's ability to return the best result as high up in the ranking as possible.
- Hit Rate evaluates the presence of relevant items within the top results returned, crucial where users only consider the first few results:
from llama_index.evaluation import RetrieverEvaluator
# define retriever from index
retriever = index.as_retriever(similarity_top_k=2)
retriever_evaluator = RetrieverEvaluator.from_metric_names(
["mrr", "hit_rate"], retriever=retriever
)
retriever_evaluator.evaluate(
query="query", expected_ids=["node_id1", "node_id2"]
)
The RetrieverEvaluator
is set up to assess the performance of a retrieval system which is tasked to fetching data relevant to user queries from a database or index.
The evaluator uses MRR and Hit Rate to measure how well the retriever performs within a given query and a set of expected results.
The results are serving as a benchmark for evaluation.
The process involves comparing the output of the retriever with the expected nodes to assess relevance and accuracy in the context of the query provided.
While the single-query evaluation is presented, the practical implementation in real-world scenarios often requires batch evaluations. The batch testing - a large set of queries and expected results are extracted from retriever for determine its overall reliability. In batch testing, the retriever undergoes an evaluation involving a vast array of queries and their corresponding expected results. Systematically querying the retriever with these varied inputs and subsequently evaluating its outputs against predefined correct answers is the method employed to accurately measure its consistency.
The Holistic RAG approach:
The Holistic Approach in RAG system evaluation presents a detailed assessment of individual components and the system as a whole.
Setting the baseline values for elements like chunking logic and embedding models, and then examining each part independently as well as in an end-to-end manner is key for understanding the impact of modifications of the system’s overall performance.
The Holistic modules don't always need ground-truth labels, as they can evaluate based on the query, context, response, and LLM interpretations:
The holistic evaluation modules cover:
Correctness: Checks if the answer generated matches the reference answer for the given query (labels required).
The accuracy of the generated answer is verified by comparing it directly to a reference answer provided for the query.
Semantic Similarity: Assesses if the predicted answer is semantically close to the reference answer: the evaluation goes beyond literal matching and assesses if the meaning of the predicted answer aligns closely with the reference answer. The labels are needed for context, focusing on the nuances of language to ensure that the meaning, not just the words, matches the reference.
Faithfulness: Determines if the answer is accurate and doesn't contain fabrications, relative to the retrieved contexts. The Faithfulness metric evaluates the integrity of the answer, ensuring it faithfully represents the information in the retrieved context by checking that the answer is accurate and free from distortions or fabrications that could misrepresent the source material.
Context Relevancy: measures how relevant the retrieved context and the resulting answer are to the original query. It does so by ensuring that the system only retrieves an information in a way that is pertinent to the user's request.
Guideline Adherence: module determines if the predicted answer follows set of guidelines, whether the response meets predefined criteria, encompassing stylistic, factual, and ethical standards, so the answer responds to the query while also aligning with specific established norms.
examples:
Notebooks with usage of these components can be found:
Golden Context Dataset
The Golden Context dataset would consist of carefully selected queries paired with an ideally matched set of sources that contain the answers. Optionally, it could also include the perfect answers that are expected to be generated by the LLM.
For our purposes, a total of 177 representative user queries have been manually curated. For each query, the most relevant source within our documentation has been diligently identified, so that these sources directly address the queries in question.
The Golden Context Dataset serves as our benchmark for precision evaluation. The dataset is structured around 'question' and 'source' pairings.
To create a Golden Dataset, gather a set of realistic customer questions and pair them with expert answers, then use this dataset to compare against responses from a language model for quality assurance, ensuring the LLM's answers align closely with the expert ones for accuracy and relevance. The creation of dataset is explained in more depth - follow the link in the resources section.
Once the golden dataset is ready, the next step is to use it to measure the quality of LLM responses.
After each evaluation, metrics like the following will be available to quantify the user experience. For example:
Similarity | Relevance | Coherence | Grounded-ness |
3.7 | 77 | 88 | 69 |
Community-Based Evaluation Tools
LlamaIndex incorporates a variety of evaluation tools designed to foster community engagement and collaborative endeavors.
The tools are structured to support a shared process of assessing and enhancing the system, empowering both users and developers to play an active role in the evaluation.
Through the use of tools shaped by community input, LlamaIndex creates a collaborative environment where constant feedback is smoothly incorporated, contributing to the continual development.
Notable tools in this ecosystem include:
- Ragas: Another key tool that provides a framework for evaluating and integrating with LlamaIndex, offering detailed metrics.
- DeepEval: A tool designed for in-depth evaluation, facilitating comprehensive assessments of various aspects of the system.
Evaluating with Ragas
The evaluation process involves importing specific metrics from Ragas, such as faithfulness, answer relevancy, context precision, context recall, and harmfulness.
When evaluating using Ragas, the following elements are essential:
QueryEngine
: This is the primary component and it acts as the core of the evaluation process, where its performance is assessed.- Metrics: Ragas provides a range of metrics specifically designed to evaluate a nuanced assessment of the engine's capabilities.
- Questions: A curated set of questions is required which are used to probe the engine's ability to retrieve and generate accurate responses.
Example questions are used initially, but ideally, questions from real-world production for more accurate performance:
eval_questions = [
"What is the population of New York City as of 2020?",
"Which borough of New York City has the highest population?",
"What is the economic significance of New York City?",
"How did New York City get its name?",
"What is the significance of the Statue of Liberty in New York City?",
]
eval_answers = [
"8,804,000", # incorrect answer
"Queens", # incorrect answer
"New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter.",
"New York City got its name when it came under British control in 1664. King Charles II of England granted the lands to his brother, the Duke of York, who named the city New York in his own honor.",
"The Statue of Liberty in New York City holds great significance as a symbol of the United States and its ideals of liberty and peace. It greeted millions of immigrants who arrived in the U.S. by ship in the late 19th and early 20th centuries, representing hope and freedom for those seeking a better life. It has since become an iconic landmark and a global symbol of cultural diversity and freedom.",
]
eval_answers = [[a] for a in eval_answers]
This is the setup part of the evaluation process where the QueryEngine
is assessed on how well it handles and responds to these specific questions, using the answers as a benchmark for performance.
Now lets import the metrics:
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from ragas.metrics.critique import harmfulness
metrics = [
faithfulness,
answer_relevancy,
context_precision,
context_recall,
harmfulness,
]
The metrics
list compiles the metrics into a collection, which can then be used in the evaluation process to assess various aspects of the QueryEngine
's performance. The results, which include scores for each metric, can be further analyzed.
Finally lets run evaluation:
from ragas.llama_index import evaluate
result = evaluate(query_engine, metrics, eval_questions, eval_answers)
evaluating with [faithfulness]
100%|█████████████████████████████████████████████████████████████| 1/1 [00:51<00:00, 51.40s/it]
evaluating with [answer_relevancy]
100%|█████████████████████████████████████████████████████████████| 1/1 [00:09<00:00, 9.64s/it]
evaluating with [context_precision]
100%|█████████████████████████████████████████████████████████████| 1/1 [00:41<00:00, 41.21s/it]
evaluating with [context_recall]
100%|█████████████████████████████████████████████████████████████| 1/1 [00:34<00:00, 34.97s/it]
evaluating with [harmfulness]
100%|█████████████████████████████████████████████████████████████| 1/1 [00:13<00:00, 13.98s/it]
# print the final scores
print(result)
{'ragas_score': 0.5142, 'faithfulness': 0.7000, 'answer_relevancy': 0.9550, 'context_precision': 0.2335, 'context_recall': 0.9800, 'harmfulness': 0.0000}
the metrics analysis quantifying different aspects of the RAG system's performance:
ragas_score
: 0.5142- This an overall score calculated by the Ragas evaluation system, an aggregate or weighted average of the other metrics. A score of 0.5142 suggests a moderate level of accuracy.
faithfulness
: 0.7000- This metric measures how accurately the system's responses adhere to the factual content of the source material. A score of 0.7 indicates relatively high faithfulness, meaning the responses are mostly accurate and true to the source.
answer_relevancy
: 0.9550- how relevant the system's responses are to the given queries. A high score of 0.955 suggests that the majority of the system's responses are closely aligned with the queries' intent.
context_precision
: 0.2335- evaluates the precision of the context used by the system to generate responses. A lower score of 0.2335 indicates that the context used often includes irrelevant information.
context_recall
: 0.9800- the recall rate of relevant context measured by the system. A high score of 0.98 suggests that the system is very effective in retrieving most of the relevant context.
harmfulness
: 0.0000- measures the system for harmful or inappropriate content generation. A score of 0 implies that no harmful content was generated in the evaluated responses.
The Custom RAG Pipeline Evaluation
For an effective evaluation of a custom RAG system, it is important to employ a range of evaluation benchmarks instrumental in assessing various facets of the RAG system, such as its effectiveness and reliability. The variety of measures guarantees a detailed evaluation, and in-depth insight into the overall capabilities of the system.
#Setup OpenAI API KEY
import logging
import sys
import os
os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_API_KEY>"
import os
import pandas as pd
import nest_asyncio
nest_asyncio.apply()
from llama_index.evaluation import generate_question_context_pairs
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.node_parser import SimpleNodeParser
from llama_index.evaluation import generate_question_context_pairs
from llama_index.evaluation import RetrieverEvaluator
from llama_index.llms import OpenAI
get the dataset:
!mkdir -p 'data/transmission'
!wget 'https://raw.githubusercontent.com/idontcalculate/data-repo/main/venus_transmission.txt'
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 19241 (19K) [text/plain] Saving to: ‘venus_transmission.txt’
load the documents:
from llama_index import SimpleDirectoryReader
reader = SimpleDirectoryReader(input_dir="/content/data/transmission")
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")
Loaded 1 docs
Defining LLM, Node Parser, Creating Index from Documents, and Retrieving Response from Query Engine:
The SimpleNodeParser
in this context converts documents into a structured format known as nodes serves for customization in parsing documents, specifically in terms of defining the chunk size, managing overlap, and incorporating metadata. Each chunk of the document is treated as a node. In this case, the parser is set with a chunk_size
of 512, which means that each node will consist of 512 characters or tokens from the original document.
Here's a breakdown of the process:
Defining the LLM:
# Define an LLM
llm = OpenAI(model="gpt-3.5-turbo")
Building the Index:
# Build index with a chunk_size of 512
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(docs)
vector_index = VectorStoreIndex(nodes)
Querying the Engine:
The VectorStoreIndex
is converted into a query_engine
. The engine is then used to query a specific question:
query_engine = vector_index.as_query_engine()
response_vector = query_engine.query("What was The first beings to inhabit the planet?")
response_vector.response
The Output:
The first beings to inhabit the planet were a dinoid and reptoid race from two different systems outside our solar system.
The response generated by the query engine is stored in response_vector
. The document is processed into nodes, indexed, and then queried using a language model.
get the first retrieved node:
# First retrieved node
response_vector.source_nodes[0].get_text()
They had heard of this beautiful new planet. At this time, Earth had two moons to harmonize the weather conditions and control the tides of the large bodies of water. The first beings to inhabit the planet were a dinoid and reptoid race from two different systems outside our solar system. They were intelligent and walked on two legs like humans and were war-like considering themselves to be superior to all other life forms. In the past, the four races of humans had conflicts with them before they outgrew such behavior. They arrived on Earth to rob it of its minerals and valuable gems. Soon they had created a terrible war. They were joined by re- 1 enforcements from their home planets. One set up its base on one of the Earth's moons, the other on Earth. It was a terrible war with advanced nuclear and laser weapons like you see in your science fiction movies. It lasted very long. Most of the life forms lay in singed waste and the one moon was destroyed. No longer interested in Earth, they went back to their planets leaving their wounded behind, they had no use for them. The four races sent a few forces to see if they could help the wounded dinoids and reptilians and to see what they could do to repair the Earth. They soon found that due to the nuclear radiation it was too dangerous on Earth before it was cleared. Even they had to remain so as not to contaminate their own planets. Due to the radiation, the survivors of the dinoids and reptoids mutated into the Dinosaurs and giant reptilians you know of in your history. The humans that were trapped there mutated into what you call Neanderthals. The Earth remained a devastated ruin, covered by a huge dark nuclear cloud and what vegetation was left was being devoured by the giant beings, also humans and animals by some. It was this way for hundreds of years before a giant comet crashed into one of the oceans and created another huge cloud. This created such darkness that the radiating heat of the Sun could not interact with Earth's gravitational field and an ice age was created. This destroyed the mutated life forms and gave the four races the chance to cleanse and heal the Earth with technology and their energy. Once again, they brought various forms of life to the Earth, creating again a paradise, except for extreme weather conditions and extreme tidal activities.
This output is important in showing functionality, useful in understanding which part of the indexed documents (nodes) the query engine is referencing in its response, providing additional insight into the source of the information and the relevance of the node to the query.
now let’s get the output of a second node:
# Second retrieved node
response_vector.source_nodes[1].get_text()
Due to the radiation, the survivors of the dinoids and reptoids mutated into the Dinosaurs and giant reptilians you know of in your history. The humans that were trapped there mutated into what you call Neanderthals. The Earth remained a devastated ruin, covered by a huge dark nuclear cloud and what vegetation was left was being devoured by the giant beings, also humans and animals by some. It was this way for hundreds of years before a giant comet crashed into one of the oceans and created another huge cloud. This created such darkness that the radiating heat of the Sun could not interact with Earth's gravitational field and an ice age was created. This destroyed the mutated life forms and gave the four races the chance to cleanse and heal the Earth with technology and their energy. Once again, they brought various forms of life to the Earth, creating again a paradise, except for extreme weather conditions and extreme tidal activities. During this time they realized that their planets were going into a natural dormant stage that they would not be able to support physical life. So they decided to colonize the Earth with their own people. They were concerned about the one moon, because it is creating earthquakes and tidal waves and storms and other difficulties for the structure of the Earth. They knew how to drink fluids to protect and balance themselves. These were the first colonies like Atlantis and Lemuria. The rest of the people stayed on their planets to await their destiny. They knew that they would perish and die. They had made the decision only to bring the younger generation with some spiritual teachers and elders to the Earth. The planet was too small for all of them. But they had no fear of death. They had once again created a paradise. They were instructed to build special temples here as doorways to the other dimensions. Because of the aggressive beings, the temples were hidden for future times when they will be important. There they could do their meditations and the higher beings. They were informed to build two shields around the Earth out of ice particles to balance the influence of the one moon. They created a tropical climate for the Earth. There were no deserts at that time. They have special crystals for these doorways and they were able to lower their vibration to enter through these doorways. The news spread of the beautiful planet.
You can view the textual information from the second node that the query engine found relevant, providing additional context or information in response to the query. This helps in understanding the breadth of information the query engine is pulling from and how different parts of the indexed documents contribute to the overall response.
we can generate the q&a dataset now:
The LLM generate questions based on the content of each node: For each node, two questions will be created, resulting in a dataset where each item consists of a context (the node's text) and a corresponding set of questions.
The Q&A dataset will serve us for evaluating the capabilities of a RAG system in question generation and context understanding tasks.
qa_dataset = generate_question_context_pairs(
nodes,
llm=llm,
num_questions_per_chunk=2
)
100%|██████████| 13/13 [00:31<00:00, 2.46s/it]
specify the context similarity for the retriever and the evaluator:
retriever = vector_index.as_retriever(similarity_top_k=2)
retriever_evaluator = RetrieverEvaluator.from_metric_names(
["mrr", "hit_rate"], retriever=retriever
)
# Evaluate
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)
def display_results(name, eval_results):
"""Display results from evaluate."""
metric_dicts = []
for eval_result in eval_results:
metric_dict = eval_result.metric_vals_dict
metric_dicts.append(metric_dict)
full_df = pd.DataFrame(metric_dicts)
hit_rate = full_df["hit_rate"].mean()
mrr = full_df["mrr"].mean()
metric_df = pd.DataFrame(
{"Retriever Name": [name], "Hit Rate": [hit_rate], "MRR": [mrr]}
)
return metric_df
display_results("OpenAI Embedding Retriever", eval_results)
Retriever Name | Hit Rate | MRR | |
0 | OpenAI Embedding Retriever | 0.846154 | 0.730769 |
Get the list of queries from the created dataset, service_context
for llms and the query-engine output:
from llama_index.evaluation import FaithfulnessEvaluator
faithfulness_gpt4 = FaithfulnessEvaluator(service_context=service_context_gpt4)
# the list of queries from the created dataset
queries = list(qa_dataset.queries.values())
# gpt-3.5-turbo
gpt35 = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context_gpt35 = ServiceContext.from_defaults(llm=gpt35)
# gpt-4
gpt4 = OpenAI(temperature=0, model="gpt-4")
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)
vector_index = VectorStoreIndex(nodes, service_context = service_context_gpt35)
query_engine = vector_index.as_query_engine()
eval_query = queries[10]
eval_query
The Output:
How did the colonies respond to the declaration of war by the dark forces, and what measures did they take to protect their knowledge and technology?
compute faithfulness and relevancy:
# Compute faithfulness evaluation
from llama_index.evaluation import FaithfulnessEvaluator
from llama_index.evaluation import RelevancyEvaluator
relevancy_gpt4 = RelevancyEvaluator(service_context=service_context_gpt4)
eval_result = faithfulness_gpt4.evaluate_response(response=response_vector)
# check passing parameter in eval_result if it passed the evaluation.
eval_result.passing
# Generate response.
# response_vector has response and source nodes (retrieved context)
response_vector = query_engine.query(query)
# Relevancy evaluation
eval_result = relevancy_gpt4.evaluate_response(
query=query, response=response_vector
)
# check passing parameter in eval_result if it passed the evaluation.
eval_result.passing
True
The two types of evaluations are performed on responses generated by a query engine: faithfulness and relevancy. The output of these evaluations an indicator showing whether the responses successfully meet the established criteria for accuracy and relevance.
generate response with the relevancy evaluator:
from llama_index.evaluation import RelevancyEvaluator
relevancy_gpt4 = RelevancyEvaluator(service_context=service_context_gpt4)
# Pick a query
query = queries[10]
query
The Output:
How did the colonies respond to the declaration of war by the dark forces, and what measures did they take to protect their knowledge and technology?
The process presents the performance of a retriever system, providing clear insights into its effectiveness using MRR and Hit Rate metrics.
Batch Evaluation:
#BatchEvalRunner to compute multiple evaluations in batch wise manner.
from llama_index.evaluation import BatchEvalRunner
# Let's pick top 10 queries to do evaluation
batch_eval_queries = queries[:10]
# Initiate BatchEvalRunner to compute FaithFulness and Relevancy Evaluation.
runner = BatchEvalRunner(
{"faithfulness": faithfulness_gpt4, "relevancy": relevancy_gpt4},
workers=8,
)
# Compute evaluation
eval_results = await runner.aevaluate_queries(
query_engine, queries=batch_eval_queries
)
The batch processing method helps in quickly assessing the system’s performance over a range of different queries.
Faithfulness and Relevancy score:
# get faithfulness score
faithfulness_score = sum(result.passing for result in eval_results['faithfulness']) / len(eval_results['faithfulness'])
faithfulness_score
1.0
# get relevancy score
relevancy_score = sum(result.passing for result in eval_results['faithfulness']) / len(eval_results['relevancy'])
relevancy_score
1.0
Observation:
Faithfulness score of 1.0
signifies that the generated answers contain no hallucinations and are entirely based on retrieved context.
Relevancy score of 1.0
suggests that the answers generated are consistently aligned with the retrieved context and the queries.
Conclusion & metrics recap
In this notebook, we have explored how to build and evaluate a RAG pipeline using LlamaIndex, with a specific focus on evaluating the retrieval system and generated responses within the pipeline.
Key Metrics for Information Retrieval and Q&A Pipelines:
For Information Retrieval Pipelines:
🎯 Hit Rate: It calculates the fraction of queries where the correct answer is found within the top-k retrieved documents. It’s about how often retriever gets it right within the top few guesses.
🎯 Recall: Measures how many times the correct document was among the retrieved documents over a set of queries. It's affected by the number of documents that the retriever returns.
🎯 Mean Reciprocal Rank (MRR): The position of the first correctly retrieved document, accounting for the fact that a query elicits multiple responses of varying relevance.
🎯 Mean Average Precision (mAP): The position of every correctly retrieved document. This metric is handy when there is more than one correct document to be retrieved.
🎯 Normalized Discounted Cumulative Gain (NDCG): A ranking performance measure that focuses on the relevant document's position in search results.
🎯 Precision: Measures how precise the system is. It counts how many of all retrieved documents were relevant to the query.
For Question Answering Pipelines:
🎯 Exact Match (EM): Measures the proportion of cases where the predicted answer is identical to the correct answer.
🎯 F1: More forgiving, measures the word overlap between the labeled and the predicted answer.
🎯 Semantic Answer Similarity (SAS): Uses a transformer-based, cross-encoder architecture to evaluate the semantic similarity of two answers rather than their lexical overlap.
RESOURCES:
- Response Evaluation
- Retrieval Evaluation
- openai-cookbook-eval
- RAG-eval code notebook:
- llamaindex
- golden-dataset
- RAGAS
RagEvaluatorPack