Introduction
This chapter focuses on designing and evaluating experiments to systematically assess RAG systems. Starting with dataset preparation, we guide you through setting up various query engine configurations and running targeted experiments. The primary emphasis is on evaluation—how to effectively measure and compare the performance of different RAG setups using structured scoring methods. By the end of this chapter, you’ll have a clear framework for experimenting with and analyzing RAG systems.
The evaluation methodology builds on ideas from my ARAGOG paper. While more advanced and automated methods for evaluation now exist, this chapter intentionally focuses on building the process from scratch. By starting from the fundamentals, this approach ensures you understand every step of the experimentation and evaluation workflow. In this context, the ARAGOG paper serves as a perfect foundation, as its "from-scratch" design aligns with the goals of this chapter.
Load the Dataset
For this example, we will use the AI-ArXiv dataset from Hugging Face (same as in the chapter on creating eval dataset). This dataset contains research papers focused on AI, including their titles, summaries, authors, and full content. It provides a rich source of information suitable for generating diverse and challenging Q&A pairs, which are critical for robust RAG evaluation.
The dataset contains 423 papers, which is an ideal size for experiments - enough noise to challenge the system, but not too costly to run.
Let’s load the dataset and explore its structure:
# Import necessary libraries
from datasets import load_dataset
import pandas as pd
# Load the AI-ArXiv dataset
dataset = load_dataset("jamescalam/ai-arxiv")
# Convert the dataset to a Pandas DataFrame
df = pd.DataFrame(dataset['train'])
# Display basic information about the dataset
print(f"Dataset loaded successfully with {len(df)} entries.")
print(df[['title', 'summary']].head())
Loading QA Pairs
We will use QA pairs from the ARAGOG paper. These questions are based on the AI-ArXiv dataset and are designed to evaluate RAG systems comprehensively, covering both simple and complex queries. If you want to create your own QA pairs, follow the instructions in the chapter on creating an eval dataset.[LINK].
Code to Load QA Pairs:
import json
# Specify the path to the local JSON file (download from https://github.com/predlico/ARAGOG/blob/main/eval_questions/benchmark.json)
file_path = "benchmark.json"
# Load the JSON data
with open(file_path, "r") as f:
data = json.load(f)
# Check the structure of the loaded data
if "questions" in data and "ground_truths" in data:
questions = data["questions"]
answers = data["ground_truths"]
# Combine questions and answers into a list of dictionaries
qa_pairs = [{"question": q, "answer": a} for q, a in zip(questions, answers)]
print(f"Successfully loaded {len(qa_pairs)} QA pairs.\n")
# Display a few QA pairs
for qa in qa_pairs[:3]: # Display the first 3 QA pairs
print(f"Question: {qa['question']}")
print(f"Answer: {qa['answer']}\n")
else:
print("The JSON file does not contain 'questions' and 'ground_truths' keys. Please verify the structure.")
Sample QA Pairs
- Question:
- Question:
- Question:
"What are the two main tasks BERT is pre-trained on?”
Answer:
"Masked LM (MLM) and Next Sentence Prediction (NSP).”
"What model sizes are reported for BERT, and what are their specifications?"
Answer:
"BERTBASE (L=12, H=768, A=12, Total Parameters=110M) and BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M).”
"How does BERT's architecture facilitate the use of a unified model across diverse NLP tasks?”
Answer:
"BERT uses a multi-layer bidirectional Transformer encoder architecture, allowing for minimal task-specific architecture modifications in fine-tuning.”
Initializing the LLM
In this section, we will initialize the large language models (LLMs) that will be used for our experiments. For this example, we will test two versions of the GPT-4o model: the full-sized version (gpt-4o
) and the smaller, more efficient version (gpt-4o-mini
). Using both models allows us to compare their performance on the same evaluation dataset, providing insights into how model size impacts RAG tasks.
from llama_index.llms.openai import OpenAI
import os
# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = ""
# Initialize the LLM with GPT-4o
llm_gpt4o = OpenAI(model="gpt-4o", temperature=0)
llm_gpt4o_mini = OpenAI(model="gpt-4o-mini", temperature=0)
Creating index
The first step in building our RAG system is to create an index, which will serve as the foundation for retrieving relevant chunks of information during query processing. This involves converting the dataset into a format that can be efficiently searched and ranked based on relevance.
In this example, we start by preparing the dataset as a list of Document
objects, where each document corresponds to an entry from the dataset. These documents are then split into smaller, overlapping chunks using a token-based splitter.
Next, we set up the embedding model, which transforms text into vector representations. These embeddings are used to compare and retrieve the most relevant chunks from the index. Finally, we create the VectorStoreIndex
- the simplest index possibly (LINK to deeplake and mention benefits).
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core import Document, VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceWindowNodeParser
# Prepare document objects from the dataset for indexing
documents = [Document(text=content) for content in df['content']]
parser = TokenTextSplitter(chunk_size=512, chunk_overlap=20)
nodes = parser.get_nodes_from_documents(documents)
# Setup the embedding model
embed_model = OpenAIEmbedding(model="text-embedding-3-large")
index = VectorStoreIndex(nodes, use_async=True) #need nest_asyncio.apply()
Setting Up the Prompt Template for Answering
A well-designed prompt is critical for guiding the language model to generate accurate, context-based answers. In this step, we define a PromptTemplate
that ensures the model adheres to specific rules while answering queries.
The template explicitly instructs the model to rely solely on the provided context for its responses, avoiding the use of any prior knowledge. Additionally, it emphasizes succinctness, restricting answers to a maximum of two sentences and 250 characters. By prohibiting direct references to the context (e.g., "Based on the context..."), the prompt ensures that the answers remain clear and professional. This structured approach to prompt engineering enhances the reliability and factual accuracy of the system.
from llama_index.core import PromptTemplate
text_qa_template = PromptTemplate("""You are an expert Q&A system that is trusted around the world for your factual accuracy.
Always answer the query using the provided context information, and not prior knowledge. Ensure your answers are fact-based and accurately reflect the context provided.
Some rules to follow:
1. Never directly reference the given context in your answer.
2. Avoid statements like 'Based on the context, ...' or 'The context information ...' or anything along those lines.
3. Focus on succinct answers that provide only the facts necessary, do not be verbose.Your answers should be max two sentences, up to 250 characters.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: """)
Setting Up Experiments
In this section, we set up various experiments to evaluate different query engine configurations. These experiments explore a combination of query transformation techniques, post-processing strategies, and language models to assess their impact on retrieval and answering performance.
Experiment Overview
- NAIVE Query Engine: This is the simplest configuration, where the query engine retrieves the top
k
most relevant chunks and directly generates answers using the selected LLM (e.g., GPT-4o or GPT-4o-mini). - HyDE Transformation: The Hypothetical Document Embeddings (HyDE) approach expands the query by generating a hypothetical answer using the LLM and then embedding it alongside the original query. This enriched embedding improves the retrieval process.
- LLM Reranker: A post-processing step where the top retrieved documents are reranked based on their relevance to the query. The reranker uses the LLM to assign relevance scores, ensuring that only the most pertinent chunks are used for answering.
- Combination of HyDE and LLM Reranker: This combines the benefits of both techniques—query expansion for better retrieval and reranking for improved selection of relevant information.
Model Configurations
We run these experiments for two LLMs:
- GPT-4o: A high-performance language model with enhanced capabilities.
- GPT-4o-mini: A smaller variant designed for lower computational overhead.
Simplified Setup
While we demonstrate four configurations here, the possibilities for experiments are vast. For example, you could:
- Test multiple language models beyond GPT-4o variants.
- Explore different vector stores like graph-based RAG or hybrid setups.
- Experiment with a variety of prompt templates.
- Incorporate diverse index structures or similarity metrics.
In real-world scenarios, you might have dozens of experiments to optimize every aspect of the system, from retrieval accuracy to computational efficiency. This setup represents a starting point, and as your system grows, so will the complexity of your experimental framework.
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine
from llama_index.core.postprocessor import LLMRerank
from llama_index.core import Settings
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")
# GPT-4o
## NAIVE
query_engine_naive_4o = index.as_query_engine(llm = llm_gpt4o,
text_qa_template=text_qa_template,
similarity_top_k=3,
embed_model=embed_model)
## HYDE
hyde = HyDEQueryTransform(include_original=True)
query_engine_hyde_4o = TransformQueryEngine(query_engine_naive_4o, hyde)
## LLM rerank
llm_rerank = LLMRerank(choice_batch_size=10, top_n=3)
query_engine_llm_rerank_4o = index.as_query_engine(
similarity_top_k=10,
text_qa_template=text_qa_template,
node_postprocessors=[llm_rerank],
embed_model=embed_model,
llm=llm_gpt4o
)
## Combine HyDE + LLM rerank
## HyDE + LLM Rerank
query_engine_hyde_llm_rerank_4o = TransformQueryEngine(query_engine_llm_rerank_4o, hyde)
# GPT-4o-mini
## NAIVE
query_engine_naive_mini = index.as_query_engine(llm = llm_gpt4o_mini,
text_qa_template=text_qa_template,
similarity_top_k=3,
embed_model=embed_model)
## HYDE
query_engine_hyde_mini = TransformQueryEngine(query_engine_naive_mini, hyde)
## LLM rerank
query_engine_llm_rerank_mini = index.as_query_engine(
similarity_top_k=10,
text_qa_template=text_qa_template,
node_postprocessors=[llm_rerank],
embed_model=embed_model,
llm=llm_gpt4o_mini
)
## Combine HyDE + LLM rerank
## HyDE + LLM Rerank
query_engine_hyde_llm_rerank_mini = TransformQueryEngine(query_engine_llm_rerank_mini, hyde)
Running the Experiments - REMOVE 2 QA PAIRS, run on full data
This section describes how we run the experiments using the query engines set up earlier. The goal is to evaluate how each configuration performs when answering a subset of questions from the QA dataset. This process provides initial insights into the effectiveness of various retrieval and answering strategies.
Workflow
- Query Engines: We define a dictionary of query engine configurations, each representing a unique combination of techniques (e.g., naive retrieval, HyDE query expansion, LLM reranking).
- Subset Evaluation: To simplify and speed up the evaluation process, we use a small subset of the QA dataset (the first two questions and answers). This allows us to validate the experiment setup before scaling to the full dataset.
- Execution: Each query engine is used to answer the selected questions. The results are stored in a DataFrame for further analysis.
- Saving Results: The outputs are saved as a CSV file, enabling you to review and analyze the answers generated by each query engine.
Considerations
Running experiments on a subset is a good practice to debug and validate configurations before scaling to the entire dataset. Once validated, the same workflow can be extended to process the complete QA dataset or include additional query engines. This modular approach keeps the system flexible and efficient as the experiments grow in scope and complexity.
# Query engines (assuming they are already initialized)
query_engines = {
"naive_4o": query_engine_naive_4o,
"hyde_4o": query_engine_hyde_4o,
"llm_rerank_4o": query_engine_llm_rerank_4o,
"hyde_llm_rerank_4o": query_engine_hyde_llm_rerank_4o,
"naive_mini": query_engine_naive_mini,
"hyde_mini": query_engine_hyde_mini,
"llm_rerank_mini": query_engine_llm_rerank_mini,
"hyde_llm_rerank_mini": query_engine_hyde_llm_rerank_mini,
}
# Create a DataFrame with only the subset of QA pairs
results_df = pd.DataFrame(qa_pairs[:2]) # Limit DataFrame to first 2 pairs
# Iterate through engines and run queries
for engine_name, engine in query_engines.items():
print(f"Running queries for engine: {engine_name}")
results = []
for idx, qa in enumerate(qa_pairs[:2]): # Limit to the first 2 QA pairs
print(f"Querying question {idx + 1}: {qa['question']}")
response = engine.query(qa["question"])
results.append(response)
results_df[engine_name] = results # Append results for the current engine
print(f"Completed queries for engine: {engine_name}\n")
# Save the subset results to a CSV
results_df.to_csv("experiment_results_subset.csv", index=False)
print("Results saved to 'experiment_results_subset.csv'")
Maybe add example output here?
Evaluation Prompt
To systematically evaluate the quality of the answers generated by different query engines, we use a custom evaluation prompt. This prompt guides the LLM in assigning a numerical score to each answer based on its accuracy, relevance, and completeness compared to the ground truth.
Prompt Details
The evaluation prompt:
- Scale: Scores answers on a scale of 1 to 10.
- 1: The answer is completely incorrect or unrelated.
- 10: The answer is entirely accurate, detailed, and matches the ground truth.
- Criteria:
- Correctness: Does the answer align with the ground truth?
- Completeness: Does the answer cover all key aspects of the question?
- Relevance: Is the answer directly related to the question without irrelevant details?
Structure
The prompt is structured to include:
- Question: The original query posed to the engine.
- Truth: The ground truth answer from the dataset.
- Provided Answer: The answer generated by the query engine.
- Instructions: Clear guidelines for the LLM to evaluate the response objectively.
Considerations for Production Systems
While this setup evaluates answers using a single numerical score, real-world production systems often require more granular evaluation metrics. For example, you may want to assess:
- Accuracy: How factually correct the answer is.
- Completeness: Whether the answer includes all necessary details.
- Relevance: Whether the answer avoids unnecessary or unrelated information.
- Toxicity: Whether the answer contains any inappropriate or harmful content.
To achieve this, you could split the evaluation into multiple prompts, each focusing on a specific metric (and/or use a framework like RAGAS). This allows for a more detailed analysis and better insights into the strengths and weaknesses of each query engine. In this demonstration, we simplify by using a single combined score for clarity and efficiency.
evaluation_prompt = PromptTemplate("""Evaluate the accuracy of the provided answer based on the original question and the ground truth answer. Assign a score on a scale of 1 to 10, where:
- 1 means the answer is completely incorrect and unrelated to the question.
- 10 means the answer is completely accurate, detailed, and matches the ground truth.
### Question:
{question}
### Truth:
{truth}
### Provided Answer:
{new_answer}
### Instructions:
1. Compare the provided answer to the ground truth, considering correctness, completeness, and relevance to the question.
2. Assign a score based on how well the provided answer matches the ground truth.
3. If the provided answer is partially correct or incomplete, reduce the score accordingly.
4. If the provided answer is unrelated to the question or completely incorrect, assign the lowest score (1).
Provide only the numerical score.
""")
Evaluation
With the experiments run and their outputs stored, the next step is to evaluate the quality of the answers generated by each query engine. This process uses the previously defined evaluation prompt to compare each answer against the ground truth and assigns a numerical score on a scale of 1 to 10. The evaluation focuses on measuring accuracy, completeness, and relevance systematically for every engine and question.
Code Overview
- Iteration: The script iterates through all QA pairs and query engines, processing each question-answer pair.
- Prompt Filling: For each pair, the question, ground truth answer, and the generated answer are inserted into the evaluation prompt.
- LLM Evaluation: The prompt is sent to GPT-4o, which serves as the judge, providing a score based on predefined criteria.
- Error Handling: The script incorporates error handling to ensure that individual failures, such as malformed responses, do not disrupt the entire evaluation process.
- Storing Results: Scores are appended as new columns in the results DataFrame, with the complete dataset saved to a CSV file (
experiment_results_with_scores.csv
) for further analysis.
Using GPT-4o as the Judge
GPT-4o evaluates the generated answers, following the logic that the evaluation model (the "judge") should be more advanced than the models being evaluated (the "workers"). This ensures a higher standard of assessment and nuanced scoring. However, this approach may introduce bias, as GPT-4o might favor responses aligned with its reasoning patterns, potentially scoring its own outputs higher than those of GPT-4o-mini or other models.
Considerations for Production Systems
In production, evaluations often require more nuanced and multidimensional metrics. For example:
- Separate Metrics: Accuracy, relevance, completeness, and toxicity could be assessed independently, each with tailored prompts.
- Diverse Evaluators: Different models or even human reviewers could be incorporated to reduce bias and improve robustness.
# Iterate over the dataset and calculate scores
for engine_name in query_engines.keys():
scores = []
for idx, row in results_df.iterrows():
question = row["question"]
truth = row["answer"]
new_answer = row[engine_name]
# Prepare the prompt with placeholders filled
formatted_prompt = evaluation_prompt.format(
question=question,
truth=truth,
new_answer=new_answer
)
# Call the LLM to evaluate
try:
response_obj = llm_gpt4o.complete(formatted_prompt)
print("Raw response object:", response_obj) # Debugging: Inspect the response structure
# Extract the actual text from the CompletionResponse object
response_text = response_obj.text.strip() # Assuming 'text' contains the result
print("Extracted response text:", response_text) # Debugging: Verify the extracted text
# Convert the text to a float score
score = float(response_text)
print("Parsed score:", score) # Debugging: Verify the parsed score
scores.append(score)
except AttributeError as ae:
print(f"Attribute error for engine {engine_name}, question {idx + 1}: {ae}")
scores.append(None)
except ValueError as ve:
print(f"Value error parsing score for engine {engine_name}, question {idx + 1}: {response_text}")
scores.append(None)
except Exception as e:
print(f"Error evaluating for engine {engine_name}, question {idx + 1}: {e}")
scores.append(None)
# Add the scores as a new column to the DataFrame
results_df[f"{engine_name}_score"] = scores
# Save the scored dataset
results_df.to_csv("experiment_results_with_scores.csv", index=False)
print("Results with scores saved to 'experiment_results_with_scores.csv'")
Results
TBD. need API key for that (can pay for it myself if needed)
Conclusion - change after full run
In this chapter, we explored a hands-on approach to evaluating RAG systems. Starting with the creation of a dataset and index, we defined experiments using different query engine configurations, including naive retrieval, query expansion with HyDE, and reranking with LLMs. Each configuration was tested on a subset of QA pairs, and the results were systematically evaluated using a custom scoring prompt.
This process highlighted the importance of experimentation and provided a practical workflow for comparing model performance. While the evaluation was simplified to a single numerical score, the foundation laid here can be extended with more nuanced metrics and advanced methods.