Evaluation dataset: Setting up the target correctly

Introduction

In this chapter, we will cover the key ingredient to making RAG evaluation work - creating an evaluation dataset. This is a dataset consisting of (ideally) 100+ Q&A pairs, where we cover different aspects of our system. Covering different aspects is crucial here, because if we only cover one or two types of questions, then the system will easily break in production when prompted with different types of questions. This is again very similar to “traditional” ML - you want the distribution of the validation dataset to be as close as possible to the real world distribution.

As mentioned in the previous chapter, we can use an LLM to help us create this evaluation dataset. But I advise caution here - from my experience, most of the fully automated solutions for this do not work ideally. What works best is having a human and an LLM working together. The LLM provides a bunch of suggestions, then the human reviews them and iteratively they arrive to the optimal evaluation dataset. Since this is done only once (and possibly updated a few times), we do not need this process to be fully automated. I know it is tempting to be lazy and let an LLM do the work for you completely, but in this case, we are setting the target, the truth we will be optimizing for. And it is crucial that the target is set properly.

Imagine you are practicing archery, but instead of aiming at a standard bullseye, you are told to aim at a moving target. The person setting up the target, however, forgets to align it properly with the actual path it is supposed to take in a real competition. You practice diligently, day after day, and become absolutely confident in your ability to hit the moving target perfectly.

When competition day arrives, you step up with complete assurance, only to discover that the real target moves along a completely different path. Despite all your confidence, your arrows miss the mark every single time. All of your hard work has been wasted because you were optimizing for the wrong target.

This is exactly what we want to avoid. In machine learning, as in archery, if the target is not set up as close to the real-world conditions as possible, you risk building a system that performs well in testing but fails catastrophically in production. Overconfidence in a poorly set target can be even more dangerous, as it blinds you to potential issues until it is too late.

Practical example

The Dataset

For this example, we will use the AI-ArXiv dataset stored in DeepLake. This dataset contains research papers focused on AI, including their titles and full content. It provides a rich source of information suitable for generating diverse and challenging Q&A pairs, which are critical for robust RAG evaluation.

The dataset contains 423 papers, which is an ideal size for experiments—enough noise to challenge the system but not too costly to process.

Let us load the dataset from DeepLake and explore its structure:

import deeplake

# Open the dataset in read-only mode
ds = deeplake.open_read_only('al://genai360/ai_arxiv_full')

# Initialize lists for 'title' and 'content'
title_list = []
content_list = []

# Iterate over rows and extract data into lists
for row in ds:
    title_list.append(row['title'])
    content_list.append(row['full_paper'])

Initializing the LLM

Creating an evaluation dataset is a one-time task, so it is essential to use the best model available to ensure the quality of the Q&A pairs. A SoTA model like GPT-4o provides the accuracy and depth needed for generating reliable and diverse examples.

Here is the code to initialize the LLM:

# Install necessary library
# !pip install llama_index

from llama_index.llms.openai import OpenAI
import os

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "hello students!"

# Initialize the LLM with GPT-4o
llm = OpenAI(model="gpt-4o", temperature=0)

Selecting Papers for QA Generation

To create a robust evaluation dataset, it is crucial to cover a wide range of topics and document types. This ensures that the dataset represents the diversity of questions your system will encounter in production. Ideally, you would sample papers at random, aiming for a thorough review to cover as much of the space as possible. For demonstration purposes, we will focus only on two notable papers:

LLaMA: Open and Efficient Foundation Language Models

This paper introduces LLaMA, a family of efficient foundation language models designed for scalability and performance in NLP tasks.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A foundational paper in NLP, BERT revolutionized the field with its pre-training approach, enabling contextual understanding and bidirectional language processing.

# Selected titles for QA generation
selected_titles = [
    'LLaMA: Open and Efficient Foundation Language Models',
    'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'
]

# Initialize lists for filtered titles and contents
filtered_titles = []
filtered_contents = []

# Filter the lists based on the selected titles
for title, content in zip(title_list, content_list):
    if title in selected_titles:
        filtered_titles.append(title)
        filtered_contents.append(content)

# Display the filtered results
print("Filtered Papers:")
for title, content in zip(filtered_titles, filtered_contents):
    print(f"Title: {title}")
    print(f"Content (first 500 chars): {content[:500]}...\n")

Splitting Papers into Chunks

When preparing papers for QA generation, it is essential to understand that the chunking strategy for this task is not necessarily the same as for RAG. The primary goal here is to divide the content into meaningful sections that align with the level of granularity you want for your QA pairs.

If your use case involves highly technical content, where each paragraph is dense with information, smaller chunks may be ideal to generate specific, detailed questions. Conversely, for less dense materials, such as news articles or opinion pieces, you might prefer larger chunks—or even use the full document if it fits within the LLM’s context window.

The chunk size should be driven by the level of granularity you need in the QA dataset. For example:

Small Chunks: Best for tasks requiring highly specific questions and answers. This can be even the same size as for RAG itself - then you can also evaluate if the correct chunk was retrieved for answering.
Large Chunks: Useful when broader context is needed to generate meaningful questions.

To keep things simple for this demonstration, we will use the entire paper content for QA generation, as it fits comfortably within the LLM’s context window.

Generating QA Pairs

Creating QA pairs is one of the most flexible parts of the process. The questions can be tailored to your specific needs—whether they are highly technical, requiring information from multiple sections, or more general in nature. You can generate as many QA pairs as needed to create a comprehensive evaluation dataset.

This is where prompt engineering plays a critical role. Including detailed instructions in the prompt is essential to guide the LLM. You can even add examples of the types of questions and answers you expect to achieve more precise outputs.

For this demonstration, we will ask the LLM to generate 10 highly technical questions for each paper, where answers may require synthesizing information from multiple parts of the text. These QA pairs should be as detailed and specific as possible.

Here’s how we can set this up using llama-index:

from llama_index.llms.openai import OpenAI
import os

# Define the prompt template
qa_prompt_template = """
You are a domain expert tasked with generating technical QA pairs for evaluation purposes.
Based on the following paper, generate 10 highly technical questions. Each question should:
- Require synthesizing information from multiple sections of the paper.
- Be clearly answerable from the content provided.
- Contain precise, detailed phrasing.

If a single example question is helpful, here is one:
- Q: What are the primary architectural innovations introduced in LLaMA to optimize model efficiency?
  A: LLaMA introduces three key innovations: efficient pretraining with smaller datasets, optimized tokenization, and an advanced fine-tuning approach.

The paper content:
{paper_content}

"""

# Use the 'selected_papers' DataFrame to extract content
qa_results = []
for _, row in selected_papers.iterrows():
    paper_content = row["content"]
    paper_title = row["title"]
    
    # Generate QA pairs
    prompt = qa_prompt_template.format(paper_content=paper_content)
    response = llm.complete(prompt)  # Send the prompt to the LLM
    
    qa_results.append({
        "title": paper_title,
        "qa_pairs": response.text.strip()  # Clean up the output
    })

# Display QA results
for result in qa_results:
    print(f"QA Pairs for Paper: {result['title']}")
    print(result["qa_pairs"])
    print("\n")

Example output:

QA Pairs for Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Here are 10 highly technical questions based on the paper content:

Question: How does BERT's masked language model (MLM) pre-training objective differ from traditional left-to-right language model pre-training, and what advantages does it offer for bidirectional context representation?
Answer: BERT's MLM pre-training objective involves randomly masking some input tokens and predicting the original tokens based on their context, allowing the model to incorporate both left and right context. This differs from traditional left-to-right models, which only use preceding context, limiting bidirectional understanding. [9 more QA pairs following]

QA Pairs for Paper: LLaMA: Open and Efficient Foundation Language Models

Based on the provided paper, here are 10 highly technical questions:

Question: How does LLaMA's approach to dataset selection and preprocessing differ from that of other large language models like GPT-3 and Chinchilla?
Answer: LLaMA exclusively uses publicly available datasets, employing a mixture of sources such as CommonCrawl, C4, GitHub, Wikipedia, and others, with specific preprocessing steps like deduplication, language identification, and quality filtering, unlike GPT-3 and Chinchilla which use proprietary datasets. [9 more QA pairs following]

Refining QA Pairs: From LLM Drafts to Expert Validation

The final and most crucial step in building a high-quality evaluation dataset is refining the generated QA pairs. While LLMs can provide an excellent starting point, domain expertise is essential to ensure that the QA pairs truly align with real-world requirements.

Here’s where you would ideally involve domain experts. For instance, if your RAG system is designed for a law firm, you’d have lawyers review and refine the QA pairs to guarantee relevance and accuracy. However, for the sake of simplicity (and perhaps a touch of laziness), w will assume these QA pairs are good enough. That said, I did spend some time cross-checking the QA pairs with ChatGPT and the set seems solid.

Conclusion

Creating a robust evaluation dataset is a foundational step for building reliable RAG systems. By combining domain expertise with the power of SoTA LLMs, we can generate high-quality QA pairs that cover diverse aspects of the system’s capabilities. The key takeaway is that this process requires thoughtful planning:

Use a diverse dataset: Ensure your evaluation dataset reflects the variety of questions and contexts your system will encounter in production.
Leverage LLMs for QA generation: Prompt engineering can help guide LLMs to produce precise and meaningful QA pairs.
Validate with experts: While LLMs can provide a great starting point, human validation ensures alignment with real-world needs.

Although we have demonstrated a simplified process here, remember that this is just the starting point. You can iterate and improve the dataset over time, incorporating feedback from users and stakeholders. A well-designed evaluation dataset sets the foundation for optimizing your RAG system effectively and ensures it performs reliably in real-world scenarios. In the next chapter, we are going to explore different evaluation metrics and approaches!

Jupyter: Google Colab