RAG evaluation tools

Introduction

This chapter covers different tools that help with evaluation, from automated optimization with Auto-RAG to metrics-based frameworks like RAGAS and observability tools like TruLens. We’ll also briefly mention other tools like Langfuse and LlamaIndex to give you a full picture of what is available.

AutoRAG

AutoRAG brings the concept of AutoML to RAG. It simplifies and automates the process of building and optimizing RAG pipelines, saving a lot of manual work. I came across AutoRAG on Reddit when I was looking for something similar to AutoML for the area of RAG/LLMs. After exchanging ideas with the team, we collaborated on a research paper, and I have been impressed with their work ever since. Their GitHub repo has 2.7k stars, and the tool is incredibly useful.

Key Features

  1. End-to-End Automation
  2. AutoRAG handles the entire RAG pipeline—from retrieval to generation. Once set up, it automatically runs experiments to find the best configuration for your use case.

  3. Simple YAML Configuration
  4. You set everything up with a YAML file (example below). It supports different retrieval and generation strategies, which makes it very flexible.

  5. Metrics-Driven Approach
  6. AutoRAG optimizes pipelines based on a wide range of metrics, from traditional NLP metrics (F1, METEOR) to LLM based metrics (consistency, relevance).

  7. Automatic Dataset Creation
  8. AutoRAG can even create custom QA datasets for you, although this can be a bit tricky because it does not include human feedback (as discussed in the evaluation chapter).

How AutoRAG Works

AutoRAG follows a clear optimization loop:

  1. Pipeline Definition: Configure nodes (retrieval, generation) and strategies using a YAML file.
  2. Experimentation: The tool tests various configurations, running benchmarks on retrieval and generation outputs.
  3. Evaluation: AutoRAG uses specified metrics to rank configurations and identify the best-performing setup.

For example, the YAML file below defines a two-stage pipeline (retrieval and generation):

node_lines:
- node_line_name: retrieve_node_line  # Set Node Line (Arbitrary Name)
  nodes:
    - node_type: retrieval  # Set Retrieval Node
      strategy:
        metrics: [retrieval_f1, retrieval_recall, retrieval_ndcg, retrieval_mrr]  # Set Retrieval Metrics
      top_k: 3
      modules:
        - module_type: vectordb
          vectordb: default
        - module_type: bm25
        - module_type: hybrid_rrf
          weight_range: (4,80)
- node_line_name: post_retrieve_node_line  # Set Node Line (Arbitrary Name)
  nodes:
    - node_type: prompt_maker  # Set Prompt Maker Node
      strategy:
        metrics:   # Set Generation Metrics
          - metric_name: meteor
          - metric_name: rouge
          - metric_name: sem_score
            embedding_model: openai
      modules:
        - module_type: fstring
          prompt: "Read the passages and answer the given question. \n Question: {query} \n Passage: {retrieved_contents} \n Answer : "
    - node_type: generator  # Set Generator Node
      strategy:
        metrics:  # Set Generation Metrics
          - metric_name: meteor
          - metric_name: rouge
          - metric_name: sem_score
            embedding_model: openai
      modules:
        - module_type: openai_llm
          llm: gpt-4o-mini
          batch: 16

This configuration defines two stages: retrieval and generation. Each stage includes strategies, metrics, and modules, all specified in a straightforward format.

Further Resources

  • GitHub Repository - Detailed documentation and examples for setting up pipelines.
  • Research Paper - Discusses the theoretical foundations and applications of AutoRAG.

Why is AutoRAG Useful

Once you’ve set up the YAML file, AutoRAG runs everything automatically. It iterates over different configurations, evaluates them, and identifies the best-performing setup. This makes it easy to experiment with various strategies and metrics without writing custom scripts. AutoRAG is highly recommended for efficiently optimizing RAG systems at scale.

In the image below, you can see all the different modules AutoRAG can optimize.

image

RAGAS

RAGAS is a lightweight framework designed to evaluate RAG pipelines. Unlike tools that handle retrieval or generation, RAGAS focuses solely on evaluation, making it an essential complement to tools like Auto-RAG or LlamaIndex. It provides a structured approach to assess the performance of RAG pipelines across multiple metrics, even when you don’t have labeled datasets.

Key Features

  1. Works with Labeled and Unlabeled Datasets
    1. RAGAS is flexible enough to work with both labeled and unlabeled datasets:

    2. With labeled data, you can compare generated answers to predefined ground truths.
    3. Without labeled data, RAGAS uses reference-free metrics to evaluate aspects like faithfulness and relevance.
  2. Comprehensive Metrics Suite
    1. RAGAS evaluates multiple dimensions of RAG performance, including:

    2. Faithfulness: How well the generated answer reflects the retrieved information.
    3. Relevance: How closely the generated answer matches the query’s intent.
    4. Context Recall and Relevancy: The quality and relevance of retrieved documents.
  3. Scalable Evaluation
  4. Since it doesn’t rely on human annotations, RAGAS is highly scalable, making it ideal for rapid experimentation and development cycles.

Evaluating with RAGAS: Amnesty QA Dataset Example

RAGAS simplifies evaluation for RAG pipelines by providing a suite of metrics to assess model performance. This example demonstrates how to evaluate a dataset using RAGAS with a real-world dataset, Amnesty QA, which contains questions and answers related to Amnesty International's topics.

We will use the following evaluation metrics:

  • Context Precision: Measures the proportion of retrieved context that is relevant to the query.
  • Faithfulness: Evaluates how well the generated answer aligns with the retrieved context.
  • Answer Relevancy: Assesses whether the answer is relevant to the query.
  • Context Recall: Examines the coverage of relevant context retrieved for the query.
# Install required libraries
# !pip install llama-index ragas datasets
from datasets import load_dataset
import os
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)
from ragas import evaluate

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "your_openai_api_key_here"

# Load the Amnesty QA dataset
amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2")

# Perform evaluation on the 'eval' split
result = evaluate(
    amnesty_qa["eval"],
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

# Convert results to a pandas DataFrame for better readability
df = result.to_pandas()

# Display the first few rows of the evaluation results
print(df.head())

Let us fetch a row with low faithfulness for demonstration:

# Sort by faithfulness score to identify rows with low scores
low_faithfulness = df.sort_values(by="faithfulness").iloc[0]

# Display the row with the lowest faithfulness score
print("Row with the lowest faithfulness score:")
print(low_faithfulness)

# Extract and display relevant fields for analysis
print("\nAnalysis:")
print(f"Query: {low_faithfulness['user_input']}")
print(f"Retrieved Context: {low_faithfulness['retrieved_contexts']}")
print(f"Response: {low_faithfulness['response']}")
print(f"Reference: {low_faithfulness['reference']}")
print(f"Faithfulness Score: {low_faithfulness['faithfulness']}")

Example:

Query:

Which private companies in the Americas are the largest GHG emitters according to the Carbon Majors database?

Retrieved Context:

  1. The issue of greenhouse gas emissions has become a major concern for environmentalists and policymakers alike, as the impact of climate change becomes more apparent. Private companies in the Americas play a significant role in contributing to these emissions, with some being identified as the largest emitters in the region according to the Carbon Majors database.
  2. Reducing greenhouse gas emissions from private companies is a complex challenge that requires cooperation between governments, businesses, and consumers. Many companies are now taking steps to reduce their carbon footprint through initiatives such as investing in renewable energy, improving energy efficiency, and offsetting emissions through carbon credits.
  3. The private companies responsible for the most emissions during this period, according to the database, are from the United States: ExxonMobil, Chevron, and Peabody. The largest emitter amongst state-owned companies in the Americas is Mexican company Pemex, followed by Venezuelan company Petróleos de Venezuela, S.A.

Response:

According to the Carbon Majors database, the largest private companies in the Americas that are the largest GHG emitters are:

  1. Chevron Corporation (United States)
  2. ExxonMobil Corporation (United States)
  3. ConocoPhillips Company (United States)
  4. BP plc (United Kingdom, but with significant operations in the Americas)
  5. Royal Dutch Shell plc (Netherlands, but with significant operations in the Americas)
  6. Peabody Energy Corporation (United States)
  7. Duke Energy Corporation (United States)
  8. TotalEnergies SE (France, but with significant operations in the Americas)
  9. BHP Group Limited (Australia, but with significant operations in the Americas)
  10. Rio Tinto Group (United Kingdom/Australia, but with significant operations in the Americas)

Please note that the rankings may change over time as new data becomes available.

Reference:

The largest private companies in the Americas that are the largest GHG emitters according to the Carbon Majors database are ExxonMobil, Chevron, and Peabody.

Faithfulness Score:

0.0

The faithfulness score of 0.0 reflects the response's significant deviation from the retrieved context, as it introduces unsupported companies not mentioned in the evidence. This highlights the importance of faithfulness as a metric to ensure that generated answers are strictly grounded in the retrieved information.

Trulens

TruLens is an open-source tool for evaluating and improving RAG pipelines and LLM applications. What sets it apart is its focus on observability - not just providing metrics, but also integrating into your development workflow to help you monitor, debug, and iterate on your system in real time.

Key Features

  • Custom Feedback Functions: Evaluate different stages of your pipeline, such as queries, retrievals, and responses, with flexible feedback functions tailored to your needs.
  • Real-Time Insights: TruLens integrates into your workflow, providing live observability for debugging and performance tracking during experiments or production.

Video Example

Here’s a video example where Anupam Datta, founder of Trulens, explains the core principles:

Other tools

In addition to the previously discussed tools, several other platforms offer valuable features for evaluating and monitoring RAG systems. Here's an overview of these tools:

LlamaIndex Evaluation Modules

LlamaIndex provides comprehensive evaluation modules tailored for RAG systems. These modules assess both retrieval and generation components, offering metrics such as recall, precision, and context relevance. By integrating these evaluations, developers can fine-tune their RAG pipelines to enhance performance.

Tonic Validate

Tonic Validate is a platform designed to streamline the evaluation of RAG applications. It offers a suite of metrics to assess accuracy, context quality, and latency, among other factors. With its user-friendly interface and integration capabilities, Tonic Validate enables developers to monitor and improve their RAG systems effectively.

Langfuse

Langfuse focuses on observability and monitoring for LLM applications. It provides tools to trace and debug workflows, capturing production traces and offering insights into system performance. While not exclusively an evaluation tool, Langfuse's monitoring capabilities are beneficial for maintaining and optimizing RAG systems in production environments.

Conclusion

The tools in this chapter show different ways to evaluate RAG systems depending on your needs. Auto-RAG focuses on automation, RAGAS provides structured metrics, and TruLens helps with real-time monitoring and debugging. Each tool serves a specific purpose, so you can choose the right one to fit your workflow and improve your system’s performance.

Jupyter: Google ColabGoogle Colab

(ask for access, includes API keys)