Evaluation metrics: From ROUGE to LLMs

Introduction

In this chapter, we will dive into different evaluation approaches for RAG systems, focusing on the metrics that help us measure and optimize performance. Evaluations are critical for ensuring the scalability and reliability of RAG pipelines, as they guide us in making informed decisions about pipeline configurations.

As highlighted in the image below, evaluation metrics can be categorized along two dimensions: scalability and meaningfulness. While some metrics, like traditional NLP evaluations (e.g., BLEU, ROUGE), are highly scalable, they may not capture the true quality of results. On the other hand, human evaluations are very meaningful but often impractical for large-scale experiments due to time and cost constraints.

In the following sections, we will explore these evaluation metrics in detail, covering their strengths, weaknesses, and when to use each. By the end of this chapter, you will have a clear understanding of how to choose the right metric for your use case and what trade-offs each metric involves.

Traditional NLP Evals

Before the rise of LLMs, traditional NLP evaluation metrics were the backbone of assessing text quality. Metrics like BLEU, ROUGE, METEOR, and word error rate (WER) were widely used for tasks like machine translation, summarization, and text generation. These metrics represented some of the earliest attempts to quantify language quality in a scalable and repeatable way.

BLEU (Bilingual Evaluation Understudy)

BLEU is a precision-based metric that evaluates how many n-grams match between the system-generated text and the reference text.

What are n-grams? n-grams are sequences of n consecutive words. For example:

For n = 1 (unigrams): "The cat is sitting" → ["The", "cat", "is", "sitting"]
For n = 2 (bigrams): "The cat is sitting" → ["The cat", "cat is", "is sitting"]

BLEU checks how many such n-grams from the candidate output overlap with the reference. It works well for structured tasks like machine translation and is still commonly used in translation benchmarks like WMT.

Strengths:

Highly scalable and computationally efficient.
Provides a simple, automated way to compare outputs against references.

Weaknesses:

Focuses only on surface-level similarity without understanding meaning.
Penalizes rephrased or semantically equivalent answers if the wording is different.

For more, see the original BLEU paper.

Summary: Introduces n-gram precision for machine translation and discusses its strengths in automating text quality evaluation.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE is primarily used in summarization tasks. While BLEU emphasizes precision (matching what the system generates to the reference), ROUGE emphasizes recall—how much of the reference content is present in the system output. It remains widely used for summarization datasets like CNN/DailyMail.

Strengths:

Useful for tasks where capturing all critical information is key (e.g., summarization).
Easy to calculate and interpret.

Weaknesses:

Similar to BLEU, it does not understand meaning and rewards surface-level overlap.

For more, see the ROUGE introduction:

Summary: Focuses on recall for summarization tasks, emphasizing overlap between generated and reference summaries.

METEOR

METEOR was designed to address BLEU’s limitations by incorporating word stems, synonyms, and paraphrases. While it captures more semantic nuances than BLEU, it still falls short of true understanding, as it cannot fully comprehend meaning, context, or intent.

Strengths:

Adds semantic matching using synonyms, stems, and word order.
More effective for tasks requiring flexibility in phrasing.

Weaknesses:

Still limited to rule-based matching; lacks true semantic understanding.

For more, see the METEOR paper:

Summary: Improves on BLEU by adding semantic matching using synonyms and stemming.

Why These Metrics Fall Short

The primary limitation of traditional NLP metrics is that they are syntactic rather than semantic. They focus on word-level or phrase-level overlaps and do not account for meaning. For instance:

A sentence that uses different wording but conveys the same meaning as the reference can score poorly.
A sentence that matches the reference exactly but has no relevance in a given context can score highly.

For example:

BLEU Weakness: "The cat is on the mat" vs. "The mat has a cat." BLEU might score this high due to shared n-grams but fails to capture the flipped meaning.
ROUGE Weakness: A very verbose summary might score well if it contains most of the reference words, even if it is poorly written or redundant.

Demonstrating the Weakness in Code

Here is a simple example using BLEU to show how it fails to capture meaning:

#!pip install nltk==3.9b1 - some issues w newer versions
#!pip install rouge_score 
from nltk.translate import meteor
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer
from nltk.tokenize import word_tokenize
import nltk

# Download required NLTK resources
nltk.download('punkt_tab')

# Reference and candidate sentences
reference = "The quick brown fox jumps over the lazy dog."
candidate_good = "The fast brown fox leaps over the lazy dog."  # Similar meaning
candidate_bad = "The dog jumps over the fox."   # Different meaning
candidate_semantic = "A swift brown fox vaults over a sleepy dog."  # Semantically similar

# Tokenize the sentences for METEOR
reference_tokens = word_tokenize(reference)
candidate_good_tokens = word_tokenize(candidate_good)
candidate_bad_tokens = word_tokenize(candidate_bad)
candidate_semantic_tokens = word_tokenize(candidate_semantic)

# BLEU Scores with smoothing
smooth_fn = SmoothingFunction().method1
bleu_good = sentence_bleu([reference_tokens], candidate_good_tokens, smoothing_function=smooth_fn)
bleu_bad = sentence_bleu([reference_tokens], candidate_bad_tokens, smoothing_function=smooth_fn)
bleu_semantic = sentence_bleu([reference_tokens], candidate_semantic_tokens, smoothing_function=smooth_fn)

# ROUGE Scores
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
rouge_good = scorer.score(reference, candidate_good)
rouge_bad = scorer.score(reference, candidate_bad)
rouge_semantic = scorer.score(reference, candidate_semantic)

# METEOR Scores with tokenized inputs
meteor_good = meteor([reference_tokens], candidate_good_tokens)
meteor_bad = meteor([reference_tokens], candidate_bad_tokens)
meteor_semantic = meteor([reference_tokens], candidate_semantic_tokens)

# Display results
print("=== BLEU Scores ===")
print({
    "Good Candidate": bleu_good,
    "Bad Candidate": bleu_bad,
    "Semantic Candidate": bleu_semantic
})

print("\n=== ROUGE Scores ===")
print({
    "Good Candidate": {k: v.fmeasure for k, v in rouge_good.items()},
    "Bad Candidate": {k: v.fmeasure for k, v in rouge_bad.items()},
    "Semantic Candidate": {k: v.fmeasure for k, v in rouge_semantic.items()}
})

print("\n=== METEOR Scores ===")
print({
    "Good Candidate": meteor_good,
    "Bad Candidate": meteor_bad,
    "Semantic Candidate": meteor_semantic
})

Expected output:

#Note: Output values may vary slightly due to differences in tokenizer or smoothing configurations."
#=== BLEU Scores ===
{'Good Candidate': 0.4671379777282001, 'Bad Candidate': 0.13162427160934248, 'Semantic Candidate': 0.06674094719041482}
#=== ROUGE Scores ===
{'Good Candidate': {'rouge1': 0.7777777777777778, 'rougeL': 0.7777777777777778}, 'Bad Candidate': {'rouge1': 0.8, 'rougeL': 0.5333333333333333}, 'Semantic Candidate': {'rouge1': 0.4444444444444444, 'rougeL': 0.4444444444444444}}
#=== METEOR Scores ===
{'Good Candidate': 0.9995, 'Bad Candidate': 0.5901535872080791, 'Semantic Candidate': 0.446}

Takeaway

This example illustrates the strengths and limitations of traditional NLP metrics. BLEU heavily penalizes rephrased or semantically similar candidates, making it less suitable for nuanced tasks. ROUGE and METEOR capture semantic overlap better, identifying the Semantic Candidate as meaningful. However, both fail to sufficiently penalize the Bad Candidate, as they lack the ability to assess deeper semantic correctness. These results highlight the need for advanced evaluation methods, especially in complex RAG tasks where understanding nuanced meaning is crucial.

Ground Truth Evals

Ground truth evaluations are used in tasks with clear, objective answers—where the output is either correct or not. This binary approach works well for simple tasks like:

Fact-based Question Answering: e.g., "What is 2 + 2?" (Answer: "4").
Classification: e.g., labeling a sentence as "positive" or "negative."
Entity Recognition: e.g., extracting "cat" from "The cat is sleeping."

When to Use Ground Truth Evals

Ground truth is ideal for:

Single-answer tasks (e.g., "What is the capital of France?" → "Paris").
Low ambiguity problems, like math or structured data tasks.

Strengths:

Highly Meaningful: Directly measures accuracy with no subjective judgment.
Easy to Compute: A simple percentage shows the results.

Weaknesses:

Rarely Fits Modern LLM Tasks: Most LLM and RAG applications involve open-ended questions or multiple valid answers (e.g., "Summarize this document" or "What are the benefits of X?").
Insensitive to Partial Correctness: For example, if the answer is "Paris, France" and the model says "Paris," it is technically correct but might still fail in some contexts requiring full precision.

Quick Code Example

# Define ground truth and model predictions
ground_truth = ["cat", "dog", "rabbit"]
model_predictions = ["cat", "rabbit", "rabbit"]

# Calculate accuracy
accuracy = sum(gt == pred for gt, pred in zip(ground_truth, model_predictions)) / len(ground_truth)
print(f"Accuracy: {accuracy:.2%}")

Ground truth evals are foundational and simple, but they often do not fit modern LLM or RAG tasks, where answers are rarely black-and-white. Instead, we need evaluation methods that account for meaning, relevance, and nuance—topics we will explore in the next sections.

Human Evals

Human evaluations are the gold standard for assessing NLP systems. They excel at capturing meaning, context, and nuance—things automated metrics often miss. For RAG tasks, where answers are often very complex, humans provide the most meaningful insights.

Strengths:

Highly Meaningful: Can assess intent, relevance, and creativity.
Versatile: Suitable for complex, open-ended tasks like summarization or reasoning.

Weaknesses:

Not Scalable: Optimizing a RAG system with 100 Q&A pairs across 100 experiments means 10,000 evaluations. Even if each evaluation takes just 30 seconds, it adds up to 83+ hours. This approach simply does not scale.
Inconsistent: Human judgments can vary due to bias, fatigue, or interpretation, reducing reliability.

When to Use Human Evals

Reserve for critical, small-scale experiments where precision is key.
Pair with automated metrics to reduce workload while maintaining quality.

Human evaluations provide unmatched insight but are impractical for large-scale optimization. In the next section, we will look at a more scalable alternative.

LLM Evals

LLM evaluations, when combined with human oversight, are one of the most effective approaches for assessing NLP systems. SoTA models like GPT-4o or Claude are incredibly capable of evaluating meaning, relevance, and nuance—provided they are guided correctly.

Strengths:

Highly Meaningful: SoTA LLMs can capture complex aspects of responses, such as accuracy, relevance, and even creativity, when prompted effectively.
Scalable: Once the prompts are designed and validated, LLM evals can run automatically, enabling large-scale optimization.

The Role of Humans

While LLM evals are highly scalable, they are not a fully hands-off solution. A human must:

Set up the prompts: Proper prompt engineering ensures that the evaluation criteria align with your preferences.
Validate the outputs: Reviewing a subset of evaluations ensures the LLM is working as intended. This step helps avoid scaling a flawed evaluation approach.

In this sense, LLM evals are more of a collaborative process between humans and LLMs. Once the system is tuned and validated, however, it becomes fully automatic.

Key Consideration: Using SoTA Models

For this method to succeed, the evaluator LLM must be as good as or better than the system it is evaluating. A SoTA model (e.g., GPT-4o or Claude) can effectively evaluate another SoTA model or a less advanced model. However, if you use a weaker model (e.g., LLaMA-3.1) to evaluate a stronger model (e.g. GPT-4o), the evaluations will likely fail—like having a student grade the teacher’s work.

Why LLM Evals Are the Best Approach

LLM evals strike the perfect balance between human and machine:

Human judgment: Guides the process with prompt engineering and validation.
Machine efficiency: Scales evaluations across large datasets and configurations.

By leveraging the strengths of both humans and SoTA LLMs, this approach offers the best of both worlds—deep insights at scale with minimal manual overhead.

#!pip install llama_index
from llama_index.llms.openai import OpenAI
import os
import re

os.environ["OPENAI_API_KEY"] = ""
# Set up the LLM with the latest OpenAI integration
llm = OpenAI(model="gpt-4o", temperature=0)

# Example input data with slightly imperfect answers
evaluations = [
    {
        "query": "What are the key differences between photosynthesis and cellular respiration?",
        "generated_answer": "Photosynthesis uses water and carbon dioxide to make energy. Cellular respiration converts oxygen into glucose.",
        "reference_answer": "Photosynthesis captures energy to create glucose, while cellular respiration releases energy by breaking down glucose."
    },
    {
        "query": "Summarize the main reasons for the fall of the Roman Empire.",
        "generated_answer": "The Roman Empire declined because of financial problems and weak emperors.",
        "reference_answer": "Major reasons for the fall of Rome include political corruption, economic decline, and external invasions by groups such as the Visigoths."
    }
]

# Define a prompt for semantic relevance evaluation with reasoning
def create_prompt(eval):
    return f"""
    Query: {eval['query']}
    Generated Answer: {eval['generated_answer']}
    Reference Answer: {eval['reference_answer']}
    Evaluate the semantic relevance of the Generated Answer to the Reference Answer on a scale of 0 to 1.
    Provide your reasoning and the score in the format: "Reasoning: ... Score: X"
    """

# Evaluate using the LLM
results = []
for eval in evaluations:
    prompt = create_prompt(eval)
    response = llm.complete(prompt)  # Returns a CompletionResponse object
    reasoning_and_score = response.text.strip()  # Extract the response text

    # Extract score using regex
    match = re.search(r"Score:\s*([0-9.]+)", reasoning_and_score)
    score = float(match.group(1)) if match else None

    results.append({
        "query": eval["query"],
        "generated_answer": eval["generated_answer"],
        "reasoning": reasoning_and_score,
        "reference_answer": eval["reference_answer"],
        "score": score
    })

# Output results
for result in results:
    print(f"Query: {result['query']}")
    print(f"Generated Answer: {result['generated_answer']}")
    print(f"Reference Answer: {result['reference_answer']}")
    print(f"Reasoning and Score:\n{result['reasoning']}")

Expected result (the outputs may vary slightly due to inherent stochasticity in LLMs):

Query:

Summarize the main reasons for the fall of the Roman Empire.

Generated Answer:

The Roman Empire declined because of financial problems and weak emperors.

Reference Answer:

Major reasons for the fall of Rome include political corruption, economic decline, and external invasions by groups such as the Visigoths.

Reasoning and Score:

Reasoning: The Generated Answer mentions "financial problems" and "weak emperors" as reasons for the fall of the Roman Empire. These points are somewhat related to the Reference Answer, which cites "economic decline" (related to financial problems) and "political corruption" (which can be associated with weak leadership). However, the Generated Answer does not mention "external invasions," which is a significant factor in the Reference Answer. Therefore, while there is some overlap, the Generated Answer does not fully capture the breadth of reasons provided in the Reference Answer.

Score:

0.5

As you can see, the LLM evaluation can be quite thorough. You can of course control how critical the LLM is in the prompt. In the example above, it would be justified to give perhaps 7/10, not 5. You can also control on what aspects of the answer should the judge focus - should it be grammatical correctness? Completeness of a summary? Key facts? This is all up to you and it can be easily controlled via the prompt. You can also have separate prompts for different aspects of the answer and the assign more granular scores.

Conclusion

Choosing the right evaluation approach is critical for optimizing your RAG system. Each method has its strengths and weaknesses, and the best choice often depends on your specific goals, constraints, and resources:

Traditional NLP evals are simple and scalable but lack depth, making them unsuitable for modern, nuanced tasks.
Ground truth evals are great for tasks with clear, objective answers but do not work well for open-ended or complex questions.
Human evals provide unmatched insight but are impractical for large-scale experimentation due to their high cost and inconsistency.
LLM evals, when combined with human oversight, offer the best of both worlds: scalability, meaningfulness, and flexibility. With proper prompt engineering and validation, they become a powerful tool for modern NLP evaluation.

As RAG systems and LLM applications continue to evolve, so too will the methods we use to evaluate them. For now, leveraging the complementary strengths of humans and SoTA LLMs is the most effective way to ensure your system performs reliably in real-world scenarios. In the next chapter we will cover different open source tools for RAG evaluation!

Jupyter: Google Colab