NLP & LLMs

Text Generation Evaluation Metrics

Evaluating text generation is inherently difficult because language is subjective, context-dependent, and allows for infinite valid variations.
Traditional lexical metrics like BLEU and ROUGE rely on exact word overlap, which fails to capture semantic meaning or stylistic nuance.
Modern evaluation shifts toward model-based metrics (e.g., BERTScore) and LLM-as-a-judge approaches to assess semantic similarity.
No single metric is sufficient; a robust evaluation pipeline typically combines automated scores with human annotation and task-specific benchmarks.

Why It Matters

Legal technology sector

In the legal technology sector, companies like Harvey AI use advanced evaluation metrics to ensure that generated contract summaries are factually accurate and legally sound. Because legal documents require high precision, they cannot rely on standard BLEU scores; instead, they employ custom reward models and LLM-as-a-judge pipelines to verify that the generated text adheres to specific legal clauses and terminology. This ensures that the model does not "hallucinate" obligations that do not exist in the source text.

Customer support automation platforms,

Customer support automation platforms, such as those provided by Intercom or Zendesk, use text generation evaluation to monitor the quality of AI-driven chat responses. They track metrics like "Helpfulness" and "Resolution Rate" by comparing model outputs against historical high-quality human agent responses. By using a combination of BERTScore for semantic consistency and LLM-based sentiment analysis, these companies ensure that their bots maintain a professional tone and provide accurate technical instructions to users.

Medical documentation

In the field of medical documentation, companies like Nuance (a Microsoft company) utilize evaluation metrics to assess the accuracy of AI-generated clinical notes. These systems must be rigorously evaluated to ensure that no critical medical information is omitted or misrepresented during the transcription and summarization process. Developers use a combination of automated fact-checking metrics and expert human-in-the-loop reviews to validate that the generated notes align perfectly with the physician's dictated observations.

How it Works

The Challenge of Evaluation

Text generation is the process of producing human-readable language via computational models. Unlike classification tasks, where there is usually a single "correct" label, text generation is open-ended. If you ask a model to "write a story about a cat," there are millions of valid, high-quality responses. This creates a fundamental problem: how do we mathematically determine if a generated sentence is "good"? If we compare the model's output to a single human-written reference, we might penalize the model for being creative or using different vocabulary, even if the output is factually correct and fluent.

Lexical Overlap Metrics

Historically, the field relied on lexical overlap metrics. These methods treat text as a "bag of words" or a sequence of n-grams (contiguous sequences of $n$ items). BLEU, for instance, counts how many words in the generated text appear in the reference text. If the reference is "The cat sat on the mat" and the model generates "The feline sat on the rug," BLEU would give a very low score because "feline" and "rug" do not match the reference tokens. This is the primary limitation of lexical metrics: they measure form rather than meaning. They are fast and computationally inexpensive, making them useful for quick iterations during training, but they are poor proxies for human judgment.

Semantic and Model-Based Metrics

To overcome the limitations of lexical overlap, researchers turned to embedding-based metrics. Instead of matching exact strings, these metrics map tokens into a high-dimensional vector space. If two words appear in similar contexts during the model's training, their vectors will be close together. BERTScore, for example, uses the internal representations of a Transformer model to compute similarity. If the model generates "feline" instead of "cat," BERTScore recognizes that these vectors are close in the embedding space and assigns a high score. This represents a significant leap forward, as it allows for evaluation that respects the semantic intent of the writer.

The Rise of LLM-as-a-Judge

The most recent trend in evaluation is using LLMs to evaluate other LLMs. This approach acknowledges that human evaluation is the "gold standard" but is too slow and expensive to scale. By providing a strong model with a rubric—such as "Rate the following response from 1 to 5 based on accuracy, tone, and conciseness"—we can automate the assessment of complex, non-linear qualities. However, this introduces new risks, such as "positional bias" (where the judge prefers the first option presented) or "self-preference bias" (where the judge prefers outputs that look like its own training data). Despite these biases, LLM-as-a-judge is currently the most effective way to evaluate open-ended generation tasks like dialogue and creative writing.

Common Pitfalls

"Higher BLEU score always means better quality." This is false because BLEU only measures surface-level n-gram overlap. A model can produce a grammatically incorrect sentence that happens to contain many of the same words as the reference, resulting in a high BLEU score while being completely useless to the user.
"Metrics like BERTScore solve the evaluation problem." While BERTScore is better than BLEU, it is still an approximation based on the internal state of a pre-trained model. It can be "fooled" by text that is semantically similar but logically contradictory, or by text that is fluent but factually incorrect.
"Human evaluation is the only reliable method." Human evaluation is often treated as the gold standard, but it is highly subjective and prone to inter-annotator disagreement. Different humans have different preferences for tone, brevity, and style, meaning that even "human" scores can be noisy and inconsistent.
"Perplexity is a measure of factual accuracy." Perplexity measures how well a model predicts the next token based on its training data, not whether the information is true. A model can be very "confident" (low perplexity) while generating a completely false statement, as long as that statement is grammatically and stylistically plausible.

Sample Code

Python

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Simulating word embeddings for "cat" and "feline"
# In a real scenario, use a library like 'transformers' to get these vectors
vec_cat = np.array([[0.8, 0.1, 0.2]])
vec_feline = np.array([[0.75, 0.15, 0.25]])
vec_dog = np.array([[-0.5, 0.8, 0.1]])

def calculate_similarity(v1, v2):
    # Cosine similarity ranges from -1 to 1
    return cosine_similarity(v1, v2)[0][0]

score_match = calculate_similarity(vec_cat, vec_feline)
score_mismatch = calculate_similarity(vec_cat, vec_dog)

print(f"Similarity (Cat vs Feline): {score_match:.4f}")
print(f"Similarity (Cat vs Dog): {score_mismatch:.4f}")

# Expected Output:
# Similarity (Cat vs Feline): 0.9821
# Similarity (Cat vs Dog): -0.2145

Key Terms

BLEU (Bilingual Evaluation Understudy)

A precision-based metric that calculates the geometric mean of n-gram overlaps between a generated text and a reference text. It was originally designed for machine translation and remains a standard baseline despite its inability to account for synonyms or paraphrasing.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

A set of metrics primarily used for evaluating summarization tasks by measuring the overlap of n-grams between the generated summary and reference summaries. Unlike BLEU, it emphasizes recall, making it more suitable for assessing how much of the source information is captured in the output.

BERTScore

A metric that leverages contextualized embeddings from pre-trained language models like BERT to calculate the cosine similarity between generated and reference tokens. It addresses the "lexical gap" by recognizing that different words can share similar semantic meanings in a given context.

Perplexity (PPL)

A measurement of how well a probability model predicts a sample, calculated as the exponentiated average negative log-likelihood of a sequence. In text generation, lower perplexity generally indicates that the model is less "surprised" by the text, suggesting higher fluency.

Semantic Similarity

The degree to which two pieces of text convey the same underlying meaning, regardless of the specific vocabulary used. Evaluating this is a core challenge in NLP because traditional metrics often penalize models for using synonyms or different sentence structures.

LLM-as-a-Judge

An evaluation paradigm where a powerful model (e.g., GPT-4) is prompted to act as an evaluator for the output of a smaller or target model. This method allows for the assessment of complex, subjective criteria like helpfulness, tone, and reasoning quality that are difficult to quantify with simple math.