← Python Code AI Agents & LLM Apps
Browse Python Concepts

Evaluating LLM Outputs — The LLM-as-Judge Pattern

Mental Model

Think of asyncio.gather as collecting reports from multiple agents. By default, if one agent shouts "Emergency!" (raises an exception), the entire operation stops, and all other reports are ignored. By setting return_exceptions=True, you're telling the system: "Collect all reports, even if some agents report an emergency; just give me the emergency report alongside the others."

Rule: When gathering batch evaluations concurrently, always set return_exceptions=True to prevent a single failure from discarding the entire suite.

The Setup

You are building an automated evaluation system that runs your test suite against an LLM-as-Judge. To speed up evaluation, you trigger concurrent validation checks using Python's asyncio.gather helper.

What Does This Print?

Broken code
Python
import asyncio

async def evaluate_output(test_id: int) -> dict:
    if test_id == 2:
        # Simulating a transient API rate limit error or timeout
        raise RuntimeError("API Rate Limit Exceeded!")
    return {"test_id": test_id, "score": 0.95}

async def run_evaluation_suite():
    test_cases = [1, 2, 3]
    # Trigger evaluations concurrently
    results = await asyncio.gather(*(evaluate_output(tid) for tid in test_cases))
    return results

try:
    print(asyncio.run(run_evaluation_suite()))
except Exception as e:
    print(f"Suite crashed: {e}")
Predict if your pipeline successfully retrieves evaluation scores for Test Case 1 and 3.

The Output

What actually happens
Suite crashed: API Rate Limit Exceeded!

The entire evaluation suite crashed. Even though Test Cases 1 and 3 executed successfully and generated perfect evaluation scores, you retrieve nothing. Because asyncio.gather defaults to immediate propagation of exceptions, a single failure aborts the collection pipeline, leaving your telemetry systems blind to partial execution data.

Why Python Does This

Under the hood, asyncio.gather coordinates multiple Future objects. By default, return_exceptions is set to False. When one of the futures raises an exception, the gather immediately raises that exception to the caller. Crucially, this does not cancel the other pending futures; they continue running in the background event loop, but their results are abandoned, and their exceptions (if any) go unhandled, generating resources leaks. To capture partial successes and log faults granularly, you must configure return_exceptions=True, which changes gather behavior to return exceptions as actual value items in the output list rather than raising them.

The Fix

Corrected pattern
Python
import asyncio

async def evaluate_output(test_id: int) -> dict:
    if test_id == 2:
        raise RuntimeError("API Rate Limit Exceeded!")
    return {"test_id": test_id, "score": 0.95}

async def run_evaluation_suite_safe():
    test_cases = [1, 2, 3]
    # Set return_exceptions=True to intercept errors gracefully
    results = await asyncio.gather(
        *(evaluate_output(tid) for tid in test_cases),
        return_exceptions=True
    )
    return results

results = asyncio.run(run_evaluation_suite_safe())
for idx, res in enumerate(results):
    if isinstance(res, Exception):
        print(f"Task {idx+1} failed with error: {res}")
    else:
        print(f"Task {idx+1} succeeded: {res}")

Setting return_exceptions=True for asyncio.gather changes its behavior: instead of propagating the first exception it encounters, it captures all exceptions as results. This allows the evaluation suite to complete all possible tests, collecting both successful outcomes and individual error objects, preventing a single transient failure from aborting the entire batch.

How This Fails in Real Systems

A CI/CD quality assurance gate evaluated model performance on 500 scenarios daily. When a single parallel request hit a transient 504 Gateway Timeout, the entire CI pipeline failed, blocking the production deploy for 4 hours until someone manually isolated and reran the evaluation suite.

Key Takeaway

When gathering batch evaluations concurrently, always set return_exceptions=True to prevent a single failure from discarding the entire suite.
Common mistake: Developers use asyncio.gather for concurrent tasks without accounting for individual task failures, leading to the entire batch crashing if even one sub-task raises an exception.