AI Pipeline Reliability
LLM outputs are probabilistic — every downstream system that treats them as deterministic will eventually corrupt data, enter an infinite retry loop, or serve a wrong answer with full confidence.
- LLM outputs are probabilistic — every downstream system that treats them as deterministic will eventually corrupt data, enter an infinite retry loop, or serve a wrong answer with full confidence.
- Schema validation (Pydantic, Instructor, Outlines) catches format failures in <5ms; semantic validation via an LLM judge adds 500ms–2s but catches factual errors schema can't.
- Retrying on semantic failures without modifying the prompt is almost always futile — the model will reproduce the same error with high probability at the same temperature.
- Idempotency keys on every LLM-triggered write operation are non-negotiable: retries after partial failures create duplicate records, double charges, or conflicting state.
- Monitor validation failure rate as an SLI — a spike signals prompt regression, model update, or upstream data change before users report it.
The Problem
An LLM-powered invoice extraction pipeline parses PDFs and writes structured records to a database. The model returns valid JSON 98% of the time — but on 2% of invoices, it returns a string where a number is expected, or omits a required field. Those records fail silently, the pipeline retries the full PDF, writes a partial duplicate, and the finance team discovers the discrepancy three weeks later during reconciliation. This isn't a model quality problem — it's a pipeline reliability problem. Non-determinism is a property of LLMs, not a bug to be fixed; the pipeline must be engineered to handle it.
Core System Idea
A reliable AI pipeline treats every LLM output as untrusted external input that must pass validation before any state mutation. The architecture has four layers: (1) LLM inference — call the model with structured output mode (JSON schema enforcement) where the provider supports it; (2) schema validation — use Pydantic or Instructor to validate the returned structure in <5ms, rejecting malformed output immediately; (3) semantic validation — optionally use rule-based checks (regex, range assertions, business logic) or an LLM-as-judge call for nuanced correctness; (4) idempotent persistence — write to the database with a deduplication key derived from the input, so retries produce the same record rather than duplicates. Conditional retry logic re-prompts the LLM with specific validation error context rather than blindly resending the original prompt. Dead-letter queues capture failed items for human review rather than dropping them silently.
System Flow
Validation gates prevent invalid LLM output from reaching persistence; failed items go to DLQ for human review, not silent discard.
Real-World Examples Indicative
Generated code completions pass through AST parsing and syntax validation before surfacing to the user. Syntactically invalid completions (mismatched braces, undefined references detectable statically) are suppressed silently. This validation runs in a separate thread in ~10ms and explains why Copilot rarely suggests code that doesn't parse — it's filtered post-generation, not prevented by the model.
Validates every LLM-generated customer response against a rule set before sending: no pricing claims not in the live product catalog, no promises about refund timelines that differ from policy, no mentions of competitor products. Responses failing validation trigger an escalation to a human agent rather than a retry — semantic errors in customer-facing financial responses are better handled by a human than by re-rolling the dice with the model.
Claude's tool use API validates every tool call response against the declared JSON schema before returning it to the application. If the model returns wrong parameter types, the API itself rejects the call and asks the model to retry with the validation error attached — schema enforcement at the provider level, before the application's own validation layer even runs.
Anti-Patterns
Resending the identical prompt after a semantic validation failure gives the model the same context that produced the wrong answer — it reproduces the error at roughly the same rate. Always include the specific validation error in the retry prompt: "Your previous response was missing the invoice_total field. Return valid JSON including this field."
JSON schema validates structure, not correctness. {"amount": -99999, "currency": "ZZZ"} passes schema validation but is semantically wrong. Business logic rules and range assertions are a separate, necessary layer.
An LLM pipeline that writes to a database or sends an email without a deduplication key will create duplicates on any retry. Derive the key from the input hash — not a timestamp, which changes on retry.
A validation failure that's logged but not queued for review means the failure rate is invisible until someone notices missing data. Route all persistent failures to a dead-letter queue with the full input, the model output, and the validation error.
Setting temperature to 0 reduces variance but does not eliminate it — especially across model versions or after provider updates. Never design a pipeline that breaks on any output variation.
Design Tradeoffs
| Dimension | Schema Validation | LLM-as-Judge |
|---|---|---|
| Latency | <5ms (local, synchronous) | 500ms–2s (extra LLM call) |
| Cost | Near zero | $0.001–$0.01 per validation |
| What it catches | Format errors, wrong types, missing fields | Factual errors, tone issues, policy violations |
| False positive rate | Very low (deterministic) | Higher (model-dependent) |
Best Practices
When to Use / Avoid
| Use When | Avoid When |
|---|---|
| LLM output drives database writes, API calls, or financial operations | Output is advisory only and shown to a human who decides what to do |
| Schema conformance is required for downstream systems | Latency is the primary constraint and validation overhead is unacceptable |
| Compliance or audit trails require predictable, validated outputs | Prototype stage where occasional failures are acceptable and expected |
| LLM pipeline runs unattended in batch mode | Human reviews every output before any action is taken |