AI Systems · #6

AI Pipeline Reliability

LLM outputs are probabilistic — every downstream system that treats them as deterministic will eventually corrupt data, enter an infinite retry loop, or serve a wrong answer with full confidence.

Published May 29, 2026 · By MortalApps · 5 min read · 990 words

TL;DR

LLM outputs are probabilistic — every downstream system that treats them as deterministic will eventually corrupt data, enter an infinite retry loop, or serve a wrong answer with full confidence.
Schema validation (Pydantic, Instructor, Outlines) catches format failures in <5ms; semantic validation via an LLM judge adds 500ms–2s but catches factual errors schema can't.
Retrying on semantic failures without modifying the prompt is almost always futile — the model will reproduce the same error with high probability at the same temperature.
Idempotency keys on every LLM-triggered write operation are non-negotiable: retries after partial failures create duplicate records, double charges, or conflicting state.
Monitor validation failure rate as an SLI — a spike signals prompt regression, model update, or upstream data change before users report it.

Problem Idea Flow Examples Anti-patterns Tradeoffs Best Practices Related

The Problem

An LLM-powered invoice extraction pipeline parses PDFs and writes structured records to a database. The model returns valid JSON 98% of the time — but on 2% of invoices, it returns a string where a number is expected, or omits a required field. Those records fail silently, the pipeline retries the full PDF, writes a partial duplicate, and the finance team discovers the discrepancy three weeks later during reconciliation. This isn't a model quality problem — it's a pipeline reliability problem. Non-determinism is a property of LLMs, not a bug to be fixed; the pipeline must be engineered to handle it.

Core System Idea

A reliable AI pipeline treats every LLM output as untrusted external input that must pass validation before any state mutation. The architecture has four layers: (1) LLM inference — call the model with structured output mode (JSON schema enforcement) where the provider supports it; (2) schema validation — use Pydantic or Instructor to validate the returned structure in <5ms, rejecting malformed output immediately; (3) semantic validation — optionally use rule-based checks (regex, range assertions, business logic) or an LLM-as-judge call for nuanced correctness; (4) idempotent persistence — write to the database with a deduplication key derived from the input, so retries produce the same record rather than duplicates. Conditional retry logic re-prompts the LLM with specific validation error context rather than blindly resending the original prompt. Dead-letter queues capture failed items for human review rather than dropping them silently.

System Flow

flowchart TD A["User Input"] --> B["LLM Inference"] B --> C{"Schema Valid?"} C -- No --> D["Conditional Re-prompt"] D --> B C -- Yes --> E{"Semantic Valid?"} E -- No --> F["Dead Letter Queue"] E -- Yes --> G["Idempotent Write"]

Validation gates prevent invalid LLM output from reaching persistence; failed items go to DLQ for human review, not silent discard.

Real-World Examples Indicative

GitHub Copilot

Generated code completions pass through AST parsing and syntax validation before surfacing to the user. Syntactically invalid completions (mismatched braces, undefined references detectable statically) are suppressed silently. This validation runs in a separate thread in ~10ms and explains why Copilot rarely suggests code that doesn't parse — it's filtered post-generation, not prevented by the model.

Klarna's AI support bot

Validates every LLM-generated customer response against a rule set before sending: no pricing claims not in the live product catalog, no promises about refund timelines that differ from policy, no mentions of competitor products. Responses failing validation trigger an escalation to a human agent rather than a retry — semantic errors in customer-facing financial responses are better handled by a human than by re-rolling the dice with the model.

Anthropic tool use (function calling)

Claude's tool use API validates every tool call response against the declared JSON schema before returning it to the application. If the model returns wrong parameter types, the API itself rejects the call and asks the model to retry with the validation error attached — schema enforcement at the provider level, before the application's own validation layer even runs.

Anti-Patterns

Blind retry on any failure

Resending the identical prompt after a semantic validation failure gives the model the same context that produced the wrong answer — it reproduces the error at roughly the same rate. Always include the specific validation error in the retry prompt: "Your previous response was missing the invoice_total field. Return valid JSON including this field."

Schema validation as the only gate

JSON schema validates structure, not correctness. {"amount": -99999, "currency": "ZZZ"} passes schema validation but is semantically wrong. Business logic rules and range assertions are a separate, necessary layer.

LLM-triggered writes without idempotency keys

An LLM pipeline that writes to a database or sends an email without a deduplication key will create duplicates on any retry. Derive the key from the input hash — not a timestamp, which changes on retry.

Dropping failed items silently

A validation failure that's logged but not queued for review means the failure rate is invisible until someone notices missing data. Route all persistent failures to a dead-letter queue with the full input, the model output, and the validation error.

Treating temperature=0 as deterministic

Setting temperature to 0 reduces variance but does not eliminate it — especially across model versions or after provider updates. Never design a pipeline that breaks on any output variation.

Design Tradeoffs

Dimension	Schema Validation	LLM-as-Judge
Latency	<5ms (local, synchronous)	500ms–2s (extra LLM call)
Cost	Near zero	$0.001–$0.01 per validation
What it catches	Format errors, wrong types, missing fields	Factual errors, tone issues, policy violations
False positive rate	Very low (deterministic)	Higher (model-dependent)

Best Practices

Use Instructor or Outlines to enforce structured output at the model call level — they retry automatically with the validation error appended, handling the most common failure mode without custom retry logic.

Define a validation failure SLI: alert when failure rate exceeds 2% over a 5-minute window. A jump from 0.5% to 5% means the prompt regressed, the model updated, or upstream data changed.

For retries, cap at 2 attempts with error context included in the prompt. After 2 failures, route to a dead-letter queue — don't keep calling the model.

Assign a deduplication key to every LLM-triggered write derived from a hash of the input. This makes the write operation idempotent regardless of how many times the pipeline runs.

Version your prompts alongside your application code. When validation failure rate spikes, you need to know which prompt version was live — not reconstruct it from memory.

When to Use / Avoid

Use When	Avoid When
LLM output drives database writes, API calls, or financial operations	Output is advisory only and shown to a human who decides what to do
Schema conformance is required for downstream systems	Latency is the primary constraint and validation overhead is unacceptable
Compliance or audit trails require predictable, validated outputs	Prototype stage where occasional failures are acceptable and expected
LLM pipeline runs unattended in batch mode	Human reviews every output before any action is taken