Engineering for Unreliability

How to build reliable systems from unreliable components — and why this isn't as new as it feels.

The Discomfort

Every developer knows this contract:

Same input → same logic → same output. Every time.

AI breaks this contract. The same prompt, the same model, the same input can produce different outputs on consecutive runs. For engineers trained on deterministic systems, this feels fundamentally broken.

The resistance isn't purely intellectual. It's emotional. Engineers build identity around the quality of their output, around control over the system's behavior. When a component produces inconsistent results, it doesn't just challenge a technical assumption. It challenges the sense of mastery that makes the work feel meaningful. Recognizing this is the first step past it.

It's not broken. It's a different kind of system. And it requires a different kind of reliability engineering.

You Already Do This

The discomfort is real, but the problem isn't new. Developers already work with unreliable components daily:

Networks fail. You retry. You add timeouts. You design for partial failure.
Third-party APIs change. You version. You write contract tests. You add fallbacks.
Race conditions exist. You use locks, queues, and idempotency.
Users provide garbage input. You validate, sanitize, and reject.

None of these systems are deterministic at the component level. They're reliable at the system level — because you engineered reliability around unreliable parts.

AI is the same pattern. The model is an unreliable component. Your job is to build a reliable system around it.

The Reframe

Traditional software asks: Does it work?

AI systems ask: How often does it work?

This isn't a lower standard. It's a more honest one. Traditional software also fails — it just fails in ways we've learned to hide (error handling, retries, graceful degradation). AI makes the unreliability visible instead of burying it.

Reliability is not "it always gives the same answer." Reliability is "the system consistently performs within acceptable bounds." The bounds are yours to define.

The Toolkit

The shift from deterministic to probabilistic systems requires specific engineering practices. None of these are exotic — they're the same discipline you already apply to distributed systems, applied to a new kind of component.

Evaluation replaces unit testing

In traditional software, tests verify exact outputs. In AI systems, you evaluate distributions.

Every AI system should have:

An evaluation dataset that represents expected behavior
Automated evaluation runs on every change
Regression checks that catch degradation

This plays the same role as a test suite. Without it, AI systems degrade silently — not because the code changed, but because the model did.

Guardrails replace trust

LLM output should almost never be trusted directly. The model provides reasoning. The system provides reliability.

Production systems must include validation layers:

Structured outputs or schemas that enforce format
Deterministic business rules that override model judgment where appropriate
Output validation that catches hallucinations and constraint violations
Retry and repair loops for recoverable failures
Agent-as-reviewer — a separate evaluator agent reviews the primary agent's output (see below)
Risk-graded human review where the cost of error is high — see risk-graded validation gates for HITL / HOTL / HOOTL allocation

The guardrails aren't an admission that the AI is bad. They're the engineering that makes the AI useful.

Agent-as-reviewer is the production default

For non-trivial work, pair the generator with a separate evaluator agent — different context, sometimes a different model. The pattern is now the production default for code review (CodeRabbit, Graphite Diamond, Greptile, GitHub Copilot review) and is being adopted in customer service (tone + policy review), document processing, and other discrete-task domains.

Why it matters: humans rating output quickly are unreliable graders of confidently-presented incorrect output. Agent reviewers, run in different contexts, surface what tired humans miss. The PoLL ("Panel of LLM judges") research finds juries of smaller models often outperform a single large judge (Verga et al., 2024) — and at lower cost.

Cost structure: a separate reviewer agent on every artifact roughly doubles inference spend per task. At Rung 5, this is now considered worth it — the cost-per-merged-unit math works because human review at scale doesn't.

Smaller steps replace large prompts

Large prompts that try to solve everything are fragile. They combine too many failure modes into a single call.

Reliable AI systems decompose tasks:

Classify the input
Retrieve relevant context
Generate output
Validate structure and quality
Repair if needed

Each step can be evaluated, monitored, and improved independently. Smaller reasoning units increase reliability for the same reason that small functions are more reliable than monolithic ones.

Observability replaces assumptions

Production AI systems must expose:

Prompt logs — what went in
Model responses — what came out
Evaluation scores — how good was it
Failure cases — what went wrong
Cost and latency metrics — what it costs

You can't improve what you can't see. And AI systems fail in ways that are invisible without instrumentation — the output looks plausible but is wrong.

Versioning replaces stability

AI models evolve continuously. A model update can change behavior in ways no code change would explain.

Engineering practices must include:

Model versioning — pin the model version in production. Upgrade deliberately.
Prompt versioning — treat prompts like code. Track changes. Review diffs.
Evaluation on upgrade — run your evaluation suite before and after every model change, the same way you run tests before deploying a new dependency.
Rapid iteration — when something breaks, iterate fast. The fix is usually in the prompt, the guardrails, or the evaluation — not in the model.

Treat model changes like dependency upgrades. You already know how to manage those.

Failure Modes and Recovery at High Maturity

The toolkit handles known failure modes. At Tier 3 / Rung 5 maturity, three additional concerns become dominant: two new failure modes (sycophancy, subjective edge cases) and a response pattern (recalibration vs debugging) that distinguishes high-maturity practice from low-maturity practice. None are caught by tests or monitoring; engineers need to recognize them.

Sycophancy: a structural concern

LLMs reliably defend wrong positions with confidence. This is measured across multiple studies (Sharma et al., 2023; Wen et al., 2024; OpenAI on hallucination, 2025). Wen et al. found RLHF makes models better at convincing humans they're right without making them better at being right (false-positive rate +24% on QA, +18% on programming).

The literature genuinely disagrees on whether sycophancy is a tractable training fix or a structural artifact of RLHF.

The framework's stance: treat sycophancy as a structural concern for engineering purposes, regardless of training trajectory.

Build process safeguards (external signal, agent-as-reviewer, ground-truth retrieval, executable tests) into every loop. Don't trust model confidence as a signal of correctness.

If training improves it later, the safeguards remain useful infrastructure. If it doesn't, the safeguards are essential.

Subjective edge cases

At low maturity, edge cases are technical: tests don't catch them, monitoring shows them as anomalies, the fix is in the code. At higher maturity, the dominant edge case shifts: it's subjective. A user reports the AI got something wrong — but the tests pass, monitoring is green, and the failure is qualitative.

Examples:

Engineering — a PR is technically correct but the agent took a wrong implementation path; the user notices "this isn't what I meant"
Customer service — an AI-resolved ticket has correct facts but tone is off for the customer's emotional state
Finance — a transaction is correctly categorized by the rule but the rule itself misrepresents the business intent
Content / marketing — copy is grammatical and on-spec but loses the brand's voice

The recovery is conversation, not patching. Talk to the user. Understand what they were actually trying to accomplish. Update the spec or context — not the code.

This failure mode is well-documented but under-named. Practitioner sources reference it under different vocabulary: Addy Osmani's 70% problem (the last 30% is human work), NN/g UX research, the PAIR Guidebook. The framework names it explicitly because the operational response differs structurally from a technical bug fix.

Recalibration vs debugging

When the AI is wrong, two responses are possible:

Recovery option

Debugging

Fix the artifact (code, response, document) the agent produced. Treats the failure as a code problem.

Recovery option

Recalibration

Rebuild the agent's understanding via fresh context, re-articulated spec, multi-perspective brainstorm. Treats the failure as a spec or context problem.

These are operationally distinct.

The literature on intrinsic self-correction is unanimous: a model that committed to a wrong direction will not reliably notice on its own; reflection in the same context window after a wrong answer compounds the error rather than fixing it (Huang et al., 2023; Kamoi et al., 2024). Cemri et al. (Why Do Multi-Agent LLM Systems Fail?, 2025) found 41.8% of multi-agent failures are specification or design issues that need re-specification, not retry.

Practical heuristic: if the same spec produces the same wrong output across two fresh runs, debug the spec, not the artifact. Most non-trivial failures at high maturity are recalibration problems disguised as debugging problems.

See AI Lab § Stuck-State Protocol for the operational pattern.

The Role Shift

In this model, engineers are not primarily writing logic. They are designing systems that supervise intelligence. This is, structurally, management: specifying intent to an imperfect executor, evaluating output rather than controlling the process, and iterating on your own communication when the result doesn't match the intent. Managers have always operated in a probabilistic world. Engineers are joining them.

↓ Engineers used to

→ implement

•Write business logic
•Implement known algorithms
•Debug line by line
•Optimize execution paths

↑ Engineers now

→ orchestrate

•Design prompts and tool interfaces
•Build validation and evaluation layers
•Design retry and correction loops
•Monitor system-level performance

This isn't a demotion. It's a shift to a higher level of abstraction, the same shift that happened when engineers moved from assembly to high-level languages, or from bare metal to cloud infrastructure. Each transition felt like losing control. Each one was actually gaining leverage.

The Core Principle

We do not program intelligence. We design environments where intelligence performs reliably.

The unreliability is not the problem. It's the nature of the component. The engineering is what turns an unreliable component into a reliable system.

This is not new. This is what engineers have always done.

Related: Codebase readiness addresses the inverse problem: not how to make reliable systems from unreliable AI output, but how to give AI agents a codebase legible enough to produce reliable work. Same uncertainty, reflected.

← Back to home · The AI Lab · AI Execution Standards · The Reference Framework