Engineering for Unreliability

How to build reliable systems from unreliable components — and why this isn't as new as it feels.

The Discomfort

Every developer knows this contract:

Same input → same logic → same output. Every time.

AI breaks this contract. The same prompt, the same model, the same input can produce different outputs on consecutive runs. For engineers trained on deterministic systems, this feels fundamentally broken.

The resistance isn't purely intellectual. It's emotional. Engineers build identity around the quality of their output, around control over the system's behavior. When a component produces inconsistent results, it doesn't just challenge a technical assumption. It challenges the sense of mastery that makes the work feel meaningful. Recognizing this is the first step past it.

It's not broken. It's a different kind of system. And it requires a different kind of reliability engineering.

You Already Do This

The discomfort is real, but the problem isn't new. Developers already work with unreliable components daily:

Networks fail. You retry. You add timeouts. You design for partial failure.
Third-party APIs change. You version. You write contract tests. You add fallbacks.
Race conditions exist. You use locks, queues, and idempotency.
Users provide garbage input. You validate, sanitize, and reject.

None of these systems are deterministic at the component level. They're reliable at the system level — because you engineered reliability around unreliable parts.

AI is the same pattern. The model is an unreliable component. Your job is to build a reliable system around it.

The Reframe

Traditional software asks: Does it work?

AI systems ask: How often does it work?

This isn't a lower standard. It's a more honest one. Traditional software also fails — it just fails in ways we've learned to hide (error handling, retries, graceful degradation). AI makes the unreliability visible instead of burying it.

Reliability is not "it always gives the same answer." Reliability is "the system consistently performs within acceptable bounds." The bounds are yours to define.

The Toolkit

The shift from deterministic to probabilistic systems requires specific engineering practices. None of these are exotic — they're the same discipline you already apply to distributed systems, applied to a new kind of component.

Evaluation replaces unit testing

In traditional software, tests verify exact outputs. In AI systems, you evaluate distributions.

Every AI system should have:

An evaluation dataset that represents expected behavior
Automated evaluation runs on every change
Regression checks that catch degradation

This plays the same role as a test suite. Without it, AI systems degrade silently — not because the code changed, but because the model did.

Guardrails replace trust

LLM output should almost never be trusted directly. The model provides reasoning. The system provides reliability.

Production systems must include validation layers:

Structured outputs or schemas that enforce format
Deterministic business rules that override model judgment where appropriate
Output validation that catches hallucinations and constraint violations
Retry and repair loops for recoverable failures
Human review where the cost of error is high

The guardrails aren't an admission that the AI is bad. They're the engineering that makes the AI useful.

Smaller steps replace large prompts

Large prompts that try to solve everything are fragile. They combine too many failure modes into a single call.

Reliable AI systems decompose tasks:

Classify the input
Retrieve relevant context
Generate output
Validate structure and quality
Repair if needed

Each step can be evaluated, monitored, and improved independently. Smaller reasoning units increase reliability for the same reason that small functions are more reliable than monolithic ones.

Observability replaces assumptions

Production AI systems must expose:

Prompt logs — what went in
Model responses — what came out
Evaluation scores — how good was it
Failure cases — what went wrong
Cost and latency metrics — what it costs

You can't improve what you can't see. And AI systems fail in ways that are invisible without instrumentation — the output looks plausible but is wrong.

Versioning replaces stability

AI models evolve continuously. A model update can change behavior in ways no code change would explain.

Engineering practices must include:

Model versioning — pin the model version in production. Upgrade deliberately.
Prompt versioning — treat prompts like code. Track changes. Review diffs.
Evaluation on upgrade — run your evaluation suite before and after every model change, the same way you run tests before deploying a new dependency.
Rapid iteration — when something breaks, iterate fast. The fix is usually in the prompt, the guardrails, or the evaluation — not in the model.

Treat model changes like dependency upgrades. You already know how to manage those.

The Role Shift

In this model, engineers are not primarily writing logic. They are designing systems that supervise intelligence. This is, structurally, management: specifying intent to an imperfect executor, evaluating output rather than controlling the process, and iterating on your own communication when the result doesn't match the intent. Managers have always operated in a probabilistic world. Engineers are joining them.

↓ Engineers used to

→ implement

•Write business logic
•Implement known algorithms
•Debug line by line
•Optimize execution paths

↑ Engineers now

→ orchestrate

•Design prompts and tool interfaces
•Build validation and evaluation layers
•Design retry and correction loops
•Monitor system-level performance

This isn't a demotion. It's a shift to a higher level of abstraction, the same shift that happened when engineers moved from assembly to high-level languages, or from bare metal to cloud infrastructure. Each transition felt like losing control. Each one was actually gaining leverage.

The Core Principle

We do not program intelligence. We design environments where intelligence performs reliably.

The unreliability is not the problem. It's the nature of the component. The engineering is what turns an unreliable component into a reliable system.

This is not new. This is what engineers have always done.

← Back to home · The AI Lab · AI Execution Standards · The Reference Framework