The AI Lab
A cutting-edge engineering environment where specs go in and software comes out.
Engineering Environment — Rung 5 (Autonomous Production)
The AI Lab is a parallel operating environment targeting the highest rung of AI-driven development. It experiments with new ways of working and serves as a testing space for practices, agents, and workflows that will then be applied across engineering. The Lab operates outside of engineering's standard operating procedures. It has its own rules.
Two Maturity Scales
This document uses two distinct scales defined in the reference framework: the organizational scale (Levels 1-3, applicable company-wide) and the engineering scale (Rungs 0-5, specific to software development). See the framework for complete definitions and acceptance criteria.
The Lab targets Rung 5. Engineering outside the Lab aims for organizational Level 3 (Rungs 4-5).
The hardest transition is the shift from Rung 3 to Rung 4: accepting that you no longer read the code and trusting scenarios to validate the result. It's a psychological change before it's a technical one. Most engineers plateau at Rung 3 because letting go of control over the code goes against all their professional instincts.
Absolute Rules
Two rules define the Lab. They are not aspirations — they are conditions of admission.
The human defines the architecture, constraints, and satisfaction scenarios. AI produces the code, runs the tests, and converges toward the solution. If you're writing or reading code line by line, you're not operating in the Lab's working mode.
Project Admission Criteria
The Lab's natural terrain. No legacy, no technical debt, no habits. The Lab's rules (Rung 5) apply end-to-end from day one.
- Scope is sufficiently defined to write specs and scenarios
- The project can tolerate a learning pace
The Lab also takes on existing projects being transitioned to Rung 5. This is harder — the code exists, and so do the habits — but this is where the transformation has the most impact.
- Sufficient scenario coverage (or commitment to build it first)
- All new work follows the Lab's rules — no falling back
- Existing code is context for the agent, not untouchable
- Regression risk managed by scenarios, not human review
Typical sequence for a brownfield:
These steps assume a codebase being transitioned directly to Rung 5 under Lab rules. The upstream decisions — assessing codebase readiness and choosing among the three brownfield modes (remediate, strangler-fig, rebuild) — sit outside the Lab itself.
- Extract the implicit specification — The existing system IS the specification. Nobody ever documented the thousand implicit decisions accumulated over years of patches, hotfixes, and workarounds that became permanent. This extraction is the hardest and most human work in the transition. It requires the people who know why this module has that exception, why this service was split that way, why this value is configured like that. AI can help document what the system does (generate specs from code). But distinguishing intentional behaviors from historical accidents remains a human judgment.
- Write end-to-end scenarios that describe the current expected behavior, based on the specification extracted in step 1
- Verify that the scenarios pass on the existing code
- From that point on, all changes are made by the agent, validated by scenarios
- Iterate: each transitioned component increases the project's Rung 5 coverage
What Does NOT Belong in the Lab
- Any project whose development continues using traditional practices (human writes or reviews code)
- Projects whose delivery constraints tolerate zero learning risk
Rule: the entry condition isn't the absence of existing code — it's the commitment that all new work follows the Lab's rules.
Working Mode: The Operational Unit
The Five Stages
The Lab structures all work around a recurring five-stage loop. The human concentrates at the boundaries; the agent runs inside.
- Context
Living working context (CLAUDE.md / AGENTS.md, scoped tools, on-demand skills) validated against the system before work begins. Stale context produces confidently wrong work.
- Clarification
Agent exposes assumptions and asks calibrated questions. No execution while material ambiguity remains. Cost rule: clarification cost ≪ correction cost.
- Execution
Agent produces, runs tests, converges. The human doesn't supervise execution.
- Validation
Risk-graded gate (§ 4). Tests, scenarios, agent reviewer for reversible work; human approval for irreversible.
- Recovery
When validation fails or the agent gets stuck: recalibrate (re-spec, re-context, brainstorm) rather than debug. See § 5.
Stage 2 (Clarification) ships in production tooling — spec-kit's /speckit.clarify, Anthropic's plan mode, and the AskUserQuestion tool all operationalize it as a discrete gate. Stage 4 is detailed in Risk-Graded Validation Gates; Stage 5 in Stuck-State Protocol.
Two Human Checkpoints
Human work concentrates at the front boundary (Context + Clarification) and the back boundary (Validation + Recovery). Inside the loop, agents and evaluators run. This is the operational pattern that makes Rung 5 sustainable — the human's attention isn't a per-line review bottleneck; it's a per-loop direction-and-judgment role.
The Pattern Is Discrete-Task, Not Engineering-Specific
The Lab applies the loop to code, but the same shape governs other discrete-task work:
| Stage | Engineering | Customer service | Finance operations |
|---|---|---|---|
| Context | Codebase + CLAUDE.md | Knowledge base + customer history | Ledger + chart of accounts + period rules |
| Clarification | Agent asks about ambiguous acceptance criteria | Agent asks "what does the customer actually want?" before drafting | Agent surfaces ambiguity in transaction categorization |
| Execution | PR with tests | Response + actions | Drafted journal entries |
| Validation | Agent reviewer + CI; human gate on production deploy | Agent reviewer for tone + policy; human gate on refund above threshold | Agent reconciler; human approval before posting |
| Recovery | Re-spec when subtle bug surfaces | Re-train when escalation patterns shift | Re-spec when an edge case surfaces a category gap |
The substrate changes; the loop doesn't. The Lab's rules — code not written or reviewed by humans — are the engineering instance of the broader principle: humans direct and validate; AI executes within risk-graded gates.
Scenarios vs Tests
- Tests: validations stored in the code. Vulnerable to gaming by agents — an agent can rewrite a test to make it pass. Useful but insufficient.
- Scenarios: end-to-end user journeys that describe expected behavior from the user's perspective. Harder to circumvent. The Lab favors scenarios.
When humans no longer read the code, unit tests lose a crucial advantage: the engineer's ability to identify edge cases from their knowledge of the implementation. In an opaque execution model, only end-to-end behavioral validation remains reliable, because it doesn't depend on knowledge of internal details.
Satisfaction Metric
The Lab doesn't measure success in binary (tests green / red). It measures satisfaction: "across all observed trajectories through all scenarios, what fraction satisfies the user?"
When satisfaction is insufficient, the problem is in the specification, not in the agent. Iterate on the spec, not the code.
Token Economics
Wall-clock time is the wrong measure at Rung 5 — agents work in parallel, asynchronously, and overnight. The binding metrics are:
- Cost per merged unit (PR, ticket, transaction). Anthropic quantifies the multi-agent premium: typical agents use ~4× the tokens of chat; multi-agent systems can use ~15× (Anthropic, 2025). Cost per merged unit makes that visible.
- AI gross margin at the team level — value produced relative to inference spent. AI-native teams treat token cost as a first-class engineering metric. See AI economics at maturity.
- Agent throughput per dollar — merged units per dollar of inference. Distinguishes high-spend-high-throughput teams from high-spend-low-throughput ones.
The Lab tracks these alongside satisfaction. A team that reaches Rung 5 by running expensive multi-agent loops on every task can be technically successful and economically unsustainable at the same time.
The Critical Skills: Specification and Process Design
The Lab's bottleneck isn't implementation speed — it's the front-boundary work. Two new skills:
- Specification — writing instructions precise enough for the agent to implement correctly without human intervention.
- Process design for AI — designing the constraints, gates, and validation tiers within which the agent operates consistently. Distinct from prompt engineering and from spec-writing per se. See AI Execution Standards § Layer 5.
Almost nobody has fully developed either.
The difficulty with specification: when a human receives an ambiguous spec, they fill the gaps with judgment, context, or a Slack message asking "did you mean X or Y?" The agent builds what you described. If the description is ambiguous, the software fills the gaps with machine assumptions, not customer intuitions. The clarification stage of the operational unit is the structural fix — but only if the team uses it.
This skill is developed through practice:
- AI clinics should include spec reviews: "here's my spec, here's what the agent produced, here's what was missing from the spec"
- Pair sessions should work on specification exercises, not just code exercises
- Every failed iteration is a signal about the spec, not the agent — document what the spec didn't state clearly enough
The Lab's goal isn't just to produce software through agents. It's to develop engineers who can specify with the rigor that agents demand.
Risk-Graded Validation Gates
Stage 4 (Validation) of the operational unit is risk-graded — different action classes get different gates. The dimensions that determine the gate, drawn from SAE J3016 (driving) as the cleanest analog:
- Operational design domain (ODD) — the conditions under which the agent is designed to function. Outside the ODD, the agent makes no claims; the gate falls back to human.
- Fallback responsibility — who handles the action when the agent gets stuck or hits a forbidden boundary.
- Supervision expectation — what the human is expected to do during operation.
- Transfer of control — how authority shifts between agent and human.
The Three Operational Stances
Human approval required before the action executes.
Default for irreversible high-impact actions: financial transactions, production deploys, customer-facing communications, anything that creates legal or financial obligation.
Agent acts autonomously; human monitors with intervention authority.
Default for reversible production work with strong eval coverage.
Agent acts within pre-defined boundaries; no real-time human involvement.
Reserved for sandboxed, reversible work with strong tests and an agent reviewer on every artifact.
A mature Rung 5 team operates all three concurrently. Examples:
| Action | Gate | Rationale |
|---|---|---|
| Code merge to main (well-tested repo, agent reviewer) | HOOTL | Reversible (revertable); blast radius bounded by CI |
| Production deploy | HITL | Irreversible at customer-perception level; high impact |
| Drafted financial transaction in accounting system | HITL | Irreversible (audit trail); regulatory implications |
| Routine customer support response (within knowledge base scope) | HOTL | Reversible; quality monitored; agent reviewer for tone/policy |
| Refund or credit issued above threshold | HITL | Financial irreversibility; trust implications |
The Lab's absolute rules ("code not written or reviewed by humans") still apply within HOOTL mode. But Rung 5 teams deliberately step out of HOOTL for irreversible work — that's a risk-graded design choice, not a regression.
Vigilance Fatigue
HOTL is operationally fragile when treated as passive monitoring. For HOTL to be meaningful, the human must (a) have intervention authority (kill switch, rollback, override) and (b) actually be paying attention. Vigilance fatigue is well-documented; teams that label everything HOTL as compliance theater find their oversight is illusory. The cognitive cost of sustained vigilance is real (see Cognitive cost).
Stuck-State Protocol
When validation fails or the agent gets stuck at a deliverable, the response is governed by the operational unit's Recovery stage — and the Lab's rules.
The AI-Bottleneck Failure Mode
At Rung 5, "behind on a deliverable" rarely means human capacity is short. It means the agent has hit a structural limit — wrong direction, ambiguous spec, subjective edge case it cannot resolve alone. Cemri et al. (Why Do Multi-Agent LLM Systems Fail?, 2025) found 41.8% of multi-agent failures are specification or design issues that need re-specification, not retry.
Recalibration, Not Debugging
The literature on intrinsic self-correction is unanimous: a model that has committed to a wrong direction will not reliably notice on its own; reflection in the same context window after a wrong answer compounds the error rather than fixing it. The recovery move is to rebuild the agent's understanding — fresh context, re-articulated spec, multi-perspective brainstorm — not to debug the artifact the agent produced.
Practical pattern at Rung 5:
- Detect the stuck state — iteration limit reached; same failure pattern recurring; subjective issue raised by a user.
- Stop iterating. Convene a recalibration session — one or more humans engage the agent in dialogue; the goal is to surface the wrong assumption.
- Re-spec or re-context. Refine acceptance criteria, fix ambiguous requirements, update CLAUDE.md / AGENTS.md if intent has drifted from documented context.
- Restart the loop from Context, not Execution. A fresh agent run on a corrected spec almost always outperforms continued correction within the original context window.
Lab Rule for Stuck States
When a Lab project is stuck, do not take the work back manually. Taking the implementation back into human hands violates the Lab's absolute rules and replaces the symptom (slow delivery) with a worse one (the Lab no longer demonstrates what Rung 5 looks like). The right response is collective recalibration: bring more humans into the spec/clarification, surface the wrong assumption, re-run the loop.
If the work is genuinely irrecoverable at Rung 5, the project's codebase readiness is below where the team is operating — fall back to a lower Rung temporarily and remediate the harness (codebase readiness, brownfield strategy). This is a legitimate move; reverting to manual coding inside the Lab is not.
Deliberate Naivety
The biggest obstacle to Rung 5 isn't technical — it's habit.
Experienced engineers have deep reflexes: structuring code a certain way, reviewing line by line, writing tests themselves, refactoring manually. These reflexes were strengths in traditional development. In the Lab, they're obstacles.
Deliberate naivety means:
- Removing traditional development conventions and seeing what holds without them
- Systematically asking: "Why am I doing this? The model should be doing it instead."
- Accepting that approaches that seem "naive" or "incorrect" by traditional standards may be correct in an AI-native environment
- Treating tasks historically deemed too expensive (building full service replicas, writing thousands of scenarios) as routine when AI execution costs make it feasible
The Lab's permanent question:
Why am I doing this? The model should be doing it instead.
If the answer is "because I've always done it that way," that's exactly the reason to change.
Support Role
The Lab isn't isolated from the rest of engineering. It serves it.
The Lab produces:
- Documented working patterns: how to specify for an agent, how to write scenarios, how to evaluate satisfaction
- Reusable or adaptable agents
- Concrete proof that Rung 5 works on real projects
- Honest feedback — what works and what doesn't work yet
The Lab shares through:
- AI clinics: regular sessions, short format. "Here's what we tried, here's what happened."
- Documentation: every discovered pattern and anti-pattern is documented
- Lab / non-Lab pairing: a Lab member temporarily works with a non-Lab engineer to transfer practices
A Lab that doesn't share is useless. Sharing is as important as production.
Lab Culture
The Lab has a distinct culture from the rest of the organization:
- Mandatory curiosity — the question "what if we tried..." is always welcome
- Aggressive monitoring — Lab members stay on top of the latest AI model advancements. When a new model or tool drops, the Lab tests it quickly and evaluates whether it's a game-changer. Waiting for things to "mature" is incompatible with the Lab.
- Boldness in methods, rigor in commitments — the Lab pushes boundaries on how we work: what tools we adopt, what workflows we reinvent, what "naive" approaches we test. But contractual, economic, legal, and security obligations to customers remain non-negotiable. Boldness applies to the means, not the guarantees.
- High risk, low stakes — Lab projects are chosen to tolerate failure. Use that to take risks you wouldn't take elsewhere
- Radical transparency — failures are shared with as much detail as successes. A documented failure has more value than a silent success
- Leadership means elevating the team — in the Lab, leadership isn't measured by individual performance. Leaders are those who make the rest of the team better: who share their discoveries, document their patterns, unblock their colleagues, and turn their expertise into reproducible practices. A brilliant engineer who keeps their methods to themselves is not a Lab leader.
- Iteration speed — the spec-to-shipped feedback loop must be fast. If an iteration takes days, the cycle is too heavy
Pitfalls to Avoid
- Pitfall: Reverting to habits
The reflex to "manually check the code just to be sure" is exactly what the Lab forbids. If the scenarios pass, the code is validated by the scenarios.
- Pitfall: Insufficient specifications
When the agent produces bad code, the problem is usually in the spec. Recalibrate (re-spec, re-context) before debugging the artifact. Iterate on the spec, not the code.
- Pitfall: Isolation
A Lab that doesn't share its learnings is a hobby, not a Lab.
- Pitfall: Too-critical projects too early
The Lab has high risk tolerance. Don't put in a project whose failure endangers a customer or a contract.
- Pitfall: Agent perfectionism
The goal isn't a perfect agent. It's an agent that produces value. Iterate.
- Pitfall: Brownfield without spec extraction
Transitioning an existing project without first extracting the implicit specification and writing scenarios that protect current behavior is flying without a net. The extraction is the hardest work — don't underestimate it.
- Pitfall: "Half-Lab" brownfield
If part of the work on a brownfield project is done in traditional mode "because it's faster for that part," the project isn't in the Lab. The rules are absolute, even when it's uncomfortable.
- Pitfall: The six-month wall
AI-driven projects without strong human involvement (specs, scenarios, architecture) accumulate structural debt that explodes after roughly six months. AI-generated code without clear constraints is often less structured and less maintainable. The Lab's scenarios and specifications are precisely the defense against this wall — they impose upstream rigor that prevents debt from accumulating silently.
- Pitfall: Treating an AI-bottleneck as a productivity problem
When the agent can't ship, it's almost never a capacity problem (the agent has near-infinite capacity). It's almost always a spec or context problem. Redistributing work to humans replaces the wrong root cause and violates the Lab's rules. The right response is recalibration (see § 5).
Lifecycle
The Lab starts with greenfield projects and begins transitioning 1-2 selected brownfield projects. Small team. Absolute rules in effect. Output = delivered projects + documented practices + working agents + brownfield transition playbook.
Brownfield projects transitioned in the Lab become the reference cases for the rest of engineering. Engineers who went through the Lab become the pairing partners for those who haven't. More brownfield projects enter the Lab.
The Lab has absorbed engineering. The distinction disappears. Everything is Rung 5. The Lab was never a destination — it was the transition vehicle. Both greenfield AND brownfield projects operate under the same rules.
This lifecycle aligns with the organizational transformation path: Phase 1 corresponds to the Level 2→3 transition (6-12 months at the organizational scale).
Summary Rule
Why am I doing this? The model should be doing it instead.
← Back to home · The reference framework · AI Execution Standards · Glossary
