Codebase readiness

Your AI coding agent is only as good as your codebase's harness. AI amplifies whatever structure already exists — a disciplined codebase ships faster, an undisciplined one ships confidently wrong code faster. The harness (Fowler: Agent = Model + Harness) is what this page helps you measure.

A nine-dimension diagnostic with a 1–5 scoring rubric per dimension. Three are blocking — low scores compromise agent work fundamentally and can't be compensated by high scores elsewhere. The other six are constraining. Score any repo in one command with the free open-source Claude Code skill.

The five readiness levels

Each level is defined by the feedback mechanism it adds. You cannot skip levels; the thing that unlocks the next level is always another feedback loop, not another tool.

Level	Name	What's present	What agents can do
0	Opaque	No enforced types. Little to no test coverage. CI absent or unreliable. Docs stale. Module boundaries unclear.	Isolated single-function suggestions. Anything beyond produces hallucinated code with no corrective signal.
1	Instrumented	Types enforced at build time. Some tests exist. CI returns pass/fail within minutes.	Single-function generation with basic confidence. Narrow-scope refactors. Signal is weak but present.
2	Validated	Meaningful coverage on changing code paths. CI under 30 min (under 5 min is better). At least one end-to-end integration path.	Multi-file refactors within bounded scope. Bug fixes validated by existing tests. Minimum level for Rung 3 (directed development).
3	Legible	Module boundaries clear. Naming consistent. Architecture documented. Behavioral contracts declared at interface boundaries.	Cross-module work with predictable outcomes. Agents can reason about the codebase without walkthroughs.
4	Specified	Intent captured in specifications, not only in code. New work is spec-first. Historical decisions documented.	Autonomous feature generation from spec (Rung 4). The human specifies; the agent builds and tests.
5	Scenario-governed	End-to-end scenarios cover critical paths. Satisfaction metrics replace binary pass/fail. Observability complete.	Full autonomous production (Rung 5). The Lab's operating model becomes sustainable.

Fowler calls the Level 3+ structural traits ambient affordances — properties that make a codebase legible to an agent without additional instruction. Their absence forces the agent to invent structure.

The AI Codebase Maturity Model, validated on a real project at 91% coverage and sub-30-minute bug-to-fix cycles, stated the underlying finding bluntly: "The intelligence of an AI-driven development system resides not in the AI model itself, but in the infrastructure of instructions, tests, metrics, and feedback loops that surround it."

The ceiling rule

Engineering practice (the framework's Rungs 0–5) is about how the team works. Codebase readiness (Levels 0–5) is about what the code supports. They're linked:

A codebase's readiness level is the ceiling on the engineering Rung that can operate reliably on it.

A team at Rung 5 pushing agents into a Level 2 codebase will get the same breakage as a team at Rung 2 — faster. The feedback loops that keep Rung 5 honest don't exist. This is the most common failure mode in brownfield AI adoption: teams see Rung 5 demos on greenfield repositories and assume the working model transfers. It won't until the harness is built.

The Codebase Readiness Grid

Nine dimensions. Each answers one question. Score each 1–5. The lowest score is the ceiling on the readiness level you can operate at reliably; your weakest dimension is where agents will fail first.

Try it on your codebase

codebase-readiness

The Codebase Readiness Grid, as a Claude Code skill. Runs on any repo, collects signals per dimension, produces a scorecard with a path-to-Level-5 recommendation. Open source, MIT licensed.

View on GitHub

#	Dimension	Type	Question it answers	Scoring rubric (1–5)
1	Test coverage and feedback latency	blocking	Is there meaningful coverage on the code paths that change most often? How fast is CI?	1 = no tests or dead tests. 2 = low coverage, slow CI. 3 = meaningful coverage on hot paths, CI under 30 min. 4 = enforced threshold, CI under 5 min, failures localize. 5 = behavioral scenarios cover critical paths, CI fast and green as deploy gate.
2	Type strictness	blocking	Are types enforced at build time? Is `any` rare or pervasive?	1 = untyped or types off. 2 = types optional, `any` everywhere. 3 = types enforced, some escape hatches. 4 = strict mode on, `any` rare and justified. 5 = strict mode in CI, zero `any`, contracts at every boundary.
3	File size and context legibility	constraining	Can an agent reason about one file without loading the rest of the system?	1 = god files (2000+ lines), mixed responsibilities. 2 = many 1000+ line files, tangled. 3 = most files under 500 lines, some outliers. 4 = most under 300 lines, outliers rare and justified. 5 = each file sized to one responsibility.
4	Module boundary clarity	constraining	Can a new engineer (or agent) describe the module graph after 30 minutes? Are boundaries enforced?	1 = no modular structure. 2 = structure on paper, violated in practice. 3 = clear high-level modules, occasional leaks. 4 = enforced boundaries, stable contracts. 5 = module graph is a navigable map; architecture documented and current.
5	API directness	blocking	Is the HTTP/RPC call visible at the call site, or hidden behind wrapper layers?	1 = access via opaque abstractions (Factory, dynamic dispatch, ORM with invisible queries). 2 = significant abstraction, traceable with effort. 3 = mix of direct and abstracted. 4 = mostly direct calls with typed responses. 5 = call sites show what's happening; no hidden network or database operations.
6	Documented intent	constraining	Is the why documented separately from the how? Are architectural decisions recorded?	1 = no intent docs; code is the only truth. 2 = scattered docs, often stale. 3 = module READMEs, some ADRs. 4 = current ADR log, critical modules documented, CLAUDE.md / AGENTS.md in place. 5 = every non-trivial decision has a spec or ADR; historical context preserved.
7	Observability	constraining	Can you tell what the system did for a given request? Do failures produce reproducible test cases?	1 = no structured logs, no traces, no metrics, no error tracking. 2 = some logs, no tracing. 3 = structured logs and basic metrics; error tracking in place. 4 = full telemetry with query access. 5 = production fully observable; errors produce reproducible test cases.
8	Dev and deploy simplicity	constraining	Can you set up dev with one command? Deploy with one (or auto-)?	1 = dev setup takes days or tribal knowledge; deploys manual and fragile. 2 = scripted but fragile; deploys require runbook steps. 3 = documented commands, reliable deploy. 4 = one-command setup; one-command (or auto-) deploy. 5 = dev, CI, prod architecturally identical.
9	Dependency and runtime currency	constraining	Is the runtime supported? Are frameworks within 1–2 majors of current? Any abandoned libraries?	1 = runtime EOL, framework 2+ majors behind, abandoned libraries, paradigms mixed. 2 = runtime current but many deps 1+ majors behind. 3 = runtime current, most deps within 1 major, occasional legacy pattern documented. 4 = consistent modern paradigms. 5 = aggressive dependency hygiene; conventions match current community standards.

Notes on the less self-evident dimensions. D1: slow CI is the most common brownfield deficit — a 72-hour test suite is not a sensor, it's a report. D3: small files aren't a style preference, they're a context-window discipline; agents that need three files to understand one function hallucinate the missing parts. D5: opaque abstractions don't just hide detail, they produce confidently wrong code when the agent guesses at what UserFactory.resolve() does. D6: code documents itself; intent does not. This is the hardest deficit to repair quickly. D9: agents are trained on current library versions and idioms; a codebase using patterns 2–3 years behind receives code that contradicts its own conventions.

How scoring works

Four rules govern how to read the scorecard. They compound — all four apply simultaneously.

1. The ceiling is the lowest score. Your weakest dimension sets your readiness level. A codebase with eight 5s and one 1 is at Level 0, not Level 4. Agents fail at the weakest link.

2. Three dimensions are blocking; six are constraining.

Blocking (D1, D2, D5) — low scores compromise agent work fundamentally. Without tests, agents are blind. Without types, they hallucinate shapes. With opaque APIs, they produce confidently wrong code. These cannot be compensated by high scores elsewhere.
Constraining (D3, D4, D6, D7, D8, D9) — low scores degrade quality but agents can still produce value. More human review per change, more cleanup, more friction.

When two dimensions tie at a low score, raise the blocking one first. The ceiling rule still applies; this classification just sharpens remediation priority.

3. Deferral credit: an intentional, documented deferral scores one level higher than the same gap undocumented. If observability isn't wired up, but the project spec defers it to a GA-Ready phase with date and criteria, score it 2 instead of 1. The deferral is a decision; an absent-minded gap is not. Verbal claims don't qualify — the deferral must be in the repo.

4. Never summarize the scorecard with an average. The scorecard is a vector, not a scalar. Two codebases averaging 3.0 can be in radically different situations: one with D1 at 1 is blind; one with D7 at 1 is operationally immature but usable. Averages hide that distinction.

The scorecard is a snapshot. Re-run it every quarter — codebases move both directions.

What a Level 5 codebase looks like

A concrete reference, generalized from observed patterns. A codebase that scores 5 across all nine dimensions tends to share these traits:

Spec-first for non-trivial work. API contracts verified, acceptance scenarios defined, decisions documented before code.
Strict typing enforced at build time. Zero any. Type violations fail CI.
Small, focused files. Most under a few hundred lines. Largest files are intentional outliers, not accidental accumulations.
Direct API and data access. Call sites show what's being called. No opaque factories, no generic dispatchers.
Minimal dependencies. Few packages, each justified.
Explicit conventions. File structure and naming documented. ABOUTME: header on every file. AGENTS.md or CLAUDE.md with the agent rules.
No implicit knowledge. The codebase is legible because it says what it means, not because the team remembers.
Tests as contracts. Simple mocks, behavior assertions, readable names.
One-command build, one-command deploy. No fighting infrastructure to ship.
Current dependencies, supported runtime. No EOL runtimes, no abandoned libraries.

These aren't style preferences. They're the structural properties that let agents produce reliable work — and they're the same properties that make humans productive. The two aren't separate.

Harness-building strategy

When readiness is below where your team needs to work, sequence the investment in this order. Skipping stages produces speed without confidence.

The harness has two parts: sensors (tests, CI, observability) verify agent output; guides (types, conventions, architecture docs, documented intent) specify what to produce. Brownfield codebases typically have the guides but lack the sensors — that's the configuration that ships confidently wrong code.

Stage 1 — Install fast sensors first. Useful tests and fast CI (feedback latency under 30 minutes is the gate). Without sensors, every downstream change is a guess and you can't verify that later harness work is landing.

Stage 2 — Make the code legible. Consistent naming, enforced types, clear module boundaries, centralized cross-cutting concerns, deletion of dead paths. Legibility is what lets agents work across modules without hallucinating structure.

Stage 3 — Capture intent. Document architectural decisions. Declare behavioral contracts at interface boundaries. For the most critical modules, extract the implicit specification — the thousand decisions that accumulated over years of patches. Agents can document what the system does; distinguishing intentional behavior from historical accident stays human. See the AI Lab brownfield sequence for the disciplined version.

Stage 4 — Govern with scenarios. Convert intent into end-to-end scenarios. Scenarios are harder for agents to circumvent than unit tests. Rung 5 working mode becomes viable at this point.

Patterns from Paul Duvall's AI Development Patterns that fit this sequence: Readiness Assessment (Foundation), Observable Development, Guided Refactoring.

What to do when the score is low

Four modes apply, covered in detail at Brownfield engineering strategy: remediate in place, strangler-fig migration, full rebuild, or isolate and bypass. The right mode depends on the codebase's architectural soundness, seam availability, and remaining business value — not on team capacity or strategic priorities, which sit outside this assessment.

Mirror: readiness and unreliability

Engineering for unreliability addresses AI output as input to systems. Codebase readiness addresses the inverse: the codebase as input to AI agents. Same uncertainty, reflected.

Brownfield engineering strategy — the four modes for low-readiness codebases
AI Lab — the Rung 5 working mode that a Level 5 codebase supports
Engineering for unreliability — the mirror problem
AI Execution Standards and Specification Guide — specification engineering

Sources

Mahiou et al. (2026). "AI Codebase Maturity Model (ACMM)." arXiv. arxiv.org/abs/2604.09388
Fowler, M. (2026). "Harness Engineering for Coding Agent Users." martinfowler.com
Duvall, P. (2026). "AI Development Patterns." github.com/PaulDuvall/ai-development-patterns
Google Cloud (2025). "DORA 2025 State of AI-assisted Software Development." dora.dev
"A Maturity Model for AI-Native Development." (2025). Transcode. blog.transcode.be
"The State of Generative AI in Software Development." (2026). arXiv. arxiv.org/abs/2603.16975

← Back to home · Brownfield engineering strategy · AI Lab · Engineering for unreliability · Glossary