AI-Native Transformation Framework

Brownfield engineering strategy

How to transition an existing codebase to AI-native development — when to remediate in place, when to strangler-fig, when to rebuild, when to isolate, and the methodology that makes each tractable.


Is your codebase actually brownfield?

The four modes on this page apply to brownfield codebases — existing code with accumulated decisions, production users, and a team working inside it. Before picking a mode, verify the codebase state. Three show up in practice:

StateWhat it meansWhat to do
Greenfield (in development, pre-GA)New codebase, not yet in production, or in production but young enough that all code is still being actively designed. Low readiness scores here are scheduling decisions, not debt.Continue development. Close readiness gaps on roadmap timing. Re-run the readiness assessment at GA and treat unclosed gaps as brownfield then.
Brownfield (production, accumulated)Code exists, runs in production, has real users, carries years of decisions. Readiness gaps are debt.Pick a mode using the four below.
HybridA new Level 5-ready service inside an organization whose other systems are brownfield. The new service is greenfield; the legacy integration boundary is brownfield.Score the new service as greenfield. For each legacy boundary, pick a mode separately. Most common version: a new AI-native app reading or writing data through a legacy API.

The four modes

Each mode is a different path to Level 5. The question is not "which mode does the team have capacity for" — that's a resource-allocation decision downstream. The question is "which mode is the shortest path to Level 5 for this codebase, on the evidence." Team size, strategic priorities, and opportunity cost are inputs a human weighs after reading the recommendation; they are not inputs to the recommendation itself.

ModeWhat "path to Level 5" means
Remediate in placeLevel 5 happens in this codebase via staged harness-building.
Strangler-figLevel 5 happens gradually — new pieces are built Level 5-ready while old pieces retire.
Full rebuildLevel 5 happens in a new version of this codebase; old version retires at cutover.
Isolate and bypassLevel 5 does not happen in this codebase. It happens in new Level 5-ready codebases alongside; this one stays frozen.

Mode 1 — Remediate in place

Keep the existing code. Invest in the harness: tests, types, conventions, documented intent. Use AI to accelerate the remediation itself via the Research-Review-Rebuild methodology below.

When this is right: the architecture is fundamentally sound (the issue is the harness, not the structure); the team has domain expertise tied to the current codebase; business continuity cannot tolerate parallel systems.

The risk: you remediate the harness and the code stays fundamentally Level 3 because the structure fights you. You've spent six months and still can't reach Rung 5.

Mode 2 — Strangler-fig migration

Build new functionality Level 5-ready alongside the old system. Route traffic through a facade. Kill pieces of the old system one by one as the new ones prove themselves. Martin Fowler's Strangler Fig Pattern is the canonical reference.

When this is right: the existing system has clean seams for extraction; business continuity matters; rebuild time for the whole system would exceed business tolerance.

The risk: seam identification is hard. If the seams aren't clean, you end up with two coupled systems instead of one — twice the complexity, none of the benefit.

Mode 3 — Full rebuild

Start over with a Level 5-ready codebase. Run the two systems in parallel, migrate data, cut over when the new system proves equivalent or better.

When this is right: the existing structure actively prevents the work you need to do; the cost of continued remediation exceeds the cost of rebuild; AI-assisted rebuild economics make the rebuild tractable in a timeframe that wasn't viable before; the domain is well-understood (you're replacing how, not rediscovering what).

The risk: the second-system effect. Rebuilds collect every deferred feature request from the old system and fail under their own weight. Keep scope disciplined.

Mode 4 — Isolate and bypass

Freeze the legacy. Maintain it by human hands at whatever level the business still requires. Build new value as new Level 5-ready apps alongside it, routing through whatever API or data layer already works. Don't attempt to modernize the legacy itself.

When this is right: remediation cost exceeds the value remaining in the legacy; new value can be delivered in parallel without deep integration; the legacy has a viable integration seam (API, database, queue) that new apps can route through; the business tolerates maintenance mode for an extended period.

The risk: bypassed legacy accumulates more debt over time. Eventually something forces the decision — a security patch that can't be applied, a platform EOL, a data model that can't hold new requirements. Isolate-and-bypass buys time; it doesn't solve the problem.

"Do not invest" is a legitimate outcome here. Not every legacy is worth fixing. The signal: you're writing your third remediation plan for the same module, the tests you added last quarter are no longer green, and the business still delivers value through this code. That's not a codebase waiting to be modernized. That's a codebase waiting to be replaced.


Decision criteria

QuestionRemediateStranglerRebuildIsolate
Is the architecture fundamentally sound?YesMostlyNoDoesn't matter
Are there clean seams for extraction?N/AYesN/AAt least one usable integration point
Can the team describe the system's intent?YesYesYes (or use Black Box to Blueprint)Not required for legacy
Can business tolerate parallel systems?N/AYesYesYes
Is continued debt payment cheaper than replacement?YesPartiallyNoYes, while new value ships alongside
Does the legacy have value remaining worth preserving?YesYesYesYes, but not worth fixing
Do you have team capacity for the harder option?LowestMediumHighestLow (the legacy stays frozen)

No clean yes on the rebuild row? Don't rebuild. The second-system effect punishes teams who chose rebuild for reasons other than fit.

Several "no" answers across Remediate/Strangler/Rebuild, but legacy still has business value? Isolate and bypass is probably the right mode.

The Technical Debt Quadrant sharpens the rebuild decision:

Debt typeCauseRemediation
Prudent, deliberateWe shipped knowing the shortcut.Usually remediable.
Prudent, inadvertentWe learned better later.Remediable, but check whether the learning invalidated structural decisions.
Reckless, deliberateWe knew better, did it anyway.Often remediable with discipline, but signals process problems that outlive the codebase.
Reckless, inadvertentWe didn't know what we were doing.Rebuild candidate. The structure reflects ignorance that later knowledge cannot unwind in place.
Try it on your codebase
codebase-readiness

Need help picking the right mode? The codebase-readiness Claude Code skill assesses a codebase against the nine-dimension readiness model and recommends a path to Level 5 — including which of the four modes applies.

View on GitHub

The methodology: Research, Review, Rebuild

Published by Fowler and EPAM in Research, Review, Rebuild, this is the most concrete brownfield methodology available. It applies directly to Modes 1–3; in Mode 4 it applies to the integration seam even if the legacy stays frozen. Direct agent deployment into an opaque legacy codebase reliably produces confidently wrong output. The phase-gated structure prevents this.

Phase 1 — Research. AI analyzes the existing code: reconstructs intent, extracts behavioral contracts, identifies patterns, documents the dependency graph. Tools like MCP connectors let agents systematically traverse the codebase. Output: an intent map — what the system does, independent of how it's structured.

Phase 2 — Review. Domain experts validate the intent map. AI can extract what the code does. Only humans can distinguish intentional behavior from historical accident. This is the throughput bottleneck — on the Bahmni case study (AngularJS to React), human review took ~20 minutes per component. Plan capacity for review, not just generation.

Phase 3 — Rebuild. With validated intent, AI generates replacement code with minimal ambiguity. On Bahmni: ~$2 per component in under an hour, vs. 3–6 days manually. The economics are compelling when Phases 1 and 2 are disciplined — and worse than traditional when they're skipped.

The order is load-bearing. Teams that skip research and review to get to rebuild faster don't save time; they produce wrong outputs faster and spend the saved time debugging hallucinated behavior.

Economics: what the numbers imply

The Bahmni data point rearranges the math: what was an 18-month migration becomes a 6-month migration, where the constraint is reviewer capacity, not engineering hours. Your mileage will vary with architecture, test coverage, and domain complexity — and the economics only work when Research and Review are disciplined. Skipping them collapses the speedup.

Black Box to Blueprint: reverse-engineering intent

When the original team is gone, documentation is wrong, and the code is the only source of truth, Phase 1 becomes a reverse-engineering problem. Fowler's Black Box to Blueprint describes five techniques:

  1. UI-layer reconstruction — infer behavior from the user interface and its state transitions.
  2. Change data capture — observe how the system modifies data in production to infer business logic.
  3. Server logic inference — analyze API boundaries and request/response patterns to rebuild the logic behind them.
  4. Binary archaeology — reconstruct from binaries, logs, and external interfaces when source is lost.
  5. Progressive multi-pass enrichment — break artifacts into manageable chunks, extract partial insights per pass, build context incrementally. Single-shot whole-system analysis fails at scale.

Two disciplines are non-negotiable: triangulation (every hypothesis about intent confirmed across two independent sources — UI + logs, API + database, code + observed behavior) and lineage tracking (for every claim, record the evidence it's based on, so unreliable evidence is identifiable later).


Spec-from-code: the brownfield inversion

The AI Execution Standards and Specification Guide assume specs precede code. Brownfield inverts this: the code already exists, and the spec must be reverse-engineered from it. This is a different workflow.

Spec-driven development tools like spec-kit, Kiro, and Tessl are primarily greenfield-oriented. Spec-kit's "Brownfield Bootstrap" is the exception: auto-discover existing architecture to establish a Constitution (persistent governing principles), then apply spec-driven development to new features only while Research and Review catch up on the legacy surface.

The pattern that works:

  1. Extract the implicit specification. What does the system actually do? See the AI Lab brownfield sequence for the disciplined version.
  2. Write end-to-end scenarios that describe current expected behavior from the extracted spec.
  3. Verify scenarios pass on the existing code. This becomes your regression harness.
  4. Apply spec-first to all new work. No exceptions.
  5. Strangler-migrate component by component, using the scenarios as protection during migration.

The first step is the hardest human work in the transition. Agents can document what the system does; only humans can answer whether that's what it should do.


Common pitfalls

Underestimating the human review bottleneck. AI generation costs dropped by orders of magnitude. Human review costs didn't. Plan reviewer capacity as a first-class constraint, not an afterthought.

Half-lab brownfield. Running part of the project in AI-native mode and part in traditional mode because "that part's faster the old way" produces neither benefit. The AI Lab pitfall list names this specifically.

Treating seams as given when they're implicit. Strangler-fig reads well on whiteboards. In practice, "clean seams" are where responsibilities can be extracted without coupling to eight other modules. If that isn't true of your codebase, strangler-fig becomes full rebuild with extra steps.

Confusing modernization cost with replacement cost. The decision "remediate vs. rebuild" isn't about whether fixing costs money — both cost money. It's about whether the structure can hold what you need it to hold. Reckless-inadvertent debt can't be unwound in place regardless of budget.


Tool landscape (snapshot)

The framework doesn't recommend specific tools; the landscape shifts too fast. But worth knowing about: spec-driven development (spec-kit, Kiro, Tessl — see Fowler's comparison); MCP connectors for systematic codebase traversal; agent platforms (Claude Code, Cursor, Aider, Devin) with different autonomy/human-in-the-loop tradeoffs; benchmarks (SWE-bench, IDE-Bench) — but benchmark scores on curated problems don't predict brownfield performance on your code.


Mirror: strategy and unreliability

Engineering for unreliability is about making reliable systems from unreliable AI outputs. Brownfield strategy is about making AI agents effective on unreliable codebases. The two converge at the harness: fast sensors, documented intent, scenario-level validation. A brownfield migration that doesn't end with the codebase at Readiness Level 5 and doesn't match the Unreliability discipline succeeded in form but not in function.



Sources


← Back to home · Codebase readiness · AI Lab · Engineering for unreliability · Glossary