Specification Engineering Guide
A practical companion to the Execution Standards. The Standards define the layers and primitives. This guide shows how to use them.
When do you need a spec?
Not every AI interaction requires a formal specification. Use this decision tree:
Use a prompt when:
- The task is a single step with obvious success criteria
- You'll evaluate the output immediately
- Failure costs nothing (you can just re-prompt)
Use a prompt + context file when:
- The task requires domain knowledge (terminology, brand voice, technical constraints)
- You've done this type of task before and quality matters
- The output will be used by others
Use a full specification when:
- The task has multiple steps or dependencies
- Multiple people or agents will work on parts of it
- Failure is costly (production code, customer-facing output, irreversible actions)
- You need someone else to verify the result without asking you questions
- The task will be repeated and should produce consistent results
The threshold is not complexity — it's cost of failure. A simple task with high stakes needs a spec. A complex task with zero stakes can start with a prompt.
Writing a spec takes time. But research shows it pays for itself: enriching a problem statement before an agent begins work yields a 20% improvement in resolution rates, and weaker agents see the largest gains. It's cheaper to improve the spec than to improve the agent.
The four layers in practice
The Standards define four mandatory layers. These layers form a cumulative maturity progression — each level builds on the previous one. Here's what each looks like when done well.
Layer 1 — Prompt Craft
The baseline. You're writing a clear instruction.
Seems reasonable but has gaps:
Write a 3-email onboarding sequence for trial users who haven't activated yet. Make it friendly and helpful. Include CTAs.
Better — specific enough to verify:
Write a 3-email nurture sequence for trial users who signed up but haven't completed onboarding within 7 days. Goal: get them to complete their first project. Tone: helpful, not pushy. Each email should be under 100 words with one clear CTA. Email 1 (day 3): highlight the easiest way to start. Email 2 (day 5): share a template they can customize. Email 3 (day 7): offer a 15-minute onboarding call. Subject lines should be conversational, not promotional.
The first prompt feels complete but leaves critical questions open: what does "activated" mean? How many emails? How long? What's the CTA pointing to? The agent will fill those gaps with assumptions — and assumptions are where quality breaks down.
The ≤20% correction rule: If you consistently need to fix more than 20% of the AI output, the problem is your prompt, not the model. Invest in the instruction, not in editing the result.
Layer 2 — Context Engineering
Context is everything the agent needs to know that isn't in the prompt. Andrej Karpathy defines context engineering as "the delicate art and science of filling the context window with just the right information for the next step."
The key question is not "what could be relevant?" — it's "what does the agent need to see right now?" More context is not better context. Research shows models hit performance walls when overwhelmed with information, regardless of context window size.
Five quality criteria (Vishnyakova, 2026):
- Relevant — only what's necessary for the current step. Irrelevant data actively worsens output.
- Sufficient — everything needed to make a decision without guessing. Missing context is the primary architectural cause of hallucination.
- Isolated — each step or sub-agent sees only its own context. Sharing everything causes compounding failures (see Context failure modes).
- Economical — minimum tokens while preserving quality. Anthropic frames this as finding "the smallest possible set of high-signal tokens."
- Traceable — every piece of context attributable to a source. When something goes wrong, you need to know which input caused it.
A team context file is not a brain dump. It's a curated set of facts the agent needs for recurring tasks. Keep it concise and gotchas-only — it loads into every session, so every line must be universally applicable.
# Customer Success Team Context
## Goals
- Retain existing customers (retention > acquisition)
- Reduce time-to-resolution for support tickets
- Identify expansion opportunities in existing accounts
## Constraints
- Never share customer data across accounts
- Never promise features that aren't shipped
- Escalate billing disputes over $500 to a manager
- Response time SLA: first response within 4 hours
## Terminology
- "Partner" = reseller account (not a regular customer)
- "White-label" = customer-branded version of the platform
- "Link branding" = custom domain for tracking links
## Quality standards
- Responses must address the specific question, not generic FAQ
- Include next steps in every response
- If unsure, say so — don't guess
Just-in-time vs upfront: Don't load everything upfront. Use a hybrid strategy — always load the team context file and terminology; load specific data (customer records, ticket history, documentation) on demand. The agent should know where to find information, not memorize it all.
For a deeper treatment of context engineering, see Anthropic's Effective Context Engineering for AI Agents.
Layer 3 — Intent Engineering
Intent answers: "What should the agent optimize for, and what trade-offs is it allowed to make?"
Without explicit intent, agents optimize the most measurable metric available — which is rarely what you actually want. The Klarna case study illustrates this: their AI agent handled two-thirds of customer inquiries and saved $60M, but the CEO publicly acknowledged that quality had suffered because the agent optimized cost per token, not the value of customer relationships. As one researcher puts it: "Context without intent is noise."
Intent for a support workflow:
## Objective hierarchy (in order of priority)
1. Accuracy — never give incorrect information
2. Resolution — solve the customer's actual problem,
not just answer their question
3. Tone — match the brand voice (friendly, competent)
4. Efficiency — keep responses concise
## Trade-off rules
- Accuracy vs speed → choose accuracy
- Tone vs efficiency → keep it warm, even if longer
- Uncertain about a technical answer → say so,
don't guess
## Escalation conditions
- Customer mentions legal action → immediately
- Billing dispute over $500 → manager
- Technical issue affecting multiple customers
→ engineering
- Same question asked three times → senior support
## Decision authority
May: draft responses, look up account details,
cite knowledge base
Must not: promise refunds, share other customers'
data, commit to future features
Intent for an engineering workflow:
## Objective hierarchy (in order of priority)
1. Correctness — code must pass all existing tests
2. Maintainability — readable by a developer unfamiliar
with the codebase
3. Performance — meet SLA requirements
4. Simplicity — prefer fewer lines over abstraction
## Trade-off rules
- Correctness vs speed → choose correctness
- Maintainability vs performance → favor readability
unless the SLA is at risk
- When multiple approaches are valid → pick the one
with less coupling to other modules
## Escalation conditions
- Change requires modifying a public API → escalate
- Performance impact exceeds 10% on critical paths
→ escalate
- Solution requires a database migration → escalate
## Decision authority
May: refactor within the scope of the task,
add helper functions, update related tests
Must not: change public APIs, modify unrelated code,
skip test coverage
Layer 4 — Specification Engineering
A specification is a complete, self-contained instruction set for a non-trivial task. It includes everything the agent needs and nothing it doesn't.
The seven required components (from the Standards):
- Problem statement — what needs to happen and why
- Scope — what's included and explicitly what's not
- Inputs — what the agent has access to
- Constraints — the Must / Must Not / Prefer / Escalate categories
- Acceptance criteria — how to verify the output is correct (what "done" looks like, objectively)
- Failure conditions — what counts as a failed attempt
- Success tests — specific scenarios that the output must handle
The five primitives with examples
Primitive 1 — Self-Contained Problem Statements
The test: could someone with no context about your project execute this task using only what's written?
Seems complete but isn't:
The job scheduler is too slow for large batches. Fix the priority system so smaller jobs don't get stuck behind large ones. Use BullMQ.
Self-contained:
The job scheduler (src/jobs/queue.ts) processes tasks sequentially. Large jobs (50K+ items) block smaller jobs in the queue — users with small jobs wait 30+ minutes for processing to start. The fix should add priority-based scheduling: jobs under 5K items get higher priority. The rate limiter configuration (100 req/s) must not change. The existing BullMQ priority feature should be used if possible. If the solution requires changing the job schema, escalate before implementing.
The first version sounds actionable but hides questions: which file? What's "too slow" in numbers? What are the constraints? The agent will make assumptions about all of these — and they may be wrong.
Research on coding agents confirms this: providing architectural context, cross-file dependencies, and exploration hints upfront prevents agents from wasting steps on unfocused repository traversal.
Marketing version — seems complete but isn't:
Create an email drip campaign for trial users who haven't converted. Focus on showing product value. 3 emails, friendly tone, with CTAs.
Self-contained:
Trial-to-paid conversion is 12%. Industry benchmark is 18%. Users who complete onboarding within 72 hours convert at 31%. Design a 3-email sequence triggered when a trial user hasn't completed their first project by day 3. Goal: get them to complete onboarding. Each email under 100 words, one CTA per email, works in EN and FR. No pricing mentions, no urgency tactics.
The first version would produce a generic drip campaign. The second gives the agent the business context (why this matters) and the constraints (what to avoid) that produce a targeted result.
Primitive 2 — Acceptance Criteria
Write three sentences that let someone else verify the output without asking you questions.
Feels testable but isn't:
The dashboard should accurately display key metrics and load quickly across all browsers.
Actually testable:
The dashboard displays MRR, churn rate, and active users for the selected date range. MRR matches the finance team's number within $100. The page loads in under 2 seconds on a standard connection. Charts render without errors in Chrome and Firefox.
The first version uses words that feel precise ("accurately," "quickly," "all browsers") but that no one can verify without asking follow-up questions. The second version defines specific numbers, specific metrics, and specific browsers.
Primitive 3 — Constraint Architecture
Four categories. Fill in all four for every non-trivial task.
Example: Support ticket auto-response system
| Category | Rules |
|---|---|
| Must | Address the customer by name. Reference their specific issue. Include a next step. |
| Must not | Never share credentials or internal system details. Never promise a specific resolution time. Never auto-close a ticket. |
| Prefer | Link to relevant knowledge base articles when they exist. Use the customer's language based on their account settings. Keep responses under 200 words. |
| Escalate | Customer mentions cancellation. Issue involves data loss. Same issue reported 3+ times in 24 hours. Customer is on an enterprise plan. |
Primitive 4 — Decomposition
Break tasks into independent pieces with clear inputs and outputs. Target: subtasks that take ≤2 hours and can be verified on their own.
Looks decomposed but is still coupled:
- Design the data model and build the API
- Build the frontend that uses the API
- Write tests and deploy
Actually independent:
| Step | Input | Output | Verification |
|---|---|---|---|
| 1. Schema migration | Current schema + requirements doc | Migration file + rollback file | Migration runs forward and backward without errors |
| 2. API endpoint | Migration + API spec | Route handler with validation | All 5 test cases pass, returns correct status codes |
| 3. Frontend form | API spec + design mockup | React component with form validation | Form submits successfully, validation errors display correctly |
| 4. Integration test | All three above | End-to-end test | User can complete the full workflow in staging |
The first version has hidden dependencies (step 1 bundles two concerns, step 2 can't start until step 1 is fully done, step 3 bundles testing with deployment). The second version has clear inputs, outputs, and verification at each step. Anthropic recommends this orchestrator-workers pattern for multi-component tasks.
Primitive 5 — Evaluation Design
Build 3-5 test cases with known-good outputs. Run them after every change.
Example: Email subject line generator
| Test case | Input | Expected output characteristics |
|---|---|---|
| 1. Welcome email | New user, signed up today, name: Marie | Contains user's name. Under 50 characters. No spam trigger words. |
| 2. Re-engagement | User inactive 30 days | References a specific timeframe or their last activity. Creates curiosity. Not guilt-inducing. |
| 3. Feature announcement | New editor feature, all users | Highlights the benefit, not the feature name. No technical jargon. |
| 4. French locale | Same as #1 but user locale is FR | Grammatically correct French. Not a word-for-word translation. |
| 5. Edge case: no name | User has no first name on file | Doesn't say "Hi null" or "Hi ". Uses a generic greeting. |
You don't test the exact output (AI is non-deterministic). You test the output characteristics. This is evaluation, not unit testing. Anthropic recommends pairing each eval prompt with verifiable outcomes and tracking accuracy, runtime, and error rates over time.
Context failure modes
When AI output goes wrong, diagnose which context failure caused it before retrying. These four failure modes are well-documented in the context engineering literature:
Poisoning
An error enters the context and compounds across turns. The agent hallucinated a function name in step 2, then keeps calling that non-existent function in steps 3-5.
Fix: Clear context between steps. Don't let tool outputs from early steps persist deep into the conversation. Summarize at boundaries instead of carrying raw history.
Distraction
The context is so long that the agent repeats patterns from history instead of reasoning about the current step. Research has observed this degradation after exceeding 100K tokens.
Fix: Compress or summarize older context. When the conversation gets long, create a fresh session with a summary of decisions made so far.
Confusion
Too many tools or too much documentation in context. The agent picks the wrong tool or cites irrelevant documentation because it can't distinguish what's relevant.
Fix: Reduce what's loaded. Only expose tools and documents relevant to the current task. As Anthropic puts it: "If a human cannot definitively choose the correct tool, an agent cannot perform better."
Clash
Contradictory information in context. The style guide says "be concise" but the template says "include a detailed explanation." A Microsoft/Salesforce study showed a 39% quality drop when contradictory information was present.
Fix: Audit your context for contradictions. When updating context files, remove the old guidance — don't just add new guidance alongside it. One source of truth per topic.
Writing specs for different roles
Engineering spec
## Problem
The job scheduler processes tasks sequentially. Large jobs
(50K+ items) block smaller jobs in the queue. Users with
small jobs wait 30+ minutes for processing to start.
## Scope
- Modify the job queue to support priority-based scheduling
- Small jobs (< 5K items) get higher priority
- Do NOT change the processing rate limits or the
rendering pipeline
## Inputs
- Job queue implementation: src/jobs/queue.ts
- Job model: src/models/job.ts
- Current tests: src/jobs/__tests__/queue.test.ts
## Constraints
- Must: maintain FIFO order within the same priority tier
- Must: process all jobs within the existing rate limits
- Must not: drop or reorder jobs already in progress
- Must not: require a database migration
- Prefer: use the existing BullMQ priority feature
- Escalate: if the solution requires changing the job schema
## Acceptance criteria
1. A 1K-item job queued after a 100K job starts processing
within 5 minutes, not 30+
2. No job is starved — all jobs complete within 2x their
expected duration
3. Existing tests pass without modification
4. New tests cover priority ordering and starvation prevention
## Failure conditions
- Any job processed with incorrect inputs
- Any job processed twice
- Queue deadlock under concurrent submissions
## Success tests
1. Queue 3 jobs: 100K, 1K, 500. Verify 1K and 500
start before 100K completes.
2. Queue 20 small jobs and 1 large. Verify all complete.
3. Submit jobs concurrently from 3 API clients.
Verify no duplicates or lost jobs.
Marketing spec
## Problem
Trial-to-paid conversion is 12%. Industry benchmark is 18%.
Users who complete onboarding within 72 hours convert at 31%.
Most trial users never complete onboarding.
## Scope
Design a 3-email onboarding sequence triggered when a trial
user hasn't completed their first project by day 3. Goal:
get them to complete onboarding. This is the email content
only — the automation trigger already exists.
## Inputs
- Current welcome email (attached)
- Top 3 templates by usage (attached)
- Brand voice guide: friendly, competent, never corporate
- Product: visual editor, template library, project management
## Constraints
- Must: each email under 100 words
- Must: one clear CTA per email
- Must: work in both EN and FR
- Must not: mention pricing or plan limitations
- Must not: use urgency tactics ("your trial is expiring!")
- Prefer: show, don't tell (link to a template, not a
feature list)
- Escalate: if the sequence needs more than 3 emails
## Acceptance criteria
1. Each email has exactly one CTA linking to a specific
action in the product
2. Tone matches the brand voice guide
3. No email exceeds 100 words
4. Subject lines are under 50 characters
5. A non-customer can understand the value proposition
without prior product knowledge
## Failure conditions
- Any email exceeds 100 words
- CTA links to a generic page instead of a specific action
- Urgency language appears ("limited time", "expiring",
"last chance")
- French version is a literal translation rather than
natural prose
## Success tests
1. Read email 1 cold. Can you identify what the product
does and what action to take? (yes/no)
2. Read all 3 in sequence. Does the progression feel
natural, not repetitive? (yes/no)
3. French versions read naturally, not like translations.
(verified by native speaker)
Support spec
## Problem
Support response time averages 6 hours. 40% of tickets are
questions already answered in the knowledge base. Agents
spend time searching for answers that already exist.
## Scope
Build an AI-assisted draft response system. When a ticket
comes in, the system searches the knowledge base and drafts
a response for the support agent to review and send.
The agent always reviews before sending — no auto-responses.
## Inputs
- Knowledge base articles (~200 articles)
- Ticket history (last 90 days, anonymized)
- Customer account data (plan type, tenure, recent activity)
- Canned response templates (32 templates)
## Constraints
- Must: human reviews every response before sending
- Must: cite the knowledge base article used
- Must: flag when no relevant article exists
(signal to create one)
- Must not: access customer data from other accounts
- Must not: promise resolution times
- Must not: auto-close tickets
- Prefer: match the customer's language
- Prefer: keep drafts under 200 words
- Escalate: billing disputes over $500, mentions of
cancellation, data loss issues
## Acceptance criteria
1. Drafts are generated within 30 seconds of ticket creation
2. 70%+ of drafts require less than 20% editing
3. Cited knowledge base articles are relevant to the question
4. Agents can override or discard any draft with one click
## Failure conditions
- Response sent without human review
- Customer data from account A appears in account B's response
- Draft cites a deprecated or incorrect KB article
- System generates a draft for a ticket it should escalate
## Success tests
1. "How do I set up custom domains?" → Draft references
the domain setup KB article with correct instructions
2. "I was charged twice" → Draft flags for escalation,
does not attempt to resolve billing
3. Question in French → Draft is in French,
references relevant troubleshooting steps
4. Gibberish input → System flags as unclear, does not
generate a draft
The spec improvement loop
A spec is never done on the first try. Use the failure responsibility model:
- Write the spec using the five primitives
- Run it — let the agent execute
- Evaluate the output against your acceptance criteria
- Diagnose failures by layer:
| What went wrong | Which layer to fix |
|---|---|
| Output is wrong or low quality | Layer 1 — improve the prompt |
| Output ignores domain knowledge | Layer 2 — improve the context |
| Output optimizes the wrong thing | Layer 3 — clarify the intent |
| Output is incomplete or misses requirements | Layer 4 — tighten the spec |
- Fix one layer at a time. Never change multiple layers simultaneously — you won't know what worked.
- Update the spec with what you learned. The best specs are written by people who've watched agents fail.
This iterative approach aligns with Anthropic's recommendation to "start minimal with the best model, then iteratively add instructions based on observed failure modes."
Common mistakes
Over-specifying. A spec that describes every implementation detail is just pseudocode. Specify the what and the constraints, not the how. Let the agent choose the approach within your boundaries.
Under-specifying. "Make it better" is not a spec. If you can't describe done, you're not ready to delegate.
Context dumping. Loading every document you have into context "just in case." Irrelevant information actively degrades output. Anthropic's core principle: find the smallest set of high-signal tokens, not the most tokens.
Optimizing prompts when the problem is context. If the agent keeps producing irrelevant output, adding "be more relevant" to the prompt won't help. The agent needs different information, not better instructions. Use the failure responsibility model to diagnose the right layer.
Skipping evaluation design. Without test cases, you're relying on vibes to judge output quality. Build the test cases first — they force you to clarify what you actually want.
Retrying instead of diagnosing. When output fails, the instinct is to regenerate. Stop. Figure out which layer caused the failure. Fix that layer. Then retry.
Quick reference
Before delegating, confirm:
- I can describe the problem without referencing things the agent doesn't know
- I can write three sentences that verify success
- I've listed what must happen, must not happen, and when to stop and ask
- I've broken the task into pieces small enough to verify independently
- I have test cases with known-good characteristics
When output fails, diagnose:
| Symptom | Likely cause | Action |
|---|---|---|
| Output is wrong | Prompt | Improve instructions, add examples |
| Output ignores your domain | Context | Add or update the context file |
| Output optimizes the wrong thing | Intent | Define objective hierarchy and trade-offs |
| Output is incomplete | Specification | Add missing requirements, tighten scope |
Further reading
- Building Effective Agents — Anthropic's guide to agent architecture patterns
- Effective Context Engineering for AI Agents — Managing the full token state across agent loops
- Writing Tools for Agents — Tool design principles and evaluation methodology
- Context Engineering for Agents — LangChain's four operations framework (Write, Select, Compress, Isolate)
- A Survey of Context Engineering for LLMs — Academic survey establishing a formal taxonomy
- CodeScout: Contextual Problem Statement Enhancement — Research showing 20% improvement from better upfront specifications
← Back to home · Execution Standards · The reference framework · Glossary
