Proof-driven requirements: the new Agile for building AI systems

Building an AI system that writes customer-facing data reports taught me the requirements that matter most can’t be written in advance. Here are eighteen ways an agent looks right and is wrong — and how to fix each one.

Mohannad ArbajiFounder, ChalkTalk~9 min read + ~15 min browsable failure catalog

This post is a war story.

Over the last few weeks, I built and operated an agentic AI system that does something unforgiving: it takes a school district’s usage and assessment data and writes the activity-and-impact report an account manager puts in front of a senior K-12 administrator. The report has to be safe, compliant, and right. A wrong number isn’t a bug; it’s a credibility breach at the exact moment we ask a customer to renew.

I am a founder. Two of my degrees are in Electrical and Computer Engineering, but I haven’t written code professionally since my first corporate job fifteen years ago. I came to agentic engineering because of the roughly 50x increase in agency I experienced the first time I worked this way: things I would have had to brief, queue, and wait for, I could now just build, production-grade.

I built this system directly using frontier LLMs, starting from a single top-level business objective: “build a company second brain for renewals.” In plain terms: one system that pulls together everything we know about a customer — CRM, support tickets, product usage, meeting notes — and turns it into the report and the story for that customer’s renewal.

I planned exhaustively: long, structured brainstorming sessions that became specs, specs that became detailed implementation plans, plans I executed against. That is how the business logic was designed: what a renewal story is made of, which analyses exist, what the report should say. But no amount of planning produced the other document — the one that makes the output of the system I built trustworthy. That document exists now: dozens of hard rules, enforced by code and by multiple AI agents that review each other’s work — and almost none of them came out of planning. Each one was paid for by a failure I hit on a real run.

That experience has a name now.

What changed: requirements used to come first

Traditional software is deterministic. The same input produces the same output, which is what made it possible to specify behavior before building it. Agile already taught us not to specify everything upfront — requirements emerge through iteration, and PDR inherits that loop wholesale. But Agile kept one assumption so safe it never needed stating: once a behavior passes its test, it stays passed. A failure announces itself.

AI systems break that assumption. The same prompt can produce different answers depending on context, phrasing, or what got loaded into the window. And the failures don’t look like failures. Nothing crashes. No error fires. The system produces a plausible, confident answer that happens to be wrong — and it looks identical to a correct one.

A wrong interpretation isn’t a failed test; it can be a trust breach, a governance gap, or a compliance failure. That is what makes upfront specification structurally insufficient, not just inconvenient. I’ve seen the alternative called Proof-Driven Requirements (PDR),¹ and the name fits: in an AI system, the most important requirements cannot be written before execution. They are discovered through it.

The sequence I was taught

Define the business objective→ Write the full product requirements doc→ Build→ Test→ Ship

The sequence that actually worked

Define the business objective→ Brainstorm + write the plan→ Build a working flow→ Run it on real data→ Turn each failure into a requirement→ Refine and repeat

Notice what the second sequence keeps: the objective still comes first, and so does real planning — PDR is not winging it. Build, run, learn, repeat is still Agile’s loop. What changes is how the behavioral layer gets defined: by proof, not by prediction — because passing once proves almost nothing about passing next time. Naming the pattern is the easy half. The hard half is what you do with a requirement once a failure proves it — that’s what this post is about.

Agilebuilt for deterministic code

Proof-drivenAI systems

Behavior is

Fully specifiable upfront

Discovered through execution

A failure is

A bug to fix

Evidence the spec is incomplete

Tests prove

The feature works

The failure can’t recur — each one became a permanent gate

The requirement doc is

A feature description

A growing set of verified behavioral constraints

Done means

Shipped

Continuously proving reliability, run after run

The one idea that organizes everything

Looking back across everything this system forced me to fix, one move kept repeating — not the only fix, but the one that did the most work each time:

Everything starts as prose. You turn the deterministic parts into code. The prose that remains is true LLM judgment — and that isn’t a compromise. It’s the superpower.

Concretely: an agent skill usually begins life as one long prose file — prompt-based instructions the model reads and follows. But look closely: most of those instructions are work that should happen the same way every single time — which query runs next, what shape a record takes, where a file gets written, whether the parts sum to the whole. Written as prose, that is a program in English executed by a probabilistic reader — an instruction the model chooses to follow, and on a long enough run, eventually doesn’t. So you turn it into code.

What you deliberately leave as prose is the true LLM judgment: synthesizing data into a story, writing the narrative, choosing which analyses to run for this customer, deciding whether an anomaly is signal or noise. This is where the model’s non-determinism stops being a liability and becomes the point — and that work is the reason the system is worth building at all.

That conversion takes you most of the way. But it sets up the harder question, the one the rest of this post answers: how do you trust a system where some steps — and some hand-offs between steps — still run on a model reading prose? You build robust tracking and watch where it breaks.

The catalog of failures that wrote the spec

While building our renewal storytelling engine, I encountered failure modes that no planning session had surfaced. Below is a sample of key issues — eighteen, grouped into five families. Each one forced an architectural change. Each one produced a requirement that was proven into existence, not planned.

Every failure mode steps through four cards — use the arrows or swipe:

The Wall — what failed, and why it looked fine
The Proof — the actual run where I hit it
The Requirement — the constraint it forced into existence
The Fix Shape — the portable pattern, not my specific tool

The names are deliberately general. The examples are mine, but I expect any team building multi-step agentic workflows will encounter most of these too.

You’re viewing the JavaScript-off version of this page (message previews and Quick Look turn JavaScript off). Each failure mode below shows its first card, with the rest behind a “Show the other three cards” toggle. If the toggle doesn’t respond in your viewer, open this file in a web browser — that’s also where the full tabbed experience lives.

Family A · #1–3 of 18

Drift

Semantic Drift under Ambiguity

An instruction that sounds specific has several legitimate readings — and the model picks one silently.

1 / 4 — The Wall

Most instructions you give an agent are ambiguous without you noticing. “Analyze this data for multiple years if this district has been with us for more than 1 year” sounds specific. It isn’t — there are several legitimate analyses that phrase could mean, and the model will pick one for you: silently, plausibly, and possibly a different one tomorrow. Nothing about the output looks broken, because every reading produces a clean-looking result.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

In my system, a multi-year analysis could legitimately mean at least three different things (tracking the same students’ growth across years, comparing this year’s cohort against last year’s, or trending school-wide results over time). So I defined the three ways we do it and wrote selection rules based on the shape of the district’s data. That handles the ambiguity machines can settle.

For the ambiguity they can’t: when a renewal-report request hits a district running our courses in multiple subjects, the system stops and asks me which subject I mean (or whether I want all of them) instead of guessing.

3 / 4 — The Requirement

Ambiguity resolves through a defined menu of interpretations with selection rules — and when the rules can’t decide, the system asks the operator instead of guessing.

4 / 4 — The Fix Shape

Enumerate the legitimate interpretations before the model ever meets them. Encode the rules for choosing. Gate whatever remains on a human. The model generates within a chosen interpretation — it never chooses silently.

Semantic Drift under Clarity

The companion nobody warns you about: a perfectly clear spec still drifts on a long run.

1 / 4 — The Wall

Drift under ambiguity at least has a cause you can fix with clearer instructions. The nastier discovery: a completely unambiguous spec also drifts on a long run. The model shortcuts, conflates adjacent steps, re-derives a format from memory instead of re-reading the doc two files away. Length and repetition are the enemy, not vagueness.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

I audited one end-to-end run of my report skill: 164 structured events emitted, six distinct drift modes — with three correct reference docs sitting in the working directory the whole time. Separately, a 1,000-line skill spec that ran clean when I supervised it interactively started silently skipping steps the moment it ran as a background agent. Same spec. Same code. Same data.

3 / 4 — The Requirement

Anything deterministic must be code, not prose the model re-interprets on every pass. The longer the run, the more this matters.

4 / 4 — The Fix Shape

A validating helper the model calls with arguments — it never writes the fixed shape itself. The clearer your prose, the more tempting it is to trust it. Don’t. Better prose isn’t the answer; not-prose is.

Schema Improvisation

Asked for a fixed shape, the model invents a slightly better one.

1 / 4 — The Wall

Hand the model a strict schema and it improvises: promotes a subtype into the type field because it’s “more specific,” adds helper fields that seem useful, writes the file one directory up from where it belongs. Each deviation is locally reasonable. Every one breaks the downstream tools that parse the output.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

In that same 164-event audit: the subtype sql_query was promoted into the type: field on 39 of 164 events; an invented entry: START/END field appeared on 34; nine invented subtypes fell outside the canonical enum; the journal landed in the wrong directory.

3 / 4 — The Requirement

The producer of a fixed shape validates structure and rejects unknowns. The model passes data — never format.

4 / 4 — The Fix Shape

A small CLI owns the schema: it allocates IDs, stamps timestamps, hardcodes the path, validates every enum, rejects invented fields. Five of my six drift modes disappeared by construction — the model can’t drift a file it no longer writes.

Family B · #4–7 of 18

Trust

The Confident Fabrication

A number nobody computed, stated fluently.

1 / 4 — The Wall

The model states a metric that doesn’t match any query that ran — or types an “expected result” from its prior of what the data probably says. The sentence is grammatical, specific, and wrong. In a customer-facing report, a fabricated number is worse than no number.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

While registering a new SQL query with verified logic, the model hand-typed the expected result of that query instead of running it. The typed value was 530x off from what the warehouse actually returned.

This one incident put a hard gate on expectation strings (e.g. “this query should return X”): computed, never authored.

3 / 4 — The Requirement

No value reaches output unless execution produced it during this run. Result strings bind to the query that made them — never to memory.

4 / 4 — The Fix Shape

Separate compute from narrate. Registering a metric forces the query to execute at registration time, so the recorded expectation is real by construction. The narrator may reference computed values by name; it can never mint one.

Evidence Surface Inconsistency

The story and the evidence beside it disagree — or no one can see whether they agree.

1 / 4 — The Wall

A report has two surfaces: the narrative a human reads aloud, and the charts and tables beside it. If they’re built from different read paths they quietly diverge. My early reports retyped numbers into sentences, and a retyped number is a fork: the prose and the chart beside it can drift apart with no error anywhere.

And there’s a deeper version: even when the two surfaces do agree, can a reviewer tell which sentences are anchored to data and which are authored prose?

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

The fix: narrative templates carry tokens instead of numbers. A sentence is authored as “{funnel_enrolled} students enrolled in ChalkTalk,” where {funnel_enrolled} names a registered SQL query or Python calculation. At render time it becomes “12,651 students enrolled in ChalkTalk.”

Interpolated values render visibly distinct from authored prose — a reviewer sees at a glance which words are data, and can click a number to land on the query that produced it.

3 / 4 — The Requirement

Every claim traces to a recorded event — and the trace is inspectable. Data-bound text and authored prose must be distinguishable at a glance.

4 / 4 — The Fix Shape

One read model for narrative and evidence: numbers live as tokens bound at render time, never retyped. A claims file with source and reasoning per claim; a halt when a narrative number matches no recorded event. Provenance that exists but can’t be read is provenance that doesn’t exist.

The Sum That Doesn’t Add Up

Individually correct queries combine into nonsense. The join is where correctness dies.

1 / 4 — The Wall

Every input query is verified. The final numbers get fact-checked. The result is still wrong — because the corruption happened in the middle: a join fans out and silently inflates counts, an aggregation breaks across grade levels, a filter leaks rows from outside the year you declared. Nobody checks the middle, because each piece passed its own test.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

I built a whole validation layer because of this zone. The checks: grain (each row unique on declared columns, catching fan-out), rollup (recomputing the same number from its finest-grained rows gives the same total — one query checked against its own math), scope (no rows from outside the declared district, curriculum, and year), source-consistency (the declared table matches what the SQL actually joins), and reconciliation (separate outputs checked against each other — the counts published in one section must sum to the headline in another, and independent sources must agree within tolerance).

3 / 4 — The Requirement

Declare invariants on the output shape — uniqueness, parts-equal-whole, no leakage — and mechanically verify them at every step. Verify the middle, not just the ends.

4 / 4 — The Fix Shape

The invariant rides as a declaration on the data model itself and is checked per-query at runtime. Severity is a product decision per constraint: fan-out and broken rollups halt; a possible scope leak proceeds but gets flagged where the human and the narrator both see it.

The Correct Number, Wrong Reading

The data is right. The sentence makes the reader compute something false.

1 / 4 — The Wall

The subtlest trust failure I’ve hit: every value in the sentence is accurate, and the sentence still lies — because of how the values sit next to each other. The output of a report isn’t the number; it’s the meaning the reader walks away with.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

A report cover read: “reaching 334 of 440 enrolled students and a path to doubling the paired-evidence base.” Every number was right — the doubling meant 109 going to roughly 220. But put “doubling” next to “334 of 440” and the reader’s eye computes an impossible 668. Correct data, false reading.

3 / 4 — The Requirement

Check how a number will be read, not only whether it is right. A multiplier must bind explicitly to its base.

4 / 4 — The Fix Shape

A render-time check for perception traps — mine detects a multiplier landing near an unrelated base and halts unless the sentence carries its own binding: “doubling (from 109 to ~220).” Data-correctness gates can’t catch this class. It needs its own gate.

Family C · #8–12 of 18

Orchestration

The Hidden Failure Cascade

No layer proved correctness — and the deeper the orchestration, the harder to localize.

1 / 4 — The Wall

One phase retrieves weak data. The next synthesizes on top of it. A third acts on the flawed synthesis. The final output sounds coherent and intelligent — because the last model in the chain is very good at sounding coherent. No layer actually proved anything, and the more sophisticated the orchestration, the more places the break can hide.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

My version: phase 2’s exact numbers reaching phase 3 as lossy conversational summary, then a hollow phase poisoning everything downstream while the final report read beautifully. Until each phase boundary became a checkable artifact, localizing a cascade meant re-reading a 90-minute transcript by hand.

3 / 4 — The Requirement

Deep orchestration must decompose into phases that each prove their own correctness — or a coherent-sounding whole hides a broken part you cannot find.

4 / 4 — The Fix Shape

Checkpoints between phases — downstream reads the file, never the prior prose. Per-phase execution counts. Reconciliation at phase boundaries, so a dead layer surfaces after phase 1 instead of after the run. Orchestration depth raises the stakes of per-phase proof; it never lowers them.

The Silent Skip / Phase Hollowing

A step quietly dropped — output still looks complete. Success by appearance.

1 / 4 — The Wall

A multi-step agent drops a step — or collapses a whole dispatch-subagents architecture into one inline pass — and still emits deliverables that look finished. From the outside: success. Underneath: hollow. This failure does not announce itself, ever.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

One run executed every phase inline instead of dispatching them: substantive-looking deliverables, zero telemetry, and about 115 false alarms from confused downstream checks. An earlier run had silently dropped an entire analysis phase and two customer PDFs — nothing checked for them, so nothing noticed. The hardening that followed cut detection from ~90 minutes to ~30 seconds.

3 / 4 — The Requirement

Every mandated step proves it ran. A step that cannot show execution evidence halts the run. Zero evidence is a failure — never a clean run.

4 / 4 — The Fix Shape

A completion contract (a machine-readable list of what each phase must produce) plus per-phase execution counts means a hollow phase halts before the next one builds on it. The elegant endgame: the retry path re-dispatches the phase, and a dispatched subagent physically cannot run inline — the architecture cures its own failure mode.

C10

The Lying Recorder

The telemetry itself silently fails. The run looks clean because nothing was watching.

1 / 4 — The Wall

You build the monitoring. Then the monitoring breaks silently — a misconfigured environment, a hook installed mid-session that never fires — and every run since has looked clean precisely because the thing that would have complained was dead. Observability has failure modes of its own.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

An empty environment variable made my logging hook silently no-op. Hooks installed mid-session didn’t fire until restart: one district’s run captured 0 of 26 expected events and finished “successfully.” The report was right. The telemetry was fiction — and nothing knew.

3 / 4 — The Requirement

An end-of-run assertion that the telemetry captured reality — and the checker shares code with the thing it checks, so the two can never drift apart.

4 / 4 — The Fix Shape

Reconstruct what should have been recorded from an independent source — the session transcript — and diff it against what was recorded; hard-fail on zero-capture. A detail I’m proud of: the gate imports the recorder’s own decision function instead of re-implementing it. The anti-drift lesson, applied to the anti-drift tool.

C11

The Courier Tax

Paying a language model to be a for-loop.

1 / 4 — The Wall

Between every two deterministic steps — run a query, write a checkpoint, run the next query — the model is the courier: read prose, pick up the next command, execute, read more prose. The round-trip costs tokens and minutes, and it’s exactly where attention drifts and steps get skipped.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

When the system ran unattended, the steps it skipped were never the judgment calls — they were the boilerplate. Every run ends with the same four wrap-up commands. Each prose handoff was one more chance for attention to wander — exactly where the skips clustered. One 90-minute run wasted about 50 minutes on work it had abbreviated and then had to redo.

3 / 4 — The Requirement

Code drives sequencing. The model is invoked only where the next step requires judgment.

4 / 4 — The Fix Shape

A driver script picks the next step from state. It calls the model surgically — choosing the story angle, drafting narrative, talking to the operator — and runs helpers directly for everything else. Skipping becomes structurally impossible, and you stop paying inference prices for couriering.

C12

The Orchestration-Shape Tradeoff

Inline drifts. Full fan-out goes blind. The shape is a per-step decision.

1 / 4 — The Wall

Another face of drift: one long sequential context degrades as it grows. Fanning work out to subagents fixes that — isolated context, parallel speed. But a fully autonomous subagent is a black box: you lose insight into what it’s doing, and it cannot pause to ask a question. Whatever you gated on confirmation is now ungated inside the box.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

I run all three shapes in one system, deliberately. Heavy data phases fan out — context isolation and parallelism. Judgment moments and operator gates stay inline, where the orchestrator can see and ask. Choosing wrong in either direction cost me: sequential drift on long runs, and contained subagents that couldn’t surface the question they should have asked.

3 / 4 — The Requirement

Orchestration shape is a first-class design decision trading accuracy × performance × control — chosen per step, never globally.

4 / 4 — The Fix Shape

Fan out for isolation and independent parallel work. Stay inline wherever a clarification gate or a human question lives. A driver mediates between the two. This is more than a telemetry choice — it decides what the system can notice and what it can ask.

Family D · #13–15 of 18

Memory

D13

Context Dilution / The Monolith

Loading everything up front performs worse than loading nothing.

1 / 4 — The Wall

The intuition says: give the model all the context and it will use what it needs. The reality: attention spreads across everything you load, including the 90% that isn’t load-bearing for this request. Constraints stated early get buried under everything stated after.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

My first skill was a 600-line monolithic file — every rule, every reference, loaded up front. By line 300 the agent had forgotten the constraints from line 50. It performed worse than having no skill at all.

3 / 4 — The Requirement

Context loads progressively: start small, navigate to detail on demand, stop at the shallowest layer that answers the question.

4 / 4 — The Fix Shape

Layered structure — a small entry point with dispatch logic, phase files that load when their phase fires, references that load when pointed to. The budget you’re actually managing is attention, not tokens.

D14

The Compaction Cliff

Long sessions summarize themselves. Summaries round. Downstream builds on the rounding.

1 / 4 — The Wall

Long agent sessions compact: older turns get summarized to free up context. Summaries preserve gist and destroy precision — and any later step that reads the summary instead of the original builds on rounded values, with nothing flagging that it happened.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

Mid-run, phase 2 produced exact metrics — paired_students: 247. The conversation compacted between phases. Phase 3 read “about 250 paired students” from the summary and built the customer-facing artifacts on it. The report was off by single-digit percentages everywhere, for no visible reason.

3 / 4 — The Requirement

Persist load-bearing facts to disk the moment they exist. Files survive compaction; conversation prose does not.

4 / 4 — The Fix Shape

A checkpoint file written at every phase end, in structured YAML, so a value like paired_students: 247 is data, not a sentence a summarizer can soften into “about 250.” Narrative may be summarized; hard numbers are written to disk before they are ever presented. Downstream phases read the file, never the prior prose.

D15

The Re-Learned Lesson

A lesson you can’t find is a lesson you re-learn. Even the system you built to prevent this.

1 / 4 — The Wall

Corrections made in conversation evaporate when the session ends; the same mistake comes back next week. The bigger cousin: even the meta-systems you build get forgotten, if they don’t live somewhere that loads automatically.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

The same branch, path, and formatting mistakes recurred session after session until each became a written rule in a file the agent loads at startup. Then the sharper proof: I came back from my two-week honeymoon and couldn’t remember some of the finer details of how the “shared learnings & rules” system I had built four weeks earlier worked — which file fed which reviewer, where a lesson was supposed to live.

3 / 4 — The Requirement

Every correction becomes a rule in a known home that the right reader loads automatically — at write-time and at review-time.

4 / 4 — The Fix Shape

Route the lesson by who needs it: personal habits into per-user memory; repo standards into committed rule files mirrored into the AI reviewer’s config. One lesson, two readers — the writer never makes the mistake, and the reviewer catches it if it slips through anyway.

Family E · #16–18 of 18

Maturity

E16

Single-Model Blindness

A second pass from the same model is polish, not verification.

1 / 4 — The Wall

Every model is structurally blind to certain classes of its own mistakes — the same priors that produced the bug are the ones reviewing it. Re-reading your own work with the same eyes finds typos, not blind spots.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

I shipped a redesign, self-reviewed the diff, and it looked clean. A different vendor’s model (Codex reviewing Opus’s work, in my case) then returned 9 findings — 7 were real bugs that would have gone to production. The disagreement between models was the signal.

3 / 4 — The Requirement

A second, different model reads every change before it ships — and you validate its findings against source, because it is confident when wrong, too.

4 / 4 — The Fix Shape

Cross-model review as a standing gate: one model writes, a different one reviews, a human adjudicates with the source open. Model diversity — different training, different priors — catches what repetition can’t.

E17

The Polite Fiction

A declared capability nothing verifies is a lie waiting to ship.

1 / 4 — The Wall

Systems accumulate declarations: “this component opts out of X,” “this skill adopts Y.” The moment a declaration isn’t machine-checked, it starts drifting from reality — copied forward, stale, comforting, false. The checkbox manufactures confidence in coverage that doesn’t exist.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

My skills share a library of capabilities — telemetry, shared Python helpers, reference data files. Each skill declares which capabilities it adopts and which it opts out of, with a justification, so one canonical implementation serves many skills and a linter can check each declaration against reality.

The linter’s first version emitted an advisory note for every declared opt-out — which it never actually verified. Any skill could copy an opt-out clause from a neighbor, forget to update the wording, and pass forever. The fix made the check enforceable, with a test for exactly the copy-and-forget case.

3 / 4 — The Requirement

Every declaration is machine-checked against reality. An unverified promise is treated as false.

4 / 4 — The Fix Shape

The dial-and-machine pattern: a declaration you write (the dial) is only meaningful when automatic machinery reads it and fails when it doesn’t match reality (the machine). A dial with no machine is decoration.

E18

The Dead Safety Net

A gate built ahead of need rots into theater — and its presence implies false coverage.

1 / 4 — The Wall

Not every entry in this catalog means “add machinery.” Gates built before a failure proves them tend to go unused — wired up, never firing, silently trusted. Dormant machinery is worse than absent machinery, because everyone assumes it’s watching.

Show the other three cards — The Proof, The Requirement, The Fix Shape

2 / 4 — The Proof

I built a baseline-drift mechanism to catch numbers shifting between runs. Across weeks of real runs: one baseline ever committed, zero drift comparisons performed. I deprecated it — and deleting it is what forced the better mechanism (binding results to live execution) to exist at all.

3 / 4 — The Requirement

Build the gate when a failure proves it’s needed. Delete dormant machinery — its presence implies a coverage that isn’t there.

4 / 4 — The Fix Shape

Keep an explicit ledger: what’s automated, what’s still prose, what’s a promotion target. Promote on proven pain. Retire on proven disuse. The ledger of what you haven’t hardened is as load-bearing as the hardening.

The evaluation loop is the spec

If requirements follow proof, then evaluation stops being a QA stage at the end and becomes part of the specification itself. In practice, that means every run — every district, every regenerated report — is held against explicit criteria:

Did every phase prove it actually ran — or did something quietly hollow out?
Did every number in the narrative trace back to a query executed during this run?
Did the parts sum to the whole at every join and rollup?
Did the system ask at the gates instead of guessing?
Did the telemetry itself capture reality at write time — or was the recorder dead and self-healed at the end of the run with confident guesswork?

A failure against this list is not just a bug to fix. It is evidence that the specification is incomplete — and the fix is a new permanent gate, not a patch. The misleading-sentence check from B7 is now a regression test every future change must pass. The hollow-phase check from C9 runs on every run, forever.

The skill this builds

If the catalog has a punchline, it’s this: the work of building reliable AI systems is mostly not prompting. The frontier keeps moving up a layer — 2024 was the year of prompt engineering, 2025 the year of context engineering, 2026 the year of harness engineering — and the next layer up is loop engineering: running the agent against a goal on a schedule, evaluating the result, and feeding the improvement back in. Each layer is the prerequisite for the one above it.

The loop is the optimizer; the harness is the verifier it iterates against. A loop without a harness is the Lying Recorder (C10) generalized — confidently converging on garbage at machine speed. A harness without a loop is the Dead Safety Net (E18) — a gate no run flows through rots until someone deletes it. The clearest loop in my system: telemetry feeds a health dashboard, the worst-performing skill surfaces first, the lesson becomes a rule, and the rule loads into the next run.

The day-to-day of harness work is defining trust boundaries: deciding, per check, whether a violation fails open or fails closed. A join that silently inflates a count stops the run cold; a possible scope leak continues with a visible warning that the human reviewer and the writing model both see before anything ships. That call, made constraint by constraint, defines what your customer can trust — which means it defines the product.

The competitive advantage is shifting from “can the agent do the task?” to “can the system prove the task was done correctly?” The second question is harder — and far more defensible. As models commoditize, raw execution stops differentiating anyone. Trust infrastructure doesn’t commoditize, because it can only be earned the way this catalog was: one proven failure at a time.

Who gets to design trust

Here is the shift hiding under everything above: the people building these systems are no longer just engineers. Everyone at a company now needs to be building their own agentic systems — sales, account management, support, product, marketing, ops. As Amjad Masad, CEO of Replit, put it: “The roles are collapsing. We have designers shipping code, engineers shipping code, and sales people shipping code. The particular skill is not the bottleneck anymore. It is how ambitious you are, how generative you are, how creative you are, and how good you are at utilizing these tools.”

And whoever builds one is no longer only defining features, workflows, and UX. They are defining acceptable uncertainty, evidence requirements, escalation policies, and risk allocation. Should the model decide autonomously? Should uncertain cases escalate to a human? Fail open or fail closed? Those are not purely engineering questions anymore — they define trust, liability, and customer confidence. You are no longer only designing experiences; you are designing how the system earns trust.

Engineering leadership changes underneath this shift, too: from managing the engineers who write code to solve the company’s problems to enabling the entire company to write code to solve their own — often better than a handoff could, because the person closest to the problem now has the agency to build — while engineering focuses on what genuinely needs them: security, scale, and the core product. Enablement is what makes a non-engineer’s work verifiable instead of forbidden. The org that shares trust-boundary decisions between product and engineering ships; the org that hoards them stalls — and the harness is what makes that sharing safe.

Bottom line

In the agent era, a product requirement can no longer just describe behavior. It has to define verification, evidence, containment, observability, and escalation — the product requirements document evolves from a feature document into a proof document. Because once agents can reason, retrieve, synthesize, and act, the real product question is no longer “what can the system do?” It becomes “what can the system reliably prove?”

Planning produced my intent: a company second brain for renewals. Proof produced everything that made it trustworthy enough to put in front of a customer. The evaluation loop becomes the infrastructure through which the product learns what reliability actually demands.

¹ Some places I’ve seen this emerging term include HUMAIN’s product team and Architecture of Proof.

No relation to proof-driven development, a formal-methods practice (2015) of machine-proving code against an upfront spec. That is upfront specification taken to its limit; PDR is what’s left when the limit can’t be reached. ↩

Author: Mohannad Arbaji, ChalkTalk Founder

A note on punctuation: Every em-dash in this post is mine, not an AI’s. I founded an education company whose flagship products include SAT and ACT courses — with entire lessons on em-dashes, semicolons, and proper punctuation. I was using em-dashes long before AI made them suspicious, and I refuse to replace them with commas. A moment of silence for those who did.

Also published at medium.com/@moarbaji

Facebook Tweet LinkedIn

Proof-Driven Requirements: The New Agile for Building AI Systems

Proof-Driven Requirements: The New Agile for Building AI Systems

Proof-driven requirements: the new Agile for building AI systems

What changed: requirements used to come first

The one idea that organizes everything

The catalog of failures that wrote the spec

Drift

Semantic Drift under Ambiguity

Semantic Drift under Clarity

Schema Improvisation

Trust

The Confident Fabrication

Evidence Surface Inconsistency

The Sum That Doesn’t Add Up

The Correct Number, Wrong Reading

Orchestration

The Hidden Failure Cascade

The Silent Skip / Phase Hollowing

The Lying Recorder

The Courier Tax

The Orchestration-Shape Tradeoff

Memory

Context Dilution / The Monolith

The Compaction Cliff

The Re-Learned Lesson

Maturity

Single-Model Blindness

The Polite Fiction

The Dead Safety Net

The evaluation loop is the spec

The skill this builds

Who gets to design trust

Bottom line