Proof-Driven Requirements: The New Agile for Building AI Systems

Proof-Driven Requirements: The New Agile for Building AI Systems

Proof-driven requirements: the new Agile for building AI systems

Building an AI system that writes customer-facing data reports taught me the requirements that matter most can’t be written in advance. Here are eighteen ways an agent looks right and is wrong — and how to fix each one.

This post is a war story.

Over the last few weeks, I built and operated an agentic AI system that does something unforgiving: it takes a school district’s usage and assessment data and writes the activity-and-impact report an account manager puts in front of a senior K-12 administrator. The report has to be safe, compliant, and right. A wrong number isn’t a bug; it’s a credibility breach at the exact moment we ask a customer to renew.

I am a founder. Two of my degrees are in Electrical and Computer Engineering, but I haven’t written code professionally since my first corporate job fifteen years ago. I came to agentic engineering because of the roughly 50x increase in agency I experienced the first time I worked this way: things I would have had to brief, queue, and wait for, I could now just build, production-grade.

I built this system directly using frontier LLMs, starting from a single top-level business objective: “build a company second brain for renewals.” In plain terms: one system that pulls together everything we know about a customer — CRM, support tickets, product usage, meeting notes — and turns it into the report and the story for that customer’s renewal.

I planned exhaustively: long, structured brainstorming sessions that became specs, specs that became detailed implementation plans, plans I executed against. That is how the business logic was designed: what a renewal story is made of, which analyses exist, what the report should say. But no amount of planning produced the other document — the one that makes the output of the system I built trustworthy. That document exists now: dozens of hard rules, enforced by code and by multiple AI agents that review each other’s work — and almost none of them came out of planning. Each one was paid for by a failure I hit on a real run.

That experience has a name now.

What changed: requirements used to come first

Traditional software is deterministic. The same input produces the same output, which is what made it possible to specify behavior before building it. Agile already taught us not to specify everything upfront — requirements emerge through iteration, and PDR inherits that loop wholesale. But Agile kept one assumption so safe it never needed stating: once a behavior passes its test, it stays passed. A failure announces itself.

AI systems break that assumption. The same prompt can produce different answers depending on context, phrasing, or what got loaded into the window. And the failures don’t look like failures. Nothing crashes. No error fires. The system produces a plausible, confident answer that happens to be wrong — and it looks identical to a correct one.

A wrong interpretation isn’t a failed test; it can be a trust breach, a governance gap, or a compliance failure. That is what makes upfront specification structurally insufficient, not just inconvenient. I’ve seen the alternative called Proof-Driven Requirements (PDR),1 and the name fits: in an AI system, the most important requirements cannot be written before execution. They are discovered through it.

The sequence I was taught
Define the business objective Write the full product requirements doc Build Test Ship
The sequence that actually worked
Define the business objective Brainstorm + write the plan Build a working flow Run it on real data Turn each failure into a requirement Refine and repeat

Notice what the second sequence keeps: the objective still comes first, and so does real planning — PDR is not winging it. Build, run, learn, repeat is still Agile’s loop. What changes is how the behavioral layer gets defined: by proof, not by prediction — because passing once proves almost nothing about passing next time. Naming the pattern is the easy half. The hard half is what you do with a requirement once a failure proves it — that’s what this post is about.

Agilebuilt for deterministic code
Proof-drivenAI systems

Behavior is
Fully specifiable upfront
Discovered through execution

A failure is
A bug to fix
Evidence the spec is incomplete

Tests prove
The feature works
The failure can’t recur — each one became a permanent gate

The requirement doc is
A feature description
A growing set of verified behavioral constraints

Done means
Shipped
Continuously proving reliability, run after run

The one idea that organizes everything

Looking back across everything this system forced me to fix, one move kept repeating — not the only fix, but the one that did the most work each time:

Everything starts as prose. You turn the deterministic parts into code. The prose that remains is true LLM judgment — and that isn’t a compromise. It’s the superpower.

Concretely: an agent skill usually begins life as one long prose file — prompt-based instructions the model reads and follows. But look closely: most of those instructions are work that should happen the same way every single time — which query runs next, what shape a record takes, where a file gets written, whether the parts sum to the whole. Written as prose, that is a program in English executed by a probabilistic reader — an instruction the model chooses to follow, and on a long enough run, eventually doesn’t. So you turn it into code.

What you deliberately leave as prose is the true LLM judgment: synthesizing data into a story, writing the narrative, choosing which analyses to run for this customer, deciding whether an anomaly is signal or noise. This is where the model’s non-determinism stops being a liability and becomes the point — and that work is the reason the system is worth building at all.

That conversion takes you most of the way. But it sets up the harder question, the one the rest of this post answers: how do you trust a system where some steps — and some hand-offs between steps — still run on a model reading prose? You build robust tracking and watch where it breaks.

The catalog of failures that wrote the spec

While building our renewal storytelling engine, I encountered failure modes that no planning session had surfaced. Below is a sample of key issues — eighteen, grouped into five families. Each one forced an architectural change. Each one produced a requirement that was proven into existence, not planned.

Every failure mode steps through four cards — use the arrows or swipe:

  • The Wall — what failed, and why it looked fine
  • The Proof — the actual run where I hit it
  • The Requirement — the constraint it forced into existence
  • The Fix Shape — the portable pattern, not my specific tool

The names are deliberately general. The examples are mine, but I expect any team building multi-step agentic workflows will encounter most of these too.

You’re viewing the JavaScript-off version of this page (message previews and Quick Look turn JavaScript off). Each failure mode below shows its first card, with the rest behind a “Show the other three cards” toggle. If the toggle doesn’t respond in your viewer, open this file in a web browser — that’s also where the full tabbed experience lives.

Family A · #1–3 of 18

Drift

A1

Semantic Drift under Ambiguity

An instruction that sounds specific has several legitimate readings — and the model picks one silently.

A2

Semantic Drift under Clarity

The companion nobody warns you about: a perfectly clear spec still drifts on a long run.

A3

Schema Improvisation

Asked for a fixed shape, the model invents a slightly better one.

Family B · #4–7 of 18

Trust

B4

The Confident Fabrication

A number nobody computed, stated fluently.

B5

Evidence Surface Inconsistency

The story and the evidence beside it disagree — or no one can see whether they agree.

B6

The Sum That Doesn’t Add Up

Individually correct queries combine into nonsense. The join is where correctness dies.

B7

The Correct Number, Wrong Reading

The data is right. The sentence makes the reader compute something false.

Family C · #8–12 of 18

Orchestration

C8

The Hidden Failure Cascade

No layer proved correctness — and the deeper the orchestration, the harder to localize.

C9

The Silent Skip / Phase Hollowing

A step quietly dropped — output still looks complete. Success by appearance.

C10

The Lying Recorder

The telemetry itself silently fails. The run looks clean because nothing was watching.

C11

The Courier Tax

Paying a language model to be a for-loop.

C12

The Orchestration-Shape Tradeoff

Inline drifts. Full fan-out goes blind. The shape is a per-step decision.

Family D · #13–15 of 18

Memory

D13

Context Dilution / The Monolith

Loading everything up front performs worse than loading nothing.

D14

The Compaction Cliff

Long sessions summarize themselves. Summaries round. Downstream builds on the rounding.

D15

The Re-Learned Lesson

A lesson you can’t find is a lesson you re-learn. Even the system you built to prevent this.

Family E · #16–18 of 18

Maturity

E16

Single-Model Blindness

A second pass from the same model is polish, not verification.

E17

The Polite Fiction

A declared capability nothing verifies is a lie waiting to ship.

E18

The Dead Safety Net

A gate built ahead of need rots into theater — and its presence implies false coverage.

The evaluation loop is the spec

If requirements follow proof, then evaluation stops being a QA stage at the end and becomes part of the specification itself. In practice, that means every run — every district, every regenerated report — is held against explicit criteria:

  • Did every phase prove it actually ran — or did something quietly hollow out?
  • Did every number in the narrative trace back to a query executed during this run?
  • Did the parts sum to the whole at every join and rollup?
  • Did the system ask at the gates instead of guessing?
  • Did the telemetry itself capture reality at write time — or was the recorder dead and self-healed at the end of the run with confident guesswork?

A failure against this list is not just a bug to fix. It is evidence that the specification is incomplete — and the fix is a new permanent gate, not a patch. The misleading-sentence check from B7 is now a regression test every future change must pass. The hollow-phase check from C9 runs on every run, forever.

The skill this builds

If the catalog has a punchline, it’s this: the work of building reliable AI systems is mostly not prompting. The frontier keeps moving up a layer — 2024 was the year of prompt engineering, 2025 the year of context engineering, 2026 the year of harness engineering — and the next layer up is loop engineering: running the agent against a goal on a schedule, evaluating the result, and feeding the improvement back in. Each layer is the prerequisite for the one above it.

The loop is the optimizer; the harness is the verifier it iterates against. A loop without a harness is the Lying Recorder (C10) generalized — confidently converging on garbage at machine speed. A harness without a loop is the Dead Safety Net (E18) — a gate no run flows through rots until someone deletes it. The clearest loop in my system: telemetry feeds a health dashboard, the worst-performing skill surfaces first, the lesson becomes a rule, and the rule loads into the next run.

The day-to-day of harness work is defining trust boundaries: deciding, per check, whether a violation fails open or fails closed. A join that silently inflates a count stops the run cold; a possible scope leak continues with a visible warning that the human reviewer and the writing model both see before anything ships. That call, made constraint by constraint, defines what your customer can trust — which means it defines the product.

The competitive advantage is shifting from “can the agent do the task?” to “can the system prove the task was done correctly?” The second question is harder — and far more defensible. As models commoditize, raw execution stops differentiating anyone. Trust infrastructure doesn’t commoditize, because it can only be earned the way this catalog was: one proven failure at a time.

Who gets to design trust

Here is the shift hiding under everything above: the people building these systems are no longer just engineers. Everyone at a company now needs to be building their own agentic systems — sales, account management, support, product, marketing, ops. As Amjad Masad, CEO of Replit, put it: “The roles are collapsing. We have designers shipping code, engineers shipping code, and sales people shipping code. The particular skill is not the bottleneck anymore. It is how ambitious you are, how generative you are, how creative you are, and how good you are at utilizing these tools.”

And whoever builds one is no longer only defining features, workflows, and UX. They are defining acceptable uncertainty, evidence requirements, escalation policies, and risk allocation. Should the model decide autonomously? Should uncertain cases escalate to a human? Fail open or fail closed? Those are not purely engineering questions anymore — they define trust, liability, and customer confidence. You are no longer only designing experiences; you are designing how the system earns trust.

Engineering leadership changes underneath this shift, too: from managing the engineers who write code to solve the company’s problems to enabling the entire company to write code to solve their own — often better than a handoff could, because the person closest to the problem now has the agency to build — while engineering focuses on what genuinely needs them: security, scale, and the core product. Enablement is what makes a non-engineer’s work verifiable instead of forbidden. The org that shares trust-boundary decisions between product and engineering ships; the org that hoards them stalls — and the harness is what makes that sharing safe.

Bottom line

In the agent era, a product requirement can no longer just describe behavior. It has to define verification, evidence, containment, observability, and escalation — the product requirements document evolves from a feature document into a proof document. Because once agents can reason, retrieve, synthesize, and act, the real product question is no longer “what can the system do?” It becomes “what can the system reliably prove?”

Planning produced my intent: a company second brain for renewals. Proof produced everything that made it trustworthy enough to put in front of a customer. The evaluation loop becomes the infrastructure through which the product learns what reliability actually demands.

1 Some places I’ve seen this emerging term include HUMAIN’s product team and Architecture of Proof.

No relation to proof-driven development, a formal-methods practice (2015) of machine-proving code against an upfront spec. That is upfront specification taken to its limit; PDR is what’s left when the limit can’t be reached.

Mohannad Arbaji

Author: Mohannad Arbaji, ChalkTalk Founder

A note on punctuation: Every em-dash in this post is mine, not an AI’s. I founded an education company whose flagship products include SAT and ACT courses — with entire lessons on em-dashes, semicolons, and proper punctuation. I was using em-dashes long before AI made them suspicious, and I refuse to replace them with commas. A moment of silence for those who did.

Also published at medium.com/@moarbaji