Reliable Agent Systems

What Building This Taught Me About Diagnosis, Not Retries

Repeated failure usually means the system needs a different model of the problem, not another attempt.

093 min read532 words

Failure Mode

One of the easiest mistakes in agent design is to confuse retries with learning.

The system fails.
It tries again.
It changes a few lines.
It fails again.
It tries again with slightly different wording or slightly different code.

From the outside it looks persistent.
Most of the time it is just looping.

I started caring about this because the same pattern kept showing up in coding work.

The first patch would miss the real constraint.
The second patch would change symptoms without changing the mental model.
The third patch would still orbit the same wrong assumption, just with new syntax.

More effort was being spent.
No diagnosis had happened.

That distinction matters.

A retry says, "do another attempt."
A diagnosis says, "classify what failed, explain why it failed, and decide what kind of change is now justified."

Control Surface

Diagram

Reliable agent loop

The system only advances when artifacts pass an external check. Failed work routes into diagnosis, not more guessing.

Those are completely different operations.

The first is motion.
The second is learning.

This is why I do not think repeated failure should automatically trigger another patch round. It should trigger a change in mode.

At that point the system needs to stop acting like an implementor and start acting like an investigator.

It needs to answer questions like:
• did the build fail, or did the runtime fail
• is the issue in the code, the environment, the assumptions, or the contract
• what evidence rules out the last theory
• what invariant did the previous fixes keep violating
• what needs to change in the system's model before another patch is worth making

Without that pause, retries become expensive optimism.

This is also where a lot of agent systems overestimate themselves. They assume another attempt plus more context is close enough to diagnosis. It usually is not. If the model still does not know what kind of failure it is looking at, extra attempts mostly produce variation, not progress.

That is why I ended up valuing the debugger role more than I expected.

What Ships

Delivery looppipeline

task -> spec -> implement -> verify -> ship
                      |
                      +--> diagnose -> retry

Not because every system needs a named debugger, but because reliable systems need a diagnosis mode. Something has to interrupt the loop of patch, hope, retry. Something has to force the system to reproduce the failure, inspect the real path, and write down what it now believes about the cause.

That is the moment where the system can actually improve.

The same lesson applies outside coding.

If a recurring report keeps missing the same decision-relevant context, the answer is not just "run it again." The system needs to classify what information model is missing. If a research assistant keeps surfacing stale conclusions, the answer is not just "retrieve more." The system needs to mark what was contradicted and what should stop being active.

So the design rule I took away is simple:

Retries are a throughput tool.
Diagnosis is an adaptation tool.

You need both.
But if the system only knows how to retry, it will look busy without getting much smarter.

That is why repeated failure should not be treated as a reason for more momentum.
It should be treated as evidence that the current model is wrong.

And until the system can say what was wrong about that model, it probably is not ready for another attempt.

What Building This Taught Me About Diagnosis, Not Retries

Failure Mode

Control Surface

Reliable agent loop

What Ships

Why Most Coding Agents Still Don’t Ship Working Software

Deterministic Gates Beat Prompt-Only Control

The Goal Is Not Multi-Agent. The Goal Is Reliable Software