Reliable Agent Systems
The Goal Is Not Multi-Agent. The Goal Is Reliable Software
The question is not whether your coding system has one agent or five. The question is whether each piece of structure isolates a real failure mode.
Failure Mode
A lot of discussion around coding agents gets stuck on the wrong question.
Single agent or multi-agent?
Planner or no planner?
Architect, reviewer, debugger, evaluator, orchestrator — how many roles should there be?
I do not think that is the right starting point.
The goal is not multi-agent.
The goal is reliable software.
That sounds obvious, but many systems drift toward role inflation because role names create a feeling of seriousness. The setup begins to resemble a miniature org chart: planner agent, architect agent, reviewer agent, test agent, doc agent, evaluator agent, deploy agent. Sometimes this helps. Often it just recreates human ceremony inside an agent harness.
That is usually a mistake.
The right question is not “how many agents should there be?”
It is “what failure mode am I isolating?”
If planning and implementation collapse into one loop and the system keeps making poorly scoped changes, maybe separation helps.
If self-review is too positive, maybe an independent evaluator helps.
If repeated bug fixes keep circling the same wrong assumption, maybe a debugger role helps.
If none of those problems are present, then extra roles are just overhead.
That is the distinction that matters.
Control Surface
Diagram
Reliable agent loop
The system only advances when artifacts pass an external check. Failed work routes into diagnosis, not more guessing.
I am not against multi-agent systems.
I am against adding roles before earning them.
This matters because a lot of current coding-agent design seems inherited from human organizations. We borrow the shape of teams because it feels familiar: product manager, architect, engineer, reviewer, QA, on-call. But human organizations have constraints that agents do not share in the same way. Human teams are shaped by scheduling, incentives, communication gaps, specialization, and bounded individual context.
Coding agents have different failure modes.
They fail because of context instability, weak persistence, overconfident reasoning, shallow verification, poor tool use, and brittle adaptation. The structure should be designed around those failure modes, not around a desire to simulate a company.
That is why I think the useful primitives are smaller and sharper than most agent workflows suggest.
A serious coding system needs at least:
• a clear objective
• a workable model of the codebase
• a way to decompose tasks
• verification against reality
• memory of local conventions and past mistakes
• a definition of done tied to behavior, not rhetoric
How those primitives are implemented can vary.
Sometimes one agent with strong verification is enough.
Sometimes several agents with clear boundaries work better.
Sometimes the most important distinction is not between agents, but between the prompt and the harness. Sometimes the bottleneck is not planning but runtime testing. Sometimes memory matters more than decomposition.
That is why “multi-agent” is not a philosophy I find very interesting on its own.
It is just one possible implementation choice.
What Ships
task -> spec -> implement -> verify -> ship
|
+--> diagnose -> retryWhat matters is whether the structure increases reliable convergence.
Does it reduce hidden failure?
Does it improve verification?
Does it help the next session start smarter instead of colder?
Does it make debugging more precise?
Does it stop the system from grading its own work too generously?
If the answer is yes, the structure earned its place.
If the answer is no, it is probably theater.
I think that is the lens more people should use.
Not single-agent versus multi-agent.
Not simple versus sophisticated.
More like:
what is the minimum structure that makes this system dependable?
That is the real design problem.
And in practice, the answer is often less about the number of agents and more about whether the system has clear boundaries, hard feedback loops, and memory that actually carries forward.
The goal is not to build a tiny autonomous company.
The goal is to ship reliable software.
Everything else is secondary.