Reliable Agent Systems

The Goal Is Not Multi-Agent. The Goal Is Reliable Software

Extra agents only earn their place when they isolate a recurring failure mode.

073 min read491 words

Failure Mode

A lot of agent design gets stuck on the wrong question.

Should the system use one agent or five?
Do we need a planner, architect, evaluator, debugger, and reviewer?
Would another role make the workflow more serious?

That framing is usually a trap.

The goal is not multi-agent.
The goal is reliable software.

Role count is only interesting if it improves convergence.

That means every extra role should have to answer one question:

What failure mode does this isolate better than the simpler design?

If planning and implementation collapsing into one loop keeps producing badly scoped work, then separation may help.
If the same loop keeps grading its own output too positively, then independent evaluation may help.
If repeated bug fixes circle the same wrong assumption, then a diagnosis step may help.

If none of those failures are happening, extra roles are overhead.

This is why I am skeptical of agent workflows that look like miniature org charts.

Control Surface

Diagram

Reliable agent loop

The system only advances when artifacts pass an external check. Failed work routes into diagnosis, not more guessing.

Planner. Architect. Builder. Reviewer. QA. Deployer. Writer. Analyst.

Sometimes that structure is useful.
A lot of the time it is just human ceremony imported into an agent harness.

That import is rarely questioned hard enough.

Human teams are shaped by constraints agents do not share in the same way: scheduling, handoff cost, incentives, specialization, bounded individual context, and organizational politics.

Agents have different failure modes.

They fail through context instability, weak persistence, overconfident self-evaluation, shallow runtime verification, brittle tool use, and poor recovery loops.

So agent structure should be designed around those failures, not around the shape of a software org chart.

That is why I think the useful primitives are smaller and sharper than most multi-agent discourse suggests.

You need:
• a clear objective
• a workable view of the codebase or domain
• some method for decomposition
• verification against reality
• memory of local conventions and past failure
• a definition of done tied to behavior

How those are implemented can vary.

What Ships

Delivery looppipeline

task -> spec -> implement -> verify -> ship
                      |
                      +--> diagnose -> retry

Sometimes one agent with strong gates is enough.
Sometimes a small number of roles helps because the responsibilities conflict.
Sometimes the most important distinction is not between agents at all, but between prompt and harness, or between implementation and runtime verification.

That is why I do not find "multi-agent" very interesting as a philosophy.

It is an implementation choice.
The real design problem is much narrower:

What is the minimum structure that makes this system dependable?

That question is harder and more useful.

It forces you to account for cost.

Every extra role increases coordination, token use, handoff surfaces, and the chance of stale context moving between steps. If the role does not remove a recurring failure more than it adds overhead, it should not exist.

So my default posture is simple:

Add structure only after the failure earns it.

That rule produces smaller systems, but usually better ones.

Because the point is not to build a tiny autonomous company.
The point is to ship software that works.

The Goal Is Not Multi-Agent. The Goal Is Reliable Software

Failure Mode

Control Surface

Reliable agent loop

What Ships

Why Most Coding Agents Still Don’t Ship Working Software

Deterministic Gates Beat Prompt-Only Control

Building a Coding Agent That Actually Ships