Suraj LabBackend systems, memory, and orchestration.

Reliable Agent Systems

Why Most Coding Agents Still Don’t Ship Working Software

Code generation is local. Shipping software is end-to-end. A lot of coding agents still confuse the two.

054 min read703 words

Failure Mode

Most coding agents still fail in the same boring way.

You ask them to build something. They generate code quickly. The code looks plausible. It may compile. A few tests pass. But the moment you actually run the system, the cracks show up.

The UI does not behave the way the code implied it would.
The agent invented an API that does not exist.

It skipped over design decisions that mattered.

It swallowed errors instead of surfacing them.

And when you come back later to add a feature, it behaves like it has never seen the project before.

This failure mode is common because most coding agents are still optimized for code generation, not for software delivery.

Those are not the same thing.

Generating code is local.
Shipping software is end-to-end.

Shipping includes design, decomposition, integration, verification, runtime checks, debugging, and some durable memory of what was learned along the way. A model can be good at producing syntax while still being bad at producing working systems.

That gap matters more than most demos admit.

A lot of agent workflows still reward “looks right” over “works under execution.” That is the wrong optimization. If the system has not built the project, run the tests, launched the app, exercised the flows, and compared the result to the original contract, it has not really finished. It has only produced a plausible draft.

Control Surface

Diagram

Reliable agent loop

The system only advances when artifacts pass an external check. Failed work routes into diagnosis, not more guessing.

Part of the problem is that coding agents tend to collapse too many jobs into one loop.

The same system interprets the task, makes design decisions, writes the code, evaluates the code, and often decides for itself that the result is acceptable. That creates a predictable bias: it grades its own work too positively.

Another problem is weak verification.

A lot of “evaluation” is still basically code reading. The agent inspects the files, sees that the functions exist, notices a few tests, and declares success. But software does not live in static text. It lives at runtime. It lives in build behavior, wiring, state transitions, race conditions, network edges, UI flows, and actual execution.

A third problem is poor continuity.

Even when a coding agent succeeds once, the next session often starts too close to zero. It re-explores the codebase, forgets past decisions, reintroduces old mistakes, and pays the same tax repeatedly. The system may be smart in isolation and still fail to become meaningfully better over time.

This is why I think the next step for coding agents is not just better generation.
It is better systems around generation.

That means:
• separating planning from implementation where it helps

• treating build and runtime as first-class gates

• using independent review rather than self-grading

• routing repeated failures into diagnosis instead of more guessing

• persisting local project knowledge across sessions

None of that is especially glamorous.
But it is the difference between “the code looks reasonable” and “the software actually works.”

What Ships

Delivery looppipeline
task -> spec -> implement -> verify -> ship
                      |
                      +--> diagnose -> retry

It also changes how success should be measured.

A good coding agent should not be judged mainly by how fast it produces files. It should be judged by things like:
• how often the project builds on first pass

• whether required tests actually exist and pass

• whether the runtime behavior matches the requested behavior

• whether UI flows work under execution

• whether the next feature is easier because the system retained useful project knowledge

Those are much more meaningful metrics than token speed or the number of lines written.

I also think this is where a lot of confusion around coding agents comes from.

People see the model produce a large amount of plausible code and assume the hard part is solved. But real software delivery is not constrained mainly by typing speed. It is constrained by architecture, feedback loops, integration boundaries, and verification against reality.

That is why a coding agent that writes less code but closes the loop properly is often more valuable than one that writes a lot of code quickly.

The interesting part is not that a model can write code.
We already know that.

The interesting part is whether it can be part of a system that reliably ships.

That is a harder problem.
But it is also the one that matters.