Suraj LabAmazon software engineer building backend systems and side projects.

Reliable Agent Systems

Why Most Coding Agents Still Don’t Ship Working Software

Plausible code is local success. Shipping software requires design, verification, runtime checks, and carry-forward knowledge.

053 min read527 words

Failure Mode

Most coding agents still fail in a familiar way.

They produce a lot of plausible code.
The files look coherent.

The diff feels substantial.

Then the actual system breaks where real software usually breaks: at integration time, at runtime, or in the next session.

The API the UI depends on does not exist.
The app compiles but the main flow is broken.

The tests pass because they are narrow or fake.

The feature works once and then the next task repeats a repo mistake that was already corrected.

This is not mainly an intelligence problem.
It is a system design problem.

Coding agents are often optimized for code generation, not software delivery.

That is a deep mismatch.

Generating code is local.
Shipping software is end-to-end.

End-to-end means the system has to survive several kinds of reality:
• design constraints

• existing codebase rules

• build behavior

• runtime behavior

• user-facing flows

• project memory across sessions

Control Surface

Diagram

Reliable agent loop

The system only advances when artifacts pass an external check. Failed work routes into diagnosis, not more guessing.

A model can be good at the first part and weak at the rest.

This is why code review alone is such a weak truth source for agents.

Static code can look reasonable while the system fails in motion. Real software lives in actual execution: build wiring, state transitions, configuration edges, race conditions, browser behavior, network calls, and the mismatch between what the code suggests and what the app actually does.

That is why "the code looks right" is not a completion criterion.

If the project was not built, if the tests were not run, if the app was not launched, if the requested behavior was not exercised, and if the result was not checked against the original contract, then the system did not finish. It drafted.

Another failure mode is self-grading.

The same loop interprets the request, chooses the design, writes the code, and decides whether the result is acceptable. That structure creates predictable optimism. The system is too close to its own decisions to grade them cleanly.

Then there is continuity.

What Ships

Delivery looppipeline
task -> spec -> implement -> verify -> ship
                      |
                      +--> diagnose -> retry

Even when an agent gets the feature over the line, the next session often restarts too cold. It re-explores the same files, forgets past constraints, and pays the same orientation tax again. That is not just inefficient. It also makes reliability harder because the system keeps re-opening mistakes it should have closed.

So when I evaluate a coding agent, I care less about how quickly it can emit code and more about whether it closes the loop.

The meaningful questions are:
• did the build pass

• did the required tests exist and pass

• did the app run

• did the requested behavior work under execution

• did the system preserve project knowledge that makes the next task easier

Those are much better measures than token speed or lines written.

The interesting problem is not whether a model can generate code. That part is already real.

The interesting problem is whether the model is embedded in a system that can reliably ship software.

That requires stronger feedback loops, clearer definitions of done, and some way to carry useful local truth across sessions.

Without that, you get a lot of plausible code and not enough dependable delivery.