Suraj LabBackend systems, memory, and orchestration.

Reliable Agent Systems

Building a Multi-Agent Coding System That Actually Ships

A case study in separating roles, adding deterministic gates, forcing runtime verification, and building just enough memory for the next session to start smarter.

085 min read953 words

Failure Mode

I built a Claude Code plugin recently called coding-agent.

The goal was narrow: make coding agents produce software that actually works when you run it.

That sounds like a low bar, but it is still where many systems fail. They generate plausible code, maybe pass a few tests, then fall apart at runtime. They skip design, improvise library usage, ignore integration boundaries, and rarely build enough project memory to make the next session better than the last one.

So I built a system around a different assumption:

A coding agent should not just be a model with a large prompt.
It should operate inside a pipeline with explicit roles, hard gates, and verification against reality.

The final system ended up with five agents.

The orchestrator runs the state machine. It reads the task, classifies its size, dispatches the right subagent, and advances or retries stages based on deterministic checks. It mostly does not write code.

The architect handles discovery, research, specs, and plans. It talks to the user, reads the repo, checks current docs, identifies risks, and turns vague requests into something implementable.

The implementor writes the code. It owns the actual change, ideally with tests first, and uses specialist skills depending on whether the task touches frontend, backend, mobile, infra, or data.

The evaluator is an independent reviewer. It builds the project, runs the tests, launches the app, checks runtime behavior, reviews code quality, and writes an explicit PASS or FAIL review.

The debugger only appears when a bug survives repeated fix attempts. Its job is not to patch symptoms. It traces the failure path, checks assumptions, and writes a diagnosis before more code is changed.

That architecture did not come from theory. It came from failure.

Control Surface

Diagram

Reliable agent loop

The system only advances when artifacts pass an external check. Failed work routes into diagnosis, not more guessing.

The first version used a deeper hierarchy with many more agents. It failed because the environment only allowed one level of subagent dispatch. The next version was flatter, but the orchestrator kept doing too much itself. Even when told not to write code, it would gradually drift into editing files directly after the first feature shipped. It had momentum.

The fix was not stronger wording. It was task-size classification.

If a task is truly tiny, the orchestrator can handle it. But if it touches more than a couple of files or requires meaningful new logic, it must dispatch. That boundary removed a lot of hidden quality regression because it stopped the orchestrator from quietly bypassing the pipeline.

The next important design choice was building in waves.

Many AI-built apps fail because the system moves through the stack in big horizontal passes: first schema, then backend, then frontend, then tests. By the time it reaches the UI, it has forgotten important constraints from earlier layers. So I decomposed work into vertical slices.

Wave 1 is always foundation: config, schema, shared types, structured logging, core wiring.
Later waves are feature slices that go end-to-end: data, API, UI, tests, and evaluation criteria.

That makes the system close loops faster. Each wave produces something coherent and testable. It also gives the evaluator something concrete to verify at each step.

Another major lesson was that prompts are too weak to enforce critical process.

So each stage has deterministic verification around it. A spec is not accepted because the architect says it is done. A script checks whether the required sections exist. An implementation stage is not accepted because the implementor says the code should build. The build actually runs. The review stage is not accepted because the evaluator sounds confident. The review must contain a status, findings, and evidence.

That one change removed an entire class of fake progress.

The evaluator itself is deliberately heavy on runtime checks.

It does not just read code.
It builds the project.

It runs the tests.

It launches the app.

It navigates core UI paths.

It checks console errors and network behavior.

It tests both mobile and desktop widths when relevant.

For iOS projects, it can build into a simulator and exercise the app there.

What Ships

Delivery looppipeline
task -> spec -> implement -> verify -> ship
                      |
                      +--> diagnose -> retry

This matters because code that compiles is not the same thing as software that works.

Another lesson was that repeated bug fixes can fail for the same reason repeatedly if the system has the wrong mental model. So I added the debugger.

If the same bug survives another fix round, the system stops guessing. The debugger reproduces the issue, traces the execution path, checks what the framework or library actually guarantees, and writes a diagnosis. Only then does the implementor try again.

That pause between symptom and retry matters more than it sounds.

The last major piece is reflection.

After a PASS review, the orchestrator writes down what was learned: technical gotchas, architecture decisions, patterns that worked, and suggested updates to the repo’s AGENTS.md. That file becomes durable local memory for future sessions. The next agent does not have to rediscover everything from scratch.

That closes the loop.

One session ships the feature.
The next session starts smarter.

I like this design because it stays relatively small.

It is not trying to be an entire autonomous software company in a box. It is just enough structure to isolate the main failure modes I kept seeing:
• the same agent grading its own work too positively

• prompts failing as process control

• runtime behavior being ignored

• repeated bug fixes circling the same wrong assumption

• useful project knowledge disappearing between sessions

Those were the problems.
The architecture followed from them.

So I do not see this system as proof that “multi-agent is the answer.”
I see it as proof that targeted separation, hard verification, and lightweight memory help a model stop fooling itself.

That is a more modest claim.
But it is also the part that actually matters.