Reliable Agent Systems
What Building This Taught Me About Better Agents
The most important lessons were not about prompts. They were about structure, verification, memory, and the limits of self-evaluation.
Failure Mode
Building a coding-agent system changed my view of agents in a few ways.
Not because it made me believe in some grand framework. If anything, it made me more skeptical of broad claims. What it gave me was a sharper sense of where agent systems actually break, and what kinds of fixes genuinely improve them.
The first lesson is that prompts are weaker than they look.
A good prompt matters. It shapes behavior, tradeoffs, tone, and decomposition. But there is a big difference between a prompt that sounds rigorous and a system that is actually hard to fool.
The model will still skip steps.
It will still compress instructions.
It will still declare progress too early.
It will still subtly optimize for plausible completion.
That means the important rules cannot live only in prose.
Some of them have to become environment.
The second lesson is that self-evaluation is unreliable.
This showed up everywhere. The implementor would think the feature was basically done. Then an independent evaluator would build the app, launch it, and immediately find something broken. Or the agent would write tests that technically passed but did not really validate the behavior the feature was supposed to guarantee.
This changed how I think about review.
A lot of current agent systems assume that stronger reasoning will eventually make self-grading good enough. Maybe it will improve, but I think the structural bias remains. A system is generally too close to its own local choices to judge them cleanly. Independent verification matters.
Control Surface
Diagram
Reliable agent loop
The system only advances when artifacts pass an external check. Failed work routes into diagnosis, not more guessing.
The third lesson is that runtime reality is where most of the truth lives.
Code can look coherent and still fail because of wiring, configuration, network assumptions, state mismatches, async behavior, UI integration, or platform quirks. That sounds obvious to anyone who has shipped software, but many agent systems still act like static code analysis plus unit tests is enough.
It usually is not.
The fourth lesson is that retries are not learning.
If a bug survives two or three attempts, the system often does not need more effort. It needs a different mental model. Repeating the same style of fix with slightly different code is not diagnosis. It is just motion.
That is why I ended up caring more about the debugger role than I expected. Not because every system needs a separate debugger, but because the pause between failure and another code change is often where the real improvement happens.
The fifth lesson is that memory matters, even in small amounts.
A repo-level knowledge file is not a grand memory architecture. But even lightweight carry-forward knowledge changes the experience. The next session starts with some local truth already available: how the system is structured, what the build commands are, which gotchas matter, what patterns worked before. That lowers the reorientation tax and makes repeated use feel more cumulative.
The sixth lesson is that multi-agent is not the point.
I ended up with several agents because certain responsibilities conflicted when one loop tried to do everything. Planning, implementation, evaluation, and diagnosis were not failing for the same reason, so separating them helped. But I do not think the lesson is “everyone should build five-agent systems.”
What Ships
task -> spec -> implement -> verify -> ship
|
+--> diagnose -> retryThe lesson is narrower:
separate responsibilities when the separation isolates a real failure mode.
That is a very different claim.
The broader conclusion I took away is that useful agents will probably be defined less by how autonomous they sound and more by how well their environment is designed.
By environment I mean:
• what state they can carry forward
• what they are forced to verify
• what they are allowed to assume
• how failures are surfaced
• how corrections are preserved
• what counts as done
That is where a lot of the real leverage is.
The model matters, of course.
But once the model is capable enough, the important differentiator shifts.
Then the question becomes:
what kind of system is this model embedded in?
That is the question I keep coming back to.
Because the future of agents probably will not be decided by prompts alone.
It will be decided by whether we can build systems around them that accumulate useful state, verify themselves against reality, and become more dependable through use.
That still feels early.
But it also feels like the right direction.