Reliable Agent Systems

Deterministic Gates Beat Prompt-Only Control

If a rule matters, move it out of the prompt and into the harness.

063 min read490 words

Failure Mode

One of the most useful things I learned from building agent systems is that language is a weak enforcement mechanism.

You can write careful instructions:
"ask discovery questions first"
"do not skip build verification"
"never claim completion without runtime testing"
"return a structured review with evidence"

And the model will still skip steps, compress intent, or declare progress too early.

That is not a moral failure. It is how these systems work. They are good at plausible continuation, not dependable process obedience.

So the design rule I keep coming back to is simple:

If a rule matters, move it out of the prompt and into the harness.

That is what deterministic gates do.

Instead of trusting the agent to say whether a stage is done, the environment checks a concrete artifact.

A spec is accepted because required sections exist, not because the architect sounds confident.
A build stage is accepted because the command exits cleanly, not because the implementor says it should compile.
A review is accepted because it contains a clear status, findings, and evidence, not because the prose feels thorough.

Control Surface

Diagram

Reliable agent loop

The system only advances when artifacts pass an external check. Failed work routes into diagnosis, not more guessing.

That shift changes the character of the system.

Without gates, the pipeline is basically social. Each stage makes claims and the next stage informally trusts them.

With gates, the pipeline becomes operational. The artifact passes or it does not.

That matters because models are often too willing to narrate success.

They are better at producing the appearance of completion than at recognizing that a missing section, failed build, broken runtime path, or weak test should block forward motion.

Once the critical checks move outside the prompt, a whole class of fake progress disappears.

This also clarifies what prompts are for.

Prompts are still useful for direction.
They help with decomposition, tradeoffs, ambiguity, and local judgment.

But prompts should guide.
Gates should enforce.

What Ships

Delivery looppipeline

task -> spec -> implement -> verify -> ship
                      |
                      +--> diagnose -> retry

That is a cleaner division of labor than asking the prompt to carry task instructions, quality bar, safety policy, process structure, and completion logic all at once.

I think many agent systems become fragile because they overload language.

They ask wording to do the job that should belong to scripts, harnesses, runtime checks, or typed artifact validation. Then they are surprised when the model drifts.

The more important the constraint is, the less it should rely on language alone.

If violating it would break trust, break correctness, or hide failure, it probably belongs in the environment.

This is especially true in coding systems because the ground truth is often available.

The code builds or it does not.
The test passes or it does not.
The app runs or it does not.
The flow behaves as requested or it does not.

When those checks exist, it is a design mistake to leave them as prose requests.

That may sound less elegant than "the agent understands the process."
It works better.

And for systems that are supposed to do real work, working better matters more than sounding elegant.

Deterministic Gates Beat Prompt-Only Control

Failure Mode

Control Surface

Reliable agent loop

What Ships

Why Most Coding Agents Still Don’t Ship Working Software

The Goal Is Not Multi-Agent. The Goal Is Reliable Software

Building a Coding Agent That Actually Ships