Suraj LabAmazon software engineer building backend systems and side projects.

Reliable Agent Systems

Building a Coding Agent That Actually Ships

A Claude Code plugin for turning vague software ideas into reviewed, tested, committed code.

0810 min read2,023 words

Generation Is Not Shipping

AI coding tools are getting very good at producing code. That is no longer the hard part.

The hard part is producing software.

Software is not just files. It is intent, requirements, design tradeoffs, tests, runtime behavior, error handling, review, documentation, and memory across sessions. Most coding agents collapse all of that into one loose loop:

user asks -> model writes code -> maybe runs tests -> user inspects the result

That works for small edits. It breaks down when the task has ambiguity, architectural decisions, multiple files, UI behavior, external libraries, or long-running context.

I built coding-agent as a Claude Code plugin to explore a different model: not one smart model writing code, but a structured software delivery pipeline where agents play different roles, durable artifacts preserve state, deterministic checks gate progress, and the user approves the important decisions before code is written.

The result is a multi-agent Claude Code plugin for building software end to end. The repo describes a system with five agents, 54 skills, seven MCP servers, and deterministic pipeline gates. Its promise is simple: turn a vague request like "build me a blog with comments" into research, spec, plan, implementation, independent review, runtime testing, and commit.

The problem is familiar if you have used AI coding tools heavily.

  • the UI was never opened
  • the library API was hallucinated
  • the design decision was never made explicit
  • the tests cover happy paths only
  • the agent changed unrelated files
  • a later session forgets why something was built
  • the same agent that wrote the code also "reviews" it

The problem is not only model intelligence. The problem is workflow architecture.

Human software teams do not ship serious systems by letting one person vaguely interpret a request, write code, self-review it, and merge it. We introduce structure: product intent, requirements, technical design, task breakdown, implementation, testing, independent review, release gates, documentation, and incident learnings.

coding-agent applies that idea to AI coding. Not because agents should imitate bureaucracy, but because software quality comes from state, separation, feedback, and verification.

A normal coding agent is mostly an executor. It receives a task and mutates the repo.

coding-agent adds a control plane around that execution.

That control plane answers questions the model conversation should not be allowed to blur:

  • What phase are we in?
  • Has the user approved the spec?
  • What files or areas are allowed to change?
  • Which agent owns the next step?
  • What evidence proves the feature works?
  • What happens if review fails?
  • What context survives into the next session?
  • How do we prevent approved decisions from drifting silently?

The key design move is that the LLM conversation is not the source of truth. Artifacts on disk are.

The plugin docs define four primitives: Actor, Artifact, Skill, and Check. Actors produce work. Artifacts store durable outputs like intent.md, spec.md, plan.md, work.md, review.md, and learnings.md. Skills provide scoped knowledge. Checks are deterministic predicates such as intent-approved, ui-evidence, and revisions-resolved.

That gives the system a backbone.

The conversation can be messy. The model can be probabilistic. But the pipeline state is written down, checked, and resumed.

The first primitive is Actor.

The major actors are Orchestrator, Architect, Implementor, Evaluator, Debugger, and User.

The Orchestrator owns state, classifies tasks, dispatches agents, and enforces gates. The Architect researches, asks discovery questions through the orchestrator, and writes spec and plan artifacts. The Implementor writes code and tests inside a scoped task contract. The Evaluator independently builds, tests, and verifies runtime behavior. The Debugger performs root-cause analysis when fixes fail. The User approves intent, spec, plan, and push.

This separation matters because one of the biggest weaknesses of LLM coding is self-evaluation bias.

The agent that wrote the code is naturally bad at reviewing it. It tends to confirm its own assumptions. So coding-agent separates the generator from the evaluator. That mirrors a real engineering team: the person who implements a feature should not be the only person validating it.

The second primitive is Artifact.

The Control Plane

Diagram

Control plane for coding agents

The orchestrator owns state and dispatch. Artifacts carry intent, design, work, review, diagnosis, and memory across the pipeline.

Artifacts are durable outputs on disk. This is one of the strongest parts of the design because the system does not rely on "the model remembers what it said." It writes structured files.

The important artifacts include intent.md for the approved user request, spec.md for requirements and technical approach, plan.md for task decomposition and evaluation criteria, work.md or progress.md for mutable execution state, review.md for evaluator findings, diagnosis.md for root-cause analysis, learnings.md for persistent patterns and gotchas, and AGENTS.md for project-specific conventions.

The most important invariant is that approved artifacts are immutable forever.

If something changes after approval, the system does not silently rewrite the past. It records a revision in work.md with a Supersedes: pointer. That solves a common agent problem: drift. Agents often start with one plan, discover something later, and quietly reshape the task. Sometimes that is useful. Sometimes it violates the user's original intent. Immutable approved artifacts make drift visible.

The third primitive is Skill.

Skills are scoped knowledge modules loaded only when needed. The repo currently carries skills across frontend, backend, data, mobile, infrastructure, observability, configuration, testing, security, documentation, and pipeline verification.

That matters because long prompts are not free. They increase context size, dilute attention, and make instructions easier to ignore. The better pattern is to keep the agent role small, then load domain knowledge just in time.

The fourth primitive is Check.

Checks are deterministic gates. This is the most important anti-vibe-coding mechanism in the system.

The repo has scripts that exit 0 or 1 for stage verification: spec, plan, build, tests, review, UI evidence, approved intent, approved plan, active feature consistency, action logging, close-out completeness, and unresolved revisions.

The model can propose. The check decides whether the stage is valid.

That distinction is critical. LLMs are good at producing plausible explanations. They are weaker as final arbiters of whether a pipeline condition has actually been satisfied. A deterministic gate creates a hard boundary.

The primary pipeline is:

  • user request
  • intake
  • intent.md
  • user approval
  • spec
  • spec.md
  • user approval
  • plan
  • plan.md
  • user approval
  • implementation
  • code and tests
  • review
  • review.md PASS or FAIL
  • fix-round if needed
  • close-out
  • learnings.md, docs, archive
  • commit gate

The orchestrator begins by restating and classifying the user request. The repo defines task-size paths: micro tasks can stay in the main thread, small tasks use implementor plus lightweight evaluation, medium tasks get architect planning, and large features use the full spec -> plan -> implementation -> review pipeline.

That classification matters because not every task deserves five agents.

You do not want ceremony for a typo. You also do not want the orchestrator casually editing five files because "it seems simple." The bright-line rule is useful: if the task touches more than two files or writes more than about thirty lines of new logic, it should not be handled as a main-thread patch.

For meaningful features, the architect writes spec.md before coding starts.

That is where the system slows down. Most agent failures happen because implementation starts before ambiguity is resolved. The model fills in missing details silently. Then the user reviews code and realizes the wrong thing was built.

The spec phase forces the system to clarify functional requirements, non-goals, technical risks, test infrastructure, stack choices, user-visible behavior, and acceptance criteria. If the architect needs answers, it returns structured discovery questions to the orchestrator. The orchestrator surfaces those questions through the real user channel.

That keeps user communication centralized. Subagents do not independently negotiate with the user. The orchestrator owns the conversation and the state. This prevents a subtle multi-agent failure mode: different agents creating different versions of truth.

After the spec is approved, the architect writes plan.md.

The plan decomposes the feature into waves and tasks. Each task declares domain tags, required skills, likely touched files, acceptance criteria, evaluation requirements, and whether it can run in parallel.

Parallelism only earns its place when the files are disjoint and the dependency graph allows it. Uncontrolled parallel agents create merge conflicts, duplicated work, and inconsistent design choices. A good plan is a dispatch contract, not a motivational outline.

Implementation is done by one or more implementors.

Each implementor receives a scoped task contract. It reads the approved artifacts, loads relevant skills, inspects the codebase, writes tests first where appropriate, implements the change, and returns a structured payload. The orchestrator parses that return block and applies updates to work.md. Subagents do not directly mutate shared workflow state.

Where Trust Comes From

Delivery pipelineworkflow
intake -> intent.md -> approval
spec -> spec.md -> approval
plan -> plan.md -> approval
implement -> code + tests
review -> review.md PASS/FAIL
fail -> diagnose -> fix-round -> review
pass -> learnings.md + docs + commit gate

That single-writer rule is good systems design.

In distributed systems, multi-writer shared state creates race conditions. Multi-agent coding has the same problem. If every subagent can rewrite the plan, progress, and status, the workflow becomes non-deterministic. coding-agent avoids that by making the orchestrator the state owner. Subagents produce outputs. The orchestrator integrates them.

Review is independent from implementation.

The evaluator builds the project, runs tests, checks acceptance criteria, and validates runtime behavior. For UI work, it drives the browser through Playwright or uses the relevant simulator tooling. A UI cannot pass just because the code looks coherent. The page needs to load, the interaction needs to work, console errors need to be checked, and screenshots need to exist.

No runtime evidence, no pass.

The system also assumes failure will happen. That is healthy.

Instead of pretending the first implementation will be correct, it defines a fix ladder:

  • Review FAIL
  • Round 1: re-implement from evaluator findings
  • Review again
  • Round 2: debugger diagnoses root cause
  • Implementor fixes from diagnosis
  • Review again
  • Round 3: escalate to user

A stubborn bug usually means the implementor's mental model is wrong. Asking the same implementor to "try again" repeatedly wastes cycles. You need a different role: debugger, not builder. The debugger reproduces, isolates, traces, and writes diagnosis.md. It does not write the fix itself.

Close-out is not just cleanup. It is memory formation.

After a successful review, the orchestrator freezes artifacts, archives the feature, updates learnings.md, updates AGENTS.md if a new convention emerged, clears the active feature pointer, updates the session checkpoint, and prepares for commit.

That is how the system improves across sessions. Most coding agents are stateless contractors. They show up, make changes, and leave. Next time, they rediscover the same project facts. coding-agent tries to make the project itself teach the agent.

The strongest design decisions are not flashy:

  • the orchestrator is the only state owner
  • approved artifacts are immutable
  • humans approve intent, spec, plan, and push
  • the evaluator is separate from the implementor
  • runtime evidence is required for UI work
  • skills replace giant prompts
  • close-out turns completed work into project memory

There are tradeoffs.

More structure means more overhead. Deterministic checks only verify what they encode. Human approvals can become rubber stamps if artifacts are too verbose. Parallelism is risky unless boundaries are clean. Some rules still depend on disciplined prompt behavior and should eventually move into mechanical enforcement where possible.

But the deeper point is that coding agents need institutions, not just intelligence.

Human teams scale because they invent trust machinery: design docs, code review, ownership boundaries, incident reports, release gates, architecture decisions, style guides, CI, and deployment checklists. These are not bureaucracy by default. They are compression mechanisms for trust.

Agentic software engineering needs the same thing.

A raw model is intelligence. A coding agent is intelligence plus tools. A software-building system needs more:

  • tools
  • memory
  • roles
  • artifacts
  • gates
  • review
  • recovery
  • audit trail

That is why coding-agent is more interesting than a generic coding-agent demo. It does not just generate code. It encodes opinions about what should block progress, what should be observable, what should be diagnosed, and what should survive into the next run.

The thesis is simple:

AI agents can write code, but shipping software requires a pipeline.

That pipeline needs durable artifacts, explicit approvals, scoped skills, deterministic checks, independent review, runtime evidence, memory, and recovery.

The goal is not to make the human disappear.

The goal is to make the human's attention count.