Suraj LabAmazon software engineer building backend systems and side projects.

Systems With Continuity

Most AI Agents Don’t Improve Over Time

Repeated work should compound. Most agents still recompute.

033 min read573 words

Old Model

A lot of agent systems are quietly graded on the wrong thing.

They get credit when they complete a task once.
They get less scrutiny on whether the next run is meaningfully better.

That hides a real weakness.

Most agents do not improve over time. They recompute over time.

They may have more tools, more workflow stages, more prompt structure, more retrieval, and more memory-looking infrastructure. That can increase capability. It does not automatically produce adaptation.

Adaptation is a harder standard.

It means repeated work changes the system in a way that improves future work.

Take a recurring report. A real improvement loop should make the next report narrower, sharper, and less repetitive because the system knows:
• what it covered before

• what conclusions survived

• what corrections were made

• what sources proved noisy

• what open questions are still unresolved

If the next report starts from scratch with a fresh generation, that is not improvement. It is a looped demo.

The same logic applies to coding systems.

Continuity Layer

Diagram

System with continuity

The system does not end at output. It carries state, accepts correction, and changes future behavior.

If the agent has already been corrected on the repo's build command, file structure, or library constraints, the next task should reflect that. If it makes the same class of mistake again, the system may be useful, but it is not really learning.

This is why I keep separating task completion from accumulation.

Most current agents are optimized for the first.
Few are built carefully for the second.

That difference shows up in their memory model.

A lot of systems store things. Fewer systems govern how stored things should influence future behavior.

That governance is the hard part.

For repeated work, the system has to decide:
• what is an observation versus an inference

• what is stable enough to preserve

• what remains tentative

• what was corrected and should stop influencing behavior

• what deserves reinforcement

• what should decay

Without those distinctions, the system is not improving. It is just carrying a larger backpack.

That is why I am skeptical when people equate more tooling with more learning.

Tools help with execution.
Learning requires governed change.

What Changes

Continuity loopsystems
observe -> interpret -> update -> act
          ^                   |
          |------ review -----|

The two are related, but they are not interchangeable.

A well-tooled agent can still plateau if nothing in the architecture promotes repeated signal into usable state, or if stale assumptions remain active forever because nobody gave the system a way to downgrade them.

So when I think about whether an agent is improving, I no longer ask only whether it solved the current task.

I ask:
• what will be easier next time because this run happened

• what mistake is now less likely

• what knowledge became durable

• what belief changed status

• what was removed from influence even if it stayed in history

That is a much higher bar than "the output looked good."

It also forces a more honest product standard.

If you want repeated work to compound, you need more than a context window and a few saved notes. You need a model of change. You need some way to turn interaction history into state that is revisable, traceable, and behaviorally relevant.

Otherwise the agent stays stuck in an awkward place:
highly capable at fresh starts, weak at accumulation.

That may be enough for some categories.
It is not enough for the kinds of systems I care about.

Repeated work should not just be another chance to generate.
It should be proof that the system can carry lessons forward.