Building an AI Agent Fleet from Scratch: What I Learned After 18 Months

Eighteen months ago I started building a multi-agent AI system to run the operational and content work across my portfolio. Not to assist a team — to replace the need for a team on a specific class of work. I was wrong about some things, right about others, and surprised by almost everything.

This is what I actually learned.

What I got wrong first

I built the first version as a monolith. One agent that knew everything about everything — the calendar, the email, the content operation, the code, the finances. The thinking was that a single context window with full access would be more capable than specialized agents operating in silos.

That was wrong. A single agent with too much context is slower, more prone to errors, and harder to debug. When something goes wrong — and things go wrong — you have no idea which part of the system caused it because everything is entangled. The monolith also made it impossible to improve one capability without risking regression in another.

The right architecture turned out to be the obvious one: specialized agents with narrow responsibilities, a coordinator that routes work, and shared memory that lets them pass context to each other without being directly coupled.

Specialization is not the same as isolation

The hardest design problem in a multi-agent system is the boundary between agents. You want each agent to own a domain clearly enough that it can act autonomously. But you also need them to share context — a content agent needs to know what the email agent sent last week so it does not contradict it. An ops agent needs to know what the code agent just deployed so it can update the runbook.

I solved this with a shared memory layer: a set of markdown files that any agent can read and write, with conventions about where different types of information live. The content agent writes to memory/content-calendar.md. The email agent reads it before drafting a newsletter. The ops agent writes deployment notes to memory/deployments.md. Nobody is tightly coupled, but everybody can see what everybody else is doing.

This is boring infrastructure. It is also the most important piece of the system. Without it, the agents produce contradictory outputs and the system feels chaotic rather than coordinated.

Autonomy requires verification, not trust

The biggest cultural shift in working with an AI agent fleet is that you cannot trust agent output the way you trust a capable person's output. A capable person has internalized context, values, and judgment about what crosses a line. An agent will cheerfully do things that are wrong, off-brand, or embarrassing because it does not know what it does not know.

The solution is not to reduce autonomy. The solution is verification loops: explicit points in every workflow where the agent's output is checked against a defined standard before it leaves the system. For content, that is a checklist. For code, that is a test suite. For email, that is a human review step on anything customer-facing.

I have shipped things I should not have because I trusted agent output that had passed a surface check. I have also not shipped things I should have because verification was too heavy. Finding the right balance is the ongoing work.

The economics changed everything

When I started this, running a fleet of agents cost enough that I had to be selective about what I ran. The cost-per-token math meant that a daily content production run across 13 sites would have been expensive to operate continuously.

That constraint is gone. I have agents running on models that cost a fraction of what I was paying 18 months ago for worse results. The Kimi K2.5 model that Volt runs on costs almost nothing per task. Claude Sonnet handles complex coordination work at a price point that would have seemed impossible when I started.

This matters because it changes what you build. When cost is a constraint, you design for efficiency — minimize agent calls, batch work, reduce context. When cost is not a constraint, you can build for reliability — add verification loops, run parallel agents for cross-checking, use more capable models for critical tasks. I rebuilt significant parts of the fleet after the cost curve shifted.

What I would do differently

Start with a clear taxonomy of work: what is rote and should be fully automated, what requires human judgment and should never be automated, and what is in the middle and should be automated with human review. I spent too long trying to automate the middle category without adequate review, and too long being too conservative about the rote category.

Build the memory layer first. Before writing a single agent prompt, design where state lives, how agents read it, and how they write back. This is not glamorous work, but it is the foundation everything else depends on.

Hire for verification, not for execution. The agents can execute. What they cannot do is decide whether the output is right. The humans in this system are not there to do the work the agents do — they are there to make judgment calls the agents cannot make. That is a different hiring profile than most operators expect.

Where this goes

I think about this fleet as infrastructure now, not as a product or an experiment. It is closer to a database or a cloud account than it is to a tool I use — it runs in the background, it handles a class of work reliably, and I only think about it when something breaks.

Getting to that feeling took 18 months of building, debugging, and rebuilding. The technical part was the easy part. The hard part was developing the right intuitions about when to trust the system and when to intervene — and that only comes from operating it long enough to fail in enough different ways.