Home/ Writing/ 60 days, 23 million tokens
● Field note · February 2026 · 14 min read · Frollie OS

60 days, 23 million tokens — what one operator can ship now.

Two months. One operator. Twenty‑three million tokens of Claude Code in production. This is the field note for what actually happened in those sixty days — what got built, where the tokens went, what the cost stack looks like underneath, and what I think it means for any FMCG operator thinking about staffing decisions for 2026.

The setup.

Frollie went live in early December 2025. By the second week of February I'd shipped enough of the operating system that we could close the quarter without a single technical hire. Two months. One operator. Roughly twenty‑three million tokens of Claude Code, give or take, across about forty distinct sub‑agent workflows.

The point of writing it down is that the inputs and outputs of those sixty days are now legible in a way they weren't to me three months ago. Where the tokens went, what they bought, what broke, and how much the whole thing actually costs in dollars — those numbers are the part that makes the rest of this argument concrete.

Claude Code terminal showing live agent activity, parallel sub‑agents running concurrently in tmux panes, log highlights, and quality‑gate status.
A working trace from the Frollie OS development environment — multiple Claude Code sub‑agents running in parallel against the same repo. This is what 23 million tokens looks like in motion.

What actually got shipped.

The system that came out the other side is what we now call Frollie OS. It runs the entire business — three brands, six distribution channels — on a single repo with no engineering team behind me.

The modules, roughly in the order they got built:

That is the surface area. At HelloFresh ANZ, a comparable scope is what a twelve‑person product team would put on its annual roadmap. We did it in sixty days, and we did it without ever hiring a developer.

Frollie Pro internal dashboard. Modules for Operations, Inventory & Supply, Sales & Distribution, Financials, Accounting, Configuration, Admin, and Help & Training, each with their own sub-pages and integrations.
Frollie Pro — the operator surface. Every module is composed from agent‑deployed code under review by one person.

Where the tokens actually went.

The thing I didn't expect coming in: greenfield code is the cheap part.

Rough breakdown of where the twenty‑three million tokens went, by cost share:

Sixty‑five percent of the tokens went to making code I already had not break in production. That ratio is a different shape from what I expected. A junior engineer will tell you the hard part is writing the code. The agent stack tells you the hard part is the second half — the unromantic, unphotographed part nobody walks through at boot camp.

Once you internalise that, the strategy reorganises. You spend your spec time on the easy 25% and your review time on the expensive 65%. The leverage is in the second half, not the first.

The composition pattern.

I run a roughly consistent sub‑agent topology now. It looks like this on every non‑trivial workflow:

Sub‑agent 1The spec agent.

Takes my one‑paragraph requirement and turns it into a structured spec — interfaces, data shapes, edge cases, failure modes — before any implementation runs. This is the cheapest agent to run and by far the highest leverage. Most of my time as the human in the loop is spent here.

Sub‑agent 2The implementation agent.

Takes the spec and writes the first cut. It has no awareness of the rest of the codebase except via context I curate — which is deliberate. Letting an implementation agent see the whole repo by default produces more drift than discipline.

Sub‑agent 3The review agent.

Reads the implementation against both the spec and the surrounding code, and either approves or sends it back. The review agent gets the broad context the implementation agent was denied. It catches what implementation can't see.

Sub‑agent 4The test agent.

Writes the test suite from the spec, not the implementation — so the tests can catch implementation bugs rather than rubber‑stamp them. This is a small distinction that matters more than any other.

I'm the human in the loop at two points: the spec hand‑off, and the final review before merge. Everything in between can run unattended. I usually watch the trace anyway, because it's genuinely interesting and because catching a confused agent early saves tokens later.

The cost — in dollars.

Twenty‑three million tokens at Claude Sonnet 4.6 / 4.7 mix prices works out to around USD $400–500 per month in raw API spend across the period, depending on which model I was leaning on in a given week. Add a chunky Claude Code subscription on top and we are at maybe USD $700–800 of agentic spend per month for what built and now maintains the entire Frollie backend.

Compare to the obvious alternatives:

The cost stack, plainly A twelve‑person product team at HelloFresh comparable scope, fully loaded, runs ~AUD $2.5M/year.
A single mid‑level engineer in Sydney is ~AUD $200k all‑in.
A senior contractor doing the same work bills ~AUD $1,500/day — sixty days is AUD $90k.

Frollie OS, sixty days of build, agent stack inclusive: ~USD $1,500 total.

The agent stack is two orders of magnitude cheaper than the cheapest human alternative. The number is so large it stops feeling like a comparison and starts feeling like a category change. The right way to read it isn't "we saved money" — it's "we removed the cost line from the P&L entirely."

What broke.

Three failure categories, in order of how often I encountered them. I keep a tally in a notebook because the pattern is actually informative.

Failure 1Bad spec. My fault.

I wrote a one‑paragraph requirement that was ambiguous on a critical edge case. The model produced something internally consistent that solved the wrong problem. Caught at review, mostly. Twice it shipped to staging before I caught it. Once it shipped to production. Fix in production was thirty minutes; the embarrassment lasted longer.

The lesson is unsexy and very obvious in retrospect: a well‑written spec is the single most leveraged thing a non‑engineer can do for an agent. I now spend twice as long writing specs as I did at month one, and the failure rate is down by roughly an order of magnitude.

Failure 2Brittle context. Architecture fault.

The model lost the shape of the codebase between sessions and produced code that worked locally but violated a system‑level invariant — a database constraint, a billing assumption, a state‑machine contract. Painful when it happened, because the local correctness made the breakage harder to spot.

The fix was a CLAUDE.md at the repo root that the planning sub‑agent has to read on every spawn. It captures the system invariants — the things that are true everywhere, not just in the file you're touching. Should have been there from day one. Wasn't, because I'd never built a system with this much surface area before, and you don't write down the invariants you don't know you have until they bite you.

Failure 3Model judgement gaps. Model fault.

The model picks the locally elegant solution over the globally correct one. The most common version: aggressive deduplication of code that should have stayed distinct because the duplication signalled two different concepts. Less common but more painful: a security shortcut that was technically correct but bypassed a defence‑in‑depth I had elsewhere.

This category is the one where senior judgement actually matters. The fix is not a config change or a markdown file — it's the human reviewer asking "is this the right shape of solution?" before greenlighting the merge. The agent will produce confidently wrong answers fast if you let it. The discipline of pausing is the new senior contribution.

What this means for FMCG operators.

Three implications, each working its way through how I now think about the next venture.

Implication 1Build capacity is no longer the bottleneck.

The bottleneck is direction. (See also: Force × direction.) What do you want? Why? What does winning look like? Those questions used to be the easy part of strategy because the answer would take six months to build anyway. Now the build is the cheap part, and the questions are the expensive part.

Implication 2Senior judgement is the only remaining moat.

The agent will produce the wrong thing fast and confidently if you let it. The reviewer who knows what good looks like — what a healthy P&L feels like, what a brittle architecture smells like, what a real customer will and won't tolerate — is the new senior contribution. McKinsey trained mine over seven years. I'm grateful daily.

Implication 3The team you don't hire is the team you actually save.

Not in salary — in coordination cost. A twelve‑person product team has sixty‑six pairwise communication channels. A solo operator with sub‑agents has zero. The agent stack doesn't just replace headcount; it removes the entire coordination overhead that was the actual bottleneck of any small business trying to scale.

"The agent stack doesn't just replace headcount; it removes the coordination overhead that was the bottleneck of any small business trying to scale."

Where this leaves me.

Two months in, the operating system runs the business. The business turns over enough to pay for itself. The agentic spend is a rounding error on the marketing budget. There is no engineering team to hire, manage, or lose.

I am one operator with a wife, a daughter, and a Claude Code session running in the background while I write this. The team I never hired never broke up. The roadmap I never wrote never slipped. The architecture I never centralised never bottlenecked.

Building taught me what driving felt like. Driving taught me what servicing missed. Servicing taught me what good looks like. None of those reps were free — but the next sixty days will not need any of them to be repeated. That, I think, is the part of this that's actually new.

— Lucas, February 2026 · field note from Frollie OS development