Recursive self-improvement, you said?

Spawning a fleet of coding agents is a solved problem. You write a for loop, you call the Agent tool N times, you go get coffee. The unsolved problem is everything wrapped around the spawn: deciding what can actually run in parallel, stopping the agent that wrote the code from also grading (or eating) its own homework, and — the part nobody ships — recording how the run went so the next one isn’t the same run with the same mistakes.

I wish I didn’t remember this anymore but this used to be called a “retrospective” in that sect I was once a member of.

I’ve been dogfooding a small orchestration skill (agent-team-orchestration, open in voitta-ai/skillz) that treats those as the actual work. Three runs in. This is the first write-up, warts very much included — the warts are the only part with information in them.

The shape, and the one non-negotiable rule

Start with a conversation, not a spawn. Before any developer agent exists, an architect reads the open issues (gh issue list, then actually gh issue view each one) and the repo, and produces the one deliverable that’s genuinely hard: the parallel set. Independent work (different modules, no shared schema, PRs that won’t collide on merge) fans out; everything else serializes (shared files, a migration that has to land first, B’s acceptance depends on A). Get that wrong and you don’t get parallelism, you get merge conflicts with extra steps.

Then each issue in the wave gets a squad, roles deliberately split so no agent both writes and blesses the same diff:

  • developer — its own git worktree, opens the PR;
  • adversarial reviewer — a different agent, briefed to break the diff, not rubber-stamp it;
  • SDET — drives the change like a user;
  • productivity engineer — a meta-role that watches the process: every stall, every human approval, every bit of rework, written down.

The dev/reviewer split is load-bearing. The instant the context that wrote the code also reviews it, the review is theater.

And the telemetry is free, which is the best price. Every Claude Code session is a complete JSONL transcript at ~/.claude/projects/<slug>/<uuid>.jsonl — every tool call, every AskUserQuestion, every answer you gave. (We’ll gate the privacy policy to not log every breath you take).

TFW that retrospective is not a wishful thinking, it’s actionable.

Three runs, in ascending order of interesting

Run 1 — shipped clean, screwed up in a way I didn’t catch until I read the log. Two bug fixes on a production Next.js + Prisma app (two-branch staging/prod). Both merged, deployed, SDET-verified green. Then I read the transcript: the two bugs already had open PRs from a prior run. The architect never looked. We’d built and squash-merged duplicates, closed the issues, and orphaned two perfectly good PRs.

That’s not an agent being dumb. It’s a hole in the recipe. “Choose the parallel set” reasoned about file overlap and ordering and never asked the first question a human lead asks — is anyone already on this? — which is one gh pr list away. Second tell, same run: asked “where’s the evidence the reviewer approved these?”, the answer was nowhere. The verdicts lived in the agents’ context and never touched the PR. An approval that leaves no durable artifact didn’t happen. (Worse, squash-merge later buried even the merge-commit note, but I’m getting ahead of myself.)

Run 2 — the loop closed, and I have receipts. New work — a homepage redesign across seven sub-issues — same skill. At startup the agent did something I didn’t tell it to: it ran gh issue view 122 on the prior run’s recorded retro and read the engagement log. Then it did exactly the things Run 1 botched. It pre-flighted existing PRs. Every merge carried an adversarial verdict with specifics; the reviewer caught a dead query param (?q= where the target route reads ?search=) and sent it back with REQUEST_CHANGES.

Then it got interesting. A staging route started returning 500. The team traced it to schema drift, and went to fix the deploy pipeline by adding prisma db push. The safe version (no --accept-data-loss) did the right thing and aborted:

⚠️ There might be data loss when applying the changes:
• drop column `negotiableTerms` on `Property` (1 non-null value)
Error: Use the --accept-data-loss flag to ignore the data loss warnings

It refused to drop a column with live data, surfaced it for a human call, took a one-time --accept-data-loss against staging only, reconciled, and reverted — production never saw the flag. The redesign isn’t the headline. The headline is that the run improved because it had read how the last run went. Best current read: that’s the flywheel, showing up unprompted.

Run 3 — we pointed it at itself, which is geekily elegant, and scientifically noble I scraped every point across Runs 1–2 where an agent stopped to ask a human to approve something — fifteen gates — dumped them into one issue, and ran the skill on that issue. The architect grouped the fifteen by type, correctly separated the gates worth keeping (destructive DB ops — yes, always ask) from the avoidable friction (re-asking a runtime question it already answered two turns ago), and — the good part — ran two of the fixes on its own execution before they were written into the skill. It pre-flighted with gh pr list and caught two pre-existing issues that overlapped the work, exactly the Run-1 bug, fixed live by the thing being fixed.

What’s actually carrying the weight

  • The parallel-set call is real architecture. Run 3 ran two repos in parallel but serialized five edits that all touched one SKILL.md into a single PR — instead of four agents racing to conflict on the same file.
  • Build/attack/verify pays rent. The reviewer caught a bug the developer was happy with. Once is enough to justify the second agent.
  • Worktree-per-issue keeps the squads from knifing each other.
  • The flight recorder is the product. Every stall is a candidate fix — a default, a permission, a pre-flight, a sharper brief.

Where it falls down (best current read)

  • The headline feature has never once fired. The skill leads with “every agent is a watchable terminal tab you can steer mid-run.” That needs the root session launched through the cmux claude-teams wrapper, which prepends a tmux shim to PATH (CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 alone is a red herring — diagnose with which tmux + echo $TMUX). Three runs, three fallbacks to background agents, because the session wasn’t started that one specific way. A feature nobody reaches isn’t a feature, it’s a positioning bug.
  • The same two process bugs recur every run until baked in: a setup question asked at spawn time instead of as a step-0 precondition, and re-asking a decision already made. Prose doesn’t self-correct — the executor re-litigates your opinions until you encode them as defaults.
  • N=3 and confounded. Run 2’s wins rode on memory carried from Run 1, so I can’t yet split skill-value from memory-value. The compounding loop is a strong signal, not a proof. The honest next experiment is one run on a clean, never-seen repo, launched under cmux, with no carried memory, measured by a typed telemetry schema — which doesn’t exist yet, so I’m building that before I build anything else.

The actual thesis

Spawning is commodity; the moat is the operating doctrine plus the telemetry loop — the thing that makes human-interventions-per-issue trend down run over run. Build the instrument first, defer the spaceship. YAGNI applies to strategy, too.

Skill’s open in voitta-ai/skillz. Run it on your backlog and tell me where it stalls. The stalls are the entire point.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.