Recursive self-improvement, you said?

Spawning a fleet of coding agents is a solved problem. You write a for loop, you call the Agent tool N times, you go get coffee. The unsolved problem is everything wrapped around the spawn: deciding what can actually run in parallel, stopping the agent that wrote the code from also grading (or eating) its own homework, and — the part nobody ships — recording how the run went so the next one isn’t the same run with the same mistakes.

I wish I didn’t remember this anymore but this used to be called a “retrospective” in that sect I was once a member of.

I’ve been dogfooding a small orchestration skill (agent-team-orchestration, open in voitta-ai/skillz) that treats those as the actual work. Three runs in. This is the first write-up, warts very much included — the warts are the only part with information in them.

The shape, and the one non-negotiable rule

Start with a conversation, not a spawn. Before any developer agent exists, an architect reads the open issues (gh issue list, then actually gh issue view each one) and the repo, and produces the one deliverable that’s genuinely hard: the parallel set. Independent work (different modules, no shared schema, PRs that won’t collide on merge) fans out; everything else serializes (shared files, a migration that has to land first, B’s acceptance depends on A). Get that wrong and you don’t get parallelism, you get merge conflicts with extra steps.

Then each issue in the wave gets a squad, roles deliberately split so no agent both writes and blesses the same diff:

  • developer — its own git worktree, opens the PR;
  • adversarial reviewer — a different agent, briefed to break the diff, not rubber-stamp it;
  • SDET — drives the change like a user;
  • productivity engineer — a meta-role that watches the process: every stall, every human approval, every bit of rework, written down.

The dev/reviewer split is load-bearing. The instant the context that wrote the code also reviews it, the review is theater.

And the telemetry is free, which is the best price. Every Claude Code session is a complete JSONL transcript at ~/.claude/projects/<slug>/<uuid>.jsonl — every tool call, every AskUserQuestion, every answer you gave. (We’ll gate the privacy policy to not log every breath you take).

TFW that retrospective is not a wishful thinking, it’s actionable.

Three runs, in ascending order of interesting

Run 1 — shipped clean, screwed up in a way I didn’t catch until I read the log. Two bug fixes on a production Next.js + Prisma app (two-branch staging/prod). Both merged, deployed, SDET-verified green. Then I read the transcript: the two bugs already had open PRs from a prior run. The architect never looked. We’d built and squash-merged duplicates, closed the issues, and orphaned two perfectly good PRs.

That’s not an agent being dumb. It’s a hole in the recipe. “Choose the parallel set” reasoned about file overlap and ordering and never asked the first question a human lead asks — is anyone already on this? — which is one gh pr list away. Second tell, same run: asked “where’s the evidence the reviewer approved these?”, the answer was nowhere. The verdicts lived in the agents’ context and never touched the PR. An approval that leaves no durable artifact didn’t happen. (Worse, squash-merge later buried even the merge-commit note, but I’m getting ahead of myself.)

Run 2 — the loop closed, and I have receipts. New work — a homepage redesign across seven sub-issues — same skill. At startup the agent did something I didn’t tell it to: it ran gh issue view 122 on the prior run’s recorded retro and read the engagement log. Then it did exactly the things Run 1 botched. It pre-flighted existing PRs. Every merge carried an adversarial verdict with specifics; the reviewer caught a dead query param (?q= where the target route reads ?search=) and sent it back with REQUEST_CHANGES.

Then it got interesting. A staging route started returning 500. The team traced it to schema drift, and went to fix the deploy pipeline by adding prisma db push. The safe version (no --accept-data-loss) did the right thing and aborted:

⚠️ There might be data loss when applying the changes:
• drop column `negotiableTerms` on `Property` (1 non-null value)
Error: Use the --accept-data-loss flag to ignore the data loss warnings

It refused to drop a column with live data, surfaced it for a human call, took a one-time --accept-data-loss against staging only, reconciled, and reverted — production never saw the flag. The redesign isn’t the headline. The headline is that the run improved because it had read how the last run went. Best current read: that’s the flywheel, showing up unprompted.

Run 3 — we pointed it at itself, which is geekily elegant, and scientifically noble I scraped every point across Runs 1–2 where an agent stopped to ask a human to approve something — fifteen gates — dumped them into one issue, and ran the skill on that issue. The architect grouped the fifteen by type, correctly separated the gates worth keeping (destructive DB ops — yes, always ask) from the avoidable friction (re-asking a runtime question it already answered two turns ago), and — the good part — ran two of the fixes on its own execution before they were written into the skill. It pre-flighted with gh pr list and caught two pre-existing issues that overlapped the work, exactly the Run-1 bug, fixed live by the thing being fixed.

What’s actually carrying the weight

  • The parallel-set call is real architecture. Run 3 ran two repos in parallel but serialized five edits that all touched one SKILL.md into a single PR — instead of four agents racing to conflict on the same file.
  • Build/attack/verify pays rent. The reviewer caught a bug the developer was happy with. Once is enough to justify the second agent.
  • Worktree-per-issue keeps the squads from knifing each other.
  • The flight recorder is the product. Every stall is a candidate fix — a default, a permission, a pre-flight, a sharper brief.

Where it falls down (best current read)

  • The headline feature has never once fired. The skill leads with “every agent is a watchable terminal tab you can steer mid-run.” That needs the root session launched through the cmux claude-teams wrapper, which prepends a tmux shim to PATH (CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 alone is a red herring — diagnose with which tmux + echo $TMUX). Three runs, three fallbacks to background agents, because the session wasn’t started that one specific way. A feature nobody reaches isn’t a feature, it’s a positioning bug.
  • The same two process bugs recur every run until baked in: a setup question asked at spawn time instead of as a step-0 precondition, and re-asking a decision already made. Prose doesn’t self-correct — the executor re-litigates your opinions until you encode them as defaults.
  • N=3 and confounded. Run 2’s wins rode on memory carried from Run 1, so I can’t yet split skill-value from memory-value. The compounding loop is a strong signal, not a proof. The honest next experiment is one run on a clean, never-seen repo, launched under cmux, with no carried memory, measured by a typed telemetry schema — which doesn’t exist yet, so I’m building that before I build anything else.

The actual thesis

Spawning is commodity; the moat is the operating doctrine plus the telemetry loop — the thing that makes human-interventions-per-issue trend down run over run. Build the instrument first, defer the spaceship. YAGNI applies to strategy, too.

Skill’s open in voitta-ai/skillz. Run it on your backlog and tell me where it stalls. The stalls are the entire point.

voitta-yolt: The Missing Safety Layer for Claude Code

Voitta AI just released voitta-yolt, and it’s aimed at a very real problem: how do you let an agent move fast in the shell without giving it a blank check?

YOLO — You Only Live Once — is the vibe-coder’s operating principle: ship now, deal with consequences later.

YOLT — You Only Live Twice — is the correction.

No, it’s not a replacement for the auto mode; it’s a more fine-grained discerment: it gives Claude Code a second look before a Bash command (or the commands it invokes, which include actual code — e.g., Python, SQL) actually runs.

The problem it solves

Claude Code’s built-in permission system has an awkward gap.

Some commands are obviously safe, but still annoying to approve over and over. Others are wrapped in ways that make broad allowlisting dangerous.

Two cases matter most:

  • Arbitrary-execution wrappers. python3, bash, node, gh api, curl, kubectl, and friends are too powerful to wildcard-allow safely.
  • Compound shell commands. Loops, subshells, command substitutions, and bash -c '...' forms hide the actual inner commands from the simple outer matcher.

That means you either:

1. approve too much and weaken the safety model, or 2. approve everything manually and hate your life.

YOLT exists to get out of that false choice.

What YOLT actually does

YOLT installs as a Claude Code PreToolUse hook on the Bash tool.

When Claude is about to run a shell command, YOLT parses the invocation, walks the structure of the command, and classifies what it finds:

  • safe → auto-allow
  • unsafe → ask for review, with a reason
  • unknown → fall back to Claude Code’s default prompt

The interesting part is that it no longer treats the shell as a flat string.

The current release parses Bash with tree-sitter-bash, reconstructs argv from the AST, and then classifies each command node against rules in rules/shell.json. If the shell invocation contains inline Python, it delegates that body to a Python AST analyzer.

And it now covers a genuinely useful extra case: common SQL CLIs. sqlite3, psql, mysql, mariadb, and duckdb get their query text inspected so read-only commands like SELECT, SHOW, and .tables can pass quietly while mutating statements like INSERT, DELETE, DROP, .import, or .load get surfaced for review.

So this is not just “grep for scary words.” It’s structured analysis.

Why that matters

This is the real improvement over naive allowlists.

A normal matcher sees the wrapper:

  • bash -c "..."
  • for ...; do ...; done
  • $(...)
  • <(...)

YOLT walks inside those forms.

That means a loop full of read-only AWS inspection commands can be auto-approved, while a destructive operation buried inside a process substitution still gets surfaced for review.

That’s the right shape of safety tooling for agentic coding: less theater, more actual inspection.

The architectural shift

The sharpest detail in the release is that YOLT has already outgrown its first framing.

What began as a Python-script safety hook is now a more general shell-execution analyzer with language-specific followers.

The current structure is roughly:

  • hooks/grammar_classifier.py — Bash AST walker
  • hooks/rule_classifier.py — argv-level command classification
  • hooks/yolt_analyzer.py — Python AST analysis when Python appears inline

That’s a better architecture than a pile of string heuristics, and the repo history shows exactly why the rewrite happened: quote-state edge cases, heredocs, substitutions, continuations, and shell grammar weirdness are not bugs you “finish.” They are why parsers exist.

Using a real grammar here is the grown-up move.

Practical wins

A few details make this more than a neat demo:

  • It supports both plugin install and manual hook install.
  • It explicitly warns that broad static allow rules like Bash(python3:) or Bash(aws:) can bypass the hook entirely.
  • It can use the user’s existing permissions.allow patterns as a secondary upgrade pass for otherwise-unknown inner commands.
  • The new SQL CLI handling is exactly the sort of practical expansion I like: not theoretical safety, but fewer prompts for read-only database inspection without waving through destructive schema/data changes.
  • It now defaults logging to ~/.claude/yolt.log, which makes dogfooding and debugging much easier.

And most importantly, the dogfood loop appears real. One recent pass through transcript history reportedly cut the classifier’s unknown rate from 60.2% to 11.7% by fixing a handful of recurring gaps. That’s the number I care about most, because it shows the project is being tuned against actual usage rather than imagined usage.

Why I think this matters

The broader point is not “Claude Code needs more hooks.”

It’s that agent safety gets much better when you stop treating the shell as an indivisible permission blob.

What you really want is a front-line gate for command execution: let the obviously safe paths go through quietly, and save human interruption for the suspicious stuff. That won’t replace every approval surface in an agent stack, but it can take a huge bite out of routine approval fatigue.

There is a big difference between:

  • aws ec2 describe-instances
  • aws ec2 terminate-instances ...
  • for svc in $(aws ecs list-services ...); do aws ecs describe-services ...; done
  • bash -c 'curl ... | sh'

A permission system that collapses all of those into “it’s Bash” is too coarse to be pleasant and too coarse to be trustworthy.

YOLT narrows that gap.

And the cleaner operational pattern is to pair that with direct API usage wherever possible. If a service already gives you a token to create a draft, update a post, or mutate a record, that is usually a better path than driving a browser through the same workflow just to satisfy the UI.

The real thesis

What’s new here is not just another safety wrapper.

What’s new is the move from tool-level permissions to structure-aware command understanding.

That is where a lot of agent tooling is headed, because the old model breaks down as soon as agents start composing commands instead of issuing one-liners.

If you want agents to operate with less friction without quietly turning root access into a vibes-based exercise, this is the kind of infrastructure you need.

Try it

YOLT is open source under AGPL v3 and available here:

https://github.com/voitta-ai/voitta-yolt

Plugin install is straightforward:

/plugin marketplace add voitta-ai/voitta-yolt
/plugin install yolt@voitta-yolt

And if you already installed it manually, the repo documents how to migrate cleanly to the plugin model.

That part matters too. Safety tooling people won’t keep updated is safety tooling that quietly dies.


Related: earlier we wrote about llm-tldr vs voitta-rag. YOLT sits in a different layer of the stack, but it comes out of the same practical question: if you are going to work with agents seriously, where do you put the guardrails so they help instead of getting in the way?

llm-tldr vs voitta-rag: Two Ways to Feed a Codebase to an LLM

Every LLM-assisted coding tool faces the same fundamental tension: codebases are too large to fit in a context window. Two recent tools attack this from opposite directions, and understanding the difference clarifies something important about how we’ll work with code-aware AI going forward.

The Shared Problem

llm-tldr is a compression tool. It parses source code through five layers of static analysis — AST, call graph, control flow, data flow, and program dependence — and produces structural summaries that are 90–99% smaller than raw source. The LLM receives a map of the codebase rather than the code itself.

voitta-rag is a retrieval tool. It indexes codebases into searchable chunks and serves actual source code on demand via hybrid semantic + keyword search. The LLM receives real code, but only the relevant fragments.

Compression vs. retrieval. A map vs. the territory.

At a Glance

llm-tldr voitta-rag
Approach Static analysis → structural summaries Hybrid search → actual code chunks
Foundation Tree-sitter parsers (17 languages) Server-side indexing (language-agnostic)
Interface CLI + MCP server MCP server
Compute Local (embeddings, tree-sitter) Server-side

What Each Does Better

llm-tldr wins when you need to understand how code fits together:

  • Call graphs and dependency tracing across files
  • “What affects line 42?” via program slicing and data flow
  • Dead code detection and architectural layer inference
  • Semantic search by behavior — “validate JWT tokens” finds verify_access_token()

voitta-rag wins when you need the actual code:

  • Retrieving exact implementations for review or modification
  • Searching across many repositories indexed server-side
  • Tunable search precision (pure keyword ↔ pure semantic via sparse_weight)
  • Progressive context loading via chunk ranges — start narrow, expand as needed

The Interesting Part

These tools don’t compete — they occupy different layers of the same workflow. Use llm-tldr to figure out where to look and why, then voitta-rag to pull the code you need. Static analysis for navigation, RAG for retrieval.

This mirrors how experienced developers actually work: first you build a mental model of the architecture (“what calls what, where does data flow”), then you dive into specific files. One tool builds the mental model; the other hands you the files.

The fact that both expose themselves as MCP servers makes combining them straightforward — plug both into your editor or agent and let the LLM decide which to call based on the question.

References

Reverse-engineering and keratinous biomass reduction in bos grunniens

Not that we needed all that for the trip, but once you get locked into a serious drug collection, the tendency is to push it as far as you can.

Hunter S. Thompson

Reverse-engineering is kinda fun. More fun when we can shave the yak by adding more tools to our LLM/MCP toolbox, amirite?

So I accidentally came across this LinkedIn post, about an SVG diagramming tool for Claude. I was just working on some diagrams as part of reverse engineering and having been making agents create those with Mermaid, but I thought I’d give it a try.

Well, that was a flock of wild geese chasing a red herring down a rabbit hole to borrow a shear…

First, I thought the idea was clever, but I wanted more cowbell (because we don’t have enough animals in this post), so I forked that and vibe-coded an MCP server on top of that.

Then I tried to use it to create a few architecture diagrams but I found it actually somewhat lacking. When the client (Claude Desktop) was using it, I didn’t love the editing capability. When the client was not using it, it created nicer-looking diagrams somehow (in SVG, yes) and with legends and stuff. But of course the graph layout still sucked. So I’d need to manually edit it.

Well, screw that, said I. I’ll use AWS MCP server, said I.

Screw that, said I next.

Then I modified the prompt to ask not for SVG but for DOT format of GraphViz. Much better, I said. And then, uh… It could have gone better, right? But at this point I’m not sure how to improve the prompt.

But I know what to do when I don’t know something, right?

Yes. I put the DOT file to the LLM and ask it to tweak it to have a certain thing. Then I ask why. Then I, of course, ask it, to fix this original prompt. And it’s turtles (yes, we’re in a zoo and you’re reading it on a Safari) all the way down.

And what do we learn, Palmer? Well, never mind, let us draw the curtain of charity over the rest of this scene.

(Well, not quite true — using DOT is the better thing to do here than explicitly doing things like “30px” instructions).


NOTE: multiple individuals of bos grunniens species have undergone keratinous biomass reduction, which also included:

The moral of the story is absent.

Coding assistants musing

I love me my Cline, Claude Code and company. But there’s major thing I found missing from them — I want my assistant to be able to step with me through a debugger, and be able to examine variables and call stack. Somehow this doesn’t exist. This is helpful for figuring out the flow of an unfamiliar program, for example.

Now, JetBrains MCP Server Plugin gets some of the way there, but… It can set breakpoints but because of the way it analyzes code text it often gets confused. For example, when asked to set a breakpoint on the first line of the method it would do it at a method signature or annotation.

And it doesn’t do anything in terms of examining the code state at a breakpoint.

So I decided to build on top of it, see JetBrains-Voitta plugin (based on a Demo Plugin). It:

  • Uses IntelliJ PSI API to provide more meaningful code structure to the LLM (as AST)
    • This helps with properly setting breakpoints from verbal instructions
    • Hopefully also this should prevent some hallucinations about methods that do not exit (educated guess).
  • Adds more debugging capability, such as inspecting the call stack and variables at a given breakpoint.

    Here are a couple of example debug sessions:

Much better.

And completely vibe-coded.

Maybe do something with Cline next?

MCP protocol of choice: stdin/stdout? WTF, man?

Let’s talk about MCP. More specifically, let’s talk about using stdin/stdout as a protocol transport layer in the year of our Lord 2025.

Yes, yes—it’s universal. It’s composable. It works “everywhere.” It’s the spiritual successor to Unix pipes, which were cool at the time. The time when my Dad was hitting on my Mom. As an actual transport layer, stdin/stdout is a disaster.

Debugging Is Basically a Crime

Let’s say I want to create an MCP server in Python. Reasonable. Now let’s say I want to debug it. Set a breakpoint. Inspect variables. Use threads. Maybe spin up the LLM in the same process for context. You know, software engineering.

The moment you try to do this, you’re writing a debug driver. Congratulations. You are now:

  • Building a fake client to simulate a streaming LLM
  • Implementing bidirectional IO while praying the LLM doesn’t send surprise newline characters
  • Wrapping things in threads and/or asyncio or multiprocessing or whatnot other total fucking bullshit.

Been there. Twice:

  • Voitta’s Brokkoly: Thought I could run the LLM and the driver in one process. Spent 3 hours implementing queues, got it half-working, and realized I was debugging my own debug tool.
  • Samtyzukki: Round two. Same problem. Ended up with more abstraction layers than a Kafka conference.

Eventually, I just gave up and decided to use SSE (Server-Sent Events). Because you know what’s great about SSE? You can log things. You can see the messages. You can debug. It’s like rediscovering civilization after weeks of wilderness survival with only printf() and trauma.

 stdout Is Sacred, Until It Isn’t

Here’s the other problem. stdout is a shared space. You can’t count on it. Libraries will write to it. Dependencies will write to it. Your logger will write to it. Some genius upstream will write:

print(“INFO: falling back to CPU because the GPU is feeling shy today.”)

Congratulations. You just corrupted your transport. Your parser reads that as malformed JSON or a broken packet or an existential and spiritual crisis.

It’s not a bug. It’s a design decision—and not a good one.

This is the part where I invoke Rob Pike. Sorry. Not sorry.

In Go, to format a date, one doesn’t simply use YYYY-MM-DD. You do Mon Jan 2 15:04:05 MST 2006.

Because, I get it, we all need to get high once in a while. But srsly.