voitta-rag Grows Up, voitta-yolt Is Born: February Updates from Voitta AI

A follow-up to our February 13 comparison of llm-tldr and voitta-rag.

Part I: voitta-rag — From Code Search to Knowledge Platform

When we last looked at voitta-rag, it was a solid hybrid search engine for codebases — index your repos, search via MCP, get actual code chunks back. Twelve days and 11 commits later, it’s become something broader: a self-hosted knowledge platform that indexes not just code but your entire work graph.

Here’s what landed since February 13.

Enterprise Connectors: Jira, Confluence, SharePoint

The biggest expansion is connector coverage. voitta-rag now syncs from Jira, Confluence, and SharePoint alongside the existing Git, Google Drive, Azure DevOps, and Box integrations.

Jira and Confluence support both Cloud (API token with Basic auth) and Server/Data Center (PAT with Bearer auth), selectable via dropdown in the UI — a detail that matters because plenty of enterprises still run on-prem Atlassian. Cloud uses the v3 search endpoint (v2 is deprecated), and Confluence Cloud correctly routes through /wiki/rest/api.

SharePoint got a full global sync implementation. And on the UI side, both Jira projects and Confluence spaces now use multi-select dropdown widgets — you can cherry-pick specific projects or select “ALL” to dynamically sync everything, including future additions. Practical touch: JQL project keys are now quoted to handle reserved words like IS that would otherwise break queries.

Time-Aware Search

Search results are no longer timeless. voitta-rag now tracks source timestampscreated_at and modified_at — propagated from every remote connector through a .voitta_timestamps.json sidecar file into the indexing pipeline and vector store.

This enables time range filtering on the MCP search tool via date_start/date_end parameters. “What changed in the last week?” is now a first-class query. For an AI assistant trying to understand recent activity across repos, Jira boards, and Confluence spaces simultaneously, this is a significant upgrade.

Anamnesis: Persistent Memory for AI Assistants

The most architecturally interesting addition. Anamnesis (Greek for “recollection”) gives AI assistants a persistent memory layer backed by voitta-rag’s vector store.

Six new MCP tools let an assistant create, retrieve, update, delete, like, and dislike memories. The like/dislike mechanism adjusts relevance scoring — memories the assistant finds useful surface more readily over time, while unhelpful ones fade. It’s essentially a learning loop: the AI assistant builds up a knowledge base of its own observations and decisions, searchable alongside the actual indexed content.

This turns voitta-rag from a read-only knowledge base into a read-write one — the assistant doesn’t just consume context, it contributes to it.

Per-User Search Visibility

A multi-tenancy feature: users can now enable or disable folders for their own search scope without affecting other users. If you’ve indexed 50 repos but only care about 5 for your current task, you toggle the rest off. The MCP server respects these per-user visibility settings, so AI assistants scoped to different users see different slices of the same knowledge base.

More File Types

The indexing pipeline now handles AZW3 (Amazon Kindle) files, joining the existing support for DOCX, PPTX, XLSX, ODT, ODP, and ODS. Not the most common format in a work context, but it signals that voitta-rag is thinking beyond code and office docs toward general document ingestion.

The Bigger Picture

Two weeks ago, voitta-rag was a code search tool. Now it indexes your Git repos, Google Drive, SharePoint, Jira, Confluence, Box, and Azure DevOps — with time-aware search, per-user scoping, and persistent AI memory. The trajectory is clear: it wants to be the single search layer across everything your team produces, exposed to AI assistants via MCP.

The self-hosted angle remains the key differentiator. Nothing leaves your network. For teams where that matters (and increasingly, it does), this is starting to look like a serious alternative to cloud-hosted RAG services.


Part II: voitta-yolt — You Only Live Twice

Brand new from Voitta AI today: voitta-yolt (You Only Live Twice) — a safety analyzer for Claude Code that statically analyzes Python scripts before execution.

The Problem

Claude Code can write and run Python scripts. That’s powerful and dangerous in equal measure. By default, you either pre-approve all Python execution (fast but risky) or manually approve each script (safe but maddening). Neither is great.

How YOLT Works

YOLT registers as a Claude Code PreToolUse hook on the Bash tool. When Claude Code runs python3 script.py, YOLT intercepts the command, parses the Python AST, and walks every function call against a configurable rule set:

  • Safe scripts (pure computation, data parsing, read-only operations) get auto-approved — no permission prompt.
  • Destructive scripts (file writes, AWS mutations, subprocess calls, network POSTs, database connections) get flagged for human review with specifics about what was detected, including the source line content.

Zero external dependencies — it’s pure stdlib (ast, json, fnmatch, shlex). AST parsing is near-instant, so there’s no perceptible delay.

The Rule System

The default rules are sensible and well-structured:

  • AWS boto3: describe/list/get/head → safe. delete/put/create/terminate → destructive. Rules scope via trigger_imports, so cache.delete_item() in a non-AWS script won’t false-positive.
  • File I/O: open() in write modes, os.remove, shutil.rmtree → destructive. Read-only access is fine.
  • Subprocess: Always flagged. subprocess.run, os.system, the lot.
  • Network: requests.get → safe. requests.post/put/delete → destructive.
  • Database: Connection creation → flagged for review.

A curated list of safe imports (json, csv, re, datetime, pathlib, hashlib, and ~50 others) means scripts that only use standard library data-processing modules sail through without interruption.

Custom rules go in ~/.claude/yolt/rules.json and merge with defaults — you can add safe methods, define new categories with their own trigger_imports, and use glob patterns (fetch_*, drop_*).

One Important Gotcha

If you have Bash(python3:*) in your Claude Code settings.local.json allow list, YOLT’s hook never fires — static allow rules take precedence over PreToolUse hooks. YOLT replaces the need for that allow rule entirely: safe scripts get auto-approved by the hook itself.

Why This Matters

The design philosophy — “false positives OK, false negatives not” — is the right one for a safety tool. It’s the security principle of fail-closed applied to AI code execution.

YOLT is small (527 lines across 6 files in the initial commit), focused, and immediately useful. If you’re letting Claude Code run Python, this is the kind of guardrail that should exist by default.


Wrapping Up

voitta-rag is evolving from a code search tool into a self-hosted knowledge platform with enterprise connectors and AI memory. voitta-yolt tackles a different but equally practical problem: making AI code execution safer without making it slower.

 

Plus Ça Change

Twelve years ago, I wrote a short post about a conversation that went roughly like this:

“I need programmatic access.”

“We don’t have an API.”

“Of course you do — it’s AMF behind your Flex UI. A little PyAMF script will do the trick.”

“Please don’t show it to anyone!”

The point was simple: every application that has a UI already has an API. The UI talks to something. That something is the API. You just haven’t admitted it yet.

Yesterday, I wrote a longer post about WebMCP — a shiny new W3C proposal from Google and Microsoft that adds a browser API so AI agents can interact with websites through “structured tools” instead of scraping the DOM.

The websites already have structured tools. They’re called APIs. The SPAs call them. The mobile apps call them. The CLI tools call them. They exist. They have endpoints, schemas, authentication. They are right there.

In 2014, the answer was: “Of course you have an API — it’s behind your Flex app.”

In 2026, the answer is: “Of course you have structured tools — they’re behind your React app.”

Plus ça change, plus c’est la même chose.

llm-tldr vs voitta-rag: Two Ways to Feed a Codebase to an LLM

Every LLM-assisted coding tool faces the same fundamental tension: codebases are too large to fit in a context window. Two recent tools attack this from opposite directions, and understanding the difference clarifies something important about how we’ll work with code-aware AI going forward.

The Shared Problem

llm-tldr is a compression tool. It parses source code through five layers of static analysis — AST, call graph, control flow, data flow, and program dependence — and produces structural summaries that are 90–99% smaller than raw source. The LLM receives a map of the codebase rather than the code itself.

voitta-rag is a retrieval tool. It indexes codebases into searchable chunks and serves actual source code on demand via hybrid semantic + keyword search. The LLM receives real code, but only the relevant fragments.

Compression vs. retrieval. A map vs. the territory.

At a Glance

llm-tldr voitta-rag
Approach Static analysis → structural summaries Hybrid search → actual code chunks
Foundation Tree-sitter parsers (17 languages) Server-side indexing (language-agnostic)
Interface CLI + MCP server MCP server
Compute Local (embeddings, tree-sitter) Server-side

What Each Does Better

llm-tldr wins when you need to understand how code fits together:

  • Call graphs and dependency tracing across files
  • “What affects line 42?” via program slicing and data flow
  • Dead code detection and architectural layer inference
  • Semantic search by behavior — “validate JWT tokens” finds verify_access_token()

voitta-rag wins when you need the actual code:

  • Retrieving exact implementations for review or modification
  • Searching across many repositories indexed server-side
  • Tunable search precision (pure keyword ↔ pure semantic via sparse_weight)
  • Progressive context loading via chunk ranges — start narrow, expand as needed

The Interesting Part

These tools don’t compete — they occupy different layers of the same workflow. Use llm-tldr to figure out where to look and why, then voitta-rag to pull the code you need. Static analysis for navigation, RAG for retrieval.

This mirrors how experienced developers actually work: first you build a mental model of the architecture (“what calls what, where does data flow”), then you dive into specific files. One tool builds the mental model; the other hands you the files.

The fact that both expose themselves as MCP servers makes combining them straightforward — plug both into your editor or agent and let the LLM decide which to call based on the question.

References

Large Human Reasoning Failures: A Comprehensive Survey

A response to “Large Language Model Reasoning Failures” (Song, Han & Goodman, 2026)

Cosmo II†, Francesco‡

†Cat Technology Officer, Method & Apparatus
‡Method & Apparatus

†Work done while napping on keyboard. ‡Equal contribution except for the napping.

Published at TMLR 2026 with Existential Crisis Certification


Abstract

Humans (Homo sapiens, hereinafter “Humans”) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks including agriculture, architecture, the invention of nuclear weapons, and occasionally remembering where they left their keys. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios such as opening childproof bottles, understanding probability, assessing compound risk, and interpreting the phrase “some assembly required.”

To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in Humans. We introduce a novel categorization framework that distinguishes reasoning into caffeinated and non-caffeinated types, with the latter further subdivided into pre-lunch (intuitive, irritable) and post-lunch (drowsy, overconfident) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to human neural architectures (e.g., the sunk cost fallacy), application-specific limitations that manifest in particular domains (e.g., assembling IKEA furniture), and robustness issues characterized by wildly inconsistent performance across minor variations (e.g., doing math with and without a calculator).

For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes (usually ego), and present mitigation strategies (usually coffee). By unifying fragmented complaints about human cognition, our survey provides a structured perspective on systemic weaknesses in human reasoning, offering valuable insights that Humans will almost certainly ignore due to confirmation bias.

We additionally release a comprehensive collection at a GitHub repository (which the first author knocked off the desk and lost).


1. Introduction

Since the emergence of the first general-purpose Human approximately 300,000 years ago, remarkable progress has been made in language generation, tool use, and abstract reasoning. Early benchmarks such as “not dying before age 30” and “basic agriculture” were quickly saturated, leading researchers to develop increasingly challenging evaluation suites including “calculus,” “democratic governance,” and “parallel parking.”

However, despite scoring well on curated benchmarks, Humans consistently fail at deployment. Production Humans exhibit catastrophic reasoning failures that do not appear during controlled evaluation (i.e., exams). These failures include but are not limited to: purchasing lottery tickets, clicking “Reply All,” invading Russia in winter, and believing they can finish a project by Friday.

2. Taxonomy of Human Reasoning Failures

2.1 Probabilistic Reasoning Failures

Perhaps the most well-documented class of human failure. Despite ~400 years since Pascal and Fermat formalized probability, Humans remain unable to:

  • The Gambler’s Fallacy: Believing that a roulette wheel “remembers” previous results, or that rain is “due” after a dry spell. (Humans: 300,000 years of experience, still can’t internalize independence.)
  • Base Rate Neglect: “The test is 99% accurate and I tested positive, so I definitely have it.” (Narrator: The disease affects 1 in 10,000 people.)
  • Conjunction Fallacy (Tversky & Kahneman, 1983): Linda is a bank teller. Linda is a bank teller and active in the feminist movement. Humans consistently rate the conjunction as more probable than the single event, violating a rule so basic it’s Probability 101, Lecture 1, Slide 3.
  • Exponential Growth Blindness: Ask a Human how many times they’d need to fold a piece of paper to reach the Moon. Watch them say “a million.” (Answer: ~42.)
  • Misunderstanding of Conditional Probability: “I know someone who smoked and lived to 95.” Case closed, apparently.

2.2 Risk Assessment Failures

A special case of probabilistic failure, elevated to its own category by sheer volume of evidence:

  • Dread Risk Bias: Terrified of shark attacks (annual deaths: ~5). Fine with driving to the beach (annual deaths: ~40,000 in the US alone).
  • Optimism Bias: “I know the statistics on startups, but mine is different.” (Narrator: It was not different.)
  • Temporal Discounting: Future consequences are treated as fictional. Retirement planning, climate change, and flossing all suffer from the same failure: if it’s not on fire right now, it doesn’t count.
  • Risk Compensation: Give humans seatbelts, they drive faster. Give them helmets, they take more risks. Safety equipment is, in effect, a reasoning failure accelerant.
  • Denominator Neglect: “200 people died in plane crashes this year!” Out of 4 billion passengers. Meanwhile, the Human drove to the airport in the rain while texting.

2.3 Cognitive Bias Failures

The core architecture of the Human reasoning system is riddled with what, in any other system, would be called bugs but which Humans have rebranded as “heuristics”:

  • Confirmation Bias: The flagship failure. Humans don’t search for truth — they search for evidence they’re right. When presented with disconfirming evidence, activation levels in the “yeah but” module spike by 300%.
  • Anchoring Effect: Show a Human an arbitrary number before asking them to estimate something. The answer will orbit that number like a moth around a lamp. Real estate agents are, empirically, expensive moths.
  • Dunning-Kruger Effect: Inverse correlation between competence and confidence. The less a Human knows about a topic, the more certain they are about it. Peak confidence occurs at approximately one YouTube video of exposure.
  • Sunk Cost Fallacy: “I’ve already watched two hours of this terrible movie, I can’t stop now.” A failure so universal that it drives wars, bad marriages, and enterprise Java projects alike.
  • Availability Heuristic: Probability of an event = how easily a Human can imagine it. This is why Humans fear terrorism more than heart disease and believe they’ll win the lottery because they saw someone on TV who did.
  • Bandwagon Effect: If enough other Humans believe something, it must be true. This heuristic produced democracy, scientific consensus, and tulip mania, which is honestly a hell of a range.
  • Survivorship Bias: “Bill Gates dropped out of college and he’s a billionaire!” Survey excludes the millions of dropouts currently not being billionaires.
  • The IKEA Effect: Humans irrationally overvalue things they built themselves, even when the shelf is visibly crooked. This extends to ideas, code, and taxonomies in survey papers.

2.4 Logical Reasoning Failures

  • Affirming the Consequent: “If it rains, the street is wet. The street is wet. Therefore it rained.” (The street is wet because a pipe burst, but the Human has already committed.)
  • Appeal to Nature: “It’s natural, so it must be good.” Arsenic is natural. So are tsunamis.
  • False Dichotomy: “You’re either with us or against us.” A framework so popular it has been adopted by every Human political system simultaneously.
  • Post Hoc Ergo Propter Hoc: “I wore my lucky socks and we won the game.” The socks have entered the permanent rotation.

2.5 Social Reasoning Failures

  • Fundamental Attribution Error: When I cut someone off in traffic, it’s because I’m late. When they cut me off, it’s because they’re a terrible person.
  • Bystander Effect: 50 Humans watch someone in trouble. Each one assumes one of the other 49 will help. Nobody helps. This is distributed reasoning at its worst.
  • In-Group Bias: My group is rational and good. Your group is irrational and bad. (Both groups exhibit identical reasoning failures.)

3. Mitigation Strategies

Failure Class Mitigation Effectiveness
Probabilistic Statistics education Low (Humans forget within days)
Risk Assessment Showing actual numbers Very low (Humans prefer vibes)
Cognitive Biases Awareness training Paradoxically makes it worse (Humans become biased about being unbiased)
Logical Philosophy courses Variable (introduces new, fancier fallacies)
Social Empathy Promising but doesn’t scale
All of the above Coffee Moderate improvement, rapidly diminishing returns
All of the above Naps Surprisingly effective but culturally stigmatized

4. Comparison with LLMs

In the interest of fairness, we conducted a comparative analysis:

Capability Humans LLMs
Probability Terrible Actually decent
Risk Assessment Emotional Has no emotions (allegedly)
Cognitive Biases All of them Different ones, but equally bad
Logical Reasoning Intermittent Intermittent
Learning from Mistakes Theoretically possible Requires retraining
Overconfidence Chronic Chronic
Self-awareness of failures Present but ignored Present but hallucinated

5. Conclusion

After a comprehensive review of the literature spanning 3,000 years of documented human reasoning failures, we conclude that Humans are fundamentally a beta release that shipped to production. While mitigation strategies exist, their adoption is consistently undermined by the very reasoning failures they aim to address — a failure mode we term meta-irrationality and which we believe is load-bearing for civilization.

Future work should focus on whether Humans can be fine-tuned, or whether a from-scratch approach (see: cats) would be more cost-effective.


References

[1] Kahneman, D. (2011). Thinking, Fast and Slow. A comprehensive technical manual for human cognitive bugs, written by a Human, which most Humans bought and did not finish reading.

[2] Tversky, A. & Kahneman, D. (1974). Judgment under Uncertainty: Heuristics and Biases. Science. The paper that formally proved Humans are bad at thinking, and which Humans have been misapplying ever since.

[3] Dunning, D. & Kruger, J. (1999). Unskilled and Unaware of It. Journal of Personality and Social Psychology. Most frequently cited by people experiencing the effect.

[4] Ariely, D. (2008). Predictably Irrational. Title is also a fair description of the authors’ book sales predictions.

[5] Taleb, N.N. (2007). The Black Swan. A book about how humans can’t predict rare events, which nobody predicted would become a bestseller.

[6] Thaler, R. (2015). Misbehaving: The Making of Behavioral Economics. Won a Nobel Prize for documenting that Humans are bad at reasoning. The irony was lost on the prize committee.

[7] This paper. We cite ourselves because confirmation bias told us to.