The 25% problem
Your BitBakery update for March 2026
University of Waterloo researchers tested 11 large language models on a range of software tasks. The headline finding was straightforward: even the most capable models topped out at around 75% accuracy. Open-source alternatives ran about 10 points behind that.
These aren’t catastrophic numbers. In fact, for many tasks they’re impressive. But they do clarify something important about where human judgment fits in the new workflow.
In a post we published earlier this month, we talked about the shift from vibe coding to agentic engineering, a term Andrej Karpathy, who coined “vibe coding” in the first place, now prefers.
The distinction matters.
Vibe coding was prompt-and-hope. Agentic engineering means orchestrating AI systems with real oversight. You define the architecture, set the quality gates, and own what ships.
The Waterloo research puts a number on why that oversight layer isn’t optional. If AI tools are wrong 25% of the time on simpler tasks—and more often on complex ones like image, video, and website generation—then the engineer’s job isn’t disappearing. It’s changing.
Their job is increasingly to catch the 25%.
That requires seniority. It requires knowing what good architecture looks like before the agent starts writing. And it’s why we keep coming back to the same principle—AI augments, it doesn’t replace. Every line that ships is still a human decision.
This is especially relevant if you’re evaluating an outsourced development partner right now. The question isn’t whether they use AI tools. Anyone worth working with does. The question is what their human layer looks like around those tools.
If you want to talk about keeping humans in the loop as you adopt AI, I’m always available for a call.
— Wes


