Amnesia, Hallucinated Endpoints, Tribal Silos: Why AI Made Your Team Busier, Not Faster
Key Takeaways
- A randomised trial by METR found AI made experienced developers 19% slower, while they believed it had sped them up by 20%.
- Adoption is high. Trust is low. Only 24% of developers trust AI-generated code “a lot” or “a great deal” (DORA, 2024).
- The bottleneck has moved from writing code to debugging code an AI wrote without understanding the codebase.
- Three specific failure modes explain the gap: amnesia between sessions, hallucinated endpoints inside sessions, and tribal knowledge that no agent can see.
- Each one has a concrete fix. None of them require a frontier model.
Your team adopted AI. Your delivery velocity is the same as last quarter. Most of you suspect it is worse. You are not imagining it. The data says you are right.
The numbers nobody is putting on a slide deck
Stack Overflow's 2024 Developer Survey reports that 63.2% of professional developers now use AI tools, with another 13.5% planning to adopt soon. 81% identify productivity as the biggest benefit they hope for. Adoption is real.
The measured impact tells a different story.
In a randomised controlled trial published by METR in 2025, 16 experienced open-source developers worked through 246 real issues from their own repositories, half with AI tools and half without. Developers expected AI to speed them up by 24%. The actual measured result: AI made them 19% slower. The most striking finding came at the end. Even after experiencing the slowdown, developers still believed AI had sped them up by 20%.
DORA's 2024 research flags the same gap from a different angle. Only 24% of developers trust AI-generated code "a lot" or "a great deal", and roughly half do not interact with AI as an automated part of their tool chain at all. Adoption is high. Integration and trust are not.
The headline narrative is "AI made us faster". The lived experience is "AI made us busier". Those are not the same thing.
This article is about why, and what to do about it.
Failure mode 1: Amnesia
Every new session starts cold. Every new chat is a stranger to your project.
The 90 minutes you spent yesterday explaining the auth flow to the agent are gone. You will explain it again today. And tomorrow. And on a per-developer basis if your team is more than one person.
Most "AI memory" in 2026 lives at one of three altitudes: the model's training data (frozen, generic), the active context window (alive for one session), or some saved-chat feature (great for the developer who saved it, invisible to the next agent that opens the same repo). None of those persist team context across sessions, across people, or across products.
The fix is not exotic. It is persistent, project-scoped memory that the next agent reads at session start. Sessions accumulate. Decisions get captured. The agent that opens this repo tomorrow starts where today's agent finished, on the same codebase, with the same context. The trick is not the storage. The trick is the trickle-fed capture pattern: write after every assistant turn, not at session end. Most save-on-end implementations lose half a day's context to a single IDE crash.
Failure mode 2: Hallucinated endpoints
Asked to call an existing API, the model invents an endpoint it thinks should exist. Asked to query a table, it invents a column. Asked to use an internal service, it imports it from a package that does not exist.
These are not hard bugs to spot once they hit a PR. They are expensive bugs to write, review, and unwind. The reviewer spots three things wrong with the PR, fixes the surface, misses the deeper conflict, and ships it. Two weeks later, an outage.
The fix here is a living matrix of the codebase that the agent queries before suggesting code. Endpoints, dependencies, database schema, environment variables, framework conventions. Mechanically scanned, not LLM-generated. Cheap, fast, deterministic, reproducible. When the agent suggests a new endpoint, it checks the matrix first. The endpoint either exists, in which case use it, or it does not, in which case the agent flags the gap rather than inventing a plausible URL.
The matrix is a living document. Session capture feeds it. Decisions made today update it tomorrow. New developer (or new agent session) onboarding moves from days to minutes. Hallucinated endpoints do not survive a matrix lookup.
Failure mode 3: Tribal silos
This is the deepest failure, and the one most teams will not name out loud.
There is one engineer on the team who actually understands how the notification system works. Until that engineer is in the room, every change to the notification system is gambling. The AI cannot save you here. The AI does not know what that engineer knows. Most of what they know is not written down.
Most teams will not name this because naming it makes the bus factor uncomfortable. So they ship "AI tooling for the team" without solving it, and then cannot understand why the productivity numbers look the way they do.
The fix is shared, compounding team memory. A wiki compiler that folds raw session content (decisions, gotchas, design conversations) into curated topic articles, with hallucination guards, contradiction checks, and confidence scores. Architecture Decision Records captured inline as the decision is being made, not weeks later when someone "gets around to writing it up". Gotchas as a first-class entity type so the same library quirk does not get rediscovered every six months.
After three months of compounding, the topic article on the notification system is the canonical reference. After six, it is what new joiners read first. The single-engineer dependency dissolves. So does the bug-rediscovery loop.
What "AI made us busier" really means
The bottleneck in AI-augmented engineering is not generation speed. Generation speed is solved. Grounding is the unsolved problem.
When the agent has nowhere to ground, it generates plausible code that does not fit. The PR review chases plausibility, fixes the surface, misses the conflict. The cost moves downstream. Velocity stays flat. The team feels busier because they are. The slope is wrong.
When the agent has somewhere to ground (persistent memory, a current matrix, a wiki that compounds), generation speed translates into delivered features. The fixes do not require a frontier model. They require infrastructure: persistence, retrieval, and a quality gate against fabrication.
We run this infrastructure on a single £4,000 box. One DGX Spark, one vault, one team. The architecture works at solo scale (my own product portfolio) and at team scale (a paid contract engineering team). The principles are the same. The configuration flag is one bool.
The gap between "more AI suggestions than ever" and "the same delivery velocity as last quarter" is the cost of un-grounded AI.
A team-context brain is what closes it.
Take the Next Step
If your team has been adopting AI tools and your shipping cadence has not moved, the gap is the un-grounded-suggestion problem. We have built the architecture that closes it, in production, against real codebases. Get in touch if you want to talk about what that would look like for your team.