The AI Productivity Paradox, and the Team-Context Infrastructure That Fixed Mine

Written by Mike McGreal, Dendro Logic | Engineering | Published: 1 May 2026

The contradiction nobody is talking about loudly enough

Adoption is real. Stack Overflow’s 2024 Developer Survey reports 63.2% of professional developers now use AI tools, with another 13.5% planning to adopt soon. 81% identify productivity as the biggest benefit they hope for. The narrative writes itself: AI is making engineering teams faster.

The measured impact tells a different story.

In a randomised controlled trial published by METR in 2025, 16 experienced open-source developers worked through 246 real issues from their own repositories, half with AI tools, half without. Developers expected AI to speed them up by 24%. The actual measured result: AI made them 19% slower. The most striking finding: even after experiencing the slowdown, developers still believed AI had sped them up by 20%.

DORA’s 2024 research flags the same gap from a different angle. Only 24% of developers trust AI-generated code a lot or a great deal, and roughly half do not interact with AI as an automated part of their tool chain at all. Adoption is high, integration and trust are not.

“Our AI suggestions saved each developer 15-20 percent of their time. Our cycle time is unchanged. Some weeks it is worse.”

“We are shipping more code, fewer features, and the same number of bugs.”

“I spend half my AI time un-doing what the AI did to a part of the codebase it did not understand.”

The headline narrative is AI made us faster. The lived experience is AI made us busier. Those are not the same thing.

This is an article about why, and about a system I built to close the gap on a real codebase.

What is actually happening

LLMs are very good at generating plausible code. They are very bad at generating code that fits.

A team’s codebase is a graph of decisions made over years: this auth flow, that database schema, this naming convention, that retry policy. Most of these decisions are not written down anywhere a model can see. They live in git log archaeology, in three engineers’ heads, in a Slack thread from June.

When you ask an AI assistant to build a new feature, the assistant has access to the open file and maybe a handful of imports. It does not have access to:

The 47 endpoints that already exist in the same app
The shape of the database tables it is about to write against
The decision made last sprint to deprecate the pattern it is about to use
The fact that the function it is writing fresh already exists three modules over

So it does what plausible-text generators do. It writes plausible-shaped code. That code reaches a PR. The reviewer spots three things wrong with it. The reviewer fixes the surface, misses the deeper conflict, and ships it. Two weeks later, debugging.

The bottleneck has moved from writing code (where AI helps) to “debugging code that AI wrote without understanding the codebase” (where AI hurts). Net velocity, flat or worse.

I have lived this on my own products. I have watched it on a paid contract team. The pattern is identical at solo scale and team scale.

The three failure modes, named

1. Amnesia

Every new session starts cold. Every new chat is a stranger to your project. The 90 minutes you spent yesterday explaining the auth flow to the agent: gone. You will explain it again today. And tomorrow. And on a per-developer basis if your team is more than one person.

2. Hallucinated endpoints

Asked to call an existing API, the model invents an endpoint it thinks should exist. Asked to query a table, it invents a column. Asked to use a service, it imports it from a package that does not exist. These are not hard bugs to spot once they hit a PR. They are expensive bugs to write, review, and unwind.

3. Tribal silos

The deepest failure. There is one engineer on the team who actually understands how the notification system works. Until that engineer is in the room, every change to the notification system is gambling. The AI cannot save you here, because the AI does not know what that engineer knows. Most of what they know is not written down.

This third one is the productivity cap most teams will not name out loud, because naming it makes the bus factor uncomfortable.

What I built

A self-hosted second brain that solves all three.

It is not a personal note-taking app. It is team-context infrastructure that LLM agents query before they suggest anything. The brain holds the current shape of the codebase as a structured matrix: endpoints, schema, dependencies, environment variables, decisions, the full reasoning trail of the conversations that produced them. When an agent (or a human) asks “what is the current way we do X”, the brain answers from grounded sources, with citations, in a few hundred tokens.

I run one instance for my own product portfolio. I run a second instance for a contract engineering team. The architecture is identical, a single config flag flips it between solo and team mode. The team instance adds contributor-scoped subdirectories and a pre-write secret scrubber on raw transcripts, everything else is the same.

The brain runs on a single NVIDIA DGX Spark, which I benchmarked separately. One box, ~£4,000 retail, holds the inference for a 120-billion-parameter mixture-of-experts model plus a smaller utility model, plus the brain itself, plus the memory vault. No cloud LLM bill on a typical day. Cloud tiers (Gemini Flash, Gemini Pro) are present as fallbacks, used when the local stack hiccups.

The brain ships:

A FastAPI server with 14 routers (memory, sessions, integrations, dashboard, vault browser, observability, and so on)
A Discord bot that exposes the brain’s specialist agent teams (research, business, engineering, operations, brand, media) for any conversation you want to have
A scheduled job runner with twelve-plus background jobs (heartbeat monitoring, wiki recompiler, hallucination detector, daily reflection, session aggregation, and so on)
A web dashboard with vault file browsing, session views, manual reindex controls, and observability for jobs and LLM costs
An MCP server sidecar that exposes brain capabilities as tools any AI agent can call: memory_search, memory_ask, memory_expand_ref, memory_session_list, session_log
Per-session capture that writes every assistant turn to disk before the next turn fires, so an IDE crash loses at most one in-flight turn
A periodic re-summariser that turns raw session transcripts into topical chunks (decisions, blockers, progress, open questions, follow-ups), each independently retrievable

Together these solve the amnesia problem. Sessions accumulate. The brain remembers. Tomorrow’s agent starts where yesterday’s agent finished, on the same project, with the same context.

Inspirations, lifts, and what is new

I did not invent the second brain. The space has prior art.

Andrej Karpathy has been making the case for years that personal AI infrastructure should be local-first, model-tier-aware, and built around the developer’s actual workflow rather than a generic chatbot. His framing of Software 2.0 and his more recent talks on AI agents shaped the tier-routing approach, the local-inference-first posture, and the bias toward small specialist models for utility work that this brain leans on.

Cole Medin’s open-source projects map directly onto choices in this brain: local-ai-packaged for the local-first deployment pattern, ai-agents-masterclass and ottomator-agents for agent design patterns, context-engineering-intro for the practical retrieval shaping that makes RAG actually work. The vault-as-source-of-truth approach (markdown files as canonical, git-tracked, with the index built on top) is straight from this lineage.

What is new, or at least uncommon enough to be worth writing about:

Token-efficient progressive drill-down with stable hash references. Most retrieval systems return whole chunks or whole files. This one returns dense summaries with stable hashes, the agent decides whether to drill (one hash gives the section, the same hash with a flag gives that section plus its neighbours, the same hash with another flag gives the whole file). Most queries answered at layer one or two. The agent never receives a 50 KB blob when 200 tokens of TOC entry will do.
MCP-first access from day one. Every brain capability is exposed as an MCP tool. Any Claude Code session, Cursor instance, or compatible agent runtime can call memory_search, memory_ask, memory_expand_ref, memory_session_list, session_log, and record_adr directly. The brain meets agents where they already live, rather than asking developers to switch tools.
Trickle-fed session capture engineered for crash safety. Most save-the-chat patterns rely on a clean session-end hook. IDE crashes, computer restarts, network drops, and force-quits all bypass that hook. The trickle pattern writes after every assistant turn, asynchronously, never blocking the agent. Worst case loss is one in-flight turn.
Repo matrix as a living document, designed to evolve from session capture. This is the spine of the team-context capability. Discussed in detail below.
Half-life decay applied to every chunk. Older content surfaces less for queries about today’s work. Decay is age-based, derived from each chunk’s last-modified timestamp, with a configurable half-life and per-category weights (decisions, lessons, facts, projects, entities). Today’s work ranks ahead of stale content of similar surface relevance, automatically.
The web dashboard as a first-class team surface. Headless agents make a brain invisible. Without a UI a team will not trust it and will not manage it. The dashboard is the social interface that lets the brain be reviewed, curated, and operated by the team rather than treated as a black box only the operator can poke. The dashboard is exposed only on the private Tailscale network, never to the public internet, with API-key auth on every endpoint behind that. The network boundary is the security boundary.
Product-scope filtering at retrieval time. The brain knows which content belongs to which product (Leap, the contract team’s product, my consultancy itself, and so on) via a scope registry the team curates. When an agent session opens against a particular repo, retrieval is filtered to that scope automatically. A query in a Leap session will not return chunks from an unrelated product, even when the semantics match. This is what makes a multi-product team brain practical instead of a noisy mush.

The repo matrix, and why I think it matters

This is the part most “personal AI memory” projects do not have, and it is the part that actually solves the productivity paradox.

When the brain meets a repo it has never seen, a mechanical scanner walks the file tree and produces a matrix: endpoints, dependencies, database schema, environment variables, framework, top-level architecture. This is mechanical, not LLM-generated. Cheap, fast, deterministic, reproducible. The matrix lives in the vault as a markdown file alongside the other product memory.

The matrix is a living document. As developers (or LLM agents) work on the repo, the structured session-capture pipeline produces topical chunks: decisions made, endpoints added, schemas changed, dependencies bumped. A periodic job reads those chunks and proposes updates to the matrix, citing the session that produced each update. A human reviews and promotes, over time, automation can take more of the load.

The result is a current, mechanical, session-fed picture of the codebase that any agent can query before suggesting code. New endpoint suggestions are checked against the matrix. Hallucinated endpoints get caught at the matrix-lookup stage, before they hit a PR. Tribal knowledge dissolves: “only one engineer knows how the auth flow works” becomes “the matrix shows auth wired here, here, and here, with these JWT claims, and the last decision touching it was three weeks ago in this session”. Onboarding a new dev (or a new agent session) takes minutes, not days.

The mechanical scanner runs on demand against any repo and produces the matrix in minutes. The session-capture-feeds-matrix job runs in the background and proposes updates as developer conversations introduce new endpoints or schema changes, a human approves before they land. Both are running today against my own product portfolio.

Defence in depth against hallucination

Most “AI memory” systems trust the LLM. This one does not. There are two layers, both in production, both with their own job.

Layer 1: HaluGate, at compile time. The brain has a wiki compiler that folds source material into topic articles. After every fold, an asynchronous job extracts the factual claims from the article, cross-references each claim against the source material the LLM was given, and flags any claim that has no supporting source. Risk score (low, medium, high) lands in the article’s frontmatter, high-risk articles trigger a notification on the team chat for review. Fabricated content does not land in the vault unchecked.

Layer 2: citation verification at serve time. Every answer the brain synthesises for a user (via memory_ask and similar) carries inline [#hash] citations after each claim. Before the answer is returned, a verifier dispatches each claim to a verification model concurrently, asks “is this claim supported by the cited source?”, and flags any miscites in the response itself. Latency: about a second for a typical three-or-four-claim answer. The user sees the answer plus an honest summary: claims verified, claims miscited, claims uncited.

The companion to Layer 2 is an uncited-claim flagger that examines any sentence in the answer without a citation. Each uncited claim gets classified as one of three things: likely sourced but citation missed (the source exists in the vault, the model just forgot to cite it), found in team sources (the answer is right but came from the team brain rather than the personal one, suggesting the right scope to query next time), or genuine knowledge gap (no support in the vault at all, with a suggested location for the user to capture the new fact).

That last one is the one I find most interesting. Instead of pretending to know, the brain says: “I do not have a source for this. Here is where you might want to record it.” It actively asks to be taught.

The same defence-in-depth primitive guards the planned matrix-evolution job: before a session-derived candidate endpoint update is appended to a repo matrix, the verifier checks whether the claim has a supporting source in the session transcript. If not, do not append.

Beyond memory: specialist agents, compounding knowledge, and a chat front-end for non-developers

Memory and retrieval are the spine, not the whole skeleton. Three more capabilities worth naming, because they multiply the value at team scale.

Specialist agent teams with curated tool access

The brain runs six director-pod teams, each with a director persona, specialist agents, and a curated tool surface:

Research – researcher and security expert, with web search, vault search, drive upload, and library lookups
Business Board – product-specific advisors that reason against the current portfolio context
Engineering – engineering manager persona with code search, architecture lookup, and library docs
Operations – calendar, email, document, and vault tools for daily admin work
Brand and Content – copywriter persona with web search, image generation, and brand guidelines
Media – media manager persona for the household media stack

Each team is fronted by a director that decomposes the work and dispatches to the right specialist. Behind it sits a ReAct-pattern orchestrator with iterative tool-calling, a per-run cost cap, a circuit breaker after three consecutive failures, a full audit trail in the brain’s database, and PII masking on the audit logs. None of this is exposed to the calling team member, they ask a question, the right specialist answers.

For a team this means: a product manager asks the Research team to compile competitor positioning, an engineer asks the Engineering team to plan a feature against the current codebase, the marketing lead asks the Brand team to draft a launch post. Same brain, same memory, different specialists.

The wiki compiler: knowledge that compounds with use

A daily background job folds source material (session summaries, meeting notes, ingested conversations, any markdown that lands in the vault) into curated wiki topic articles. The compiler runs through quality gates: HaluGate (described above), a contradiction checker that flags claims that disagree with prior versions, and a confidence scorer that records how well-supported each article is.

The result is a knowledge layer that gets richer over time without manual curation. After the first month, a query about a project returns a topic article that consolidates everything the brain has ever ingested about it, with citations back to every source. After three months, the topic article is a near-canonical reference. After six months, it is what new team members read first.

This is the bit most personal-brain projects do not have. They store, they do not compound. The compiler is the difference.

A channel-based chat front-end for non-developers, in both directions

Not every team member lives in an IDE. The brain exposes itself as a bot in whatever channel-based comms platform the team already uses. My instance runs in Discord, for an engineering team on Slack the equivalent integration is straight-forward, the brain does not care which protocol calls it. The persona-aware channel pattern (researcher, librarian, operations, business advisor, and so on) maps the same way: each channel is fronted by a specialist agent appropriate to the work happening in it.

A product manager asks the librarian persona to summarise this week’s design decisions without opening a code editor. An ops lead asks the operations persona to draft a customer-facing email grounded in the current product positioning. The same retrieval, the same memory, the same specialist agents, different surface.

Crucially, the integration runs in both directions. The brain does not just serve through your channel platform, it ingests from it. A scheduled job at end of day summarises every team conversation into a daily digest with per-conversation references back to the full archive. Decisions made in chat become searchable brain content the next morning. The “we discussed this in Slack three weeks ago, who remembers what we decided” problem dissolves: the brain remembers, with citations to the message.

For a team that runs on chat, this is the integration that closes the loop. The chat is where decisions get made. The chat is where the brain captures them. The brain is where the next agent session starts.

Gotchas as first-class institutional memory

Engineering teams rediscover the same bugs every six months because the team member who hit it the first time has moved on, written a Confluence page nobody reads, or only documented it in a Slack thread that has scrolled away.

The brain treats gotchas as a first-class entity type. Each lives in a structured markdown file under a dedicated folder, tagged for retrieval, with a fixed shape: the symptom, the root cause, the fix, the reproduction. Examples from real production use across my own work and the contract team:

“Library X validates against the bundled schema and silently accepts invalid input.”
“A particular DSL does not support nested array filters and returns empty results without erroring.”
“An LLM model’s reasoning_effort setting eats the max-tokens budget on chain-of-thought and breaks downstream JSON parsing.”

When an agent encounters a problem similar to a documented gotcha, the brain surfaces it in retrieval. New team members get the institutional bug history on day one. The same bug does not get rediscovered in month seven.

This is the second-order productivity win. The first-order win is “AI suggests fewer wrong things”. The second-order win is “the team stops repeating its own mistakes”.

Decision records and a self-synthesising adversarial reviewer

Two related capabilities that together turn the brain into something more than a passive memory.

ADRs as searchable decision history, with inline capture from any agent session. Every architectural decision gets recorded in a standard format (status, context, decision, alternatives considered, consequences) under a sequentially-numbered file. Seventeen and counting in my own brain to date, covering everything from “which LLM tier for which workload” to “why we route brain access through MCP from day one”. The compiler indexes ADRs alongside everything else, so when a future agent session asks “should we use approach X for this?” the brain surfaces the previous decision with the reasoning, not just the outcome. Decisions never have to be re-litigated from scratch. A record_adr MCP tool lets any session create a new ADR inline as the decision is being made: the agent calls the tool with title, context, decision, alternatives and consequences, the brain assigns the next sequential number, writes the file, and indexes it for retrieval. Decisions land in the searchable knowledge layer the moment they are made.

Adversarial reviewer, synthesised weekly from real review patterns. The brain runs a weekly background job that scans the vault for code review feedback, PR comments, quality-gate findings, and architectural pushbacks recorded over the last week. It compiles those into an anonymous reviewer persona that captures what effective review on this team actually catches – naming conventions the team enforces, integration assumptions that have bitten before, error-handling patterns that keep slipping through, architectural boundaries that get violated. Anonymous by design, per a separate ADR, the persona captures what good review catches, not who caught it. That persona is then available to any agent session as a code reviewer of last resort: “before this lands, run it past the adversarial reviewer”. Every team’s bug patterns are specific to that team, codifying them into a reusable reviewer turns one engineer’s hard-won pattern recognition into reusable infrastructure.

Together these two are why the brain gets sharper over time, not just bigger. ADRs preserve reasoning. The reviewer learns from review.

Heartbeat and proactive surfacing

A scheduled job, the heartbeat, monitors signals the brain has been told matter (urgent emails, calendar conflicts, blocked PRs, missed deadlines, anomalies in monitored systems). When something rises above a configured threshold, the brain posts a Discord notification with the context and a recommended action.

Most second-brain projects are reactive – you ask, they answer. Proactive surfacing turns the brain into something more like an always-on engineering operations layer. It catches things the team forgot to ask about.

Living alongside the team’s existing tools: the cache hierarchy

The brain does not replace your team’s Atlassian stack. Most engineering teams already pay for Jira and Confluence and have years of useful information sitting in both. The brain plays a specific role in that ecosystem.

It becomes the L1 cache. Every AI session, every developer query, hits the brain first.

The hierarchy is binding, injected into every Claude Code session at startup as a non-negotiable rule:

L1, Polaris Brain – always queried first. Compiled, curated, project-scoped, fast (under 50 ms typical).
L2, Atlassian – fall back here if L1 is incomplete. Query Jira and Confluence via the official Atlassian MCP server (the one most Claude Code installs already have). The brain is not reinventing this integration, it is sitting in front of it.
L3, Web search – last resort, only for genuinely external information.

The critical mechanic is L2 write-back. After any Atlassian lookup, the agent contributes a summary back to the brain. The next session asking the same question finds the answer in L1 and never touches Atlassian. Over weeks, the brain accumulates coverage of the team’s Atlassian content. Atlassian queries decrease, latency drops, cost drops. Every knowledge gap becomes a one-time cost.

For a team that uses Jira for tickets, Confluence for documentation, and AI for development, this turns three previously-disconnected sources of truth into a single first-stop layer that gets richer with use. New joiners hit the brain and get answers grounded in the team’s actual ticket history and documentation, not in the model’s training-data approximation of what “a Jira ticket about X usually looks like”.

The brain’s MCP tools and the Atlassian MCP tools coexist in the same Claude Code session. The hierarchy is what makes them sequential rather than competing.

Production case study: my own product portfolio

The proof is in usage. I run this brain against my own product portfolio. The most heavily-instrumented case is Leap, a career mobility and capability verification platform built by Dendro Logic Ltd.

Leap is a Turborepo monorepo on Vercel and Supabase. Five apps under the same workspace:

talent app (talent.leapcareer.io)
recruiter app (recruit.leapcareer.io)
enterprise admin (enterprise.leapcareer.io)
marketing site (leapcareer.io)
internal admin (overlord.leapcareer.io)

Plus a small standalone service, fetch-proxy, running as a Docker container on the brain’s own host hardware. A bearer-token-auth HTTP proxy with SSRF protection and a strict allowlist, used by the talent app to fetch external content where direct cloud-to-cloud requests are not the right architecture.

Leap is a work in progress. Development has weighted heavily toward the talent app so far. The recruiter, enterprise, and marketing apps have minimal endpoints today because they sit on foundational scaffolding waiting for their next build phase, not because they will stay that thin. The brain reflects exactly this: 46 endpoints catalogued in talent, single digits across the others, the matrix evolving as each app gets its turn.

Component	Catalogued
API endpoints (talent app)	46
API endpoints (recruiter app)	4
API endpoints (enterprise app)	1
API endpoints (marketing app)	2
API endpoints (overlord app)	9
API endpoints (fetch-proxy)	2
Total endpoints	64
Postgres tables (Supabase)	70+, full RLS coverage
External scrape domains (allowlist)	30+
Frontend framework	Next.js on Vercel
Backend	Supabase (Postgres + Auth + Edge Functions + Realtime + Storage)
Vector search	Supabase pgvector for skill matching
Deployment	Vercel (frontend), Supabase (backend), Polaris (fetch-proxy)

Each row in this table is a drill-down handle. Asking the brain “what are the talent app endpoints” returns the actual list of 46 routes with their methods, paths, auth notes, expected request payload structure (keys, types, required vs optional), response shape, and a one-line description of each, sourced directly from the route files in the repo. The matrix in the article is the index, not the data, because the data is too long to belong in an article. The agent that needs the detail asks for it.

When I open a fresh agent session against Leap, I do not need to explain any of this. The agent queries the brain, gets the matrix, and starts grounded. When I ask “where would I add a new endpoint for X”, the agent checks whether something similar already exists before suggesting code. When I plan a new feature, the agent reasons against the actual data model rather than a guess.

The brain also holds the decision log (work/leap/leap-decisions.md), the architecture review notes (work/leap/leap-architecture-review-2026-04-16.md), and the principles doc (work/leap/leap-principles.md). All grounded, all retrievable, all decay-weighted so the most recent decisions surface first.

I returned to Leap after a three-week gap on another project. The brain knew exactly where I had been. The session reopened with current state in context. There was no re-onboarding. There was no “where was I”. This is the experience the system is built to deliver.

What it does not solve

Three honest limits.

Garbage in, garbage out. The brain reflects what has been ingested, not what is true. If your team’s decisions are inconsistent, or your repo’s documentation is stale, the brain will surface the inconsistency. It will not invent the right answer. The hallucination guard catches LLM-introduced fabrications, it does not catch stale facts that were correctly transcribed at the time.

Trust takes time. Teams that have been burned by AI hallucinations are appropriately sceptical of “the AI knows the codebase now”. The web dashboard, the citations on every retrieval, and the human-in-the-loop on matrix updates are all designed to earn trust slowly. There is no shortcut.

What this actually is

It is not a personal-productivity layer. It is not a note-taking app with AI features. It is not RAG over your documents.

It is team-context infrastructure that LLM agents hit before they hallucinate.

The architectural insight, if there is one worth taking away, is that the bottleneck in AI-augmented engineering is not generation speed, it is grounding. Generation speed is solved. Grounding is the unsolved problem. Every additional second of grounding work pays back in hours of avoided debugging.

A self-hosted second brain that ingests the repo matrix, captures the conversations, and serves both back to LLM agents over a standard protocol is a cheap and reliable answer. It does not need a frontier model. It does not need a multi-million-pound infrastructure. One DGX Spark, one vault, one team.

The same architecture works at solo scale (my own product portfolio) and team scale (a paid contract engineering team). The principles are the same, the configuration flag is one bool.

If your team has more AI suggestions than ever and the same delivery velocity as last quarter, the gap between those two numbers is the cost of un-grounded AI. A team-context brain is what closes it.

Open questions I am still chewing on

How to share matrices between brain instances when teams overlap, without losing scope isolation. How to handle private decisions in a team brain that should not leak to all members. How to make the trust-earning curve shorter without compromising on safety. How far the auto-merge line can be pushed on matrix evolution before human review becomes a bottleneck rather than a quality gate.

Comments and pushback welcome. The brain code is private for now while the design stabilises. The principles are not.

hello@dendro-logic.com

The brain runs in production against my own product portfolio (Leap, CoreThread, Dendro Logic itself) and a paid engineering contract team. Hardware: single NVIDIA DGX Spark.