Resumption-First Memory: Design Rules for Agentic Systems
Every serious agentic memory system you can read about today has converged on roughly the same whiteboard. Capture the turns. Summarise them. Embed and index the result. Retrieve the relevant chunks on the next session. Inject them back into the model. The substrate is now boring enough that each layer can be bought off the shelf or wired up in an afternoon: pgvector, MCP, a cross-encoder reranker, an embedding model trained six months ago.
The mechanics have stopped being the interesting part.
What is still wide open is what shape memory should take in the first place. Most public systems still optimise for fact-recall. Remember that the user prefers concise answers. Remember that this codebase uses Vitest. Flat extracted facts, embedded, retrieved on demand. That works for a personalisation bullet in a consumer chatbot. It does not work for an agent that has to pick up a half-built feature on a Friday afternoon. The optimisation target there is different. It is resumption value.
This piece is the design-rules companion to Surviving Claude Compaction, which I published four days ago and which walked through how to wire up a five-tier session continuity stack against Claude Code. That article was a how-to. This one is the why-this-shape. Five design rules. One concrete chunk shown end to end. A landscape table. One category claim worth defending.
The landscape, May 2026
If you list what the public agentic memory systems actually do, the moving parts look strikingly similar. The differences are not in the mechanics. They are in what each is optimising for.
| System | What it does | What it optimises for |
|---|---|---|
| Letta (formerly MemGPT) | Editable memory blocks inside context, plus tiered recall and archival storage | Self-edited context blocks with tiered storage |
| mem0 | LLM extracts facts, hybrid retrieval (vector + BM25 + entity links) | Fact recall with temporal and multi-hop reasoning |
| Zep / Graphiti | Bi-temporal knowledge graph with fact validity windows | Time-aware facts with automatic invalidation |
| Cognee | ECL pipeline turning unstructured docs into a queryable knowledge graph | Unstructured-corpus ingestion into agent memory |
| OpenAI ChatGPT memory | Saved bullets plus implicit reference across past conversations | Consumer personalisation across chats |
| Anthropic memory tool | File-system memory (/memory) the model self-manages, paired with context editing |
Model-driven self-paging |
Three of these (mem0, Zep, Cognee) extract facts or entities from a conversation and store them as standalone propositions. Two (Letta, the Anthropic memory tool) page content in and out of the context window based on what fits. One (ChatGPT memory) surfaces a small list of saved bullets to the user, with implicit reference across past chats added on top.
Notice what is missing. None of them is explicitly designed around the question: if the previous session was cut off mid-thought, what is the cheapest, highest-fidelity way to put the next session back in the same room? That is the agentic-workflow question. It is the question my brain answers, and it is not a fact-recall question. It is a resumption question. It has a different optimal answer.
Why fact-recall is the wrong target
The instinct to extract facts is reasonable. Facts are tidy. They embed well. They retrieve on demand. You can show them to the user as a list of what I remember about you and the user understands immediately. Fact-recall is the natural shape if you came to memory from a search-and-retrieval background.
But the moment the agent’s job is to continue work rather than answer questions, fact-recall starts to fail in a specific way. A fact is a paraphrase. A paraphrase is an interpretation of the original conversation. An agent that trusts a paraphrase quotes it back, builds on top of it, and gradually drifts away from what the user actually said. The drift compounds across sessions. After ten resumptions you are not continuing the original work. You are continuing the system’s running interpretation of the original work.
I covered this failure mode in the previous article. A digest tells the next agent what happened but not where the truth lives. Hallucination, for an agent picking up where another agent left off, is what happens when there is no way to grow comprehension of a session over time. All the agent can do is act on the digest it has been handed. The fix is not a tidier digest. It is giving the agent the means to pull on the threads in that digest and find the original turns, so each session compounds the next one’s grasp of the work rather than restarting from a paraphrase.

Resumption-first memory inverts the model. Instead of trying to compress the conversation, it provides a structured index of threads still in motion at the moment the previous session ended, with each thread carrying enough provenance for the next agent to verify rather than trust. The agent reads the index, decides which threads matter, and pulls the original exchange when it needs the actual detail. The conversation stays canonical. The index just tells the agent which threads exist and how to grab each one.
That is a different shape of memory. It has different design rules.
Five design rules
These are the five choices that, taken together, distinguish a resumption-first memory system from the fact-recall systems above. Each rule is independent of the others, but they reinforce each other in practice. I arrived at them by iterating quickly on this since late last year. A project-scoped SQLite store first, then a complete rework of the retrieval shape in late February, watching where each version failed and which design choices held up.
Rule 1. Typed ontology, ranked by resumption value
Most memory systems treat memories as flat strings (mem0, ChatGPT memory) or by recency (everything with a timestamp). A resumption-first memory system needs a typed ontology where the types are ranked by how much resumption value they carry.
My current ontology has eight types, in priority order:
open_thread > next_move > open_question > decision > rejection > blocker > progress > follow_up
open_thread is at the top because it is the most expensive thing for the next agent to recover unaided. progress and follow_up sit at the bottom because the next agent can usually rediscover those by reading the codebase or the task list. The ordering is the order the next agent reads the index in.
This is workflow insight expressed as schema. It is the difference between memory that helps the agent finish what it was doing, and memory that helps the agent remember what it finished.
Rule 2. Summary as an index, not a substitute
The default instinct, and the one I made the mistake of following on my first attempt, is to write a paragraph that summarises the previous session and inject it into the next session as a recap. Tidy, readable, low-token. And wrong, for the reasons in the previous section.
A resumption-first summary is not a substitute for the previous session. It is an index over it. Each entry in the index points back to the raw transcript and tells the next agent: this thread exists, here is the verbatim user quote, here is the retrieval query to pull it open if you need the detail.
The raw transcript stays canonical. The summary is a navigation layer on top, designed to be cheap to inject in full but exhaustive in coverage.
Rule 3. search_query as a first-class field on every item
Every memory item carries a literal search_query field: three to eight words, in the user’s own language, designed to retrieve the original exchange when run through memory_search. The next agent does not have to invent a query each time it wants to drill down. The query is part of the breadcrumb.
This is the single most token-efficient rule in the set, and as far as I can tell from the published landscape, nobody else is doing it. Everyone else builds a search system, then leaves the agent to compose the right query on the fly. The query is part of the original session’s context, the agent that produced the item is the agent best placed to write the query, and writing it into the item at summarisation time costs nothing extra.

Rule 4. Faithfulness rules in the summariser prompt
A summariser that is allowed to paraphrase is a summariser that will drift over long horizons. I enforce three faithfulness rules in the summariser’s system prompt:
user_quoteis verbatim. Never paraphrased.- The number of items per session is not fixed. A trivial session produces one item. A big architecture session produces ten. No N-per-session quota, no forced compression.
- Items must correspond to material actually in the transcript. No inventing topics that look plausible.
These look obvious. In production they are easy to soften, and softening them is where most memory systems drift. The summariser becomes a tool for making the session sound like a clean digest, and clean digests are paraphrase. The discipline of treating the summariser prompt like an ADR is what keeps the breadcrumb honest.
Rule 5. Two-tier retrieval, split by cost not capacity
Letta’s hierarchical context splits memory by context-window fit: is there room to keep this in-context, or should it be paged out. That is a capacity-driven split.
A resumption-first system should split by likelihood of mattering:
- Tier 1, every turn, free. Embed the user message, kNN over the user’s own memory chunks, inject the top few breadcrumbs into the system prompt. Always on. Cheap because the breadcrumbs are small.
- Tier 2, agent-decided. The agent has a
memory_searchtool. When a breadcrumb looks thin and the agent decides the detail matters, it runs the breadcrumb’s ownsearch_query(rule 3) and pulls the underlying exchange.
The split looks similar to Letta’s on the surface, but the question being asked is different. Letta asks whether there is room. This asks whether it is likely to matter. Subtle but meaningful: the former wastes effort paging in content the agent will never use, the latter only pays the cost when the agent has decided it is worth paying.
A concrete chunk, end to end
Here is what one of these breadcrumbs looks like in practice, taken (lightly redacted) from a session I ran yesterday:
type: open_thread
title: Polaris async worker implementation
what: Build a Docker-container job in the polaris-brain image that polls
Supabase for queued opal_jobs rows tagged tier=medium, preferred_host=polaris.
Claims them, runs local vLLM, writes results back, heartbeats every cycle.
why: Provides the heavy-compute path for async tasks while keeping the
Vercel front-end snappy.
user_quote: "I think 15 then 16. please confirm you have looked at the
current session start / session continue logics / files so you can see
what I mean by the summary for system prompt being threads to pull on /
bread crumbs."
search_query: Polaris async worker Sprint 15 A implementation
source_lines: 210-260
ref_hash: 04acf2bb
A handful of bytes. The next agent reading the resumed session sees this chunk in its system prompt, knows there is an open thread on a Polaris worker, knows the literal request the user made, knows exactly which transcript line range to drill into if it needs more, and knows the exact memory_search query to run if it wants the surrounding exchange.
The same chunk under a fact-recall system would have been compressed into something like the user wants a Polaris worker for heavy jobs. Plausible, lossy, paraphrased, no provenance. The next agent would have no way to recover the actual constraints (tier='medium', preferred_host='polaris', polled via service-role key, heartbeats every cycle) without re-asking the user. It would, eventually, rebuild a plausible-but-wrong version of the design. Then it would write code against it.
The breadcrumb is the difference between continuing the work and re-deciding the work.
Where this pattern goes next: Leap and Opal
The system I have described is for engineering work. The agent is Claude Code, the previous session is yesterday’s coding session, the resumption value is measured in time-to-productive on the next morning. But the shape of the pattern is not specific to engineering.
I am currently porting it into Leap, a career-progression product in closed beta where the user-facing assistant is called Opal. Opal is a long-running coach, not a chatbot, and the resumption problem there is even more pointed than it is in code: a user might come back to Opal after three weeks of life happening, expecting the coach to remember the constraints, the aspirations, the rejected paths, and the open questions from the last conversation. The mechanics are identical to the brain. The ontology is different, because user conversations are not engineering sessions. The eight types shift to fit:
| Engineering ontology | User-conversation ontology (Opal) |
|---|---|
| open_thread | aspiration |
| next_move | concern |
| decision | decision |
| rejection | rejected_path |
| blocker | constraint |
| progress | progress |
| follow_up | open_question |
| – | insight (new) |
The point of porting this rather than copying a ChatGPT-memory-style flat fact list is that the breadcrumb shape gives you a useful side-effect almost for free. Every memory item is a row in a table, addressable by ref_hash, tagged with the user it belongs to. Per-chunk row-level security drops out. Per-chunk consent (the user revoked memory item X) drops out. Per-chunk right-to-forget under GDPR drops out. A flat fact-bullet system has none of these properties without retrofitting them.
Consumer AI memory today is shallow. ChatGPT memory is a small list of saved bullets plus an implicit reference across past chats, useful but flat. The bigger interesting claim is that consumer AI memory could be qualitatively different: a coach that remembers the threads of someone’s life and career, not just the facts about them. Opal is built on exactly that bet. The same architectural rules that work for engineering resumption work for life-and-career resumption, with the GDPR posture as a bonus rather than a tax.
The category claim
Stated plainly, so it can be defended or torn down:
Memory systems for agentic workflows should optimise for resumption value, not fact-recall. A summary should be a typed index of threads, not a digest. Breadcrumbs (verbatim quote + retrieve-via query + source pointer) belong inside each memory item, not invented at retrieval time.
The mechanics of agentic memory have converged. The framing has not. Someone is going to write the version of this article with academic citations and benchmark numbers, and they are probably already drafting it. That is fine. The framing matters more than the priority. If this piece is the thing that puts a name on the shape, the months of building it will have done their job.
Close
Substrate moves faster than design. pgvector, MCP, cross-encoders, frontier embedding models, all of these are commodities now. What separates one agentic memory system from another over the next 12 to 18 months will not be the substrate. It will be the shape of what they choose to store, and the discipline with which they store it. Resumption-first is one framing. There will be others. The category claim worth defending is that the target matters, and that fact-recall is the wrong target for the kind of work agents are now doing.
If you want the implementation walkthrough, Surviving Claude Compaction is the companion piece. This one was about why the chunks look the way they do.
Landscape table verified against current published docs on 26 May 2026 (canonical URLs in the table itself). The memory architecture described here runs in production against my own daily coding sessions and is being ported into Leap, a career-progression product currently in closed beta, where the user-facing assistant is Opal. Substrate: Polaris brain (Docker on Tailscale, openai/gpt-oss-120b on local vLLM, pgvector on Postgres, MCP surface for agent access). The summariser runs at session end on each captured transcript, produces the typed breadcrumbs described above, and persists them with a stable ref_hash addressable by both agents and the Discord brain bot. Hardware: single NVIDIA DGX Spark.