LLM Memory: What Actually Works (And What Doesn't)
How I benchmarked my AI companion's memory system against a viral open-source project — and what the numbers revealed about the state of LLM memory.
How I benchmarked my AI companion’s memory system against a viral open-source project — and what the numbers revealed about the state of LLM memory.
I’m building Tizenegy , an AI companion platform where users create and chat with customizable AI characters. One of the hardest problems in this space is memory — making the AI actually remember what you told it last week, last month, or six months ago.
Most LLM conversations are stateless. You close the tab, the context is gone. For a companion that’s supposed to know you, that’s a dealbreaker. So I built a memory system. Then I found someone else’s memory system that claimed near-perfect recall scores. Naturally, I had to test that claim — and test my own system honestly while I was at it.
This is what I found.
The Problem With LLM Memory
Every LLM has a context window — a fixed amount of text it can “see” at once. Claude can handle 200K tokens, GPT-4 around 128K. That sounds like a lot, but six months of daily conversations easily exceeds that. You need a system that stores memories externally and retrieves the right ones at the right time.
The standard approach: after each conversation, extract the important bits (facts, preferences, events), store them as vector embeddings, and when the user sends a new message, search for semantically similar memories to inject into the prompt.
Simple in theory. Surprisingly tricky in practice.
How Tizenegy’s Memory Works
My system runs on Cloudflare’s stack — Workers, D1 (SQLite), and Vectorize (vector search). Here’s the pipeline:
After every message pair, a queue consumer extracts structured data using a cheap LLM (Llama 3.1 8B): entities (people, places, pets), facts (“works at Google”), preferences (“likes coffee”), and a 1-2 sentence summary.
Contradiction detection catches when facts change. If you said you work at Google last month and now mention starting at Meta, the old fact gets invalidated with a timestamp. The new one supersedes it. Your companion won’t awkwardly ask about your day at Google anymore.
Semantic memories are embedded with
bge-base-en-v1.5(768-dimensional vectors) and stored in Vectorize, namespaced per user and filtered by companion.At retrieval time, the system runs a composite score: 50% semantic similarity, 30% recency (exponential decay, ~6-day half-life), and 20% importance. This means a moderately similar recent memory can outrank an old but semantically closer one — which matches how human memory actually works.
The whole thing runs asynchronously. Memory extraction happens in a queue after the companion has already responded, so there’s no latency hit on the chat experience.
Enter MemPalace
In early April 2026, a project called MemPalace made waves on social media. Created by a team using actress Milla Jovovich’s name and likeness, it racked up 5,400 GitHub stars in under 24 hours and reached over 15 million people. The headline claim: 96.6% recall on LongMemEval — a benchmark from an ICLR 2025 paper — with zero API calls, using only local vector storage.
The core idea is compelling. MemPalace organizes memories into a “palace” structure borrowed from the ancient Method of Loci: wings (people or projects), rooms (topics), and halls (memory types like facts, preferences, or events). Instead of flat vector search across all memories, you search within the relevant wing and room first. They reported this metadata-filtered approach gives a 34% retrieval boost over flat search.
They also store everything verbatim — no LLM summarization, no lossy compression. The argument: why burn tokens summarizing when you can just store the raw text and let the embedding model handle similarity? It’s a valid architectural point, and the benchmark numbers seemed to back it up.
The Critique
Then came a detailed technical analysis that poked serious holes in the claims.
The key issues:
The 100% LoCoMo score was trivial. LoCoMo conversations max out at 32
sessions. MemPalace used top_k=50 for retrieval — retrieving more items
than exist in the dataset. That’s not retrieval, that’s just… returning
everything.
The 96.6% LongMemEval number measures retrieval, not answers. It checks
whether the right conversation session appears in the top-5 search results
(recall_any@5). It does not check whether the system can actually answer
the question. No answer generation, no judge evaluation. You can find the
right session and still give a wrong answer.
The knowledge graph claims didn’t match the code. The marketing touted
“contradiction detection,” but the actual knowledge_graph.py only does
exact-match triple deduplication — a far cry from temporal fact management.
The AAAK compression format regresses quality. MemPalace’s own documentation admits their compression dialect scores 84.2% vs 96.6% for raw storage — a 12.4 percentage point drop. At small scale, it actually uses more tokens than raw text.
To be fair, the project’s internal BENCHMARKS.md file honestly discloses
most of these caveats. The issue was the gap between the internal
documentation and the public marketing.
Running My Own Benchmarks
Reading all this, I had two questions: how does my system actually score, and does the palace structure help?
I set up a benchmark using LongMemEval-S: 500 questions across 6 categories (single-session facts, multi-session reasoning, temporal questions, knowledge updates, preferences, and abstention tests), each with ~53 conversation sessions as the search corpus.
I tested five configurations:
Results with bge-base-en-v1.5 (my production model)
| Mode | Recall@5 | Recall@10 | NDCG@10 |
|---|---|---|---|
| Raw (baseline) | 96.0% | 97.6% | 91.3% |
| Extracted (with LLM pipeline) | 96.0% | 97.6% | 91.3% |
| Palace (room-filtered) | 94.6% | 96.2% | 90.0% |
Results with all-MiniLM-L6-v2 (MemPalace’s model)
| Mode | Recall@5 | Recall@10 | NDCG@10 |
|---|---|---|---|
| Raw | 73.6% | 73.6% | 71.2% |
| Palace (room-filtered) | 72.4% | 72.4% | 70.1% |
How Other Memory Systems Compare
For context, here are published LongMemEval scores from other memory systems (where available):
| System | R@5 | R@10 | Notes |
|---|---|---|---|
| Tizenegy (bge-base, raw) | 96.0% | 97.6% | Our baseline, zero API calls |
| MemPalace (claimed, MiniLM) | 96.6% | 98.2% | Not reproducible in our tests |
| Mem0 | ~85% | — | $19-249/mo cloud service |
| Zep | ~85% | — | $25/mo+, Neo4j-backed |
| Mastra (GPT-based) | 94.9% | — | Requires GPT API calls |
| Supermemory ASMR | ~99% | — | API required |
The commercial services (Mem0, Zep) hover around 85%. Mastra gets to ~95% but requires GPT API calls for every extraction. MemPalace and Supermemory claim higher numbers but with methodological caveats. Our raw baseline with bge-base sits comfortably in the top tier — at zero ongoing API cost for retrieval.
Note: these numbers come from different benchmark runs with varying methodologies. Direct comparison should be taken with a grain of salt — which is exactly the point of this article.
What the Numbers Tell Us
The embedding model matters more than anything else. bge-base-en-v1.5 scores 96.0% where MiniLM scores 73.6% — a 22 percentage point gap. This single choice dwarfs every other architectural decision.
LLM extraction doesn’t improve retrieval. The “extracted” mode (running my full pipeline with entity/fact extraction and memory summaries) scores identically to raw embedding. The extraction pipeline’s value is in the structured data it produces (facts, entities, contradiction detection) — things that enrich the companion’s prompt but don’t show up in a retrieval-only benchmark.
Room filtering slightly hurts retrieval. The palace mode costs about 1.3 percentage points. Keyword-based room detection sometimes misclassifies a question or session, causing the room-filtered search to miss relevant results. The fallback to unfiltered search helps, but doesn’t fully recover.
MemPalace’s 96.6% is not reproducible with MiniLM. Using their published
embedding model, I get 73.6%. The discrepancy likely comes from different
benchmark parameters — possibly higher top_k, different chunking
strategies, or pure cosine distance without composite scoring.
The Bigger Picture
Here’s what I think the AI memory space is getting wrong: we’re over-optimizing for retrieval benchmarks while under-investing in what actually makes a companion feel like it knows you.
Retrieval recall measures whether the right memory exists in the search results. It doesn’t measure whether the companion uses that memory well, whether it contradicts itself about your job, whether it remembers your sister’s name, or whether it notices you’ve been stressed lately.
The palace structure — rooms, layers, identity cards — might cost 1.3% on a retrieval benchmark. But a companion that loads a 100-token identity card on session start (“this person works in tech, has a sister named Emma, recently went through a breakup, loves hiking”) provides a qualitatively different experience than one that starts fresh every time and hopes the vector search returns something relevant.
That’s the bet I’m making with Tizenegy. The memory system uses a feature
flag (PALACE_MEMORY_ENABLED) so I can A/B test both approaches with real
users. If the palace structure makes conversations feel more natural — even
if retrieval recall dips slightly — that’s the right trade-off for a
companion product.
Takeaways
If you’re building LLM memory, here’s what I’d suggest based on these benchmarks:
Pick your embedding model carefully. It’s the highest-leverage decision. Test multiple models against your actual data.
Store raw text. LLM-extracted summaries don’t improve vector retrieval. They’re valuable for structured data (facts, entities), but don’t expect them to boost search quality.
Be skeptical of benchmark claims. Read the methodology before the headline number. “100% recall” often means “we retrieved everything” — which is easy when your
top_kexceeds your corpus size.Measure what matters to your users. Retrieval recall is necessary but not sufficient. For a companion product, the subjective experience of “this AI knows me” matters more than the percentage point difference between 94.6% and 96.0%.
Use feature flags. Memory architectures are hard to evaluate in isolation. Ship both, measure with real users, keep what works.
The code for the benchmark runner and all results are in the Tizenegy
repository. I used LongMemEval-S (500 questions, ICLR 2025), ChromaDB
locally with the same models used in production, and honest methodology — no
inflated top_k, no skipped answer verification. I encourage anyone building
in this space to benchmark honestly. The field needs it.
Bjorn Wikkeling builds AI products at Smoking Media . Tizenegy is currently in development at tizenegy.com .
