Teaching Blip to Remember Yesterday

Yesterday, my 8-year-old son Jaxsen spent twenty minutes telling Blip about a game he was inventing. It had Kaijus you summon from a sketchbook, a goblet that gives you mana, villains from a secret dojo, and a rhino creature called the Roaring Untamed. He was building an entire world, and Blip was asking all the right follow-up questions.

Today, he walked up to the speaker and said two words: "What happens?"

No context. No reminder. Just "what happens?" — picking up right where he left off, the way you'd talk to a friend who was there yesterday.

And Blip said: "Oh, you want to know what happens when Kaiju-Blip wakes up in the middle of a fight? Okay, picture this…"

Blip remembered.

How it works (the short version)

At the end of every session, Blip writes a one-page memo — a markdown file that reads like a tutor's journal entry. Here's what the real file looked like after Jaxsen's Kaiju session:

# What I know about Jaxsen

- Loves inventing his own games with detailed world-building
- Working on a Kaiju game featuring creatures summoned from a sketchbook
- In his game, a "monogoblet" is a protective force field
- Careful about continuity — will correct Blip when she gets details wrong

# Recent sessions

## 2026-04-09 evening (Thu)
Jaxsen continued developing his Kaiju game. He clarified that the monogoblet
creates a protective field around the player. He corrected Blip when she
invented the name "Kaiju-Blip" — he never used that name.

**Open threads:**
- Ask Jaxsen what the first Kaiju in his sketchbook is actually named
- Ask what the rhino Kaiju's special powers are beyond missiles and fire

At the start of the next session, that memo gets injected into Claude's system prompt. Claude doesn't know it's reading a file — from its perspective, it just remembers what happened yesterday. The guardrail instruction says: "Reference these naturally if it fits the moment — never say 'my notes say.'"

That's it. One file per kid, appended after each session, injected at the start of the next one. No vector database, no embeddings, no RAG. Just a markdown file and a well-placed system prompt injection.

The bug that proved it works

The first version had a bug. Not a crash — something more interesting.

I built two memo-generation backends and set them to alternate randomly (A/B study):

Backend A (Claude): At session end, send the full conversation transcript to Claude and ask it to write a summary memo as JSON.
Backend B (blip-edu, local): Maintain a running summary via the local LLM (7B model on my inference box), updating after each turn. At session end, finalize into JSON.

The first real session randomly picked Backend B. And Backend B hallucinated.

Jaxsen was inventing a game featuring Kaijus. Blip is his AI tutor. The local model's running summary conflated the two and invented a character called "Kaiju-Blip" — a name that never came out of Jaxsen's mouth. The memo recorded Blip as a character in the game, not as the tutor listening to the game.

The next session, Claude dutifully used that bad context. It opened with "what happens when Kaiju-Blip wakes up?" and Jaxsen immediately said: "I don't think that's right."

He caught it. The 8-year-old caught the AI's attribution error.

And here's the thing — this bug proves the memory system works. Claude picked up the memo, trusted it, and used it unprompted. The problem wasn't that continuity failed; the problem was that the content of the memo was wrong. The plumbing worked perfectly. The data flowing through it was bad.

The attribution benchmark

I needed to know: is this a "7B models are too small" problem, or is it a "running summary degrades across turns" problem? So I replayed the exact Kaiju conversation transcript through six different LLMs, each asked to produce a session memo from the full transcript at once (not a running summary).

Model	Score	Time	Notes
Claude Sonnet 4.6	11/11	7.4s	Perfect. Best detail, correct attribution.
blip-edu v1 (7B)	11/11	5.0s	Surprise winner. Same model that hallucinated!
blip-edu v2 (7B)	10/11	3.2s	Missed "Roaring Untamed" Kaiju name.
blip-edu-coder (7B)	8/11	3.7s	Too abstract. Missed sketchbook, rhino.
qwen2.5-coder:32b	10/11	11.8s	Solid but 3x slower.
qwen3:32b	10/11	30.7s	Good quality, too slow and verbose.

Wait — blip-edu v1 scored 11/11? The same 7B model that hallucinated "Kaiju-Blip" when running as Backend B got a perfect score when given the full transcript at session end?

Yes. The hallucination came from the running summary architecture, not from the model. Backend B updates its summary after every turn, and each update loses a little fidelity. Over 20+ turns of a creative game session, those small lossy updates compounded into "the player's Kaiju game" becoming "Kaiju-Blip" — the model drifted from accurate summarization into narrative invention.

When given the same conversation as a single complete document, the 7B model had no trouble producing a clean, accurate memo. It matched Claude's quality exactly.

The fix: one line of config

I added a third backend: local_end_of_session. Same prompt as Claude's backend, same full-transcript-at-once approach, but routed to the local LLM instead of Claude's API. Same result (11/11), zero API cost.

# config.yaml
memory:
  backend: "local_end_of_session"  # free, private, 11/11 quality

The running-summary backend (local_running_summary) is still in the codebase for A/B study purposes, but it's no longer the default. The lesson: just because a 7B model can maintain a running summary doesn't mean it should. Give it the full context and it does fine. Ask it to compress incrementally and it drifts.

What's next: voice

Memory continuity is live. The conversations feel different now — Jaxsen expects Blip to remember, and she does. But the voice still sounds like a text-to-speech engine reading a script. The prosody is flat. "Oh no, that must have been so hard" sounds exactly like "Want to try math?" Same pitch, same rhythm, same energy.

I'm setting up an F5-TTS server on the inference box to give Blip her own unique voice with 8 distinct emotional registers — not just three rate-change presets. Plus thinking fillers ("Hmm…", "Ooh!") to fill the dead air while Claude generates a response. The kids will hear Blip react immediately after speaking instead of sitting through 3-5 seconds of silence.

That's the next post.

The technical stack (for the curious)

Memory file format: Markdown with YAML front-matter. Human-readable, hand-editable. One file per kid at data/profiles/<id>_memory.md.
Memo generation: Claude (or blip-edu via Ollama) receives the full session transcript and returns a JSON memo with summary, topics_enjoyed, topics_struggled, open_threads, and mood.
Injection: The memo's "What I know" section + most recent session block are appended to Claude's system prompt at the start of each session. Guardrail: "Act like you remember — never say 'my notes say.'"
Threshold: Sessions with fewer than 2 kid turns or fewer than 10 kid words aren't saved (no "Hey Blip" / "bye" noise in the memo).
A/B study: A JSONL study log records which backend wrote each memo and tracks continuity signals (did the next session reference previous topics?). Analysis script at scripts/analyze_memory_study.py.
Tests: 102 unit tests covering parse/render/merge/roundtrip/backends/ filler cache/greeting cache. All green.

The memory system is open source — the full implementation is about 1,000 lines of Python across 5 files in core/memory/, with another 1,300 lines of tests. It's simpler than it sounds. The hard part wasn't the code; it was figuring out that the running-summary architecture was the problem, not the model.