← All notes

Persistent memory in Claude Code: hooks, local embeddings, and auto-recall

May 11, 20268 min read

Claude Code starts every session with zero context about what you learned, decided, or corrected yesterday. If you want it to compound with use, something has to persist that knowledge and re-inject it when it's relevant. This is what I built.

Anatomy of the system

Four pieces holding each other up:

  1. Memory files.md documents with YAML frontmatter, one per unit of knowledge. They live on disk under a memory/ directory, version-controllable if you want.
  2. Vector index — a SQLite database with chunk-level embeddings, plus an FTS5 virtual table for BM25 keyword scoring.
  3. Search daemon — a long-running process that keeps the embedding model warm in RAM and answers queries over a Unix socket. Optional, but worth it.
  4. Hooks — the writers (PreCompact, PostToolUse) and the readers (UserPromptSubmit, SessionStart). These are the connective tissue.

The end-to-end flow: a prompt lands, the UserPromptSubmit hook asks the daemon for the top-k memories relevant to it, the daemon embeds the query, runs hybrid search against the index, returns hits, and the hook injects the matching .md contents into the model's context as an <auto-loaded-memory> block — just before the model sees the turn. From the agent's perspective, it's as if you'd pasted those memories into the prompt yourself.

The memory file format

Every memory is a .md file with a tiny frontmatter:

---
name: <short imperative title>
description: <one line about what triggers this memory>
type: <feedback|reference|project|user|tool>
---
<markdown body — the rule, the decision, the technical detail>

The five types are conventions, not enforced by anything. They help when you grep, and when you want different injection thresholds per type:

  • feedback — user corrections you don't want to repeat. "Prefers atomic commits over batches." "Don't use --no-verify to bypass hooks."
  • reference — stable technical knowledge. API quirks, undocumented behavior, paths that change rarely.
  • project — state about ongoing work. Clients, integrations, in-flight TODOs.
  • user — personal preferences. Naming conventions, communication style.
  • tool — details about a specific binary or service. Install command, config path, gotchas.

Two scopes work well in practice: a global directory under ~/.claude/.../memory/ for things that apply everywhere, and a project-local <repo>/.claude/memory/ that only loads when your cwd is inside that repo. Walk up from cwd looking for .claude/memory/ and you get per-project memory without configuring anything.

The local model

The embedding side runs entirely on your machine:

  • paraphrase-multilingual-MiniLM-L12-v2 from sentence-transformers. 384 dimensions, ~120 MB on disk, CPU-fine. The "multilingual" matters — my memories mix English and Spanish freely, and an English-only model degrades badly on mixed content.
  • Hybrid search: 70% cosine similarity + 30% BM25 from SQLite's FTS5. Cosine captures semantic intent ("how do I authenticate" → memory about JWT setup). BM25 captures literal terms — paths, commands, filenames — that embedding models tend to smooth over.
  • Chunking: ~120 tokens per chunk with 30 tokens of overlap, because the model's max_seq_length is 128. Chunk-level indexing means a 2,000-word file generates several embedding rows; the search returns the file of the best-matching chunk (deduplicated), so a single strong paragraph in a long doc can surface the whole memory.

Why not OpenAI's text-embedding-3 family? Three reasons. Privacy — memories contain corrections, project state, sometimes credentials by accident; nothing leaves the machine. Latency — a warm daemon answers in 22–49 ms; a round-trip to a hosted API is 200+ ms minimum. Cost — zero, indefinitely.

The indexer is short:

# Conceptual sketch — not production code
from sentence_transformers import SentenceTransformer
import sqlite3, glob

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
db = sqlite3.connect('memory-index.db')

for path in glob.glob('memory/*.md'):
    text = open(path).read()
    for i, chunk in enumerate(chunk_text(text, size=120, overlap=30)):
        embedding = model.encode(chunk)
        db.execute(
            "INSERT INTO chunks (file, idx, text, vec) VALUES (?, ?, ?, ?)",
            (path, i, chunk, embedding.tobytes())
        )

# After inserts, rebuild the FTS5 table for BM25 queries
db.execute("INSERT INTO chunks_fts(chunks_fts) VALUES ('rebuild')")

Reindexing is incremental: store each file's mtime in a metadata table and only re-embed files whose mtime has advanced.

The four hooks

Claude Code's hooks are shell commands that fire at specific lifecycle events. Each hook receives a JSON blob on stdin and can write back another JSON blob to inject context, block the action, or trigger side effects.

Hook Matcher Job
SessionStart startup|resume|clear|compact Reindex if anything changed; inject today's journal as initial context
UserPromptSubmit .* Look up top-k memories and inject them before the model sees the prompt
PostToolUse Write|Edit|MultiEdit Async reindex if you just edited a .md file inside a memory directory
PreCompact manual|auto Extract durable memories from the transcript before it gets compressed

They're declared in ~/.claude/settings.json:

{
  "hooks": {
    "SessionStart": [{
      "matcher": "startup|resume|clear|compact",
      "hooks": [{ "type": "command", "command": "bash /path/to/session-start.sh", "timeout": 60 }]
    }],
    "UserPromptSubmit": [{
      "matcher": ".*",
      "hooks": [{ "type": "command", "command": "bash /path/to/memory-search-hook.sh", "timeout": 5 }]
    }],
    "PostToolUse": [{
      "matcher": "Write|Edit|MultiEdit",
      "hooks": [{ "type": "command", "command": "bash /path/to/reindex-if-touched.sh", "timeout": 5 }]
    }],
    "PreCompact": [{
      "matcher": "manual|auto",
      "hooks": [{ "type": "command", "command": "bash /path/to/precompact-extract.sh", "timeout": 380 }]
    }]
  }
}

The UserPromptSubmit hook is the smallest and the most impactful. Conceptually:

# Conceptual sketch
PROMPT=$(jq -r '.prompt' <<< "$INPUT")
[ ${#PROMPT} -lt 30 ] && { echo '{}'; exit 0; }

HITS=$(python3 /path/to/memory-search.py --query "$PROMPT" --topk 5 --threshold 0.35)

if [ -n "$HITS" ]; then
  CONTEXT=$(printf '<auto-loaded-memory>\n%s\n</auto-loaded-memory>' "$HITS")
  jq -nc --arg ctx "$CONTEXT" \
    '{ hookSpecificOutput: { hookEventName: "UserPromptSubmit", additionalContext: $ctx } }'
fi
exit 0

Two details worth stealing:

  • Re-entrancy guard: skip the search when CLAUDE_HEADLESS=1 is set in the environment, so a sub-Claude spawned by another hook doesn't recursively trigger its own memory search.
  • Graceful degradation: any error path should echo '{}' && exit 0. A bug in your hook shouldn't make the parent session refuse to accept user prompts.

The hook that changes everything — PreCompact

This is the one that flipped the system from "manual journaling that I always forget to do" to "the agent journals itself."

When Claude Code is about to compact a long conversation — either automatically because the context is full, or manually because you ran /compact — the PreCompact hook fires first. The hook receives the path to the transcript JSONL on disk, plus session metadata. Inside it:

  1. Pre-filter the transcript with jq, keeping only user and assistant text turns (drop tool results, system reminders, the noise).
  2. Cap to the last ~100 KB — roughly the last 25K tokens of conversation. Older context is usually summarized by then and not worth re-analyzing.
  3. Spawn a headless Claude with claude -p --model sonnet and a system prompt that knows the memory file format. Its job is to scan the conversation and emit zero or more new .md files representing durable knowledge: corrections the user gave, decisions that were made, gotchas that were discovered.
  4. Reindex so the new memories are queryable on the next turn.
  5. Failsafe everything: any error path exits 0. A broken extractor can never block the compaction itself — the user would lose work.
# Conceptual sketch
TRANSCRIPT=$(jq -r '.transcript_path' <<< "$INPUT")

jq -c 'select(.type=="user" or .type=="assistant")' "$TRANSCRIPT" \
  | tail -c 100000 \
  | CLAUDE_HEADLESS=1 claude -p \
       --model sonnet \
       --system-prompt "$(cat /path/to/extract-prompt.txt)" \
       --output-format text \
  | python3 /path/to/parse-and-write-memories.py

python3 /path/to/memory-index.py --reindex
exit 0

The extractor prompt is the single most important file in the system. It explains the frontmatter format, the five types, what not to save (anything ephemeral, anything trivially derivable from git log), and asks for one .md per durable insight. Spend an afternoon iterating on it. Everything else is plumbing.

Budget: ~$0.20–0.50 per compaction with Sonnet, runtime 30–60 seconds. Cheap enough that I let it run on every compaction, manual or auto.

Rolling your own

A workable setup is six or seven steps:

  1. Create the directories: ~/.claude/memory/ for global and ~/.claude/scripts/ for the hooks.
  2. Install the embedding model inside an isolated venv: python -m venv ~/.claude/scripts/.venv && ~/.claude/scripts/.venv/bin/pip install sentence-transformers. The first run downloads the model; subsequent runs are instant.
  3. Write the scripts — five files, each doing one thing:
    • memory-index.py — indexes .md files into SQLite with embeddings + FTS5.
    • memory-search.py — hybrid query (cosine + BM25), returns top-k file paths with scores.
    • memory-search-hook.sh — invoked by UserPromptSubmit; calls search, emits the additionalContext JSON.
    • precompact-extract.sh — invoked by PreCompact; runs the headless extractor.
    • reindex-if-touched.sh — invoked by PostToolUse; async reindex when a memory file changes.
  4. Register the hooks in ~/.claude/settings.json using the JSON above.
  5. Write 2–3 seed memories by hand to validate the pipeline end-to-end. A user_preferences.md, a feedback_no_force_push.md, a tool_my_cli.md — whatever feels real.
  6. Test the search: python memory-search.py "your test query". You want hits with combined score > 0.35.
  7. Restart Claude Code and watch a fresh prompt come back with an <auto-loaded-memory> block in the context.

A few decisions you'll make along the way:

  • Global, project, or both? Both. Walk up from cwd looking for .claude/memory/, and union the project hits with the global hits. Project state goes project; user preferences and tool details go global.
  • Bigger embedding model? Stay on MiniLM-L12 unless you have RAM and latency budget to burn. The multilingual MiniLM punches above its weight on mixed-language content.
  • Daemon or spawn-per-query? Daemon if you want consistent sub-50ms response. Spawn-per-query (load the model on every search) is fine if you can live with ~500 ms — the model cold-start dominates.

What you get

Once it runs, the agent stops being a stateless responder and starts composing with its own past. It uses last week's correction so you don't have to give it again. It remembers an obscure config path from a month ago. It stops re-litigating settled decisions. That's the difference between a tool and an assistant.