“The ceiling of an AI agent is its tool interface.”
This chapter is about one thing: the interface you give an AI agent determines what it can do, what it can’t do, and where it will break.
Not prompt engineering. Not model selection. The protocol layer — how you design the tool interface you hand to the agent.
The hardest-won lesson from that sprint was this: when agents fail, 80% of the time it’s not because the model isn’t good enough — it’s because the tools I gave it aren’t good enough. Fix the tools, and the same model immediately accomplishes what it couldn’t before.
The harness mentioned in the preface — its first layer is the protocol layer. This chapter starts with a counterintuitive choice: rejecting vector databases as the primary architecture for agent memory.
1.1 From Vector Database to Zettelkasten
In 2024–2025, every AI conversation about memory inevitably led to RAG (Retrieval-Augmented Generation). You’ve probably seen the architecture: chunk documents, embed them, store in a vector database, do similarity search at query time, stuff top-K results into the context.
I tried this approach. More than tried — I used Pinecone, Weaviate, and Chroma across three projects.
Let me tell you about a specific failure. In the second project, I had an agent use RAG to look up “design principles for MCP tool descriptions.” Vector search returned five results — three of which were about “REST API documentation best practices.” Because in embedding space, “tool description” and “API documentation” are close neighbors.
I spent two hours trying to fix it: different chunk sizes, adjusted overlap ratios, different embedding models. Nothing worked. Embeddings are thousand-dimensional floating-point vectors — you can’t look at one and say “ah, dimension 834 is too high, that’s why it mismatched.” All you can do is tweak parameters and pray.
That’s not engineering. That’s alchemy.
Conclusion: for the agent memory use case, a vector database is the wrong primary architecture.
Why? Three reasons, and a better alternative.
First, Embedding Search Is Not Debuggable
By contrast, explicit links are human-readable, debuggable, and editable. Card A links to Card B because a human decided they’re related — and that decision is visible in the link. When an agent follows a link, every step leaves a trail.
Second, One-Shot Retrieval Doesn’t Fit Complex Problems
The standard RAG flow is: query → retrieve → generate. One retrieval, one generation.
But real engineering problems are rarely solvable with a single retrieval. Say an agent encounters an MCP tool timeout bug. It needs to: find known issues → discover it’s related to tool description design → find design principles → locate the self-containment constraint → finally understand the root cause of the timeout.
That’s a 5-step exploration process. You could say “just do multi-hop RAG” — sure, but at that point you’re using vector similarity to simulate graph traversal. Why not just use a graph?
Third, Tokens Will Only Get Cheaper
Token prices drop several-fold every 12–18 months, and this trend has held for two years with no signs of slowing. Once tokens are cheap enough, having the LLM read the original text, make judgments, and follow links becomes entirely economically viable.
Zettelkasten: A Better Alternative
A quick primer on Zettelkasten. The method comes from German sociologist Niklas Luhmann — he wrote over 50 books and 600 scholarly articles using a system of roughly 90,000 interlinked index cards. The core idea is simple: each card records one atomic idea, and cards are connected through explicit links to form a knowledge network. No hierarchical directories, no classification system — just cards and links. Structure emerges from the link relationships.
Zettelkasten has a counterintuitive core principle: connection matters more than storage. Traditional note-taking systems help you “find what you thought you were searching for” — you search a keyword, you find the matching note. Zettelkasten does something different: it surfaces ideas you’ve already forgotten. You follow links, and three hops in you hit a card you wrote three months ago, and suddenly realize that idea is relevant to the problem in front of you. That serendipity is something search can’t give you.
There’s also an iron rule: never copy-paste. Every card must be restated in your own words. This isn’t pedantry — the more effort you invest at storage time, the easier retrieval becomes. Text copied verbatim becomes incomprehensible three days later — you can’t even remember why you thought it mattered. Rewriting in your own words forces digestion. The same applies to agents: memex_retro requires the agent to summarize learnings in its own language, not paste raw logs.
Applying Zettelkasten to agent memory comes down to three substitutions:
LLM Replaces Embedding. Instead of mapping text to vector space with an embedding model, let the LLM read card text directly, understand the content, and decide where to go next. LLM semantic understanding far surpasses embedding — it can grasp “this paragraph describes a cause, that one describes an effect,” and it can judge relevance in the context of the current task. Embedding can’t make this kind of context-sensitive judgment — the same card has different relevance depending on the task.
Explicit Links Replace Vector Similarity. Vector similarity has exactly one dimension: how similar. Explicit links carry richer semantics — causal relationships, evolutionary relationships (“this design was later replaced”), dependency relationships, even contradictory relationships. Link meaning is determined by context — exactly where LLMs excel. Links are embedded directly in card text using [[wikilink]] syntax — agents naturally weave in links as they write cards, which is more elegant than calling a separate “create link” command.
Iterative Exploration Replaces One-Shot Retrieval. When solving problems, agents often don’t know what they’re looking for. They have a vague direction and need to explore the knowledge network. The agent reads a card, sees [[wikilinks]], calls links to check link targets, uses read to fetch the next card, sees more links, keeps deciding — every step involves LLM judgment. Not mechanical similarity ranking, but contextual, reasoned exploration.
Where embedding still fits in: memex did eventually add embedding as a supplementary search mechanism (supporting OpenAI, local GGUF models, and Ollama as providers, with 0.7 * semantic + 0.3 * keyword hybrid scoring). But this is an optional enhancement, not the primary architecture. The primary architecture remains wikilinks + keyword search + iterative exploration. Embedding has value when you need fuzzy semantic search, but the structure of knowledge, the agent’s navigation paths, the traceability for debugging — all of that is built on explicit links.
This nicely illustrates good architecture supporting progressive enhancement: get the core path working first, then layer on capabilities as needed.
To be fair, vector databases genuinely excel in certain scenarios — like rough filtering of massive unstructured document collections, or “dump everything” modes where users don’t maintain any knowledge structure. My conclusion is scoped to the agent memory use case: agents need debuggable, iteratively explorable, cross-session structured knowledge — not a black-box search engine.
Why Memory Is Systematically Undervalued
Agent memory is one of the most systematically undervalued directions in AI right now. The vast majority of agents’ “memory” is still at the conversation history level — session ends, memory vanishes. It’s like an employee who wakes up with amnesia every morning: you have to teach them from scratch every time, and team knowledge never accumulates.
The jump from conversation history to structured long-term memory isn’t incremental improvement — it’s a phase transition. Conversation history is linear, short-lived, bound to a single session. Zettelkasten-style memory is networked, persistent, and accumulates across sessions. The former is working memory; the latter is an external brain. An agent with only working memory can do tasks, but it will never “get smarter” — it won’t learn from experience, won’t debug faster on attempt 50 than on attempt 1.
Zettelkasten is essentially the direct analog for agent memory. Luhmann used his card box to build an external brain that let him accumulate and connect knowledge across decades. An agent’s memex does exactly the same thing — just compressed from decades to weeks, and from tens of thousands of cards to hundreds. The structure is identical; the principle is the same.
1.2 CLI as Protocol Layer: A Design Philosophy
With the design direction for the memory system established, the next question is: how do you let agents use it?
The most intuitive answer might be: write a Python SDK, expose an API, let the agent call it.
I went a different way. I chose CLI.
Not out of nostalgia, not out of faith in “the Unix philosophy” — because this was the optimal solution I arrived at after getting burned in practice.
CLI Is the Pure Data Layer, Skill Is the Intelligence Layer
Look at this layering:
┌─────────────────────────────┐
│ Skill Layer (Intelligence) │ → LLM calls, judgment logic, workflow orchestration
├─────────────────────────────┤
│ CLI Layer (Protocol) │ → Pure data operations, no LLM dependency, deterministic
├─────────────────────────────┤
│ Storage Layer │ → Plain text, Markdown, Git
└─────────────────────────────┘
What the CLI layer does:
# Write a card (receives full frontmatter + body via stdin)
echo "---
title: MCP Tool Timeout Fix
created: 2026-03-15
source: claude
---
Root cause: tool internally called a slow external API.
Fix: pass state as parameter instead.
See also: [[mcp-tool-self-containment]]" | memex write mcp-tool-timeout-fix
# Read a card
memex read mcp-tool-timeout-fix
# Search (supports keyword search and manifest filtering)
memex search "timeout"
memex search --tag "mcp" --category "bugfix"
# View link relationships
memex links mcp-tool-timeout-fix
# View backlinks (who links to this card)
memex backlinks mcp-tool-self-containment
![]()
Notice the characteristics of these commands:
- No LLM dependency. Every command is deterministic. Same inputs, same outputs, every time.
- Pure data operations. Write, read, search, check links — all data operations. No “summarize this” or “make a judgment” — nothing that requires intelligence.
- Each command does one thing. Complex functionality comes from agents composing their own call chains — the agent is the ultimate pipe operator.
The intelligence layer (Skill) handles: read the task description → decide which memories are needed → call CLI to retrieve → read results → decide if it’s enough → if not, follow links to explore further → synthesize all retrieved cards → start executing the task. Every “decide” and “synthesize” here is done by the LLM; every “read” and “retrieve” is done by the CLI.
Why Not an API/SDK?
CLI’s unique advantage is that it’s bilingual — humans and machines both speak it natively. A human can manually debug in the terminal, manually maintain memory, manually verify an agent’s operations. memex read some-card, hit enter, see the content, eyeball it for problems. An API/SDK can’t match this zero-cost human intervention.
On top of that, CLI is inherently testable — every command can be manually run in a terminal to verify. Inherently composable — pipes, redirects, scripts, 30 years of Unix tooling available out of the box. And any agent framework — Claude Code, Cursor, Windsurf — can invoke CLI commands. Language-agnostic, environment-agnostic.
Memory Is Independent of Any Agent Environment
This is crucial. The memory system is a standalone CLI tool. It doesn’t know who’s calling it — Claude Code, Cursor, or a human typing commands in a terminal. It doesn’t care.
This means: you can switch agent frameworks and memory is unaffected. You can have multiple agent frameworks accessing the same memory simultaneously. You can maintain memory manually when no agent is running. Backup is memex sync push, restore is memex sync pull — in fact, the MCP server auto-fetches before recall and auto-pushes after retro, so you don’t even need to sync manually.
This decoupling seems obvious, but I’ve seen too many people lock their memory into a specific agent framework. The moment that framework upgrades or you switch products, all your memory is gone.
Knowledge is your most valuable asset. Don’t lock it inside any single tool.
1.3 The Three-Tier Dispatch Model: Tool Protocol → Instruction → Workflow
CLI solves “how does the agent access memory.” But agents need to do more than access memory — they need to call various tools, follow various rules, execute various workflows. How do you “dispatch” these capabilities to an agent?
In practice I discovered a three-tier model. It wasn’t designed from the top down — it was extracted from chaos after hitting enough walls.
Tier 1: Tool Protocol (MCP)
MCP (Model Context Protocol) is a tool protocol initiated by Anthropic, widely adopted by major AI coding tools in 2025 (Cursor, Windsurf, Claude Code, etc.). It defines what tools an agent can call, what parameters each tool accepts, and what results it returns.
Tool Descriptions Are Runtime Instructions, Not Documentation
Most people write tool descriptions like API docs: “Read a card by slug.” Concise, accurate, useless.
When an agent sees a description like that, it knows the tool can read cards — but not when to use it, how to use it well, or what goes wrong when you misuse it.
Look at the actual description of memex’s memex_recall:
“IMPORTANT: You MUST call this at the START of every new task or conversation, BEFORE doing any work. This retrieves your persistent memory — knowledge cards from previous sessions with [[bidirectional links]]. Returns the keyword index (if exists) or card list. Optionally search by query. Without calling this first, you will miss context from prior sessions and repeat past mistakes.”
This isn’t documentation — it’s a runtime instruction. It contains:
- When to use it: “at the START of every new task, BEFORE doing any work”
- What it does: “retrieves your persistent memory — knowledge cards”
- Usage strategy: “Returns the keyword index (if exists) or card list”
- Consequence warning: “you will miss context and repeat past mistakes”
The essence is embedding instructions into the tool interface. The agent doesn’t need separate documentation to learn — the tool schema itself is the tutorial.
MCP Tools Must Be Self-Contained
When an MCP tool is invoked, it should be able to complete its task using only its input parameters, with no dependency on external state.
Anti-pattern: "Read the currently selected card." — “currently selected” implies external state. If another session changes the selection, behavior becomes unpredictable.
Correct pattern:
{
"name": "memex_read",
"inputSchema": {
"properties": {
"slug": { "type": "string", "description": "Card slug (e.g. 'my-card-name')" }
},
"required": ["slug"]
}
}
All necessary information is in the parameters. No implicit dependencies. Deterministic. Testable.
I once had a tool that depended on “the result of the last operation” — worked fine in a single session, blew up immediately when running multiple sessions in parallel. Debugging took four hours; the fix took ten minutes — replace the implicit state with an explicit parameter.
High-Level Operations: Formatting Is an Engineering Problem, Don’t Delegate It to the Language Model
Here’s the scenario: I had an agent create a Zettelkasten card. The initial design gave it a low-level write_file tool and let it assemble the Markdown according to Zettelkasten format on its own — YAML frontmatter fields, date formatting, wikilink syntax.
Result: roughly one formatting error every six or seven attempts. Missing created field in frontmatter, wrong date format, missing brackets in wikilinks. After a few hundred operations, these malformed cards poison subsequent searches and link traversals.
The solution is a two-tier design. Compare memex’s MCP tools:
- Low-level
memex_write: receives full frontmatter + body; the agent must assemble the format itself - High-level
memex_retro: the agent provides only slug, title, body, and category; frontmatter is auto-generated by code (timestamp, source tag); auto-triggers git sync after writing
# Low-level: agent assembles format (error-prone)
memex_write({ slug: "new-card", content: "---\ntitle: ...\ncreated: ...\n---\n..." })
# High-level: agent provides only semantic info, format guaranteed by tool
memex_retro({ slug: "new-card", title: "New Insight", body: "...", category: "architecture" })
memex_retro handles all formatting details internally: YAML frontmatter generation, timestamp insertion, source auto-tagged to the caller (Claude Code / Cursor), auto-sync to remote after write. Formatting is an engineering problem. Don’t delegate it to the language model.
Tier 2: Instruction Injection
Tool protocol defines what an agent can do. Instructions define what an agent should do.
The most intuitive example is the difference between good and bad instructions.
Bad instruction: “Try to reference previous memory when possible.” In practice this is effectively a no-op — the agent doesn’t know what “try” means, doesn’t know what to look for, or how much is enough.
Good instruction (from my CLAUDE.md):
## Memory Recall
- **Session start**: At the beginning of every session, run `memex read index`
and scan for cards relevant to the current task. Read 2-3 most relevant cards
before starting work.
- **Before debugging**: When diagnosing an issue, search memex for related patterns
before tracing code from scratch.
The difference in effectiveness is enormous. The former leads to agents occasionally glancing at memory and mostly ignoring it. The latter makes agents proactively load context at every session start and check the knowledge base before every debug session — what used to take 15-20 minutes of “re-understanding the project” at each session start now takes 2-3 minutes to get up to speed.
Instructions and tool descriptions are complementary: tool descriptions are embedded in the tool interface and take effect when the tool is used; instructions live in the context window and stay active throughout the entire session. One says “how to use this tool well,” the other says “under what circumstances to use this tool.”
Tier 3: Workflow Orchestration (Skill)
The first two tiers handle “what tools does the agent have” and “what rules should the agent follow.” The third tier solves: how do multiple steps chain into a complete workflow?
This is the role of Skills. Skill = reusable intelligent workflow. (In this book, “Skill” refers to Claude Code’s built-in concept, but the pattern applies to any agent framework — Cursor has “rules,” other frameworks have similar workflow abstractions.)
Take the memory recall skill as an example. Initially I designed it as a single-step operation — search keywords, return results. Terrible results, because keyword search is too coarse; the agent would find irrelevant results and just stop.
After iteration, it became a multi-step workflow:
1. Read the current task description
2. Call memex_recall to get index or search results (LLM chooses keywords)
3. Read top 2-3 cards
4. For each card, check its [[wikilinks]]
5. Decide whether to follow links for further exploration (LLM decides)
6. If yes, call memex_read on link targets (max 2 more hops)
7. Synthesize all retrieved cards, begin executing the task
Steps 2 and 5 require LLM judgment; the rest are deterministic tool calls. This is the concrete realization of “CLI is the pure data layer, Skill is the intelligence layer.” Define once, reuse infinitely.
The Principle Behind the Three Tiers: Progressive Disclosure
This three-tier dispatch model wasn’t designed on a whiteboard. It’s grounded in a classic UI principle — progressive disclosure — and that principle applies equally well to LLM prompts.
In UI design, progressive disclosure means: don’t dump all information on the user at once; reveal it in layers, on demand. The same logic applies to LLMs. Cramming everything into the system prompt — tool definitions, all instructions, every workflow detail — wastes the context window and dilutes attention. Model attention is a finite resource; the more you stuff in, the lower the execution quality of each individual instruction.
The three-tier dispatch model maps neatly to three levels of progressive disclosure:
- Capability declaration (always visible) — Tool Protocol layer. MCP tool descriptions are always visible to the agent, defining capability boundaries. But they’re declarations, not execution.
- Behavioral rules (active throughout) — Instruction layer. Core rules in CLAUDE.md and system prompts, in effect for the entire session. Keep the count low — fewer items means higher execution quality per item.
- Execution details (expanded on demand) — Workflow layer. Skills pull cards from memory on demand during execution, expanding workflow steps as needed. No preloading, no redundancy.
Information layering isn’t just architectural aesthetics — it’s a core principle of prompt engineering. The more precise and on-demand the context you give an agent, the higher its execution quality. This is no different from giving instructions to humans — a good manager doesn’t dump all documentation on a new hire and say “figure it out,” but provides context in stages, matched to the task at hand.
Design Principles for the Three Tiers
- Tool Protocol: self-contained, deterministic, description-as-instruction
- Instruction Injection: explicit, specific, verifiable — no “best effort,” demand “do X then Y”
- Workflow Orchestration: separate intelligent decisions from deterministic operations; make it reusable
Violate any of these, and agent behavior becomes unpredictable. And unpredictability is the mortal enemy of production.
1.4 In Practice: From 0 to 227 Cards
Enough theory. Let’s look at what actually happened.
Minimum Viable Validation: Will Agents Actually Use Memory?
Day 1, the memory system had only three commands: memex write, memex read, memex search.
No tags, no links, no index. The Day 1 goal was to validate the core hypothesis: will agents actually use memory?
The answer: yes, and they become heavily dependent on it. Once I added the “read relevant memory at session start” instruction to CLAUDE.md, agent performance improved immediately and dramatically. That day produced 12 cards, mostly architecture decisions and key API interface definitions.
The First Scaling Pain Point: Links and Tags
By Day 2, 12 cards weren’t enough. Keyword search was too coarse — as cards multiplied, precision dropped.
This is when [[wikilink]] syntax and tag filtering entered the picture. Links were written directly into card text — sometimes added by me manually, sometimes embedded naturally by agents as they recorded things. Backlinks were computed via the memex backlinks command. By the end of Day 3, the card count reached 47.
The “Map” Epiphany: Index Cards
47 cards, averaging two or three links each — the network had real complexity. The agent needed a map.
memex read index
index is a manually maintained special card — an overview of all card slugs with brief descriptions. The agent reads the index at session start, does a quick scan, and decides which cards to read in depth. This pattern is extremely effective — the agent gets a table of contents instead of rummaging through a pile of files.
In fact, the memex_recall high-level MCP operation is implemented exactly this way: try to read the index card first; if it doesn’t exist, fall back to listing all cards.
The agent’s multi-hop exploration doesn’t require any special graph traversal command either — it implements it through iterative links → read → links calls. This echoes the earlier argument: LLM handles judgment, CLI handles data operations, every step has context involved. By the end of Day 5, the card count broke 100.
A Concrete Debugging Case
Day 5, an agent session hit an MCP tool call timeout while working on the webhook module. The agent’s debug flow:
1. Call memex_recall to read the index, scan for relevant cards
→ Found "mcp-tool-timeout-patterns"
2. memex_read to read that card
→ Content: list of known timeout causes, body contains [[mcp-tool-self-containment]]
3. memex_read to read "mcp-tool-self-containment"
→ Content: MCP tools must be self-contained, no external state dependencies
4. Agent forms hypothesis: timeout may be caused by tool internally waiting on external state
5. Checks code, confirms: webhook handler's MCP tool was calling
another service's status API internally, and that API was slow
6. Fix: move the status query outside the tool, pass it in as a parameter
7. Call memex_retro to create new card "webhook-handler-timeout-fix",
body embeds [[mcp-tool-self-containment]]
The entire debug process: 8 minutes. Without memory, the agent would need to analyze from scratch — probably 30-40 minutes, and it might never arrive at the “self-containment” angle.
More importantly, look at step 7: the agent wrote this debugging experience back into memory. The next time any session encounters a similar problem, it can start directly from this card.
This is the core value of a memory system: knowledge accumulation is cross-session, and error fixes are one-time.
By the end of Day 7, the project shipped. Total: 227 cards. Every core operation is either reading a file or traversing an index. No full scans, no re-indexing. Plain text + git — scaling is essentially unlimited. These cards later became the source material for this book.
1.5 Chapter Summary
The core of this chapter: tool interface > model capability. When an agent fails, fix the tool first — don’t swap the model.
The three-tier model (MCP / Instruction / Skill) is the systematic implementation of this principle — Tool Protocol defines capability boundaries, Instructions embed decision-making context, Skills orchestrate reusable intelligent workflows. And the Zettelkasten-style memory system proves something counterintuitive: explicit structure + LLM judgment systematically outperforms black-box embedding in the agent memory use case.
These principles aren’t theoretical derivations. Every single one has at least one concrete project failure behind it.
Next chapter, we move from the infrastructure of tools and memory to a more fundamental question: how do you constrain agent behavior? Once an agent has capabilities (tools) and memory, the next step isn’t “let it loose” — it’s “design the reins so it can’t go off-track.” That’s Harness-Native Engineering — the core methodology of this book.