Skip to content
Touchskyer's Thinking Wall
Ch 3
21 min read
Multi-Silicon Collaboration

Not Delegating Tasks, but Arming Agents

The previous chapter solved a fundamental problem: how to constrain a single agent so its output is reliable. Mechanical gates, independent review, tamper detection — these are the reins for harnessing a single execution engine. But when you’re not riding one horse but commanding a cavalry, the problem shifts from “how to constrain” to “how to arm.” Constraint sets the floor. Armament is what makes the team dangerous.

Most people’s mental model of “building a team with AI” goes like this: you’re the boss, AIs are the employees, you break down tasks, hand them out one by one, they each do their thing, you aggregate at the end.

This understanding isn’t wrong, but it stops at task delegation — you’re assigning work.

The problem with task delegation is that it assumes you can decompose tasks finely enough and clearly enough. But real engineering work isn’t an assembly line. A task like “implement the user authentication module” will encounter decisions during execution: “should we change the database schema?”, “how does this play with the existing session mechanism?” Each decision requires context, judgment, and a global understanding of the system.

If your agent is just a “receive task, execute task” operator, it will stall or guess blindly at every decision point. Either you intervene constantly (making you the one doing the actual work) or you let it decide on its own (and brace for rework).

The approach that actually scales is capability empowerment — you’re not giving the agent a task, you’re giving it a complete capability kit that enables it to make sound decisions autonomously during execution.

This kit includes:

  • A complete toolset: not “only the tools it needs,” but enough tools for it to independently solve problems encountered during execution.
  • Sufficient context: not just “what to do” but “why we’re doing it this way,” “what the overall system architecture is,” “which constraints must not be violated.”
  • An explicit decision framework: when judgment calls arise, what should be prioritized? Which situations require escalation? Which can be decided autonomously?
  • Collaboration protocols: how to interact with other agents? How to handle conflicts? How to share discoveries?

That’s what “arming” means — you’re not hiring temps, you’re training special forces.

Two Execution Models for Agents: Spawn vs Delegate

Spawn vs Delegate: Two Agent Dispatch Patterns

When your main agent needs another agent’s help, there are two fundamentally different models. These aren’t industry-standard terms — they’re a classification framework I’ve distilled from practice. Different agent frameworks may use different names, but the underlying tradeoffs are the same.

The Spawn Model

The main agent launches a brand-new agent session, gives it a task description and necessary context, then they operate independently. The spawned agent runs autonomously with its own context window, its own toolset, its own execution rhythm. Once done, it returns results to the main agent.

Spawn’s defining characteristic is session isolation. The two agents don’t share a context window; they communicate via message passing. Note: this isn’t OS-level process isolation — multiple agents can absolutely run within the same process. The key is that they hold independent conversation state. This has several direct consequences:

  • Good parallelism: multiple spawned agents can run simultaneously without blocking each other.
  • Fault isolation: one agent crashing doesn’t affect the others.
  • Context independence: each agent starts from a clean context, uncontaminated by other agents’ reasoning. This is critical for independent review.
  • High communication cost: every interaction is a full “pack-transmit-unpack” cycle. You can’t assume the other side “should know” something — you must explicitly pass all necessary information.

The Delegate Model

The main agent hands off a sub-task to a sub-routine via tool call. After execution, control returns to the main agent along with the results.

Delegate is more like a function call. Its defining characteristic is centralized control flow: the main agent always holds the reins, and the sub-routine is an extension of it rather than an independent entity.

  • Low latency: no process startup overhead, results available immediately.
  • Shared context: the delegate can access part or all of the main agent’s context.
  • Serial execution: the delegate blocks within the main agent’s execution flow.
  • High coupling: the delegate’s behavior depends on the main agent’s context — unsuitable for tasks requiring independent judgment.

When to Use Which

Through extensive practice, this choice can be formalized:

Use Spawn when:

  • The task requires independent judgment (review, second opinion)
  • The task can run in parallel
  • The task may be long-running
  • Fault isolation is needed

Use Delegate when:

  • The task is a natural extension of the current line of thinking
  • The task needs the main agent’s full context
  • The task is lightweight and spawn overhead isn’t worth it
  • Results are needed immediately for the next decision

A common mistake is delegating what should be spawned. The classic example is code review — you have the builder agent invoke a review tool within its own context. The tool is nominally “another agent,” but it inherits the builder’s context, has seen the builder’s reasoning, and is essentially still self-review. Chapter 2 covered why that doesn’t work.

The opposite mistake also exists: spawning what should be delegated. Something like “read a config file and return a specific field” — you spawn an agent for this, wait for it to initialize, execute, and return, taking 30 seconds. A tool call could handle it in 1 second.

There are also critical orchestration mechanisms:

Notification system: after a spawned agent completes its task, the main agent needs to be notified. A common approach is through a message bus — the agent posts a message to the bus upon completion, and the main agent subscribes to topics of interest. Far more efficient than polling.

Tracker: when you have multiple spawned agents running concurrently, you need a single place to track their status — who’s running, who’s done, who’s stuck, who’s failed. This tracker is the coordinator’s eyes.

Abort Cascade: when a task is cancelled, all sub-agents spawned for that task should also be cancelled. Sounds obvious, but if you don’t explicitly implement this cascade, you’ll regularly encounter “the main task was cancelled but sub-agents are still chugging away” — wasting resources is the minor issue; they might still be modifying files, creating ghost changes (modifications to files that no active task owns).

Failure Taxonomy for Multi-Agent Systems

Before diving into specific routing and orchestration designs, it’s worth building a failure model first. When something goes wrong in a multi-agent system, most people’s instinct is “fix the agent that has the bug.” But multi-agent system failures often aren’t in any single agent — they’re in the seams between agents.

I classify multi-agent system failures into three layers:

Agent-layer failures: capability problems in a single agent. Hallucinations, tool-use failures, forgetting things because the context window overflowed. This is the easiest layer to diagnose — you can point to a specific agent and say “this one screwed up.” The harness system from Chapter 2 is primarily designed to combat this layer.

Coordination-layer failures: collaboration problems between agents. Information gets lost or distorted in transit. Two agents waiting on each other’s output create a deadlock. Multiple agents make contradictory decisions about the same problem and nobody notices. The signature of this layer: every agent looks correct in isolation, but they break when combined. The explicit routing and conflict management discussed below are fundamentally fighting coordination-layer failures.

System-layer failures: resilience problems in the overall architecture. No fallback mechanism means one agent going down stalls the entire pipeline. Errors propagate and amplify through the agent chain (agent A outputs a small deviation, agent B makes a larger deviation based on it, agent C writes the deviation into the final deliverable as fact). Missing global timeouts and circuit breakers allow the system to enter an unrecoverable state.

The significance of this three-layer model: it tells you that failures at different layers require responses at different layers. Tactically tweaking a single agent’s prompt won’t fix coordination-layer and system-layer problems. You need systemic strategies — explicit communication protocols, structured state management, and fallback design running through the entire chain. The specific designs in the following sections are all responses to this three-layer failure model.

Routing Must Be Explicit: Lessons from Three Systems

This might be the most important lesson in this chapter: state passing between agents must be explicit.

Let me illustrate with a comparison across three projects.

Agent Store’s Routing Bug

I’ve seen a real case from an agent platform: its agent store lets users create custom agents, each with its own toolset and behavior configuration. In the early design, routing between agents relied on an implicit context-passing mechanism — when agent A called agent B, the system automatically injected A’s “current state” into B’s context.

Sounds convenient. In practice, a nightmare.

The bug manifested as: agent B would occasionally make baffling decisions, as if responding to a question nobody asked. Investigation revealed that A’s context contained residual state from a previous conversation turn, which was implicitly passed to B, and B treated it as current task context.

This wasn’t an ordinary bug. It was a design flaw at the architectural level: implicit state passing makes data flow impossible to reason about. You don’t know what information an agent received, where it came from, or whether it’s still valid.

Three Systems Compared

That agent platform: post-fix, switched to explicit message passing. When agent A calls agent B, it must declare all transmitted information in the message payload. No “automatic context inheritance” — whatever you pass, B has; whatever you don’t, B doesn’t. This increases the caller’s workload but makes every interaction auditable.

OpenClaw (an open-source config-driven agent routing system): adopted an 8-tier priority cascade routing design. It’s a purely config-driven binding system — each request passes through 8 priority-ordered matching layers and gets routed to the target agent by rules. All routing rules live in configuration files with no runtime dynamic inference. You can open the config file and answer “which agent will this request be routed to?” without running any code.

OpenClaw’s 8-tier cascade, from highest to lowest priority:

  1. Explicit user specification
  2. Scenario-specific binding
  3. User preference
  4. Topic binding
  5. Capability matching (tag-based — agents declare capability tags; requests carry required tags; matching is a deterministic set intersection, no LLM inference)
  6. Load balancing
  7. Default route
  8. Fallback

Every layer is a deterministic rule. The moment any layer matches, it stops — no further cascading.

The elegance of this design: it externalizes all routing decisions into configuration. Adding new routing rules requires no code changes — just add a config entry at the appropriate tier. Debugging routing issues requires no breakpoints — just look at the config file and request parameters.

Claude Code (my personal usage pattern): the simplest model — one main agent directly calls subagent tools with all necessary information in the parameters. No middleware, no message bus, no priority cascade. Works for small-scale scenarios, but routing logic is scattered across prompts and tool calls — doesn’t scale.

The shared lesson across all three systems: no matter how simple or complex your routing is, state must be explicit. Implicit state passing is the original sin of agent systems. Humans can rely on shared working memory and cultural context to fill in what’s left unsaid. LLMs can’t — they only process the tokens you give them. The thing you think “it should know”? It doesn’t.

Protocols Over Frameworks: The Underrated Dimension of Agent Infrastructure

In the agent orchestration space, there’s one thing that’s systematically overrated and one that’s systematically underrated. The overrated thing is frameworks. The underrated thing is protocols.

Choose LangChain and you’re locked into its execution model, memory scheme, prompt templates, and tool-calling conventions. Choose CrewAI and you’re locked into its role-based abstractions and task orchestration patterns. The problem with frameworks isn’t that they’re not useful — they work great at the demo stage. The problem is coupling. When you need to swap an LLM provider, change a memory strategy, or integrate a tool the framework didn’t anticipate, you find yourself wrestling with the framework’s abstraction layer instead of solving your business problem.

MCP’s (Model Context Protocol) success is a direct validation of protocol thinking. MCP isn’t a framework — it doesn’t care what language you implement your server in or how your agent orchestrates internally. It only defines the communication contract between agent and tool. This is the same relationship HTTP has with web frameworks: HTTP defines how client and server talk, but whether you use Django or Express or raw sockets, HTTP doesn’t care. It’s precisely this loose coupling that enabled MCP’s rapid adoption — any framework can plug in without abandoning existing technical choices.

For multi-agent systems, the practical implication is: build your orchestration logic on protocols, not frameworks. Define clearly how agents pass messages, what state formats look like, how errors get reported, how capabilities are declared. What framework each agent uses internally is its own business. The explicit message passing, structured review reports, and three-tier blocking/warning/info classification discussed earlier — these are all protocol-level designs, independent of any specific framework.

Companion Design: Full-Capability vs Constrained

When you have multiple agents, you face a design choice: what should the agent that users directly interact with (the companion) look like?

Two schools of thought:

The Router school: the companion is a router. It receives the user’s request, analyzes intent, and forwards to the appropriate specialist agent. The companion itself does no real work — only dispatch. This design looks “architecturally correct” — separation of concerns, single responsibility.

The Full-capability school: the companion is a generalist agent. It can handle most requests on its own and only calls specialist agents when necessary. The user always perceives a single entity.

In practice, I’ve consistently chosen full-capability.

OpenClaw and Claude Code — the main agent in both products is full-capability.

Why?

Because the router pattern has a fatal latency problem. The user says something, the router first has to understand intent (one LLM call), then route to a specialist agent (one spawn or delegate), then the specialist starts processing. Every hop adds latency. A full-capability agent handles it directly — one less hop.

The deeper reason is context completeness. In the router pattern, the specialist only gets the information the router passes along. If the router drops important context (say, a constraint the user mentioned three turns ago), the specialist makes decisions that miss expectations. A full-capability agent has been in the conversation from the start — it has the complete context.

Of course, full-capability doesn’t mean doing everything yourself. It means having the ability to do it yourself, but choosing when to call specialists. The choice rests with the companion, not forced by the architecture.

There are cases where a router architecture does win: very large tool surfaces where no single agent can hold them all, strict security boundaries between domains (e.g., multi-tenant environments where agents must not cross data boundaries), or regulatory environments where audit trails demand a clean separation of concerns. But for most product-building scenarios, full-capability is the better default.

Once you’ve chosen full-capability, a practical question follows: how do you manage what the companion is allowed to do? In practice, the answer is organizing tool policy by theme, not by individual tool.

Tool Policy by Theme, Not by Tool

The traditional approach to permissions: list all tools, individually mark whether each agent can use them. This produces a massive permission matrix — N agents × M tools — with crushing maintenance cost.

The better approach is organizing policy by theme. A theme is a group of related capabilities, like “filesystem operations,” “code execution,” “network requests,” “user interaction.” Each agent is granted a set of themes, and all tools within a theme are automatically available.

Benefits:

  1. Adding a new tool doesn’t require updating every agent’s permissions. The new tool belongs to a theme; all agents with that theme automatically gain access.
  2. Permission semantics are clear. “This agent can operate the filesystem” is easier to understand and reason about than “this agent can use read_file, write_file, delete_file, list_directory, move_file, copy_file.”
  3. Fewer configuration errors. Tool-level permissions are easy to miss — you added write_file but forgot create_directory, so the agent can write files but can’t create directories. Theme-level authorization eliminates this problem.

Conflict Management: Parallel Agents

When multiple agents work in parallel, you will inevitably hit this problem: they will edit the same file at the same time.

This isn’t an edge case — it’s the norm. Consider a realistic dev scenario: agent A is implementing a new API endpoint, agent B is refactoring error handling logic. Both need to modify app/middleware.py. If they don’t know about each other, each making different changes based on the same version, you’re left manually resolving merge conflicts.

And this is trickier than human git conflicts. A human developer has a mental model — when they see conflict markers, they understand both sides’ intent and can make a sensible merge decision. Agents don’t have this ability. When an agent sees conflict markers, the most likely response is to delete one side entirely.

Here are the strategies I’ve distilled from practice:

Strategy 1: Pessimistic Locking at the File Level

The simplest, bluntest approach. During task assignment, identify which files each task will touch and ensure no two parallel tasks touch the same file. If there’s a conflict, serialize one of them.

Upside: completely prevents conflicts. Downside: overly conservative. Two agents might both need config.py, but one changes line 10 and the other changes line 200 — no actual conflict. Pessimistic locking forces them to serialize anyway.

Strategy 2: Branches + Auto-Merge

Each agent works on its own git branch. Upon completion, the coordinator handles the merge. If git can auto-merge (changes in different regions), merge directly. If there’s a conflict, the coordinator invokes a merge resolution agent.

Upside: maximizes parallelism. Downside: merge resolution agent quality is inconsistent. For simple conflicts (both sides adding an import line), it does fine. For semantic conflicts — where both sides modify different aspects of the same function, each internally consistent but contradictory when combined — it frequently gets it wrong.

Strategy 3: Architectural Isolation

The most fundamental approach. If your code architecture is sufficiently modular, each agent works within its own module, and cross-module interaction happens through well-defined interfaces. Parallel agents naturally never edit the same file.

Upside: solves the problem at the root. Downside: requires upfront architectural investment. Not every codebase has this luxury.

In practice, I usually mix all three: start with architectural isolation (minimize the possibility of conflicts), use branches + auto-merge where architecture can’t fully isolate, and apply pessimistic locking for known high-contention files (global config, database migrations, route tables).

There’s one more point that’s critical but easy to overlook: when delegating in phases, you’re not just verifying components — you’re verifying wiring.

When you split a large task into multiple phases, each completed by a different agent, every agent’s individual component might be correct, but the connections between them might be wrong. Agent A exposes a function called get_user, agent B calls fetch_user. Each one’s unit tests pass, but integration blows up.

You must verify wiring at every phase transition, not just verify components. This means integration tests are the gate between phases — not something you only run at final delivery.

Review in Practice: Multi-Agent Scenarios

Chapter 2 established the necessity of independent review. In multi-agent scenarios, the question is no longer “should we review” but “how do we review efficiently when multiple agents are working in parallel.”

Spawn, Not Delegate

The review agent must be spawned, not delegated — this was established in Chapter 2’s gate hierarchy. But in multi-agent scenarios, how you spawn matters:

  • The reviewer doesn’t share the builder’s context window. It receives the pure deliverable: code + requirements document. The builder’s prompt, reasoning chain, abandoned alternatives — the reviewer knows none of it.
  • The reviewer’s prompt is standardized. Not “please review this code,” but a structured checklist: functional correctness, boundary conditions, error handling, consistency with the existing system.
  • One builder, one reviewer. Don’t have a single reviewer simultaneously review multiple builders’ output — context pollution propagates across tasks within the same session.

Review Results Drive Iteration

Review isn’t the endpoint — it’s one link in the feedback loop. The review agent’s output is a structured report that clearly separates three levels:

  • Blocking: must be fixed before proceeding. Builder re-enters the build-review cycle.
  • Warning: should be fixed but doesn’t block. Builder fixes in the current round or tags as a known issue for the next.
  • Info: advisory. Builder may or may not adopt.

This three-tier classification isn’t about the reviewer performing “I’m being thorough” — it’s about giving the coordinator an actionable signal. The coordinator only needs to check whether the blocking list is empty to decide whether to proceed.

Quantitative Data: Directional Reference

During a version iteration, I ran a self-review vs independent review comparison across 20 tasks. Small sample size, limited statistical significance, but the directional conclusion is clear:

MetricSelf-ReviewIndependent Review
Real bugs found311
False positive rate (flagged but not real issues)45%18%
Design issues found04
Boundary conditions builder missed17

Experimental conditions: same Claude model, same batch of tasks (covering new feature development, bug fixes, and refactoring). Self-review was a delegate within the same session; independent review was a spawned independent session that only saw code and requirements. (Note: this comparison doesn’t control for token consumption — the independent reviewer may have used more tokens per review. Some of the improvement likely comes from more thorough reading, not independence alone. The directional signal is still clear, but take the exact magnitudes with a grain of salt.)

A few patterns worth noting:

  • Self-review is virtually incapable of finding design-level issues. The builder chose a design approach; during self-review it defends that approach rather than questioning it.
  • Independent review leads dramatically in discovering boundary conditions. Boundary conditions the builder didn’t think of during implementation won’t be thought of during self-review either — cognitive blind spots within the same context window are shared.
  • Self-review has an extremely high false positive rate. It nitpicks on irrelevant style issues (“naming isn’t great,” “comments could be more detailed”) while waving through real problems.

On cost: at current LLM pricing (as of mid-2025), one review’s token cost runs from a few cents to a few dimes. Even if only 10% of reviews catch a critical bug, compared to the cost of a bug that escapes to production (locating, fixing, testing, deploying, cleanup), the ROI easily exceeds 50x.

With these review patterns in place, let’s compress this chapter’s lessons.

Chapter Summary

This chapter’s theme is the leap from “one person using one AI” to “one person using a team of AIs.” The core isn’t technical complexity — it’s a shift in how you think about the problem.

Key principles:

  1. Arm, don’t assign. Give agents a complete capability kit so they can autonomously handle uncertainty during execution. Full-capability companion beats router.

  2. Spawn vs Delegate is an architectural decision. Tasks needing independence get spawned; tasks needing shared context get delegated. The wrong choice causes systemic quality problems (delegating review) or performance problems (spawning simple lookups).

  3. Routing must be explicit. Implicit state passing is the original sin of agent systems. OpenClaw’s 8-tier cascade is a reference model: all routing logic externalized as configuration — auditable, understandable, modifiable.

  4. Parallel agent conflicts are an architecture problem, not a luck problem. Architectural isolation > branches + auto-merge > pessimistic locking. Also: phased delegation must verify wiring, not just components.

  5. Independent review is the error-correction mechanism of multi-agent systems. It’s not just quality assurance — in an architecture where parallel agents operate autonomously, independent review is the only mechanism that can detect cross-agent blind spots. Spawn it, standardize it, let it drive the iteration loop.

From Chapter 2’s constraint system to Chapter 3’s team architecture, one thread runs throughout: you’re not collaborating with AI — you’re designing an engineering system with AI as the execution layer. The collaboration metaphor implies AI has autonomous judgment and quality standards, which makes you drop your guard. The system metaphor reminds you: every behavior must be designed, constrained, and verified.

Next chapter, we shift perspective from “how to coordinate agents” to “how to make the whole system run autonomously.” Once you have a well-armed AI team, the natural next step is: can you step back from operator to architect and let the pipeline run itself? Autonomous operation isn’t hands-off — it’s engineering precise down to the tick level.

相关博客文章 Related blog posts

Comments