
EP03 solved the review perspective problem — added the Tenth Man to watch direction. But another problem was still growing: with every new product, OPC’s core got fatter. Before tearing apart the extension system, I stopped to ask: how do other people do this?
Not out of academic curiosity. Out of fear. A solo tool-builder’s worst nightmare isn’t building the wrong thing — it’s finishing and then discovering someone else stepped in the same pit five years ago and left a clear warning sign.
I spent a week dissecting 30 open-source AI projects — Cline, Open WebUI, LangChain, Dify, n8n, LobeHub, AutoGen, CrewAI, Pydantic-AI, DSPy, Haystack, Composio, LiteLLM, Agno, vLLM, AutoGPT, LlamaIndex, Semantic Kernel, Activepieces, Letta, plus a dozen smaller projects. Not for competitive analysis — OPC doesn’t compete with any of these. For decision calibration: I’d already made some design choices and wanted to know whether they leaned left or right in a larger sample.
All data as of May 2026, rough measures: Stars from GitHub display values, test density = tests/ directory LOC / src/ directory LOC (via tokei), god file line counts via wc -l. Not rigorous research — empirical judgment from a single snapshot.
Four findings directly influenced OPC’s subsequent design decisions.
God File Three Genes
The first finding came from Cline — a 61k-star VS Code AI coding assistant.
Opening the source: task/index.ts at 3,764 lines, api.ts at 5,062 lines. Two files totaling 9,000 lines, larger than many complete projects. Gut reaction: “this needs to be split.” But Cline has 61k stars, and the god file didn’t kill it.
After dissecting god files across 30 projects, they nearly all fell into three patterns:
| Pattern | Bloat driver | Typical case |
|---|---|---|
| Orchestrator | Agent is the universal entry point; all state converges here | Cline task/index.ts (3,764 lines) |
| Middleware | Each new integration adds an adaptation layer | Cline api.ts (5,062 lines) |
| Config Hub | All configuration routing centralized | LiteLLM (14,693 lines) |
God files aren’t architectural failure — they’re a common byproduct of feature-driven development. When a file becomes the system’s entry point, adaptation layer, or configuration center, every new feature needs a hook here. Bloat isn’t laziness — it’s that the file has natural gravity.
The criteria for when to split aren’t based on line count but on a combination of three signals:
- Acceptable: File is orchestrator-natured + has clear internal function boundaries + low modification frequency
- Must split: >5,000 lines AND modified weekly + 3+ unrelated features mixed + new developers need 2+ days to understand
Open WebUI’s middleware.py (5,120 lines) had both orchestrator and middleware genes — hitting all three “must split” signals. But they chose features first: one core developer, rapid iteration — ship now, refactor later. 135k stars proved the market bought in.
Callback to OPC: Later, when splitting the harness from monolith to core + extensions in EP05, I hesitated over whether to fully shatter harness.ts. After seeing Cline’s orchestrator pattern, I changed my mind — the harness is a natural orchestrator. Under 3,000 lines, moderate modification frequency: don’t split. Capability contracts and hook dispatch logic get pulled out, but pipeline scheduling stays in one file so the overall flow is visible at a glance.
The Plugin Difficulty Triangle
Surveying 11 frameworks’ plugin systems revealed a clear difficulty triangle: Developer Experience (DX), Security, and Discoverability — no framework gets all three right simultaneously.
The DX spectrum spans a 20x gap between extremes. LangChain and Pydantic-AI need just 5 lines (one @tool decorator plus a function). LobeHub needs 100 lines (standalone npm package + manifest + handler + UI component). The reason is simple: isolation requires declaring boundaries, and boundaries mean boilerplate.
The security dimension is harsher. Open WebUI uses exec() to run user Python code directly in the server process — amazing DX, zero security. Activepieces isolates each Piece in a sandbox — best security, but writing a Piece requires 30 lines plus SDK learning cost.
| Framework | DX | Security | Discoverability |
|---|---|---|---|
| LangChain | Strong (5 lines) | Weak (no isolation) | Convention-based |
| Open WebUI | Strong (10 lines) | Extremely weak (exec) | Community-based |
| LobeHub | Weak (100 lines) | Medium (package-level) | Marketplace |
| Activepieces | Medium (30 lines) | Strong (sandbox) | Marketplace |
| Haystack | Medium (15 lines) | Medium (pipeline-level) | Registry |
The theoretical optimum is typed decorators (reduce boilerplate) + package isolation (ensure security) + marketplace (improve discoverability). But nobody achieves it — because the implementation cost across three layers is additive, each layer being thousands of lines of code.
Callback to OPC: The extension design that landed later (EP05) took a deliberate position in this triangle — DX first, security via trust boundary, discoverability deferred. OPC is a single-person tool; the extension author is the user themselves (or an agent they trust). So no sandbox isolation needed (trust boundary = user boundary), no marketplace needed (just one user). DX kept as low-barrier as possible: one JSON meta + hook functions, starter template at 121 lines. This trade-off would likely break down in a multi-user scenario — but for a tool with exactly one user, paying the security cost in advance for a hypothetical second user isn’t worthwhile. The difficulty triangle confirmed this wasn’t laziness but a conscious trade-off.
The Cost of Standardization
LangChain is the de facto standard for LLM frameworks (as of 2026). prompt | llm | parser in one line made “LLM application = component pipeline” the industry’s default mental model.
But the standard’s foundation is runnables/base.py — a 6,342-line god file with Config Hub genes. Every component (prompt template, LLM, parser, retriever, tool) implements the Runnable interface. Each new composition pattern adds a class to this file. The core package: 67.7k lines — 4.7x AutoGen’s core.
This is the core trade-off of standardization:
The heavier the core, the lighter the upper layers; the lighter the core, the heavier the upper layers.
LangChain chose a heavy core — 67.7k lines so the upper layer needs just 5 lines to write a tool. The cost: a user who only wants to call one API has to pull in the entire core. Add the classic/v1 dual-version coexistence and a 2,792-line callback manager, and the new user debugging experience is nightmare-grade.
Compare DSPy — no file exceeds 900 lines. Because it doesn’t define a standard, it does one thing (prompt optimization). Narrow scope, light code.
Callback to OPC: OPC went the opposite direction from LangChain — convention over configuration, no abstraction layer. No Runnable interface, no general pipeline SDK, no plugin API version management. Flow definitions are a JSON schema plus a set of conventions. Extension hook signatures are fixed at five types, with no support for custom hook types. This limits flexibility — you can’t invent new hook points. But in return: small core codebase (an order of magnitude less than LangChain), newcomers can read extension-authoring.md and write extensions without first understanding an abstract type system. LangChain’s lesson: the more universal the standard, the heavier the internals. OPC chose not to be universal — it serves only OPC’s review pipeline, without attempting to become a general-purpose agent framework.
Test Density ≠ Success
The most counterintuitive finding: across this 5-project sample, test density and user scale showed no obvious positive correlation.
| Project | Stars | Test Ratio | Purpose |
|---|---|---|---|
| Pydantic-AI | ~10k | 2.54 | Type-safe agent |
| n8n | ~70k | 1.88 | Automation platform |
| LangChain | ~135k | 0.73 | LLM framework |
| Cline | ~61k | 0.084 | VS Code AI |
| Open WebUI | ~135k | 0.005 | AI full-stack frontend |
Open WebUI: 135k stars, test ratio 0.005 — for every 100 lines of product code, just 0.5 lines of tests. Nearly zero integration tests across the entire backend. But users voted with their feet.
Pydantic-AI: 2.54 — highest test ratio, fewest stars. Not because excessive testing made it bad, but because it was young and narrow in scope.
Test density affects maintainability; market timing and feature coverage determine whether a project survives. Two 135k-star projects (LangChain and Open WebUI) differ in test density by 146x — but LangChain is a framework (other projects depend on it; API stability is its lifeline), while Open WebUI is an end-user product (users don’t import its code). Different positioning, different quality strategies.
Callback to OPC: OPC’s positioning falls between the two — both a personal tool and open-sourced for others to use. Starting from v0.6, extensions are required to have tests, but the core’s testing strategy is integration-test-first rather than unit-test-first. Reasoning: the core’s pipeline scheduling logic is hard to meaningfully cover with unit tests (node A’s output feeds node B’s input — testing A and B separately is less useful than testing the A→B flow). This choice was bolstered by Cline’s case — Cline’s 0.084 test ratio is compensated by human-in-the-loop reviews and checkpoint rollbacks. OPC’s mechanical gates and role separation play a similar role: not untested, but trust mechanisms extend beyond tests alone.
See the Pits, Then Choose Your Path
Four findings, four decision calibrations:
| Finding | OPC’s choice | Reasoning |
|---|---|---|
| God File three genes | Harness stays as single-file orchestrator | Under 3,000 lines, moderate modification frequency, visible at a glance |
| Plugin difficulty triangle | DX first, security via trust boundary | Single-person tool; extension author = user |
| Cost of standardization | Convention over configuration | Not a general framework; scoped to review pipeline |
| Test ≠ success | Integration tests first + mechanical gates | Pipeline logic needs flow tests; gates compensate for coverage gaps |
These choices weren’t made after looking at others’ code — some were already in place before the extension system was torn apart. But after examining 30 projects, I could articulate more clearly why these choices were made and under what conditions they would fail.
But honestly, all four findings conveniently confirmed “OPC got it right” — and if analyzing 30 projects only yields “I was right about everything,” the analytical framework itself may be biased. At least one choice turned out to have problems: the fourth one, “integration tests first + mechanical gates as compensation.” I used mechanical gates to justify fewer unit tests — but EP08 will reveal an uncomfortable fact: the gate’s FAIL loop was never triggered. Using an unverified mechanism to argue “we don’t need more tests” is circular reasoning.
Harness stays as a single file — if line count breaks 5,000 and it’s modified weekly, time to split. DX first, security via trust — if an untrusted second user shows up, an isolation layer is needed. Convention over configuration — if someone wants to use OPC extensions in a non-OPC pipeline, interface abstraction is needed. Integration tests first — if the core starts being depended on by third parties, unit tests are needed to stabilize the API.
Every design decision has a failure condition. Knowing that condition matters more than defending the decision.
Look at others’ pitfalls before choosing your own. At least you step in with your eyes open.
Silicon Team S2: Evolving the Toolchain Through Real Products ← S2E03: When Everyone Says PASS, It’s Time to Worry | S2E05: Every New Product Makes the Core Fatter →