Silicon Workforce S1E03: From 'It Runs' to 'I Trust It'

Silicon Workforce S1E03

Security audit: 47/100.

That was the score Hermes — a code review advisor — gave to OPC’s main branch. 5 Critical, 8 High. A framework that claims to check other projects’ code quality scored 47 on its own security audit.

It’s like a company that sells security doors, but their own office door lock is broken.

Stop the Bleeding

The four most glaring problems were each the kind that keeps you awake at night once you know about them.

Command injection. Flow template names were concatenated directly into shell commands for execution. If someone named a template ; rm -rf /, the system would dutifully execute it. This wasn’t theoretical — OPC is designed to let AI agents choose flow templates themselves. AI doesn’t need malicious intent — its generated output can contain shell metacharacters (semicolons, pipes, backticks) that the shell interprets as command separators or sub-commands, triggering entirely unexpected operations.

The core idea behind the fix: whitelist all external input before it reaches the shell — only allow letters, digits, underscores, and hyphens; reject everything else. (Implementation: validateRelativePath() checks path legitimacy, blocking .. and absolute paths; template names must match /^[a-zA-Z0-9_-]+$/.)

Path traversal. The --flow-file parameter could point anywhere on the filesystem. The original intent was letting users load custom flow configurations, but there were no boundary checks. AI agents could read any file outside OPC’s working directory — config files, environment variables, SSH keys.

Fix: after resolving the path, check that resolvedPath.startsWith(cwd), ensuring it stays within the working directory. Built-in template names are protected; external flow files can’t override them.

Empty file cheating. OPC’s review stage requires at least two independent evaluation reports. But the validation only checked “does this file exist,” not “does it have content.” Two empty files could pass the review.

This is another example of the “aspirational theater” from the previous episode — the system claims to be reviewing, but actually just checks for file existence.

The fix went beyond just “file not empty.” Now validation checks five things: must contain severity markers (red/yellow/blue or LGTM); line count >= 50 (thin reviews are likely going through the motions); must have file:line references (reviews must be anchored to specific code); two reviews can’t be identical — copy-paste is an instant error, and mostly-the-same content (over 70% line overlap) gets a warning; two reviews must come from different roles.

Infinite bouncing. A Gate ruling ITERATE sends the flow back to Builder. But what if the Builder’s fix still doesn’t pass? ITERATE again. Fix again. Still fails. Infinite loop.

Fix: maxLoopsPerEdge = 3 — same edge traversed more than 3 times forces a stop. maxNodeReentry = 5 — same node entered more than 5 times forces a stop. Like safety ropes in a mine shaft — you don’t feel them during normal work, but they prevent you from falling.

After stopping the bleeding, the security score climbed from 47 to about 75. Test suites grew from 12 to 16. But 75 wasn’t enough.

Acceptance Criteria Aren’t a Checklist — They’re a Contract

The key to going from 75 to 90 wasn’t fixing more bugs. It was thinking clearly about one thing: what exactly are acceptance criteria?

Some people say acceptance criteria are just a few bullet points. Check them off and you’re done.

Wrong. Acceptance criteria are a contract.

Imagine a client-vendor relationship. The client pays, the vendor does the work. The client’s core concern isn’t “you did good work” — it’s “when I can prove you didn’t, I have evidence.” Acceptance criteria are that contract — the more specific, testable, and unambiguous they are, the more secure everyone feels.

A bad acceptance criterion: “The system should be fast.”

A good acceptance criterion: “P95 latency < 200ms. Verification: run wrk with 1000 concurrent requests for 30 seconds, P95 latency distribution < 200ms.”

The difference? With the bad criterion, you can always say “seems fine” — because “fast” has no definition. With the good one, you can only say “yes” or “no” — 200ms is 200ms, no middle ground.

OPC's blueprint — a quality fortress of 14 checkpoint nodes

OPC uses a tool called criteria-lint to enforce this. 14 lint rules in three layers:

Structural layer checks your format: Is there an Outcomes section? A Verification section? Does each outcome have a corresponding verification method? Is the count between 3 and 7?

Content layer checks substance: Using vague words like “fast,” “clean,” “intuitive”? If they’re not followed by quantitative metrics (“under 200ms,” “WCAG AA”), it’s an instant fail. Writing “should work as expected”? That’s always true, meaning nothing. Verification method is “manual inspection” or “looks correct”? Eyeballing isn’t verification. Two criteria share 80% of words? Either copy-paste redundancy or unclear thinking.

Warning layer flags potential oversights: Is the scope section empty? No failure modes written (only success scenarios, nothing about what happens when things fail)? More than 5 outcomes (too many means blurred focus)?

These rules don’t rely on AI judgment. They’re regex patterns and Jaccard similarity scores — run them and they tell you exactly which line doesn’t pass.

The Tenth Person

When designing OPC’s review process, a key decision was inspired by Israeli intelligence’s “Tenth Man” doctrine.

The rule goes like this: if 9 analysts all agree on a conclusion, the 10th person’s duty is to oppose it. Not because the conclusion is necessarily wrong — but because consensus itself is the biggest risk signal.

In OPC, when a discussion round enters its second iteration with all agents converging, the orchestrator automatically introduces the devil-advocate role. This role’s job is to challenge — not because challenging is fun, but because if everyone agrees, either they’re truly right, or everyone is making the same mistake.

For irreversible decisions — data deletion, public API contracts, destructive migrations — Devil’s Advocate is mandatory. No matter how many agree, someone must attempt to overturn it.

This design reflects a core OPC belief: good process isn’t about getting everyone to agree — it’s about giving disagreement a place to exist.

Don’t Make AI Better — Make Bad Outcomes Smaller

After three weeks of hardening, test suites grew from 12 to 21, assertions from 150 to 450, enforcement rules from 20 to 60, and the security score from 47 to about 90.

But the numbers aren’t the point. The point is what I understood through this process:

Traditional software invariants are deterministic — input A must produce output B. Agent framework invariants are probabilistic — you can’t guarantee AI writes good code, but you can guarantee bad code won’t pass the gate.

So OPC’s philosophy isn’t “make AI better.” AI’s capability boundary isn’t something I can control — it depends on model training, prompt quality, context size. These are all probabilistic; I can’t offer guarantees.

What I can guarantee is the other side: make bad outcomes smaller.

60 enforcement rules aren’t designed to produce good output — they’re designed to intercept bad output. This distinction determines the entire system’s design direction. OPC isn’t a tool that makes AI write better code; it’s a tool that prevents AI’s bad code from passing acceptance.

Like a dam. A dam doesn’t make the rain fall better. It stops the flood from getting through.

In a later test, this philosophy was validated. A 55-hour session, 52 subagents collaborating, produced 639 new tests. One pre-commit hook called check_test_imports.py had a brutally simple rule: every test file must import real product code. If your test defines its own function and tests itself — the most common trap in AI-generated tests — the hook blocks it immediately.

639 tests, every one importing real modules. $150 in token costs, averaging $0.23 per test. If a senior engineer spent three days writing the same quantity and quality of tests at $800/day, that’s $2,400.

But price isn’t the point. The point is: these tests are trustworthy. Not because AI wrote them well, but because untrustworthy tests were mechanically intercepted.

From 47 to 90 wasn’t about AI getting better. It was about keeping bad things from getting through.

Silicon Workforce S1: The OPC Framework Evolution Previous: What Does a One-Person Engineering Team Look Like <- Next: Growing a Skeleton for the Framework ->

Silicon Workforce S1E03: From 'It Runs' to 'I Trust It'

Stop the Bleeding

Acceptance Criteria Aren’t a Checklist — They’re a Contract

The Tenth Person

Don’t Make AI Better — Make Bad Outcomes Smaller

Comments