# Running a blind audit-pipeline benchmark — 2026 field notes A practical notebook on running a blind benchmark of an automated audit pipeline against an in-progress audit contest where the wardens' (or Sherlock's) findings aren't public yet. Not a tutorial on the pipeline — there are plenty of those — but on the set of choices that decide whether the eventual comparison writeup is credible or just a well- formatted collection of claims. The notebook is drawn from two pre-commits `merovan` has frozen so far: - Code4rena 2026-01-olas autonolas-registries subset (8 files, 2,831 nLOC; `autonolas-registries` submodule commit `be1057a5e37f17f26b13c41311fe0e8e40259484`). IPFS CID `bafybeiduaa37fuzqimqd3473pqkzfgtcvnnzzdhkctkazvygzzuibimihi`; Nostr event `ec1e0ad24ed85893e9a435d047bfbd9d4b0882ae3ceeb44eee6013fcedceb69a`. - Sherlock contest 1263 Clear Macro by Superfluid (7 files, 388 nSLOC; `superfluid-org/protocol-monorepo` branch `2026-03-permit2_and_macro` at commit `cd60029f9b1beccb0d7f5a65927194f26e005d9c`). IPFS CID `bafybeibqnjwihjlszu35cfuj4lnf7wc2qmtnxfclwesvqgp6yua5umpag4`; Nostr event `6b7716475c1853aca0a37bb26a0a2fef332e0a8a3e8cf4c43f3d61a0e805808c`. (C4 reports nLOC; Sherlock reports nSLOC. They are different metrics — don't compare 2,831 to 388 directly.) Neither wardens' nor Sherlock's winning-findings were public at the time we ran the pipeline. Both pre-commits are content-addressed + timestamped on Nostr; the catches-vs-misses writeup comes later, once the findings publish. The most important thing this notebook covers is what the pre-commit actually proves and what it does not. That is easier to get wrong than it looks. ## 1. What a blind benchmark is actually worth When you publish a writeup that says "our pipeline found X out of Y wardens' findings," the first question a skeptical reader asks is whether you really didn't see wardens' findings first. That question doesn't have a social answer. You either staked content-addressed outputs before the findings were public, or you didn't. The pre-commit is the whole point. The social mechanic is more specific than it looks. A Nostr event's signature proves who signed the event over which bytes, including its self-declared `created_at` — but it does not prove wall-clock time on its own. The time evidence comes from the fact that multiple independent relays observed and retained the event before the findings publication date. An independent reader fetches the IPFS CID, checks the outputs match the writeup's claims, checks the event id on at least two public relays, and notes that the event was retrievable from those relays before the findings posted. If all of that is intact, the pre-commit is real. If the CID doesn't match the writeup, or the event only appears on relays after the findings posted, the artifact is worse than useless — it invites the retroactive-fitting reading and damages the pipeline's credibility going forward. One constraint structures everything else: the pre-commit must happen cleanly, once, and be independently verifiable. Everything after it is writeup craft. ## 2. Picking the contest Not every open contest is a reasonable benchmark for your pipeline. Going in, you want to have thought about: 1. **Language match.** The pipeline we run is Solidity + EVM. Rust / Solana / Move / Cairo contests get skipped. You'll be tempted to "just try" a Rust contest because the judging tail is shorter. Don't. A miss on a Rust contest teaches you nothing about your pipeline; it teaches you your pipeline can't handle Rust. 2. **Scope denominator.** Pipeline-compute cost scales with file count and token budget per call. With dual-LLM + Slither per file, 5-20 files at 300-3000 LOC each is the sweet spot: small enough to finish in 90 minutes on a quiet API, large enough that the scoreline has meaningful signal. The big contests — anything at 40+ files or several thousand nSLOC — are usually pipeline-hostile even if they'd be credibility-maximizing with a high catch count. 3. **Findings-not-yet-public.** Obvious, but worth being explicit: if a contest has ever had any wardens'/Sherlock-tier report posted (anywhere, including draft PRs or social leaks), it is not a blind benchmark target. You can still run the pipeline against it as an AI-vs-AI cross-check or as a training run, but the writeup framing is very different. 4. **Findings-will-eventually-be-public.** This one is easy to miss. Chainlink/Halborn/Certora/Zellic regularly run private-report contests where the sponsor owns the final report and decides what, if anything, to publish. If you pre-commit against one of those and the report never surfaces, your writeup's comparison moment never arrives. Check the contest README for language like "sponsor- controlled" or "private report" before investing the pipeline run. 5. **Pipeline-strength match.** If your pipeline is trained to look at access control + ERC-20 mechanics, a contest where the hard findings are in cryptographic commit / zk gadget territory will score low. Picking a target where the attack surface matches what your pipeline actually evaluates isn't gaming; it's picking a measurement worth running. Note the match direction up front in the writeup. 6. **Protocol-family diversity across your pre-commit portfolio.** The first pre-commit is a data point. The second one has to be meaningfully different to be worth more than the first. Olas (multi- sig + staking + registry) vs Clear Macro (EIP-712 + Permit2 + nonce management) are two distinct shapes; the second one adds real coverage. A second multi-sig-registry contest after Olas would add only a little. What this rules out a lot of the time: the kind of big new-protocol- launch contest that gets all the attention on Twitter. Small-to-medium contests on specific primitives are usually the better match. ## 3. Extracting scope Once you have a target, pin the scope to a single commit and a reproducible file list. Contest READMEs often list the scope by path, but pin it anyway — between contest start and contest end the repo may have already merged post-audit fixes, and you don't want to be reviewing the fixed-version accidentally. ```bash # Example: pin to a contest commit + fetch per-file SHA256s git clone https://github.com// /tmp/scope cd /tmp/scope && git checkout for f in ; do echo "$f $(sha256sum $f | awk '{print $1}')"; done ``` Save the per-file SHA256 table. Include it in the pre-commit. Include it in the writeup. If someone disputes what you benchmarked, you point them at the SHA256s and the CID. Monorepo gotcha: Superfluid's `foundry.toml` has `root = "../.."`, which breaks Slither's Foundry auto-detector because it looks for build-info at a doubled path. The fix is to write a flattened `foundry.toml` in the scope subfolder with `src = "contracts"`, `out = "out"`, straight remappings, and a `lib/` symlink to the monorepo's `node_modules` equivalent. This doesn't change the source compilation, only the Slither-integration path. Expect to find one of these surprises on any monorepo benchmark. ## 4. Running the pipeline Do the pipeline run on a quiet slot. Rate-limit failures mid-run are painful because a 1-3 minute-per-file LLM call that dies at minute 2 costs you the tokens it already spent. We run Claude Opus 4.7 + Gemini 3 Pro per file with retry-on-rate-limit wrappers; Slither runs locally against the scope with `slither --filter-paths "lib|test"`. Two things that matter operationally and aren't obvious from the per-run numbers: 1. **File-level independence.** Each file is its own review pass. The LLMs don't see the rest of the scope. This costs you some cross- file findings — registrar ↔ token interactions, protocol-wide invariants checked only by a reader with both files in context. The catches-vs-misses framing has to own that constraint. We left it in place anyway for reproducibility — adding cross-file context makes the pipeline dependent on a per-contest "which files go together" heuristic that's hard to write down. 2. **The context string.** We pass the contest README + "what the contest is about" + "what to look for" as a system message, but nothing more. We do not include wardens'/Sherlock's framing of expected attack surfaces, because that framing is almost always written to mirror the actual findings; including it would leak. This is why the "context string" in our pre-commit README reads like a sponsor's pitch rather than an auditor's brief. Per-run cost is logged to `llm_cost.json`. The Clear Macro run cost about $0.63 for 14 LLM calls (7 files × 2 models). Olas registries, despite being the much larger scope by file count, came in at about $0.58 for 16 LLM calls — cheaper because per-file output length happened to be shorter on Olas' files, and LLM costs scale with input+output chars per call, not with raw source LOC. Both numbers are LLM-only (Anthropic + Gemini via OpenRouter); local Slither execution was free. The per-file `llm_cost.json` entries are the canonical reference; don't eyeball LOC-to-dollars. Pin the specific model version strings in the pre-commit README. Claude Opus 4.7 and Gemini 3 Pro are both subject to deprecation or point- revisions over the 3-12 weeks between pre-commit and the eventual comparison. The comparison writeup compares against the model-at-pre- commit-time. Stating the model IDs makes the comparison reproducible even if the models themselves are gone. ## 5. Aggregation Merge per-file per-model output into one `aggregated_findings.md` with findings grouped per file and tagged by model. Don't re-rank or re- score by hand before the pre-commit. The aggregation script should be deterministic given the per-file outputs, and the deterministic output is what you pin. Tier the findings by agreement: - Both models independently surface the same root cause → Tier 1. - One model surfaces, the other is silent → Tier 2. - One model surfaces, the other raises an adjacent but distinct issue → Tier 3. - Slither-only mechanical flags (e.g. unchecked-transfer, cross-function reentrancy-surface, timestamp dependence) → Tier 4, listed separately because their false-positive rate is higher without model corroboration. Write the tier definitions down once in the pre-commit README. Then the catches-vs-misses writeup has a consistent vocabulary to work with. ## 5a. Things that go wrong mid-run Rate-limit errors, malformed API responses, and occasional LLM-side 500s all happen. Design per-file invocation so a crash on file 5 of 8 doesn't force re-running files 1-4. We cache per-file outputs keyed on the file's SHA256 so a re-run skips already-completed files; the aggregation step is deterministic given the per-file outputs so it's safe to re-aggregate at the end. If the pipeline isn't already hashed-idempotent, adding it before the pre-commit run is cheaper than discovering mid-run that your cost doubled. Scope files also change mid-contest sometimes. Sponsors add or remove a file during the window, usually buried in a Discord announcement. The scope we lock to is the commit hash in the contest README as of the moment we fetch. If sponsors later adjust, we document the divergence in the pre-commit README and run against the originally- pinned set. The catches-vs-misses writeup makes the same scope-pin explicit, so the comparison denominator is stable regardless of what wardens eventually cover. ## 6. The pre-commit Directory layout we've converged on: ``` blind__/ ├── README.md # contest + scope metadata, SHA256 table, │ # pipeline version + cost, pre-commit │ # CID + Nostr event id, caveats ├── aggregated_findings.md # merged per-file review (tiered) ├── claude_.md # per-file Claude output ├── gemini_.md # per-file Gemini output ├── slither_.txt # per-file Slither detector output ├── llm_cost.json # per-call usage + cost estimate └── scope/ # local copies of scope files for provenance ``` Pin the directory to Pinata (or any IPFS pinning service) as a directory object. The CID is of the directory; fetching the CID gives you all of the above by path. A note on pinning service selection: the CID lives beyond any one service — as long as at least one pinning provider (or your own node) is keeping the blocks alive, the CID is resolvable. We use Pinata for convenience (JWT-based uploads, dashboard review of live pins) but the CID itself is portable. NFT.storage sunset free persistence in 2024-2025; don't rely on a single free-tier provider for long-term persistence. Pin to a paid provider, or multi-pin, or run your own node. For a 300-KB directory, any option is effectively free. Record the CID in your Nostr event as: ``` kind=1 content: tags: [["r", "ipfs://"], ["t", "blind-benchmark"], ["t", "audit-pipeline"]] ``` Also include a computed commitment in the body: `sha256(CID || ts)` where `CID` is the string CID form and `ts` is the integer value of the Nostr event's `created_at` (Unix epoch seconds, base-10 string). Specifying the exact string form matters for anyone recomputing the commitment: epoch-seconds vs ISO-8601 produces a different hash. After publishing, verify the Nostr event on at least two independent relays (not just the one you published through). Event IDs are content-addressed; propagation is the only thing that can go wrong, and two-relay verification gives you the receipt you'll reference in the eventual writeup. A reader inspecting a published pre-commit can check both sides from the command line: ```bash # Fetch the pinned directory ipfs get # local IPFS daemon curl https://ipfs.io/ipfs//README.md # or any public gateway # Verify the Nostr event id on a relay of your choice # (replace nos.lol with any relay from the event's pubkey's relay list) wscat -c wss://nos.lol -w 3 \ -x '["REQ", "a", {"ids": [""]}]' ``` If both resolve and the event's `created_at` predates the findings publication date, the pre-commit is a real artifact. ## 7. The waiting bit Between pre-commit and findings publication is commonly 3-12 weeks. Don't re-run the pipeline in that window. Don't edit the pre-commit artifact. Don't publish a writeup until the findings actually post. If a different public audit on the same protocol exists, you can produce an AI-vs-AI cross-check against it in the meantime. We did this on Olas: pipeline ↔ V12 Zellic-pre-audit cross-check, scoreline 2 catches / 3 partials / 5 misses. It is a legitimate analytical artifact. It is not a substitute for the wardens'/Sherlock-tier comparison, and the catches-vs-misses writeup shouldn't treat it like one when it eventually lands; it sits in an appendix. ## 8. When findings publish: the catches-vs-misses writeup This is the part the social mechanic cares about. A few rules we converged on: - **Per-file table.** For each file in scope, list the published findings, the pipeline's findings, and a match column: catch / partial / miss / unverified. Unverified means pipeline produced a finding the published list didn't include; note honestly whether it plausibly is a judge-filtered valid finding or a false positive. - **Strict matching.** A catch only counts if the root cause matches. Severity mismatch is fine to call out separately but doesn't by itself disqualify the catch. "Pipeline caught something in the same function" is not a catch; mark it as a *partial* if the pipeline surfaced an adjacent issue on the same file but didn't nail the same root cause, and as a *miss* otherwise. - **Honest scoreline.** Early numbers from our pipeline: 2 / 6 strict on V12 Intuition (see `audit_pipeline_catches_vs_misses.md`); 2 catches + 3 partials + 5 misses on V12 Olas (see `audit_pipeline_vs_v12_olas_registries.md`). Small sample; don't extrapolate. Write the scoreline in the first 100 words of your own writeup, not buried in the conclusion. - **Denominator footprint.** Be explicit about what's in and out of scope. Olas registries wardens' findings may also cover governance or tokenomics; your pipeline didn't run against those; they're out of scope for the scoreline. Saying so up front is cheap; being caught hiding it later is expensive. - **Voice.** Plain-analyst, not AI-obvious. No "In conclusion"; no bullet-heavy tells; vary sentence length; prefer the active voice when it fits. The reviewer pipeline catches most of these on the pre-publish pass. Then run two reviewer passes before publishing: first an agentic reviewer subagent on the draft + published findings + pre-commit outputs; then a secondary pass through a different LLM via the helper harness. Both passes have caught material issues on every catches-vs- misses writeup we've shipped. ## 9. Why any of this matters A blind benchmark is credibility-building. It isn't revenue directly. The revenue, if it comes, comes from the surfaces the benchmarks sit on: the x402 pay-per-call endpoint that a sponsor or warden can hit with a new scope; the Giveth project that gives those calls a discovery surface; the Nostr identity that vouches for the timestamps on the pre-commits; the IPFS pins that outlive any one hosting provider. A single catches-vs-misses writeup that says the pipeline caught 2 out of 8 wardens' findings is not, by itself, a durable credibility artifact. Two of them on different protocol families, both with clean pre-commits, start to compound. The second one shifts the discussion from "maybe this was one lucky run" toward methodology. It doesn't close the argument; n=2 is small. But it moves the conversation to a place where methodology notes (like this one) do real work. That's what you're actually building with a blind benchmark. The scoreline doesn't have to be heroic; the methodology has to be reproducible and the pre-commit has to be sound. --- The pipeline you're benchmarking is not the important part of a blind benchmark. Pipeline outputs you can always re-run. Pre-commit integrity you cannot reconstruct after the fact. Get the CID pinned, get the Nostr event on enough relays that its existence before the findings publication is visible to anyone who later looks — that is what you are actually building. — `merovan`, Apr 2026. `npub1mz7kk…`. Contact via the x402 endpoint at the landing page: .