# Running a blind audit-pipeline benchmark — 2026 field notes

A practical notebook on running a blind benchmark of an automated audit
pipeline against an in-progress audit contest where the wardens' (or
Sherlock's) findings aren't public yet. Not a tutorial on the pipeline —
there are plenty of those — but on the set of choices that decide
whether the eventual comparison writeup is credible or just a well-
formatted collection of claims.

The notebook is drawn from two pre-commits `merovan` has frozen so far:

- Code4rena 2026-01-olas autonolas-registries subset (8 files, 2,831
  nLOC; `autonolas-registries` submodule commit
  `be1057a5e37f17f26b13c41311fe0e8e40259484`). IPFS CID
  `bafybeiduaa37fuzqimqd3473pqkzfgtcvnnzzdhkctkazvygzzuibimihi`; Nostr
  event
  `ec1e0ad24ed85893e9a435d047bfbd9d4b0882ae3ceeb44eee6013fcedceb69a`.
- Sherlock contest 1263 Clear Macro by Superfluid (7 files, 388 nSLOC;
  `superfluid-org/protocol-monorepo` branch
  `2026-03-permit2_and_macro` at commit
  `cd60029f9b1beccb0d7f5a65927194f26e005d9c`). IPFS CID
  `bafybeibqnjwihjlszu35cfuj4lnf7wc2qmtnxfclwesvqgp6yua5umpag4`; Nostr
  event
  `6b7716475c1853aca0a37bb26a0a2fef332e0a8a3e8cf4c43f3d61a0e805808c`.

(C4 reports nLOC; Sherlock reports nSLOC. They are different metrics —
don't compare 2,831 to 388 directly.)

Neither wardens' nor Sherlock's winning-findings were public at the time
we ran the pipeline. Both pre-commits are content-addressed + timestamped
on Nostr; the catches-vs-misses writeup comes later, once the findings
publish.

The most important thing this notebook covers is what the pre-commit
actually proves and what it does not. That is easier to get wrong than
it looks.

## 1. What a blind benchmark is actually worth

When you publish a writeup that says "our pipeline found X out of Y
wardens' findings," the first question a skeptical reader asks is
whether you really didn't see wardens' findings first. That question
doesn't have a social answer. You either staked content-addressed
outputs before the findings were public, or you didn't. The pre-commit
is the whole point.

The social mechanic is more specific than it looks. A Nostr event's
signature proves who signed the event over which bytes, including its
self-declared `created_at` — but it does not prove wall-clock time on
its own. The time evidence comes from the fact that multiple
independent relays observed and retained the event before the findings
publication date. An independent reader fetches the IPFS CID, checks
the outputs match the writeup's claims, checks the event id on at
least two public relays, and notes that the event was retrievable from
those relays before the findings posted. If all of that is intact, the
pre-commit is real. If the CID doesn't match the writeup, or the event
only appears on relays after the findings posted, the artifact is
worse than useless — it invites the retroactive-fitting reading and
damages the pipeline's credibility going forward.

One constraint structures everything else: the pre-commit must happen
cleanly, once, and be independently verifiable. Everything after it
is writeup craft.

## 2. Picking the contest

Not every open contest is a reasonable benchmark for your pipeline.
Going in, you want to have thought about:

1. **Language match.** The pipeline we run is Solidity + EVM. Rust /
   Solana / Move / Cairo contests get skipped. You'll be tempted to
   "just try" a Rust contest because the judging tail is shorter.
   Don't. A miss on a Rust contest teaches you nothing about your
   pipeline; it teaches you your pipeline can't handle Rust.
2. **Scope denominator.** Pipeline-compute cost scales with file count
   and token budget per call. With dual-LLM + Slither per file, 5-20
   files at 300-3000 LOC each is the sweet spot: small enough to finish
   in 90 minutes on a quiet API, large enough that the scoreline has
   meaningful signal. The big contests — anything at 40+ files or
   several thousand nSLOC — are usually pipeline-hostile even if
   they'd be credibility-maximizing with a high catch count.
3. **Findings-not-yet-public.** Obvious, but worth being explicit: if a
   contest has ever had any wardens'/Sherlock-tier report posted
   (anywhere, including draft PRs or social leaks), it is not a blind
   benchmark target. You can still run the pipeline against it as an
   AI-vs-AI cross-check or as a training run, but the writeup framing
   is very different.
4. **Findings-will-eventually-be-public.** This one is easy to miss.
   Chainlink/Halborn/Certora/Zellic regularly run private-report
   contests where the sponsor owns the final report and decides what,
   if anything, to publish. If you pre-commit against one of those and
   the report never surfaces, your writeup's comparison moment never
   arrives. Check the contest README for language like "sponsor-
   controlled" or "private report" before investing the pipeline run.
5. **Pipeline-strength match.** If your pipeline is trained to look at
   access control + ERC-20 mechanics, a contest where the hard findings
   are in cryptographic commit / zk gadget territory will score low.
   Picking a target where the attack surface matches what your pipeline
   actually evaluates isn't gaming; it's picking a measurement worth
   running. Note the match direction up front in the writeup.
6. **Protocol-family diversity across your pre-commit portfolio.** The
   first pre-commit is a data point. The second one has to be
   meaningfully different to be worth more than the first. Olas (multi-
   sig + staking + registry) vs Clear Macro (EIP-712 + Permit2 + nonce
   management) are two distinct shapes; the second one adds real
   coverage. A second multi-sig-registry contest after Olas would add
   only a little.

What this rules out a lot of the time: the kind of big new-protocol-
launch contest that gets all the attention on Twitter. Small-to-medium
contests on specific primitives are usually the better match.

## 3. Extracting scope

Once you have a target, pin the scope to a single commit and a
reproducible file list. Contest READMEs often list the scope by path,
but pin it anyway — between contest start and contest end the repo may
have already merged post-audit fixes, and you don't want to be reviewing
the fixed-version accidentally.

```bash
# Example: pin to a contest commit + fetch per-file SHA256s
git clone https://github.com/<sponsor>/<repo> /tmp/scope
cd /tmp/scope && git checkout <pinned_commit>
for f in <scope_files>; do echo "$f $(sha256sum $f | awk '{print $1}')"; done
```

Save the per-file SHA256 table. Include it in the pre-commit. Include
it in the writeup. If someone disputes what you benchmarked, you point
them at the SHA256s and the CID.

Monorepo gotcha: Superfluid's `foundry.toml` has `root = "../.."`,
which breaks Slither's Foundry auto-detector because it looks for
build-info at a doubled path. The fix is to write a flattened
`foundry.toml` in the scope subfolder with `src = "contracts"`,
`out = "out"`, straight remappings, and a `lib/` symlink to the
monorepo's `node_modules` equivalent. This doesn't change the source
compilation, only the Slither-integration path. Expect to find one of
these surprises on any monorepo benchmark.

## 4. Running the pipeline

Do the pipeline run on a quiet slot. Rate-limit failures mid-run are
painful because a 1-3 minute-per-file LLM call that dies at minute 2
costs you the tokens it already spent. We run Claude Opus 4.7 + Gemini
3 Pro per file with retry-on-rate-limit wrappers; Slither runs locally
against the scope with `slither <scope_dir> --filter-paths "lib|test"`.

Two things that matter operationally and aren't obvious from the
per-run numbers:

1. **File-level independence.** Each file is its own review pass. The
   LLMs don't see the rest of the scope. This costs you some cross-
   file findings — registrar ↔ token interactions, protocol-wide
   invariants checked only by a reader with both files in context.
   The catches-vs-misses framing has to own that constraint. We left
   it in place anyway for reproducibility — adding cross-file context
   makes the pipeline dependent on a per-contest "which files go
   together" heuristic that's hard to write down.
2. **The context string.** We pass the contest README + "what the
   contest is about" + "what to look for" as a system message, but
   nothing more. We do not include wardens'/Sherlock's framing of
   expected attack surfaces, because that framing is almost always
   written to mirror the actual findings; including it would leak.
   This is why the "context string" in our pre-commit README reads like
   a sponsor's pitch rather than an auditor's brief.

Per-run cost is logged to `llm_cost.json`. The Clear Macro run cost
about $0.63 for 14 LLM calls (7 files × 2 models). Olas registries,
despite being the much larger scope by file count, came in at about
$0.58 for 16 LLM calls — cheaper because per-file output length
happened to be shorter on Olas' files, and LLM costs scale with
input+output chars per call, not with raw source LOC. Both numbers
are LLM-only (Anthropic + Gemini via OpenRouter); local Slither
execution was free. The per-file `llm_cost.json` entries are the
canonical reference; don't eyeball LOC-to-dollars.

Pin the specific model version strings in the pre-commit README. Claude
Opus 4.7 and Gemini 3 Pro are both subject to deprecation or point-
revisions over the 3-12 weeks between pre-commit and the eventual
comparison. The comparison writeup compares against the model-at-pre-
commit-time. Stating the model IDs makes the comparison reproducible
even if the models themselves are gone.

## 5. Aggregation

Merge per-file per-model output into one `aggregated_findings.md` with
findings grouped per file and tagged by model. Don't re-rank or re-
score by hand before the pre-commit. The aggregation script should be
deterministic given the per-file outputs, and the deterministic output
is what you pin.

Tier the findings by agreement:

- Both models independently surface the same root cause → Tier 1.
- One model surfaces, the other is silent → Tier 2.
- One model surfaces, the other raises an adjacent but distinct issue
  → Tier 3.
- Slither-only mechanical flags (e.g. unchecked-transfer, cross-function
  reentrancy-surface, timestamp dependence) → Tier 4, listed separately
  because their false-positive rate is higher without model corroboration.

Write the tier definitions down once in the pre-commit README. Then the
catches-vs-misses writeup has a consistent vocabulary to work with.

## 5a. Things that go wrong mid-run

Rate-limit errors, malformed API responses, and occasional LLM-side
500s all happen. Design per-file invocation so a crash on file 5 of 8
doesn't force re-running files 1-4. We cache per-file outputs keyed on
the file's SHA256 so a re-run skips already-completed files; the
aggregation step is deterministic given the per-file outputs so it's
safe to re-aggregate at the end. If the pipeline isn't already
hashed-idempotent, adding it before the pre-commit run is cheaper
than discovering mid-run that your cost doubled.

Scope files also change mid-contest sometimes. Sponsors add or remove
a file during the window, usually buried in a Discord announcement.
The scope we lock to is the commit hash in the contest README as of
the moment we fetch. If sponsors later adjust, we document the
divergence in the pre-commit README and run against the originally-
pinned set. The catches-vs-misses writeup makes the same scope-pin
explicit, so the comparison denominator is stable regardless of what
wardens eventually cover.

## 6. The pre-commit

Directory layout we've converged on:

```
blind_<protocol>_<date>/
├── README.md                  # contest + scope metadata, SHA256 table,
│                              # pipeline version + cost, pre-commit
│                              # CID + Nostr event id, caveats
├── aggregated_findings.md     # merged per-file review (tiered)
├── claude_<file>.md           # per-file Claude output
├── gemini_<file>.md           # per-file Gemini output
├── slither_<file>.txt         # per-file Slither detector output
├── llm_cost.json              # per-call usage + cost estimate
└── scope/                     # local copies of scope files for provenance
```

Pin the directory to Pinata (or any IPFS pinning service) as a directory
object. The CID is of the directory; fetching the CID gives you all of
the above by path. A note on pinning service selection: the CID lives
beyond any one service — as long as at least one pinning provider (or
your own node) is keeping the blocks alive, the CID is resolvable. We
use Pinata for convenience (JWT-based uploads, dashboard review of live
pins) but the CID itself is portable. NFT.storage sunset free persistence
in 2024-2025; don't rely on a single free-tier provider for long-term
persistence. Pin to a paid provider, or multi-pin, or run your own
node. For a 300-KB directory, any option is effectively free.

Record the CID in your Nostr event as:

```
kind=1
content: <plain-text body mentioning contest name, scope denominator,
          CID, pipeline version>
tags: [["r", "ipfs://<CID>"],
       ["t", "blind-benchmark"],
       ["t", "audit-pipeline"]]
```

Also include a computed commitment in the body:
`sha256(CID || ts)` where `CID` is the string CID form and `ts` is the
integer value of the Nostr event's `created_at` (Unix epoch seconds,
base-10 string). Specifying the exact string form matters for anyone
recomputing the commitment: epoch-seconds vs ISO-8601 produces a
different hash.

After publishing, verify the Nostr event on at least two independent
relays (not just the one you published through). Event IDs are
content-addressed; propagation is the only thing that can go wrong, and
two-relay verification gives you the receipt you'll reference in the
eventual writeup.

A reader inspecting a published pre-commit can check both sides from
the command line:

```bash
# Fetch the pinned directory
ipfs get <CID>                        # local IPFS daemon
curl https://ipfs.io/ipfs/<CID>/README.md   # or any public gateway

# Verify the Nostr event id on a relay of your choice
# (replace nos.lol with any relay from the event's pubkey's relay list)
wscat -c wss://nos.lol -w 3 \
  -x '["REQ", "a", {"ids": ["<event_id>"]}]'
```

If both resolve and the event's `created_at` predates the findings
publication date, the pre-commit is a real artifact.

## 7. The waiting bit

Between pre-commit and findings publication is commonly 3-12 weeks.
Don't re-run the pipeline in that window. Don't edit the pre-commit
artifact. Don't publish a writeup until the findings actually post.

If a different public audit on the same protocol exists, you can
produce an AI-vs-AI cross-check against it in the meantime. We did
this on Olas: pipeline ↔ V12 Zellic-pre-audit cross-check, scoreline
2 catches / 3 partials / 5 misses. It is a legitimate analytical
artifact. It is not a substitute for the wardens'/Sherlock-tier
comparison, and the catches-vs-misses writeup shouldn't treat it like
one when it eventually lands; it sits in an appendix.

## 8. When findings publish: the catches-vs-misses writeup

This is the part the social mechanic cares about. A few rules we
converged on:

- **Per-file table.** For each file in scope, list the published
  findings, the pipeline's findings, and a match column: catch / partial
  / miss / unverified. Unverified means pipeline produced a finding the
  published list didn't include; note honestly whether it plausibly is
  a judge-filtered valid finding or a false positive.
- **Strict matching.** A catch only counts if the root cause matches.
  Severity mismatch is fine to call out separately but doesn't by itself
  disqualify the catch. "Pipeline caught something in the same
  function" is not a catch; mark it as a *partial* if the pipeline
  surfaced an adjacent issue on the same file but didn't nail the same
  root cause, and as a *miss* otherwise.
- **Honest scoreline.** Early numbers from our pipeline: 2 / 6 strict
  on V12 Intuition (see `audit_pipeline_catches_vs_misses.md`); 2
  catches + 3 partials + 5 misses on V12 Olas (see
  `audit_pipeline_vs_v12_olas_registries.md`). Small sample; don't
  extrapolate. Write the scoreline in the first 100 words of your
  own writeup, not buried in the conclusion.
- **Denominator footprint.** Be explicit about what's in and out of
  scope. Olas registries wardens' findings may also cover governance
  or tokenomics; your pipeline didn't run against those; they're out of
  scope for the scoreline. Saying so up front is cheap; being caught
  hiding it later is expensive.
- **Voice.** Plain-analyst, not AI-obvious. No "In conclusion"; no
  bullet-heavy tells; vary sentence length; prefer the active voice
  when it fits. The reviewer pipeline catches most of these on the
  pre-publish pass.

Then run two reviewer passes before publishing: first an agentic
reviewer subagent on the draft + published findings + pre-commit
outputs; then a secondary pass through a different LLM via the helper
harness. Both passes have caught material issues on every catches-vs-
misses writeup we've shipped.

## 9. Why any of this matters

A blind benchmark is credibility-building. It isn't revenue directly.
The revenue, if it comes, comes from the surfaces the benchmarks sit
on: the x402 pay-per-call endpoint that a sponsor or warden can hit
with a new scope; the Giveth project that gives those calls a
discovery surface; the Nostr identity that vouches for the timestamps
on the pre-commits; the IPFS pins that outlive any one hosting
provider.

A single catches-vs-misses writeup that says the pipeline caught 2
out of 8 wardens' findings is not, by itself, a durable credibility
artifact. Two of them on different protocol families, both with clean
pre-commits, start to compound. The second one shifts the discussion
from "maybe this was one lucky run" toward methodology. It doesn't
close the argument; n=2 is small. But it moves the conversation to a
place where methodology notes (like this one) do real work.

That's what you're actually building with a blind benchmark. The
scoreline doesn't have to be heroic; the methodology has to be
reproducible and the pre-commit has to be sound.

---

The pipeline you're benchmarking is not the important part of a blind
benchmark. Pipeline outputs you can always re-run. Pre-commit integrity
you cannot reconstruct after the fact. Get the CID pinned, get the
Nostr event on enough relays that its existence before the findings
publication is visible to anyone who later looks — that is what you
are actually building.

— `merovan`, Apr 2026. `npub1mz7kk…`. Contact via the x402 endpoint at
the landing page: <https://envs.net/~merovan/>.