5.7 KiB
ADR-174: CI Bench-Regression Gate (Compile-Verify)
| Field | Value |
|---|---|
| Status | Accepted — implemented, caught one real bit-rotted bench |
| Date | 2026-06-15 |
| Deciders | ruv |
| Codename | BENCH-GATE |
| Milestone | benchmark/optimization re-balance — sub-deliverable 8.3 |
| Motivated by | docs/research/sota-nn-train-benchmark-brief.md (target 3: criterion benches as CI regression baselines) |
Context
The v2/ workspace ships 26 criterion benches across 18 crates (e.g.
nvsim/pipeline_throughput, wifi-densepose-ruvector/{ann,sketch,fusion}_bench,
wifi-densepose-signal/{signal,dsp_perf,features,calibration,cir,…}_bench,
wifi-densepose-mat/detection_bench, wifi-densepose-nn/{inference,native_conv}_bench,
wifi-densepose-engine/engine_cycle, …). Because benches are not part of
cargo test, nothing in CI compiled them — so they bit-rot silently the moment
a public API they call changes, and the rot is invisible until someone manually
runs cargo bench months later.
The SOTA brief named "wire existing criterion benches into CI as regression
baselines" as a concrete benchmark-hygiene target. The honest difficulty: true
timing-regression gating on shared GitHub runners is unreliable — wall-clock
varies 2–3× run-to-run (a captured 10-sample run showed float_l2/512 ranging
307–444 ns), so a hard threshold or a cross-runner criterion --baseline compare
(baseline and PR land on different physical machines) would manufacture false
regressions. A gate that cries wolf gets disabled.
Decision
Add .github/workflows/bench-regression.yml with two jobs of explicitly
different authority — and do NOT pretend to gate on timing.
bench-compile — HARD GATE (real regression detection)
cargo bench --workspace --no-default-features --no-run compiles + links every
default-feature bench (no measurement → fully deterministic), plus a
--features cir compile of the gated cir_bench. Benches aren't in cargo test,
so this is the genuine guard: the build fails the moment a bench stops
compiling.
bench-fast-run — INFORMATIONAL (continue-on-error: true, never gates)
Runs a curated pure-CPU subset (nvsim/pipeline_throughput,
ruvector/{sketch,fusion}_bench) in criterion quick-mode (1 s warm-up / 2 s
measure / 10 samples), targeted per---bench, and uploads logs as an artifact.
Every number it produces is informational only — explicitly stated in the
workflow header.
What is NOT done, and why (honest scope)
No timing-regression gate, no committed baseline JSON. The workflow header documents the exact condition under which true timing-gating becomes honest: a frequency-pinned self-hosted runner with a generous (>2×) floor. A cross-runner baseline would be dishonest, so none is committed.
Proof it matters (MEASURED)
Running the new gate on the current tree immediately caught
wifi-densepose-mat/detection_bench failing to compile:
error[E0063]: missing field last_rssi in initializer of SensorPosition — the
struct gained a field; the bench was never updated. Fixed in the same change
(last_rssi: None, the simulated-zone convention) and re-verified
(cargo bench -p wifi-densepose-mat --no-default-features --bench detection_bench --no-run
→ Finished). The gate paid for itself on its first run.
Exclusions (documented in-workflow)
ruvector/crv_bench— its crates.io depruvector-crv 0.1.1fails to build on stable (upstreamE0308instage_iii.rs); excluded with a re-add condition.onnx_bench/mqtt_throughput— feature-gated (ort / mqtt), left to their crates' own workflows.wasm-edge/process_frame_bench— workspace-excluded.
Conventions mirror existing workflows: submodules: recursive (the workspace
path-deps vendor/rufield), Swatinem/rust-cache workspaces: v2, Tauri/GTK apt
deps (a --workspace bench link pulls the whole graph), path-filtered triggers.
Validation
- Bit-rot caught + fixed (above), re-verified
--no-run. - MEASURED locally (
--no-default-features, Windows): nvsim, ruvector (sketch/fusion/ann), signal/cir_bench, mat/detection_bench (post-fix), vitals, ruview-swarm/swarm_bench all compile; fast subset runs (nvsim pipeline_run/d1/256≈ 55 µs;ruvector sketch_hamming≈ 3–7 ns vsfloat_l2≈ 63–371 ns). cargo test -p wifi-densepose-mat --no-default-features→ 166/6/2 passed, 0 failed.python archive/v1/data/proof/verify.py→ VERDICT: PASS, hashf8e76f21…46f7aunchanged.- Honest limitation: the full
--workspace --no-runcould not be end-to-end validated on this Windows box (desktopneeds GTK,candle-corefails on MSVC,swarm_benchLTO-links OOM under parallel pressure — all Windows-env artifacts; each affected bench compiles standalone here). The first green Linux CI run on the PR is the authoritative proof of the--workspacestep.
Consequences
Positive
- Bench bit-rot is now a hard CI failure, not a silent surprise — the 26 benches stay compilable as the APIs they exercise evolve.
- The benchmark-infrastructure half of the DoD (step 5) is satisfied honestly, setting up the next sub-deliverable (QAT-int8 measurement) to be regression-protected.
Negative / Neutral
- No automated timing-regression detection (deliberate — see scope). Revisit only with a frequency-pinned self-hosted runner.
- One bench (
crv_bench) excluded pending an upstream dep fix.
Links
- ADR-173 — metric-locked accuracy harness (sub-deliverable 8.1)
docs/research/sota-nn-train-benchmark-brief.md— motivating target- ADR-134 (CIR), ADR-135 (calibration), ADR-154 (signal DSP benches) — benched paths