# ADR-174: CI Bench-Regression Gate (Compile-Verify) | Field | Value | |-------|-------| | **Status** | Accepted — implemented, caught one real bit-rotted bench | | **Date** | 2026-06-15 | | **Deciders** | ruv | | **Codename** | **BENCH-GATE** | | **Milestone** | benchmark/optimization re-balance — sub-deliverable 8.3 | | **Motivated by** | `docs/research/sota-nn-train-benchmark-brief.md` (target 3: criterion benches as CI regression baselines) | ## Context The v2/ workspace ships **26 criterion benches across 18 crates** (e.g. `nvsim/pipeline_throughput`, `wifi-densepose-ruvector/{ann,sketch,fusion}_bench`, `wifi-densepose-signal/{signal,dsp_perf,features,calibration,cir,…}_bench`, `wifi-densepose-mat/detection_bench`, `wifi-densepose-nn/{inference,native_conv}_bench`, `wifi-densepose-engine/engine_cycle`, …). Because **benches are not part of `cargo test`**, nothing in CI compiled them — so they bit-rot silently the moment a public API they call changes, and the rot is invisible until someone manually runs `cargo bench` months later. The SOTA brief named "wire existing criterion benches into CI as regression baselines" as a concrete benchmark-hygiene target. The honest difficulty: true *timing*-regression gating on shared GitHub runners is unreliable — wall-clock varies 2–3× run-to-run (a captured 10-sample run showed `float_l2/512` ranging 307–444 ns), so a hard threshold or a cross-runner `criterion --baseline` compare (baseline and PR land on different physical machines) would manufacture false regressions. A gate that cries wolf gets disabled. ## Decision Add `.github/workflows/bench-regression.yml` with **two jobs of explicitly different authority** — and do NOT pretend to gate on timing. ### `bench-compile` — HARD GATE (real regression detection) `cargo bench --workspace --no-default-features --no-run` compiles + links every default-feature bench (no measurement → fully deterministic), plus a `--features cir` compile of the gated `cir_bench`. Benches aren't in `cargo test`, so this is the genuine guard: **the build fails the moment a bench stops compiling.** ### `bench-fast-run` — INFORMATIONAL (`continue-on-error: true`, never gates) Runs a curated pure-CPU subset (`nvsim/pipeline_throughput`, `ruvector/{sketch,fusion}_bench`) in criterion quick-mode (1 s warm-up / 2 s measure / 10 samples), targeted per-`--bench`, and uploads logs as an artifact. Every number it produces is **informational only** — explicitly stated in the workflow header. ### What is NOT done, and why (honest scope) No timing-regression gate, no committed baseline JSON. The workflow header documents the exact condition under which true timing-gating becomes honest: a frequency-pinned **self-hosted** runner with a generous (>2×) floor. A cross-runner baseline would be dishonest, so none is committed. ### Proof it matters (MEASURED) Running the new gate on the current tree immediately caught `wifi-densepose-mat/detection_bench` failing to compile: `error[E0063]: missing field last_rssi in initializer of SensorPosition` — the struct gained a field; the bench was never updated. **Fixed** in the same change (`last_rssi: None`, the simulated-zone convention) and re-verified (`cargo bench -p wifi-densepose-mat --no-default-features --bench detection_bench --no-run` → `Finished`). The gate paid for itself on its first run. ### Exclusions (documented in-workflow) - `ruvector/crv_bench` — its crates.io dep `ruvector-crv 0.1.1` fails to build on stable (upstream `E0308` in `stage_iii.rs`); excluded with a re-add condition. - `onnx_bench` / `mqtt_throughput` — feature-gated (ort / mqtt), left to their crates' own workflows. `wasm-edge/process_frame_bench` — workspace-excluded. Conventions mirror existing workflows: `submodules: recursive` (the workspace path-deps `vendor/rufield`), Swatinem/rust-cache `workspaces: v2`, Tauri/GTK apt deps (a `--workspace` bench link pulls the whole graph), path-filtered triggers. ## Validation - **Bit-rot caught + fixed** (above), re-verified `--no-run`. - **MEASURED locally** (`--no-default-features`, Windows): nvsim, ruvector (sketch/fusion/ann), signal/cir_bench, mat/detection_bench (post-fix), vitals, ruview-swarm/swarm_bench all compile; fast subset runs (`nvsim pipeline_run/d1/256` ≈ 55 µs; `ruvector sketch_hamming` ≈ 3–7 ns vs `float_l2` ≈ 63–371 ns). - `cargo test -p wifi-densepose-mat --no-default-features` → 166/6/2 passed, 0 failed. - `python archive/v1/data/proof/verify.py` → **VERDICT: PASS**, hash `f8e76f21…46f7a` unchanged. - **Honest limitation:** the full `--workspace --no-run` could not be end-to-end validated on this Windows box (`desktop` needs GTK, `candle-core` fails on MSVC, `swarm_bench` LTO-links OOM under parallel pressure — all Windows-env artifacts; each affected bench compiles standalone here). **The first green Linux CI run on the PR is the authoritative proof of the `--workspace` step.** ## Consequences ### Positive - Bench bit-rot is now a hard CI failure, not a silent surprise — the 26 benches stay compilable as the APIs they exercise evolve. - The benchmark-infrastructure half of the DoD (step 5) is satisfied honestly, setting up the next sub-deliverable (QAT-int8 measurement) to be regression-protected. ### Negative / Neutral - No automated timing-regression detection (deliberate — see scope). Revisit only with a frequency-pinned self-hosted runner. - One bench (`crv_bench`) excluded pending an upstream dep fix. ## Links - ADR-173 — metric-locked accuracy harness (sub-deliverable 8.1) - `docs/research/sota-nn-train-benchmark-brief.md` — motivating target - ADR-134 (CIR), ADR-135 (calibration), ADR-154 (signal DSP benches) — benched paths