wifi-densepose/docs/adr/ADR-174-ci-bench-regression...

5.7 KiB
Raw Blame History

ADR-174: CI Bench-Regression Gate (Compile-Verify)

Field Value
Status Accepted — implemented, caught one real bit-rotted bench
Date 2026-06-15
Deciders ruv
Codename BENCH-GATE
Milestone benchmark/optimization re-balance — sub-deliverable 8.3
Motivated by docs/research/sota-nn-train-benchmark-brief.md (target 3: criterion benches as CI regression baselines)

Context

The v2/ workspace ships 26 criterion benches across 18 crates (e.g. nvsim/pipeline_throughput, wifi-densepose-ruvector/{ann,sketch,fusion}_bench, wifi-densepose-signal/{signal,dsp_perf,features,calibration,cir,…}_bench, wifi-densepose-mat/detection_bench, wifi-densepose-nn/{inference,native_conv}_bench, wifi-densepose-engine/engine_cycle, …). Because benches are not part of cargo test, nothing in CI compiled them — so they bit-rot silently the moment a public API they call changes, and the rot is invisible until someone manually runs cargo bench months later.

The SOTA brief named "wire existing criterion benches into CI as regression baselines" as a concrete benchmark-hygiene target. The honest difficulty: true timing-regression gating on shared GitHub runners is unreliable — wall-clock varies 23× run-to-run (a captured 10-sample run showed float_l2/512 ranging 307444 ns), so a hard threshold or a cross-runner criterion --baseline compare (baseline and PR land on different physical machines) would manufacture false regressions. A gate that cries wolf gets disabled.

Decision

Add .github/workflows/bench-regression.yml with two jobs of explicitly different authority — and do NOT pretend to gate on timing.

bench-compile — HARD GATE (real regression detection)

cargo bench --workspace --no-default-features --no-run compiles + links every default-feature bench (no measurement → fully deterministic), plus a --features cir compile of the gated cir_bench. Benches aren't in cargo test, so this is the genuine guard: the build fails the moment a bench stops compiling.

bench-fast-run — INFORMATIONAL (continue-on-error: true, never gates)

Runs a curated pure-CPU subset (nvsim/pipeline_throughput, ruvector/{sketch,fusion}_bench) in criterion quick-mode (1 s warm-up / 2 s measure / 10 samples), targeted per---bench, and uploads logs as an artifact. Every number it produces is informational only — explicitly stated in the workflow header.

What is NOT done, and why (honest scope)

No timing-regression gate, no committed baseline JSON. The workflow header documents the exact condition under which true timing-gating becomes honest: a frequency-pinned self-hosted runner with a generous (>2×) floor. A cross-runner baseline would be dishonest, so none is committed.

Proof it matters (MEASURED)

Running the new gate on the current tree immediately caught wifi-densepose-mat/detection_bench failing to compile: error[E0063]: missing field last_rssi in initializer of SensorPosition — the struct gained a field; the bench was never updated. Fixed in the same change (last_rssi: None, the simulated-zone convention) and re-verified (cargo bench -p wifi-densepose-mat --no-default-features --bench detection_bench --no-runFinished). The gate paid for itself on its first run.

Exclusions (documented in-workflow)

  • ruvector/crv_bench — its crates.io dep ruvector-crv 0.1.1 fails to build on stable (upstream E0308 in stage_iii.rs); excluded with a re-add condition.
  • onnx_bench / mqtt_throughput — feature-gated (ort / mqtt), left to their crates' own workflows. wasm-edge/process_frame_bench — workspace-excluded.

Conventions mirror existing workflows: submodules: recursive (the workspace path-deps vendor/rufield), Swatinem/rust-cache workspaces: v2, Tauri/GTK apt deps (a --workspace bench link pulls the whole graph), path-filtered triggers.

Validation

  • Bit-rot caught + fixed (above), re-verified --no-run.
  • MEASURED locally (--no-default-features, Windows): nvsim, ruvector (sketch/fusion/ann), signal/cir_bench, mat/detection_bench (post-fix), vitals, ruview-swarm/swarm_bench all compile; fast subset runs (nvsim pipeline_run/d1/256 ≈ 55 µs; ruvector sketch_hamming ≈ 37 ns vs float_l2 ≈ 63371 ns).
  • cargo test -p wifi-densepose-mat --no-default-features → 166/6/2 passed, 0 failed.
  • python archive/v1/data/proof/verify.pyVERDICT: PASS, hash f8e76f21…46f7a unchanged.
  • Honest limitation: the full --workspace --no-run could not be end-to-end validated on this Windows box (desktop needs GTK, candle-core fails on MSVC, swarm_bench LTO-links OOM under parallel pressure — all Windows-env artifacts; each affected bench compiles standalone here). The first green Linux CI run on the PR is the authoritative proof of the --workspace step.

Consequences

Positive

  • Bench bit-rot is now a hard CI failure, not a silent surprise — the 26 benches stay compilable as the APIs they exercise evolve.
  • The benchmark-infrastructure half of the DoD (step 5) is satisfied honestly, setting up the next sub-deliverable (QAT-int8 measurement) to be regression-protected.

Negative / Neutral

  • No automated timing-regression detection (deliberate — see scope). Revisit only with a frequency-pinned self-hosted runner.
  • One bench (crv_bench) excluded pending an upstream dep fix.
  • ADR-173 — metric-locked accuracy harness (sub-deliverable 8.1)
  • docs/research/sota-nn-train-benchmark-brief.md — motivating target
  • ADR-134 (CIR), ADR-135 (calibration), ADR-154 (signal DSP benches) — benched paths