wifi-densepose/docs/adr/ADR-174-ci-bench-regression...

111 lines
5.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-174: CI Bench-Regression Gate (Compile-Verify)
| Field | Value |
|-------|-------|
| **Status** | Accepted — implemented, caught one real bit-rotted bench |
| **Date** | 2026-06-15 |
| **Deciders** | ruv |
| **Codename** | **BENCH-GATE** |
| **Milestone** | benchmark/optimization re-balance — sub-deliverable 8.3 |
| **Motivated by** | `docs/research/sota-nn-train-benchmark-brief.md` (target 3: criterion benches as CI regression baselines) |
## Context
The v2/ workspace ships **26 criterion benches across 18 crates** (e.g.
`nvsim/pipeline_throughput`, `wifi-densepose-ruvector/{ann,sketch,fusion}_bench`,
`wifi-densepose-signal/{signal,dsp_perf,features,calibration,cir,…}_bench`,
`wifi-densepose-mat/detection_bench`, `wifi-densepose-nn/{inference,native_conv}_bench`,
`wifi-densepose-engine/engine_cycle`, …). Because **benches are not part of
`cargo test`**, nothing in CI compiled them — so they bit-rot silently the moment
a public API they call changes, and the rot is invisible until someone manually
runs `cargo bench` months later.
The SOTA brief named "wire existing criterion benches into CI as regression
baselines" as a concrete benchmark-hygiene target. The honest difficulty: true
*timing*-regression gating on shared GitHub runners is unreliable — wall-clock
varies 23× run-to-run (a captured 10-sample run showed `float_l2/512` ranging
307444 ns), so a hard threshold or a cross-runner `criterion --baseline` compare
(baseline and PR land on different physical machines) would manufacture false
regressions. A gate that cries wolf gets disabled.
## Decision
Add `.github/workflows/bench-regression.yml` with **two jobs of explicitly
different authority** — and do NOT pretend to gate on timing.
### `bench-compile` — HARD GATE (real regression detection)
`cargo bench --workspace --no-default-features --no-run` compiles + links every
default-feature bench (no measurement → fully deterministic), plus a
`--features cir` compile of the gated `cir_bench`. Benches aren't in `cargo test`,
so this is the genuine guard: **the build fails the moment a bench stops
compiling.**
### `bench-fast-run` — INFORMATIONAL (`continue-on-error: true`, never gates)
Runs a curated pure-CPU subset (`nvsim/pipeline_throughput`,
`ruvector/{sketch,fusion}_bench`) in criterion quick-mode (1 s warm-up / 2 s
measure / 10 samples), targeted per-`--bench`, and uploads logs as an artifact.
Every number it produces is **informational only** — explicitly stated in the
workflow header.
### What is NOT done, and why (honest scope)
No timing-regression gate, no committed baseline JSON. The workflow header
documents the exact condition under which true timing-gating becomes honest: a
frequency-pinned **self-hosted** runner with a generous (>2×) floor. A
cross-runner baseline would be dishonest, so none is committed.
### Proof it matters (MEASURED)
Running the new gate on the current tree immediately caught
`wifi-densepose-mat/detection_bench` failing to compile:
`error[E0063]: missing field last_rssi in initializer of SensorPosition` — the
struct gained a field; the bench was never updated. **Fixed** in the same change
(`last_rssi: None`, the simulated-zone convention) and re-verified
(`cargo bench -p wifi-densepose-mat --no-default-features --bench detection_bench --no-run`
`Finished`). The gate paid for itself on its first run.
### Exclusions (documented in-workflow)
- `ruvector/crv_bench` — its crates.io dep `ruvector-crv 0.1.1` fails to build on
stable (upstream `E0308` in `stage_iii.rs`); excluded with a re-add condition.
- `onnx_bench` / `mqtt_throughput` — feature-gated (ort / mqtt), left to their
crates' own workflows. `wasm-edge/process_frame_bench` — workspace-excluded.
Conventions mirror existing workflows: `submodules: recursive` (the workspace
path-deps `vendor/rufield`), Swatinem/rust-cache `workspaces: v2`, Tauri/GTK apt
deps (a `--workspace` bench link pulls the whole graph), path-filtered triggers.
## Validation
- **Bit-rot caught + fixed** (above), re-verified `--no-run`.
- **MEASURED locally** (`--no-default-features`, Windows): nvsim, ruvector
(sketch/fusion/ann), signal/cir_bench, mat/detection_bench (post-fix),
vitals, ruview-swarm/swarm_bench all compile; fast subset runs (`nvsim
pipeline_run/d1/256` ≈ 55 µs; `ruvector sketch_hamming` ≈ 37 ns vs `float_l2`
≈ 63371 ns).
- `cargo test -p wifi-densepose-mat --no-default-features` → 166/6/2 passed, 0 failed.
- `python archive/v1/data/proof/verify.py`**VERDICT: PASS**, hash
`f8e76f21…46f7a` unchanged.
- **Honest limitation:** the full `--workspace --no-run` could not be
end-to-end validated on this Windows box (`desktop` needs GTK, `candle-core`
fails on MSVC, `swarm_bench` LTO-links OOM under parallel pressure — all
Windows-env artifacts; each affected bench compiles standalone here). **The
first green Linux CI run on the PR is the authoritative proof of the
`--workspace` step.**
## Consequences
### Positive
- Bench bit-rot is now a hard CI failure, not a silent surprise — the 26 benches
stay compilable as the APIs they exercise evolve.
- The benchmark-infrastructure half of the DoD (step 5) is satisfied honestly,
setting up the next sub-deliverable (QAT-int8 measurement) to be
regression-protected.
### Negative / Neutral
- No automated timing-regression detection (deliberate — see scope). Revisit only
with a frequency-pinned self-hosted runner.
- One bench (`crv_bench`) excluded pending an upstream dep fix.
## Links
- ADR-173 — metric-locked accuracy harness (sub-deliverable 8.1)
- `docs/research/sota-nn-train-benchmark-brief.md` — motivating target
- ADR-134 (CIR), ADR-135 (calibration), ADR-154 (signal DSP benches) — benched paths