From b209b8b778a30735314fefd0b79f88fc55111734 Mon Sep 17 00:00:00 2001 From: rUv Date: Mon, 15 Jun 2026 08:26:38 -0400 Subject: [PATCH] ci(bench): compile-verify regression gate for v2 criterion benches + ADR-174 (#1094) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * ci(bench): wire v2 criterion benches into CI as a compile-verify regression gate Sub-deliverable 8.3 of the benchmark/optimization milestone (needs ADR slot 174). The v2/ workspace ships 26 criterion benches across 18 crates, but benches are not part of `cargo test`, so nothing in CI compiled them and they silently rot when a public API they call changes. Add `.github/workflows/bench-regression.yml`: - bench-compile (HARD GATE): `cargo bench --workspace --no-default-features --no-run` compiles + links every default-feature bench (no measurement) plus the cir-gated cir_bench — a real, deterministic regression guard against bench bit-rot. - bench-fast-run (INFORMATIONAL, continue-on-error, never gates): runs a curated pure-CPU subset (nvsim, ruvector sketch/fusion) in criterion quick-mode and uploads logs as an artifact. No timing-regression gate, by design: wall-clock on shared GitHub runners varies 2-3x run-to-run, so a hard threshold or cross-runner `criterion --baseline` compare would manufacture false failures. The honest scope is compile-verify + informational-run; the workflow header documents the self-hosted-runner condition under which true timing-gating becomes honest. The crv-gated crv_bench is excluded because its crates.io dep ruvector-crv 0.1.1 fails to build upstream. Running the gate immediately caught one already-bit-rotted bench: wifi-densepose-mat/detection_bench failed to compile (E0063: missing field last_rssi in SensorPosition). Fixed (last_rssi: None) and re-verified. Validation (MEASURED): mat detection_bench + cir_bench + nvsim + ruvector + vitals + swarm benches compile under --no-default-features; fast subset runs; `cargo test -p wifi-densepose-mat --no-default-features` 174 passed / 0 failed; Python proof PASS, hash f8e76f21...46f7a unchanged. Co-Authored-By: claude-flow * docs(adr): ADR-174 — CI bench-regression compile-verify gate Records sub-deliverable 8.3 (bench-regression.yml, committed c4c59e085): a hard compile-verify gate over all 26 v2 criterion benches (caught + fixed one real bit-rotted bench, mat/detection_bench E0063) + an informational fast-run. Documents the honest scope — no timing-regression gate, since shared-runner wall-clock varies 2-3x; states the self-hosted-runner condition under which timing gating becomes honest. Co-Authored-By: claude-flow --- .github/workflows/bench-regression.yml | 199 ++++++++++++++++++ CHANGELOG.md | 1 + ...ci-bench-regression-compile-verify-gate.md | 110 ++++++++++ .../benches/detection_bench.rs | 3 + 4 files changed, 313 insertions(+) create mode 100644 .github/workflows/bench-regression.yml create mode 100644 docs/adr/ADR-174-ci-bench-regression-compile-verify-gate.md diff --git a/.github/workflows/bench-regression.yml b/.github/workflows/bench-regression.yml new file mode 100644 index 00000000..1defa9bb --- /dev/null +++ b/.github/workflows/bench-regression.yml @@ -0,0 +1,199 @@ +name: Bench Regression Guard + +# Sub-deliverable 8.3 of the benchmark/optimization milestone. +# +# HONEST SCOPE (read this before assuming this gates on timing): +# * The `bench-compile` job is a REAL, HARD-FAILING regression gate. It runs +# `cargo bench --no-default-features --no-run`, which type-checks and links +# EVERY criterion bench in the v2/ workspace without running a single +# measurement. Benches are not part of `cargo test`, so they silently +# bit-rot when a public API they call changes — this job catches that the +# moment it happens. This is the part of this workflow that can fail a PR. +# +# * The `bench-fast-run` job runs a small, curated subset of pure-CPU benches +# in criterion "quick mode" (short warm-up / measurement / 10 samples) and +# is INFORMATIONAL ONLY (`continue-on-error: true`). It does NOT gate on +# timing. Wall-clock timings on shared GitHub-hosted runners vary by +# 2-3x run-to-run (noisy neighbours, CPU throttling, no pinned frequency), +# so a hard ">X ms" threshold here would flake constantly and teach +# everyone to ignore it. We deliberately do not pretend to do timing +# regression-gating we cannot deliver reliably. The numbers are surfaced in +# the job log + uploaded as an artifact for humans to eyeball trends. +# +# WHY NO criterion --baseline COMPARE GATE: +# criterion's `--save-baseline` / `--baseline` compare is the textbook +# regression mechanism, but it only produces a trustworthy verdict when the +# baseline and the candidate were measured on the SAME hardware under the SAME +# conditions. GitHub-hosted runners give neither (the baseline commit and the +# PR commit land on different physical machines). Committing a baseline JSON +# measured on one runner and comparing a different runner against it would +# manufacture false regressions. If/when these benches run on a dedicated, +# frequency-pinned self-hosted runner, a `--baseline` compare with a generous +# (>2x) noise floor becomes honest and can be added then. Until then, +# compile-verify + informational-run is the honest gate. + +on: + push: + branches: [ main, develop, 'feat/*' ] + paths: + - 'v2/crates/**/benches/**' + - 'v2/crates/**/Cargo.toml' + - 'v2/crates/**/src/**' + - 'v2/Cargo.toml' + - 'v2/Cargo.lock' + - '.github/workflows/bench-regression.yml' + pull_request: + paths: + - 'v2/crates/**/benches/**' + - 'v2/crates/**/Cargo.toml' + - 'v2/crates/**/src/**' + - 'v2/Cargo.toml' + - 'v2/Cargo.lock' + - '.github/workflows/bench-regression.yml' + workflow_dispatch: + +permissions: + contents: read + +env: + CARGO_TERM_COLOR: always + # Debuginfo is useless in CI and the 38-crate workspace target dir otherwise + # exhausts the runner disk (mirrors ci.yml's rust-tests job). The bench + # profile inherits release + debug = true (v2/Cargo.toml [profile.bench]); + # force it off so the link step does not run out of space. + CARGO_PROFILE_BENCH_DEBUG: "0" + CARGO_PROFILE_RELEASE_DEBUG: "0" + +jobs: + # ── HARD GATE: every bench must still compile + link ───────────────────── + bench-compile: + name: bench compile-verify (--no-run) + runs-on: ubuntu-latest + steps: + - name: Checkout (recursive — wifi-densepose-rufield path-deps vendor/rufield) + uses: actions/checkout@v4 + with: + # The workspace includes `wifi-densepose-rufield`, which path-deps the + # `vendor/rufield` submodule crates. Without a recursive checkout the + # whole workspace fails to resolve before any bench is built. + submodules: recursive + + # The workspace pulls in `wifi-densepose-desktop` (Tauri v2) whose -sys + # crates need the GTK/WebKit/serial dev libraries via pkg-config, exactly + # as ci.yml's rust-tests job documents. A `--workspace` bench build links + # the whole graph, so these are required here too. + - name: Install Tauri / GTK / serial system dev libraries + run: | + sudo apt-get update + sudo apt-get install -y --no-install-recommends \ + libglib2.0-dev \ + libgtk-3-dev \ + libsoup-3.0-dev \ + libjavascriptcoregtk-4.1-dev \ + libwebkit2gtk-4.1-dev \ + libayatana-appindicator3-dev \ + librsvg2-dev \ + libxdo-dev \ + libudev-dev \ + libdbus-1-dev \ + libssl-dev \ + pkg-config + + - name: Install Rust toolchain + uses: dtolnay/rust-toolchain@stable + + - name: Cache cargo (Swatinem/rust-cache) + uses: Swatinem/rust-cache@v2 + with: + workspaces: v2 + # Distinct cache scope from ci.yml's rust-tests so the bench profile + # artifacts (release+opt) do not evict the test profile cache. + key: bench-regression + + # The core regression guard. `--no-run` compiles + links every bench + # target in the workspace's DEFAULT feature set but runs no measurement, + # so it is deterministic and fast-ish (build only). A bench that no longer + # compiles — because a type/signature it calls changed and nobody updated + # the bench — fails the build here. `--no-default-features` is the + # workspace's standard gate flag (openblas/tch/ort/onnx stay opt-out). + - name: Compile all workspace benches (default features) + working-directory: v2 + run: cargo bench --workspace --no-default-features --no-run + + # Feature-gated benches are skipped by the default build above because + # their `[[bench]]` entries carry `required-features`. Compile the ones we + # can guard so they are also covered against bit-rot. + # * cir → wifi-densepose-signal/benches/cir_bench.rs (ADR-134). The + # `cir` feature is pure-Rust (`cir = []`), so it builds on the stock + # runner and is a real, hard-failing guard like the step above. + # + # NOT guarded here (honest scope): + # * crv → wifi-densepose-ruvector/benches/crv_bench.rs. The `crv` feature + # pulls the crates.io dependency `ruvector-crv 0.1.1`, which currently + # FAILS to compile on stable (E0308 type mismatch in its own + # `stage_iii.rs` — an UPSTREAM bug, unrelated to bench bit-rot). + # Adding a hard `--features crv` compile step would make this workflow + # red for a reason this gate is not meant to police. Re-add this step + # once `ruvector-crv` ships a fixed release. (mqtt/onnx benches are + # likewise left to their own crate workflows.) + - name: Compile feature-gated benches (cir) + working-directory: v2 + run: cargo bench -p wifi-densepose-signal --no-default-features --features cir --bench cir_bench --no-run + + # ── INFORMATIONAL: run a curated fast subset (never gates) ─────────────── + bench-fast-run: + name: bench fast-run (informational, non-gating) + runs-on: ubuntu-latest + # NEVER fail the workflow on this job — timings are noise-prone on shared + # runners (see header). It exists to surface trends for humans, not to gate. + continue-on-error: true + needs: [bench-compile] + steps: + - name: Checkout (recursive) + uses: actions/checkout@v4 + with: + submodules: recursive + + - name: Install Rust toolchain + uses: dtolnay/rust-toolchain@stable + + - name: Cache cargo (Swatinem/rust-cache) + uses: Swatinem/rust-cache@v2 + with: + workspaces: v2 + key: bench-regression + + # Curated subset = pure-CPU, fast, dependency-light criterion benches that + # finish in seconds under quick-mode flags. Each is targeted by `--bench` + # (NOT a bare `cargo bench -p`) because the crates' lib targets use the + # libtest harness, which rejects criterion's CLI flags (--warm-up-time + # etc.) and aborts the run. Quick-mode: 1s warm-up, 2s measure, 10 samples. + - name: nvsim pipeline_throughput (quick) + working-directory: v2 + run: | + mkdir -p ../bench-out + cargo bench -p nvsim --no-default-features --bench pipeline_throughput -- \ + --warm-up-time 1 --measurement-time 2 --sample-size 10 \ + | tee ../bench-out/nvsim_pipeline_throughput.txt + + - name: ruvector sketch_bench (quick) + working-directory: v2 + run: | + cargo bench -p wifi-densepose-ruvector --no-default-features --bench sketch_bench -- \ + --warm-up-time 1 --measurement-time 2 --sample-size 10 \ + | tee ../bench-out/ruvector_sketch_bench.txt + + - name: ruvector fusion_bench (quick) + working-directory: v2 + run: | + cargo bench -p wifi-densepose-ruvector --no-default-features --bench fusion_bench -- \ + --warm-up-time 1 --measurement-time 2 --sample-size 10 \ + | tee ../bench-out/ruvector_fusion_bench.txt + + - name: Upload informational bench logs + if: always() + uses: actions/upload-artifact@v4 + with: + name: bench-fast-run-logs + path: bench-out/ + if-no-files-found: warn diff --git a/CHANGELOG.md b/CHANGELOG.md index 4dc3b4e0..25736a07 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -15,6 +15,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added - **Metric-locked PCK/MPJPE accuracy harness — resolves the PCK-definition ambiguity (`wifi-densepose-train`, needs ADR slot 173).** The SOTA brief (`docs/research/sota-nn-train-benchmark-brief.md` §1, §3.1, §4) found the single biggest threat to any "beyond-SOTA" claim is **metric ambiguity**: three PCK@20 figures (96.09% WiFlow-STD image-normalized, 81.63% AetherArena torso-PCK, 61.1% GraphPose-Fi standard PCK) cannot be lined up because each silently uses a different normalization — the project was retracted twice over this (a withdrawn "92.9%" used *absolute* pixels, not torso). New `src/accuracy.rs` makes the normalizer **explicit, selectable, and carried with every reported number**: a `PckNormalization` enum (`TorsoDiameter` = standard MM-Fi/GraphPose-Fi hip↔hip; `BoundingBoxDiagonal` = looser WiFlow-STD image-normalized; `AbsolutePixels(threshold)` = the retracted convention, included so historical numbers are reproducible and clearly labeled non-comparable); one canonical `pck_at(pred, gt, vis, k, normalization)` reusing the `metrics_core` geometric primitives (hip distance, bbox diagonal — no duplicate kernel); `mpjpe(pred, gt, vis)` (2D/3D, mm); and a self-describing `PoseAccuracy { pck_at: BTreeMap, mpjpe, normalization, n_keypoints, n_frames }` returned by `accuracy_report(frames, ks, normalization)` so an **unlabeled PCK number is structurally impossible**. **17 hand-computed deterministic tests** (no GPU, no datasets) prove the harness arithmetic: perfect→PCK=1.0/MPJPE=0; all-just-outside→0.0; half-in-half-out→0.5; the **key proof** that identical predictions score 0.50 (torso) / 1.00 (bbox) / 0.75 (abs) under the three normalizations (the ambiguity is real and the definitions are distinct); MPJPE 2D/3D fixtures; and graceful degenerate handling (zero torso, empty frames, NaN coords — no panic, never a false-perfect). **This is measurement infrastructure, not an accuracy claim** — the tests prove the harness is correct, not that any model is good. `wifi-densepose-train` lib 191→206, `test_metrics` 12→14, 0 failed. Python deterministic proof unchanged (off the signal proof path). +- **CI bench-regression guard (`.github/workflows/bench-regression.yml`) — wires the v2/ criterion benches into CI as a real, hard-failing COMPILE-VERIFY gate + an informational fast-run; caught + fixed one already-bit-rotted bench (benchmark/optimization milestone sub-deliverable 8.3; needs ADR slot 174).** The v2/ workspace ships **26 criterion benches across 18 crates** (e.g. `nvsim/pipeline_throughput`, `wifi-densepose-ruvector/{ann,sketch,fusion}_bench`, `wifi-densepose-signal/{signal,dsp_perf,features,calibration,aether_prefilter,cir}_bench`, `wifi-densepose-mat/detection_bench`, `wifi-densepose-nn/{inference,native_conv,onnx}_bench`, `wifi-densepose-engine/engine_cycle`, …) but, because benches are **not** part of `cargo test`, nothing in CI compiled them — so they silently rot when a public API they call changes. **Proof this matters (MEASURED):** running the new gate on the current tree immediately caught `wifi-densepose-mat/detection_bench` failing to compile (`E0063: missing field last_rssi in initializer of SensorPosition` — the struct gained a field, the bench was never updated); fixed in this change (`last_rssi: None`, the simulated-zone convention) and re-verified (`cargo bench -p wifi-densepose-mat --no-default-features --bench detection_bench --no-run` → `Finished`, Executable produced). **HONEST SCOPE — what gates vs what is informational:** (1) `bench-compile` (HARD GATE) runs `cargo bench --workspace --no-default-features --no-run` (compile + link every default-feature bench, no measurement) plus a `--features cir` compile of the gated `cir_bench` — a deterministic, real regression guard against bench bit-rot; (2) `bench-fast-run` (INFORMATIONAL, `continue-on-error: true`, NEVER gates) runs a curated pure-CPU subset (`nvsim/pipeline_throughput`, `ruvector/{sketch,fusion}_bench`) in criterion quick-mode (1s warm-up / 2s measure / 10 samples), targeted per-`--bench` (the crates' libtest lib targets reject criterion flags), and uploads the logs as an artifact. **No timing-regression gate, by design and stated in the workflow header:** wall-clock on shared GitHub runners varies 2-3x run-to-run, so a hard threshold or a cross-runner `criterion --baseline` compare would manufacture false failures; that becomes honest only on a frequency-pinned self-hosted runner (documented as the re-add condition). The `crv`-gated `ruvector/crv_bench` is deliberately NOT compiled by the gate because its crates.io dep `ruvector-crv 0.1.1` currently fails to build on stable (upstream E0308 in its own `stage_iii.rs`) — noted in-workflow with the re-add condition. Checkout is `submodules: recursive` (the workspace path-deps `vendor/rufield`) and installs the Tauri/GTK dev libs like `ci.yml`'s rust-tests job (a `--workspace` bench link pulls the whole graph). **MEASURED locally (Windows, `--no-default-features`):** `nvsim`, `wifi-densepose-ruvector` (sketch/fusion/ann), `wifi-densepose-signal/cir_bench`, `wifi-densepose-mat/detection_bench` (post-fix), `wifi-densepose-vitals/vitals_bench`, and `ruview-swarm/swarm_bench` all compile + the fast subset runs (sample baseline: `nvsim pipeline_run/d1/256` ≈ 55 µs, `d16/1024` ≈ 315 µs; `ruvector sketch_hamming` ≈ 3-7 ns vs `float_l2` ≈ 63-371 ns). The full `--workspace` `--no-run` could **not** be fully validated on Windows (Tauri-`desktop` needs GTK, `candle-core` fails on MSVC, `swarm_bench` LTO-links OOM under parallel pressure) — those are Windows-env artifacts that build in the Linux CI runner (each affected bench was confirmed to compile standalone here). No baseline JSON is committed (a cross-runner baseline would be dishonest). Python deterministic proof unchanged (`f8e76f21…46f7a`, bit-exact — off the signal proof path). - **RuField `rufield-viewer` live-ingest mode — closes the RuView↔RuField visual loop (ADR-262 surfaces).** The dashboard gains `--source live --upstream `: it consumes RuView's `/ws/field` SSE (falling back to polling `/api/field`), **verifies every event's ed25519 provenance receipt on ingest** (`is_fusable`) — forged/tampered events are flagged ✗ and **never fused** into trusted inferences — and renders real RuView `FieldEvent`s through the same room-state/privacy-badge/fusion-graph/receipt path the synthetic mode uses (wire-compatible by construction: both sides use `rufield_core::FieldEvent` serde). **Strict banner honesty:** a single `BannerState` shows `SYNTHETIC` / `LIVE — ` / `DISCONNECTED — unreachable`, mutually exclusive — never SYNTHETIC while showing live data or vice versa; live mode returns **409** on `/api/run` rather than fabricate a synthetic run, and starts DISCONNECTED until first verified contact. Default stays synthetic. 26 tests / 0 failed. `ruvnet/rufield` `crates/rufield-viewer`; `vendor/rufield` submodule bumped. - **ADR-262 P3 — live RuField surface: RuView's running sensing-server now speaks RuField on `/api/field` + `/ws/field`.** Wires the P1 `wifi-densepose-rufield` bridge into the live `wifi-densepose-sensing-server` (the bridge is the only added coupling, ADR-262 §5.4). A new `src/rufield_surface.rs` module (kept out of the 8k-line `main.rs`) holds a `FieldSurface` with a **dedicated ed25519 `Signer`**, a bounded ring buffer of recent signed events (`FIELD_RING_CAPACITY = 64`), and the `/ws/field` broadcast topic; it exposes `GET /api/field` (latest signed `FieldEvent`s + signer pubkey + a `dev_signing_key` flag) and `GET /ws/field` (per-cycle stream, mirroring `/ws/sensing`), plus a standalone `router()` for isolated testing. **Tap:** at the ESP32 governed-trust cycle (`main.rs` `observe_cycle` ~`:5886` / `SensingUpdate` build ~`:5938`), `emit_rufield_event` joins the cycle's real `SensingUpdate` (features/classification/signal_field) with the engine's recorded `effective_class`/`demoted` trust state into a `SensingSnapshot` and surfaces a signed `FieldEvent` — **existing endpoints (`/ws/sensing` etc.) are unchanged; this is purely additive.** **Signer (defers the P2 key decision, §8 Q1):** a **standalone dev/sensing key** from `WDP_RUFIELD_SIGNING_SEED` (64-hex or ≥32-byte value), else a deterministic dev default with a logged `WARN` — reusing the `cog-ha-matter` Ed25519 key is the deferred P2 call, so P3 does not pre-empt it. **Egress privacy (fail-closed):** `network_egress_allowed` is *stricter* than `DefaultPrivacyGuard` for an unattended live surface — only **P1/P2** leave the box; P0 (raw) and P3/P4/P5 are held edge-local, so a `Derived → P4/P5` cycle **never** surfaces; no-presence cycles emit **no phantom event**. **P3 acceptance gates (`tests/rufield_surface_test.rs`, 4 integration via `tower::oneshot` + 4 module unit, 0 failed):** a well-formed **signed** event (`Modality::WifiCsi`, P2 not P1, `is_fusable` ed25519-verified, real timestamp); empty cycle → no phantom; **privacy-safety** — an injected `Derived` trust never surfaces; a mixed stream surfaces only egress-safe events. **Honest scope (ADR-262 §0/§6):** real plumbing on a **live endpoint**, **NOT accuracy** — single-link CSI with its existing caveats (no validated room-coordinate accuracy — `field_localize`), a dedicated dev signing key pending the P2 ownership decision, no accuracy claim. The win is narrowly: "RuView's live sensing now speaks RuField on `/ws/field`." - **ADR-262 P1 — `wifi-densepose-rufield` anti-corruption bridge: RuView WiFi-CSI sensing → signed RuField `FieldEvent`s.** A new v2 workspace member (the *single coupling point* between RuView and the standalone RuField MFS spec, ADR-262 §5.4) that **path-deps** the `vendor/rufield` submodule crates (`rufield-core`/`-provenance`/`-privacy`/`-fusion` — pure-Rust, `--no-default-features`-buildable: serde/sha2/ed25519/toml only, no tch/openblas/ndarray/candle) and **no** RuView internal crate. The bridge takes owned primitives — `SensingSnapshot` mirrors the `/ws/sensing` `SensingUpdate` (features + classification + signal_field) joined with the `TrustedOutput` trust state (`trust_class`/`demoted`/`identity_bound`) — and `snapshot_to_field_event()` emits one **signed** `FieldEvent` (`Modality::WifiCsi`, axis `[Frequency]`): a real `FieldTensor` from the feature scalars with the real `timestamp_ns`; an `Observation` whose `range_m`/`motion_vector`/`space_cell` are derived from the strongest **signal-field peak** when present (else `None` — coordinates are **never fabricated**, per the `field_localize` caveat) and `confidence` from the classification; a real `ProvenanceRef` (sha256 over the tensor bytes, `synthetic=false`) **ed25519-signed** so `rufield_provenance::is_fusable` passes. **The §3.3 privacy mapping is the critical correctness item**, implemented as `map_privacy()` mapping RuView's class onto RuField P0–P5 **by information content, NEVER by byte value** and **fail-closed**: RuView `Derived` (byte `1`, which sorts *below* `Anonymous` byte `2`) carries an identity embedding → maps to **P4** (or **P5** if identity-bound), **never P1** (the single most dangerous mapping mistake); `Raw → P0`, `Anonymous → P2`, `Restricted → P2`; a governed-engine `demoted` cycle floors the egress class to ≥ P2 with raw suppressed. **P1 acceptance gates (15 tests / 0 failed — 5 unit + 9 integration + 1 doc):** round-trip (`SensingSnapshot → FieldEvent →` serde `→` equal), `is_fusable` (verified ed25519 receipt), `RuFieldFusion::ingest` accept + `infer()` runs, **privacy-safety** (`gate_privacy_safety_derived_never_maps_to_low_privacy` — `Derived → P4/P5`, never P1; a table test over every RuView class; fail-closed demotion), and determinism (same snapshot + same signer seed → byte-identical event). **Honest scope:** this is **P1 plumbing** — a tested conversion + a safe privacy mapping. It is **not** wired into the live server (that is P3) and makes **no accuracy claim** (RuField v0.1 is synthetic; RuView's single-link CSI carries its own caveats). CI: the `rust-tests` workflow checkout gains `submodules: recursive` so the path-deps resolve. Python deterministic proof unchanged (off the signal proof path). diff --git a/docs/adr/ADR-174-ci-bench-regression-compile-verify-gate.md b/docs/adr/ADR-174-ci-bench-regression-compile-verify-gate.md new file mode 100644 index 00000000..7aed7fe9 --- /dev/null +++ b/docs/adr/ADR-174-ci-bench-regression-compile-verify-gate.md @@ -0,0 +1,110 @@ +# ADR-174: CI Bench-Regression Gate (Compile-Verify) + +| Field | Value | +|-------|-------| +| **Status** | Accepted — implemented, caught one real bit-rotted bench | +| **Date** | 2026-06-15 | +| **Deciders** | ruv | +| **Codename** | **BENCH-GATE** | +| **Milestone** | benchmark/optimization re-balance — sub-deliverable 8.3 | +| **Motivated by** | `docs/research/sota-nn-train-benchmark-brief.md` (target 3: criterion benches as CI regression baselines) | + +## Context + +The v2/ workspace ships **26 criterion benches across 18 crates** (e.g. +`nvsim/pipeline_throughput`, `wifi-densepose-ruvector/{ann,sketch,fusion}_bench`, +`wifi-densepose-signal/{signal,dsp_perf,features,calibration,cir,…}_bench`, +`wifi-densepose-mat/detection_bench`, `wifi-densepose-nn/{inference,native_conv}_bench`, +`wifi-densepose-engine/engine_cycle`, …). Because **benches are not part of +`cargo test`**, nothing in CI compiled them — so they bit-rot silently the moment +a public API they call changes, and the rot is invisible until someone manually +runs `cargo bench` months later. + +The SOTA brief named "wire existing criterion benches into CI as regression +baselines" as a concrete benchmark-hygiene target. The honest difficulty: true +*timing*-regression gating on shared GitHub runners is unreliable — wall-clock +varies 2–3× run-to-run (a captured 10-sample run showed `float_l2/512` ranging +307–444 ns), so a hard threshold or a cross-runner `criterion --baseline` compare +(baseline and PR land on different physical machines) would manufacture false +regressions. A gate that cries wolf gets disabled. + +## Decision + +Add `.github/workflows/bench-regression.yml` with **two jobs of explicitly +different authority** — and do NOT pretend to gate on timing. + +### `bench-compile` — HARD GATE (real regression detection) +`cargo bench --workspace --no-default-features --no-run` compiles + links every +default-feature bench (no measurement → fully deterministic), plus a +`--features cir` compile of the gated `cir_bench`. Benches aren't in `cargo test`, +so this is the genuine guard: **the build fails the moment a bench stops +compiling.** + +### `bench-fast-run` — INFORMATIONAL (`continue-on-error: true`, never gates) +Runs a curated pure-CPU subset (`nvsim/pipeline_throughput`, +`ruvector/{sketch,fusion}_bench`) in criterion quick-mode (1 s warm-up / 2 s +measure / 10 samples), targeted per-`--bench`, and uploads logs as an artifact. +Every number it produces is **informational only** — explicitly stated in the +workflow header. + +### What is NOT done, and why (honest scope) +No timing-regression gate, no committed baseline JSON. The workflow header +documents the exact condition under which true timing-gating becomes honest: a +frequency-pinned **self-hosted** runner with a generous (>2×) floor. A +cross-runner baseline would be dishonest, so none is committed. + +### Proof it matters (MEASURED) +Running the new gate on the current tree immediately caught +`wifi-densepose-mat/detection_bench` failing to compile: +`error[E0063]: missing field last_rssi in initializer of SensorPosition` — the +struct gained a field; the bench was never updated. **Fixed** in the same change +(`last_rssi: None`, the simulated-zone convention) and re-verified +(`cargo bench -p wifi-densepose-mat --no-default-features --bench detection_bench --no-run` +→ `Finished`). The gate paid for itself on its first run. + +### Exclusions (documented in-workflow) +- `ruvector/crv_bench` — its crates.io dep `ruvector-crv 0.1.1` fails to build on + stable (upstream `E0308` in `stage_iii.rs`); excluded with a re-add condition. +- `onnx_bench` / `mqtt_throughput` — feature-gated (ort / mqtt), left to their + crates' own workflows. `wasm-edge/process_frame_bench` — workspace-excluded. + +Conventions mirror existing workflows: `submodules: recursive` (the workspace +path-deps `vendor/rufield`), Swatinem/rust-cache `workspaces: v2`, Tauri/GTK apt +deps (a `--workspace` bench link pulls the whole graph), path-filtered triggers. + +## Validation + +- **Bit-rot caught + fixed** (above), re-verified `--no-run`. +- **MEASURED locally** (`--no-default-features`, Windows): nvsim, ruvector + (sketch/fusion/ann), signal/cir_bench, mat/detection_bench (post-fix), + vitals, ruview-swarm/swarm_bench all compile; fast subset runs (`nvsim + pipeline_run/d1/256` ≈ 55 µs; `ruvector sketch_hamming` ≈ 3–7 ns vs `float_l2` + ≈ 63–371 ns). +- `cargo test -p wifi-densepose-mat --no-default-features` → 166/6/2 passed, 0 failed. +- `python archive/v1/data/proof/verify.py` → **VERDICT: PASS**, hash + `f8e76f21…46f7a` unchanged. +- **Honest limitation:** the full `--workspace --no-run` could not be + end-to-end validated on this Windows box (`desktop` needs GTK, `candle-core` + fails on MSVC, `swarm_bench` LTO-links OOM under parallel pressure — all + Windows-env artifacts; each affected bench compiles standalone here). **The + first green Linux CI run on the PR is the authoritative proof of the + `--workspace` step.** + +## Consequences + +### Positive +- Bench bit-rot is now a hard CI failure, not a silent surprise — the 26 benches + stay compilable as the APIs they exercise evolve. +- The benchmark-infrastructure half of the DoD (step 5) is satisfied honestly, + setting up the next sub-deliverable (QAT-int8 measurement) to be + regression-protected. + +### Negative / Neutral +- No automated timing-regression detection (deliberate — see scope). Revisit only + with a frequency-pinned self-hosted runner. +- One bench (`crv_bench`) excluded pending an upstream dep fix. + +## Links +- ADR-173 — metric-locked accuracy harness (sub-deliverable 8.1) +- `docs/research/sota-nn-train-benchmark-brief.md` — motivating target +- ADR-134 (CIR), ADR-135 (calibration), ADR-154 (signal DSP benches) — benched paths diff --git a/v2/crates/wifi-densepose-mat/benches/detection_bench.rs b/v2/crates/wifi-densepose-mat/benches/detection_bench.rs index f4efb6cd..4bbc4ffe 100644 --- a/v2/crates/wifi-densepose-mat/benches/detection_bench.rs +++ b/v2/crates/wifi-densepose-mat/benches/detection_bench.rs @@ -220,6 +220,9 @@ fn create_test_sensors(count: usize) -> Vec { z: 1.5, sensor_type: SensorType::Transceiver, is_operational: true, + // No live RSSI plumbed for synthetic bench sensors (simulated + // zone) — localization must not fabricate one. + last_rssi: None, } }) .collect()