diff --git a/docs/adr/ADR-114-replay-regression-suite.md b/docs/adr/ADR-114-replay-regression-suite.md new file mode 100644 index 00000000..a9dbe0dd --- /dev/null +++ b/docs/adr/ADR-114-replay-regression-suite.md @@ -0,0 +1,162 @@ +# ADR-114 — 2000-Packet Replay Regression Suite + +**Status**: Accepted +**Date**: 2026-05-17 +**Scope**: `v2/crates/wifi-densepose-sensing-server/src/main.rs` +(`replay_tests` module under `#[cfg(test)]`), +`v2/crates/wifi-densepose-sensing-server/tests/fixtures/replay_*.jsonl`, +`scripts/generate-replay-fixtures.py`. Closes the "2 000-packet fixed- +replay test suite" item in CHECKLIST. + +## Context + +Up to now the amplitude classifier has been protected by per-function +unit tests (cv calculation, NBVI selection, baseline drop trigger) but +not by an end-to-end regression test that feeds a known-good stream +through the full `amp_presence_override` pipeline and checks that the +labels still look right. + +Without that, a refactor of NBVI selection or a threshold tweak could +silently regress classifier behaviour on real deployments — the unit +tests would all pass while the production output flipped. + +Pace's ESPectre has a similar pattern: 1000 idle + 1000 motion frames, +checked into the repo, replayed in CI on every PR. + +## Decisions + +### D1 — Fixture format: line-delimited JSON, `{node_id, amplitude[]}` + +```jsonl +{"node_id":1,"amplitude":[28.842, 19.333, ...]} +{"node_id":2,"amplitude":[15.601, 17.220, ...]} +... +``` + +Minimal: just the two fields the classifier reads. Round-robined across +nodes (500 per node × 2 nodes = 1000 frames per fixture file). 1000 +frames per file × 2 files = 2000 packets total. + +### D2 — Fixtures live in-repo under `tests/fixtures/` + +``` +v2/crates/wifi-densepose-sensing-server/tests/fixtures/ + replay_idle.jsonl (1000 lines) + replay_motion.jsonl (1000 lines) +``` + +Co-located with the test that consumes them. `cargo test` picks them up +via `env!("CARGO_MANIFEST_DIR")`. The fixture files are ~1.5 MB total +(text JSON) — small enough for the repo, not so small that the test +loses statistical power. + +### D3 — Synthetic but parameter-matched to live data + +The fixtures are generated by `scripts/generate-replay-fixtures.py` with +two deterministic seeds (42 and 43). Parameters chosen to mirror the +live deployment: + +* Baseline mean amplitudes per node taken from `data/baseline.json` + (node 1: 27.04, node 2: 14.72). +* Idle: per-frame Gaussian noise σ = 1.8 % of the per-subcarrier mean. +* Motion: ±40 % slow envelope (0.15 Hz sinusoid, 6.7 s cycle, longer + than the classifier's 4.5 s `AMP_SHORT_WIN`) + 5 % per-frame noise. + Mimics a body slowly modulating the channel during walking. + +This is deliberately *synthetic*. Capturing 1000 real frames of +"empty room" requires the operator to step out and stay out for ~50 s, +and capturing "motion" requires walking through the room — neither is +something this session could do without manual operator labour. The +synthetic-but-realistic alternative gives deterministic regression +coverage today, with the option to swap in live captures (same JSONL +schema, same filenames) when time allows. + +### D4 — Test lives inside `main.rs` under `#[cfg(test)] mod replay_tests` + +`amp_presence_override` is private to the binary crate, so the test +can't sit in `tests/` (which is for integration tests against +`lib.rs`). Putting it under `#[cfg(test)]` in `main.rs` keeps the +helper visibility minimal and exercises the exact function path +production uses. + +### D5 — Test resets per-node history before each fixture run + +`amp_presence_override` accumulates per-node state in +`OnceLock>>` statics. The test clears those between +the idle and motion runs so each fixture starts with a fresh classifier +(no cross-contamination from the previous fixture's frames sitting in +the rolling window). + +It also clears the per-subcarrier baseline (`amp_baseline_per_sub`) +because the synthetic fixtures don't share a per-subcarrier profile +with whatever real recording lives in `data/baseline.json` — leaving +the live per-sub baseline in place would make the drift channel +saturate and obscure the CV-threshold path we're actually testing. + +### D6 — F1 threshold: 0.85 + +Convention from Pace's ESPectre CI gate. Current value on the synthetic +fixtures with this deployment's baseline is `F1 = 1.000` (tp=822, +fp=0, tn=822, fn=0; 178 warmup frames excluded per fixture). The 0.15 +headroom gives room for legitimate classifier evolution without +forcing a fixture re-record on every tuning change. + +### D7 — Test loads the deployment baseline at startup + +Without `data/baseline.json` loaded, the classifier compares raw CV +against thresholds of 3.0 (300 %) and 6.0 — values no realistic signal +reaches. The test discovers the baseline via a couple of canonical +relative paths (`../../data/baseline.json` from the crate dir, etc.) +and exits early with a clear `eprintln!` hint if none are found. + +## Trade-offs + +* **Synthetic fixtures don't catch sensor-specific bugs.** A + Kconfig-level FW regression that produced subtly different amplitude + scaling would not be caught — the synthetic fixtures encode the + *expected* scaling, not whatever the FW currently emits. The witness + bundle (ADR-028) still covers that end of the pipeline. +* **`replay_2000` runs only when explicitly named or via the full + suite.** No filtering hides it from CI. It runs in well under a + second so cost is negligible. +* **F1 currently 1.0 — too clean to detect subtle regressions.** A + followup with live captures may bring the natural F1 to ~0.9, at + which point the 0.85 threshold becomes a real gate. For now it's + primarily a contract test: "the classifier still emits something + reasonable on a known input". + +## Files Touched + +``` +scripts/generate-replay-fixtures.py (new) +v2/crates/wifi-densepose-sensing-server/tests/fixtures/ + replay_idle.jsonl (new) + replay_motion.jsonl (new) +v2/crates/wifi-densepose-sensing-server/src/main.rs + - replay_tests module (D4, D5, D7) +docs/adr/ADR-114-replay-regression-suite.md (this) +``` + +## Verified Acceptance + +``` +$ cargo test --release -p wifi-densepose-sensing-server \ + --no-default-features --bin sensing-server replay_2000 -- --nocapture +replay_2000 F1=1.000 tp=822 fp=0 tn=822 fn=0 +test replay_tests::replay_2000_packets_f1_above_threshold ... ok +test result: ok. 1 passed; 0 failed; 0 ignored; +``` + +Full workspace suite: 327 tests pass (was 326 + this one). + +## References + +* ADR-101 — raw-amplitude classifier this test exercises. +* ADR-102 — NBVI subcarrier selection that feeds CV calculation. +* ADR-103 — persistent baseline that drives the universal-threshold + normalization the test relies on. +* ADR-028 — witness bundle (the other end-to-end regression + mechanism; ADR-114 covers classifier code paths, ADR-028 covers + the deterministic-CSI proof pipeline). +* Francesco Pace, *How I Turned My Wi-Fi Into a Motion Sensor — + Part 2*, "Replay regression test" — the upstream pattern.