# ADR-114 — 2000-Packet Replay Regression Suite **Status**: Accepted **Date**: 2026-05-17 **Scope**: `v2/crates/wifi-densepose-sensing-server/src/main.rs` (`replay_tests` module under `#[cfg(test)]`), `v2/crates/wifi-densepose-sensing-server/tests/fixtures/replay_*.jsonl`, `scripts/generate-replay-fixtures.py`. Closes the "2 000-packet fixed- replay test suite" item in CHECKLIST. ## Context Up to now the amplitude classifier has been protected by per-function unit tests (cv calculation, NBVI selection, baseline drop trigger) but not by an end-to-end regression test that feeds a known-good stream through the full `amp_presence_override` pipeline and checks that the labels still look right. Without that, a refactor of NBVI selection or a threshold tweak could silently regress classifier behaviour on real deployments — the unit tests would all pass while the production output flipped. Pace's ESPectre has a similar pattern: 1000 idle + 1000 motion frames, checked into the repo, replayed in CI on every PR. ## Decisions ### D1 — Fixture format: line-delimited JSON, `{node_id, amplitude[]}` ```jsonl {"node_id":1,"amplitude":[28.842, 19.333, ...]} {"node_id":2,"amplitude":[15.601, 17.220, ...]} ... ``` Minimal: just the two fields the classifier reads. Round-robined across nodes (500 per node × 2 nodes = 1000 frames per fixture file). 1000 frames per file × 2 files = 2000 packets total. ### D2 — Fixtures live in-repo under `tests/fixtures/` ``` v2/crates/wifi-densepose-sensing-server/tests/fixtures/ replay_idle.jsonl (1000 lines) replay_motion.jsonl (1000 lines) ``` Co-located with the test that consumes them. `cargo test` picks them up via `env!("CARGO_MANIFEST_DIR")`. The fixture files are ~1.5 MB total (text JSON) — small enough for the repo, not so small that the test loses statistical power. ### D3 — Synthetic but parameter-matched to live data The fixtures are generated by `scripts/generate-replay-fixtures.py` with two deterministic seeds (42 and 43). Parameters chosen to mirror the live deployment: * Baseline mean amplitudes per node taken from `data/baseline.json` (node 1: 27.04, node 2: 14.72). * Idle: per-frame Gaussian noise σ = 1.8 % of the per-subcarrier mean. * Motion: ±40 % slow envelope (0.15 Hz sinusoid, 6.7 s cycle, longer than the classifier's 4.5 s `AMP_SHORT_WIN`) + 5 % per-frame noise. Mimics a body slowly modulating the channel during walking. This is deliberately *synthetic*. Capturing 1000 real frames of "empty room" requires the operator to step out and stay out for ~50 s, and capturing "motion" requires walking through the room — neither is something this session could do without manual operator labour. The synthetic-but-realistic alternative gives deterministic regression coverage today, with the option to swap in live captures (same JSONL schema, same filenames) when time allows. ### D4 — Test lives inside `main.rs` under `#[cfg(test)] mod replay_tests` `amp_presence_override` is private to the binary crate, so the test can't sit in `tests/` (which is for integration tests against `lib.rs`). Putting it under `#[cfg(test)]` in `main.rs` keeps the helper visibility minimal and exercises the exact function path production uses. ### D5 — Test resets per-node history before each fixture run `amp_presence_override` accumulates per-node state in `OnceLock>>` statics. The test clears those between the idle and motion runs so each fixture starts with a fresh classifier (no cross-contamination from the previous fixture's frames sitting in the rolling window). It also clears the per-subcarrier baseline (`amp_baseline_per_sub`) because the synthetic fixtures don't share a per-subcarrier profile with whatever real recording lives in `data/baseline.json` — leaving the live per-sub baseline in place would make the drift channel saturate and obscure the CV-threshold path we're actually testing. ### D6 — F1 threshold: 0.85 Convention from Pace's ESPectre CI gate. Current value on the synthetic fixtures with this deployment's baseline is `F1 = 1.000` (tp=822, fp=0, tn=822, fn=0; 178 warmup frames excluded per fixture). The 0.15 headroom gives room for legitimate classifier evolution without forcing a fixture re-record on every tuning change. ### D7 — Test loads the deployment baseline at startup Without `data/baseline.json` loaded, the classifier compares raw CV against thresholds of 3.0 (300 %) and 6.0 — values no realistic signal reaches. The test discovers the baseline via a couple of canonical relative paths (`../../data/baseline.json` from the crate dir, etc.) and exits early with a clear `eprintln!` hint if none are found. ## Trade-offs * **Synthetic fixtures don't catch sensor-specific bugs.** A Kconfig-level FW regression that produced subtly different amplitude scaling would not be caught — the synthetic fixtures encode the *expected* scaling, not whatever the FW currently emits. The witness bundle (ADR-028) still covers that end of the pipeline. * **`replay_2000` runs only when explicitly named or via the full suite.** No filtering hides it from CI. It runs in well under a second so cost is negligible. * **F1 currently 1.0 — too clean to detect subtle regressions.** A followup with live captures may bring the natural F1 to ~0.9, at which point the 0.85 threshold becomes a real gate. For now it's primarily a contract test: "the classifier still emits something reasonable on a known input". ## Files Touched ``` scripts/generate-replay-fixtures.py (new) v2/crates/wifi-densepose-sensing-server/tests/fixtures/ replay_idle.jsonl (new) replay_motion.jsonl (new) v2/crates/wifi-densepose-sensing-server/src/main.rs - replay_tests module (D4, D5, D7) docs/adr/ADR-114-replay-regression-suite.md (this) ``` ## Verified Acceptance ``` $ cargo test --release -p wifi-densepose-sensing-server \ --no-default-features --bin sensing-server replay_2000 -- --nocapture replay_2000 F1=1.000 tp=822 fp=0 tn=822 fn=0 test replay_tests::replay_2000_packets_f1_above_threshold ... ok test result: ok. 1 passed; 0 failed; 0 ignored; ``` Full workspace suite: 327 tests pass (was 326 + this one). ## References * ADR-101 — raw-amplitude classifier this test exercises. * ADR-102 — NBVI subcarrier selection that feeds CV calculation. * ADR-103 — persistent baseline that drives the universal-threshold normalization the test relies on. * ADR-028 — witness bundle (the other end-to-end regression mechanism; ADR-114 covers classifier code paths, ADR-028 covers the deterministic-CSI proof pipeline). * Francesco Pace, *How I Turned My Wi-Fi Into a Motion Sensor — Part 2*, "Replay regression test" — the upstream pattern.