wifi-densepose/docs/adr/ADR-114-replay-regression-s...

6.7 KiB
Raw Blame History

ADR-114 — 2000-Packet Replay Regression Suite

Status: Accepted Date: 2026-05-17 Scope: v2/crates/wifi-densepose-sensing-server/src/main.rs (replay_tests module under #[cfg(test)]), v2/crates/wifi-densepose-sensing-server/tests/fixtures/replay_*.jsonl, scripts/generate-replay-fixtures.py. Closes the "2 000-packet fixed- replay test suite" item in CHECKLIST.

Context

Up to now the amplitude classifier has been protected by per-function unit tests (cv calculation, NBVI selection, baseline drop trigger) but not by an end-to-end regression test that feeds a known-good stream through the full amp_presence_override pipeline and checks that the labels still look right.

Without that, a refactor of NBVI selection or a threshold tweak could silently regress classifier behaviour on real deployments — the unit tests would all pass while the production output flipped.

Pace's ESPectre has a similar pattern: 1000 idle + 1000 motion frames, checked into the repo, replayed in CI on every PR.

Decisions

D1 — Fixture format: line-delimited JSON, {node_id, amplitude[]}

{"node_id":1,"amplitude":[28.842, 19.333, ...]}
{"node_id":2,"amplitude":[15.601, 17.220, ...]}
...

Minimal: just the two fields the classifier reads. Round-robined across nodes (500 per node × 2 nodes = 1000 frames per fixture file). 1000 frames per file × 2 files = 2000 packets total.

D2 — Fixtures live in-repo under tests/fixtures/

v2/crates/wifi-densepose-sensing-server/tests/fixtures/
  replay_idle.jsonl    (1000 lines)
  replay_motion.jsonl  (1000 lines)

Co-located with the test that consumes them. cargo test picks them up via env!("CARGO_MANIFEST_DIR"). The fixture files are ~1.5 MB total (text JSON) — small enough for the repo, not so small that the test loses statistical power.

D3 — Synthetic but parameter-matched to live data

The fixtures are generated by scripts/generate-replay-fixtures.py with two deterministic seeds (42 and 43). Parameters chosen to mirror the live deployment:

  • Baseline mean amplitudes per node taken from data/baseline.json (node 1: 27.04, node 2: 14.72).
  • Idle: per-frame Gaussian noise σ = 1.8 % of the per-subcarrier mean.
  • Motion: ±40 % slow envelope (0.15 Hz sinusoid, 6.7 s cycle, longer than the classifier's 4.5 s AMP_SHORT_WIN) + 5 % per-frame noise. Mimics a body slowly modulating the channel during walking.

This is deliberately synthetic. Capturing 1000 real frames of "empty room" requires the operator to step out and stay out for ~50 s, and capturing "motion" requires walking through the room — neither is something this session could do without manual operator labour. The synthetic-but-realistic alternative gives deterministic regression coverage today, with the option to swap in live captures (same JSONL schema, same filenames) when time allows.

D4 — Test lives inside main.rs under #[cfg(test)] mod replay_tests

amp_presence_override is private to the binary crate, so the test can't sit in tests/ (which is for integration tests against lib.rs). Putting it under #[cfg(test)] in main.rs keeps the helper visibility minimal and exercises the exact function path production uses.

D5 — Test resets per-node history before each fixture run

amp_presence_override accumulates per-node state in OnceLock<Mutex<HashMap<…>>> statics. The test clears those between the idle and motion runs so each fixture starts with a fresh classifier (no cross-contamination from the previous fixture's frames sitting in the rolling window).

It also clears the per-subcarrier baseline (amp_baseline_per_sub) because the synthetic fixtures don't share a per-subcarrier profile with whatever real recording lives in data/baseline.json — leaving the live per-sub baseline in place would make the drift channel saturate and obscure the CV-threshold path we're actually testing.

D6 — F1 threshold: 0.85

Convention from Pace's ESPectre CI gate. Current value on the synthetic fixtures with this deployment's baseline is F1 = 1.000 (tp=822, fp=0, tn=822, fn=0; 178 warmup frames excluded per fixture). The 0.15 headroom gives room for legitimate classifier evolution without forcing a fixture re-record on every tuning change.

D7 — Test loads the deployment baseline at startup

Without data/baseline.json loaded, the classifier compares raw CV against thresholds of 3.0 (300 %) and 6.0 — values no realistic signal reaches. The test discovers the baseline via a couple of canonical relative paths (../../data/baseline.json from the crate dir, etc.) and exits early with a clear eprintln! hint if none are found.

Trade-offs

  • Synthetic fixtures don't catch sensor-specific bugs. A Kconfig-level FW regression that produced subtly different amplitude scaling would not be caught — the synthetic fixtures encode the expected scaling, not whatever the FW currently emits. The witness bundle (ADR-028) still covers that end of the pipeline.
  • replay_2000 runs only when explicitly named or via the full suite. No filtering hides it from CI. It runs in well under a second so cost is negligible.
  • F1 currently 1.0 — too clean to detect subtle regressions. A followup with live captures may bring the natural F1 to ~0.9, at which point the 0.85 threshold becomes a real gate. For now it's primarily a contract test: "the classifier still emits something reasonable on a known input".

Files Touched

scripts/generate-replay-fixtures.py                            (new)
v2/crates/wifi-densepose-sensing-server/tests/fixtures/
  replay_idle.jsonl                                            (new)
  replay_motion.jsonl                                          (new)
v2/crates/wifi-densepose-sensing-server/src/main.rs
  - replay_tests module (D4, D5, D7)
docs/adr/ADR-114-replay-regression-suite.md                    (this)

Verified Acceptance

$ cargo test --release -p wifi-densepose-sensing-server \
    --no-default-features --bin sensing-server replay_2000 -- --nocapture
replay_2000 F1=1.000  tp=822 fp=0 tn=822 fn=0
test replay_tests::replay_2000_packets_f1_above_threshold ... ok
test result: ok. 1 passed; 0 failed; 0 ignored;

Full workspace suite: 327 tests pass (was 326 + this one).

References

  • ADR-101 — raw-amplitude classifier this test exercises.
  • ADR-102 — NBVI subcarrier selection that feeds CV calculation.
  • ADR-103 — persistent baseline that drives the universal-threshold normalization the test relies on.
  • ADR-028 — witness bundle (the other end-to-end regression mechanism; ADR-114 covers classifier code paths, ADR-028 covers the deterministic-CSI proof pipeline).
  • Francesco Pace, How I Turned My Wi-Fi Into a Motion Sensor — Part 2, "Replay regression test" — the upstream pattern.