6.7 KiB
ADR-114 — 2000-Packet Replay Regression Suite
Status: Accepted
Date: 2026-05-17
Scope: v2/crates/wifi-densepose-sensing-server/src/main.rs
(replay_tests module under #[cfg(test)]),
v2/crates/wifi-densepose-sensing-server/tests/fixtures/replay_*.jsonl,
scripts/generate-replay-fixtures.py. Closes the "2 000-packet fixed-
replay test suite" item in CHECKLIST.
Context
Up to now the amplitude classifier has been protected by per-function
unit tests (cv calculation, NBVI selection, baseline drop trigger) but
not by an end-to-end regression test that feeds a known-good stream
through the full amp_presence_override pipeline and checks that the
labels still look right.
Without that, a refactor of NBVI selection or a threshold tweak could silently regress classifier behaviour on real deployments — the unit tests would all pass while the production output flipped.
Pace's ESPectre has a similar pattern: 1000 idle + 1000 motion frames, checked into the repo, replayed in CI on every PR.
Decisions
D1 — Fixture format: line-delimited JSON, {node_id, amplitude[]}
{"node_id":1,"amplitude":[28.842, 19.333, ...]}
{"node_id":2,"amplitude":[15.601, 17.220, ...]}
...
Minimal: just the two fields the classifier reads. Round-robined across nodes (500 per node × 2 nodes = 1000 frames per fixture file). 1000 frames per file × 2 files = 2000 packets total.
D2 — Fixtures live in-repo under tests/fixtures/
v2/crates/wifi-densepose-sensing-server/tests/fixtures/
replay_idle.jsonl (1000 lines)
replay_motion.jsonl (1000 lines)
Co-located with the test that consumes them. cargo test picks them up
via env!("CARGO_MANIFEST_DIR"). The fixture files are ~1.5 MB total
(text JSON) — small enough for the repo, not so small that the test
loses statistical power.
D3 — Synthetic but parameter-matched to live data
The fixtures are generated by scripts/generate-replay-fixtures.py with
two deterministic seeds (42 and 43). Parameters chosen to mirror the
live deployment:
- Baseline mean amplitudes per node taken from
data/baseline.json(node 1: 27.04, node 2: 14.72). - Idle: per-frame Gaussian noise σ = 1.8 % of the per-subcarrier mean.
- Motion: ±40 % slow envelope (0.15 Hz sinusoid, 6.7 s cycle, longer
than the classifier's 4.5 s
AMP_SHORT_WIN) + 5 % per-frame noise. Mimics a body slowly modulating the channel during walking.
This is deliberately synthetic. Capturing 1000 real frames of "empty room" requires the operator to step out and stay out for ~50 s, and capturing "motion" requires walking through the room — neither is something this session could do without manual operator labour. The synthetic-but-realistic alternative gives deterministic regression coverage today, with the option to swap in live captures (same JSONL schema, same filenames) when time allows.
D4 — Test lives inside main.rs under #[cfg(test)] mod replay_tests
amp_presence_override is private to the binary crate, so the test
can't sit in tests/ (which is for integration tests against
lib.rs). Putting it under #[cfg(test)] in main.rs keeps the
helper visibility minimal and exercises the exact function path
production uses.
D5 — Test resets per-node history before each fixture run
amp_presence_override accumulates per-node state in
OnceLock<Mutex<HashMap<…>>> statics. The test clears those between
the idle and motion runs so each fixture starts with a fresh classifier
(no cross-contamination from the previous fixture's frames sitting in
the rolling window).
It also clears the per-subcarrier baseline (amp_baseline_per_sub)
because the synthetic fixtures don't share a per-subcarrier profile
with whatever real recording lives in data/baseline.json — leaving
the live per-sub baseline in place would make the drift channel
saturate and obscure the CV-threshold path we're actually testing.
D6 — F1 threshold: 0.85
Convention from Pace's ESPectre CI gate. Current value on the synthetic
fixtures with this deployment's baseline is F1 = 1.000 (tp=822,
fp=0, tn=822, fn=0; 178 warmup frames excluded per fixture). The 0.15
headroom gives room for legitimate classifier evolution without
forcing a fixture re-record on every tuning change.
D7 — Test loads the deployment baseline at startup
Without data/baseline.json loaded, the classifier compares raw CV
against thresholds of 3.0 (300 %) and 6.0 — values no realistic signal
reaches. The test discovers the baseline via a couple of canonical
relative paths (../../data/baseline.json from the crate dir, etc.)
and exits early with a clear eprintln! hint if none are found.
Trade-offs
- Synthetic fixtures don't catch sensor-specific bugs. A Kconfig-level FW regression that produced subtly different amplitude scaling would not be caught — the synthetic fixtures encode the expected scaling, not whatever the FW currently emits. The witness bundle (ADR-028) still covers that end of the pipeline.
replay_2000runs only when explicitly named or via the full suite. No filtering hides it from CI. It runs in well under a second so cost is negligible.- F1 currently 1.0 — too clean to detect subtle regressions. A followup with live captures may bring the natural F1 to ~0.9, at which point the 0.85 threshold becomes a real gate. For now it's primarily a contract test: "the classifier still emits something reasonable on a known input".
Files Touched
scripts/generate-replay-fixtures.py (new)
v2/crates/wifi-densepose-sensing-server/tests/fixtures/
replay_idle.jsonl (new)
replay_motion.jsonl (new)
v2/crates/wifi-densepose-sensing-server/src/main.rs
- replay_tests module (D4, D5, D7)
docs/adr/ADR-114-replay-regression-suite.md (this)
Verified Acceptance
$ cargo test --release -p wifi-densepose-sensing-server \
--no-default-features --bin sensing-server replay_2000 -- --nocapture
replay_2000 F1=1.000 tp=822 fp=0 tn=822 fn=0
test replay_tests::replay_2000_packets_f1_above_threshold ... ok
test result: ok. 1 passed; 0 failed; 0 ignored;
Full workspace suite: 327 tests pass (was 326 + this one).
References
- ADR-101 — raw-amplitude classifier this test exercises.
- ADR-102 — NBVI subcarrier selection that feeds CV calculation.
- ADR-103 — persistent baseline that drives the universal-threshold normalization the test relies on.
- ADR-028 — witness bundle (the other end-to-end regression mechanism; ADR-114 covers classifier code paths, ADR-028 covers the deterministic-CSI proof pipeline).
- Francesco Pace, How I Turned My Wi-Fi Into a Motion Sensor — Part 2, "Replay regression test" — the upstream pattern.