wifi-densepose/docs/adr/ADR-114-replay-regression-s...

163 lines
6.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-114 — 2000-Packet Replay Regression Suite
**Status**: Accepted
**Date**: 2026-05-17
**Scope**: `v2/crates/wifi-densepose-sensing-server/src/main.rs`
(`replay_tests` module under `#[cfg(test)]`),
`v2/crates/wifi-densepose-sensing-server/tests/fixtures/replay_*.jsonl`,
`scripts/generate-replay-fixtures.py`. Closes the "2 000-packet fixed-
replay test suite" item in CHECKLIST.
## Context
Up to now the amplitude classifier has been protected by per-function
unit tests (cv calculation, NBVI selection, baseline drop trigger) but
not by an end-to-end regression test that feeds a known-good stream
through the full `amp_presence_override` pipeline and checks that the
labels still look right.
Without that, a refactor of NBVI selection or a threshold tweak could
silently regress classifier behaviour on real deployments — the unit
tests would all pass while the production output flipped.
Pace's ESPectre has a similar pattern: 1000 idle + 1000 motion frames,
checked into the repo, replayed in CI on every PR.
## Decisions
### D1 — Fixture format: line-delimited JSON, `{node_id, amplitude[]}`
```jsonl
{"node_id":1,"amplitude":[28.842, 19.333, ...]}
{"node_id":2,"amplitude":[15.601, 17.220, ...]}
...
```
Minimal: just the two fields the classifier reads. Round-robined across
nodes (500 per node × 2 nodes = 1000 frames per fixture file). 1000
frames per file × 2 files = 2000 packets total.
### D2 — Fixtures live in-repo under `tests/fixtures/`
```
v2/crates/wifi-densepose-sensing-server/tests/fixtures/
replay_idle.jsonl (1000 lines)
replay_motion.jsonl (1000 lines)
```
Co-located with the test that consumes them. `cargo test` picks them up
via `env!("CARGO_MANIFEST_DIR")`. The fixture files are ~1.5 MB total
(text JSON) — small enough for the repo, not so small that the test
loses statistical power.
### D3 — Synthetic but parameter-matched to live data
The fixtures are generated by `scripts/generate-replay-fixtures.py` with
two deterministic seeds (42 and 43). Parameters chosen to mirror the
live deployment:
* Baseline mean amplitudes per node taken from `data/baseline.json`
(node 1: 27.04, node 2: 14.72).
* Idle: per-frame Gaussian noise σ = 1.8 % of the per-subcarrier mean.
* Motion: ±40 % slow envelope (0.15 Hz sinusoid, 6.7 s cycle, longer
than the classifier's 4.5 s `AMP_SHORT_WIN`) + 5 % per-frame noise.
Mimics a body slowly modulating the channel during walking.
This is deliberately *synthetic*. Capturing 1000 real frames of
"empty room" requires the operator to step out and stay out for ~50 s,
and capturing "motion" requires walking through the room — neither is
something this session could do without manual operator labour. The
synthetic-but-realistic alternative gives deterministic regression
coverage today, with the option to swap in live captures (same JSONL
schema, same filenames) when time allows.
### D4 — Test lives inside `main.rs` under `#[cfg(test)] mod replay_tests`
`amp_presence_override` is private to the binary crate, so the test
can't sit in `tests/` (which is for integration tests against
`lib.rs`). Putting it under `#[cfg(test)]` in `main.rs` keeps the
helper visibility minimal and exercises the exact function path
production uses.
### D5 — Test resets per-node history before each fixture run
`amp_presence_override` accumulates per-node state in
`OnceLock<Mutex<HashMap<…>>>` statics. The test clears those between
the idle and motion runs so each fixture starts with a fresh classifier
(no cross-contamination from the previous fixture's frames sitting in
the rolling window).
It also clears the per-subcarrier baseline (`amp_baseline_per_sub`)
because the synthetic fixtures don't share a per-subcarrier profile
with whatever real recording lives in `data/baseline.json` — leaving
the live per-sub baseline in place would make the drift channel
saturate and obscure the CV-threshold path we're actually testing.
### D6 — F1 threshold: 0.85
Convention from Pace's ESPectre CI gate. Current value on the synthetic
fixtures with this deployment's baseline is `F1 = 1.000` (tp=822,
fp=0, tn=822, fn=0; 178 warmup frames excluded per fixture). The 0.15
headroom gives room for legitimate classifier evolution without
forcing a fixture re-record on every tuning change.
### D7 — Test loads the deployment baseline at startup
Without `data/baseline.json` loaded, the classifier compares raw CV
against thresholds of 3.0 (300 %) and 6.0 — values no realistic signal
reaches. The test discovers the baseline via a couple of canonical
relative paths (`../../data/baseline.json` from the crate dir, etc.)
and exits early with a clear `eprintln!` hint if none are found.
## Trade-offs
* **Synthetic fixtures don't catch sensor-specific bugs.** A
Kconfig-level FW regression that produced subtly different amplitude
scaling would not be caught — the synthetic fixtures encode the
*expected* scaling, not whatever the FW currently emits. The witness
bundle (ADR-028) still covers that end of the pipeline.
* **`replay_2000` runs only when explicitly named or via the full
suite.** No filtering hides it from CI. It runs in well under a
second so cost is negligible.
* **F1 currently 1.0 — too clean to detect subtle regressions.** A
followup with live captures may bring the natural F1 to ~0.9, at
which point the 0.85 threshold becomes a real gate. For now it's
primarily a contract test: "the classifier still emits something
reasonable on a known input".
## Files Touched
```
scripts/generate-replay-fixtures.py (new)
v2/crates/wifi-densepose-sensing-server/tests/fixtures/
replay_idle.jsonl (new)
replay_motion.jsonl (new)
v2/crates/wifi-densepose-sensing-server/src/main.rs
- replay_tests module (D4, D5, D7)
docs/adr/ADR-114-replay-regression-suite.md (this)
```
## Verified Acceptance
```
$ cargo test --release -p wifi-densepose-sensing-server \
--no-default-features --bin sensing-server replay_2000 -- --nocapture
replay_2000 F1=1.000 tp=822 fp=0 tn=822 fn=0
test replay_tests::replay_2000_packets_f1_above_threshold ... ok
test result: ok. 1 passed; 0 failed; 0 ignored;
```
Full workspace suite: 327 tests pass (was 326 + this one).
## References
* ADR-101 — raw-amplitude classifier this test exercises.
* ADR-102 — NBVI subcarrier selection that feeds CV calculation.
* ADR-103 — persistent baseline that drives the universal-threshold
normalization the test relies on.
* ADR-028 — witness bundle (the other end-to-end regression
mechanism; ADR-114 covers classifier code paths, ADR-028 covers
the deterministic-CSI proof pipeline).
* Francesco Pace, *How I Turned My Wi-Fi Into a Motion Sensor —
Part 2*, "Replay regression test" — the upstream pattern.