docs(adr-114): ADR for replay regression suite
Co-Authored-By: claude-flow <ruv@ruv.net>
This commit is contained in:
parent
96225e27cf
commit
c827cde69e
|
|
@ -0,0 +1,162 @@
|
||||||
|
# ADR-114 — 2000-Packet Replay Regression Suite
|
||||||
|
|
||||||
|
**Status**: Accepted
|
||||||
|
**Date**: 2026-05-17
|
||||||
|
**Scope**: `v2/crates/wifi-densepose-sensing-server/src/main.rs`
|
||||||
|
(`replay_tests` module under `#[cfg(test)]`),
|
||||||
|
`v2/crates/wifi-densepose-sensing-server/tests/fixtures/replay_*.jsonl`,
|
||||||
|
`scripts/generate-replay-fixtures.py`. Closes the "2 000-packet fixed-
|
||||||
|
replay test suite" item in CHECKLIST.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Up to now the amplitude classifier has been protected by per-function
|
||||||
|
unit tests (cv calculation, NBVI selection, baseline drop trigger) but
|
||||||
|
not by an end-to-end regression test that feeds a known-good stream
|
||||||
|
through the full `amp_presence_override` pipeline and checks that the
|
||||||
|
labels still look right.
|
||||||
|
|
||||||
|
Without that, a refactor of NBVI selection or a threshold tweak could
|
||||||
|
silently regress classifier behaviour on real deployments — the unit
|
||||||
|
tests would all pass while the production output flipped.
|
||||||
|
|
||||||
|
Pace's ESPectre has a similar pattern: 1000 idle + 1000 motion frames,
|
||||||
|
checked into the repo, replayed in CI on every PR.
|
||||||
|
|
||||||
|
## Decisions
|
||||||
|
|
||||||
|
### D1 — Fixture format: line-delimited JSON, `{node_id, amplitude[]}`
|
||||||
|
|
||||||
|
```jsonl
|
||||||
|
{"node_id":1,"amplitude":[28.842, 19.333, ...]}
|
||||||
|
{"node_id":2,"amplitude":[15.601, 17.220, ...]}
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
Minimal: just the two fields the classifier reads. Round-robined across
|
||||||
|
nodes (500 per node × 2 nodes = 1000 frames per fixture file). 1000
|
||||||
|
frames per file × 2 files = 2000 packets total.
|
||||||
|
|
||||||
|
### D2 — Fixtures live in-repo under `tests/fixtures/`
|
||||||
|
|
||||||
|
```
|
||||||
|
v2/crates/wifi-densepose-sensing-server/tests/fixtures/
|
||||||
|
replay_idle.jsonl (1000 lines)
|
||||||
|
replay_motion.jsonl (1000 lines)
|
||||||
|
```
|
||||||
|
|
||||||
|
Co-located with the test that consumes them. `cargo test` picks them up
|
||||||
|
via `env!("CARGO_MANIFEST_DIR")`. The fixture files are ~1.5 MB total
|
||||||
|
(text JSON) — small enough for the repo, not so small that the test
|
||||||
|
loses statistical power.
|
||||||
|
|
||||||
|
### D3 — Synthetic but parameter-matched to live data
|
||||||
|
|
||||||
|
The fixtures are generated by `scripts/generate-replay-fixtures.py` with
|
||||||
|
two deterministic seeds (42 and 43). Parameters chosen to mirror the
|
||||||
|
live deployment:
|
||||||
|
|
||||||
|
* Baseline mean amplitudes per node taken from `data/baseline.json`
|
||||||
|
(node 1: 27.04, node 2: 14.72).
|
||||||
|
* Idle: per-frame Gaussian noise σ = 1.8 % of the per-subcarrier mean.
|
||||||
|
* Motion: ±40 % slow envelope (0.15 Hz sinusoid, 6.7 s cycle, longer
|
||||||
|
than the classifier's 4.5 s `AMP_SHORT_WIN`) + 5 % per-frame noise.
|
||||||
|
Mimics a body slowly modulating the channel during walking.
|
||||||
|
|
||||||
|
This is deliberately *synthetic*. Capturing 1000 real frames of
|
||||||
|
"empty room" requires the operator to step out and stay out for ~50 s,
|
||||||
|
and capturing "motion" requires walking through the room — neither is
|
||||||
|
something this session could do without manual operator labour. The
|
||||||
|
synthetic-but-realistic alternative gives deterministic regression
|
||||||
|
coverage today, with the option to swap in live captures (same JSONL
|
||||||
|
schema, same filenames) when time allows.
|
||||||
|
|
||||||
|
### D4 — Test lives inside `main.rs` under `#[cfg(test)] mod replay_tests`
|
||||||
|
|
||||||
|
`amp_presence_override` is private to the binary crate, so the test
|
||||||
|
can't sit in `tests/` (which is for integration tests against
|
||||||
|
`lib.rs`). Putting it under `#[cfg(test)]` in `main.rs` keeps the
|
||||||
|
helper visibility minimal and exercises the exact function path
|
||||||
|
production uses.
|
||||||
|
|
||||||
|
### D5 — Test resets per-node history before each fixture run
|
||||||
|
|
||||||
|
`amp_presence_override` accumulates per-node state in
|
||||||
|
`OnceLock<Mutex<HashMap<…>>>` statics. The test clears those between
|
||||||
|
the idle and motion runs so each fixture starts with a fresh classifier
|
||||||
|
(no cross-contamination from the previous fixture's frames sitting in
|
||||||
|
the rolling window).
|
||||||
|
|
||||||
|
It also clears the per-subcarrier baseline (`amp_baseline_per_sub`)
|
||||||
|
because the synthetic fixtures don't share a per-subcarrier profile
|
||||||
|
with whatever real recording lives in `data/baseline.json` — leaving
|
||||||
|
the live per-sub baseline in place would make the drift channel
|
||||||
|
saturate and obscure the CV-threshold path we're actually testing.
|
||||||
|
|
||||||
|
### D6 — F1 threshold: 0.85
|
||||||
|
|
||||||
|
Convention from Pace's ESPectre CI gate. Current value on the synthetic
|
||||||
|
fixtures with this deployment's baseline is `F1 = 1.000` (tp=822,
|
||||||
|
fp=0, tn=822, fn=0; 178 warmup frames excluded per fixture). The 0.15
|
||||||
|
headroom gives room for legitimate classifier evolution without
|
||||||
|
forcing a fixture re-record on every tuning change.
|
||||||
|
|
||||||
|
### D7 — Test loads the deployment baseline at startup
|
||||||
|
|
||||||
|
Without `data/baseline.json` loaded, the classifier compares raw CV
|
||||||
|
against thresholds of 3.0 (300 %) and 6.0 — values no realistic signal
|
||||||
|
reaches. The test discovers the baseline via a couple of canonical
|
||||||
|
relative paths (`../../data/baseline.json` from the crate dir, etc.)
|
||||||
|
and exits early with a clear `eprintln!` hint if none are found.
|
||||||
|
|
||||||
|
## Trade-offs
|
||||||
|
|
||||||
|
* **Synthetic fixtures don't catch sensor-specific bugs.** A
|
||||||
|
Kconfig-level FW regression that produced subtly different amplitude
|
||||||
|
scaling would not be caught — the synthetic fixtures encode the
|
||||||
|
*expected* scaling, not whatever the FW currently emits. The witness
|
||||||
|
bundle (ADR-028) still covers that end of the pipeline.
|
||||||
|
* **`replay_2000` runs only when explicitly named or via the full
|
||||||
|
suite.** No filtering hides it from CI. It runs in well under a
|
||||||
|
second so cost is negligible.
|
||||||
|
* **F1 currently 1.0 — too clean to detect subtle regressions.** A
|
||||||
|
followup with live captures may bring the natural F1 to ~0.9, at
|
||||||
|
which point the 0.85 threshold becomes a real gate. For now it's
|
||||||
|
primarily a contract test: "the classifier still emits something
|
||||||
|
reasonable on a known input".
|
||||||
|
|
||||||
|
## Files Touched
|
||||||
|
|
||||||
|
```
|
||||||
|
scripts/generate-replay-fixtures.py (new)
|
||||||
|
v2/crates/wifi-densepose-sensing-server/tests/fixtures/
|
||||||
|
replay_idle.jsonl (new)
|
||||||
|
replay_motion.jsonl (new)
|
||||||
|
v2/crates/wifi-densepose-sensing-server/src/main.rs
|
||||||
|
- replay_tests module (D4, D5, D7)
|
||||||
|
docs/adr/ADR-114-replay-regression-suite.md (this)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Verified Acceptance
|
||||||
|
|
||||||
|
```
|
||||||
|
$ cargo test --release -p wifi-densepose-sensing-server \
|
||||||
|
--no-default-features --bin sensing-server replay_2000 -- --nocapture
|
||||||
|
replay_2000 F1=1.000 tp=822 fp=0 tn=822 fn=0
|
||||||
|
test replay_tests::replay_2000_packets_f1_above_threshold ... ok
|
||||||
|
test result: ok. 1 passed; 0 failed; 0 ignored;
|
||||||
|
```
|
||||||
|
|
||||||
|
Full workspace suite: 327 tests pass (was 326 + this one).
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
* ADR-101 — raw-amplitude classifier this test exercises.
|
||||||
|
* ADR-102 — NBVI subcarrier selection that feeds CV calculation.
|
||||||
|
* ADR-103 — persistent baseline that drives the universal-threshold
|
||||||
|
normalization the test relies on.
|
||||||
|
* ADR-028 — witness bundle (the other end-to-end regression
|
||||||
|
mechanism; ADR-114 covers classifier code paths, ADR-028 covers
|
||||||
|
the deterministic-CSI proof pipeline).
|
||||||
|
* Francesco Pace, *How I Turned My Wi-Fi Into a Motion Sensor —
|
||||||
|
Part 2*, "Replay regression test" — the upstream pattern.
|
||||||
Loading…
Reference in New Issue