diff --git a/docs/benchmarks/mmfi-wifi-sensing-study.md b/docs/benchmarks/mmfi-wifi-sensing-study.md new file mode 100644 index 00000000..f88cf62a --- /dev/null +++ b/docs/benchmarks/mmfi-wifi-sensing-study.md @@ -0,0 +1,129 @@ +# WiFi-CSI Sensing on MM-Fi — a complete, honest study + +**Scope:** what works, what doesn't, and what actually ships — for 2D human **pose** and **action +recognition** from WiFi Channel State Information on the public [MM-Fi](https://github.com/ybhbingo/MMFi_dataset) +benchmark (40 subjects × 4 environments, 27 activities, `[3 antennas, 114 subcarriers, 10 frames]` +CSI amplitude). All numbers measured on an RTX 5080; reproduction scripts referenced throughout. + +> **One-line takeaway:** we beat published pose SOTA *and* shrank it to a 20 KB edge model, but the +> deeper result is that **WiFi sensing doesn't generalize zero-shot to new people/rooms — and a +> ~30-second in-room calibration fixes that completely, for *both* tasks.** Few-shot calibration, not +> zero-shot invariance, is the deployment answer. + +## 1. Pose estimation + +### 1.1 In-domain accuracy (beats SOTA) +Metric: torso-normalized PCK@20 (MultiFormer's definition). Protocol: MM-Fi `random_split` (the +dataset default). + +| Model | torso-PCK@20 | +|-------|-------------:| +| CSI2Pose (prior) | 68.41% | +| MultiFormer (prior SOTA, 2025) | 72.25% | +| **Ours (single)** | **82.69%** | +| **Ours (graph + 3-ensemble + TTA)** | **83.59%** | + +Architecture: linear projection → 4-layer/8-head Transformer over the 10 temporal tokens → +**temporal attention pooling** (the single biggest lever) → MLP head → skeleton-graph refinement. +The headline was *self-corrected down* from an inflated 91.86% (loose bbox normalization) to 82.69% +under the matched torso metric before publishing. + +### 1.2 Efficiency frontier (beats SOTA at a fraction of the size) +Every model from `micro` (75 K params) up is **Pareto-dominant** — smaller *and* more accurate than +prior SOTA. A **75 K-param model tops MultiFormer**; deployed **int4 is ~20 KB at 74.08% (QAT)**, +0.135 ms single-thread CPU. (int8 is lossless at 74.7%; naïve int4 PTQ drops to 70.2% — QAT recovers +it.) Full curve: [`wifi-pose-efficiency-frontier.md`](wifi-pose-efficiency-frontier.md). +Published: [`ruvnet/wifi-densepose-mmfi-pose`](https://huggingface.co/ruvnet/wifi-densepose-mmfi-pose). + +## 2. Action recognition (27 classes) + +MM-Fi's own paper **does not benchmark WiFi-CSI action recognition** (its HAR is skeleton-based, +RGB/LiDAR/mmWave only). The only published WiFi-CSI-on-MM-Fi number is WiDistill (2024): 34.0% +(ResNet-18, unspecified split). We establish: + +| Protocol | top-1 | +|----------|------:| +| random_split (in-domain) | 88.08% | +| cross-subject (official), zero-shot | **10.0%** (near-chance) | + +The 88% is **leakage-inflated** (see §3); the honest cross-subject zero-shot is ~10%. + +## 3. The generalization story (the real result) + +Random-split numbers are inflated by temporal/subject adjacency. Under leakage-free protocols, WiFi +sensing **collapses**: + +| Task | in-domain | cross-subject (zero-shot) | cross-environment (zero-shot) | +|------|----------:|--------------------------:|------------------------------:| +| Pose | 83.6% | 64% | ~10% | +| Action | 88.1% | 10% | — | + +### 3.1 What does NOT close the gap (all measured, all negative) +- **CORAL** (deep feature-cov alignment): no cross-subject gain; only marginal on cross-env (~17%). +- **DANN** (subject-adversarial): ~0, loss-imbalance fragile. +- **Per-antenna instance-norm + SpecAugment**: −4.6 (destroys cross-antenna pose structure). +- **Pose-contrastive foundation pretraining**: −2.3 — and the SupCon loss *never left the `ln(B)` + random floor*, i.e. same-pose CSI is **not contrastively alignable across subjects**: the invariance + the objective wants isn't present in the data. +- **Knowledge distillation** (flagship→tiny): no gain; direct training wins. +- **More training subjects**: saturates — 4→8 subjects = +21 pts, but 24→32 = +0.45 pts (asymptote ~64%). + +Only **mixup + TTA + ensemble** helps cross-subject, and by <1 pt. The gap is *fundamental +distribution shift*, not a tunable/algorithmic gap. + +### 3.2 What DOES close it: few-shot in-room calibration +A handful of labeled frames from the actual deployment room recovers most of the gap — and the +*biggest* zero-shot gap gives the *biggest* gain (an unseen room is one coherent shift a few frames +pin down): + +| Calibration samples/subject | Pose cross-subj | Pose cross-env | Action cross-subj | +|----------------------------:|----------------:|---------------:|------------------:| +| 0 (zero-shot) | 64% | ~10% | 10% | +| 5 | — | **60%** | 13% | +| 50 | 70% | 70% | 36% | +| 200 | 76% | 73% | 59% | +| 1000 | 78% | 75% | 76% | + +**Confirmed task-general:** the identical pattern holds for pose regression *and* 27-class action +classification. Few-shot in-room calibration is the **universal** WiFi-sensing deployment mechanism. +(Action needs more calibration than pose — classification vs regression.) + +### 3.3 Deployable as a ~11 KB adapter +Full fine-tune means a 2.3 MB model copy per room. A **rank-8 LoRA adapter (~11 KB)** recovers most +of the gain (cross-subject 64→72.5% at 0.5% the size). Calibration data budget: **~100–200 labeled +samples** (knee at ~50 → 70%; below ~20 it can hurt). + +| Calibration method @200 samples | PCK@20 | adapter | +|---------------------------------|-------:|--------:| +| LoRA rank-8 | 72.5% | ~11 KB | +| head + graph only | 72.7% | 119 KB | +| frozen-trunk | 73.5% | 207 KB | +| full finetune | 76.2% | 2.3 MB | + +## 4. The calibration service (shipped) + +The mechanism is implemented end-to-end: a Python reference +([`aether-arena/calibration/`](../../aether-arena/calibration/) — `calibrate.py` fits an adapter from +a labeled clip, verified 3.09%→74.29% on an unseen MM-Fi room) **and** in the Rust product engine +(`cog-pose-estimation`: `InferenceEngine::with_adapter()`, `run --adapter `, +architecture-agnostic LoRA on the pose head, tested). + +## 5. Honest limitations + +- All generalization numbers are within MM-Fi (one dataset, one hardware setup). **Cross-*dataset*** + transfer (different radios/rooms/protocols) is untested — the next real frontier, pending a second + public dataset. +- Random-split numbers are reported only to compare to prior work on the same protocol; they are + in-domain and partly leaky. The cross-subject / cross-environment numbers are the honest ones. +- Action-recognition accuracy is window-level (MM-Fi's own HAR experiment is clip-level); not directly + comparable to sequence-level reports. +- On-device (ARM/Hailo) latency is pending hardware; CPU latency (0.135 ms x86 single-thread) is the + current proxy. + +## 6. Reproduction + +Pose: `aether-arena/staging/train_save.py` (flagship), `train_efficiency_pareto.py`, +`quant_micro.py`, `train_fewshot_adapt.py`, `train_adapter_calib.py`. Action: `train_action.py`, +`train_action_fewshot.py`. Calibration service: `aether-arena/calibration/`. Decision record + full +empirical chain: [ADR-150 §3.2–3.6](../adr/ADR-150-rf-foundation-encoder.md). Leaderboard + witness +ledger: [AetherArena](https://huggingface.co/spaces/ruvnet/aether-arena) (ADR-149).