21 KiB

Raw Blame History

SOTA Evidence Brief — `wifi-densepose-nn` / `wifi-densepose-train` Benchmark ADR Seed

Field	Value
Date	2026-06-14
Author	deep-research (Opus)
Purpose	Seed a future benchmark/optimization ADR for the NN-inference (`wifi-densepose-nn`) and training (`wifi-densepose-train`) crates
Scope	The DELTA beyond what ADR-152 / ADR-150 / ADR-015 already establish — current published WiFi-CSI pose SOTA, winning architectures, edge-quantization SOTA, and a defensible benchmark-suite design
Ethos	Every claim graded PEER-REVIEWED / PREPRINT / VENDOR-CLAIM / BLOG, with MEASURED-on-public-benchmark distinguished from marketing. Numbers that could not be verified are flagged. No fabricated citations.

Citation discipline carried in from ADR-152 §2.2: preprint accuracy numbers are CLAIMED until reproduced on our hardware. The project has already retracted its own "92.9% PCK@20" and "shipped-WiFlow-STD 97.25%" figures after measurement; this brief inherits that bar.

1. Executive summary

Where the project stands vs the 2026 frontier. The repo is, by the evidence already in-tree, ahead of most academic groups on benchmark hygiene and roughly at parity on capability — but the two are measured on incompatible yardsticks, which is the single biggest risk to any "beyond-SOTA" claim.

The project's headline reproductions (benchmarks/wiflow-std/RESULTS.md) are MEASURED and rigorous: WiFlow-STD retrained to 96.09–96.61% PCK@20 on the authors' own 360k-window 2D dataset (RTX 5080), shipped checkpoint REFUTED, dataset/code defects documented. This is a genuinely strong, reproducible result.
But that number is not on a standard public benchmark. WiFlow-STD's dataset is self-collected (5 subjects, 15 keypoints, 2D, in-domain random split, hardware unspecified). The academic frontier on the standard public 3D benchmark (MM-Fi) reports PCK@20 ≈ 61% / MPJPE ≈ 161 mm random-split (GraphPose-Fi, Nov 2025) — a harder metric (3D, mm-scale, standard PCK normalization). The project's own AetherArena MM-Fi number (81.63% torso-PCK@20 in-domain, ADR-150) uses a torso-normalized PCK that is looser than GraphPose-Fi's standard PCK, so the three numbers (96% / 81.6% / 61%) cannot be lined up without a unified harness. Making them comparable IS the highest-value work item.
The deployment frontier — cross-subject / cross-environment generalization — is where everyone collapses, the project included (ADR-150: 81.63% in-domain → ~11.6% leakage-free cross-subject). GraphPose-Fi independently confirms the cliff (61.1% random → 12.9% cross-environment PCK@20). This is the real research target, not in-domain PCK.

Top 3 highest-value optimization/benchmark targets:

A unified, metric-locked accuracy harness in wifi-densepose-train that scores any model under one explicit PCK definition (normalization, keypoint convention, split) so WiFlow-STD-repro, AetherArena/MM-Fi, and GraphPose-Fi numbers become directly comparable. Without this, no "beyond-SOTA" claim survives the "prove it" bar — the project has already been burned twice by metric ambiguity (the retracted 92.9% used absolute, not torso-normalized, PCK).
A QAT path for the WiFlow-STD-class edge model. The in-tree edge work (RESULTS.md) has fully characterized PTQ (static QDQ conv-only is the int8 sweet spot; dynamic int8 is a no-op on this all-conv architecture) and found the half model (843k params) strictly dominates the published 2.23M and tiny (56k, 295 KB ONNX fp32) holds 94.1% PCK@20. The one untested lever is quantization-aware training, which the general literature says recovers most of the PTQ accuracy gap. That is the next defensible edge win.
Criterion-backed regression benches wired into CI for the real Candle/ONNX forward path. The benches exist (wifi-densepose-nn/benches/{inference,onnx,native_conv}_bench.rs, wifi-densepose-train/benches/training_bench.rs) and benchmarks/edge-latency/RESULTS.md shows the methodology is sound (host≠ESP32 caveat made explicit). The gap is turning point-in-time captures into committed regression baselines.

2. Findings per research question

RQ1 — Latest WiFi-CSI pose SOTA (2024–2026): published PCK@20 / MPJPE on the standard public benchmarks

The crucial framing: "WiFi pose SOTA" splits into two non-comparable tracks — 3D pose on MM-Fi/Person-in-WiFi-3D (mm-scale MPJPE, standard PCK) vs 2D pose on self-collected sets (image-normalized PCK). The project's flagship reproduction lives in the second track; the academic frontier lives in the first.

Method	Venue / Year	Benchmark + split	PCK@20	MPJPE	Grade
GraphPose-Fi (arXiv 2511.19105)	PREPRINT, Nov 2025	MM-Fi P1, random split	61.1%	160.6 mm (PA-MPJPE 105.0)	numbers MEASURED-in-study (preprint); beats MetaFi++, HPE-Li, DT-Pose
GraphPose-Fi	same	MM-Fi P1, cross-subject	44.2%	210.5 mm	same
GraphPose-Fi	same	MM-Fi P1, cross-environment	12.9%	302.7 mm	same — the generalization cliff
DT-Pose (arXiv 2501.09411)	PREPRINT (ICLR'25 OpenReview aPnLQ6WfQQ), Jan 2025; code cseeyangchen/DT-Pose	MM-Fi (domain-gap + topology focus)	not cleanly extractable from abstract	reports MPJPE; self-supervised masked pretrain + topology decode	numbers NOT verified at exact-table level here — flagged
Person-in-WiFi-3D (CVPR 2024, openaccess)	PEER-REVIEWED, CVPR 2024	own 97k-frame multi-person set	— (multi-person, not single-PCK)	91.7 mm (1p) / 108.1 (2p) / 125.3 (3p) 3D joint error	MEASURED (peer-reviewed); own dataset, not MM-Fi
WiFlow-STD (arXiv 2602.08661, DY2434 repo)	PREPRINT, Apr 2026	self-collected, 5-subj, 2D, in-domain random	97.25% (claimed)	0.007 m (image-norm)	claimed CLAIMED; project reproduced 96.09–96.61% (MEASURED, RTX 5080) after repairing dataset/code
PerceptAlign (arXiv 2601.12252)	PREPRINT + MobiCom'26 acceptance	own 7-layout cross-domain 3D set	—	222.4 mm (Scene4) / 317.1 (Scene5), claims −54% cross-env vs SOTA	CLAIMED (preprint); failure mode corroborated
Project AetherArena (ADR-150, issue #876)	internal	MM-Fi, random split, torso-PCK	81.63% torso-PCK@20	—	MEASURED-internal; torso-PCK ≠ GraphPose-Fi standard PCK
Project WiFlow-STD repro (`benchmarks/wiflow-std/RESULTS.md`)	internal	their data, their split	96.09–96.61%	0.0094–0.0098 m	MEASURED-internal (RTX 5080)

How the project's ~96% compares to the frontier: It is not directly comparable. The 96% is on an easier task (2D, in-domain, image-normalized PCK, single-environment, 5 subjects) than GraphPose-Fi's 61.1% (3D, standard PCK, mm-scale). The project's own MM-Fi-track number (81.63% torso-PCK@20) appears to beat GraphPose-Fi's 61.1%, but only because torso-PCK is a looser normalization — the project explicitly flags this (ADR-150 cites beating "MultiFormer's 72.25%" under the same torso metric, not GraphPose-Fi's). The honest statement: the project is competitive on in-domain MM-Fi under its own torso metric, and collapses cross-subject exactly as the published frontier does. No public number lets the project claim "beyond-SOTA" today.

RQ2 — What's winning architecturally now (2025–2026)

The clear trend across the verified 2025–2026 papers:

Graph / skeleton-aware decoders are the current academic SOTA on MM-Fi. GraphPose-Fi (PREPRINT, Nov 2025) wins by injecting anatomical graph structure into the decoder — exactly the GraphPose-Fi-style skeleton-aware graph head ADR-150 §2.2 already names as the planned decoder. The project's architecture direction matches the frontier.
Self-supervised masked pretraining (MAE) is the cross-domain lever, not capacity. UNSW MAE study (arXiv 2511.18792, PREPRINT, Nov 2025): cross-domain gains scale log-linearly with pretraining data, unsaturated at 1.3M samples; ViT-Base adds only 0.4–0.9% over ViT-Small. Recipe: 80% masking, (30,3) small patches. DT-Pose (arXiv 2501.09411) independently uses masked pretraining + topology constraints for the domain gap. Caveat (MEASURED in ADR-152 §2.3): UNSW's downstream tasks are classification, not pose — pose transfer remains a hypothesis. The project's own measurement (b) found WiFlow-STD pretrained features give optimization transfer but NOT feature transfer to ESP32 CSI.
Spatio-temporal decoupling is the efficiency lever. WiFlow-STD's whole contribution is decoupling spatial and temporal CSI processing to hit 2.23M params. The project verified the params/FLOPs (MEASURED) and then beat it: the half-model (843k) matches accuracy with 0.38× params (RESULTS.md efficiency sweep).
Geometry/layout conditioning is the cross-layout lever. PerceptAlign (MobiCom'26): fusing transceiver-position embeddings + two-checkerboard calibration, claimed −60% cross-domain. ADR-152 §2.1 already adopted this (NodeGeometry, geometry embeddings).
NOT winning / absent: diffusion models for CSI pose did not surface in the verified frontier. Full DensePose-UV regression from commodity WiFi remains undemonstrated (ADR-152 F5, MEASURED by full-text screening). No 2025–2026 paper was found that beats the project's current direction — the project is tracking, not trailing, the architecture frontier.

Verdict RQ2: the winning stack (MAE pretrain → graph/skeleton decoder → geometry conditioning, ViT-Small-class capacity) is already the planned ADR-150/152 stack. The gain available is not a new architecture; it's (a) more heterogeneous pretraining data and (b) honest cross-domain measurement.

RQ3 — Edge/quantized inference SOTA for small CSI pose models

The in-tree edge work (benchmarks/wiflow-std/RESULTS.md "Edge optimization" + "Static PTQ" + "Efficiency sweep") is already at or beyond what the public literature offers for this specific model class, and is MEASURED. Key findings to carry forward:

Dynamic INT8 is a trap on all-conv CSI models. WiFlow-STD has zero nn.Linear layers (21 Conv1d + 22 Conv2d + BatchNorm). torch.quantize_dynamic quantizes 0% of params (dynamic int8 has no conv kernels). MEASURED.
Static QDQ conv-only PTQ is the int8 sweet spot. PCK@20 96.60–96.63% (vs fp32 96.68%, dynamic 96.52%), 2.53 MB. All-ops QDQ is strictly worse (−1.4 pt). MEASURED.
ONNX Runtime fp32 is the real CPU latency win: 3.2 ms/window batch-1 vs torch 11.0 ms (~3.4×) at parity (2.4e-7). int8 is ~2× slower than ONNX fp32 at batch-1 (ConvInteger kernels). MEASURED.
Smaller-than-published dominates. half (843k) ≥ full on accuracy; tiny (56k, 295 KB ONNX fp32, 0.66 ms/win, 94.1% PCK@20) is the smallest deployable artifact. At tiny scale int8 is a bad trade (−1.43 pt for −47 KB). MEASURED.
General QAT-vs-PTQ context (BLOG/VENDOR): NVIDIA TensorRT QAT blog, Ultralytics QAT glossary, ONNX Runtime quantization docs: QAT "almost always" recovers accuracy PTQ loses on sensitive models; ONNX Runtime does NOT retrain (QAT must happen in PyTorch, then export QDQ). The Onboard Optimization survey, arXiv 2505.08793 (PREPRINT) covers on-device optimization broadly. These are general claims, not CSI-pose-specific — grade accordingly.
Hailo / Pi target (CLAUDE.local.md): the 4× Pi+Hailo cluster (Hailo-8 @ 26 TOPS / Hailo-10 @ 40 TOPS) needs a HEF compile path, which is its own toolchain (not ONNX/Candle). No in-tree HEF benchmark exists yet — this is a genuine gap for the edge-inference claim.

Actionable for an inference-speed benchmark: the honest comparand set is {torch fp32, ONNX fp32, ONNX static-QDQ-conv-only int8, candle fp32} × {full, half, tiny} on a fixed host, with the host≠ESP32 / host≠Hailo caveat stated up front (the edge-latency/RESULTS.md template already does this correctly). The one new datapoint worth producing: QAT-int8 on the half model to test whether QAT closes the PTQ −0.16 pt gap and keeps the size win.

RQ4 — Rigorous, reproducible benchmark methodology

The repo already demonstrates the right methodology in three places — the ADR should codify it, not invent it:

benchmarks/wiflow-std/RESULTS.md — the gold standard already in-tree: pinned upstream commit, seed-42 file-level split documented, corruption masks committed as ground truth, every forced deviation recorded, mean-pose honesty baseline, MEASURED-vs-CLAIMED grading.
benchmarks/edge-latency/RESULTS.md — criterion 0.5, explicit host machine, low/median/high brackets, contention caveat, host≠ESP32 separation, steady-state-vs-cold-start distinction.
Rust micro-bench: criterion benches already exist in both crates (wifi-densepose-nn/benches/, wifi-densepose-train/benches/).

What a credible "beyond-SOTA" claim requires (the bar that survives "prove it"):

One locked accuracy definition — PCK normalization (torso vs absolute vs bbox), keypoint convention (15 vs 17 COCO), and split (random / cross-subject / cross-environment) declared before the run. The retracted 92.9% died exactly because PCK normalization was unstated.
A mean-pose / constant-output honesty baseline on every split (already done in measurement (b) — a single-subject near-static set scored 95.9% torso-PCK@20 with a constant pose). Any claim must beat this.
MEASURED-vs-CLAIMED grading per number, with the exact command and raw-JSON path committed.
Cross-domain, not just in-domain. In-domain PCK is saturated and uninformative; the defensible claim is on cross-subject/cross-environment, where the frontier is 12–44% PCK@20.

3. Proposed benchmark-suite design

A two-part suite (wifi-densepose-train accuracy harness + wifi-densepose-nn latency harness), both committing raw JSON + a graded RESULTS.md.

3.1 Accuracy harness (`wifi-densepose-train`)

Metric module with one canonical PCK (parameterized: {torso, bbox, absolute} normalization × threshold × keypoint-map), so a single function scores WiFlow-STD-repro, MM-Fi/AetherArena, and a GraphPose-Fi re-run identically. Lock the default to torso-PCK@20 on 17-kp COCO and always also print standard-PCK to expose the gap.
Fixed datasets/splits: (i) WiFlow-STD cleaned 360k (their split, for repro parity), (ii) MM-Fi P1 random + cross-subject + cross-environment (to line up against GraphPose-Fi 61.1/44.2/12.9 and the project's 81.63), (iii) ESP32 paired eval set when ≥2k multi-subject windows exist.
Mandatory honesty baselines emitted every run: mean-pose, constant-output, and (for cross-domain) source-only.
Output: raw JSON + a RESULTS.md table with MEASURED/CLAIMED grades, mirroring benchmarks/wiflow-std/RESULTS.md.

3.2 Latency/size harness (`wifi-densepose-nn`)

Matrix: {torch fp32 (ref), ONNX fp32, ONNX static-QDQ-conv-only int8, candle fp32} × {full 2.23M, half 843k, tiny 56k} × {batch 1, 64}, criterion-timed, host declared.
Report: disk size, batch-1 + batch-64 ms/window (median + low/high), and PCK@20 on the locked 10k-window subset, so latency and accuracy never get cited apart.
Caveat block up front: host ≠ ESP32-S3/WASM3, host ≠ Hailo HEF. No host number is presented as the edge number.
CI gate: commit the current medians as regression baselines; fail PRs that regress latency >X% or accuracy >Y pt.

3.3 What counts as a defensible "beyond-SOTA" result

A claim is citable only if all hold: (1) scored under a pre-declared metric/split, (2) beats the relevant published frontier number on the same metric definition (e.g. >61.1% standard-PCK@20 on MM-Fi random, or >12.9% on cross-environment), (3) beats the mean-pose honesty baseline, (4) raw JSON + exact command committed, (5) graded MEASURED. The single most valuable "beyond-SOTA" target is cross-environment MM-Fi, where the published bar (12.9% PCK@20) is low enough that a real win is both achievable and unambiguous.

4. Gap table

Capability	Project current (graded)	Published SOTA (graded)	Proposed target	Data / hardware needed
In-domain 2D PCK@20 (self-collected)	96.09–96.61% (MEASURED, RTX 5080, WiFlow-STD repro)	97.25% claimed (WiFlow-STD, CLAIMED)	match within noise + own architecture	cleaned 360k dataset (have); already met
In-domain MM-Fi PCK@20 (torso-norm)	81.63% torso-PCK (MEASURED-internal)	GraphPose-Fi 61.1% standard-PCK (PREPRINT) — not comparable	re-score both under one PCK def	MM-Fi P1 (have); unified metric harness (gap)
Cross-subject MM-Fi PCK@20	~11.6% torso (MEASURED, the cliff)	GraphPose-Fi 44.2% standard (PREPRINT)	close gap via MAE pretrain + graph decoder	1.3M heterogeneous CSI corpus (ADR-150/152 §2.3), ViT-Small encoder
Cross-environment MM-Fi PCK@20	untested-internal	GraphPose-Fi 12.9% standard (PREPRINT)	beat 12.9% → cleanest beyond-SOTA win	MM-Fi cross-env split + geometry conditioning (ADR-152 §2.1)
ESP32 CSI→pose (17-kp)	no run beats mean-pose baseline (MEASURED, measurement b)	n/a (no public ESP32 pose benchmark)	beat mean-pose on temporal split	≥2k multi-subject/multi-position paired windows (gap)
Edge int8 size/accuracy	static QDQ conv-only 96.61% @ 2.53 MB; tiny 94.1% @ 295 KB fp32 (MEASURED)	no model-matched public number	QAT-int8 on half model (untested lever)	PyTorch QAT + QDQ export; RTX 5080 (have)
Edge CPU latency	ONNX fp32 3.2 ms/win b1 host (MEASURED)	n/a (model-specific)	committed criterion regression baseline	host bench (have); ESP32/Hailo on-hardware (gap)
Hailo HEF edge inference	none in-tree (gap)	n/a	first MEASURED HEF latency	Hailo compile toolchain + Pi cluster (have hardware, CLAUDE.local.md)
Foundation encoder (MAE)	recipe adopted, untrained (ADR-152 §2.3)	UNSW: log-linear cross-domain scaling on classification (PREPRINT)	pose-transfer validation (hypothesis today)	1.3M-sample corpus aggregation (priority per F3)

5. Sources (graded)

Source	Type	Grade	Used for
GraphPose-Fi, arXiv 2511.19105	preprint	PREPRINT; table numbers MEASURED-in-study (fetched + quoted)	RQ1 MM-Fi frontier (61.1/44.2/12.9 PCK@20, 160.6/210.5/302.7 mm)
WiFlow-STD, arXiv 2602.08661 + DY2434 repo	preprint+code	numbers CLAIMED; artifacts MEASURED; project repro 96% MEASURED	RQ1/RQ2/RQ3
PerceptAlign, arXiv 2601.12252	preprint + MobiCom'26 acceptance	CLAIMED numbers; failure mode corroborated	RQ1/RQ2 geometry conditioning
UNSW MAE, arXiv 2511.18792	preprint	ablations MEASURED-in-study; pose transfer = hypothesis	RQ2 MAE recipe
DT-Pose, arXiv 2501.09411, OpenReview aPnLQ6WfQQ, code	preprint+code (ICLR'25)	exact MPJPE table NOT verified here — flagged	RQ2 masked-pretrain + topology
Person-in-WiFi-3D, CVPR 2024	peer-reviewed	MEASURED (91.7/108.1/125.3 mm); own dataset	RQ1 3D multi-person frontier
ONNX Runtime quantization docs	vendor docs	VENDOR	RQ3 PTQ/QAT mechanics
NVIDIA TensorRT QAT blog, Ultralytics	vendor/blog	BLOG/VENDOR; general, not CSI-specific	RQ3 QAT>PTQ context
Onboard Optimization survey, arXiv 2505.08793	preprint	PREPRINT	RQ3 on-device optimization landscape
In-tree `benchmarks/wiflow-std/RESULTS.md`, `benchmarks/edge-latency/RESULTS.md`, ADR-150, ADR-152, ADR-015	internal MEASURED	MEASURED-internal	grounding, all RQs

Unverified / flagged: DT-Pose exact MM-Fi MPJPE table not extracted at primary-source precision (abstract-level only). GraphPose-Fi parameter count not reported in the paper. WiFlow-STD/PerceptAlign accuracy numbers are author-self-reported preprints. No CSI-pose-specific QAT benchmark exists in the public literature — the QAT recommendation rests on general (non-CSI) vendor/blog evidence.

21 KiB Raw Blame History Unescape Escape

SOTA Evidence Brief — wifi-densepose-nn / wifi-densepose-train Benchmark ADR Seed