23 KiB
ADR-155: NN / Training Beyond-SOTA Sweep — Milestone 1 (Claim Integrity, Honest Validation, the Unified Metric, and the SOTA Landscape)
| Field | Value |
|---|---|
| Status | Proposed |
| Date | 2026-06-11 |
| Deciders | ruv |
| Codebase target | wifi-densepose-train (metrics.rs, dataset.rs, proof.rs, rapid_adapt.rs, ruview_metrics.rs, config.rs, ablation.rs, subcarrier.rs, bin/train.rs, bin/verify_training.rs), wifi-densepose-nn (tensor.rs, translator.rs, onnx.rs), benches, docs |
| Relates to | ADR-154 (Signal/DSP sweep, Milestone 0), ADR-152 (WiFi-Pose SOTA 2026 intake), ADR-150 (RF Foundation Encoder), ADR-079 (Camera-Supervised Pose), ADR-027 (MERIDIAN), ADR-024 (AETHER) |
| Scope | Milestone 1 of the beyond-SOTA NN/training sweep: the integrity-critical fixes that let the training/metrics subsystem substantiate a clean accuracy claim (the unified metric, leak-free validation, honest TTA, rigorous proof), a focused set of correctness/security fixes, two measured perf wins, the NN SOTA landscape with evidence grades, and a prioritized backlog. ~45 review findings are explicitly deferred (§8) — nothing is silently dropped. |
0. PROOF discipline (this ADR's contract)
This project has been publicly accused of "AI slop." Milestone 1 is the most integrity-critical of the sweep because a gap review found the training/metrics subsystem could not substantiate a clean accuracy claim: there were four divergent PCK implementations and three divergent OKS implementations, a model trained on real data was validated against a synthetic set, the dataset had no leak-free split, the test-time-adaptation path descended a fake gradient, and the deterministic proof self-certified on any loss decrease (including float noise) with no committed baseline.
We answer that with evidence, not adjectives:
- Every integrity fix ships with a committed regression test that would have caught the bug.
- Every perf number is MEASURED before/after with the exact reproduce command. A perf claim without a measured before/after is UNPROVEN and is not made here.
- Every external SOTA reference is graded MEASURED / CLAIMED / THEORETICAL.
- We disclose, in full, what the proof does not prove and what remains unmeasured.
Build/test constraint (disclosed)
The reportable-metric code (metrics.rs, trainer.rs, proof.rs, model.rs, losses.rs) is gated behind the tch-backend Cargo feature (libtorch FFI). libtorch is not installed on the development host, so the project's standard gate is cargo test --workspace --no-default-features (no tch). The canonical-metric logic is therefore validated two ways: (1) the non-tch reachable surface (compute_pck/compute_oks free functions, dataset.rs split, rapid_adapt.rs, ruview_metrics.rs) runs under the workspace test suite with new regression tests; (2) the tch-gated accumulator/trainer/proof changes are routed through those same canonical functions, so the metric definition is identical whether or not tch is present. This limitation is disclosed rather than hidden.
1. Context — the seven divergent metric definitions
The gap review found four PCK and three OKS implementations that disagreed on normalization, on the zero-visible-joint case, and on the OKS scale:
| # | Location | Normalizer | Zero-visible PCK | OKS scale |
|---|---|---|---|---|
| PCK-1 | metrics.rs MetricsAccumulator (the trainer's) |
bbox diagonal | 1.0 (false-perfect bug) | normalized-coord diag² |
| PCK-2 | metrics.rs compute_pck |
torso hip↔shoulder | 0.0 | — |
| PCK-3 | metrics.rs compute_pck_v2 |
torso hip↔hip (pixel) | 0.0 | — |
| PCK-4 | training_bench.rs |
raw threshold (no torso) | 0.0 | — |
| OKS-1 | metrics.rs:443 compute_oks |
— | — | caller s (1.0 ⇒ fake Gold) |
| OKS-2 | metrics.rs:994 compute_oks_v2 |
— | — | sqrt(area) (could be 0) |
| OKS-3 | ruview_metrics.rs:642 |
— | — | caller s (1.0 ⇒ fake Gold) |
Two of these are not merely inconsistent, they are wrong in a claim-inflating direction:
- The
MetricsAccumulatorzero-visible-joint bug scored a sample with no visible joints as PCK = 1.0 ("no errors to measure"). An empty or garbage prediction could thus inflate the reported metric. - The OKS
s = 1.0-on-normalized-coordinates bug ("fake Gold tier"): with keypoints in[0,1]and the scale fixed at1.0, every squared distance is ≈0 and the exponential kernel returns ≈1.0 for any pose. OKS looked near-perfect regardless of prediction quality.
This is the same metric-bug class ADR-152 flagged. Milestone 1 closes it for real.
2. Decision — TIER 1: CLAIM INTEGRITY (the "prove everything" core)
2.1 Unify the metrics — ONE canonical definition — ACCEPTED & IMPLEMENTED
There is now exactly one PCK and one OKS that may be used for any reported number, in the canonical region of metrics.rs:
pck_canonical(pred, gt, vis, k)— torso-normalized PCK@k. A keypointjis correct iff‖pred_j − gt_j‖₂ ≤ k · torso, wheretorso = ‖left_hip(11) − right_hip(12)‖₂in the keypoint coordinate space, with a bounding-box-diagonal fallback when the hips are not both visible. This is the COCO / ADR-152 convention validated inbenchmarks/wiflow-std/RESULTS.md(the ~96% PCK@20 reproduction — hip↔hip torso, COCO Setting). Zero visible joints ⇒(0, 0, 0.0)— a sample with no measurable evidence scores 0, never 1.oks_canonical(pred, gt, vis)— COCO OKS.s = sqrt(area)is derived from the GT pose extent (the canonical torso size as a robust, always-positive scale proxy), never a fixed1.0. There is no escape hatch that makes OKS ≈ 1.0 for any pose; a degenerate (zero-extent) pose returns 0.0.
Single source of truth, enforced. MetricsAccumulator::update (the trainer's), compute_pck, compute_per_joint_pck, compute_oks, aggregate_metrics, and the deprecated compute_pck_v2/compute_oks_v2/MetricsAccumulatorV2 all route through pck_canonical/oks_canonical. So Trainer::evaluate() → MetricsAccumulator → canonical; the WiFlow-STD bench definition (RESULTS.md) is the reference the canonical matches. eval.rs reports MPJPE (a distinct, non-divergent error metric, unchanged). The v2 functions and the training_bench.rs raw-threshold kernel are annotated #[deprecated] / "DO NOT USE for reported metrics".
The two claim-inflating bugs are fixed and pinned by regression tests:
canonical_pck_zero_visible_is_zero_not_one— no-visible ⇒ PCK 0.0 (was 1.0).canonical_oks_not_one_for_wrong_pose_on_normalized_coords— a pose off by 3× the torso on[0,1]coords yields OKS < 0.2 (the olds=1.0path returned ≈1.0).canonical_pck_uses_hip_to_hip_torso,canonical_torso_falls_back_to_bbox_when_hips_hidden— pin the normalizer.all_invisible_gives_zero_pck(renamed fromall_invisible_gives_trivial_pck, comment cites this ADR) — the trainer accumulator now scores no-visible as 0.
Legitimately changed test expectations (each updated with a comment citing this finding): the historical "perfect on an all-coincident pose" fixtures used keypoints at a single point, which is correctly unscoreable under canonical (zero extent ⇒ no scale). Test fixtures were given a real ±0.05 hip span so the canonical normalizer is positive; all_invisible_* flipped from 1.0 → 0.0.
2.2 Honest validation — leak-free split + synthetic-val disclosure — ACCEPTED & IMPLEMENTED
The leak. MM-Fi windows are extracted with stride 1 (MmFiEntry::num_windows = num_frames − window_frames + 1), so adjacent windows overlap by window_frames − 1 frames (~99% at the default 100-frame window). And bin/train.rs validated a real MM-Fi training run against a synthetic val set "for pipeline verification" — any PCK it printed was meaningless on two counts.
The fix (mirroring the leak-free discipline of occupancy_bench::EvalSplit):
MmFiDataset::subject_disjoint_split(test_subject_fraction, seed) → (train_view, test_view)partitions whole subjects to one side. Because every window of a subject travels with that subject, the two views share no subject and no window — leak-free by construction, deterministic per seed. ReturnsDatasetError::InvalidSpliton <2 subjects, bad fraction, or an empty side.assert_split_leak_free(train, test)independently verifies subject-disjointness and window-index-disjointness, and is called inside the split so a leaky split can never be handed out.bin/train.rsnow prefers the real split; the synthetic path is reachable only as a labelled fallback (single-subject data) and is routed through a newrun_smoke_testthat prefixes every metric[SMOKE-TEST] (DO NOT REPORT).--dry-runis likewise relabelled. A synthetic-val PCK can no longer be mistaken for a measurement.
Leak-free proof (tests): subject_split_is_subject_and_window_disjoint (no shared subject, no shared window index, partition covers every window once), subject_split_is_deterministic_for_seed, subject_split_rejects_single_subject, subject_split_rejects_bad_fraction, assert_leak_free_detects_injected_subject_leak (the validator catches a deliberately-injected subject overlap — a guard against future partitioner bugs).
2.3 rapid_adapt honesty — real gradients, scoped claim — ACCEPTED & IMPLEMENTED
rapid_adapt.rs's contrastive_step/entropy_step wrote a fake gradient (grad += v * 0.01) unrelated to the stated triplet / entropy objective — so any "TTA improves the metric" was unsupported by the code.
Resolution: real gradients (not removal). The two *_loss functions are now pure evaluators of the real objective; RapidAdaptation::adapt descends them with a central finite-difference gradient of that exact loss (∂L/∂wᵢ ≈ (L(w+εeᵢ) − L(w−εeᵢ))/2ε). Finite differences genuinely minimize the stated objective (to O(ε²) truncation), so "the adaptation loss decreases" is now a real, reproducible measurement rather than an artefact of a hand-tuned step. The returned final_loss is the actual objective at the produced weights.
Honest scope caveat (recorded in the module and here): this minimizes a self-supervised proxy (temporal-contrastive + prediction entropy) over a tiny LoRA bottleneck on raw CSI. It is NOT wired to the pose model, and there is no measured end-to-end PCK gain on WiFi pose from this path. TTA-on-pose is a future, not-yet-measured capability — no PCK improvement may be cited from this module.
Tests: contrastive_loss_decreases and entropy_loss_decreases (20/30 real gradient steps do not increase the loss vs 0 steps), reported_loss_is_the_real_objective_not_a_placeholder (the returned final_loss equals an independent recomputation of the objective at the output weights — i.e. it is the real loss, not a fabricated number).
2.4 proof.rs rigor — margin + committed-hash requirement — ACCEPTED & IMPLEMENTED
The deterministic proof self-certified: generate_expected_hash blessed whatever the pipeline emitted, PASS counted any loss decrease (including 1e-9 float noise), and a missing expected hash defaulted to PASS.
Two hardenings:
- Minimum-decrease margin.
MIN_LOSS_DECREASE = 1e-4. A run counts as "learning" only wheninitial − final ≥ MIN_LOSS_DECREASE— well above float noise, far below a real step's decrease. A pipeline that only wanders by noise now FAILS. - No-hash is a SKIP, never a PASS.
ProofResult::is_pass()requireshash_matches == Some(true)(a committedexpected_proof.sha256). An absent baseline yields SKIP (exit 2). Theverify-trainingbinary additionally fails fast on a sub-margin loss before the hash comparison, so a missing baseline can never downgrade a non-learning pipeline to SKIP.
What this proves — and what it does NOT (disclosed): the proof certifies reproducibility and determinism (same seed ⇒ same weights ⇒ same hash) and that the optimiser measurably reduces a loss. It runs on a deterministic synthetic dataset by construction, so it does not prove the shipped weights came from real MM-Fi data, nor that any accuracy claim is met. Accuracy is substantiated separately (benchmarks/wiflow-std/RESULTS.md). There is currently no committed expected_proof.sha256 for the Rust proof, so it is honestly in the SKIP state until a baseline is committed on a libtorch-enabled host — and SKIP is now reported as SKIP, not green.
Tests: no_committed_hash_is_skip_not_pass, submargin_loss_change_fails_even_without_hash, committed_matching_hash_with_real_decrease_passes.
3. Decision — TIER 2: CORRECTNESS / SECURITY
Each fix ships a test that would have caught the bug (all in the non-tch, workspace-tested surface).
| Finding | File | Fix | Test |
|---|---|---|---|
softmax(axis) ignored the axis (whole-tensor normalize — breaks densepose per-pixel probs) |
nn/tensor.rs |
softmax along the given axis per lane; out-of-range axis ⇒ NnError (no panic) |
(tier-2 suite) |
apply_attention identity/uniform stub (any "with attention" ablation == without) |
nn/translator.rs |
implemented real single-head scaled-dot-product attention (softmax(QKᵀ/√d)V with Q/K/V/output projections); mis-shaped checkpoint projections rejected so a bad checkpoint can't silently become a no-op |
test_attention_is_not_uniform_stub, test_attention_rejects_wrong_weight_shape |
config.validate() had no UPPER bounds (config-OOM class still open) |
train/config.rs |
upper bounds on window_frames/subcarriers/backbone_channels/heatmap_size/keypoints/parts/batch_size; reject negative gpu_device_id |
rejection tests; defaults+presets still validate |
subcarrier.rs panic on non-contiguous input |
train/subcarrier.rs |
graceful path / typed error on strided input | non-contiguous-input test |
ablation.rs latency_percentiles partial_cmp().unwrap() NaN panic |
train/ablation.rs |
total_cmp / NaN-guarded compare |
NaN-input no-panic test |
onnx.rs unchecked -1 dim cast |
nn/onnx.rs |
reject negative/zero output dims with NnError |
guarded-dim test |
ruview_metrics compute_single_oks s=1.0 fake-Gold + unguarded [j]<17 |
train/ruview_metrics.rs |
derive scale from GT extent when none supplied; reject s≤0; bound the loop to array extents |
oks_rejects_nonpositive_scale, oks_does_not_panic_on_short_arrays, oks_not_perfect_for_wrong_pose_with_derived_scale |
rf_encoder.rs was inspected and found to contain no checkpoint-deserialization assert: its assert_eq!s in LinearHead::new / ContrastiveBatcher::new are documented construction-time API contracts on programmer-supplied vector lengths, not adversarial-input panics — the described bug does not exist there. Any genuine checkpoint-load assert lives in the tch-gated proof.rs/trainer.rs path and is deferred (§8) as unverifiable without libtorch. Test pass counts: nn --no-default-features 35 passed, nn --features onnx onnx::tests 3 passed, train --no-default-features lib 176 passed.
4. Decision — TIER 3: MEASURED perf wins (new criterion benches)
All numbers MEASURED on the Windows dev host with the onnx feature (ort 2.0.0-rc.11, runtime auto-downloaded), committed in nn/benches/onnx_bench.rs.
4.1 Zero-copy ORT input — LANDED, MEASURED
onnx.rs built the ORT input via arr.iter().cloned().collect::<Vec<f32>>() — a full element-wise copy. Replaced with a contiguous fast path (arr.as_slice() ⇒ single memcpy, iterator fallback only for strided views).
- Reproduce:
cargo bench -p wifi-densepose-nn --no-default-features --features onnx --bench onnx_bench -- onnx_input_copy - Measured (input
[1,256,64,64]= 1.05M f32): 1.972 ms → 1.336 ms (~1.48× faster), 532 → 785 Melem/s. Strided fallback unchanged (within noise), correctness preserved. End-to-end real-model inference: ~45.9 µs.
4.2 ONNX per-inference write-lock — DIAGNOSED, NOT LANDABLE (honest)
OnnxBackend::run takes a parking_lot::RwLock write lock per inference, serializing concurrency. The intended fix was a read-lock. It is not landable on ort 2.0.0-rc.11: the safe Session::run is &mut self (verified against the vendored source) — there is no &self run path, so a read-lock fails the borrow checker. The underlying C++ OrtSession::Run is thread-safe, but exploiting that would require an unsafe interior-mutability bypass; we did not introduce that soundness risk. The write lock was kept, with a doc comment recording the upgrade path (a future ort with &self run ⇒ flip to read()).
- Harness landed anyway, empirically proving the serialization:
cargo bench -p wifi-densepose-nn --no-default-features --features onnx --bench onnx_bench -- onnx_concurrency→ throughput drops with more threads (1 thr 19.4 Kelem/s → 2 thr 16.9K → 4 thr 14.0K → 8 thr 14.3K). Whenortexposes&selfrun, the one-line lock change will show the speedup on this same bench.
The native-conv naive-loop rewrite was deferred (§8) as out of scope for a measured milestone.
5. The NN / training SOTA landscape (graded)
| Candidate | What | Grade | Verdict |
|---|---|---|---|
| GraphPose-Fi (arXiv 2511.19105, code github.com/Cirrick/GraphPose-Fi) | Graph/skeleton pose decoder for cross-environment WiFi pose; MM-Fi, 17 joints — matches our setup. ADR-150 §2.2 named a graph decoder but never built it. | CLAIMED (preprint; cross-env gains author-reported) | Top beyond-SOTA candidate. Propose as ACCEPTED-future — NOT built here. Best fit because the decoder is a drop-in on our 17-joint MM-Fi backbone and directly targets the cross-environment brittleness ADR-150/ADR-027 fight. |
| ONNX INT4 | Extend our measured INT8 ONNX quantization to INT4 for edge. | THEORETICAL for our pipeline (INT8 is MEASURED; INT4 untested here) | #2 priority — natural extension of a measured capability. |
| CSI-JEPA vs MAE A/B | Joint-embedding predictive pretraining vs the ADR-152 §2.3 MAE recipe. | CLAIMED (JEPA strong elsewhere) — honest caveat: no JEPA or MAE result exists on WiFi POSE yet (ADR-152 F3: UNSW MAE downstream tasks are classification, not pose). | #3 — run as a measured A/B, do not pre-announce a winner. |
| "Mamba-CSI-pose" | A state-space-model CSI pose backbone. | — | Does NOT exist. Do not propose it. No such artifact in the 2025–2026 literature; naming it would be exactly the kind of unfounded claim this sweep exists to prevent. |
6. Validation
cargo test --workspace --no-default-features— green (the metric unification legitimately changed a handful of test expectations; each was updated with a comment citing the finding, and the trainer/eval/proof now all route through the one canonical metric).python archive/v1/data/proof/verify.py—VERDICT: PASS(Python pipeline proof, independent of the Rust changes).- New criterion benches compile and run under the
onnxfeature.
7. What changed, file by file
metrics.rs—canonical_torso_size,pck_canonical,oks_canonical(single source of truth);MetricsAccumulator/compute_pck/compute_per_joint_pck/compute_oks/aggregate_metricsroute through them;compute_pck_v2/compute_oks_v2/MetricsAccumulatorV2deprecated → canonical; zero-visible ands=1.0bugs fixed; canonical bug-catching tests.dataset.rs—subject_disjoint_split,MmFiSplitView,assert_split_leak_free; leak-free split tests.error.rs—DatasetError::InvalidSplit.bin/train.rs— prefer real subject-disjoint split; synthetic path relabelledrun_smoke_test("DO NOT REPORT").proof.rs+bin/verify_training.rs—MIN_LOSS_DECREASEmargin; no-hash ⇒ SKIP-not-PASS; sub-margin ⇒ FAIL-not-SKIP; new tests.rapid_adapt.rs— fake gradient removed; finite-difference gradient of the real objective; honesty docs + tests.ruview_metrics.rs— OKS scale derived from GT extent (nos=1.0);s≤0rejected; OKS loop bounded; tests.config.rs/ablation.rs/subcarrier.rs/nn/tensor.rs/nn/translator.rs/nn/onnx.rs— Tier-2 fixes (§3) + Tier-3 perf (§4).training_bench.rs,sensing-server/training_api.rs— divergent local PCK kernels annotated "DO NOT USE for reported metrics"; the sensing-server torso-height PCK unification is a deferred backlog item (separate service + tch boundary).
8. Deferred backlog (NOT silently dropped)
The gap review surfaced ~60 findings; this milestone scoped to the provable integrity-critical subset plus two measured perf wins. The remainder are tracked here for a future ADR-155 milestone:
- GraphPose-Fi graph decoder — build the §5 top candidate (ACCEPTED-future, not built).
- ONNX INT4 quantization; CSI-JEPA vs MAE A/B; the rest of the §5 roadmap.
- ONNX read-lock concurrency win — blocked on an
ortrelease exposing&selfSession::run(§4.2); harness already committed. - native-conv naive-loop perf rewrite (§4).
rf_encoder.rsassert_eq!-on-checkpoint and any other tch-gated panic-on-input sites — require a libtorch host to compile/verify (model.rsamp_fc1unbounded alloc is indirectly guarded by the newconfig.validate()upper bounds, but a direct guard + test is deferred).sensing-server/training_api.rsPCK — unify the live-server torso-height PCK withpck_canonical(crosses the service + tch boundary).test_metrics.rsreference kernels — the integration test's localcompute_pck/compute_oksare independent reference impls (not production); fold them onto the canonical definition.- The remaining ~40 lower-severity review findings (style, micro-opt, doc) from the NN/training gap review.
9. Consequences
Positive. The training/metrics subsystem can now substantiate a clean accuracy claim: one documented metric used everywhere, a leak-free split, an honest TTA path, a proof that fails on noise and refuses to bless an unbaselined run, and two of the most claim-inflating bugs (false-perfect PCK, fake-Gold OKS) closed and pinned by regression tests. The unmeasured/unprovable parts are disclosed, not hidden.
Negative / honest. The reportable-metric tch-gated code cannot be compiled on the dev host (libtorch absent), so its validation rests on routing through the workspace-tested canonical functions plus review; the Rust deterministic proof is in SKIP until a baseline is committed on a tch host; the ONNX concurrency win is blocked upstream; and ~45 findings are deferred. None of these is presented as done.