34 KiB

Raw Blame History

ADR-155: NN / Training Beyond-SOTA Sweep — Milestone 1 (Claim Integrity, Honest Validation, the Unified Metric, and the SOTA Landscape)

Field	Value
Status	Proposed
Date	2026-06-11
Deciders	ruv
Codebase target	`wifi-densepose-train` (`metrics.rs`, `dataset.rs`, `proof.rs`, `rapid_adapt.rs`, `ruview_metrics.rs`, `config.rs`, `ablation.rs`, `subcarrier.rs`, `bin/train.rs`, `bin/verify_training.rs`), `wifi-densepose-nn` (`tensor.rs`, `translator.rs`, `onnx.rs`), benches, docs
Relates to	ADR-154 (Signal/DSP sweep, Milestone 0), ADR-152 (WiFi-Pose SOTA 2026 intake), ADR-150 (RF Foundation Encoder), ADR-079 (Camera-Supervised Pose), ADR-027 (MERIDIAN), ADR-024 (AETHER)
Scope	Milestone 1 of the beyond-SOTA NN/training sweep: the integrity-critical fixes that let the training/metrics subsystem substantiate a clean accuracy claim (the unified metric, leak-free validation, honest TTA, rigorous proof), a focused set of correctness/security fixes, two measured perf wins, the NN SOTA landscape with evidence grades, and a prioritized backlog. ~45 review findings are explicitly deferred (§8) — nothing is silently dropped.

0. PROOF discipline (this ADR's contract)

This project has been publicly accused of "AI slop." Milestone 1 is the most integrity-critical of the sweep because a gap review found the training/metrics subsystem could not substantiate a clean accuracy claim: there were four divergent PCK implementations and three divergent OKS implementations, a model trained on real data was validated against a synthetic set, the dataset had no leak-free split, the test-time-adaptation path descended a fake gradient, and the deterministic proof self-certified on any loss decrease (including float noise) with no committed baseline.

We answer that with evidence, not adjectives:

Every integrity fix ships with a committed regression test that would have caught the bug.
Every perf number is MEASURED before/after with the exact reproduce command. A perf claim without a measured before/after is UNPROVEN and is not made here.
Every external SOTA reference is graded MEASURED / CLAIMED / THEORETICAL.
We disclose, in full, what the proof does not prove and what remains unmeasured.

Build/test constraint (disclosed)

The reportable-metric code (metrics.rs, trainer.rs, proof.rs, model.rs, losses.rs) is gated behind the tch-backend Cargo feature (libtorch FFI). libtorch is not installed on the development host, so the project's standard gate is cargo test --workspace --no-default-features (no tch). The canonical-metric logic is therefore validated two ways: (1) the non-tch reachable surface (compute_pck/compute_oks free functions, dataset.rs split, rapid_adapt.rs, ruview_metrics.rs) runs under the workspace test suite with new regression tests; (2) the tch-gated accumulator/trainer/proof changes are routed through those same canonical functions, so the metric definition is identical whether or not tch is present. This limitation is disclosed rather than hidden.

1. Context — the seven divergent metric definitions

The gap review found four PCK and three OKS implementations that disagreed on normalization, on the zero-visible-joint case, and on the OKS scale:

#	Location	Normalizer	Zero-visible PCK	OKS scale
PCK-1	`metrics.rs` `MetricsAccumulator` (the trainer's)	bbox diagonal	1.0 (false-perfect bug)	normalized-coord diag²
PCK-2	`metrics.rs` `compute_pck`	torso hip↔shoulder	0.0	—
PCK-3	`metrics.rs` `compute_pck_v2`	torso hip↔hip (pixel)	0.0	—
PCK-4	`training_bench.rs`	raw threshold (no torso)	0.0	—
OKS-1	`metrics.rs:443` `compute_oks`	—	—	caller `s` (`1.0` ⇒ fake Gold)
OKS-2	`metrics.rs:994` `compute_oks_v2`	—	—	`sqrt(area)` (could be 0)
OKS-3	`ruview_metrics.rs:642`	—	—	caller `s` (`1.0` ⇒ fake Gold)

Two of these are not merely inconsistent, they are wrong in a claim-inflating direction:

The MetricsAccumulator zero-visible-joint bug scored a sample with no visible joints as PCK = 1.0 ("no errors to measure"). An empty or garbage prediction could thus inflate the reported metric.
The OKS s = 1.0-on-normalized-coordinates bug ("fake Gold tier"): with keypoints in [0,1] and the scale fixed at 1.0, every squared distance is ≈0 and the exponential kernel returns ≈1.0 for any pose. OKS looked near-perfect regardless of prediction quality.

This is the same metric-bug class ADR-152 flagged. Milestone 1 closes it for real.

2. Decision — TIER 1: CLAIM INTEGRITY (the "prove everything" core)

2.1 Unify the metrics — ONE canonical definition — ACCEPTED & IMPLEMENTED

There is now exactly one PCK and one OKS that may be used for any reported number, in the canonical region of metrics.rs:

pck_canonical(pred, gt, vis, k) — torso-normalized PCK@k. A keypoint j is correct iff ‖pred_j − gt_j‖₂ ≤ k · torso, where torso = ‖left_hip(11) − right_hip(12)‖₂ in the keypoint coordinate space, with a bounding-box-diagonal fallback when the hips are not both visible. This is the COCO / ADR-152 convention validated in benchmarks/wiflow-std/RESULTS.md (the ~96% PCK@20 reproduction — hip↔hip torso, COCO Setting). Zero visible joints ⇒ (0, 0, 0.0) — a sample with no measurable evidence scores 0, never 1.
oks_canonical(pred, gt, vis) — COCO OKS. s = sqrt(area) is derived from the GT pose extent (the canonical torso size as a robust, always-positive scale proxy), never a fixed 1.0. There is no escape hatch that makes OKS ≈ 1.0 for any pose; a degenerate (zero-extent) pose returns 0.0.

Single source of truth, enforced. MetricsAccumulator::update (the trainer's), compute_pck, compute_per_joint_pck, compute_oks, aggregate_metrics, and the deprecated compute_pck_v2/compute_oks_v2/MetricsAccumulatorV2 all route through pck_canonical/oks_canonical. So Trainer::evaluate() → MetricsAccumulator → canonical; the WiFlow-STD bench definition (RESULTS.md) is the reference the canonical matches. eval.rs reports MPJPE (a distinct, non-divergent error metric, unchanged). The v2 functions and the training_bench.rs raw-threshold kernel are annotated #[deprecated] / "DO NOT USE for reported metrics".

The two claim-inflating bugs are fixed and pinned by regression tests:

canonical_pck_zero_visible_is_zero_not_one — no-visible ⇒ PCK 0.0 (was 1.0).
canonical_oks_not_one_for_wrong_pose_on_normalized_coords — a pose off by 3× the torso on [0,1] coords yields OKS < 0.2 (the old s=1.0 path returned ≈1.0).
canonical_pck_uses_hip_to_hip_torso, canonical_torso_falls_back_to_bbox_when_hips_hidden — pin the normalizer.
all_invisible_gives_zero_pck (renamed from all_invisible_gives_trivial_pck, comment cites this ADR) — the trainer accumulator now scores no-visible as 0.

Legitimately changed test expectations (each updated with a comment citing this finding): the historical "perfect on an all-coincident pose" fixtures used keypoints at a single point, which is correctly unscoreable under canonical (zero extent ⇒ no scale). Test fixtures were given a real ±0.05 hip span so the canonical normalizer is positive; all_invisible_* flipped from 1.0 → 0.0.

2.2 Honest validation — leak-free split + synthetic-val disclosure — ACCEPTED & IMPLEMENTED

The leak. MM-Fi windows are extracted with stride 1 (MmFiEntry::num_windows = num_frames − window_frames + 1), so adjacent windows overlap by window_frames − 1 frames (~99% at the default 100-frame window). And bin/train.rs validated a real MM-Fi training run against a synthetic val set "for pipeline verification" — any PCK it printed was meaningless on two counts.

The fix (mirroring the leak-free discipline of occupancy_bench::EvalSplit):

MmFiDataset::subject_disjoint_split(test_subject_fraction, seed) → (train_view, test_view) partitions whole subjects to one side. Because every window of a subject travels with that subject, the two views share no subject and no window — leak-free by construction, deterministic per seed. Returns DatasetError::InvalidSplit on <2 subjects, bad fraction, or an empty side.
assert_split_leak_free(train, test) independently verifies subject-disjointness and window-index-disjointness, and is called inside the split so a leaky split can never be handed out.
bin/train.rs now prefers the real split; the synthetic path is reachable only as a labelled fallback (single-subject data) and is routed through a new run_smoke_test that prefixes every metric [SMOKE-TEST] (DO NOT REPORT). --dry-run is likewise relabelled. A synthetic-val PCK can no longer be mistaken for a measurement.

Leak-free proof (tests): subject_split_is_subject_and_window_disjoint (no shared subject, no shared window index, partition covers every window once), subject_split_is_deterministic_for_seed, subject_split_rejects_single_subject, subject_split_rejects_bad_fraction, assert_leak_free_detects_injected_subject_leak (the validator catches a deliberately-injected subject overlap — a guard against future partitioner bugs).

2.3 rapid_adapt honesty — real gradients, scoped claim — ACCEPTED & IMPLEMENTED

rapid_adapt.rs's contrastive_step/entropy_step wrote a fake gradient (grad += v * 0.01) unrelated to the stated triplet / entropy objective — so any "TTA improves the metric" was unsupported by the code.

Resolution: real gradients (not removal). The two *_loss functions are now pure evaluators of the real objective; RapidAdaptation::adapt descends them with a central finite-difference gradient of that exact loss (∂L/∂wᵢ ≈ (L(w+εeᵢ) − L(w−εeᵢ))/2ε). Finite differences genuinely minimize the stated objective (to O(ε²) truncation), so "the adaptation loss decreases" is now a real, reproducible measurement rather than an artefact of a hand-tuned step. The returned final_loss is the actual objective at the produced weights.

Honest scope caveat (recorded in the module and here): this minimizes a self-supervised proxy (temporal-contrastive + prediction entropy) over a tiny LoRA bottleneck on raw CSI. It is NOT wired to the pose model, and there is no measured end-to-end PCK gain on WiFi pose from this path. TTA-on-pose is a future, not-yet-measured capability — no PCK improvement may be cited from this module.

Tests: contrastive_loss_decreases and entropy_loss_decreases (20/30 real gradient steps do not increase the loss vs 0 steps), reported_loss_is_the_real_objective_not_a_placeholder (the returned final_loss equals an independent recomputation of the objective at the output weights — i.e. it is the real loss, not a fabricated number).

2.4 proof.rs rigor — margin + committed-hash requirement — ACCEPTED & IMPLEMENTED

The deterministic proof self-certified: generate_expected_hash blessed whatever the pipeline emitted, PASS counted any loss decrease (including 1e-9 float noise), and a missing expected hash defaulted to PASS.

Two hardenings:

Minimum-decrease margin. MIN_LOSS_DECREASE = 1e-4. A run counts as "learning" only when initial − final ≥ MIN_LOSS_DECREASE — well above float noise, far below a real step's decrease. A pipeline that only wanders by noise now FAILS.
No-hash is a SKIP, never a PASS. ProofResult::is_pass() requires hash_matches == Some(true) (a committed expected_proof.sha256). An absent baseline yields SKIP (exit 2). The verify-training binary additionally fails fast on a sub-margin loss before the hash comparison, so a missing baseline can never downgrade a non-learning pipeline to SKIP.

What this proves — and what it does NOT (disclosed): the proof certifies reproducibility and determinism (same seed ⇒ same weights ⇒ same hash) and that the optimiser measurably reduces a loss. It runs on a deterministic synthetic dataset by construction, so it does not prove the shipped weights came from real MM-Fi data, nor that any accuracy claim is met. Accuracy is substantiated separately (benchmarks/wiflow-std/RESULTS.md). There is currently no committed expected_proof.sha256 for the Rust proof, so it is honestly in the SKIP state until a baseline is committed on a libtorch-enabled host — and SKIP is now reported as SKIP, not green.

Tests: no_committed_hash_is_skip_not_pass, submargin_loss_change_fails_even_without_hash, committed_matching_hash_with_real_decrease_passes.

3. Decision — TIER 2: CORRECTNESS / SECURITY

Each fix ships a test that would have caught the bug (all in the non-tch, workspace-tested surface).

Finding	File	Fix	Test
`softmax(axis)` ignored the axis (whole-tensor normalize — breaks densepose per-pixel probs)	`nn/tensor.rs`	softmax along the given axis per lane; out-of-range axis ⇒ `NnError` (no panic)	(tier-2 suite)
`apply_attention` identity/uniform stub (any "with attention" ablation == without)	`nn/translator.rs`	implemented real single-head scaled-dot-product attention (`softmax(QKᵀ/√d)V` with Q/K/V/output projections); mis-shaped checkpoint projections rejected so a bad checkpoint can't silently become a no-op	`test_attention_is_not_uniform_stub`, `test_attention_rejects_wrong_weight_shape`
`config.validate()` had no UPPER bounds (config-OOM class still open)	`train/config.rs`	upper bounds on `window_frames`/subcarriers/`backbone_channels`/`heatmap_size`/keypoints/parts/`batch_size`; reject negative `gpu_device_id`	rejection tests; defaults+presets still validate
`subcarrier.rs` panic on non-contiguous input	`train/subcarrier.rs`	graceful path / typed error on strided input	non-contiguous-input test
`ablation.rs` `latency_percentiles` `partial_cmp().unwrap()` NaN panic	`train/ablation.rs`	`total_cmp` / NaN-guarded compare	NaN-input no-panic test
`onnx.rs` unchecked `-1` dim cast	`nn/onnx.rs`	reject negative/zero output dims with `NnError`	guarded-dim test
`ruview_metrics` `compute_single_oks` `s=1.0` fake-Gold + unguarded `[j]<17`	`train/ruview_metrics.rs`	derive scale from GT extent when none supplied; reject `s≤0`; bound the loop to array extents	`oks_rejects_nonpositive_scale`, `oks_does_not_panic_on_short_arrays`, `oks_not_perfect_for_wrong_pose_with_derived_scale`

rf_encoder.rs was inspected and found to contain no checkpoint-deserialization assert: its assert_eq!s in LinearHead::new / ContrastiveBatcher::new are documented construction-time API contracts on programmer-supplied vector lengths, not adversarial-input panics — the described bug does not exist there. Any genuine checkpoint-load assert lives in the tch-gated proof.rs/trainer.rs path and is deferred (§8) as unverifiable without libtorch. Test pass counts: nn --no-default-features 35 passed, nn --features onnx onnx::tests 3 passed, train --no-default-features lib 176 passed.

4. Decision — TIER 3: MEASURED perf wins (new criterion benches)

All numbers MEASURED on the Windows dev host with the onnx feature (ort 2.0.0-rc.11, runtime auto-downloaded), committed in nn/benches/onnx_bench.rs.

4.1 Zero-copy ORT input — LANDED, MEASURED

onnx.rs built the ORT input via arr.iter().cloned().collect::<Vec<f32>>() — a full element-wise copy. Replaced with a contiguous fast path (arr.as_slice() ⇒ single memcpy, iterator fallback only for strided views).

Reproduce: cargo bench -p wifi-densepose-nn --no-default-features --features onnx --bench onnx_bench -- onnx_input_copy
Measured (input [1,256,64,64] = 1.05M f32): 1.972 ms → 1.336 ms (~1.48× faster), 532 → 785 Melem/s. Strided fallback unchanged (within noise), correctness preserved. End-to-end real-model inference: ~45.9 µs.

4.2 ONNX per-inference write-lock — DIAGNOSED, NOT LANDABLE (honest)

OnnxBackend::run takes a parking_lot::RwLock write lock per inference, serializing concurrency. The intended fix was a read-lock. It is not landable on ort 2.0.0-rc.11: the safe Session::run is &mut self (verified against the vendored source) — there is no &self run path, so a read-lock fails the borrow checker. The underlying C++ OrtSession::Run is thread-safe, but exploiting that would require an unsafe interior-mutability bypass; we did not introduce that soundness risk. The write lock was kept, with a doc comment recording the upgrade path (a future ort with &self run ⇒ flip to read()).

Harness landed anyway, empirically proving the serialization: cargo bench -p wifi-densepose-nn --no-default-features --features onnx --bench onnx_bench -- onnx_concurrency → throughput drops with more threads (1 thr 19.4 Kelem/s → 2 thr 16.9K → 4 thr 14.0K → 8 thr 14.3K). When ort exposes &self run, the one-line lock change will show the speedup on this same bench.

The native-conv naive-loop rewrite was deferred (§8) as out of scope for a measured milestone.

5. The NN / training SOTA landscape (graded)

Candidate	What	Grade	Verdict
GraphPose-Fi (arXiv 2511.19105, code github.com/Cirrick/GraphPose-Fi)	Graph/skeleton pose decoder for cross-environment WiFi pose; MM-Fi, 17 joints — matches our setup. ADR-150 §2.2 named a graph decoder but never built it.	CLAIMED (preprint; cross-env gains author-reported)	Top beyond-SOTA candidate. Propose as ACCEPTED-future — NOT built here. Best fit because the decoder is a drop-in on our 17-joint MM-Fi backbone and directly targets the cross-environment brittleness ADR-150/ADR-027 fight.
ONNX INT4	Extend our measured INT8 ONNX quantization to INT4 for edge.	THEORETICAL for our pipeline (INT8 is MEASURED; INT4 untested here)	#2 priority — natural extension of a measured capability.
CSI-JEPA vs MAE A/B	Joint-embedding predictive pretraining vs the ADR-152 §2.3 MAE recipe.	CLAIMED (JEPA strong elsewhere) — honest caveat: no JEPA or MAE result exists on WiFi POSE yet (ADR-152 F3: UNSW MAE downstream tasks are classification, not pose).	#3 — run as a measured A/B, do not pre-announce a winner.
"Mamba-CSI-pose"	A state-space-model CSI pose backbone.	—	Does NOT exist. Do not propose it. No such artifact in the 2025–2026 literature; naming it would be exactly the kind of unfounded claim this sweep exists to prevent.

6. Validation

cargo test --workspace --no-default-features — green (the metric unification legitimately changed a handful of test expectations; each was updated with a comment citing the finding, and the trainer/eval/proof now all route through the one canonical metric).
python archive/v1/data/proof/verify.py — VERDICT: PASS (Python pipeline proof, independent of the Rust changes).
New criterion benches compile and run under the onnx feature.

7. What changed, file by file

metrics.rs — canonical_torso_size, pck_canonical, oks_canonical (single source of truth); MetricsAccumulator/compute_pck/compute_per_joint_pck/compute_oks/aggregate_metrics route through them; compute_pck_v2/compute_oks_v2/MetricsAccumulatorV2 deprecated → canonical; zero-visible and s=1.0 bugs fixed; canonical bug-catching tests.
dataset.rs — subject_disjoint_split, MmFiSplitView, assert_split_leak_free; leak-free split tests.
error.rs — DatasetError::InvalidSplit.
bin/train.rs — prefer real subject-disjoint split; synthetic path relabelled run_smoke_test ("DO NOT REPORT").
proof.rs + bin/verify_training.rs — MIN_LOSS_DECREASE margin; no-hash ⇒ SKIP-not-PASS; sub-margin ⇒ FAIL-not-SKIP; new tests.
rapid_adapt.rs — fake gradient removed; finite-difference gradient of the real objective; honesty docs + tests.
ruview_metrics.rs — OKS scale derived from GT extent (no s=1.0); s≤0 rejected; OKS loop bounded; tests.
config.rs / ablation.rs / subcarrier.rs / nn/tensor.rs / nn/translator.rs / nn/onnx.rs — Tier-2 fixes (§3) + Tier-3 perf (§4).
training_bench.rs, sensing-server/training_api.rs — divergent local PCK kernels annotated "DO NOT USE for reported metrics"; the sensing-server torso-height PCK unification is a deferred backlog item (separate service + tch boundary).

8. Deferred backlog (NOT silently dropped)

The gap review surfaced ~60 findings; this milestone scoped to the provable integrity-critical subset plus two measured perf wins. The remainder are tracked here for a future ADR-155 milestone:

GraphPose-Fi graph decoder — build the §5 top candidate (ACCEPTED-future, not built).
ONNX INT4 quantization; CSI-JEPA vs MAE A/B; the rest of the §5 roadmap.
ONNX read-lock concurrency win — blocked on an ort release exposing &self Session::run (§4.2); harness already committed.
native-conv naive-loop perf rewrite (§4). — RESOLVED in Milestone-2 (see §8.2): bench-first → MEASURED-INCONCLUSIVE, no perf change shipped.
~~rf_encoder.rs assert_eq!-on-checkpoint~~ — RESOLVED in Milestone-2 (see §8.2): a pure-Rust fallible LinearHead::try_new guard was added. Any genuine tch-gated panic-on-input sites remain deferred — they require a libtorch host to compile/verify (model.rs amp_fc1 unbounded alloc is indirectly guarded by the new config.validate() upper bounds, but a direct guard + test is deferred).
~~sensing-server/training_api.rs PCK~~ — RESOLVED in Milestone-1b (see §8.1, Goal C). Relabelled (not unified) — and the audit found the real live divergence is in trainer.rs, not the orphaned training_api.rs.
~~test_metrics.rs reference kernels~~ — RESOLVED in Milestone-1b (see §8.1, Goal B). Canonical core hoisted to an un-gated module; the integration test now validates the production functions against hand-computed fixtures + a differential cross-check.
metrics.rs compute_pck_v2/compute_oks_v2/MetricsAccumulatorV2/evaluate_dataset_v2/hungarian_assignment_v2 — confirmed to have zero external callers (only evaluate_dataset_v2→MetricsAccumulatorV2 internally). They are already #[deprecated] and route through canonical, so they are not a divergent-definition risk, only dead weight. Left in place this pass (public API in a tch-gated module; deleting needs a deprecation-cycle + tch host to verify) — flagged here for a future cleanup, NOT deleted silently.
sensing-server/trainer.rs pck_at_threshold (raw) + oks_map(area=1.0) and the training_bench.rs raw kernel — relabelled in Milestone-1b (§8.1); true unification onto pck_canonical/oks_canonical (needs a torso scale + the train crate as a sensing-server dep) remains deferred.
~~The remaining ~40 lower-severity review findings (style, micro-opt, doc).~~ — RESOLVED in Milestone-2 (§8.2): the host-verifiable subset is cleared. The "~40" was an estimate; the actual host-verifiable (non-tch) train/nn surface is smaller. Enumerated resolution below.

8.2 Milestone-2 — host-verifiable §8 P3 backlog clearance — RESOLVED

Mirroring the ADR-154 M3 cleanup discipline, M2 closed the host-verifiable (non-tch) subset of the §8 backlog in wifi-densepose-train (+ the pure-Rust rf_encoder.rs/densepose.rs in wifi-densepose-nn that the §3/§4 items named). Everything behind #[cfg(feature = "tch-backend")] (metrics.rs, model.rs, losses.rs, proof.rs, trainer.rs, wiflow_std/{layers,model}.rs) is out of host-verifiable scope — it cannot be compiled/verified without libtorch and stays genuinely deferred (not dropped).

PROOF discipline held: every de-magicked constant is pinned == prior literal by a *_consts_unchanged_from_literals test; every boundary test characterizes CURRENT behaviour; no operating-value or behaviour change; the Python proof stays bit-exact at f8e76f21…46f7a (the metrics path is off the signal proof path — asserted, not assumed). A smaller-but-true count was reported rather than inventing 40 fixes.

Enumerated finding → resolution (real counts):

#	Finding (location)	Action	Pin/characterization test
1	`metrics_core.rs` — `0.5` vis / `1e-6` extent / `0.07` OKS-fallback sigma	de-magic → `VISIBILITY_THRESHOLD` / `MIN_REFERENCE_EXTENT` / `OKS_FALLBACK_SIGMA`	`metrics_core_consts_unchanged_from_literals`; `visibility_threshold_boundary_is_inclusive`; `degenerate_extent_below_floor_is_unscoreable`
2	`ruview_metrics.rs` — `17` / `0.5` / `0.2` / `1e-3` / `1e-6`	de-magic → `NUM_KEYPOINTS` / `VISIBILITY_THRESHOLD` / `PCK_THRESHOLD` / `MIN_BBOX_DIAG` / `MIN_DURATION_MINUTES`	`ruview_metrics_consts_unchanged_from_literals`; `tracking_zero_duration_does_not_divide_by_zero`; `oks_short_array_is_bounded_at_keypoint_count`
3	`subcarrier.rs` — sparse-interp `0.15`/`1e-4`/`0.1`/`1e-8`/`1e-5`/`500`	de-magic → 6 `SPARSE_*` consts	`sparse_interp_consts_unchanged_from_literals`; `compute_interp_weights_single_target_is_index_zero`; `sparse_interp_single_target_is_finite`
4	`eval.rs` — `1e-10` division guard (×3)	de-magic → `MIN_POSITIVE_MPJPE`	`eval_min_positive_mpjpe_unchanged_from_literal`; `domain_gap_infinite_when_in_domain_perfect_but_cross_nonzero`; `domain_gap_unity_when_everything_perfect`
5	`domain.rs` — `1e-5` LayerNorm eps	de-magic → `LAYER_NORM_EPS`	`layer_norm_eps_unchanged_from_literal` (n=0/zero-var boundary already covered)
6	`virtual_aug.rs` — `1e-10` Box-Muller / room-scale guards	de-magic → `BOX_MULLER_U1_FLOOR` / `MIN_ROOM_SCALE`	`virtual_aug_guard_consts_unchanged_from_literals`; `augment_frame_zero_room_scale_passes_amplitude_finite`
7	`rf_encoder.rs` — `20.0` softplus overflow threshold	de-magic → `SOFTPLUS_LINEAR_THRESHOLD`	`softplus_threshold_unchanged_from_literal`
8	`rf_encoder.rs` — panic-only `LinearHead::new` for untrusted weights (§3)	add pure-Rust fallible `try_new` → typed `RfHeadError` (additive; `new` unchanged)	`try_new_accepts_valid_and_rejects_each_bad_shape`
9	`densepose.rs::apply_conv_layer` naive-loop (§4)	bench-first → MEASURED-INCONCLUSIVE, no perf change shipped; committed bench + characterization anchor	`native_conv_matches_reference` + `benches/native_conv_bench.rs`
10	`rapid_adapt.rs` module-doc "O(ε)" inconsistency	doc-only fix → "O(ε²)" (central differences)	n/a (doc)
11	`geometry.rs` `DeepSets::encode` missing `# Panics`	doc-only fix (documents existing `assert!`)	n/a (doc)

Tally: 7 de-magicked (const + pin test), 9 new boundary/characterization tests, 1 added input guard (try_new) + test, 2 doc-only fixes, 1 perf item bench-first MEASURED-INCONCLUSIVE (not shipped, deferred). New tests: train --no-default-features 303 (was 288, +15); nn --no-default-features lib 38 (was 35, +3).

Skipped honestly (flagged-but-not-real): ablation.rs (NaN sort + boundary already fixed/tested in M1 — clean), signal_features.rs (consts already named, n=0 boundary already tested), mae.rs (no bare guard literals found), metrics_core already had thorough zero-visible/hip-normalizer coverage from M1. No churn was manufactured to hit a count.

Genuinely data-gated / tch-gated — remaining backlog (blocked, not dropped): GraphPose-Fi graph decoder, ONNX INT4, CSI-JEPA vs MAE A/B (all data/model-gated — need a training run + datasets); ONNX read-lock concurrency win (upstream-gated on ort); the tch-gated panic-on-input sites in proof.rs/trainer.rs/model.rs and the metrics.rs *_v2 dead-code deletion (tch-gated — need a libtorch host to compile/verify). The non-tch-verifiable subset of §8 is now cleared.

8.1 Milestone-1b — metric-definition unification (the §8 metric subset) — RESOLVED

This milestone closed the two metric-integrity items above. The work is pinned by tests, graded MEASURED, and surfaced findings the §1 table missed.

The complete, honest PCK / OKS audit map (every definition in v2/):

Definition (file:line)	Normalization basis	Threshold convention	Status
`metrics_core.rs` `pck_canonical` (was `metrics.rs`)	hip↔hip torso WIDTH (bbox-diag fallback), `[0,1]` coords	`k·torso`	CANONICAL
`metrics_core.rs` `oks_canonical`	`s=sqrt(area)` from GT pose extent	COCO kernel	CANONICAL
`metrics.rs` `compute_pck` / `compute_per_joint_pck` / `compute_oks`	— (thin wrappers)	—	route to canonical
`metrics.rs` `aggregate_metrics` / `MetricsAccumulator`	—	—	route to canonical
`metrics.rs` `compute_pck_v2` / `compute_oks_v2` / `MetricsAccumulatorV2`	hip↔hip (folded)	—	legacy-redundant, deprecated, NO callers — route to canonical
`tests/test_metrics.rs` local `compute_pck`/`compute_oks` (removed)	raw-threshold reimpl	raw	was independent reimpl → now validate canonical + 1 differential kernel
`benches/training_bench.rs` `compute_pck`	raw-threshold	raw	distinct-by-design (bench-only), annotated DO-NOT-REPORT
`sensing-server/training_api.rs` `compute_pck`	torso-HEIGHT (nose→hip), pixel-space	`ratio·torso_h`, 50px floor	distinct-by-design — and ORPHAN file (not `mod`-declared, does not compile); relabelled `compute_pck_torso_height`
`sensing-server/trainer.rs` `pck_at_threshold`	RAW (no normalization)	raw `thr`	distinct, LIVE (drives `best_pck`); MISSED by §1 table; relabelled `pck_raw@0.2`
`sensing-server/trainer.rs` `oks_map`→`oks_single(area=1.0)`	`area=1.0`	COCO kernel	fake-Gold, LIVE (drives `best_oks`); MISSED by §1 table; relabelled `oks_map(area=1.0 proxy)`

Findings the §1 seven-definition table under-counted (honest correction): the live sensing-server claim surface is trainer.rs (in lib.rs), not the named training_api.rs — which is an orphan file, never mod-declared, so it does not compile into the crate. The live best_pck is a raw, unnormalized PCK and the live best_oks still uses the area=1.0 fake-Gold path ADR-155 §2.1 reported as closed elsewhere. So the true metric landscape is messier than §1 documented: ≥3 PCK and ≥1 OKS live in sensing-server, two of them on the inflating side, and the file the ADR named for the fix was dead code. This is a finding, not a failure — recorded here rather than hidden.

Goal B (test_metrics.rs) — RESOLVED, MEASURED. The canonical core (pck_canonical/oks_canonical/canonical_torso_size/sigmas/bounding_box_diagonal) was hoisted into a new un-gated metrics_core module (the full metrics module is tch-backend-gated, so the canonical definition was previously unreachable from the workspace test gate; metrics now re-exports it → still ONE implementation). tests/test_metrics.rs now asserts the production functions against hand-computed fixtures — canonical_pck_matches_hand_computed_fixture (3/4 correct ⇒ 0.75, hand-derived), zero-visible⇒0.0, hip↔hip normalizer pin, OKS perfect⇒1.0, the fake-Gold pin — plus test_kernel_agrees_with_canonical, a differential test where an independent raw-threshold reference must AGREE with canonical in the torso=1.0 regime. (10→12 tests.)

Goal C (training_api.rs PCK) — RESOLVED by RELABEL, MEASURED. Torso-height is load-bearing (pixel-space, vertical nose→hip scale, [17×3] layout, no ndarray/train dep), so unifying would silently change the live numbers' meaning — exactly what to avoid. Resolution: relabel everywhere the metric surfaces so it is never read as canonical, in both the named training_api.rs (now compute_pck_torso_height, struct/JSON-field docs, pck_torso_h@0.2 logs) and — the real fix — the LIVE trainer.rs path (pck_at_threshold documented raw-unnormalized; oks_map area=1.0 flagged fake-Gold; main.rs prints pck_raw@0.2 / oks_map(area=1.0 proxy)). No wire-format field or pub-fn renames (no silent API break). Pinned by torso_pck_is_labelled_distinctly_from_canonical (training_api) and pck_at_threshold_is_raw_unnormalized_not_canonical (the live kernel). True unification (route the live server through pck_canonical/oks_canonical) remains a deferred §8 item — it needs a torso scale on the live data and the train crate as a dep.

9. Consequences

Positive. The training/metrics subsystem can now substantiate a clean accuracy claim: one documented metric used everywhere, a leak-free split, an honest TTA path, a proof that fails on noise and refuses to bless an unbaselined run, and two of the most claim-inflating bugs (false-perfect PCK, fake-Gold OKS) closed and pinned by regression tests. The unmeasured/unprovable parts are disclosed, not hidden.

Negative / honest. The reportable-metric tch-gated code cannot be compiled on the dev host (libtorch absent), so its validation rests on routing through the workspace-tested canonical functions plus review; the Rust deterministic proof is in SKIP until a baseline is committed on a tch host; the ONNX concurrency win is blocked upstream; and ~45 findings are deferred. None of these is presented as done.

Picture changed by Milestone-1b (§8.1) — corrected, not hidden. The §1 "seven divergent metrics" count was an under-count. The metric-unification audit (Goal A) found the live wifi-densepose-sensing-server carries additional, divergent definitions the §1 table omitted: a raw, unnormalized pck_at_threshold and an area=1.0 fake-Gold oks_map in trainer.rs — and these, not the orphaned training_api.rs the backlog named, are what actually drive the live-reported best_pck/best_oks. Milestone-1b relabelled them (load-bearing math on different data; relabel beats false unification) and pinned the divergence with tests; full unification onto the canonical definition stays deferred. So the canonical train/nn metric is unified and test-validated end-to-end, but the sensing-server still computes (now clearly-labelled, non-canonical) progress proxies — disclosed here as the honest current state.

34 KiB Raw Blame History Unescape Escape