wifi-densepose/docs/adr/ADR-173-metric-locked-pck-m...

6.3 KiB
Raw Blame History

ADR-173: Metric-Locked PCK/MPJPE Accuracy Harness

Field Value
Status Accepted — implemented, deterministically tested
Date 2026-06-15
Deciders ruv
Codename METRIC-LOCK
Amends ADR-155 (generalizes the torso-only metrics_core::pck_canonical to a selectable normalization)
Motivated by docs/research/sota-nn-train-benchmark-brief.md (PR #1090)

Context

The beyond-SOTA SOTA-research brief (PR #1090) identified the single biggest threat to any "beyond-SOTA" accuracy claim this project makes: metric ambiguity. Three PCK@20 numbers circulate, computed under three different and unstated normalizations, so they cannot be compared:

  • 96.0996.61% — WiFlow-STD reproduction, image/bounding-box-normalized PCK (the looser convention).
  • 81.63% — an internal MM-Fi number reported as "torso-PCK" (tighter).
  • 61.1% — GraphPose-Fi (arXiv 2511.19105), standard torso-diameter PCK on the MM-Fi random split (the academic frontier).

The project has been burned by this twice: a previously-published 92.9% was retracted because it used absolute-pixel normalization, not torso. Until there is one canonical, documented, tested PCK definition — and every reported number carries the definition it was computed under — no accuracy comparison is credible, and the "prove everything" bar cannot be met for the benchmark half of the work.

This is measurement infrastructure, not an accuracy claim. The deliverable's job is to make the metric unambiguous and reproducible, so future numbers are comparable and an unlabeled PCK is structurally impossible.

Decision

Add a metric-locked accuracy harness as a new module v2/crates/wifi-densepose-train/src/accuracy.rs (404 non-test lines; inline deterministic tests bring the file to 708), re-exported at the crate root. It extends, not duplicates — it reuses metrics_core's geometric primitives (bounding_box_diagonal, canonical hip indices CANON_LEFT_HIP/RIGHT_HIP), so there remains exactly one implementation of each geometric reference; the existing ADR-155 pck_canonical (torso-only) is unchanged and this generalizes it.

Public API

  • enum PckNormalization { TorsoDiameter, BoundingBoxDiagonal, AbsolutePixels(f32) } — the three conventions the three historical numbers used, now explicit and selectable. .label() / .tolerance(...).
  • pck_at(pred, gt, vis, k, norm) -> (correct, total, pck) — PCK@k = fraction of visible keypoints whose predicted-vs-GT distance ≤ the tolerance, where tolerance = k% of the chosen normalizer (or an absolute threshold for AbsolutePixels).
  • mpjpe(pred, gt, vis) -> f32 — mean per-joint position error (2D/3D, coordinate units; mm for mm inputs). Re-exported crate-root as pck_mpjpe to avoid colliding with the existing eval::mpjpe.
  • struct PoseAccuracy { pck_at: BTreeMap<u8,f32>, mpjpe, normalization, n_keypoints, n_frames }a reported number always carries its normalization; an unlabeled PCK is structurally impossible to produce through this surface.
  • struct PoseFrame { pred, gt, visibility } + accuracy_report(frames, ks, norm) -> PoseAccuracy (micro-averaged over keypoints).

Correctness is proven by hand-computed deterministic tests (no GPU, no data)

The tests construct synthetic keypoint sets whose PCK/MPJPE can be computed by hand, and assert the harness matches. Highlights (all pass):

Test Construction Expected
perfect_prediction pred==gt PCK=1.0 (all 3 norms), MPJPE=0
all_just_outside every error just past τ@20 PCK=0.0
half_in_half_out 2 exact, 2 just outside PCK=0.5
three_normalizations (KEY PROOF) identical pred; nose err .06, shoulder .10, hips exact torso=0.50, bbox=1.00, abs(.08)=0.75
mpjpe_2d / mpjpe_3d (3,4)→5 / (1,2,2)→3 2.5 / 3.0
mpjpe_excludes_invisible invisible joint err 100 ignored 5.0
zero_torso_unscoreable coincident hips (0,0,0.0), not false-perfect
no_visible_keypoints vis=∅ (0,0,0.0)
nan_coords one NaN pred coord counted wrong, no panic
empty report no frames 0.0, not NaN
bbox≥torso ordering same frames bbox-PCK ≥ torso-PCK

The key proof (the ambiguity is real and quantified)

Identical predictions, three declared normalizations → 0.50 / 1.00 / 0.75. Mechanism: the bbox diagonal √(0.20² + 0.80²) = 0.825 is ~4× the hip-span torso 0.20, so τ@20 is 0.165 (bbox) vs 0.040 (torso) — the looser image-normalized convention passes joints the strict torso convention rejects. This is exactly why 96% / 81.6% / 61% cannot be lined up without declaring the enum, demonstrated in-code.

Validation

  • cargo test -p wifi-densepose-train --no-default-features → lib 191 → 206 (+15), test_metrics 12 → 14 (+2), doc-tests 8 — 0 failed.
  • cargo test --workspace --no-default-featuresexit 0, 0 failed.
  • python archive/v1/data/proof/verify.pyVERDICT: PASS, hash f8e76f21a0f9852b70b6d9dd5318239f6b20cbcb4cdd995863263cecdc446f7a unchanged (off the signal proof path — confirms no pipeline alteration).

Consequences

Positive

  • The three historical PCK numbers can now be recomputed under one declared definition and compared honestly. The retracted-number class of error (silent normalization mismatch) is structurally prevented going forward.
  • Establishes the measurement substrate for the beyond-SOTA target: GraphPose-Fi cross-environment PCK@20 = 12.9% (standard torso PCK) is now a number this harness can produce comparably.

Negative

  • None functional. The harness is additive; no existing metric path changed.

Neutral

  • Producing actual model numbers under this harness requires the trained models + datasets (MM-Fi) and, for cross-domain splits, is the next sub-deliverable of the benchmark/optimization milestone — out of scope here (this ADR is the instrument, not the reading).
  • ADR-155 — metric core (pck_canonical, torso-only) — generalized here
  • ADR-152 — WiFi-Pose SOTA 2026 intake / WiFlow-STD benchmark
  • docs/research/sota-nn-train-benchmark-brief.md — the motivating gap analysis
  • GraphPose-Fi — arXiv 2511.19105 (verified cross-env PCK@20 = 12.9% anchor)