wifi-densepose/docs/research/ruview-beyond-sota/03-benchmark-validation-met...

# Beyond-SOTA Validation, Test & Benchmark Methodology

**Series:** `docs/research/ruview-beyond-sota/` · Document 03
**Date:** 2026-06-09
**Scope:** How RuView proves (and gates) beyond-SOTA claims using the verification
infrastructure that already exists in this repository. Every number below is sourced
from a cited file in this repo; nothing is invented.

---

## 1. The Layered Validation Pyramid

Six layers, cheapest/most-deterministic at the bottom, most expensive/most-credible at
the top. A beyond-SOTA claim must survive **every layer below it** before it may be
published from the layer it lives at.

| Layer | What it proves | Tooling | Frequency | Determinism |
|-------|----------------|---------|-----------|-------------|
| **L0** Unit/integration tests | Code correctness | `cargo test --workspace --no-default-features` + pytest | per commit | exact |
| **L1** Deterministic proof + witness bundle | Pipeline is real, unchanged, reproducible | `archive/v1/data/proof/verify.py`, `scripts/generate-witness-bundle.sh` | per merge / release | exact (SHA-256) |
| **L2** Criterion micro-benchmarks | Compute latency only — never quality (ADR-171 §2) | 15 bench targets across `v2/crates/*/benches/` | nightly / pre-release | statistical |
| **L3** Dataset-level accuracy eval | Pose/presence/vitals quality vs published SOTA | MM-Fi / Wi-Pose (ADR-015), `ruview_metrics.rs` tiers, ADR-145 ablation harness | per model release | seeded |
| **L4** Hardware-in-loop | Real CSI on real ESP32, no mocks | COM9 (S3) / COM12 (C6) protocol, witness firmware hashes | per firmware release | A/B controlled |
| **L5** Field trials / live capture | End-to-end behavior in a real room | live-session captures (e.g. `benchmark_baseline.json`) | campaign | statistical |

### 1.1 L0 — Workspace tests (current counts)

- ADR-028 audit (2026-03-01): **1,031 passed, 0 failed, 8 ignored** for
  `cargo test --workspace --no-default-features`
  (`docs/adr/ADR-028-esp32-capability-audit.md` §2).
- Current `CHANGELOG.md` (Unreleased, cross-platform fix entry): **2,682 workspace
  tests pass / 0 fail on Windows** — the suite has more than doubled since the audit.
- `CLAUDE.md` pre-merge gate still cites "1,031+ passed, 0 failed" as the floor.

**Rule:** the post-change test count may never be lower than the pre-change count, and
failures must be 0. The witness bundle records the full log
(`test-results/rust-workspace-tests.log`) and an aggregated `summary.txt`
(`scripts/generate-witness-bundle.sh` step 3).

### 1.2 L1 — Deterministic proof ("Trust Kill Switch") + witness bundle

`archive/v1/data/proof/verify.py` (header comment): feeds 1,000 synthetic CSI frames
(seed=42, `sample_csi_data.json`) through the **production** `CSIProcessor`
(`src/core/csi_processor.py`), hashes the first 100 frames' feature output
(`VERIFICATION_FRAME_COUNT = 100`), and compares against
`archive/v1/data/proof/expected_features.sha256`.

- **Current published hash (file contents, verified during this investigation):**
  `f8e76f21a0f9852b70b6d9dd5318239f6b20cbcb4cdd995863263cecdc446f7a`
- The hash is **environment-coupled** and has been legitimately regenerated before:
  ADR-028 §5.3 recorded `8c0680d7…` under numpy 2.4.2/scipy 1.17.1; `CHANGELOG.md`
  (#560 fix) recorded `667eb054…` after 6-decimal quantization + single-thread BLAS
  pinning (`OMP_NUM_THREADS=1` etc.). Each regeneration must follow the documented
  procedure: `python verify.py --generate-hash` then `python verify.py` → `VERDICT: PASS`.

`scripts/generate-witness-bundle.sh` packages: witness log + ADR-028, the Python proof
(verify.py + expected hash + reference-signal metadata), full Rust test log + summary,
the ADR-134 CIR proof, firmware source/binary SHA-256s, crate version manifest, npm
tarball SHA-256, and a recipient-side `VERIFY.sh`.

**Accuracy note on check counts:** `CLAUDE.md` describes the recipient verification as
"7/7 PASS"; the current `VERIFY.sh` embedded in the script performs **10** `check()`
assertions (witness log, ADR, proof-hash file, tests, firmware hashes, crate manifest,
npm manifest, Python proof, CIR proof, CIR hash file) but prints a hardcoded
`"ALL CHECKS PASSED (8/8)"` string (`generate-witness-bundle.sh` line 293). The
hardcoded count is stale relative to the actual check list — fix it to print
`${PASS_COUNT}/${PASS_COUNT+FAIL_COUNT}` so the verdict can never silently desynchronize
from the check inventory.

### 1.3 L2 — Criterion micro-benchmark inventory (all 15 targets)

All bench sources read directly. Per ADR-171 §2 these are **latency regression gates
only, never quality evidence**.

| Bench target | Crate | Benchmark functions / groups | What it measures | Recorded value or in-source target (citation) |
|---|---|---|---|---|
| `engine_cycle.rs` | wifi-densepose-engine | `process_cycle_4nodes_56sc` | One full `StreamingEngine::process_cycle` (fuse + quality + calibration provenance + privacy gate + WorldGraph node), 4-node/56-subcarrier ESP32-S3 HT20 mesh | Budget: **50 ms** (20 Hz) — bench header |
| `signal_bench.rs` | wifi-densepose-signal | `CSI Preprocessing`, `Phase Sanitization`, `Feature Extraction`, `Motion Detection`, `Full Pipeline` | SOTA signal stages (ADR-014) at varying frame sizes | no recorded baseline |
| `cir_bench.rs` | wifi-densepose-signal | `cir_estimate` (HT20/HT40/HE20/HE40), `cir_estimate_12link`, `cir_estimator_new` | ADR-134 `CirEstimator::estimate()` per tier; 12-link multistatic amortization; cold-start | no recorded baseline |
| `calibration_bench.rs` | wifi-densepose-signal | `bench_recorder_record`, `bench_recorder_finalize`, `bench_deviation`, `bench_record_600`, `bench_to_bytes` (K=52/114/242/484) | ADR-135 empty-room baseline recorder + deviation scoring | no recorded baseline |
| `aether_prefilter_bench.rs` | wifi-densepose-signal | `aether_search_d…_n…_k…` (search vs prefilter) | ADR-084 Pass-2: `EmbeddingHistory::search_prefilter` vs brute force, prefilter_factor=8 | Pass: **≥4× at n=1024** — bench header |
| `sketch_bench.rs` | wifi-densepose-ruvector | `compare_d128/256/512` × `float_l2`/`float_cosine`/`sketch_hamming` | ADR-084 sketch-vs-float per-pair compare cost (AETHER 128-d, spectrogram 256-d) | Pass: **sketch ≥8× faster** at every dim (ADR-084 threshold 8×–30×) — bench header |
| `crv_bench.rs` | wifi-densepose-ruvector | `gestalt_classify_single/batch_100`, `sensory_encode_single`, `pipeline_full_session`, `convergence_two_sessions`, `crv_session_create`, `crv_embedding_dimension_scaling` (32/128/384), `crv_stage_vi_partition` | CRV integration throughput | no recorded baseline |
| `inference_bench.rs` | wifi-densepose-nn | `tensor_ops` (relu/sigmoid/tanh), `densepose_inference`, `translator_inference`, `mock_inference`, `batch_inference` | NN forward-pass cost by input/batch size | no recorded baseline; **`mock_inference` group must never be quoted as a pipeline number** (§6) |
| `training_bench.rs` | wifi-densepose-train | `interp_114_to_56_batch32`, `interp_scaling`, `compute_interp_weights_114_56`, `synthetic_dataset_get`, `synthetic_epoch`, `config_validate`, PCK over 100 samples | Training preprocessing + metrics hot paths; fixtures fully deterministic (no `rand`) — header | no recorded baseline |
| `detection_bench.rs` | wifi-densepose-mat | `breathing_detection`, `heartbeat_detection`, `movement_classification`, `detection_pipeline`, localization (triangulation/depth), alert generation | MAT survivor-detection algorithms at varying signal lengths / noise | no recorded baseline |
| `transport_bench.rs` | wifi-densepose-hardware | `beacon_serialize_16byte/28byte_auth/quic_framed`, `auth_beacon_verify`, `replay_window`, `framed_message` encode/decode, `secure_tdm_cycle` (manual vs QUIC) | TDM beacon crypto + transport | no recorded baseline |
| `mqtt_throughput.rs` | wifi-densepose-sensing-server | `discovery::build_*`, `state::*`, `rate_limiter::allow_*`, `privacy::decide_*`, `semantic::bus_tick_all_10_primitives` | ADR-115 MQTT hot path | Targets (header): discovery **<5 µs**, state encode **<2 µs**, rate limit **<100 ns**, privacy **<50 ns**, bus tick **<10 µs** |
| `swarm_bench.rs` | ruview-swarm | `marl_actor_inference`, `rrt_apf_100iter`, `multiview_fusion_3drones`, `demo_coverage_estimate`, `ppo_update_64transitions` | ADR-148 swarm control-loop compute | Measured: **3.3 µs / 43 µs / 54–58.5 ns / 100 ps / 248 µs** (ADR-171 §4.3; `CHANGELOG.md` Performance section) |
| `pipeline_throughput.rs` | nvsim | `pipeline_run` (sample-count sweep), `witness::run` vs `run_with_witness` | NV-diamond sim throughput + witness overhead | Acceptance: **≥1 kHz** simulated samples/s on Cortex-A53-class CPU — bench header |
| `state_machine.rs` | homecore | `set` first/warm/no-op, `get` hit/miss, `all_snapshot`, `all_by_domain_light_20_of_100`, `broadcast_fan_out` | HOMECORE state-machine hot paths | no recorded baseline |

**Honest gap — `benchmark_baseline.json` is not a criterion baseline.** The repo-root
`benchmark_baseline.json` (369.9 KB) contains **1,566 live-capture samples** from a
2-node session (fields: `tick`, `n_nodes`, `variance`, `motion`, `presence`,
`confidence`, `est_persons`, `n_persons_rendered`, `kp_spread`, `rssi`) plus a summary
block — it records **field-trial telemetry (L5)**, not micro-benchmark latencies.
No file in the repo references it (`grep -rn benchmark_baseline` → 0 hits outside the
file itself); its producer must be identified and committed (§5.3). Summary values
(all from the file's `summary` object):

| Metric | Baseline value |
|---|---:|
| `total_frames` | 1,566 |
| `presence_ratio` | 0.9336 (1,462/1,566 frames presence-true) |
| `confidence_mean` | 0.6433 |
| `variance_mean` / `variance_std` | 109.36 / 154.13 |
| `kp_spread_mean` / `kp_spread_std` | 86.73 / 4.52 |
| `person_count_changes` | 10 |

Criterion latencies that *have* been recorded live in ADR documents instead
(ADR-168-benchmark-proof.md, ADR-171 §4.3, CHANGELOG Performance) — §5 below defines
how to consolidate them into a real machine-readable criterion baseline.

### 1.4 L3 — Dataset-level accuracy evaluation

- **Datasets (ADR-015):** primary **MM-Fi** (40 subjects × 27 actions × ~320K frames,
  1TX×3RX, 114 subcarriers @100 Hz, 17-keypoint COCO + DensePose UV, CC BY-NC 4.0);
  secondary **Wi-Pose** (12 volunteers × 12 actions × 166,600 packets, 3×3, 30
  subcarriers). 114→56 subcarrier interpolation via `subcarrier.rs`; validation split =
  subjects 33–40 held out (ADR-015 Phase 1).
- **Acceptance tiers:** `wifi-densepose-train/src/ruview_metrics.rs` —
  PCK@0.2 / OKS / MOTA / vitals rolled into `RuViewTier`
  (Fail/Bronze/Silver/Gold) (ADR-145 §1.1).
- **Ablation harness (ADR-145):** 6-variant matrix (`csi_only`, `cir_only`,
  `csi_plus_cir`, `plus_doppler`, `plus_bfld`, `plus_uwb`-skipped), each variant
  producing acceptance tier + `SpecMetrics` (presence ≥0.90, localization ≤0.50 m,
  activity ≥0.70, FP ≤0.05, FN ≤0.10), `LatencyProfile` (p95 ≤100 ms), and
  `PrivacyLeakage` (MIA `leakage_score` ≤0.05), SHA-256-pinned per variant under
  `PROOF_SEED=42` (ADR-145 §2.2–2.6). Built at commit `0f336b7d3` (ADR-145
  implementation status); CLI auto-mode wiring is pending.
- **Cross-environment:** ADR-027 MERIDIAN `CrossDomainEvaluator`
  (`wifi-densepose-train/src/eval.rs`) — `domain_gap_ratio`, extended by ADR-145
  `cross_room_degradation()` with a 17-joint PCK-delta heatmap.

### 1.5 L4 — Hardware-in-loop

- Real CSI nodes: ESP32-S3 on **COM9**, ESP32-C6 + MR60BHA2 on **COM12** (`CLAUDE.md`
  hardware table). ADR-018 binary frame protocol over UDP:5005 (ADR-028 §3.2/§3.4).
- ADR-145 Tier-4 test (gated, `#[cfg(feature = "hardware-test")]`): replay a live 30 s
  COM9 capture through `csi_only` and `csi_plus_cir`; assert no presence regression and
  p95 < 100 ms.
- A/B board protocol precedent (`CHANGELOG.md` #987): fixed vs unmodified control board
  against Apple-Watch ground truth (control pegged 40–49 BPM; fixed 88–91 vs 87 GT) —
  this fixed-board/control-board + external ground-truth pattern is the required design
  for all hardware vital-sign claims.
- Witness bundle pins firmware: per-file SHA-256 of all sources + release binaries
  (`generate-witness-bundle.sh` step 5).

### 1.6 L5 — Field trials

Live multi-node sessions captured as JSONL/JSON with summary statistics —
`benchmark_baseline.json` (§1.3) is the existing exemplar. ADR-171 §6 adds the seeded
`evals/` episode harness (Stage 1 kinematic full-matrix, Stage 2 Gazebo/PX4 SITL on the
3 median seeds) for the swarm domain.

---

## 2. Beyond-SOTA Acceptance Criteria per Capability Axis

A claim is "beyond SOTA" only with: a named external baseline, an exact metric and
protocol match, the dataset/split named, the threshold pre-registered, and the
statistical procedure of §3 followed. Current axes with measured status:

| Axis | Metric (exact) | Dataset / protocol | SOTA baseline | Beyond-SOTA threshold | Measured status (cited) |
|---|---|---|---|---|---|
| In-domain pose accuracy | torso-PCK@20: `‖pred−gt‖ ≤ 0.2·‖R-shoulder−L-hip‖` | MM-Fi `random_split` (ratio 0.8, seed 0) | MultiFormer **72.25%** (Table VII); CSI2Pose 68.41% | > 72.25% with 95% CI lower bound above it | Flagship **83.59%**; micro (75,237 params) **74.30%** (`docs/benchmarks/wifi-pose-efficiency-frontier.md`) |
| Edge efficiency frontier | torso-PCK@20 at deployed precision + params + batch-1 latency | same | MultiFormer 72.25% at full size | Pareto-dominance: smaller **and** above 72.25% at the deployed precision | int8 73.5 KB **74.70%**; int4-QAT 36.7 KB **74.46%**; shipped int4 verified **74.08%**, 0.135 ms 1-thread x86 (same file) |
| Cross-subject generalization | torso-PCK@20, official MM-Fi cross-subject split (256,608 train / 64,152 test) | leakage-free split | own zero-shot baseline 63.99% | ADR-150 §4 gate: **+≥6 pts cross-subject without losing >2 pts random-split** | Best zero-shot **64.92%** (mixup+TTA+3-seed); gate judged unreachable without new capture (ADR-150 §3.2) |
| Few-shot calibration (deployment) | PCK@20 after K labeled in-room samples; adapter size | MM-Fi cross-subject & cross-environment splits | zero-shot (64% / 10.6%) | SOTA-level (≳72%) from ≤200 samples with ≤~11 KB per-room adapter | cross-subject ~**72%** @100–200 samples (3 seeds); cross-env **10.6→73.1%** @200, 60.1% @5 (ADR-150 §3.5–3.6) |
| Swarm SAR localization | CEP50/CEP95 (m), GDOP-stratified | seeded episode distribution (ADR-171 §6), not single geometry | Wi2SAR **5 m** (arxiv 2604.09115, paper-to-paper) | CEP50 < 5 m, IQM over ≥10 seeds, 95% CI excluding 5 m | 1.732 m single synthetic geometry — graded **Low–Medium**, not yet claimable (ADR-171 §7) |
| Swarm coverage | coverage-rate@240 s; time-to-95% | episode rollouts | Wi2SAR 160k m²/13.5 min | rollout (not analytic) mean+CI beating baseline | 223 s is an analytic estimate — graded **Low** (ADR-171 §7) |
| Control-loop latency | criterion wall-clock | local hardware, named | 10 ms / 100 Hz budget | all stages ≪ budget | 3.3 µs MARL / 43 µs RRT-APF / 54 ns fusion / 248 µs PPO (ADR-171 §4.3) |
| World-model trajectory | MDE (m) at 5-frame horizon | RuView CSI-derived occupancy | pre-fine-tune random-weight baseline 9.49 m MDE | **≤1.0 m (2.0 vox)** at 5-frame horizon (ADR-147 §5 target, cited in benchmark-proof §4) | 9.49 m / FDE 16.23 m random weights; 208.45 ms median latency on real CSI (ADR-168-benchmark-proof §4, §7) |
| Privacy leakage | MIA `leakage_score = 2·(AUC−0.5)` | fixed replay, fixed-seed shadow classifier | chance (0) | ≤ **0.05** (attacker AUC ≤ 0.525) | gate defined, harness built (ADR-145 §2.3) |
| Vitals (hardware) | BPM error vs wearable ground truth | live A/B board protocol | control board behavior | within physiological agreement of ground truth, stable spread | 88–91 BPM vs 87 GT, spread 59→0 (CHANGELOG #987) |

### Claim-language discipline (from ADR-171 §7 grading)

| Evidence | Permitted language |
|---|---|
| Single run / single geometry / analytic estimate | "directional", never "beats SOTA" |
| Seeded multi-run with CIs vs paper baseline | "exceeds the published X result paper-to-paper" |
| Same metric, same split, same protocol, CI excludes baseline | "beyond SOTA on <dataset>/<split>" |
| No public leaderboard exists (swarm CSI-SAR) | never claim "leaderboard standing" (ADR-171 §3) |

---

## 3. Statistical Procedure for Honest Claims

Adopted from ADR-171 §5 (Agarwal 2021 / Gorsane 2022 standard) and the practices
already used in ADR-150/efficiency-frontier measurements:

1. **Seeds.** ≥10 independent seeds for RL/episodic claims (ADR-171 §5); ≥3 seeds
   minimum for supervised dataset evals (ADR-150 §3.5 used 3 seeds; report all).
   Training seeds, eval seeds, and split files are versioned and committed.
2. **Aggregate.** IQM (not mean/median) for episodic metrics + performance profiles;
   for dataset accuracy report mean across seeds with each seed's value listed.
3. **Confidence intervals.** 95% stratified bootstrap, 1,000 resamples (ADR-171 §5;
   reference impl: `rliable`).
4. **Paired comparisons.** When comparing model A vs B (e.g. `csi_plus_cir` vs
   `csi_only`, or ours vs a reproduced baseline), evaluate both on the **identical
   frozen test frames** and use a paired bootstrap over per-sample correctness
   (PCK hit/miss is per-joint binary — pair at the joint-sample level). For
   paper-to-paper comparisons where the baseline cannot be re-run, state so
   explicitly ("paper-to-paper", ADR-171 §2) and require the CI lower bound to clear
   the published point value.
5. **Pre-registration.** The threshold lives in an ADR **before** the run
   (precedent: ADR-150 §4 gate written before §3.2 measurements; the measurements
   honestly reported the gate as not met).
6. **Negative results are recorded.** ADR-150 §1/§3.2 keeps DANN-failed,
   capacity-hurts, and KD-didn't-help results in the record — required practice.
7. **Eval episodes (swarm):** 50 fixed, versioned episodes per policy
   (10 victim layouts × 5 CSI-noise levels), ≥3 baselines (random walk,
   boustrophedon+triangulation, IPPO) (ADR-171 §5).
8. **GDOP stratification** for any localization claim, so geometry artifacts cannot
   produce the headline (ADR-171 §6.3).

---

## 4. Regression-Gate Design (CI Enforcement)

### 4.1 Three gate classes, three tolerances

| Gate class | Source of truth | Tolerance | On breach |
|---|---|---|---|
| Determinism hashes | `expected_features.sha256`, `expected_cir_features.sha256`, `expected_calibration_features.sha256`, future `expected_ablation_<slug>.sha256` | **exact (0%)** | exit 1 = FAIL; exit 2 = SKIP only for placeholder hashes (proof.rs `0/1/2` convention, ADR-145 §2.4) |
| Accuracy / quality metrics | per-variant canonical bytes, quantized 1e-3 (ADR-145 §2.6) | exact after quantization | FAIL CI; tier change requires ADR amendment |
| Latency / throughput | criterion estimates JSON | **% tolerance per scale** (below) | FAIL on regression beyond tolerance; trend everything |

### 4.2 Criterion baseline file (replaces the current gap)

Today criterion numbers live in prose (ADR-168-benchmark-proof, ADR-171 §4.3,
CHANGELOG). Formalize:

1. `cargo bench --workspace -- --save-baseline main` on a **named, fixed runner**
   (ADR-147 used RTX 5080 / specific host; record host + toolchain in the file).
2. Export `target/criterion/*/estimates.json` point estimates into a committed
   `v2/benchmarks/criterion-baseline.json`: `{bench_id, crate, p50_ns, host, commit}`.
3. CI compares new runs against it with scale-aware tolerance — wall-clock noise is
   proportionally larger at small magnitudes:

| Magnitude | Tolerance | Rationale |
|---|---|---|
| < 1 µs (e.g. fusion 54 ns, privacy decide <50 ns target) | ±25% | timer/jitter dominated |
| 1 µs – 1 ms (MARL 3.3 µs, RRT-APF 43 µs, PPO 248 µs) | ±15% | criterion CI typically <5%, leave CI-runner headroom |
| > 1 ms (engine cycle vs 50 ms budget, OccWorld ~209 ms) | ±10% **and** absolute budget (50 ms / 500 ms ADR-147 §6) | budgets are the contract |

4. Hard in-source acceptance thresholds remain authoritative regardless of baseline:
   sketch ≥8× (`sketch_bench.rs`), prefilter ≥4× (`aether_prefilter_bench.rs`),
   nvsim ≥1 kHz (`pipeline_throughput.rs`), MQTT header targets, ADR-145 p95 ≤100 ms.
5. Latency stays **out of determinism hashes** (ADR-145 §2.6) but **in** the trended
   `summary.json`, so sub-threshold drift is visible (ADR-145 §3.2 mitigation).

### 4.3 Live-capture baseline gate (`benchmark_baseline.json`)

Adopt the file as the L5 regression anchor with documented provenance, then gate a
re-capture of the same scenario (same 2-node placement, same room class) against the
summary block:

| Field | Baseline | Suggested gate |
|---|---:|---|
| `presence_ratio` | 0.9336 | ≥ 0.90 for an occupied-room session |
| `confidence_mean` | 0.6433 | within ±0.10 |
| `kp_spread_std` | 4.52 | ≤ 2× baseline (skeleton stability) |
| `person_count_changes` | 10 / 1,566 frames | ≤ 2× baseline (count flapping — see CHANGELOG #803/#894 clamp bugs this metric would have caught) |

Field-trial gates are **soft** (warn + require human sign-off), never auto-merge
blockers — environments differ; the gate exists to force an explanation.

### 4.4 Wiring

Pre-merge (`CLAUDE.md` checklist): L0 + L1. Nightly: L2 criterion + ADR-145 Tier-3
ablation matrix (minutes-scale, ADR-145 §3.2). Release: full witness bundle +
`VERIFY.sh` + L4 on real COM-port hardware (`CLAUDE.md` firmware rule 6/7).

---

## 5. Reproducibility & External-Witness Requirements

Anyone outside the project must be able to re-run every claimed result:

1. **One command per layer.** `cargo test --workspace --no-default-features`;
   `python archive/v1/data/proof/verify.py`; `bash scripts/generate-witness-bundle.sh`
   then `bash VERIFY.sh` inside the bundle; per ADR-150 §4 every accuracy result needs
   "one-command reproduction" (efficiency frontier publishes its exact command:
   `python aether-arena/staging/train_efficiency_pareto.py npy/X.npy npy/Y.npy npy/split_random.npy`).
2. **Pinned numerical environment.** The Python proof requires single-threaded BLAS
   (`OMP_NUM_THREADS=1`, `OPENBLAS_NUM_THREADS=1`, `MKL_NUM_THREADS=1`,
   `VECLIB_MAXIMUM_THREADS=1`, `NUMEXPR_NUM_THREADS=1`) and 6-decimal quantization
   (`HASH_QUANTIZATION_DECIMALS=6`) — the #560 fix in `CHANGELOG.md`; Rust proof
   runners use coarse u16 quantization at 1e-3 in natural order
   (`calibration_proof_runner.rs` pattern, ADR-145 §2.6) for libm portability.
3. **Seeds are constants, committed:** `PROOF_SEED=42`, `MODEL_SEED=0`
   (`proof.rs`, ADR-015 Phase 5); dataset splits committed as `.npy`
   (`split_random.npy`); swarm configs as versioned YAML with all seeds (ADR-171 §5).
4. **Artifacts carry hashes.** Published model artifacts include SHA-256 (HuggingFace
   `pose_micro_int4.npz`, sha256 `c03eeb…` — efficiency-frontier doc); witness bundle
   has a `MANIFEST.sha256` over every file; provenance fields
   (`replay_sha256`, `model_sha256`, `calibration_version`, `privacy_mode`) are bound
   into ablation proof hashes (ADR-145 §2.7) so a metric cannot be quoted without its
   exact model + calibration + privacy decision.
5. **Hardware claims name the hardware.** ADR-147 records RTX 5080 / CUDA 12.8 /
   PyTorch 2.10.0; nvsim states the Cortex-A53 scaling caveat in the bench header;
   efficiency-frontier flags ARM validation as pending. Copy this discipline.
6. **Witness rows.** Every new proof gains rows in `docs/WITNESS-LOG-028.md`
   (ADR-145 §5.3 adds W-39…W-41) and the bundle's `source-hashes.txt`.
7. **Secret hygiene in evidence.** Bundle logs pass through
   `scripts/redact-secrets.py` (ADR-110 wave-5 incident note in
   `generate-witness-bundle.sh` step 4) — external evidence must never embed `.env`.

---

## 6. Known Measurement Pitfalls (WiFi-sensing specific)

| # | Pitfall | Repo evidence | Mitigation in this methodology |
|---|---|---|---|
| 1 | **Subject leakage / split optimism.** In-domain `random_split` has temporal/subject-adjacency effects; the same model family scores 83.6% random-split but ~11.6% torso-PCK on the leakage-free cross-subject split | efficiency-frontier "Controlled claim" footnote; ADR-150 §1, §3.2 | Always report the split name; publish random-split and cross-subject numbers side by side; cross-subject claims only on the official split |
| 2 | **Per-environment overfitting.** Zero-shot cross-environment collapses to 10.6%; subject-scaling saturates ~63.7% past 16–20 subjects because the residual is room/device shift | ADR-150 §3.3, §3.6 | Cross-room degradation + 17-joint heatmap in every ablation (ADR-145 §2.5); claim deployment accuracy only with the calibration protocol stated (K samples, adapter size) |
| 3 | **Mock-mode contamination.** Mock firmware missed a real Kconfig threshold bug; the nn crate ships a `mock_inference` criterion group that must never be quoted as pipeline performance | `CLAUDE.md` firmware rule 7; `inference_bench.rs` `bench_mock_inference` | L4 mandatory before firmware release ("Always test with real WiFi CSI, not mock mode"); label mock benches in reports; ADR-147 §7 re-ran the benchmark on real CSI explicitly "no mocks" |
| 4 | **Single-run point estimates.** 1.732 m localization from one synthetic geometry; 223 s coverage from an analytic formula | ADR-171 §1, §7 | §3 seed/CI protocol; evidence-grade table before publication |
| 5 | **Random-weight / untrained baselines read as results.** OccWorld MDE 9.49 m is a pre-fine-tuning random-weight reading | ADR-168-benchmark-proof §4 | Label baseline-vs-target explicitly; never aggregate untrained-model numbers into capability claims |
| 6 | **Latency conflated with quality.** Criterion µs numbers prove no compute bottleneck, nothing about accuracy | ADR-171 §2, §4.3 | L2 is gate-only; quality claims live in L3+ |
| 7 | **Floating-point nondeterminism breaking proofs.** SciPy FFT SIMD reordering + multithreaded BLAS produced different hashes across CI microarchitectures | CHANGELOG #560; `calibration_proof_runner.rs` lines 1–13 (cited in ADR-145 §2.3) | Quantize before hashing; pin thread env vars; exclude wall-clock from hashes |
| 8 | **Hash churn without procedure.** Three distinct historical values of the proof hash exist (`8c0680d7…` ADR-028, `667eb054…` CHANGELOG #560, `f8e76f21…` current file) | cited files | Every regeneration via `--generate-hash` + re-verify + CHANGELOG entry + witness bundle refresh |
| 9 | **Aggregation bugs masking accuracy.** Person count clamped to 1 by EMA mapping; eigenvalue path leaking counts up to 10; both invisible to unit tests for months | CHANGELOG #803, #894 | L5 summary gates on `person_count_changes`/count distributions; convergence tests replaying the live loop |
| 10 | **Stale verification claims.** `VERIFY.sh` prints hardcoded "(8/8)" over 10 actual checks; `CLAUDE.md` says "7/7" | `generate-witness-bundle.sh` line 293; `CLAUDE.md` | Compute the verdict count; audit doc claims against scripts each release |
| 11 | **Licensing limits on the eval set.** MM-Fi is CC BY-NC — weights trained solely on it cannot back commercial claims | ADR-015 Consequences | Track dataset license alongside every published number |

---

## 7. Gap List (what must be built to fully execute this methodology)

| Gap | Owner layer | Source |
|---|---|---|
| Machine-readable criterion baseline (`v2/benchmarks/criterion-baseline.json`) + CI comparison job | L2 | §4.2 (numbers currently only in ADR prose) |
| Provenance + producer script for `benchmark_baseline.json`; soft-gate job | L5 | §1.3, §4.3 (zero code references today) |
| `ruview-cli --ablation mode=auto` wiring + `expected_ablation_<slug>.sha256` (currently placeholders → exit 2) | L3 | ADR-145 implementation status |
| Seeded swarm `evals/` harness + `evals/RESULTS.md` internal leaderboard | L3/L5 | ADR-171 §6, §8 open issues |
| Fix `VERIFY.sh` hardcoded verdict count; reconcile `CLAUDE.md` "7/7" | L1 | §1.2 |
| Curated paired room-A/room-B labeled replay set (frozen, SHA-pinned, never trained on) | L3 | ADR-145 §3.2 |
| ARM/edge on-device latency validation for the int4 model (x86-only today) | L4 | efficiency-frontier doc ("Pi fleet pending") |
| Bench validation of the antenna-placement matrix on real hardware | L4 | PRODUCTION-ROADMAP.md Tier 2.3 |

---

## Update — falsifiable occupancy benchmark implemented

`wifi-densepose-train::occupancy_bench` (added this branch) makes the
presence/person-count claim **falsifiable in code**, directly enforcing the L3
discipline above. It grades predictions vs ground truth and gates a SOTA claim
behind a single `claim_allowed` invariant that requires **all** of:

1. `DataProvenance::Measured` — synthetic/mock data is scorable for regression
   but **never claimable** (anti-mock-contamination; the CLAUDE.md Kconfig-bug
   lesson made structural).
2. A leak-free `EvalSplit` — `validate()` refuses any split where a subject *or*
   environment id appears in both train and test (subject leakage / per-env
   overfitting).
3. `n_test ≥ min_test_samples` (small-N guard).
4. Presence F1 whose **bootstrap-CI lower bound** (deterministic splitmix64,
   seeded) clears the threshold — not the point estimate.
5. Count MAE within threshold.

The claim string is unreadable except through the gate (returns `NO_CLAIM`
otherwise) — same discipline as the `ruview-gamma` acceptance gate. 10 tests
cover each refusal path. What remains is *data*, not *method*: feed it a frozen,
SHA-pinned, subject/environment-disjoint **measured** replay set (the curated
room-A/room-B item above) and the "beyond SOTA" claim becomes a passing or
failing test, not a slogan.

---

*All values cited from: `benchmark_baseline.json`, `v2/crates/*/benches/*.rs` (15
files), `docs/adr/ADR-168-benchmark-proof.md`,
`docs/adr/ADR-171-swarm-benchmarking-evaluation-methodology.md`,
`docs/adr/ADR-145-ablation-eval-harness-privacy-leakage.md`,
`docs/adr/ADR-028-esp32-capability-audit.md`,
`docs/adr/ADR-015-public-dataset-training-strategy.md`,
`docs/adr/ADR-150-rf-foundation-encoder.md`,
`docs/benchmarks/wifi-pose-efficiency-frontier.md`,
`scripts/generate-witness-bundle.sh`, `archive/v1/data/proof/verify.py`,
`archive/v1/data/proof/expected_features.sha256`, `CHANGELOG.md`, `CLAUDE.md`,
`docs/research/sota-2026-05-22/PRODUCTION-ROADMAP.md`.*