30 KiB
Beyond-SOTA Validation, Test & Benchmark Methodology
Series: docs/research/ruview-beyond-sota/ · Document 03
Date: 2026-06-09
Scope: How RuView proves (and gates) beyond-SOTA claims using the verification
infrastructure that already exists in this repository. Every number below is sourced
from a cited file in this repo; nothing is invented.
1. The Layered Validation Pyramid
Six layers, cheapest/most-deterministic at the bottom, most expensive/most-credible at the top. A beyond-SOTA claim must survive every layer below it before it may be published from the layer it lives at.
| Layer | What it proves | Tooling | Frequency | Determinism |
|---|---|---|---|---|
| L0 Unit/integration tests | Code correctness | cargo test --workspace --no-default-features + pytest |
per commit | exact |
| L1 Deterministic proof + witness bundle | Pipeline is real, unchanged, reproducible | archive/v1/data/proof/verify.py, scripts/generate-witness-bundle.sh |
per merge / release | exact (SHA-256) |
| L2 Criterion micro-benchmarks | Compute latency only — never quality (ADR-149 §2) | 15 bench targets across v2/crates/*/benches/ |
nightly / pre-release | statistical |
| L3 Dataset-level accuracy eval | Pose/presence/vitals quality vs published SOTA | MM-Fi / Wi-Pose (ADR-015), ruview_metrics.rs tiers, ADR-145 ablation harness |
per model release | seeded |
| L4 Hardware-in-loop | Real CSI on real ESP32, no mocks | COM9 (S3) / COM12 (C6) protocol, witness firmware hashes | per firmware release | A/B controlled |
| L5 Field trials / live capture | End-to-end behavior in a real room | live-session captures (e.g. benchmark_baseline.json) |
campaign | statistical |
1.1 L0 — Workspace tests (current counts)
- ADR-028 audit (2026-03-01): 1,031 passed, 0 failed, 8 ignored for
cargo test --workspace --no-default-features(docs/adr/ADR-028-esp32-capability-audit.md§2). - Current
CHANGELOG.md(Unreleased, cross-platform fix entry): 2,682 workspace tests pass / 0 fail on Windows — the suite has more than doubled since the audit. CLAUDE.mdpre-merge gate still cites "1,031+ passed, 0 failed" as the floor.
Rule: the post-change test count may never be lower than the pre-change count, and
failures must be 0. The witness bundle records the full log
(test-results/rust-workspace-tests.log) and an aggregated summary.txt
(scripts/generate-witness-bundle.sh step 3).
1.2 L1 — Deterministic proof ("Trust Kill Switch") + witness bundle
archive/v1/data/proof/verify.py (header comment): feeds 1,000 synthetic CSI frames
(seed=42, sample_csi_data.json) through the production CSIProcessor
(src/core/csi_processor.py), hashes the first 100 frames' feature output
(VERIFICATION_FRAME_COUNT = 100), and compares against
archive/v1/data/proof/expected_features.sha256.
- Current published hash (file contents, verified during this investigation):
f8e76f21a0f9852b70b6d9dd5318239f6b20cbcb4cdd995863263cecdc446f7a - The hash is environment-coupled and has been legitimately regenerated before:
ADR-028 §5.3 recorded
8c0680d7…under numpy 2.4.2/scipy 1.17.1;CHANGELOG.md(#560 fix) recorded667eb054…after 6-decimal quantization + single-thread BLAS pinning (OMP_NUM_THREADS=1etc.). Each regeneration must follow the documented procedure:python verify.py --generate-hashthenpython verify.py→VERDICT: PASS.
scripts/generate-witness-bundle.sh packages: witness log + ADR-028, the Python proof
(verify.py + expected hash + reference-signal metadata), full Rust test log + summary,
the ADR-134 CIR proof, firmware source/binary SHA-256s, crate version manifest, npm
tarball SHA-256, and a recipient-side VERIFY.sh.
Accuracy note on check counts: CLAUDE.md describes the recipient verification as
"7/7 PASS"; the current VERIFY.sh embedded in the script performs 10 check()
assertions (witness log, ADR, proof-hash file, tests, firmware hashes, crate manifest,
npm manifest, Python proof, CIR proof, CIR hash file) but prints a hardcoded
"ALL CHECKS PASSED (8/8)" string (generate-witness-bundle.sh line 293). The
hardcoded count is stale relative to the actual check list — fix it to print
${PASS_COUNT}/${PASS_COUNT+FAIL_COUNT} so the verdict can never silently desynchronize
from the check inventory.
1.3 L2 — Criterion micro-benchmark inventory (all 15 targets)
All bench sources read directly. Per ADR-149 §2 these are latency regression gates only, never quality evidence.
| Bench target | Crate | Benchmark functions / groups | What it measures | Recorded value or in-source target (citation) |
|---|---|---|---|---|
engine_cycle.rs |
wifi-densepose-engine | process_cycle_4nodes_56sc |
One full StreamingEngine::process_cycle (fuse + quality + calibration provenance + privacy gate + WorldGraph node), 4-node/56-subcarrier ESP32-S3 HT20 mesh |
Budget: 50 ms (20 Hz) — bench header |
signal_bench.rs |
wifi-densepose-signal | CSI Preprocessing, Phase Sanitization, Feature Extraction, Motion Detection, Full Pipeline |
SOTA signal stages (ADR-014) at varying frame sizes | no recorded baseline |
cir_bench.rs |
wifi-densepose-signal | cir_estimate (HT20/HT40/HE20/HE40), cir_estimate_12link, cir_estimator_new |
ADR-134 CirEstimator::estimate() per tier; 12-link multistatic amortization; cold-start |
no recorded baseline |
calibration_bench.rs |
wifi-densepose-signal | bench_recorder_record, bench_recorder_finalize, bench_deviation, bench_record_600, bench_to_bytes (K=52/114/242/484) |
ADR-135 empty-room baseline recorder + deviation scoring | no recorded baseline |
aether_prefilter_bench.rs |
wifi-densepose-signal | aether_search_d…_n…_k… (search vs prefilter) |
ADR-084 Pass-2: EmbeddingHistory::search_prefilter vs brute force, prefilter_factor=8 |
Pass: ≥4× at n=1024 — bench header |
sketch_bench.rs |
wifi-densepose-ruvector | compare_d128/256/512 × float_l2/float_cosine/sketch_hamming |
ADR-084 sketch-vs-float per-pair compare cost (AETHER 128-d, spectrogram 256-d) | Pass: sketch ≥8× faster at every dim (ADR-084 threshold 8×–30×) — bench header |
crv_bench.rs |
wifi-densepose-ruvector | gestalt_classify_single/batch_100, sensory_encode_single, pipeline_full_session, convergence_two_sessions, crv_session_create, crv_embedding_dimension_scaling (32/128/384), crv_stage_vi_partition |
CRV integration throughput | no recorded baseline |
inference_bench.rs |
wifi-densepose-nn | tensor_ops (relu/sigmoid/tanh), densepose_inference, translator_inference, mock_inference, batch_inference |
NN forward-pass cost by input/batch size | no recorded baseline; mock_inference group must never be quoted as a pipeline number (§6) |
training_bench.rs |
wifi-densepose-train | interp_114_to_56_batch32, interp_scaling, compute_interp_weights_114_56, synthetic_dataset_get, synthetic_epoch, config_validate, PCK over 100 samples |
Training preprocessing + metrics hot paths; fixtures fully deterministic (no rand) — header |
no recorded baseline |
detection_bench.rs |
wifi-densepose-mat | breathing_detection, heartbeat_detection, movement_classification, detection_pipeline, localization (triangulation/depth), alert generation |
MAT survivor-detection algorithms at varying signal lengths / noise | no recorded baseline |
transport_bench.rs |
wifi-densepose-hardware | beacon_serialize_16byte/28byte_auth/quic_framed, auth_beacon_verify, replay_window, framed_message encode/decode, secure_tdm_cycle (manual vs QUIC) |
TDM beacon crypto + transport | no recorded baseline |
mqtt_throughput.rs |
wifi-densepose-sensing-server | discovery::build_*, state::*, rate_limiter::allow_*, privacy::decide_*, semantic::bus_tick_all_10_primitives |
ADR-115 MQTT hot path | Targets (header): discovery <5 µs, state encode <2 µs, rate limit <100 ns, privacy <50 ns, bus tick <10 µs |
swarm_bench.rs |
ruview-swarm | marl_actor_inference, rrt_apf_100iter, multiview_fusion_3drones, demo_coverage_estimate, ppo_update_64transitions |
ADR-148 swarm control-loop compute | Measured: 3.3 µs / 43 µs / 54–58.5 ns / 100 ps / 248 µs (ADR-149 §4.3; CHANGELOG.md Performance section) |
pipeline_throughput.rs |
nvsim | pipeline_run (sample-count sweep), witness::run vs run_with_witness |
NV-diamond sim throughput + witness overhead | Acceptance: ≥1 kHz simulated samples/s on Cortex-A53-class CPU — bench header |
state_machine.rs |
homecore | set first/warm/no-op, get hit/miss, all_snapshot, all_by_domain_light_20_of_100, broadcast_fan_out |
HOMECORE state-machine hot paths | no recorded baseline |
Honest gap — benchmark_baseline.json is not a criterion baseline. The repo-root
benchmark_baseline.json (369.9 KB) contains 1,566 live-capture samples from a
2-node session (fields: tick, n_nodes, variance, motion, presence,
confidence, est_persons, n_persons_rendered, kp_spread, rssi) plus a summary
block — it records field-trial telemetry (L5), not micro-benchmark latencies.
No file in the repo references it (grep -rn benchmark_baseline → 0 hits outside the
file itself); its producer must be identified and committed (§5.3). Summary values
(all from the file's summary object):
| Metric | Baseline value |
|---|---|
total_frames |
1,566 |
presence_ratio |
0.9336 (1,462/1,566 frames presence-true) |
confidence_mean |
0.6433 |
variance_mean / variance_std |
109.36 / 154.13 |
kp_spread_mean / kp_spread_std |
86.73 / 4.52 |
person_count_changes |
10 |
Criterion latencies that have been recorded live in ADR documents instead (ADR-147-benchmark-proof.md, ADR-149 §4.3, CHANGELOG Performance) — §5 below defines how to consolidate them into a real machine-readable criterion baseline.
1.4 L3 — Dataset-level accuracy evaluation
- Datasets (ADR-015): primary MM-Fi (40 subjects × 27 actions × ~320K frames,
1TX×3RX, 114 subcarriers @100 Hz, 17-keypoint COCO + DensePose UV, CC BY-NC 4.0);
secondary Wi-Pose (12 volunteers × 12 actions × 166,600 packets, 3×3, 30
subcarriers). 114→56 subcarrier interpolation via
subcarrier.rs; validation split = subjects 33–40 held out (ADR-015 Phase 1). - Acceptance tiers:
wifi-densepose-train/src/ruview_metrics.rs— PCK@0.2 / OKS / MOTA / vitals rolled intoRuViewTier(Fail/Bronze/Silver/Gold) (ADR-145 §1.1). - Ablation harness (ADR-145): 6-variant matrix (
csi_only,cir_only,csi_plus_cir,plus_doppler,plus_bfld,plus_uwb-skipped), each variant producing acceptance tier +SpecMetrics(presence ≥0.90, localization ≤0.50 m, activity ≥0.70, FP ≤0.05, FN ≤0.10),LatencyProfile(p95 ≤100 ms), andPrivacyLeakage(MIAleakage_score≤0.05), SHA-256-pinned per variant underPROOF_SEED=42(ADR-145 §2.2–2.6). Built at commit0f336b7d3(ADR-145 implementation status); CLI auto-mode wiring is pending. - Cross-environment: ADR-027 MERIDIAN
CrossDomainEvaluator(wifi-densepose-train/src/eval.rs) —domain_gap_ratio, extended by ADR-145cross_room_degradation()with a 17-joint PCK-delta heatmap.
1.5 L4 — Hardware-in-loop
- Real CSI nodes: ESP32-S3 on COM9, ESP32-C6 + MR60BHA2 on COM12 (
CLAUDE.mdhardware table). ADR-018 binary frame protocol over UDP:5005 (ADR-028 §3.2/§3.4). - ADR-145 Tier-4 test (gated,
#[cfg(feature = "hardware-test")]): replay a live 30 s COM9 capture throughcsi_onlyandcsi_plus_cir; assert no presence regression and p95 < 100 ms. - A/B board protocol precedent (
CHANGELOG.md#987): fixed vs unmodified control board against Apple-Watch ground truth (control pegged 40–49 BPM; fixed 88–91 vs 87 GT) — this fixed-board/control-board + external ground-truth pattern is the required design for all hardware vital-sign claims. - Witness bundle pins firmware: per-file SHA-256 of all sources + release binaries
(
generate-witness-bundle.shstep 5).
1.6 L5 — Field trials
Live multi-node sessions captured as JSONL/JSON with summary statistics —
benchmark_baseline.json (§1.3) is the existing exemplar. ADR-149 §6 adds the seeded
evals/ episode harness (Stage 1 kinematic full-matrix, Stage 2 Gazebo/PX4 SITL on the
3 median seeds) for the swarm domain.
2. Beyond-SOTA Acceptance Criteria per Capability Axis
A claim is "beyond SOTA" only with: a named external baseline, an exact metric and protocol match, the dataset/split named, the threshold pre-registered, and the statistical procedure of §3 followed. Current axes with measured status:
| Axis | Metric (exact) | Dataset / protocol | SOTA baseline | Beyond-SOTA threshold | Measured status (cited) |
|---|---|---|---|---|---|
| In-domain pose accuracy | torso-PCK@20: ‖pred−gt‖ ≤ 0.2·‖R-shoulder−L-hip‖ |
MM-Fi random_split (ratio 0.8, seed 0) |
MultiFormer 72.25% (Table VII); CSI2Pose 68.41% | > 72.25% with 95% CI lower bound above it | Flagship 83.59%; micro (75,237 params) 74.30% (docs/benchmarks/wifi-pose-efficiency-frontier.md) |
| Edge efficiency frontier | torso-PCK@20 at deployed precision + params + batch-1 latency | same | MultiFormer 72.25% at full size | Pareto-dominance: smaller and above 72.25% at the deployed precision | int8 73.5 KB 74.70%; int4-QAT 36.7 KB 74.46%; shipped int4 verified 74.08%, 0.135 ms 1-thread x86 (same file) |
| Cross-subject generalization | torso-PCK@20, official MM-Fi cross-subject split (256,608 train / 64,152 test) | leakage-free split | own zero-shot baseline 63.99% | ADR-150 §4 gate: +≥6 pts cross-subject without losing >2 pts random-split | Best zero-shot 64.92% (mixup+TTA+3-seed); gate judged unreachable without new capture (ADR-150 §3.2) |
| Few-shot calibration (deployment) | PCK@20 after K labeled in-room samples; adapter size | MM-Fi cross-subject & cross-environment splits | zero-shot (64% / 10.6%) | SOTA-level (≳72%) from ≤200 samples with ≤~11 KB per-room adapter | cross-subject ~72% @100–200 samples (3 seeds); cross-env 10.6→73.1% @200, 60.1% @5 (ADR-150 §3.5–3.6) |
| Swarm SAR localization | CEP50/CEP95 (m), GDOP-stratified | seeded episode distribution (ADR-149 §6), not single geometry | Wi2SAR 5 m (arxiv 2604.09115, paper-to-paper) | CEP50 < 5 m, IQM over ≥10 seeds, 95% CI excluding 5 m | 1.732 m single synthetic geometry — graded Low–Medium, not yet claimable (ADR-149 §7) |
| Swarm coverage | coverage-rate@240 s; time-to-95% | episode rollouts | Wi2SAR 160k m²/13.5 min | rollout (not analytic) mean+CI beating baseline | 223 s is an analytic estimate — graded Low (ADR-149 §7) |
| Control-loop latency | criterion wall-clock | local hardware, named | 10 ms / 100 Hz budget | all stages ≪ budget | 3.3 µs MARL / 43 µs RRT-APF / 54 ns fusion / 248 µs PPO (ADR-149 §4.3) |
| World-model trajectory | MDE (m) at 5-frame horizon | RuView CSI-derived occupancy | pre-fine-tune random-weight baseline 9.49 m MDE | ≤1.0 m (2.0 vox) at 5-frame horizon (ADR-147 §5 target, cited in benchmark-proof §4) | 9.49 m / FDE 16.23 m random weights; 208.45 ms median latency on real CSI (ADR-147-benchmark-proof §4, §7) |
| Privacy leakage | MIA leakage_score = 2·(AUC−0.5) |
fixed replay, fixed-seed shadow classifier | chance (0) | ≤ 0.05 (attacker AUC ≤ 0.525) | gate defined, harness built (ADR-145 §2.3) |
| Vitals (hardware) | BPM error vs wearable ground truth | live A/B board protocol | control board behavior | within physiological agreement of ground truth, stable spread | 88–91 BPM vs 87 GT, spread 59→0 (CHANGELOG #987) |
Claim-language discipline (from ADR-149 §7 grading)
| Evidence | Permitted language |
|---|---|
| Single run / single geometry / analytic estimate | "directional", never "beats SOTA" |
| Seeded multi-run with CIs vs paper baseline | "exceeds the published X result paper-to-paper" |
| Same metric, same split, same protocol, CI excludes baseline | "beyond SOTA on /" |
| No public leaderboard exists (swarm CSI-SAR) | never claim "leaderboard standing" (ADR-149 §3) |
3. Statistical Procedure for Honest Claims
Adopted from ADR-149 §5 (Agarwal 2021 / Gorsane 2022 standard) and the practices already used in ADR-150/efficiency-frontier measurements:
- Seeds. ≥10 independent seeds for RL/episodic claims (ADR-149 §5); ≥3 seeds minimum for supervised dataset evals (ADR-150 §3.5 used 3 seeds; report all). Training seeds, eval seeds, and split files are versioned and committed.
- Aggregate. IQM (not mean/median) for episodic metrics + performance profiles; for dataset accuracy report mean across seeds with each seed's value listed.
- Confidence intervals. 95% stratified bootstrap, 1,000 resamples (ADR-149 §5;
reference impl:
rliable). - Paired comparisons. When comparing model A vs B (e.g.
csi_plus_cirvscsi_only, or ours vs a reproduced baseline), evaluate both on the identical frozen test frames and use a paired bootstrap over per-sample correctness (PCK hit/miss is per-joint binary — pair at the joint-sample level). For paper-to-paper comparisons where the baseline cannot be re-run, state so explicitly ("paper-to-paper", ADR-149 §2) and require the CI lower bound to clear the published point value. - Pre-registration. The threshold lives in an ADR before the run (precedent: ADR-150 §4 gate written before §3.2 measurements; the measurements honestly reported the gate as not met).
- Negative results are recorded. ADR-150 §1/§3.2 keeps DANN-failed, capacity-hurts, and KD-didn't-help results in the record — required practice.
- Eval episodes (swarm): 50 fixed, versioned episodes per policy (10 victim layouts × 5 CSI-noise levels), ≥3 baselines (random walk, boustrophedon+triangulation, IPPO) (ADR-149 §5).
- GDOP stratification for any localization claim, so geometry artifacts cannot produce the headline (ADR-149 §6.3).
4. Regression-Gate Design (CI Enforcement)
4.1 Three gate classes, three tolerances
| Gate class | Source of truth | Tolerance | On breach |
|---|---|---|---|
| Determinism hashes | expected_features.sha256, expected_cir_features.sha256, expected_calibration_features.sha256, future expected_ablation_<slug>.sha256 |
exact (0%) | exit 1 = FAIL; exit 2 = SKIP only for placeholder hashes (proof.rs 0/1/2 convention, ADR-145 §2.4) |
| Accuracy / quality metrics | per-variant canonical bytes, quantized 1e-3 (ADR-145 §2.6) | exact after quantization | FAIL CI; tier change requires ADR amendment |
| Latency / throughput | criterion estimates JSON | % tolerance per scale (below) | FAIL on regression beyond tolerance; trend everything |
4.2 Criterion baseline file (replaces the current gap)
Today criterion numbers live in prose (ADR-147-benchmark-proof, ADR-149 §4.3, CHANGELOG). Formalize:
cargo bench --workspace -- --save-baseline mainon a named, fixed runner (ADR-147 used RTX 5080 / specific host; record host + toolchain in the file).- Export
target/criterion/*/estimates.jsonpoint estimates into a committedv2/benchmarks/criterion-baseline.json:{bench_id, crate, p50_ns, host, commit}. - CI compares new runs against it with scale-aware tolerance — wall-clock noise is proportionally larger at small magnitudes:
| Magnitude | Tolerance | Rationale |
|---|---|---|
| < 1 µs (e.g. fusion 54 ns, privacy decide <50 ns target) | ±25% | timer/jitter dominated |
| 1 µs – 1 ms (MARL 3.3 µs, RRT-APF 43 µs, PPO 248 µs) | ±15% | criterion CI typically <5%, leave CI-runner headroom |
| > 1 ms (engine cycle vs 50 ms budget, OccWorld ~209 ms) | ±10% and absolute budget (50 ms / 500 ms ADR-147 §6) | budgets are the contract |
- Hard in-source acceptance thresholds remain authoritative regardless of baseline:
sketch ≥8× (
sketch_bench.rs), prefilter ≥4× (aether_prefilter_bench.rs), nvsim ≥1 kHz (pipeline_throughput.rs), MQTT header targets, ADR-145 p95 ≤100 ms. - Latency stays out of determinism hashes (ADR-145 §2.6) but in the trended
summary.json, so sub-threshold drift is visible (ADR-145 §3.2 mitigation).
4.3 Live-capture baseline gate (benchmark_baseline.json)
Adopt the file as the L5 regression anchor with documented provenance, then gate a re-capture of the same scenario (same 2-node placement, same room class) against the summary block:
| Field | Baseline | Suggested gate |
|---|---|---|
presence_ratio |
0.9336 | ≥ 0.90 for an occupied-room session |
confidence_mean |
0.6433 | within ±0.10 |
kp_spread_std |
4.52 | ≤ 2× baseline (skeleton stability) |
person_count_changes |
10 / 1,566 frames | ≤ 2× baseline (count flapping — see CHANGELOG #803/#894 clamp bugs this metric would have caught) |
Field-trial gates are soft (warn + require human sign-off), never auto-merge blockers — environments differ; the gate exists to force an explanation.
4.4 Wiring
Pre-merge (CLAUDE.md checklist): L0 + L1. Nightly: L2 criterion + ADR-145 Tier-3
ablation matrix (minutes-scale, ADR-145 §3.2). Release: full witness bundle +
VERIFY.sh + L4 on real COM-port hardware (CLAUDE.md firmware rule 6/7).
5. Reproducibility & External-Witness Requirements
Anyone outside the project must be able to re-run every claimed result:
- One command per layer.
cargo test --workspace --no-default-features;python archive/v1/data/proof/verify.py;bash scripts/generate-witness-bundle.shthenbash VERIFY.shinside the bundle; per ADR-150 §4 every accuracy result needs "one-command reproduction" (efficiency frontier publishes its exact command:python aether-arena/staging/train_efficiency_pareto.py npy/X.npy npy/Y.npy npy/split_random.npy). - Pinned numerical environment. The Python proof requires single-threaded BLAS
(
OMP_NUM_THREADS=1,OPENBLAS_NUM_THREADS=1,MKL_NUM_THREADS=1,VECLIB_MAXIMUM_THREADS=1,NUMEXPR_NUM_THREADS=1) and 6-decimal quantization (HASH_QUANTIZATION_DECIMALS=6) — the #560 fix inCHANGELOG.md; Rust proof runners use coarse u16 quantization at 1e-3 in natural order (calibration_proof_runner.rspattern, ADR-145 §2.6) for libm portability. - Seeds are constants, committed:
PROOF_SEED=42,MODEL_SEED=0(proof.rs, ADR-015 Phase 5); dataset splits committed as.npy(split_random.npy); swarm configs as versioned YAML with all seeds (ADR-149 §5). - Artifacts carry hashes. Published model artifacts include SHA-256 (HuggingFace
pose_micro_int4.npz, sha256c03eeb…— efficiency-frontier doc); witness bundle has aMANIFEST.sha256over every file; provenance fields (replay_sha256,model_sha256,calibration_version,privacy_mode) are bound into ablation proof hashes (ADR-145 §2.7) so a metric cannot be quoted without its exact model + calibration + privacy decision. - Hardware claims name the hardware. ADR-147 records RTX 5080 / CUDA 12.8 / PyTorch 2.10.0; nvsim states the Cortex-A53 scaling caveat in the bench header; efficiency-frontier flags ARM validation as pending. Copy this discipline.
- Witness rows. Every new proof gains rows in
docs/WITNESS-LOG-028.md(ADR-145 §5.3 adds W-39…W-41) and the bundle'ssource-hashes.txt. - Secret hygiene in evidence. Bundle logs pass through
scripts/redact-secrets.py(ADR-110 wave-5 incident note ingenerate-witness-bundle.shstep 4) — external evidence must never embed.env.
6. Known Measurement Pitfalls (WiFi-sensing specific)
| # | Pitfall | Repo evidence | Mitigation in this methodology |
|---|---|---|---|
| 1 | Subject leakage / split optimism. In-domain random_split has temporal/subject-adjacency effects; the same model family scores 83.6% random-split but ~11.6% torso-PCK on the leakage-free cross-subject split |
efficiency-frontier "Controlled claim" footnote; ADR-150 §1, §3.2 | Always report the split name; publish random-split and cross-subject numbers side by side; cross-subject claims only on the official split |
| 2 | Per-environment overfitting. Zero-shot cross-environment collapses to 10.6%; subject-scaling saturates ~63.7% past 16–20 subjects because the residual is room/device shift | ADR-150 §3.3, §3.6 | Cross-room degradation + 17-joint heatmap in every ablation (ADR-145 §2.5); claim deployment accuracy only with the calibration protocol stated (K samples, adapter size) |
| 3 | Mock-mode contamination. Mock firmware missed a real Kconfig threshold bug; the nn crate ships a mock_inference criterion group that must never be quoted as pipeline performance |
CLAUDE.md firmware rule 7; inference_bench.rs bench_mock_inference |
L4 mandatory before firmware release ("Always test with real WiFi CSI, not mock mode"); label mock benches in reports; ADR-147 §7 re-ran the benchmark on real CSI explicitly "no mocks" |
| 4 | Single-run point estimates. 1.732 m localization from one synthetic geometry; 223 s coverage from an analytic formula | ADR-149 §1, §7 | §3 seed/CI protocol; evidence-grade table before publication |
| 5 | Random-weight / untrained baselines read as results. OccWorld MDE 9.49 m is a pre-fine-tuning random-weight reading | ADR-147-benchmark-proof §4 | Label baseline-vs-target explicitly; never aggregate untrained-model numbers into capability claims |
| 6 | Latency conflated with quality. Criterion µs numbers prove no compute bottleneck, nothing about accuracy | ADR-149 §2, §4.3 | L2 is gate-only; quality claims live in L3+ |
| 7 | Floating-point nondeterminism breaking proofs. SciPy FFT SIMD reordering + multithreaded BLAS produced different hashes across CI microarchitectures | CHANGELOG #560; calibration_proof_runner.rs lines 1–13 (cited in ADR-145 §2.3) |
Quantize before hashing; pin thread env vars; exclude wall-clock from hashes |
| 8 | Hash churn without procedure. Three distinct historical values of the proof hash exist (8c0680d7… ADR-028, 667eb054… CHANGELOG #560, f8e76f21… current file) |
cited files | Every regeneration via --generate-hash + re-verify + CHANGELOG entry + witness bundle refresh |
| 9 | Aggregation bugs masking accuracy. Person count clamped to 1 by EMA mapping; eigenvalue path leaking counts up to 10; both invisible to unit tests for months | CHANGELOG #803, #894 | L5 summary gates on person_count_changes/count distributions; convergence tests replaying the live loop |
| 10 | Stale verification claims. VERIFY.sh prints hardcoded "(8/8)" over 10 actual checks; CLAUDE.md says "7/7" |
generate-witness-bundle.sh line 293; CLAUDE.md |
Compute the verdict count; audit doc claims against scripts each release |
| 11 | Licensing limits on the eval set. MM-Fi is CC BY-NC — weights trained solely on it cannot back commercial claims | ADR-015 Consequences | Track dataset license alongside every published number |
7. Gap List (what must be built to fully execute this methodology)
| Gap | Owner layer | Source |
|---|---|---|
Machine-readable criterion baseline (v2/benchmarks/criterion-baseline.json) + CI comparison job |
L2 | §4.2 (numbers currently only in ADR prose) |
Provenance + producer script for benchmark_baseline.json; soft-gate job |
L5 | §1.3, §4.3 (zero code references today) |
ruview-cli --ablation mode=auto wiring + expected_ablation_<slug>.sha256 (currently placeholders → exit 2) |
L3 | ADR-145 implementation status |
Seeded swarm evals/ harness + evals/RESULTS.md internal leaderboard |
L3/L5 | ADR-149 §6, §8 open issues |
Fix VERIFY.sh hardcoded verdict count; reconcile CLAUDE.md "7/7" |
L1 | §1.2 |
| Curated paired room-A/room-B labeled replay set (frozen, SHA-pinned, never trained on) | L3 | ADR-145 §3.2 |
| ARM/edge on-device latency validation for the int4 model (x86-only today) | L4 | efficiency-frontier doc ("Pi fleet pending") |
| Bench validation of the antenna-placement matrix on real hardware | L4 | PRODUCTION-ROADMAP.md Tier 2.3 |
Update — falsifiable occupancy benchmark implemented
wifi-densepose-train::occupancy_bench (added this branch) makes the
presence/person-count claim falsifiable in code, directly enforcing the L3
discipline above. It grades predictions vs ground truth and gates a SOTA claim
behind a single claim_allowed invariant that requires all of:
DataProvenance::Measured— synthetic/mock data is scorable for regression but never claimable (anti-mock-contamination; the CLAUDE.md Kconfig-bug lesson made structural).- A leak-free
EvalSplit—validate()refuses any split where a subject or environment id appears in both train and test (subject leakage / per-env overfitting). n_test ≥ min_test_samples(small-N guard).- Presence F1 whose bootstrap-CI lower bound (deterministic splitmix64, seeded) clears the threshold — not the point estimate.
- Count MAE within threshold.
The claim string is unreadable except through the gate (returns NO_CLAIM
otherwise) — same discipline as the ruview-gamma acceptance gate. 10 tests
cover each refusal path. What remains is data, not method: feed it a frozen,
SHA-pinned, subject/environment-disjoint measured replay set (the curated
room-A/room-B item above) and the "beyond SOTA" claim becomes a passing or
failing test, not a slogan.
All values cited from: benchmark_baseline.json, v2/crates/*/benches/*.rs (15
files), docs/adr/ADR-147-benchmark-proof.md,
docs/adr/ADR-149-swarm-benchmarking-evaluation-methodology.md,
docs/adr/ADR-145-ablation-eval-harness-privacy-leakage.md,
docs/adr/ADR-028-esp32-capability-audit.md,
docs/adr/ADR-015-public-dataset-training-strategy.md,
docs/adr/ADR-150-rf-foundation-encoder.md,
docs/benchmarks/wifi-pose-efficiency-frontier.md,
scripts/generate-witness-bundle.sh, archive/v1/data/proof/verify.py,
archive/v1/data/proof/expected_features.sha256, CHANGELOG.md, CLAUDE.md,
docs/research/sota-2026-05-22/PRODUCTION-ROADMAP.md.