wifi-densepose/benchmarks/edge-skills/RESULTS.md

9.1 KiB
Raw Blame History

Edge-Skill Synthetic-Ground-Truth Validation — RESULTS

Crate: v2/crates/wifi-densepose-wasm-edge (workspace-EXCLUDED — build from its own dir) Branch: feat/edge-skills-synthetic-validation ADR: ADR-160 Date: 2026-06-13 Harness: tests/synthetic_validation.rs

HONESTY BOUNDARY — read first. Everything below is synthetic-ground-truth validation: a signal is planted with a known answer, the real detector is run, and detection accuracy / precision / recall / rate-error is measured. This is NOT field accuracy. A skill that recovers a planted sinusoid here is proven to do the math it claims on a constructed signal; it is NOT proven to work on real CSI in a real room. Skills whose detection target cannot be honestly planted (clinical, weapon, affect, sleep-stage, sign-language) are NOT given a number — they are listed under DATA-GATED with the real data each would require.

Reproduce

cd v2/crates/wifi-densepose-wasm-edge   # workspace-excluded; build here
cargo test --features std --test synthetic_validation -- --nocapture
# also runs under the medical tier (med_* skills stay DATA-GATED, not validated):
cargo test --features std,medical-experimental --test synthetic_validation -- --nocapture

Each MEASURED-on-synthetic | … line printed by the harness is the source of the table below. Numbers are deterministic (no RNG; pseudo-noise uses a fixed LCG seed).


MEASURED-on-synthetic (constructible skills)

Skill What was planted (ground truth) Result Grade
vital_trend BPM held N≥6 calls at each threshold band (brady/tachy-pnea <12 / >25, brady/tachy-cardia <50 / >120, apnea breathing<1.0 for ≥20) vs normal acc 1.000, prec 1.000, recall 1.000 (TP5 FP0 TN5 FN0) MEASURED
exo_time_crystal period-2 coordinated motion vs pseudo-noise + flat acc 1.000 (TP1 FP0 TN2 FN0) MEASURED †
exo_ghost_hunter (hidden breathing) phase sinusoid at lag-8 (breathing band 515) in an empty room vs flat phase acc 1.000; planted score 1.000, flat 0.000 MEASURED
occupancy 220-frame flat-amplitude calibration, then strong per-zone amplitude variance vs flat acc 1.000 (TP1 FP0 TN1 FN0) MEASURED
intrusion calibrate→arm (330 quiet frames), then per-subcarrier Δphase>1.5 + Δamp≫3σ vs quiet acc 1.000 (TP1 FP0 TN1 FN0) MEASURED
exo_rain_detect empty room, 60-frame baseline, then broadband variance (8/8 groups, ratio≫2.5) for ≥10 frames vs stable-low acc 1.000 (TP1 FP0 TN1 FN0) MEASURED
sig_flash_attention sustained high phase+amplitude in each of the 8 subcarrier groups; assert reported attention peak == planted group peak-localization 8/8 = 1.000 MEASURED
spt_spiking_tracker sparse (2-subcarrier) large phase-delta in each of the 4 zones; assert tracked zone == planted zone zone-localization 4/4 = 1.000 MEASURED ‡
sig_optimal_transport sustained large frame-to-frame amplitude-distribution change vs stationary acc 1.000 (TP1 FP0 TN1 FN0) MEASURED
sig_mincut_person_match 2 persons with distinct stable per-region variance signatures over 40 frames person ids assigned, 0 id-swaps / 40 frames MEASURED
lrn_dtw_gesture_learn stillness → 3 identical gesture rehearsals → enrollment template enrolled (templates=1) MEASURED (enroll) §
sig_sparse_recovery 30 clean frames to init, then 8/32 (25%) nulled subcarriers dropout-detect + recovery-trigger = PASS MEASURED (trigger) ¶

Caveats on individual results

exo_time_crystal — honest discriminative limit. A pure periodic signal already has autocorrelation peaks at lag L and 2L (natural harmonics), so this "period-doubling" detector cannot separate a true period-2 sub-harmonic from a plain periodic signal — an earlier plant using a clean sine produced a false positive (recorded during development). The construct it can discriminate with known ground truth is periodic-coordination vs aperiodic (noise/flat), which is what is measured (1.000). The original "sub-harmonic vs clean period" claim is NOT validatable with this algorithm.

spt_spiking_tracker — plant must be sparse. With weights init'd home=1.0 / cross=0.25, firing all 8 inputs in a zone (8×0.25=2.0 > threshold 1.0) overdrives every output neuron and the tracker collapses to zone 0 (measured 1/4 during development). Firing only 2 inputs (home 2.0 fires, cross 0.5 silent) yields clean 4/4 zone localization. The validatable claim is single-zone localization.

§ lrn_dtw_gesture_learn — enrollment validated; replay-match NOT. The deterministic, constructible part (stillness → 3 identical rehearsals → a template is enrolled) is MEASURED. The DTW replay match (731) did not fire on the identical replay in this run (match_same=false) — replay-recognition accuracy is reported, not asserted, and is not claimed as validated.

sig_sparse_recovery — trigger validated; recovery accuracy is NEGATIVE. The dropout-detection + ISTA-recovery trigger pipeline fires correctly on >10% planted nulls (asserted). But the measured recovery accuracy is NOT a win: recovered RMSE 1.0045 vs unrecovered-null RMSE 0.9830 (2.2%, i.e. slightly worse than leaving the nulls at zero) on a neighbor-correlated signal. The tridiagonal correlation model's fixed point does not equal the planted truth. The recovery's reconstruction quality is therefore NOT validated as effective on synthetic data — only its detection/trigger path is. Reported honestly; no positive number claimed.


DATA-GATED — NOT validatable on synthetic data

Planting a "seizure-like" / "weapon-like" / "happy-like" synthetic signal and claiming the detector "works" validates nothing real and is exactly the AI-slop this project fights. These skills run real DSP (per ADR-160, 0 stubs) and keep their ADR-160 disclaimers, but get no accuracy number here. Each needs the specific real, labelled data listed:

Skill Why not constructible on synthetic Real data required
med_seizure_detect "seizure-like" motion is not a seizure; no ground-truth signature exists synthetically Clinical EEG-/video-labelled tonic-clonic seizure CSI from instrumented patients
med_sleep_apnea a planted breathing-pause is not clinical apnea (AHI scoring, hypopnea, desaturation) Polysomnography-labelled (PSG) overnight CSI with scored apnea/hypopnea events
med_cardiac_arrhythmia a synthetic HR sequence cannot encode true arrhythmia morphology ECG-labelled CSI (AFib/PVC/etc.) from clinical monitoring
med_respiratory_distress distress is a clinical gestalt, not a plantable rate Clinician-labelled respiratory-distress CSI episodes
med_gait_analysis clinical gait metrics need a reference motion-capture standard Mocap-/force-plate-labelled gait CSI
sec_weapon_detect a high variance ratio is RF reflectivity, not weapon discrimination (ADR-160 §A3 already renamed the event to HIGH_METAL_REFLECTIVITY) Labelled metal-object-vs-no-object CSI with controlled object classes
exo_emotion_detect affect is not recoverable from a planted heuristic; outputs are proxies (ADR-160 §A2) Validated affect-labelled CSI (self-report / physiological ground truth)
exo_happiness_score "happiness" is a gait-energy proxy, not a measured affect (ADR-160 §A2) Validated affect/valence-labelled CSI
exo_dream_stage sleep staging needs PSG reference (EEG/EOG/EMG) PSG-staged overnight CSI
exo_gesture_language coarse gesture clusters ≠ true sign language (ADR-160 §A4) Labelled ASL letter/word CSI dataset

The above are not failures — they are the honest boundary. A smaller set of genuinely-measured skills plus this explicit gated list is the deliverable, per the prove-everything directive.


Skills not in either list

The remaining edge skills (smart-building / retail / industrial occupancy-style, the other sig_*/lrn_*/spt_*/tmp_*/qnt_*/aut_*/ais_* algorithm-named modules) are wired and exercised live in the unified pipeline integration test (tests/pipeline_all.rs, all 59 default / 64 medical skills run without panic over 300 synthetic frames) but were not given an individual planted-ground-truth accuracy number here. They are honest REAL-DSP modules (ADR-160) whose physical observable could be planted with more harness work; that is deferred, not claimed.

Test counts (full crate suite)

DEFAULT  (--features std):                     631 passed, 0 failed
  (lib 504; budget 25; honest_labeling 10; pipeline_all 4; synthetic_validation 12; bench 1; vendor 75)
MEDICAL  (--features std,medical-experimental): 669 passed, 0 failed
  (lib 542; +16 same new tests; med_* stay DATA-GATED, not validated)

(M6 baseline was 615 / 653; the new pipeline_all (4) + synthetic_validation (12) tests add 16 to each tier.)