9.1 KiB
Edge-Skill Synthetic-Ground-Truth Validation — RESULTS
Crate: v2/crates/wifi-densepose-wasm-edge (workspace-EXCLUDED — build from its own dir)
Branch: feat/edge-skills-synthetic-validation
ADR: ADR-160
Date: 2026-06-13
Harness: tests/synthetic_validation.rs
HONESTY BOUNDARY — read first. Everything below is synthetic-ground-truth validation: a signal is planted with a known answer, the real detector is run, and detection accuracy / precision / recall / rate-error is measured. This is NOT field accuracy. A skill that recovers a planted sinusoid here is proven to do the math it claims on a constructed signal; it is NOT proven to work on real CSI in a real room. Skills whose detection target cannot be honestly planted (clinical, weapon, affect, sleep-stage, sign-language) are NOT given a number — they are listed under DATA-GATED with the real data each would require.
Reproduce
cd v2/crates/wifi-densepose-wasm-edge # workspace-excluded; build here
cargo test --features std --test synthetic_validation -- --nocapture
# also runs under the medical tier (med_* skills stay DATA-GATED, not validated):
cargo test --features std,medical-experimental --test synthetic_validation -- --nocapture
Each MEASURED-on-synthetic | … line printed by the harness is the source of the
table below. Numbers are deterministic (no RNG; pseudo-noise uses a fixed LCG seed).
MEASURED-on-synthetic (constructible skills)
| Skill | What was planted (ground truth) | Result | Grade |
|---|---|---|---|
| vital_trend | BPM held N≥6 calls at each threshold band (brady/tachy-pnea <12 / >25, brady/tachy-cardia <50 / >120, apnea breathing<1.0 for ≥20) vs normal | acc 1.000, prec 1.000, recall 1.000 (TP5 FP0 TN5 FN0) | MEASURED |
| exo_time_crystal | period-2 coordinated motion vs pseudo-noise + flat | acc 1.000 (TP1 FP0 TN2 FN0) | MEASURED † |
| exo_ghost_hunter (hidden breathing) | phase sinusoid at lag-8 (breathing band 5–15) in an empty room vs flat phase | acc 1.000; planted score 1.000, flat 0.000 | MEASURED |
| occupancy | 220-frame flat-amplitude calibration, then strong per-zone amplitude variance vs flat | acc 1.000 (TP1 FP0 TN1 FN0) | MEASURED |
| intrusion | calibrate→arm (330 quiet frames), then per-subcarrier Δphase>1.5 + Δamp≫3σ vs quiet | acc 1.000 (TP1 FP0 TN1 FN0) | MEASURED |
| exo_rain_detect | empty room, 60-frame baseline, then broadband variance (8/8 groups, ratio≫2.5) for ≥10 frames vs stable-low | acc 1.000 (TP1 FP0 TN1 FN0) | MEASURED |
| sig_flash_attention | sustained high phase+amplitude in each of the 8 subcarrier groups; assert reported attention peak == planted group | peak-localization 8/8 = 1.000 | MEASURED |
| spt_spiking_tracker | sparse (2-subcarrier) large phase-delta in each of the 4 zones; assert tracked zone == planted zone | zone-localization 4/4 = 1.000 | MEASURED ‡ |
| sig_optimal_transport | sustained large frame-to-frame amplitude-distribution change vs stationary | acc 1.000 (TP1 FP0 TN1 FN0) | MEASURED |
| sig_mincut_person_match | 2 persons with distinct stable per-region variance signatures over 40 frames | person ids assigned, 0 id-swaps / 40 frames | MEASURED |
| lrn_dtw_gesture_learn | stillness → 3 identical gesture rehearsals → enrollment | template enrolled (templates=1) | MEASURED (enroll) § |
| sig_sparse_recovery | 30 clean frames to init, then 8/32 (25%) nulled subcarriers | dropout-detect + recovery-trigger = PASS | MEASURED (trigger) ¶ |
Caveats on individual results
† exo_time_crystal — honest discriminative limit. A pure periodic signal already has autocorrelation peaks at lag L and 2L (natural harmonics), so this "period-doubling" detector cannot separate a true period-2 sub-harmonic from a plain periodic signal — an earlier plant using a clean sine produced a false positive (recorded during development). The construct it can discriminate with known ground truth is periodic-coordination vs aperiodic (noise/flat), which is what is measured (1.000). The original "sub-harmonic vs clean period" claim is NOT validatable with this algorithm.
‡ spt_spiking_tracker — plant must be sparse. With weights init'd home=1.0 / cross=0.25, firing all 8 inputs in a zone (8×0.25=2.0 > threshold 1.0) overdrives every output neuron and the tracker collapses to zone 0 (measured 1/4 during development). Firing only 2 inputs (home 2.0 fires, cross 0.5 silent) yields clean 4/4 zone localization. The validatable claim is single-zone localization.
§ lrn_dtw_gesture_learn — enrollment validated; replay-match NOT. The
deterministic, constructible part (stillness → 3 identical rehearsals → a template
is enrolled) is MEASURED. The DTW replay match (731) did not fire on the
identical replay in this run (match_same=false) — replay-recognition accuracy is
reported, not asserted, and is not claimed as validated.
¶ sig_sparse_recovery — trigger validated; recovery accuracy is NEGATIVE. The dropout-detection + ISTA-recovery trigger pipeline fires correctly on >10% planted nulls (asserted). But the measured recovery accuracy is NOT a win: recovered RMSE 1.0045 vs unrecovered-null RMSE 0.9830 (−2.2%, i.e. slightly worse than leaving the nulls at zero) on a neighbor-correlated signal. The tridiagonal correlation model's fixed point does not equal the planted truth. The recovery's reconstruction quality is therefore NOT validated as effective on synthetic data — only its detection/trigger path is. Reported honestly; no positive number claimed.
DATA-GATED — NOT validatable on synthetic data
Planting a "seizure-like" / "weapon-like" / "happy-like" synthetic signal and claiming the detector "works" validates nothing real and is exactly the AI-slop this project fights. These skills run real DSP (per ADR-160, 0 stubs) and keep their ADR-160 disclaimers, but get no accuracy number here. Each needs the specific real, labelled data listed:
| Skill | Why not constructible on synthetic | Real data required |
|---|---|---|
med_seizure_detect |
"seizure-like" motion is not a seizure; no ground-truth signature exists synthetically | Clinical EEG-/video-labelled tonic-clonic seizure CSI from instrumented patients |
med_sleep_apnea |
a planted breathing-pause is not clinical apnea (AHI scoring, hypopnea, desaturation) | Polysomnography-labelled (PSG) overnight CSI with scored apnea/hypopnea events |
med_cardiac_arrhythmia |
a synthetic HR sequence cannot encode true arrhythmia morphology | ECG-labelled CSI (AFib/PVC/etc.) from clinical monitoring |
med_respiratory_distress |
distress is a clinical gestalt, not a plantable rate | Clinician-labelled respiratory-distress CSI episodes |
med_gait_analysis |
clinical gait metrics need a reference motion-capture standard | Mocap-/force-plate-labelled gait CSI |
sec_weapon_detect |
a high variance ratio is RF reflectivity, not weapon discrimination (ADR-160 §A3 already renamed the event to HIGH_METAL_REFLECTIVITY) |
Labelled metal-object-vs-no-object CSI with controlled object classes |
exo_emotion_detect |
affect is not recoverable from a planted heuristic; outputs are proxies (ADR-160 §A2) | Validated affect-labelled CSI (self-report / physiological ground truth) |
exo_happiness_score |
"happiness" is a gait-energy proxy, not a measured affect (ADR-160 §A2) | Validated affect/valence-labelled CSI |
exo_dream_stage |
sleep staging needs PSG reference (EEG/EOG/EMG) | PSG-staged overnight CSI |
exo_gesture_language |
coarse gesture clusters ≠ true sign language (ADR-160 §A4) | Labelled ASL letter/word CSI dataset |
The above are not failures — they are the honest boundary. A smaller set of genuinely-measured skills plus this explicit gated list is the deliverable, per the prove-everything directive.
Skills not in either list
The remaining edge skills (smart-building / retail / industrial occupancy-style,
the other sig_*/lrn_*/spt_*/tmp_*/qnt_*/aut_*/ais_* algorithm-named
modules) are wired and exercised live in the unified pipeline integration test
(tests/pipeline_all.rs, all 59 default / 64 medical skills run without panic over
300 synthetic frames) but were not given an individual planted-ground-truth
accuracy number here. They are honest REAL-DSP modules (ADR-160) whose physical
observable could be planted with more harness work; that is deferred, not claimed.
Test counts (full crate suite)
DEFAULT (--features std): 631 passed, 0 failed
(lib 504; budget 25; honest_labeling 10; pipeline_all 4; synthetic_validation 12; bench 1; vendor 75)
MEDICAL (--features std,medical-experimental): 669 passed, 0 failed
(lib 542; +16 same new tests; med_* stay DATA-GATED, not validated)
(M6 baseline was 615 / 653; the new pipeline_all (4) + synthetic_validation (12) tests add 16 to each tier.)