9.1 KiB

Raw Blame History

Edge-Skill Synthetic-Ground-Truth Validation — RESULTS

Crate: v2/crates/wifi-densepose-wasm-edge (workspace-EXCLUDED — build from its own dir) Branch: feat/edge-skills-synthetic-validation ADR: ADR-160 Date: 2026-06-13 Harness: tests/synthetic_validation.rs

HONESTY BOUNDARY — read first. Everything below is synthetic-ground-truth validation: a signal is planted with a known answer, the real detector is run, and detection accuracy / precision / recall / rate-error is measured. This is NOT field accuracy. A skill that recovers a planted sinusoid here is proven to do the math it claims on a constructed signal; it is NOT proven to work on real CSI in a real room. Skills whose detection target cannot be honestly planted (clinical, weapon, affect, sleep-stage, sign-language) are NOT given a number — they are listed under DATA-GATED with the real data each would require.

Reproduce

cd v2/crates/wifi-densepose-wasm-edge   # workspace-excluded; build here
cargo test --features std --test synthetic_validation -- --nocapture
# also runs under the medical tier (med_* skills stay DATA-GATED, not validated):
cargo test --features std,medical-experimental --test synthetic_validation -- --nocapture

Each MEASURED-on-synthetic | … line printed by the harness is the source of the table below. Numbers are deterministic (no RNG; pseudo-noise uses a fixed LCG seed).

MEASURED-on-synthetic (constructible skills)

Skill	What was planted (ground truth)	Result	Grade
vital_trend	BPM held N≥6 calls at each threshold band (brady/tachy-pnea <12 / >25, brady/tachy-cardia <50 / >120, apnea breathing<1.0 for ≥20) vs normal	acc 1.000, prec 1.000, recall 1.000 (TP5 FP0 TN5 FN0)	MEASURED
exo_time_crystal	period-2 coordinated motion vs pseudo-noise + flat	acc 1.000 (TP1 FP0 TN2 FN0)	MEASURED †
exo_ghost_hunter (hidden breathing)	phase sinusoid at lag-8 (breathing band 5–15) in an empty room vs flat phase	acc 1.000; planted score 1.000, flat 0.000	MEASURED
occupancy	220-frame flat-amplitude calibration, then strong per-zone amplitude variance vs flat	acc 1.000 (TP1 FP0 TN1 FN0)	MEASURED
intrusion	calibrate→arm (330 quiet frames), then per-subcarrier Δphase>1.5 + Δamp≫3σ vs quiet	acc 1.000 (TP1 FP0 TN1 FN0)	MEASURED
exo_rain_detect	empty room, 60-frame baseline, then broadband variance (8/8 groups, ratio≫2.5) for ≥10 frames vs stable-low	acc 1.000 (TP1 FP0 TN1 FN0)	MEASURED
sig_flash_attention	sustained high phase+amplitude in each of the 8 subcarrier groups; assert reported attention peak == planted group	peak-localization 8/8 = 1.000	MEASURED
spt_spiking_tracker	sparse (2-subcarrier) large phase-delta in each of the 4 zones; assert tracked zone == planted zone	zone-localization 4/4 = 1.000	MEASURED ‡
sig_optimal_transport	sustained large frame-to-frame amplitude-distribution change vs stationary	acc 1.000 (TP1 FP0 TN1 FN0)	MEASURED
sig_mincut_person_match	2 persons with distinct stable per-region variance signatures over 40 frames	person ids assigned, 0 id-swaps / 40 frames	MEASURED
lrn_dtw_gesture_learn	stillness → 3 identical gesture rehearsals → enrollment	template enrolled (templates=1)	MEASURED (enroll) §
sig_sparse_recovery	30 clean frames to init, then 8/32 (25%) nulled subcarriers	dropout-detect + recovery-trigger = PASS	MEASURED (trigger) ¶

Caveats on individual results

† exo_time_crystal — honest discriminative limit. A pure periodic signal already has autocorrelation peaks at lag L and 2L (natural harmonics), so this "period-doubling" detector cannot separate a true period-2 sub-harmonic from a plain periodic signal — an earlier plant using a clean sine produced a false positive (recorded during development). The construct it can discriminate with known ground truth is periodic-coordination vs aperiodic (noise/flat), which is what is measured (1.000). The original "sub-harmonic vs clean period" claim is NOT validatable with this algorithm.

‡ spt_spiking_tracker — plant must be sparse. With weights init'd home=1.0 / cross=0.25, firing all 8 inputs in a zone (8×0.25=2.0 > threshold 1.0) overdrives every output neuron and the tracker collapses to zone 0 (measured 1/4 during development). Firing only 2 inputs (home 2.0 fires, cross 0.5 silent) yields clean 4/4 zone localization. The validatable claim is single-zone localization.

§ lrn_dtw_gesture_learn — enrollment validated; replay-match NOT. The deterministic, constructible part (stillness → 3 identical rehearsals → a template is enrolled) is MEASURED. The DTW replay match (731) did not fire on the identical replay in this run (match_same=false) — replay-recognition accuracy is reported, not asserted, and is not claimed as validated.

¶ sig_sparse_recovery — trigger validated; recovery accuracy is NEGATIVE. The dropout-detection + ISTA-recovery trigger pipeline fires correctly on >10% planted nulls (asserted). But the measured recovery accuracy is NOT a win: recovered RMSE 1.0045 vs unrecovered-null RMSE 0.9830 (−2.2%, i.e. slightly worse than leaving the nulls at zero) on a neighbor-correlated signal. The tridiagonal correlation model's fixed point does not equal the planted truth. The recovery's reconstruction quality is therefore NOT validated as effective on synthetic data — only its detection/trigger path is. Reported honestly; no positive number claimed.

DATA-GATED — NOT validatable on synthetic data

Planting a "seizure-like" / "weapon-like" / "happy-like" synthetic signal and claiming the detector "works" validates nothing real and is exactly the AI-slop this project fights. These skills run real DSP (per ADR-160, 0 stubs) and keep their ADR-160 disclaimers, but get no accuracy number here. Each needs the specific real, labelled data listed:

Skill	Why not constructible on synthetic	Real data required
`med_seizure_detect`	"seizure-like" motion is not a seizure; no ground-truth signature exists synthetically	Clinical EEG-/video-labelled tonic-clonic seizure CSI from instrumented patients
`med_sleep_apnea`	a planted breathing-pause is not clinical apnea (AHI scoring, hypopnea, desaturation)	Polysomnography-labelled (PSG) overnight CSI with scored apnea/hypopnea events
`med_cardiac_arrhythmia`	a synthetic HR sequence cannot encode true arrhythmia morphology	ECG-labelled CSI (AFib/PVC/etc.) from clinical monitoring
`med_respiratory_distress`	distress is a clinical gestalt, not a plantable rate	Clinician-labelled respiratory-distress CSI episodes
`med_gait_analysis`	clinical gait metrics need a reference motion-capture standard	Mocap-/force-plate-labelled gait CSI
`sec_weapon_detect`	a high variance ratio is RF reflectivity, not weapon discrimination (ADR-160 §A3 already renamed the event to `HIGH_METAL_REFLECTIVITY`)	Labelled metal-object-vs-no-object CSI with controlled object classes
`exo_emotion_detect`	affect is not recoverable from a planted heuristic; outputs are proxies (ADR-160 §A2)	Validated affect-labelled CSI (self-report / physiological ground truth)
`exo_happiness_score`	"happiness" is a gait-energy proxy, not a measured affect (ADR-160 §A2)	Validated affect/valence-labelled CSI
`exo_dream_stage`	sleep staging needs PSG reference (EEG/EOG/EMG)	PSG-staged overnight CSI
`exo_gesture_language`	coarse gesture clusters ≠ true sign language (ADR-160 §A4)	Labelled ASL letter/word CSI dataset

The above are not failures — they are the honest boundary. A smaller set of genuinely-measured skills plus this explicit gated list is the deliverable, per the prove-everything directive.

Skills not in either list

The remaining edge skills (smart-building / retail / industrial occupancy-style, the other sig_*/lrn_*/spt_*/tmp_*/qnt_*/aut_*/ais_* algorithm-named modules) are wired and exercised live in the unified pipeline integration test (tests/pipeline_all.rs, all 59 default / 64 medical skills run without panic over 300 synthetic frames) but were not given an individual planted-ground-truth accuracy number here. They are honest REAL-DSP modules (ADR-160) whose physical observable could be planted with more harness work; that is deferred, not claimed.

Test counts (full crate suite)

DEFAULT  (--features std):                     631 passed, 0 failed
  (lib 504; budget 25; honest_labeling 10; pipeline_all 4; synthetic_validation 12; bench 1; vendor 75)
MEDICAL  (--features std,medical-experimental): 669 passed, 0 failed
  (lib 542; +16 same new tests; med_* stay DATA-GATED, not validated)

(M6 baseline was 615 / 653; the new pipeline_all (4) + synthetic_validation (12) tests add 16 to each tier.)

9.1 KiB Raw Blame History Unescape Escape