# Edge-Skill Synthetic-Ground-Truth Validation — RESULTS **Crate:** `v2/crates/wifi-densepose-wasm-edge` (workspace-EXCLUDED — build from its own dir) **Branch:** `feat/edge-skills-synthetic-validation` **ADR:** [ADR-160](../../docs/adr/ADR-160-edge-skill-library-honest-labeling.md) **Date:** 2026-06-13 **Harness:** `tests/synthetic_validation.rs` > **HONESTY BOUNDARY — read first.** Everything below is **synthetic-ground-truth > validation**: a signal is *planted* with a known answer, the **real** detector > is run, and detection accuracy / precision / recall / rate-error is **measured**. > This is **NOT field accuracy.** A skill that recovers a planted sinusoid here is > proven to do the math it claims on a *constructed* signal; it is **NOT** proven > to work on real CSI in a real room. Skills whose detection target cannot be > honestly planted (clinical, weapon, affect, sleep-stage, sign-language) are > **NOT** given a number — they are listed under **DATA-GATED** with the real > data each would require. ## Reproduce ```bash cd v2/crates/wifi-densepose-wasm-edge # workspace-excluded; build here cargo test --features std --test synthetic_validation -- --nocapture # also runs under the medical tier (med_* skills stay DATA-GATED, not validated): cargo test --features std,medical-experimental --test synthetic_validation -- --nocapture ``` Each `MEASURED-on-synthetic | …` line printed by the harness is the source of the table below. Numbers are deterministic (no RNG; pseudo-noise uses a fixed LCG seed). --- ## MEASURED-on-synthetic (constructible skills) | Skill | What was planted (ground truth) | Result | Grade | |-------|----------------------------------|--------|-------| | **vital_trend** | BPM held N≥6 calls at each threshold band (brady/tachy-pnea <12 / >25, brady/tachy-cardia <50 / >120, apnea breathing<1.0 for ≥20) vs normal | **acc 1.000, prec 1.000, recall 1.000** (TP5 FP0 TN5 FN0) | MEASURED | | **exo_time_crystal** | period-2 coordinated motion vs pseudo-noise + flat | **acc 1.000** (TP1 FP0 TN2 FN0) | MEASURED † | | **exo_ghost_hunter** (hidden breathing) | phase sinusoid at lag-8 (breathing band 5–15) in an empty room vs flat phase | **acc 1.000**; planted score **1.000**, flat **0.000** | MEASURED | | **occupancy** | 220-frame flat-amplitude calibration, then strong per-zone amplitude variance vs flat | **acc 1.000** (TP1 FP0 TN1 FN0) | MEASURED | | **intrusion** | calibrate→arm (330 quiet frames), then per-subcarrier Δphase>1.5 + Δamp≫3σ vs quiet | **acc 1.000** (TP1 FP0 TN1 FN0) | MEASURED | | **exo_rain_detect** | empty room, 60-frame baseline, then broadband variance (8/8 groups, ratio≫2.5) for ≥10 frames vs stable-low | **acc 1.000** (TP1 FP0 TN1 FN0) | MEASURED | | **sig_flash_attention** | sustained high phase+amplitude in each of the 8 subcarrier groups; assert reported attention peak == planted group | **peak-localization 8/8 = 1.000** | MEASURED | | **spt_spiking_tracker** | sparse (2-subcarrier) large phase-delta in each of the 4 zones; assert tracked zone == planted zone | **zone-localization 4/4 = 1.000** | MEASURED ‡ | | **sig_optimal_transport** | sustained large frame-to-frame amplitude-distribution change vs stationary | **acc 1.000** (TP1 FP0 TN1 FN0) | MEASURED | | **sig_mincut_person_match** | 2 persons with distinct stable per-region variance signatures over 40 frames | **person ids assigned, 0 id-swaps / 40 frames** | MEASURED | | **lrn_dtw_gesture_learn** | stillness → 3 identical gesture rehearsals → enrollment | **template enrolled (templates=1)** | MEASURED (enroll) §| | **sig_sparse_recovery** | 30 clean frames to init, then 8/32 (25%) nulled subcarriers | **dropout-detect + recovery-trigger = PASS** | MEASURED (trigger) ¶| ### Caveats on individual results † **exo_time_crystal — honest discriminative limit.** A *pure* periodic signal already has autocorrelation peaks at lag L **and** 2L (natural harmonics), so this "period-doubling" detector cannot separate a true period-2 sub-harmonic from a plain periodic signal — an earlier plant using a clean sine produced a *false positive* (recorded during development). The construct it **can** discriminate with known ground truth is **periodic-coordination vs aperiodic** (noise/flat), which is what is measured (1.000). The original "sub-harmonic vs clean period" claim is **NOT** validatable with this algorithm. ‡ **spt_spiking_tracker — plant must be sparse.** With weights init'd home=1.0 / cross=0.25, firing all 8 inputs in a zone (8×0.25=2.0 > threshold 1.0) overdrives *every* output neuron and the tracker collapses to zone 0 (measured 1/4 during development). Firing only 2 inputs (home 2.0 fires, cross 0.5 silent) yields clean 4/4 zone localization. The validatable claim is *single-zone* localization. § **lrn_dtw_gesture_learn — enrollment validated; replay-match NOT.** The deterministic, constructible part (stillness → 3 identical rehearsals → a template is enrolled) is MEASURED. The DTW *replay match* (731) did **not** fire on the identical replay in this run (`match_same=false`) — replay-recognition accuracy is **reported, not asserted**, and is not claimed as validated. ¶ **sig_sparse_recovery — trigger validated; recovery accuracy is NEGATIVE.** The dropout-detection + ISTA-recovery *trigger* pipeline fires correctly on >10% planted nulls (asserted). But the **measured recovery accuracy is NOT a win**: recovered RMSE **1.0045** vs unrecovered-null RMSE **0.9830** (**−2.2%**, i.e. slightly *worse* than leaving the nulls at zero) on a neighbor-correlated signal. The tridiagonal correlation model's fixed point does not equal the planted truth. **The recovery's reconstruction quality is therefore NOT validated as effective on synthetic data** — only its detection/trigger path is. Reported honestly; no positive number claimed. --- ## DATA-GATED — NOT validatable on synthetic data Planting a "seizure-like" / "weapon-like" / "happy-like" synthetic signal and claiming the detector "works" validates **nothing real** and is exactly the AI-slop this project fights. These skills run real DSP (per ADR-160, 0 stubs) and keep their ADR-160 disclaimers, but get **no accuracy number** here. Each needs the specific real, labelled data listed: | Skill | Why not constructible on synthetic | Real data required | |-------|------------------------------------|--------------------| | `med_seizure_detect` | "seizure-like" motion is not a seizure; no ground-truth signature exists synthetically | Clinical EEG-/video-labelled tonic-clonic seizure CSI from instrumented patients | | `med_sleep_apnea` | a planted breathing-pause is not clinical apnea (AHI scoring, hypopnea, desaturation) | Polysomnography-labelled (PSG) overnight CSI with scored apnea/hypopnea events | | `med_cardiac_arrhythmia` | a synthetic HR sequence cannot encode true arrhythmia morphology | ECG-labelled CSI (AFib/PVC/etc.) from clinical monitoring | | `med_respiratory_distress` | distress is a clinical gestalt, not a plantable rate | Clinician-labelled respiratory-distress CSI episodes | | `med_gait_analysis` | clinical gait metrics need a reference motion-capture standard | Mocap-/force-plate-labelled gait CSI | | `sec_weapon_detect` | a high variance ratio is RF reflectivity, **not** weapon discrimination (ADR-160 §A3 already renamed the event to `HIGH_METAL_REFLECTIVITY`) | Labelled metal-object-vs-no-object CSI with controlled object classes | | `exo_emotion_detect` | affect is not recoverable from a planted heuristic; outputs are proxies (ADR-160 §A2) | Validated affect-labelled CSI (self-report / physiological ground truth) | | `exo_happiness_score` | "happiness" is a gait-energy proxy, not a measured affect (ADR-160 §A2) | Validated affect/valence-labelled CSI | | `exo_dream_stage` | sleep staging needs PSG reference (EEG/EOG/EMG) | PSG-staged overnight CSI | | `exo_gesture_language` | coarse gesture clusters ≠ true sign language (ADR-160 §A4) | Labelled ASL letter/word CSI dataset | > The above are **not failures** — they are the honest boundary. A smaller set of > genuinely-measured skills plus this explicit gated list is the deliverable, per > the prove-everything directive. --- ## Skills not in either list The remaining edge skills (smart-building / retail / industrial occupancy-style, the other `sig_*`/`lrn_*`/`spt_*`/`tmp_*`/`qnt_*`/`aut_*`/`ais_*` algorithm-named modules) are **wired and exercised live** in the unified pipeline integration test (`tests/pipeline_all.rs`, all 59 default / 64 medical skills run without panic over 300 synthetic frames) but were **not** given an individual planted-ground-truth accuracy number here. They are honest REAL-DSP modules (ADR-160) whose physical observable could be planted with more harness work; that is deferred, not claimed. ## Test counts (full crate suite) ``` DEFAULT (--features std): 631 passed, 0 failed (lib 504; budget 25; honest_labeling 10; pipeline_all 4; synthetic_validation 12; bench 1; vendor 75) MEDICAL (--features std,medical-experimental): 669 passed, 0 failed (lib 542; +16 same new tests; med_* stay DATA-GATED, not validated) ``` (M6 baseline was 615 / 653; the new pipeline_all (4) + synthetic_validation (12) tests add 16 to each tier.)