133 lines
9.1 KiB
Markdown
133 lines
9.1 KiB
Markdown
# Edge-Skill Synthetic-Ground-Truth Validation — RESULTS
|
||
|
||
**Crate:** `v2/crates/wifi-densepose-wasm-edge` (workspace-EXCLUDED — build from its own dir)
|
||
**Branch:** `feat/edge-skills-synthetic-validation`
|
||
**ADR:** [ADR-160](../../docs/adr/ADR-160-edge-skill-library-honest-labeling.md)
|
||
**Date:** 2026-06-13
|
||
**Harness:** `tests/synthetic_validation.rs`
|
||
|
||
> **HONESTY BOUNDARY — read first.** Everything below is **synthetic-ground-truth
|
||
> validation**: a signal is *planted* with a known answer, the **real** detector
|
||
> is run, and detection accuracy / precision / recall / rate-error is **measured**.
|
||
> This is **NOT field accuracy.** A skill that recovers a planted sinusoid here is
|
||
> proven to do the math it claims on a *constructed* signal; it is **NOT** proven
|
||
> to work on real CSI in a real room. Skills whose detection target cannot be
|
||
> honestly planted (clinical, weapon, affect, sleep-stage, sign-language) are
|
||
> **NOT** given a number — they are listed under **DATA-GATED** with the real
|
||
> data each would require.
|
||
|
||
## Reproduce
|
||
|
||
```bash
|
||
cd v2/crates/wifi-densepose-wasm-edge # workspace-excluded; build here
|
||
cargo test --features std --test synthetic_validation -- --nocapture
|
||
# also runs under the medical tier (med_* skills stay DATA-GATED, not validated):
|
||
cargo test --features std,medical-experimental --test synthetic_validation -- --nocapture
|
||
```
|
||
|
||
Each `MEASURED-on-synthetic | …` line printed by the harness is the source of the
|
||
table below. Numbers are deterministic (no RNG; pseudo-noise uses a fixed LCG seed).
|
||
|
||
---
|
||
|
||
## MEASURED-on-synthetic (constructible skills)
|
||
|
||
| Skill | What was planted (ground truth) | Result | Grade |
|
||
|-------|----------------------------------|--------|-------|
|
||
| **vital_trend** | BPM held N≥6 calls at each threshold band (brady/tachy-pnea <12 / >25, brady/tachy-cardia <50 / >120, apnea breathing<1.0 for ≥20) vs normal | **acc 1.000, prec 1.000, recall 1.000** (TP5 FP0 TN5 FN0) | MEASURED |
|
||
| **exo_time_crystal** | period-2 coordinated motion vs pseudo-noise + flat | **acc 1.000** (TP1 FP0 TN2 FN0) | MEASURED † |
|
||
| **exo_ghost_hunter** (hidden breathing) | phase sinusoid at lag-8 (breathing band 5–15) in an empty room vs flat phase | **acc 1.000**; planted score **1.000**, flat **0.000** | MEASURED |
|
||
| **occupancy** | 220-frame flat-amplitude calibration, then strong per-zone amplitude variance vs flat | **acc 1.000** (TP1 FP0 TN1 FN0) | MEASURED |
|
||
| **intrusion** | calibrate→arm (330 quiet frames), then per-subcarrier Δphase>1.5 + Δamp≫3σ vs quiet | **acc 1.000** (TP1 FP0 TN1 FN0) | MEASURED |
|
||
| **exo_rain_detect** | empty room, 60-frame baseline, then broadband variance (8/8 groups, ratio≫2.5) for ≥10 frames vs stable-low | **acc 1.000** (TP1 FP0 TN1 FN0) | MEASURED |
|
||
| **sig_flash_attention** | sustained high phase+amplitude in each of the 8 subcarrier groups; assert reported attention peak == planted group | **peak-localization 8/8 = 1.000** | MEASURED |
|
||
| **spt_spiking_tracker** | sparse (2-subcarrier) large phase-delta in each of the 4 zones; assert tracked zone == planted zone | **zone-localization 4/4 = 1.000** | MEASURED ‡ |
|
||
| **sig_optimal_transport** | sustained large frame-to-frame amplitude-distribution change vs stationary | **acc 1.000** (TP1 FP0 TN1 FN0) | MEASURED |
|
||
| **sig_mincut_person_match** | 2 persons with distinct stable per-region variance signatures over 40 frames | **person ids assigned, 0 id-swaps / 40 frames** | MEASURED |
|
||
| **lrn_dtw_gesture_learn** | stillness → 3 identical gesture rehearsals → enrollment | **template enrolled (templates=1)** | MEASURED (enroll) §|
|
||
| **sig_sparse_recovery** | 30 clean frames to init, then 8/32 (25%) nulled subcarriers | **dropout-detect + recovery-trigger = PASS** | MEASURED (trigger) ¶|
|
||
|
||
### Caveats on individual results
|
||
|
||
† **exo_time_crystal — honest discriminative limit.** A *pure* periodic signal
|
||
already has autocorrelation peaks at lag L **and** 2L (natural harmonics), so this
|
||
"period-doubling" detector cannot separate a true period-2 sub-harmonic from a
|
||
plain periodic signal — an earlier plant using a clean sine produced a *false
|
||
positive* (recorded during development). The construct it **can** discriminate
|
||
with known ground truth is **periodic-coordination vs aperiodic** (noise/flat),
|
||
which is what is measured (1.000). The original "sub-harmonic vs clean period"
|
||
claim is **NOT** validatable with this algorithm.
|
||
|
||
‡ **spt_spiking_tracker — plant must be sparse.** With weights init'd home=1.0 /
|
||
cross=0.25, firing all 8 inputs in a zone (8×0.25=2.0 > threshold 1.0) overdrives
|
||
*every* output neuron and the tracker collapses to zone 0 (measured 1/4 during
|
||
development). Firing only 2 inputs (home 2.0 fires, cross 0.5 silent) yields clean
|
||
4/4 zone localization. The validatable claim is *single-zone* localization.
|
||
|
||
§ **lrn_dtw_gesture_learn — enrollment validated; replay-match NOT.** The
|
||
deterministic, constructible part (stillness → 3 identical rehearsals → a template
|
||
is enrolled) is MEASURED. The DTW *replay match* (731) did **not** fire on the
|
||
identical replay in this run (`match_same=false`) — replay-recognition accuracy is
|
||
**reported, not asserted**, and is not claimed as validated.
|
||
|
||
¶ **sig_sparse_recovery — trigger validated; recovery accuracy is NEGATIVE.**
|
||
The dropout-detection + ISTA-recovery *trigger* pipeline fires correctly on >10%
|
||
planted nulls (asserted). But the **measured recovery accuracy is NOT a win**:
|
||
recovered RMSE **1.0045** vs unrecovered-null RMSE **0.9830** (**−2.2%**, i.e.
|
||
slightly *worse* than leaving the nulls at zero) on a neighbor-correlated signal.
|
||
The tridiagonal correlation model's fixed point does not equal the planted truth.
|
||
**The recovery's reconstruction quality is therefore NOT validated as effective on
|
||
synthetic data** — only its detection/trigger path is. Reported honestly; no
|
||
positive number claimed.
|
||
|
||
---
|
||
|
||
## DATA-GATED — NOT validatable on synthetic data
|
||
|
||
Planting a "seizure-like" / "weapon-like" / "happy-like" synthetic signal and
|
||
claiming the detector "works" validates **nothing real** and is exactly the
|
||
AI-slop this project fights. These skills run real DSP (per ADR-160, 0 stubs) and
|
||
keep their ADR-160 disclaimers, but get **no accuracy number** here. Each needs
|
||
the specific real, labelled data listed:
|
||
|
||
| Skill | Why not constructible on synthetic | Real data required |
|
||
|-------|------------------------------------|--------------------|
|
||
| `med_seizure_detect` | "seizure-like" motion is not a seizure; no ground-truth signature exists synthetically | Clinical EEG-/video-labelled tonic-clonic seizure CSI from instrumented patients |
|
||
| `med_sleep_apnea` | a planted breathing-pause is not clinical apnea (AHI scoring, hypopnea, desaturation) | Polysomnography-labelled (PSG) overnight CSI with scored apnea/hypopnea events |
|
||
| `med_cardiac_arrhythmia` | a synthetic HR sequence cannot encode true arrhythmia morphology | ECG-labelled CSI (AFib/PVC/etc.) from clinical monitoring |
|
||
| `med_respiratory_distress` | distress is a clinical gestalt, not a plantable rate | Clinician-labelled respiratory-distress CSI episodes |
|
||
| `med_gait_analysis` | clinical gait metrics need a reference motion-capture standard | Mocap-/force-plate-labelled gait CSI |
|
||
| `sec_weapon_detect` | a high variance ratio is RF reflectivity, **not** weapon discrimination (ADR-160 §A3 already renamed the event to `HIGH_METAL_REFLECTIVITY`) | Labelled metal-object-vs-no-object CSI with controlled object classes |
|
||
| `exo_emotion_detect` | affect is not recoverable from a planted heuristic; outputs are proxies (ADR-160 §A2) | Validated affect-labelled CSI (self-report / physiological ground truth) |
|
||
| `exo_happiness_score` | "happiness" is a gait-energy proxy, not a measured affect (ADR-160 §A2) | Validated affect/valence-labelled CSI |
|
||
| `exo_dream_stage` | sleep staging needs PSG reference (EEG/EOG/EMG) | PSG-staged overnight CSI |
|
||
| `exo_gesture_language` | coarse gesture clusters ≠ true sign language (ADR-160 §A4) | Labelled ASL letter/word CSI dataset |
|
||
|
||
> The above are **not failures** — they are the honest boundary. A smaller set of
|
||
> genuinely-measured skills plus this explicit gated list is the deliverable, per
|
||
> the prove-everything directive.
|
||
|
||
---
|
||
|
||
## Skills not in either list
|
||
|
||
The remaining edge skills (smart-building / retail / industrial occupancy-style,
|
||
the other `sig_*`/`lrn_*`/`spt_*`/`tmp_*`/`qnt_*`/`aut_*`/`ais_*` algorithm-named
|
||
modules) are **wired and exercised live** in the unified pipeline integration test
|
||
(`tests/pipeline_all.rs`, all 59 default / 64 medical skills run without panic over
|
||
300 synthetic frames) but were **not** given an individual planted-ground-truth
|
||
accuracy number here. They are honest REAL-DSP modules (ADR-160) whose physical
|
||
observable could be planted with more harness work; that is deferred, not claimed.
|
||
|
||
## Test counts (full crate suite)
|
||
|
||
```
|
||
DEFAULT (--features std): 631 passed, 0 failed
|
||
(lib 504; budget 25; honest_labeling 10; pipeline_all 4; synthetic_validation 12; bench 1; vendor 75)
|
||
MEDICAL (--features std,medical-experimental): 669 passed, 0 failed
|
||
(lib 542; +16 same new tests; med_* stay DATA-GATED, not validated)
|
||
```
|
||
|
||
(M6 baseline was 615 / 653; the new pipeline_all (4) + synthetic_validation (12)
|
||
tests add 16 to each tier.)
|