diff --git a/docs/adr/ADR-104-per-subcarrier-drift-presence.md b/docs/adr/ADR-104-per-subcarrier-drift-presence.md new file mode 100644 index 00000000..f5de2ab8 --- /dev/null +++ b/docs/adr/ADR-104-per-subcarrier-drift-presence.md @@ -0,0 +1,162 @@ +# ADR-104 — Per-Subcarrier Drift Presence Channel + NBVI FP-Rate Validation + +**Status**: Accepted +**Date**: 2026-05-17 +**Scope**: `v2/crates/wifi-densepose-sensing-server/src/main.rs` +(`AMP_BASELINE_PER_SUB`, `AMP_DRIFT`, `amp_drift_for_node`, +`amp_drift_max`, `amp_node_level`, `amp_classify_from_latest`, +`nbvi_select_top_k` Step 3), `scripts/record-baseline.py` +(`per_subcarrier_mean` already saved). + +## Context + +After ADR-103 the classifier triggers `present_still` only when the +**broadband mean** of the NBVI-selected subset drops by ≥ 25 % from +the loaded baseline. This works when the operator's body crosses the +line of sight between AP and sensor — direct-component attenuation +dominates. But: + +1. **Off-axis presence**: the operator sitting at a desk to the side + of the AP-sensor line modulates only a handful of subcarriers + (the ones whose Fresnel zone happens to brush their body). The + *broadband* mean barely shifts; ADR-103 says `absent` even though + someone is clearly in the room. +2. **NBVI Step 3**: Pace's full NBVI pipeline picks top-K by raw NBVI + score, then **validates** each candidate K by counting false + positives the motion detector would produce on the calibration + buffer, and keeps the K with the lowest FP rate. We were taking + the raw top-12 without validation — fragile if one of the chosen + subcarriers happens to overlap a noise source. + +## Decisions + +### D1 — Spectral drift score as a second presence channel + +`amp_presence_override` per node now also computes a **spectral +drift score**: + +``` +drift_k = (current_amp[k] - baseline_amp[k]).abs() / baseline_amp[k] for baseline[k] > 1.0 +drift = mean(drift_k) across kept subcarriers +``` + +`current_amp[k]` = mean of the recent `AMP_SHORT_WIN` (90) frames' +amplitude at subcarrier `k`. `baseline_amp[k]` = the +`per_subcarrier_mean` vector saved by ADR-103's recording script. + +Per-node drift is stashed in `AMP_DRIFT: HashMap` so +`amp_node_level` (per-node) and `amp_classify_from_latest` (global) +can use it. Threshold `AMP_DRIFT_PRESENCE_THRESH = 0.10` (10 % +average per-subcarrier deviation) is empirical and consistent with +the broadband-ratio trigger (drop ≥ 25 %, drift ≥ 10 %). + +### D2 — Trigger order in classifier + +Per node (`amp_node_snapshot`): + +``` +1. CV ≥ 6× baseline_cv → active +2. CV ≥ 3× baseline_cv → present_moving +3. drift ≥ 10 % → present_still ← ADR-104 (off-axis) +4. mean / baseline < 0.75 → present_still ← ADR-101 (in-path) +5. otherwise → absent +``` + +Global (`amp_classify_from_latest`) uses MAX CV / MAX drift / ANY +baseline-drop across nodes. Either drop OR drift fires `present_still`. + +### D3 — Opportunistic loading + +`per_subcarrier_mean` was already being written by +`scripts/record-baseline.py` (line ~132, written as a list of +~56 floats per node) but the server ignored it. Now `load_baseline_file` +parses it and populates `AMP_BASELINE_PER_SUB`. If absent (older +`baseline.json` from before this ADR) → drift stays 0.0 → no behaviour +change. Re-trigger calibration via the ADR-107 REST endpoint or auto- +recalibrate to populate the field and activate the drift channel. + +### D4 — NBVI FP-rate validation (Step 3 of Pace's spec) + +`nbvi_select_top_k` no longer returns the literal top-K. After +ranking by NBVI score (Steps 1+2), it evaluates each candidate +K ∈ `{6, 8, 10, 12, 16, 20}` clamped to the available subcarrier +pool: + +* For each K: compute per-frame broadband mean over the top-K + subset across the quiet window. +* Slide a sub-window (length `AMP_SHORT_WIN/3 ≈ 30` samples, stride + `sub_window/2`) and count windows where rolling CV exceeds the + moving-gate threshold (0.10). +* Pick the K with the **smallest FP count**. Ties broken by smallest + total NBVI score (less noisy subset wins). + +Result: a subset that's stable AND non-FP-producing on the calibration +window. If a top-12 NBVI candidate sneaks in a subcarrier overlapping +a noise source, the FP count surfaces it and a smaller K wins instead. + +## Files Touched + +``` +v2/crates/wifi-densepose-sensing-server/src/main.rs + - statics: AMP_BASELINE_PER_SUB, AMP_DRIFT + - helpers: amp_baseline_per_sub_init, amp_drift_init, + amp_drift_for_node, amp_drift_max + - load_baseline_file: parse per_subcarrier_mean → AMP_BASELINE_PER_SUB + - amp_presence_override: drift computation + stash + - amp_node_level: drift trigger (uses MAX for cross-node) + - amp_node_snapshot: per-node drift trigger (overrides MAX) + - amp_classify_from_latest: any-node drift trigger in global fusion + - nbvi_select_top_k: Step 3 FP-rate validation +docs/adr/ADR-104-per-subcarrier-drift-presence.md (this) +``` + +Implementation commit: `6212b17e`. + +## Verified Acceptance + +Server boot log (using existing v1 baseline.json without +`per_subcarrier_mean`): + +``` +baseline: loaded 2 node overrides from data/baseline.json + (node1=27.04, node2=14.72; node1_cv=2.62%, node2_cv=3.65%) +``` + +Without `per_subcarrier_mean` in the file, drift is identically 0 +and the classifier behaves exactly as ADR-103. To activate the +drift channel: re-record via the ADR-107 REST endpoint or wait for +auto-recalibrate; new `baseline.json` carries the +`per_subcarrier_mean` vector and drift becomes live. + +NBVI Step 3 validation runs on every refresh tick. With K=12 being +the "safe" default that always passes (clean low-CV window in the +operator's deployment) and smaller Ks not improving FP=0, the picker +keeps K=12 in steady state. Defends against future drift in channel +conditions where a previously-clean subcarrier picks up interference. + +## Open Items + +* **Per-subcarrier baseline AGE check** — the per-sub vector reflects + the channel at calibration time. As the channel slowly drifts (other + WiFi clients on the AP, temperature, etc.) the per-sub baseline ages + faster than the broadband-mean baseline. Need: if `last_written_sec_ago` + > N hours AND drift consistently > threshold → flag for + re-calibration. Defer to a future ADR-109. +* **Per-subcarrier delta in UI** — `raw.html` only shows broadband + bars + global classification. A small "drift" sparkline per node + would let the operator see the off-axis channel firing. ~30 min. +* **Phase-domain drift** — currently amplitude-only. Phase delta vs + baseline phase would catch even subtler movement (chest-wall sub-mm + motion during breathing). Requires phase baseline in `baseline.json`, + which the recording script doesn't yet save. ~1 h script + ~30 min + server. + +## References + +* ADR-101 — broadband classifier; this ADR adds a parallel channel. +* ADR-102 — NBVI; this ADR adds Step 3 validation per Pace's spec. +* ADR-103 — persistent baseline; `per_subcarrier_mean` already written. +* ADR-107 — REST calibrate endpoint; how the operator refreshes the + per-sub vector on demand. +* [`docs/references/espectre-techniques.md`](../references/espectre-techniques.md) + §1.Step 3. diff --git a/docs/adr/ADR-108-fw-nvs-persist-gain-lock.md b/docs/adr/ADR-108-fw-nvs-persist-gain-lock.md new file mode 100644 index 00000000..8279510b --- /dev/null +++ b/docs/adr/ADR-108-fw-nvs-persist-gain-lock.md @@ -0,0 +1,176 @@ +# ADR-108 — FW NVS Persistence of Gain-Lock Values + +**Status**: Accepted +**Date**: 2026-05-17 +**Scope**: `firmware/esp32-csi-node/main/csi_collector.c` +(`rv_gain_load_from_nvs`, `rv_gain_save_to_nvs`, NVS hook in +`rv_gain_lock_process`). + +## Context + +ADR-100 introduced the FW-side gain-lock (AGC + FFT scale) but the +calibration runs on *every* boot: + +1. Collect 300 packets (~3 s at 100 pps, but realistically 6-12 s + in production where keepalive drives only 25 pps). +2. Take the median of AGC and FFT samples. +3. Call `phy_force_rx_gain` / `phy_fft_scale_force` to freeze. + +This means after every reboot — OTA, power blip, watchdog — the chip +goes through 6-12 s where CSI is generated with **unlocked AGC** that +drifts ±20–30 % (the very artefact gain-lock was meant to suppress). +The operator's classifier, ADR-101's NBVI selector, and ADR-103's +baseline comparison all see noisy data during that warm-up. + +Pace's ESPectre persists everything calibration-related to NVS so +post-reboot the sensor is back in detect mode in well under a +second. This ADR ports the gain-lock half of that policy +(NBVI lives server-side in RuView, doesn't apply). + +## Decisions + +### D1 — NVS namespace + keys + +```c +#define RV_GAIN_NVS_NS "csi_cfg" +#define RV_GAIN_NVS_K_AGC "gl_agc" // u8 +#define RV_GAIN_NVS_K_FFT "gl_fft" // i8 +``` + +`csi_cfg` is the same namespace the WiFi creds / collector IP / node_id +live in (so it's already initialised + checked by `nvs_config_load`). +Two single-byte values — minimal NVS footprint. + +### D2 — Two thin helpers + +```c +static esp_err_t rv_gain_load_from_nvs(uint8_t *agc, int8_t *fft); +static void rv_gain_save_to_nvs(uint8_t agc, int8_t fft); +``` + +Both are local to `csi_collector.c`. Load returns `ESP_ERR_NVS_NOT_FOUND` +on a fresh chip; save logs a warning but never blocks the boot path +if NVS write fails. + +### D3 — One-shot NVS load at top of `rv_gain_lock_process` + +A static `s_nvs_checked` flag triggers exactly **one** load attempt +on the first packet after boot: + +```c +if (!s_nvs_checked) { + s_nvs_checked = true; + uint8_t agc; int8_t fft; + if (rv_gain_load_from_nvs(&agc, &fft) == ESP_OK + && agc >= RV_GAIN_MIN_SAFE_AGC) + { + phy_fft_scale_force(true, fft); + phy_force_rx_gain(1, (int)agc); + s_gain_locked = true; + ESP_LOGI(TAG, "gain-lock RESTORED from NVS: AGC=%u FFT=%d", agc, fft); + return; + } +} +``` + +The `agc >= RV_GAIN_MIN_SAFE_AGC` guard preserves ADR-100's "skip if +signal too strong" safety: a stale low-AGC value that would freeze +the RX path is rejected even if it's in NVS. + +### D4 — Save after every successful lock + +The existing `phy_*_force` branch in `rv_gain_lock_process` is wrapped +with a save call: + +```c +phy_fft_scale_force(true, s_gain_fft_value); +phy_force_rx_gain(1, (int)s_gain_agc_value); +rv_gain_save_to_nvs(s_gain_agc_value, s_gain_fft_value); +ESP_LOGI(TAG, "gain-lock PERSISTED to NVS (%s/%s, %s)", + RV_GAIN_NVS_NS, RV_GAIN_NVS_K_AGC, RV_GAIN_NVS_K_FFT); +``` + +So the first boot ever does the full 300-packet calibration **and** +saves; every subsequent boot loads instantly from D3. + +### D5 — Invalidation policy + +Stored values are tied to: this sensor's physical location + this AP's +MAC + this channel + this antenna orientation. If any of those change, +the saved AGC/FFT may be slightly off-optimal — but **not dangerous**. +The WiFi PHY just receives slightly off-optimal CSI; the host will +see higher baseline noise until the operator triggers a re-calibration. + +Today: erase via `idf.py erase-flash` over USB, or `nvs_flash_erase()` +called from a future REST endpoint. No automatic invalidation — the +operator decides when a deployment change is significant enough. + +## Files Touched + +``` +firmware/esp32-csi-node/main/csi_collector.c + - #include "nvs.h" / "nvs_flash.h" + - rv_gain_load_from_nvs / rv_gain_save_to_nvs (D2) + - s_nvs_checked one-shot in rv_gain_lock_process (D3) + - save call after lock branch (D4) +docs/adr/ADR-108-fw-nvs-persist-gain-lock.md (this) +``` + +Implementation commit: `3779bb76`. Flashed to both sensors via OTA +(no USB) — `python3 scripts/ota-deploy.sh`. + +## Verified Acceptance + +Test sequence: + +1. OTA flash new FW to both nodes (first boot, NVS empty). +2. Wait 15 s for FW to complete first calibration + write to NVS. +3. OTA flash the SAME binary again (forces a reboot; new FW has + values in NVS from step 2). +4. Sample WS amplitude rate in the first 3 s after the second boot. + +Before this ADR: ~5-12 s gap between boot and first amp-bearing WS +frame (waiting for fresh calibration). After this ADR: WS shows +**44 Hz raw CSI in the first 3 s** — instant resume. + +Logs from a chip that has values in NVS: + +``` +I (335) main: boot: reset_reason=SW running_partition=ota_1 +I (520) csi_collector: gain-lock RESTORED from NVS: AGC=44 FFT=-33 + (0-packet calibration; clear NVS to recalibrate) +``` + +vs first-boot ever: + +``` +I (335) main: boot: reset_reason=POWERON running_partition=ota_0 +I (4980) csi_collector: gain-lock APPLIED: AGC=44 FFT=-33 + (median of 300 packets) +I (4980) csi_collector: gain-lock PERSISTED to NVS (csi_cfg/gl_agc, gl_fft) +``` + +## Open Items + +* **REST endpoint to clear gain-lock NVS** — today the operator has + to USB-erase the namespace. A FW-side `POST /ota/recalibrate` that + clears the two keys + `esp_restart()` would close that loop. + ~30 min FW + flash. +* **Track AP MAC alongside AGC/FFT** — `csi_cfg/gl_ap_mac`. On boot, + if current AP MAC ≠ saved → ignore the cached values and re-calibrate. + Fully automatic invalidation. ~1 h FW. +* **Per-channel cache** — `csi_cfg/gl__agc`. If the channel hop + table (ADR-029) is reactivated, each channel needs its own values. + ~1 h FW. + +## References + +* ADR-100 — gain-lock implementation that this ADR persists. +* ADR-101 — classifier that suffers during the 6-12 s warm-up gap + that this ADR closes. +* `docs/references/ota-pipeline.md` — the WiFi flash flow used to + deploy this FW change without USB. +* Francesco Pace, *How I Turned My Wi-Fi Into a Motion Sensor — + Part 2*, "Persisted calibration" — the upstream pattern this ADR + ports (their NVS payload also includes NBVI indices + baseline, + which RuView keeps server-side).