docs: ADR-104 (per-subcarrier drift) + ADR-108 (FW NVS persistence)

ADR-104: documents the off-axis presence channel that fires present_still when per-subcarrier amplitudes drift ≥10% from the saved per_subcarrier_mean baseline, plus the NBVI Step 3 FP-rate validation (K candidate sweep, smallest-FP wins). Implementation shipped in 6212b17e. ADR-108: documents the FW NVS persistence of gain-lock values (csi_cfg/gl_agc + gl_fft), the one-shot load at first packet after boot, the save after every successful calibration, and the safety MIN_SAFE_AGC guard on restored values. Implementation shipped in 3779bb76; flashed to both sensors via OTA. Both ADRs ≤ 200 lines per the project's docs convention. Open items recorded so future agents can pick up: per-sub drift age check, phase-domain drift, REST recalibrate endpoint, AP-MAC tied cache.
2026-05-17 13:34:22 +07:00 · 2026-05-17 13:34:22 +07:00 · d7189d9b0f
parent 3779bb7655
commit d7189d9b0f
2 changed files with 338 additions and 0 deletions
--- a/docs/adr/ADR-104-per-subcarrier-drift-presence.md
+++ b/docs/adr/ADR-104-per-subcarrier-drift-presence.md
@ -0,0 +1,162 @@
+# ADR-104 — Per-Subcarrier Drift Presence Channel + NBVI FP-Rate Validation
+
+**Status**: Accepted
+**Date**: 2026-05-17
+**Scope**: `v2/crates/wifi-densepose-sensing-server/src/main.rs`
+(`AMP_BASELINE_PER_SUB`, `AMP_DRIFT`, `amp_drift_for_node`,
+`amp_drift_max`, `amp_node_level`, `amp_classify_from_latest`,
+`nbvi_select_top_k` Step 3), `scripts/record-baseline.py`
+(`per_subcarrier_mean` already saved).
+
+## Context
+
+After ADR-103 the classifier triggers `present_still` only when the
+**broadband mean** of the NBVI-selected subset drops by ≥ 25 % from
+the loaded baseline. This works when the operator's body crosses the
+line of sight between AP and sensor — direct-component attenuation
+dominates. But:
+
+1. **Off-axis presence**: the operator sitting at a desk to the side
+   of the AP-sensor line modulates only a handful of subcarriers
+   (the ones whose Fresnel zone happens to brush their body). The
+   *broadband* mean barely shifts; ADR-103 says `absent` even though
+   someone is clearly in the room.
+2. **NBVI Step 3**: Pace's full NBVI pipeline picks top-K by raw NBVI
+   score, then **validates** each candidate K by counting false
+   positives the motion detector would produce on the calibration
+   buffer, and keeps the K with the lowest FP rate. We were taking
+   the raw top-12 without validation — fragile if one of the chosen
+   subcarriers happens to overlap a noise source.
+
+## Decisions
+
+### D1 — Spectral drift score as a second presence channel
+
+`amp_presence_override` per node now also computes a **spectral
+drift score**:
+
+```
+drift_k = (current_amp[k] - baseline_amp[k]).abs() / baseline_amp[k]    for baseline[k] > 1.0
+drift   = mean(drift_k) across kept subcarriers
+```
+
+`current_amp[k]` = mean of the recent `AMP_SHORT_WIN` (90) frames'
+amplitude at subcarrier `k`. `baseline_amp[k]` = the
+`per_subcarrier_mean` vector saved by ADR-103's recording script.
+
+Per-node drift is stashed in `AMP_DRIFT: HashMap<u8, f64>` so
+`amp_node_level` (per-node) and `amp_classify_from_latest` (global)
+can use it. Threshold `AMP_DRIFT_PRESENCE_THRESH = 0.10` (10 %
+average per-subcarrier deviation) is empirical and consistent with
+the broadband-ratio trigger (drop ≥ 25 %, drift ≥ 10 %).
+
+### D2 — Trigger order in classifier
+
+Per node (`amp_node_snapshot`):
+
+```
+1. CV ≥ 6× baseline_cv  → active
+2. CV ≥ 3× baseline_cv  → present_moving
+3. drift ≥ 10 %         → present_still   ← ADR-104 (off-axis)
+4. mean / baseline < 0.75 → present_still ← ADR-101 (in-path)
+5. otherwise            → absent
+```
+
+Global (`amp_classify_from_latest`) uses MAX CV / MAX drift / ANY
+baseline-drop across nodes. Either drop OR drift fires `present_still`.
+
+### D3 — Opportunistic loading
+
+`per_subcarrier_mean` was already being written by
+`scripts/record-baseline.py` (line ~132, written as a list of
+~56 floats per node) but the server ignored it. Now `load_baseline_file`
+parses it and populates `AMP_BASELINE_PER_SUB`. If absent (older
+`baseline.json` from before this ADR) → drift stays 0.0 → no behaviour
+change. Re-trigger calibration via the ADR-107 REST endpoint or auto-
+recalibrate to populate the field and activate the drift channel.
+
+### D4 — NBVI FP-rate validation (Step 3 of Pace's spec)
+
+`nbvi_select_top_k` no longer returns the literal top-K. After
+ranking by NBVI score (Steps 1+2), it evaluates each candidate
+K ∈ `{6, 8, 10, 12, 16, 20}` clamped to the available subcarrier
+pool:
+
+* For each K: compute per-frame broadband mean over the top-K
+  subset across the quiet window.
+* Slide a sub-window (length `AMP_SHORT_WIN/3 ≈ 30` samples, stride
+  `sub_window/2`) and count windows where rolling CV exceeds the
+  moving-gate threshold (0.10).
+* Pick the K with the **smallest FP count**. Ties broken by smallest
+  total NBVI score (less noisy subset wins).
+
+Result: a subset that's stable AND non-FP-producing on the calibration
+window. If a top-12 NBVI candidate sneaks in a subcarrier overlapping
+a noise source, the FP count surfaces it and a smaller K wins instead.
+
+## Files Touched
+
+```
+v2/crates/wifi-densepose-sensing-server/src/main.rs
+  - statics: AMP_BASELINE_PER_SUB, AMP_DRIFT
+  - helpers: amp_baseline_per_sub_init, amp_drift_init,
+             amp_drift_for_node, amp_drift_max
+  - load_baseline_file: parse per_subcarrier_mean → AMP_BASELINE_PER_SUB
+  - amp_presence_override: drift computation + stash
+  - amp_node_level: drift trigger (uses MAX for cross-node)
+  - amp_node_snapshot: per-node drift trigger (overrides MAX)
+  - amp_classify_from_latest: any-node drift trigger in global fusion
+  - nbvi_select_top_k: Step 3 FP-rate validation
+docs/adr/ADR-104-per-subcarrier-drift-presence.md  (this)
+```
+
+Implementation commit: `6212b17e`.
+
+## Verified Acceptance
+
+Server boot log (using existing v1 baseline.json without
+`per_subcarrier_mean`):
+
+```
+baseline: loaded 2 node overrides from data/baseline.json
+          (node1=27.04, node2=14.72; node1_cv=2.62%, node2_cv=3.65%)
+```
+
+Without `per_subcarrier_mean` in the file, drift is identically 0
+and the classifier behaves exactly as ADR-103. To activate the
+drift channel: re-record via the ADR-107 REST endpoint or wait for
+auto-recalibrate; new `baseline.json` carries the
+`per_subcarrier_mean` vector and drift becomes live.
+
+NBVI Step 3 validation runs on every refresh tick. With K=12 being
+the "safe" default that always passes (clean low-CV window in the
+operator's deployment) and smaller Ks not improving FP=0, the picker
+keeps K=12 in steady state. Defends against future drift in channel
+conditions where a previously-clean subcarrier picks up interference.
+
+## Open Items
+
+* **Per-subcarrier baseline AGE check** — the per-sub vector reflects
+  the channel at calibration time. As the channel slowly drifts (other
+  WiFi clients on the AP, temperature, etc.) the per-sub baseline ages
+  faster than the broadband-mean baseline. Need: if `last_written_sec_ago`
+  > N hours AND drift consistently > threshold → flag for
+  re-calibration. Defer to a future ADR-109.
+* **Per-subcarrier delta in UI** — `raw.html` only shows broadband
+  bars + global classification. A small "drift" sparkline per node
+  would let the operator see the off-axis channel firing. ~30 min.
+* **Phase-domain drift** — currently amplitude-only. Phase delta vs
+  baseline phase would catch even subtler movement (chest-wall sub-mm
+  motion during breathing). Requires phase baseline in `baseline.json`,
+  which the recording script doesn't yet save. ~1 h script + ~30 min
+  server.
+
+## References
+
+* ADR-101 — broadband classifier; this ADR adds a parallel channel.
+* ADR-102 — NBVI; this ADR adds Step 3 validation per Pace's spec.
+* ADR-103 — persistent baseline; `per_subcarrier_mean` already written.
+* ADR-107 — REST calibrate endpoint; how the operator refreshes the
+  per-sub vector on demand.
+* [`docs/references/espectre-techniques.md`](../references/espectre-techniques.md)
+  §1.Step 3.
--- a/docs/adr/ADR-108-fw-nvs-persist-gain-lock.md
+++ b/docs/adr/ADR-108-fw-nvs-persist-gain-lock.md
@ -0,0 +1,176 @@
+# ADR-108 — FW NVS Persistence of Gain-Lock Values
+
+**Status**: Accepted
+**Date**: 2026-05-17
+**Scope**: `firmware/esp32-csi-node/main/csi_collector.c`
+(`rv_gain_load_from_nvs`, `rv_gain_save_to_nvs`, NVS hook in
+`rv_gain_lock_process`).
+
+## Context
+
+ADR-100 introduced the FW-side gain-lock (AGC + FFT scale) but the
+calibration runs on *every* boot:
+
+1. Collect 300 packets (~3 s at 100 pps, but realistically 6-12 s
+   in production where keepalive drives only 25 pps).
+2. Take the median of AGC and FFT samples.
+3. Call `phy_force_rx_gain` / `phy_fft_scale_force` to freeze.
+
+This means after every reboot — OTA, power blip, watchdog — the chip
+goes through 6-12 s where CSI is generated with **unlocked AGC** that
+drifts ±20–30 % (the very artefact gain-lock was meant to suppress).
+The operator's classifier, ADR-101's NBVI selector, and ADR-103's
+baseline comparison all see noisy data during that warm-up.
+
+Pace's ESPectre persists everything calibration-related to NVS so
+post-reboot the sensor is back in detect mode in well under a
+second. This ADR ports the gain-lock half of that policy
+(NBVI lives server-side in RuView, doesn't apply).
+
+## Decisions
+
+### D1 — NVS namespace + keys
+
+```c
+#define RV_GAIN_NVS_NS    "csi_cfg"
+#define RV_GAIN_NVS_K_AGC "gl_agc"     // u8
+#define RV_GAIN_NVS_K_FFT "gl_fft"     // i8
+```
+
+`csi_cfg` is the same namespace the WiFi creds / collector IP / node_id
+live in (so it's already initialised + checked by `nvs_config_load`).
+Two single-byte values — minimal NVS footprint.
+
+### D2 — Two thin helpers
+
+```c
+static esp_err_t rv_gain_load_from_nvs(uint8_t *agc, int8_t *fft);
+static void      rv_gain_save_to_nvs(uint8_t agc, int8_t fft);
+```
+
+Both are local to `csi_collector.c`. Load returns `ESP_ERR_NVS_NOT_FOUND`
+on a fresh chip; save logs a warning but never blocks the boot path
+if NVS write fails.
+
+### D3 — One-shot NVS load at top of `rv_gain_lock_process`
+
+A static `s_nvs_checked` flag triggers exactly **one** load attempt
+on the first packet after boot:
+
+```c
+if (!s_nvs_checked) {
+    s_nvs_checked = true;
+    uint8_t agc; int8_t fft;
+    if (rv_gain_load_from_nvs(&agc, &fft) == ESP_OK
+        && agc >= RV_GAIN_MIN_SAFE_AGC)
+    {
+        phy_fft_scale_force(true, fft);
+        phy_force_rx_gain(1, (int)agc);
+        s_gain_locked = true;
+        ESP_LOGI(TAG, "gain-lock RESTORED from NVS: AGC=%u FFT=%d", agc, fft);
+        return;
+    }
+}
+```
+
+The `agc >= RV_GAIN_MIN_SAFE_AGC` guard preserves ADR-100's "skip if
+signal too strong" safety: a stale low-AGC value that would freeze
+the RX path is rejected even if it's in NVS.
+
+### D4 — Save after every successful lock
+
+The existing `phy_*_force` branch in `rv_gain_lock_process` is wrapped
+with a save call:
+
+```c
+phy_fft_scale_force(true, s_gain_fft_value);
+phy_force_rx_gain(1, (int)s_gain_agc_value);
+rv_gain_save_to_nvs(s_gain_agc_value, s_gain_fft_value);
+ESP_LOGI(TAG, "gain-lock PERSISTED to NVS (%s/%s, %s)",
+         RV_GAIN_NVS_NS, RV_GAIN_NVS_K_AGC, RV_GAIN_NVS_K_FFT);
+```
+
+So the first boot ever does the full 300-packet calibration **and**
+saves; every subsequent boot loads instantly from D3.
+
+### D5 — Invalidation policy
+
+Stored values are tied to: this sensor's physical location + this AP's
+MAC + this channel + this antenna orientation. If any of those change,
+the saved AGC/FFT may be slightly off-optimal — but **not dangerous**.
+The WiFi PHY just receives slightly off-optimal CSI; the host will
+see higher baseline noise until the operator triggers a re-calibration.
+
+Today: erase via `idf.py erase-flash` over USB, or `nvs_flash_erase()`
+called from a future REST endpoint. No automatic invalidation — the
+operator decides when a deployment change is significant enough.
+
+## Files Touched
+
+```
+firmware/esp32-csi-node/main/csi_collector.c
+  - #include "nvs.h" / "nvs_flash.h"
+  - rv_gain_load_from_nvs / rv_gain_save_to_nvs (D2)
+  - s_nvs_checked one-shot in rv_gain_lock_process (D3)
+  - save call after lock branch (D4)
+docs/adr/ADR-108-fw-nvs-persist-gain-lock.md  (this)
+```
+
+Implementation commit: `3779bb76`. Flashed to both sensors via OTA
+(no USB) — `python3 scripts/ota-deploy.sh`.
+
+## Verified Acceptance
+
+Test sequence:
+
+1. OTA flash new FW to both nodes (first boot, NVS empty).
+2. Wait 15 s for FW to complete first calibration + write to NVS.
+3. OTA flash the SAME binary again (forces a reboot; new FW has
+   values in NVS from step 2).
+4. Sample WS amplitude rate in the first 3 s after the second boot.
+
+Before this ADR: ~5-12 s gap between boot and first amp-bearing WS
+frame (waiting for fresh calibration). After this ADR: WS shows
+**44 Hz raw CSI in the first 3 s** — instant resume.
+
+Logs from a chip that has values in NVS:
+
+```
+I (335) main: boot: reset_reason=SW running_partition=ota_1
+I (520) csi_collector: gain-lock RESTORED from NVS: AGC=44 FFT=-33
+                       (0-packet calibration; clear NVS to recalibrate)
+```
+
+vs first-boot ever:
+
+```
+I (335) main: boot: reset_reason=POWERON running_partition=ota_0
+I (4980) csi_collector: gain-lock APPLIED: AGC=44 FFT=-33
+                        (median of 300 packets)
+I (4980) csi_collector: gain-lock PERSISTED to NVS (csi_cfg/gl_agc, gl_fft)
+```
+
+## Open Items
+
+* **REST endpoint to clear gain-lock NVS** — today the operator has
+  to USB-erase the namespace. A FW-side `POST /ota/recalibrate` that
+  clears the two keys + `esp_restart()` would close that loop.
+  ~30 min FW + flash.
+* **Track AP MAC alongside AGC/FFT** — `csi_cfg/gl_ap_mac`. On boot,
+  if current AP MAC ≠ saved → ignore the cached values and re-calibrate.
+  Fully automatic invalidation. ~1 h FW.
+* **Per-channel cache** — `csi_cfg/gl_<chan>_agc`. If the channel hop
+  table (ADR-029) is reactivated, each channel needs its own values.
+  ~1 h FW.
+
+## References
+
+* ADR-100 — gain-lock implementation that this ADR persists.
+* ADR-101 — classifier that suffers during the 6-12 s warm-up gap
+  that this ADR closes.
+* `docs/references/ota-pipeline.md` — the WiFi flash flow used to
+  deploy this FW change without USB.
+* Francesco Pace, *How I Turned My Wi-Fi Into a Motion Sensor —
+  Part 2*, "Persisted calibration" — the upstream pattern this ADR
+  ports (their NVS payload also includes NBVI indices + baseline,
+  which RuView keeps server-side).