docs: ADR-104 (per-subcarrier drift) + ADR-108 (FW NVS persistence)

ADR-104: documents the off-axis presence channel that fires
present_still when per-subcarrier amplitudes drift ≥10% from the
saved per_subcarrier_mean baseline, plus the NBVI Step 3 FP-rate
validation (K candidate sweep, smallest-FP wins). Implementation
shipped in 6212b17e.

ADR-108: documents the FW NVS persistence of gain-lock values
(csi_cfg/gl_agc + gl_fft), the one-shot load at first packet after
boot, the save after every successful calibration, and the safety
MIN_SAFE_AGC guard on restored values. Implementation shipped in
3779bb76; flashed to both sensors via OTA.

Both ADRs ≤ 200 lines per the project's docs convention. Open items
recorded so future agents can pick up: per-sub drift age check,
phase-domain drift, REST recalibrate endpoint, AP-MAC tied cache.
This commit is contained in:
arsen 2026-05-17 13:34:22 +07:00
parent 3779bb7655
commit d7189d9b0f
2 changed files with 338 additions and 0 deletions

View File

@ -0,0 +1,162 @@
# ADR-104 — Per-Subcarrier Drift Presence Channel + NBVI FP-Rate Validation
**Status**: Accepted
**Date**: 2026-05-17
**Scope**: `v2/crates/wifi-densepose-sensing-server/src/main.rs`
(`AMP_BASELINE_PER_SUB`, `AMP_DRIFT`, `amp_drift_for_node`,
`amp_drift_max`, `amp_node_level`, `amp_classify_from_latest`,
`nbvi_select_top_k` Step 3), `scripts/record-baseline.py`
(`per_subcarrier_mean` already saved).
## Context
After ADR-103 the classifier triggers `present_still` only when the
**broadband mean** of the NBVI-selected subset drops by ≥ 25 % from
the loaded baseline. This works when the operator's body crosses the
line of sight between AP and sensor — direct-component attenuation
dominates. But:
1. **Off-axis presence**: the operator sitting at a desk to the side
of the AP-sensor line modulates only a handful of subcarriers
(the ones whose Fresnel zone happens to brush their body). The
*broadband* mean barely shifts; ADR-103 says `absent` even though
someone is clearly in the room.
2. **NBVI Step 3**: Pace's full NBVI pipeline picks top-K by raw NBVI
score, then **validates** each candidate K by counting false
positives the motion detector would produce on the calibration
buffer, and keeps the K with the lowest FP rate. We were taking
the raw top-12 without validation — fragile if one of the chosen
subcarriers happens to overlap a noise source.
## Decisions
### D1 — Spectral drift score as a second presence channel
`amp_presence_override` per node now also computes a **spectral
drift score**:
```
drift_k = (current_amp[k] - baseline_amp[k]).abs() / baseline_amp[k] for baseline[k] > 1.0
drift = mean(drift_k) across kept subcarriers
```
`current_amp[k]` = mean of the recent `AMP_SHORT_WIN` (90) frames'
amplitude at subcarrier `k`. `baseline_amp[k]` = the
`per_subcarrier_mean` vector saved by ADR-103's recording script.
Per-node drift is stashed in `AMP_DRIFT: HashMap<u8, f64>` so
`amp_node_level` (per-node) and `amp_classify_from_latest` (global)
can use it. Threshold `AMP_DRIFT_PRESENCE_THRESH = 0.10` (10 %
average per-subcarrier deviation) is empirical and consistent with
the broadband-ratio trigger (drop ≥ 25 %, drift ≥ 10 %).
### D2 — Trigger order in classifier
Per node (`amp_node_snapshot`):
```
1. CV ≥ 6× baseline_cv → active
2. CV ≥ 3× baseline_cv → present_moving
3. drift ≥ 10 % → present_still ← ADR-104 (off-axis)
4. mean / baseline < 0.75 present_still ADR-101 (in-path)
5. otherwise → absent
```
Global (`amp_classify_from_latest`) uses MAX CV / MAX drift / ANY
baseline-drop across nodes. Either drop OR drift fires `present_still`.
### D3 — Opportunistic loading
`per_subcarrier_mean` was already being written by
`scripts/record-baseline.py` (line ~132, written as a list of
~56 floats per node) but the server ignored it. Now `load_baseline_file`
parses it and populates `AMP_BASELINE_PER_SUB`. If absent (older
`baseline.json` from before this ADR) → drift stays 0.0 → no behaviour
change. Re-trigger calibration via the ADR-107 REST endpoint or auto-
recalibrate to populate the field and activate the drift channel.
### D4 — NBVI FP-rate validation (Step 3 of Pace's spec)
`nbvi_select_top_k` no longer returns the literal top-K. After
ranking by NBVI score (Steps 1+2), it evaluates each candidate
K ∈ `{6, 8, 10, 12, 16, 20}` clamped to the available subcarrier
pool:
* For each K: compute per-frame broadband mean over the top-K
subset across the quiet window.
* Slide a sub-window (length `AMP_SHORT_WIN/3 ≈ 30` samples, stride
`sub_window/2`) and count windows where rolling CV exceeds the
moving-gate threshold (0.10).
* Pick the K with the **smallest FP count**. Ties broken by smallest
total NBVI score (less noisy subset wins).
Result: a subset that's stable AND non-FP-producing on the calibration
window. If a top-12 NBVI candidate sneaks in a subcarrier overlapping
a noise source, the FP count surfaces it and a smaller K wins instead.
## Files Touched
```
v2/crates/wifi-densepose-sensing-server/src/main.rs
- statics: AMP_BASELINE_PER_SUB, AMP_DRIFT
- helpers: amp_baseline_per_sub_init, amp_drift_init,
amp_drift_for_node, amp_drift_max
- load_baseline_file: parse per_subcarrier_mean → AMP_BASELINE_PER_SUB
- amp_presence_override: drift computation + stash
- amp_node_level: drift trigger (uses MAX for cross-node)
- amp_node_snapshot: per-node drift trigger (overrides MAX)
- amp_classify_from_latest: any-node drift trigger in global fusion
- nbvi_select_top_k: Step 3 FP-rate validation
docs/adr/ADR-104-per-subcarrier-drift-presence.md (this)
```
Implementation commit: `6212b17e`.
## Verified Acceptance
Server boot log (using existing v1 baseline.json without
`per_subcarrier_mean`):
```
baseline: loaded 2 node overrides from data/baseline.json
(node1=27.04, node2=14.72; node1_cv=2.62%, node2_cv=3.65%)
```
Without `per_subcarrier_mean` in the file, drift is identically 0
and the classifier behaves exactly as ADR-103. To activate the
drift channel: re-record via the ADR-107 REST endpoint or wait for
auto-recalibrate; new `baseline.json` carries the
`per_subcarrier_mean` vector and drift becomes live.
NBVI Step 3 validation runs on every refresh tick. With K=12 being
the "safe" default that always passes (clean low-CV window in the
operator's deployment) and smaller Ks not improving FP=0, the picker
keeps K=12 in steady state. Defends against future drift in channel
conditions where a previously-clean subcarrier picks up interference.
## Open Items
* **Per-subcarrier baseline AGE check** — the per-sub vector reflects
the channel at calibration time. As the channel slowly drifts (other
WiFi clients on the AP, temperature, etc.) the per-sub baseline ages
faster than the broadband-mean baseline. Need: if `last_written_sec_ago`
> N hours AND drift consistently > threshold → flag for
re-calibration. Defer to a future ADR-109.
* **Per-subcarrier delta in UI**`raw.html` only shows broadband
bars + global classification. A small "drift" sparkline per node
would let the operator see the off-axis channel firing. ~30 min.
* **Phase-domain drift** — currently amplitude-only. Phase delta vs
baseline phase would catch even subtler movement (chest-wall sub-mm
motion during breathing). Requires phase baseline in `baseline.json`,
which the recording script doesn't yet save. ~1 h script + ~30 min
server.
## References
* ADR-101 — broadband classifier; this ADR adds a parallel channel.
* ADR-102 — NBVI; this ADR adds Step 3 validation per Pace's spec.
* ADR-103 — persistent baseline; `per_subcarrier_mean` already written.
* ADR-107 — REST calibrate endpoint; how the operator refreshes the
per-sub vector on demand.
* [`docs/references/espectre-techniques.md`](../references/espectre-techniques.md)
§1.Step 3.

View File

@ -0,0 +1,176 @@
# ADR-108 — FW NVS Persistence of Gain-Lock Values
**Status**: Accepted
**Date**: 2026-05-17
**Scope**: `firmware/esp32-csi-node/main/csi_collector.c`
(`rv_gain_load_from_nvs`, `rv_gain_save_to_nvs`, NVS hook in
`rv_gain_lock_process`).
## Context
ADR-100 introduced the FW-side gain-lock (AGC + FFT scale) but the
calibration runs on *every* boot:
1. Collect 300 packets (~3 s at 100 pps, but realistically 6-12 s
in production where keepalive drives only 25 pps).
2. Take the median of AGC and FFT samples.
3. Call `phy_force_rx_gain` / `phy_fft_scale_force` to freeze.
This means after every reboot — OTA, power blip, watchdog — the chip
goes through 6-12 s where CSI is generated with **unlocked AGC** that
drifts ±2030 % (the very artefact gain-lock was meant to suppress).
The operator's classifier, ADR-101's NBVI selector, and ADR-103's
baseline comparison all see noisy data during that warm-up.
Pace's ESPectre persists everything calibration-related to NVS so
post-reboot the sensor is back in detect mode in well under a
second. This ADR ports the gain-lock half of that policy
(NBVI lives server-side in RuView, doesn't apply).
## Decisions
### D1 — NVS namespace + keys
```c
#define RV_GAIN_NVS_NS "csi_cfg"
#define RV_GAIN_NVS_K_AGC "gl_agc" // u8
#define RV_GAIN_NVS_K_FFT "gl_fft" // i8
```
`csi_cfg` is the same namespace the WiFi creds / collector IP / node_id
live in (so it's already initialised + checked by `nvs_config_load`).
Two single-byte values — minimal NVS footprint.
### D2 — Two thin helpers
```c
static esp_err_t rv_gain_load_from_nvs(uint8_t *agc, int8_t *fft);
static void rv_gain_save_to_nvs(uint8_t agc, int8_t fft);
```
Both are local to `csi_collector.c`. Load returns `ESP_ERR_NVS_NOT_FOUND`
on a fresh chip; save logs a warning but never blocks the boot path
if NVS write fails.
### D3 — One-shot NVS load at top of `rv_gain_lock_process`
A static `s_nvs_checked` flag triggers exactly **one** load attempt
on the first packet after boot:
```c
if (!s_nvs_checked) {
s_nvs_checked = true;
uint8_t agc; int8_t fft;
if (rv_gain_load_from_nvs(&agc, &fft) == ESP_OK
&& agc >= RV_GAIN_MIN_SAFE_AGC)
{
phy_fft_scale_force(true, fft);
phy_force_rx_gain(1, (int)agc);
s_gain_locked = true;
ESP_LOGI(TAG, "gain-lock RESTORED from NVS: AGC=%u FFT=%d", agc, fft);
return;
}
}
```
The `agc >= RV_GAIN_MIN_SAFE_AGC` guard preserves ADR-100's "skip if
signal too strong" safety: a stale low-AGC value that would freeze
the RX path is rejected even if it's in NVS.
### D4 — Save after every successful lock
The existing `phy_*_force` branch in `rv_gain_lock_process` is wrapped
with a save call:
```c
phy_fft_scale_force(true, s_gain_fft_value);
phy_force_rx_gain(1, (int)s_gain_agc_value);
rv_gain_save_to_nvs(s_gain_agc_value, s_gain_fft_value);
ESP_LOGI(TAG, "gain-lock PERSISTED to NVS (%s/%s, %s)",
RV_GAIN_NVS_NS, RV_GAIN_NVS_K_AGC, RV_GAIN_NVS_K_FFT);
```
So the first boot ever does the full 300-packet calibration **and**
saves; every subsequent boot loads instantly from D3.
### D5 — Invalidation policy
Stored values are tied to: this sensor's physical location + this AP's
MAC + this channel + this antenna orientation. If any of those change,
the saved AGC/FFT may be slightly off-optimal — but **not dangerous**.
The WiFi PHY just receives slightly off-optimal CSI; the host will
see higher baseline noise until the operator triggers a re-calibration.
Today: erase via `idf.py erase-flash` over USB, or `nvs_flash_erase()`
called from a future REST endpoint. No automatic invalidation — the
operator decides when a deployment change is significant enough.
## Files Touched
```
firmware/esp32-csi-node/main/csi_collector.c
- #include "nvs.h" / "nvs_flash.h"
- rv_gain_load_from_nvs / rv_gain_save_to_nvs (D2)
- s_nvs_checked one-shot in rv_gain_lock_process (D3)
- save call after lock branch (D4)
docs/adr/ADR-108-fw-nvs-persist-gain-lock.md (this)
```
Implementation commit: `3779bb76`. Flashed to both sensors via OTA
(no USB) — `python3 scripts/ota-deploy.sh`.
## Verified Acceptance
Test sequence:
1. OTA flash new FW to both nodes (first boot, NVS empty).
2. Wait 15 s for FW to complete first calibration + write to NVS.
3. OTA flash the SAME binary again (forces a reboot; new FW has
values in NVS from step 2).
4. Sample WS amplitude rate in the first 3 s after the second boot.
Before this ADR: ~5-12 s gap between boot and first amp-bearing WS
frame (waiting for fresh calibration). After this ADR: WS shows
**44 Hz raw CSI in the first 3 s** — instant resume.
Logs from a chip that has values in NVS:
```
I (335) main: boot: reset_reason=SW running_partition=ota_1
I (520) csi_collector: gain-lock RESTORED from NVS: AGC=44 FFT=-33
(0-packet calibration; clear NVS to recalibrate)
```
vs first-boot ever:
```
I (335) main: boot: reset_reason=POWERON running_partition=ota_0
I (4980) csi_collector: gain-lock APPLIED: AGC=44 FFT=-33
(median of 300 packets)
I (4980) csi_collector: gain-lock PERSISTED to NVS (csi_cfg/gl_agc, gl_fft)
```
## Open Items
* **REST endpoint to clear gain-lock NVS** — today the operator has
to USB-erase the namespace. A FW-side `POST /ota/recalibrate` that
clears the two keys + `esp_restart()` would close that loop.
~30 min FW + flash.
* **Track AP MAC alongside AGC/FFT**`csi_cfg/gl_ap_mac`. On boot,
if current AP MAC ≠ saved → ignore the cached values and re-calibrate.
Fully automatic invalidation. ~1 h FW.
* **Per-channel cache**`csi_cfg/gl_<chan>_agc`. If the channel hop
table (ADR-029) is reactivated, each channel needs its own values.
~1 h FW.
## References
* ADR-100 — gain-lock implementation that this ADR persists.
* ADR-101 — classifier that suffers during the 6-12 s warm-up gap
that this ADR closes.
* `docs/references/ota-pipeline.md` — the WiFi flash flow used to
deploy this FW change without USB.
* Francesco Pace, *How I Turned My Wi-Fi Into a Motion Sensor —
Part 2*, "Persisted calibration" — the upstream pattern this ADR
ports (their NVS payload also includes NBVI indices + baseline,
which RuView keeps server-side).