wifi-densepose/docs/adr/ADR-100-gain-lock-baseline-...

155 lines
7.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-100 — PHY Gain Lock for Baseline-Stable CSI
**Status**: Accepted
**Date**: 2026-05-17
**Scope**: `firmware/esp32-csi-node/main/csi_collector.c`,
`v2/crates/wifi-densepose-sensing-server/static/raw.html`.
## Context
After ADR-110 deployed the TP-Link WISP AP and the operator captured three
controlled one-minute windows (empty / sit / walk), the RSSI MAD-Δ
classifier failed to separate the three states — measured `d` values
overlapped within ±0.03 of 0.49 while in-state spread was ±0.10. We
inspected the live amplitude spectrum on the new `raw.html` console and
saw a slow ±20-30 % broadband drift in the sensor amplitude even with
the room provably empty. The drift was indistinguishable from body
modulation at multi-meter range and dominated every downstream feature.
Francesco Pace's [ESPectre](https://github.com/francescopace/espectre)
project (GPLv3) traced the same artefact to the ESP32 PHY's automatic
gain control: AGC continuously rebalances the receiver gain per packet
so received frames stay in the optimal decoding range. For CSI sensing
this is a disaster — the same channel state arrives with a different
amplitude every packet because the gain stage shifts under it. Pace
documented two undocumented PHY routines in the IDF blob that freeze
AGC and FFT scaling, plus a calibration recipe (median of the first
300 packets) that is robust to brief startup activity.
## Decisions
### D1 — Port the ESPectre gain-lock to RuView FW
Added a self-contained block to `csi_collector.c`:
* **Overlay struct** `rv_phy_rx_ctrl_t` aliased over `wifi_csi_info_t.rx_ctrl`
to read the hidden `agc_gain` (u8) and `fft_gain` (signed i8) fields.
* **Extern declarations** for the two PHY routines:
```c
extern void phy_fft_scale_force(bool force_en, int8_t force_value);
extern void phy_force_rx_gain(int force_en, int force_value);
```
* **Two-phase calibration** (`rv_gain_lock_process`):
- Phase 1 (≤ 300 packets, ~6 s at the rate-gated 50 Hz callback):
accumulate AGC and FFT samples into static arrays.
- At the 300th packet: `qsort` both arrays, take the median, and
call the two PHY routines to freeze gain.
* **Safety branch**: if median AGC < 30, skip the lock and log a
warning. Forcing a low gain on a strong-signal deployment causes the
RX path to freeze (empirically documented in ESPectre's
`gain_controller.h`).
* **Supported targets**: ESP32-S3, ESP32-C3, ESP32-C6 only older
parts compile to a no-op stub. RuView ships on S3 so this is the only
path we care about.
The hook is wired immediately after the existing rate-gate and MAC
filter in the CSI callback so calibration completes within the first
~6 s after the WiFi association, regardless of host traffic. After
that it short-circuits.
Tagged as ADR-100 in the source comment for traceability.
### D2 — Use the existing `raw.html` console (ADR-110, D2 reuse) as the verification UI
The console added in ADR-110 already streams `nodes[].amplitude` from
the existing WebSocket. No server-side change was needed. The HTML
displays a per-node bar histogram of all 56 active subcarriers plus
broadband mean amplitude and RSSI traces over the last 30 s. This is
the surface where the operator can watch without any DSP, without any
classification whether the gain-lock has actually flattened the
baseline.
### D3 — Geometry matters as much as gain-lock
A controlled three-state capture made on 2026-05-17 with both sensors
positioned so that the line `TP-Link AP → sensor` passes through the
operator (lying on the bed) confirmed both decisions. The summary
table appears under *Verified Acceptance* below. Earlier captures
(ADR-110) failed to separate states partly because the sensors were
placed off-axis from the AP-to-body line; with that geometry the body
never physically obstructs the CSI channel.
## Calibration values observed (real captures, this deployment)
| Node | Boot rate (low traffic) | Boot rate (ping flood) | AGC median | FFT scale median | Lock decision |
|---|---|---|---|---|---|
| room01 (192.168.0.101) | 0.3 fps | 30+ fps | **4244** | 31 / 33 | **APPLIED** |
| room02 (192.168.0.100) | 0.3 fps | 30+ fps | **44** | 40 / 42 | **APPLIED** |
Both AGC medians are comfortably above the 30 safety threshold. The
calibration completes in ~6 s when there is any host traffic (a single
ping to the sensor at 10 pps is enough); on a totally idle channel
beacons drive the rate down to 0.3 fps and calibration would take ~17
minutes practically we always have some traffic.
## Verified Acceptance — three-state separation
Geometry: TP-Link AP on the wall, both sensors at table-level on the
opposite side of the room, operator lying on the bed between AP and
sensors. 30 seconds per state, gain-lock active on both nodes,
`raw.html` open during capture, `target_ip` provisioned to the Mac's
TP-Link-side IP (192.168.0.103) so no upstream NAT is in the path.
| State | node 1 mean A | node 1 CV | node 1 sub-CV <5 % | node 2 mean A | node 2 CV | node 2 sub-CV <7 % |
|---|---|---|---|---|---|---|
| **EMPTY** (operator out) | **37.28** | **2.71 %** | **44/44** | 9.52 | 5.22 % | 26/44 |
| **STILL** (operator lying still on bed) | 22.43 | 3.70 % | 30/44 | 9.67 | 5.02 % | 24/44 |
| **WALK** (operator pacing the room) | 31.77 | **12.50 %** | 0/44 | 7.15 | **29.72 %** | 0/44 |
Observations:
* **Node 1 separates all three states** by mean amplitude alone: 37
22 32. The body lying still blocks the direct path
(40 % amplitude drop), then motion adds reflections back. The CV
ladder 2.71 3.70 12.50 % is a second independent feature.
* **Node 2 separates STILL+EMPTY from WALK** by CV (5 30 %). Its
geometry doesn't pick up a still body, only motion.
* **Compare to ADR-110** where empty/sit/walk differed by ±0.02 inside
±0.10 noise we now have inter-state separation ratios of **×3.4 on
node 1 and ×5.9 on node 2**. The signal is no longer dominated by
baseline drift.
## Files Touched
```
firmware/esp32-csi-node/main/csi_collector.c # gain-lock module + hook
v2/crates/wifi-densepose-sensing-server/static/raw.html # already from ADR-110
docs/adr/ADR-100-gain-lock-baseline-stabilization.md # this ADR
```
## Open Items
* **NBVI subcarrier selection** closed in ADR-102 (server-side
port with quiet-window finder).
* ✅ **Server-side RSSI parsing** fixed by parallel agent in commit
`3393c1e8` (parse_esp32_frame offset realignment + carrying RSSI
through feature_state packets).
* ✅ **Calibration latency on an idle channel** closed in ADR-106
by the built-in managed-`ping` keepalive (drives sensor RX at
25 pkt/s/node out of the box).
* ⏳ **NVS target_ip is hardcoded** still open. Tailscale-target
option not implemented; sensors still send to the Mac's TP-Link-
side IP (192.168.0.103). Mac roaming still breaks the CSI stream.
## References
* ADR-039 Edge intelligence pipeline (host DSP path).
* ADR-098 Earlier ESP32-S3 deployment fixes.
* ADR-110 TP-Link WISP deployment + first RSSI-Δ attempt (this ADR
supersedes the threshold table in ADR-110, D3 the RSSI MAD-Δ
detector is left in place but no longer the primary signal).
* Francesco Pace, *How I Turned My Wi-Fi Into a Motion Sensor — Part 2*,
Dec 2025 source of the gain-lock recipe.
* `francescopace/espectre`, `components/espectre/gain_controller.{h,cpp}`
on GitHub reference implementation (GPLv3).