247 lines
11 KiB
Markdown
247 lines
11 KiB
Markdown
# ADR-098 — ESP32-S3 CSI Node Deployment Fixes (room01/room02)
|
||
|
||
**Status**: Accepted
|
||
**Date**: 2026-05-14
|
||
**Scope**: `firmware/esp32-csi-node/`, `v2/crates/wifi-densepose-sensing-server/`,
|
||
`v2/crates/wifi-densepose-desktop/`, `ui/mobile/`
|
||
|
||
## Context
|
||
|
||
Two ESP32-S3 CSI nodes (room01 `1c:db:d4:49:eb:88`, room02 `e8:f6:0a:83:89:44`)
|
||
were deployed against the RuView stack on a 2.4 GHz domestic LAN. The
|
||
out-of-the-box firmware booted but did not produce usable presence/motion
|
||
signal: `motion_score` saturated at `1.0`, `presence_score` froze near a
|
||
non-zero constant regardless of activity, vital signs never populated,
|
||
and OTA updates rolled back on every attempt.
|
||
|
||
Root-causing the chain took multiple rebuild/flash cycles. This ADR
|
||
records the final patches that made the stack functional end-to-end on
|
||
the deployed hardware and the empirical evidence that drove each change.
|
||
|
||
## Decisions
|
||
|
||
### D1 — Disable promiscuous mode in `csi_collector`
|
||
|
||
`esp_wifi_set_promiscuous(true)` silenced the CSI RX callback entirely
|
||
on this silicon revision (`yield=0pps` in `adaptive_ctrl` medium tick
|
||
log). Removing the call lets the WiFi driver invoke `wifi_csi_callback`
|
||
again at the connected-AP rate (~5-10 pps for beacon-driven traffic).
|
||
|
||
**Patch**: `csi_collector.c` — replace `esp_wifi_set_promiscuous(true);`
|
||
with a one-line `ESP_LOGI` documenting the empirical incompatibility.
|
||
Do **not** re-enable.
|
||
|
||
### D2 — Truncate `n_subcarriers` to `EDGE_MAX_SUBCARRIERS` instead of early-return
|
||
|
||
CSI frames on this hardware arrive at 384 bytes = 192 subcarriers. The
|
||
DSP pipeline declared `EDGE_MAX_SUBCARRIERS = 128`, so every incoming
|
||
frame failed the `n_subcarriers > EDGE_MAX_SUBCARRIERS` check and
|
||
returned before `process_frame` reached Step 8 (motion energy). This
|
||
was the underlying reason DSP outputs appeared frozen: the pipeline
|
||
literally was not running.
|
||
|
||
**Patch**: `edge_processing.c` — on oversized frames, clamp
|
||
`n_subcarriers = EDGE_MAX_SUBCARRIERS` and log a one-shot warning,
|
||
instead of returning. The first 128 subcarriers cover the full 20 MHz
|
||
HT20 channel; the trailing bins are HT40 sideband and not relied on.
|
||
|
||
### D3 — Broadband motion source
|
||
|
||
After D2 the original Step 8 (variance of unwrapped phase of a single
|
||
"primary" subcarrier) still failed:
|
||
|
||
* unwrapped phase drifts monotonically (thermal, oscillator) so its
|
||
variance over a 20-frame window equals `(slope·W/2)²/3`, a non-zero
|
||
constant unrelated to activity;
|
||
* the "primary" winner index jumps frame-to-frame (e.g. 22 → 103 →
|
||
105), so per-bin amplitude variance is dominated by index churn,
|
||
not motion.
|
||
|
||
We replace the source with **broadband mean amplitude variance**:
|
||
on every frame compute `mean(sqrt(I²+Q²))` across **all** subcarriers,
|
||
push that scalar into a 20-sample ring, and use its temporal variance
|
||
as `motion_energy`. This is the well-known CSI motion proxy:
|
||
human motion smears multipath and inflates frequency-domain spread
|
||
coherently across the whole channel.
|
||
|
||
Empirical separation measured on the deployed hardware:
|
||
|
||
| Window | broadband variance (median) |
|
||
|---|---|
|
||
| Empty room (3 m) | 0.07 – 0.10 (occasional 1.6 spike) |
|
||
| Walking past 2-3 m | 3.5 – 14 |
|
||
|
||
Ratio ≈ 44×. Divisor `var / 3.0f` with `clamp(0, 1.0)` puts empty
|
||
under 0.05 and walking near saturation.
|
||
|
||
**Patch**: `edge_processing.c`
|
||
* New buffer `s_broad_mean_amp_history[20]`.
|
||
* Per-frame `band_amp_mean = mean(sqrt(I²+Q²))` over all subcarriers.
|
||
* Step 8 replaced: `s_motion_energy = clamp(var / 3.0f, 0, 1)`.
|
||
|
||
### D4 — Biquad sample rate consistency
|
||
|
||
`biquad_bandpass_design(..., fs=20.0f, ...)` (filter design) did not
|
||
match `estimate_bpm_zero_crossing(..., sample_rate=10.0f, ...)` (BPM
|
||
detector). At a real callback rate of ~10 Hz the breathing passband
|
||
designed for 20 Hz becomes 0.05–0.25 Hz on the wire, excluding the
|
||
0.2–0.3 Hz human breathing band (12–18 BPM).
|
||
|
||
**Patch**: `edge_processing.c:1063` — `fs = 10.0f` for both
|
||
breathing and heart-rate filters. With D2+D3 active, `breathing_rate_bpm`
|
||
populates 21–22 BPM for a stationary person within ~30 s.
|
||
|
||
### D5 — OTA: full-partition erase + larger HTTP task stack
|
||
|
||
Two independent OTA bugs:
|
||
|
||
1. `esp_ota_begin(..., OTA_WITH_SEQUENTIAL_WRITES, ...)` skipped the
|
||
trailing-page erase, leaving stale code from a previous (larger)
|
||
image in the tail of the target partition. The new image header
|
||
passed SHA validation but residual instructions still resided at
|
||
addresses reachable via IRAM jump tables.
|
||
2. The HTTP server worker that runs the OTA verify step overflowed
|
||
its default 4 KB stack (esp_ota_get_app_partition_description does
|
||
substantial work). The new image *was* booted from `ota_1`, then
|
||
panicked in early init from stack overflow, and the bootloader
|
||
fell back to `ota_0` — looking exactly like a rollback even though
|
||
`CONFIG_BOOTLOADER_APP_ROLLBACK_ENABLE` is disabled.
|
||
|
||
**Patches**: `ota_update.c`
|
||
* `esp_ota_begin(update_partition, OTA_SIZE_UNKNOWN, &handle)` —
|
||
full-partition erase before write.
|
||
* `httpd_config_t config = HTTPD_DEFAULT_CONFIG(); config.stack_size = 8192;` —
|
||
doubled stack so OTA validation has room.
|
||
|
||
Plus `main.c:130-153` — `esp_reset_reason()` and running-partition label
|
||
logged once at app start, so any future boot anomaly is visible without
|
||
guesswork.
|
||
|
||
### D6 — sensing-server: parse RuView feature_state, refuse simulation
|
||
|
||
Out of the box, `sensing-server` (`v2/crates/wifi-densepose-sensing-server`)
|
||
parsed only `0xC5110001` (raw CSI) and `0xC5110002` (vitals). RuView FW
|
||
emits `0xC5110006` (ADR-081 feature_state) as its default upstream
|
||
payload — a gap in the project.
|
||
|
||
**Patches**: `src/main.rs`
|
||
* New `parse_rv_feature_state(buf)` decoding the 60-byte
|
||
`rv_feature_state_t` into the existing `Esp32VitalsPacket` shape;
|
||
wired ahead of the existing `parse_esp32_vitals` call.
|
||
* Per-node `BaselineTracker` (file-scope `OnceLock<Mutex<HashMap<u8,_>>>`)
|
||
applies hysteretic motion gating on top of the FW-reported scores so
|
||
the UI receives clean boolean presence transitions even when the FW
|
||
scalar is noisy.
|
||
* `--source simulate` and the auto-fallback to simulation removed;
|
||
`simulate`/`simulated` now exit non-zero with a `ERROR` log.
|
||
|
||
A `parse_csi_lean` parser was also added for compatibility with the
|
||
legacy FW 5.47 (`esp32s3_csi_capture`) CSV format. Dead code under
|
||
current FW; kept as defence-in-depth so a mistakenly flashed legacy
|
||
sensor still produces useful data.
|
||
|
||
### D7 — Desktop UI: HTTP-sweep discovery
|
||
|
||
mDNS (`_ruview._udp.local.`) and UDP-broadcast beacon discovery (the
|
||
two paths the desktop ships) are not advertised by current RuView FW.
|
||
We added a third concurrent path: `GET /<probe-ip>:8032/status` over
|
||
the local /24 subnet, parsing the JSON returned by RuView's
|
||
`ota_status_handler`.
|
||
|
||
**Patches**: `v2/crates/wifi-densepose-desktop/src/commands/discovery.rs`
|
||
* `discover_via_http_sweep(timeout)` running alongside mDNS + UDP.
|
||
* `futures::future::join_all(tasks)` with overall `tokio::time::timeout`
|
||
replaces the previous sequential `for task in tasks` loop, which
|
||
blocked on slow-to-time-out unrelated IPs and missed the responding
|
||
sensors.
|
||
* Result-keeping in `useNodes`/`Dashboard` — keep last good list when
|
||
a poll round returns 0 nodes.
|
||
|
||
### D8 — Mobile UI: WS path + Tailscale default + no simulation fallback
|
||
|
||
* `WS_PATH = '/ws/sensing'` and a hard-coded `WS_PORT = 8765` so the
|
||
mobile app's `ws.service` connects to the RuView WS endpoint instead
|
||
of the legacy `/api/v1/stream/pose` FastAPI path.
|
||
* `settingsStore.serverUrl` defaults to `http://100.123.189.10:8080`,
|
||
the deployed Mac's Tailscale IP, so the phone reaches the server
|
||
without LAN dependency.
|
||
* All `simulated` fallbacks removed from `ws.service.ts` and
|
||
`matStore.ts` — UI shows `disconnected` rather than synthetic data
|
||
when the server is unreachable.
|
||
|
||
### D9 — Reset-reason logging in `app_main`
|
||
|
||
A two-line ESP_LOGI at the start of `app_main` records
|
||
`esp_reset_reason()` and `esp_ota_get_running_partition()->label`.
|
||
Worth its weight every time we touched OTA — it eliminated guesswork
|
||
when an image silently fell back.
|
||
|
||
## Verification
|
||
|
||
Acceptance ran on both deployed nodes with the operator stationary,
|
||
then walking 2-3 m past each sensor, then leaving the room.
|
||
|
||
| Criterion | Target | room01 | room02 |
|
||
|---|---|---|---|
|
||
| `motion_energy` empty room | < 0.05 | 0.018 | 0.070 |
|
||
| `motion_energy` walking | > 0.3 within 2 s | < 1 s | 3 s |
|
||
| `motion_energy` decay after exit | < 0.1 within 5 s | 0.02–0.03 | 0.02–0.03 |
|
||
| `breathing_rate_bpm` stationary 30 s | 12-20 BPM | 22.2 BPM | 21.0 BPM |
|
||
| OTA round-trip | 2 consecutive succeed | ✅ | ✅ |
|
||
| Reset-reason visible | one-line log at boot | ✅ | ✅ |
|
||
|
||
OTA #1 transitioned `running_partition: ota_0 → ota_1`; OTA #2 reversed
|
||
it back to `ota_0`. No panics. `Connection reset` on the curl side is
|
||
expected — `esp_restart()` tears down the TCP connection after
|
||
`httpd_resp_send` returns.
|
||
|
||
## Files Touched
|
||
|
||
```
|
||
firmware/esp32-csi-node/main/csi_collector.c
|
||
firmware/esp32-csi-node/main/edge_processing.c
|
||
firmware/esp32-csi-node/main/main.c
|
||
firmware/esp32-csi-node/main/ota_update.c
|
||
firmware/esp32-csi-node/sdkconfig.defaults
|
||
|
||
v2/crates/wifi-densepose-sensing-server/src/main.rs
|
||
v2/crates/wifi-densepose-sensing-server/src/csi.rs
|
||
|
||
v2/crates/wifi-densepose-desktop/src/commands/discovery.rs
|
||
v2/crates/wifi-densepose-desktop/src/commands/server.rs
|
||
v2/crates/wifi-densepose-desktop/ui/src/hooks/useNodes.ts
|
||
v2/crates/wifi-densepose-desktop/ui/src/hooks/useServer.ts
|
||
v2/crates/wifi-densepose-desktop/ui/src/pages/Dashboard.tsx
|
||
v2/crates/wifi-densepose-desktop/ui/src/pages/Sensing.tsx
|
||
v2/crates/wifi-densepose-desktop/ui/src/types.ts
|
||
|
||
ui/mobile/src/constants/websocket.ts
|
||
ui/mobile/src/services/ws.service.ts
|
||
ui/mobile/src/stores/matStore.ts
|
||
ui/mobile/src/stores/settingsStore.ts
|
||
ui/mobile/src/screens/MATScreen/index.tsx
|
||
ui/mobile/src/screens/VitalsScreen/index.tsx
|
||
|
||
docker/docker-compose.yml # host port 5005 → 5006 (RuView FW target)
|
||
```
|
||
|
||
## Open Items
|
||
|
||
* `EDGE_MAX_SUBCARRIERS` is still `128` — D2 truncates incoming frames
|
||
rather than enlarging the buffer. Increasing to 192 would let the
|
||
pipeline use the full 192-subcarrier HT40 sideband, but requires
|
||
re-sizing several stack/heap structures and re-tuning DSP windows.
|
||
Tracked for a future release.
|
||
* Empty-room `motion_energy` on room02 sits slightly above the 0.05
|
||
target (0.07). Either the Fresnel-zone alignment for that node is
|
||
noisier or the calibration constant `var / 3.0f` needs to be
|
||
hardware-rev specific. Acceptable for the current deployment;
|
||
candidate for an auto-calibration routine.
|
||
|
||
## References
|
||
|
||
* ADR-039 — Edge intelligence pipeline (the file we patched).
|
||
* ADR-081 — `rv_feature_state_t` packet format (`0xC5110006`).
|
||
* RuView issue #555 — *DSP froze on unwrapped phase variance* (this ADR).
|
||
* RuView issue #556 — *OTA never sticks* (this ADR).
|