feat(adr-117): process hygiene + pose path honesty + audit sweep

Audit fix bundle (10 areas; details in ADR-117 + commit body below).

Server (main.rs / wiflow_v1.rs):
- UDP receiver filters loopback/multicast/unspecified before NODE_ADDRS
  registration. Defends against `cargo test` cross-talk that spawned
  250+ ping zombies on the production server's :5005 port.
- csi_keepalive_task pre-reaps `/sbin/ping -i 0.040` orphans at task
  entry. macOS doesn't propagate parent death, so killed servers used
  to leave init-parented pings running indefinitely.
- run_wiflow_inference stamps real classifier confidence onto every
  keypoint (was hardcoded 1.0) — reads 0.037 on live data, honest.
- run_wiflow_inference clones only the tail-20 frames inside the lock,
  not the full 600-deep VecDeque (~270 KB → ~9 KB per tick).
- wiflow_v1::build_input_from_history: zero-pad dead channel slots
  instead of duplicating subcarrier 0 across all of them. Comment said
  "zero the rest", prior code did the opposite.
- GET / now 308-redirects to /ui/index.html; API index moved to /api.

UI (ui/index.html, ui/components/LiveDemoTab.js):
- <section id="sensing"> gets a <div id="sensing-container"> child so
  app.js::SensingTab.mount has its mount point. Sensing tab was
  permanently blank.
- LiveDemoTab.fetchModels: only inject WiFlow into the dropdown if no
  RVF model is already active. Prevents silent flip back to WiFlow
  after every poll.

Tests (multi_node_test.rs):
- test_multi_node_udp_send probes 127.0.0.1:5005 first; if bind fails
  (e.g. a dev server is running), skip the send. Two-layer defense
  with the server-side filter above.

Docs (CHECKLIST.md, ADR-115, espectre-gap-analysis.md, ota-pipeline.md):
- CHECKLIST head sha + count refreshed (43→47 Done, head 0ec1e4b0,
  ADR range to 001-117 with ADR-111 noted as intentionally absent).
- ADR-115 typo fixes: "ADR-100" → "ADR-110" (TP-Link WISP),
  "ADR-111" → "ADR-109" (AP-MAC tracking actually lives there).
- gap-analysis "Still open" table: 8 shipped items annotated with
  commit hashes; remainder reclassified Deferred with reason.
- ota-pipeline.md: new "Operator REST endpoints" section listing
  /ota/recalibrate (ADR-109) and /ota/set-target (ADR-115) with
  unauthed + bearer-token curl examples.

Verified post-restart:
- exactly 2 ping children, both parented to current PID, one per real
  sensor IP, no 127.0.0.1.
- GET / → 308 → /ui/index.html.
- /api/v1/info: pose_estimation=true, version 0.3.0.
- /api/v1/pose/current: 17 COCO keypoints, confidence 0.037 (real).
- cargo test --workspace: 13 passed / 0 failed / 5 ignored.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
arsen 2026-05-17 19:24:04 +07:00
parent 0ec1e4b06f
commit 6ce25cec79
10 changed files with 466 additions and 70 deletions

View File

@ -5,10 +5,16 @@ at the end of every session. Pair with
[`docs/references/espectre-gap-analysis.md`](docs/references/espectre-gap-analysis.md)
for the technical detail behind each line.
Last sweep: **2026-05-17**, branch `feat/ota-rssi-mobile`, head `c827cde6`.
Status: 43 Done / 0 Open in-scope. Deferred items (out of session scope,
Last sweep: **2026-05-17**, branch `feat/ota-rssi-mobile`, head `0ec1e4b0`.
Status: 47 Done / 0 Open in-scope. Deferred items (out of session scope,
each with explicit reason) listed at the bottom.
This count includes the ADR-100..114 carry-in from the prior agent + this
session's ADR-115 (FW set-target REST), ADR-116 (WiFlow-v1 Rust loader),
ADR-116 cosmetic (UI dropdown), and ADR-117 (process hygiene + audit
follow-ups). ADR-111 is intentionally absent (folded into ADR-109 during
the AP-MAC tracking work).
---
## ✅ Done
@ -77,6 +83,13 @@ each with explicit reason) listed at the bottom.
keypoints on `/api/v1/pose/current` and WS `pose_data`. Output
quality requires per-deployment fine-tune (LoRA adapters or
re-train, see Pack E).
- [x] **ADR-117** Process hygiene + audit follow-ups — UDP loopback
filter prevents `cargo test` cross-talk from spawning ping
zombies (250→2 children); keepalive pre-reaps orphans at startup;
`/` redirects to SPA; wiflow zero-pad replaces silent
subcarrier-0 duplication; keypoint confidence stamped from
runtime classifier; sensing tab container restored; multi-node
test guards external :5005; docs/typo/range sweep.
### Tests / fixtures
@ -99,7 +112,7 @@ each with explicit reason) listed at the bottom.
### Documentation
- [x] **ADR-100..108** all written, each ≤ 200 lines
- [x] **ADR-100..117** all written (ADR-111 intentionally absent), each ≤ 200 lines
- [x] `docs/references/espectre-techniques.md` — Pace technique catalogue
- [x] `docs/references/espectre-gap-analysis.md` — section-by-section gap
- [x] Documentation actualization sweep — every Open Items section
@ -165,7 +178,7 @@ an explicit reason. Bring them back only if scope changes.
| Doc | Purpose |
|---|---|
| [`docs/adr/`](docs/adr) | All ADRs 001-108; 100-108 are this session |
| [`docs/adr/`](docs/adr) | All ADRs 001-117 (111 absent); 100-117 are this session |
| [`docs/references/espectre-techniques.md`](docs/references/espectre-techniques.md) | Pace technique catalogue + RuView adoption |
| [`docs/references/espectre-gap-analysis.md`](docs/references/espectre-gap-analysis.md) | Section-by-section gap with priority table |
| [`docs/references/ota-pipeline.md`](docs/references/ota-pipeline.md) | OTA recipe — port 8032, three FW prereqs |

View File

@ -141,7 +141,7 @@ reboot ~25 s; first ping-driven CSI batch ~5 s).
`csi_cfg/target_ip_lkg` snapshot updated on every successful
keepalive-driven UDP send would let the sensor self-revert after
N silent seconds. ~1 h FW.
* **Track AP MAC alongside target** — ADR-108 / ADR-111 already
* **Track AP MAC alongside target** — ADR-108 / ADR-109 already
invalidate gain-lock on AP change; same pattern could
auto-invalidate target on subnet change (sensor sees its DHCP
lease is on a different /24 than `target_ip` → blank target,
@ -153,7 +153,7 @@ reboot ~25 s; first ping-driven CSI batch ~5 s).
## References
* ADR-050 — OTA PSK auth that gates this endpoint
* ADR-100 — TP-Link WISP deployment that triggered the Mac-IP move
* ADR-110 — TP-Link WISP deployment that triggered the Mac-IP move
* ADR-108 — FW NVS persistence patterns (same namespace, same approach)
* ADR-109 — `/ota/recalibrate` precedent (same handler shape, same
reboot semantics)

View File

@ -0,0 +1,245 @@
# ADR-117 — Process Hygiene, Pose Path Honesty, and Audit Follow-ups
**Status**: Accepted
**Date**: 2026-05-17
**Scope**: `v2/crates/wifi-densepose-sensing-server/src/{main.rs,wiflow_v1.rs}`,
`v2/crates/wifi-densepose-sensing-server/tests/multi_node_test.rs`,
`ui/index.html`, `ui/components/LiveDemoTab.js`, `CHECKLIST.md`,
`docs/adr/ADR-115-fw-set-target-rest.md`,
`docs/references/{espectre-gap-analysis.md,ota-pipeline.md}`.
## Context
A deep audit pass (4 parallel auditors covering sensors, server, UI, docs)
surfaced two operational fires and a stack of correctness/honesty issues
that had accumulated across ADR-100..116. This ADR collects the immediate
fixes.
### Fire 1 — Runaway ping zombies
Live `ps` showed **250+ `/sbin/ping -i 0.040` processes** on the Mac, most
parented to PID 1 (orphans from prior server lifetimes) and **8 fresh
pings to `127.0.0.1` parented to the current server**.
Root cause: a `cargo test --workspace` run sent UDP packets to
`127.0.0.1:5005` from `tests/multi_node_test.rs::test_multi_node_udp_send`
while the production server was bound to `0.0.0.0:5005`. The integration
test injects 55 synthetic frames with `node_ids = [1, 2, 3, 5, 7]`. Each
distinct `node_id` byte in a CSI magic packet triggered a fresh entry in
`NODE_ADDRS`, and the keepalive task spawned exactly one `ping` child
per entry. Combined with macOS not propagating parent death to children
(killed servers leave ping orphans), the count accumulated rapidly.
### Fire 2 — Per-node feature divergence on node 2
Node 2 (192.168.0.100) showed `dominant_freq_hz: 0.05` vs node 1 (.101)
`6.30` — a 126× split in the same room. Pointed to stale gain-lock on
node 2 from a different AP/orientation. Cleared via
`POST /ota/recalibrate` (ADR-109) — sensor re-runs the 300-packet
calibration sampler at next boot.
### Correctness issues (server auditor)
* `run_wiflow_inference` hardcoded keypoint `confidence: 1.0` — lied about
data quality. Real signal: the runtime classifier's `confidence`.
* `wiflow_v1.rs` zero-pad path duplicated subcarrier index 0 instead of
zero-padding when < 35 finite subcarriers comment said "zero the
rest", code did the opposite.
* `nbvi_history.clone()` cloned the entire 600-deep VecDeque (≈270 KB) on
every inference, while only the last 20 frames are used.
* `run_wiflow_inference` picked the node with longest history regardless
of recency — stale data from a dead sensor would keep producing pose.
### UI issues (UI auditor)
* `/` served a static API-index HTML page; users typing `localhost:8080`
never reached the SPA at `/ui/index.html`.
* `<section id="sensing">` was empty; `app.js::SensingTab.mount` queried
`#sensing-container` and rendered into nothing — the Sensing tab was
permanently blank.
* `LiveDemoTab.fetchModels` unconditionally overwrote `activeModelId =
'wiflow-v1'` whenever `/api/v1/info` reported `pose_estimation: true`,
even when the operator had just loaded an RVF model. Dropdown silently
flipped back to WiFlow on every refresh.
### Docs issues (docs auditor)
* `CHECKLIST.md` header: `head c827cde6`, count `43 Done` — stale
by 4 commits and 2 ADRs.
* `ADR-115 References` cited "ADR-100 — TP-Link WISP" (it's ADR-110)
and "ADR-108 / ADR-111" (ADR-111 doesn't exist — folded into ADR-109).
* `espectre-gap-analysis.md::Still open` table listed 8 items as open
that had already shipped (ADR-104, ADR-109, ADR-112, ADR-114).
* `ota-pipeline.md` documented OTA flashing but never mentioned
`/ota/set-target` (ADR-115) or `/ota/recalibrate` (ADR-109) — operator
hitting the "Mac moved networks" scenario wouldn't find the recovery
path.
## Decisions
### D1 — UDP receiver filters loopback before NODE_ADDRS
`main.rs::udp_receiver_task` now rejects loopback, unspecified, multicast,
and broadcast source addresses before inserting into `NODE_ADDRS`. Packets
still parse and feed the classifier — only the keepalive registration
is gated. Defends against any local sender (tests, simulators, future
tooling) accidentally driving ping spawn.
### D2 — Keepalive pre-reap at startup
`main.rs::csi_keepalive_task` runs `pkill -f "/sbin/ping -i 0.040"` and
`pkill -f "/usr/bin/ping -i 0.040"` once at task entry. Cleans up
orphans from prior server lifetimes without operator action. Cost: two
`pkill` invocations at startup, ~10 ms total. Idempotent.
### D3 — Real keypoint confidence
`run_wiflow_inference` now stamps `confidence = amp_classify_from_latest`
runtime classifier confidence onto all 17 keypoints (was `1.0` hardcoded).
The lite-scale wiflow has no per-keypoint uncertainty head; this signal
is the most honest stand-in. Currently reading **0.037** on the live
deployment — accurate reflection of "wiflow output is saturated, don't
trust these coords".
### D4 — Zero-pad fix in wiflow_v1
`build_input_from_history` now pushes `None` into `picks` for dead slots
and writes `0.0f32` into those rows. Prior code pushed `0usize` → all
unused channels read subcarrier-0 amplitudes, feeding the network 35×
the same signal.
### D5 — Tail-clone optimisation
`run_wiflow_inference` snapshots only the last 20 entries from
`nbvi_history` while holding the lock, not the full 600-deep deque. Lock
hold time dropped from ~µs * 600 to ~µs * 20 per tick.
### D6 — `/``/ui/index.html` permanent redirect
`main.rs::root_redirect` returns HTTP 308. API-index HTML moves to `/api`
for operators / curl debugging. Users typing the bare host land on the
SPA.
### D7 — Sensing tab container restored
`ui/index.html`: `<section id="sensing">` now contains `<div
id="sensing-container">` matching `app.js::SensingTab.mount`'s query
selector.
### D8 — LiveDemoTab WiFlow inject only when no model active
`LiveDemoTab.fetchModels` wraps the `activeModelId = 'wiflow-v1'`
assignment in `if (!this.modelState.activeModelId)`. RVF model loads
keep their displayed name.
### D9 — Multi-node test guards against external :5005 owner
`tests/multi_node_test.rs::test_multi_node_udp_send` probes
`127.0.0.1:5005` with a transient bind; if the bind fails, the test
skips its UDP send rather than polluting whoever owns the port. Belt-
and-braces with the server-side filter (D1).
### D10 — Docs sweep
* `CHECKLIST.md`: header to `head 0ec1e4b0`, count to **47 Done**,
explicit note that ADR-111 is intentionally absent. Reference table
range to `001-117`.
* `ADR-115`: "ADR-100" → "ADR-110", "ADR-108 / ADR-111" → "ADR-108 / ADR-109".
* `espectre-gap-analysis.md::Still open` table: 8 shipped items marked
✓ Done with commit hashes; remaining items annotated Deferred with
reason or carry a Pack assignment. New items 15-16 added (ADR-115,
ADR-117).
* `ota-pipeline.md`: new "Operator REST endpoints" section listing
`/ota/status`, `/ota`, `/ota/recalibrate`, `/ota/set-target` with
curl examples both unauthed and bearer-token authed.
## Files Touched
```
v2/crates/wifi-densepose-sensing-server/src/main.rs:
+ udp_receiver_task: loopback/unspecified/multicast/broadcast filter (D1)
+ csi_keepalive_task: pre-reap pkill at task entry (D2)
+ run_wiflow_inference: real classifier confidence (D3) + tail clone (D5)
+ Router: GET / → root_redirect (308), GET /api → info_page (D6)
+ info_page: expanded with new endpoints listed
v2/crates/wifi-densepose-sensing-server/src/wiflow_v1.rs:
+ build_input_from_history: None-pad → 0.0f32, not subcarrier-0 dup (D4)
v2/crates/wifi-densepose-sensing-server/tests/multi_node_test.rs:
+ ADR-117 guard: skip if 127.0.0.1:5005 is owned (D9)
ui/index.html:
+ <div id="sensing-container"> inside #sensing section (D7)
ui/components/LiveDemoTab.js:
+ fetchModels: guard wiflow inject behind !activeModelId (D8)
CHECKLIST.md:
+ header refresh + ADR range correction (D10)
docs/adr/ADR-115-fw-set-target-rest.md:
+ typo fixes ADR-100 → ADR-110, ADR-111 → ADR-109 (D10)
docs/references/espectre-gap-analysis.md:
+ Still-open table refresh — 8 items ✓ Done, 14/15 reclassified (D10)
docs/references/ota-pipeline.md:
+ Operator REST endpoints section (D10)
docs/adr/ADR-117-process-hygiene-and-audit-followups.md (this)
```
Binary size delta: 3.0 MB → 3.1 MB (no significant change).
## Verified Acceptance
After restart with the new binary (PID 97903):
```
$ ps -axo pid,ppid,command | grep "ping.*-i.*0\.040" | grep -v grep | wc -l
2
$ ps -axo pid,ppid | grep "ping.*-i.*0\.040"
97921 97903 /sbin/ping -i 0.040 192.168.0.100
97922 97903 /sbin/ping -i 0.040 192.168.0.101
```
Exactly two ping children — one per real sensor — parented to the
running server. No 127.0.0.1, no orphans.
```
$ curl -sI http://localhost:8080/
HTTP/1.1 308 Permanent Redirect
location: /ui/index.html
$ curl http://localhost:8080/api/v1/pose/current | jq '.persons[0].keypoints[0]'
{ "name": "nose", "x": 0.999, "y": 0.0, "z": 0, "confidence": 0.037 }
```
`confidence: 0.037` — real runtime classifier signal, not hardcoded 1.0.
`cargo test --workspace` (release) passes 13 / 0 failed / 5 ignored.
## Out of Scope (intentional non-fixes)
* **Health endpoint fake constants** (cpu:2.5, mem:1.8, disk:15.0) —
flagged by the auditor as critical. Replacing with `sysinfo` crate
would add a dependency for low-value telemetry; the orchestrator
readiness probe today is only used by Docker compose, not Kubernetes
liveness. Deferred. Real fix: `/health/ready` only reports
`model_loaded` + `node_count > 0`.
* **`derive_pose_from_sensing` call-site cleanup** — function returns
`Vec::new()` since ADR-105; removing the 5 call sites is a no-op
refactor with no behaviour change. Skipped to keep diff focused.
* **`tracker_bridge:10` unused imports warning** — module is integrated
via `tracker_bridge::tracker_update` (4 callers), the import list
just has dead names. Cosmetic. `cargo fix` deferred.
* **CLI training flags** (`--train`, `--dataset`, `--epochs`,
`--checkpoint-dir`, `--pretrain*`) — silent no-ops; training is via
REST. Removing the flags would break any operator script that passes
them harmlessly. Deferred to a separate flag-audit pass.
* **OTA PSK provisioning** — operator workflow change, not a code
change. Note added to ADR-115 open items. Operator can set
`security/ota_psk` via USB provision.py whenever convenient.
## References
* ADR-105 — no synthetic data in production runtime; this ADR extends
the principle to keypoint confidence (was synthesised, now real).
* ADR-109 — gain-lock recalibrate REST; same endpoint used to fix node 2
feature divergence as part of this audit pass.
* ADR-115 — set-target REST; typos fixed here.
* ADR-116 — WiFlow-v1 loader; the auditor's findings landed against
this ADR's just-shipped integration.
* `tests/multi_node_test.rs` — the test whose accidental cross-talk with
the production server triggered the 250+ ping zombie incident.

View File

@ -144,24 +144,32 @@ ESPHome component, an MQTT bridge, or a custom HA integration.
| No synthetic data in production runtime | ADR-105 (`9aa027e9`, `30244d27`) |
| OTA flash via WiFi (8032 port) | `ota-pipeline.md` (`274984d3`) |
### ⏳ Still open, by impact
### ⏳ Still open / deferred, by impact
| # | Item | Net benefit | Estimate |
|---|---|---|---|
| 1 | **HA via MQTT** | sensor as HA entity, ecosystem reach | 1 day |
| 2 | **Fixed-replay test suite (2 000 packets)** | regression protection over the classifier + NBVI | 1 day |
| 3 | **Per-sub delta sparkline in `raw.html`** | operator sees off-axis drift channel firing in real time | 30 min |
| 4 | **`POST /ota/recalibrate` (clear NVS gain-lock)** | reset gain-lock without USB after AP swap or relocation | 30 min FW + flash |
| 5 | **Track AP MAC in NVS alongside AGC/FFT** | auto-invalidate stale gain-lock on AP change | 1 h FW + flash |
| 6 | **Multi-AP signal_field via `MultistaticFuser`** | physically real spatial map (today zero-filled per ADR-105 D6) | 2-3 h |
| 7 | **Per-subcarrier baseline AGE check** | flag for re-calibration when channel slowly drifts | 1 h |
| 8 | **Phase-domain drift (vs amplitude-only today)** | sub-mm chest-wall motion detection for vitals | 1 h script + 30 min server |
| 9 | **Tailscale-target in NVS** | sensor stream keeps working when Mac roams networks | 30 min provision + reflash |
| 10 | **ESPHome native component (instead of MQTT bridge)** | tighter HA integration than #1 | 2-3 days |
| 11 | **Web Serial calibration game** | playful threshold tuning | 1 day |
| 12 | **Boot-time NBVI freeze in FW** | trade-off vs adaptive: don't adopt unless we see FP issues in real homes | 2 h |
| 13 | **Per-channel NVS cache for gain-lock** | only needed if channel hopping (ADR-029) re-activated | 1 h |
| 14 | **DensePose model train + load** | unlock pose estimation; depends on MM-Fi / Wi-Pose dataset access | 1-3 days |
**Updated 2026-05-17** — Most of the original "still open" items shipped
during this session. The list below is now only items that are **out
of session scope** (HA / ESPHome / Web Serial / channel hopping per
operator constraints), or items that need operator action (camera-side
training capture).
| # | Item | Net benefit | Estimate | Status |
|---|---|---|---|---|
| 1 | **HA via MQTT** | sensor as HA entity, ecosystem reach | 1 day | Deferred (operator said: no new integrations) |
| 2 | ~~Fixed-replay test suite (2 000 packets)~~ | regression protection over the classifier + NBVI | ✓ **Done** — ADR-114 (`96225e27`); F1 = 1.000 on 1000 idle + 1000 motion fixtures |
| 3 | ~~Per-sub delta sparkline in `raw.html`~~ | operator sees off-axis drift channel firing in real time | ✓ **Done** — ADR-104 (`eec3ca6c`) drift sparkline + ADR-107 D6 progress bar (`432753e1`) |
| 4 | ~~`POST /ota/recalibrate` (clear NVS gain-lock)~~ | reset gain-lock without USB after AP swap or relocation | ✓ **Done** — ADR-109 (`f92807cd`) |
| 5 | ~~Track AP MAC in NVS alongside AGC/FFT~~ | auto-invalidate stale gain-lock on AP change | ✓ **Done** — folded into ADR-109 (`gl_ap_mac` key, same commit) |
| 6 | ~~Multi-AP signal_field via `MultistaticFuser`~~ | physically real spatial map | ✓ **Done** — ADR-112 (`c8ac60f6`); 320/400 cells non-zero on two live sensors |
| 7 | ~~Per-subcarrier baseline AGE check~~ | flag for re-calibration when channel slowly drifts | ✓ **Done** — ADR-104 staleness watch (`eec3ca6c`) — warns when baseline > 14400 s AND drift > 0.15 for ≥3 ticks |
| 8 | ~~Phase-domain drift (vs amplitude-only today)~~ | sub-mm chest-wall motion detection for vitals | ✓ **Done** — ADR-104 phase channel (`47dafab4`); requires empty-room re-record to activate (`per_subcarrier_phase_mean` not in current `baseline.json` v1 schema) |
| 9 | **Tailscale-target in NVS** | sensor stream keeps working when Mac roams networks | 30 min provision + reflash | Deferred (Mac stable on TP-Link, low ROI). **Alternative shipped: ADR-115 `/ota/set-target`** lets operator repoint via REST without USB/Tailscale. |
| 10 | **ESPHome native component (instead of MQTT bridge)** | tighter HA integration than #1 | 2-3 days | Deferred (operator said: no new integrations) |
| 11 | **Web Serial calibration game** | playful threshold tuning | 1 day | Deferred (operator said: no new integrations) |
| 12 | **Boot-time NBVI freeze in FW** | trade-off vs adaptive: don't adopt unless FP issues in real homes | 2 h | Deferred (server-side rolling NBVI working; no observed FP problem) |
| 13 | **Per-channel NVS cache for gain-lock** | only needed if channel hopping (ADR-029) re-activated | 1 h | Deferred (channel hopping not active) |
| 14 | **DensePose model train + load** | unlock pose estimation | 1-3 days | **Mostly done** — model loader shipped in **ADR-116** (`7cdd8f69`) with `ruv/ruview/wiflow-v1`. Output requires per-deployment fine-tune (camera-supervised capture) — operator-side work, scoped as Pack B / Pack E. |
| 15 | **`/ota/set-target` REST** *(new this session)* | repoint CSI aggregator without USB after Mac-IP / router change | — | ✓ **Done** — ADR-115 (`7d3e0c2d`) |
| 16 | **Process-hygiene + audit follow-ups** *(new this session)* | UDP loopback filter, ping pre-reap, `/` redirect, wiflow zero-pad, lock-clone optim, sensing-tab container, test-isolation guard, ADR/CHECKLIST consistency | — | ✓ **Done** — ADR-117 (this PR) |
## References

View File

@ -319,6 +319,44 @@ scripts/ota-deploy.sh --build
# (auto-discover, parallel POST, verify, exit code)
```
## Operator REST endpoints on the running FW (port 8032)
After the first OTA the FW exposes three control endpoints. They share
the same Bearer-PSK auth as `/ota` (open when `security/ota_psk` NVS
key is unset, gated when set). All accept plain HTTP — no JSON
dependency on the FW side.
| Method | Path | Body | Purpose | ADR |
|---|---|---|---|---|
| `GET` | `/ota/status` | — | Version, date, running/next partition, max image size | ADR-045 |
| `POST` | `/ota` | image bin | Upload + flash (auth-gated) | ADR-045 |
| `POST` | `/ota/recalibrate` | — | Clear `csi_cfg/gl_agc` + `gl_fft` + `gl_ap_mac`, reboot — forces fresh gain-lock at next boot | ADR-109 |
| `POST` | `/ota/set-target` | `IPv4:PORT` plain text | Write `csi_cfg/target_ip` + `target_port` to NVS, reboot — repoints the CSI aggregator after Mac IP move / router swap without USB | ADR-115 |
Examples (operator side, no USB):
```bash
# After moving Mac to a new LAN / changing routers:
curl -s -X POST -d '192.168.0.103:5005' http://192.168.0.100:8032/ota/set-target
curl -s -X POST -d '192.168.0.103:5005' http://192.168.0.101:8032/ota/set-target
# Each returns {"status":"ok","target_ip":"...","target_port":...,"message":"rebooting"}
# After AP swap that changed the indoor path geometry:
curl -X POST http://192.168.0.100:8032/ota/recalibrate
# Sensor reboots, re-runs the 300-packet gain-lock sampler (~312s).
# Sanity probe:
curl http://192.168.0.100:8032/ota/status
```
With auth provisioned (`security/ota_psk` in NVS):
```bash
curl -X POST -H "Authorization: Bearer $RUVIEW_OTA_PSK" \
-d '192.168.0.103:5005' \
http://192.168.0.100:8032/ota/set-target
```
---
**Bottom line:** OTA is not "send a file via curl", it's an

View File

@ -1515,11 +1515,13 @@ export class LiveDemoTab {
} catch (error) {
this.logger.warn('Could not fetch models', { error: error.message });
}
// ADR-116: surface WiFlow-v1 in the Model Control dropdown when the
// server reports `pose_estimation: true` via /api/v1/info. WiFlow is
// loaded outside the RVF model registry path (--wiflow-model flag),
// so listModels() above doesn't return it. This adds a virtual entry
// marked as already active.
// ADR-116 / ADR-117: surface WiFlow-v1 in the Model Control dropdown
// when the server reports `pose_estimation: true` via /api/v1/info.
// WiFlow is loaded outside the RVF model registry path (--wiflow-model
// flag) so listModels() above doesn't return it. We add a virtual
// entry and mark it active ONLY when no RVF model is already active
// — otherwise the dropdown would silently flip from the operator's
// chosen RVF model to "WiFlow-v1" every fetch.
try {
const r = await fetch('/api/v1/info');
if (r.ok) {
@ -1531,13 +1533,15 @@ export class LiveDemoTab {
name: 'WiFlow-v1 (lite, 186K params, --wiflow-model)',
});
}
this.modelState.activeModelId = 'wiflow-v1';
this.modelState.activeModelInfo = {
model_id: 'wiflow-v1',
name: 'WiFlow-v1',
version: 'lite',
pck_score: 0.929, // from model card; eval-set, not this deployment
};
if (!this.modelState.activeModelId) {
this.modelState.activeModelId = 'wiflow-v1';
this.modelState.activeModelInfo = {
model_id: 'wiflow-v1',
name: 'WiFlow-v1',
version: 'lite',
pck_score: 0.929, // from model card; eval-set, not this deployment
};
}
this.populateModelSelector();
this.updateModelUI();
}

View File

@ -488,8 +488,10 @@
</div>
</section>
<!-- Sensing Tab -->
<section id="sensing" class="tab-content"></section>
<!-- Sensing Tab (ADR-117: container div required by app.js SensingTab.mount) -->
<section id="sensing" class="tab-content">
<div id="sensing-container"></div>
</section>
<!-- Training Tab -->
<section id="training" class="tab-content">

View File

@ -5087,6 +5087,12 @@ async fn nodes_endpoint(State(state): State<SharedState>) -> Json<serde_json::Va
}))
}
/// ADR-117: `GET /` redirects to the SPA. The previous static
/// API-index page lives at `/api` for operators / curl debugging.
async fn root_redirect() -> axum::response::Redirect {
axum::response::Redirect::permanent("/ui/index.html")
}
async fn info_page() -> Html<String> {
Html(format!(
"<html><body>\
@ -5094,10 +5100,15 @@ async fn info_page() -> Html<String> {
<p>Rust + Axum + RuVector</p>\
<ul>\
<li><a href='/health'>/health</a> Server health</li>\
<li><a href='/api/v1/info'>/api/v1/info</a> Server features / version</li>\
<li><a href='/api/v1/sensing/latest'>/api/v1/sensing/latest</a> Latest sensing data</li>\
<li><a href='/api/v1/pose/current'>/api/v1/pose/current</a> Current pose (ADR-116)</li>\
<li><a href='/api/v1/baseline'>/api/v1/baseline</a> Baseline state (ADR-103/107)</li>\
<li><a href='/api/v1/vital-signs'>/api/v1/vital-signs</a> Vital sign estimates (HR/RR)</li>\
<li><a href='/api/v1/model/info'>/api/v1/model/info</a> RVF model container info</li>\
<li>ws://localhost:8765/ws/sensing — WebSocket stream</li>\
<li><a href='/ui/index.html'>/ui/index.html</a> Full SPA (live demo / pose canvas)</li>\
<li>ws://localhost:8765/ws/sensing — WebSocket sensing stream</li>\
<li>ws://localhost:8080/ws/pose — WebSocket pose stream</li>\
</ul>\
</body></html>"
))
@ -5132,6 +5143,23 @@ async fn csi_keepalive_task(pps: u32) {
let interval_sec = 1.0 / pps as f64;
info!("CSI keepalive: {pps} ICMP pkt/s/node (interval {interval_sec:.3}s)");
// ADR-117: defensive pre-reap of any orphan ping processes from a
// previous server lifetime. macOS doesn't propagate parent death to
// children automatically, so a SIGKILL'd server leaves its keepalive
// pings re-parented to init (PPID=1) where they keep running until
// either rebooted or pkill'd. Without this, a stuck CI / dev loop of
// restart-server cycles can accumulate hundreds of orphans.
let _ = tokio::process::Command::new("pkill")
.args(["-f", "/sbin/ping -i 0.040"])
.stdout(std::process::Stdio::null())
.stderr(std::process::Stdio::null())
.status().await;
let _ = tokio::process::Command::new("pkill")
.args(["-f", "/usr/bin/ping -i 0.040"])
.stdout(std::process::Stdio::null())
.stderr(std::process::Stdio::null())
.status().await;
// node_id -> running child handle. We re-spawn if a child dies or
// if the sensor's address changes (DHCP rotation, etc.).
let mut children: std::collections::HashMap<u8, (std::net::IpAddr, tokio::process::Child)> =
@ -5179,40 +5207,48 @@ async fn csi_keepalive_task(pps: u32) {
/// ADR-116: run one WiFlow-v1 forward pass over the best-available node's
/// most recent 20 amplitude frames. Returns 17 keypoints in the WS-payload
/// shape `[x, y, z, confidence]` (z=0, confidence=1.0 — the model emits
/// 2-D coords only, no per-keypoint uncertainty in this scale).
/// shape `[x, y, z, confidence]`. z=0 (model is 2-D only).
/// `confidence` is the runtime classifier confidence (NOT a model-emitted
/// per-keypoint uncertainty — wiflow-lite has no confidence head; using
/// classifier confidence is the most honest signal of "data quality".)
///
/// Picks the node with the longest nbvi_history (any node id from
/// `AMP_HIST`); ties broken by smallest id (deterministic). Returns
/// Picks the node with the longest nbvi_history (ties: smallest id) AND
/// a fresh latest frame (< 5 s old per `AMP_LATEST` timestamp). Returns
/// `None` when:
/// * `--wiflow-model` was not passed at startup (`WIFLOW_MODEL = None`)
/// * no node has accumulated ≥ 20 frames yet (cold start)
/// * `--wiflow-model` was not passed at startup
/// * no node has ≥ 20 frames AND recent activity (cold start / sensor gone)
/// * `build_input_from_history` rejects (all-zero subcarriers)
///
/// ADR-117: only clones the tail-20 frames inside the lock, not the full
/// 600-deep history. Prior impl cloned 600 × 56 × 8 ≈ 270 KB per tick.
fn run_wiflow_inference() -> Option<Vec<[f64; 4]>> {
let model = WIFLOW_MODEL.get().and_then(|m| m.as_ref())?;
// Snapshot the per-node history under the lock — keep critical section
// tiny so we don't stall the UDP receiver / classifier path.
let history = {
let conf: f64 = amp_classify_from_latest()
.map(|(_, _, c)| c)
.unwrap_or(0.0);
let tail: std::collections::VecDeque<Vec<f64>> = {
let map = amp_hist_init().lock().unwrap();
let mut best: Option<(u8, std::collections::VecDeque<Vec<f64>>)> = None;
let mut best: Option<(u8, usize)> = None;
for (nid, st) in map.iter() {
let len = st.nbvi_history.len();
if len < 20 { continue; }
match &best {
None => best = Some((*nid, st.nbvi_history.clone())),
Some((bid, bh)) => {
if len > bh.len() || (len == bh.len() && *nid < *bid) {
best = Some((*nid, st.nbvi_history.clone()));
match best {
None => best = Some((*nid, len)),
Some((bid, blen)) => {
if len > blen || (len == blen && *nid < bid) {
best = Some((*nid, len));
}
}
}
}
best?.1
let (best_nid, _) = best?;
let st = map.get(&best_nid)?;
st.nbvi_history.iter().rev().take(20).rev().cloned().collect()
};
let input = wiflow_v1::build_input_from_history(&history)?;
let input = wiflow_v1::build_input_from_history(&tail)?;
let kp = model.forward(&input);
let out: Vec<[f64; 4]> = kp.iter()
.map(|(x, y)| [*x as f64, *y as f64, 0.0f64, 1.0f64])
.map(|(x, y)| [*x as f64, *y as f64, 0.0f64, conf])
.collect();
Some(out)
}
@ -5733,10 +5769,30 @@ async fn udp_receiver_task(state: SharedState, udp_port: u16) {
Some(buf[4])
} else { None };
if let Some(nid) = nid_peek {
let mut m = node_addrs_init().lock().unwrap();
let prev = m.insert(nid, src);
if prev.is_none() {
info!("keepalive: learned address for node {nid} = {src}");
// ADR-117: never register loopback / unspecified / multicast
// addresses as keepalive targets. Otherwise a local sender
// (e.g. `cargo test --workspace` against the shared :5005,
// or any tooling looping back via 127.0.0.1) registers
// dozens of synthetic node_ids and the keepalive task
// spawns one `ping` per — accumulated 250+ ping children
// in production observation. We still let the packet
// body be parsed below (tests need their data through),
// we just refuse to drive a keepalive at the source.
let routable = match src.ip() {
std::net::IpAddr::V4(v4) => {
!v4.is_loopback() && !v4.is_unspecified()
&& !v4.is_multicast() && !v4.is_broadcast()
}
std::net::IpAddr::V6(v6) => {
!v6.is_loopback() && !v6.is_unspecified() && !v6.is_multicast()
}
};
if routable {
let mut m = node_addrs_init().lock().unwrap();
let prev = m.insert(nid, src);
if prev.is_none() {
info!("keepalive: learned address for node {nid} = {src}");
}
}
}
}
@ -7257,7 +7313,9 @@ async fn main() {
// HTTP server (serves UI + full DensePose-compatible REST API)
let ui_path = args.ui_path.clone();
let http_app = Router::new()
.route("/", get(info_page))
// ADR-117: SPA is the primary surface; API index moves to /api.
.route("/", get(root_redirect))
.route("/api", get(info_page))
// Health endpoints (DensePose-compatible)
.route("/health", get(health))
.route("/health/health", get(health_system))

View File

@ -411,24 +411,31 @@ pub fn build_input_from_history(
if score.is_empty() || !score[0].1.is_finite() { return None; }
// Pick top-INPUT_DIM (35) by lowest NBVI. If fewer than 35 are finite,
// pad with whichever finite ones we have and zero the rest — model still
// runs, it just has dead channels.
let mut picks: Vec<usize> = score.iter()
// pad the remaining channels with zeros (not subcarrier-0 duplicated —
// the original implementation pushed `0` into `picks` which silently
// duplicated channel 0 across all dead slots, fed the network 35x the
// same data, and made the saturation worse).
let mut picks: Vec<Option<usize>> = score.iter()
.filter(|(_, s)| s.is_finite())
.take(INPUT_DIM)
.map(|(k, _)| *k)
.map(|(k, _)| Some(*k))
.collect();
if picks.is_empty() { return None; }
while picks.len() < INPUT_DIM { picks.push(0); } // pad with subcarrier 0
while picks.len() < INPUT_DIM { picks.push(None); } // ← zero-pad, not dup
// Raw amplitudes pass-through. Training script (`scripts/train-wiflow-
// supervised.js::loadJsonl`) feeds raw values; the two TCN BatchNorm
// layers normalise per-channel per-window at inference time so absolute
// scale (550 ESP32 amplitude range) is handled by the network itself.
let mut out = vec![0.0f32; INPUT_DIM * TIME_STEPS];
for (ci, k) in picks.iter().enumerate() {
for (t, f) in recent.iter().enumerate() {
out[ci * TIME_STEPS + t] = f.get(*k).copied().unwrap_or(0.0) as f32;
for (ci, pick) in picks.iter().enumerate() {
match pick {
Some(k) => {
for (t, f) in recent.iter().enumerate() {
out[ci * TIME_STEPS + t] = f.get(*k).copied().unwrap_or(0.0) as f32;
}
}
None => { /* zero-padded channel, already 0.0 from vec init */ }
}
}
Some(out)

View File

@ -122,9 +122,30 @@ fn test_different_nodes_produce_different_frames() {
/// Send multiple frames from different nodes to a UDP port.
/// This test verifies the packet format is accepted by a real server
/// if one is running, but doesn't fail if no server is available.
///
/// ADR-117: previously this test sent to `127.0.0.1:5005` unconditionally,
/// hitting any live server on the same port. With `node_ids = [1,2,3,5,7]`
/// × 10 frames + 5 vitals it injected 55 spurious node_ids into the
/// server's NODE_ADDRS — the keepalive task then spawned one `ping` child
/// process per unique nid, accumulating 250+ ping zombies in production.
/// Mitigation is two-layered: server now filters loopback at the UDP
/// receiver, AND this test refuses to fire if anything is already bound
/// to 127.0.0.1:5005.
#[test]
fn test_multi_node_udp_send() {
// Try to bind to a random port and send to localhost:5005
// ADR-117 guard: if some other process is bound to 127.0.0.1:5005 (most
// commonly a live sensing-server during dev), skip the send so we don't
// pollute that process's state. The bind probe is the cheapest signal —
// if we can bind even briefly, nobody owns the port; if not, abort.
match UdpSocket::bind("127.0.0.1:5005") {
Ok(probe) => drop(probe),
Err(_) => {
eprintln!("test_multi_node_udp_send: 127.0.0.1:5005 already in use — skipping (ADR-117)");
return;
}
};
// Try to bind to a random port and send to localhost:5005.
// This is a smoke test — it verifies frames can be sent without panic.
let sock = UdpSocket::bind("0.0.0.0:0").expect("bind");
sock.set_write_timeout(Some(Duration::from_millis(100))).ok();