feat(adr-117): process hygiene + pose path honesty + audit sweep
Audit fix bundle (10 areas; details in ADR-117 + commit body below).
Server (main.rs / wiflow_v1.rs):
- UDP receiver filters loopback/multicast/unspecified before NODE_ADDRS
registration. Defends against `cargo test` cross-talk that spawned
250+ ping zombies on the production server's :5005 port.
- csi_keepalive_task pre-reaps `/sbin/ping -i 0.040` orphans at task
entry. macOS doesn't propagate parent death, so killed servers used
to leave init-parented pings running indefinitely.
- run_wiflow_inference stamps real classifier confidence onto every
keypoint (was hardcoded 1.0) — reads 0.037 on live data, honest.
- run_wiflow_inference clones only the tail-20 frames inside the lock,
not the full 600-deep VecDeque (~270 KB → ~9 KB per tick).
- wiflow_v1::build_input_from_history: zero-pad dead channel slots
instead of duplicating subcarrier 0 across all of them. Comment said
"zero the rest", prior code did the opposite.
- GET / now 308-redirects to /ui/index.html; API index moved to /api.
UI (ui/index.html, ui/components/LiveDemoTab.js):
- <section id="sensing"> gets a <div id="sensing-container"> child so
app.js::SensingTab.mount has its mount point. Sensing tab was
permanently blank.
- LiveDemoTab.fetchModels: only inject WiFlow into the dropdown if no
RVF model is already active. Prevents silent flip back to WiFlow
after every poll.
Tests (multi_node_test.rs):
- test_multi_node_udp_send probes 127.0.0.1:5005 first; if bind fails
(e.g. a dev server is running), skip the send. Two-layer defense
with the server-side filter above.
Docs (CHECKLIST.md, ADR-115, espectre-gap-analysis.md, ota-pipeline.md):
- CHECKLIST head sha + count refreshed (43→47 Done, head 0ec1e4b0,
ADR range to 001-117 with ADR-111 noted as intentionally absent).
- ADR-115 typo fixes: "ADR-100" → "ADR-110" (TP-Link WISP),
"ADR-111" → "ADR-109" (AP-MAC tracking actually lives there).
- gap-analysis "Still open" table: 8 shipped items annotated with
commit hashes; remainder reclassified Deferred with reason.
- ota-pipeline.md: new "Operator REST endpoints" section listing
/ota/recalibrate (ADR-109) and /ota/set-target (ADR-115) with
unauthed + bearer-token curl examples.
Verified post-restart:
- exactly 2 ping children, both parented to current PID, one per real
sensor IP, no 127.0.0.1.
- GET / → 308 → /ui/index.html.
- /api/v1/info: pose_estimation=true, version 0.3.0.
- /api/v1/pose/current: 17 COCO keypoints, confidence 0.037 (real).
- cargo test --workspace: 13 passed / 0 failed / 5 ignored.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
0ec1e4b06f
commit
6ce25cec79
21
CHECKLIST.md
21
CHECKLIST.md
|
|
@ -5,10 +5,16 @@ at the end of every session. Pair with
|
|||
[`docs/references/espectre-gap-analysis.md`](docs/references/espectre-gap-analysis.md)
|
||||
for the technical detail behind each line.
|
||||
|
||||
Last sweep: **2026-05-17**, branch `feat/ota-rssi-mobile`, head `c827cde6`.
|
||||
Status: 43 Done / 0 Open in-scope. Deferred items (out of session scope,
|
||||
Last sweep: **2026-05-17**, branch `feat/ota-rssi-mobile`, head `0ec1e4b0`.
|
||||
Status: 47 Done / 0 Open in-scope. Deferred items (out of session scope,
|
||||
each with explicit reason) listed at the bottom.
|
||||
|
||||
This count includes the ADR-100..114 carry-in from the prior agent + this
|
||||
session's ADR-115 (FW set-target REST), ADR-116 (WiFlow-v1 Rust loader),
|
||||
ADR-116 cosmetic (UI dropdown), and ADR-117 (process hygiene + audit
|
||||
follow-ups). ADR-111 is intentionally absent (folded into ADR-109 during
|
||||
the AP-MAC tracking work).
|
||||
|
||||
---
|
||||
|
||||
## ✅ Done
|
||||
|
|
@ -77,6 +83,13 @@ each with explicit reason) listed at the bottom.
|
|||
keypoints on `/api/v1/pose/current` and WS `pose_data`. Output
|
||||
quality requires per-deployment fine-tune (LoRA adapters or
|
||||
re-train, see Pack E).
|
||||
- [x] **ADR-117** Process hygiene + audit follow-ups — UDP loopback
|
||||
filter prevents `cargo test` cross-talk from spawning ping
|
||||
zombies (250→2 children); keepalive pre-reaps orphans at startup;
|
||||
`/` redirects to SPA; wiflow zero-pad replaces silent
|
||||
subcarrier-0 duplication; keypoint confidence stamped from
|
||||
runtime classifier; sensing tab container restored; multi-node
|
||||
test guards external :5005; docs/typo/range sweep.
|
||||
|
||||
### Tests / fixtures
|
||||
|
||||
|
|
@ -99,7 +112,7 @@ each with explicit reason) listed at the bottom.
|
|||
|
||||
### Documentation
|
||||
|
||||
- [x] **ADR-100..108** all written, each ≤ 200 lines
|
||||
- [x] **ADR-100..117** all written (ADR-111 intentionally absent), each ≤ 200 lines
|
||||
- [x] `docs/references/espectre-techniques.md` — Pace technique catalogue
|
||||
- [x] `docs/references/espectre-gap-analysis.md` — section-by-section gap
|
||||
- [x] Documentation actualization sweep — every Open Items section
|
||||
|
|
@ -165,7 +178,7 @@ an explicit reason. Bring them back only if scope changes.
|
|||
|
||||
| Doc | Purpose |
|
||||
|---|---|
|
||||
| [`docs/adr/`](docs/adr) | All ADRs 001-108; 100-108 are this session |
|
||||
| [`docs/adr/`](docs/adr) | All ADRs 001-117 (111 absent); 100-117 are this session |
|
||||
| [`docs/references/espectre-techniques.md`](docs/references/espectre-techniques.md) | Pace technique catalogue + RuView adoption |
|
||||
| [`docs/references/espectre-gap-analysis.md`](docs/references/espectre-gap-analysis.md) | Section-by-section gap with priority table |
|
||||
| [`docs/references/ota-pipeline.md`](docs/references/ota-pipeline.md) | OTA recipe — port 8032, three FW prereqs |
|
||||
|
|
|
|||
|
|
@ -141,7 +141,7 @@ reboot ~25 s; first ping-driven CSI batch ~5 s).
|
|||
`csi_cfg/target_ip_lkg` snapshot updated on every successful
|
||||
keepalive-driven UDP send would let the sensor self-revert after
|
||||
N silent seconds. ~1 h FW.
|
||||
* **Track AP MAC alongside target** — ADR-108 / ADR-111 already
|
||||
* **Track AP MAC alongside target** — ADR-108 / ADR-109 already
|
||||
invalidate gain-lock on AP change; same pattern could
|
||||
auto-invalidate target on subnet change (sensor sees its DHCP
|
||||
lease is on a different /24 than `target_ip` → blank target,
|
||||
|
|
@ -153,7 +153,7 @@ reboot ~25 s; first ping-driven CSI batch ~5 s).
|
|||
## References
|
||||
|
||||
* ADR-050 — OTA PSK auth that gates this endpoint
|
||||
* ADR-100 — TP-Link WISP deployment that triggered the Mac-IP move
|
||||
* ADR-110 — TP-Link WISP deployment that triggered the Mac-IP move
|
||||
* ADR-108 — FW NVS persistence patterns (same namespace, same approach)
|
||||
* ADR-109 — `/ota/recalibrate` precedent (same handler shape, same
|
||||
reboot semantics)
|
||||
|
|
|
|||
|
|
@ -0,0 +1,245 @@
|
|||
# ADR-117 — Process Hygiene, Pose Path Honesty, and Audit Follow-ups
|
||||
|
||||
**Status**: Accepted
|
||||
**Date**: 2026-05-17
|
||||
**Scope**: `v2/crates/wifi-densepose-sensing-server/src/{main.rs,wiflow_v1.rs}`,
|
||||
`v2/crates/wifi-densepose-sensing-server/tests/multi_node_test.rs`,
|
||||
`ui/index.html`, `ui/components/LiveDemoTab.js`, `CHECKLIST.md`,
|
||||
`docs/adr/ADR-115-fw-set-target-rest.md`,
|
||||
`docs/references/{espectre-gap-analysis.md,ota-pipeline.md}`.
|
||||
|
||||
## Context
|
||||
|
||||
A deep audit pass (4 parallel auditors covering sensors, server, UI, docs)
|
||||
surfaced two operational fires and a stack of correctness/honesty issues
|
||||
that had accumulated across ADR-100..116. This ADR collects the immediate
|
||||
fixes.
|
||||
|
||||
### Fire 1 — Runaway ping zombies
|
||||
|
||||
Live `ps` showed **250+ `/sbin/ping -i 0.040` processes** on the Mac, most
|
||||
parented to PID 1 (orphans from prior server lifetimes) and **8 fresh
|
||||
pings to `127.0.0.1` parented to the current server**.
|
||||
|
||||
Root cause: a `cargo test --workspace` run sent UDP packets to
|
||||
`127.0.0.1:5005` from `tests/multi_node_test.rs::test_multi_node_udp_send`
|
||||
while the production server was bound to `0.0.0.0:5005`. The integration
|
||||
test injects 55 synthetic frames with `node_ids = [1, 2, 3, 5, 7]`. Each
|
||||
distinct `node_id` byte in a CSI magic packet triggered a fresh entry in
|
||||
`NODE_ADDRS`, and the keepalive task spawned exactly one `ping` child
|
||||
per entry. Combined with macOS not propagating parent death to children
|
||||
(killed servers leave ping orphans), the count accumulated rapidly.
|
||||
|
||||
### Fire 2 — Per-node feature divergence on node 2
|
||||
|
||||
Node 2 (192.168.0.100) showed `dominant_freq_hz: 0.05` vs node 1 (.101)
|
||||
`6.30` — a 126× split in the same room. Pointed to stale gain-lock on
|
||||
node 2 from a different AP/orientation. Cleared via
|
||||
`POST /ota/recalibrate` (ADR-109) — sensor re-runs the 300-packet
|
||||
calibration sampler at next boot.
|
||||
|
||||
### Correctness issues (server auditor)
|
||||
|
||||
* `run_wiflow_inference` hardcoded keypoint `confidence: 1.0` — lied about
|
||||
data quality. Real signal: the runtime classifier's `confidence`.
|
||||
* `wiflow_v1.rs` zero-pad path duplicated subcarrier index 0 instead of
|
||||
zero-padding when < 35 finite subcarriers — comment said "zero the
|
||||
rest", code did the opposite.
|
||||
* `nbvi_history.clone()` cloned the entire 600-deep VecDeque (≈270 KB) on
|
||||
every inference, while only the last 20 frames are used.
|
||||
* `run_wiflow_inference` picked the node with longest history regardless
|
||||
of recency — stale data from a dead sensor would keep producing pose.
|
||||
|
||||
### UI issues (UI auditor)
|
||||
|
||||
* `/` served a static API-index HTML page; users typing `localhost:8080`
|
||||
never reached the SPA at `/ui/index.html`.
|
||||
* `<section id="sensing">` was empty; `app.js::SensingTab.mount` queried
|
||||
`#sensing-container` and rendered into nothing — the Sensing tab was
|
||||
permanently blank.
|
||||
* `LiveDemoTab.fetchModels` unconditionally overwrote `activeModelId =
|
||||
'wiflow-v1'` whenever `/api/v1/info` reported `pose_estimation: true`,
|
||||
even when the operator had just loaded an RVF model. Dropdown silently
|
||||
flipped back to WiFlow on every refresh.
|
||||
|
||||
### Docs issues (docs auditor)
|
||||
|
||||
* `CHECKLIST.md` header: `head c827cde6`, count `43 Done` — stale
|
||||
by 4 commits and 2 ADRs.
|
||||
* `ADR-115 References` cited "ADR-100 — TP-Link WISP" (it's ADR-110)
|
||||
and "ADR-108 / ADR-111" (ADR-111 doesn't exist — folded into ADR-109).
|
||||
* `espectre-gap-analysis.md::Still open` table listed 8 items as open
|
||||
that had already shipped (ADR-104, ADR-109, ADR-112, ADR-114).
|
||||
* `ota-pipeline.md` documented OTA flashing but never mentioned
|
||||
`/ota/set-target` (ADR-115) or `/ota/recalibrate` (ADR-109) — operator
|
||||
hitting the "Mac moved networks" scenario wouldn't find the recovery
|
||||
path.
|
||||
|
||||
## Decisions
|
||||
|
||||
### D1 — UDP receiver filters loopback before NODE_ADDRS
|
||||
|
||||
`main.rs::udp_receiver_task` now rejects loopback, unspecified, multicast,
|
||||
and broadcast source addresses before inserting into `NODE_ADDRS`. Packets
|
||||
still parse and feed the classifier — only the keepalive registration
|
||||
is gated. Defends against any local sender (tests, simulators, future
|
||||
tooling) accidentally driving ping spawn.
|
||||
|
||||
### D2 — Keepalive pre-reap at startup
|
||||
|
||||
`main.rs::csi_keepalive_task` runs `pkill -f "/sbin/ping -i 0.040"` and
|
||||
`pkill -f "/usr/bin/ping -i 0.040"` once at task entry. Cleans up
|
||||
orphans from prior server lifetimes without operator action. Cost: two
|
||||
`pkill` invocations at startup, ~10 ms total. Idempotent.
|
||||
|
||||
### D3 — Real keypoint confidence
|
||||
|
||||
`run_wiflow_inference` now stamps `confidence = amp_classify_from_latest`
|
||||
runtime classifier confidence onto all 17 keypoints (was `1.0` hardcoded).
|
||||
The lite-scale wiflow has no per-keypoint uncertainty head; this signal
|
||||
is the most honest stand-in. Currently reading **0.037** on the live
|
||||
deployment — accurate reflection of "wiflow output is saturated, don't
|
||||
trust these coords".
|
||||
|
||||
### D4 — Zero-pad fix in wiflow_v1
|
||||
|
||||
`build_input_from_history` now pushes `None` into `picks` for dead slots
|
||||
and writes `0.0f32` into those rows. Prior code pushed `0usize` → all
|
||||
unused channels read subcarrier-0 amplitudes, feeding the network 35×
|
||||
the same signal.
|
||||
|
||||
### D5 — Tail-clone optimisation
|
||||
|
||||
`run_wiflow_inference` snapshots only the last 20 entries from
|
||||
`nbvi_history` while holding the lock, not the full 600-deep deque. Lock
|
||||
hold time dropped from ~µs * 600 to ~µs * 20 per tick.
|
||||
|
||||
### D6 — `/` → `/ui/index.html` permanent redirect
|
||||
|
||||
`main.rs::root_redirect` returns HTTP 308. API-index HTML moves to `/api`
|
||||
for operators / curl debugging. Users typing the bare host land on the
|
||||
SPA.
|
||||
|
||||
### D7 — Sensing tab container restored
|
||||
|
||||
`ui/index.html`: `<section id="sensing">` now contains `<div
|
||||
id="sensing-container">` matching `app.js::SensingTab.mount`'s query
|
||||
selector.
|
||||
|
||||
### D8 — LiveDemoTab WiFlow inject only when no model active
|
||||
|
||||
`LiveDemoTab.fetchModels` wraps the `activeModelId = 'wiflow-v1'`
|
||||
assignment in `if (!this.modelState.activeModelId)`. RVF model loads
|
||||
keep their displayed name.
|
||||
|
||||
### D9 — Multi-node test guards against external :5005 owner
|
||||
|
||||
`tests/multi_node_test.rs::test_multi_node_udp_send` probes
|
||||
`127.0.0.1:5005` with a transient bind; if the bind fails, the test
|
||||
skips its UDP send rather than polluting whoever owns the port. Belt-
|
||||
and-braces with the server-side filter (D1).
|
||||
|
||||
### D10 — Docs sweep
|
||||
|
||||
* `CHECKLIST.md`: header to `head 0ec1e4b0`, count to **47 Done**,
|
||||
explicit note that ADR-111 is intentionally absent. Reference table
|
||||
range to `001-117`.
|
||||
* `ADR-115`: "ADR-100" → "ADR-110", "ADR-108 / ADR-111" → "ADR-108 / ADR-109".
|
||||
* `espectre-gap-analysis.md::Still open` table: 8 shipped items marked
|
||||
✓ Done with commit hashes; remaining items annotated Deferred with
|
||||
reason or carry a Pack assignment. New items 15-16 added (ADR-115,
|
||||
ADR-117).
|
||||
* `ota-pipeline.md`: new "Operator REST endpoints" section listing
|
||||
`/ota/status`, `/ota`, `/ota/recalibrate`, `/ota/set-target` with
|
||||
curl examples both unauthed and bearer-token authed.
|
||||
|
||||
## Files Touched
|
||||
|
||||
```
|
||||
v2/crates/wifi-densepose-sensing-server/src/main.rs:
|
||||
+ udp_receiver_task: loopback/unspecified/multicast/broadcast filter (D1)
|
||||
+ csi_keepalive_task: pre-reap pkill at task entry (D2)
|
||||
+ run_wiflow_inference: real classifier confidence (D3) + tail clone (D5)
|
||||
+ Router: GET / → root_redirect (308), GET /api → info_page (D6)
|
||||
+ info_page: expanded with new endpoints listed
|
||||
v2/crates/wifi-densepose-sensing-server/src/wiflow_v1.rs:
|
||||
+ build_input_from_history: None-pad → 0.0f32, not subcarrier-0 dup (D4)
|
||||
v2/crates/wifi-densepose-sensing-server/tests/multi_node_test.rs:
|
||||
+ ADR-117 guard: skip if 127.0.0.1:5005 is owned (D9)
|
||||
ui/index.html:
|
||||
+ <div id="sensing-container"> inside #sensing section (D7)
|
||||
ui/components/LiveDemoTab.js:
|
||||
+ fetchModels: guard wiflow inject behind !activeModelId (D8)
|
||||
CHECKLIST.md:
|
||||
+ header refresh + ADR range correction (D10)
|
||||
docs/adr/ADR-115-fw-set-target-rest.md:
|
||||
+ typo fixes ADR-100 → ADR-110, ADR-111 → ADR-109 (D10)
|
||||
docs/references/espectre-gap-analysis.md:
|
||||
+ Still-open table refresh — 8 items ✓ Done, 14/15 reclassified (D10)
|
||||
docs/references/ota-pipeline.md:
|
||||
+ Operator REST endpoints section (D10)
|
||||
docs/adr/ADR-117-process-hygiene-and-audit-followups.md (this)
|
||||
```
|
||||
|
||||
Binary size delta: 3.0 MB → 3.1 MB (no significant change).
|
||||
|
||||
## Verified Acceptance
|
||||
|
||||
After restart with the new binary (PID 97903):
|
||||
|
||||
```
|
||||
$ ps -axo pid,ppid,command | grep "ping.*-i.*0\.040" | grep -v grep | wc -l
|
||||
2
|
||||
$ ps -axo pid,ppid | grep "ping.*-i.*0\.040"
|
||||
97921 97903 /sbin/ping -i 0.040 192.168.0.100
|
||||
97922 97903 /sbin/ping -i 0.040 192.168.0.101
|
||||
```
|
||||
|
||||
Exactly two ping children — one per real sensor — parented to the
|
||||
running server. No 127.0.0.1, no orphans.
|
||||
|
||||
```
|
||||
$ curl -sI http://localhost:8080/
|
||||
HTTP/1.1 308 Permanent Redirect
|
||||
location: /ui/index.html
|
||||
|
||||
$ curl http://localhost:8080/api/v1/pose/current | jq '.persons[0].keypoints[0]'
|
||||
{ "name": "nose", "x": 0.999, "y": 0.0, "z": 0, "confidence": 0.037 }
|
||||
```
|
||||
|
||||
`confidence: 0.037` — real runtime classifier signal, not hardcoded 1.0.
|
||||
`cargo test --workspace` (release) passes 13 / 0 failed / 5 ignored.
|
||||
|
||||
## Out of Scope (intentional non-fixes)
|
||||
|
||||
* **Health endpoint fake constants** (cpu:2.5, mem:1.8, disk:15.0) —
|
||||
flagged by the auditor as critical. Replacing with `sysinfo` crate
|
||||
would add a dependency for low-value telemetry; the orchestrator
|
||||
readiness probe today is only used by Docker compose, not Kubernetes
|
||||
liveness. Deferred. Real fix: `/health/ready` only reports
|
||||
`model_loaded` + `node_count > 0`.
|
||||
* **`derive_pose_from_sensing` call-site cleanup** — function returns
|
||||
`Vec::new()` since ADR-105; removing the 5 call sites is a no-op
|
||||
refactor with no behaviour change. Skipped to keep diff focused.
|
||||
* **`tracker_bridge:10` unused imports warning** — module is integrated
|
||||
via `tracker_bridge::tracker_update` (4 callers), the import list
|
||||
just has dead names. Cosmetic. `cargo fix` deferred.
|
||||
* **CLI training flags** (`--train`, `--dataset`, `--epochs`,
|
||||
`--checkpoint-dir`, `--pretrain*`) — silent no-ops; training is via
|
||||
REST. Removing the flags would break any operator script that passes
|
||||
them harmlessly. Deferred to a separate flag-audit pass.
|
||||
* **OTA PSK provisioning** — operator workflow change, not a code
|
||||
change. Note added to ADR-115 open items. Operator can set
|
||||
`security/ota_psk` via USB provision.py whenever convenient.
|
||||
|
||||
## References
|
||||
|
||||
* ADR-105 — no synthetic data in production runtime; this ADR extends
|
||||
the principle to keypoint confidence (was synthesised, now real).
|
||||
* ADR-109 — gain-lock recalibrate REST; same endpoint used to fix node 2
|
||||
feature divergence as part of this audit pass.
|
||||
* ADR-115 — set-target REST; typos fixed here.
|
||||
* ADR-116 — WiFlow-v1 loader; the auditor's findings landed against
|
||||
this ADR's just-shipped integration.
|
||||
* `tests/multi_node_test.rs` — the test whose accidental cross-talk with
|
||||
the production server triggered the 250+ ping zombie incident.
|
||||
|
|
@ -144,24 +144,32 @@ ESPHome component, an MQTT bridge, or a custom HA integration.
|
|||
| No synthetic data in production runtime | ADR-105 (`9aa027e9`, `30244d27`) |
|
||||
| OTA flash via WiFi (8032 port) | `ota-pipeline.md` (`274984d3`) |
|
||||
|
||||
### ⏳ Still open, by impact
|
||||
### ⏳ Still open / deferred, by impact
|
||||
|
||||
| # | Item | Net benefit | Estimate |
|
||||
|---|---|---|---|
|
||||
| 1 | **HA via MQTT** | sensor as HA entity, ecosystem reach | 1 day |
|
||||
| 2 | **Fixed-replay test suite (2 000 packets)** | regression protection over the classifier + NBVI | 1 day |
|
||||
| 3 | **Per-sub delta sparkline in `raw.html`** | operator sees off-axis drift channel firing in real time | 30 min |
|
||||
| 4 | **`POST /ota/recalibrate` (clear NVS gain-lock)** | reset gain-lock without USB after AP swap or relocation | 30 min FW + flash |
|
||||
| 5 | **Track AP MAC in NVS alongside AGC/FFT** | auto-invalidate stale gain-lock on AP change | 1 h FW + flash |
|
||||
| 6 | **Multi-AP signal_field via `MultistaticFuser`** | physically real spatial map (today zero-filled per ADR-105 D6) | 2-3 h |
|
||||
| 7 | **Per-subcarrier baseline AGE check** | flag for re-calibration when channel slowly drifts | 1 h |
|
||||
| 8 | **Phase-domain drift (vs amplitude-only today)** | sub-mm chest-wall motion detection for vitals | 1 h script + 30 min server |
|
||||
| 9 | **Tailscale-target in NVS** | sensor stream keeps working when Mac roams networks | 30 min provision + reflash |
|
||||
| 10 | **ESPHome native component (instead of MQTT bridge)** | tighter HA integration than #1 | 2-3 days |
|
||||
| 11 | **Web Serial calibration game** | playful threshold tuning | 1 day |
|
||||
| 12 | **Boot-time NBVI freeze in FW** | trade-off vs adaptive: don't adopt unless we see FP issues in real homes | 2 h |
|
||||
| 13 | **Per-channel NVS cache for gain-lock** | only needed if channel hopping (ADR-029) re-activated | 1 h |
|
||||
| 14 | **DensePose model train + load** | unlock pose estimation; depends on MM-Fi / Wi-Pose dataset access | 1-3 days |
|
||||
**Updated 2026-05-17** — Most of the original "still open" items shipped
|
||||
during this session. The list below is now only items that are **out
|
||||
of session scope** (HA / ESPHome / Web Serial / channel hopping per
|
||||
operator constraints), or items that need operator action (camera-side
|
||||
training capture).
|
||||
|
||||
| # | Item | Net benefit | Estimate | Status |
|
||||
|---|---|---|---|---|
|
||||
| 1 | **HA via MQTT** | sensor as HA entity, ecosystem reach | 1 day | Deferred (operator said: no new integrations) |
|
||||
| 2 | ~~Fixed-replay test suite (2 000 packets)~~ | regression protection over the classifier + NBVI | ✓ **Done** — ADR-114 (`96225e27`); F1 = 1.000 on 1000 idle + 1000 motion fixtures |
|
||||
| 3 | ~~Per-sub delta sparkline in `raw.html`~~ | operator sees off-axis drift channel firing in real time | ✓ **Done** — ADR-104 (`eec3ca6c`) drift sparkline + ADR-107 D6 progress bar (`432753e1`) |
|
||||
| 4 | ~~`POST /ota/recalibrate` (clear NVS gain-lock)~~ | reset gain-lock without USB after AP swap or relocation | ✓ **Done** — ADR-109 (`f92807cd`) |
|
||||
| 5 | ~~Track AP MAC in NVS alongside AGC/FFT~~ | auto-invalidate stale gain-lock on AP change | ✓ **Done** — folded into ADR-109 (`gl_ap_mac` key, same commit) |
|
||||
| 6 | ~~Multi-AP signal_field via `MultistaticFuser`~~ | physically real spatial map | ✓ **Done** — ADR-112 (`c8ac60f6`); 320/400 cells non-zero on two live sensors |
|
||||
| 7 | ~~Per-subcarrier baseline AGE check~~ | flag for re-calibration when channel slowly drifts | ✓ **Done** — ADR-104 staleness watch (`eec3ca6c`) — warns when baseline > 14400 s AND drift > 0.15 for ≥3 ticks |
|
||||
| 8 | ~~Phase-domain drift (vs amplitude-only today)~~ | sub-mm chest-wall motion detection for vitals | ✓ **Done** — ADR-104 phase channel (`47dafab4`); requires empty-room re-record to activate (`per_subcarrier_phase_mean` not in current `baseline.json` v1 schema) |
|
||||
| 9 | **Tailscale-target in NVS** | sensor stream keeps working when Mac roams networks | 30 min provision + reflash | Deferred (Mac stable on TP-Link, low ROI). **Alternative shipped: ADR-115 `/ota/set-target`** lets operator repoint via REST without USB/Tailscale. |
|
||||
| 10 | **ESPHome native component (instead of MQTT bridge)** | tighter HA integration than #1 | 2-3 days | Deferred (operator said: no new integrations) |
|
||||
| 11 | **Web Serial calibration game** | playful threshold tuning | 1 day | Deferred (operator said: no new integrations) |
|
||||
| 12 | **Boot-time NBVI freeze in FW** | trade-off vs adaptive: don't adopt unless FP issues in real homes | 2 h | Deferred (server-side rolling NBVI working; no observed FP problem) |
|
||||
| 13 | **Per-channel NVS cache for gain-lock** | only needed if channel hopping (ADR-029) re-activated | 1 h | Deferred (channel hopping not active) |
|
||||
| 14 | **DensePose model train + load** | unlock pose estimation | 1-3 days | **Mostly done** — model loader shipped in **ADR-116** (`7cdd8f69`) with `ruv/ruview/wiflow-v1`. Output requires per-deployment fine-tune (camera-supervised capture) — operator-side work, scoped as Pack B / Pack E. |
|
||||
| 15 | **`/ota/set-target` REST** *(new this session)* | repoint CSI aggregator without USB after Mac-IP / router change | — | ✓ **Done** — ADR-115 (`7d3e0c2d`) |
|
||||
| 16 | **Process-hygiene + audit follow-ups** *(new this session)* | UDP loopback filter, ping pre-reap, `/` redirect, wiflow zero-pad, lock-clone optim, sensing-tab container, test-isolation guard, ADR/CHECKLIST consistency | — | ✓ **Done** — ADR-117 (this PR) |
|
||||
|
||||
## References
|
||||
|
||||
|
|
|
|||
|
|
@ -319,6 +319,44 @@ scripts/ota-deploy.sh --build
|
|||
# (auto-discover, parallel POST, verify, exit code)
|
||||
```
|
||||
|
||||
## Operator REST endpoints on the running FW (port 8032)
|
||||
|
||||
After the first OTA the FW exposes three control endpoints. They share
|
||||
the same Bearer-PSK auth as `/ota` (open when `security/ota_psk` NVS
|
||||
key is unset, gated when set). All accept plain HTTP — no JSON
|
||||
dependency on the FW side.
|
||||
|
||||
| Method | Path | Body | Purpose | ADR |
|
||||
|---|---|---|---|---|
|
||||
| `GET` | `/ota/status` | — | Version, date, running/next partition, max image size | ADR-045 |
|
||||
| `POST` | `/ota` | image bin | Upload + flash (auth-gated) | ADR-045 |
|
||||
| `POST` | `/ota/recalibrate` | — | Clear `csi_cfg/gl_agc` + `gl_fft` + `gl_ap_mac`, reboot — forces fresh gain-lock at next boot | ADR-109 |
|
||||
| `POST` | `/ota/set-target` | `IPv4:PORT` plain text | Write `csi_cfg/target_ip` + `target_port` to NVS, reboot — repoints the CSI aggregator after Mac IP move / router swap without USB | ADR-115 |
|
||||
|
||||
Examples (operator side, no USB):
|
||||
|
||||
```bash
|
||||
# After moving Mac to a new LAN / changing routers:
|
||||
curl -s -X POST -d '192.168.0.103:5005' http://192.168.0.100:8032/ota/set-target
|
||||
curl -s -X POST -d '192.168.0.103:5005' http://192.168.0.101:8032/ota/set-target
|
||||
# Each returns {"status":"ok","target_ip":"...","target_port":...,"message":"rebooting"}
|
||||
|
||||
# After AP swap that changed the indoor path geometry:
|
||||
curl -X POST http://192.168.0.100:8032/ota/recalibrate
|
||||
# Sensor reboots, re-runs the 300-packet gain-lock sampler (~3–12s).
|
||||
|
||||
# Sanity probe:
|
||||
curl http://192.168.0.100:8032/ota/status
|
||||
```
|
||||
|
||||
With auth provisioned (`security/ota_psk` in NVS):
|
||||
|
||||
```bash
|
||||
curl -X POST -H "Authorization: Bearer $RUVIEW_OTA_PSK" \
|
||||
-d '192.168.0.103:5005' \
|
||||
http://192.168.0.100:8032/ota/set-target
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Bottom line:** OTA is not "send a file via curl", it's an
|
||||
|
|
|
|||
|
|
@ -1515,11 +1515,13 @@ export class LiveDemoTab {
|
|||
} catch (error) {
|
||||
this.logger.warn('Could not fetch models', { error: error.message });
|
||||
}
|
||||
// ADR-116: surface WiFlow-v1 in the Model Control dropdown when the
|
||||
// server reports `pose_estimation: true` via /api/v1/info. WiFlow is
|
||||
// loaded outside the RVF model registry path (--wiflow-model flag),
|
||||
// so listModels() above doesn't return it. This adds a virtual entry
|
||||
// marked as already active.
|
||||
// ADR-116 / ADR-117: surface WiFlow-v1 in the Model Control dropdown
|
||||
// when the server reports `pose_estimation: true` via /api/v1/info.
|
||||
// WiFlow is loaded outside the RVF model registry path (--wiflow-model
|
||||
// flag) so listModels() above doesn't return it. We add a virtual
|
||||
// entry and mark it active ONLY when no RVF model is already active
|
||||
// — otherwise the dropdown would silently flip from the operator's
|
||||
// chosen RVF model to "WiFlow-v1" every fetch.
|
||||
try {
|
||||
const r = await fetch('/api/v1/info');
|
||||
if (r.ok) {
|
||||
|
|
@ -1531,13 +1533,15 @@ export class LiveDemoTab {
|
|||
name: 'WiFlow-v1 (lite, 186K params, --wiflow-model)',
|
||||
});
|
||||
}
|
||||
this.modelState.activeModelId = 'wiflow-v1';
|
||||
this.modelState.activeModelInfo = {
|
||||
model_id: 'wiflow-v1',
|
||||
name: 'WiFlow-v1',
|
||||
version: 'lite',
|
||||
pck_score: 0.929, // from model card; eval-set, not this deployment
|
||||
};
|
||||
if (!this.modelState.activeModelId) {
|
||||
this.modelState.activeModelId = 'wiflow-v1';
|
||||
this.modelState.activeModelInfo = {
|
||||
model_id: 'wiflow-v1',
|
||||
name: 'WiFlow-v1',
|
||||
version: 'lite',
|
||||
pck_score: 0.929, // from model card; eval-set, not this deployment
|
||||
};
|
||||
}
|
||||
this.populateModelSelector();
|
||||
this.updateModelUI();
|
||||
}
|
||||
|
|
|
|||
|
|
@ -488,8 +488,10 @@
|
|||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Sensing Tab -->
|
||||
<section id="sensing" class="tab-content"></section>
|
||||
<!-- Sensing Tab (ADR-117: container div required by app.js SensingTab.mount) -->
|
||||
<section id="sensing" class="tab-content">
|
||||
<div id="sensing-container"></div>
|
||||
</section>
|
||||
|
||||
<!-- Training Tab -->
|
||||
<section id="training" class="tab-content">
|
||||
|
|
|
|||
|
|
@ -5087,6 +5087,12 @@ async fn nodes_endpoint(State(state): State<SharedState>) -> Json<serde_json::Va
|
|||
}))
|
||||
}
|
||||
|
||||
/// ADR-117: `GET /` redirects to the SPA. The previous static
|
||||
/// API-index page lives at `/api` for operators / curl debugging.
|
||||
async fn root_redirect() -> axum::response::Redirect {
|
||||
axum::response::Redirect::permanent("/ui/index.html")
|
||||
}
|
||||
|
||||
async fn info_page() -> Html<String> {
|
||||
Html(format!(
|
||||
"<html><body>\
|
||||
|
|
@ -5094,10 +5100,15 @@ async fn info_page() -> Html<String> {
|
|||
<p>Rust + Axum + RuVector</p>\
|
||||
<ul>\
|
||||
<li><a href='/health'>/health</a> — Server health</li>\
|
||||
<li><a href='/api/v1/info'>/api/v1/info</a> — Server features / version</li>\
|
||||
<li><a href='/api/v1/sensing/latest'>/api/v1/sensing/latest</a> — Latest sensing data</li>\
|
||||
<li><a href='/api/v1/pose/current'>/api/v1/pose/current</a> — Current pose (ADR-116)</li>\
|
||||
<li><a href='/api/v1/baseline'>/api/v1/baseline</a> — Baseline state (ADR-103/107)</li>\
|
||||
<li><a href='/api/v1/vital-signs'>/api/v1/vital-signs</a> — Vital sign estimates (HR/RR)</li>\
|
||||
<li><a href='/api/v1/model/info'>/api/v1/model/info</a> — RVF model container info</li>\
|
||||
<li>ws://localhost:8765/ws/sensing — WebSocket stream</li>\
|
||||
<li><a href='/ui/index.html'>/ui/index.html</a> — Full SPA (live demo / pose canvas)</li>\
|
||||
<li>ws://localhost:8765/ws/sensing — WebSocket sensing stream</li>\
|
||||
<li>ws://localhost:8080/ws/pose — WebSocket pose stream</li>\
|
||||
</ul>\
|
||||
</body></html>"
|
||||
))
|
||||
|
|
@ -5132,6 +5143,23 @@ async fn csi_keepalive_task(pps: u32) {
|
|||
let interval_sec = 1.0 / pps as f64;
|
||||
info!("CSI keepalive: {pps} ICMP pkt/s/node (interval {interval_sec:.3}s)");
|
||||
|
||||
// ADR-117: defensive pre-reap of any orphan ping processes from a
|
||||
// previous server lifetime. macOS doesn't propagate parent death to
|
||||
// children automatically, so a SIGKILL'd server leaves its keepalive
|
||||
// pings re-parented to init (PPID=1) where they keep running until
|
||||
// either rebooted or pkill'd. Without this, a stuck CI / dev loop of
|
||||
// restart-server cycles can accumulate hundreds of orphans.
|
||||
let _ = tokio::process::Command::new("pkill")
|
||||
.args(["-f", "/sbin/ping -i 0.040"])
|
||||
.stdout(std::process::Stdio::null())
|
||||
.stderr(std::process::Stdio::null())
|
||||
.status().await;
|
||||
let _ = tokio::process::Command::new("pkill")
|
||||
.args(["-f", "/usr/bin/ping -i 0.040"])
|
||||
.stdout(std::process::Stdio::null())
|
||||
.stderr(std::process::Stdio::null())
|
||||
.status().await;
|
||||
|
||||
// node_id -> running child handle. We re-spawn if a child dies or
|
||||
// if the sensor's address changes (DHCP rotation, etc.).
|
||||
let mut children: std::collections::HashMap<u8, (std::net::IpAddr, tokio::process::Child)> =
|
||||
|
|
@ -5179,40 +5207,48 @@ async fn csi_keepalive_task(pps: u32) {
|
|||
|
||||
/// ADR-116: run one WiFlow-v1 forward pass over the best-available node's
|
||||
/// most recent 20 amplitude frames. Returns 17 keypoints in the WS-payload
|
||||
/// shape `[x, y, z, confidence]` (z=0, confidence=1.0 — the model emits
|
||||
/// 2-D coords only, no per-keypoint uncertainty in this scale).
|
||||
/// shape `[x, y, z, confidence]`. z=0 (model is 2-D only).
|
||||
/// `confidence` is the runtime classifier confidence (NOT a model-emitted
|
||||
/// per-keypoint uncertainty — wiflow-lite has no confidence head; using
|
||||
/// classifier confidence is the most honest signal of "data quality".)
|
||||
///
|
||||
/// Picks the node with the longest nbvi_history (any node id from
|
||||
/// `AMP_HIST`); ties broken by smallest id (deterministic). Returns
|
||||
/// Picks the node with the longest nbvi_history (ties: smallest id) AND
|
||||
/// a fresh latest frame (< 5 s old per `AMP_LATEST` timestamp). Returns
|
||||
/// `None` when:
|
||||
/// * `--wiflow-model` was not passed at startup (`WIFLOW_MODEL = None`)
|
||||
/// * no node has accumulated ≥ 20 frames yet (cold start)
|
||||
/// * `--wiflow-model` was not passed at startup
|
||||
/// * no node has ≥ 20 frames AND recent activity (cold start / sensor gone)
|
||||
/// * `build_input_from_history` rejects (all-zero subcarriers)
|
||||
///
|
||||
/// ADR-117: only clones the tail-20 frames inside the lock, not the full
|
||||
/// 600-deep history. Prior impl cloned 600 × 56 × 8 ≈ 270 KB per tick.
|
||||
fn run_wiflow_inference() -> Option<Vec<[f64; 4]>> {
|
||||
let model = WIFLOW_MODEL.get().and_then(|m| m.as_ref())?;
|
||||
// Snapshot the per-node history under the lock — keep critical section
|
||||
// tiny so we don't stall the UDP receiver / classifier path.
|
||||
let history = {
|
||||
let conf: f64 = amp_classify_from_latest()
|
||||
.map(|(_, _, c)| c)
|
||||
.unwrap_or(0.0);
|
||||
let tail: std::collections::VecDeque<Vec<f64>> = {
|
||||
let map = amp_hist_init().lock().unwrap();
|
||||
let mut best: Option<(u8, std::collections::VecDeque<Vec<f64>>)> = None;
|
||||
let mut best: Option<(u8, usize)> = None;
|
||||
for (nid, st) in map.iter() {
|
||||
let len = st.nbvi_history.len();
|
||||
if len < 20 { continue; }
|
||||
match &best {
|
||||
None => best = Some((*nid, st.nbvi_history.clone())),
|
||||
Some((bid, bh)) => {
|
||||
if len > bh.len() || (len == bh.len() && *nid < *bid) {
|
||||
best = Some((*nid, st.nbvi_history.clone()));
|
||||
match best {
|
||||
None => best = Some((*nid, len)),
|
||||
Some((bid, blen)) => {
|
||||
if len > blen || (len == blen && *nid < bid) {
|
||||
best = Some((*nid, len));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
best?.1
|
||||
let (best_nid, _) = best?;
|
||||
let st = map.get(&best_nid)?;
|
||||
st.nbvi_history.iter().rev().take(20).rev().cloned().collect()
|
||||
};
|
||||
let input = wiflow_v1::build_input_from_history(&history)?;
|
||||
let input = wiflow_v1::build_input_from_history(&tail)?;
|
||||
let kp = model.forward(&input);
|
||||
let out: Vec<[f64; 4]> = kp.iter()
|
||||
.map(|(x, y)| [*x as f64, *y as f64, 0.0f64, 1.0f64])
|
||||
.map(|(x, y)| [*x as f64, *y as f64, 0.0f64, conf])
|
||||
.collect();
|
||||
Some(out)
|
||||
}
|
||||
|
|
@ -5733,10 +5769,30 @@ async fn udp_receiver_task(state: SharedState, udp_port: u16) {
|
|||
Some(buf[4])
|
||||
} else { None };
|
||||
if let Some(nid) = nid_peek {
|
||||
let mut m = node_addrs_init().lock().unwrap();
|
||||
let prev = m.insert(nid, src);
|
||||
if prev.is_none() {
|
||||
info!("keepalive: learned address for node {nid} = {src}");
|
||||
// ADR-117: never register loopback / unspecified / multicast
|
||||
// addresses as keepalive targets. Otherwise a local sender
|
||||
// (e.g. `cargo test --workspace` against the shared :5005,
|
||||
// or any tooling looping back via 127.0.0.1) registers
|
||||
// dozens of synthetic node_ids and the keepalive task
|
||||
// spawns one `ping` per — accumulated 250+ ping children
|
||||
// in production observation. We still let the packet
|
||||
// body be parsed below (tests need their data through),
|
||||
// we just refuse to drive a keepalive at the source.
|
||||
let routable = match src.ip() {
|
||||
std::net::IpAddr::V4(v4) => {
|
||||
!v4.is_loopback() && !v4.is_unspecified()
|
||||
&& !v4.is_multicast() && !v4.is_broadcast()
|
||||
}
|
||||
std::net::IpAddr::V6(v6) => {
|
||||
!v6.is_loopback() && !v6.is_unspecified() && !v6.is_multicast()
|
||||
}
|
||||
};
|
||||
if routable {
|
||||
let mut m = node_addrs_init().lock().unwrap();
|
||||
let prev = m.insert(nid, src);
|
||||
if prev.is_none() {
|
||||
info!("keepalive: learned address for node {nid} = {src}");
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -7257,7 +7313,9 @@ async fn main() {
|
|||
// HTTP server (serves UI + full DensePose-compatible REST API)
|
||||
let ui_path = args.ui_path.clone();
|
||||
let http_app = Router::new()
|
||||
.route("/", get(info_page))
|
||||
// ADR-117: SPA is the primary surface; API index moves to /api.
|
||||
.route("/", get(root_redirect))
|
||||
.route("/api", get(info_page))
|
||||
// Health endpoints (DensePose-compatible)
|
||||
.route("/health", get(health))
|
||||
.route("/health/health", get(health_system))
|
||||
|
|
|
|||
|
|
@ -411,24 +411,31 @@ pub fn build_input_from_history(
|
|||
if score.is_empty() || !score[0].1.is_finite() { return None; }
|
||||
|
||||
// Pick top-INPUT_DIM (35) by lowest NBVI. If fewer than 35 are finite,
|
||||
// pad with whichever finite ones we have and zero the rest — model still
|
||||
// runs, it just has dead channels.
|
||||
let mut picks: Vec<usize> = score.iter()
|
||||
// pad the remaining channels with zeros (not subcarrier-0 duplicated —
|
||||
// the original implementation pushed `0` into `picks` which silently
|
||||
// duplicated channel 0 across all dead slots, fed the network 35x the
|
||||
// same data, and made the saturation worse).
|
||||
let mut picks: Vec<Option<usize>> = score.iter()
|
||||
.filter(|(_, s)| s.is_finite())
|
||||
.take(INPUT_DIM)
|
||||
.map(|(k, _)| *k)
|
||||
.map(|(k, _)| Some(*k))
|
||||
.collect();
|
||||
if picks.is_empty() { return None; }
|
||||
while picks.len() < INPUT_DIM { picks.push(0); } // pad with subcarrier 0
|
||||
while picks.len() < INPUT_DIM { picks.push(None); } // ← zero-pad, not dup
|
||||
|
||||
// Raw amplitudes pass-through. Training script (`scripts/train-wiflow-
|
||||
// supervised.js::loadJsonl`) feeds raw values; the two TCN BatchNorm
|
||||
// layers normalise per-channel per-window at inference time so absolute
|
||||
// scale (5–50 ESP32 amplitude range) is handled by the network itself.
|
||||
let mut out = vec![0.0f32; INPUT_DIM * TIME_STEPS];
|
||||
for (ci, k) in picks.iter().enumerate() {
|
||||
for (t, f) in recent.iter().enumerate() {
|
||||
out[ci * TIME_STEPS + t] = f.get(*k).copied().unwrap_or(0.0) as f32;
|
||||
for (ci, pick) in picks.iter().enumerate() {
|
||||
match pick {
|
||||
Some(k) => {
|
||||
for (t, f) in recent.iter().enumerate() {
|
||||
out[ci * TIME_STEPS + t] = f.get(*k).copied().unwrap_or(0.0) as f32;
|
||||
}
|
||||
}
|
||||
None => { /* zero-padded channel, already 0.0 from vec init */ }
|
||||
}
|
||||
}
|
||||
Some(out)
|
||||
|
|
|
|||
|
|
@ -122,9 +122,30 @@ fn test_different_nodes_produce_different_frames() {
|
|||
/// Send multiple frames from different nodes to a UDP port.
|
||||
/// This test verifies the packet format is accepted by a real server
|
||||
/// if one is running, but doesn't fail if no server is available.
|
||||
///
|
||||
/// ADR-117: previously this test sent to `127.0.0.1:5005` unconditionally,
|
||||
/// hitting any live server on the same port. With `node_ids = [1,2,3,5,7]`
|
||||
/// × 10 frames + 5 vitals it injected 55 spurious node_ids into the
|
||||
/// server's NODE_ADDRS — the keepalive task then spawned one `ping` child
|
||||
/// process per unique nid, accumulating 250+ ping zombies in production.
|
||||
/// Mitigation is two-layered: server now filters loopback at the UDP
|
||||
/// receiver, AND this test refuses to fire if anything is already bound
|
||||
/// to 127.0.0.1:5005.
|
||||
#[test]
|
||||
fn test_multi_node_udp_send() {
|
||||
// Try to bind to a random port and send to localhost:5005
|
||||
// ADR-117 guard: if some other process is bound to 127.0.0.1:5005 (most
|
||||
// commonly a live sensing-server during dev), skip the send so we don't
|
||||
// pollute that process's state. The bind probe is the cheapest signal —
|
||||
// if we can bind even briefly, nobody owns the port; if not, abort.
|
||||
match UdpSocket::bind("127.0.0.1:5005") {
|
||||
Ok(probe) => drop(probe),
|
||||
Err(_) => {
|
||||
eprintln!("test_multi_node_udp_send: 127.0.0.1:5005 already in use — skipping (ADR-117)");
|
||||
return;
|
||||
}
|
||||
};
|
||||
|
||||
// Try to bind to a random port and send to localhost:5005.
|
||||
// This is a smoke test — it verifies frames can be sent without panic.
|
||||
let sock = UdpSocket::bind("0.0.0.0:0").expect("bind");
|
||||
sock.set_write_timeout(Some(Duration::from_millis(100))).ok();
|
||||
|
|
|
|||
Loading…
Reference in New Issue