docs(ADR-163): edge-latency RESULTS + PROOF/prove.sh wiring (T3)

Adds benchmarks/edge-latency/RESULTS.md (wiflow-std RESULTS style: each measured number with reproduce command, machine, MEASURED-on-host grade, and the honest host-vs-ESP32 / steady-state-vs-cold-start caveats) and ADR-163 (HEADLINE: CLAIMED latency budgets -> MEASURED-on-host, closing M5/M6 measurement debt; ESP32-on-hardware still pending). - ADR-160 deferred 'criterion benches for process_frame budget claims' line updated to DONE (host) with the ESP32-pending note. - PROOF.md performance table gains the two edge-latency reproduce rows; provenance ADR range extended to ADR-163. - prove.sh gated section gains the edge-latency bench note (host proxy only; not asserted, never claims the ESP32 figure). Benches/docs only; no crate republishes. Co-Authored-By: claude-flow <ruv@ruv.net>
2026-06-12 08:02:07 -04:00 · 2026-06-12 08:02:07 -04:00 · 1a17cc5b06
parent 7c13ec6a00
commit 1a17cc5b06
5 changed files with 275 additions and 5 deletions
--- a/PROOF.md
+++ b/PROOF.md
@ -55,6 +55,8 @@ trained checkpoint) so you can reproduce them yourself.
 | zero-copy ORT input ~1.48× (ADR-155) | **MEASURED** | `cd v2 && cargo bench -p wifi-densepose-nn --features onnx --bench onnx_bench` |
 | pointcloud splats 9→2 passes ~1.24× (ADR-160 research) | **MEASURED** | `cd v2 && cargo bench -p wifi-densepose-pointcloud --bench splats_bench` |
 | native wlanapi multi-BSSID scan 9.74 Hz (vs netsh ~2 Hz) | **MEASURED (Windows)** | `cd v2 && cargo test -p wifi-densepose-wifiscan -- --ignored measure_native_scan_rate` |
+| wasm-edge `process_frame` hot-path latency (host proxy, ADR-163) | **MEASURED-on-host** (NOT the ESP32/WASM3 budget — needs hardware) | `cd v2/crates/wifi-densepose-wasm-edge && cargo bench --features std` |
+| cog steady-state CPU infer latency ~305 µs (ADR-163; NOT the manifest cold-start) | **MEASURED-on-host** | `cd v2 && cargo bench -p cog-person-count -p cog-pose-estimation --no-default-features --bench infer_bench` |

 ## What we do NOT claim (the honest negatives — the strongest anti-slop signal)

@ -68,8 +70,9 @@ trained checkpoint) so you can reproduce them yourself.

 ## Provenance

-Every claim above traces to a committed ADR (`docs/adr/ADR-154`…`ADR-160`), a
-test, a criterion bench, or `benchmarks/wiflow-std/RESULTS.md`. The history
+Every claim above traces to a committed ADR (`docs/adr/ADR-154`…`ADR-163`), a
+test, a criterion bench, `benchmarks/wiflow-std/RESULTS.md`, or
+`benchmarks/edge-latency/RESULTS.md`. The history
 includes published **retractions** (the 92.9% PCK retraction; the WiFlow-STD
 shipped-checkpoint refutation; the NV-diamond BOM reality check) — a faker hides
 failures; we commit them.
--- a/benchmarks/edge-latency/RESULTS.md
+++ b/benchmarks/edge-latency/RESULTS.md
@ -0,0 +1,137 @@
+# Edge-Latency Benchmark Results — ADR-163
+
+Converting **CLAIMED** edge latency budgets into **MEASURED-on-host** numbers,
+closing the measurement debt flagged by Milestones 5/6 (ADR-159 / ADR-160).
+Benches + docs only — **no production-code behavior changed**.
+
+## The honest caveat, up front (read before citing any number)
+
+Two distinct gaps separate every number below from the figure it is converting:
+
+1. **Host ≠ ESP32.** The wasm-edge skill modules document budgets *"on ESP32-S3
+   WASM3"* (e.g. `exo_time_crystal`: "H (<10 ms)"). These benches run **native
+   x86_64 on a development laptop**, not the Xtensa/WASM3 target. A native host
+   median is an **upper bound on the algorithm's work**, not the ESP32 number.
+   WASM3 interpretation on a ~240 MHz Xtensa core is typically 1–2 orders of
+   magnitude slower than native `-O` host code, so a host median far under the
+   budget **does NOT prove the ESP32 meets it.** *The ESP32 figure is NOT
+   reproduced here — it needs hardware.*
+
+2. **Bench ≠ the doc-claimed measurement.** For the cogs, the manifest cites a
+   **cold-start** number (`cold_start_ms_avg`, weight-load included); these
+   benches measure **steady-state** per-frame `infer` (warm, weights resident).
+   Different measurements; we report both, labelled.
+
+Grades (per `benchmarks/wiflow-std/RESULTS.md` / ADR-152 vocabulary):
+- **MEASURED-on-host** — reproduced in this repo on the machine below, exact
+  command recorded. NOT the ESP32 / NOT the cold-start figure.
+- **CLAIMED (ESP32)** — the doc budget; UNMEASURED on hardware here.
+
+## Machine
+
+| | |
+|---|---|
+| Host | `ruvzen` (Windows 11, this dev box) |
+| CPU | Intel Core Ultra 9 285H |
+| Toolchain | `cargo 1.91.1`, `--release` (opt-level per crate profile) |
+| Bench harness | criterion 0.5 (`time: [low **median** high]` reported below) |
+| Date | 2026-06-12 |
+
+Run-to-run spread on this box is non-trivial (criterion's low/high bracket the
+median by a few %); the medians below are single-session captures with the smoke
+settings `--warm-up-time 1 --measurement-time 2` (wasm-edge) / `3` (cogs). Re-run
+for your own machine — the absolute numbers are host-specific.
+
+---
+
+## T1 — wasm-edge `process_frame` hot paths (ADR-160 deferred item → DONE host)
+
+The crate is **excluded from the v2 workspace**; bench from the crate dir.
+
+```bash
+cd v2/crates/wifi-densepose-wasm-edge
+cargo bench --features std -- --warm-up-time 1 --measurement-time 2
+# med_seizure_detect is medical-experimental-gated:
+cargo bench --features std,medical-experimental -- --warm-up-time 1 --measurement-time 2 med_seizure
+```
+
+| Hot path (M6-audit-named) | Bench id | Host median | Grade | Doc budget (CLAIMED, ESP32) |
+|---|---|---|---|---|
+| `exo_time_crystal` 256-pt × 128-lag autocorrelation (full buffer) | `exo_time_crystal::process_frame[autocorr_256x128]` | **17.3 µs** | MEASURED-on-host | "H (<10 ms) on ESP32-S3 WASM3" — **NOT reproduced here (needs hardware)** |
+| `exo_ghost_hunter` empty-room periodicity + hidden-breathing | `exo_ghost_hunter::process_frame[empty_room_periodicity]` | **1.44 µs** | MEASURED-on-host | research/exotic; no firm ESP32 figure — host proxy only |
+| `sec_weapon_detect` per-subcarrier Welford (MAX_SC=32) | `sec_weapon_detect::process_frame[per_sc_welford]` | **0.42 µs** (420 ns) | MEASURED-on-host | research-grade; calibration-gated — host proxy only |
+| `med_seizure_detect` clonic-phase rhythm path (steady-state frame) | `med_seizure_detect::process_frame[clonic_rhythm]` | **0.10 µs** (105 ns) | MEASURED-on-host (feature-gated) | doc budget "S (<5 ms) on ESP32"; **NOT reproduced here** |
+
+Reading these honestly:
+
+- `exo_time_crystal` at **17.3 µs host** is the only one whose host cost is even
+  in the same *thousandths* of its 10 ms ESP32 budget — it does the most work
+  (~32K MACs/frame). 17.3 µs native says the algorithm is cheap; it says
+  **nothing** about whether WASM3-on-Xtensa lands under 10 ms. A naïve
+  host→ESP32 extrapolation (assume 100× interpreter+clock penalty) would put it
+  near ~1.7 ms, comfortably under — **but that is an extrapolation, not a
+  measurement**, and is recorded here only to show the host number is not
+  obviously in tension with the budget. ESP32 figure: **UNMEASURED**.
+- `med_seizure_detect`'s 105 ns is the **steady-state** per-frame cost; the
+  expensive clonic autocorrelation only fires when the state machine is in the
+  clonic phase, so this is a lower-bound on the heavy path, not the worst case.
+  It is still a real, committed host datapoint.
+- The pre-existing `tests/budget_compliance.rs` already asserts the L/S/H
+  wall-clock tiers (25 passing tests); these criterion benches add the
+  regression-grade, reproducible median that ADR-160 deferred.
+
+---
+
+## T2 — cog steady-state inference latency (ADR-159/160 deferred item → DONE)
+
+Cog crates are normal workspace members; bench from `v2/`. Real weights
+(`count_v1.safetensors` / `pose_v1.safetensors`) ship in-repo under each cog's
+`cog/artifacts/`, so the bench measures the **real Candle CPU forward**, not the
+stub (the bench `assert!`s `backend().starts_with("candle-")`).
+
+```bash
+cd v2
+cargo bench -p cog-person-count  --no-default-features --bench infer_bench -- --warm-up-time 1 --measurement-time 3
+cargo bench -p cog-pose-estimation --no-default-features --bench infer_bench -- --warm-up-time 1 --measurement-time 3
+```
+
+| Cog | Bench id | Host median (steady-state infer, CPU) | Grade | Manifest cold-start (CLAIMED, different measurement + machine) |
+|---|---|---|---|---|
+| cog-person-count | `cog_person_count::infer[cpu_real_weights_steady_state]` | **305 µs** (idle box) | MEASURED-on-host | — (person-count manifest carries comparable provenance) |
+| cog-pose-estimation | `cog_pose_estimation::infer[cpu_real_weights_steady_state]` | **305 µs** (idle box) | MEASURED-on-host | `cold_start_ms_avg: 5.4` (30 invocations, **ruvultra/RTX 5080 host**, candle 0.9 cpu) — **cold-start, NOT steady-state; NOT this machine** |
+
+> Spread caveat (observed, honest): both medians above were captured with the box
+> otherwise idle. A re-run of the validate-form command *while a second cargo job
+> was loading the same cores* gave 385 µs (person-count) / 973 µs (pose) —
+> the criterion low/high bracket widens to ~0.34–1.18 ms under contention. The
+> 305 µs figures are the idle-box datapoints; the absolute number is host- and
+> load-dependent (the ~10× pose swing is core contention, not a code change).
+
+Reading these honestly:
+
+- **Steady-state ≠ cold-start.** The pose manifest's `5.4 ms` folds in one-time
+  weight load / mmap / first-forward allocation. This bench warms the engine
+  first and times only the recurring per-frame forward, on a *different
+  machine*. The two numbers are not comparable and we do not claim this bench
+  reproduces the 5.4 ms manifest figure.
+- Both cogs share the same conv encoder; person-count adds a count head +
+  confidence head, pose adds a 256-wide MLP head. The host steady-state cost is
+  dominated by the three dilated Conv1d layers (56→64→128→128) shared by both —
+  which is why both land at ~305 µs.
+- **Empirical confirmation of the steady-state/cold-start gap:** pose
+  steady-state (305 µs host) is ~18× *under* the manifest's 5.4 ms cold-start.
+  Even accounting for the different machine, this is the expected shape — the
+  bulk of cold-start is one-time setup, not the forward pass — and it is exactly
+  why conflating the two would be dishonest.
+
+---
+
+## Status vs the deferred items
+
+| Deferred item | Was | Now |
+|---|---|---|
+| ADR-160 "Criterion benches for `process_frame` budget claims" | ACCEPTED-FUTURE | **DONE (host)**; ESP32-on-hardware still **PENDING** (needs the wasm32 target + a flashed ESP32-S3) |
+| ADR-159/160 cog inference latency (`cold_start_ms_avg` uncommitted-benched) | CLAIMED | **MEASURED-on-host (steady-state)**; cold-start-on-ruvultra remains the manifest's separate claim |
+
+Nothing here changes runtime behavior — these are benches + this results file
+only. No crate needs republishing.
--- a/docs/adr/ADR-160-edge-skill-library-honest-labeling.md
+++ b/docs/adr/ADR-160-edge-skill-library-honest-labeling.md
@ -182,9 +182,15 @@ label or behavior change, consistent with leaving their claim surface intact.)
  sign-language claim requires labelled clinical/affective/ASL data and reference
  standards that do not exist in this repo. The disclaimers + feature gate are the
  honest stand-in. Nothing is claimed that is not measured.
- **Criterion benches for `process_frame` budget claims** — **ACCEPTED-FUTURE**.
-  `tests/budget_compliance.rs` asserts L/S/H tier wall-clock budgets (25 tests,
-  passing), but a regression-grade criterion bench is not yet wired.
+- **Criterion benches for `process_frame` budget claims** — **DONE (host)**
+  (ADR-163, 2026-06-12). `benches/process_frame_bench.rs` benches the heaviest
+  hot paths (`exo_time_crystal` 256×128 autocorrelation, `exo_ghost_hunter`
+  periodicity, `sec_weapon_detect` per-subcarrier Welford, `med_seizure_detect`
+  clonic rhythm) and reports committed **host** medians
+  (`benchmarks/edge-latency/RESULTS.md`). `tests/budget_compliance.rs` continues
+  to assert the L/S/H tier wall-clock budgets (25 tests, passing). **ESP32-on-
+  hardware (Xtensa/WASM3) latency remains PENDING** — the host bench is an
+  upper-bound algorithm-cost proxy, NOT the ESP32 figure (needs hardware).
 - **`wasm32-unknown-unknown` `static_mut_refs` confirmation** — **ACCEPTED-FUTURE**
  (toolchain): the source pattern is eliminated; a CI job on the wasm target should
  assert zero `static_mut_refs` once the target is added to the build image.
--- a/docs/adr/ADR-163-edge-latency-measurement.md
+++ b/docs/adr/ADR-163-edge-latency-measurement.md
@ -0,0 +1,123 @@
+# ADR-163: Edge-Latency Measurement — CLAIMED budgets → MEASURED-on-host
+
+- **Status**: accepted
+- **Date**: 2026-06-12
+- **Deciders**: ruv
+- **Tags**: edge-latency, wasm-edge, esp32, cog-inference, criterion, prove-everything, measurement-debt
+- **Amends**: ADR-160 (deferred "criterion benches for process_frame budget claims" line now DONE-on-host); ADR-159 (cog inference latency)
+
+## Context — Milestone 9 of the beyond-SOTA sweep
+
+Prior milestones (M5/M6, ADR-159/ADR-160) flagged **measurement debt**: edge
+latency budgets asserted in doc-comments and manifests but **never reproduced by
+a committed benchmark**. Specifically:
+
+- Many `wifi-densepose-wasm-edge` skill modules document a timing budget *"on
+  ESP32-S3 WASM3"* (e.g. `exo_time_crystal`: "H (heavy, <10 ms)"). These were
+  **CLAIMED**, not benchmarked. ADR-160's deferred backlog named exactly this:
+  *"Criterion benches for `process_frame` budget claims — ACCEPTED-FUTURE."*
+- `cog-pose-estimation`'s manifest cites `cold_start_ms_avg: 5.4`, but neither
+  cog had a `benches/` directory or any committed inference-latency number.
+
+Under the project's **prove-everything / anti-"AI-slop"** directive, a CLAIMED
+latency budget that a skeptic cannot reproduce is debt. M9 pays it down — benches
+and docs only, **no production-code behavior change** (so nothing republishes).
+
+## Headline
+
+**Converted the CLAIMED edge-latency budgets into MEASURED-on-host numbers, with
+the honest host-vs-ESP32 caveat stated everywhere.** Added committed criterion
+benches over the heaviest hot paths and a results file a skeptic can re-run. The
+ESP32-on-hardware figure remains explicitly **UNMEASURED** — this milestone does
+not pretend a laptop reproduces an Xtensa/WASM3 budget.
+
+## Decision — benches landed
+
+### T1 — wasm-edge `process_frame` budget benches
+
+`v2/crates/wifi-densepose-wasm-edge/benches/process_frame_bench.rs` (criterion,
+`harness = false`, `required-features = ["std"]`). The crate is **excluded from
+the v2 workspace**, so it runs from the crate dir. Benches the M6-audit-named
+heaviest hot paths over a **fixed synthetic CSI frame**, each driven through the
+public `process_frame` after warming the relevant ring/phase buffers so the
+expensive path actually executes:
+
+- `exo_time_crystal::process_frame` — full 256-pt × 128-lag autocorrelation.
+- `exo_ghost_hunter::process_frame` — empty-room periodicity / hidden-breathing.
+- `sec_weapon_detect::process_frame` — per-subcarrier (MAX_SC=32) Welford.
+- `med_seizure_detect::process_frame` — clonic-rhythm path (`#[cfg(feature =
+  "medical-experimental")]`, only built/run with that gate).
+
+The lib's `bench = false` was set so the libtest harness does not intercept
+criterion CLI flags; the `ghost_hunter` bin is already `standalone-bin`-gated and
+not built under `--features std`.
+
+**Measured host medians** (Intel Core Ultra 9 285H, native `--release`):
+`exo_time_crystal` **17.3 µs** · `exo_ghost_hunter` **1.44 µs** ·
+`sec_weapon_detect` **0.42 µs** · `med_seizure_detect` **0.10 µs**.
+
+### T2 — cog inference latency benches
+
+`v2/crates/cog-person-count/benches/infer_bench.rs` and
+`v2/crates/cog-pose-estimation/benches/infer_bench.rs` (criterion,
+`harness = false`). Each loads the **real** shipped weights from the in-repo
+`cog/artifacts/`, asserts the Candle CPU backend (so the stub can never be
+silently benched), warms one forward, then times steady-state
+`InferenceEngine::infer` over a fixed CSI window on `Device::Cpu`.
+
+**Measured host medians:** cog-person-count **305 µs** · cog-pose-estimation
+**305 µs** (steady-state, CPU, real weights).
+
+### T3 — results file
+
+`benchmarks/edge-latency/RESULTS.md`, in the `benchmarks/wiflow-std/RESULTS.md`
+style: each number with its exact reproduce command, the machine, the
+MEASURED-on-host grade, and the honest caveat.
+
+## The honest caveat (recorded, non-negotiable)
+
+1. **Host ≠ ESP32.** The wasm-edge benches run native x86_64, not Xtensa/WASM3.
+   A host median is an **upper bound on algorithm work**, not the ESP32 number;
+   WASM3 interpretation on a ~240 MHz core is 1–2 orders of magnitude slower than
+   native `-O`. A host median under budget does **not** prove the ESP32 meets it.
+   **The ESP32 figure is NOT reproduced here — it needs hardware.**
+2. **Bench ≠ the doc-claimed measurement.** The cogs' manifest cites a
+   **cold-start** number (weight-load included); these benches measure
+   **steady-state** per-frame `infer`. We report both, labelled, and do not
+   conflate them. Empirically, pose steady-state (305 µs host) is ~18× under the
+   5.4 ms cold-start — the expected shape, and exactly why conflating would lie.
+
+## Deferred / still-pending (nothing dropped)
+
+- **ESP32-on-hardware `process_frame` latency** — **PENDING (hardware)**. Needs
+  the `wasm32-unknown-unknown` target built + flashed to an ESP32-S3 and timed
+  under WASM3. The host bench is the algorithm-cost proxy until then.
+- **Per-skill *accuracy*** remains **DATA-GATED** (unchanged from ADR-160) —
+  this ADR measures latency only, never claims detection accuracy.
+
+## Reproduction (MEASURED)
+
+```bash
+# T1 — wasm-edge (workspace-excluded → run from the crate dir)
+cd v2/crates/wifi-densepose-wasm-edge
+cargo bench --features std -- --warm-up-time 1 --measurement-time 2
+cargo bench --features std,medical-experimental -- --warm-up-time 1 --measurement-time 2 med_seizure
+
+# T2 — cogs (workspace members)
+cd v2
+cargo bench -p cog-person-count   --no-default-features --bench infer_bench
+cargo bench -p cog-pose-estimation --no-default-features --bench infer_bench
+
+# existing tests still green (behavior unchanged)
+cargo test -p cog-person-count -p cog-pose-estimation --no-default-features
+```
+
+## Consequences
+
+- ADR-160's deferred *"Criterion benches for `process_frame` budget claims"* line
+  is now **DONE (host)**; the ESP32-on-hardware confirmation is explicitly the
+  one remaining pending item.
+- The cogs now ship committed, reproducible steady-state inference-latency
+  numbers, cleanly distinguished from the manifest's cold-start claim.
+- No runtime behavior changed; no crate republishes. `PROOF.md`'s performance
+  table and `scripts/prove.sh`'s gated section reference the new benches.
--- a/scripts/prove.sh
+++ b/scripts/prove.sh
@ -131,6 +131,7 @@ else
  SKIP "named person-identity — DATA-GATED: needs a real enrollment feeding the AETHER/body-resonance channel (see docs/research/soul/)"
  SKIP "OccWorld trained accuracy — needs a trained checkpoint (predict() carries weights_trained=false until then)"
  SKIP "native wlanapi 9.74 Hz scan — Windows-only; run: cargo test -p wifi-densepose-wifiscan -- --ignored measure_native_scan_rate"
+  SKIP "edge-latency benches (ADR-163) — host medians, not asserted here: (cd v2/crates/wifi-densepose-wasm-edge && cargo bench --features std) and (cd v2 && cargo bench -p cog-person-count -p cog-pose-estimation --no-default-features --bench infer_bench). HOST proxy only — the ESP32/WASM3 budget is NOT reproduced on a laptop; see benchmarks/edge-latency/RESULTS.md"
  echo "  (re-run with --full to attempt the feature-gated subset where prereqs exist)"
 fi
 hr