wifi-densepose/api-docs/research/ruview-beyond-sota/04-optimization-roadmap.md

253 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# RuView Beyond-SOTA — 04: Performance Review & Optimization Roadmap
**Scope:** the streaming sensing pipeline (CSI ingest → multistatic fusion → CIR gate →
pose publish) in `v2/`, hot-path crates `wifi-densepose-signal` (ruvsense),
`wifi-densepose-engine`, `wifi-densepose-ruvector`, plus build-profile and edge-target
(Pi 5-class, WASM) considerations.
**Hard constraint (non-negotiable):** the witness chain (ADR-028, ADR-136 §2.5 replay
contract, ADR-137 §2.7 BLAKE3 witness in
`v2/crates/wifi-densepose-engine/src/lib.rs:437-448`) requires **bit-exact deterministic
float output**. Every recommendation below is tagged with its determinism risk. Anything
that reorders float additions, enables FMA contraction, fast-math, or parallel reduction
**changes the witness hash** and requires a coordinated proof-hash regeneration
(`verify.py --generate-hash`) plus witness-bundle re-issue.
---
## 1. What we actually have measured (and what we don't)
`/home/user/RuView/benchmark_baseline.json` is a **signal-quality soak baseline**, not a
latency benchmark: 1,566 samples (ticks 5113152395) of
`variance / motion / presence / confidence / est_persons / kp_spread / rssi`, with a
summary block (`confidence_mean: 0.643`, `presence_ratio: 0.934`,
`kp_spread_mean: 86.7`, `person_count_changes: 10`). **It contains zero timing data.**
It is the accuracy guardrail for any optimization (post-change soak must reproduce these
distributions), not a latency baseline.
Latency benchmarks exist but no committed results were found in the repo:
| Bench | File | What it measures |
|---|---|---|
| `process_cycle_4nodes_56sc` | `v2/crates/wifi-densepose-engine/benches/engine_cycle.rs:34-48` | One full engine cycle, 4 nodes × 56 subcarriers, vs. the documented 50 ms budget (`engine_cycle.rs:3-6`) |
| `cir_bench` | `v2/crates/wifi-densepose-signal/benches/cir_bench.rs` | `CirEstimator::estimate()` per tier (HT20/HT40/HE20/HE40) + 12-link amortization |
| `sketch_bench` | `v2/crates/wifi-densepose-ruvector/benches/sketch_bench.rs:86-175` | Hamming sketch vs. float L2/cosine compare; top-K over 1,024-sketch bank |
| `signal_bench`, `calibration_bench`, `aether_prefilter_bench` | `v2/crates/wifi-densepose-signal/benches/` | Signal-path and ADR-135 calibration throughput |
**Action zero of the roadmap is to run these on a Pi 5 and commit the criterion
baselines.** All impact classes below are derived from operation counts read out of the
code (cited), not invented measurements.
---
## 2. Latency budget model — streaming pipeline
Two clock domains exist and must not be conflated:
- **TDMA sensing cycle: 20 Hz / 50 ms** — the architecture's own budget
(`v2/crates/wifi-densepose-signal/src/ruvsense/mod.rs:5`, `RuvSenseConfig::target_hz =
20.0` at `mod.rs:258`, and the bench doc `engine_cycle.rs:3`).
- **CSI ingest: 100 Hz per node** — raw frames arrive ~5× faster than the fused output
rate; per-frame ingest work (parse, normalize, calibrate, window) must therefore fit a
**10 ms** per-frame envelope while the fused path fits **< 50 ms end-to-end**.
Proposed per-stage budget for the 50 ms end-to-end target (4 nodes, HT20 / 56
subcarriers the configuration the engine bench encodes):
| # | Stage | Code | Budget | Risk (from code reading) |
|---|---|---|---|---|
| 1 | Ingest + hardware normalize (per 100 Hz frame) | `hardware_norm`, `multiband.rs` | 2 ms | Low vector ops on 56 floats |
| 2 | Calibration apply (ADR-135) | `ruvsense/calibration.rs` | 2 ms | Low Welford lookups |
| 3 | Phase alignment | `phase_align.rs:117-152` | 1 ms | Low 20 iterations over 17 static subcarriers (`config.max_iterations: 20`, `phase_align.rs:57`); allocation churn only 3) |
| 4 | Multistatic fusion (attention + softmax) | `multistatic.rs:512-598` | 2 ms | Low O(nodes × 56); but does duplicate work in `fuse_scored` 3, F2) |
| 5 | **CIR gate (ISTA L1)** | `multistatic.rs:440-475` `cir.rs:601-654` | 15 ms | **HIGH** dominant cost, scales badly with PHY tier (below) |
| 6 | Coherence score + gate decision | `coherence.rs`, `coherence_gate.rs` | 2 ms | Low z-scores over 56 subcarriers |
| 7 | Tomography (ADR-030 tier 2, when enabled) | `tomography.rs:236-323` | 8 ms | **Medium** per-iteration allocation + loose step size 3, F8/F9) |
| 8 | Pose tracker (17-kp Kalman + re-ID) | `pose_tracker.rs` | 8 ms | Medium sketch prefilter (ADR-084) already mitigates the re-ID scan |
| 9 | Engine: quality score, privacy gate, WorldGraph node, BLAKE3 witness | `engine/src/lib.rs:304-368` | 5 ms | Low per cycle, but **unbounded memory growth** 4) |
| 10 | Publish (WS/serde) | sensing-server | 5 ms | Low |
| | **Total** | | **50 ms** | |
### Why stage 5 is the at-risk stage — operation counts from the code
`ista_solve` (`cir.rs:601-654`) runs **two dense complex mat-vecs per iteration**
(`matvec_phi` at `cir.rs:717-726`, `matvec_phi_h` at `cir.rs:730-745`), each O(K·G)
complex MACs (≈ 8 FLOPs each), up to `max_iters: 100` (`cir.rs:176`). Per
`CirConfig` (`cir.rs:164-233`):
| Tier | K (active) | G (taps) | FLOPs/iter (2·K·G·8) | FLOPs @100 iters |
|---|---|---|---|---|
| HT20 | 52 | 156 | 0.13 M | 13 M |
| HT40 | 114 | 342 | 0.62 M | 62 M |
| HE20 | 242 | 726 | 2.8 M | 0.28 G |
| HE40 | 484 | 1,452 | 11.2 M | 1.1 G |
HT20 fits the 15 ms budget comfortably on a Pi 5; **HE40 at worst-case iteration count
is ~1.1 GFLOP of scalar, cache-unfriendly work per estimate and will not fit any 50 ms
budget without structural change** (F4 below). Today the gate runs once per cycle on the
first link only (`multistatic.rs:452-463`), which contains the damage; the 12-link
amortization pattern in `cir_bench.rs` shows the intended scale-up, which multiplies
this cost ×12.
---
## 3. Findings table — optimization opportunities
Impact: relative cycle-time/memory effect at the 4-node HT20 operating point unless
noted. Determinism: **EXACT** = bit-identical output guaranteed; **TIE** = only
tie-breaking/ordering may differ; **CHANGES-FLOATS** = output bits change, witness/proof
hash must be regenerated.
| ID | Finding (file:line) | Impact | Effort | Determinism |
|---|---|---|---|---|
| F1 | `FusedSensingFrame` deep-copies every input frame each cycle: `node_frames: node_frames.to_vec()` (`multistatic.rs:282`) clones all per-node amplitude+phase vectors per 50 ms cycle even when downstream geometry consumers don't need them | Med | Low (Arc/Cow or borrow) | EXACT |
| F2 | `fuse_scored` re-derives the per-node amplitude views and recomputes `node_attention_weights` after `fuse` already computed them inside `attention_weighted_fusion` (`multistatic.rs:311-321` duplicating `multistatic.rs:520`) full cosine-sim + softmax done twice per cycle | Low-Med | Low (return weights from `fuse`) | EXACT (same math, computed once) |
| F3 | CIR gate rebuilds a heap `CsiFrame` per cycle: `build_csi_frame_from_channel` allocates an `Array2<Complex64>` and converts amplitude/phase via `from_polar` per subcarrier (`multistatic.rs:488-506`, called from `multistatic.rs:462`), then `extract_csi_vector` converts back to `Complex32` (`cir.rs:505-530`) f32f64f32 round-trip plus two allocations purely as glue | Med | Med (give `CirEstimator` a slice-based entry point) | EXACT if conversions reproduce exactly (f32f64 is lossless; `from_polar` in f64 then truncate f32 polar keep the f64 intermediate to stay exact, or accept CHANGES-FLOATS and regenerate hashes) |
| F4 | ISTA inner loop uses dense O(K·G) mat-vecs (`cir.rs:717-745`) although Φ is a sub-sampled DFT (`cir.rs:539-558`) the products Φx and Φᴴr are computable via an FFT of length G in O(G log G), an ~840× FLOP cut at HE20/HE40 (table §2) | **High** (the only path to HE40 real-time) | High | **CHANGES-FLOATS** (different summation order than the sequential dot product) must ship behind a feature flag, A/B against `cir_proof_runner`, regenerate `expected_features.sha256` + witness bundle |
| F5 | `neumann_warm_start` recomputes the diagonal of ΦᴴΦ with a full K×G pass **per frame** (`cir.rs:676-681`), rebuilds the COOCSR diagonal matrix per frame (`cir.rs:683-685`), and collects `rhs_re`/`rhs_im` Vecs per frame (`cir.rs:689-690`) yet `diag` depends only on Φ, which is fixed at `CirEstimator::new` | Med | Low (precompute diag+CSR in `new()`) | EXACT (same values, computed once) |
| F6 | `phase_variance` collects a `Vec<f32>` of phases per call (`cir.rs:792`) replaceable by a two-pass loop with zero allocation | Low | Low | EXACT |
| F7 | Φ and Φᴴ are both stored densely (`cir.rs:546-547`): 2·K·G·8 bytes Φᴴ entries are just conjugates of Φ (`cir.rs:555`), so a transposed-iteration kernel over Φ alone halves the footprint (HE40: 11.2 MB 5.6 MB) | Low (latency) / Med (memory §4) | Med | EXACT (conjugation is exact; keep identical accumulation order in the transposed kernel) |
| F8 | Tomography allocates the gradient vector **inside** the solver iteration loop: `let mut gradient = vec![0.0_f64; self.n_voxels]` (`tomography.rs:266`) one heap alloc + zeroing per iteration, up to `max_iterations: 100` (`tomography.rs:75`); hoist and `fill(0.0)` | Med (for tier-2 deployments) | Low | EXACT |
| F9 | Tomography step size uses the Frobenius-norm upper bound for the Lipschitz constant (`tomography.rs:253-259`, comment admits `‖WᵀW‖ ≤ ‖W‖_F²`) a bound loose by up to the matrix rank, forcing proportionally more ISTA iterations than the power-method estimate used in `cir.rs:566-590` | Med | Low (reuse the cir.rs power-method pattern) | **CHANGES-FLOATS** (different step different iterate path) |
| F10 | `apply_phase_correction` clones the amplitude vector and allocates a fresh corrected-phase Vec per channel per cycle (`phase_align.rs:258-268`, `frame.amplitude.clone()` at `phase_align.rs:264`); `align` additionally `frames.to_vec()`s on the single-channel path (`phase_align.rs:128`) an in-place `align_mut` avoids all of it | Low-Med | Low | EXACT |
| F11 | Static-subcarrier selection fully sorts all subcarriers by variance (`phase_align.rs:180`) where `select_nth_unstable_by` suffices trivial at 56 subcarriers, relevant at HE tiers (242484) | Low | Low | **TIE** (equal-variance ties may select a different subcarrier set; pin a stable tie-break on index to stay EXACT) |
| F12 | Engine clones each node's amplitude vector for the array coordinator every cycle: `cf.amplitude.clone()` (`engine/src/lib.rs:385`); also allocates a `Vec<Option<CalibrationId>>` per cycle (`lib.rs:293`) and `format!("{e:?}")` strings for every evidence ref (`lib.rs:337`) | Low | Low | EXACT |
| F13 | `fuse_scored_calibrated` computes the modal calibration id in O(n²) (`multistatic.rs:404-410`) harmless at n 15 nodes, noted for swarm-scale reuse (ADR-148) | Low | Low | EXACT |
| F14 | **No `rayon` and no SIMD feature exists anywhere in the hot crates** (grep over `crates/*/Cargo.toml`: zero hits for rayon/simd/target-feature outside wasm-opt flags). The 12-link CIR pattern (`cir_bench.rs:4-5`) and the per-node ingest path are embarrassingly parallel **across independent links/nodes** | High (multi-link tiers) | Med | **EXACT if and only if** parallelism stays at link/node granularity with results collected in deterministic (index) order and no shared float accumulator; intra-link parallel reductions are CHANGES-FLOATS and are banned |
| F15 | `Cir::top_k_taps` clones and fully sorts all G taps (`cir.rs:322-332`) O(G log G) with a G-sized clone; a k-heap (the exact pattern already written in `sketch.rs:546-563`) is O(G log k) | Low | Low | TIE (equal-magnitude ordering; pin index tie-break) |
| F16 | Core `CsiFrame` carries `Complex64` while the entire ruvsense DSP path computes in f32 (conversion at `cir.rs:525`) 2× memory and bandwidth on every ingest for precision the pipeline immediately discards | Med (memory/bandwidth) | High (core type change ripples everywhere) | **CHANGES-FLOATS** at the boundary; defer until a major version |
| F17 | Sketch path is already well-optimized: heap-based top-K with n k fast path (`sketch.rs:536-569`), 28-byte wire format (`sketch.rs:303`). Remaining win is build-level: `count_ones()` only lowers to POPCNT/NEON-vcnt when the target CPU enables it (see §5) | Low | Low | EXACT (integer ops) |
---
## 4. Memory-footprint analysis (Pi 5-class and WASM; ESP32 aggregation out of scope)
**Static, per-process (from struct definitions):**
| Component | Sizing source | Footprint |
|---|---|---|
| `CirEstimator` HT20 (Φ + Φᴴ, `Complex32`) | `cir.rs:546-547`, K=52 G=156 | 2 · 52 · 156 · 8 B **130 KB** |
| `CirEstimator` HE20 | K=242 G=726 | **2.8 MB** |
| `CirEstimator` HE40 | K=484 G=1452 | **11.2 MB** (halvable via F7) |
| Tomography weight matrix | `tomography.rs:214-217`, sparse per-link (voxel,weight) pairs; default grid 8×8×4 = 256 voxels (`tomography.rs:70-73`) | tens of KB at default grid |
| Sketch bank, 1,024 × 128-d | `sketch.rs` 1 bit/dim | 1,024 · 16 B **16 KB** (vs 512 KB float) |
A Pi 5 (48 GB) absorbs all of this trivially. The real memory risks are dynamic:
1. **Unbounded WorldGraph growth (the one genuine leak-class issue).** Every
`process_cycle` appends a `SemanticState` node plus a `DerivedFrom` edge
(`engine/src/lib.rs:346-352`), and change-points append `Event` nodes
(`lib.rs:422-428`). At 20 Hz that is **1.73 M nodes/day** with no eviction anywhere
in the engine. `snapshot_json` (`lib.rs:191-193`) then serializes the whole graph.
**Required:** a retention/compaction policy (ring buffer or time-windowed rollup of
SemanticStates). Determinism caveat: eviction changes snapshot *contents* (a product
decision), not float math the per-cycle witness (`lib.rs:437-448`) is unaffected.
2. **Per-cycle allocation churn** (F1, F3, F5, F8, F10, F12): at 20 Hz this is dozens of
short-lived heap allocations per cycle. On a Pi 5 this is allocator pressure and
cache pollution rather than RSS growth; on WASM (bump-ish dlmalloc, no MADV_FREE) it
inflates the linear memory high-water mark, which is never returned to the host.
3. **WASM targets.** `wifi-densepose-wasm` is a browser binding crate (JS interop,
serde, chrono `crates/wifi-densepose-wasm/Cargo.toml`) and pulls `wifi-densepose-mat`
optionally; it relies on `wasm-opt -O4` (`Cargo.toml` `[package.metadata.wasm-pack]`).
`wifi-densepose-wasm-edge` is the disciplined one: `no_std` + `libm`, its own profile
`opt-level = "s"`, lto, cgu=1 (`crates/wifi-densepose-wasm-edge/Cargo.toml`). Neither
enables `+simd128` 5). If the CIR estimator is ever compiled to wasm-edge, HE40's
11.2 MB of sensing matrix alone is ~700 pages of linear memory restrict edge WASM
to HT20 (130 KB) or ship F4/F7 first.
---
## 5. Build-profile review & recommendations
Current release profile (`v2/Cargo.toml:213-218`) is already aggressive and correct:
`opt-level = 3`, `lto = true` (fat), `codegen-units = 1`, `panic = "abort"`,
`strip = true`; `bench` inherits release with debug symbols (`v2/Cargo.toml:225-227`).
There is nothing wrong to fix here the gains left are target- and feedback-driven:
1. **Per-target CPU tuning (EXACT, do first).** No `target-cpu` is set anywhere. For
Pi 5 fleet builds: `RUSTFLAGS="-C target-cpu=cortex-a76"` enables NEON scheduling
and `vcnt` for the sketch path (F17) without changing IEEE semantics. LLVM does not
reassociate float reductions or contract to FMA without explicit fast-math/contract
flags, so scalar float results stay bit-exact. **Verify with the existing proof
runners** (`cir_proof_runner`, `calibration_proof_runner`,
`signal/Cargo.toml`) as the acceptance gate that is exactly what they exist for.
2. **WASM SIMD.** Add `-C target-feature=+simd128` for `wifi-densepose-wasm` builds and
keep a non-SIMD artifact for older runtimes. Same determinism note as above; gate
with the proof runners compiled to wasm where feasible.
3. **PGO: feasible and determinism-safe.** PGO changes inlining/layout, never FP
semantics. The repo already has ideal deterministic training workloads: the proof
runner binaries plus `engine_cycle` / `cir_bench`. Pipeline: `cargo pgo build`
run proof runners + benches `cargo pgo optimize`. Expect mid-single-digit to ~15%
on branchy paths (gate decisions, tracker lifecycle); the dense ISTA loop will see
little. Cost: CI complexity. Verdict: do it after F1F12, not before.
4. **Do not** enable `-ffast-math`-equivalents (`fadd_fast`, `core::intrinsics`,
`-C llvm-args=-fp-contract=fast`) anywhere in the witness path. This must be a
stated rule in CONTRIBUTING/ADR, not tribal knowledge.
5. **BOLT / `opt-level` experiments are not worth it** ahead of F4; the pipeline is
FLOP-bound in one loop, not front-end bound.
---
## 6. Prioritized 90-day plan
### Phase 0 — Measure (days 110)
- Run and commit criterion baselines on a Pi 5 and an x86 dev box:
`engine_cycle`, `cir_bench` (all four tiers), `sketch_bench`, `signal_bench`,
`calibration_bench`. The 50 ms claim in `engine_cycle.rs:3` becomes a measured number.
- Add a lightweight per-stage timing histogram (feature-gated, off in witness builds) at
the §2 stage boundaries; wire a CI perf-regression gate 10%) on the committed
baselines.
- Re-run the soak that produced `benchmark_baseline.json` and pin it as the accuracy
guardrail for everything below.
### Phase 1 — Exact, zero-risk wins (days 1035)
All EXACT findings; no witness impact; each lands with proof-runner verification:
- F5 (precompute warm-start diag/CSR in `CirEstimator::new`) biggest exact CIR win.
- F8 (hoist tomography gradient buffer), F6, F10, F12, F1, F2 (allocation/duplication
removal), F15 + F11 with pinned index tie-breaks.
- WorldGraph retention policy (the §4.1 unbounded-growth fix) design ADR + ring-buffer
implementation.
- Expected outcome: measurable cycle-time reduction and flat memory under 24 h soak;
**identical witness hashes**.
### Phase 2 — Determinism-managed structural wins (days 3570)
Each behind a feature flag, A/B'd against the legacy path (the `use_cir_gate` A/B switch
at `multistatic.rs:103` is the template), with proof-hash regeneration as an explicit,
witnessed release event:
- **F4: FFT-based Φ/Φᴴ application in ISTA** the headline item; the only route to
HE20/HE40 real-time and the 12-link pattern. Acceptance: cir_bench speedup 5× at
HE20, soak metrics within guardrail, new `expected_features.sha256` published in a
fresh witness bundle.
- F9 (power-method Lipschitz in tomography) riding the same hash-regen train.
- F3 (slice-based CIR entry point), choosing the exact-f64-intermediate variant if the
hash train slips.
- F14: feature-gated `rayon` across **links/nodes only**, deterministic index-ordered
collection; CI must run the determinism test (`engine/src/lib.rs:535-548`
`cycle_is_deterministic`) with the feature on.
### Phase 3 — Platform & toolchain (days 7090)
- Pi 5 `target-cpu=cortex-a76` fleet builds + proof-runner verification 5.1).
- `+simd128` WASM artifact + size budget check for wasm-edge 5.2, §4.3).
- PGO pilot in CI using proof runners as the training corpus 5.3).
- Re-baseline: new criterion numbers, refreshed witness bundle, updated this document's
§1 with real measured latencies.
**Out of 90-day scope, flagged for the architecture backlog:** F16 (Complex64Complex32
in core), F7 (single-matrix Φ kernel bundle with F4), and HE40-on-edge (blocked on
F4+F7).
---
## 7. Summary
The pipeline's only structural latency hazard is the dense ISTA CIR solver
(`cir.rs:601-654` + `cir.rs:717-745`): fine at HT20, ~1.1 GFLOP worst-case per estimate
at HE40, and slated to run per-link (×12). Everything else is allocation churn and
duplicated work that can be removed with **bit-exact** refactors (F1F12), plus one
genuine memory bug-class issue: unbounded WorldGraph growth at 20 Hz
(`engine/src/lib.rs:346-352`). The build profile is already optimal; remaining toolchain
gains (target-cpu, wasm simd128, PGO) are determinism-safe and cheap. The determinism
constraint is workable because the repo already owns the right tools deterministic
proof runners, an A/B gate pattern, and a per-cycle witness so float-changing
optimizations become scheduled, witnessed hash-regeneration events rather than risks.