From 2aee4d21cfee09bc6bea0a3255b12bb3b8cd04e1 Mon Sep 17 00:00:00 2001 From: ruv Date: Fri, 8 May 2026 12:06:26 -0400 Subject: [PATCH] docs(temporal): README for wifi-densepose-temporal (#513) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes the documentation gap on the host-side ADR-096 surface. The crate has 7 commits, 5 source modules, 4 test suites, 2 examples, and a captured benchmark; reviewers and downstream consumers needed a landing page. Sections: - Quick start (5-line forward + 7-line streaming) - Backends + selection rule (SparseGqa MHA-vs-GQA dispatch) - Streaming semantics (cache lifetime, eviction policy, the headline correctness test) - Weight blob format with the host/firmware lockstep note - Examples (init_random_blob, bench_speedup) with run lines - Tests (18/18 passing as of 247794a2c, broken down by suite) - Status of ADR-096 claims with concrete evidence for each - Status of ADR-095 surface (firmware) + the toolchain blocker - Carry-forward of the open questions still applicable from §8 The README intentionally cross-links to: - docs/adr/ADR-096 for design rationale - components/ruv_temporal/ README for the firmware mirror - benches_results.md for the captured speedup curve Doesn't claim more than is proven. Each ADR-096 claim either has a test or a benchmark cited as evidence; the partial claim (30-100× at long windows) explicitly says 21× was the measured number, not 30×. Co-Authored-By: claude-flow --- v2/crates/wifi-densepose-temporal/README.md | 146 ++++++++++++++++++++ 1 file changed, 146 insertions(+) create mode 100644 v2/crates/wifi-densepose-temporal/README.md diff --git a/v2/crates/wifi-densepose-temporal/README.md b/v2/crates/wifi-densepose-temporal/README.md new file mode 100644 index 00000000..a5915d89 --- /dev/null +++ b/v2/crates/wifi-densepose-temporal/README.md @@ -0,0 +1,146 @@ +# `wifi-densepose-temporal` + +AETHER temporal head over CSI feature windows. Sparse-GQA attention via +`ruvllm_sparse_attention`, with a streaming `KvCache` decode path for +online re-ID and incremental classification. + +Implements the host side of [ADR-096](../../../docs/adr/ADR-096-aether-temporal-head-sparse-gqa.md); +mirrored on the firmware side at +[`firmware/esp32-csi-node/components/ruv_temporal/`](../../../firmware/esp32-csi-node/components/ruv_temporal/). + +## Quick start + +```rust +use wifi_densepose_temporal::{AetherTemporalHead, TemporalHeadConfig, Tensor3}; + +// Default config matches AETHER's MQA shape: +// q_heads=4, kv_heads=1, head_dim=32, window=32, block_size=16, causal=true +let cfg = TemporalHeadConfig::default_aether(); +let head = AetherTemporalHead::new(&cfg)?; + +// Prefill: full window forward +let out = head.forward(&q, &k, &v)?; // shape: (window, q_heads, head_dim) + +// Streaming: O(log T) per new frame against an accumulated cache +let mut cache = head.make_cache(/* capacity */ 1024)?; +for new_frame in stream { + let (q1, k1, v1) = project(&new_frame); // each seq=1 + let attn_out = head.step(&q1, &k1, &v1, &mut cache)?; + // pool, run classifier head, etc +} +``` + +## Backends + +`TemporalBackendKind` selects between two paths (ADR-096 §4.4): + +| Backend | When | Cost | +|---|---|---| +| `SparseGqa` | New training runs (default) | O(N log N) prefill, O(log T) decode | +| `Dense` | Reserved for back-compat | Returns `TemporalError::DenseBackendNotImplemented` for now (ADR-096 §4.4 follow-up) | + +The `SparseGqa` backend dispatches at `forward()` time: + +- `q_heads == kv_heads` → `forward()` (sparse MHA) +- `q_heads != kv_heads` → `forward_gqa()` (GQA / MQA) + +## Streaming semantics + +`step()` is the structural advantage over dense MHA — append `(k, v)` to the +caller-owned cache and decode the new `q` in O(log T) per token. + +- `q`/`k`/`v` must each have `seq == 1` (multi-token q is the prefill path). +- `KvCache` lifetime is the caller's. Per ADR-096 §8.5 the natural lifetime + is per-`PoseTrack` (re-ID) or per-session (online classification). When + the track drops, drop the cache. +- Cache fills are the caller's problem. Upstream H2O heavy-hitter eviction + is opt-in; this crate's wrapper doesn't pre-pick a policy. + +Headline correctness test: `streaming_step_matches_forward_at_last_position` +proves token-by-token `step()` produces the same output as a single-shot +`forward()` at position `N-1`, max_abs_err < 1e-3. + +## Weight blob format (`.rvne`) + +Wire format for transferring trained weights to the firmware. +[`weights.rs`](src/weights.rs) defines the host side; the firmware mirror +at [`components/ruv_temporal/src/weights.rs`](../../../firmware/esp32-csi-node/components/ruv_temporal/src/weights.rs) +parses it bit-for-bit. + +| Section | Bytes | Contents | +|---|---|---| +| Header | 24 | magic `RVNE` / version 1 / dtype flag (FP32 \| FP16) / dims | +| Weights | variable | flat per-layer arrays, dtype as flagged | +| Footer | 4 | CRC32-IEEE over everything before | + +Hard-break versioning: bumping `version` means firmware refuses to load. +Adding fields goes behind reserved flag bits, never by reorder. + +```rust +let blob = WeightBlob::new(header, weights)?; +let bytes = blob.serialize(); // host +// ... +let view = WeightBlobView::parse(&bytes)?; // firmware (no_std, borrowed slice) +``` + +## Examples + +| Example | Run | +|---|---| +| `init_random_blob` | `cargo run -p wifi-densepose-temporal --example init_random_blob -- model.rvne` — emits a 41 KB AETHER-shaped `.rvne` | +| `bench_speedup` | `cargo run -p wifi-densepose-temporal --example bench_speedup --release` — sparse-vs-dense speedup curve | + +Captured benchmark results: [`benches_results.md`](benches_results.md). + +## Tests + +``` +cargo test -p wifi-densepose-temporal +``` + +| Suite | Tests | What | +|---|---|---| +| `tests/smoke.rs` | 5 | Forward at AETHER default, MHA dispatch, GQA dispatch, dense-rejected, invalid-GQA-rejected, N=1000 long window | +| `tests/weight_blob.rs` | 8 | Roundtrip FP32 + FP16, bad magic / version / size / CRC / GQA, layout anchor | +| `tests/blob_e2e.rs` | 2 | Realistic 25 KB+ filesystem roundtrip, deterministic seed reproducibility | +| `tests/streaming.rs` | 3 | step()-matches-forward at last position, multi-token-q rejected, make_cache shape | + +**18/18 passing as of commit `247794a2c`.** + +## Status of ADR-096 claims + +| Claim | Status | Evidence | +|---|---|---| +| O(N log N) sparse vs O(N²) dense | **Empirically confirmed** | `bench_speedup` measures 21.21× at N=1024; cost-growth ratios match theory (dense 274×, sparse 24× for 16× more tokens) | +| `step()` matches `forward()` at last position | **Proven** | `streaming_step_matches_forward_at_last_position` test | +| Wire format consistent host↔firmware | **Proven** | 3 sites with same magic/version/CRC, 41-KB blob roundtrips through filesystem in tests | +| Path-vendored, no crates.io coupling | **Confirmed** | Workspace dep is `path = "../vendor/ruvector/crates/ruvllm_sparse_attention"` | +| 30–100× at long windows | **Partial** | 21.21× measured at N=1024 in single-run wall-clock; higher N + criterion would push closer to the 30× lower bound | + +## Status of ADR-095 surface (firmware) + +`AetherTemporalHead` is the host-side analog of the firmware on-device path. +The firmware Rust component scaffold and C-side wiring are complete; the +Rust component cross-compile is currently blocked by an upstream esp-rs +nightly-bundle inconsistency. See +[`components/ruv_temporal/README.md`](../../../firmware/esp32-csi-node/components/ruv_temporal/README.md) +for details. + +When the toolchain unblocks, no changes to this crate are needed — +`weights.rs` is already mirrored, `Tensor3` and `KvCache` cross the +boundary unchanged, and the C ABI consumed by `temporal_task.c` is stable. + +## Open questions (still applicable from ADR-096 §8) + +- The deployed AETHER tracker's actual window length is what determines + whether sparse pays off in production. At training default of 100 frames, + sparse begins to win (5–6× at N=128–256). At the 1000-frame roadmap + target, the speedup is much larger (21× measured). +- Streaming GQA decode is an upstream roadmap item; the current + `decode_step` is wired for the MHA branch. When upstream ships GQA + decode (post-ADR-189/190), `AetherTemporalHead.step` gets a GQA dispatch + branch added without any public API change. + +## License + +MIT.