docs(adr): ADR-095/096 — sparse attention on ESP32 + AETHER GQA head (#513)

Two Proposed ADRs covering the integration of vendored ruvllm_sparse_attention v0.1.1 (released 2026-05-07, no_std + alloc validated on real ESP32-S3 per upstream ADR-192). * ADR-095 — adds a learned temporal head to the ESP32-S3 firmware via a Rust component compiled --no-default-features against the 376 KB rlib. Runs alongside the existing physics-only DSP, gated behind a Kconfig (8 MB only initially). Use cases: gesture recognition, fall classification with sequence context, breathing-quality scoring, on-device anomaly detection. Builds on ADR-018, ADR-039, ADR-081. * ADR-096 — adopts forward_gqa + KvCache for the AETHER (ADR-024) contrastive CSI embedding's temporal aggregation. Path-vendored workspace dep, A/B gate before flipping the inference default. ~30-100x speedup at long windows; streaming decode goes from O(N^2) recompute to O(log T) per new frame. Refs #513
2026-05-07 15:14:38 -04:00 · 2026-05-07 15:14:38 -04:00 · 684ef4f1a5
parent e7904786f0
commit 684ef4f1a5
2 changed files with 758 additions and 0 deletions
--- a/docs/adr/ADR-095-on-esp32-temporal-modeling-sparse-attention.md
+++ b/docs/adr/ADR-095-on-esp32-temporal-modeling-sparse-attention.md
@ -0,0 +1,369 @@
+# ADR-095: On-ESP32-S3 Temporal Modeling at the Edge via `ruvllm_sparse_attention` (no_std)
+
+| Field       | Value                                                                                                  |
+|-------------|--------------------------------------------------------------------------------------------------------|
+| **Status**  | Proposed (2026-05-07)                                                                                  |
+| **Date**    | 2026-05-07                                                                                             |
+| **Authors** | ruvnet, claude-flow                                                                                    |
+| **Related** | ADR-018, ADR-024, ADR-039, ADR-040, ADR-061, ADR-081, ADR-091; upstream ADR-189, ADR-190, ADR-192      |
+| **Branch**  | `feat/ruvllm-sparse-attention-edge`                                                                    |
+| **Tracking**| #513                                                                                                   |
+
+---
+
+## 1. Context
+
+Today the ESP32-S3 firmware in `firmware/esp32-csi-node/main/` does
+**physics-only** sensing on-device. The pipeline in `edge_processing.c`
+runs on Core 1 and produces:
+
+- Adaptive presence detection (`presence_score`).
+- Breathing-band (0.1–0.5 Hz) and heart-rate-band (0.8–2.0 Hz) biquad
+  IIR bandpass + zero-crossing BPM estimators.
+- A motion / fall flag (`flags` bits 0–2 in `edge_vitals_pkt_t` magic
+  `0xC5110002`, plus fused mmWave variant `0xC5110004` per ADR-063).
+- ADR-081 `rv_feature_state_t` (60 B at magic `0xC5110006`) emitted at
+  1–10 Hz from the adaptive controller's fast loop.
+
+There is **no learned model of any kind on the MCU**. The closest things
+are: ADR-039 Tier-1 compressed-CSI emission, ADR-040 WASM modules
+(Tier-3, but used by the user for ad-hoc DSP, not transformer
+inference), and the Rust-side AETHER embeddings (ADR-024) which run
+on the host, not the node. Anomaly detection that needs *temporal
+context* — "is this fall pattern consistent with a fall, or just a
+sit-down?" — is structurally absent. The fall debounce in v0.6.x
+(3-frame consecutive + 5 s cooldown, raised threshold 2.0 → 15.0 rad/s²)
+is a hand-tuned heuristic exactly because the firmware has nothing
+better to reason with.
+
+A second pressure point: the Tmr Svc / FreeRTOS stack is already
+sensitive. `edge_processing.c` lines 47–48 explicitly note that
+`process_frame + update_multi_person_vitals` combined used ~6.5–7.5 KB
+of the 8 KB task stack and that **scratch buffers were moved to static
+storage to avoid stack overflow.** Any new heavyweight workload — and
+a transformer forward pass is heavyweight — must therefore live in
+**its own FreeRTOS task with its own task stack**, not piggyback on
+the existing edge DSP task.
+
+The vendored crate `ruvllm_sparse_attention` v0.1.1 (released 2026-05-07,
+synced today at `vendor/ruvector/crates/ruvllm_sparse_attention/`)
+removes the previously-blocking `std` requirement. Per upstream
+**ADR-192**, the crate now compiles cleanly to
+`xtensa-esp32s3-none-elf` via `espup`, with a measured **376 KB
+release rlib**, zero runtime dependencies beyond `libm`, and was
+validated on a real ESP32-S3 (rev v0.2, 16 MB flash). It exposes
+`SubquadraticSparseAttention`, `KvCache` / `KvCacheF16`, `FastGrnnGate`,
+`IncrementalLandmarks`, `RuvLlmSparseBlock`, and a `Tensor3` value
+type. The kernel is O(N log N) by default and near-linear O(N) when
+the FastGRNN salience gate is enabled.
+
+This is the first time we have had a credible path to **on-device
+transformer inference for CSI** without a Python runtime, without
+TFLite, and without a coprocessor. It is also the right moment to
+decide *whether* we want it before code starts to land.
+
+---
+
+## 2. Decision
+
+Add a learned **temporal head** to the ESP32-S3 firmware running on
+the node itself, using `ruvllm_sparse_attention` compiled
+`--no-default-features` (no_std + alloc, optionally `+fp16`), driven
+by a small Rust component integrated into the ESP-IDF build. The
+temporal head runs **alongside** the existing physics-only pipeline,
+not as a replacement — physics gives us breathing/heart-rate/presence,
+the temporal head gives us classification and sequence-aware reasoning.
+
+Concretely:
+
+1. The temporal head consumes a rolling window of feature vectors
+   (initially the same `rv_feature_state_t` floats already produced
+   by ADR-081, plus optionally a small projection of recent CSI
+   amplitude statistics), length `N` ∈ [100, 500] frames, sampled at
+   the controller's fast-loop rate.
+2. It outputs a small set of **class logits** for the active
+   detection task. The first three deployable tasks are listed in
+   §4.
+3. It runs in its own FreeRTOS task on Core 1 (or pinned to whichever
+   core the WiFi driver is *not* on), at a cadence slower than the
+   fast loop — initially 1 Hz, classification-on-demand.
+4. The kernel is invoked through a thin C ABI (`ruv_temporal_init`,
+   `ruv_temporal_push_frame`, `ruv_temporal_classify`) exported from
+   a Rust static library linked into the ESP-IDF build the same way
+   the existing Tier-3 components are linked.
+5. Weights are stored as a flat `f32` (or `f16` with the `fp16`
+   feature) blob in the ESP32-S3 flash, loadable from either an
+   embedded `EMBED_FILES` resource (compile-time bake-in) or NVS
+   (post-flash provisioning, mirroring ADR-040's WASM-upload path).
+6. The temporal head is gated behind a Kconfig option
+   `CONFIG_CSI_TEMPORAL_HEAD_ENABLED`, **default off**, and is only
+   compiled into the 8 MB build profile until the flash math in §6
+   demonstrates 4 MB headroom.
+
+This ADR authorizes the architecture; it does **not** ship any of
+the firmware-side or training-side changes. Implementation lands in
+follow-up issues per the roadmap in §7.
+
+---
+
+## 3. Approach
+
+### 3.1 Build integration
+
+ESP-IDF v5.4 already supports Rust components via the
+`rust-esp32`-style template (a CMake `idf_component_register` shim
+that runs `cargo build --target xtensa-esp32s3-none-elf` and links
+the resulting static library). The new component lives at
+`firmware/esp32-csi-node/components/ruv_temporal/`:
+
+```
+ruv_temporal/
+  CMakeLists.txt          # component manifest, Rust build invocation
+  Cargo.toml              # crate config: no_std, deps on ruvllm_sparse_attention
+  build.rs                # generates the C header from #[no_mangle] exports
+  src/lib.rs              # public C ABI: init/push/classify/teardown
+  src/window.rs           # rolling frame buffer
+  src/weights.rs          # NVS / EMBED_FILES weight loader
+  include/ruv_temporal.h  # generated; consumed by edge_processing.c
+```
+
+Cargo features compiled in: `["fp16"]`. **Not** `parallel` (rayon
+needs threads, breaks no_std). **Not** `std`.
+
+### 3.2 Interface
+
+The C ABI is intentionally narrow. It does not expose `Tensor3`,
+attention configs, or any Rust types — only `float*` buffers and
+opaque handles:
+
+```c
+typedef struct ruv_temporal_ctx ruv_temporal_ctx_t;
+
+esp_err_t ruv_temporal_init(const uint8_t *weights, size_t wlen,
+                            uint32_t input_dim, uint32_t window,
+                            ruv_temporal_ctx_t **out_ctx);
+esp_err_t ruv_temporal_push(ruv_temporal_ctx_t *ctx, const float *frame);
+esp_err_t ruv_temporal_classify(ruv_temporal_ctx_t *ctx,
+                                float *logits, uint32_t n_classes);
+void      ruv_temporal_destroy(ruv_temporal_ctx_t *ctx);
+```
+
+`push` is the hot path and must be cheap (it just writes into a
+ring buffer in PSRAM if available, IRAM/DRAM otherwise). `classify`
+runs the actual sparse attention forward and is the budget-heavy
+call.
+
+### 3.3 Task topology
+
+A new task `ruv_temporal_task` with its own 16 KB stack, pinned to
+the same core as the edge DSP task (Core 1), fed via a FreeRTOS
+queue from the adaptive controller's fast loop. We do **not** call
+the kernel from the existing edge task — the edge stack is already
+near-full per the comment at `edge_processing.c:47-48` and recent
+fall-debounce / Tmr-Svc-stack work.
+
+### 3.4 Memory budget (per inference)
+
+With `N = 256` (window), `d_model = 32`, `n_heads = 4`, `head_dim = 8`,
+1–2 `RuvLlmSparseBlock` layers, `block_size = 64`, `window = 64`:
+
+- Weights: ~5–15 KB (single block, INT8 quant deferred to a later
+  ADR; FP16 default).
+- KV cache (FP16, full window): `2 * 256 * 4 * 8 * 2 B ≈ 16 KB`.
+- Activations (peak, with `forward_flash` tiling): ≈ 2 KB.
+- Working set: < 64 KB. Comfortable in PSRAM, possible in ISR-safe
+  internal SRAM.
+
+These are first-pass estimates; the precise numbers come out of the
+`forward_flash` benchmark on real hardware, which is exit criterion
+in §7.
+
+### 3.5 Compatibility with ADR-081 / ADR-039 / ADR-018
+
+The temporal head is a **consumer** of the same feature stream
+already flowing in the firmware. It does not alter:
+
+- ADR-018 raw CSI frame layout (`0xC5110001`).
+- ADR-039 Tier-1 compressed CSI (`0xC5110005`) or vitals
+  (`0xC5110002`).
+- ADR-063 fused vitals (`0xC5110004`).
+- ADR-081 `rv_feature_state_t` (`0xC5110006`) — this is the primary
+  input we tap.
+
+If the temporal head fires a classification, the result rides on a
+new `0xC5110007` packet (small: class id, confidence, monotonic seq,
+ts_us, CRC32). Allocation of that magic is deferred to the
+implementation PR — this ADR reserves the *concept*, not the byte
+layout.
+
+---
+
+## 4. Use cases that motivate this
+
+| Task | Why temporal context matters | Window | Class count |
+|------|------------------------------|--------|-------------|
+| **Gesture recognition** (wave / point / clap / kick) | Single-frame CSI snapshots can't disambiguate gestures from random motion. ~100-frame windows capture full gesture trajectories. | 100 frames @ 50 Hz = 2 s | 4–8 |
+| **Fall classification with sequence context** | The current heuristic ("> 15 rad/s² for 3 consecutive frames + 5 s cooldown") was raised to suppress false positives. A learned temporal head can distinguish a fall (rapid descent then stillness) from a sit-down (descent then sustained micro-motion) using the same input window. | 200 frames @ 50 Hz = 4 s | 3 (fall / sit / nothing) |
+| **Breathing-quality scoring** | Today's pipeline emits a BPM and a confidence float. A temporal head trained on labeled apnea / shallow / paradoxical / normal sequences can output a 4-class quality label that downstream consumers can render in one glance. | 500 frames @ 50 Hz = 10 s | 4 |
+| **"Is this normal for this room/time" anomaly detection** | Per-room SONA profiles (ADR-005) capture environment statistics, but anomaly *temporal shape* is currently checked host-side via embedding distance (ADR-024 §2.4 `temporal_baseline` index). A small on-device classifier can flag ahead of host roundtrip. | 300 frames | 2 (normal / anomalous) |
+
+These four cover the visible product gaps in the v0.6.x line.
+Gesture recognition is the headline; fall classification is the
+highest-impact for the eldercare scenarios v0.5.4 was tuned for.
+
+---
+
+## 5. Alternatives considered
+
+| Option | Why rejected |
+|--------|--------------|
+| **TFLite Micro** | Heavier runtime (~150 KB code + interpreter), pulls in C++ STL surface, no Rust-native API. Does not benefit from sparse attention specifically. We'd be re-paying the cost of a full inference framework when we only need one kernel. |
+| **Run all classifiers server-side** | Costs a full Tier-1 CSI uplink (~50–70 KB/s/node per ADR-039) just to feed a remote classifier, then a roundtrip back. Defeats the point of ADR-081's compact feature stream and makes the system worthless when the backhaul is down. Also leaks raw CSI to the network for purposes the user did not opt into. |
+| **Stay physics-only forever** | Cleanest from a maintenance standpoint, but loses gesture, structurally, and the fall-debounce hack will keep accreting per-deployment knobs. The product space already has commodity physics-only firmware (Bosch presence sensors, etc.); on-device transformer inference for CSI is what would *differentiate* RuView. |
+| **Use `ruvector-attention` (already in workspace) on-device** | `ruvector-attention` is `std`-bound today; doesn't compile to `xtensa-esp32s3-none-elf` without a port comparable in scope to upstream ADR-192. Even if ported, it doesn't give us GQA + streaming KV cache, which is the structural capability the new crate adds. |
+| **Wait for IEEE 802.11bf** | Different problem (standardised CSI exposure across vendors). Doesn't address whether the model runs on-device or off. |
+
+---
+
+## 6. Consequences
+
+### Positive
+
+- **Genuinely novel.** No competing CSI-sensing project ships
+  transformer inference on the MCU itself. The closest peers
+  (Espressif's ESP-DL, Edge Impulse) are non-attention CNN/RNN
+  pipelines.
+- **Latency.** Classification result is local — no backhaul,
+  no host roundtrip, sub-100 ms gesture-to-action.
+- **Privacy.** Raw CSI never leaves the node for these tasks.
+- **Reuses the ADR-081 feature stream** — the temporal head is a
+  consumer of the existing 60 B `rv_feature_state_t`, not a new
+  uplink format.
+- **Validated kernel.** Per upstream ADR-192, the no_std build was
+  validated on real ESP32-S3 hardware (MAC `ac:a7:04:e2:66:24`).
+  We are not betting on a paper crate.
+
+### Negative / tradeoffs
+
+- **Flash budget pressure on 4 MB boards.** Per `partitions_4mb.csv`,
+  each OTA slot is 1.875 MB (`0x1D0000`). The current build is
+  ~853 KiB. Adding a 376 KB rlib plus weights brings us to ~1.3 MB —
+  still under the slot ceiling but with little headroom for other
+  growth. **Decision: temporal head is 8 MB-only initially**, gated
+  behind `CONFIG_CSI_TEMPORAL_HEAD_ENABLED`. 4 MB enablement is a
+  separate ADR after we measure the actual incremental link size
+  (the 376 KB upstream number is for the rlib in isolation; the
+  linked-and-stripped final binary delta will be smaller).
+- **Rust toolchain dependency.** The ESP-IDF build now needs
+  `espup` + `cargo +esp` to be present on every developer machine
+  and CI runner. This is a real hurdle on Windows — see
+  `CLAUDE.local.md` for the existing Python-subprocess wrapper
+  required to run ESP-IDF cleanly. CI will need a parallel
+  Rust-toolchain step.
+- **One more thing to test.** QEMU (ADR-061) does not run the
+  ESP32-S3 Xtensa Rust binary today. The QEMU validator pipeline
+  will need a build matrix entry for "Rust component compiled but
+  classifier disabled" at minimum.
+- **Stack overflow risk.** Same hazard the v0.6.4 work just
+  navigated. Mitigated by §3.3 (own task, own stack); this needs
+  to be a code-review checklist item.
+- **Weights provenance.** Once we ship a model, we need a story
+  for *which model*, signed by *whom*, retrained *how often*. See
+  Open Questions §8.
+
+### Neutral
+
+- ADR-040's WASM Tier-3 path is **not** superseded. WASM remains
+  the right choice for user-uploaded modules. The temporal head is
+  a first-party signed-by-us component, with a different deploy
+  story.
+- The host-side ADR-024 AETHER pipeline is unchanged by this ADR.
+  ADR-096 covers the host-side use of the same crate.
+
+---
+
+## 7. Roadmap
+
+| Phase | Scope | Gating |
+|-------|-------|--------|
+| 0 | This ADR + ADR-096 land. No code. | Maintainer review of #513. |
+| 1 | New crate `wifi-densepose-temporal` (host-side only): defines the temporal-head architecture, training script, weight serialization format. | Phase 0 accepted. |
+| 2 | `ruv_temporal` ESP-IDF component scaffolding — empty kernel, just the C ABI and ring buffer. Compiles cleanly into 8 MB firmware. Adds ~5 KB to binary. | Phase 1 produces a serialised set of weights. |
+| 3 | Wire `ruvllm_sparse_attention` `forward` (not yet `forward_gated`) into the component. First on-target classification benchmark on COM7. Gate: end-to-end inference ≤ 50 ms with `N = 256`, no stack overflow under 24 h soak. | Phase 2 ABI stable. |
+| 4 | First trained classifier (gesture or fall, whichever has labelled data first). Hardware A/B: temporal-head decision vs current heuristic on a held-out set. Promotion criterion: temporal head matches or beats heuristic on F1 *and* false-positive rate. | Phase 3 latency gate met. |
+| 5 | 4 MB profile gating — measure actual binary delta, decide whether to enable on SuperMini. | Phase 4 in production on 8 MB. |
+| 6 | `forward_gated_with_fastgrnn` for long-window tasks (breathing-quality at N = 500). | Phase 4 stable. |
+
+---
+
+## 8. Open questions
+
+1. **Who trains the temporal heads?** Two options:
+   (a) host-side training on captured `rv_feature_state_t` traces
+   labelled in-app, then export to flat-buffer weights;
+   (b) teacher-distillation from the larger AETHER model (ADR-024)
+   running off-device, using soft labels. Option (b) is more
+   data-efficient but couples this ADR's ship date to ADR-024's
+   training-pipeline maturity. Open.
+2. **How are weights flashed?** Three options, in increasing
+   capability: NVS blob (small, safe, 4–8 KB ceiling per key),
+   `EMBED_FILES` baked into the firmware image (no runtime update),
+   OTA-updateable partition (mirrors ADR-040 RVF upload path,
+   biggest engineering cost). Phase 2/3 will pick one; my prior is
+   `EMBED_FILES` for the first model, OTA partition once we have
+   more than one.
+3. **Does the 376 KB rlib figure scale?** Upstream measured
+   376 KB for the kernel + the embedding/projection
+   weights for *their* test config. Adding 1–2
+   `RuvLlmSparseBlock` layers with embedding/projection weights
+   sized to actual CSI feature dimension may push this. Phase 2
+   will measure the on-target stripped-binary delta directly; if
+   the delta exceeds 600 KB we revisit the 4 MB story sooner.
+4. **What window length is right for fall classification?**
+   200 frames at 50 Hz = 4 s feels right based on the v0.6.4
+   debounce numbers (3-frame consecutive + 5 s cooldown is
+   essentially a 4-second decision window already). Empirical, not
+   architectural — set in Phase 4.
+5. **Quantisation.** First model ships FP16 (KV cache feature flag
+   already supports this). INT8 for both weights and activations
+   is a follow-up; the current crate has no INT8 path so it would
+   be a separate kernel.
+6. **What happens when the controller is in `RV_PROFILE_PASSIVE_LOW_RATE`?**
+   The fast loop slows down, so the input frame rate to the
+   temporal head drops. Either the head needs to handle variable
+   sample rate (resample at push time) or it stops emitting until
+   the controller goes back to active. Phase 1 design call.
+
+---
+
+## 9. Acceptance criteria
+
+This ADR is **Accepted** once:
+
+1. Maintainer review on #513 confirms the architecture.
+2. The follow-up implementation issue is filed and references this
+   ADR plus ADR-096 by number.
+3. ADR index in `docs/adr/README.md` (if present) has an ADR-095
+   row.
+
+This ADR is **Implemented** once:
+
+1. Phase 3 is in `main` with the gating Kconfig off by default.
+2. A Phase-4 hardware A/B has been published (witness-bundle
+   compatible per ADR-028).
+3. The QEMU validator (ADR-061) has at minimum a "compiles, doesn't
+   run" check for the Rust component.
+
+---
+
+## 10. Related
+
+ADR-018 (binary CSI frame), ADR-024 (AETHER contrastive embedding —
+host-side counterpart, see ADR-096), ADR-039 (edge intelligence
+tiers), ADR-040 (WASM Tier-3 modules — the *other* extensibility
+path), ADR-061 (QEMU CI), ADR-081 (adaptive controller, mesh plane,
+`rv_feature_state_t`), ADR-091 (stand-off radar tier — adjacent
+edge-intelligence ADR), upstream ADR-189 (KV cache incremental
+decode), upstream ADR-190 (GQA/MQA), upstream ADR-192 (no_std +
+alloc on ESP32-S3 — the structural unblock that makes this ADR
+possible).
--- a/docs/adr/ADR-096-aether-temporal-head-sparse-gqa.md
+++ b/docs/adr/ADR-096-aether-temporal-head-sparse-gqa.md
@ -0,0 +1,389 @@
+# ADR-096: AETHER Temporal Head via `ruvllm_sparse_attention::forward_gqa` + Streaming KV Cache
+
+| Field       | Value                                                                                 |
+|-------------|---------------------------------------------------------------------------------------|
+| **Status**  | Proposed (2026-05-07)                                                                 |
+| **Date**    | 2026-05-07                                                                            |
+| **Authors** | ruvnet, claude-flow                                                                   |
+| **Related** | ADR-014, ADR-016, ADR-024, ADR-095; upstream ADR-189, ADR-190, ADR-192                |
+| **Branch**  | `feat/ruvllm-sparse-attention-edge`                                                   |
+| **Tracking**| #513                                                                                  |
+
+---
+
+## 1. Context
+
+ADR-024 ("Project AETHER") specifies a contrastive CSI embedding
+model on top of the existing `CsiToPoseTransformer` backbone. It
+adds a 2-layer projection head to the per-keypoint features and
+trains it with InfoNCE + VICReg + (optional) cross-modal alignment.
+The **temporal aggregation** that turns per-frame backbone features
+into a window-level representation is described at the level of
+"a transformer encoder over the CSI window" — but ADR-024 does not
+pin a specific attention kernel. In the current code:
+
+- `v2/crates/wifi-densepose-train/src/model.rs` uses
+  `ruvector_attention::ScaledDotProductAttention` (line 34) and
+  applies `apply_antenna_attention` over the antenna-path dimension
+  and `apply_spatial_attention` over the spatial location dimension.
+  Both are dense.
+- The training-side temporal pooling currently runs at
+  `window_frames = 100` by default (`config.rs:165`), with
+  `proof.rs` and `trainer.rs` using shorter test windows of 4 and 2
+  respectively.
+- `v2/crates/wifi-densepose-signal/src/ruvsense/pose_tracker.rs`
+  consumes a 128-dim AETHER re-ID embedding (line 22, 263) but does
+  not perform the temporal aggregation itself — that happens
+  upstream.
+
+So the temporal head is a real seam in the codebase, but its
+specific attention kernel is *currently dense* and *currently not a
+named architectural decision*. This ADR makes that decision.
+
+The vendored `ruvllm_sparse_attention` v0.1.1 (synced today,
+released 2026-05-07) provides a different kind of temporal kernel:
+
+- **Subquadratic O(N log N)** sparse attention (`forward`,
+  `forward_flash`).
+- **Grouped-Query / Multi-Query Attention** (`forward_gqa`,
+  `forward_gqa_flash`) — shares K/V across query heads, the
+  pattern Mistral-7B and Llama-3 use.
+- **Streaming KV cache** (`KvCache`, `KvCacheF16`) with H2O
+  heavy-hitter eviction, allowing token-by-token decode in
+  **O(log T)** per step against an accumulated cache. See upstream
+  ADR-189.
+- **FastGRNN salience gate** for **near-linear O(N)** when the
+  log-stride candidate set can be pruned.
+
+These capabilities are qualitatively different from
+`ruvector-attention` 2.0.4, which is what the workspace uses today
+for spatial / antenna attention.
+
+---
+
+## 2. Decision
+
+The AETHER temporal head will be implemented with
+`ruvllm_sparse_attention::SubquadraticSparseAttention::forward_gqa`
+for prefill, and `decode_step` against a `KvCache` (with the `fp16`
+feature enabled) for streaming inference paths (online re-ID,
+incremental embedding extraction during a tracked session).
+
+Concretely:
+
+1. `wifi-densepose-train` adds `ruvllm_sparse_attention` as a
+   workspace dependency, **path-vendored** against
+   `vendor/ruvector/crates/ruvllm_sparse_attention` so the workspace
+   does not gain a crates.io publish dependency.
+2. The AETHER block factory takes a feature flag
+   (`temporal_head = "dense" | "sparse_gqa"`) selecting between the
+   current dense MHA path and the new sparse-GQA path. The default
+   for new training runs is `sparse_gqa`. Existing checkpoints
+   continue to load on `dense`.
+3. Signal-side consumers (the streaming embedding extraction used
+   by `pose_tracker.rs` for re-ID updates) call `decode_step` rather
+   than re-running prefill on every new frame — this is the
+   structural win that dense MHA cannot provide.
+4. We add an A/B benchmark gate (§5) before flipping the production
+   default. The default *training* config can move first; the
+   default *inference* config waits for the gate.
+
+This ADR sanctions the swap. It does not perform the swap; that
+lands in a follow-up implementation issue once both ADR-095 and
+ADR-096 are accepted.
+
+---
+
+## 3. Quantitative argument
+
+### 3.1 Edge-evaluation count
+
+For a single attention layer over `N` frames:
+
+| Path | Edge evaluations | At `N = 100` (today's default) | At `N = 1000` (10 s @ 100 Hz) | At `N = 8192` |
+|------|------------------|--------------------------------|-------------------------------|---------------|
+| Dense MHA | `N²` | 1.0 × 10⁴ | 1.0 × 10⁶ | 6.7 × 10⁷ |
+| Sparse `forward` (window + log-stride + landmarks) | ~`N · (W + log N + N/B)` | 1.4 × 10⁴ | 1.4 × 10⁴ | 1.1 × 10⁶ |
+| Sparse + FastGRNN | ~`N · (W + globals + K)` | constant in `N` | constant in `N` | constant in `N` |
+
+Numbers for the sparse rows are taken from upstream's measured
+table (`README.md:230-237`, "sparse-edge reduction vs causal dense
+attention"): 8192 → 29.3× edge reduction, 16384 → 57.5×, 32768 →
+113.2×.
+
+**The honest framing:** at the *current* AETHER default of
+`window_frames = 100`, dense MHA is essentially free and the
+sparse machinery has overhead — the per-token cost in upstream's
+benchmark is ~2.4 µs at `N = 256` and ~2.1 µs at `N = 128`. The
+sparse path probably *loses* below `N ≈ 128`. It starts winning at
+the 1 s + windows we'd realistically use for activity classification
+(`N = 200` at 50 Hz, `N = 500` for breathing-quality), and pulls
+ahead by 30–100× at the 10 s windows that long-context re-ID
+benefits from.
+
+### 3.2 Streaming decode
+
+Where dense MHA structurally cannot follow is incremental decode.
+Re-ID over a long-tracked person (a 5-minute session at 50 Hz =
+15,000 frames) with dense MHA requires recomputing attention from
+scratch every time the window slides. With `decode_step` against a
+`KvCache`:
+
+| Operation | Dense MHA | Sparse GQA + KV cache |
+|-----------|-----------|-----------------------|
+| Append one new frame to the embedding context | O(N²) | **O(log T)** |
+| Memory growth | O(N · d) per recompute | O(T · d_kv) cached, evicted by H2O heavy-hitter |
+| FP16 KV cache | n/a | available via `fp16` feature, halves memory |
+
+This is the qualitative capability dense MHA lacks. Even at small
+`N` where dense MHA is competitive on prefill, decode is structurally
+different: amortised O(1) per new frame vs O(N²) recompute.
+
+---
+
+## 4. Approach
+
+### 4.1 Workspace dependency
+
+Add to `v2/Cargo.toml`:
+
+```toml
+[workspace.dependencies]
+ruvllm_sparse_attention = {
+    path = "../vendor/ruvector/crates/ruvllm_sparse_attention",
+    default-features = false,
+    features = ["fp16"]
+}
+```
+
+`default-features = false` mirrors the rest of the workspace's
+`--no-default-features` posture (and matches what ADR-095 does on
+the firmware side, so both consumers have the same feature set).
+We **do not** pull `parallel` here — rayon doesn't help with
+inference-shaped batches at the sequence lengths we run, and it
+breaks ADR-095's no_std build if the dependency leaks.
+
+### 4.2 Crate placement
+
+Two viable homes for the AETHER temporal head:
+
+| Option | Tradeoffs |
+|--------|-----------|
+| **A. New `wifi-densepose-temporal` crate** | Cleanest. Unique import surface, easy to feature-gate. But: one more crate in the publishing order (CLAUDE.md crate table grows to 16). |
+| **B. Add to `wifi-densepose-train`** | Co-located with the model; no new crate; simpler workspace graph. But: `wifi-densepose-train` is heavyweight (`tch`, full training stack), and signal-side consumers would have to depend on the whole training crate just to run inference. |
+
+**Recommendation: A.** The temporal head is consumed by both
+`wifi-densepose-train` (training) and `wifi-densepose-signal`
+(inference, re-ID). Pulling those toward a shared third crate keeps
+the dependency arrows clean. Also matches ADR-095's
+`wifi-densepose-temporal` host-side training crate name —
+deliberate convergence.
+
+### 4.3 API sketch
+
+```rust
+pub struct AetherTemporalHead {
+    backend: TemporalBackend,
+    cache: Option<KvCache>,           // populated for streaming inference
+}
+
+pub enum TemporalBackend {
+    Dense(DenseMha),                  // current ruvector-attention path
+    SparseGqa(SubquadraticSparseAttention),
+}
+
+impl AetherTemporalHead {
+    pub fn new(cfg: &TemporalHeadConfig) -> Self;
+
+    /// Window-level prefill. Returns pooled [d_model] embedding.
+    pub fn forward(&self, frames: &Tensor3) -> Vec<f32>;
+
+    /// Incremental decode for streaming re-ID. Updates internal
+    /// cache and returns pooled embedding given a single new frame.
+    /// SparseGqa backend only.
+    pub fn step(&mut self, frame: &Tensor3) -> Result<Vec<f32>, TemporalError>;
+}
+```
+
+### 4.4 Selection rule
+
+In `forward_auto`'s spirit, the head selects the path based on
+`(window, n_q_heads, n_kv_heads)` of the model:
+
+- `window ≤ 64` and dense MHA is in the checkpoint: use dense path.
+- `n_q_heads != n_kv_heads`: use `forward_gqa`.
+- `n_q_heads == n_kv_heads` and `window > 64`: use `forward`.
+- Streaming (per-frame) inference: always `decode_step`.
+
+---
+
+## 5. Validation gate before flipping the inference default
+
+We do not flip the production inference default until *all four*
+of these pass on the most recent AETHER checkpoint:
+
+1. **Contrastive loss within 1%** of the dense baseline at the same
+   training budget (so the kernel substitution doesn't silently
+   regress the loss surface).
+2. **Re-ID rank-1 accuracy within 1 percentage point** of the dense
+   baseline on the held-out test split.
+3. **Spearman rank correlation ≥ 0.95** between dense-MHA and
+   sparse-GQA top-50 nearest-neighbour orderings on the
+   `env_fingerprint` and `person_track` HNSW indices (matches the
+   ADR-024 §2.5.3 quantisation-rank-preservation criterion).
+4. **Latency improvement ≥ 5×** at the deployed window length.
+
+Any of (1)–(3) failing rolls back the default; the kernel can stay
+in the codebase as opt-in, but is not what new training runs use.
+
+---
+
+## 6. Alternatives considered
+
+| Option | Why rejected |
+|--------|--------------|
+| **Keep dense MHA, period** | Simple, but caps the practical window length. The 10 s + windows that long-context re-ID and breathing-quality scoring want are exactly where dense MHA hurts. We'd be locking in a ceiling for no reason. |
+| **Use `ruvector-attention` 2.0.4 (already in workspace)** | It's what we use today for antenna and spatial attention. But it lacks GQA, lacks streaming KV cache, and its dependency story upstream is messy (`ruvector-attn-mincut` is stuck at 2.0.4 per the issue). It works, but it's not the right tool for *temporal* attention specifically. |
+| **Wait for `ruvector-attention 2.x` to add GQA + KV cache** | Speculative; no published roadmap. Meanwhile `ruvllm_sparse_attention` shipped real artifacts on 2026-05-07 and is path-vendorable today. |
+| **Use a non-attention temporal pooler (TCN / S4 / Mamba)** | All three are real options for time-series sensing; some research gives them a slight edge on long-horizon dependencies. But (a) we already have AETHER specified around attention in ADR-024, (b) the contrastive recipe is attention-tuned, (c) we'd be re-running the entire ADR-024 training story to swap to a different family. Switching to *sparse* attention preserves the ADR-024 mathematical apparatus exactly. |
+| **`forward_gated_with_fastgrnn` immediately** | Tempting because it's the O(N) path. But the gate adds approximation error on top of the sparsity-induced approximation error. Phase the introductions: prove sparse-GQA matches dense first, then layer the gate on top in a follow-up. |
+
+---
+
+## 7. Consequences
+
+### Positive
+
+- **Long windows are no longer scary.** `window_frames = 1000` for
+  10 s sessions becomes practical, not aspirational.
+- **Streaming re-ID gets a structural speedup.** Per-frame decode
+  cost goes from O(N²) to O(log T). Pose tracker cost is a real
+  budget today; this shrinks it.
+- **GQA fits the AETHER backbone better.** AETHER's per-keypoint
+  cross-attention already has a query/key shape mismatch (17
+  keypoint queries vs N CSI keys). GQA was designed for exactly
+  this asymmetry.
+- **Path-vendored, not crates.io-coupled.** No bind-time risk —
+  the crate ships from the vendored copy of upstream, and the
+  vendor was synced today (`e38347601`).
+- **Same kernel, two consumers.** ADR-095 wants this on the MCU;
+  this ADR wants it on the host. Path-vendoring once keeps the
+  versions in lockstep.
+- **Approximation error is bounded** by the local window +
+  log-stride + landmark pattern. Upstream's measurement (`README.md`
+  §FAQ) is "<1% perplexity on standard benchmarks" for the
+  causal case; we measure ours via §5's gate.
+
+### Negative
+
+- **Adds a workspace dependency** the team has to know about.
+  Mitigated by path-vendoring (no version-resolution risk).
+- **Approximation error is not zero.** For high-precision re-ID
+  this needs measurement. §5's gate is the safety net; if rank
+  correlation drops below 0.95 we don't flip the default.
+- **More moving parts in the temporal head.** Dense MHA has one
+  knob (number of heads). Sparse GQA has window, log-stride,
+  landmark block size, KV head count, and (later) gate top-K. We
+  pay this in default-config tuning effort.
+- **`KvCache` introduces session state** in a place that didn't
+  have it. Code that previously called a stateless `forward(...)`
+  now has to think about cache lifetime per tracked person. The
+  pose tracker (`pose_tracker.rs`) already has per-track state, so
+  the natural place for the cache is inside `PoseTrack`; needs a
+  small lifecycle review.
+- **Training and inference paths diverge slightly.** Training
+  always uses `forward` (full window prefill). Inference uses
+  `decode_step` for streaming. The two paths must be tested
+  separately; upstream's `forward` and `decode_step` are unit-test
+  parity-checked, but our wrapper has its own surface.
+
+### Neutral
+
+- ADR-024 is **not superseded.** The contrastive loss, the
+  augmentation strategy, the projection head, the HNSW indices —
+  all unchanged. This ADR makes a single architectural choice
+  inside ADR-024's "temporal aggregation" black box.
+- ADR-016 (RuVector training pipeline integration) is unaffected.
+  The other RuVector crates (`mincut`, `attn-mincut`,
+  `temporal-tensor`, `solver`, `attention`) keep their existing
+  roles in `model.rs`.
+
+---
+
+## 8. Open questions
+
+1. **What is the AETHER temporal head's actual current
+   architecture in code?** ADR-024 specifies the projection head
+   precisely (Linear → BN → ReLU → Linear → L2-norm) but the
+   *temporal aggregation* before that is not pinned. The closest
+   thing in `model.rs` today is `apply_antenna_attention` and
+   `apply_spatial_attention`, which are over antenna and spatial
+   axes, not the temporal axis. So this ADR is, in practice,
+   choosing the temporal kernel for the *first time* — not
+   replacing one. Worth confirming with the maintainer before the
+   implementation PR uses language like "swap" rather than "add".
+2. **What window length is the deployed AETHER tracker using
+   today?** The training default is 100 frames (`config.rs:165`),
+   but `proof.rs` uses 4 and `trainer.rs` uses 2. The realistic
+   deployment number determines how much of the §3.1 quantitative
+   argument is *currently* operative versus *future-state*. If the
+   answer is "we run AETHER on 4-frame windows", sparse pays
+   nothing today, and the case for this ADR rests entirely on the
+   long-window roadmap. If 100 or more, sparse already pays.
+3. **Is `FastGrnnGate` worth enabling for re-ID specifically?**
+   Probably not — re-ID benefits from full-sequence visibility,
+   and the gate's job is to *prune* long-range candidates. Save
+   the gate for activity classification (where transient movement
+   is the signal of interest, and saliency-based pruning matches
+   the use case). Confirm via §5's accuracy gate when we get there.
+4. **Does the cross-modal alignment loss (ADR-024 §2.2.4) need
+   any change?** The cross-modal loss operates on pooled
+   `z_csi` (already temporally aggregated) and pooled `z_pose`. As
+   long as the temporal aggregator returns a comparable pooled
+   vector, the loss is kernel-agnostic. Likely no change, but
+   worth a smoke test.
+5. **Where does the KV cache live for re-ID?** Per `pose_tracker.rs`,
+   each `PoseTrack` already has lifecycle (create / update /
+   evict). The natural place is `PoseTrack::kv_cache:
+   Option<KvCache>`, populated when the track first emits an
+   embedding. Eviction policy ties to `track.last_seen` — when
+   the track is dropped, drop the cache. Spec-level sanity check
+   only; needs a real design pass in the implementation PR.
+
+---
+
+## 9. Acceptance criteria
+
+This ADR is **Accepted** once:
+
+1. Maintainer review on #513 confirms the architecture and resolves
+   §8.1 (the "first-time choice vs replacement" framing).
+2. Open question §8.2 has a concrete answer (ideally a one-line
+   pointer to the production training config).
+3. The follow-up implementation issue is filed.
+
+This ADR is **Implemented** once:
+
+1. `wifi-densepose-temporal` (or equivalent) ships in the workspace
+   with a default-off feature flag exposing both dense and
+   sparse-GQA backends.
+2. §5's four-gate validation has run on the most recent AETHER
+   checkpoint and the result is published (witness-bundle
+   compatible per ADR-028 if the run is reproducible).
+3. The default for new training runs is `sparse_gqa`, with `dense`
+   still selectable for back-compat.
+
+---
+
+## 10. Related
+
+ADR-014 (signal SOTA), ADR-016 (RuVector training pipeline
+integration), ADR-024 (AETHER contrastive CSI embedding — this
+ADR fills in its temporal-aggregation black box), ADR-095
+(on-ESP32-S3 temporal modeling — same crate, different consumer),
+upstream ADR-189 (KV cache incremental decode — the basis for
+streaming re-ID), upstream ADR-190 (GQA / MQA — what AETHER's 17
+keypoint queries × N CSI keys asymmetry naturally maps onto),
+upstream ADR-192 (no_std + alloc support — the structural change
+that means the *same* kernel runs both on the host here and on
+the MCU under ADR-095).