docs(adr): ADR-095/096 — sparse attention on ESP32 + AETHER GQA head (#513)

Two Proposed ADRs covering the integration of vendored
ruvllm_sparse_attention v0.1.1 (released 2026-05-07, no_std + alloc
validated on real ESP32-S3 per upstream ADR-192).

* ADR-095 — adds a learned temporal head to the ESP32-S3 firmware
  via a Rust component compiled --no-default-features against the
  376 KB rlib. Runs alongside the existing physics-only DSP, gated
  behind a Kconfig (8 MB only initially). Use cases: gesture
  recognition, fall classification with sequence context,
  breathing-quality scoring, on-device anomaly detection. Builds
  on ADR-018, ADR-039, ADR-081.

* ADR-096 — adopts forward_gqa + KvCache for the AETHER (ADR-024)
  contrastive CSI embedding's temporal aggregation. Path-vendored
  workspace dep, A/B gate before flipping the inference default.
  ~30-100x speedup at long windows; streaming decode goes from
  O(N^2) recompute to O(log T) per new frame.

Refs #513
This commit is contained in:
ruv 2026-05-07 15:14:38 -04:00
parent e7904786f0
commit 684ef4f1a5
2 changed files with 758 additions and 0 deletions

View File

@ -0,0 +1,369 @@
# ADR-095: On-ESP32-S3 Temporal Modeling at the Edge via `ruvllm_sparse_attention` (no_std)
| Field | Value |
|-------------|--------------------------------------------------------------------------------------------------------|
| **Status** | Proposed (2026-05-07) |
| **Date** | 2026-05-07 |
| **Authors** | ruvnet, claude-flow |
| **Related** | ADR-018, ADR-024, ADR-039, ADR-040, ADR-061, ADR-081, ADR-091; upstream ADR-189, ADR-190, ADR-192 |
| **Branch** | `feat/ruvllm-sparse-attention-edge` |
| **Tracking**| #513 |
---
## 1. Context
Today the ESP32-S3 firmware in `firmware/esp32-csi-node/main/` does
**physics-only** sensing on-device. The pipeline in `edge_processing.c`
runs on Core 1 and produces:
- Adaptive presence detection (`presence_score`).
- Breathing-band (0.10.5 Hz) and heart-rate-band (0.82.0 Hz) biquad
IIR bandpass + zero-crossing BPM estimators.
- A motion / fall flag (`flags` bits 02 in `edge_vitals_pkt_t` magic
`0xC5110002`, plus fused mmWave variant `0xC5110004` per ADR-063).
- ADR-081 `rv_feature_state_t` (60 B at magic `0xC5110006`) emitted at
110 Hz from the adaptive controller's fast loop.
There is **no learned model of any kind on the MCU**. The closest things
are: ADR-039 Tier-1 compressed-CSI emission, ADR-040 WASM modules
(Tier-3, but used by the user for ad-hoc DSP, not transformer
inference), and the Rust-side AETHER embeddings (ADR-024) which run
on the host, not the node. Anomaly detection that needs *temporal
context* — "is this fall pattern consistent with a fall, or just a
sit-down?" — is structurally absent. The fall debounce in v0.6.x
(3-frame consecutive + 5 s cooldown, raised threshold 2.0 → 15.0 rad/s²)
is a hand-tuned heuristic exactly because the firmware has nothing
better to reason with.
A second pressure point: the Tmr Svc / FreeRTOS stack is already
sensitive. `edge_processing.c` lines 4748 explicitly note that
`process_frame + update_multi_person_vitals` combined used ~6.57.5 KB
of the 8 KB task stack and that **scratch buffers were moved to static
storage to avoid stack overflow.** Any new heavyweight workload — and
a transformer forward pass is heavyweight — must therefore live in
**its own FreeRTOS task with its own task stack**, not piggyback on
the existing edge DSP task.
The vendored crate `ruvllm_sparse_attention` v0.1.1 (released 2026-05-07,
synced today at `vendor/ruvector/crates/ruvllm_sparse_attention/`)
removes the previously-blocking `std` requirement. Per upstream
**ADR-192**, the crate now compiles cleanly to
`xtensa-esp32s3-none-elf` via `espup`, with a measured **376 KB
release rlib**, zero runtime dependencies beyond `libm`, and was
validated on a real ESP32-S3 (rev v0.2, 16 MB flash). It exposes
`SubquadraticSparseAttention`, `KvCache` / `KvCacheF16`, `FastGrnnGate`,
`IncrementalLandmarks`, `RuvLlmSparseBlock`, and a `Tensor3` value
type. The kernel is O(N log N) by default and near-linear O(N) when
the FastGRNN salience gate is enabled.
This is the first time we have had a credible path to **on-device
transformer inference for CSI** without a Python runtime, without
TFLite, and without a coprocessor. It is also the right moment to
decide *whether* we want it before code starts to land.
---
## 2. Decision
Add a learned **temporal head** to the ESP32-S3 firmware running on
the node itself, using `ruvllm_sparse_attention` compiled
`--no-default-features` (no_std + alloc, optionally `+fp16`), driven
by a small Rust component integrated into the ESP-IDF build. The
temporal head runs **alongside** the existing physics-only pipeline,
not as a replacement — physics gives us breathing/heart-rate/presence,
the temporal head gives us classification and sequence-aware reasoning.
Concretely:
1. The temporal head consumes a rolling window of feature vectors
(initially the same `rv_feature_state_t` floats already produced
by ADR-081, plus optionally a small projection of recent CSI
amplitude statistics), length `N` ∈ [100, 500] frames, sampled at
the controller's fast-loop rate.
2. It outputs a small set of **class logits** for the active
detection task. The first three deployable tasks are listed in
§4.
3. It runs in its own FreeRTOS task on Core 1 (or pinned to whichever
core the WiFi driver is *not* on), at a cadence slower than the
fast loop — initially 1 Hz, classification-on-demand.
4. The kernel is invoked through a thin C ABI (`ruv_temporal_init`,
`ruv_temporal_push_frame`, `ruv_temporal_classify`) exported from
a Rust static library linked into the ESP-IDF build the same way
the existing Tier-3 components are linked.
5. Weights are stored as a flat `f32` (or `f16` with the `fp16`
feature) blob in the ESP32-S3 flash, loadable from either an
embedded `EMBED_FILES` resource (compile-time bake-in) or NVS
(post-flash provisioning, mirroring ADR-040's WASM-upload path).
6. The temporal head is gated behind a Kconfig option
`CONFIG_CSI_TEMPORAL_HEAD_ENABLED`, **default off**, and is only
compiled into the 8 MB build profile until the flash math in §6
demonstrates 4 MB headroom.
This ADR authorizes the architecture; it does **not** ship any of
the firmware-side or training-side changes. Implementation lands in
follow-up issues per the roadmap in §7.
---
## 3. Approach
### 3.1 Build integration
ESP-IDF v5.4 already supports Rust components via the
`rust-esp32`-style template (a CMake `idf_component_register` shim
that runs `cargo build --target xtensa-esp32s3-none-elf` and links
the resulting static library). The new component lives at
`firmware/esp32-csi-node/components/ruv_temporal/`:
```
ruv_temporal/
CMakeLists.txt # component manifest, Rust build invocation
Cargo.toml # crate config: no_std, deps on ruvllm_sparse_attention
build.rs # generates the C header from #[no_mangle] exports
src/lib.rs # public C ABI: init/push/classify/teardown
src/window.rs # rolling frame buffer
src/weights.rs # NVS / EMBED_FILES weight loader
include/ruv_temporal.h # generated; consumed by edge_processing.c
```
Cargo features compiled in: `["fp16"]`. **Not** `parallel` (rayon
needs threads, breaks no_std). **Not** `std`.
### 3.2 Interface
The C ABI is intentionally narrow. It does not expose `Tensor3`,
attention configs, or any Rust types — only `float*` buffers and
opaque handles:
```c
typedef struct ruv_temporal_ctx ruv_temporal_ctx_t;
esp_err_t ruv_temporal_init(const uint8_t *weights, size_t wlen,
uint32_t input_dim, uint32_t window,
ruv_temporal_ctx_t **out_ctx);
esp_err_t ruv_temporal_push(ruv_temporal_ctx_t *ctx, const float *frame);
esp_err_t ruv_temporal_classify(ruv_temporal_ctx_t *ctx,
float *logits, uint32_t n_classes);
void ruv_temporal_destroy(ruv_temporal_ctx_t *ctx);
```
`push` is the hot path and must be cheap (it just writes into a
ring buffer in PSRAM if available, IRAM/DRAM otherwise). `classify`
runs the actual sparse attention forward and is the budget-heavy
call.
### 3.3 Task topology
A new task `ruv_temporal_task` with its own 16 KB stack, pinned to
the same core as the edge DSP task (Core 1), fed via a FreeRTOS
queue from the adaptive controller's fast loop. We do **not** call
the kernel from the existing edge task — the edge stack is already
near-full per the comment at `edge_processing.c:47-48` and recent
fall-debounce / Tmr-Svc-stack work.
### 3.4 Memory budget (per inference)
With `N = 256` (window), `d_model = 32`, `n_heads = 4`, `head_dim = 8`,
12 `RuvLlmSparseBlock` layers, `block_size = 64`, `window = 64`:
- Weights: ~515 KB (single block, INT8 quant deferred to a later
ADR; FP16 default).
- KV cache (FP16, full window): `2 * 256 * 4 * 8 * 2 B ≈ 16 KB`.
- Activations (peak, with `forward_flash` tiling): ≈ 2 KB.
- Working set: < 64 KB. Comfortable in PSRAM, possible in ISR-safe
internal SRAM.
These are first-pass estimates; the precise numbers come out of the
`forward_flash` benchmark on real hardware, which is exit criterion
in §7.
### 3.5 Compatibility with ADR-081 / ADR-039 / ADR-018
The temporal head is a **consumer** of the same feature stream
already flowing in the firmware. It does not alter:
- ADR-018 raw CSI frame layout (`0xC5110001`).
- ADR-039 Tier-1 compressed CSI (`0xC5110005`) or vitals
(`0xC5110002`).
- ADR-063 fused vitals (`0xC5110004`).
- ADR-081 `rv_feature_state_t` (`0xC5110006`) — this is the primary
input we tap.
If the temporal head fires a classification, the result rides on a
new `0xC5110007` packet (small: class id, confidence, monotonic seq,
ts_us, CRC32). Allocation of that magic is deferred to the
implementation PR — this ADR reserves the *concept*, not the byte
layout.
---
## 4. Use cases that motivate this
| Task | Why temporal context matters | Window | Class count |
|------|------------------------------|--------|-------------|
| **Gesture recognition** (wave / point / clap / kick) | Single-frame CSI snapshots can't disambiguate gestures from random motion. ~100-frame windows capture full gesture trajectories. | 100 frames @ 50 Hz = 2 s | 48 |
| **Fall classification with sequence context** | The current heuristic ("> 15 rad/s² for 3 consecutive frames + 5 s cooldown") was raised to suppress false positives. A learned temporal head can distinguish a fall (rapid descent then stillness) from a sit-down (descent then sustained micro-motion) using the same input window. | 200 frames @ 50 Hz = 4 s | 3 (fall / sit / nothing) |
| **Breathing-quality scoring** | Today's pipeline emits a BPM and a confidence float. A temporal head trained on labeled apnea / shallow / paradoxical / normal sequences can output a 4-class quality label that downstream consumers can render in one glance. | 500 frames @ 50 Hz = 10 s | 4 |
| **"Is this normal for this room/time" anomaly detection** | Per-room SONA profiles (ADR-005) capture environment statistics, but anomaly *temporal shape* is currently checked host-side via embedding distance (ADR-024 §2.4 `temporal_baseline` index). A small on-device classifier can flag ahead of host roundtrip. | 300 frames | 2 (normal / anomalous) |
These four cover the visible product gaps in the v0.6.x line.
Gesture recognition is the headline; fall classification is the
highest-impact for the eldercare scenarios v0.5.4 was tuned for.
---
## 5. Alternatives considered
| Option | Why rejected |
|--------|--------------|
| **TFLite Micro** | Heavier runtime (~150 KB code + interpreter), pulls in C++ STL surface, no Rust-native API. Does not benefit from sparse attention specifically. We'd be re-paying the cost of a full inference framework when we only need one kernel. |
| **Run all classifiers server-side** | Costs a full Tier-1 CSI uplink (~5070 KB/s/node per ADR-039) just to feed a remote classifier, then a roundtrip back. Defeats the point of ADR-081's compact feature stream and makes the system worthless when the backhaul is down. Also leaks raw CSI to the network for purposes the user did not opt into. |
| **Stay physics-only forever** | Cleanest from a maintenance standpoint, but loses gesture, structurally, and the fall-debounce hack will keep accreting per-deployment knobs. The product space already has commodity physics-only firmware (Bosch presence sensors, etc.); on-device transformer inference for CSI is what would *differentiate* RuView. |
| **Use `ruvector-attention` (already in workspace) on-device** | `ruvector-attention` is `std`-bound today; doesn't compile to `xtensa-esp32s3-none-elf` without a port comparable in scope to upstream ADR-192. Even if ported, it doesn't give us GQA + streaming KV cache, which is the structural capability the new crate adds. |
| **Wait for IEEE 802.11bf** | Different problem (standardised CSI exposure across vendors). Doesn't address whether the model runs on-device or off. |
---
## 6. Consequences
### Positive
- **Genuinely novel.** No competing CSI-sensing project ships
transformer inference on the MCU itself. The closest peers
(Espressif's ESP-DL, Edge Impulse) are non-attention CNN/RNN
pipelines.
- **Latency.** Classification result is local — no backhaul,
no host roundtrip, sub-100 ms gesture-to-action.
- **Privacy.** Raw CSI never leaves the node for these tasks.
- **Reuses the ADR-081 feature stream** — the temporal head is a
consumer of the existing 60 B `rv_feature_state_t`, not a new
uplink format.
- **Validated kernel.** Per upstream ADR-192, the no_std build was
validated on real ESP32-S3 hardware (MAC `ac:a7:04:e2:66:24`).
We are not betting on a paper crate.
### Negative / tradeoffs
- **Flash budget pressure on 4 MB boards.** Per `partitions_4mb.csv`,
each OTA slot is 1.875 MB (`0x1D0000`). The current build is
~853 KiB. Adding a 376 KB rlib plus weights brings us to ~1.3 MB —
still under the slot ceiling but with little headroom for other
growth. **Decision: temporal head is 8 MB-only initially**, gated
behind `CONFIG_CSI_TEMPORAL_HEAD_ENABLED`. 4 MB enablement is a
separate ADR after we measure the actual incremental link size
(the 376 KB upstream number is for the rlib in isolation; the
linked-and-stripped final binary delta will be smaller).
- **Rust toolchain dependency.** The ESP-IDF build now needs
`espup` + `cargo +esp` to be present on every developer machine
and CI runner. This is a real hurdle on Windows — see
`CLAUDE.local.md` for the existing Python-subprocess wrapper
required to run ESP-IDF cleanly. CI will need a parallel
Rust-toolchain step.
- **One more thing to test.** QEMU (ADR-061) does not run the
ESP32-S3 Xtensa Rust binary today. The QEMU validator pipeline
will need a build matrix entry for "Rust component compiled but
classifier disabled" at minimum.
- **Stack overflow risk.** Same hazard the v0.6.4 work just
navigated. Mitigated by §3.3 (own task, own stack); this needs
to be a code-review checklist item.
- **Weights provenance.** Once we ship a model, we need a story
for *which model*, signed by *whom*, retrained *how often*. See
Open Questions §8.
### Neutral
- ADR-040's WASM Tier-3 path is **not** superseded. WASM remains
the right choice for user-uploaded modules. The temporal head is
a first-party signed-by-us component, with a different deploy
story.
- The host-side ADR-024 AETHER pipeline is unchanged by this ADR.
ADR-096 covers the host-side use of the same crate.
---
## 7. Roadmap
| Phase | Scope | Gating |
|-------|-------|--------|
| 0 | This ADR + ADR-096 land. No code. | Maintainer review of #513. |
| 1 | New crate `wifi-densepose-temporal` (host-side only): defines the temporal-head architecture, training script, weight serialization format. | Phase 0 accepted. |
| 2 | `ruv_temporal` ESP-IDF component scaffolding — empty kernel, just the C ABI and ring buffer. Compiles cleanly into 8 MB firmware. Adds ~5 KB to binary. | Phase 1 produces a serialised set of weights. |
| 3 | Wire `ruvllm_sparse_attention` `forward` (not yet `forward_gated`) into the component. First on-target classification benchmark on COM7. Gate: end-to-end inference ≤ 50 ms with `N = 256`, no stack overflow under 24 h soak. | Phase 2 ABI stable. |
| 4 | First trained classifier (gesture or fall, whichever has labelled data first). Hardware A/B: temporal-head decision vs current heuristic on a held-out set. Promotion criterion: temporal head matches or beats heuristic on F1 *and* false-positive rate. | Phase 3 latency gate met. |
| 5 | 4 MB profile gating — measure actual binary delta, decide whether to enable on SuperMini. | Phase 4 in production on 8 MB. |
| 6 | `forward_gated_with_fastgrnn` for long-window tasks (breathing-quality at N = 500). | Phase 4 stable. |
---
## 8. Open questions
1. **Who trains the temporal heads?** Two options:
(a) host-side training on captured `rv_feature_state_t` traces
labelled in-app, then export to flat-buffer weights;
(b) teacher-distillation from the larger AETHER model (ADR-024)
running off-device, using soft labels. Option (b) is more
data-efficient but couples this ADR's ship date to ADR-024's
training-pipeline maturity. Open.
2. **How are weights flashed?** Three options, in increasing
capability: NVS blob (small, safe, 48 KB ceiling per key),
`EMBED_FILES` baked into the firmware image (no runtime update),
OTA-updateable partition (mirrors ADR-040 RVF upload path,
biggest engineering cost). Phase 2/3 will pick one; my prior is
`EMBED_FILES` for the first model, OTA partition once we have
more than one.
3. **Does the 376 KB rlib figure scale?** Upstream measured
376 KB for the kernel + the embedding/projection
weights for *their* test config. Adding 12
`RuvLlmSparseBlock` layers with embedding/projection weights
sized to actual CSI feature dimension may push this. Phase 2
will measure the on-target stripped-binary delta directly; if
the delta exceeds 600 KB we revisit the 4 MB story sooner.
4. **What window length is right for fall classification?**
200 frames at 50 Hz = 4 s feels right based on the v0.6.4
debounce numbers (3-frame consecutive + 5 s cooldown is
essentially a 4-second decision window already). Empirical, not
architectural — set in Phase 4.
5. **Quantisation.** First model ships FP16 (KV cache feature flag
already supports this). INT8 for both weights and activations
is a follow-up; the current crate has no INT8 path so it would
be a separate kernel.
6. **What happens when the controller is in `RV_PROFILE_PASSIVE_LOW_RATE`?**
The fast loop slows down, so the input frame rate to the
temporal head drops. Either the head needs to handle variable
sample rate (resample at push time) or it stops emitting until
the controller goes back to active. Phase 1 design call.
---
## 9. Acceptance criteria
This ADR is **Accepted** once:
1. Maintainer review on #513 confirms the architecture.
2. The follow-up implementation issue is filed and references this
ADR plus ADR-096 by number.
3. ADR index in `docs/adr/README.md` (if present) has an ADR-095
row.
This ADR is **Implemented** once:
1. Phase 3 is in `main` with the gating Kconfig off by default.
2. A Phase-4 hardware A/B has been published (witness-bundle
compatible per ADR-028).
3. The QEMU validator (ADR-061) has at minimum a "compiles, doesn't
run" check for the Rust component.
---
## 10. Related
ADR-018 (binary CSI frame), ADR-024 (AETHER contrastive embedding —
host-side counterpart, see ADR-096), ADR-039 (edge intelligence
tiers), ADR-040 (WASM Tier-3 modules — the *other* extensibility
path), ADR-061 (QEMU CI), ADR-081 (adaptive controller, mesh plane,
`rv_feature_state_t`), ADR-091 (stand-off radar tier — adjacent
edge-intelligence ADR), upstream ADR-189 (KV cache incremental
decode), upstream ADR-190 (GQA/MQA), upstream ADR-192 (no_std +
alloc on ESP32-S3 — the structural unblock that makes this ADR
possible).

View File

@ -0,0 +1,389 @@
# ADR-096: AETHER Temporal Head via `ruvllm_sparse_attention::forward_gqa` + Streaming KV Cache
| Field | Value |
|-------------|---------------------------------------------------------------------------------------|
| **Status** | Proposed (2026-05-07) |
| **Date** | 2026-05-07 |
| **Authors** | ruvnet, claude-flow |
| **Related** | ADR-014, ADR-016, ADR-024, ADR-095; upstream ADR-189, ADR-190, ADR-192 |
| **Branch** | `feat/ruvllm-sparse-attention-edge` |
| **Tracking**| #513 |
---
## 1. Context
ADR-024 ("Project AETHER") specifies a contrastive CSI embedding
model on top of the existing `CsiToPoseTransformer` backbone. It
adds a 2-layer projection head to the per-keypoint features and
trains it with InfoNCE + VICReg + (optional) cross-modal alignment.
The **temporal aggregation** that turns per-frame backbone features
into a window-level representation is described at the level of
"a transformer encoder over the CSI window" — but ADR-024 does not
pin a specific attention kernel. In the current code:
- `v2/crates/wifi-densepose-train/src/model.rs` uses
`ruvector_attention::ScaledDotProductAttention` (line 34) and
applies `apply_antenna_attention` over the antenna-path dimension
and `apply_spatial_attention` over the spatial location dimension.
Both are dense.
- The training-side temporal pooling currently runs at
`window_frames = 100` by default (`config.rs:165`), with
`proof.rs` and `trainer.rs` using shorter test windows of 4 and 2
respectively.
- `v2/crates/wifi-densepose-signal/src/ruvsense/pose_tracker.rs`
consumes a 128-dim AETHER re-ID embedding (line 22, 263) but does
not perform the temporal aggregation itself — that happens
upstream.
So the temporal head is a real seam in the codebase, but its
specific attention kernel is *currently dense* and *currently not a
named architectural decision*. This ADR makes that decision.
The vendored `ruvllm_sparse_attention` v0.1.1 (synced today,
released 2026-05-07) provides a different kind of temporal kernel:
- **Subquadratic O(N log N)** sparse attention (`forward`,
`forward_flash`).
- **Grouped-Query / Multi-Query Attention** (`forward_gqa`,
`forward_gqa_flash`) — shares K/V across query heads, the
pattern Mistral-7B and Llama-3 use.
- **Streaming KV cache** (`KvCache`, `KvCacheF16`) with H2O
heavy-hitter eviction, allowing token-by-token decode in
**O(log T)** per step against an accumulated cache. See upstream
ADR-189.
- **FastGRNN salience gate** for **near-linear O(N)** when the
log-stride candidate set can be pruned.
These capabilities are qualitatively different from
`ruvector-attention` 2.0.4, which is what the workspace uses today
for spatial / antenna attention.
---
## 2. Decision
The AETHER temporal head will be implemented with
`ruvllm_sparse_attention::SubquadraticSparseAttention::forward_gqa`
for prefill, and `decode_step` against a `KvCache` (with the `fp16`
feature enabled) for streaming inference paths (online re-ID,
incremental embedding extraction during a tracked session).
Concretely:
1. `wifi-densepose-train` adds `ruvllm_sparse_attention` as a
workspace dependency, **path-vendored** against
`vendor/ruvector/crates/ruvllm_sparse_attention` so the workspace
does not gain a crates.io publish dependency.
2. The AETHER block factory takes a feature flag
(`temporal_head = "dense" | "sparse_gqa"`) selecting between the
current dense MHA path and the new sparse-GQA path. The default
for new training runs is `sparse_gqa`. Existing checkpoints
continue to load on `dense`.
3. Signal-side consumers (the streaming embedding extraction used
by `pose_tracker.rs` for re-ID updates) call `decode_step` rather
than re-running prefill on every new frame — this is the
structural win that dense MHA cannot provide.
4. We add an A/B benchmark gate (§5) before flipping the production
default. The default *training* config can move first; the
default *inference* config waits for the gate.
This ADR sanctions the swap. It does not perform the swap; that
lands in a follow-up implementation issue once both ADR-095 and
ADR-096 are accepted.
---
## 3. Quantitative argument
### 3.1 Edge-evaluation count
For a single attention layer over `N` frames:
| Path | Edge evaluations | At `N = 100` (today's default) | At `N = 1000` (10 s @ 100 Hz) | At `N = 8192` |
|------|------------------|--------------------------------|-------------------------------|---------------|
| Dense MHA | `N²` | 1.0 × 10⁴ | 1.0 × 10⁶ | 6.7 × 10⁷ |
| Sparse `forward` (window + log-stride + landmarks) | ~`N · (W + log N + N/B)` | 1.4 × 10⁴ | 1.4 × 10⁴ | 1.1 × 10⁶ |
| Sparse + FastGRNN | ~`N · (W + globals + K)` | constant in `N` | constant in `N` | constant in `N` |
Numbers for the sparse rows are taken from upstream's measured
table (`README.md:230-237`, "sparse-edge reduction vs causal dense
attention"): 8192 → 29.3× edge reduction, 16384 → 57.5×, 32768 →
113.2×.
**The honest framing:** at the *current* AETHER default of
`window_frames = 100`, dense MHA is essentially free and the
sparse machinery has overhead — the per-token cost in upstream's
benchmark is ~2.4 µs at `N = 256` and ~2.1 µs at `N = 128`. The
sparse path probably *loses* below `N ≈ 128`. It starts winning at
the 1 s + windows we'd realistically use for activity classification
(`N = 200` at 50 Hz, `N = 500` for breathing-quality), and pulls
ahead by 30100× at the 10 s windows that long-context re-ID
benefits from.
### 3.2 Streaming decode
Where dense MHA structurally cannot follow is incremental decode.
Re-ID over a long-tracked person (a 5-minute session at 50 Hz =
15,000 frames) with dense MHA requires recomputing attention from
scratch every time the window slides. With `decode_step` against a
`KvCache`:
| Operation | Dense MHA | Sparse GQA + KV cache |
|-----------|-----------|-----------------------|
| Append one new frame to the embedding context | O(N²) | **O(log T)** |
| Memory growth | O(N · d) per recompute | O(T · d_kv) cached, evicted by H2O heavy-hitter |
| FP16 KV cache | n/a | available via `fp16` feature, halves memory |
This is the qualitative capability dense MHA lacks. Even at small
`N` where dense MHA is competitive on prefill, decode is structurally
different: amortised O(1) per new frame vs O(N²) recompute.
---
## 4. Approach
### 4.1 Workspace dependency
Add to `v2/Cargo.toml`:
```toml
[workspace.dependencies]
ruvllm_sparse_attention = {
path = "../vendor/ruvector/crates/ruvllm_sparse_attention",
default-features = false,
features = ["fp16"]
}
```
`default-features = false` mirrors the rest of the workspace's
`--no-default-features` posture (and matches what ADR-095 does on
the firmware side, so both consumers have the same feature set).
We **do not** pull `parallel` here — rayon doesn't help with
inference-shaped batches at the sequence lengths we run, and it
breaks ADR-095's no_std build if the dependency leaks.
### 4.2 Crate placement
Two viable homes for the AETHER temporal head:
| Option | Tradeoffs |
|--------|-----------|
| **A. New `wifi-densepose-temporal` crate** | Cleanest. Unique import surface, easy to feature-gate. But: one more crate in the publishing order (CLAUDE.md crate table grows to 16). |
| **B. Add to `wifi-densepose-train`** | Co-located with the model; no new crate; simpler workspace graph. But: `wifi-densepose-train` is heavyweight (`tch`, full training stack), and signal-side consumers would have to depend on the whole training crate just to run inference. |
**Recommendation: A.** The temporal head is consumed by both
`wifi-densepose-train` (training) and `wifi-densepose-signal`
(inference, re-ID). Pulling those toward a shared third crate keeps
the dependency arrows clean. Also matches ADR-095's
`wifi-densepose-temporal` host-side training crate name —
deliberate convergence.
### 4.3 API sketch
```rust
pub struct AetherTemporalHead {
backend: TemporalBackend,
cache: Option<KvCache>, // populated for streaming inference
}
pub enum TemporalBackend {
Dense(DenseMha), // current ruvector-attention path
SparseGqa(SubquadraticSparseAttention),
}
impl AetherTemporalHead {
pub fn new(cfg: &TemporalHeadConfig) -> Self;
/// Window-level prefill. Returns pooled [d_model] embedding.
pub fn forward(&self, frames: &Tensor3) -> Vec<f32>;
/// Incremental decode for streaming re-ID. Updates internal
/// cache and returns pooled embedding given a single new frame.
/// SparseGqa backend only.
pub fn step(&mut self, frame: &Tensor3) -> Result<Vec<f32>, TemporalError>;
}
```
### 4.4 Selection rule
In `forward_auto`'s spirit, the head selects the path based on
`(window, n_q_heads, n_kv_heads)` of the model:
- `window ≤ 64` and dense MHA is in the checkpoint: use dense path.
- `n_q_heads != n_kv_heads`: use `forward_gqa`.
- `n_q_heads == n_kv_heads` and `window > 64`: use `forward`.
- Streaming (per-frame) inference: always `decode_step`.
---
## 5. Validation gate before flipping the inference default
We do not flip the production inference default until *all four*
of these pass on the most recent AETHER checkpoint:
1. **Contrastive loss within 1%** of the dense baseline at the same
training budget (so the kernel substitution doesn't silently
regress the loss surface).
2. **Re-ID rank-1 accuracy within 1 percentage point** of the dense
baseline on the held-out test split.
3. **Spearman rank correlation ≥ 0.95** between dense-MHA and
sparse-GQA top-50 nearest-neighbour orderings on the
`env_fingerprint` and `person_track` HNSW indices (matches the
ADR-024 §2.5.3 quantisation-rank-preservation criterion).
4. **Latency improvement ≥ 5×** at the deployed window length.
Any of (1)(3) failing rolls back the default; the kernel can stay
in the codebase as opt-in, but is not what new training runs use.
---
## 6. Alternatives considered
| Option | Why rejected |
|--------|--------------|
| **Keep dense MHA, period** | Simple, but caps the practical window length. The 10 s + windows that long-context re-ID and breathing-quality scoring want are exactly where dense MHA hurts. We'd be locking in a ceiling for no reason. |
| **Use `ruvector-attention` 2.0.4 (already in workspace)** | It's what we use today for antenna and spatial attention. But it lacks GQA, lacks streaming KV cache, and its dependency story upstream is messy (`ruvector-attn-mincut` is stuck at 2.0.4 per the issue). It works, but it's not the right tool for *temporal* attention specifically. |
| **Wait for `ruvector-attention 2.x` to add GQA + KV cache** | Speculative; no published roadmap. Meanwhile `ruvllm_sparse_attention` shipped real artifacts on 2026-05-07 and is path-vendorable today. |
| **Use a non-attention temporal pooler (TCN / S4 / Mamba)** | All three are real options for time-series sensing; some research gives them a slight edge on long-horizon dependencies. But (a) we already have AETHER specified around attention in ADR-024, (b) the contrastive recipe is attention-tuned, (c) we'd be re-running the entire ADR-024 training story to swap to a different family. Switching to *sparse* attention preserves the ADR-024 mathematical apparatus exactly. |
| **`forward_gated_with_fastgrnn` immediately** | Tempting because it's the O(N) path. But the gate adds approximation error on top of the sparsity-induced approximation error. Phase the introductions: prove sparse-GQA matches dense first, then layer the gate on top in a follow-up. |
---
## 7. Consequences
### Positive
- **Long windows are no longer scary.** `window_frames = 1000` for
10 s sessions becomes practical, not aspirational.
- **Streaming re-ID gets a structural speedup.** Per-frame decode
cost goes from O(N²) to O(log T). Pose tracker cost is a real
budget today; this shrinks it.
- **GQA fits the AETHER backbone better.** AETHER's per-keypoint
cross-attention already has a query/key shape mismatch (17
keypoint queries vs N CSI keys). GQA was designed for exactly
this asymmetry.
- **Path-vendored, not crates.io-coupled.** No bind-time risk —
the crate ships from the vendored copy of upstream, and the
vendor was synced today (`e38347601`).
- **Same kernel, two consumers.** ADR-095 wants this on the MCU;
this ADR wants it on the host. Path-vendoring once keeps the
versions in lockstep.
- **Approximation error is bounded** by the local window +
log-stride + landmark pattern. Upstream's measurement (`README.md`
§FAQ) is "<1% perplexity on standard benchmarks" for the
causal case; we measure ours via §5's gate.
### Negative
- **Adds a workspace dependency** the team has to know about.
Mitigated by path-vendoring (no version-resolution risk).
- **Approximation error is not zero.** For high-precision re-ID
this needs measurement. §5's gate is the safety net; if rank
correlation drops below 0.95 we don't flip the default.
- **More moving parts in the temporal head.** Dense MHA has one
knob (number of heads). Sparse GQA has window, log-stride,
landmark block size, KV head count, and (later) gate top-K. We
pay this in default-config tuning effort.
- **`KvCache` introduces session state** in a place that didn't
have it. Code that previously called a stateless `forward(...)`
now has to think about cache lifetime per tracked person. The
pose tracker (`pose_tracker.rs`) already has per-track state, so
the natural place for the cache is inside `PoseTrack`; needs a
small lifecycle review.
- **Training and inference paths diverge slightly.** Training
always uses `forward` (full window prefill). Inference uses
`decode_step` for streaming. The two paths must be tested
separately; upstream's `forward` and `decode_step` are unit-test
parity-checked, but our wrapper has its own surface.
### Neutral
- ADR-024 is **not superseded.** The contrastive loss, the
augmentation strategy, the projection head, the HNSW indices —
all unchanged. This ADR makes a single architectural choice
inside ADR-024's "temporal aggregation" black box.
- ADR-016 (RuVector training pipeline integration) is unaffected.
The other RuVector crates (`mincut`, `attn-mincut`,
`temporal-tensor`, `solver`, `attention`) keep their existing
roles in `model.rs`.
---
## 8. Open questions
1. **What is the AETHER temporal head's actual current
architecture in code?** ADR-024 specifies the projection head
precisely (Linear → BN → ReLU → Linear → L2-norm) but the
*temporal aggregation* before that is not pinned. The closest
thing in `model.rs` today is `apply_antenna_attention` and
`apply_spatial_attention`, which are over antenna and spatial
axes, not the temporal axis. So this ADR is, in practice,
choosing the temporal kernel for the *first time* — not
replacing one. Worth confirming with the maintainer before the
implementation PR uses language like "swap" rather than "add".
2. **What window length is the deployed AETHER tracker using
today?** The training default is 100 frames (`config.rs:165`),
but `proof.rs` uses 4 and `trainer.rs` uses 2. The realistic
deployment number determines how much of the §3.1 quantitative
argument is *currently* operative versus *future-state*. If the
answer is "we run AETHER on 4-frame windows", sparse pays
nothing today, and the case for this ADR rests entirely on the
long-window roadmap. If 100 or more, sparse already pays.
3. **Is `FastGrnnGate` worth enabling for re-ID specifically?**
Probably not — re-ID benefits from full-sequence visibility,
and the gate's job is to *prune* long-range candidates. Save
the gate for activity classification (where transient movement
is the signal of interest, and saliency-based pruning matches
the use case). Confirm via §5's accuracy gate when we get there.
4. **Does the cross-modal alignment loss (ADR-024 §2.2.4) need
any change?** The cross-modal loss operates on pooled
`z_csi` (already temporally aggregated) and pooled `z_pose`. As
long as the temporal aggregator returns a comparable pooled
vector, the loss is kernel-agnostic. Likely no change, but
worth a smoke test.
5. **Where does the KV cache live for re-ID?** Per `pose_tracker.rs`,
each `PoseTrack` already has lifecycle (create / update /
evict). The natural place is `PoseTrack::kv_cache:
Option<KvCache>`, populated when the track first emits an
embedding. Eviction policy ties to `track.last_seen` — when
the track is dropped, drop the cache. Spec-level sanity check
only; needs a real design pass in the implementation PR.
---
## 9. Acceptance criteria
This ADR is **Accepted** once:
1. Maintainer review on #513 confirms the architecture and resolves
§8.1 (the "first-time choice vs replacement" framing).
2. Open question §8.2 has a concrete answer (ideally a one-line
pointer to the production training config).
3. The follow-up implementation issue is filed.
This ADR is **Implemented** once:
1. `wifi-densepose-temporal` (or equivalent) ships in the workspace
with a default-off feature flag exposing both dense and
sparse-GQA backends.
2. §5's four-gate validation has run on the most recent AETHER
checkpoint and the result is published (witness-bundle
compatible per ADR-028 if the run is reproducible).
3. The default for new training runs is `sparse_gqa`, with `dense`
still selectable for back-compat.
---
## 10. Related
ADR-014 (signal SOTA), ADR-016 (RuVector training pipeline
integration), ADR-024 (AETHER contrastive CSI embedding — this
ADR fills in its temporal-aggregation black box), ADR-095
(on-ESP32-S3 temporal modeling — same crate, different consumer),
upstream ADR-189 (KV cache incremental decode — the basis for
streaming re-ID), upstream ADR-190 (GQA / MQA — what AETHER's 17
keypoint queries × N CSI keys asymmetry naturally maps onto),
upstream ADR-192 (no_std + alloc support — the structural change
that means the *same* kernel runs both on the host here and on
the MCU under ADR-095).