docs(adr): ADR-095/096 — sparse attention on ESP32 + AETHER GQA head (#513)
Two Proposed ADRs covering the integration of vendored ruvllm_sparse_attention v0.1.1 (released 2026-05-07, no_std + alloc validated on real ESP32-S3 per upstream ADR-192). * ADR-095 — adds a learned temporal head to the ESP32-S3 firmware via a Rust component compiled --no-default-features against the 376 KB rlib. Runs alongside the existing physics-only DSP, gated behind a Kconfig (8 MB only initially). Use cases: gesture recognition, fall classification with sequence context, breathing-quality scoring, on-device anomaly detection. Builds on ADR-018, ADR-039, ADR-081. * ADR-096 — adopts forward_gqa + KvCache for the AETHER (ADR-024) contrastive CSI embedding's temporal aggregation. Path-vendored workspace dep, A/B gate before flipping the inference default. ~30-100x speedup at long windows; streaming decode goes from O(N^2) recompute to O(log T) per new frame. Refs #513
This commit is contained in:
parent
e7904786f0
commit
684ef4f1a5
|
|
@ -0,0 +1,369 @@
|
|||
# ADR-095: On-ESP32-S3 Temporal Modeling at the Edge via `ruvllm_sparse_attention` (no_std)
|
||||
|
||||
| Field | Value |
|
||||
|-------------|--------------------------------------------------------------------------------------------------------|
|
||||
| **Status** | Proposed (2026-05-07) |
|
||||
| **Date** | 2026-05-07 |
|
||||
| **Authors** | ruvnet, claude-flow |
|
||||
| **Related** | ADR-018, ADR-024, ADR-039, ADR-040, ADR-061, ADR-081, ADR-091; upstream ADR-189, ADR-190, ADR-192 |
|
||||
| **Branch** | `feat/ruvllm-sparse-attention-edge` |
|
||||
| **Tracking**| #513 |
|
||||
|
||||
---
|
||||
|
||||
## 1. Context
|
||||
|
||||
Today the ESP32-S3 firmware in `firmware/esp32-csi-node/main/` does
|
||||
**physics-only** sensing on-device. The pipeline in `edge_processing.c`
|
||||
runs on Core 1 and produces:
|
||||
|
||||
- Adaptive presence detection (`presence_score`).
|
||||
- Breathing-band (0.1–0.5 Hz) and heart-rate-band (0.8–2.0 Hz) biquad
|
||||
IIR bandpass + zero-crossing BPM estimators.
|
||||
- A motion / fall flag (`flags` bits 0–2 in `edge_vitals_pkt_t` magic
|
||||
`0xC5110002`, plus fused mmWave variant `0xC5110004` per ADR-063).
|
||||
- ADR-081 `rv_feature_state_t` (60 B at magic `0xC5110006`) emitted at
|
||||
1–10 Hz from the adaptive controller's fast loop.
|
||||
|
||||
There is **no learned model of any kind on the MCU**. The closest things
|
||||
are: ADR-039 Tier-1 compressed-CSI emission, ADR-040 WASM modules
|
||||
(Tier-3, but used by the user for ad-hoc DSP, not transformer
|
||||
inference), and the Rust-side AETHER embeddings (ADR-024) which run
|
||||
on the host, not the node. Anomaly detection that needs *temporal
|
||||
context* — "is this fall pattern consistent with a fall, or just a
|
||||
sit-down?" — is structurally absent. The fall debounce in v0.6.x
|
||||
(3-frame consecutive + 5 s cooldown, raised threshold 2.0 → 15.0 rad/s²)
|
||||
is a hand-tuned heuristic exactly because the firmware has nothing
|
||||
better to reason with.
|
||||
|
||||
A second pressure point: the Tmr Svc / FreeRTOS stack is already
|
||||
sensitive. `edge_processing.c` lines 47–48 explicitly note that
|
||||
`process_frame + update_multi_person_vitals` combined used ~6.5–7.5 KB
|
||||
of the 8 KB task stack and that **scratch buffers were moved to static
|
||||
storage to avoid stack overflow.** Any new heavyweight workload — and
|
||||
a transformer forward pass is heavyweight — must therefore live in
|
||||
**its own FreeRTOS task with its own task stack**, not piggyback on
|
||||
the existing edge DSP task.
|
||||
|
||||
The vendored crate `ruvllm_sparse_attention` v0.1.1 (released 2026-05-07,
|
||||
synced today at `vendor/ruvector/crates/ruvllm_sparse_attention/`)
|
||||
removes the previously-blocking `std` requirement. Per upstream
|
||||
**ADR-192**, the crate now compiles cleanly to
|
||||
`xtensa-esp32s3-none-elf` via `espup`, with a measured **376 KB
|
||||
release rlib**, zero runtime dependencies beyond `libm`, and was
|
||||
validated on a real ESP32-S3 (rev v0.2, 16 MB flash). It exposes
|
||||
`SubquadraticSparseAttention`, `KvCache` / `KvCacheF16`, `FastGrnnGate`,
|
||||
`IncrementalLandmarks`, `RuvLlmSparseBlock`, and a `Tensor3` value
|
||||
type. The kernel is O(N log N) by default and near-linear O(N) when
|
||||
the FastGRNN salience gate is enabled.
|
||||
|
||||
This is the first time we have had a credible path to **on-device
|
||||
transformer inference for CSI** without a Python runtime, without
|
||||
TFLite, and without a coprocessor. It is also the right moment to
|
||||
decide *whether* we want it before code starts to land.
|
||||
|
||||
---
|
||||
|
||||
## 2. Decision
|
||||
|
||||
Add a learned **temporal head** to the ESP32-S3 firmware running on
|
||||
the node itself, using `ruvllm_sparse_attention` compiled
|
||||
`--no-default-features` (no_std + alloc, optionally `+fp16`), driven
|
||||
by a small Rust component integrated into the ESP-IDF build. The
|
||||
temporal head runs **alongside** the existing physics-only pipeline,
|
||||
not as a replacement — physics gives us breathing/heart-rate/presence,
|
||||
the temporal head gives us classification and sequence-aware reasoning.
|
||||
|
||||
Concretely:
|
||||
|
||||
1. The temporal head consumes a rolling window of feature vectors
|
||||
(initially the same `rv_feature_state_t` floats already produced
|
||||
by ADR-081, plus optionally a small projection of recent CSI
|
||||
amplitude statistics), length `N` ∈ [100, 500] frames, sampled at
|
||||
the controller's fast-loop rate.
|
||||
2. It outputs a small set of **class logits** for the active
|
||||
detection task. The first three deployable tasks are listed in
|
||||
§4.
|
||||
3. It runs in its own FreeRTOS task on Core 1 (or pinned to whichever
|
||||
core the WiFi driver is *not* on), at a cadence slower than the
|
||||
fast loop — initially 1 Hz, classification-on-demand.
|
||||
4. The kernel is invoked through a thin C ABI (`ruv_temporal_init`,
|
||||
`ruv_temporal_push_frame`, `ruv_temporal_classify`) exported from
|
||||
a Rust static library linked into the ESP-IDF build the same way
|
||||
the existing Tier-3 components are linked.
|
||||
5. Weights are stored as a flat `f32` (or `f16` with the `fp16`
|
||||
feature) blob in the ESP32-S3 flash, loadable from either an
|
||||
embedded `EMBED_FILES` resource (compile-time bake-in) or NVS
|
||||
(post-flash provisioning, mirroring ADR-040's WASM-upload path).
|
||||
6. The temporal head is gated behind a Kconfig option
|
||||
`CONFIG_CSI_TEMPORAL_HEAD_ENABLED`, **default off**, and is only
|
||||
compiled into the 8 MB build profile until the flash math in §6
|
||||
demonstrates 4 MB headroom.
|
||||
|
||||
This ADR authorizes the architecture; it does **not** ship any of
|
||||
the firmware-side or training-side changes. Implementation lands in
|
||||
follow-up issues per the roadmap in §7.
|
||||
|
||||
---
|
||||
|
||||
## 3. Approach
|
||||
|
||||
### 3.1 Build integration
|
||||
|
||||
ESP-IDF v5.4 already supports Rust components via the
|
||||
`rust-esp32`-style template (a CMake `idf_component_register` shim
|
||||
that runs `cargo build --target xtensa-esp32s3-none-elf` and links
|
||||
the resulting static library). The new component lives at
|
||||
`firmware/esp32-csi-node/components/ruv_temporal/`:
|
||||
|
||||
```
|
||||
ruv_temporal/
|
||||
CMakeLists.txt # component manifest, Rust build invocation
|
||||
Cargo.toml # crate config: no_std, deps on ruvllm_sparse_attention
|
||||
build.rs # generates the C header from #[no_mangle] exports
|
||||
src/lib.rs # public C ABI: init/push/classify/teardown
|
||||
src/window.rs # rolling frame buffer
|
||||
src/weights.rs # NVS / EMBED_FILES weight loader
|
||||
include/ruv_temporal.h # generated; consumed by edge_processing.c
|
||||
```
|
||||
|
||||
Cargo features compiled in: `["fp16"]`. **Not** `parallel` (rayon
|
||||
needs threads, breaks no_std). **Not** `std`.
|
||||
|
||||
### 3.2 Interface
|
||||
|
||||
The C ABI is intentionally narrow. It does not expose `Tensor3`,
|
||||
attention configs, or any Rust types — only `float*` buffers and
|
||||
opaque handles:
|
||||
|
||||
```c
|
||||
typedef struct ruv_temporal_ctx ruv_temporal_ctx_t;
|
||||
|
||||
esp_err_t ruv_temporal_init(const uint8_t *weights, size_t wlen,
|
||||
uint32_t input_dim, uint32_t window,
|
||||
ruv_temporal_ctx_t **out_ctx);
|
||||
esp_err_t ruv_temporal_push(ruv_temporal_ctx_t *ctx, const float *frame);
|
||||
esp_err_t ruv_temporal_classify(ruv_temporal_ctx_t *ctx,
|
||||
float *logits, uint32_t n_classes);
|
||||
void ruv_temporal_destroy(ruv_temporal_ctx_t *ctx);
|
||||
```
|
||||
|
||||
`push` is the hot path and must be cheap (it just writes into a
|
||||
ring buffer in PSRAM if available, IRAM/DRAM otherwise). `classify`
|
||||
runs the actual sparse attention forward and is the budget-heavy
|
||||
call.
|
||||
|
||||
### 3.3 Task topology
|
||||
|
||||
A new task `ruv_temporal_task` with its own 16 KB stack, pinned to
|
||||
the same core as the edge DSP task (Core 1), fed via a FreeRTOS
|
||||
queue from the adaptive controller's fast loop. We do **not** call
|
||||
the kernel from the existing edge task — the edge stack is already
|
||||
near-full per the comment at `edge_processing.c:47-48` and recent
|
||||
fall-debounce / Tmr-Svc-stack work.
|
||||
|
||||
### 3.4 Memory budget (per inference)
|
||||
|
||||
With `N = 256` (window), `d_model = 32`, `n_heads = 4`, `head_dim = 8`,
|
||||
1–2 `RuvLlmSparseBlock` layers, `block_size = 64`, `window = 64`:
|
||||
|
||||
- Weights: ~5–15 KB (single block, INT8 quant deferred to a later
|
||||
ADR; FP16 default).
|
||||
- KV cache (FP16, full window): `2 * 256 * 4 * 8 * 2 B ≈ 16 KB`.
|
||||
- Activations (peak, with `forward_flash` tiling): ≈ 2 KB.
|
||||
- Working set: < 64 KB. Comfortable in PSRAM, possible in ISR-safe
|
||||
internal SRAM.
|
||||
|
||||
These are first-pass estimates; the precise numbers come out of the
|
||||
`forward_flash` benchmark on real hardware, which is exit criterion
|
||||
in §7.
|
||||
|
||||
### 3.5 Compatibility with ADR-081 / ADR-039 / ADR-018
|
||||
|
||||
The temporal head is a **consumer** of the same feature stream
|
||||
already flowing in the firmware. It does not alter:
|
||||
|
||||
- ADR-018 raw CSI frame layout (`0xC5110001`).
|
||||
- ADR-039 Tier-1 compressed CSI (`0xC5110005`) or vitals
|
||||
(`0xC5110002`).
|
||||
- ADR-063 fused vitals (`0xC5110004`).
|
||||
- ADR-081 `rv_feature_state_t` (`0xC5110006`) — this is the primary
|
||||
input we tap.
|
||||
|
||||
If the temporal head fires a classification, the result rides on a
|
||||
new `0xC5110007` packet (small: class id, confidence, monotonic seq,
|
||||
ts_us, CRC32). Allocation of that magic is deferred to the
|
||||
implementation PR — this ADR reserves the *concept*, not the byte
|
||||
layout.
|
||||
|
||||
---
|
||||
|
||||
## 4. Use cases that motivate this
|
||||
|
||||
| Task | Why temporal context matters | Window | Class count |
|
||||
|------|------------------------------|--------|-------------|
|
||||
| **Gesture recognition** (wave / point / clap / kick) | Single-frame CSI snapshots can't disambiguate gestures from random motion. ~100-frame windows capture full gesture trajectories. | 100 frames @ 50 Hz = 2 s | 4–8 |
|
||||
| **Fall classification with sequence context** | The current heuristic ("> 15 rad/s² for 3 consecutive frames + 5 s cooldown") was raised to suppress false positives. A learned temporal head can distinguish a fall (rapid descent then stillness) from a sit-down (descent then sustained micro-motion) using the same input window. | 200 frames @ 50 Hz = 4 s | 3 (fall / sit / nothing) |
|
||||
| **Breathing-quality scoring** | Today's pipeline emits a BPM and a confidence float. A temporal head trained on labeled apnea / shallow / paradoxical / normal sequences can output a 4-class quality label that downstream consumers can render in one glance. | 500 frames @ 50 Hz = 10 s | 4 |
|
||||
| **"Is this normal for this room/time" anomaly detection** | Per-room SONA profiles (ADR-005) capture environment statistics, but anomaly *temporal shape* is currently checked host-side via embedding distance (ADR-024 §2.4 `temporal_baseline` index). A small on-device classifier can flag ahead of host roundtrip. | 300 frames | 2 (normal / anomalous) |
|
||||
|
||||
These four cover the visible product gaps in the v0.6.x line.
|
||||
Gesture recognition is the headline; fall classification is the
|
||||
highest-impact for the eldercare scenarios v0.5.4 was tuned for.
|
||||
|
||||
---
|
||||
|
||||
## 5. Alternatives considered
|
||||
|
||||
| Option | Why rejected |
|
||||
|--------|--------------|
|
||||
| **TFLite Micro** | Heavier runtime (~150 KB code + interpreter), pulls in C++ STL surface, no Rust-native API. Does not benefit from sparse attention specifically. We'd be re-paying the cost of a full inference framework when we only need one kernel. |
|
||||
| **Run all classifiers server-side** | Costs a full Tier-1 CSI uplink (~50–70 KB/s/node per ADR-039) just to feed a remote classifier, then a roundtrip back. Defeats the point of ADR-081's compact feature stream and makes the system worthless when the backhaul is down. Also leaks raw CSI to the network for purposes the user did not opt into. |
|
||||
| **Stay physics-only forever** | Cleanest from a maintenance standpoint, but loses gesture, structurally, and the fall-debounce hack will keep accreting per-deployment knobs. The product space already has commodity physics-only firmware (Bosch presence sensors, etc.); on-device transformer inference for CSI is what would *differentiate* RuView. |
|
||||
| **Use `ruvector-attention` (already in workspace) on-device** | `ruvector-attention` is `std`-bound today; doesn't compile to `xtensa-esp32s3-none-elf` without a port comparable in scope to upstream ADR-192. Even if ported, it doesn't give us GQA + streaming KV cache, which is the structural capability the new crate adds. |
|
||||
| **Wait for IEEE 802.11bf** | Different problem (standardised CSI exposure across vendors). Doesn't address whether the model runs on-device or off. |
|
||||
|
||||
---
|
||||
|
||||
## 6. Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Genuinely novel.** No competing CSI-sensing project ships
|
||||
transformer inference on the MCU itself. The closest peers
|
||||
(Espressif's ESP-DL, Edge Impulse) are non-attention CNN/RNN
|
||||
pipelines.
|
||||
- **Latency.** Classification result is local — no backhaul,
|
||||
no host roundtrip, sub-100 ms gesture-to-action.
|
||||
- **Privacy.** Raw CSI never leaves the node for these tasks.
|
||||
- **Reuses the ADR-081 feature stream** — the temporal head is a
|
||||
consumer of the existing 60 B `rv_feature_state_t`, not a new
|
||||
uplink format.
|
||||
- **Validated kernel.** Per upstream ADR-192, the no_std build was
|
||||
validated on real ESP32-S3 hardware (MAC `ac:a7:04:e2:66:24`).
|
||||
We are not betting on a paper crate.
|
||||
|
||||
### Negative / tradeoffs
|
||||
|
||||
- **Flash budget pressure on 4 MB boards.** Per `partitions_4mb.csv`,
|
||||
each OTA slot is 1.875 MB (`0x1D0000`). The current build is
|
||||
~853 KiB. Adding a 376 KB rlib plus weights brings us to ~1.3 MB —
|
||||
still under the slot ceiling but with little headroom for other
|
||||
growth. **Decision: temporal head is 8 MB-only initially**, gated
|
||||
behind `CONFIG_CSI_TEMPORAL_HEAD_ENABLED`. 4 MB enablement is a
|
||||
separate ADR after we measure the actual incremental link size
|
||||
(the 376 KB upstream number is for the rlib in isolation; the
|
||||
linked-and-stripped final binary delta will be smaller).
|
||||
- **Rust toolchain dependency.** The ESP-IDF build now needs
|
||||
`espup` + `cargo +esp` to be present on every developer machine
|
||||
and CI runner. This is a real hurdle on Windows — see
|
||||
`CLAUDE.local.md` for the existing Python-subprocess wrapper
|
||||
required to run ESP-IDF cleanly. CI will need a parallel
|
||||
Rust-toolchain step.
|
||||
- **One more thing to test.** QEMU (ADR-061) does not run the
|
||||
ESP32-S3 Xtensa Rust binary today. The QEMU validator pipeline
|
||||
will need a build matrix entry for "Rust component compiled but
|
||||
classifier disabled" at minimum.
|
||||
- **Stack overflow risk.** Same hazard the v0.6.4 work just
|
||||
navigated. Mitigated by §3.3 (own task, own stack); this needs
|
||||
to be a code-review checklist item.
|
||||
- **Weights provenance.** Once we ship a model, we need a story
|
||||
for *which model*, signed by *whom*, retrained *how often*. See
|
||||
Open Questions §8.
|
||||
|
||||
### Neutral
|
||||
|
||||
- ADR-040's WASM Tier-3 path is **not** superseded. WASM remains
|
||||
the right choice for user-uploaded modules. The temporal head is
|
||||
a first-party signed-by-us component, with a different deploy
|
||||
story.
|
||||
- The host-side ADR-024 AETHER pipeline is unchanged by this ADR.
|
||||
ADR-096 covers the host-side use of the same crate.
|
||||
|
||||
---
|
||||
|
||||
## 7. Roadmap
|
||||
|
||||
| Phase | Scope | Gating |
|
||||
|-------|-------|--------|
|
||||
| 0 | This ADR + ADR-096 land. No code. | Maintainer review of #513. |
|
||||
| 1 | New crate `wifi-densepose-temporal` (host-side only): defines the temporal-head architecture, training script, weight serialization format. | Phase 0 accepted. |
|
||||
| 2 | `ruv_temporal` ESP-IDF component scaffolding — empty kernel, just the C ABI and ring buffer. Compiles cleanly into 8 MB firmware. Adds ~5 KB to binary. | Phase 1 produces a serialised set of weights. |
|
||||
| 3 | Wire `ruvllm_sparse_attention` `forward` (not yet `forward_gated`) into the component. First on-target classification benchmark on COM7. Gate: end-to-end inference ≤ 50 ms with `N = 256`, no stack overflow under 24 h soak. | Phase 2 ABI stable. |
|
||||
| 4 | First trained classifier (gesture or fall, whichever has labelled data first). Hardware A/B: temporal-head decision vs current heuristic on a held-out set. Promotion criterion: temporal head matches or beats heuristic on F1 *and* false-positive rate. | Phase 3 latency gate met. |
|
||||
| 5 | 4 MB profile gating — measure actual binary delta, decide whether to enable on SuperMini. | Phase 4 in production on 8 MB. |
|
||||
| 6 | `forward_gated_with_fastgrnn` for long-window tasks (breathing-quality at N = 500). | Phase 4 stable. |
|
||||
|
||||
---
|
||||
|
||||
## 8. Open questions
|
||||
|
||||
1. **Who trains the temporal heads?** Two options:
|
||||
(a) host-side training on captured `rv_feature_state_t` traces
|
||||
labelled in-app, then export to flat-buffer weights;
|
||||
(b) teacher-distillation from the larger AETHER model (ADR-024)
|
||||
running off-device, using soft labels. Option (b) is more
|
||||
data-efficient but couples this ADR's ship date to ADR-024's
|
||||
training-pipeline maturity. Open.
|
||||
2. **How are weights flashed?** Three options, in increasing
|
||||
capability: NVS blob (small, safe, 4–8 KB ceiling per key),
|
||||
`EMBED_FILES` baked into the firmware image (no runtime update),
|
||||
OTA-updateable partition (mirrors ADR-040 RVF upload path,
|
||||
biggest engineering cost). Phase 2/3 will pick one; my prior is
|
||||
`EMBED_FILES` for the first model, OTA partition once we have
|
||||
more than one.
|
||||
3. **Does the 376 KB rlib figure scale?** Upstream measured
|
||||
376 KB for the kernel + the embedding/projection
|
||||
weights for *their* test config. Adding 1–2
|
||||
`RuvLlmSparseBlock` layers with embedding/projection weights
|
||||
sized to actual CSI feature dimension may push this. Phase 2
|
||||
will measure the on-target stripped-binary delta directly; if
|
||||
the delta exceeds 600 KB we revisit the 4 MB story sooner.
|
||||
4. **What window length is right for fall classification?**
|
||||
200 frames at 50 Hz = 4 s feels right based on the v0.6.4
|
||||
debounce numbers (3-frame consecutive + 5 s cooldown is
|
||||
essentially a 4-second decision window already). Empirical, not
|
||||
architectural — set in Phase 4.
|
||||
5. **Quantisation.** First model ships FP16 (KV cache feature flag
|
||||
already supports this). INT8 for both weights and activations
|
||||
is a follow-up; the current crate has no INT8 path so it would
|
||||
be a separate kernel.
|
||||
6. **What happens when the controller is in `RV_PROFILE_PASSIVE_LOW_RATE`?**
|
||||
The fast loop slows down, so the input frame rate to the
|
||||
temporal head drops. Either the head needs to handle variable
|
||||
sample rate (resample at push time) or it stops emitting until
|
||||
the controller goes back to active. Phase 1 design call.
|
||||
|
||||
---
|
||||
|
||||
## 9. Acceptance criteria
|
||||
|
||||
This ADR is **Accepted** once:
|
||||
|
||||
1. Maintainer review on #513 confirms the architecture.
|
||||
2. The follow-up implementation issue is filed and references this
|
||||
ADR plus ADR-096 by number.
|
||||
3. ADR index in `docs/adr/README.md` (if present) has an ADR-095
|
||||
row.
|
||||
|
||||
This ADR is **Implemented** once:
|
||||
|
||||
1. Phase 3 is in `main` with the gating Kconfig off by default.
|
||||
2. A Phase-4 hardware A/B has been published (witness-bundle
|
||||
compatible per ADR-028).
|
||||
3. The QEMU validator (ADR-061) has at minimum a "compiles, doesn't
|
||||
run" check for the Rust component.
|
||||
|
||||
---
|
||||
|
||||
## 10. Related
|
||||
|
||||
ADR-018 (binary CSI frame), ADR-024 (AETHER contrastive embedding —
|
||||
host-side counterpart, see ADR-096), ADR-039 (edge intelligence
|
||||
tiers), ADR-040 (WASM Tier-3 modules — the *other* extensibility
|
||||
path), ADR-061 (QEMU CI), ADR-081 (adaptive controller, mesh plane,
|
||||
`rv_feature_state_t`), ADR-091 (stand-off radar tier — adjacent
|
||||
edge-intelligence ADR), upstream ADR-189 (KV cache incremental
|
||||
decode), upstream ADR-190 (GQA/MQA), upstream ADR-192 (no_std +
|
||||
alloc on ESP32-S3 — the structural unblock that makes this ADR
|
||||
possible).
|
||||
|
|
@ -0,0 +1,389 @@
|
|||
# ADR-096: AETHER Temporal Head via `ruvllm_sparse_attention::forward_gqa` + Streaming KV Cache
|
||||
|
||||
| Field | Value |
|
||||
|-------------|---------------------------------------------------------------------------------------|
|
||||
| **Status** | Proposed (2026-05-07) |
|
||||
| **Date** | 2026-05-07 |
|
||||
| **Authors** | ruvnet, claude-flow |
|
||||
| **Related** | ADR-014, ADR-016, ADR-024, ADR-095; upstream ADR-189, ADR-190, ADR-192 |
|
||||
| **Branch** | `feat/ruvllm-sparse-attention-edge` |
|
||||
| **Tracking**| #513 |
|
||||
|
||||
---
|
||||
|
||||
## 1. Context
|
||||
|
||||
ADR-024 ("Project AETHER") specifies a contrastive CSI embedding
|
||||
model on top of the existing `CsiToPoseTransformer` backbone. It
|
||||
adds a 2-layer projection head to the per-keypoint features and
|
||||
trains it with InfoNCE + VICReg + (optional) cross-modal alignment.
|
||||
The **temporal aggregation** that turns per-frame backbone features
|
||||
into a window-level representation is described at the level of
|
||||
"a transformer encoder over the CSI window" — but ADR-024 does not
|
||||
pin a specific attention kernel. In the current code:
|
||||
|
||||
- `v2/crates/wifi-densepose-train/src/model.rs` uses
|
||||
`ruvector_attention::ScaledDotProductAttention` (line 34) and
|
||||
applies `apply_antenna_attention` over the antenna-path dimension
|
||||
and `apply_spatial_attention` over the spatial location dimension.
|
||||
Both are dense.
|
||||
- The training-side temporal pooling currently runs at
|
||||
`window_frames = 100` by default (`config.rs:165`), with
|
||||
`proof.rs` and `trainer.rs` using shorter test windows of 4 and 2
|
||||
respectively.
|
||||
- `v2/crates/wifi-densepose-signal/src/ruvsense/pose_tracker.rs`
|
||||
consumes a 128-dim AETHER re-ID embedding (line 22, 263) but does
|
||||
not perform the temporal aggregation itself — that happens
|
||||
upstream.
|
||||
|
||||
So the temporal head is a real seam in the codebase, but its
|
||||
specific attention kernel is *currently dense* and *currently not a
|
||||
named architectural decision*. This ADR makes that decision.
|
||||
|
||||
The vendored `ruvllm_sparse_attention` v0.1.1 (synced today,
|
||||
released 2026-05-07) provides a different kind of temporal kernel:
|
||||
|
||||
- **Subquadratic O(N log N)** sparse attention (`forward`,
|
||||
`forward_flash`).
|
||||
- **Grouped-Query / Multi-Query Attention** (`forward_gqa`,
|
||||
`forward_gqa_flash`) — shares K/V across query heads, the
|
||||
pattern Mistral-7B and Llama-3 use.
|
||||
- **Streaming KV cache** (`KvCache`, `KvCacheF16`) with H2O
|
||||
heavy-hitter eviction, allowing token-by-token decode in
|
||||
**O(log T)** per step against an accumulated cache. See upstream
|
||||
ADR-189.
|
||||
- **FastGRNN salience gate** for **near-linear O(N)** when the
|
||||
log-stride candidate set can be pruned.
|
||||
|
||||
These capabilities are qualitatively different from
|
||||
`ruvector-attention` 2.0.4, which is what the workspace uses today
|
||||
for spatial / antenna attention.
|
||||
|
||||
---
|
||||
|
||||
## 2. Decision
|
||||
|
||||
The AETHER temporal head will be implemented with
|
||||
`ruvllm_sparse_attention::SubquadraticSparseAttention::forward_gqa`
|
||||
for prefill, and `decode_step` against a `KvCache` (with the `fp16`
|
||||
feature enabled) for streaming inference paths (online re-ID,
|
||||
incremental embedding extraction during a tracked session).
|
||||
|
||||
Concretely:
|
||||
|
||||
1. `wifi-densepose-train` adds `ruvllm_sparse_attention` as a
|
||||
workspace dependency, **path-vendored** against
|
||||
`vendor/ruvector/crates/ruvllm_sparse_attention` so the workspace
|
||||
does not gain a crates.io publish dependency.
|
||||
2. The AETHER block factory takes a feature flag
|
||||
(`temporal_head = "dense" | "sparse_gqa"`) selecting between the
|
||||
current dense MHA path and the new sparse-GQA path. The default
|
||||
for new training runs is `sparse_gqa`. Existing checkpoints
|
||||
continue to load on `dense`.
|
||||
3. Signal-side consumers (the streaming embedding extraction used
|
||||
by `pose_tracker.rs` for re-ID updates) call `decode_step` rather
|
||||
than re-running prefill on every new frame — this is the
|
||||
structural win that dense MHA cannot provide.
|
||||
4. We add an A/B benchmark gate (§5) before flipping the production
|
||||
default. The default *training* config can move first; the
|
||||
default *inference* config waits for the gate.
|
||||
|
||||
This ADR sanctions the swap. It does not perform the swap; that
|
||||
lands in a follow-up implementation issue once both ADR-095 and
|
||||
ADR-096 are accepted.
|
||||
|
||||
---
|
||||
|
||||
## 3. Quantitative argument
|
||||
|
||||
### 3.1 Edge-evaluation count
|
||||
|
||||
For a single attention layer over `N` frames:
|
||||
|
||||
| Path | Edge evaluations | At `N = 100` (today's default) | At `N = 1000` (10 s @ 100 Hz) | At `N = 8192` |
|
||||
|------|------------------|--------------------------------|-------------------------------|---------------|
|
||||
| Dense MHA | `N²` | 1.0 × 10⁴ | 1.0 × 10⁶ | 6.7 × 10⁷ |
|
||||
| Sparse `forward` (window + log-stride + landmarks) | ~`N · (W + log N + N/B)` | 1.4 × 10⁴ | 1.4 × 10⁴ | 1.1 × 10⁶ |
|
||||
| Sparse + FastGRNN | ~`N · (W + globals + K)` | constant in `N` | constant in `N` | constant in `N` |
|
||||
|
||||
Numbers for the sparse rows are taken from upstream's measured
|
||||
table (`README.md:230-237`, "sparse-edge reduction vs causal dense
|
||||
attention"): 8192 → 29.3× edge reduction, 16384 → 57.5×, 32768 →
|
||||
113.2×.
|
||||
|
||||
**The honest framing:** at the *current* AETHER default of
|
||||
`window_frames = 100`, dense MHA is essentially free and the
|
||||
sparse machinery has overhead — the per-token cost in upstream's
|
||||
benchmark is ~2.4 µs at `N = 256` and ~2.1 µs at `N = 128`. The
|
||||
sparse path probably *loses* below `N ≈ 128`. It starts winning at
|
||||
the 1 s + windows we'd realistically use for activity classification
|
||||
(`N = 200` at 50 Hz, `N = 500` for breathing-quality), and pulls
|
||||
ahead by 30–100× at the 10 s windows that long-context re-ID
|
||||
benefits from.
|
||||
|
||||
### 3.2 Streaming decode
|
||||
|
||||
Where dense MHA structurally cannot follow is incremental decode.
|
||||
Re-ID over a long-tracked person (a 5-minute session at 50 Hz =
|
||||
15,000 frames) with dense MHA requires recomputing attention from
|
||||
scratch every time the window slides. With `decode_step` against a
|
||||
`KvCache`:
|
||||
|
||||
| Operation | Dense MHA | Sparse GQA + KV cache |
|
||||
|-----------|-----------|-----------------------|
|
||||
| Append one new frame to the embedding context | O(N²) | **O(log T)** |
|
||||
| Memory growth | O(N · d) per recompute | O(T · d_kv) cached, evicted by H2O heavy-hitter |
|
||||
| FP16 KV cache | n/a | available via `fp16` feature, halves memory |
|
||||
|
||||
This is the qualitative capability dense MHA lacks. Even at small
|
||||
`N` where dense MHA is competitive on prefill, decode is structurally
|
||||
different: amortised O(1) per new frame vs O(N²) recompute.
|
||||
|
||||
---
|
||||
|
||||
## 4. Approach
|
||||
|
||||
### 4.1 Workspace dependency
|
||||
|
||||
Add to `v2/Cargo.toml`:
|
||||
|
||||
```toml
|
||||
[workspace.dependencies]
|
||||
ruvllm_sparse_attention = {
|
||||
path = "../vendor/ruvector/crates/ruvllm_sparse_attention",
|
||||
default-features = false,
|
||||
features = ["fp16"]
|
||||
}
|
||||
```
|
||||
|
||||
`default-features = false` mirrors the rest of the workspace's
|
||||
`--no-default-features` posture (and matches what ADR-095 does on
|
||||
the firmware side, so both consumers have the same feature set).
|
||||
We **do not** pull `parallel` here — rayon doesn't help with
|
||||
inference-shaped batches at the sequence lengths we run, and it
|
||||
breaks ADR-095's no_std build if the dependency leaks.
|
||||
|
||||
### 4.2 Crate placement
|
||||
|
||||
Two viable homes for the AETHER temporal head:
|
||||
|
||||
| Option | Tradeoffs |
|
||||
|--------|-----------|
|
||||
| **A. New `wifi-densepose-temporal` crate** | Cleanest. Unique import surface, easy to feature-gate. But: one more crate in the publishing order (CLAUDE.md crate table grows to 16). |
|
||||
| **B. Add to `wifi-densepose-train`** | Co-located with the model; no new crate; simpler workspace graph. But: `wifi-densepose-train` is heavyweight (`tch`, full training stack), and signal-side consumers would have to depend on the whole training crate just to run inference. |
|
||||
|
||||
**Recommendation: A.** The temporal head is consumed by both
|
||||
`wifi-densepose-train` (training) and `wifi-densepose-signal`
|
||||
(inference, re-ID). Pulling those toward a shared third crate keeps
|
||||
the dependency arrows clean. Also matches ADR-095's
|
||||
`wifi-densepose-temporal` host-side training crate name —
|
||||
deliberate convergence.
|
||||
|
||||
### 4.3 API sketch
|
||||
|
||||
```rust
|
||||
pub struct AetherTemporalHead {
|
||||
backend: TemporalBackend,
|
||||
cache: Option<KvCache>, // populated for streaming inference
|
||||
}
|
||||
|
||||
pub enum TemporalBackend {
|
||||
Dense(DenseMha), // current ruvector-attention path
|
||||
SparseGqa(SubquadraticSparseAttention),
|
||||
}
|
||||
|
||||
impl AetherTemporalHead {
|
||||
pub fn new(cfg: &TemporalHeadConfig) -> Self;
|
||||
|
||||
/// Window-level prefill. Returns pooled [d_model] embedding.
|
||||
pub fn forward(&self, frames: &Tensor3) -> Vec<f32>;
|
||||
|
||||
/// Incremental decode for streaming re-ID. Updates internal
|
||||
/// cache and returns pooled embedding given a single new frame.
|
||||
/// SparseGqa backend only.
|
||||
pub fn step(&mut self, frame: &Tensor3) -> Result<Vec<f32>, TemporalError>;
|
||||
}
|
||||
```
|
||||
|
||||
### 4.4 Selection rule
|
||||
|
||||
In `forward_auto`'s spirit, the head selects the path based on
|
||||
`(window, n_q_heads, n_kv_heads)` of the model:
|
||||
|
||||
- `window ≤ 64` and dense MHA is in the checkpoint: use dense path.
|
||||
- `n_q_heads != n_kv_heads`: use `forward_gqa`.
|
||||
- `n_q_heads == n_kv_heads` and `window > 64`: use `forward`.
|
||||
- Streaming (per-frame) inference: always `decode_step`.
|
||||
|
||||
---
|
||||
|
||||
## 5. Validation gate before flipping the inference default
|
||||
|
||||
We do not flip the production inference default until *all four*
|
||||
of these pass on the most recent AETHER checkpoint:
|
||||
|
||||
1. **Contrastive loss within 1%** of the dense baseline at the same
|
||||
training budget (so the kernel substitution doesn't silently
|
||||
regress the loss surface).
|
||||
2. **Re-ID rank-1 accuracy within 1 percentage point** of the dense
|
||||
baseline on the held-out test split.
|
||||
3. **Spearman rank correlation ≥ 0.95** between dense-MHA and
|
||||
sparse-GQA top-50 nearest-neighbour orderings on the
|
||||
`env_fingerprint` and `person_track` HNSW indices (matches the
|
||||
ADR-024 §2.5.3 quantisation-rank-preservation criterion).
|
||||
4. **Latency improvement ≥ 5×** at the deployed window length.
|
||||
|
||||
Any of (1)–(3) failing rolls back the default; the kernel can stay
|
||||
in the codebase as opt-in, but is not what new training runs use.
|
||||
|
||||
---
|
||||
|
||||
## 6. Alternatives considered
|
||||
|
||||
| Option | Why rejected |
|
||||
|--------|--------------|
|
||||
| **Keep dense MHA, period** | Simple, but caps the practical window length. The 10 s + windows that long-context re-ID and breathing-quality scoring want are exactly where dense MHA hurts. We'd be locking in a ceiling for no reason. |
|
||||
| **Use `ruvector-attention` 2.0.4 (already in workspace)** | It's what we use today for antenna and spatial attention. But it lacks GQA, lacks streaming KV cache, and its dependency story upstream is messy (`ruvector-attn-mincut` is stuck at 2.0.4 per the issue). It works, but it's not the right tool for *temporal* attention specifically. |
|
||||
| **Wait for `ruvector-attention 2.x` to add GQA + KV cache** | Speculative; no published roadmap. Meanwhile `ruvllm_sparse_attention` shipped real artifacts on 2026-05-07 and is path-vendorable today. |
|
||||
| **Use a non-attention temporal pooler (TCN / S4 / Mamba)** | All three are real options for time-series sensing; some research gives them a slight edge on long-horizon dependencies. But (a) we already have AETHER specified around attention in ADR-024, (b) the contrastive recipe is attention-tuned, (c) we'd be re-running the entire ADR-024 training story to swap to a different family. Switching to *sparse* attention preserves the ADR-024 mathematical apparatus exactly. |
|
||||
| **`forward_gated_with_fastgrnn` immediately** | Tempting because it's the O(N) path. But the gate adds approximation error on top of the sparsity-induced approximation error. Phase the introductions: prove sparse-GQA matches dense first, then layer the gate on top in a follow-up. |
|
||||
|
||||
---
|
||||
|
||||
## 7. Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Long windows are no longer scary.** `window_frames = 1000` for
|
||||
10 s sessions becomes practical, not aspirational.
|
||||
- **Streaming re-ID gets a structural speedup.** Per-frame decode
|
||||
cost goes from O(N²) to O(log T). Pose tracker cost is a real
|
||||
budget today; this shrinks it.
|
||||
- **GQA fits the AETHER backbone better.** AETHER's per-keypoint
|
||||
cross-attention already has a query/key shape mismatch (17
|
||||
keypoint queries vs N CSI keys). GQA was designed for exactly
|
||||
this asymmetry.
|
||||
- **Path-vendored, not crates.io-coupled.** No bind-time risk —
|
||||
the crate ships from the vendored copy of upstream, and the
|
||||
vendor was synced today (`e38347601`).
|
||||
- **Same kernel, two consumers.** ADR-095 wants this on the MCU;
|
||||
this ADR wants it on the host. Path-vendoring once keeps the
|
||||
versions in lockstep.
|
||||
- **Approximation error is bounded** by the local window +
|
||||
log-stride + landmark pattern. Upstream's measurement (`README.md`
|
||||
§FAQ) is "<1% perplexity on standard benchmarks" for the
|
||||
causal case; we measure ours via §5's gate.
|
||||
|
||||
### Negative
|
||||
|
||||
- **Adds a workspace dependency** the team has to know about.
|
||||
Mitigated by path-vendoring (no version-resolution risk).
|
||||
- **Approximation error is not zero.** For high-precision re-ID
|
||||
this needs measurement. §5's gate is the safety net; if rank
|
||||
correlation drops below 0.95 we don't flip the default.
|
||||
- **More moving parts in the temporal head.** Dense MHA has one
|
||||
knob (number of heads). Sparse GQA has window, log-stride,
|
||||
landmark block size, KV head count, and (later) gate top-K. We
|
||||
pay this in default-config tuning effort.
|
||||
- **`KvCache` introduces session state** in a place that didn't
|
||||
have it. Code that previously called a stateless `forward(...)`
|
||||
now has to think about cache lifetime per tracked person. The
|
||||
pose tracker (`pose_tracker.rs`) already has per-track state, so
|
||||
the natural place for the cache is inside `PoseTrack`; needs a
|
||||
small lifecycle review.
|
||||
- **Training and inference paths diverge slightly.** Training
|
||||
always uses `forward` (full window prefill). Inference uses
|
||||
`decode_step` for streaming. The two paths must be tested
|
||||
separately; upstream's `forward` and `decode_step` are unit-test
|
||||
parity-checked, but our wrapper has its own surface.
|
||||
|
||||
### Neutral
|
||||
|
||||
- ADR-024 is **not superseded.** The contrastive loss, the
|
||||
augmentation strategy, the projection head, the HNSW indices —
|
||||
all unchanged. This ADR makes a single architectural choice
|
||||
inside ADR-024's "temporal aggregation" black box.
|
||||
- ADR-016 (RuVector training pipeline integration) is unaffected.
|
||||
The other RuVector crates (`mincut`, `attn-mincut`,
|
||||
`temporal-tensor`, `solver`, `attention`) keep their existing
|
||||
roles in `model.rs`.
|
||||
|
||||
---
|
||||
|
||||
## 8. Open questions
|
||||
|
||||
1. **What is the AETHER temporal head's actual current
|
||||
architecture in code?** ADR-024 specifies the projection head
|
||||
precisely (Linear → BN → ReLU → Linear → L2-norm) but the
|
||||
*temporal aggregation* before that is not pinned. The closest
|
||||
thing in `model.rs` today is `apply_antenna_attention` and
|
||||
`apply_spatial_attention`, which are over antenna and spatial
|
||||
axes, not the temporal axis. So this ADR is, in practice,
|
||||
choosing the temporal kernel for the *first time* — not
|
||||
replacing one. Worth confirming with the maintainer before the
|
||||
implementation PR uses language like "swap" rather than "add".
|
||||
2. **What window length is the deployed AETHER tracker using
|
||||
today?** The training default is 100 frames (`config.rs:165`),
|
||||
but `proof.rs` uses 4 and `trainer.rs` uses 2. The realistic
|
||||
deployment number determines how much of the §3.1 quantitative
|
||||
argument is *currently* operative versus *future-state*. If the
|
||||
answer is "we run AETHER on 4-frame windows", sparse pays
|
||||
nothing today, and the case for this ADR rests entirely on the
|
||||
long-window roadmap. If 100 or more, sparse already pays.
|
||||
3. **Is `FastGrnnGate` worth enabling for re-ID specifically?**
|
||||
Probably not — re-ID benefits from full-sequence visibility,
|
||||
and the gate's job is to *prune* long-range candidates. Save
|
||||
the gate for activity classification (where transient movement
|
||||
is the signal of interest, and saliency-based pruning matches
|
||||
the use case). Confirm via §5's accuracy gate when we get there.
|
||||
4. **Does the cross-modal alignment loss (ADR-024 §2.2.4) need
|
||||
any change?** The cross-modal loss operates on pooled
|
||||
`z_csi` (already temporally aggregated) and pooled `z_pose`. As
|
||||
long as the temporal aggregator returns a comparable pooled
|
||||
vector, the loss is kernel-agnostic. Likely no change, but
|
||||
worth a smoke test.
|
||||
5. **Where does the KV cache live for re-ID?** Per `pose_tracker.rs`,
|
||||
each `PoseTrack` already has lifecycle (create / update /
|
||||
evict). The natural place is `PoseTrack::kv_cache:
|
||||
Option<KvCache>`, populated when the track first emits an
|
||||
embedding. Eviction policy ties to `track.last_seen` — when
|
||||
the track is dropped, drop the cache. Spec-level sanity check
|
||||
only; needs a real design pass in the implementation PR.
|
||||
|
||||
---
|
||||
|
||||
## 9. Acceptance criteria
|
||||
|
||||
This ADR is **Accepted** once:
|
||||
|
||||
1. Maintainer review on #513 confirms the architecture and resolves
|
||||
§8.1 (the "first-time choice vs replacement" framing).
|
||||
2. Open question §8.2 has a concrete answer (ideally a one-line
|
||||
pointer to the production training config).
|
||||
3. The follow-up implementation issue is filed.
|
||||
|
||||
This ADR is **Implemented** once:
|
||||
|
||||
1. `wifi-densepose-temporal` (or equivalent) ships in the workspace
|
||||
with a default-off feature flag exposing both dense and
|
||||
sparse-GQA backends.
|
||||
2. §5's four-gate validation has run on the most recent AETHER
|
||||
checkpoint and the result is published (witness-bundle
|
||||
compatible per ADR-028 if the run is reproducible).
|
||||
3. The default for new training runs is `sparse_gqa`, with `dense`
|
||||
still selectable for back-compat.
|
||||
|
||||
---
|
||||
|
||||
## 10. Related
|
||||
|
||||
ADR-014 (signal SOTA), ADR-016 (RuVector training pipeline
|
||||
integration), ADR-024 (AETHER contrastive CSI embedding — this
|
||||
ADR fills in its temporal-aggregation black box), ADR-095
|
||||
(on-ESP32-S3 temporal modeling — same crate, different consumer),
|
||||
upstream ADR-189 (KV cache incremental decode — the basis for
|
||||
streaming re-ID), upstream ADR-190 (GQA / MQA — what AETHER's 17
|
||||
keypoint queries × N CSI keys asymmetry naturally maps onto),
|
||||
upstream ADR-192 (no_std + alloc support — the structural change
|
||||
that means the *same* kernel runs both on the host here and on
|
||||
the MCU under ADR-095).
|
||||
Loading…
Reference in New Issue