390 lines
18 KiB
Markdown
390 lines
18 KiB
Markdown
# ADR-096: AETHER Temporal Head via `ruvllm_sparse_attention::forward_gqa` + Streaming KV Cache
|
||
|
||
| Field | Value |
|
||
|-------------|---------------------------------------------------------------------------------------|
|
||
| **Status** | Proposed (2026-05-07) |
|
||
| **Date** | 2026-05-07 |
|
||
| **Authors** | ruvnet, claude-flow |
|
||
| **Related** | ADR-014, ADR-016, ADR-024, ADR-095; upstream ADR-189, ADR-190, ADR-192 |
|
||
| **Branch** | `feat/ruvllm-sparse-attention-edge` |
|
||
| **Tracking**| #513 |
|
||
|
||
---
|
||
|
||
## 1. Context
|
||
|
||
ADR-024 ("Project AETHER") specifies a contrastive CSI embedding
|
||
model on top of the existing `CsiToPoseTransformer` backbone. It
|
||
adds a 2-layer projection head to the per-keypoint features and
|
||
trains it with InfoNCE + VICReg + (optional) cross-modal alignment.
|
||
The **temporal aggregation** that turns per-frame backbone features
|
||
into a window-level representation is described at the level of
|
||
"a transformer encoder over the CSI window" — but ADR-024 does not
|
||
pin a specific attention kernel. In the current code:
|
||
|
||
- `v2/crates/wifi-densepose-train/src/model.rs` uses
|
||
`ruvector_attention::ScaledDotProductAttention` (line 34) and
|
||
applies `apply_antenna_attention` over the antenna-path dimension
|
||
and `apply_spatial_attention` over the spatial location dimension.
|
||
Both are dense.
|
||
- The training-side temporal pooling currently runs at
|
||
`window_frames = 100` by default (`config.rs:165`), with
|
||
`proof.rs` and `trainer.rs` using shorter test windows of 4 and 2
|
||
respectively.
|
||
- `v2/crates/wifi-densepose-signal/src/ruvsense/pose_tracker.rs`
|
||
consumes a 128-dim AETHER re-ID embedding (line 22, 263) but does
|
||
not perform the temporal aggregation itself — that happens
|
||
upstream.
|
||
|
||
So the temporal head is a real seam in the codebase, but its
|
||
specific attention kernel is *currently dense* and *currently not a
|
||
named architectural decision*. This ADR makes that decision.
|
||
|
||
The vendored `ruvllm_sparse_attention` v0.1.1 (synced today,
|
||
released 2026-05-07) provides a different kind of temporal kernel:
|
||
|
||
- **Subquadratic O(N log N)** sparse attention (`forward`,
|
||
`forward_flash`).
|
||
- **Grouped-Query / Multi-Query Attention** (`forward_gqa`,
|
||
`forward_gqa_flash`) — shares K/V across query heads, the
|
||
pattern Mistral-7B and Llama-3 use.
|
||
- **Streaming KV cache** (`KvCache`, `KvCacheF16`) with H2O
|
||
heavy-hitter eviction, allowing token-by-token decode in
|
||
**O(log T)** per step against an accumulated cache. See upstream
|
||
ADR-189.
|
||
- **FastGRNN salience gate** for **near-linear O(N)** when the
|
||
log-stride candidate set can be pruned.
|
||
|
||
These capabilities are qualitatively different from
|
||
`ruvector-attention` 2.0.4, which is what the workspace uses today
|
||
for spatial / antenna attention.
|
||
|
||
---
|
||
|
||
## 2. Decision
|
||
|
||
The AETHER temporal head will be implemented with
|
||
`ruvllm_sparse_attention::SubquadraticSparseAttention::forward_gqa`
|
||
for prefill, and `decode_step` against a `KvCache` (with the `fp16`
|
||
feature enabled) for streaming inference paths (online re-ID,
|
||
incremental embedding extraction during a tracked session).
|
||
|
||
Concretely:
|
||
|
||
1. `wifi-densepose-train` adds `ruvllm_sparse_attention` as a
|
||
workspace dependency, **path-vendored** against
|
||
`vendor/ruvector/crates/ruvllm_sparse_attention` so the workspace
|
||
does not gain a crates.io publish dependency.
|
||
2. The AETHER block factory takes a feature flag
|
||
(`temporal_head = "dense" | "sparse_gqa"`) selecting between the
|
||
current dense MHA path and the new sparse-GQA path. The default
|
||
for new training runs is `sparse_gqa`. Existing checkpoints
|
||
continue to load on `dense`.
|
||
3. Signal-side consumers (the streaming embedding extraction used
|
||
by `pose_tracker.rs` for re-ID updates) call `decode_step` rather
|
||
than re-running prefill on every new frame — this is the
|
||
structural win that dense MHA cannot provide.
|
||
4. We add an A/B benchmark gate (§5) before flipping the production
|
||
default. The default *training* config can move first; the
|
||
default *inference* config waits for the gate.
|
||
|
||
This ADR sanctions the swap. It does not perform the swap; that
|
||
lands in a follow-up implementation issue once both ADR-095 and
|
||
ADR-096 are accepted.
|
||
|
||
---
|
||
|
||
## 3. Quantitative argument
|
||
|
||
### 3.1 Edge-evaluation count
|
||
|
||
For a single attention layer over `N` frames:
|
||
|
||
| Path | Edge evaluations | At `N = 100` (today's default) | At `N = 1000` (10 s @ 100 Hz) | At `N = 8192` |
|
||
|------|------------------|--------------------------------|-------------------------------|---------------|
|
||
| Dense MHA | `N²` | 1.0 × 10⁴ | 1.0 × 10⁶ | 6.7 × 10⁷ |
|
||
| Sparse `forward` (window + log-stride + landmarks) | ~`N · (W + log N + N/B)` | 1.4 × 10⁴ | 1.4 × 10⁴ | 1.1 × 10⁶ |
|
||
| Sparse + FastGRNN | ~`N · (W + globals + K)` | constant in `N` | constant in `N` | constant in `N` |
|
||
|
||
Numbers for the sparse rows are taken from upstream's measured
|
||
table (`README.md:230-237`, "sparse-edge reduction vs causal dense
|
||
attention"): 8192 → 29.3× edge reduction, 16384 → 57.5×, 32768 →
|
||
113.2×.
|
||
|
||
**The honest framing:** at the *current* AETHER default of
|
||
`window_frames = 100`, dense MHA is essentially free and the
|
||
sparse machinery has overhead — the per-token cost in upstream's
|
||
benchmark is ~2.4 µs at `N = 256` and ~2.1 µs at `N = 128`. The
|
||
sparse path probably *loses* below `N ≈ 128`. It starts winning at
|
||
the 1 s + windows we'd realistically use for activity classification
|
||
(`N = 200` at 50 Hz, `N = 500` for breathing-quality), and pulls
|
||
ahead by 30–100× at the 10 s windows that long-context re-ID
|
||
benefits from.
|
||
|
||
### 3.2 Streaming decode
|
||
|
||
Where dense MHA structurally cannot follow is incremental decode.
|
||
Re-ID over a long-tracked person (a 5-minute session at 50 Hz =
|
||
15,000 frames) with dense MHA requires recomputing attention from
|
||
scratch every time the window slides. With `decode_step` against a
|
||
`KvCache`:
|
||
|
||
| Operation | Dense MHA | Sparse GQA + KV cache |
|
||
|-----------|-----------|-----------------------|
|
||
| Append one new frame to the embedding context | O(N²) | **O(log T)** |
|
||
| Memory growth | O(N · d) per recompute | O(T · d_kv) cached, evicted by H2O heavy-hitter |
|
||
| FP16 KV cache | n/a | available via `fp16` feature, halves memory |
|
||
|
||
This is the qualitative capability dense MHA lacks. Even at small
|
||
`N` where dense MHA is competitive on prefill, decode is structurally
|
||
different: amortised O(1) per new frame vs O(N²) recompute.
|
||
|
||
---
|
||
|
||
## 4. Approach
|
||
|
||
### 4.1 Workspace dependency
|
||
|
||
Add to `v2/Cargo.toml`:
|
||
|
||
```toml
|
||
[workspace.dependencies]
|
||
ruvllm_sparse_attention = {
|
||
path = "../vendor/ruvector/crates/ruvllm_sparse_attention",
|
||
default-features = false,
|
||
features = ["fp16"]
|
||
}
|
||
```
|
||
|
||
`default-features = false` mirrors the rest of the workspace's
|
||
`--no-default-features` posture (and matches what ADR-095 does on
|
||
the firmware side, so both consumers have the same feature set).
|
||
We **do not** pull `parallel` here — rayon doesn't help with
|
||
inference-shaped batches at the sequence lengths we run, and it
|
||
breaks ADR-095's no_std build if the dependency leaks.
|
||
|
||
### 4.2 Crate placement
|
||
|
||
Two viable homes for the AETHER temporal head:
|
||
|
||
| Option | Tradeoffs |
|
||
|--------|-----------|
|
||
| **A. New `wifi-densepose-temporal` crate** | Cleanest. Unique import surface, easy to feature-gate. But: one more crate in the publishing order (CLAUDE.md crate table grows to 16). |
|
||
| **B. Add to `wifi-densepose-train`** | Co-located with the model; no new crate; simpler workspace graph. But: `wifi-densepose-train` is heavyweight (`tch`, full training stack), and signal-side consumers would have to depend on the whole training crate just to run inference. |
|
||
|
||
**Recommendation: A.** The temporal head is consumed by both
|
||
`wifi-densepose-train` (training) and `wifi-densepose-signal`
|
||
(inference, re-ID). Pulling those toward a shared third crate keeps
|
||
the dependency arrows clean. Also matches ADR-095's
|
||
`wifi-densepose-temporal` host-side training crate name —
|
||
deliberate convergence.
|
||
|
||
### 4.3 API sketch
|
||
|
||
```rust
|
||
pub struct AetherTemporalHead {
|
||
backend: TemporalBackend,
|
||
cache: Option<KvCache>, // populated for streaming inference
|
||
}
|
||
|
||
pub enum TemporalBackend {
|
||
Dense(DenseMha), // current ruvector-attention path
|
||
SparseGqa(SubquadraticSparseAttention),
|
||
}
|
||
|
||
impl AetherTemporalHead {
|
||
pub fn new(cfg: &TemporalHeadConfig) -> Self;
|
||
|
||
/// Window-level prefill. Returns pooled [d_model] embedding.
|
||
pub fn forward(&self, frames: &Tensor3) -> Vec<f32>;
|
||
|
||
/// Incremental decode for streaming re-ID. Updates internal
|
||
/// cache and returns pooled embedding given a single new frame.
|
||
/// SparseGqa backend only.
|
||
pub fn step(&mut self, frame: &Tensor3) -> Result<Vec<f32>, TemporalError>;
|
||
}
|
||
```
|
||
|
||
### 4.4 Selection rule
|
||
|
||
In `forward_auto`'s spirit, the head selects the path based on
|
||
`(window, n_q_heads, n_kv_heads)` of the model:
|
||
|
||
- `window ≤ 64` and dense MHA is in the checkpoint: use dense path.
|
||
- `n_q_heads != n_kv_heads`: use `forward_gqa`.
|
||
- `n_q_heads == n_kv_heads` and `window > 64`: use `forward`.
|
||
- Streaming (per-frame) inference: always `decode_step`.
|
||
|
||
---
|
||
|
||
## 5. Validation gate before flipping the inference default
|
||
|
||
We do not flip the production inference default until *all four*
|
||
of these pass on the most recent AETHER checkpoint:
|
||
|
||
1. **Contrastive loss within 1%** of the dense baseline at the same
|
||
training budget (so the kernel substitution doesn't silently
|
||
regress the loss surface).
|
||
2. **Re-ID rank-1 accuracy within 1 percentage point** of the dense
|
||
baseline on the held-out test split.
|
||
3. **Spearman rank correlation ≥ 0.95** between dense-MHA and
|
||
sparse-GQA top-50 nearest-neighbour orderings on the
|
||
`env_fingerprint` and `person_track` HNSW indices (matches the
|
||
ADR-024 §2.5.3 quantisation-rank-preservation criterion).
|
||
4. **Latency improvement ≥ 5×** at the deployed window length.
|
||
|
||
Any of (1)–(3) failing rolls back the default; the kernel can stay
|
||
in the codebase as opt-in, but is not what new training runs use.
|
||
|
||
---
|
||
|
||
## 6. Alternatives considered
|
||
|
||
| Option | Why rejected |
|
||
|--------|--------------|
|
||
| **Keep dense MHA, period** | Simple, but caps the practical window length. The 10 s + windows that long-context re-ID and breathing-quality scoring want are exactly where dense MHA hurts. We'd be locking in a ceiling for no reason. |
|
||
| **Use `ruvector-attention` 2.0.4 (already in workspace)** | It's what we use today for antenna and spatial attention. But it lacks GQA, lacks streaming KV cache, and its dependency story upstream is messy (`ruvector-attn-mincut` is stuck at 2.0.4 per the issue). It works, but it's not the right tool for *temporal* attention specifically. |
|
||
| **Wait for `ruvector-attention 2.x` to add GQA + KV cache** | Speculative; no published roadmap. Meanwhile `ruvllm_sparse_attention` shipped real artifacts on 2026-05-07 and is path-vendorable today. |
|
||
| **Use a non-attention temporal pooler (TCN / S4 / Mamba)** | All three are real options for time-series sensing; some research gives them a slight edge on long-horizon dependencies. But (a) we already have AETHER specified around attention in ADR-024, (b) the contrastive recipe is attention-tuned, (c) we'd be re-running the entire ADR-024 training story to swap to a different family. Switching to *sparse* attention preserves the ADR-024 mathematical apparatus exactly. |
|
||
| **`forward_gated_with_fastgrnn` immediately** | Tempting because it's the O(N) path. But the gate adds approximation error on top of the sparsity-induced approximation error. Phase the introductions: prove sparse-GQA matches dense first, then layer the gate on top in a follow-up. |
|
||
|
||
---
|
||
|
||
## 7. Consequences
|
||
|
||
### Positive
|
||
|
||
- **Long windows are no longer scary.** `window_frames = 1000` for
|
||
10 s sessions becomes practical, not aspirational.
|
||
- **Streaming re-ID gets a structural speedup.** Per-frame decode
|
||
cost goes from O(N²) to O(log T). Pose tracker cost is a real
|
||
budget today; this shrinks it.
|
||
- **GQA fits the AETHER backbone better.** AETHER's per-keypoint
|
||
cross-attention already has a query/key shape mismatch (17
|
||
keypoint queries vs N CSI keys). GQA was designed for exactly
|
||
this asymmetry.
|
||
- **Path-vendored, not crates.io-coupled.** No bind-time risk —
|
||
the crate ships from the vendored copy of upstream, and the
|
||
vendor was synced today (`e38347601`).
|
||
- **Same kernel, two consumers.** ADR-095 wants this on the MCU;
|
||
this ADR wants it on the host. Path-vendoring once keeps the
|
||
versions in lockstep.
|
||
- **Approximation error is bounded** by the local window +
|
||
log-stride + landmark pattern. Upstream's measurement (`README.md`
|
||
§FAQ) is "<1% perplexity on standard benchmarks" for the
|
||
causal case; we measure ours via §5's gate.
|
||
|
||
### Negative
|
||
|
||
- **Adds a workspace dependency** the team has to know about.
|
||
Mitigated by path-vendoring (no version-resolution risk).
|
||
- **Approximation error is not zero.** For high-precision re-ID
|
||
this needs measurement. §5's gate is the safety net; if rank
|
||
correlation drops below 0.95 we don't flip the default.
|
||
- **More moving parts in the temporal head.** Dense MHA has one
|
||
knob (number of heads). Sparse GQA has window, log-stride,
|
||
landmark block size, KV head count, and (later) gate top-K. We
|
||
pay this in default-config tuning effort.
|
||
- **`KvCache` introduces session state** in a place that didn't
|
||
have it. Code that previously called a stateless `forward(...)`
|
||
now has to think about cache lifetime per tracked person. The
|
||
pose tracker (`pose_tracker.rs`) already has per-track state, so
|
||
the natural place for the cache is inside `PoseTrack`; needs a
|
||
small lifecycle review.
|
||
- **Training and inference paths diverge slightly.** Training
|
||
always uses `forward` (full window prefill). Inference uses
|
||
`decode_step` for streaming. The two paths must be tested
|
||
separately; upstream's `forward` and `decode_step` are unit-test
|
||
parity-checked, but our wrapper has its own surface.
|
||
|
||
### Neutral
|
||
|
||
- ADR-024 is **not superseded.** The contrastive loss, the
|
||
augmentation strategy, the projection head, the HNSW indices —
|
||
all unchanged. This ADR makes a single architectural choice
|
||
inside ADR-024's "temporal aggregation" black box.
|
||
- ADR-016 (RuVector training pipeline integration) is unaffected.
|
||
The other RuVector crates (`mincut`, `attn-mincut`,
|
||
`temporal-tensor`, `solver`, `attention`) keep their existing
|
||
roles in `model.rs`.
|
||
|
||
---
|
||
|
||
## 8. Open questions
|
||
|
||
1. **What is the AETHER temporal head's actual current
|
||
architecture in code?** ADR-024 specifies the projection head
|
||
precisely (Linear → BN → ReLU → Linear → L2-norm) but the
|
||
*temporal aggregation* before that is not pinned. The closest
|
||
thing in `model.rs` today is `apply_antenna_attention` and
|
||
`apply_spatial_attention`, which are over antenna and spatial
|
||
axes, not the temporal axis. So this ADR is, in practice,
|
||
choosing the temporal kernel for the *first time* — not
|
||
replacing one. Worth confirming with the maintainer before the
|
||
implementation PR uses language like "swap" rather than "add".
|
||
2. **What window length is the deployed AETHER tracker using
|
||
today?** The training default is 100 frames (`config.rs:165`),
|
||
but `proof.rs` uses 4 and `trainer.rs` uses 2. The realistic
|
||
deployment number determines how much of the §3.1 quantitative
|
||
argument is *currently* operative versus *future-state*. If the
|
||
answer is "we run AETHER on 4-frame windows", sparse pays
|
||
nothing today, and the case for this ADR rests entirely on the
|
||
long-window roadmap. If 100 or more, sparse already pays.
|
||
3. **Is `FastGrnnGate` worth enabling for re-ID specifically?**
|
||
Probably not — re-ID benefits from full-sequence visibility,
|
||
and the gate's job is to *prune* long-range candidates. Save
|
||
the gate for activity classification (where transient movement
|
||
is the signal of interest, and saliency-based pruning matches
|
||
the use case). Confirm via §5's accuracy gate when we get there.
|
||
4. **Does the cross-modal alignment loss (ADR-024 §2.2.4) need
|
||
any change?** The cross-modal loss operates on pooled
|
||
`z_csi` (already temporally aggregated) and pooled `z_pose`. As
|
||
long as the temporal aggregator returns a comparable pooled
|
||
vector, the loss is kernel-agnostic. Likely no change, but
|
||
worth a smoke test.
|
||
5. **Where does the KV cache live for re-ID?** Per `pose_tracker.rs`,
|
||
each `PoseTrack` already has lifecycle (create / update /
|
||
evict). The natural place is `PoseTrack::kv_cache:
|
||
Option<KvCache>`, populated when the track first emits an
|
||
embedding. Eviction policy ties to `track.last_seen` — when
|
||
the track is dropped, drop the cache. Spec-level sanity check
|
||
only; needs a real design pass in the implementation PR.
|
||
|
||
---
|
||
|
||
## 9. Acceptance criteria
|
||
|
||
This ADR is **Accepted** once:
|
||
|
||
1. Maintainer review on #513 confirms the architecture and resolves
|
||
§8.1 (the "first-time choice vs replacement" framing).
|
||
2. Open question §8.2 has a concrete answer (ideally a one-line
|
||
pointer to the production training config).
|
||
3. The follow-up implementation issue is filed.
|
||
|
||
This ADR is **Implemented** once:
|
||
|
||
1. `wifi-densepose-temporal` (or equivalent) ships in the workspace
|
||
with a default-off feature flag exposing both dense and
|
||
sparse-GQA backends.
|
||
2. §5's four-gate validation has run on the most recent AETHER
|
||
checkpoint and the result is published (witness-bundle
|
||
compatible per ADR-028 if the run is reproducible).
|
||
3. The default for new training runs is `sparse_gqa`, with `dense`
|
||
still selectable for back-compat.
|
||
|
||
---
|
||
|
||
## 10. Related
|
||
|
||
ADR-014 (signal SOTA), ADR-016 (RuVector training pipeline
|
||
integration), ADR-024 (AETHER contrastive CSI embedding — this
|
||
ADR fills in its temporal-aggregation black box), ADR-095
|
||
(on-ESP32-S3 temporal modeling — same crate, different consumer),
|
||
upstream ADR-189 (KV cache incremental decode — the basis for
|
||
streaming re-ID), upstream ADR-190 (GQA / MQA — what AETHER's 17
|
||
keypoint queries × N CSI keys asymmetry naturally maps onto),
|
||
upstream ADR-192 (no_std + alloc support — the structural change
|
||
that means the *same* kernel runs both on the host here and on
|
||
the MCU under ADR-095).
|