18 KiB
ADR-096: AETHER Temporal Head via ruvllm_sparse_attention::forward_gqa + Streaming KV Cache
| Field | Value |
|---|---|
| Status | Proposed (2026-05-07) |
| Date | 2026-05-07 |
| Authors | ruvnet, claude-flow |
| Related | ADR-014, ADR-016, ADR-024, ADR-095; upstream ADR-189, ADR-190, ADR-192 |
| Branch | feat/ruvllm-sparse-attention-edge |
| Tracking | #513 |
1. Context
ADR-024 ("Project AETHER") specifies a contrastive CSI embedding
model on top of the existing CsiToPoseTransformer backbone. It
adds a 2-layer projection head to the per-keypoint features and
trains it with InfoNCE + VICReg + (optional) cross-modal alignment.
The temporal aggregation that turns per-frame backbone features
into a window-level representation is described at the level of
"a transformer encoder over the CSI window" — but ADR-024 does not
pin a specific attention kernel. In the current code:
v2/crates/wifi-densepose-train/src/model.rsusesruvector_attention::ScaledDotProductAttention(line 34) and appliesapply_antenna_attentionover the antenna-path dimension andapply_spatial_attentionover the spatial location dimension. Both are dense.- The training-side temporal pooling currently runs at
window_frames = 100by default (config.rs:165), withproof.rsandtrainer.rsusing shorter test windows of 4 and 2 respectively. v2/crates/wifi-densepose-signal/src/ruvsense/pose_tracker.rsconsumes a 128-dim AETHER re-ID embedding (line 22, 263) but does not perform the temporal aggregation itself — that happens upstream.
So the temporal head is a real seam in the codebase, but its specific attention kernel is currently dense and currently not a named architectural decision. This ADR makes that decision.
The vendored ruvllm_sparse_attention v0.1.1 (synced today,
released 2026-05-07) provides a different kind of temporal kernel:
- Subquadratic O(N log N) sparse attention (
forward,forward_flash). - Grouped-Query / Multi-Query Attention (
forward_gqa,forward_gqa_flash) — shares K/V across query heads, the pattern Mistral-7B and Llama-3 use. - Streaming KV cache (
KvCache,KvCacheF16) with H2O heavy-hitter eviction, allowing token-by-token decode in O(log T) per step against an accumulated cache. See upstream ADR-189. - FastGRNN salience gate for near-linear O(N) when the log-stride candidate set can be pruned.
These capabilities are qualitatively different from
ruvector-attention 2.0.4, which is what the workspace uses today
for spatial / antenna attention.
2. Decision
The AETHER temporal head will be implemented with
ruvllm_sparse_attention::SubquadraticSparseAttention::forward_gqa
for prefill, and decode_step against a KvCache (with the fp16
feature enabled) for streaming inference paths (online re-ID,
incremental embedding extraction during a tracked session).
Concretely:
wifi-densepose-trainaddsruvllm_sparse_attentionas a workspace dependency, path-vendored againstvendor/ruvector/crates/ruvllm_sparse_attentionso the workspace does not gain a crates.io publish dependency.- The AETHER block factory takes a feature flag
(
temporal_head = "dense" | "sparse_gqa") selecting between the current dense MHA path and the new sparse-GQA path. The default for new training runs issparse_gqa. Existing checkpoints continue to load ondense. - Signal-side consumers (the streaming embedding extraction used
by
pose_tracker.rsfor re-ID updates) calldecode_steprather than re-running prefill on every new frame — this is the structural win that dense MHA cannot provide. - We add an A/B benchmark gate (§5) before flipping the production default. The default training config can move first; the default inference config waits for the gate.
This ADR sanctions the swap. It does not perform the swap; that lands in a follow-up implementation issue once both ADR-095 and ADR-096 are accepted.
3. Quantitative argument
3.1 Edge-evaluation count
For a single attention layer over N frames:
| Path | Edge evaluations | At N = 100 (today's default) |
At N = 1000 (10 s @ 100 Hz) |
At N = 8192 |
|---|---|---|---|---|
| Dense MHA | N² |
1.0 × 10⁴ | 1.0 × 10⁶ | 6.7 × 10⁷ |
Sparse forward (window + log-stride + landmarks) |
~N · (W + log N + N/B) |
1.4 × 10⁴ | 1.4 × 10⁴ | 1.1 × 10⁶ |
| Sparse + FastGRNN | ~N · (W + globals + K) |
constant in N |
constant in N |
constant in N |
Numbers for the sparse rows are taken from upstream's measured
table (README.md:230-237, "sparse-edge reduction vs causal dense
attention"): 8192 → 29.3× edge reduction, 16384 → 57.5×, 32768 →
113.2×.
The honest framing: at the current AETHER default of
window_frames = 100, dense MHA is essentially free and the
sparse machinery has overhead — the per-token cost in upstream's
benchmark is ~2.4 µs at N = 256 and ~2.1 µs at N = 128. The
sparse path probably loses below N ≈ 128. It starts winning at
the 1 s + windows we'd realistically use for activity classification
(N = 200 at 50 Hz, N = 500 for breathing-quality), and pulls
ahead by 30–100× at the 10 s windows that long-context re-ID
benefits from.
3.2 Streaming decode
Where dense MHA structurally cannot follow is incremental decode.
Re-ID over a long-tracked person (a 5-minute session at 50 Hz =
15,000 frames) with dense MHA requires recomputing attention from
scratch every time the window slides. With decode_step against a
KvCache:
| Operation | Dense MHA | Sparse GQA + KV cache |
|---|---|---|
| Append one new frame to the embedding context | O(N²) | O(log T) |
| Memory growth | O(N · d) per recompute | O(T · d_kv) cached, evicted by H2O heavy-hitter |
| FP16 KV cache | n/a | available via fp16 feature, halves memory |
This is the qualitative capability dense MHA lacks. Even at small
N where dense MHA is competitive on prefill, decode is structurally
different: amortised O(1) per new frame vs O(N²) recompute.
4. Approach
4.1 Workspace dependency
Add to v2/Cargo.toml:
[workspace.dependencies]
ruvllm_sparse_attention = {
path = "../vendor/ruvector/crates/ruvllm_sparse_attention",
default-features = false,
features = ["fp16"]
}
default-features = false mirrors the rest of the workspace's
--no-default-features posture (and matches what ADR-095 does on
the firmware side, so both consumers have the same feature set).
We do not pull parallel here — rayon doesn't help with
inference-shaped batches at the sequence lengths we run, and it
breaks ADR-095's no_std build if the dependency leaks.
4.2 Crate placement
Two viable homes for the AETHER temporal head:
| Option | Tradeoffs |
|---|---|
A. New wifi-densepose-temporal crate |
Cleanest. Unique import surface, easy to feature-gate. But: one more crate in the publishing order (CLAUDE.md crate table grows to 16). |
B. Add to wifi-densepose-train |
Co-located with the model; no new crate; simpler workspace graph. But: wifi-densepose-train is heavyweight (tch, full training stack), and signal-side consumers would have to depend on the whole training crate just to run inference. |
Recommendation: A. The temporal head is consumed by both
wifi-densepose-train (training) and wifi-densepose-signal
(inference, re-ID). Pulling those toward a shared third crate keeps
the dependency arrows clean. Also matches ADR-095's
wifi-densepose-temporal host-side training crate name —
deliberate convergence.
4.3 API sketch
pub struct AetherTemporalHead {
backend: TemporalBackend,
cache: Option<KvCache>, // populated for streaming inference
}
pub enum TemporalBackend {
Dense(DenseMha), // current ruvector-attention path
SparseGqa(SubquadraticSparseAttention),
}
impl AetherTemporalHead {
pub fn new(cfg: &TemporalHeadConfig) -> Self;
/// Window-level prefill. Returns pooled [d_model] embedding.
pub fn forward(&self, frames: &Tensor3) -> Vec<f32>;
/// Incremental decode for streaming re-ID. Updates internal
/// cache and returns pooled embedding given a single new frame.
/// SparseGqa backend only.
pub fn step(&mut self, frame: &Tensor3) -> Result<Vec<f32>, TemporalError>;
}
4.4 Selection rule
In forward_auto's spirit, the head selects the path based on
(window, n_q_heads, n_kv_heads) of the model:
window ≤ 64and dense MHA is in the checkpoint: use dense path.n_q_heads != n_kv_heads: useforward_gqa.n_q_heads == n_kv_headsandwindow > 64: useforward.- Streaming (per-frame) inference: always
decode_step.
5. Validation gate before flipping the inference default
We do not flip the production inference default until all four of these pass on the most recent AETHER checkpoint:
- Contrastive loss within 1% of the dense baseline at the same training budget (so the kernel substitution doesn't silently regress the loss surface).
- Re-ID rank-1 accuracy within 1 percentage point of the dense baseline on the held-out test split.
- Spearman rank correlation ≥ 0.95 between dense-MHA and
sparse-GQA top-50 nearest-neighbour orderings on the
env_fingerprintandperson_trackHNSW indices (matches the ADR-024 §2.5.3 quantisation-rank-preservation criterion). - Latency improvement ≥ 5× at the deployed window length.
Any of (1)–(3) failing rolls back the default; the kernel can stay in the codebase as opt-in, but is not what new training runs use.
6. Alternatives considered
| Option | Why rejected |
|---|---|
| Keep dense MHA, period | Simple, but caps the practical window length. The 10 s + windows that long-context re-ID and breathing-quality scoring want are exactly where dense MHA hurts. We'd be locking in a ceiling for no reason. |
Use ruvector-attention 2.0.4 (already in workspace) |
It's what we use today for antenna and spatial attention. But it lacks GQA, lacks streaming KV cache, and its dependency story upstream is messy (ruvector-attn-mincut is stuck at 2.0.4 per the issue). It works, but it's not the right tool for temporal attention specifically. |
Wait for ruvector-attention 2.x to add GQA + KV cache |
Speculative; no published roadmap. Meanwhile ruvllm_sparse_attention shipped real artifacts on 2026-05-07 and is path-vendorable today. |
| Use a non-attention temporal pooler (TCN / S4 / Mamba) | All three are real options for time-series sensing; some research gives them a slight edge on long-horizon dependencies. But (a) we already have AETHER specified around attention in ADR-024, (b) the contrastive recipe is attention-tuned, (c) we'd be re-running the entire ADR-024 training story to swap to a different family. Switching to sparse attention preserves the ADR-024 mathematical apparatus exactly. |
forward_gated_with_fastgrnn immediately |
Tempting because it's the O(N) path. But the gate adds approximation error on top of the sparsity-induced approximation error. Phase the introductions: prove sparse-GQA matches dense first, then layer the gate on top in a follow-up. |
7. Consequences
Positive
- Long windows are no longer scary.
window_frames = 1000for 10 s sessions becomes practical, not aspirational. - Streaming re-ID gets a structural speedup. Per-frame decode cost goes from O(N²) to O(log T). Pose tracker cost is a real budget today; this shrinks it.
- GQA fits the AETHER backbone better. AETHER's per-keypoint cross-attention already has a query/key shape mismatch (17 keypoint queries vs N CSI keys). GQA was designed for exactly this asymmetry.
- Path-vendored, not crates.io-coupled. No bind-time risk —
the crate ships from the vendored copy of upstream, and the
vendor was synced today (
e38347601). - Same kernel, two consumers. ADR-095 wants this on the MCU; this ADR wants it on the host. Path-vendoring once keeps the versions in lockstep.
- Approximation error is bounded by the local window +
log-stride + landmark pattern. Upstream's measurement (
README.md§FAQ) is "<1% perplexity on standard benchmarks" for the causal case; we measure ours via §5's gate.
Negative
- Adds a workspace dependency the team has to know about. Mitigated by path-vendoring (no version-resolution risk).
- Approximation error is not zero. For high-precision re-ID this needs measurement. §5's gate is the safety net; if rank correlation drops below 0.95 we don't flip the default.
- More moving parts in the temporal head. Dense MHA has one knob (number of heads). Sparse GQA has window, log-stride, landmark block size, KV head count, and (later) gate top-K. We pay this in default-config tuning effort.
KvCacheintroduces session state in a place that didn't have it. Code that previously called a statelessforward(...)now has to think about cache lifetime per tracked person. The pose tracker (pose_tracker.rs) already has per-track state, so the natural place for the cache is insidePoseTrack; needs a small lifecycle review.- Training and inference paths diverge slightly. Training
always uses
forward(full window prefill). Inference usesdecode_stepfor streaming. The two paths must be tested separately; upstream'sforwardanddecode_stepare unit-test parity-checked, but our wrapper has its own surface.
Neutral
- ADR-024 is not superseded. The contrastive loss, the augmentation strategy, the projection head, the HNSW indices — all unchanged. This ADR makes a single architectural choice inside ADR-024's "temporal aggregation" black box.
- ADR-016 (RuVector training pipeline integration) is unaffected.
The other RuVector crates (
mincut,attn-mincut,temporal-tensor,solver,attention) keep their existing roles inmodel.rs.
8. Open questions
- What is the AETHER temporal head's actual current
architecture in code? ADR-024 specifies the projection head
precisely (Linear → BN → ReLU → Linear → L2-norm) but the
temporal aggregation before that is not pinned. The closest
thing in
model.rstoday isapply_antenna_attentionandapply_spatial_attention, which are over antenna and spatial axes, not the temporal axis. So this ADR is, in practice, choosing the temporal kernel for the first time — not replacing one. Worth confirming with the maintainer before the implementation PR uses language like "swap" rather than "add". - What window length is the deployed AETHER tracker using
today? The training default is 100 frames (
config.rs:165), butproof.rsuses 4 andtrainer.rsuses 2. The realistic deployment number determines how much of the §3.1 quantitative argument is currently operative versus future-state. If the answer is "we run AETHER on 4-frame windows", sparse pays nothing today, and the case for this ADR rests entirely on the long-window roadmap. If 100 or more, sparse already pays. - Is
FastGrnnGateworth enabling for re-ID specifically? Probably not — re-ID benefits from full-sequence visibility, and the gate's job is to prune long-range candidates. Save the gate for activity classification (where transient movement is the signal of interest, and saliency-based pruning matches the use case). Confirm via §5's accuracy gate when we get there. - Does the cross-modal alignment loss (ADR-024 §2.2.4) need
any change? The cross-modal loss operates on pooled
z_csi(already temporally aggregated) and pooledz_pose. As long as the temporal aggregator returns a comparable pooled vector, the loss is kernel-agnostic. Likely no change, but worth a smoke test. - Where does the KV cache live for re-ID? Per
pose_tracker.rs, eachPoseTrackalready has lifecycle (create / update / evict). The natural place isPoseTrack::kv_cache: Option<KvCache>, populated when the track first emits an embedding. Eviction policy ties totrack.last_seen— when the track is dropped, drop the cache. Spec-level sanity check only; needs a real design pass in the implementation PR.
9. Acceptance criteria
This ADR is Accepted once:
- Maintainer review on #513 confirms the architecture and resolves §8.1 (the "first-time choice vs replacement" framing).
- Open question §8.2 has a concrete answer (ideally a one-line pointer to the production training config).
- The follow-up implementation issue is filed.
This ADR is Implemented once:
wifi-densepose-temporal(or equivalent) ships in the workspace with a default-off feature flag exposing both dense and sparse-GQA backends.- §5's four-gate validation has run on the most recent AETHER checkpoint and the result is published (witness-bundle compatible per ADR-028 if the run is reproducible).
- The default for new training runs is
sparse_gqa, withdensestill selectable for back-compat.
10. Related
ADR-014 (signal SOTA), ADR-016 (RuVector training pipeline integration), ADR-024 (AETHER contrastive CSI embedding — this ADR fills in its temporal-aggregation black box), ADR-095 (on-ESP32-S3 temporal modeling — same crate, different consumer), upstream ADR-189 (KV cache incremental decode — the basis for streaming re-ID), upstream ADR-190 (GQA / MQA — what AETHER's 17 keypoint queries × N CSI keys asymmetry naturally maps onto), upstream ADR-192 (no_std + alloc support — the structural change that means the same kernel runs both on the host here and on the MCU under ADR-095).