wifi-densepose/docs/adr/ADR-096-aether-temporal-hea...

# ADR-096: AETHER Temporal Head via `ruvllm_sparse_attention::forward_gqa` + Streaming KV Cache

| Field       | Value                                                                                 |
|-------------|---------------------------------------------------------------------------------------|
| **Status**  | Proposed (2026-05-07)                                                                 |
| **Date**    | 2026-05-07                                                                            |
| **Authors** | ruvnet, claude-flow                                                                   |
| **Related** | ADR-014, ADR-016, ADR-024, ADR-095; upstream ADR-189, ADR-190, ADR-192                |
| **Branch**  | `feat/ruvllm-sparse-attention-edge`                                                   |
| **Tracking**| #513                                                                                  |

---

## 1. Context

ADR-024 ("Project AETHER") specifies a contrastive CSI embedding
model on top of the existing `CsiToPoseTransformer` backbone. It
adds a 2-layer projection head to the per-keypoint features and
trains it with InfoNCE + VICReg + (optional) cross-modal alignment.
The **temporal aggregation** that turns per-frame backbone features
into a window-level representation is described at the level of
"a transformer encoder over the CSI window" — but ADR-024 does not
pin a specific attention kernel. In the current code:

- `v2/crates/wifi-densepose-train/src/model.rs` uses
  `ruvector_attention::ScaledDotProductAttention` (line 34) and
  applies `apply_antenna_attention` over the antenna-path dimension
  and `apply_spatial_attention` over the spatial location dimension.
  Both are dense.
- The training-side temporal pooling currently runs at
  `window_frames = 100` by default (`config.rs:165`), with
  `proof.rs` and `trainer.rs` using shorter test windows of 4 and 2
  respectively.
- `v2/crates/wifi-densepose-signal/src/ruvsense/pose_tracker.rs`
  consumes a 128-dim AETHER re-ID embedding (line 22, 263) but does
  not perform the temporal aggregation itself — that happens
  upstream.

So the temporal head is a real seam in the codebase, but its
specific attention kernel is *currently dense* and *currently not a
named architectural decision*. This ADR makes that decision.

The vendored `ruvllm_sparse_attention` v0.1.1 (synced today,
released 2026-05-07) provides a different kind of temporal kernel:

- **Subquadratic O(N log N)** sparse attention (`forward`,
  `forward_flash`).
- **Grouped-Query / Multi-Query Attention** (`forward_gqa`,
  `forward_gqa_flash`) — shares K/V across query heads, the
  pattern Mistral-7B and Llama-3 use.
- **Streaming KV cache** (`KvCache`, `KvCacheF16`) with H2O
  heavy-hitter eviction, allowing token-by-token decode in
  **O(log T)** per step against an accumulated cache. See upstream
  ADR-189.
- **FastGRNN salience gate** for **near-linear O(N)** when the
  log-stride candidate set can be pruned.

These capabilities are qualitatively different from
`ruvector-attention` 2.0.4, which is what the workspace uses today
for spatial / antenna attention.

---

## 2. Decision

The AETHER temporal head will be implemented with
`ruvllm_sparse_attention::SubquadraticSparseAttention::forward_gqa`
for prefill, and `decode_step` against a `KvCache` (with the `fp16`
feature enabled) for streaming inference paths (online re-ID,
incremental embedding extraction during a tracked session).

Concretely:

1. `wifi-densepose-train` adds `ruvllm_sparse_attention` as a
   workspace dependency, **path-vendored** against
   `vendor/ruvector/crates/ruvllm_sparse_attention` so the workspace
   does not gain a crates.io publish dependency.
2. The AETHER block factory takes a feature flag
   (`temporal_head = "dense" | "sparse_gqa"`) selecting between the
   current dense MHA path and the new sparse-GQA path. The default
   for new training runs is `sparse_gqa`. Existing checkpoints
   continue to load on `dense`.
3. Signal-side consumers (the streaming embedding extraction used
   by `pose_tracker.rs` for re-ID updates) call `decode_step` rather
   than re-running prefill on every new frame — this is the
   structural win that dense MHA cannot provide.
4. We add an A/B benchmark gate (§5) before flipping the production
   default. The default *training* config can move first; the
   default *inference* config waits for the gate.

This ADR sanctions the swap. It does not perform the swap; that
lands in a follow-up implementation issue once both ADR-095 and
ADR-096 are accepted.

---

## 3. Quantitative argument

### 3.1 Edge-evaluation count

For a single attention layer over `N` frames:

| Path | Edge evaluations | At `N = 100` (today's default) | At `N = 1000` (10 s @ 100 Hz) | At `N = 8192` |
|------|------------------|--------------------------------|-------------------------------|---------------|
| Dense MHA | `N²` | 1.0 × 10⁴ | 1.0 × 10⁶ | 6.7 × 10⁷ |
| Sparse `forward` (window + log-stride + landmarks) | ~`N · (W + log N + N/B)` | 1.4 × 10⁴ | 1.4 × 10⁴ | 1.1 × 10⁶ |
| Sparse + FastGRNN | ~`N · (W + globals + K)` | constant in `N` | constant in `N` | constant in `N` |

Numbers for the sparse rows are taken from upstream's measured
table (`README.md:230-237`, "sparse-edge reduction vs causal dense
attention"): 8192 → 29.3× edge reduction, 16384 → 57.5×, 32768 →
113.2×.

**The honest framing:** at the *current* AETHER default of
`window_frames = 100`, dense MHA is essentially free and the
sparse machinery has overhead — the per-token cost in upstream's
benchmark is ~2.4 µs at `N = 256` and ~2.1 µs at `N = 128`. The
sparse path probably *loses* below `N ≈ 128`. It starts winning at
the 1 s + windows we'd realistically use for activity classification
(`N = 200` at 50 Hz, `N = 500` for breathing-quality), and pulls
ahead by 30–100× at the 10 s windows that long-context re-ID
benefits from.

### 3.2 Streaming decode

Where dense MHA structurally cannot follow is incremental decode.
Re-ID over a long-tracked person (a 5-minute session at 50 Hz =
15,000 frames) with dense MHA requires recomputing attention from
scratch every time the window slides. With `decode_step` against a
`KvCache`:

| Operation | Dense MHA | Sparse GQA + KV cache |
|-----------|-----------|-----------------------|
| Append one new frame to the embedding context | O(N²) | **O(log T)** |
| Memory growth | O(N · d) per recompute | O(T · d_kv) cached, evicted by H2O heavy-hitter |
| FP16 KV cache | n/a | available via `fp16` feature, halves memory |

This is the qualitative capability dense MHA lacks. Even at small
`N` where dense MHA is competitive on prefill, decode is structurally
different: amortised O(1) per new frame vs O(N²) recompute.

---

## 4. Approach

### 4.1 Workspace dependency

Add to `v2/Cargo.toml`:

```toml
[workspace.dependencies]
ruvllm_sparse_attention = {
    path = "../vendor/ruvector/crates/ruvllm_sparse_attention",
    default-features = false,
    features = ["fp16"]
}
```

`default-features = false` mirrors the rest of the workspace's
`--no-default-features` posture (and matches what ADR-095 does on
the firmware side, so both consumers have the same feature set).
We **do not** pull `parallel` here — rayon doesn't help with
inference-shaped batches at the sequence lengths we run, and it
breaks ADR-095's no_std build if the dependency leaks.

### 4.2 Crate placement

Two viable homes for the AETHER temporal head:

| Option | Tradeoffs |
|--------|-----------|
| **A. New `wifi-densepose-temporal` crate** | Cleanest. Unique import surface, easy to feature-gate. But: one more crate in the publishing order (CLAUDE.md crate table grows to 16). |
| **B. Add to `wifi-densepose-train`** | Co-located with the model; no new crate; simpler workspace graph. But: `wifi-densepose-train` is heavyweight (`tch`, full training stack), and signal-side consumers would have to depend on the whole training crate just to run inference. |

**Recommendation: A.** The temporal head is consumed by both
`wifi-densepose-train` (training) and `wifi-densepose-signal`
(inference, re-ID). Pulling those toward a shared third crate keeps
the dependency arrows clean. Also matches ADR-095's
`wifi-densepose-temporal` host-side training crate name —
deliberate convergence.

### 4.3 API sketch

```rust
pub struct AetherTemporalHead {
    backend: TemporalBackend,
    cache: Option<KvCache>,           // populated for streaming inference
}

pub enum TemporalBackend {
    Dense(DenseMha),                  // current ruvector-attention path
    SparseGqa(SubquadraticSparseAttention),
}

impl AetherTemporalHead {
    pub fn new(cfg: &TemporalHeadConfig) -> Self;

    /// Window-level prefill. Returns pooled [d_model] embedding.
    pub fn forward(&self, frames: &Tensor3) -> Vec<f32>;

    /// Incremental decode for streaming re-ID. Updates internal
    /// cache and returns pooled embedding given a single new frame.
    /// SparseGqa backend only.
    pub fn step(&mut self, frame: &Tensor3) -> Result<Vec<f32>, TemporalError>;
}
```

### 4.4 Selection rule

In `forward_auto`'s spirit, the head selects the path based on
`(window, n_q_heads, n_kv_heads)` of the model:

- `window ≤ 64` and dense MHA is in the checkpoint: use dense path.
- `n_q_heads != n_kv_heads`: use `forward_gqa`.
- `n_q_heads == n_kv_heads` and `window > 64`: use `forward`.
- Streaming (per-frame) inference: always `decode_step`.

---

## 5. Validation gate before flipping the inference default

We do not flip the production inference default until *all four*
of these pass on the most recent AETHER checkpoint:

1. **Contrastive loss within 1%** of the dense baseline at the same
   training budget (so the kernel substitution doesn't silently
   regress the loss surface).
2. **Re-ID rank-1 accuracy within 1 percentage point** of the dense
   baseline on the held-out test split.
3. **Spearman rank correlation ≥ 0.95** between dense-MHA and
   sparse-GQA top-50 nearest-neighbour orderings on the
   `env_fingerprint` and `person_track` HNSW indices (matches the
   ADR-024 §2.5.3 quantisation-rank-preservation criterion).
4. **Latency improvement ≥ 5×** at the deployed window length.

Any of (1)–(3) failing rolls back the default; the kernel can stay
in the codebase as opt-in, but is not what new training runs use.

---

## 6. Alternatives considered

| Option | Why rejected |
|--------|--------------|
| **Keep dense MHA, period** | Simple, but caps the practical window length. The 10 s + windows that long-context re-ID and breathing-quality scoring want are exactly where dense MHA hurts. We'd be locking in a ceiling for no reason. |
| **Use `ruvector-attention` 2.0.4 (already in workspace)** | It's what we use today for antenna and spatial attention. But it lacks GQA, lacks streaming KV cache, and its dependency story upstream is messy (`ruvector-attn-mincut` is stuck at 2.0.4 per the issue). It works, but it's not the right tool for *temporal* attention specifically. |
| **Wait for `ruvector-attention 2.x` to add GQA + KV cache** | Speculative; no published roadmap. Meanwhile `ruvllm_sparse_attention` shipped real artifacts on 2026-05-07 and is path-vendorable today. |
| **Use a non-attention temporal pooler (TCN / S4 / Mamba)** | All three are real options for time-series sensing; some research gives them a slight edge on long-horizon dependencies. But (a) we already have AETHER specified around attention in ADR-024, (b) the contrastive recipe is attention-tuned, (c) we'd be re-running the entire ADR-024 training story to swap to a different family. Switching to *sparse* attention preserves the ADR-024 mathematical apparatus exactly. |
| **`forward_gated_with_fastgrnn` immediately** | Tempting because it's the O(N) path. But the gate adds approximation error on top of the sparsity-induced approximation error. Phase the introductions: prove sparse-GQA matches dense first, then layer the gate on top in a follow-up. |

---

## 7. Consequences

### Positive

- **Long windows are no longer scary.** `window_frames = 1000` for
  10 s sessions becomes practical, not aspirational.
- **Streaming re-ID gets a structural speedup.** Per-frame decode
  cost goes from O(N²) to O(log T). Pose tracker cost is a real
  budget today; this shrinks it.
- **GQA fits the AETHER backbone better.** AETHER's per-keypoint
  cross-attention already has a query/key shape mismatch (17
  keypoint queries vs N CSI keys). GQA was designed for exactly
  this asymmetry.
- **Path-vendored, not crates.io-coupled.** No bind-time risk —
  the crate ships from the vendored copy of upstream, and the
  vendor was synced today (`e38347601`).
- **Same kernel, two consumers.** ADR-095 wants this on the MCU;
  this ADR wants it on the host. Path-vendoring once keeps the
  versions in lockstep.
- **Approximation error is bounded** by the local window +
  log-stride + landmark pattern. Upstream's measurement (`README.md`
  §FAQ) is "<1% perplexity on standard benchmarks" for the
  causal case; we measure ours via §5's gate.

### Negative

- **Adds a workspace dependency** the team has to know about.
  Mitigated by path-vendoring (no version-resolution risk).
- **Approximation error is not zero.** For high-precision re-ID
  this needs measurement. §5's gate is the safety net; if rank
  correlation drops below 0.95 we don't flip the default.
- **More moving parts in the temporal head.** Dense MHA has one
  knob (number of heads). Sparse GQA has window, log-stride,
  landmark block size, KV head count, and (later) gate top-K. We
  pay this in default-config tuning effort.
- **`KvCache` introduces session state** in a place that didn't
  have it. Code that previously called a stateless `forward(...)`
  now has to think about cache lifetime per tracked person. The
  pose tracker (`pose_tracker.rs`) already has per-track state, so
  the natural place for the cache is inside `PoseTrack`; needs a
  small lifecycle review.
- **Training and inference paths diverge slightly.** Training
  always uses `forward` (full window prefill). Inference uses
  `decode_step` for streaming. The two paths must be tested
  separately; upstream's `forward` and `decode_step` are unit-test
  parity-checked, but our wrapper has its own surface.

### Neutral

- ADR-024 is **not superseded.** The contrastive loss, the
  augmentation strategy, the projection head, the HNSW indices —
  all unchanged. This ADR makes a single architectural choice
  inside ADR-024's "temporal aggregation" black box.
- ADR-016 (RuVector training pipeline integration) is unaffected.
  The other RuVector crates (`mincut`, `attn-mincut`,
  `temporal-tensor`, `solver`, `attention`) keep their existing
  roles in `model.rs`.

---

## 8. Open questions

1. **What is the AETHER temporal head's actual current
   architecture in code?** ADR-024 specifies the projection head
   precisely (Linear → BN → ReLU → Linear → L2-norm) but the
   *temporal aggregation* before that is not pinned. The closest
   thing in `model.rs` today is `apply_antenna_attention` and
   `apply_spatial_attention`, which are over antenna and spatial
   axes, not the temporal axis. So this ADR is, in practice,
   choosing the temporal kernel for the *first time* — not
   replacing one. Worth confirming with the maintainer before the
   implementation PR uses language like "swap" rather than "add".
2. **What window length is the deployed AETHER tracker using
   today?** The training default is 100 frames (`config.rs:165`),
   but `proof.rs` uses 4 and `trainer.rs` uses 2. The realistic
   deployment number determines how much of the §3.1 quantitative
   argument is *currently* operative versus *future-state*. If the
   answer is "we run AETHER on 4-frame windows", sparse pays
   nothing today, and the case for this ADR rests entirely on the
   long-window roadmap. If 100 or more, sparse already pays.
3. **Is `FastGrnnGate` worth enabling for re-ID specifically?**
   Probably not — re-ID benefits from full-sequence visibility,
   and the gate's job is to *prune* long-range candidates. Save
   the gate for activity classification (where transient movement
   is the signal of interest, and saliency-based pruning matches
   the use case). Confirm via §5's accuracy gate when we get there.
4. **Does the cross-modal alignment loss (ADR-024 §2.2.4) need
   any change?** The cross-modal loss operates on pooled
   `z_csi` (already temporally aggregated) and pooled `z_pose`. As
   long as the temporal aggregator returns a comparable pooled
   vector, the loss is kernel-agnostic. Likely no change, but
   worth a smoke test.
5. **Where does the KV cache live for re-ID?** Per `pose_tracker.rs`,
   each `PoseTrack` already has lifecycle (create / update /
   evict). The natural place is `PoseTrack::kv_cache:
   Option<KvCache>`, populated when the track first emits an
   embedding. Eviction policy ties to `track.last_seen` — when
   the track is dropped, drop the cache. Spec-level sanity check
   only; needs a real design pass in the implementation PR.

---

## 9. Acceptance criteria

This ADR is **Accepted** once:

1. Maintainer review on #513 confirms the architecture and resolves
   §8.1 (the "first-time choice vs replacement" framing).
2. Open question §8.2 has a concrete answer (ideally a one-line
   pointer to the production training config).
3. The follow-up implementation issue is filed.

This ADR is **Implemented** once:

1. `wifi-densepose-temporal` (or equivalent) ships in the workspace
   with a default-off feature flag exposing both dense and
   sparse-GQA backends.
2. §5's four-gate validation has run on the most recent AETHER
   checkpoint and the result is published (witness-bundle
   compatible per ADR-028 if the run is reproducible).
3. The default for new training runs is `sparse_gqa`, with `dense`
   still selectable for back-compat.

---

## 10. Related

ADR-014 (signal SOTA), ADR-016 (RuVector training pipeline
integration), ADR-024 (AETHER contrastive CSI embedding — this
ADR fills in its temporal-aggregation black box), ADR-095
(on-ESP32-S3 temporal modeling — same crate, different consumer),
upstream ADR-189 (KV cache incremental decode — the basis for
streaming re-ID), upstream ADR-190 (GQA / MQA — what AETHER's 17
keypoint queries × N CSI keys asymmetry naturally maps onto),
upstream ADR-192 (no_std + alloc support — the structural change
that means the *same* kernel runs both on the host here and on
the MCU under ADR-095).