wifi-densepose/docs/adr/ADR-095-on-esp32-temporal-m...

# ADR-095: On-ESP32-S3 Temporal Modeling at the Edge via `ruvllm_sparse_attention` (no_std)

| Field       | Value                                                                                                  |
|-------------|--------------------------------------------------------------------------------------------------------|
| **Status**  | Proposed (2026-05-07)                                                                                  |
| **Date**    | 2026-05-07                                                                                             |
| **Authors** | ruvnet, claude-flow                                                                                    |
| **Related** | ADR-018, ADR-024, ADR-039, ADR-040, ADR-061, ADR-081, ADR-091; upstream ADR-189, ADR-190, ADR-192      |
| **Branch**  | `feat/ruvllm-sparse-attention-edge`                                                                    |
| **Tracking**| #513                                                                                                   |

---

## 1. Context

Today the ESP32-S3 firmware in `firmware/esp32-csi-node/main/` does
**physics-only** sensing on-device. The pipeline in `edge_processing.c`
runs on Core 1 and produces:

- Adaptive presence detection (`presence_score`).
- Breathing-band (0.1–0.5 Hz) and heart-rate-band (0.8–2.0 Hz) biquad
  IIR bandpass + zero-crossing BPM estimators.
- A motion / fall flag (`flags` bits 0–2 in `edge_vitals_pkt_t` magic
  `0xC5110002`, plus fused mmWave variant `0xC5110004` per ADR-063).
- ADR-081 `rv_feature_state_t` (60 B at magic `0xC5110006`) emitted at
  1–10 Hz from the adaptive controller's fast loop.

There is **no learned model of any kind on the MCU**. The closest things
are: ADR-039 Tier-1 compressed-CSI emission, ADR-040 WASM modules
(Tier-3, but used by the user for ad-hoc DSP, not transformer
inference), and the Rust-side AETHER embeddings (ADR-024) which run
on the host, not the node. Anomaly detection that needs *temporal
context* — "is this fall pattern consistent with a fall, or just a
sit-down?" — is structurally absent. The fall debounce in v0.6.x
(3-frame consecutive + 5 s cooldown, raised threshold 2.0 → 15.0 rad/s²)
is a hand-tuned heuristic exactly because the firmware has nothing
better to reason with.

A second pressure point: the Tmr Svc / FreeRTOS stack is already
sensitive. `edge_processing.c` lines 47–48 explicitly note that
`process_frame + update_multi_person_vitals` combined used ~6.5–7.5 KB
of the 8 KB task stack and that **scratch buffers were moved to static
storage to avoid stack overflow.** Any new heavyweight workload — and
a transformer forward pass is heavyweight — must therefore live in
**its own FreeRTOS task with its own task stack**, not piggyback on
the existing edge DSP task.

The vendored crate `ruvllm_sparse_attention` v0.1.1 (released 2026-05-07,
synced today at `vendor/ruvector/crates/ruvllm_sparse_attention/`)
removes the previously-blocking `std` requirement. Per upstream
**ADR-192**, the crate now compiles cleanly to
`xtensa-esp32s3-none-elf` via `espup`, with a measured **376 KB
release rlib**, zero runtime dependencies beyond `libm`, and was
validated on a real ESP32-S3 (rev v0.2, 16 MB flash). It exposes
`SubquadraticSparseAttention`, `KvCache` / `KvCacheF16`, `FastGrnnGate`,
`IncrementalLandmarks`, `RuvLlmSparseBlock`, and a `Tensor3` value
type. The kernel is O(N log N) by default and near-linear O(N) when
the FastGRNN salience gate is enabled.

This is the first time we have had a credible path to **on-device
transformer inference for CSI** without a Python runtime, without
TFLite, and without a coprocessor. It is also the right moment to
decide *whether* we want it before code starts to land.

---

## 2. Decision

Add a learned **temporal head** to the ESP32-S3 firmware running on
the node itself, using `ruvllm_sparse_attention` compiled
`--no-default-features` (no_std + alloc, optionally `+fp16`), driven
by a small Rust component integrated into the ESP-IDF build. The
temporal head runs **alongside** the existing physics-only pipeline,
not as a replacement — physics gives us breathing/heart-rate/presence,
the temporal head gives us classification and sequence-aware reasoning.

Concretely:

1. The temporal head consumes a rolling window of feature vectors
   (initially the same `rv_feature_state_t` floats already produced
   by ADR-081, plus optionally a small projection of recent CSI
   amplitude statistics), length `N` ∈ [100, 500] frames, sampled at
   the controller's fast-loop rate.
2. It outputs a small set of **class logits** for the active
   detection task. The first three deployable tasks are listed in
   §4.
3. It runs in its own FreeRTOS task on Core 1 (or pinned to whichever
   core the WiFi driver is *not* on), at a cadence slower than the
   fast loop — initially 1 Hz, classification-on-demand.
4. The kernel is invoked through a thin C ABI (`ruv_temporal_init`,
   `ruv_temporal_push_frame`, `ruv_temporal_classify`) exported from
   a Rust static library linked into the ESP-IDF build the same way
   the existing Tier-3 components are linked.
5. Weights are stored as a flat `f32` (or `f16` with the `fp16`
   feature) blob in the ESP32-S3 flash, loadable from either an
   embedded `EMBED_FILES` resource (compile-time bake-in) or NVS
   (post-flash provisioning, mirroring ADR-040's WASM-upload path).
6. The temporal head is gated behind a Kconfig option
   `CONFIG_CSI_TEMPORAL_HEAD_ENABLED`, **default off**, and is only
   compiled into the 8 MB build profile until the flash math in §6
   demonstrates 4 MB headroom.

This ADR authorizes the architecture; it does **not** ship any of
the firmware-side or training-side changes. Implementation lands in
follow-up issues per the roadmap in §7.

---

## 3. Approach

### 3.1 Build integration

ESP-IDF v5.4 already supports Rust components via the
`rust-esp32`-style template (a CMake `idf_component_register` shim
that runs `cargo build --target xtensa-esp32s3-none-elf` and links
the resulting static library). The new component lives at
`firmware/esp32-csi-node/components/ruv_temporal/`:

```
ruv_temporal/
  CMakeLists.txt          # component manifest, Rust build invocation
  Cargo.toml              # crate config: no_std, deps on ruvllm_sparse_attention
  build.rs                # generates the C header from #[no_mangle] exports
  src/lib.rs              # public C ABI: init/push/classify/teardown
  src/window.rs           # rolling frame buffer
  src/weights.rs          # NVS / EMBED_FILES weight loader
  include/ruv_temporal.h  # generated; consumed by edge_processing.c
```

Cargo features compiled in: `["fp16"]`. **Not** `parallel` (rayon
needs threads, breaks no_std). **Not** `std`.

### 3.2 Interface

The C ABI is intentionally narrow. It does not expose `Tensor3`,
attention configs, or any Rust types — only `float*` buffers and
opaque handles:

```c
typedef struct ruv_temporal_ctx ruv_temporal_ctx_t;

esp_err_t ruv_temporal_init(const uint8_t *weights, size_t wlen,
                            uint32_t input_dim, uint32_t window,
                            ruv_temporal_ctx_t **out_ctx);
esp_err_t ruv_temporal_push(ruv_temporal_ctx_t *ctx, const float *frame);
esp_err_t ruv_temporal_classify(ruv_temporal_ctx_t *ctx,
                                float *logits, uint32_t n_classes);
void      ruv_temporal_destroy(ruv_temporal_ctx_t *ctx);
```

`push` is the hot path and must be cheap (it just writes into a
ring buffer in PSRAM if available, IRAM/DRAM otherwise). `classify`
runs the actual sparse attention forward and is the budget-heavy
call.

### 3.3 Task topology

A new task `ruv_temporal_task` with its own 16 KB stack, pinned to
the same core as the edge DSP task (Core 1), fed via a FreeRTOS
queue from the adaptive controller's fast loop. We do **not** call
the kernel from the existing edge task — the edge stack is already
near-full per the comment at `edge_processing.c:47-48` and recent
fall-debounce / Tmr-Svc-stack work.

### 3.4 Memory budget (per inference)

With `N = 256` (window), `d_model = 32`, `n_heads = 4`, `head_dim = 8`,
1–2 `RuvLlmSparseBlock` layers, `block_size = 64`, `window = 64`:

- Weights: ~5–15 KB (single block, INT8 quant deferred to a later
  ADR; FP16 default).
- KV cache (FP16, full window): `2 * 256 * 4 * 8 * 2 B ≈ 16 KB`.
- Activations (peak, with `forward_flash` tiling): ≈ 2 KB.
- Working set: < 64 KB. Comfortable in PSRAM, possible in ISR-safe
  internal SRAM.

These are first-pass estimates; the precise numbers come out of the
`forward_flash` benchmark on real hardware, which is exit criterion
in §7.

### 3.5 Compatibility with ADR-081 / ADR-039 / ADR-018

The temporal head is a **consumer** of the same feature stream
already flowing in the firmware. It does not alter:

- ADR-018 raw CSI frame layout (`0xC5110001`).
- ADR-039 Tier-1 compressed CSI (`0xC5110005`) or vitals
  (`0xC5110002`).
- ADR-063 fused vitals (`0xC5110004`).
- ADR-081 `rv_feature_state_t` (`0xC5110006`) — this is the primary
  input we tap.

If the temporal head fires a classification, the result rides on a
new `0xC5110007` packet (small: class id, confidence, monotonic seq,
ts_us, CRC32). Allocation of that magic is deferred to the
implementation PR — this ADR reserves the *concept*, not the byte
layout.

---

## 4. Use cases that motivate this

| Task | Why temporal context matters | Window | Class count |
|------|------------------------------|--------|-------------|
| **Gesture recognition** (wave / point / clap / kick) | Single-frame CSI snapshots can't disambiguate gestures from random motion. ~100-frame windows capture full gesture trajectories. | 100 frames @ 50 Hz = 2 s | 4–8 |
| **Fall classification with sequence context** | The current heuristic ("> 15 rad/s² for 3 consecutive frames + 5 s cooldown") was raised to suppress false positives. A learned temporal head can distinguish a fall (rapid descent then stillness) from a sit-down (descent then sustained micro-motion) using the same input window. | 200 frames @ 50 Hz = 4 s | 3 (fall / sit / nothing) |
| **Breathing-quality scoring** | Today's pipeline emits a BPM and a confidence float. A temporal head trained on labeled apnea / shallow / paradoxical / normal sequences can output a 4-class quality label that downstream consumers can render in one glance. | 500 frames @ 50 Hz = 10 s | 4 |
| **"Is this normal for this room/time" anomaly detection** | Per-room SONA profiles (ADR-005) capture environment statistics, but anomaly *temporal shape* is currently checked host-side via embedding distance (ADR-024 §2.4 `temporal_baseline` index). A small on-device classifier can flag ahead of host roundtrip. | 300 frames | 2 (normal / anomalous) |

These four cover the visible product gaps in the v0.6.x line.
Gesture recognition is the headline; fall classification is the
highest-impact for the eldercare scenarios v0.5.4 was tuned for.

---

## 5. Alternatives considered

| Option | Why rejected |
|--------|--------------|
| **TFLite Micro** | Heavier runtime (~150 KB code + interpreter), pulls in C++ STL surface, no Rust-native API. Does not benefit from sparse attention specifically. We'd be re-paying the cost of a full inference framework when we only need one kernel. |
| **Run all classifiers server-side** | Costs a full Tier-1 CSI uplink (~50–70 KB/s/node per ADR-039) just to feed a remote classifier, then a roundtrip back. Defeats the point of ADR-081's compact feature stream and makes the system worthless when the backhaul is down. Also leaks raw CSI to the network for purposes the user did not opt into. |
| **Stay physics-only forever** | Cleanest from a maintenance standpoint, but loses gesture, structurally, and the fall-debounce hack will keep accreting per-deployment knobs. The product space already has commodity physics-only firmware (Bosch presence sensors, etc.); on-device transformer inference for CSI is what would *differentiate* RuView. |
| **Use `ruvector-attention` (already in workspace) on-device** | `ruvector-attention` is `std`-bound today; doesn't compile to `xtensa-esp32s3-none-elf` without a port comparable in scope to upstream ADR-192. Even if ported, it doesn't give us GQA + streaming KV cache, which is the structural capability the new crate adds. |
| **Wait for IEEE 802.11bf** | Different problem (standardised CSI exposure across vendors). Doesn't address whether the model runs on-device or off. |

---

## 6. Consequences

### Positive

- **Genuinely novel.** No competing CSI-sensing project ships
  transformer inference on the MCU itself. The closest peers
  (Espressif's ESP-DL, Edge Impulse) are non-attention CNN/RNN
  pipelines.
- **Latency.** Classification result is local — no backhaul,
  no host roundtrip, sub-100 ms gesture-to-action.
- **Privacy.** Raw CSI never leaves the node for these tasks.
- **Reuses the ADR-081 feature stream** — the temporal head is a
  consumer of the existing 60 B `rv_feature_state_t`, not a new
  uplink format.
- **Validated kernel.** Per upstream ADR-192, the no_std build was
  validated on real ESP32-S3 hardware (MAC `ac:a7:04:e2:66:24`).
  We are not betting on a paper crate.

### Negative / tradeoffs

- **Flash budget pressure on 4 MB boards.** Per `partitions_4mb.csv`,
  each OTA slot is 1.875 MB (`0x1D0000`). The current build is
  ~853 KiB. Adding a 376 KB rlib plus weights brings us to ~1.3 MB —
  still under the slot ceiling but with little headroom for other
  growth. **Decision: temporal head is 8 MB-only initially**, gated
  behind `CONFIG_CSI_TEMPORAL_HEAD_ENABLED`. 4 MB enablement is a
  separate ADR after we measure the actual incremental link size
  (the 376 KB upstream number is for the rlib in isolation; the
  linked-and-stripped final binary delta will be smaller).
- **Rust toolchain dependency.** The ESP-IDF build now needs
  `espup` + `cargo +esp` to be present on every developer machine
  and CI runner. This is a real hurdle on Windows — see
  `CLAUDE.local.md` for the existing Python-subprocess wrapper
  required to run ESP-IDF cleanly. CI will need a parallel
  Rust-toolchain step.
- **One more thing to test.** QEMU (ADR-061) does not run the
  ESP32-S3 Xtensa Rust binary today. The QEMU validator pipeline
  will need a build matrix entry for "Rust component compiled but
  classifier disabled" at minimum.
- **Stack overflow risk.** Same hazard the v0.6.4 work just
  navigated. Mitigated by §3.3 (own task, own stack); this needs
  to be a code-review checklist item.
- **Weights provenance.** Once we ship a model, we need a story
  for *which model*, signed by *whom*, retrained *how often*. See
  Open Questions §8.

### Neutral

- ADR-040's WASM Tier-3 path is **not** superseded. WASM remains
  the right choice for user-uploaded modules. The temporal head is
  a first-party signed-by-us component, with a different deploy
  story.
- The host-side ADR-024 AETHER pipeline is unchanged by this ADR.
  ADR-096 covers the host-side use of the same crate.

---

## 7. Roadmap

| Phase | Scope | Gating |
|-------|-------|--------|
| 0 | This ADR + ADR-096 land. No code. | Maintainer review of #513. |
| 1 | New crate `wifi-densepose-temporal` (host-side only): defines the temporal-head architecture, training script, weight serialization format. | Phase 0 accepted. |
| 2 | `ruv_temporal` ESP-IDF component scaffolding — empty kernel, just the C ABI and ring buffer. Compiles cleanly into 8 MB firmware. Adds ~5 KB to binary. | Phase 1 produces a serialised set of weights. |
| 3 | Wire `ruvllm_sparse_attention` `forward` (not yet `forward_gated`) into the component. First on-target classification benchmark on COM7. Gate: end-to-end inference ≤ 50 ms with `N = 256`, no stack overflow under 24 h soak. | Phase 2 ABI stable. |
| 4 | First trained classifier (gesture or fall, whichever has labelled data first). Hardware A/B: temporal-head decision vs current heuristic on a held-out set. Promotion criterion: temporal head matches or beats heuristic on F1 *and* false-positive rate. | Phase 3 latency gate met. |
| 5 | 4 MB profile gating — measure actual binary delta, decide whether to enable on SuperMini. | Phase 4 in production on 8 MB. |
| 6 | `forward_gated_with_fastgrnn` for long-window tasks (breathing-quality at N = 500). | Phase 4 stable. |

---

## 8. Open questions

1. **Who trains the temporal heads?** Two options:
   (a) host-side training on captured `rv_feature_state_t` traces
   labelled in-app, then export to flat-buffer weights;
   (b) teacher-distillation from the larger AETHER model (ADR-024)
   running off-device, using soft labels. Option (b) is more
   data-efficient but couples this ADR's ship date to ADR-024's
   training-pipeline maturity. Open.
2. **How are weights flashed?** Three options, in increasing
   capability: NVS blob (small, safe, 4–8 KB ceiling per key),
   `EMBED_FILES` baked into the firmware image (no runtime update),
   OTA-updateable partition (mirrors ADR-040 RVF upload path,
   biggest engineering cost). Phase 2/3 will pick one; my prior is
   `EMBED_FILES` for the first model, OTA partition once we have
   more than one.
3. **Does the 376 KB rlib figure scale?** Upstream measured
   376 KB for the kernel + the embedding/projection
   weights for *their* test config. Adding 1–2
   `RuvLlmSparseBlock` layers with embedding/projection weights
   sized to actual CSI feature dimension may push this. Phase 2
   will measure the on-target stripped-binary delta directly; if
   the delta exceeds 600 KB we revisit the 4 MB story sooner.
4. **What window length is right for fall classification?**
   200 frames at 50 Hz = 4 s feels right based on the v0.6.4
   debounce numbers (3-frame consecutive + 5 s cooldown is
   essentially a 4-second decision window already). Empirical, not
   architectural — set in Phase 4.
5. **Quantisation.** First model ships FP16 (KV cache feature flag
   already supports this). INT8 for both weights and activations
   is a follow-up; the current crate has no INT8 path so it would
   be a separate kernel.
6. **What happens when the controller is in `RV_PROFILE_PASSIVE_LOW_RATE`?**
   The fast loop slows down, so the input frame rate to the
   temporal head drops. Either the head needs to handle variable
   sample rate (resample at push time) or it stops emitting until
   the controller goes back to active. Phase 1 design call.

---

## 9. Acceptance criteria

This ADR is **Accepted** once:

1. Maintainer review on #513 confirms the architecture.
2. The follow-up implementation issue is filed and references this
   ADR plus ADR-096 by number.
3. ADR index in `docs/adr/README.md` (if present) has an ADR-095
   row.

This ADR is **Implemented** once:

1. Phase 3 is in `main` with the gating Kconfig off by default.
2. A Phase-4 hardware A/B has been published (witness-bundle
   compatible per ADR-028).
3. The QEMU validator (ADR-061) has at minimum a "compiles, doesn't
   run" check for the Rust component.

---

## 10. Related

ADR-018 (binary CSI frame), ADR-024 (AETHER contrastive embedding —
host-side counterpart, see ADR-096), ADR-039 (edge intelligence
tiers), ADR-040 (WASM Tier-3 modules — the *other* extensibility
path), ADR-061 (QEMU CI), ADR-081 (adaptive controller, mesh plane,
`rv_feature_state_t`), ADR-091 (stand-off radar tier — adjacent
edge-intelligence ADR), upstream ADR-189 (KV cache incremental
decode), upstream ADR-190 (GQA/MQA), upstream ADR-192 (no_std +
alloc on ESP32-S3 — the structural unblock that makes this ADR
possible).