370 lines
19 KiB
Markdown
370 lines
19 KiB
Markdown
# ADR-095: On-ESP32-S3 Temporal Modeling at the Edge via `ruvllm_sparse_attention` (no_std)
|
||
|
||
| Field | Value |
|
||
|-------------|--------------------------------------------------------------------------------------------------------|
|
||
| **Status** | Proposed (2026-05-07) |
|
||
| **Date** | 2026-05-07 |
|
||
| **Authors** | ruvnet, claude-flow |
|
||
| **Related** | ADR-018, ADR-024, ADR-039, ADR-040, ADR-061, ADR-081, ADR-091; upstream ADR-189, ADR-190, ADR-192 |
|
||
| **Branch** | `feat/ruvllm-sparse-attention-edge` |
|
||
| **Tracking**| #513 |
|
||
|
||
---
|
||
|
||
## 1. Context
|
||
|
||
Today the ESP32-S3 firmware in `firmware/esp32-csi-node/main/` does
|
||
**physics-only** sensing on-device. The pipeline in `edge_processing.c`
|
||
runs on Core 1 and produces:
|
||
|
||
- Adaptive presence detection (`presence_score`).
|
||
- Breathing-band (0.1–0.5 Hz) and heart-rate-band (0.8–2.0 Hz) biquad
|
||
IIR bandpass + zero-crossing BPM estimators.
|
||
- A motion / fall flag (`flags` bits 0–2 in `edge_vitals_pkt_t` magic
|
||
`0xC5110002`, plus fused mmWave variant `0xC5110004` per ADR-063).
|
||
- ADR-081 `rv_feature_state_t` (60 B at magic `0xC5110006`) emitted at
|
||
1–10 Hz from the adaptive controller's fast loop.
|
||
|
||
There is **no learned model of any kind on the MCU**. The closest things
|
||
are: ADR-039 Tier-1 compressed-CSI emission, ADR-040 WASM modules
|
||
(Tier-3, but used by the user for ad-hoc DSP, not transformer
|
||
inference), and the Rust-side AETHER embeddings (ADR-024) which run
|
||
on the host, not the node. Anomaly detection that needs *temporal
|
||
context* — "is this fall pattern consistent with a fall, or just a
|
||
sit-down?" — is structurally absent. The fall debounce in v0.6.x
|
||
(3-frame consecutive + 5 s cooldown, raised threshold 2.0 → 15.0 rad/s²)
|
||
is a hand-tuned heuristic exactly because the firmware has nothing
|
||
better to reason with.
|
||
|
||
A second pressure point: the Tmr Svc / FreeRTOS stack is already
|
||
sensitive. `edge_processing.c` lines 47–48 explicitly note that
|
||
`process_frame + update_multi_person_vitals` combined used ~6.5–7.5 KB
|
||
of the 8 KB task stack and that **scratch buffers were moved to static
|
||
storage to avoid stack overflow.** Any new heavyweight workload — and
|
||
a transformer forward pass is heavyweight — must therefore live in
|
||
**its own FreeRTOS task with its own task stack**, not piggyback on
|
||
the existing edge DSP task.
|
||
|
||
The vendored crate `ruvllm_sparse_attention` v0.1.1 (released 2026-05-07,
|
||
synced today at `vendor/ruvector/crates/ruvllm_sparse_attention/`)
|
||
removes the previously-blocking `std` requirement. Per upstream
|
||
**ADR-192**, the crate now compiles cleanly to
|
||
`xtensa-esp32s3-none-elf` via `espup`, with a measured **376 KB
|
||
release rlib**, zero runtime dependencies beyond `libm`, and was
|
||
validated on a real ESP32-S3 (rev v0.2, 16 MB flash). It exposes
|
||
`SubquadraticSparseAttention`, `KvCache` / `KvCacheF16`, `FastGrnnGate`,
|
||
`IncrementalLandmarks`, `RuvLlmSparseBlock`, and a `Tensor3` value
|
||
type. The kernel is O(N log N) by default and near-linear O(N) when
|
||
the FastGRNN salience gate is enabled.
|
||
|
||
This is the first time we have had a credible path to **on-device
|
||
transformer inference for CSI** without a Python runtime, without
|
||
TFLite, and without a coprocessor. It is also the right moment to
|
||
decide *whether* we want it before code starts to land.
|
||
|
||
---
|
||
|
||
## 2. Decision
|
||
|
||
Add a learned **temporal head** to the ESP32-S3 firmware running on
|
||
the node itself, using `ruvllm_sparse_attention` compiled
|
||
`--no-default-features` (no_std + alloc, optionally `+fp16`), driven
|
||
by a small Rust component integrated into the ESP-IDF build. The
|
||
temporal head runs **alongside** the existing physics-only pipeline,
|
||
not as a replacement — physics gives us breathing/heart-rate/presence,
|
||
the temporal head gives us classification and sequence-aware reasoning.
|
||
|
||
Concretely:
|
||
|
||
1. The temporal head consumes a rolling window of feature vectors
|
||
(initially the same `rv_feature_state_t` floats already produced
|
||
by ADR-081, plus optionally a small projection of recent CSI
|
||
amplitude statistics), length `N` ∈ [100, 500] frames, sampled at
|
||
the controller's fast-loop rate.
|
||
2. It outputs a small set of **class logits** for the active
|
||
detection task. The first three deployable tasks are listed in
|
||
§4.
|
||
3. It runs in its own FreeRTOS task on Core 1 (or pinned to whichever
|
||
core the WiFi driver is *not* on), at a cadence slower than the
|
||
fast loop — initially 1 Hz, classification-on-demand.
|
||
4. The kernel is invoked through a thin C ABI (`ruv_temporal_init`,
|
||
`ruv_temporal_push_frame`, `ruv_temporal_classify`) exported from
|
||
a Rust static library linked into the ESP-IDF build the same way
|
||
the existing Tier-3 components are linked.
|
||
5. Weights are stored as a flat `f32` (or `f16` with the `fp16`
|
||
feature) blob in the ESP32-S3 flash, loadable from either an
|
||
embedded `EMBED_FILES` resource (compile-time bake-in) or NVS
|
||
(post-flash provisioning, mirroring ADR-040's WASM-upload path).
|
||
6. The temporal head is gated behind a Kconfig option
|
||
`CONFIG_CSI_TEMPORAL_HEAD_ENABLED`, **default off**, and is only
|
||
compiled into the 8 MB build profile until the flash math in §6
|
||
demonstrates 4 MB headroom.
|
||
|
||
This ADR authorizes the architecture; it does **not** ship any of
|
||
the firmware-side or training-side changes. Implementation lands in
|
||
follow-up issues per the roadmap in §7.
|
||
|
||
---
|
||
|
||
## 3. Approach
|
||
|
||
### 3.1 Build integration
|
||
|
||
ESP-IDF v5.4 already supports Rust components via the
|
||
`rust-esp32`-style template (a CMake `idf_component_register` shim
|
||
that runs `cargo build --target xtensa-esp32s3-none-elf` and links
|
||
the resulting static library). The new component lives at
|
||
`firmware/esp32-csi-node/components/ruv_temporal/`:
|
||
|
||
```
|
||
ruv_temporal/
|
||
CMakeLists.txt # component manifest, Rust build invocation
|
||
Cargo.toml # crate config: no_std, deps on ruvllm_sparse_attention
|
||
build.rs # generates the C header from #[no_mangle] exports
|
||
src/lib.rs # public C ABI: init/push/classify/teardown
|
||
src/window.rs # rolling frame buffer
|
||
src/weights.rs # NVS / EMBED_FILES weight loader
|
||
include/ruv_temporal.h # generated; consumed by edge_processing.c
|
||
```
|
||
|
||
Cargo features compiled in: `["fp16"]`. **Not** `parallel` (rayon
|
||
needs threads, breaks no_std). **Not** `std`.
|
||
|
||
### 3.2 Interface
|
||
|
||
The C ABI is intentionally narrow. It does not expose `Tensor3`,
|
||
attention configs, or any Rust types — only `float*` buffers and
|
||
opaque handles:
|
||
|
||
```c
|
||
typedef struct ruv_temporal_ctx ruv_temporal_ctx_t;
|
||
|
||
esp_err_t ruv_temporal_init(const uint8_t *weights, size_t wlen,
|
||
uint32_t input_dim, uint32_t window,
|
||
ruv_temporal_ctx_t **out_ctx);
|
||
esp_err_t ruv_temporal_push(ruv_temporal_ctx_t *ctx, const float *frame);
|
||
esp_err_t ruv_temporal_classify(ruv_temporal_ctx_t *ctx,
|
||
float *logits, uint32_t n_classes);
|
||
void ruv_temporal_destroy(ruv_temporal_ctx_t *ctx);
|
||
```
|
||
|
||
`push` is the hot path and must be cheap (it just writes into a
|
||
ring buffer in PSRAM if available, IRAM/DRAM otherwise). `classify`
|
||
runs the actual sparse attention forward and is the budget-heavy
|
||
call.
|
||
|
||
### 3.3 Task topology
|
||
|
||
A new task `ruv_temporal_task` with its own 16 KB stack, pinned to
|
||
the same core as the edge DSP task (Core 1), fed via a FreeRTOS
|
||
queue from the adaptive controller's fast loop. We do **not** call
|
||
the kernel from the existing edge task — the edge stack is already
|
||
near-full per the comment at `edge_processing.c:47-48` and recent
|
||
fall-debounce / Tmr-Svc-stack work.
|
||
|
||
### 3.4 Memory budget (per inference)
|
||
|
||
With `N = 256` (window), `d_model = 32`, `n_heads = 4`, `head_dim = 8`,
|
||
1–2 `RuvLlmSparseBlock` layers, `block_size = 64`, `window = 64`:
|
||
|
||
- Weights: ~5–15 KB (single block, INT8 quant deferred to a later
|
||
ADR; FP16 default).
|
||
- KV cache (FP16, full window): `2 * 256 * 4 * 8 * 2 B ≈ 16 KB`.
|
||
- Activations (peak, with `forward_flash` tiling): ≈ 2 KB.
|
||
- Working set: < 64 KB. Comfortable in PSRAM, possible in ISR-safe
|
||
internal SRAM.
|
||
|
||
These are first-pass estimates; the precise numbers come out of the
|
||
`forward_flash` benchmark on real hardware, which is exit criterion
|
||
in §7.
|
||
|
||
### 3.5 Compatibility with ADR-081 / ADR-039 / ADR-018
|
||
|
||
The temporal head is a **consumer** of the same feature stream
|
||
already flowing in the firmware. It does not alter:
|
||
|
||
- ADR-018 raw CSI frame layout (`0xC5110001`).
|
||
- ADR-039 Tier-1 compressed CSI (`0xC5110005`) or vitals
|
||
(`0xC5110002`).
|
||
- ADR-063 fused vitals (`0xC5110004`).
|
||
- ADR-081 `rv_feature_state_t` (`0xC5110006`) — this is the primary
|
||
input we tap.
|
||
|
||
If the temporal head fires a classification, the result rides on a
|
||
new `0xC5110007` packet (small: class id, confidence, monotonic seq,
|
||
ts_us, CRC32). Allocation of that magic is deferred to the
|
||
implementation PR — this ADR reserves the *concept*, not the byte
|
||
layout.
|
||
|
||
---
|
||
|
||
## 4. Use cases that motivate this
|
||
|
||
| Task | Why temporal context matters | Window | Class count |
|
||
|------|------------------------------|--------|-------------|
|
||
| **Gesture recognition** (wave / point / clap / kick) | Single-frame CSI snapshots can't disambiguate gestures from random motion. ~100-frame windows capture full gesture trajectories. | 100 frames @ 50 Hz = 2 s | 4–8 |
|
||
| **Fall classification with sequence context** | The current heuristic ("> 15 rad/s² for 3 consecutive frames + 5 s cooldown") was raised to suppress false positives. A learned temporal head can distinguish a fall (rapid descent then stillness) from a sit-down (descent then sustained micro-motion) using the same input window. | 200 frames @ 50 Hz = 4 s | 3 (fall / sit / nothing) |
|
||
| **Breathing-quality scoring** | Today's pipeline emits a BPM and a confidence float. A temporal head trained on labeled apnea / shallow / paradoxical / normal sequences can output a 4-class quality label that downstream consumers can render in one glance. | 500 frames @ 50 Hz = 10 s | 4 |
|
||
| **"Is this normal for this room/time" anomaly detection** | Per-room SONA profiles (ADR-005) capture environment statistics, but anomaly *temporal shape* is currently checked host-side via embedding distance (ADR-024 §2.4 `temporal_baseline` index). A small on-device classifier can flag ahead of host roundtrip. | 300 frames | 2 (normal / anomalous) |
|
||
|
||
These four cover the visible product gaps in the v0.6.x line.
|
||
Gesture recognition is the headline; fall classification is the
|
||
highest-impact for the eldercare scenarios v0.5.4 was tuned for.
|
||
|
||
---
|
||
|
||
## 5. Alternatives considered
|
||
|
||
| Option | Why rejected |
|
||
|--------|--------------|
|
||
| **TFLite Micro** | Heavier runtime (~150 KB code + interpreter), pulls in C++ STL surface, no Rust-native API. Does not benefit from sparse attention specifically. We'd be re-paying the cost of a full inference framework when we only need one kernel. |
|
||
| **Run all classifiers server-side** | Costs a full Tier-1 CSI uplink (~50–70 KB/s/node per ADR-039) just to feed a remote classifier, then a roundtrip back. Defeats the point of ADR-081's compact feature stream and makes the system worthless when the backhaul is down. Also leaks raw CSI to the network for purposes the user did not opt into. |
|
||
| **Stay physics-only forever** | Cleanest from a maintenance standpoint, but loses gesture, structurally, and the fall-debounce hack will keep accreting per-deployment knobs. The product space already has commodity physics-only firmware (Bosch presence sensors, etc.); on-device transformer inference for CSI is what would *differentiate* RuView. |
|
||
| **Use `ruvector-attention` (already in workspace) on-device** | `ruvector-attention` is `std`-bound today; doesn't compile to `xtensa-esp32s3-none-elf` without a port comparable in scope to upstream ADR-192. Even if ported, it doesn't give us GQA + streaming KV cache, which is the structural capability the new crate adds. |
|
||
| **Wait for IEEE 802.11bf** | Different problem (standardised CSI exposure across vendors). Doesn't address whether the model runs on-device or off. |
|
||
|
||
---
|
||
|
||
## 6. Consequences
|
||
|
||
### Positive
|
||
|
||
- **Genuinely novel.** No competing CSI-sensing project ships
|
||
transformer inference on the MCU itself. The closest peers
|
||
(Espressif's ESP-DL, Edge Impulse) are non-attention CNN/RNN
|
||
pipelines.
|
||
- **Latency.** Classification result is local — no backhaul,
|
||
no host roundtrip, sub-100 ms gesture-to-action.
|
||
- **Privacy.** Raw CSI never leaves the node for these tasks.
|
||
- **Reuses the ADR-081 feature stream** — the temporal head is a
|
||
consumer of the existing 60 B `rv_feature_state_t`, not a new
|
||
uplink format.
|
||
- **Validated kernel.** Per upstream ADR-192, the no_std build was
|
||
validated on real ESP32-S3 hardware (MAC `ac:a7:04:e2:66:24`).
|
||
We are not betting on a paper crate.
|
||
|
||
### Negative / tradeoffs
|
||
|
||
- **Flash budget pressure on 4 MB boards.** Per `partitions_4mb.csv`,
|
||
each OTA slot is 1.875 MB (`0x1D0000`). The current build is
|
||
~853 KiB. Adding a 376 KB rlib plus weights brings us to ~1.3 MB —
|
||
still under the slot ceiling but with little headroom for other
|
||
growth. **Decision: temporal head is 8 MB-only initially**, gated
|
||
behind `CONFIG_CSI_TEMPORAL_HEAD_ENABLED`. 4 MB enablement is a
|
||
separate ADR after we measure the actual incremental link size
|
||
(the 376 KB upstream number is for the rlib in isolation; the
|
||
linked-and-stripped final binary delta will be smaller).
|
||
- **Rust toolchain dependency.** The ESP-IDF build now needs
|
||
`espup` + `cargo +esp` to be present on every developer machine
|
||
and CI runner. This is a real hurdle on Windows — see
|
||
`CLAUDE.local.md` for the existing Python-subprocess wrapper
|
||
required to run ESP-IDF cleanly. CI will need a parallel
|
||
Rust-toolchain step.
|
||
- **One more thing to test.** QEMU (ADR-061) does not run the
|
||
ESP32-S3 Xtensa Rust binary today. The QEMU validator pipeline
|
||
will need a build matrix entry for "Rust component compiled but
|
||
classifier disabled" at minimum.
|
||
- **Stack overflow risk.** Same hazard the v0.6.4 work just
|
||
navigated. Mitigated by §3.3 (own task, own stack); this needs
|
||
to be a code-review checklist item.
|
||
- **Weights provenance.** Once we ship a model, we need a story
|
||
for *which model*, signed by *whom*, retrained *how often*. See
|
||
Open Questions §8.
|
||
|
||
### Neutral
|
||
|
||
- ADR-040's WASM Tier-3 path is **not** superseded. WASM remains
|
||
the right choice for user-uploaded modules. The temporal head is
|
||
a first-party signed-by-us component, with a different deploy
|
||
story.
|
||
- The host-side ADR-024 AETHER pipeline is unchanged by this ADR.
|
||
ADR-096 covers the host-side use of the same crate.
|
||
|
||
---
|
||
|
||
## 7. Roadmap
|
||
|
||
| Phase | Scope | Gating |
|
||
|-------|-------|--------|
|
||
| 0 | This ADR + ADR-096 land. No code. | Maintainer review of #513. |
|
||
| 1 | New crate `wifi-densepose-temporal` (host-side only): defines the temporal-head architecture, training script, weight serialization format. | Phase 0 accepted. |
|
||
| 2 | `ruv_temporal` ESP-IDF component scaffolding — empty kernel, just the C ABI and ring buffer. Compiles cleanly into 8 MB firmware. Adds ~5 KB to binary. | Phase 1 produces a serialised set of weights. |
|
||
| 3 | Wire `ruvllm_sparse_attention` `forward` (not yet `forward_gated`) into the component. First on-target classification benchmark on COM7. Gate: end-to-end inference ≤ 50 ms with `N = 256`, no stack overflow under 24 h soak. | Phase 2 ABI stable. |
|
||
| 4 | First trained classifier (gesture or fall, whichever has labelled data first). Hardware A/B: temporal-head decision vs current heuristic on a held-out set. Promotion criterion: temporal head matches or beats heuristic on F1 *and* false-positive rate. | Phase 3 latency gate met. |
|
||
| 5 | 4 MB profile gating — measure actual binary delta, decide whether to enable on SuperMini. | Phase 4 in production on 8 MB. |
|
||
| 6 | `forward_gated_with_fastgrnn` for long-window tasks (breathing-quality at N = 500). | Phase 4 stable. |
|
||
|
||
---
|
||
|
||
## 8. Open questions
|
||
|
||
1. **Who trains the temporal heads?** Two options:
|
||
(a) host-side training on captured `rv_feature_state_t` traces
|
||
labelled in-app, then export to flat-buffer weights;
|
||
(b) teacher-distillation from the larger AETHER model (ADR-024)
|
||
running off-device, using soft labels. Option (b) is more
|
||
data-efficient but couples this ADR's ship date to ADR-024's
|
||
training-pipeline maturity. Open.
|
||
2. **How are weights flashed?** Three options, in increasing
|
||
capability: NVS blob (small, safe, 4–8 KB ceiling per key),
|
||
`EMBED_FILES` baked into the firmware image (no runtime update),
|
||
OTA-updateable partition (mirrors ADR-040 RVF upload path,
|
||
biggest engineering cost). Phase 2/3 will pick one; my prior is
|
||
`EMBED_FILES` for the first model, OTA partition once we have
|
||
more than one.
|
||
3. **Does the 376 KB rlib figure scale?** Upstream measured
|
||
376 KB for the kernel + the embedding/projection
|
||
weights for *their* test config. Adding 1–2
|
||
`RuvLlmSparseBlock` layers with embedding/projection weights
|
||
sized to actual CSI feature dimension may push this. Phase 2
|
||
will measure the on-target stripped-binary delta directly; if
|
||
the delta exceeds 600 KB we revisit the 4 MB story sooner.
|
||
4. **What window length is right for fall classification?**
|
||
200 frames at 50 Hz = 4 s feels right based on the v0.6.4
|
||
debounce numbers (3-frame consecutive + 5 s cooldown is
|
||
essentially a 4-second decision window already). Empirical, not
|
||
architectural — set in Phase 4.
|
||
5. **Quantisation.** First model ships FP16 (KV cache feature flag
|
||
already supports this). INT8 for both weights and activations
|
||
is a follow-up; the current crate has no INT8 path so it would
|
||
be a separate kernel.
|
||
6. **What happens when the controller is in `RV_PROFILE_PASSIVE_LOW_RATE`?**
|
||
The fast loop slows down, so the input frame rate to the
|
||
temporal head drops. Either the head needs to handle variable
|
||
sample rate (resample at push time) or it stops emitting until
|
||
the controller goes back to active. Phase 1 design call.
|
||
|
||
---
|
||
|
||
## 9. Acceptance criteria
|
||
|
||
This ADR is **Accepted** once:
|
||
|
||
1. Maintainer review on #513 confirms the architecture.
|
||
2. The follow-up implementation issue is filed and references this
|
||
ADR plus ADR-096 by number.
|
||
3. ADR index in `docs/adr/README.md` (if present) has an ADR-095
|
||
row.
|
||
|
||
This ADR is **Implemented** once:
|
||
|
||
1. Phase 3 is in `main` with the gating Kconfig off by default.
|
||
2. A Phase-4 hardware A/B has been published (witness-bundle
|
||
compatible per ADR-028).
|
||
3. The QEMU validator (ADR-061) has at minimum a "compiles, doesn't
|
||
run" check for the Rust component.
|
||
|
||
---
|
||
|
||
## 10. Related
|
||
|
||
ADR-018 (binary CSI frame), ADR-024 (AETHER contrastive embedding —
|
||
host-side counterpart, see ADR-096), ADR-039 (edge intelligence
|
||
tiers), ADR-040 (WASM Tier-3 modules — the *other* extensibility
|
||
path), ADR-061 (QEMU CI), ADR-081 (adaptive controller, mesh plane,
|
||
`rv_feature_state_t`), ADR-091 (stand-off radar tier — adjacent
|
||
edge-intelligence ADR), upstream ADR-189 (KV cache incremental
|
||
decode), upstream ADR-190 (GQA/MQA), upstream ADR-192 (no_std +
|
||
alloc on ESP32-S3 — the structural unblock that makes this ADR
|
||
possible).
|