19 KiB
ADR-095: On-ESP32-S3 Temporal Modeling at the Edge via ruvllm_sparse_attention (no_std)
| Field | Value |
|---|---|
| Status | Proposed (2026-05-07) |
| Date | 2026-05-07 |
| Authors | ruvnet, claude-flow |
| Related | ADR-018, ADR-024, ADR-039, ADR-040, ADR-061, ADR-081, ADR-091; upstream ADR-189, ADR-190, ADR-192 |
| Branch | feat/ruvllm-sparse-attention-edge |
| Tracking | #513 |
1. Context
Today the ESP32-S3 firmware in firmware/esp32-csi-node/main/ does
physics-only sensing on-device. The pipeline in edge_processing.c
runs on Core 1 and produces:
- Adaptive presence detection (
presence_score). - Breathing-band (0.1–0.5 Hz) and heart-rate-band (0.8–2.0 Hz) biquad IIR bandpass + zero-crossing BPM estimators.
- A motion / fall flag (
flagsbits 0–2 inedge_vitals_pkt_tmagic0xC5110002, plus fused mmWave variant0xC5110004per ADR-063). - ADR-081
rv_feature_state_t(60 B at magic0xC5110006) emitted at 1–10 Hz from the adaptive controller's fast loop.
There is no learned model of any kind on the MCU. The closest things are: ADR-039 Tier-1 compressed-CSI emission, ADR-040 WASM modules (Tier-3, but used by the user for ad-hoc DSP, not transformer inference), and the Rust-side AETHER embeddings (ADR-024) which run on the host, not the node. Anomaly detection that needs temporal context — "is this fall pattern consistent with a fall, or just a sit-down?" — is structurally absent. The fall debounce in v0.6.x (3-frame consecutive + 5 s cooldown, raised threshold 2.0 → 15.0 rad/s²) is a hand-tuned heuristic exactly because the firmware has nothing better to reason with.
A second pressure point: the Tmr Svc / FreeRTOS stack is already
sensitive. edge_processing.c lines 47–48 explicitly note that
process_frame + update_multi_person_vitals combined used ~6.5–7.5 KB
of the 8 KB task stack and that scratch buffers were moved to static
storage to avoid stack overflow. Any new heavyweight workload — and
a transformer forward pass is heavyweight — must therefore live in
its own FreeRTOS task with its own task stack, not piggyback on
the existing edge DSP task.
The vendored crate ruvllm_sparse_attention v0.1.1 (released 2026-05-07,
synced today at vendor/ruvector/crates/ruvllm_sparse_attention/)
removes the previously-blocking std requirement. Per upstream
ADR-192, the crate now compiles cleanly to
xtensa-esp32s3-none-elf via espup, with a measured 376 KB
release rlib, zero runtime dependencies beyond libm, and was
validated on a real ESP32-S3 (rev v0.2, 16 MB flash). It exposes
SubquadraticSparseAttention, KvCache / KvCacheF16, FastGrnnGate,
IncrementalLandmarks, RuvLlmSparseBlock, and a Tensor3 value
type. The kernel is O(N log N) by default and near-linear O(N) when
the FastGRNN salience gate is enabled.
This is the first time we have had a credible path to on-device transformer inference for CSI without a Python runtime, without TFLite, and without a coprocessor. It is also the right moment to decide whether we want it before code starts to land.
2. Decision
Add a learned temporal head to the ESP32-S3 firmware running on
the node itself, using ruvllm_sparse_attention compiled
--no-default-features (no_std + alloc, optionally +fp16), driven
by a small Rust component integrated into the ESP-IDF build. The
temporal head runs alongside the existing physics-only pipeline,
not as a replacement — physics gives us breathing/heart-rate/presence,
the temporal head gives us classification and sequence-aware reasoning.
Concretely:
- The temporal head consumes a rolling window of feature vectors
(initially the same
rv_feature_state_tfloats already produced by ADR-081, plus optionally a small projection of recent CSI amplitude statistics), lengthN∈ [100, 500] frames, sampled at the controller's fast-loop rate. - It outputs a small set of class logits for the active detection task. The first three deployable tasks are listed in §4.
- It runs in its own FreeRTOS task on Core 1 (or pinned to whichever core the WiFi driver is not on), at a cadence slower than the fast loop — initially 1 Hz, classification-on-demand.
- The kernel is invoked through a thin C ABI (
ruv_temporal_init,ruv_temporal_push_frame,ruv_temporal_classify) exported from a Rust static library linked into the ESP-IDF build the same way the existing Tier-3 components are linked. - Weights are stored as a flat
f32(orf16with thefp16feature) blob in the ESP32-S3 flash, loadable from either an embeddedEMBED_FILESresource (compile-time bake-in) or NVS (post-flash provisioning, mirroring ADR-040's WASM-upload path). - The temporal head is gated behind a Kconfig option
CONFIG_CSI_TEMPORAL_HEAD_ENABLED, default off, and is only compiled into the 8 MB build profile until the flash math in §6 demonstrates 4 MB headroom.
This ADR authorizes the architecture; it does not ship any of the firmware-side or training-side changes. Implementation lands in follow-up issues per the roadmap in §7.
3. Approach
3.1 Build integration
ESP-IDF v5.4 already supports Rust components via the
rust-esp32-style template (a CMake idf_component_register shim
that runs cargo build --target xtensa-esp32s3-none-elf and links
the resulting static library). The new component lives at
firmware/esp32-csi-node/components/ruv_temporal/:
ruv_temporal/
CMakeLists.txt # component manifest, Rust build invocation
Cargo.toml # crate config: no_std, deps on ruvllm_sparse_attention
build.rs # generates the C header from #[no_mangle] exports
src/lib.rs # public C ABI: init/push/classify/teardown
src/window.rs # rolling frame buffer
src/weights.rs # NVS / EMBED_FILES weight loader
include/ruv_temporal.h # generated; consumed by edge_processing.c
Cargo features compiled in: ["fp16"]. Not parallel (rayon
needs threads, breaks no_std). Not std.
3.2 Interface
The C ABI is intentionally narrow. It does not expose Tensor3,
attention configs, or any Rust types — only float* buffers and
opaque handles:
typedef struct ruv_temporal_ctx ruv_temporal_ctx_t;
esp_err_t ruv_temporal_init(const uint8_t *weights, size_t wlen,
uint32_t input_dim, uint32_t window,
ruv_temporal_ctx_t **out_ctx);
esp_err_t ruv_temporal_push(ruv_temporal_ctx_t *ctx, const float *frame);
esp_err_t ruv_temporal_classify(ruv_temporal_ctx_t *ctx,
float *logits, uint32_t n_classes);
void ruv_temporal_destroy(ruv_temporal_ctx_t *ctx);
push is the hot path and must be cheap (it just writes into a
ring buffer in PSRAM if available, IRAM/DRAM otherwise). classify
runs the actual sparse attention forward and is the budget-heavy
call.
3.3 Task topology
A new task ruv_temporal_task with its own 16 KB stack, pinned to
the same core as the edge DSP task (Core 1), fed via a FreeRTOS
queue from the adaptive controller's fast loop. We do not call
the kernel from the existing edge task — the edge stack is already
near-full per the comment at edge_processing.c:47-48 and recent
fall-debounce / Tmr-Svc-stack work.
3.4 Memory budget (per inference)
With N = 256 (window), d_model = 32, n_heads = 4, head_dim = 8,
1–2 RuvLlmSparseBlock layers, block_size = 64, window = 64:
- Weights: ~5–15 KB (single block, INT8 quant deferred to a later ADR; FP16 default).
- KV cache (FP16, full window):
2 * 256 * 4 * 8 * 2 B ≈ 16 KB. - Activations (peak, with
forward_flashtiling): ≈ 2 KB. - Working set: < 64 KB. Comfortable in PSRAM, possible in ISR-safe internal SRAM.
These are first-pass estimates; the precise numbers come out of the
forward_flash benchmark on real hardware, which is exit criterion
in §7.
3.5 Compatibility with ADR-081 / ADR-039 / ADR-018
The temporal head is a consumer of the same feature stream already flowing in the firmware. It does not alter:
- ADR-018 raw CSI frame layout (
0xC5110001). - ADR-039 Tier-1 compressed CSI (
0xC5110005) or vitals (0xC5110002). - ADR-063 fused vitals (
0xC5110004). - ADR-081
rv_feature_state_t(0xC5110006) — this is the primary input we tap.
If the temporal head fires a classification, the result rides on a
new 0xC5110007 packet (small: class id, confidence, monotonic seq,
ts_us, CRC32). Allocation of that magic is deferred to the
implementation PR — this ADR reserves the concept, not the byte
layout.
4. Use cases that motivate this
| Task | Why temporal context matters | Window | Class count |
|---|---|---|---|
| Gesture recognition (wave / point / clap / kick) | Single-frame CSI snapshots can't disambiguate gestures from random motion. ~100-frame windows capture full gesture trajectories. | 100 frames @ 50 Hz = 2 s | 4–8 |
| Fall classification with sequence context | The current heuristic ("> 15 rad/s² for 3 consecutive frames + 5 s cooldown") was raised to suppress false positives. A learned temporal head can distinguish a fall (rapid descent then stillness) from a sit-down (descent then sustained micro-motion) using the same input window. | 200 frames @ 50 Hz = 4 s | 3 (fall / sit / nothing) |
| Breathing-quality scoring | Today's pipeline emits a BPM and a confidence float. A temporal head trained on labeled apnea / shallow / paradoxical / normal sequences can output a 4-class quality label that downstream consumers can render in one glance. | 500 frames @ 50 Hz = 10 s | 4 |
| "Is this normal for this room/time" anomaly detection | Per-room SONA profiles (ADR-005) capture environment statistics, but anomaly temporal shape is currently checked host-side via embedding distance (ADR-024 §2.4 temporal_baseline index). A small on-device classifier can flag ahead of host roundtrip. |
300 frames | 2 (normal / anomalous) |
These four cover the visible product gaps in the v0.6.x line. Gesture recognition is the headline; fall classification is the highest-impact for the eldercare scenarios v0.5.4 was tuned for.
5. Alternatives considered
| Option | Why rejected |
|---|---|
| TFLite Micro | Heavier runtime (~150 KB code + interpreter), pulls in C++ STL surface, no Rust-native API. Does not benefit from sparse attention specifically. We'd be re-paying the cost of a full inference framework when we only need one kernel. |
| Run all classifiers server-side | Costs a full Tier-1 CSI uplink (~50–70 KB/s/node per ADR-039) just to feed a remote classifier, then a roundtrip back. Defeats the point of ADR-081's compact feature stream and makes the system worthless when the backhaul is down. Also leaks raw CSI to the network for purposes the user did not opt into. |
| Stay physics-only forever | Cleanest from a maintenance standpoint, but loses gesture, structurally, and the fall-debounce hack will keep accreting per-deployment knobs. The product space already has commodity physics-only firmware (Bosch presence sensors, etc.); on-device transformer inference for CSI is what would differentiate RuView. |
Use ruvector-attention (already in workspace) on-device |
ruvector-attention is std-bound today; doesn't compile to xtensa-esp32s3-none-elf without a port comparable in scope to upstream ADR-192. Even if ported, it doesn't give us GQA + streaming KV cache, which is the structural capability the new crate adds. |
| Wait for IEEE 802.11bf | Different problem (standardised CSI exposure across vendors). Doesn't address whether the model runs on-device or off. |
6. Consequences
Positive
- Genuinely novel. No competing CSI-sensing project ships transformer inference on the MCU itself. The closest peers (Espressif's ESP-DL, Edge Impulse) are non-attention CNN/RNN pipelines.
- Latency. Classification result is local — no backhaul, no host roundtrip, sub-100 ms gesture-to-action.
- Privacy. Raw CSI never leaves the node for these tasks.
- Reuses the ADR-081 feature stream — the temporal head is a
consumer of the existing 60 B
rv_feature_state_t, not a new uplink format. - Validated kernel. Per upstream ADR-192, the no_std build was
validated on real ESP32-S3 hardware (MAC
ac:a7:04:e2:66:24). We are not betting on a paper crate.
Negative / tradeoffs
- Flash budget pressure on 4 MB boards. Per
partitions_4mb.csv, each OTA slot is 1.875 MB (0x1D0000). The current build is ~853 KiB. Adding a 376 KB rlib plus weights brings us to ~1.3 MB — still under the slot ceiling but with little headroom for other growth. Decision: temporal head is 8 MB-only initially, gated behindCONFIG_CSI_TEMPORAL_HEAD_ENABLED. 4 MB enablement is a separate ADR after we measure the actual incremental link size (the 376 KB upstream number is for the rlib in isolation; the linked-and-stripped final binary delta will be smaller). - Rust toolchain dependency. The ESP-IDF build now needs
espup+cargo +espto be present on every developer machine and CI runner. This is a real hurdle on Windows — seeCLAUDE.local.mdfor the existing Python-subprocess wrapper required to run ESP-IDF cleanly. CI will need a parallel Rust-toolchain step. - One more thing to test. QEMU (ADR-061) does not run the ESP32-S3 Xtensa Rust binary today. The QEMU validator pipeline will need a build matrix entry for "Rust component compiled but classifier disabled" at minimum.
- Stack overflow risk. Same hazard the v0.6.4 work just navigated. Mitigated by §3.3 (own task, own stack); this needs to be a code-review checklist item.
- Weights provenance. Once we ship a model, we need a story for which model, signed by whom, retrained how often. See Open Questions §8.
Neutral
- ADR-040's WASM Tier-3 path is not superseded. WASM remains the right choice for user-uploaded modules. The temporal head is a first-party signed-by-us component, with a different deploy story.
- The host-side ADR-024 AETHER pipeline is unchanged by this ADR. ADR-096 covers the host-side use of the same crate.
7. Roadmap
| Phase | Scope | Gating |
|---|---|---|
| 0 | This ADR + ADR-096 land. No code. | Maintainer review of #513. |
| 1 | New crate wifi-densepose-temporal (host-side only): defines the temporal-head architecture, training script, weight serialization format. |
Phase 0 accepted. |
| 2 | ruv_temporal ESP-IDF component scaffolding — empty kernel, just the C ABI and ring buffer. Compiles cleanly into 8 MB firmware. Adds ~5 KB to binary. |
Phase 1 produces a serialised set of weights. |
| 3 | Wire ruvllm_sparse_attention forward (not yet forward_gated) into the component. First on-target classification benchmark on COM7. Gate: end-to-end inference ≤ 50 ms with N = 256, no stack overflow under 24 h soak. |
Phase 2 ABI stable. |
| 4 | First trained classifier (gesture or fall, whichever has labelled data first). Hardware A/B: temporal-head decision vs current heuristic on a held-out set. Promotion criterion: temporal head matches or beats heuristic on F1 and false-positive rate. | Phase 3 latency gate met. |
| 5 | 4 MB profile gating — measure actual binary delta, decide whether to enable on SuperMini. | Phase 4 in production on 8 MB. |
| 6 | forward_gated_with_fastgrnn for long-window tasks (breathing-quality at N = 500). |
Phase 4 stable. |
8. Open questions
- Who trains the temporal heads? Two options:
(a) host-side training on captured
rv_feature_state_ttraces labelled in-app, then export to flat-buffer weights; (b) teacher-distillation from the larger AETHER model (ADR-024) running off-device, using soft labels. Option (b) is more data-efficient but couples this ADR's ship date to ADR-024's training-pipeline maturity. Open. - How are weights flashed? Three options, in increasing
capability: NVS blob (small, safe, 4–8 KB ceiling per key),
EMBED_FILESbaked into the firmware image (no runtime update), OTA-updateable partition (mirrors ADR-040 RVF upload path, biggest engineering cost). Phase 2/3 will pick one; my prior isEMBED_FILESfor the first model, OTA partition once we have more than one. - Does the 376 KB rlib figure scale? Upstream measured
376 KB for the kernel + the embedding/projection
weights for their test config. Adding 1–2
RuvLlmSparseBlocklayers with embedding/projection weights sized to actual CSI feature dimension may push this. Phase 2 will measure the on-target stripped-binary delta directly; if the delta exceeds 600 KB we revisit the 4 MB story sooner. - What window length is right for fall classification? 200 frames at 50 Hz = 4 s feels right based on the v0.6.4 debounce numbers (3-frame consecutive + 5 s cooldown is essentially a 4-second decision window already). Empirical, not architectural — set in Phase 4.
- Quantisation. First model ships FP16 (KV cache feature flag already supports this). INT8 for both weights and activations is a follow-up; the current crate has no INT8 path so it would be a separate kernel.
- What happens when the controller is in
RV_PROFILE_PASSIVE_LOW_RATE? The fast loop slows down, so the input frame rate to the temporal head drops. Either the head needs to handle variable sample rate (resample at push time) or it stops emitting until the controller goes back to active. Phase 1 design call.
9. Acceptance criteria
This ADR is Accepted once:
- Maintainer review on #513 confirms the architecture.
- The follow-up implementation issue is filed and references this ADR plus ADR-096 by number.
- ADR index in
docs/adr/README.md(if present) has an ADR-095 row.
This ADR is Implemented once:
- Phase 3 is in
mainwith the gating Kconfig off by default. - A Phase-4 hardware A/B has been published (witness-bundle compatible per ADR-028).
- The QEMU validator (ADR-061) has at minimum a "compiles, doesn't run" check for the Rust component.
10. Related
ADR-018 (binary CSI frame), ADR-024 (AETHER contrastive embedding —
host-side counterpart, see ADR-096), ADR-039 (edge intelligence
tiers), ADR-040 (WASM Tier-3 modules — the other extensibility
path), ADR-061 (QEMU CI), ADR-081 (adaptive controller, mesh plane,
rv_feature_state_t), ADR-091 (stand-off radar tier — adjacent
edge-intelligence ADR), upstream ADR-189 (KV cache incremental
decode), upstream ADR-190 (GQA/MQA), upstream ADR-192 (no_std +
alloc on ESP32-S3 — the structural unblock that makes this ADR
possible).