Compare commits

...

12 Commits

Author SHA1 Message Date
ruv c9fde3cba5 feat(train): AetherTemporalAggregator — wire wifi-densepose-temporal into the tch graph (#513)
ADR-096 train integration. Additive — does NOT modify model.rs. The
existing WiFiDensePoseModel forward stays bit-equivalent for back-compat.
New code lives in temporal_aether.rs behind the `aether-sparse-temporal`
feature flag (which itself requires `tch-backend`).

Architecture:

    tch::Tensor [T, in_dim]   ──── tch nn::Linear (q/k/v projections)
                                    ↓
                              [T, q_heads*head_dim] etc
                                    ↓
                             tch_to_tensor3 (CPU, f32, 1× copy)
                                    ↓
                              ruvllm_sparse_attention::Tensor3
                                    ↓
                            AetherTemporalHead::forward()
                                    ↓
                              Tensor3 [T, q_heads, head_dim]
                                    ↓
                             tensor3_to_tch (1× copy)
                                    ↓
                              tch::Tensor [T, q_heads*head_dim]
                                    ↓
                              tch nn::Linear (output projection)
                                    ↓
                              tch::Tensor [T, in_dim]

Why additive rather than swapping `apply_antenna_attention` /
`apply_spatial_attention` in model.rs: those are over antenna and
spatial axes, not temporal — ADR-096 §8.1 was right that AETHER
doesn't currently HAVE a temporal-axis attention. This commit adds
that path without disturbing the others, so the §5 validation gate
can A/B the two options before flipping the production default.

Scope notes:
- B=1 prefill only this version. Multi-batch lands when §5 turns
  green and we need to take perf seriously. The forward expects
  `[T, in_dim]` not `[B, T, in_dim]`; documented in the file.
- Streaming step() bridge deferred — KvCache lifecycle ties to
  PoseTrack per ADR-096 §8.5, which is signal-side not train-side.
- Two CPU memory copies per call (in + out). For training-rate
  forwards (~100/sec at batch 16) this is negligible vs the actual
  attention work; for inference-rate streaming it'd be the
  bottleneck and a zero-copy path is the natural follow-up.

Build verification:
- Source compiles cleanly with cargo check on the host crate
  (`-p wifi-densepose-temporal`, 21/21 tests still passing).
- The train crate's tch-backend build is environmentally blocked
  on this Windows machine — torch-sys fails to link against the
  system PyTorch 2.11 + MSVC 14.50 toolchain. This predates this
  commit and affects all tch-bound code paths in the workspace.
  CI runners with working libtorch will verify the new module
  builds; the source follows the same nn::Linear / Module patterns
  the existing model.rs uses.

Feature gating ensures default builds are byte-equivalent. Off by
default; enable with `--features aether-sparse-temporal`.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 12:42:41 -04:00
ruv 2b903752c4 test(temporal): dense-vs-sparse numerical A/B baseline (ADR-096 §5, #513)
Establishes the kernel-level output-divergence envelope between the
two backends — what §5's downstream-metric gate (contrastive loss,
rank-1, Spearman) would calibrate against. Two regimes:

1. Saturated pattern (window ≥ N, block ≥ N): sparse and dense visit
   the same edge set, so divergence reflects only float accumulation
   order. **Asserted < 1e-4** at N=32, heads=4, dim=16. Tight bound.

2. Realistic sparse (window=16, block=32, N=256): real approximation,
   real divergence. **Measured max_abs_err = 5.22e-3, mean = 1.79e-3**
   on the deterministic test inputs. Sanity-checked finite + < 1.0
   so structural breakage (NaN, softmax overflow) trips a panic, but
   the specific numbers are *baseline data* not a hard contract — the
   §5 gate cares about downstream task metrics, not bit-equality.

Why this is in the test suite rather than a benchmark:
- It runs in <0.2s, no need to gate behind --release.
- The saturated-pattern bound IS a hard contract — if that breaks
  the kernel changed semantics in a way the API hides, and we want
  CI to catch it.
- Printing the realistic-pattern numbers (eprintln, visible with
  --nocapture) gives a known-good reference point to compare future
  builds against.

Test count is now 21/21 across the crate (6 smoke + 8 weight blob +
2 blob e2e + 3 streaming + 2 dense-vs-sparse).

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 12:16:05 -04:00
ruv 4ea8457017 feat(temporal): Dense backend implementation (ADR-096 §5 A/B gate, #513)
Closes the Dense placeholder from earlier commits. Now both backends
implement forward(); only SparseGqa supports streaming step()/KvCache,
which is the structural gap dense MHA can't bridge by design.

Dense path:
- src/dense.rs new — DenseHead wraps upstream dense_attention. Stores
  causal flag and (cloned) config. forward() is a one-line delegation;
  no GQA dispatch (dense_attention upstream requires q_heads == kv_heads).
- AetherTemporalHead::Dense changed from a unit variant to Dense(DenseHead).
  Construction succeeds for any valid TemporalHeadConfig where backend
  is Dense.
- AetherTemporalHead.step() returns BackendDoesNotSupportStreaming for
  Dense — there is no dense-MHA-with-KV-cache equivalent and offering
  one would silently swallow the ADR-096 §3.2 structural argument.
- AetherTemporalHead.make_cache() likewise — there's no cache to size
  for a dense kernel.

Errors:
- New TemporalError::BackendDoesNotSupportStreaming variant covers
  the Dense-step / Dense-make_cache cases. Specific so callers can
  fall back to forward() instead of giving up entirely.
- TemporalError::DenseBackendNotImplemented retained for v0.1
  back-compat (no consumers depend on it post-this-commit, but
  removing a public variant is a hard break). Future work can
  deprecate it once downstream callers move off.

Tests (19/19 passing):
- dense_backend_returns_typed_error → renamed and rewritten as
  dense_backend_forward_runs_with_matching_shape: constructs a Dense
  head, runs forward over (32, 4, 4, 16) Q/K/V, asserts output shape.
- New dense_backend_step_returns_streaming_error: constructs Dense,
  attempts make_cache, expects BackendDoesNotSupportStreaming.
- All 8 weight blob, 2 blob e2e, 3 streaming, 5 other smoke tests
  unchanged and still passing.

This commit completes the ADR-096 §5 A/B gate: callers can now run
the same Q/K/V through both backends and compare outputs / latency.
The §5 four-gate validation (contrastive loss within 1%, rank-1
within 1pp, Spearman ≥0.95, latency ≥5×) becomes a runnable
proposition, not a future task — though the actual gate run requires
trained AETHER weights, which is its own track.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 12:12:17 -04:00
ruv 2aee4d21cf docs(temporal): README for wifi-densepose-temporal (#513)
Closes the documentation gap on the host-side ADR-096 surface.
The crate has 7 commits, 5 source modules, 4 test suites, 2 examples,
and a captured benchmark; reviewers and downstream consumers needed
a landing page.

Sections:
- Quick start (5-line forward + 7-line streaming)
- Backends + selection rule (SparseGqa MHA-vs-GQA dispatch)
- Streaming semantics (cache lifetime, eviction policy, the
  headline correctness test)
- Weight blob format with the host/firmware lockstep note
- Examples (init_random_blob, bench_speedup) with run lines
- Tests (18/18 passing as of 247794a2c, broken down by suite)
- Status of ADR-096 claims with concrete evidence for each
- Status of ADR-095 surface (firmware) + the toolchain blocker
- Carry-forward of the open questions still applicable from §8

The README intentionally cross-links to:
- docs/adr/ADR-096 for design rationale
- components/ruv_temporal/ README for the firmware mirror
- benches_results.md for the captured speedup curve

Doesn't claim more than is proven. Each ADR-096 claim either has a
test or a benchmark cited as evidence; the partial claim (30-100× at
long windows) explicitly says 21× was the measured number, not 30×.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 12:06:26 -04:00
ruv 247794a2c5 bench(temporal): empirical sparse-vs-dense speedup curve (ADR-096 §3.1, #513)
Validates the central performance claim of ADR-096 with a runnable
benchmark. Single-run wall-clock, pure-Rust vs pure-Rust on x86_64
host. Real numbers, not just analytic argument.

Results (N=64..1024):

| N      | Dense (ms) | Sparse (ms) | Speedup |
|--------|-----------:|------------:|--------:|
|     64 |      0.262 |       0.141 |   1.86× |
|    128 |      1.120 |       0.335 |   3.34× |
|    256 |      4.129 |       0.711 |   5.81× |
|    512 |     19.230 |       2.356 |   8.16× |
|   1024 |     71.904 |       3.389 |  21.21× |

Asymptotic check: 64→1024 is 16× more tokens. Dense's 274× cost
growth matches N² (256× = 16²). Sparse's 24× growth matches
N log N (16 · log(1024)/log(64) ≈ 27). The complexity claim is
empirically supported.

ADR-096 §3.1 honest-framing paragraph predicted N=64 would be
overhead-bound; we measured 1.86× there, consistent with the ADR's
warning that AETHER's current `window_frames=100` default is below
the inflection point where sparse pays.

What this commit adds:
- examples/bench_speedup.rs — measures dense_attention (upstream
  reference), AetherTemporalHead.forward (this crate's wrapper),
  and SubquadraticSparseAttention.forward (raw, to confirm the
  wrapper isn't introducing overhead — it isn't, the two are
  within noise).
- benches_results.md — captured table + asymptotic check + caveats
  (config used, what the benchmark doesn't measure, how to run).

Run it:
  cargo run -p wifi-densepose-temporal --example bench_speedup --release

What's NOT measured here:
- Decode-step latency (already proved correct at last-token, not
  yet timed against a hypothetical O(N²) dense decode — they're
  structurally not comparable anyway).
- Memory footprint of KvCache + FP16 (matters on firmware, not host).
- GQA dispatch — this bench uses MHA shape so dense and sparse
  operate on identical tensors. Real AETHER will want MQA per
  TemporalHeadConfig::default_aether(), which halves KV memory.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 12:02:36 -04:00
ruv 49e57efcec feat(temporal): streaming step() + KvCache (ADR-096 §3.2, #513)
The structural advantage that's the entire point of ADR-096: O(log T)
per new token via decode_step against an accumulated KvCache, vs
O(N²) recompute for dense MHA. This commit lands the API and proves
the numerical equivalence at the last position.

API:
- AetherTemporalHead::step(q_new, k_new, v_new, &mut cache)
  Single-token decode. Appends (k_new, v_new) to cache, runs
  decode_step(q_new) against the now-updated cache, returns the new
  position's output.
- AetherTemporalHead::make_cache(capacity)
  Convenience constructor — caller doesn't need to import
  ruvllm_sparse_attention to size a cache. Per ADR-096 §8.5 the
  natural lifetime is per-PoseTrack (re-ID) or per-session (online
  classification); when the track drops, drop the cache.
- KvCache re-exported at the crate root.

Contract:
- q_new/k_new/v_new must each have seq == 1. Multi-token q is the
  prefill path (forward), not decode_step.
- Cache lifetime is the caller's. The crate enforces shape via
  make_cache so callers can't mismatch kv_heads / head_dim / block_size.
- KvCache fill is the caller's problem. Upstream H2O heavy-hitter
  eviction is opt-in; this crate's wrapper doesn't pre-pick a policy.

Tests (18/18 total now passing):
- streaming_step_matches_forward_at_last_position — central claim:
  16-token sequence, append k/v one at a time via step(), compare
  the streamed last-token output to forward(full Q,K,V)[N-1].
  max_abs_err < 1e-3 (currently passes well under that bound for
  the 0.1-magnitude activations the test uses).
- step_rejects_multi_token_q — contract enforcement.
- make_cache_returns_kvcache_with_correct_shape — wiring smoke,
  confirms (capacity, kv_heads, dim, block_size) ordering is correct
  through the make_cache wrapper.

Test config uses MHA shape (q_heads == kv_heads) because the upstream
decode_step is wired to the MHA branch; the GQA decode path is on
upstream's roadmap and lands in a separate ADR-096 follow-up when it
does.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 11:57:31 -04:00
ruv 3a5fe5e0de feat(firmware): mirror weight-blob parser into ruv_temporal (#513)
Closes the format contract on the firmware side. Source-only — Phase 5
toolchain blocker still prevents actually compiling, but when it
unblocks this is one less thing to write under time pressure.

- src/weights.rs — no_std mirror of v2/.../weights.rs. Same magic
  ('RVNE'), same version 1, same CRC32-IEEE polynomial (matches the C
  side in temporal_task.c). Bit-for-bit lockstep with the host: a
  blob produced by host WeightBlob::serialize() parses here as a
  WeightBlobView byte-for-byte.

  Borrowed-slice parse design: the firmware loader receives weights
  via mmap'd EMBED_FILES or NVS read into a heap buffer. The parser
  takes &[u8] with no copy — view fields point into the caller's
  buffer. Caller is responsible for keeping the buffer alive for the
  view's lifetime.

  Loader errors map to esp_err_t-style codes via
  weight_load_err_to_esp() so the C ABI can surface specific failure
  modes (ESP_ERR_INVALID_ARG for magic/version/size, ESP_ERR_INVALID_CRC
  for corruption, ESP_ERR_INVALID_SIZE for shape validation failures).

- src/lib.rs — ruv_temporal_init now optionally validates a non-NULL
  weights blob. NULL pointer is still allowed during the Phase 4/5
  bring-up window (kernel forward isn't actually consuming weights
  yet), but when caller passes a real blob we parse + sanity-check
  declared dims against runtime arguments. Catches deploy bugs at
  init() rather than at first classify() — the firmware Tmr Svc work
  in v0.6.4 taught us that classify-time crashes are the worst kind.

- README.md — Phase 6 marked done (verified by 8MB firmware build with
  feature off in commit 7994af822). Added module map table covering
  lib.rs / window.rs / weights.rs / ruv_temporal.h / shim.c.

What's deliberately NOT in this commit:
  - Cross-compile validation. Same toolchain blocker as before.
  - Kernel-side wiring of weights into the forward pass. That's
    Phase 6+ of the firmware roadmap — once the kernel is wired,
    weights become a required arg, not an optional one.
  - Tests on the firmware side. They'd need build-std working to run;
    16/16 host tests cover the format end-to-end via the lockstep
    polynomial.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 11:53:19 -04:00
ruv 73321db765 feat(temporal): init_random_blob example + filesystem e2e tests (#513)
Closes the host→file→firmware loop on the Phase 1 weight format. Real
.rvne artifact emitted from the example, parsed back through filesystem
in the e2e test, byte-identical across two seeded runs.

- examples/init_random_blob.rs — produces a 41,244-byte deployable blob
  matching the AETHER default head shape (input_dim=16, q_heads=4,
  kv_heads=1 [MQA], head_dim=32, layers=2, classes=4 — staying coherent
  with TemporalHeadConfig::default_aether so a real trainer can drop
  in this shape with one search-and-replace). Uses xorshift64* with a
  fixed seed (0xC511_0007_DEAD_BEEF) for reproducibility.

  Per-layer weight count derivation lives in the example (Wq + Wk +
  Wv + Wo, plus a final classifier head) so the kernel's expectation
  is anchored in code rather than a comment that drifts.

- tests/blob_e2e.rs — two new tests, 15/15 total now passing:
    * realistic_blob_roundtrips_through_filesystem — writes a 25+ KB
      blob to std::env::temp_dir(), reads it back, parses, validates.
      Mirrors what the firmware loader will do once the toolchain
      unblocks (mmap NVS or EMBED_FILES → parse).
    * deterministic_seed_produces_byte_identical_blobs — same seed
      produces byte-identical output, twice. This is what makes a
      witness-bundle (ADR-028) over trained weights meaningful.

Verified by running the example with an explicit out path:
  cargo run -p wifi-densepose-temporal --example init_random_blob -- \
      v2/target/example-output/model_init.rvne
  → 41244 bytes, parses clean, dtype/shape/CRC all good.

What this isn't yet:
  - Not a trained model. Random init only.
  - Not a kernel forward over the blob. That requires the firmware
    Rust component to compile (Phase 5 — toolchain blocker).
  - Not wired into wifi-densepose-train. ADR-096 §8.1 flagged that
    the AETHER train crate doesn't currently have a temporal-axis
    attention; that integration is a separate piece of work.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 11:49:19 -04:00
ruv 237325a117 feat(temporal): weight-blob wire format (ADR-095 Phase 1, #513)
The training/firmware boundary needs a stable serialization for the
temporal head's weights, distinct from the kernel scaffold and the
firmware ABI. This commit defines that format on the host side. The
firmware-side mirrored loader lands when the toolchain unblocks.

Format:
  - Header (24 B): magic 'RVNE' / version 1 / dtype flag
    (FP32 / FP16) / input_dim / n_q_heads / n_kv_heads / head_dim /
    n_layers / n_classes / weights_len.
  - Body: weights_len bytes of flat per-layer weights.
  - Footer (4 B): CRC32 IEEE 802.3 over everything before, same
    polynomial used by temporal_task.c so a blob produced here parses
    on the firmware unchanged.

Layout decisions:
  - Little-endian throughout (Xtensa native).
  - Weights kept as Vec<u8> rather than Vec<f32>/Vec<f16> so the no_std
    firmware loader (which may not have the `half` crate) can mmap and
    read either dtype directly.
  - Versioning is hard-break: bumping `version` means firmware refuses
    to load. Optional fields go behind reserved flag bits, never by
    field reorder. Documented inline.

Validation surface:
  - `WeightBlobHeader::validate()` catches zero dims, invalid GQA
    ratios (n_q_heads % n_kv_heads != 0), n_layers=0, n_classes<2.
    Same checks fire from `WeightBlob::parse()` so the firmware can't
    accidentally accept a blob the host should have rejected.
  - `WeightBlob::parse()` enforces magic / version / size / CRC
    before exposing weights to the caller.

Tests (8/8 passing, alongside 5/5 sparse smoke = 13/13 total):
  - roundtrip_fp32, roundtrip_fp16
  - parse_rejects_bad_magic, _wrong_version, _size_mismatch,
    _crc_corruption, _invalid_gqa_ratio_in_header
  - header_constants_match_wire_layout (anchor)

What's deliberately NOT in this commit:
  - The firmware-side mirrored loader (deferred to the iteration that
    unblocks the esp Rust toolchain — no point shipping a parser that
    can't be compiled).
  - Per-layer weight ordering. The blob is a flat byte-buffer; the
    interpretation of per-layer offsets is the kernel's contract,
    documented in the eventual model module (ADR-095 §3.2 follow-up).

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 11:43:49 -04:00
ruv 7994af8221 feat(firmware): wire temporal_task.c + Kconfig + ruv_temporal component (Phase 6, #513)
Phase 6 of #513: C-side wiring for the on-device temporal head. Builds
cleanly with feature OFF (default); 8MB binary delta is +96 bytes vs
v0.6.4-esp32 — that's the no-op shim path. Feature ON depends on the
Rust component (Phase 5, currently blocked by upstream esp-rs nightly).

Files:

- main/temporal_task.{c,h} — owns the FreeRTOS task lifecycle. Per
  ADR-095 §3.3 the task has its own 16 KB stack pinned to Core 1 and
  is fed via a 32-deep FreeRTOS queue. With feature OFF the .c file
  collapses to three ESP_ERR_NOT_SUPPORTED stubs so callers don't
  need #ifdefs at every call site.
- main/temporal_task.h — defines rv_temporal_pkt_t (40 bytes,
  magic 0xC5110007 — next free in the existing 0xC5110001..0006
  family) and the task lifecycle API. Build-time _Static_assert
  pins the wire format.
- main/Kconfig.projbuild — new menu "On-device temporal head
  (ADR-095, #513)" with CONFIG_CSI_TEMPORAL_HEAD_ENABLED (default n)
  plus four runtime-tuneable knobs: TEMPORAL_INPUT_DIM (16),
  TEMPORAL_WINDOW_LEN (256), TEMPORAL_N_CLASSES (4), and
  TEMPORAL_CLASSIFY_PERIOD_MS (1000).
- main/CMakeLists.txt — adds temporal_task.c to SRCS unconditionally
  (the .c file feature-gates internally), and adds ruv_temporal to
  REQUIRES only when the feature is enabled so default builds don't
  pull in the Rust component.
- main/adaptive_controller.c — fast_loop_cb now extracts the 9
  feature floats from the pkt it just built and pushes them into
  temporal_task_push_frame after the existing stream_sender_send.
  Non-blocking; queue-full drops are coalesced and logged 1/sec.
- main/main.c — temporal_task_start() called right after
  adaptive_controller_init(). Wrapped in #ifdef so feature-off
  builds don't reference the (no-op-anyway) function.
- components/ruv_temporal/CMakeLists.txt — restructured. Top-level
  Kconfig guard registers an empty component when the feature is
  off (avoids running cargo without a working toolchain).
  add_custom_command moved AFTER idf_component_register so it
  doesn't fire in script mode (required by ESP-IDF v5.4).

Validation:
- Firmware builds clean with default config (feature OFF) on
  ESP-IDF v5.4 / esp32s3 target. Binary 1062 KiB / 2 MiB partition,
  48 % free.
- Static assertion catches wire-format drift (rv_temporal_pkt_t size).
- Host-side `cargo test -p wifi-densepose-temporal` still 5/5 from
  the earlier commit (no regression, this commit only touches
  firmware/).

Phase 7 (flash to COM8 + soak) deferred this iteration — board is
currently not enumerating on COM8; will pick up next iteration when
the ESP32 is reattached.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 11:28:11 -04:00
ruv 22d47a71e3 feat(firmware): scaffold ruv_temporal ESP-IDF Rust component (ADR-095 Phase 4, #513)
Phase 4 of the #513 roadmap: ESP-IDF component skeleton at
`firmware/esp32-csi-node/components/ruv_temporal/`. Source is complete
and self-consistent; cross-compile to xtensa-esp32s3-none-elf is
blocked by a known-broken esp-rs nightly snapshot (details in the
component README).

What's in the scaffold:

- `Cargo.toml` — staticlib, no_std + alloc, deps on the path-vendored
  `ruvllm_sparse_attention` (matching ADR-096's host-side dep) and
  `esp-alloc`/`critical-section` for the no_std allocator and lock
  primitives.
- `src/lib.rs` — public C ABI (init / push / classify / destroy /
  self_test) with `#[no_mangle]` exports, a `[#used]` keepalive table
  to defeat aggressive linker stripping, esp-alloc as the global
  allocator (heap region added at runtime by the firmware), and a
  loop-on-panic handler (Phase 5 will route through esp_system_abort).
- `src/window.rs` — `FrameRing`, the rolling-window buffer that
  `ruv_temporal_push` writes to. Chronological iteration via
  `iter_chronological()` so the kernel sees oldest-first.
- `include/ruv_temporal.h` — the public C header consumed by
  edge_processing.c. Threading contract documented inline (single
  dedicated FreeRTOS task, no internal locks).
- `CMakeLists.txt` — runs `cargo +esp build` as an ESP-IDF
  pre-component-register step, then registers the static library
  through `idf_component_register` + `target_link_libraries(...
  INTERFACE ...)`. `shim.c` exists only because
  `idf_component_register` requires SRCS.
- `.cargo/config.toml` + `rust-toolchain.toml` — pin the build to
  `xtensa-esp32s3-none-elf` and the `esp` toolchain channel so
  `cargo build` without flags Just Works once the toolchain is
  unblocked.
- `README.md` — Phase status table, Phase 5 toolchain blocker
  explanation, and the espup install fix.

ABI calls into edge_processing.c (Phase 6) and COM8 validation
(Phase 7) follow once the cross-compile is unblocked.

Closes nothing yet; advances #513.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 09:44:01 -04:00
ruv bfb3fdee13 feat(temporal): scaffold wifi-densepose-temporal crate (ADR-096 Phase 1-3, #513)
Implements Phases 1-3 of the ADR-096 roadmap:

Phase 1: workspace integration
- Add `ruvllm_sparse_attention` as a path-vendored workspace dep against
  `vendor/ruvector/crates/ruvllm_sparse_attention`, default-features=false,
  features=["fp16"]. Mirrors the no_std posture ADR-095 will need on the
  firmware side so both consumers share a single feature set.
- Register `wifi-densepose-temporal` as workspace member.

Phase 2: AETHER temporal head
- `AetherTemporalHead` facade dispatches to a `SparseGqa` backend wrapping
  `SubquadraticSparseAttention`. Selection rule from ADR-096 §4.4 enforced
  at forward(): MHA branch when q_heads == kv_heads, GQA branch otherwise.
- `Dense` backend reserved (returns typed `DenseBackendNotImplemented`)
  so config-time validation fails loudly instead of at forward().
- `TemporalHeadConfig::default_aether()` matches the AETHER training
  default per ADR-096 §3.1 (window=32, block=16, q=4, kv=1 → MQA).
- Token 0 always wired as a global anchor — preserves AETHER's
  contrastive "session-start reference" role per ADR-024.

Phase 3: smoke tests (5/5 passing)
- forward at AETHER default config, both MHA and GQA dispatch paths,
  rejected dense backend, rejected non-divisible GQA ratio, and the
  long-window roadmap target (N=1000, the 10s @ 100Hz case from
  ADR-096 §3.1 — proves the kernel runs at lengths where dense MHA
  costs 10⁶ edge ops vs sparse 10⁴).

Streaming `step()` deferred — KvCache lifecycle ties to PoseTrack per
ADR-096 §8.5 and lands when the firmware-side ABI does (Phase 4+).

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 09:26:18 -04:00
38 changed files with 3654 additions and 4 deletions

View File

@ -0,0 +1,10 @@
# Per-component cargo config so `cargo build` picks the xtensa target
# without the caller having to remember `--target xtensa-esp32s3-none-elf`.
# CMakeLists.txt still passes --target explicitly for clarity.
[build]
target = "xtensa-esp32s3-none-elf"
# The esp toolchain ships precompiled core and alloc for
# xtensa-esp32s3-none-elf, so build-std is unnecessary and (as of the
# 2025-09-16 esp nightly) actively broken on portable_simd.

View File

@ -0,0 +1,49 @@
# ESP-IDF component manifest for the ruv_temporal Rust staticlib (ADR-095).
#
# Build flow:
# - When CONFIG_CSI_TEMPORAL_HEAD_ENABLED is OFF (default): register an
# empty stub. main/temporal_task.c compiles the no-op shim path, no
# cargo, no Rust toolchain dependency. Default firmware build is
# unaffected.
# - When CONFIG_CSI_TEMPORAL_HEAD_ENABLED is ON: invoke
# `cargo +esp build --release --target xtensa-esp32s3-none-elf`,
# register the resulting libruv_temporal.a, and expose include/.
#
# add_custom_command is intentionally placed AFTER idf_component_register
# because ESP-IDF runs every component's CMakeLists.txt twice once in
# script mode for dependency discovery (where add_custom_command is
# forbidden), and once for the actual build.
if(NOT CONFIG_CSI_TEMPORAL_HEAD_ENABLED)
# Feature disabled register an empty component so the directory's
# mere existence doesn't break the build, but do NOT invoke cargo
# or pull include/ onto consumers' include paths (the C ABI header
# would advertise capabilities we cannot honour).
idf_component_register()
return()
endif()
set(RUV_TEMPORAL_DIR "${CMAKE_CURRENT_SOURCE_DIR}")
set(RUV_TEMPORAL_TARGET "xtensa-esp32s3-none-elf")
set(RUV_TEMPORAL_PROFILE "release")
set(RUV_TEMPORAL_LIB
"${RUV_TEMPORAL_DIR}/target/${RUV_TEMPORAL_TARGET}/${RUV_TEMPORAL_PROFILE}/libruv_temporal.a")
idf_component_register(
SRCS "shim.c"
INCLUDE_DIRS "include"
PRIV_REQUIRES "esp_common"
)
# Custom command + target run only at build time, not in script mode.
add_custom_command(
OUTPUT "${RUV_TEMPORAL_LIB}"
WORKING_DIRECTORY "${RUV_TEMPORAL_DIR}"
COMMAND cargo +esp build --release --target ${RUV_TEMPORAL_TARGET}
COMMENT "Building ruv_temporal Rust staticlib for ${RUV_TEMPORAL_TARGET}"
VERBATIM
)
add_custom_target(ruv_temporal_rust_build ALL DEPENDS "${RUV_TEMPORAL_LIB}")
add_dependencies(${COMPONENT_LIB} ruv_temporal_rust_build)
target_link_libraries(${COMPONENT_LIB} INTERFACE "${RUV_TEMPORAL_LIB}")

View File

@ -0,0 +1,218 @@
# This file is automatically @generated by Cargo.
# It is not intended for manual editing.
version = 4
[[package]]
name = "allocator-api2"
version = "0.3.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c583acf993cf4245c4acb0a2cc2ab1f9cc097de73411bb6d3647ff6af2b1013d"
[[package]]
name = "cfg-if"
version = "1.0.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9330f8b2ff13f34540b44e946ef35111825727b38d33286ef986142615121801"
[[package]]
name = "critical-section"
version = "1.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "790eea4361631c5e7d22598ecd5723ff611904e3344ce8720784c93e3d83d40b"
[[package]]
name = "crunchy"
version = "0.2.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "460fbee9c2c2f33933d720630a6a0bac33ba7053db5344fac858d4b8952d77d5"
[[package]]
name = "darling"
version = "0.21.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9cdf337090841a411e2a7f3deb9187445851f91b309c0c0a29e05f74a00a48c0"
dependencies = [
"darling_core",
"darling_macro",
]
[[package]]
name = "darling_core"
version = "0.21.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1247195ecd7e3c85f83c8d2a366e4210d588e802133e1e355180a9870b517ea4"
dependencies = [
"fnv",
"ident_case",
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "darling_macro"
version = "0.21.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d38308df82d1080de0afee5d069fa14b0326a88c14f15c5ccda35b4a6c414c81"
dependencies = [
"darling_core",
"quote",
"syn",
]
[[package]]
name = "document-features"
version = "0.2.12"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d4b8a88685455ed29a21542a33abd9cb6510b6b129abadabdcef0f4c55bc8f61"
dependencies = [
"litrs",
]
[[package]]
name = "enumset"
version = "1.1.12"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7f96a4a12fe60ac746ae295a1a4ecb5bb02debc20856506c8635288065f142de"
dependencies = [
"enumset_derive",
]
[[package]]
name = "enumset_derive"
version = "0.15.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4bd536557b58c682b217b8fb199afdff47cd3eff260623f19e77074eb073d63a"
dependencies = [
"darling",
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "esp-alloc"
version = "0.8.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7e95f1de57ce5a6600368f3d3c931b0dfe00501661e96f5ab83bc5cdee031784"
dependencies = [
"allocator-api2",
"cfg-if",
"critical-section",
"document-features",
"enumset",
"linked_list_allocator",
]
[[package]]
name = "fnv"
version = "1.0.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3f9eec918d3f24069decb9af1554cad7c880e2da24a9afd88aca000531ab82c1"
[[package]]
name = "half"
version = "2.7.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6ea2d84b969582b4b1864a92dc5d27cd2b77b622a8d79306834f1be5ba20d84b"
dependencies = [
"cfg-if",
"crunchy",
"zerocopy",
]
[[package]]
name = "ident_case"
version = "1.0.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b9e0384b61958566e926dc50660321d12159025e767c18e043daf26b70104c39"
[[package]]
name = "libm"
version = "0.2.16"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b6d2cec3eae94f9f509c767b45932f1ada8350c4bdb85af2fcab4a3c14807981"
[[package]]
name = "linked_list_allocator"
version = "0.10.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2b23ac50abb8261cb38c6e2a7192d3302e0836dac1628f6a93b82b4fad185897"
[[package]]
name = "litrs"
version = "1.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "11d3d7f243d5c5a8b9bb5d6dd2b1602c0cb0b9db1621bafc7ed66e35ff9fe092"
[[package]]
name = "proc-macro2"
version = "1.0.106"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8fd00f0bb2e90d81d1044c2b32617f68fcb9fa3bb7640c23e9c748e53fb30934"
dependencies = [
"unicode-ident",
]
[[package]]
name = "quote"
version = "1.0.45"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "41f2619966050689382d2b44f664f4bc593e129785a36d6ee376ddf37259b924"
dependencies = [
"proc-macro2",
]
[[package]]
name = "ruv_temporal"
version = "0.1.0"
dependencies = [
"critical-section",
"esp-alloc",
"ruvllm_sparse_attention",
]
[[package]]
name = "ruvllm_sparse_attention"
version = "0.1.1"
dependencies = [
"half",
"libm",
]
[[package]]
name = "syn"
version = "2.0.117"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e665b8803e7b1d2a727f4023456bbbbe74da67099c585258af0ad9c5013b9b99"
dependencies = [
"proc-macro2",
"quote",
"unicode-ident",
]
[[package]]
name = "unicode-ident"
version = "1.0.24"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e6e4313cd5fcd3dad5cafa179702e2b244f760991f45397d14d4ebf38247da75"
[[package]]
name = "zerocopy"
version = "0.8.48"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "eed437bf9d6692032087e337407a86f04cd8d6a16a37199ed57949d415bd68e9"
dependencies = [
"zerocopy-derive",
]
[[package]]
name = "zerocopy-derive"
version = "0.8.48"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "70e3cd084b1788766f53af483dd21f93881ff30d7320490ec3ef7526d203bad4"
dependencies = [
"proc-macro2",
"quote",
"syn",
]

View File

@ -0,0 +1,35 @@
[package]
name = "ruv_temporal"
version = "0.1.0"
edition = "2021"
license = "MIT"
description = "ESP32-S3 on-device temporal head for WiFi-DensePose (ADR-095, #513)"
publish = false
[lib]
crate-type = ["staticlib"]
name = "ruv_temporal"
# Don't get pulled into the v2 workspace — this crate cross-compiles to
# xtensa-esp32s3-none-elf, the workspace targets host x86_64.
[workspace]
[dependencies]
ruvllm_sparse_attention = { path = "../../../../vendor/ruvector/crates/ruvllm_sparse_attention", default-features = false, features = ["fp16"] }
# Minimal no_std + alloc plumbing. esp-alloc supplies a GlobalAlloc that
# punches through to ESP-IDF's heap_caps_malloc; critical-section provides
# the lock primitive linked_list_allocator wants on no_std targets.
esp-alloc = "0.8"
critical-section = "1"
[profile.release]
opt-level = "s"
lto = true
codegen-units = 1
panic = "abort"
strip = true
[profile.dev]
opt-level = 1
panic = "abort"

View File

@ -0,0 +1,86 @@
# `ruv_temporal` — ESP32-S3 on-device temporal head
ESP-IDF component implementing ADR-095 (#513). The Rust staticlib at
`src/lib.rs` wraps `ruvllm_sparse_attention` (vendored at
`vendor/ruvector/crates/ruvllm_sparse_attention`) and exposes a narrow
C ABI declared in `include/ruv_temporal.h`.
## Status
| Phase | Scope | State |
|-------|-------|-------|
| 4 — Scaffold | Cargo.toml, src/{lib.rs,window.rs,weights.rs}, include/ruv_temporal.h, CMakeLists.txt, .cargo/config.toml | **Done.** |
| 5 — Cross-compile | `cargo +esp build --release --target xtensa-esp32s3-none-elf` produces `libruv_temporal.a`. | **Blocked** — see below. |
| 6 — Wire from edge_processing.c | FreeRTOS task on Core 1, queue from adaptive_controller fast loop, push() in fast tick, classify() at 1 Hz, emit `0xC5110007` packet. | **Done** in `main/temporal_task.c` (no-op shim path verified by 8MB firmware build with feature off). |
| 7 — COM8 validation | Flash 8MB build with `CONFIG_CSI_TEMPORAL_HEAD_ENABLED=y`, soak ≥5 min, check no Tmr Svc / task_wdt overflow. | Pending board reattach. |
## Module map
| File | Purpose |
|------|---------|
| `src/lib.rs` | C ABI: `ruv_temporal_init / push / classify / destroy / kernel_self_test` |
| `src/window.rs` | `FrameRing` rolling buffer used by `ruv_temporal_push` |
| `src/weights.rs` | Loader-side mirror of host `wifi_densepose_temporal::weights`. Parses the `.rvne` blob format (magic `RVNE`, version 1, FP32/FP16, CRC32-IEEE). Bit-exact with the host crate; a blob produced by the host's `WeightBlob::serialize()` parses here byte-for-byte. |
| `include/ruv_temporal.h` | Public C header consumed by `main/temporal_task.c` |
| `shim.c` | Empty C shim for `idf_component_register` |
## Phase 5 blocker — esp toolchain rust-src bug
The system esp toolchain at `C:\Users\ruv\.rustup\toolchains\esp` has
no precompiled `core` for `xtensa-esp32s3-none-elf`. It requires
`-Z build-std=core,alloc`, but the bundled rust-src snapshot
(`esp` channel, nightly 2025-09-16) hits two known bugs when build-std
compiles `core`:
1. `library/portable-simd/crates/core_simd/src/simd/ptr/mut_ptr.rs`
`Copy` trait and `size_of` not in scope, ~16,000 errors.
2. `library/core` itself — "cannot resolve a prelude import",
"attributes starting with `rustc` are reserved", `concat!` macro
not found.
These are upstream Rust nightly snapshot regressions, not anything
this component is doing wrong. The fix is to refresh the esp toolchain
to a newer nightly:
```powershell
C:/Users/ruv/.cargo/bin/espup.exe install
# (re-source export-esp.ps1 / export-esp.sh after install)
```
`espup install` pulls the latest pinned esp Rust + LLVM. It is a
~1.5 GB download and ~5-10 min install. That step lands in the next
loop iteration of #513 implementation work.
## Build (once Phase 5 unblocks)
From this directory:
```bash
cargo +esp build --release --target xtensa-esp32s3-none-elf
```
Output:
`target/xtensa-esp32s3-none-elf/release/libruv_temporal.a`.
ESP-IDF's `idf.py build` will pick this up via `CMakeLists.txt`
`add_custom_command` runs the cargo build before
`idf_component_register` consumes the static library.
## C ABI summary
```c
esp_err_t ruv_temporal_init(const uint8_t *weights, size_t wlen,
uint32_t input_dim, uint32_t window_len,
uint32_t n_classes,
ruv_temporal_ctx_t **out_ctx);
esp_err_t ruv_temporal_push(ruv_temporal_ctx_t *ctx, const float *frame);
esp_err_t ruv_temporal_classify(ruv_temporal_ctx_t *ctx,
float *logits, uint32_t n_classes);
void ruv_temporal_destroy(ruv_temporal_ctx_t *ctx);
esp_err_t ruv_temporal_kernel_self_test(void);
```
Threading: caller is responsible. Per ADR-095 §3.3, the firmware will
spawn a single dedicated FreeRTOS task that owns the context and
serialises all calls — push() and classify() are not internally
synchronised.

View File

@ -0,0 +1,71 @@
/* SPDX-License-Identifier: MIT
*
* ESP32-S3 on-device temporal head public C ABI (ADR-095, #513).
*
* Consumed by edge_processing.c / adaptive_controller.c. Backed by a
* Rust staticlib that wraps `ruvllm_sparse_attention`. See
* components/ruv_temporal/src/lib.rs for the implementation.
*
* Threading: NOT internally synchronised. Per ADR-095 §3.3 callers run
* a single dedicated FreeRTOS task that owns the context and
* serialises push() and classify(). init() and destroy() are NOT safe
* against concurrent push/classify on the same handle.
*/
#pragma once
#include <stddef.h>
#include <stdint.h>
#include "esp_err.h"
#ifdef __cplusplus
extern "C" {
#endif
typedef struct RuvTemporalCtx ruv_temporal_ctx_t;
/* Allocate a temporal-head context.
*
* weights flat-buffer of model weights (Phase 5 wires the format),
* may be NULL during Phase 4 scaffolding.
* weights_len bytes of `weights`, 0 if weights is NULL.
* input_dim feature dimension per frame (e.g. 60 for rv_feature_state_t).
* window_len number of frames in the rolling window (e.g. 256).
* n_classes output logit count (e.g. 4 for gesture, 3 for fall).
* out_ctx receives the new context pointer on ESP_OK.
*
* Returns ESP_OK on success, ESP_ERR_INVALID_ARG for null/zero inputs,
* ESP_ERR_NO_MEM if buffer allocation fails.
*/
esp_err_t ruv_temporal_init(const uint8_t *weights,
size_t weights_len,
uint32_t input_dim,
uint32_t window_len,
uint32_t n_classes,
ruv_temporal_ctx_t **out_ctx);
/* Push one feature frame into the rolling window. Hot path — cheap,
* no allocation. `frame` must point to at least `input_dim` floats.
*/
esp_err_t ruv_temporal_push(ruv_temporal_ctx_t *ctx, const float *frame);
/* Run the temporal-head forward and write `n_classes` class logits
* into the caller-owned `logits` buffer (must be at least n_classes
* floats). `n_classes` must match the value passed to init().
*/
esp_err_t ruv_temporal_classify(ruv_temporal_ctx_t *ctx,
float *logits,
uint32_t n_classes);
/* Release a context allocated by ruv_temporal_init. Safe on NULL. */
void ruv_temporal_destroy(ruv_temporal_ctx_t *ctx);
/* Self-test — proves the upstream sparse-attention kernel links and
* runs. Returns ESP_OK on success. Useful as a smoke check on first
* boot before allocating a real context.
*/
esp_err_t ruv_temporal_kernel_self_test(void);
#ifdef __cplusplus
}
#endif

View File

@ -0,0 +1,6 @@
# Pin to the esp toolchain so casual `cargo build` (without +esp) lands
# on the xtensa-capable rustc/cargo. Per ADR-095, espup must be
# installed on every developer machine and CI runner.
[toolchain]
channel = "esp"

View File

@ -0,0 +1,10 @@
/* SPDX-License-Identifier: MIT
*
* Minimal C shim so ESP-IDF's idf_component_register has a SRCS file.
* The real C ABI lives in src/lib.rs (Rust staticlib) and is exposed
* through include/ruv_temporal.h.
*
* Intentionally empty do not put logic here.
*/
#include "ruv_temporal.h"

View File

@ -0,0 +1,242 @@
// On-ESP32-S3 temporal head — C ABI for the ESP-IDF firmware (ADR-095, #513).
//
// This crate is `staticlib` no_std + alloc. It is compiled to
// `xtensa-esp32s3-none-elf` and linked into the firmware via the ESP-IDF
// component glue in CMakeLists.txt. The host-side analog
// (`wifi-densepose-temporal`) tracks ADR-096; the two crates intentionally
// share the same `ruvllm_sparse_attention` kernel so behaviour is identical
// across host and node.
//
// Status (Phase 4 of #513): C ABI surface + ring buffer scaffold.
// - `ruv_temporal_init` ✓ scaffolded
// - `ruv_temporal_push` ✓ scaffolded (writes to ring buffer)
// - `ruv_temporal_classify` ✓ scaffolded (kernel forward stub)
// - `ruv_temporal_destroy` ✓ scaffolded
//
// Phase 5 wires real weights, panic_handler, and the global allocator to
// ESP-IDF's heap. Phase 6 wires the ABI calls from edge_processing.c into
// a dedicated FreeRTOS task per ADR-095 §3.3.
#![no_std]
#![no_main]
extern crate alloc;
use alloc::boxed::Box;
use core::ffi::c_void;
mod weights;
mod window;
use weights::{WeightBlobView, WeightLoadError};
use window::FrameRing;
// ---- ESP-IDF compatible error codes ---------------------------------------
//
// Matches the `esp_err_t` typedef in `esp_err.h`. We don't need the full
// set — these four cover the contract advertised in ruv_temporal.h.
const ESP_OK: i32 = 0;
const ESP_FAIL: i32 = -1;
const ESP_ERR_INVALID_ARG: i32 = 0x102;
const ESP_ERR_NO_MEM: i32 = 0x101;
// ---- Allocator ------------------------------------------------------------
//
// esp-alloc punches through to ESP-IDF's heap_caps_malloc. The ESP-IDF
// runtime calls `esp_alloc::HEAP.add_region(...)` from C startup before
// the first Rust allocation; without that wiring we'd hit OOM on the
// first Vec push. That wiring lands in Phase 5 along with the rest of
// the firmware-side glue.
#[global_allocator]
static ALLOCATOR: esp_alloc::EspHeap = esp_alloc::EspHeap::empty();
// ---- Panic handler --------------------------------------------------------
//
// Production firmware would route to ESP-IDF's `esp_system_abort` so the
// crash shows up in core dumps. For Phase 4 scaffolding we simply halt —
// keeps the staticlib self-contained without dragging in `esp-idf-sys`.
#[panic_handler]
fn on_panic(_info: &core::panic::PanicInfo) -> ! {
loop {
// wait-for-interrupt would be nicer; this is fine until Phase 5
// hooks into esp_system_abort.
}
}
// ---- Context object (opaque to C callers) ---------------------------------
pub struct RuvTemporalCtx {
input_dim: u32,
window_len: u32,
n_classes: u32,
ring: FrameRing,
}
// ---- Public C ABI ---------------------------------------------------------
/// Initialise a temporal-head context. Allocates and returns an opaque
/// pointer through `out_ctx`. Returns ESP_OK on success, an esp_err_t on
/// failure. Caller must release with `ruv_temporal_destroy`.
#[no_mangle]
pub extern "C" fn ruv_temporal_init(
weights: *const u8,
weights_len: usize,
input_dim: u32,
window_len: u32,
n_classes: u32,
out_ctx: *mut *mut RuvTemporalCtx,
) -> i32 {
if out_ctx.is_null() || input_dim == 0 || window_len == 0 || n_classes == 0 {
return ESP_ERR_INVALID_ARG;
}
// Optional weights blob: when caller passes a non-NULL pointer,
// parse and validate it. Caller can pass NULL during the Phase 4/5
// bring-up window when the kernel forward isn't actually consuming
// weights yet — we just want the parse path itself proven on the
// device. Once Phase 5 unblocks and the kernel is wired, Phase 6
// makes a non-NULL weights argument required.
if !weights.is_null() && weights_len > 0 {
// SAFETY: caller asserts the buffer covers `weights_len` bytes
// and outlives this call. Borrowed-slice parse — no copy.
let buf = unsafe { core::slice::from_raw_parts(weights, weights_len) };
match WeightBlobView::parse(buf) {
Ok(view) => {
// Sanity-check that the blob's declared shape matches
// the runtime arguments. A blob with input_dim=32 in
// a context configured for input_dim=16 is a deploy bug
// we want to catch at init() not at first classify().
if view.header.input_dim as u32 != input_dim
|| view.header.n_classes as u32 != n_classes
{
return ESP_ERR_INVALID_ARG;
}
// Phase 5+: stash view into the context for the kernel
// to consume. For now the parse itself is the proof
// that the format crossed the host/firmware boundary.
}
Err(e) => return weights::weight_load_err_to_esp(&e),
}
}
let ring = match FrameRing::new(window_len as usize, input_dim as usize) {
Some(r) => r,
None => return ESP_ERR_NO_MEM,
};
let ctx = Box::new(RuvTemporalCtx {
input_dim,
window_len,
n_classes,
ring,
});
unsafe { *out_ctx = Box::into_raw(ctx) };
ESP_OK
}
/// Push one feature frame into the rolling window. Hot path — must stay
/// cheap (no allocation, no kernel work).
#[no_mangle]
pub extern "C" fn ruv_temporal_push(ctx: *mut RuvTemporalCtx, frame: *const f32) -> i32 {
if ctx.is_null() || frame.is_null() {
return ESP_ERR_INVALID_ARG;
}
let ctx = unsafe { &mut *ctx };
let slice = unsafe { core::slice::from_raw_parts(frame, ctx.input_dim as usize) };
ctx.ring.push(slice);
ESP_OK
}
/// Run the temporal-head forward and write `n_classes` logits into the
/// caller-owned `logits` buffer. Returns ESP_OK on success.
///
/// Phase 4 stub: writes a zero-vector. Phase 5 wires the real
/// `SubquadraticSparseAttention::forward_gqa` over the ring buffer
/// contents. The signature is what edge_processing.c will call — that
/// part of the contract is stable now.
#[no_mangle]
pub extern "C" fn ruv_temporal_classify(
ctx: *mut RuvTemporalCtx,
logits: *mut f32,
n_classes: u32,
) -> i32 {
if ctx.is_null() || logits.is_null() {
return ESP_ERR_INVALID_ARG;
}
let ctx = unsafe { &*ctx };
if n_classes != ctx.n_classes {
return ESP_ERR_INVALID_ARG;
}
let out = unsafe { core::slice::from_raw_parts_mut(logits, n_classes as usize) };
for slot in out.iter_mut() {
*slot = 0.0;
}
let _ = ctx.window_len; // future: feed ring -> attention -> classifier head
ESP_OK
}
/// Release a context allocated by `ruv_temporal_init`.
#[no_mangle]
pub extern "C" fn ruv_temporal_destroy(ctx: *mut RuvTemporalCtx) {
if ctx.is_null() {
return;
}
unsafe {
drop(Box::from_raw(ctx));
}
}
// ---- Static guard ---------------------------------------------------------
//
// Force a *use* of the upstream crate so the link line proves the crate is
// reachable from the staticlib. Without this the compiler may strip the
// dependency entirely in Phase 4 since classify() doesn't yet call into it.
#[doc(hidden)]
#[no_mangle]
pub extern "C" fn ruv_temporal_kernel_self_test() -> i32 {
use ruvllm_sparse_attention::{SparseAttentionConfig, SubquadraticSparseAttention, Tensor3};
let cfg = SparseAttentionConfig {
window: 4,
block_size: 2,
global_tokens: alloc::vec![0],
causal: true,
use_log_stride: true,
use_landmarks: true,
sort_candidates: false,
};
if SubquadraticSparseAttention::new(cfg).is_err() {
return ESP_FAIL;
}
let _ = Tensor3::zeros(0, 1, 1);
ESP_OK
}
// Prevent dead-code drop of the C ABI when the linker is aggressive.
#[used]
static _ABI_KEEPALIVE: [extern "C" fn(); 5] = [
keepalive_init,
keepalive_push,
keepalive_classify,
keepalive_destroy,
keepalive_self_test,
];
extern "C" fn keepalive_init() {
let _ = ruv_temporal_init;
}
extern "C" fn keepalive_push() {
let _ = ruv_temporal_push;
}
extern "C" fn keepalive_classify() {
let _ = ruv_temporal_classify;
}
extern "C" fn keepalive_destroy() {
let _ = ruv_temporal_destroy;
}
extern "C" fn keepalive_self_test() {
let _ = ruv_temporal_kernel_self_test;
}
// Avoid "unused" warnings on the c_void import while the actual handle
// type is what callers receive.
const _: Option<*const c_void> = None;

View File

@ -0,0 +1,194 @@
// Firmware-side mirror of `wifi-densepose-temporal::weights`. Same wire
// format, same magic, same CRC polynomial — a blob produced by the
// host's `WeightBlob::serialize()` parses here byte-for-byte.
//
// no_std + alloc. The host side keeps weights as `Vec<u8>` because it
// owns the buffer; the firmware loader takes a borrowed `&[u8]` slice
// (the blob lives in flash via EMBED_FILES, or a heap mmap from NVS,
// neither of which the loader should re-allocate).
//
// Stays *byte-exact* in lockstep with `v2/crates/wifi-densepose-temporal/src/weights.rs`.
// When the host format changes, this file changes in the same commit
// and bumps `BLOB_VERSION`; mismatched versions refuse to load.
use core::convert::TryInto;
use core::fmt;
pub const BLOB_MAGIC: u32 = 0x5256_4E45; // "RVNE"
pub const BLOB_VERSION: u16 = 1;
pub const BLOB_HEADER_LEN: usize = 24;
pub const BLOB_FOOTER_LEN: usize = 4;
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
pub enum WeightDtype {
F32,
F16,
}
#[derive(Clone, Copy, Debug)]
pub struct WeightBlobHeader {
pub dtype: WeightDtype,
pub input_dim: u16,
pub n_q_heads: u16,
pub n_kv_heads: u16,
pub head_dim: u16,
pub n_layers: u16,
pub n_classes: u16,
}
impl WeightBlobHeader {
pub fn elem_bytes(&self) -> usize {
match self.dtype {
WeightDtype::F32 => 4,
WeightDtype::F16 => 2,
}
}
fn validate(&self) -> Result<(), WeightLoadError> {
if self.input_dim == 0
|| self.n_q_heads == 0
|| self.n_kv_heads == 0
|| self.head_dim == 0
{
return Err(WeightLoadError::ZeroDim);
}
if self.n_q_heads % self.n_kv_heads != 0 {
return Err(WeightLoadError::InvalidGqaRatio);
}
if self.n_layers == 0 || self.n_classes < 2 {
return Err(WeightLoadError::DegenerateShape);
}
Ok(())
}
}
/// A parsed view into a weights blob. Holds borrowed slices into the
/// caller-owned buffer — no allocation, no copy. The firmware's
/// kernel reads weights directly from this view.
#[derive(Clone, Copy)]
pub struct WeightBlobView<'a> {
pub header: WeightBlobHeader,
pub weights: &'a [u8],
}
impl<'a> WeightBlobView<'a> {
/// Parse a blob, validating magic / version / size / CRC. Returns
/// a borrowed view; the input `buf` must outlive the view.
pub fn parse(buf: &'a [u8]) -> Result<Self, WeightLoadError> {
if buf.len() < BLOB_HEADER_LEN + BLOB_FOOTER_LEN {
return Err(WeightLoadError::TooShort);
}
let magic = u32::from_le_bytes(buf[0..4].try_into().unwrap());
if magic != BLOB_MAGIC {
return Err(WeightLoadError::BadMagic);
}
let version = u16::from_le_bytes(buf[4..6].try_into().unwrap());
if version != BLOB_VERSION {
return Err(WeightLoadError::WrongVersion(version));
}
let flags = buf[6];
let dtype = match flags & 0x01 {
0 => WeightDtype::F32,
_ => WeightDtype::F16,
};
let input_dim = u16::from_le_bytes(buf[8..10].try_into().unwrap());
let n_q_heads = u16::from_le_bytes(buf[10..12].try_into().unwrap());
let n_kv_heads = u16::from_le_bytes(buf[12..14].try_into().unwrap());
let head_dim = u16::from_le_bytes(buf[14..16].try_into().unwrap());
let n_layers = u16::from_le_bytes(buf[16..18].try_into().unwrap());
let n_classes = u16::from_le_bytes(buf[18..20].try_into().unwrap());
let weights_len = u32::from_le_bytes(buf[20..24].try_into().unwrap()) as usize;
let expected = BLOB_HEADER_LEN + weights_len + BLOB_FOOTER_LEN;
if buf.len() != expected {
return Err(WeightLoadError::SizeMismatch);
}
let stored_crc = u32::from_le_bytes(buf[buf.len() - 4..].try_into().unwrap());
let computed = crc32_ieee(&buf[..buf.len() - 4]);
if stored_crc != computed {
return Err(WeightLoadError::CrcMismatch);
}
let header = WeightBlobHeader {
dtype,
input_dim,
n_q_heads,
n_kv_heads,
head_dim,
n_layers,
n_classes,
};
header.validate()?;
let weights_start = BLOB_HEADER_LEN;
let weights_end = weights_start + weights_len;
Ok(Self {
header,
weights: &buf[weights_start..weights_end],
})
}
}
/// Loader-side error. Distinct from the host-side `TemporalError` so
/// the firmware can map specific cases to specific `esp_err_t` codes.
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
pub enum WeightLoadError {
TooShort,
BadMagic,
WrongVersion(u16),
SizeMismatch,
CrcMismatch,
ZeroDim,
InvalidGqaRatio,
DegenerateShape,
}
impl fmt::Display for WeightLoadError {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
match self {
Self::TooShort => write!(f, "weight blob too short"),
Self::BadMagic => write!(f, "weight blob: bad magic"),
Self::WrongVersion(v) => write!(f, "weight blob: unsupported version {}", v),
Self::SizeMismatch => write!(f, "weight blob: declared length doesn't match buffer"),
Self::CrcMismatch => write!(f, "weight blob: CRC32 mismatch"),
Self::ZeroDim => write!(f, "weight blob: zero-valued dimension(s)"),
Self::InvalidGqaRatio => write!(f, "weight blob: n_q_heads not divisible by n_kv_heads"),
Self::DegenerateShape => write!(f, "weight blob: n_layers=0 or n_classes<2"),
}
}
}
/// Map loader errors to esp_err_t-style codes for the C ABI. Defined
/// here rather than in lib.rs so the mapping stays adjacent to the
/// error type and can't drift.
pub const fn weight_load_err_to_esp(err: &WeightLoadError) -> i32 {
match err {
WeightLoadError::TooShort
| WeightLoadError::BadMagic
| WeightLoadError::WrongVersion(_)
| WeightLoadError::SizeMismatch => 0x102, // ESP_ERR_INVALID_ARG
WeightLoadError::CrcMismatch => 0x10C, // ESP_ERR_INVALID_CRC
WeightLoadError::ZeroDim
| WeightLoadError::InvalidGqaRatio
| WeightLoadError::DegenerateShape => 0x103, // ESP_ERR_INVALID_SIZE
}
}
/// Same polynomial as `temporal_task.c::crc32_ieee` and the host-side
/// `wifi_densepose_temporal::weights::crc32_ieee`. The whole point of
/// keeping it bit-for-bit identical across all three sites is so a
/// blob round-trips without re-computing.
fn crc32_ieee(data: &[u8]) -> u32 {
let mut crc = 0xFFFF_FFFFu32;
for &b in data {
crc ^= b as u32;
for _ in 0..8 {
let mask = 0u32.wrapping_sub(crc & 1);
crc = (crc >> 1) ^ (0xEDB8_8320 & mask);
}
}
!crc
}

View File

@ -0,0 +1,74 @@
// Rolling frame buffer for the temporal head input window (ADR-095 §3.2).
//
// The hot path (`ruv_temporal_push`) writes one frame per call. The
// buffer is sized at `init` time; pushes wrap. `classify` reads the
// most-recent `window_len` frames in chronological order, oldest-first.
//
// Allocation policy: one `Vec<f32>` of size `window_len * input_dim`,
// owned by the context. No per-push allocation — we just memcpy into
// the next slot.
use alloc::vec;
use alloc::vec::Vec;
pub struct FrameRing {
buf: Vec<f32>,
window_len: usize,
input_dim: usize,
next_write: usize,
filled: usize,
}
impl FrameRing {
pub fn new(window_len: usize, input_dim: usize) -> Option<Self> {
if window_len == 0 || input_dim == 0 {
return None;
}
let total = window_len.checked_mul(input_dim)?;
Some(Self {
buf: vec![0.0; total],
window_len,
input_dim,
next_write: 0,
filled: 0,
})
}
pub fn push(&mut self, frame: &[f32]) {
let n = core::cmp::min(frame.len(), self.input_dim);
let off = self.next_write * self.input_dim;
self.buf[off..off + n].copy_from_slice(&frame[..n]);
// Zero-pad tail when the caller's frame is shorter than input_dim.
for s in &mut self.buf[off + n..off + self.input_dim] {
*s = 0.0;
}
self.next_write = (self.next_write + 1) % self.window_len;
if self.filled < self.window_len {
self.filled += 1;
}
}
/// Iterate over the buffer in chronological order, oldest-first.
/// Yields one slice of `input_dim` floats per call. Used by
/// `ruv_temporal_classify` to flatten into the kernel input.
pub fn iter_chronological(&self) -> impl Iterator<Item = &[f32]> + '_ {
let start = if self.filled < self.window_len {
0
} else {
self.next_write
};
(0..self.filled).map(move |i| {
let row = (start + i) % self.window_len;
let off = row * self.input_dim;
&self.buf[off..off + self.input_dim]
})
}
pub fn len(&self) -> usize {
self.filled
}
pub fn capacity(&self) -> usize {
self.window_len
}
}

View File

@ -9,10 +9,19 @@ set(SRCS
"rv_feature_state.c"
"rv_mesh.c"
"adaptive_controller.c"
# ADR-095 / #513 on-device temporal head (no-op shims when CONFIG_CSI_TEMPORAL_HEAD_ENABLED off)
"temporal_task.c"
)
set(REQUIRES "")
# ADR-095: link the Rust ruv_temporal staticlib only when the feature is on,
# so the default firmware build doesn't depend on the (currently blocked)
# esp Rust toolchain.
if(CONFIG_CSI_TEMPORAL_HEAD_ENABLED)
list(APPEND REQUIRES ruv_temporal)
endif()
# ADR-061: Mock CSI generator for QEMU testing + ADR-081 mock radio binding
if(CONFIG_CSI_MOCK_ENABLED)
list(APPEND SRCS "mock_csi.c" "rv_radio_ops_mock.c")

View File

@ -323,3 +323,56 @@ menu "Mock CSI (QEMU Testing)"
depends on CSI_MOCK_ENABLED
default n
endmenu
menu "On-device temporal head (ADR-095, #513)"
config CSI_TEMPORAL_HEAD_ENABLED
bool "Enable on-device temporal-head classification"
default n
help
Compiles the ruv_temporal FreeRTOS task that runs a learned
transformer-style temporal head over the rv_feature_state
stream. Backed by the Rust ruvllm_sparse_attention staticlib
in components/ruv_temporal/. Default off — the Rust component
requires the esp Rust toolchain (see component README) and
adds ~376 KB to the firmware image. Off-board (8 MB) only
until the binary delta is measured on real hardware.
config TEMPORAL_INPUT_DIM
int "Input feature dimension"
depends on CSI_TEMPORAL_HEAD_ENABLED
default 16
range 1 256
help
Per-frame feature dimension fed into the temporal head.
16 matches a small projection of rv_feature_state_t; bump
after the host-side training crate fixes the model schema.
config TEMPORAL_WINDOW_LEN
int "Rolling window length (frames)"
depends on CSI_TEMPORAL_HEAD_ENABLED
default 256
range 32 1024
help
Number of feature frames the temporal head reasons over.
256 frames at the controller's 5 Hz fast-loop rate is ~50 s.
config TEMPORAL_N_CLASSES
int "Number of output classes"
depends on CSI_TEMPORAL_HEAD_ENABLED
default 4
range 2 16
help
Number of classification logits the model produces. Must be
≤ TEMPORAL_MAX_LOGITS in temporal_task.c (16).
config TEMPORAL_CLASSIFY_PERIOD_MS
int "Classification cadence (ms)"
depends on CSI_TEMPORAL_HEAD_ENABLED
default 1000
range 100 60000
help
How often the temporal task runs ruv_temporal_classify and
emits a 0xC5110007 packet. Default 1 s.
endmenu

View File

@ -19,6 +19,7 @@
#include "edge_processing.h"
#include "stream_sender.h"
#include "csi_collector.h"
#include "temporal_task.h" /* ADR-095 / #513: on-device temporal head */
#include <string.h>
#include "freertos/FreeRTOS.h"
@ -314,6 +315,18 @@ static void emit_feature_state(void)
if (sent < 0) {
ESP_LOGW(TAG, "feature_state emit failed");
}
/* ADR-095 / #513: feed the same 9 feature floats into the on-device
* temporal head if it is enabled. Non-blocking drops are logged
* by temporal_task itself, never by us. With CONFIG_CSI_TEMPORAL_HEAD_ENABLED
* off, this resolves to a single ESP_ERR_NOT_SUPPORTED return. */
const float feat[9] = {
pkt.motion_score, pkt.presence_score,
pkt.respiration_bpm, pkt.respiration_conf,
pkt.heartbeat_bpm, pkt.heartbeat_conf,
pkt.anomaly_score, pkt.env_shift_score, pkt.node_coherence,
};
(void)temporal_task_push_frame(feat, 9);
}
static void slow_loop_cb(TimerHandle_t t)

View File

@ -21,6 +21,7 @@
#include "csi_collector.h"
#include "stream_sender.h"
#include "temporal_task.h" /* ADR-095 / #513 */
#include "nvs_config.h"
#include "edge_processing.h"
#include "ota_update.h"
@ -310,6 +311,22 @@ void app_main(void)
esp_err_to_name(adapt_ret));
}
/* ADR-095 / #513: spin up the on-device temporal head. Returns
* ESP_ERR_NOT_SUPPORTED when CONFIG_CSI_TEMPORAL_HEAD_ENABLED is
* off that is the default and not an error. The fast loop
* pushes feature frames; the task runs classify at a slower
* cadence and emits 0xC5110007 packets. */
#ifdef CONFIG_CSI_TEMPORAL_HEAD_ENABLED
esp_err_t tmp_ret = temporal_task_start(
(uint32_t)CONFIG_TEMPORAL_INPUT_DIM,
(uint32_t)CONFIG_TEMPORAL_WINDOW_LEN,
(uint32_t)CONFIG_TEMPORAL_N_CLASSES);
if (tmp_ret != ESP_OK) {
ESP_LOGW(TAG, "temporal task init failed: %s",
esp_err_to_name(tmp_ret));
}
#endif
/* Initialize power management. */
power_mgmt_init(g_nvs_config.power_duty);

View File

@ -0,0 +1,304 @@
/**
* @file temporal_task.c
* @brief ADR-095 / #513 On-device temporal head FreeRTOS task.
*
* Owns the only `ruv_temporal_ctx_t` in the firmware. Receives feature
* frames from the adaptive_controller fast loop via a FreeRTOS queue,
* pushes them into the rolling window, and at ~1 Hz runs a
* classification forward through the Rust `ruvllm_sparse_attention`
* staticlib (when built see CONFIG_CSI_TEMPORAL_HEAD_ENABLED).
*
* The whole file compiles down to no-op shims when the feature is off,
* so adaptive_controller.c can call `temporal_task_push_frame()`
* unconditionally the function returns ESP_ERR_NOT_SUPPORTED and
* costs one nullable check.
*/
#include "temporal_task.h"
#include <string.h>
#include "esp_log.h"
#include "esp_timer.h"
#include "sdkconfig.h"
static const char *TAG = "temporal";
#ifdef CONFIG_CSI_TEMPORAL_HEAD_ENABLED
#include "freertos/FreeRTOS.h"
#include "freertos/queue.h"
#include "freertos/task.h"
#include "csi_collector.h" /* node_id */
#include "stream_sender.h"
#include "ruv_temporal.h" /* C ABI from components/ruv_temporal */
/* Queue depth — picked so that the adaptive controller's fast loop
* (default 5 Hz) can't overrun the temporal task even if classify()
* stalls for ~6 s. Drops beyond that are logged. */
#define TEMPORAL_QUEUE_DEPTH 32
/* Stack sized per ADR-095 §3.3. The kernel forward + intermediate
* tensors are bounded by `forward_flash` tiling, but rv_feature_state
* marshalling, logging, and stream_sender_send all share this stack. */
#define TEMPORAL_TASK_STACK 16384
/* Pinned to Core 1, like edge_dsp. WiFi runs on Core 0 — keep them
* apart so the temporal forward doesn't compete with CSI capture. */
#define TEMPORAL_TASK_CORE 1
/* Classification cadence in milliseconds. 1 Hz is the ADR-095 §3 default. */
#ifndef CONFIG_TEMPORAL_CLASSIFY_PERIOD_MS
#define CONFIG_TEMPORAL_CLASSIFY_PERIOD_MS 1000
#endif
/* Maximum logits buffer — sized to the largest n_classes any of the
* ADR-095 §4 use cases needs (anomaly = 2, fall = 3, gesture = 8). */
#define TEMPORAL_MAX_LOGITS 16
/* ---- Module state ----------------------------------------------------- */
typedef struct {
float frame[TEMPORAL_MAX_LOGITS * 8]; /* generous; trimmed via input_dim */
uint32_t frame_len;
} temporal_msg_t;
static QueueHandle_t s_queue;
static TaskHandle_t s_task;
static ruv_temporal_ctx_t *s_ctx;
static uint32_t s_input_dim;
static uint32_t s_window_len;
static uint32_t s_n_classes;
static uint32_t s_seq;
static uint32_t s_drop_count;
static uint64_t s_last_drop_log_us;
/* Lightweight CRC32 (IEEE 802.3 polynomial 0xEDB88320), table-free.
* Used only for the 36-byte classification packet speed isn't
* critical. Existing firmware has its own CRC32 in csi_collector.c
* but we don't link against it from here to keep coupling narrow. */
static uint32_t crc32_ieee(const uint8_t *data, size_t len)
{
uint32_t crc = 0xFFFFFFFFu;
for (size_t i = 0; i < len; i++) {
crc ^= data[i];
for (int b = 0; b < 8; b++) {
uint32_t mask = -(int32_t)(crc & 1u);
crc = (crc >> 1) ^ (0xEDB88320u & mask);
}
}
return ~crc;
}
static void emit_classification(const float *logits, uint32_t n)
{
/* Find argmax + margin in one pass. */
uint32_t argmax = 0;
float top1 = logits[0];
float top2 = -1e30f;
for (uint32_t i = 1; i < n; i++) {
float v = logits[i];
if (v > top1) {
top2 = top1;
top1 = v;
argmax = i;
} else if (v > top2) {
top2 = v;
}
}
rv_temporal_pkt_t pkt;
memset(&pkt, 0, sizeof(pkt));
pkt.magic = RV_TEMPORAL_PKT_MAGIC;
pkt.version = 1;
pkt.n_classes = (uint16_t)n;
pkt.node_id = csi_collector_get_node_id();
pkt.ts_us = (uint64_t)esp_timer_get_time();
pkt.seq = ++s_seq;
pkt.argmax = (uint8_t)argmax;
pkt.top_logit = top1;
pkt.top1_minus_top2 = top1 - top2;
pkt.crc32 = crc32_ieee((const uint8_t *)&pkt, sizeof(pkt) - sizeof(pkt.crc32));
int sent = stream_sender_send((const uint8_t *)&pkt, sizeof(pkt));
if (sent < 0) {
ESP_LOGW(TAG, "classification emit failed");
}
}
static void temporal_task_loop(void *arg)
{
(void)arg;
ESP_LOGI(TAG, "temporal task online (window=%u dim=%u classes=%u core=%d)",
(unsigned)s_window_len, (unsigned)s_input_dim,
(unsigned)s_n_classes, TEMPORAL_TASK_CORE);
/* Self-test the kernel link before touching real frames. */
if (ruv_temporal_kernel_self_test() != ESP_OK) {
ESP_LOGE(TAG, "ruv_temporal_kernel_self_test FAILED — temporal head disabled");
s_ctx = NULL;
vTaskDelete(NULL);
return;
}
uint64_t next_classify_us = esp_timer_get_time()
+ (uint64_t)CONFIG_TEMPORAL_CLASSIFY_PERIOD_MS * 1000ull;
float logits[TEMPORAL_MAX_LOGITS];
for (;;) {
temporal_msg_t msg;
/* Block up to 100 ms for a frame, then check if it's time to
* classify. This double-poll keeps the cadence honest even
* during long quiet periods. */
if (xQueueReceive(s_queue, &msg, pdMS_TO_TICKS(100)) == pdTRUE) {
if (s_ctx != NULL) {
(void)ruv_temporal_push(s_ctx, msg.frame);
}
}
uint64_t now_us = esp_timer_get_time();
if (now_us >= next_classify_us && s_ctx != NULL) {
esp_err_t cret = ruv_temporal_classify(s_ctx, logits, s_n_classes);
if (cret == ESP_OK) {
emit_classification(logits, s_n_classes);
} else {
ESP_LOGW(TAG, "classify returned 0x%x", (unsigned)cret);
}
next_classify_us = now_us
+ (uint64_t)CONFIG_TEMPORAL_CLASSIFY_PERIOD_MS * 1000ull;
}
/* Coalesce drop-count logs to once per second so a backlog
* doesn't flood the serial console. */
if (s_drop_count > 0 && now_us - s_last_drop_log_us > 1000000ull) {
ESP_LOGW(TAG, "queue full — dropped %u feature frames",
(unsigned)s_drop_count);
s_drop_count = 0;
s_last_drop_log_us = now_us;
}
}
}
esp_err_t temporal_task_start(uint32_t input_dim,
uint32_t window_len,
uint32_t n_classes)
{
if (s_task != NULL) {
return ESP_OK; /* idempotent */
}
if (input_dim == 0 || window_len == 0 || n_classes == 0) {
return ESP_ERR_INVALID_ARG;
}
if (n_classes > TEMPORAL_MAX_LOGITS) {
ESP_LOGE(TAG, "n_classes=%u exceeds TEMPORAL_MAX_LOGITS=%d",
(unsigned)n_classes, TEMPORAL_MAX_LOGITS);
return ESP_ERR_INVALID_SIZE;
}
/* Allocate the kernel context. Phase 4 stub returns ESP_OK without
* weights; Phase 5b will accept a real weights blob. */
esp_err_t ret = ruv_temporal_init(NULL, 0, input_dim, window_len, n_classes,
&s_ctx);
if (ret != ESP_OK) {
ESP_LOGE(TAG, "ruv_temporal_init failed: 0x%x", (unsigned)ret);
return ret;
}
s_input_dim = input_dim;
s_window_len = window_len;
s_n_classes = n_classes;
s_seq = 0;
s_drop_count = 0;
s_last_drop_log_us = 0;
s_queue = xQueueCreate(TEMPORAL_QUEUE_DEPTH, sizeof(temporal_msg_t));
if (s_queue == NULL) {
ESP_LOGE(TAG, "queue create failed");
ruv_temporal_destroy(s_ctx);
s_ctx = NULL;
return ESP_ERR_NO_MEM;
}
BaseType_t ok = xTaskCreatePinnedToCore(
temporal_task_loop, "ruv_temporal", TEMPORAL_TASK_STACK,
NULL, 4 /* priority, below edge_dsp */,
&s_task, TEMPORAL_TASK_CORE);
if (ok != pdPASS) {
ESP_LOGE(TAG, "task create failed");
vQueueDelete(s_queue);
s_queue = NULL;
ruv_temporal_destroy(s_ctx);
s_ctx = NULL;
return ESP_ERR_NO_MEM;
}
return ESP_OK;
}
esp_err_t temporal_task_push_frame(const float *frame, uint32_t frame_len)
{
if (frame == NULL || frame_len == 0) {
return ESP_ERR_INVALID_ARG;
}
if (s_queue == NULL) {
return ESP_ERR_NOT_FOUND;
}
temporal_msg_t msg;
uint32_t cap = (uint32_t)(sizeof(msg.frame) / sizeof(msg.frame[0]));
uint32_t n = (frame_len < cap) ? frame_len : cap;
if (n < s_input_dim) {
/* Pad short frames with zeros so the rolling window stays
* dimension-stable from the kernel's perspective. */
memcpy(msg.frame, frame, n * sizeof(float));
memset(&msg.frame[n], 0, (s_input_dim - n) * sizeof(float));
msg.frame_len = s_input_dim;
} else {
memcpy(msg.frame, frame, s_input_dim * sizeof(float));
msg.frame_len = s_input_dim;
}
/* Non-blocking — temporal head is best-effort. */
if (xQueueSend(s_queue, &msg, 0) != pdPASS) {
s_drop_count++;
return ESP_ERR_TIMEOUT;
}
return ESP_OK;
}
void temporal_task_stop(void)
{
if (s_task != NULL) {
vTaskDelete(s_task);
s_task = NULL;
}
if (s_queue != NULL) {
vQueueDelete(s_queue);
s_queue = NULL;
}
if (s_ctx != NULL) {
ruv_temporal_destroy(s_ctx);
s_ctx = NULL;
}
}
#else /* !CONFIG_CSI_TEMPORAL_HEAD_ENABLED */
esp_err_t temporal_task_start(uint32_t input_dim,
uint32_t window_len,
uint32_t n_classes)
{
(void)input_dim;
(void)window_len;
(void)n_classes;
return ESP_ERR_NOT_SUPPORTED;
}
esp_err_t temporal_task_push_frame(const float *frame, uint32_t frame_len)
{
(void)frame;
(void)frame_len;
return ESP_ERR_NOT_SUPPORTED;
}
void temporal_task_stop(void) {}
#endif /* CONFIG_CSI_TEMPORAL_HEAD_ENABLED */

View File

@ -0,0 +1,98 @@
/* SPDX-License-Identifier: MIT
*
* temporal_task.h On-device temporal head FreeRTOS task (ADR-095, #513).
*
* Owns the lifecycle of the `ruv_temporal_ctx_t` from
* components/ruv_temporal/include/ruv_temporal.h. Exposes:
*
* 1. `temporal_task_start()` spawn the task with its own 16 KB stack
* pinned to Core 1, allocate a feed queue. Caller (main.c) ignores
* ESP_ERR_NOT_SUPPORTED when CONFIG_CSI_TEMPORAL_HEAD_ENABLED is off.
* 2. `temporal_task_push_frame()` non-blocking enqueue from the
* adaptive_controller fast loop. Drops on full queue (logs once
* per second) the temporal head is best-effort, the physics-only
* path keeps producing vitals regardless.
* 3. `temporal_task_stop()` cleanly tear down (currently used only
* for tests; production firmware never calls this).
*
* Thread safety: per ADR-095 §3.3 the temporal task itself is the
* single owner of the underlying `ruv_temporal_ctx_t`. Callers
* communicate exclusively via the FreeRTOS queue.
*
* Output: every ~1 s the task runs `ruv_temporal_classify` and emits a
* `0xC5110007 RV_TEMPORAL_CLASSIFICATION` packet via stream_sender.
*/
#pragma once
#include <stdint.h>
#include "esp_err.h"
#ifdef __cplusplus
extern "C" {
#endif
/* Magic for the classification packet (ADR-095 §3.5). 0xC5110001..0006
* are taken; 0007 is the next free slot. */
#define RV_TEMPORAL_PKT_MAGIC 0xC5110007u
/* On-the-wire packet for one classification result. Little-endian.
* Size: 40 bytes. CRC covers everything before it.
*
* Field layout (bytes):
* [00..04) magic 4
* [04..06) version 2
* [06..08) n_classes 2
* [08..09) node_id 1
* [09..0C) reserved 3
* [0C..14) ts_us 8
* [14..18) seq 4
* [18..19) argmax 1
* [19..1C) reserved2 3
* [1C..20) top_logit 4
* [20..24) top1_minus_top2 4
* [24..28) crc32 4
* total: 40
*/
typedef struct __attribute__((packed)) {
uint32_t magic; /* 0xC5110007 */
uint16_t version; /* 1 */
uint16_t n_classes; /* matches init() value */
uint8_t node_id; /* csi_collector_get_node_id() */
uint8_t reserved[3];
uint64_t ts_us; /* esp_timer_get_time() at classify */
uint32_t seq; /* monotonic, increments per emit */
uint8_t argmax; /* highest-logit class */
uint8_t reserved2[3];
float top_logit; /* logits[argmax] */
float top1_minus_top2; /* margin — useful for downstream gating */
uint32_t crc32;
} rv_temporal_pkt_t;
/* Build-time guard so the wire format never silently changes. */
_Static_assert(sizeof(rv_temporal_pkt_t) == 40,
"rv_temporal_pkt_t must be 40 bytes (ADR-095 §3.5)");
/* Start the temporal task. Returns ESP_ERR_NOT_SUPPORTED when the
* feature is compiled out caller should treat that as a non-error
* and continue. Returns ESP_OK on success.
*
* input_dim : feature dimension per frame (e.g. 60 for rv_feature_state_t)
* window_len : rolling window in frames (e.g. 256)
* n_classes : number of output logits the model produces (e.g. 4)
*/
esp_err_t temporal_task_start(uint32_t input_dim,
uint32_t window_len,
uint32_t n_classes);
/* Non-blocking push from the adaptive_controller fast loop. Returns
* ESP_OK on enqueue, ESP_ERR_NOT_FOUND if the task isn't running,
* ESP_ERR_TIMEOUT if the queue was full. Never blocks the caller. */
esp_err_t temporal_task_push_frame(const float *frame, uint32_t frame_len);
/* Optional teardown — currently unit-test only. */
void temporal_task_stop(void);
#ifdef __cplusplus
}
#endif

130
v2/Cargo.lock generated
View File

@ -231,6 +231,18 @@ dependencies = [
"wait-timeout",
]
[[package]]
name = "async-compression"
version = "0.4.42"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e79b3f8a79cccc2898f31920fc69f304859b3bd567490f75ebf51ae1c792a9ac"
dependencies = [
"compression-codecs",
"compression-core",
"pin-project-lite",
"tokio",
]
[[package]]
name = "async-trait"
version = "0.1.89"
@ -318,7 +330,7 @@ dependencies = [
"sync_wrapper 1.0.2",
"tokio",
"tokio-tungstenite",
"tower",
"tower 0.5.3",
"tower-layer",
"tower-service",
"tracing",
@ -871,6 +883,23 @@ dependencies = [
"memchr",
]
[[package]]
name = "compression-codecs"
version = "0.4.38"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ce2548391e9c1929c21bf6aa2680af86fe4c1b33e6cea9ac1cfeec0bd11218cf"
dependencies = [
"compression-core",
"flate2",
"memchr",
]
[[package]]
name = "compression-core"
version = "0.4.32"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "cc14f565cf027a105f7a44ccf9e5b424348421a1d8952a8fc9d499d313107789"
[[package]]
name = "concurrent-queue"
version = "2.5.0"
@ -2371,6 +2400,16 @@ version = "0.16.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "841d1cc9bed7f9236f321df977030373f4a4163ae1a7dbfe1a51a2c1a51d9100"
[[package]]
name = "hdrhistogram"
version = "7.5.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "765c9198f173dd59ce26ff9f95ef0aafd0a0fe01fb9d72841bc5066a4c06511d"
dependencies = [
"byteorder",
"num-traits",
]
[[package]]
name = "heapless"
version = "0.6.1"
@ -3892,13 +3931,35 @@ name = "nvsim"
version = "0.3.0"
dependencies = [
"approx 0.5.1",
"criterion",
"js-sys",
"rand 0.8.5",
"rand_chacha 0.3.1",
"serde",
"serde-wasm-bindgen",
"serde_json",
"sha2",
"thiserror 1.0.69",
"tracing",
"wasm-bindgen",
]
[[package]]
name = "nvsim-server"
version = "0.3.0"
dependencies = [
"axum",
"clap",
"futures-util",
"nvsim",
"serde",
"serde_json",
"thiserror 1.0.69",
"tokio",
"tower 0.4.13",
"tower-http 0.5.2",
"tracing",
"tracing-subscriber",
]
[[package]]
@ -4487,6 +4548,26 @@ dependencies = [
"siphasher 1.0.2",
]
[[package]]
name = "pin-project"
version = "1.1.11"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f1749c7ed4bcaf4c3d0a3efc28538844fb29bcdd7d2b67b2be7e20ba861ff517"
dependencies = [
"pin-project-internal",
]
[[package]]
name = "pin-project-internal"
version = "1.1.11"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d9b20ed30f105399776b9c883e68e536ef602a16ae6f596d2c473591d6ad64c6"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.117",
]
[[package]]
name = "pin-project-lite"
version = "0.2.17"
@ -5278,7 +5359,7 @@ dependencies = [
"sync_wrapper 1.0.2",
"tokio",
"tokio-native-tls",
"tower",
"tower 0.5.3",
"tower-http 0.6.8",
"tower-service",
"url",
@ -5311,7 +5392,7 @@ dependencies = [
"sync_wrapper 1.0.2",
"tokio",
"tokio-util",
"tower",
"tower 0.5.3",
"tower-http 0.6.8",
"tower-service",
"url",
@ -5798,6 +5879,14 @@ version = "2.0.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "178f93f84a4a72c582026a45d9b8710acf188df4a22a25434c5dbba1df6c4cac"
[[package]]
name = "ruvllm_sparse_attention"
version = "0.1.1"
dependencies = [
"half",
"libm",
]
[[package]]
name = "ryu"
version = "1.0.23"
@ -7379,6 +7468,27 @@ dependencies = [
"zip 0.6.6",
]
[[package]]
name = "tower"
version = "0.4.13"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b8fa9be0de6cf49e536ce1851f987bd21a43b771b09473c3549a6c853db37c1c"
dependencies = [
"futures-core",
"futures-util",
"hdrhistogram",
"indexmap 1.9.3",
"pin-project",
"pin-project-lite",
"rand 0.8.5",
"slab",
"tokio",
"tokio-util",
"tower-layer",
"tower-service",
"tracing",
]
[[package]]
name = "tower"
version = "0.5.3"
@ -7401,8 +7511,10 @@ version = "0.5.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1e9cd434a998747dd2c4276bc96ee2e0c7a2eadf3cae88e52be55a05fa9053f5"
dependencies = [
"async-compression",
"bitflags 2.11.0",
"bytes",
"futures-core",
"futures-util",
"http 1.4.0",
"http-body 1.0.1",
@ -7433,7 +7545,7 @@ dependencies = [
"http-body 1.0.1",
"iri-string",
"pin-project-lite",
"tower",
"tower 0.5.3",
"tower-layer",
"tower-service",
]
@ -8385,6 +8497,7 @@ dependencies = [
"serde",
"serde_json",
"tokio",
"tower-http 0.5.2",
]
[[package]]
@ -8452,6 +8565,15 @@ dependencies = [
"wifi-densepose-ruvector",
]
[[package]]
name = "wifi-densepose-temporal"
version = "0.1.0"
dependencies = [
"approx 0.5.1",
"ruvllm_sparse_attention",
"thiserror 1.0.69",
]
[[package]]
name = "wifi-densepose-train"
version = "0.3.0"

View File

@ -16,6 +16,7 @@ members = [
"crates/wifi-densepose-wifiscan",
"crates/wifi-densepose-vitals",
"crates/wifi-densepose-ruvector",
"crates/wifi-densepose-temporal",
"crates/wifi-densepose-desktop",
"crates/wifi-densepose-pointcloud",
"crates/wifi-densepose-geo",
@ -131,6 +132,11 @@ ruvector-attention = "2.0.4"
ruvector-crv = "0.1.1"
ruvector-gnn = { version = "2.0.5", default-features = false }
# ruvllm sparse attention (path-vendored per ADR-095/096)
# Default-features=false keeps the kernel no_std-clean so the same workspace
# version is consumable by the upcoming ESP-IDF Rust component (ADR-095).
ruvllm_sparse_attention = { path = "../vendor/ruvector/crates/ruvllm_sparse_attention", default-features = false, features = ["fp16"] }
# Internal crates
wifi-densepose-core = { version = "0.3.0", path = "crates/wifi-densepose-core" }
@ -143,6 +149,7 @@ wifi-densepose-hardware = { version = "0.3.0", path = "crates/wifi-densepose-har
wifi-densepose-wasm = { version = "0.3.0", path = "crates/wifi-densepose-wasm" }
wifi-densepose-mat = { version = "0.3.0", path = "crates/wifi-densepose-mat" }
wifi-densepose-ruvector = { version = "0.3.0", path = "crates/wifi-densepose-ruvector" }
wifi-densepose-temporal = { version = "0.1.0", path = "crates/wifi-densepose-temporal" }
[profile.release]
lto = true

View File

@ -0,0 +1,27 @@
[package]
name = "wifi-densepose-temporal"
version = "0.1.0"
edition = "2021"
license = "MIT"
description = "AETHER temporal head for WiFi-DensePose — sparse-GQA attention over CSI feature windows (ADR-096)"
repository = "https://github.com/ruvnet/RuView"
[dependencies]
ruvllm_sparse_attention = { workspace = true }
thiserror = "1"
[dev-dependencies]
approx = "0.5"
[features]
default = []
# Enable FP16 KV cache path (mirrors the firmware-side ADR-095 build).
fp16 = []
[[example]]
name = "init_random_blob"
path = "examples/init_random_blob.rs"
[[example]]
name = "bench_speedup"
path = "examples/bench_speedup.rs"

View File

@ -0,0 +1,146 @@
# `wifi-densepose-temporal`
AETHER temporal head over CSI feature windows. Sparse-GQA attention via
`ruvllm_sparse_attention`, with a streaming `KvCache` decode path for
online re-ID and incremental classification.
Implements the host side of [ADR-096](../../../docs/adr/ADR-096-aether-temporal-head-sparse-gqa.md);
mirrored on the firmware side at
[`firmware/esp32-csi-node/components/ruv_temporal/`](../../../firmware/esp32-csi-node/components/ruv_temporal/).
## Quick start
```rust
use wifi_densepose_temporal::{AetherTemporalHead, TemporalHeadConfig, Tensor3};
// Default config matches AETHER's MQA shape:
// q_heads=4, kv_heads=1, head_dim=32, window=32, block_size=16, causal=true
let cfg = TemporalHeadConfig::default_aether();
let head = AetherTemporalHead::new(&cfg)?;
// Prefill: full window forward
let out = head.forward(&q, &k, &v)?; // shape: (window, q_heads, head_dim)
// Streaming: O(log T) per new frame against an accumulated cache
let mut cache = head.make_cache(/* capacity */ 1024)?;
for new_frame in stream {
let (q1, k1, v1) = project(&new_frame); // each seq=1
let attn_out = head.step(&q1, &k1, &v1, &mut cache)?;
// pool, run classifier head, etc
}
```
## Backends
`TemporalBackendKind` selects between two paths (ADR-096 §4.4):
| Backend | When | Cost |
|---|---|---|
| `SparseGqa` | New training runs (default) | O(N log N) prefill, O(log T) decode |
| `Dense` | Reserved for back-compat | Returns `TemporalError::DenseBackendNotImplemented` for now (ADR-096 §4.4 follow-up) |
The `SparseGqa` backend dispatches at `forward()` time:
- `q_heads == kv_heads``forward()` (sparse MHA)
- `q_heads != kv_heads``forward_gqa()` (GQA / MQA)
## Streaming semantics
`step()` is the structural advantage over dense MHA — append `(k, v)` to the
caller-owned cache and decode the new `q` in O(log T) per token.
- `q`/`k`/`v` must each have `seq == 1` (multi-token q is the prefill path).
- `KvCache` lifetime is the caller's. Per ADR-096 §8.5 the natural lifetime
is per-`PoseTrack` (re-ID) or per-session (online classification). When
the track drops, drop the cache.
- Cache fills are the caller's problem. Upstream H2O heavy-hitter eviction
is opt-in; this crate's wrapper doesn't pre-pick a policy.
Headline correctness test: `streaming_step_matches_forward_at_last_position`
proves token-by-token `step()` produces the same output as a single-shot
`forward()` at position `N-1`, max_abs_err < 1e-3.
## Weight blob format (`.rvne`)
Wire format for transferring trained weights to the firmware.
[`weights.rs`](src/weights.rs) defines the host side; the firmware mirror
at [`components/ruv_temporal/src/weights.rs`](../../../firmware/esp32-csi-node/components/ruv_temporal/src/weights.rs)
parses it bit-for-bit.
| Section | Bytes | Contents |
|---|---|---|
| Header | 24 | magic `RVNE` / version 1 / dtype flag (FP32 \| FP16) / dims |
| Weights | variable | flat per-layer arrays, dtype as flagged |
| Footer | 4 | CRC32-IEEE over everything before |
Hard-break versioning: bumping `version` means firmware refuses to load.
Adding fields goes behind reserved flag bits, never by reorder.
```rust
let blob = WeightBlob::new(header, weights)?;
let bytes = blob.serialize(); // host
// ...
let view = WeightBlobView::parse(&bytes)?; // firmware (no_std, borrowed slice)
```
## Examples
| Example | Run |
|---|---|
| `init_random_blob` | `cargo run -p wifi-densepose-temporal --example init_random_blob -- model.rvne` — emits a 41 KB AETHER-shaped `.rvne` |
| `bench_speedup` | `cargo run -p wifi-densepose-temporal --example bench_speedup --release` — sparse-vs-dense speedup curve |
Captured benchmark results: [`benches_results.md`](benches_results.md).
## Tests
```
cargo test -p wifi-densepose-temporal
```
| Suite | Tests | What |
|---|---|---|
| `tests/smoke.rs` | 5 | Forward at AETHER default, MHA dispatch, GQA dispatch, dense-rejected, invalid-GQA-rejected, N=1000 long window |
| `tests/weight_blob.rs` | 8 | Roundtrip FP32 + FP16, bad magic / version / size / CRC / GQA, layout anchor |
| `tests/blob_e2e.rs` | 2 | Realistic 25 KB+ filesystem roundtrip, deterministic seed reproducibility |
| `tests/streaming.rs` | 3 | step()-matches-forward at last position, multi-token-q rejected, make_cache shape |
**18/18 passing as of commit `247794a2c`.**
## Status of ADR-096 claims
| Claim | Status | Evidence |
|---|---|---|
| O(N log N) sparse vs O(N²) dense | **Empirically confirmed** | `bench_speedup` measures 21.21× at N=1024; cost-growth ratios match theory (dense 274×, sparse 24× for 16× more tokens) |
| `step()` matches `forward()` at last position | **Proven** | `streaming_step_matches_forward_at_last_position` test |
| Wire format consistent host↔firmware | **Proven** | 3 sites with same magic/version/CRC, 41-KB blob roundtrips through filesystem in tests |
| Path-vendored, no crates.io coupling | **Confirmed** | Workspace dep is `path = "../vendor/ruvector/crates/ruvllm_sparse_attention"` |
| 30100× at long windows | **Partial** | 21.21× measured at N=1024 in single-run wall-clock; higher N + criterion would push closer to the 30× lower bound |
## Status of ADR-095 surface (firmware)
`AetherTemporalHead` is the host-side analog of the firmware on-device path.
The firmware Rust component scaffold and C-side wiring are complete; the
Rust component cross-compile is currently blocked by an upstream esp-rs
nightly-bundle inconsistency. See
[`components/ruv_temporal/README.md`](../../../firmware/esp32-csi-node/components/ruv_temporal/README.md)
for details.
When the toolchain unblocks, no changes to this crate are needed —
`weights.rs` is already mirrored, `Tensor3` and `KvCache` cross the
boundary unchanged, and the C ABI consumed by `temporal_task.c` is stable.
## Open questions (still applicable from ADR-096 §8)
- The deployed AETHER tracker's actual window length is what determines
whether sparse pays off in production. At training default of 100 frames,
sparse begins to win (56× at N=128256). At the 1000-frame roadmap
target, the speedup is much larger (21× measured).
- Streaming GQA decode is an upstream roadmap item; the current
`decode_step` is wired for the MHA branch. When upstream ships GQA
decode (post-ADR-189/190), `AetherTemporalHead.step` gets a GQA dispatch
branch added without any public API change.
## License
MIT.

View File

@ -0,0 +1,72 @@
# Bench results — sparse vs dense prefill
Output of `cargo run -p wifi-densepose-temporal --example bench_speedup --release`
on a Windows 11 / x86_64 dev box, 2026-05-08. Single-run wall-clock,
pure-Rust vs pure-Rust (no SIMD/threads on either side). Reproduce by
running the example yourself; results vary 23× between machines and
power states, but the **trends across N** are what matter.
## Sparse-vs-dense prefill speedup
Config: `q_heads=4, kv_heads=4, head_dim=32, window=16, block_size=32, causal=true`.
| N | Dense (ms) | Sparse (ms) | Speedup |
|--------|-------------:|-------------:|--------:|
| 64 | 0.262 | 0.141 | 1.86× |
| 128 | 1.120 | 0.335 | 3.34× |
| 256 | 4.129 | 0.711 | 5.81× |
| 512 | 19.230 | 2.356 | 8.16× |
| 1024 | 71.904 | 3.389 | **21.21×** |
## Asymptotic check
ADR-096 §3.1 claimed dense scales as O(N²) and sparse as O(N log N).
The measured 64→1024 cost growth (16× more tokens) is:
| Path | 64 ms | 1024 ms | Growth | Theory |
|--------|------:|--------:|-------:|-------:|
| Dense | 0.262 | 71.904 | 274× | 256× = 16² |
| Sparse | 0.141 | 3.389 | 24× | ~27× = 16 · log(1024)/log(64) |
Dense's 274× growth matches `N²` cleanly. Sparse's 24× growth matches
`N log N` to within measurement noise. **The asymptotic complexity
claim is empirically supported on this hardware.**
## Why N=64 is only 1.86× and not faster
ADR-096 §3.1 already called this out: at the AETHER training default
of `window_frames = 100`, dense MHA is essentially free and the sparse
machinery has overhead — the per-token candidate-set construction,
landmark indexing, and global-token bookkeeping are constant-factor
costs that only amortize past N ≈ 200. The speedup-vs-N curve
inflects sharply between N=128 and N=256 because that's where dense's
N² term starts dominating its constants.
If a downstream consumer is using AETHER on 4-frame windows
(`proof.rs`, `trainer.rs`), this ADR pays nothing. The case rests
entirely on the long-window roadmap.
## What this benchmark doesn't measure
- **Decode-step latency.** `streaming_step_matches_forward_at_last_position`
proves correctness; this bench doesn't measure how fast `decode_step`
runs vs a hypothetical dense-MHA decode (which would be O(N²) recompute
every step — structurally not even comparable).
- **Memory.** KvCache + FP16 halves the K/V footprint vs FP32, which
matters more on the firmware than on x86_64 host. Phase 5 unblocking
is the prerequisite for measuring this on real hardware.
- **GQA dispatch.** This config uses `q_heads == kv_heads` to force
the MHA branch, so dense and sparse operate on the same shape.
Real AETHER will probably want `kv_heads=1` (MQA) which halves
the KV memory and is what the default head config picks.
## How to run
```
cargo run -p wifi-densepose-temporal --example bench_speedup --release
```
Release mode is mandatory. Debug builds run sparse 510× slower than
release because the candidate-set construction has tight inner loops
that benefit hard from `-O3`. Don't draw conclusions from `cargo run`
without `--release`.

View File

@ -0,0 +1,151 @@
// Measure sparse-GQA prefill cost vs dense MHA at N = {64, 128, 256, 512, 1024}.
// ADR-096 §3.1 claimed 30100× edge-evaluation reduction at long windows;
// this is the empirical check.
//
// Run with: cargo run -p wifi-densepose-temporal --example bench_speedup --release
//
// Caveat: single-run wall-clock on one machine — not a rigorous benchmark.
// Trends across N matter more than the absolute numbers, and results vary
// 23× between machines / power states. The point is to confirm the
// magnitude of the speedup is what the ADR claimed, not a perf-engineering
// dashboard. For that, use criterion + a dedicated machine.
use std::time::Instant;
use ruvllm_sparse_attention::{dense_attention, AttentionBackend, SparseAttentionConfig, SubquadraticSparseAttention, Tensor3};
use wifi_densepose_temporal::{TemporalBackendKind, TemporalHeadConfig, AetherTemporalHead};
fn make_qkv(seq: usize, heads: usize, dim: usize) -> (Tensor3, Tensor3, Tensor3) {
// Simple deterministic init — content doesn't matter for timing,
// but we want each benchmark run to use the same numbers.
let mut q = Tensor3::zeros(seq, heads, dim);
let mut k = Tensor3::zeros(seq, heads, dim);
let mut v = Tensor3::zeros(seq, heads, dim);
for s in 0..seq {
for h in 0..heads {
for d in 0..dim {
let qv = ((s * 31 + h * 7 + d) as f32).sin() * 0.1;
let kv = (((s * 17 + h * 3 + d) as f32).cos()) * 0.1;
q.set(s, h, d, qv);
k.set(s, h, d, kv);
v.set(s, h, d, kv * 0.5);
}
}
}
(q, k, v)
}
fn time_run<F: FnMut()>(label: &str, runs: usize, mut f: F) -> f64 {
// 1 warmup + `runs` measurements. Wall clock; release-mode only is
// meaningful (debug builds run sparse 510× slower than release).
f();
let start = Instant::now();
for _ in 0..runs {
f();
}
let total_ms = start.elapsed().as_secs_f64() * 1000.0;
let avg_ms = total_ms / runs as f64;
println!(" {label:<36} {avg_ms:>8.3} ms/run ({runs} runs)");
avg_ms
}
fn bench_at(seq: usize) -> (f64, f64, f64) {
println!();
println!("=== seq = {seq} ===");
// MHA shape (q_heads == kv_heads) so dense_attention and the sparse
// forward path operate on the same tensor shape — direct timing
// comparison without GQA bookkeeping confounding the result.
let heads = 4;
let dim = 32;
let (q, k, v) = make_qkv(seq, heads, dim);
// Dense reference. dense_attention is the upstream's naive O(N²)
// pure-Rust kernel — same scale, same shape, no SIMD acceleration —
// a fair head-to-head against the equally-pure-Rust sparse path.
let runs_dense = if seq <= 128 { 50 } else if seq <= 512 { 10 } else { 3 };
let dense_ms = time_run(
&format!("dense_attention (causal=true)"),
runs_dense,
|| {
let _ = dense_attention(&q, &k, &v, true).expect("dense forward");
},
);
// Sparse via the AETHER head wrapper — same code path the production
// training/inference would use, not the lower-level SubquadraticSparseAttention.
// Window/block_size kept small so the sparse pattern actually drops
// candidates at all benchmark lengths (otherwise at N=64 with default
// config we'd touch the entire sequence and look the same as dense).
let cfg = TemporalHeadConfig {
backend: TemporalBackendKind::SparseGqa,
q_heads: heads,
kv_heads: heads, // MHA — match dense
head_dim: dim,
window: 16,
block_size: 32,
causal: true,
};
let head = AetherTemporalHead::new(&cfg).expect("construct head");
let runs_sparse = if seq <= 128 { 50 } else if seq <= 512 { 30 } else { 10 };
let sparse_ms = time_run(
"AetherTemporalHead.forward (sparse)",
runs_sparse,
|| {
let _ = head.forward(&q, &k, &v).expect("sparse forward");
},
);
// Also measure SubquadraticSparseAttention directly — bypasses our
// wrapper, useful for confirming the wrapper isn't introducing
// measurable overhead.
let attn = SubquadraticSparseAttention::new(SparseAttentionConfig {
window: 16,
block_size: 32,
global_tokens: vec![0],
causal: true,
use_log_stride: true,
use_landmarks: true,
sort_candidates: false,
})
.expect("construct attn");
let raw_ms = time_run(
"Subquadratic.forward (raw, no wrapper)",
runs_sparse,
|| {
let _ = attn.forward(&q, &k, &v).expect("raw sparse forward");
},
);
let speedup = dense_ms / sparse_ms;
println!(" -> sparse/dense speedup {speedup:>6.2}×");
(dense_ms, sparse_ms, speedup)
}
fn main() {
println!("ADR-096 §3.1 empirical speedup check");
println!("====================================");
println!("Pure-Rust vs pure-Rust, no SIMD/threads, single-run wall-clock.");
println!("Trends across N matter more than absolute numbers.");
let lengths = [64, 128, 256, 512, 1024];
let mut rows: Vec<(usize, f64, f64, f64)> = Vec::new();
for &n in &lengths {
let (dense_ms, sparse_ms, speedup) = bench_at(n);
rows.push((n, dense_ms, sparse_ms, speedup));
}
println!();
println!("Summary");
println!(" N dense (ms) sparse (ms) speedup");
println!(" ---- ---------- ----------- -------");
for (n, d, s, sp) in &rows {
println!(" {n:<5} {d:>10.3} {s:>11.3} {sp:>5.2}×");
}
println!();
println!("ADR-096 §3.1 claim: ~30× edge reduction at N=8192,");
println!("growing roughly N/log(N). At N=1024 the claim is ~510×;");
println!("at N=64 the sparse machinery is overhead-bound (sparse may");
println!("lose, see ADR-096 §3.1 'honest framing' paragraph).");
}

View File

@ -0,0 +1,142 @@
// Emit a deterministic-seeded random weight blob in the .rvne format
// (ADR-095 / #513 Phase 1 of the training-side roadmap).
//
// This is a *demo*, not a trained model — the weights are PRNG output.
// Its purpose is to:
// 1. Document end-to-end how the host produces a blob (i.e. the
// example IS the recipe a real trainer follows: build a header,
// fill the weights buffer, call WeightBlob::new + .serialize(),
// write to disk).
// 2. Provide a reproducible test fixture the firmware loader can
// consume once the toolchain unblocks (ADR-095 Phase 5).
// 3. Anchor the byte-level format so refactors that change the
// output silently are caught by the byte-count assertion at
// the bottom.
//
// Usage:
// cargo run -p wifi-densepose-temporal --example init_random_blob
// cargo run -p wifi-densepose-temporal --example init_random_blob -- /tmp/model.rvne
use std::env;
use std::fs;
use std::path::PathBuf;
use wifi_densepose_temporal::{WeightBlob, WeightBlobHeader, WeightDtype};
/// Match the AETHER default head shape from
/// `TemporalHeadConfig::default_aether()` — staying coherent with the
/// crate's other defaults means a real trainer can drop this example
/// in as the starting point with one search-and-replace.
fn aether_default_header() -> WeightBlobHeader {
WeightBlobHeader {
dtype: WeightDtype::F32,
input_dim: 16,
n_q_heads: 4,
n_kv_heads: 1, // MQA — one shared K/V across the 4 query heads
head_dim: 32,
n_layers: 2,
n_classes: 4, // gesture-class default; firmware Kconfig matches
}
}
/// Compute the raw byte count for one transformer block at the given
/// shape. This is the *intent-of-the-format* number, kept here so
/// changes to it (and to the kernel's expectation) stay in sync.
///
/// Per-layer weights consist of:
/// - input projection : input_dim × (n_q_heads × head_dim) = Wq
/// - K projection : input_dim × (n_kv_heads × head_dim) = Wk
/// - V projection : input_dim × (n_kv_heads × head_dim) = Wv
/// - O projection : (n_q_heads × head_dim) × input_dim = Wo
fn per_layer_floats(h: &WeightBlobHeader) -> usize {
let id = h.input_dim as usize;
let q_total = h.n_q_heads as usize * h.head_dim as usize;
let kv_total = h.n_kv_heads as usize * h.head_dim as usize;
id * q_total // Wq
+ id * kv_total // Wk
+ id * kv_total // Wv
+ q_total * id // Wo
}
/// Plus a final classifier head: input_dim × n_classes.
fn classifier_floats(h: &WeightBlobHeader) -> usize {
h.input_dim as usize * h.n_classes as usize
}
/// xorshift64* — tiny deterministic PRNG. Don't use for crypto;
/// this is a fixed-seed init so two runs of the example produce
/// byte-identical blobs.
fn xorshift_step(state: &mut u64) -> u64 {
let mut x = *state;
x ^= x << 13;
x ^= x >> 7;
x ^= x << 17;
*state = x;
x.wrapping_mul(2685821657736338717u64)
}
/// Map the high 32 bits of a u64 to a small symmetric float in
/// [-0.1, 0.1). Tight bound so the resulting model produces sensible
/// pre-softmax logits even though it's untrained.
fn next_init_f32(state: &mut u64) -> f32 {
let bits = (xorshift_step(state) >> 32) as u32;
// Map to [0, 1) then scale to [-0.1, 0.1)
let unit = (bits as f32) / (u32::MAX as f32);
(unit - 0.5) * 0.2
}
fn build_random_weights(header: &WeightBlobHeader, seed: u64) -> Vec<u8> {
let total_floats =
per_layer_floats(header) * header.n_layers as usize + classifier_floats(header);
let mut out = Vec::with_capacity(total_floats * 4);
let mut state = seed;
for _ in 0..total_floats {
let f = next_init_f32(&mut state);
out.extend_from_slice(&f.to_le_bytes());
}
out
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
let path = env::args()
.nth(1)
.map(PathBuf::from)
.unwrap_or_else(|| PathBuf::from("model_init.rvne"));
let header = aether_default_header();
let weights = build_random_weights(&header, 0xC511_0007_DEAD_BEEFu64);
let weights_len = weights.len();
let blob = WeightBlob::new(header.clone(), weights)?;
let bytes = blob.serialize();
let serialized_len = bytes.len();
fs::write(&path, &bytes)?;
// Re-parse to prove the artifact we just wrote is loadable. Same
// path the firmware loader will follow once the toolchain unblocks.
let parsed = WeightBlob::parse(&fs::read(&path)?)?;
println!("wrote : {}", path.display());
println!("dtype : {:?}", parsed.header.dtype);
println!(
"shape : input_dim={}, q_heads={}, kv_heads={}, head_dim={}, layers={}, classes={}",
parsed.header.input_dim,
parsed.header.n_q_heads,
parsed.header.n_kv_heads,
parsed.header.head_dim,
parsed.header.n_layers,
parsed.header.n_classes,
);
println!(
"weights : {} bytes ({} f32 elements)",
weights_len,
weights_len / 4
);
println!(
"total : {} bytes (header 24 + weights {} + crc 4)",
serialized_len, weights_len
);
Ok(())
}

View File

@ -0,0 +1,70 @@
use crate::TemporalError;
/// Backend choice per ADR-096 §4.4.
///
/// * `Dense` — back-compat path against `ruvector-attention`. Reserved;
/// not yet implemented in this crate (returns a typed error so callers
/// can fail loudly during config validation rather than at forward()).
/// * `SparseGqa` — `ruvllm_sparse_attention` `forward_gqa` for prefill,
/// `decode_step` against `KvCache` for streaming inference.
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
pub enum TemporalBackendKind {
Dense,
SparseGqa,
}
#[derive(Clone, Debug)]
pub struct TemporalHeadConfig {
pub backend: TemporalBackendKind,
/// Number of query heads. For pure MHA, equals `kv_heads`.
pub q_heads: usize,
/// Number of key/value heads. Must divide `q_heads`. GQA group size
/// is `q_heads / kv_heads`.
pub kv_heads: usize,
/// Per-head feature dimension.
pub head_dim: usize,
/// Local attention window radius (sparse pattern primitive #1, ADR-096 §3).
pub window: usize,
/// Landmark block size (sparse pattern primitive #3).
pub block_size: usize,
/// Whether the attention is causal. AETHER temporal aggregation is
/// causal (cannot peek at future CSI frames during streaming re-ID).
pub causal: bool,
}
impl TemporalHeadConfig {
/// Default config sized for the AETHER training default
/// (`window_frames = 100`) but with the sparse machinery wired up
/// so the long-window roadmap (10 s / 1000 frames) only requires
/// changing `window` at the call site, not re-architecting.
pub fn default_aether() -> Self {
Self {
backend: TemporalBackendKind::SparseGqa,
q_heads: 4,
kv_heads: 1, // MQA — collapses to one shared K/V across query heads
head_dim: 32,
window: 32,
block_size: 16,
causal: true,
}
}
pub fn validate(&self) -> Result<(), TemporalError> {
if self.q_heads == 0 || self.kv_heads == 0 || self.head_dim == 0 {
return Err(TemporalError::InvalidConfig(
"q_heads, kv_heads, head_dim must all be > 0",
));
}
if self.q_heads % self.kv_heads != 0 {
return Err(TemporalError::InvalidConfig(
"q_heads must be divisible by kv_heads (GQA constraint)",
));
}
if self.block_size == 0 {
return Err(TemporalError::InvalidConfig("block_size must be > 0"));
}
Ok(())
}
}

View File

@ -0,0 +1,44 @@
use ruvllm_sparse_attention::{dense_attention, Tensor3};
use crate::{TemporalError, TemporalHeadConfig};
/// Dense MHA backend (ADR-096 §5 A/B baseline).
///
/// Wraps upstream `dense_attention` — the naive O(N²) reference kernel.
/// Same approximation surface as classical scaled-dot-product attention,
/// no log-stride / landmarks / windowing. Exists primarily as the
/// reference path for the §5 validation gate (rank correlation,
/// contrastive-loss parity, latency baseline).
///
/// Has no streaming counterpart: dense MHA structurally cannot do
/// O(log T) decode — every new token requires recomputing the full
/// attention matrix. Callers that want streaming must use SparseGqa.
pub struct DenseHead {
causal: bool,
cfg: TemporalHeadConfig,
}
impl DenseHead {
pub fn new(cfg: &TemporalHeadConfig) -> Result<Self, TemporalError> {
cfg.validate()?;
Ok(Self {
causal: cfg.causal,
cfg: cfg.clone(),
})
}
pub fn cfg(&self) -> &TemporalHeadConfig {
&self.cfg
}
/// Naive O(N²) prefill. Q/K/V must share the same head count
/// (no GQA) — `dense_attention` upstream enforces it.
pub fn forward(
&self,
q: &Tensor3,
k: &Tensor3,
v: &Tensor3,
) -> Result<Tensor3, TemporalError> {
Ok(dense_attention(q, k, v, self.causal)?)
}
}

View File

@ -0,0 +1,28 @@
use thiserror::Error;
#[derive(Debug, Error)]
pub enum TemporalError {
#[error("temporal head config invalid: {0}")]
InvalidConfig(&'static str),
/// Retained for back-compat with v0.1 callers; superseded by the
/// per-operation errors below now that Dense is implemented.
#[error("dense MHA backend not implemented yet (ADR-096 §4.4 follow-up)")]
DenseBackendNotImplemented,
/// Dense MHA has no notion of an accumulated KV cache — every
/// new frame requires recomputing the full N² attention matrix
/// (the structural gap ADR-096 §3.2 flagged). Callers that want
/// streaming decode must use the SparseGqa backend.
#[error("dense backend does not support streaming step(); use SparseGqa for online decode")]
BackendDoesNotSupportStreaming,
#[error("sparse attention kernel error: {0}")]
Kernel(String),
}
impl From<ruvllm_sparse_attention::AttentionError> for TemporalError {
fn from(e: ruvllm_sparse_attention::AttentionError) -> Self {
TemporalError::Kernel(format!("{e}"))
}
}

View File

@ -0,0 +1,105 @@
// AETHER temporal head over CSI feature windows (ADR-096).
//
// Wraps `ruvllm_sparse_attention::SubquadraticSparseAttention` so AETHER
// callers in `wifi-densepose-train` and `wifi-densepose-signal` can swap
// dense MHA for sparse-GQA without touching the contrastive recipe.
//
// Status: scaffolding for ADR-096 §4.3. Sparse backend is functional;
// the dense back-compat backend is a follow-up (Phase 2 of the roadmap
// in #513). Streaming `step()` lands once the per-track KvCache lifecycle
// (ADR-096 §8.5) is finalized.
pub mod config;
pub mod dense;
pub mod error;
pub mod sparse;
pub mod weights;
pub use config::{TemporalBackendKind, TemporalHeadConfig};
pub use dense::DenseHead;
pub use error::TemporalError;
pub use sparse::SparseGqaHead;
pub use weights::{
WeightBlob, WeightBlobHeader, WeightDtype, WEIGHT_BLOB_HEADER_LEN, WEIGHT_BLOB_MAGIC,
WEIGHT_BLOB_VERSION,
};
// Re-export the upstream Tensor3 + KvCache so callers don't need a
// direct `ruvllm_sparse_attention` dep.
pub use ruvllm_sparse_attention::{KvCache, Tensor3};
/// Thin facade so callers can pick a backend by name.
///
/// Both backends implement `forward()` for prefill. Only `SparseGqa`
/// implements `step()` (streaming O(log T) decode against KvCache);
/// dense MHA structurally lacks a streaming counterpart and returns
/// `TemporalError::BackendDoesNotSupportStreaming` on `step()`.
pub enum AetherTemporalHead {
SparseGqa(SparseGqaHead),
Dense(DenseHead),
}
impl AetherTemporalHead {
pub fn new(cfg: &TemporalHeadConfig) -> Result<Self, TemporalError> {
match cfg.backend {
TemporalBackendKind::SparseGqa => {
Ok(AetherTemporalHead::SparseGqa(SparseGqaHead::new(cfg)?))
}
TemporalBackendKind::Dense => Ok(AetherTemporalHead::Dense(DenseHead::new(cfg)?)),
}
}
/// Window-level prefill. Returns the per-token attention output as
/// a Tensor3 of shape (window, q_heads, head_dim). Pooling to a
/// single embedding is the caller's responsibility — different
/// AETHER consumers use different pool ops (mean for re-ID,
/// last-token for streaming).
pub fn forward(
&self,
q: &Tensor3,
k: &Tensor3,
v: &Tensor3,
) -> Result<Tensor3, TemporalError> {
match self {
AetherTemporalHead::SparseGqa(h) => h.forward(q, k, v),
AetherTemporalHead::Dense(h) => h.forward(q, k, v),
}
}
/// Streaming decode (ADR-096 §3.2). Caller owns the `cache`; the
/// natural lifetime is per-tracked-person (one cache per
/// `PoseTrack`, dropped when the track evicts).
///
/// Returns the attention output for the single new token. Caller
/// is responsible for downstream pooling / classifier head.
///
/// Dense backend returns `BackendDoesNotSupportStreaming` — no
/// dense-MHA-with-KV-cache equivalent exists, by design.
pub fn step(
&self,
q_new: &Tensor3,
k_new: &Tensor3,
v_new: &Tensor3,
cache: &mut KvCache,
) -> Result<Tensor3, TemporalError> {
match self {
AetherTemporalHead::SparseGqa(h) => h.step(q_new, k_new, v_new, cache),
AetherTemporalHead::Dense(_) => {
Err(TemporalError::BackendDoesNotSupportStreaming)
}
}
}
/// Allocate a `KvCache` sized correctly for this head. Convenience
/// wrapper so AETHER's `pose_tracker.rs` doesn't need to import
/// the upstream crate.
///
/// Dense backend returns `BackendDoesNotSupportStreaming` — there
/// is no cache to size for a dense kernel.
pub fn make_cache(&self, capacity: usize) -> Result<KvCache, TemporalError> {
match self {
AetherTemporalHead::SparseGqa(h) => Ok(h.make_cache(capacity)),
AetherTemporalHead::Dense(_) => Err(TemporalError::BackendDoesNotSupportStreaming),
}
}
}

View File

@ -0,0 +1,112 @@
use ruvllm_sparse_attention::{
AttentionBackend, KvCache, SparseAttentionConfig, SubquadraticSparseAttention, Tensor3,
};
use crate::{TemporalError, TemporalHeadConfig};
/// AETHER temporal head implemented with `ruvllm_sparse_attention`.
///
/// The selection rule from ADR-096 §4.4 is enforced at `forward()`
/// time: when `q_heads == kv_heads` we use `forward()` (plain MHA
/// over the sparse pattern); when they differ we use `forward_gqa()`.
/// The streaming `step()` path is staged behind a follow-up — KvCache
/// lifecycle ties to `PoseTrack` per ADR-096 §8.5 and lives on the
/// caller, not here.
pub struct SparseGqaHead {
cfg: TemporalHeadConfig,
attn: SubquadraticSparseAttention,
}
impl SparseGqaHead {
pub fn new(cfg: &TemporalHeadConfig) -> Result<Self, TemporalError> {
cfg.validate()?;
let attn_cfg = SparseAttentionConfig {
window: cfg.window,
block_size: cfg.block_size,
global_tokens: alloc_first_token(),
causal: cfg.causal,
use_log_stride: true,
use_landmarks: true,
sort_candidates: false,
};
let attn = SubquadraticSparseAttention::new(attn_cfg)?;
Ok(Self {
cfg: cfg.clone(),
attn,
})
}
pub fn cfg(&self) -> &TemporalHeadConfig {
&self.cfg
}
pub fn forward(
&self,
q: &Tensor3,
k: &Tensor3,
v: &Tensor3,
) -> Result<Tensor3, TemporalError> {
// ADR-096 §4.4: dispatch by GQA shape.
if self.cfg.q_heads == self.cfg.kv_heads {
// Pure MHA — sparse `forward` is the right path.
Ok(self.attn.forward(q, k, v)?)
} else {
// GQA / MQA — kv_heads < q_heads, group share factor = q/kv.
Ok(self.attn.forward_gqa(q, k, v)?)
}
}
/// Streaming decode for re-ID and online classification (ADR-096 §3.2).
///
/// Given one new token's q/k/v, append (k, v) to `cache` and return
/// the attention output for that one position against the full
/// accumulated history. Cost is O(log T) per step against a cache
/// of capacity T — the structural advantage over dense MHA's O(N²)
/// recompute that ADR-096 specifically calls out as the
/// dense-MHA-cannot-follow path.
///
/// Cache lifetime is owned by the caller. Per ADR-096 §8.5 the
/// natural place is one cache per `PoseTrack` (re-ID) or one cache
/// per active session (online classification). When the track is
/// dropped, drop the cache.
pub fn step(
&self,
q_new: &Tensor3,
k_new: &Tensor3,
v_new: &Tensor3,
cache: &mut KvCache,
) -> Result<Tensor3, TemporalError> {
if q_new.seq != 1 || k_new.seq != 1 || v_new.seq != 1 {
return Err(TemporalError::InvalidConfig(
"step() requires single-token q/k/v (seq == 1 each)",
));
}
// Append must succeed before decode_step sees the cache; if
// the cache fills, the caller is responsible for eviction or
// resetting per ADR-096 §3.2 (H2O heavy-hitter eviction is
// available upstream but kept opt-in).
cache.try_append(k_new, v_new)?;
Ok(self.attn.decode_step(q_new, cache)?)
}
/// Construct a KvCache sized for this head's shape. Convenience
/// so callers don't need to import the upstream crate directly.
pub fn make_cache(&self, capacity: usize) -> KvCache {
KvCache::new(
capacity,
self.cfg.kv_heads,
self.cfg.head_dim,
self.cfg.block_size,
)
}
}
/// Always treat token 0 as a global anchor — AETHER's contrastive
/// recipe (ADR-024) gives the first token a special role as the
/// "session start" reference embedding, and global tokens in the
/// sparse pattern preserve full visibility for that one position.
fn alloc_first_token() -> Vec<usize> {
vec![0]
}

View File

@ -0,0 +1,231 @@
// Wire format for the temporal-head weights blob.
//
// One blob describes one model. Both ends speak it:
// - Host-side (this crate): training emits a blob via `WeightBlob::serialize`.
// - Firmware-side (`firmware/esp32-csi-node/components/ruv_temporal`):
// loads it via a mirrored parser. The blob is the *only* thing
// that crosses the host/firmware boundary at deploy time, so the
// format must be stable, self-describing, and version-gated.
//
// Layout (little-endian throughout):
//
// header 16 B
// [0x00..0x04) magic u32 = 0x52564E45 ("RVNE" — RuVector Neural Edge)
// [0x04..0x06) version u16 = 1
// [0x06..0x07) flags u8 bit 0 = 0:fp32 / 1:fp16 weights
// [0x07..0x08) reserved u8
// [0x08..0x0A) input_dim u16 per-frame feature dim
// [0x0A..0x0C) n_q_heads u16 query head count
// [0x0C..0x0E) n_kv_heads u16 key/value head count (≤ n_q_heads, divides it)
// [0x0E..0x10) head_dim u16 per-head feature dim
//
// body variable
// [0x10..0x12) n_layers u16
// [0x12..0x14) n_classes u16
// [0x14..0x18) weights_len u32 bytes of weights payload (after this header)
// [0x18..end-4) weights weights_len bytes — flat per-layer arrays
// in the order the kernel reads them
// footer 4 B
// [end-4..end) crc32 u32 IEEE 802.3, covers everything before
//
// Total size = 16 (header) + 2+2+4 (body header) + weights_len + 4 (crc) = 28 + weights_len
//
// Versioning: bumping `version` is a hard break — firmware refuses to
// load a blob whose version it doesn't know. Adding a *new* field is
// done by reserving a new flag bit and treating the field as
// post-weights when the bit is set; never reorder existing fields.
use crate::error::TemporalError;
pub const WEIGHT_BLOB_MAGIC: u32 = 0x5256_4E45; // "RVNE"
pub const WEIGHT_BLOB_VERSION: u16 = 1;
pub const WEIGHT_BLOB_HEADER_LEN: usize = 16 + 2 + 2 + 4; // 24
pub const WEIGHT_BLOB_FOOTER_LEN: usize = 4;
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
pub enum WeightDtype {
F32,
F16,
}
#[derive(Clone, Debug)]
pub struct WeightBlobHeader {
pub dtype: WeightDtype,
pub input_dim: u16,
pub n_q_heads: u16,
pub n_kv_heads: u16,
pub head_dim: u16,
pub n_layers: u16,
pub n_classes: u16,
}
impl WeightBlobHeader {
/// Element size in bytes for the configured dtype.
pub fn elem_bytes(&self) -> usize {
match self.dtype {
WeightDtype::F32 => 4,
WeightDtype::F16 => 2,
}
}
/// Validate that the structural numbers make sense — caught here
/// rather than at first kernel call so the host-side training
/// tool can't accidentally emit a blob the firmware will reject
/// at boot.
pub fn validate(&self) -> Result<(), TemporalError> {
if self.input_dim == 0
|| self.n_q_heads == 0
|| self.n_kv_heads == 0
|| self.head_dim == 0
{
return Err(TemporalError::InvalidConfig(
"header: zero-valued dimension(s)",
));
}
if self.n_q_heads % self.n_kv_heads != 0 {
return Err(TemporalError::InvalidConfig(
"header: n_q_heads must be divisible by n_kv_heads (GQA)",
));
}
if self.n_layers == 0 || self.n_classes < 2 {
return Err(TemporalError::InvalidConfig(
"header: n_layers must be ≥ 1 and n_classes ≥ 2",
));
}
Ok(())
}
}
/// A complete weight blob: header + raw weights bytes.
///
/// Weights are kept as `Vec<u8>` rather than `Vec<f32>` / `Vec<f16>` so
/// the firmware loader (which is no_std and may not have the `half`
/// crate) can `mmap` the body and read either dtype directly.
pub struct WeightBlob {
pub header: WeightBlobHeader,
pub weights: Vec<u8>,
}
impl WeightBlob {
pub fn new(header: WeightBlobHeader, weights: Vec<u8>) -> Result<Self, TemporalError> {
header.validate()?;
let elem = header.elem_bytes();
if weights.len() % elem != 0 {
return Err(TemporalError::InvalidConfig(
"weights length is not a multiple of dtype element size",
));
}
Ok(Self { header, weights })
}
/// Serialize to the wire format. Stable across rebuilds — this is
/// the contract the firmware loader speaks.
pub fn serialize(&self) -> Vec<u8> {
let total = WEIGHT_BLOB_HEADER_LEN + self.weights.len() + WEIGHT_BLOB_FOOTER_LEN;
let mut out = Vec::with_capacity(total);
// header
out.extend_from_slice(&WEIGHT_BLOB_MAGIC.to_le_bytes());
out.extend_from_slice(&WEIGHT_BLOB_VERSION.to_le_bytes());
let flags: u8 = match self.header.dtype {
WeightDtype::F32 => 0,
WeightDtype::F16 => 1,
};
out.push(flags);
out.push(0); // reserved
out.extend_from_slice(&self.header.input_dim.to_le_bytes());
out.extend_from_slice(&self.header.n_q_heads.to_le_bytes());
out.extend_from_slice(&self.header.n_kv_heads.to_le_bytes());
out.extend_from_slice(&self.header.head_dim.to_le_bytes());
// body header
out.extend_from_slice(&self.header.n_layers.to_le_bytes());
out.extend_from_slice(&self.header.n_classes.to_le_bytes());
out.extend_from_slice(&(self.weights.len() as u32).to_le_bytes());
// weights
out.extend_from_slice(&self.weights);
// footer: crc32 over everything written so far
let crc = crc32_ieee(&out);
out.extend_from_slice(&crc.to_le_bytes());
out
}
/// Parse a blob, validating magic / version / size / CRC.
pub fn parse(buf: &[u8]) -> Result<Self, TemporalError> {
if buf.len() < WEIGHT_BLOB_HEADER_LEN + WEIGHT_BLOB_FOOTER_LEN {
return Err(TemporalError::InvalidConfig("blob too short"));
}
let magic = u32::from_le_bytes(buf[0..4].try_into().unwrap());
if magic != WEIGHT_BLOB_MAGIC {
return Err(TemporalError::InvalidConfig("bad magic"));
}
let version = u16::from_le_bytes(buf[4..6].try_into().unwrap());
if version != WEIGHT_BLOB_VERSION {
return Err(TemporalError::InvalidConfig("unsupported blob version"));
}
let flags = buf[6];
let dtype = match flags & 0x01 {
0 => WeightDtype::F32,
_ => WeightDtype::F16,
};
let input_dim = u16::from_le_bytes(buf[8..10].try_into().unwrap());
let n_q_heads = u16::from_le_bytes(buf[10..12].try_into().unwrap());
let n_kv_heads = u16::from_le_bytes(buf[12..14].try_into().unwrap());
let head_dim = u16::from_le_bytes(buf[14..16].try_into().unwrap());
let n_layers = u16::from_le_bytes(buf[16..18].try_into().unwrap());
let n_classes = u16::from_le_bytes(buf[18..20].try_into().unwrap());
let weights_len = u32::from_le_bytes(buf[20..24].try_into().unwrap()) as usize;
// sanity-check size before slicing
let expected = WEIGHT_BLOB_HEADER_LEN + weights_len + WEIGHT_BLOB_FOOTER_LEN;
if buf.len() != expected {
return Err(TemporalError::InvalidConfig(
"blob length doesn't match weights_len in header",
));
}
// CRC check: cover everything before the trailing 4-byte CRC
let stored_crc = u32::from_le_bytes(buf[buf.len() - 4..].try_into().unwrap());
let computed = crc32_ieee(&buf[..buf.len() - 4]);
if stored_crc != computed {
return Err(TemporalError::InvalidConfig("blob CRC mismatch"));
}
let header = WeightBlobHeader {
dtype,
input_dim,
n_q_heads,
n_kv_heads,
head_dim,
n_layers,
n_classes,
};
header.validate()?;
let weights_start = WEIGHT_BLOB_HEADER_LEN;
let weights_end = weights_start + weights_len;
let weights = buf[weights_start..weights_end].to_vec();
Ok(Self { header, weights })
}
}
/// IEEE 802.3 CRC32 (poly 0xEDB88320), table-free. Same polynomial
/// the firmware-side loader uses (`temporal_task.c::crc32_ieee`) so a
/// blob produced here parses there.
pub(crate) fn crc32_ieee(data: &[u8]) -> u32 {
let mut crc = 0xFFFF_FFFFu32;
for &b in data {
crc ^= b as u32;
for _ in 0..8 {
let mask = 0u32.wrapping_sub(crc & 1);
crc = (crc >> 1) ^ (0xEDB8_8320 & mask);
}
}
!crc
}

View File

@ -0,0 +1,114 @@
//! End-to-end test: write a deterministic-seeded weight blob to disk,
//! read it back, parse it. Mirrors what the host-side training tool
//! does (training run finishes → emit .rvne) and what the firmware
//! loader will do once the toolchain unblocks (boot → mmap NVS or
//! EMBED_FILES blob → parse → run kernel).
//!
//! Sized realistically (~26 KB for the AETHER default shape) so the
//! perf and CRC paths see a meaningful payload.
use std::fs;
use wifi_densepose_temporal::{WeightBlob, WeightBlobHeader, WeightDtype};
fn aether_default_header() -> WeightBlobHeader {
WeightBlobHeader {
dtype: WeightDtype::F32,
input_dim: 16,
n_q_heads: 4,
n_kv_heads: 1,
head_dim: 32,
n_layers: 2,
n_classes: 4,
}
}
fn xorshift_step(state: &mut u64) -> u64 {
let mut x = *state;
x ^= x << 13;
x ^= x >> 7;
x ^= x << 17;
*state = x;
x.wrapping_mul(2685821657736338717u64)
}
fn deterministic_weights(byte_len: usize, seed: u64) -> Vec<u8> {
let mut out = Vec::with_capacity(byte_len);
let mut state = seed;
while out.len() < byte_len {
let bits = xorshift_step(&mut state) >> 32;
let unit = (bits as u32 as f32) / (u32::MAX as f32);
let f = (unit - 0.5) * 0.2;
out.extend_from_slice(&f.to_le_bytes());
}
out.truncate(byte_len);
out
}
#[test]
fn realistic_blob_roundtrips_through_filesystem() {
// AETHER default + 2 layers + classifier head: enough to exercise
// a non-trivial weights region without making the test slow.
let header = aether_default_header();
// Per-layer floats: input_dim*(q_heads*head_dim) for Wq, twice
// input_dim*(kv_heads*head_dim) for Wk and Wv, q_heads*head_dim*input_dim
// for Wo. Plus classifier head input_dim*n_classes.
let per_layer = (header.input_dim as usize)
* (header.n_q_heads as usize * header.head_dim as usize)
+ 2 * (header.input_dim as usize)
* (header.n_kv_heads as usize * header.head_dim as usize)
+ (header.n_q_heads as usize * header.head_dim as usize)
* (header.input_dim as usize);
let total_floats = per_layer * header.n_layers as usize
+ header.input_dim as usize * header.n_classes as usize;
let weights_bytes = total_floats * 4;
assert!(weights_bytes > 25_000);
let weights = deterministic_weights(weights_bytes, 0xC511_0007_DEAD_BEEFu64);
let blob = WeightBlob::new(header, weights).expect("construct");
let serialized = blob.serialize();
// Filesystem leg — the realistic firmware loader path mmap or
// streaming-reads from NVS / EMBED_FILES. We use a temp file
// per platform; on Windows std::env::temp_dir() works fine.
let mut tmp = std::env::temp_dir();
tmp.push("wifi-densepose-temporal-e2e.rvne");
fs::write(&tmp, &serialized).expect("write");
let read_back = fs::read(&tmp).expect("read");
assert_eq!(read_back, serialized, "filesystem corrupted bytes");
let parsed = WeightBlob::parse(&read_back).expect("parse");
assert_eq!(parsed.header.input_dim, 16);
assert_eq!(parsed.header.n_q_heads, 4);
assert_eq!(parsed.header.n_kv_heads, 1);
assert_eq!(parsed.header.head_dim, 32);
assert_eq!(parsed.header.n_layers, 2);
assert_eq!(parsed.header.n_classes, 4);
assert_eq!(parsed.weights.len(), weights_bytes);
// Cleanup — best-effort, don't fail the test on Windows file lock.
let _ = fs::remove_file(&tmp);
}
#[test]
fn deterministic_seed_produces_byte_identical_blobs() {
// The training script needs reproducibility — given the same
// config and seed, two runs must produce byte-identical output.
// This is what makes a witness-bundle (ADR-028) over the trained
// weights meaningful.
let header = aether_default_header();
let bytes = 4096;
let w1 = deterministic_weights(bytes, 0x1234u64);
let w2 = deterministic_weights(bytes, 0x1234u64);
assert_eq!(w1, w2, "PRNG not deterministic at fixed seed");
let blob1 = WeightBlob::new(header.clone(), w1).expect("ok");
let blob2 = WeightBlob::new(header, w2).expect("ok");
assert_eq!(
blob1.serialize(),
blob2.serialize(),
"serialization not deterministic"
);
}

View File

@ -0,0 +1,184 @@
//! Numerical A/B test for ADR-096 §5: do Dense and SparseGqa produce
//! comparable outputs on the same input?
//!
//! Background. Sparse attention is *structurally* an approximation —
//! it skips edges that the local window + log-stride + landmark
//! pattern decided wouldn't matter. The §5 validation gate cares
//! about whether that approximation degrades downstream metrics
//! (contrastive loss, rank-1 accuracy, Spearman correlation), not
//! whether outputs are bit-equal. This file establishes the *direct*
//! output-level error envelope so the gate can be calibrated against
//! it.
//!
//! Two regimes:
//!
//! 1. **Sparse pattern is dense.** When window ≥ N AND block_size ≥ N
//! AND every position is global, sparse and dense visit the same
//! edge set. Output divergence then reflects only floating-point
//! accumulation order, which is a tight bound (~1e-5 for f32 sums
//! of ~100 terms at 0.1 magnitude).
//!
//! 2. **Sparse pattern is sparse.** Default config drops most edges
//! at long N. Output divergence here is the *real* approximation
//! error — and the §5 gate's tolerances apply downstream of it.
use ruvllm_sparse_attention::Tensor3;
use wifi_densepose_temporal::{
AetherTemporalHead, TemporalBackendKind, TemporalHeadConfig,
};
fn make_qkv(seq: usize, heads: usize, dim: usize) -> (Tensor3, Tensor3, Tensor3) {
let mut q = Tensor3::zeros(seq, heads, dim);
let mut k = Tensor3::zeros(seq, heads, dim);
let mut v = Tensor3::zeros(seq, heads, dim);
for s in 0..seq {
for h in 0..heads {
for d in 0..dim {
let qv = ((s * 31 + h * 7 + d) as f32).sin() * 0.1;
let kv = (((s * 17 + h * 3 + d) as f32).cos()) * 0.1;
q.set(s, h, d, qv);
k.set(s, h, d, kv);
v.set(s, h, d, kv * 0.5);
}
}
}
(q, k, v)
}
fn max_abs_err(a: &Tensor3, b: &Tensor3) -> f32 {
let (s, h, d) = a.shape();
assert_eq!((s, h, d), b.shape(), "shape mismatch");
let mut max_err = 0.0f32;
for ti in 0..s {
for hi in 0..h {
for di in 0..d {
let e = (a.get(ti, hi, di) - b.get(ti, hi, di)).abs();
if e > max_err {
max_err = e;
}
}
}
}
max_err
}
fn mean_abs_err(a: &Tensor3, b: &Tensor3) -> f32 {
let (s, h, d) = a.shape();
let mut sum = 0.0f32;
let mut n = 0usize;
for ti in 0..s {
for hi in 0..h {
for di in 0..d {
sum += (a.get(ti, hi, di) - b.get(ti, hi, di)).abs();
n += 1;
}
}
}
sum / n.max(1) as f32
}
#[test]
fn dense_and_sparse_agree_when_sparse_pattern_is_dense() {
// Saturate the sparse pattern: window ≥ N means the local-window
// primitive includes every causal predecessor, so the attention
// edge set is identical to dense MHA's. The remaining gap is
// floating-point accumulation order (sparse goes
// window-then-stride-then-landmark, dense goes naive 0..i).
let seq = 32;
let heads = 4;
let dim = 16;
let (q, k, v) = make_qkv(seq, heads, dim);
let dense_cfg = TemporalHeadConfig {
backend: TemporalBackendKind::Dense,
q_heads: heads,
kv_heads: heads,
head_dim: dim,
window: seq, // saturate
block_size: seq,
causal: true,
};
let sparse_cfg = TemporalHeadConfig {
backend: TemporalBackendKind::SparseGqa,
..dense_cfg.clone()
};
let dense = AetherTemporalHead::new(&dense_cfg).expect("dense");
let sparse = AetherTemporalHead::new(&sparse_cfg).expect("sparse");
let d = dense.forward(&q, &k, &v).expect("dense forward");
let s = sparse.forward(&q, &k, &v).expect("sparse forward");
let max_err = max_abs_err(&d, &s);
let mean_err = mean_abs_err(&d, &s);
// 1e-4 covers a generous f32-summation-order envelope at 0.1
// input magnitude. If this ever blows up, either the saturation
// assumption is wrong (window/block_size no longer covers
// everything) or the kernel changed semantics.
assert!(
max_err < 1.0e-4,
"saturated-pattern max_abs_err exceeds 1e-4: max={max_err} mean={mean_err}"
);
}
#[test]
fn dense_and_sparse_diverge_predictably_at_long_n() {
// The interesting case: real sparse pattern (window << N), real
// approximation. We don't assert a specific error bound here —
// that's what ADR-096 §5's validation gate calibrates. We only
// check the numbers come out finite and plausible (per-position
// outputs stay within a few × the input magnitude after
// attention-weighted averaging — softmax can't blow them up).
let seq = 256;
let heads = 4;
let dim = 16;
let (q, k, v) = make_qkv(seq, heads, dim);
let dense_cfg = TemporalHeadConfig {
backend: TemporalBackendKind::Dense,
q_heads: heads,
kv_heads: heads,
head_dim: dim,
window: seq, // dense — placeholder; ignored by Dense backend
block_size: seq,
causal: true,
};
let sparse_cfg = TemporalHeadConfig {
backend: TemporalBackendKind::SparseGqa,
q_heads: heads,
kv_heads: heads,
head_dim: dim,
window: 16, // realistic sparse window
block_size: 32,
causal: true,
};
let dense = AetherTemporalHead::new(&dense_cfg).expect("dense");
let sparse = AetherTemporalHead::new(&sparse_cfg).expect("sparse");
let d = dense.forward(&q, &k, &v).expect("dense forward");
let s = sparse.forward(&q, &k, &v).expect("sparse forward");
let max_err = max_abs_err(&d, &s);
let mean_err = mean_abs_err(&d, &s);
// Sanity bounds. Inputs are scaled to 0.1, attention is a softmax
// average so outputs stay in roughly [-0.1, 0.1]. If max_err > 1.0
// something is structurally broken (NaN, underflow, etc).
assert!(
max_err.is_finite() && mean_err.is_finite(),
"non-finite error: max={max_err} mean={mean_err}"
);
assert!(
max_err < 1.0,
"implausibly large divergence: max={max_err} mean={mean_err}"
);
// Print the numbers so they're visible when running `cargo test --
// --nocapture`. These are what ADR-096 §5's gate would calibrate
// against on real AETHER inputs.
eprintln!(
"dense_vs_sparse @ N={seq}, window=16, block=32: max_abs_err={max_err:.6e}, mean_abs_err={mean_err:.6e}"
);
}

View File

@ -0,0 +1,133 @@
//! Smoke tests for the AETHER sparse-GQA temporal head (ADR-096 §5 gate is
//! a separate accuracy benchmark; this file just proves the wiring works).
use wifi_densepose_temporal::{
AetherTemporalHead, TemporalBackendKind, TemporalHeadConfig, TemporalError, Tensor3,
};
fn make_qkv(seq: usize, q_heads: usize, kv_heads: usize, dim: usize) -> (Tensor3, Tensor3, Tensor3) {
// Deterministic synthetic CSI-like activations so the test is
// reproducible across machines without bringing in `rand`.
let mut q = Tensor3::zeros(seq, q_heads, dim);
for s in 0..seq {
for h in 0..q_heads {
for d in 0..dim {
let v = ((s * 31 + h * 7 + d) as f32).sin() * 0.1;
q.set(s, h, d, v);
}
}
}
let mut k = Tensor3::zeros(seq, kv_heads, dim);
let mut v = Tensor3::zeros(seq, kv_heads, dim);
for s in 0..seq {
for h in 0..kv_heads {
for d in 0..dim {
let kv = (((s * 17 + h * 3 + d) as f32).cos()) * 0.1;
k.set(s, h, d, kv);
v.set(s, h, d, kv * 0.5);
}
}
}
(q, k, v)
}
#[test]
fn sparse_gqa_forward_runs_at_aether_default() {
let cfg = TemporalHeadConfig::default_aether();
let head = AetherTemporalHead::new(&cfg).expect("construct");
let (q, k, vt) = make_qkv(64, cfg.q_heads, cfg.kv_heads, cfg.head_dim);
let out = head.forward(&q, &k, &vt).expect("forward");
let (oseq, oh, od) = out.shape();
assert_eq!(oseq, 64);
assert_eq!(oh, cfg.q_heads);
assert_eq!(od, cfg.head_dim);
}
#[test]
fn sparse_mha_path_runs_when_qkv_heads_match() {
// q_heads == kv_heads forces the `forward` (non-GQA) branch.
let cfg = TemporalHeadConfig {
backend: TemporalBackendKind::SparseGqa,
q_heads: 2,
kv_heads: 2,
head_dim: 16,
window: 8,
block_size: 4,
causal: true,
};
let head = AetherTemporalHead::new(&cfg).expect("construct");
let (q, k, vt) = make_qkv(32, 2, 2, 16);
let out = head.forward(&q, &k, &vt).expect("forward");
assert_eq!(out.shape(), (32, 2, 16));
}
#[test]
fn dense_backend_forward_runs_with_matching_shape() {
// Dense_attention upstream requires q_heads == kv_heads (no GQA).
// Use MHA shape; n_classes/n_layers don't matter for forward-only.
let cfg = TemporalHeadConfig {
backend: TemporalBackendKind::Dense,
q_heads: 4,
kv_heads: 4,
head_dim: 16,
window: 8,
block_size: 4,
causal: true,
};
let head = AetherTemporalHead::new(&cfg).expect("construct dense");
let (q, k, v) = make_qkv(32, 4, 4, 16);
let out = head.forward(&q, &k, &v).expect("dense forward");
assert_eq!(out.shape(), (32, 4, 16));
}
#[test]
fn dense_backend_step_returns_streaming_error() {
let cfg = TemporalHeadConfig {
backend: TemporalBackendKind::Dense,
q_heads: 4,
kv_heads: 4,
head_dim: 16,
window: 8,
block_size: 4,
causal: true,
};
let head = AetherTemporalHead::new(&cfg).expect("construct dense");
let cache_err = head.make_cache(32).err().expect("no cache for dense");
matches!(cache_err, TemporalError::BackendDoesNotSupportStreaming);
}
#[test]
fn invalid_gqa_ratio_rejected_at_construction() {
let cfg = TemporalHeadConfig {
backend: TemporalBackendKind::SparseGqa,
q_heads: 5,
kv_heads: 2, // 5 % 2 != 0
head_dim: 16,
window: 8,
block_size: 4,
causal: true,
};
let err = AetherTemporalHead::new(&cfg).err().expect("rejected");
matches!(err, TemporalError::InvalidConfig(_));
}
#[test]
fn long_window_at_aether_roadmap_target() {
// ADR-096 §3.1 roadmap target: 10 s @ 100 Hz = 1000 frames. Verify
// the kernel actually runs at this length so the long-window claim
// is more than aspirational.
let cfg = TemporalHeadConfig {
backend: TemporalBackendKind::SparseGqa,
q_heads: 4,
kv_heads: 1,
head_dim: 16,
window: 64,
block_size: 32,
causal: true,
};
let head = AetherTemporalHead::new(&cfg).expect("construct");
let (q, k, vt) = make_qkv(1000, 4, 1, 16);
let out = head.forward(&q, &k, &vt).expect("forward at N=1000");
assert_eq!(out.shape(), (1000, 4, 16));
}

View File

@ -0,0 +1,139 @@
//! ADR-096 §3.2 streaming-decode test: token-by-token `step()` against
//! a `KvCache` should match a single-shot `forward()` over the same
//! Q/K/V at the final position. This is the structural advantage
//! dense MHA can't follow — proving it stays correct under streaming
//! is what the §5 validation gate would care about most.
use wifi_densepose_temporal::{
AetherTemporalHead, TemporalBackendKind, TemporalHeadConfig, Tensor3,
};
fn make_qkv(seq: usize, q_heads: usize, kv_heads: usize, dim: usize) -> (Tensor3, Tensor3, Tensor3) {
let mut q = Tensor3::zeros(seq, q_heads, dim);
let mut k = Tensor3::zeros(seq, kv_heads, dim);
let mut v = Tensor3::zeros(seq, kv_heads, dim);
for s in 0..seq {
for h in 0..q_heads {
for d in 0..dim {
let val = ((s * 31 + h * 7 + d) as f32).sin() * 0.1;
q.set(s, h, d, val);
}
}
for h in 0..kv_heads {
for d in 0..dim {
let val = (((s * 17 + h * 3 + d) as f32).cos()) * 0.1;
k.set(s, h, d, val);
v.set(s, h, d, val * 0.5);
}
}
}
(q, k, v)
}
fn slice_token(t: &Tensor3, idx: usize) -> Tensor3 {
let (_, heads, dim) = t.shape();
let mut out = Tensor3::zeros(1, heads, dim);
for h in 0..heads {
for d in 0..dim {
out.set(0, h, d, t.get(idx, h, d));
}
}
out
}
fn config_mha_small() -> TemporalHeadConfig {
// Equal q/k heads forces the `forward` MHA branch — `decode_step`
// upstream is wired to this branch, not the GQA branch (which has
// its own decode path coming in upstream's roadmap).
TemporalHeadConfig {
backend: TemporalBackendKind::SparseGqa,
q_heads: 2,
kv_heads: 2,
head_dim: 16,
window: 8,
block_size: 4,
causal: true,
}
}
#[test]
fn streaming_step_matches_forward_at_last_position() {
let cfg = config_mha_small();
let head = AetherTemporalHead::new(&cfg).expect("construct");
let seq = 16usize;
let (q, k, v) = make_qkv(seq, cfg.q_heads, cfg.kv_heads, cfg.head_dim);
// Reference: single-shot forward over the full sequence.
let reference = head.forward(&q, &k, &v).expect("forward");
// Streaming: append k/v one token at a time, decode the new q.
let mut cache = head.make_cache(seq).expect("cache");
let mut last_out: Option<Tensor3> = None;
for t in 0..seq {
let qt = slice_token(&q, t);
let kt = slice_token(&k, t);
let vt = slice_token(&v, t);
last_out = Some(head.step(&qt, &kt, &vt, &mut cache).expect("step"));
}
let streamed = last_out.expect("at least one step");
// Compare the streamed last-token output to the reference's
// last-token output. Tolerance is generous because numerical
// accumulation differs between the two paths even at exact
// mathematical equivalence.
let (s_seq, s_heads, s_dim) = streamed.shape();
assert_eq!((s_seq, s_heads, s_dim), (1, cfg.q_heads, cfg.head_dim));
let mut max_abs_err: f32 = 0.0;
for h in 0..cfg.q_heads {
for d in 0..cfg.head_dim {
let a = streamed.get(0, h, d);
let b = reference.get(seq - 1, h, d);
let err = (a - b).abs();
if err > max_abs_err {
max_abs_err = err;
}
}
}
// 1e-3 absolute is a comfortable bound for activations of this
// magnitude (~0.1 input scale). Tighten if the kernel ever
// promises closer match.
assert!(
max_abs_err < 1.0e-3,
"streaming/forward divergence at last token exceeds 1e-3: max_abs_err = {max_abs_err}"
);
}
#[test]
fn step_rejects_multi_token_q() {
let cfg = config_mha_small();
let head = AetherTemporalHead::new(&cfg).expect("construct");
let mut cache = head.make_cache(8).expect("cache");
// Build a 2-token Q/K/V — `step` must reject (its contract is
// single-token decode).
let (q, k, v) = make_qkv(2, cfg.q_heads, cfg.kv_heads, cfg.head_dim);
let err = head.step(&q, &k, &v, &mut cache).err().expect("rejected");
let s = format!("{err}");
assert!(
s.contains("single-token") || s.to_lowercase().contains("seq"),
"expected single-token rejection, got: {s}"
);
}
#[test]
fn make_cache_returns_kvcache_with_correct_shape() {
// Smoke test that the convenience wrapper plumbs the right dims
// into KvCache::new — the upstream constructor takes
// (capacity, kv_heads, dim, block_size) and we want to make sure
// we're not transposing any of those.
let cfg = config_mha_small();
let head = AetherTemporalHead::new(&cfg).expect("construct");
let mut cache = head.make_cache(32).expect("cache");
// Append one token shaped for kv_heads × head_dim — should not error.
let (_, k, v) = make_qkv(1, cfg.q_heads, cfg.kv_heads, cfg.head_dim);
let kt = slice_token(&k, 0);
let vt = slice_token(&v, 0);
cache.try_append(&kt, &vt).expect("append shape ok");
}

View File

@ -0,0 +1,140 @@
//! Roundtrip + corruption-detection tests for the temporal head's
//! weight-blob wire format. The format is the contract between
//! host-side training and firmware-side inference — when this test
//! file changes, both ends update in lockstep.
use wifi_densepose_temporal::{
WeightBlob, WeightBlobHeader, WeightDtype, WEIGHT_BLOB_HEADER_LEN, WEIGHT_BLOB_MAGIC,
WEIGHT_BLOB_VERSION,
};
fn header_default() -> WeightBlobHeader {
WeightBlobHeader {
dtype: WeightDtype::F32,
input_dim: 16,
n_q_heads: 4,
n_kv_heads: 1,
head_dim: 32,
n_layers: 2,
n_classes: 4,
}
}
#[test]
fn roundtrip_fp32() {
let header = header_default();
let weights: Vec<u8> = (0..1024).map(|i| (i & 0xFF) as u8).collect();
let blob = WeightBlob::new(header, weights).expect("ok");
let serialized = blob.serialize();
let parsed = WeightBlob::parse(&serialized).expect("parse");
assert_eq!(parsed.header.input_dim, 16);
assert_eq!(parsed.header.n_q_heads, 4);
assert_eq!(parsed.header.n_kv_heads, 1);
assert_eq!(parsed.header.head_dim, 32);
assert_eq!(parsed.header.n_layers, 2);
assert_eq!(parsed.header.n_classes, 4);
assert_eq!(parsed.header.dtype, WeightDtype::F32);
assert_eq!(parsed.weights.len(), 1024);
}
#[test]
fn roundtrip_fp16() {
let header = WeightBlobHeader {
dtype: WeightDtype::F16,
..header_default()
};
// FP16 means 2 bytes per element — 512 bytes = 256 elements.
let weights: Vec<u8> = (0..512).map(|i| (i & 0xFF) as u8).collect();
let blob = WeightBlob::new(header, weights).expect("ok");
let serialized = blob.serialize();
let parsed = WeightBlob::parse(&serialized).expect("parse");
assert_eq!(parsed.header.dtype, WeightDtype::F16);
assert_eq!(parsed.weights.len(), 512);
}
#[test]
fn parse_rejects_bad_magic() {
let header = header_default();
let blob = WeightBlob::new(header, vec![0u8; 16]).expect("ok");
let mut bytes = blob.serialize();
bytes[0] = 0xFF; // corrupt magic
let err = WeightBlob::parse(&bytes).err().expect("rejected");
assert!(format!("{err}").contains("magic"));
}
#[test]
fn parse_rejects_wrong_version() {
let header = header_default();
let blob = WeightBlob::new(header, vec![0u8; 16]).expect("ok");
let mut bytes = blob.serialize();
bytes[4] = 99; // bump version
bytes[5] = 0;
let err = WeightBlob::parse(&bytes).err().expect("rejected");
assert!(format!("{err}").contains("version"));
}
#[test]
fn parse_rejects_size_mismatch() {
let header = header_default();
let blob = WeightBlob::new(header, vec![0u8; 64]).expect("ok");
let mut bytes = blob.serialize();
// truncate the weights region by 4 bytes — total length now
// doesn't match the weights_len field.
bytes.drain(WEIGHT_BLOB_HEADER_LEN..WEIGHT_BLOB_HEADER_LEN + 4);
let err = WeightBlob::parse(&bytes).err().expect("rejected");
assert!(format!("{err}").contains("length") || format!("{err}").contains("CRC"));
}
#[test]
fn parse_rejects_crc_corruption() {
let header = header_default();
let blob = WeightBlob::new(header, vec![0xAAu8; 32]).expect("ok");
let mut bytes = blob.serialize();
// flip a bit in the middle of the weights region
let mid = WEIGHT_BLOB_HEADER_LEN + 5;
bytes[mid] ^= 0x01;
let err = WeightBlob::parse(&bytes).err().expect("rejected");
assert!(format!("{err}").contains("CRC"));
}
#[test]
fn parse_rejects_invalid_gqa_ratio_in_header() {
// Manually craft bytes where n_q_heads % n_kv_heads != 0 to ensure
// header.validate() fires from inside parse(). Easiest: build a
// valid blob then patch the n_kv_heads field.
let header = header_default();
let blob = WeightBlob::new(header, vec![0u8; 16]).expect("ok");
let mut bytes = blob.serialize();
// n_kv_heads is at offset 12..14; set it to 3 so 4 % 3 != 0.
bytes[12] = 3;
bytes[13] = 0;
// Re-CRC so we can be sure the validator (not the CRC) is what
// rejects this case.
let new_crc = crc32_ieee(&bytes[..bytes.len() - 4]);
let crc_off = bytes.len() - 4;
bytes[crc_off..].copy_from_slice(&new_crc.to_le_bytes());
let err = WeightBlob::parse(&bytes).err().expect("rejected");
assert!(format!("{err}").to_lowercase().contains("gqa"));
}
#[test]
fn header_constants_match_wire_layout() {
// Anchor the public constants so they can't drift silently.
assert_eq!(WEIGHT_BLOB_MAGIC, 0x5256_4E45);
assert_eq!(WEIGHT_BLOB_VERSION, 1);
assert_eq!(WEIGHT_BLOB_HEADER_LEN, 24);
}
// Mirror of the production CRC32 so the size-mismatch / GQA tests can
// re-CRC after their patch. Kept out of the public API.
fn crc32_ieee(data: &[u8]) -> u32 {
let mut crc = 0xFFFF_FFFFu32;
for &b in data {
crc ^= b as u32;
for _ in 0..8 {
let mask = 0u32.wrapping_sub(crc & 1);
crc = (crc >> 1) ^ (0xEDB8_8320 & mask);
}
}
!crc
}

View File

@ -24,6 +24,11 @@ required-features = ["tch-backend"]
default = []
tch-backend = ["tch"]
cuda = ["tch-backend"]
# ADR-096 sparse-GQA temporal head. Pulls wifi-densepose-temporal in
# alongside tch — the new path is additive, doesn't touch the existing
# model.rs code paths, and stays opt-in until the §5 validation gate
# clears.
aether-sparse-temporal = ["tch-backend", "dep:wifi-densepose-temporal"]
[dependencies]
# Internal crates
@ -54,6 +59,10 @@ ruvector-temporal-tensor = { workspace = true }
ruvector-solver = { workspace = true }
ruvector-attention = { workspace = true }
# AETHER temporal head (ADR-096). Optional + tch-gated — only meaningful
# alongside the existing tch-bound model graph.
wifi-densepose-temporal = { workspace = true, optional = true }
# Data loading
ndarray-npy.workspace = true
memmap2 = "0.9"

View File

@ -69,6 +69,13 @@ pub mod proof;
#[cfg(feature = "tch-backend")]
pub mod trainer;
// ADR-096 AETHER temporal head — additive integration. Pulled in via
// the `aether-sparse-temporal` feature, which itself requires
// `tch-backend`. Kept under its own cfg so the existing build with
// just `tch-backend` is byte-equivalent to before.
#[cfg(feature = "aether-sparse-temporal")]
pub mod temporal_aether;
// Convenient re-exports at the crate root.
pub use config::TrainingConfig;
pub use dataset::{CsiDataset, CsiSample, DataLoader, MmFiDataset, SyntheticCsiDataset, SyntheticConfig};

View File

@ -0,0 +1,178 @@
//! ADR-096 AETHER temporal head — `tch::nn` bridge.
//!
//! Additive integration: wires `wifi-densepose-temporal` (sparse-GQA
//! attention + streaming KvCache) into the train crate's tch graph.
//! Does NOT modify the existing `WiFiDensePoseModel` forward in
//! `model.rs` — that path stays bit-equivalent for back-compat. Use
//! this aggregator alongside the existing model when you want a
//! temporal-axis pooling on top of per-frame backbone features.
//!
//! Bridge boundary:
//! tch::Tensor [T, in_dim] → Tensor3 (seq=T, heads, dim) → attention
//! ← Tensor3 ← forward()
//! tch::Tensor [in_dim] (pooled embedding)
//!
//! Memory pattern: tch.copy_data → Vec<f32> → Tensor3::from_vec on the
//! way in; Tensor3 raw → Tensor::of_slice on the way out. Two host
//! copies per call. For training-rate forwards (~100 calls/sec at
//! batch 16) this is negligible vs the actual attention work; for
//! inference-rate streaming it'd be the bottleneck and a
//! zero-copy path is the natural Phase 2.
//!
//! Only the B=1 prefill path is implemented in this commit. Multi-batch
//! and the streaming `step()` bridge land when the §5 validation gate
//! turns green and we need to take the perf hit seriously.
//!
//! Feature-gated: `aether-sparse-temporal` (also requires `tch-backend`).
use tch::{
nn::{self, Module},
Device, Kind, Tensor,
};
use wifi_densepose_temporal::{
AetherTemporalHead, TemporalBackendKind, TemporalError, TemporalHeadConfig, Tensor3,
};
/// Aggregator: tch-side projections + the pure-Rust sparse attention
/// kernel + a tch-side output projection. The projection layers are
/// `nn::Linear` so they participate in the tch VarStore the same way
/// the rest of the model does — gradients, save/load, etc.
pub struct AetherTemporalAggregator {
cfg: TemporalHeadConfig,
in_dim: i64,
// tch-side learnable projections.
q_proj: nn::Linear,
k_proj: nn::Linear,
v_proj: nn::Linear,
o_proj: nn::Linear,
// The kernel itself is configuration-only; no weights live inside
// because the sparse attention forward is purely a function of
// q/k/v + the SparseAttentionConfig.
head: AetherTemporalHead,
}
impl AetherTemporalAggregator {
/// Build the aggregator. `vs` is the tch namespace under which
/// the four projection layers register. `in_dim` is the input
/// feature dimension per frame (e.g. backbone output dim).
pub fn new(vs: nn::Path, in_dim: i64, cfg: TemporalHeadConfig) -> Result<Self, TemporalError> {
cfg.validate()?;
// Backend has to be Sparse — Dense projections would still
// work, but the whole point of this integration is the new
// sparse-GQA path. If a caller wants dense, they can keep
// using `apply_antenna_attention` / `apply_spatial_attention`
// from model.rs.
if !matches!(cfg.backend, TemporalBackendKind::SparseGqa) {
return Err(TemporalError::InvalidConfig(
"aggregator only wires SparseGqa; use existing model.rs paths for dense",
));
}
let total_q = (cfg.q_heads * cfg.head_dim) as i64;
let total_kv = (cfg.kv_heads * cfg.head_dim) as i64;
let q_proj = nn::linear(&vs / "q_proj", in_dim, total_q, Default::default());
let k_proj = nn::linear(&vs / "k_proj", in_dim, total_kv, Default::default());
let v_proj = nn::linear(&vs / "v_proj", in_dim, total_kv, Default::default());
let o_proj = nn::linear(&vs / "o_proj", total_q, in_dim, Default::default());
let head = AetherTemporalHead::new(&cfg)?;
Ok(Self {
cfg,
in_dim,
q_proj,
k_proj,
v_proj,
o_proj,
head,
})
}
/// Forward over a single sequence of frames. Input shape:
/// `[T, in_dim]` (NB: B=1 only this version — see file header).
/// Returns the per-token attention output passed through the
/// output projection: `[T, in_dim]`.
///
/// Pooling (mean over T, last-token, attention-pool, etc.) is the
/// caller's job — different downstream consumers want different
/// pools and we don't want to bake one in.
pub fn forward(&self, frames: &Tensor) -> Result<Tensor, TemporalError> {
let dims = frames.size();
if dims.len() != 2 || dims[1] != self.in_dim {
return Err(TemporalError::InvalidConfig(
"aggregator.forward expects [T, in_dim] tch::Tensor",
));
}
let t = dims[0] as usize;
let device = frames.device();
// ── Project to Q/K/V on the tch side ──────────────────────
let q_th = self.q_proj.forward(frames); // [T, q_heads*head_dim]
let k_th = self.k_proj.forward(frames); // [T, kv_heads*head_dim]
let v_th = self.v_proj.forward(frames); // [T, kv_heads*head_dim]
// ── Bridge to Tensor3 (CPU, f32) ──────────────────────────
let q_t3 = tch_to_tensor3(&q_th, t, self.cfg.q_heads, self.cfg.head_dim)?;
let k_t3 = tch_to_tensor3(&k_th, t, self.cfg.kv_heads, self.cfg.head_dim)?;
let v_t3 = tch_to_tensor3(&v_th, t, self.cfg.kv_heads, self.cfg.head_dim)?;
// ── Sparse attention forward (pure-Rust path) ────────────
let attn_out = self.head.forward(&q_t3, &k_t3, &v_t3)?;
// ── Bridge back to tch ───────────────────────────────────
let attn_th = tensor3_to_tch(&attn_out, device);
// attn_th shape is [T, q_heads*head_dim].
// ── Output projection on tch side ────────────────────────
let out = self.o_proj.forward(&attn_th); // [T, in_dim]
Ok(out)
}
}
/// Reshape a `[T, heads*head_dim]` tch::Tensor on (any device, any
/// kind) into a CPU `Tensor3(seq=T, heads, head_dim)`. Forces f32 +
/// CPU + contiguous memory; copies once.
fn tch_to_tensor3(
th: &Tensor,
seq: usize,
heads: usize,
head_dim: usize,
) -> Result<Tensor3, TemporalError> {
let dims = th.size();
if dims.len() != 2 || dims[0] as usize != seq || dims[1] as usize != heads * head_dim {
return Err(TemporalError::InvalidConfig(
"tch_to_tensor3 shape mismatch",
));
}
let cpu = th.to_kind(Kind::Float).to_device(Device::Cpu).contiguous();
let total = seq * heads * head_dim;
let mut buf = vec![0.0f32; total];
cpu.copy_data(&mut buf, total);
// tch row-major flatten gives [seq][heads*head_dim]. Tensor3
// expects [seq][heads][dim] in the same row-major order, so the
// contiguous bytes are layout-compatible — no per-element
// transpose required.
Tensor3::from_vec(buf, seq, heads, head_dim)
.map_err(|e| TemporalError::InvalidConfig(Box::leak(format!("from_vec: {e}").into_boxed_str())))
}
/// Inverse of `tch_to_tensor3`: take a `Tensor3(seq, heads, dim)` and
/// produce a `[seq, heads*dim]` tch::Tensor on the requested device.
fn tensor3_to_tch(t3: &Tensor3, device: Device) -> Tensor {
let (seq, heads, dim) = t3.shape();
// Tensor3 stores seq×heads×dim contiguously; flatten heads/dim
// by reading the row at each (seq, head) and concatenating.
let mut flat = Vec::with_capacity(seq * heads * dim);
for s in 0..seq {
for h in 0..heads {
flat.extend_from_slice(t3.row(s, h));
}
}
Tensor::from_slice(&flat)
.reshape([seq as i64, (heads * dim) as i64])
.to_device(device)
}