The structural advantage that's the entire point of ADR-096: O(log T)
per new token via decode_step against an accumulated KvCache, vs
O(N²) recompute for dense MHA. This commit lands the API and proves
the numerical equivalence at the last position.
API:
- AetherTemporalHead::step(q_new, k_new, v_new, &mut cache)
Single-token decode. Appends (k_new, v_new) to cache, runs
decode_step(q_new) against the now-updated cache, returns the new
position's output.
- AetherTemporalHead::make_cache(capacity)
Convenience constructor — caller doesn't need to import
ruvllm_sparse_attention to size a cache. Per ADR-096 §8.5 the
natural lifetime is per-PoseTrack (re-ID) or per-session (online
classification); when the track drops, drop the cache.
- KvCache re-exported at the crate root.
Contract:
- q_new/k_new/v_new must each have seq == 1. Multi-token q is the
prefill path (forward), not decode_step.
- Cache lifetime is the caller's. The crate enforces shape via
make_cache so callers can't mismatch kv_heads / head_dim / block_size.
- KvCache fill is the caller's problem. Upstream H2O heavy-hitter
eviction is opt-in; this crate's wrapper doesn't pre-pick a policy.
Tests (18/18 total now passing):
- streaming_step_matches_forward_at_last_position — central claim:
16-token sequence, append k/v one at a time via step(), compare
the streamed last-token output to forward(full Q,K,V)[N-1].
max_abs_err < 1e-3 (currently passes well under that bound for
the 0.1-magnitude activations the test uses).
- step_rejects_multi_token_q — contract enforcement.
- make_cache_returns_kvcache_with_correct_shape — wiring smoke,
confirms (capacity, kv_heads, dim, block_size) ordering is correct
through the make_cache wrapper.
Test config uses MHA shape (q_heads == kv_heads) because the upstream
decode_step is wired to the MHA branch; the GQA decode path is on
upstream's roadmap and lands in a separate ADR-096 follow-up when it
does.
Co-Authored-By: claude-flow <ruv@ruv.net>
Closes the host→file→firmware loop on the Phase 1 weight format. Real
.rvne artifact emitted from the example, parsed back through filesystem
in the e2e test, byte-identical across two seeded runs.
- examples/init_random_blob.rs — produces a 41,244-byte deployable blob
matching the AETHER default head shape (input_dim=16, q_heads=4,
kv_heads=1 [MQA], head_dim=32, layers=2, classes=4 — staying coherent
with TemporalHeadConfig::default_aether so a real trainer can drop
in this shape with one search-and-replace). Uses xorshift64* with a
fixed seed (0xC511_0007_DEAD_BEEF) for reproducibility.
Per-layer weight count derivation lives in the example (Wq + Wk +
Wv + Wo, plus a final classifier head) so the kernel's expectation
is anchored in code rather than a comment that drifts.
- tests/blob_e2e.rs — two new tests, 15/15 total now passing:
* realistic_blob_roundtrips_through_filesystem — writes a 25+ KB
blob to std::env::temp_dir(), reads it back, parses, validates.
Mirrors what the firmware loader will do once the toolchain
unblocks (mmap NVS or EMBED_FILES → parse).
* deterministic_seed_produces_byte_identical_blobs — same seed
produces byte-identical output, twice. This is what makes a
witness-bundle (ADR-028) over trained weights meaningful.
Verified by running the example with an explicit out path:
cargo run -p wifi-densepose-temporal --example init_random_blob -- \
v2/target/example-output/model_init.rvne
→ 41244 bytes, parses clean, dtype/shape/CRC all good.
What this isn't yet:
- Not a trained model. Random init only.
- Not a kernel forward over the blob. That requires the firmware
Rust component to compile (Phase 5 — toolchain blocker).
- Not wired into wifi-densepose-train. ADR-096 §8.1 flagged that
the AETHER train crate doesn't currently have a temporal-axis
attention; that integration is a separate piece of work.
Co-Authored-By: claude-flow <ruv@ruv.net>
The training/firmware boundary needs a stable serialization for the
temporal head's weights, distinct from the kernel scaffold and the
firmware ABI. This commit defines that format on the host side. The
firmware-side mirrored loader lands when the toolchain unblocks.
Format:
- Header (24 B): magic 'RVNE' / version 1 / dtype flag
(FP32 / FP16) / input_dim / n_q_heads / n_kv_heads / head_dim /
n_layers / n_classes / weights_len.
- Body: weights_len bytes of flat per-layer weights.
- Footer (4 B): CRC32 IEEE 802.3 over everything before, same
polynomial used by temporal_task.c so a blob produced here parses
on the firmware unchanged.
Layout decisions:
- Little-endian throughout (Xtensa native).
- Weights kept as Vec<u8> rather than Vec<f32>/Vec<f16> so the no_std
firmware loader (which may not have the `half` crate) can mmap and
read either dtype directly.
- Versioning is hard-break: bumping `version` means firmware refuses
to load. Optional fields go behind reserved flag bits, never by
field reorder. Documented inline.
Validation surface:
- `WeightBlobHeader::validate()` catches zero dims, invalid GQA
ratios (n_q_heads % n_kv_heads != 0), n_layers=0, n_classes<2.
Same checks fire from `WeightBlob::parse()` so the firmware can't
accidentally accept a blob the host should have rejected.
- `WeightBlob::parse()` enforces magic / version / size / CRC
before exposing weights to the caller.
Tests (8/8 passing, alongside 5/5 sparse smoke = 13/13 total):
- roundtrip_fp32, roundtrip_fp16
- parse_rejects_bad_magic, _wrong_version, _size_mismatch,
_crc_corruption, _invalid_gqa_ratio_in_header
- header_constants_match_wire_layout (anchor)
What's deliberately NOT in this commit:
- The firmware-side mirrored loader (deferred to the iteration that
unblocks the esp Rust toolchain — no point shipping a parser that
can't be compiled).
- Per-layer weight ordering. The blob is a flat byte-buffer; the
interpretation of per-layer offsets is the kernel's contract,
documented in the eventual model module (ADR-095 §3.2 follow-up).
Co-Authored-By: claude-flow <ruv@ruv.net>
Implements Phases 1-3 of the ADR-096 roadmap:
Phase 1: workspace integration
- Add `ruvllm_sparse_attention` as a path-vendored workspace dep against
`vendor/ruvector/crates/ruvllm_sparse_attention`, default-features=false,
features=["fp16"]. Mirrors the no_std posture ADR-095 will need on the
firmware side so both consumers share a single feature set.
- Register `wifi-densepose-temporal` as workspace member.
Phase 2: AETHER temporal head
- `AetherTemporalHead` facade dispatches to a `SparseGqa` backend wrapping
`SubquadraticSparseAttention`. Selection rule from ADR-096 §4.4 enforced
at forward(): MHA branch when q_heads == kv_heads, GQA branch otherwise.
- `Dense` backend reserved (returns typed `DenseBackendNotImplemented`)
so config-time validation fails loudly instead of at forward().
- `TemporalHeadConfig::default_aether()` matches the AETHER training
default per ADR-096 §3.1 (window=32, block=16, q=4, kv=1 → MQA).
- Token 0 always wired as a global anchor — preserves AETHER's
contrastive "session-start reference" role per ADR-024.
Phase 3: smoke tests (5/5 passing)
- forward at AETHER default config, both MHA and GQA dispatch paths,
rejected dense backend, rejected non-divisible GQA ratio, and the
long-window roadmap target (N=1000, the 10s @ 100Hz case from
ADR-096 §3.1 — proves the kernel runs at lengths where dense MHA
costs 10⁶ edge ops vs sparse 10⁴).
Streaming `step()` deferred — KvCache lifecycle ties to PoseTrack per
ADR-096 §8.5 and lands when the firmware-side ABI does (Phase 4+).
Co-Authored-By: claude-flow <ruv@ruv.net>