Validates the central performance claim of ADR-096 with a runnable
benchmark. Single-run wall-clock, pure-Rust vs pure-Rust on x86_64
host. Real numbers, not just analytic argument.
Results (N=64..1024):
| N | Dense (ms) | Sparse (ms) | Speedup |
|--------|-----------:|------------:|--------:|
| 64 | 0.262 | 0.141 | 1.86× |
| 128 | 1.120 | 0.335 | 3.34× |
| 256 | 4.129 | 0.711 | 5.81× |
| 512 | 19.230 | 2.356 | 8.16× |
| 1024 | 71.904 | 3.389 | 21.21× |
Asymptotic check: 64→1024 is 16× more tokens. Dense's 274× cost
growth matches N² (256× = 16²). Sparse's 24× growth matches
N log N (16 · log(1024)/log(64) ≈ 27). The complexity claim is
empirically supported.
ADR-096 §3.1 honest-framing paragraph predicted N=64 would be
overhead-bound; we measured 1.86× there, consistent with the ADR's
warning that AETHER's current `window_frames=100` default is below
the inflection point where sparse pays.
What this commit adds:
- examples/bench_speedup.rs — measures dense_attention (upstream
reference), AetherTemporalHead.forward (this crate's wrapper),
and SubquadraticSparseAttention.forward (raw, to confirm the
wrapper isn't introducing overhead — it isn't, the two are
within noise).
- benches_results.md — captured table + asymptotic check + caveats
(config used, what the benchmark doesn't measure, how to run).
Run it:
cargo run -p wifi-densepose-temporal --example bench_speedup --release
What's NOT measured here:
- Decode-step latency (already proved correct at last-token, not
yet timed against a hypothetical O(N²) dense decode — they're
structurally not comparable anyway).
- Memory footprint of KvCache + FP16 (matters on firmware, not host).
- GQA dispatch — this bench uses MHA shape so dense and sparse
operate on identical tensors. Real AETHER will want MQA per
TemporalHeadConfig::default_aether(), which halves KV memory.
Co-Authored-By: claude-flow <ruv@ruv.net>
Closes the host→file→firmware loop on the Phase 1 weight format. Real
.rvne artifact emitted from the example, parsed back through filesystem
in the e2e test, byte-identical across two seeded runs.
- examples/init_random_blob.rs — produces a 41,244-byte deployable blob
matching the AETHER default head shape (input_dim=16, q_heads=4,
kv_heads=1 [MQA], head_dim=32, layers=2, classes=4 — staying coherent
with TemporalHeadConfig::default_aether so a real trainer can drop
in this shape with one search-and-replace). Uses xorshift64* with a
fixed seed (0xC511_0007_DEAD_BEEF) for reproducibility.
Per-layer weight count derivation lives in the example (Wq + Wk +
Wv + Wo, plus a final classifier head) so the kernel's expectation
is anchored in code rather than a comment that drifts.
- tests/blob_e2e.rs — two new tests, 15/15 total now passing:
* realistic_blob_roundtrips_through_filesystem — writes a 25+ KB
blob to std::env::temp_dir(), reads it back, parses, validates.
Mirrors what the firmware loader will do once the toolchain
unblocks (mmap NVS or EMBED_FILES → parse).
* deterministic_seed_produces_byte_identical_blobs — same seed
produces byte-identical output, twice. This is what makes a
witness-bundle (ADR-028) over trained weights meaningful.
Verified by running the example with an explicit out path:
cargo run -p wifi-densepose-temporal --example init_random_blob -- \
v2/target/example-output/model_init.rvne
→ 41244 bytes, parses clean, dtype/shape/CRC all good.
What this isn't yet:
- Not a trained model. Random init only.
- Not a kernel forward over the blob. That requires the firmware
Rust component to compile (Phase 5 — toolchain blocker).
- Not wired into wifi-densepose-train. ADR-096 §8.1 flagged that
the AETHER train crate doesn't currently have a temporal-axis
attention; that integration is a separate piece of work.
Co-Authored-By: claude-flow <ruv@ruv.net>