Validates the central performance claim of ADR-096 with a runnable
benchmark. Single-run wall-clock, pure-Rust vs pure-Rust on x86_64
host. Real numbers, not just analytic argument.
Results (N=64..1024):
| N | Dense (ms) | Sparse (ms) | Speedup |
|--------|-----------:|------------:|--------:|
| 64 | 0.262 | 0.141 | 1.86× |
| 128 | 1.120 | 0.335 | 3.34× |
| 256 | 4.129 | 0.711 | 5.81× |
| 512 | 19.230 | 2.356 | 8.16× |
| 1024 | 71.904 | 3.389 | 21.21× |
Asymptotic check: 64→1024 is 16× more tokens. Dense's 274× cost
growth matches N² (256× = 16²). Sparse's 24× growth matches
N log N (16 · log(1024)/log(64) ≈ 27). The complexity claim is
empirically supported.
ADR-096 §3.1 honest-framing paragraph predicted N=64 would be
overhead-bound; we measured 1.86× there, consistent with the ADR's
warning that AETHER's current `window_frames=100` default is below
the inflection point where sparse pays.
What this commit adds:
- examples/bench_speedup.rs — measures dense_attention (upstream
reference), AetherTemporalHead.forward (this crate's wrapper),
and SubquadraticSparseAttention.forward (raw, to confirm the
wrapper isn't introducing overhead — it isn't, the two are
within noise).
- benches_results.md — captured table + asymptotic check + caveats
(config used, what the benchmark doesn't measure, how to run).
Run it:
cargo run -p wifi-densepose-temporal --example bench_speedup --release
What's NOT measured here:
- Decode-step latency (already proved correct at last-token, not
yet timed against a hypothetical O(N²) dense decode — they're
structurally not comparable anyway).
- Memory footprint of KvCache + FP16 (matters on firmware, not host).
- GQA dispatch — this bench uses MHA shape so dense and sparse
operate on identical tensors. Real AETHER will want MQA per
TemporalHeadConfig::default_aether(), which halves KV memory.
Co-Authored-By: claude-flow <ruv@ruv.net>