Commit Graph

2 Commits

Author SHA1 Message Date
ruv 247794a2c5 bench(temporal): empirical sparse-vs-dense speedup curve (ADR-096 §3.1, #513)
Validates the central performance claim of ADR-096 with a runnable
benchmark. Single-run wall-clock, pure-Rust vs pure-Rust on x86_64
host. Real numbers, not just analytic argument.

Results (N=64..1024):

| N      | Dense (ms) | Sparse (ms) | Speedup |
|--------|-----------:|------------:|--------:|
|     64 |      0.262 |       0.141 |   1.86× |
|    128 |      1.120 |       0.335 |   3.34× |
|    256 |      4.129 |       0.711 |   5.81× |
|    512 |     19.230 |       2.356 |   8.16× |
|   1024 |     71.904 |       3.389 |  21.21× |

Asymptotic check: 64→1024 is 16× more tokens. Dense's 274× cost
growth matches N² (256× = 16²). Sparse's 24× growth matches
N log N (16 · log(1024)/log(64) ≈ 27). The complexity claim is
empirically supported.

ADR-096 §3.1 honest-framing paragraph predicted N=64 would be
overhead-bound; we measured 1.86× there, consistent with the ADR's
warning that AETHER's current `window_frames=100` default is below
the inflection point where sparse pays.

What this commit adds:
- examples/bench_speedup.rs — measures dense_attention (upstream
  reference), AetherTemporalHead.forward (this crate's wrapper),
  and SubquadraticSparseAttention.forward (raw, to confirm the
  wrapper isn't introducing overhead — it isn't, the two are
  within noise).
- benches_results.md — captured table + asymptotic check + caveats
  (config used, what the benchmark doesn't measure, how to run).

Run it:
  cargo run -p wifi-densepose-temporal --example bench_speedup --release

What's NOT measured here:
- Decode-step latency (already proved correct at last-token, not
  yet timed against a hypothetical O(N²) dense decode — they're
  structurally not comparable anyway).
- Memory footprint of KvCache + FP16 (matters on firmware, not host).
- GQA dispatch — this bench uses MHA shape so dense and sparse
  operate on identical tensors. Real AETHER will want MQA per
  TemporalHeadConfig::default_aether(), which halves KV memory.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 12:02:36 -04:00
ruv 73321db765 feat(temporal): init_random_blob example + filesystem e2e tests (#513)
Closes the host→file→firmware loop on the Phase 1 weight format. Real
.rvne artifact emitted from the example, parsed back through filesystem
in the e2e test, byte-identical across two seeded runs.

- examples/init_random_blob.rs — produces a 41,244-byte deployable blob
  matching the AETHER default head shape (input_dim=16, q_heads=4,
  kv_heads=1 [MQA], head_dim=32, layers=2, classes=4 — staying coherent
  with TemporalHeadConfig::default_aether so a real trainer can drop
  in this shape with one search-and-replace). Uses xorshift64* with a
  fixed seed (0xC511_0007_DEAD_BEEF) for reproducibility.

  Per-layer weight count derivation lives in the example (Wq + Wk +
  Wv + Wo, plus a final classifier head) so the kernel's expectation
  is anchored in code rather than a comment that drifts.

- tests/blob_e2e.rs — two new tests, 15/15 total now passing:
    * realistic_blob_roundtrips_through_filesystem — writes a 25+ KB
      blob to std::env::temp_dir(), reads it back, parses, validates.
      Mirrors what the firmware loader will do once the toolchain
      unblocks (mmap NVS or EMBED_FILES → parse).
    * deterministic_seed_produces_byte_identical_blobs — same seed
      produces byte-identical output, twice. This is what makes a
      witness-bundle (ADR-028) over trained weights meaningful.

Verified by running the example with an explicit out path:
  cargo run -p wifi-densepose-temporal --example init_random_blob -- \
      v2/target/example-output/model_init.rvne
  → 41244 bytes, parses clean, dtype/shape/CRC all good.

What this isn't yet:
  - Not a trained model. Random init only.
  - Not a kernel forward over the blob. That requires the firmware
    Rust component to compile (Phase 5 — toolchain blocker).
  - Not wired into wifi-densepose-train. ADR-096 §8.1 flagged that
    the AETHER train crate doesn't currently have a temporal-axis
    attention; that integration is a separate piece of work.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 11:49:19 -04:00