wifi-densepose

Commit Graph

Author	SHA1	Message	Date
ruv	247794a2c5	bench(temporal): empirical sparse-vs-dense speedup curve (ADR-096 §3.1, #513 ) Validates the central performance claim of ADR-096 with a runnable benchmark. Single-run wall-clock, pure-Rust vs pure-Rust on x86_64 host. Real numbers, not just analytic argument. Results (N=64..1024): \| N \| Dense (ms) \| Sparse (ms) \| Speedup \| \|--------\|-----------:\|------------:\|--------:\| \| 64 \| 0.262 \| 0.141 \| 1.86× \| \| 128 \| 1.120 \| 0.335 \| 3.34× \| \| 256 \| 4.129 \| 0.711 \| 5.81× \| \| 512 \| 19.230 \| 2.356 \| 8.16× \| \| 1024 \| 71.904 \| 3.389 \| 21.21× \| Asymptotic check: 64→1024 is 16× more tokens. Dense's 274× cost growth matches N² (256× = 16²). Sparse's 24× growth matches N log N (16 · log(1024)/log(64) ≈ 27). The complexity claim is empirically supported. ADR-096 §3.1 honest-framing paragraph predicted N=64 would be overhead-bound; we measured 1.86× there, consistent with the ADR's warning that AETHER's current `window_frames=100` default is below the inflection point where sparse pays. What this commit adds: - examples/bench_speedup.rs — measures dense_attention (upstream reference), AetherTemporalHead.forward (this crate's wrapper), and SubquadraticSparseAttention.forward (raw, to confirm the wrapper isn't introducing overhead — it isn't, the two are within noise). - benches_results.md — captured table + asymptotic check + caveats (config used, what the benchmark doesn't measure, how to run). Run it: cargo run -p wifi-densepose-temporal --example bench_speedup --release What's NOT measured here: - Decode-step latency (already proved correct at last-token, not yet timed against a hypothetical O(N²) dense decode — they're structurally not comparable anyway). - Memory footprint of KvCache + FP16 (matters on firmware, not host). - GQA dispatch — this bench uses MHA shape so dense and sparse operate on identical tensors. Real AETHER will want MQA per TemporalHeadConfig::default_aether(), which halves KV memory. Co-Authored-By: claude-flow <ruv@ruv.net>	2026-05-08 12:02:36 -04:00
ruv	73321db765	feat(temporal): init_random_blob example + filesystem e2e tests (#513 ) Closes the host→file→firmware loop on the Phase 1 weight format. Real .rvne artifact emitted from the example, parsed back through filesystem in the e2e test, byte-identical across two seeded runs. - examples/init_random_blob.rs — produces a 41,244-byte deployable blob matching the AETHER default head shape (input_dim=16, q_heads=4, kv_heads=1 [MQA], head_dim=32, layers=2, classes=4 — staying coherent with TemporalHeadConfig::default_aether so a real trainer can drop in this shape with one search-and-replace). Uses xorshift64* with a fixed seed (0xC511_0007_DEAD_BEEF) for reproducibility. Per-layer weight count derivation lives in the example (Wq + Wk + Wv + Wo, plus a final classifier head) so the kernel's expectation is anchored in code rather than a comment that drifts. - tests/blob_e2e.rs — two new tests, 15/15 total now passing: * realistic_blob_roundtrips_through_filesystem — writes a 25+ KB blob to std::env::temp_dir(), reads it back, parses, validates. Mirrors what the firmware loader will do once the toolchain unblocks (mmap NVS or EMBED_FILES → parse). * deterministic_seed_produces_byte_identical_blobs — same seed produces byte-identical output, twice. This is what makes a witness-bundle (ADR-028) over trained weights meaningful. Verified by running the example with an explicit out path: cargo run -p wifi-densepose-temporal --example init_random_blob -- \ v2/target/example-output/model_init.rvne → 41244 bytes, parses clean, dtype/shape/CRC all good. What this isn't yet: - Not a trained model. Random init only. - Not a kernel forward over the blob. That requires the firmware Rust component to compile (Phase 5 — toolchain blocker). - Not wired into wifi-densepose-train. ADR-096 §8.1 flagged that the AETHER train crate doesn't currently have a temporal-axis attention; that integration is a separate piece of work. Co-Authored-By: claude-flow <ruv@ruv.net>	2026-05-08 11:49:19 -04:00

Author

SHA1

Message

Date

ruv

247794a2c5

bench(temporal): empirical sparse-vs-dense speedup curve (ADR-096 §3.1, #513 )

Validates the central performance claim of ADR-096 with a runnable
benchmark. Single-run wall-clock, pure-Rust vs pure-Rust on x86_64
host. Real numbers, not just analytic argument.

Results (N=64..1024):

| N      | Dense (ms) | Sparse (ms) | Speedup |
|--------|-----------:|------------:|--------:|
|     64 |      0.262 |       0.141 |   1.86× |
|    128 |      1.120 |       0.335 |   3.34× |
|    256 |      4.129 |       0.711 |   5.81× |
|    512 |     19.230 |       2.356 |   8.16× |
|   1024 |     71.904 |       3.389 |  21.21× |

Asymptotic check: 64→1024 is 16× more tokens. Dense's 274× cost
growth matches N² (256× = 16²). Sparse's 24× growth matches
N log N (16 · log(1024)/log(64) ≈ 27). The complexity claim is
empirically supported.

ADR-096 §3.1 honest-framing paragraph predicted N=64 would be
overhead-bound; we measured 1.86× there, consistent with the ADR's
warning that AETHER's current `window_frames=100` default is below
the inflection point where sparse pays.

What this commit adds:
- examples/bench_speedup.rs — measures dense_attention (upstream
  reference), AetherTemporalHead.forward (this crate's wrapper),
  and SubquadraticSparseAttention.forward (raw, to confirm the
  wrapper isn't introducing overhead — it isn't, the two are
  within noise).
- benches_results.md — captured table + asymptotic check + caveats
  (config used, what the benchmark doesn't measure, how to run).

Run it:
  cargo run -p wifi-densepose-temporal --example bench_speedup --release

What's NOT measured here:
- Decode-step latency (already proved correct at last-token, not
  yet timed against a hypothetical O(N²) dense decode — they're
  structurally not comparable anyway).
- Memory footprint of KvCache + FP16 (matters on firmware, not host).
- GQA dispatch — this bench uses MHA shape so dense and sparse
  operate on identical tensors. Real AETHER will want MQA per
  TemporalHeadConfig::default_aether(), which halves KV memory.

Co-Authored-By: claude-flow <ruv@ruv.net>

2026-05-08 12:02:36 -04:00

ruv

73321db765

feat(temporal): init_random_blob example + filesystem e2e tests (#513 )

Closes the host→file→firmware loop on the Phase 1 weight format. Real
.rvne artifact emitted from the example, parsed back through filesystem
in the e2e test, byte-identical across two seeded runs.

- examples/init_random_blob.rs — produces a 41,244-byte deployable blob
  matching the AETHER default head shape (input_dim=16, q_heads=4,
  kv_heads=1 [MQA], head_dim=32, layers=2, classes=4 — staying coherent
  with TemporalHeadConfig::default_aether so a real trainer can drop
  in this shape with one search-and-replace). Uses xorshift64* with a
  fixed seed (0xC511_0007_DEAD_BEEF) for reproducibility.

  Per-layer weight count derivation lives in the example (Wq + Wk +
  Wv + Wo, plus a final classifier head) so the kernel's expectation
  is anchored in code rather than a comment that drifts.

- tests/blob_e2e.rs — two new tests, 15/15 total now passing:
    * realistic_blob_roundtrips_through_filesystem — writes a 25+ KB
      blob to std::env::temp_dir(), reads it back, parses, validates.
      Mirrors what the firmware loader will do once the toolchain
      unblocks (mmap NVS or EMBED_FILES → parse).
    * deterministic_seed_produces_byte_identical_blobs — same seed
      produces byte-identical output, twice. This is what makes a
      witness-bundle (ADR-028) over trained weights meaningful.

Verified by running the example with an explicit out path:
  cargo run -p wifi-densepose-temporal --example init_random_blob -- \
      v2/target/example-output/model_init.rvne
  → 41244 bytes, parses clean, dtype/shape/CRC all good.

What this isn't yet:
  - Not a trained model. Random init only.
  - Not a kernel forward over the blob. That requires the firmware
    Rust component to compile (Phase 5 — toolchain blocker).
  - Not wired into wifi-densepose-train. ADR-096 §8.1 flagged that
    the AETHER train crate doesn't currently have a temporal-axis
    attention; that integration is a separate piece of work.

Co-Authored-By: claude-flow <ruv@ruv.net>

2026-05-08 11:49:19 -04:00

2 Commits