Commit Graph

8 Commits

Author SHA1 Message Date
ruv 2b903752c4 test(temporal): dense-vs-sparse numerical A/B baseline (ADR-096 §5, #513)
Establishes the kernel-level output-divergence envelope between the
two backends — what §5's downstream-metric gate (contrastive loss,
rank-1, Spearman) would calibrate against. Two regimes:

1. Saturated pattern (window ≥ N, block ≥ N): sparse and dense visit
   the same edge set, so divergence reflects only float accumulation
   order. **Asserted < 1e-4** at N=32, heads=4, dim=16. Tight bound.

2. Realistic sparse (window=16, block=32, N=256): real approximation,
   real divergence. **Measured max_abs_err = 5.22e-3, mean = 1.79e-3**
   on the deterministic test inputs. Sanity-checked finite + < 1.0
   so structural breakage (NaN, softmax overflow) trips a panic, but
   the specific numbers are *baseline data* not a hard contract — the
   §5 gate cares about downstream task metrics, not bit-equality.

Why this is in the test suite rather than a benchmark:
- It runs in <0.2s, no need to gate behind --release.
- The saturated-pattern bound IS a hard contract — if that breaks
  the kernel changed semantics in a way the API hides, and we want
  CI to catch it.
- Printing the realistic-pattern numbers (eprintln, visible with
  --nocapture) gives a known-good reference point to compare future
  builds against.

Test count is now 21/21 across the crate (6 smoke + 8 weight blob +
2 blob e2e + 3 streaming + 2 dense-vs-sparse).

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 12:16:05 -04:00
ruv 4ea8457017 feat(temporal): Dense backend implementation (ADR-096 §5 A/B gate, #513)
Closes the Dense placeholder from earlier commits. Now both backends
implement forward(); only SparseGqa supports streaming step()/KvCache,
which is the structural gap dense MHA can't bridge by design.

Dense path:
- src/dense.rs new — DenseHead wraps upstream dense_attention. Stores
  causal flag and (cloned) config. forward() is a one-line delegation;
  no GQA dispatch (dense_attention upstream requires q_heads == kv_heads).
- AetherTemporalHead::Dense changed from a unit variant to Dense(DenseHead).
  Construction succeeds for any valid TemporalHeadConfig where backend
  is Dense.
- AetherTemporalHead.step() returns BackendDoesNotSupportStreaming for
  Dense — there is no dense-MHA-with-KV-cache equivalent and offering
  one would silently swallow the ADR-096 §3.2 structural argument.
- AetherTemporalHead.make_cache() likewise — there's no cache to size
  for a dense kernel.

Errors:
- New TemporalError::BackendDoesNotSupportStreaming variant covers
  the Dense-step / Dense-make_cache cases. Specific so callers can
  fall back to forward() instead of giving up entirely.
- TemporalError::DenseBackendNotImplemented retained for v0.1
  back-compat (no consumers depend on it post-this-commit, but
  removing a public variant is a hard break). Future work can
  deprecate it once downstream callers move off.

Tests (19/19 passing):
- dense_backend_returns_typed_error → renamed and rewritten as
  dense_backend_forward_runs_with_matching_shape: constructs a Dense
  head, runs forward over (32, 4, 4, 16) Q/K/V, asserts output shape.
- New dense_backend_step_returns_streaming_error: constructs Dense,
  attempts make_cache, expects BackendDoesNotSupportStreaming.
- All 8 weight blob, 2 blob e2e, 3 streaming, 5 other smoke tests
  unchanged and still passing.

This commit completes the ADR-096 §5 A/B gate: callers can now run
the same Q/K/V through both backends and compare outputs / latency.
The §5 four-gate validation (contrastive loss within 1%, rank-1
within 1pp, Spearman ≥0.95, latency ≥5×) becomes a runnable
proposition, not a future task — though the actual gate run requires
trained AETHER weights, which is its own track.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 12:12:17 -04:00
ruv 2aee4d21cf docs(temporal): README for wifi-densepose-temporal (#513)
Closes the documentation gap on the host-side ADR-096 surface.
The crate has 7 commits, 5 source modules, 4 test suites, 2 examples,
and a captured benchmark; reviewers and downstream consumers needed
a landing page.

Sections:
- Quick start (5-line forward + 7-line streaming)
- Backends + selection rule (SparseGqa MHA-vs-GQA dispatch)
- Streaming semantics (cache lifetime, eviction policy, the
  headline correctness test)
- Weight blob format with the host/firmware lockstep note
- Examples (init_random_blob, bench_speedup) with run lines
- Tests (18/18 passing as of 247794a2c, broken down by suite)
- Status of ADR-096 claims with concrete evidence for each
- Status of ADR-095 surface (firmware) + the toolchain blocker
- Carry-forward of the open questions still applicable from §8

The README intentionally cross-links to:
- docs/adr/ADR-096 for design rationale
- components/ruv_temporal/ README for the firmware mirror
- benches_results.md for the captured speedup curve

Doesn't claim more than is proven. Each ADR-096 claim either has a
test or a benchmark cited as evidence; the partial claim (30-100× at
long windows) explicitly says 21× was the measured number, not 30×.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 12:06:26 -04:00
ruv 247794a2c5 bench(temporal): empirical sparse-vs-dense speedup curve (ADR-096 §3.1, #513)
Validates the central performance claim of ADR-096 with a runnable
benchmark. Single-run wall-clock, pure-Rust vs pure-Rust on x86_64
host. Real numbers, not just analytic argument.

Results (N=64..1024):

| N      | Dense (ms) | Sparse (ms) | Speedup |
|--------|-----------:|------------:|--------:|
|     64 |      0.262 |       0.141 |   1.86× |
|    128 |      1.120 |       0.335 |   3.34× |
|    256 |      4.129 |       0.711 |   5.81× |
|    512 |     19.230 |       2.356 |   8.16× |
|   1024 |     71.904 |       3.389 |  21.21× |

Asymptotic check: 64→1024 is 16× more tokens. Dense's 274× cost
growth matches N² (256× = 16²). Sparse's 24× growth matches
N log N (16 · log(1024)/log(64) ≈ 27). The complexity claim is
empirically supported.

ADR-096 §3.1 honest-framing paragraph predicted N=64 would be
overhead-bound; we measured 1.86× there, consistent with the ADR's
warning that AETHER's current `window_frames=100` default is below
the inflection point where sparse pays.

What this commit adds:
- examples/bench_speedup.rs — measures dense_attention (upstream
  reference), AetherTemporalHead.forward (this crate's wrapper),
  and SubquadraticSparseAttention.forward (raw, to confirm the
  wrapper isn't introducing overhead — it isn't, the two are
  within noise).
- benches_results.md — captured table + asymptotic check + caveats
  (config used, what the benchmark doesn't measure, how to run).

Run it:
  cargo run -p wifi-densepose-temporal --example bench_speedup --release

What's NOT measured here:
- Decode-step latency (already proved correct at last-token, not
  yet timed against a hypothetical O(N²) dense decode — they're
  structurally not comparable anyway).
- Memory footprint of KvCache + FP16 (matters on firmware, not host).
- GQA dispatch — this bench uses MHA shape so dense and sparse
  operate on identical tensors. Real AETHER will want MQA per
  TemporalHeadConfig::default_aether(), which halves KV memory.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 12:02:36 -04:00
ruv 49e57efcec feat(temporal): streaming step() + KvCache (ADR-096 §3.2, #513)
The structural advantage that's the entire point of ADR-096: O(log T)
per new token via decode_step against an accumulated KvCache, vs
O(N²) recompute for dense MHA. This commit lands the API and proves
the numerical equivalence at the last position.

API:
- AetherTemporalHead::step(q_new, k_new, v_new, &mut cache)
  Single-token decode. Appends (k_new, v_new) to cache, runs
  decode_step(q_new) against the now-updated cache, returns the new
  position's output.
- AetherTemporalHead::make_cache(capacity)
  Convenience constructor — caller doesn't need to import
  ruvllm_sparse_attention to size a cache. Per ADR-096 §8.5 the
  natural lifetime is per-PoseTrack (re-ID) or per-session (online
  classification); when the track drops, drop the cache.
- KvCache re-exported at the crate root.

Contract:
- q_new/k_new/v_new must each have seq == 1. Multi-token q is the
  prefill path (forward), not decode_step.
- Cache lifetime is the caller's. The crate enforces shape via
  make_cache so callers can't mismatch kv_heads / head_dim / block_size.
- KvCache fill is the caller's problem. Upstream H2O heavy-hitter
  eviction is opt-in; this crate's wrapper doesn't pre-pick a policy.

Tests (18/18 total now passing):
- streaming_step_matches_forward_at_last_position — central claim:
  16-token sequence, append k/v one at a time via step(), compare
  the streamed last-token output to forward(full Q,K,V)[N-1].
  max_abs_err < 1e-3 (currently passes well under that bound for
  the 0.1-magnitude activations the test uses).
- step_rejects_multi_token_q — contract enforcement.
- make_cache_returns_kvcache_with_correct_shape — wiring smoke,
  confirms (capacity, kv_heads, dim, block_size) ordering is correct
  through the make_cache wrapper.

Test config uses MHA shape (q_heads == kv_heads) because the upstream
decode_step is wired to the MHA branch; the GQA decode path is on
upstream's roadmap and lands in a separate ADR-096 follow-up when it
does.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 11:57:31 -04:00
ruv 73321db765 feat(temporal): init_random_blob example + filesystem e2e tests (#513)
Closes the host→file→firmware loop on the Phase 1 weight format. Real
.rvne artifact emitted from the example, parsed back through filesystem
in the e2e test, byte-identical across two seeded runs.

- examples/init_random_blob.rs — produces a 41,244-byte deployable blob
  matching the AETHER default head shape (input_dim=16, q_heads=4,
  kv_heads=1 [MQA], head_dim=32, layers=2, classes=4 — staying coherent
  with TemporalHeadConfig::default_aether so a real trainer can drop
  in this shape with one search-and-replace). Uses xorshift64* with a
  fixed seed (0xC511_0007_DEAD_BEEF) for reproducibility.

  Per-layer weight count derivation lives in the example (Wq + Wk +
  Wv + Wo, plus a final classifier head) so the kernel's expectation
  is anchored in code rather than a comment that drifts.

- tests/blob_e2e.rs — two new tests, 15/15 total now passing:
    * realistic_blob_roundtrips_through_filesystem — writes a 25+ KB
      blob to std::env::temp_dir(), reads it back, parses, validates.
      Mirrors what the firmware loader will do once the toolchain
      unblocks (mmap NVS or EMBED_FILES → parse).
    * deterministic_seed_produces_byte_identical_blobs — same seed
      produces byte-identical output, twice. This is what makes a
      witness-bundle (ADR-028) over trained weights meaningful.

Verified by running the example with an explicit out path:
  cargo run -p wifi-densepose-temporal --example init_random_blob -- \
      v2/target/example-output/model_init.rvne
  → 41244 bytes, parses clean, dtype/shape/CRC all good.

What this isn't yet:
  - Not a trained model. Random init only.
  - Not a kernel forward over the blob. That requires the firmware
    Rust component to compile (Phase 5 — toolchain blocker).
  - Not wired into wifi-densepose-train. ADR-096 §8.1 flagged that
    the AETHER train crate doesn't currently have a temporal-axis
    attention; that integration is a separate piece of work.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 11:49:19 -04:00
ruv 237325a117 feat(temporal): weight-blob wire format (ADR-095 Phase 1, #513)
The training/firmware boundary needs a stable serialization for the
temporal head's weights, distinct from the kernel scaffold and the
firmware ABI. This commit defines that format on the host side. The
firmware-side mirrored loader lands when the toolchain unblocks.

Format:
  - Header (24 B): magic 'RVNE' / version 1 / dtype flag
    (FP32 / FP16) / input_dim / n_q_heads / n_kv_heads / head_dim /
    n_layers / n_classes / weights_len.
  - Body: weights_len bytes of flat per-layer weights.
  - Footer (4 B): CRC32 IEEE 802.3 over everything before, same
    polynomial used by temporal_task.c so a blob produced here parses
    on the firmware unchanged.

Layout decisions:
  - Little-endian throughout (Xtensa native).
  - Weights kept as Vec<u8> rather than Vec<f32>/Vec<f16> so the no_std
    firmware loader (which may not have the `half` crate) can mmap and
    read either dtype directly.
  - Versioning is hard-break: bumping `version` means firmware refuses
    to load. Optional fields go behind reserved flag bits, never by
    field reorder. Documented inline.

Validation surface:
  - `WeightBlobHeader::validate()` catches zero dims, invalid GQA
    ratios (n_q_heads % n_kv_heads != 0), n_layers=0, n_classes<2.
    Same checks fire from `WeightBlob::parse()` so the firmware can't
    accidentally accept a blob the host should have rejected.
  - `WeightBlob::parse()` enforces magic / version / size / CRC
    before exposing weights to the caller.

Tests (8/8 passing, alongside 5/5 sparse smoke = 13/13 total):
  - roundtrip_fp32, roundtrip_fp16
  - parse_rejects_bad_magic, _wrong_version, _size_mismatch,
    _crc_corruption, _invalid_gqa_ratio_in_header
  - header_constants_match_wire_layout (anchor)

What's deliberately NOT in this commit:
  - The firmware-side mirrored loader (deferred to the iteration that
    unblocks the esp Rust toolchain — no point shipping a parser that
    can't be compiled).
  - Per-layer weight ordering. The blob is a flat byte-buffer; the
    interpretation of per-layer offsets is the kernel's contract,
    documented in the eventual model module (ADR-095 §3.2 follow-up).

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 11:43:49 -04:00
ruv bfb3fdee13 feat(temporal): scaffold wifi-densepose-temporal crate (ADR-096 Phase 1-3, #513)
Implements Phases 1-3 of the ADR-096 roadmap:

Phase 1: workspace integration
- Add `ruvllm_sparse_attention` as a path-vendored workspace dep against
  `vendor/ruvector/crates/ruvllm_sparse_attention`, default-features=false,
  features=["fp16"]. Mirrors the no_std posture ADR-095 will need on the
  firmware side so both consumers share a single feature set.
- Register `wifi-densepose-temporal` as workspace member.

Phase 2: AETHER temporal head
- `AetherTemporalHead` facade dispatches to a `SparseGqa` backend wrapping
  `SubquadraticSparseAttention`. Selection rule from ADR-096 §4.4 enforced
  at forward(): MHA branch when q_heads == kv_heads, GQA branch otherwise.
- `Dense` backend reserved (returns typed `DenseBackendNotImplemented`)
  so config-time validation fails loudly instead of at forward().
- `TemporalHeadConfig::default_aether()` matches the AETHER training
  default per ADR-096 §3.1 (window=32, block=16, q=4, kv=1 → MQA).
- Token 0 always wired as a global anchor — preserves AETHER's
  contrastive "session-start reference" role per ADR-024.

Phase 3: smoke tests (5/5 passing)
- forward at AETHER default config, both MHA and GQA dispatch paths,
  rejected dense backend, rejected non-divisible GQA ratio, and the
  long-window roadmap target (N=1000, the 10s @ 100Hz case from
  ADR-096 §3.1 — proves the kernel runs at lengths where dense MHA
  costs 10⁶ edge ops vs sparse 10⁴).

Streaming `step()` deferred — KvCache lifecycle ties to PoseTrack per
ADR-096 §8.5 and lands when the firmware-side ABI does (Phase 4+).

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-08 09:26:18 -04:00