Establishes the kernel-level output-divergence envelope between the
two backends — what §5's downstream-metric gate (contrastive loss,
rank-1, Spearman) would calibrate against. Two regimes:
1. Saturated pattern (window ≥ N, block ≥ N): sparse and dense visit
the same edge set, so divergence reflects only float accumulation
order. **Asserted < 1e-4** at N=32, heads=4, dim=16. Tight bound.
2. Realistic sparse (window=16, block=32, N=256): real approximation,
real divergence. **Measured max_abs_err = 5.22e-3, mean = 1.79e-3**
on the deterministic test inputs. Sanity-checked finite + < 1.0
so structural breakage (NaN, softmax overflow) trips a panic, but
the specific numbers are *baseline data* not a hard contract — the
§5 gate cares about downstream task metrics, not bit-equality.
Why this is in the test suite rather than a benchmark:
- It runs in <0.2s, no need to gate behind --release.
- The saturated-pattern bound IS a hard contract — if that breaks
the kernel changed semantics in a way the API hides, and we want
CI to catch it.
- Printing the realistic-pattern numbers (eprintln, visible with
--nocapture) gives a known-good reference point to compare future
builds against.
Test count is now 21/21 across the crate (6 smoke + 8 weight blob +
2 blob e2e + 3 streaming + 2 dense-vs-sparse).
Co-Authored-By: claude-flow <ruv@ruv.net>