Adds a second dynamic weight approach to the bridge alongside the existing
BLOBFILE compile path. Instead of packing weights into the spatial dimension
of a single large input tensor and slicing them inside MIL (the training_dynamic/
approach), weights are declared as native MIL function parameters backed by
persistent IOSurfaces:
// training_dynamic/ approach: spatial packing
func main<ios18>(tensor<fp32, [1, DIM, 1, SEQ + 4*DIM]> x) {
Wq = slice_by_size(x=x, begin=..., size=...); // overhead
...
// this PR: native function parameters
func main<ios18>(tensor<fp16,[1,K,1,M]> x, tensor<fp16,[1,N,K]> W) { ... }
New API:
ane_bridge_compile_dyn() — compile with n_weights IOSurface parameters
ane_bridge_write_weight() — write fp16 to weight IOSurface (~0.001ms)
ane_bridge_write_weight_f32() — write fp32 with NEON conversion
ane_bridge_copy_io() — direct output→input copy, no CPU round-trip
ane_bridge_begin/end_realtime() — 90.6% p99 jitter reduction
Compile cache fix: ANE only writes net.plist for parameter-based models (no
data file). try_cache_restore now checks net.plist only; data is saved/restored
conditionally for BLOBFILE models that do produce it.
Also removes the pre-built libane_bridge.dylib binary from version control.
Performance vs spatial packing (Stories110M, 12 layers, M-series):
training_dynamic/ (slice approach): 110ms/step
function parameter approach: 76.9ms/step (-30%)
The slice/reshape/transpose overhead per weight matrix explains the gap.
Both compile once at startup; weight updates are IOSurface writes in both cases.
Tested: test_bridge.m — 15/15 assertions across all new API functions.
Dynamic weight pipeline that eliminates the ~3.7s recompile-every-10-steps
bottleneck. Weights are passed via IOSurface spatial dimension instead of
baked as constants, so kernels compile once at startup (345ms) and run
indefinitely without exec() restart.
Key components:
- training_dynamic/ — full pipeline (config, IO, MIL generators, train loop)
- 9 dynamic kernels shared across all 12 layers
- Vocab compaction 32K→9.2K for faster classifier
- Vectorized cross-entropy with vDSP/NEON
- Adam optimizer with gradient clipping + cosine LR schedule
- Checkpoint save/resume
- test_dynamic_matmul.m — validates dynamic weight matmul vs cblas
- test_weight_patch.m — tests weight update via IOSurface
- dashboard.py — updated with --dynamic flag for v2 pipeline support,
improved step regex parsing, --scratch/--lr/--accum CLI args
Performance: 110ms/step steady-state (no recompile overhead)
ane_fwd=21 ane_bwd=28 io_fwd=12 io_bwd=15 silu=10 cls=13 rms=5 ms
Weave in scope notice near the top covering project intent, what it
is/isn't, hype clarification, maintenance expectations, and fork
encouragement. Consolidate private API disclaimer with existing
disclaimer section to avoid duplication.
https://claude.ai/code/session_01NNL4MVEY1aKp19eGHTYJUv
Key findings from running all 4 probes on Apple M5:
- Weight reload (unload+load after file overwrite) does NOT work — weights
are baked at compile time, output is identical regardless of file changes
- weightsBuffer IOSurface parameter also does not override compiled weights
- All QoS values 0-63 work, no measurable latency difference (~0.07ms/eval)
- _ANEPerformanceStats has hwExecutionTime (ns) + perfCounterData
- _ANEChainingRequest supports loopback execution (output→input chaining)
- _ANEClient has real-time eval path and chaining preparation methods
- procedureIndex 0-15 all succeed on single-procedure models
Fixed probe tests to use fp32 I/O with cast (matching inmem_peak pattern)
and 64+ channel kernels (ANE minimum size requirement).
Full analysis in training/m5result.md.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Four standalone probe tests to characterize the M5 ANE:
- test_weight_reload: Can weights be hot-swapped via unload+load without recompilation?
- test_perf_stats: Enumerate _ANEPerformanceStats methods/properties and hardware counters
- test_qos_sweep: Measure compile/load/eval latency across QoS 0-63
- test_ane_advanced: Probe SharedEvents, weightsBuffer IOSurface, procedureIndex, VirtualClient
Training telemetry (train_large.m):
- JSON lines to stderr with per-step timing breakdown and per-batch TFLOPS metrics
- Enables external monitoring tools to visualize ANE utilization in real-time
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>