- Vectorize adam_update with vDSP batch ops (stories_cpu_ops.h)
Replaces scalar per-element loop with vDSP_vsmul/vsma/vsq/vdiv
Expected ~3-4x faster for 2.4M parameter updates
- Vectorize model_adam_step ADAM_UPDATE macro with vDSP (backward.h)
Same batch ops pattern for the train.m model pipeline
- Replace cpu_accum_dW with cblas_sgemm (backward.h)
dW += dy^T @ x is a standard BLAS GEMM operation
Expected 5-10x faster for weight gradient accumulation
- Replace cpu_matmul_backward_dx with cblas_sgemm (backward.h)
dx = dy @ W^T is also a standard BLAS GEMM
- Add -framework Accelerate to train target (Makefile)
Four standalone probe tests to characterize the M5 ANE:
- test_weight_reload: Can weights be hot-swapped via unload+load without recompilation?
- test_perf_stats: Enumerate _ANEPerformanceStats methods/properties and hardware counters
- test_qos_sweep: Measure compile/load/eval latency across QoS 0-63
- test_ane_advanced: Probe SharedEvents, weightsBuffer IOSurface, procedureIndex, VirtualClient
Training telemetry (train_large.m):
- JSON lines to stderr with per-step timing breakdown and per-batch TFLOPS metrics
- Enables external monitoring tools to visualize ANE utilization in real-time
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>