- Vectorize adam_update with vDSP batch ops (stories_cpu_ops.h)
Replaces scalar per-element loop with vDSP_vsmul/vsma/vsq/vdiv
Expected ~3-4x faster for 2.4M parameter updates
- Vectorize model_adam_step ADAM_UPDATE macro with vDSP (backward.h)
Same batch ops pattern for the train.m model pipeline
- Replace cpu_accum_dW with cblas_sgemm (backward.h)
dW += dy^T @ x is a standard BLAS GEMM operation
Expected 5-10x faster for weight gradient accumulation
- Replace cpu_matmul_backward_dx with cblas_sgemm (backward.h)
dx = dy @ W^T is also a standard BLAS GEMM
- Add -framework Accelerate to train target (Makefile)
- Parse static pipeline JSON step/batch/perf lines for real-time updates
- Running elapsed time, ms/step from wall-clock timestamps, steps/sec
- Compute ANE + Total TFLOPS from FLOPs/step when not reported directly
- Support --ane (train_large_ane) and --no-ane-extras flags
- Dynamic pipeline timing breakdown + CKPT_PATH per mode
Dynamic weight pipeline that eliminates the ~3.7s recompile-every-10-steps
bottleneck. Weights are passed via IOSurface spatial dimension instead of
baked as constants, so kernels compile once at startup (345ms) and run
indefinitely without exec() restart.
Key components:
- training_dynamic/ — full pipeline (config, IO, MIL generators, train loop)
- 9 dynamic kernels shared across all 12 layers
- Vocab compaction 32K→9.2K for faster classifier
- Vectorized cross-entropy with vDSP/NEON
- Adam optimizer with gradient clipping + cosine LR schedule
- Checkpoint save/resume
- test_dynamic_matmul.m — validates dynamic weight matmul vs cblas
- test_weight_patch.m — tests weight update via IOSurface
- dashboard.py — updated with --dynamic flag for v2 pipeline support,
improved step regex parsing, --scratch/--lr/--accum CLI args
Performance: 110ms/step steady-state (no recompile overhead)
ane_fwd=21 ane_bwd=28 io_fwd=12 io_bwd=15 silu=10 cls=13 rms=5 ms
Key findings from running all 4 probes on Apple M5:
- Weight reload (unload+load after file overwrite) does NOT work — weights
are baked at compile time, output is identical regardless of file changes
- weightsBuffer IOSurface parameter also does not override compiled weights
- All QoS values 0-63 work, no measurable latency difference (~0.07ms/eval)
- _ANEPerformanceStats has hwExecutionTime (ns) + perfCounterData
- _ANEChainingRequest supports loopback execution (output→input chaining)
- _ANEClient has real-time eval path and chaining preparation methods
- procedureIndex 0-15 all succeed on single-procedure models
Fixed probe tests to use fp32 I/O with cast (matching inmem_peak pattern)
and 64+ channel kernels (ANE minimum size requirement).
Full analysis in training/m5result.md.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Four standalone probe tests to characterize the M5 ANE:
- test_weight_reload: Can weights be hot-swapped via unload+load without recompilation?
- test_perf_stats: Enumerate _ANEPerformanceStats methods/properties and hardware counters
- test_qos_sweep: Measure compile/load/eval latency across QoS 0-63
- test_ane_advanced: Probe SharedEvents, weightsBuffer IOSurface, procedureIndex, VirtualClient
Training telemetry (train_large.m):
- JSON lines to stderr with per-step timing breakdown and per-batch TFLOPS metrics
- Enables external monitoring tools to visualize ANE utilization in real-time
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>