Key discovery: compile and eval can run in parallel via GCD.
119 foreground evals completed during a 26.8ms background compile.
Architecture:
- Two kernel sets (A/B) alternate active/pending
- Background GCD thread compiles pending kernels while active runs
- Atomic swap at batch boundary
- Eliminates 88% compilation bottleneck
Includes:
- train_double_buffer.m: modified train_large.m with async compilation
- PROBE_RESULTS.md: full benchmark data from M4 probe
- Updated Makefile
Four standalone probe tests to characterize the M5 ANE:
- test_weight_reload: Can weights be hot-swapped via unload+load without recompilation?
- test_perf_stats: Enumerate _ANEPerformanceStats methods/properties and hardware counters
- test_qos_sweep: Measure compile/load/eval latency across QoS 0-63
- test_ane_advanced: Probe SharedEvents, weightsBuffer IOSurface, procedureIndex, VirtualClient
Training telemetry (train_large.m):
- JSON lines to stderr with per-step timing breakdown and per-batch TFLOPS metrics
- Enables external monitoring tools to visualize ANE utilization in real-time
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>