Commit Graph

5 Commits

Author SHA1 Message Date
mgkcloud 0edafd48ca feat: double-buffered async ANE training
Key discovery: compile and eval can run in parallel via GCD.
119 foreground evals completed during a 26.8ms background compile.

Architecture:
- Two kernel sets (A/B) alternate active/pending
- Background GCD thread compiles pending kernels while active runs
- Atomic swap at batch boundary
- Eliminates 88% compilation bottleneck

Includes:
- train_double_buffer.m: modified train_large.m with async compilation
- PROBE_RESULTS.md: full benchmark data from M4 probe
- Updated Makefile
2026-03-04 00:12:17 +11:00
Vipul ebac5dd73f Python Bridge+Memory leak fix+More functions 2026-03-03 02:04:36 -05:00
m0at 40d3f45631 Add ANE probe tests and training telemetry for M5 optimization
Four standalone probe tests to characterize the M5 ANE:
- test_weight_reload: Can weights be hot-swapped via unload+load without recompilation?
- test_perf_stats: Enumerate _ANEPerformanceStats methods/properties and hardware counters
- test_qos_sweep: Measure compile/load/eval latency across QoS 0-63
- test_ane_advanced: Probe SharedEvents, weightsBuffer IOSurface, procedureIndex, VirtualClient

Training telemetry (train_large.m):
- JSON lines to stderr with per-step timing breakdown and per-batch TFLOPS metrics
- Enables external monitoring tools to visualize ANE utilization in real-time

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 22:54:58 -08:00
maderix 4d67db1bdb stories110M: 12-layer ANE training with dashboard, 107ms/step
- Scale to full stories110M (109M params, 12 layers) with real TinyStories data
- vDSP-vectorized cross-entropy (110ms→14ms), NEON fp16 IO, async dW
- TUI dashboard: loss curve, ANE/CPU power, CPU/memory graphs, text generation
- Split into modular headers: config, io, mil, cpu_ops
2026-03-01 03:14:39 -08:00
maderix f213c8db68 Initial release 2026-02-28 00:22:06 -08:00