berkus/ANE - ANE

Commit Graph

Author	SHA1	Message	Date
mgkcloud	0edafd48ca	feat: double-buffered async ANE training Key discovery: compile and eval can run in parallel via GCD. 119 foreground evals completed during a 26.8ms background compile. Architecture: - Two kernel sets (A/B) alternate active/pending - Background GCD thread compiles pending kernels while active runs - Atomic swap at batch boundary - Eliminates 88% compilation bottleneck Includes: - train_double_buffer.m: modified train_large.m with async compilation - PROBE_RESULTS.md: full benchmark data from M4 probe - Updated Makefile	2026-03-04 00:12:17 +11:00
Vipul	ebac5dd73f	Python Bridge+Memory leak fix+More functions	2026-03-03 02:04:36 -05:00
m0at	40d3f45631	Add ANE probe tests and training telemetry for M5 optimization Four standalone probe tests to characterize the M5 ANE: - test_weight_reload: Can weights be hot-swapped via unload+load without recompilation? - test_perf_stats: Enumerate _ANEPerformanceStats methods/properties and hardware counters - test_qos_sweep: Measure compile/load/eval latency across QoS 0-63 - test_ane_advanced: Probe SharedEvents, weightsBuffer IOSurface, procedureIndex, VirtualClient Training telemetry (train_large.m): - JSON lines to stderr with per-step timing breakdown and per-batch TFLOPS metrics - Enables external monitoring tools to visualize ANE utilization in real-time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 22:54:58 -08:00
maderix	4d67db1bdb	stories110M: 12-layer ANE training with dashboard, 107ms/step - Scale to full stories110M (109M params, 12 layers) with real TinyStories data - vDSP-vectorized cross-entropy (110ms→14ms), NEON fp16 IO, async dW - TUI dashboard: loss curve, ANE/CPU power, CPU/memory graphs, text generation - Split into modular headers: config, io, mil, cpu_ops	2026-03-01 03:14:39 -08:00
maderix	f213c8db68	Initial release	2026-02-28 00:22:06 -08:00

5 Commits