Commit Graph

6 Commits

Author SHA1 Message Date
imperatormk 0cf13e2b84 define g_fp16_io in train.m (fixes linker error) 2026-03-03 17:16:22 +01:00
imperatormk 709b60208f Fix MIL syntax for cross-generation ANE compatibility
The MIL scalar types used shorthand syntax (string("x"), int32(1)) that
only works on M4. Changed to the canonical verbose format that CoreML's
own compiler emits (tensor<string, []>("x"), tensor<int32, []>(1)).

Also targets program(1.0) with <ios16> instead of program(1.3)/<ios18>,
and simplifies buildInfo to just coremlc-version.

For conv-based kernels, adds runtime fp16 I/O fallback — M1/M2 ANE
doesn't support the cast op (fp32<->fp16), so on first compile failure
it retries with native fp16 inputs/outputs and does the conversion on
the CPU side. The fallback is persisted across exec() restarts.

Note: matmul and scaled_dot_product_attention ops still fail on M1/M2 —
these are M4+ ANE ops. The attention tests (test_ane_causal_attn,
test_ane_sdpa5, test_full_fused attention part) require M4 hardware.
Conv-based kernels (training, QKV projections, FFN) work on all generations.

Tested on M1 Pro, macOS 26.3 (Tahoe).
2026-03-02 22:00:45 +01:00
m0at 184b182bfc Add M5 probe results: weight reload fails, all QoS work, chaining API found
Key findings from running all 4 probes on Apple M5:

- Weight reload (unload+load after file overwrite) does NOT work — weights
  are baked at compile time, output is identical regardless of file changes
- weightsBuffer IOSurface parameter also does not override compiled weights
- All QoS values 0-63 work, no measurable latency difference (~0.07ms/eval)
- _ANEPerformanceStats has hwExecutionTime (ns) + perfCounterData
- _ANEChainingRequest supports loopback execution (output→input chaining)
- _ANEClient has real-time eval path and chaining preparation methods
- procedureIndex 0-15 all succeed on single-procedure models

Fixed probe tests to use fp32 I/O with cast (matching inmem_peak pattern)
and 64+ channel kernels (ANE minimum size requirement).

Full analysis in training/m5result.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 23:16:38 -08:00
m0at 40d3f45631 Add ANE probe tests and training telemetry for M5 optimization
Four standalone probe tests to characterize the M5 ANE:
- test_weight_reload: Can weights be hot-swapped via unload+load without recompilation?
- test_perf_stats: Enumerate _ANEPerformanceStats methods/properties and hardware counters
- test_qos_sweep: Measure compile/load/eval latency across QoS 0-63
- test_ane_advanced: Probe SharedEvents, weightsBuffer IOSurface, procedureIndex, VirtualClient

Training telemetry (train_large.m):
- JSON lines to stderr with per-step timing breakdown and per-batch TFLOPS metrics
- Enables external monitoring tools to visualize ANE utilization in real-time

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 22:54:58 -08:00
maderix 4d67db1bdb stories110M: 12-layer ANE training with dashboard, 107ms/step
- Scale to full stories110M (109M params, 12 layers) with real TinyStories data
- vDSP-vectorized cross-entropy (110ms→14ms), NEON fp16 IO, async dW
- TUI dashboard: loss curve, ANE/CPU power, CPU/memory graphs, text generation
- Split into modular headers: config, io, mil, cpu_ops
2026-03-01 03:14:39 -08:00
maderix f213c8db68 Initial release 2026-02-28 00:22:06 -08:00