berkus/ANE - ANE

Commit Graph

Author	SHA1	Message	Date
m0at	184b182bfc	Add M5 probe results: weight reload fails, all QoS work, chaining API found Key findings from running all 4 probes on Apple M5: - Weight reload (unload+load after file overwrite) does NOT work — weights are baked at compile time, output is identical regardless of file changes - weightsBuffer IOSurface parameter also does not override compiled weights - All QoS values 0-63 work, no measurable latency difference (~0.07ms/eval) - _ANEPerformanceStats has hwExecutionTime (ns) + perfCounterData - _ANEChainingRequest supports loopback execution (output→input chaining) - _ANEClient has real-time eval path and chaining preparation methods - procedureIndex 0-15 all succeed on single-procedure models Fixed probe tests to use fp32 I/O with cast (matching inmem_peak pattern) and 64+ channel kernels (ANE minimum size requirement). Full analysis in training/m5result.md. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 23:16:38 -08:00
m0at	40d3f45631	Add ANE probe tests and training telemetry for M5 optimization Four standalone probe tests to characterize the M5 ANE: - test_weight_reload: Can weights be hot-swapped via unload+load without recompilation? - test_perf_stats: Enumerate _ANEPerformanceStats methods/properties and hardware counters - test_qos_sweep: Measure compile/load/eval latency across QoS 0-63 - test_ane_advanced: Probe SharedEvents, weightsBuffer IOSurface, procedureIndex, VirtualClient Training telemetry (train_large.m): - JSON lines to stderr with per-step timing breakdown and per-batch TFLOPS metrics - Enables external monitoring tools to visualize ANE utilization in real-time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 22:54:58 -08:00

Author

SHA1

Message

Date

m0at

184b182bfc

Add M5 probe results: weight reload fails, all QoS work, chaining API found

Key findings from running all 4 probes on Apple M5:

- Weight reload (unload+load after file overwrite) does NOT work — weights
  are baked at compile time, output is identical regardless of file changes
- weightsBuffer IOSurface parameter also does not override compiled weights
- All QoS values 0-63 work, no measurable latency difference (~0.07ms/eval)
- _ANEPerformanceStats has hwExecutionTime (ns) + perfCounterData
- _ANEChainingRequest supports loopback execution (output→input chaining)
- _ANEClient has real-time eval path and chaining preparation methods
- procedureIndex 0-15 all succeed on single-procedure models

Fixed probe tests to use fp32 I/O with cast (matching inmem_peak pattern)
and 64+ channel kernels (ANE minimum size requirement).

Full analysis in training/m5result.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-01 23:16:38 -08:00

m0at

40d3f45631

Add ANE probe tests and training telemetry for M5 optimization

Four standalone probe tests to characterize the M5 ANE:
- test_weight_reload: Can weights be hot-swapped via unload+load without recompilation?
- test_perf_stats: Enumerate _ANEPerformanceStats methods/properties and hardware counters
- test_qos_sweep: Measure compile/load/eval latency across QoS 0-63
- test_ane_advanced: Probe SharedEvents, weightsBuffer IOSurface, procedureIndex, VirtualClient

Training telemetry (train_large.m):
- JSON lines to stderr with per-step timing breakdown and per-batch TFLOPS metrics
- Enables external monitoring tools to visualize ANE utilization in real-time

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-01 22:54:58 -08:00

2 Commits