mirror of https://github.com/maderix/ANE.git
7.1 KiB
7.1 KiB
Apple Neural Engine — Cross-Generation Benchmark Report
Community-submitted benchmark data from Issue #3. All results use Stories110M (12-layer transformer, 109M params, dim=768, seq=256).
Training Performance (Static Pipeline)
Chip ms/step ANE ms Compile/10 ANE TFLOPS Util% Contributor
─────────────────────────────────────────────────────────────────────────────────
M1 Pro 148-163 32-35 7.9-8.5s 0.57-0.63 3.6-4.0 @moriwang
M1 Max 143-167 35-45 ~7.1s 0.54-0.65 3.4-4.1 @andyg5000
M3 Ultra* 91 ~10 ~3.7s 0.88 5.6 (repo ref)
M4 Pro 69-73 8.9 ~3.5s 1.28 8.1 @srt54558
M4 Max 64 10.2 ~3.5s 1.45 9.2 @SethBurkart123
M5 101-120 9.1-9.8 3.2-3.4s 0.77-0.91 4.9-5.8 @GitBubble
*M3 Ultra = reference platform this project was developed on.
Peak ANE Throughput (inmem_peak, 128x conv 512ch sp64)
Chip TFLOPS Rated TOPS Utilization
───────────────────────────────────────────────────
M1 Pro FAIL 11 - (MIL compat issue)
M1 Max FAIL 11 - (MIL compat issue)
M3 Pro 9.98 15.8 63%
M4 Pro 12.57 38 33%
M4 Max 10.93 38 29%
M5 12.17 ~19* 64%
M5 (other) 12.44 ~19* 65%
*M5 ANE TOPS not officially disclosed; ~19 TOPS estimated from measured peak.
Comparative Chart
ANE Training Speed (ms/step, lower is better)
══════════════════════════════════════════════════════════════
M1 Pro ████████████████████████████████████████░░░░ 148-163 ms
M1 Max ██████████████████████████████████████░░░░░░ 143-167 ms
M3 Ultra ██████████████████░░░░░░░░░░░░░░░░░░░░░░░░░ 91 ms
M4 Pro ██████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 69-73 ms
M4 Max ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 64 ms
M5 ████████████████████████░░░░░░░░░░░░░░░░░░░░ 101-120 ms
0 50 100 150 200
Peak ANE Throughput (TFLOPS, higher is better)
══════════════════════════════════════════════════════════════
M1 Pro FAIL (MIL compat)
M1 Max FAIL (MIL compat)
M3 Pro ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░ 9.98
M4 Pro ████████████████████████████████░░░░░░░░░░░░░ 12.57
M4 Max ██████████████████████░░░░░░░░░░░░░░░░░░░░░░ 10.93
M5 █████████████████████████░░░░░░░░░░░░░░░░░░░ 12.17
0 3 6 9 12 15 18
ANE Sustained Throughput (TFLOPS, 5s window)
══════════════════════════════════════════════════════════════
M3 Pro ██████████████████████████████████████████████ 15.04 (95.2%)
0 3 6 9 12 15 18
(Only M3 Pro submitted sustained benchmark)
Key Findings
M1/M1 Pro/M1 Max
- Standalone benchmarks fail —
ane_mil_gen.hsingle-blob weight format rejected - Training works via
stories_mil.h(separate per-matrix weight blobs) - ANE compiler handles weight blobs differently from M4+
- Training at 148-167 ms/step, ~0.6 TFLOPS
M3 Pro
- Only ch=512 compiles — 52 channel values tested (1-4096), only 512 accepted
- Fixed 512-wide lane structure in SRAM tiling
- Peak: 16.77 TFLOPS (106% of rated 15.8 TOPS) at 128x conv 512ch sp2048
- Sustained: 15.04 TFLOPS over 5 seconds (95.2% utilization)
- Spatial dimension is the key to peak throughput (sp64→sp2048 = 2x improvement)
M4 Pro / M4 Max
- Flexible channel support (256/384/512/768+)
- M4 Pro: peak 12.57 TFLOPS, training at 72.5 ms/step
- M4 Max: peak 10.93 TFLOPS, training at 64 ms/step (fastest overall)
sram_probeandinmem_benchfail on M4 Pro (same MIL compat issue)
M5
- Training works out of the box with existing
program(1.3)MIL - Training speed 101-120 ms/step (slower than M4 Max, comparable to M3 Ultra)
- Peak ANE throughput ~12.2-12.4 TFLOPS (similar to M4 Pro)
- ANE appears to be same H16 family as M4
- M5 Pro/Max not yet benchmarked — Fusion Architecture may change ANE behavior
Cross-Generation MIL Compatibility
Feature M1 M3 M4 M5
─────────────────────────────────────────────────────────
program(1.3) / ios18 PARTIAL YES YES YES
Single-blob weights FAIL YES YES YES
Per-matrix weight blobs YES YES YES YES
Channel flexibility ? ch=512 FLEX FLEX
BLOBFILE offset refs FAIL YES YES YES
macOS Compatibility Issues
- macOS 26.x —
[MLModel compileModelAtURL:]broken for standalone benchmarks (fixed in PR #27: switched to in-memory MIL compilation) - macOS 15.x — Works for all M-series with correct MIL format
- M1 generation requires
stories_mil.hpath, notane_mil_gen.h
How to Contribute
Run on your hardware and post results to Issue #3:
cd training && make train_large
./train_large ane_stories110M_ckpt.bin 256 20 1e-4
Include: chip model, macOS version, full output with JSON lines.
Report compiled 2026-03-04 from community submissions. Contributors: @SethBurkart123, @srt54558, @andyg5000, @moriwang, @D-Ogi, @GitBubble, @elijah-pelton