mirror of https://github.com/maderix/ANE.git
164 lines
8.6 KiB
Markdown
164 lines
8.6 KiB
Markdown
# Apple Neural Engine — Cross-Generation Benchmark Report
|
|
|
|
Community-submitted benchmark data from [Issue #3](https://github.com/maderix/ANE/issues/3).
|
|
|
|
## Model Configuration
|
|
|
|
All training benchmarks use **Stories110M** — a Llama2-architecture transformer:
|
|
|
|
```
|
|
Parameter Value
|
|
────────────────────────
|
|
Architecture Llama2 (RoPE, SwiGLU, RMSNorm, GQA-ready)
|
|
Layers 12
|
|
Dimension 768
|
|
Hidden (FFN) 2048
|
|
Heads 12
|
|
Vocab 32000 (Llama 2 BPE)
|
|
Sequence 256
|
|
Total Params 109.53M (84.95M transformer + 24.58M embedding)
|
|
Training Data TinyStories (~20M tokens, pretokenized)
|
|
Optimizer Adam (lr=1e-4 to 3e-4, b1=0.9, b2=0.999)
|
|
Precision FP16 on ANE, FP32 on CPU
|
|
```
|
|
|
|
Kernels per step (static pipeline): 72 (60 weight-bearing + 12 static sdpaBwd2).
|
|
Forward: sdpaFwd + ffnW13 + ffnW2 per layer. Backward: ffnBwdW2t + ffnBwdW13t + wotBwd + sdpaBwd1 + sdpaBwd2 + qkvBwd per layer. Weight gradients (dW) via `cblas_sgemm` on CPU.
|
|
|
|
## Training Performance (Static Pipeline)
|
|
|
|
```
|
|
Chip ms/step ANE ms Compile/10 ANE TFLOPS Util% Contributor
|
|
─────────────────────────────────────────────────────────────────────────────────
|
|
M1 Pro 148-163 32-35 7.9-8.5s 0.57-0.63 3.6-4.0 @moriwang
|
|
M1 Max 143-167 35-45 ~7.1s 0.54-0.65 3.4-4.1 @andyg5000
|
|
M3 Ultra* 91 ~10 ~3.7s 0.88 5.6 (repo ref)
|
|
M4 Pro 69-73 8.9 ~3.5s 1.28 8.1 @srt54558
|
|
M4 Max 64 10.2 ~3.5s 1.45 9.2 @SethBurkart123
|
|
M5 101-120 9.1-9.8 3.2-3.4s 0.77-0.91 4.9-5.8 @GitBubble
|
|
```
|
|
|
|
*M3 Ultra = reference platform this project was developed on.
|
|
|
|
## Peak ANE Throughput (inmem_peak, 128x conv 512ch sp64)
|
|
|
|
```
|
|
Chip NE Cores FP16 TFLOPS (measured) Rated TOPS (Apple spec*)
|
|
────────────────────────────────────────────────────────────────────────────
|
|
M1 Pro 16 FAIL 11 (MIL compat issue)
|
|
M1 Max 16 FAIL 11 (MIL compat issue)
|
|
M3 Pro 16 9.98 15.8
|
|
M3 Ultra 32 - 31.6 (ref platform)
|
|
M4 Pro 16 12.57 38
|
|
M4 Max 16 10.93 38
|
|
M5 16 12.17 not disclosed
|
|
M5 (other) 16 12.44 not disclosed
|
|
```
|
|
|
|
*Apple's "Rated TOPS" changed methodology across generations — M1/M3 report FP16,
|
|
M4 reports INT8/mixed-precision peak. The numbers are not directly comparable across
|
|
generations. Use the measured FP16 TFLOPS column for apples-to-apples comparison.
|
|
All chips have 16 NE cores except Ultra variants (32 cores, two dies via UltraFusion).
|
|
Max variants share the same 16-core NE as Pro — the M4 Max vs M4 Pro TFLOPS difference
|
|
is run-to-run variance, not hardware.*
|
|
|
|
## Comparative Chart
|
|
|
|
```
|
|
ANE Training Speed (ms/step, lower is better)
|
|
══════════════════════════════════════════════════════════════
|
|
|
|
M1 Pro ████████████████████████████████████████░░░░ 148-163 ms
|
|
M1 Max ██████████████████████████████████████░░░░░░ 143-167 ms
|
|
M3 Ultra ██████████████████░░░░░░░░░░░░░░░░░░░░░░░░░ 91 ms
|
|
M4 Pro ██████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 69-73 ms
|
|
M4 Max ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 64 ms
|
|
M5 ████████████████████████░░░░░░░░░░░░░░░░░░░░ 101-120 ms
|
|
|
|
0 50 100 150 200
|
|
|
|
|
|
Peak ANE Throughput (TFLOPS, higher is better)
|
|
══════════════════════════════════════════════════════════════
|
|
|
|
M1 Pro FAIL (MIL compat)
|
|
M1 Max FAIL (MIL compat)
|
|
M3 Pro ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░ 9.98
|
|
M4 Pro ████████████████████████████████░░░░░░░░░░░░░ 12.57
|
|
M4 Max ██████████████████████░░░░░░░░░░░░░░░░░░░░░░ 10.93
|
|
M5 █████████████████████████░░░░░░░░░░░░░░░░░░░ 12.17
|
|
|
|
0 3 6 9 12 15 18
|
|
|
|
|
|
ANE Sustained Throughput (TFLOPS, 5s window)
|
|
══════════════════════════════════════════════════════════════
|
|
|
|
M3 Pro ██████████████████████████████████████████████ 15.04 (95.2%)
|
|
|
|
0 3 6 9 12 15 18
|
|
(Only M3 Pro submitted sustained benchmark)
|
|
```
|
|
|
|
## Key Findings
|
|
|
|
### M1/M1 Pro/M1 Max
|
|
- **Standalone benchmarks fail** — `ane_mil_gen.h` single-blob weight format rejected
|
|
- **Training works** via `stories_mil.h` (separate per-matrix weight blobs)
|
|
- ANE compiler handles weight blobs differently from M4+
|
|
- Training at 148-167 ms/step, ~0.6 TFLOPS
|
|
|
|
### M3 Pro
|
|
- **Only ch=512 compiles** — 52 channel values tested (1-4096), only 512 accepted
|
|
- Fixed 512-wide lane structure in SRAM tiling
|
|
- **Peak: 16.77 TFLOPS** (106% of rated 15.8 TOPS) at 128x conv 512ch sp2048
|
|
- **Sustained: 15.04 TFLOPS** over 5 seconds (95.2% utilization)
|
|
- Spatial dimension is the key to peak throughput (sp64→sp2048 = 2x improvement)
|
|
|
|
### M4 Pro / M4 Max
|
|
- Flexible channel support (256/384/512/768+)
|
|
- M4 Pro: peak 12.57 TFLOPS, training at 72.5 ms/step
|
|
- M4 Max: peak 10.93 TFLOPS, training at 64 ms/step (fastest overall)
|
|
- `sram_probe` and `inmem_bench` fail on M4 Pro (same MIL compat issue)
|
|
|
|
### M5
|
|
- Training works out of the box with existing `program(1.3)` MIL
|
|
- Training speed 101-120 ms/step (slower than M4 Max, comparable to M3 Ultra)
|
|
- Peak ANE throughput ~12.2-12.4 TFLOPS (similar to M4 Pro)
|
|
- ANE appears to be same H16 family as M4
|
|
- **M5 Pro/Max not yet benchmarked** — Fusion Architecture may change ANE behavior
|
|
|
|
### Cross-Generation MIL Compatibility
|
|
|
|
```
|
|
Feature M1 M3 M4 M5
|
|
─────────────────────────────────────────────────────────
|
|
program(1.3) / ios18 PARTIAL YES YES YES
|
|
Single-blob weights FAIL YES YES YES
|
|
Per-matrix weight blobs YES YES YES YES
|
|
Channel flexibility ? ch=512 FLEX FLEX
|
|
BLOBFILE offset refs FAIL YES YES YES
|
|
```
|
|
|
|
## macOS Compatibility Issues
|
|
|
|
- **macOS 26.x** — `[MLModel compileModelAtURL:]` broken for standalone benchmarks
|
|
(fixed in PR #27: switched to in-memory MIL compilation)
|
|
- **macOS 15.x** — Works for all M-series with correct MIL format
|
|
- M1 generation requires `stories_mil.h` path, not `ane_mil_gen.h`
|
|
|
|
## How to Contribute
|
|
|
|
Run on your hardware and post results to [Issue #3](https://github.com/maderix/ANE/issues/3):
|
|
|
|
```bash
|
|
cd training && make train_large
|
|
./train_large ane_stories110M_ckpt.bin 256 20 1e-4
|
|
```
|
|
|
|
Include: chip model, macOS version, full output with JSON lines.
|
|
|
|
---
|
|
*Report compiled 2026-03-04 from community submissions.*
|
|
*Contributors: @SethBurkart123, @srt54558, @andyg5000, @moriwang, @D-Ogi, @GitBubble, @elijah-pelton*
|