8.6 KiB

Raw Blame History

Apple Neural Engine — Cross-Generation Benchmark Report

Community-submitted benchmark data from Issue #3.

Model Configuration

All training benchmarks use Stories110M — a Llama2-architecture transformer:

Parameter       Value
────────────────────────
Architecture    Llama2 (RoPE, SwiGLU, RMSNorm, GQA-ready)
Layers          12
Dimension       768
Hidden (FFN)    2048
Heads           12
Vocab           32000 (Llama 2 BPE)
Sequence        256
Total Params    109.53M (84.95M transformer + 24.58M embedding)
Training Data   TinyStories (~20M tokens, pretokenized)
Optimizer       Adam (lr=1e-4 to 3e-4, b1=0.9, b2=0.999)
Precision       FP16 on ANE, FP32 on CPU

Kernels per step (static pipeline): 72 (60 weight-bearing + 12 static sdpaBwd2). Forward: sdpaFwd + ffnW13 + ffnW2 per layer. Backward: ffnBwdW2t + ffnBwdW13t + wotBwd + sdpaBwd1 + sdpaBwd2 + qkvBwd per layer. Weight gradients (dW) via cblas_sgemm on CPU.

Training Performance (Static Pipeline)

Chip            ms/step   ANE ms   Compile/10   ANE TFLOPS   Util%    Contributor
─────────────────────────────────────────────────────────────────────────────────
M1 Pro          148-163   32-35    7.9-8.5s     0.57-0.63    3.6-4.0  @moriwang
M1 Max          143-167   35-45    ~7.1s        0.54-0.65    3.4-4.1  @andyg5000
M3 Ultra*       91        ~10      ~3.7s        0.88         5.6      (repo ref)
M4 Pro          69-73     8.9      ~3.5s        1.28         8.1      @srt54558
M4 Max          64        10.2     ~3.5s        1.45         9.2      @SethBurkart123
M5              101-120   9.1-9.8  3.2-3.4s     0.77-0.91    4.9-5.8  @GitBubble

*M3 Ultra = reference platform this project was developed on.

Peak ANE Throughput (inmem_peak, 128x conv 512ch sp64)

Chip            NE Cores  FP16 TFLOPS (measured)    Rated TOPS (Apple spec*)
────────────────────────────────────────────────────────────────────────────
M1 Pro          16        FAIL                      11    (MIL compat issue)
M1 Max          16        FAIL                      11    (MIL compat issue)
M3 Pro          16        9.98                      15.8
M3 Ultra        32        -                         31.6  (ref platform)
M4 Pro          16        12.57                     38
M4 Max          16        10.93                     38
M5              16        12.17                     not disclosed
M5 (other)      16        12.44                     not disclosed

Apple's "Rated TOPS" changed methodology across generations — M1/M3 report FP16, M4 reports INT8/mixed-precision peak. The numbers are not directly comparable across generations. Use the measured FP16 TFLOPS column for apples-to-apples comparison. All chips have 16 NE cores except Ultra variants (32 cores, two dies via UltraFusion). Max variants share the same 16-core NE as Pro — the M4 Max vs M4 Pro TFLOPS difference is run-to-run variance, not hardware.

Comparative Chart

ANE Training Speed (ms/step, lower is better)
══════════════════════════════════════════════════════════════

M1 Pro    ████████████████████████████████████████░░░░  148-163 ms
M1 Max    ██████████████████████████████████████░░░░░░  143-167 ms
M3 Ultra  ██████████████████░░░░░░░░░░░░░░░░░░░░░░░░░   91 ms
M4 Pro    ██████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   69-73 ms
M4 Max    ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   64 ms
M5        ████████████████████████░░░░░░░░░░░░░░░░░░░░  101-120 ms

          0        50       100       150       200


Peak ANE Throughput (TFLOPS, higher is better)
══════════════════════════════════════════════════════════════

M1 Pro    FAIL (MIL compat)
M1 Max    FAIL (MIL compat)
M3 Pro    ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░  9.98
M4 Pro    ████████████████████████████████░░░░░░░░░░░░░  12.57
M4 Max    ██████████████████████░░░░░░░░░░░░░░░░░░░░░░  10.93
M5        █████████████████████████░░░░░░░░░░░░░░░░░░░  12.17

          0     3     6     9     12    15    18


ANE Sustained Throughput (TFLOPS, 5s window)
══════════════════════════════════════════════════════════════

M3 Pro    ██████████████████████████████████████████████  15.04 (95.2%)

          0     3     6     9     12    15    18
          (Only M3 Pro submitted sustained benchmark)

Key Findings

M1/M1 Pro/M1 Max

Standalone benchmarks fail — ane_mil_gen.h single-blob weight format rejected
Training works via stories_mil.h (separate per-matrix weight blobs)
ANE compiler handles weight blobs differently from M4+
Training at 148-167 ms/step, ~0.6 TFLOPS

M3 Pro

Only ch=512 compiles — 52 channel values tested (1-4096), only 512 accepted
Fixed 512-wide lane structure in SRAM tiling
Peak: 16.77 TFLOPS (106% of rated 15.8 TOPS) at 128x conv 512ch sp2048
Sustained: 15.04 TFLOPS over 5 seconds (95.2% utilization)
Spatial dimension is the key to peak throughput (sp64→sp2048 = 2x improvement)

M4 Pro / M4 Max

Flexible channel support (256/384/512/768+)
M4 Pro: peak 12.57 TFLOPS, training at 72.5 ms/step
M4 Max: peak 10.93 TFLOPS, training at 64 ms/step (fastest overall)
sram_probe and inmem_bench fail on M4 Pro (same MIL compat issue)

M5

Training works out of the box with existing program(1.3) MIL
Training speed 101-120 ms/step (slower than M4 Max, comparable to M3 Ultra)
Peak ANE throughput ~12.2-12.4 TFLOPS (similar to M4 Pro)
ANE appears to be same H16 family as M4
M5 Pro/Max not yet benchmarked — Fusion Architecture may change ANE behavior

Cross-Generation MIL Compatibility

Feature                    M1       M3       M4       M5
─────────────────────────────────────────────────────────
program(1.3) / ios18       PARTIAL  YES      YES      YES
Single-blob weights        FAIL     YES      YES      YES
Per-matrix weight blobs    YES      YES      YES      YES
Channel flexibility        ?        ch=512   FLEX     FLEX
BLOBFILE offset refs       FAIL     YES      YES      YES

macOS Compatibility Issues

macOS 26.x — [MLModel compileModelAtURL:] broken for standalone benchmarks (fixed in PR #27: switched to in-memory MIL compilation)
macOS 15.x — Works for all M-series with correct MIL format
M1 generation requires stories_mil.h path, not ane_mil_gen.h

How to Contribute

Run on your hardware and post results to Issue #3:

cd training && make train_large
./train_large ane_stories110M_ckpt.bin 256 20 1e-4

Include: chip model, macOS version, full output with JSON lines.

Report compiled 2026-03-04 from community submissions. Contributors: @SethBurkart123, @srt54558, @andyg5000, @moriwang, @D-Ogi, @GitBubble, @elijah-pelton

8.6 KiB Raw Blame History