7.1 KiB

Raw Blame History

Apple Neural Engine — Cross-Generation Benchmark Report

Community-submitted benchmark data from Issue #3. All results use Stories110M (12-layer transformer, 109M params, dim=768, seq=256).

Training Performance (Static Pipeline)

Chip            ms/step   ANE ms   Compile/10   ANE TFLOPS   Util%    Contributor
─────────────────────────────────────────────────────────────────────────────────
M1 Pro          148-163   32-35    7.9-8.5s     0.57-0.63    3.6-4.0  @moriwang
M1 Max          143-167   35-45    ~7.1s        0.54-0.65    3.4-4.1  @andyg5000
M3 Ultra*       91        ~10      ~3.7s        0.88         5.6      (repo ref)
M4 Pro          69-73     8.9      ~3.5s        1.28         8.1      @srt54558
M4 Max          64        10.2     ~3.5s        1.45         9.2      @SethBurkart123
M5              101-120   9.1-9.8  3.2-3.4s     0.77-0.91    4.9-5.8  @GitBubble

*M3 Ultra = reference platform this project was developed on.

Peak ANE Throughput (inmem_peak, 128x conv 512ch sp64)

Chip            TFLOPS    Rated TOPS    Utilization
───────────────────────────────────────────────────
M1 Pro          FAIL      11            -  (MIL compat issue)
M1 Max          FAIL      11            -  (MIL compat issue)
M3 Pro          9.98      15.8          63%
M4 Pro          12.57     38            33%
M4 Max          10.93     38            29%
M5              12.17     ~19*          64%
M5 (other)      12.44     ~19*          65%

*M5 ANE TOPS not officially disclosed; ~19 TOPS estimated from measured peak.

Comparative Chart

ANE Training Speed (ms/step, lower is better)
══════════════════════════════════════════════════════════════

M1 Pro    ████████████████████████████████████████░░░░  148-163 ms
M1 Max    ██████████████████████████████████████░░░░░░  143-167 ms
M3 Ultra  ██████████████████░░░░░░░░░░░░░░░░░░░░░░░░░   91 ms
M4 Pro    ██████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   69-73 ms
M4 Max    ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   64 ms
M5        ████████████████████████░░░░░░░░░░░░░░░░░░░░  101-120 ms

          0        50       100       150       200


Peak ANE Throughput (TFLOPS, higher is better)
══════════════════════════════════════════════════════════════

M1 Pro    FAIL (MIL compat)
M1 Max    FAIL (MIL compat)
M3 Pro    ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░  9.98
M4 Pro    ████████████████████████████████░░░░░░░░░░░░░  12.57
M4 Max    ██████████████████████░░░░░░░░░░░░░░░░░░░░░░  10.93
M5        █████████████████████████░░░░░░░░░░░░░░░░░░░  12.17

          0     3     6     9     12    15    18


ANE Sustained Throughput (TFLOPS, 5s window)
══════════════════════════════════════════════════════════════

M3 Pro    ██████████████████████████████████████████████  15.04 (95.2%)

          0     3     6     9     12    15    18
          (Only M3 Pro submitted sustained benchmark)

Key Findings

M1/M1 Pro/M1 Max

Standalone benchmarks fail — ane_mil_gen.h single-blob weight format rejected
Training works via stories_mil.h (separate per-matrix weight blobs)
ANE compiler handles weight blobs differently from M4+
Training at 148-167 ms/step, ~0.6 TFLOPS

M3 Pro

Only ch=512 compiles — 52 channel values tested (1-4096), only 512 accepted
Fixed 512-wide lane structure in SRAM tiling
Peak: 16.77 TFLOPS (106% of rated 15.8 TOPS) at 128x conv 512ch sp2048
Sustained: 15.04 TFLOPS over 5 seconds (95.2% utilization)
Spatial dimension is the key to peak throughput (sp64→sp2048 = 2x improvement)

M4 Pro / M4 Max

Flexible channel support (256/384/512/768+)
M4 Pro: peak 12.57 TFLOPS, training at 72.5 ms/step
M4 Max: peak 10.93 TFLOPS, training at 64 ms/step (fastest overall)
sram_probe and inmem_bench fail on M4 Pro (same MIL compat issue)

M5

Training works out of the box with existing program(1.3) MIL
Training speed 101-120 ms/step (slower than M4 Max, comparable to M3 Ultra)
Peak ANE throughput ~12.2-12.4 TFLOPS (similar to M4 Pro)
ANE appears to be same H16 family as M4
M5 Pro/Max not yet benchmarked — Fusion Architecture may change ANE behavior

Cross-Generation MIL Compatibility

Feature                    M1       M3       M4       M5
─────────────────────────────────────────────────────────
program(1.3) / ios18       PARTIAL  YES      YES      YES
Single-blob weights        FAIL     YES      YES      YES
Per-matrix weight blobs    YES      YES      YES      YES
Channel flexibility        ?        ch=512   FLEX     FLEX
BLOBFILE offset refs       FAIL     YES      YES      YES

macOS Compatibility Issues

macOS 26.x — [MLModel compileModelAtURL:] broken for standalone benchmarks (fixed in PR #27: switched to in-memory MIL compilation)
macOS 15.x — Works for all M-series with correct MIL format
M1 generation requires stories_mil.h path, not ane_mil_gen.h

How to Contribute

Run on your hardware and post results to Issue #3:

cd training && make train_large
./train_large ane_stories110M_ckpt.bin 256 20 1e-4

Include: chip model, macOS version, full output with JSON lines.

Report compiled 2026-03-04 from community submissions. Contributors: @SethBurkart123, @srt54558, @andyg5000, @moriwang, @D-Ogi, @GitBubble, @elijah-pelton

7.1 KiB Raw Blame History