ANE/benchmarks/ANE_BENCHMARK_REPORT.md

# Apple Neural Engine — Cross-Generation Benchmark Report

Community-submitted benchmark data from [Issue #3](https://github.com/maderix/ANE/issues/3).
All results use Stories110M (12-layer transformer, 109M params, dim=768, seq=256).

## Training Performance (Static Pipeline)

```
Chip            ms/step   ANE ms   Compile/10   ANE TFLOPS   Util%    Contributor
─────────────────────────────────────────────────────────────────────────────────
M1 Pro          148-163   32-35    7.9-8.5s     0.57-0.63    3.6-4.0  @moriwang
M1 Max          143-167   35-45    ~7.1s        0.54-0.65    3.4-4.1  @andyg5000
M3 Ultra*       91        ~10      ~3.7s        0.88         5.6      (repo ref)
M4 Pro          69-73     8.9      ~3.5s        1.28         8.1      @srt54558
M4 Max          64        10.2     ~3.5s        1.45         9.2      @SethBurkart123
M5              101-120   9.1-9.8  3.2-3.4s     0.77-0.91    4.9-5.8  @GitBubble
```

*M3 Ultra = reference platform this project was developed on.

## Peak ANE Throughput (inmem_peak, 128x conv 512ch sp64)

```
Chip            TFLOPS    Rated TOPS    Utilization
───────────────────────────────────────────────────
M1 Pro          FAIL      11            -  (MIL compat issue)
M1 Max          FAIL      11            -  (MIL compat issue)
M3 Pro          9.98      15.8          63%
M4 Pro          12.57     38            33%
M4 Max          10.93     38            29%
M5              12.17     ~19*          64%
M5 (other)      12.44     ~19*          65%
```

*M5 ANE TOPS not officially disclosed; ~19 TOPS estimated from measured peak.

## Comparative Chart

```
ANE Training Speed (ms/step, lower is better)
══════════════════════════════════════════════════════════════

M1 Pro    ████████████████████████████████████████░░░░  148-163 ms
M1 Max    ██████████████████████████████████████░░░░░░  143-167 ms
M3 Ultra  ██████████████████░░░░░░░░░░░░░░░░░░░░░░░░░   91 ms
M4 Pro    ██████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   69-73 ms
M4 Max    ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   64 ms
M5        ████████████████████████░░░░░░░░░░░░░░░░░░░░  101-120 ms

          0        50       100       150       200


Peak ANE Throughput (TFLOPS, higher is better)
══════════════════════════════════════════════════════════════

M1 Pro    FAIL (MIL compat)
M1 Max    FAIL (MIL compat)
M3 Pro    ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░  9.98
M4 Pro    ████████████████████████████████░░░░░░░░░░░░░  12.57
M4 Max    ██████████████████████░░░░░░░░░░░░░░░░░░░░░░  10.93
M5        █████████████████████████░░░░░░░░░░░░░░░░░░░  12.17

          0     3     6     9     12    15    18


ANE Sustained Throughput (TFLOPS, 5s window)
══════════════════════════════════════════════════════════════

M3 Pro    ██████████████████████████████████████████████  15.04 (95.2%)

          0     3     6     9     12    15    18
          (Only M3 Pro submitted sustained benchmark)
```

## Key Findings

### M1/M1 Pro/M1 Max
- **Standalone benchmarks fail** — `ane_mil_gen.h` single-blob weight format rejected
- **Training works** via `stories_mil.h` (separate per-matrix weight blobs)
- ANE compiler handles weight blobs differently from M4+
- Training at 148-167 ms/step, ~0.6 TFLOPS

### M3 Pro
- **Only ch=512 compiles** — 52 channel values tested (1-4096), only 512 accepted
- Fixed 512-wide lane structure in SRAM tiling
- **Peak: 16.77 TFLOPS** (106% of rated 15.8 TOPS) at 128x conv 512ch sp2048
- **Sustained: 15.04 TFLOPS** over 5 seconds (95.2% utilization)
- Spatial dimension is the key to peak throughput (sp64→sp2048 = 2x improvement)

### M4 Pro / M4 Max
- Flexible channel support (256/384/512/768+)
- M4 Pro: peak 12.57 TFLOPS, training at 72.5 ms/step
- M4 Max: peak 10.93 TFLOPS, training at 64 ms/step (fastest overall)
- `sram_probe` and `inmem_bench` fail on M4 Pro (same MIL compat issue)

### M5
- Training works out of the box with existing `program(1.3)` MIL
- Training speed 101-120 ms/step (slower than M4 Max, comparable to M3 Ultra)
- Peak ANE throughput ~12.2-12.4 TFLOPS (similar to M4 Pro)
- ANE appears to be same H16 family as M4
- **M5 Pro/Max not yet benchmarked** — Fusion Architecture may change ANE behavior

### Cross-Generation MIL Compatibility

```
Feature                    M1       M3       M4       M5
─────────────────────────────────────────────────────────
program(1.3) / ios18       PARTIAL  YES      YES      YES
Single-blob weights        FAIL     YES      YES      YES
Per-matrix weight blobs    YES      YES      YES      YES
Channel flexibility        ?        ch=512   FLEX     FLEX
BLOBFILE offset refs       FAIL     YES      YES      YES
```

## macOS Compatibility Issues

- **macOS 26.x** — `[MLModel compileModelAtURL:]` broken for standalone benchmarks
  (fixed in PR #27: switched to in-memory MIL compilation)
- **macOS 15.x** — Works for all M-series with correct MIL format
- M1 generation requires `stories_mil.h` path, not `ane_mil_gen.h`

## How to Contribute

Run on your hardware and post results to [Issue #3](https://github.com/maderix/ANE/issues/3):

```bash
cd training && make train_large
./train_large ane_stories110M_ckpt.bin 256 20 1e-4
```

Include: chip model, macOS version, full output with JSON lines.

---
*Report compiled 2026-03-04 from community submissions.*
*Contributors: @SethBurkart123, @srt54558, @andyg5000, @moriwang, @D-Ogi, @GitBubble, @elijah-pelton*