ANE/docs/BENCHMARK_RESULTS.md

6.0 KiB

ANE Benchmark Results: Apple M4 Max

Date: March 3, 2026 Machine: Mac16,5 (MacBook Pro, Apple M4 Max) macOS: 26.2 ANE Peak: 15.8 TFLOPS (theoretical)

Training Performance

train_large (CPU classifier path)

Metric Value
Model Stories110M (12 layers, dim=768, hidden=2048)
Kernels 72 (60 weight-bearing + 12 static sdpaBwd2)
Avg step time 72.4 ms/step
ANE TFLOPS 1.29 sustained
Total TFLOPS 2.41 (ANE+CPU)
ANE utilization 8.1% of 15.8 TFLOPS
Compile time 79.7% of wall time
Train time 16.4% of wall time

train_large_ane (ANE-offloaded classifier)

Metric Value
Model Stories110M (same as above)
Kernels 99 (86 weight-bearing + 13 static)
Avg step time 62.9 ms/step
ANE TFLOPS 1.68 sustained
Total TFLOPS 2.77 (ANE+CPU)
ANE utilization 10.6% of 15.8 TFLOPS
Compile time 84.5% of wall time
Train time 12.5% of wall time

Step time breakdown (ms/step, ANE classifier path):

Component Time (ms) Description
ane 10-12 ANE kernel dispatch + evaluation
elem 12-13 Elementwise ops (residuals, activations)
cls 5-6 Classifier forward + backward
io 3-5 IOSurface data transfers
rms 0.1 RMSNorm
cblas_wait 0.0 BLAS sync overhead

Programmatic MIL Peak TFLOPS

Config                         W(MB)   GFLOP   ms/eval  TFLOPS
----------------------------------------------------------------------
32x conv 512ch sp64            16.0    1.07    0.408 ms   2.63
48x conv 512ch sp64            24.0    1.61    0.262 ms   6.15
64x conv 512ch sp64            32.0    2.15    0.244 ms   8.80
96x conv 512ch sp64            48.0    3.22    0.326 ms   9.89
128x conv 512ch sp64           64.0    4.29    0.385 ms  11.14
64x conv 256ch sp64             8.0    0.54    0.365 ms   1.47
128x conv 256ch sp64           16.0    1.07    0.454 ms   2.37
256x conv 256ch sp64           32.0    2.15    0.351 ms   6.11
64x conv 384ch sp64            18.0    1.21    0.429 ms   2.82
128x conv 384ch sp64           36.0    2.42    0.354 ms   6.82

Peak observed: 11.14 TFLOPS (128x conv 512ch sp64, 64 MB weights)

In-Memory ANE Benchmark (via mlpackage)

Config         W (MB)    ms/eval   TFLOPS
---------------------------------------------
 256ch x64sp     0.1     0.319 ms    0.03
 512ch x64sp     0.5     0.357 ms    0.09
1024ch x64sp     2.0     0.457 ms    0.29
2048ch x64sp     8.0     0.254 ms    2.11
3072ch x64sp    18.0     0.389 ms    3.10
4096ch x64sp    32.0     1.148 ms    1.87

SRAM Probe Results

Coarse Probe (varying channels + spatial)

Config                      W (MB)  Act(MB)  Tot(MB)    ms/eval   TFLOPS
--------------------------------------------------------------------------
256ch x 64sp                  0.1     0.03      0.2     0.378 ms    0.02
512ch x 64sp                  0.5     0.06      0.6     0.389 ms    0.09
1024ch x 64sp                 2.0     0.12      2.2     0.392 ms    0.34
2048ch x 64sp                 8.0     0.25      8.5     0.218 ms    2.47
3072ch x 64sp                18.0     0.38     18.8     0.396 ms    3.05
4096ch x 64sp                32.0     0.50     33.0     1.116 ms    1.92
5120ch x 64sp                50.0     0.62     51.2     0.767 ms    4.38
6144ch x 64sp                72.0     0.75     73.5     0.872 ms    5.54
8192ch x 32sp               128.0     0.50    129.0     4.195 ms    1.02

Fine Probe (spatial=64, weights only)

Channels       W (MB)    ms/eval   TFLOPS    GFLOPS/MB
--------------------------------------------------------------
   256 ch       0.1     0.378 ms    0.02       177.7
   512 ch       0.5     0.431 ms    0.08       155.6
  1024 ch       2.0     0.411 ms    0.33       163.5
  1536 ch       4.5     0.493 ms    0.61       136.1
  2048 ch       8.0     0.410 ms    1.31       163.9
  2560 ch      12.5     0.237 ms    3.53       282.6  <-- peak efficiency
  3072 ch      18.0     0.335 ms    3.60       200.1
  3584 ch      24.5     0.414 ms    3.97       162.1
  4096 ch      32.0     1.134 ms    1.89        59.2  <-- spilling
  4608 ch      40.5     0.563 ms    4.83       119.2
  5120 ch      50.0     0.659 ms    5.09       101.8
  6144 ch      72.0     0.844 ms    5.73        79.5  <-- spilling
  8192 ch     128.0     4.203 ms    1.02         8.0  <-- catastrophic spilling

SRAM Analysis

The M4 Max ANE SRAM appears to be approximately 24-32 MB:

  • Peak efficiency at 2560ch (12.5 MB weights): 282.6 GFLOPS/MB, 3.53 TFLOPS
  • First spill at 4096ch (32.0 MB): drops to 59.2 GFLOPS/MB (1.89 TFLOPS)
  • Catastrophic at 8192ch (128.0 MB): 8.0 GFLOPS/MB (1.02 TFLOPS)

The 4608ch recovery (4.83 TFLOPS despite 40.5 MB weights) suggests the ANE may use tiling strategies for some weight configurations.

Training kernels (dim=768, weight matrices ~1.2 MB fp16 each) stay well within the SRAM budget.

Known Test Results

Test Status Notes
test_rmsnorm_bwd PASS ANE-accelerated RMSNorm backward
test_classifier PASS 4 tests passed; ANE backward 3x slower than CPU cblas for matmul
test_weight_reload FAIL (expected) ANE bakes weights at compile time; IOSurface override doesn't work
test_perf_stats PASS _ANEPerformanceStats API accessible
test_qos_sweep PASS QoS parameter has no measurable effect on latency
test_ane_advanced PASS Advanced ANE operations verified
inmem_basic PASS In-memory compilation and execution verified
inmem_bench PASS Multi-config benchmarks via mlpackage
inmem_peak PASS Peak TFLOPS measurement via programmatic MIL
sram_bench PASS SRAM capacity probing
sram_probe PASS Fine-grained SRAM spilling detection

Reproducing

cd scripts && bash run_benchmarks.sh

The benchmark script auto-generates required .mlpackage models (needs Python 3.11-3.13 with coremltools).

Override training data paths:

ANE_MODEL_PATH=/path/to/stories110M.bin ANE_DATA_PATH=/path/to/data.bin ./train_large