6.0 KiB

Raw Blame History

ANE Benchmark Results: Apple M4 Max

Date: March 3, 2026 Machine: Mac16,5 (MacBook Pro, Apple M4 Max) macOS: 26.2 ANE Peak: 15.8 TFLOPS (theoretical)

Training Performance

train_large (CPU classifier path)

Metric	Value
Model	Stories110M (12 layers, dim=768, hidden=2048)
Kernels	72 (60 weight-bearing + 12 static sdpaBwd2)
Avg step time	72.4 ms/step
ANE TFLOPS	1.29 sustained
Total TFLOPS	2.41 (ANE+CPU)
ANE utilization	8.1% of 15.8 TFLOPS
Compile time	79.7% of wall time
Train time	16.4% of wall time

train_large_ane (ANE-offloaded classifier)

Metric	Value
Model	Stories110M (same as above)
Kernels	99 (86 weight-bearing + 13 static)
Avg step time	62.9 ms/step
ANE TFLOPS	1.68 sustained
Total TFLOPS	2.77 (ANE+CPU)
ANE utilization	10.6% of 15.8 TFLOPS
Compile time	84.5% of wall time
Train time	12.5% of wall time

Step time breakdown (ms/step, ANE classifier path):

Component	Time (ms)	Description
ane	10-12	ANE kernel dispatch + evaluation
elem	12-13	Elementwise ops (residuals, activations)
cls	5-6	Classifier forward + backward
io	3-5	IOSurface data transfers
rms	0.1	RMSNorm
cblas_wait	0.0	BLAS sync overhead

Programmatic MIL Peak TFLOPS

Config                         W(MB)   GFLOP   ms/eval  TFLOPS
----------------------------------------------------------------------
32x conv 512ch sp64            16.0    1.07    0.408 ms   2.63
48x conv 512ch sp64            24.0    1.61    0.262 ms   6.15
64x conv 512ch sp64            32.0    2.15    0.244 ms   8.80
96x conv 512ch sp64            48.0    3.22    0.326 ms   9.89
128x conv 512ch sp64           64.0    4.29    0.385 ms  11.14
64x conv 256ch sp64             8.0    0.54    0.365 ms   1.47
128x conv 256ch sp64           16.0    1.07    0.454 ms   2.37
256x conv 256ch sp64           32.0    2.15    0.351 ms   6.11
64x conv 384ch sp64            18.0    1.21    0.429 ms   2.82
128x conv 384ch sp64           36.0    2.42    0.354 ms   6.82

Peak observed: 11.14 TFLOPS (128x conv 512ch sp64, 64 MB weights)

In-Memory ANE Benchmark (via mlpackage)

Config         W (MB)    ms/eval   TFLOPS
---------------------------------------------
 256ch x64sp     0.1     0.319 ms    0.03
 512ch x64sp     0.5     0.357 ms    0.09
1024ch x64sp     2.0     0.457 ms    0.29
2048ch x64sp     8.0     0.254 ms    2.11
3072ch x64sp    18.0     0.389 ms    3.10
4096ch x64sp    32.0     1.148 ms    1.87

SRAM Probe Results

Coarse Probe (varying channels + spatial)

Config                      W (MB)  Act(MB)  Tot(MB)    ms/eval   TFLOPS
--------------------------------------------------------------------------
256ch x 64sp                  0.1     0.03      0.2     0.378 ms    0.02
512ch x 64sp                  0.5     0.06      0.6     0.389 ms    0.09
1024ch x 64sp                 2.0     0.12      2.2     0.392 ms    0.34
2048ch x 64sp                 8.0     0.25      8.5     0.218 ms    2.47
3072ch x 64sp                18.0     0.38     18.8     0.396 ms    3.05
4096ch x 64sp                32.0     0.50     33.0     1.116 ms    1.92
5120ch x 64sp                50.0     0.62     51.2     0.767 ms    4.38
6144ch x 64sp                72.0     0.75     73.5     0.872 ms    5.54
8192ch x 32sp               128.0     0.50    129.0     4.195 ms    1.02

Fine Probe (spatial=64, weights only)

Channels       W (MB)    ms/eval   TFLOPS    GFLOPS/MB
--------------------------------------------------------------
   256 ch       0.1     0.378 ms    0.02       177.7
   512 ch       0.5     0.431 ms    0.08       155.6
  1024 ch       2.0     0.411 ms    0.33       163.5
  1536 ch       4.5     0.493 ms    0.61       136.1
  2048 ch       8.0     0.410 ms    1.31       163.9
  2560 ch      12.5     0.237 ms    3.53       282.6  <-- peak efficiency
  3072 ch      18.0     0.335 ms    3.60       200.1
  3584 ch      24.5     0.414 ms    3.97       162.1
  4096 ch      32.0     1.134 ms    1.89        59.2  <-- spilling
  4608 ch      40.5     0.563 ms    4.83       119.2
  5120 ch      50.0     0.659 ms    5.09       101.8
  6144 ch      72.0     0.844 ms    5.73        79.5  <-- spilling
  8192 ch     128.0     4.203 ms    1.02         8.0  <-- catastrophic spilling

SRAM Analysis

The M4 Max ANE SRAM appears to be approximately 24-32 MB:

Peak efficiency at 2560ch (12.5 MB weights): 282.6 GFLOPS/MB, 3.53 TFLOPS
First spill at 4096ch (32.0 MB): drops to 59.2 GFLOPS/MB (1.89 TFLOPS)
Catastrophic at 8192ch (128.0 MB): 8.0 GFLOPS/MB (1.02 TFLOPS)

The 4608ch recovery (4.83 TFLOPS despite 40.5 MB weights) suggests the ANE may use tiling strategies for some weight configurations.

Training kernels (dim=768, weight matrices ~1.2 MB fp16 each) stay well within the SRAM budget.

Known Test Results

Test	Status	Notes
test_rmsnorm_bwd	PASS	ANE-accelerated RMSNorm backward
test_classifier	PASS	4 tests passed; ANE backward 3x slower than CPU cblas for matmul
test_weight_reload	FAIL (expected)	ANE bakes weights at compile time; IOSurface override doesn't work
test_perf_stats	PASS	_ANEPerformanceStats API accessible
test_qos_sweep	PASS	QoS parameter has no measurable effect on latency
test_ane_advanced	PASS	Advanced ANE operations verified
inmem_basic	PASS	In-memory compilation and execution verified
inmem_bench	PASS	Multi-config benchmarks via mlpackage
inmem_peak	PASS	Peak TFLOPS measurement via programmatic MIL
sram_bench	PASS	SRAM capacity probing
sram_probe	PASS	Fine-grained SRAM spilling detection

Reproducing

cd scripts && bash run_benchmarks.sh

The benchmark script auto-generates required .mlpackage models (needs Python 3.11-3.13 with coremltools).

Override training data paths:

ANE_MODEL_PATH=/path/to/stories110M.bin ANE_DATA_PATH=/path/to/data.bin ./train_large

6.0 KiB Raw Blame History