mirror of https://github.com/maderix/ANE.git
Merge c3263bd618 into d91c9845c0
This commit is contained in:
commit
82c275d140
|
|
@ -94,6 +94,26 @@
|
|||
"peak_tflops_inmem": 12.17,
|
||||
"notes": "inmem_peak only, no training data submitted.",
|
||||
"contributor": "elijah-pelton"
|
||||
},
|
||||
{
|
||||
"chip": "M5 Max",
|
||||
"cores": "18-core (6P+12E)",
|
||||
"ram_gb": 128,
|
||||
"macos": "26.4.1",
|
||||
"ane_subtype": "h17",
|
||||
"ms_per_step": [89.7, 90.2],
|
||||
"ane_ms": [7.8, 8.1],
|
||||
"compile_ms": [3335, 3384],
|
||||
"ane_tflops": 1.03,
|
||||
"ane_util_pct": 5.4,
|
||||
"peak_tflops_inmem": 13.80,
|
||||
"peak_tflops_int8_w8a8": 35.61,
|
||||
"peak_tflops_fp16_64x64": 19.27,
|
||||
"ms_per_step_dynamic_stories110m": 73.5,
|
||||
"ms_per_step_dynamic_qwen3_06b": 320.0,
|
||||
"compile_ms_dynamic": 421,
|
||||
"notes": "First H17 ANE on record — distinct subtype from M4/M5 base (both H16). FP16/INT8 peak compute matches M4 within 4%; training gains over M5 base are CPU-driven (12P cores + Accelerate). See benchmarks/m5max_result.md for the full probe report.",
|
||||
"contributor": "lixiang.ict@gmail.com"
|
||||
}
|
||||
],
|
||||
"neural_engine_specs": {
|
||||
|
|
@ -108,6 +128,7 @@
|
|||
"M3_Ultra": {"ne_cores": 32, "rated_tops": 31.6},
|
||||
"M4": {"ne_cores": 16, "rated_tops": 38, "note": "INT8/mixed-precision spec"},
|
||||
"M4_Max": {"ne_cores": 16, "rated_tops": 38, "note": "INT8/mixed-precision spec"},
|
||||
"M5": {"ne_cores": 16, "rated_tops": null, "estimated_tops": 19}
|
||||
"M5": {"ne_cores": 16, "rated_tops": null, "estimated_tops": 19, "ane_subtype": "h16"},
|
||||
"M5_Max": {"ne_cores": 16, "rated_tops": null, "measured_fp16_tflops": 19.27, "measured_int8_tops": 35.61, "ane_subtype": "h17", "note": "First chip on record reporting H17 ANE subtype; peak math matches M4 H16."}
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -0,0 +1,229 @@
|
|||
# M5 Max ANE Probe & Training Benchmark
|
||||
|
||||
**Machine**: MacBook Pro · Apple M5 Max (6P + 12E CPU) · 128 GB RAM
|
||||
**macOS**: 26.4.1 (Darwin 25.4.0), Model Identifier `Mac17,7`
|
||||
**Date**: 2026-04-23
|
||||
**ANE Family**: **`h17`** (new — M4 and base M5 are `h16`)
|
||||
|
||||
All data gathered with the repo's probes and training harness as-is, no source
|
||||
changes. Compared against:
|
||||
- [README.md](../README.md) M4 reference figures
|
||||
- [training/m5result.md](../training/m5result.md) — base M5 (10-core, 16 GB) probe notes
|
||||
- [benchmarks/community_results.json](./community_results.json) — M1/M3/M4/M5 submissions
|
||||
|
||||
---
|
||||
|
||||
## Hardware identification
|
||||
|
||||
**Question**: Is the ANE in M5 Max the same silicon block as M4 / base M5?
|
||||
|
||||
**Result**: **No — `_ANEDeviceInfo.aneSubType` returns `h17`**, a version not
|
||||
seen in any community submission. The base M5 (per `m5result.md`) still reports
|
||||
`h16`, same as M4. M5 Max is the first `h17` on record.
|
||||
|
||||
```
|
||||
=== ANE INT8 W8A8 Benchmark (M4, h17) === ← header label is hardcoded "M4",
|
||||
the "h17" is read from the device.
|
||||
```
|
||||
|
||||
Everything else still works: `program(1.3)` MIL, `_ANEInMemoryModelDescriptor`,
|
||||
`constexpr_affine_dequantize`, `quantize` / `dequantize`. No API closures on
|
||||
macOS 26.4.1.
|
||||
|
||||
---
|
||||
|
||||
## inmem_peak — deep conv stacks (FP16)
|
||||
|
||||
**Question**: Peak FP16 throughput using the same 128-layer conv sweep
|
||||
([inmem_peak.m](../inmem_peak.m)) the README reports for M4.
|
||||
|
||||
```
|
||||
Config W(MB) GFLOP ms/eval TFLOPS
|
||||
-----------------------------------------------------------------
|
||||
32x conv 512ch sp64 16.0 1.07 0.135 ms 7.95
|
||||
48x conv 512ch sp64 24.0 1.61 0.171 ms 9.42
|
||||
64x conv 512ch sp64 32.0 2.15 0.206 ms 10.40
|
||||
96x conv 512ch sp64 48.0 3.22 0.266 ms 12.13
|
||||
128x conv 512ch sp64 64.0 4.29 0.311 ms 13.80 ← peak
|
||||
64x conv 256ch sp64 8.0 0.54 0.168 ms 3.19
|
||||
128x conv 256ch sp64 16.0 1.07 0.132 ms 8.16
|
||||
256x conv 256ch sp64 32.0 2.15 0.216 ms 9.94
|
||||
64x conv 384ch sp64 18.0 1.21 0.142 ms 8.52
|
||||
128x conv 384ch sp64 36.0 2.42 0.203 ms 11.91
|
||||
```
|
||||
|
||||
**Peak: 13.80 TFLOPS** at `128× conv 512ch sp=64`.
|
||||
|
||||
| Chip | inmem_peak FP16 (TFLOPS) |
|
||||
|------------|--------------------------|
|
||||
| M3 Pro | 9.98 |
|
||||
| M4 Pro | 12.57 |
|
||||
| M4 Max | 10.93 |
|
||||
| M5 (16 GB) | 12.17 |
|
||||
| M5 (32 GB) | 12.44 |
|
||||
| **M5 Max** | **13.80** |
|
||||
|
||||
---
|
||||
|
||||
## ane_int8_bench — FP16 vs INT8 W8A8 (larger spatial 64×64)
|
||||
|
||||
**Question**: How close does M5 Max come to the M4 blog's 19 TFLOPS / 35 TOPS
|
||||
figures when the conv is large enough to saturate the array?
|
||||
|
||||
```
|
||||
Config GOP ms/eval TOPS Ratio
|
||||
-----------------------------------------------------------------
|
||||
FP16 128x conv 512ch 64x64 274.88 14.263 ms 19.27
|
||||
W8A8 128x conv 512ch 64x64 274.88 7.720 ms 35.61 1.85x
|
||||
FP16 64x conv 512ch 64x64 137.44 7.153 ms 19.21
|
||||
W8A8 64x conv 512ch 64x64 137.44 3.824 ms 35.94 1.87x
|
||||
FP16 256x conv 256ch 64x64 137.44 7.318 ms 18.78
|
||||
W8A8 256x conv 256ch 64x64 137.44 4.118 ms 33.37 1.78x
|
||||
FP16 128x conv 256ch 64x64 68.72 3.696 ms 18.59
|
||||
W8A8 128x conv 256ch 64x64 68.72 2.112 ms 32.54 1.75x
|
||||
FP16 128x conv 384ch 64x64 154.62 8.154 ms 18.96
|
||||
W8A8 128x conv 384ch 64x64 154.62 4.389 ms 35.23 1.86x
|
||||
```
|
||||
|
||||
| Precision | M5 Max | M4 (README `H16G`) |
|
||||
|-----------|--------|---------------------|
|
||||
| FP16 peak | **19.27 TFLOPS** | 18.6 TFLOPS |
|
||||
| INT8 W8A8 peak | **35.61 TOPS** | 35.1 TOPS |
|
||||
| INT8/FP16 ratio | 1.85× | 1.88× |
|
||||
|
||||
**Implication**: the `h17` ANE's raw compute is **within 4 % of `h16`**
|
||||
(run-to-run noise). Apple kept the ~19 TFLOPS FP16 / ~35 TOPS INT8 ceiling
|
||||
across two chip generations. The "38 TOPS" spec remains the INT8 path.
|
||||
|
||||
---
|
||||
|
||||
## sram_bench — working-set cliff
|
||||
|
||||
**Question**: Where does the on-chip SRAM spill to DRAM?
|
||||
|
||||
```
|
||||
Config W(MB) Act(MB) Tot(MB) ms/eval TFLOPS
|
||||
---------------------------------------------------------------------
|
||||
256ch x 64sp 0.1 0.03 0.2 0.212 ms 0.04
|
||||
512ch x 64sp 0.5 0.06 0.6 0.085 ms 0.40
|
||||
1024ch x 64sp 2.0 0.12 2.2 0.335 ms 0.40
|
||||
2048ch x 64sp 8.0 0.25 8.5 0.141 ms 3.80
|
||||
3072ch x 64sp 18.0 0.38 18.8 0.204 ms 5.92
|
||||
4096ch x 64sp 32.0 0.50 33.0 0.300 ms 7.17
|
||||
5120ch x 64sp 50.0 0.62 51.2 0.432 ms 7.76
|
||||
6144ch x 64sp 72.0 0.75 73.5 0.565 ms 8.56
|
||||
8192ch x 32sp 128.0 0.50 129.0 0.965 ms 4.45
|
||||
```
|
||||
|
||||
**M4 in the blog shows the cliff around ~32 MB**. On M5 Max throughput is still
|
||||
climbing past **73 MB** and only breaks at **129 MB**. Caveat: the last row
|
||||
also halves `sp` from 64 to 32 — a pipeline-starvation confound we can't rule
|
||||
out without an independent probe. What's unambiguous: the effective SRAM
|
||||
working set is **at least as large as M4's**, plausibly larger.
|
||||
|
||||
---
|
||||
|
||||
## inmem_bench — single 1×1 conv latency scan
|
||||
|
||||
```
|
||||
Config W(MB) ms/eval TFLOPS
|
||||
--------------------------------------------
|
||||
256ch x64sp 0.1 0.088 ms 0.09
|
||||
512ch x64sp 0.5 0.089 ms 0.38
|
||||
1024ch x64sp 2.0 0.313 ms 0.43
|
||||
2048ch x64sp 8.0 0.131 ms 4.10
|
||||
3072ch x64sp 18.0 0.189 ms 6.38
|
||||
4096ch x64sp 32.0 0.302 ms 7.11
|
||||
```
|
||||
|
||||
Dispatch floor ≈ 0.09 ms, matching the M4 blog's ~0.095 ms XPC/IOKit overhead.
|
||||
|
||||
---
|
||||
|
||||
## Training — dynamic pipeline (`training_dynamic/train.m`)
|
||||
|
||||
**Synthetic token data** (5 M random uint16 in [0, 5000) to mimic a compressed
|
||||
TinyStories vocab), random init, `--accum 10`.
|
||||
|
||||
| Model | Params | Layers | Kernels compiled once | **M5 Max ms/step** |
|
||||
|-------|--------|--------|-----------------------|--------------------|
|
||||
| Stories110M (MHA 12/12) | 109 M | 12 | 421 ms | **73.5 ms** |
|
||||
| Qwen3-0.6B (GQA 16/8) | 596 M | 28 | 398 ms | **320.0 ms** |
|
||||
|
||||
Qwen3-0.6B per-step timing breakdown (stable from step 10+):
|
||||
|
||||
```
|
||||
ane_fwd=54.6 io_fwd=15.2 rms=4.5 ane_bwd=70.5 io_bwd=43.3
|
||||
silu=27.0 rms_bwd=12.4 cls=8.7 cblas_wait=0.0 dw_copy=9.9
|
||||
```
|
||||
|
||||
ANE time = 125 ms (39 %) · CPU time = 195 ms (61 %). Bottleneck is unchanged
|
||||
from the README's diagnosis: ANE is idle most of the step waiting for
|
||||
RMSNorm / SiLU / classifier / dW / Adam on CPU.
|
||||
|
||||
---
|
||||
|
||||
## Training — static pipeline (`train_large.m`)
|
||||
|
||||
For apples-to-apples with `community_results.json` (all existing entries use
|
||||
this path).
|
||||
|
||||
```
|
||||
[batch 10: compile=3384ms train=902.5ms (90.2ms/step) compiles=72]
|
||||
ane=8.0 io=2.8 cls=7.6 elem=11.7 rms=0.1 cblas_wait=0.0 ms/step
|
||||
[batch 20: compile=3335ms train=897.1ms (89.7ms/step) compiles=72]
|
||||
[batch 30: compile=3353ms train=900.6ms (90.1ms/step) compiles=72]
|
||||
Total steps: 30
|
||||
Wall time: 13.1 s
|
||||
Compile time: 10072 ms (76.9 %)
|
||||
Train time: 2700 ms (20.6 %)
|
||||
Avg train: 90.0 ms/step
|
||||
```
|
||||
|
||||
| Chip | ms/step | ane ms | compile / 10 |
|
||||
|-------------|---------|--------|--------------|
|
||||
| M1 Pro | 148–163 | 32–35 | 7.9–8.5 s |
|
||||
| M1 Max | 143–167 | 35–45 | ~7.1 s |
|
||||
| M3 Ultra\* | 91 | ~10 | ~3.7 s |
|
||||
| M4 Pro | 69–73 | 8.9 | ~3.5 s |
|
||||
| M4 Max | 64 | 10.2 | ~3.5 s |
|
||||
| M5 (16 GB) | 101–120 | 9.1–9.8| 3.2–3.4 s |
|
||||
| **M5 Max** | **90.0**| **8.0**| **~3.35 s** |
|
||||
|
||||
\* repo reference platform.
|
||||
|
||||
---
|
||||
|
||||
## Speedup summary — M5 Max vs baselines
|
||||
|
||||
| Metric | M4 (README) | M5 base | M5 Max | vs M4 | vs M5 base |
|
||||
|--------|-------------|---------|--------|-------|------------|
|
||||
| FP16 peak (TFLOPS) | 18.6 | 12.17–12.44 | **19.27** | 1.04× | 1.55× |
|
||||
| INT8 W8A8 (TOPS) | 35.1 | — | **35.61** | 1.01× | — |
|
||||
| Stories110M static (ms/step) | 91 | 101–120 | **90.0** | 1.01× | 1.22× |
|
||||
| Stories110M dynamic (ms/step)| — | — | **73.5** | — | — |
|
||||
| Qwen3-0.6B dynamic (ms/step) | 412| — | **320.0** | 1.29× | — |
|
||||
|
||||
**Takeaways**:
|
||||
|
||||
1. **Peak ANE compute has not moved between M4 and M5 Max** (≈ 19 TFLOPS FP16,
|
||||
≈ 35 TOPS INT8). The `h16 → h17` version bump does not show up in peak math.
|
||||
2. **Training gains of 1.22–1.29× are CPU-driven**, not ANE-driven. The 12
|
||||
performance cores plus Accelerate's `cblas_sgemm` on M5 Max close the gap
|
||||
that made base M5 (4P + 6E) slower than M4 Pro despite a newer ANE.
|
||||
3. **M5 Max's effective SRAM working set is ≥ M4's.** The `sram_bench` cliff
|
||||
sits past 70 MB where M4's was at ~32 MB, though a cleaner probe is needed
|
||||
(the 128 MB row changes two variables at once).
|
||||
|
||||
---
|
||||
|
||||
## Strategic implications
|
||||
|
||||
- Anyone optimizing training on this repo for M5 Max should focus on pushing
|
||||
RMSNorm / SiLU / classifier onto the ANE, not on peak-throughput MIL tricks —
|
||||
the ANE already has 60 % idle headroom per step.
|
||||
- `h17` is worth re-probing with the tests under `training/test_*.m` — the
|
||||
`m5result.md` findings (weight-reload fails, weightsBuffer is inert,
|
||||
procedureIndex is accepted but ignored, QoS has no effect) were recorded on
|
||||
`h16` and may or may not hold on `h17`.
|
||||
- No evidence that Apple has tightened the private-API surface on macOS 26.4.1.
|
||||
Loading…
Reference in New Issue