mirror of https://github.com/maderix/ANE.git
2.9 KiB
2.9 KiB
ANE Probe Results: M4 (macOS 26.3)
Machine: Apple M4 (10 cores), 32GB RAM, macOS 26.3
Date: 2026-03-03
ANE Family: H16 (same as M5 results in training/m5result.md)
Key Discovery: Compile and Eval Run in Parallel
This was not known before. The M5 probes tested compile and eval sequentially.
We tested with GCD dispatch_async and found they fully overlap.
probe_v2.m Results
TEST 1: Pure Eval Throughput
Conv 128x128, spatial=64
1000 evals: 189.1ms total, 0.189ms/eval
11.09 GFLOPS sustained
TEST 2: Ping-pong (Two Pre-compiled Models)
500 ping-pong pairs: 207.4ms (0.415ms/pair, 0.207ms/eval)
Near-zero overhead switching between two loaded models.
TEST 3: Sequential Compile (20 Models)
All 20 models compiled and verified ✓
Compile time: ~23-29ms each (consistent, no degradation)
All 20 models correct with different scale factors
TEST 4: Background Compile Overlap ⭐
Background compile: 26.8ms
Foreground evals during compile: 119 (26.8ms total)
Overlap: YES — compile and eval CAN run in parallel!
Background model verified correct ✓
Summary
| Metric | Value |
|---|---|
| Compile time | ~25ms per kernel set |
| Eval time | 0.189ms per eval |
| Compile:eval ratio | ~130:1 |
| Parallel compile+eval | YES |
| Max simultaneous models | 20+ |
| Ping-pong overhead | +10% vs single model |
Peak ANE Throughput (inmem_peak)
Config W(MB) GFLOP ms/eval TFLOPS
96x conv 512ch sp64 48.0 3.22 0.429 ms 7.50
128x conv 512ch sp64 64.0 4.29 0.589 ms 7.30
256x conv 256ch sp64 32.0 2.15 0.380 ms 5.65
64x conv 512ch sp64 32.0 2.15 0.395 ms 5.43
Peak: 7.50 TFLOPS (47% of 15.8 TFLOPS theoretical).
Implications for Training
Before (train_large.m)
- Synchronous compile: 88.6% of wall time is compilation
- 55ms compile per batch, 0.54ms actual training
- Training throughput limited by compiler, not by ANE
After (train_double_buffer.m)
- Async double-buffered compile: 0% compile stall
- Background compile happens during forward/backward passes
- ~130 eval steps fit in one compile window
- Weight updates are "delayed" by one batch (standard technique in distributed training)
- Training throughput limited only by ANE eval speed
Architecture
Time →
Active kernels: [=== eval batch N ===][=== eval batch N+1 ===][=== eval batch N+2 ===]
Background: [compile N+1 weights ][compile N+2 weights ][compile N+3 weights ]
↑ ↑ ↑
swap ready swap ready swap ready
Two kernel sets (A and B) alternate between active evaluation and background compilation. When the background compile finishes, pointers swap atomically at the batch boundary.