mirror of https://github.com/maderix/ANE.git
9.8 KiB
9.8 KiB
ANE Training -- Benchmarks and Tests Guide
All benchmarks and tests require macOS 15+ on Apple Silicon (tested on M4, M5).
Quick Start
# Build and run training benchmark (100 steps)
cd training
make train_large && ./train_large --steps 100
# Run the automated benchmark suite
cd ..
bash scripts/run_benchmarks.sh
Training Benchmarks
train_large (CPU classifier)
The main 12-layer Stories110M training loop with classifier on CPU.
| Item | Details |
|---|---|
| Purpose | Full transformer training benchmark |
| Measures | ms/step, ANE TFLOPS, ANE utilization %, per-component timing |
| Prerequisites | Training data: bash download_data.sh (or runs on random data if absent) |
| Build | cd training && make train_large |
| Run | ./train_large --steps 100 |
| CLI flags | --steps N (default 10000), --lr F (default 3e-4), --resume |
Expected output:
ane=9.6 io=4.1 cls=9.1 elem=14.4 rms=0.1 cblas_wait=2.3 ms/step
=== Efficiency Report ===
Total steps: 100
Avg train: 107.0 ms/step
ANE TFLOPS: 2.45 sustained
ANE utilization: 15.5% of 15.8 TFLOPS
train_large_ane (ANE classifier)
Same training with classifier, softmax, and RMSNorm backward offloaded to ANE.
| Item | Details |
|---|---|
| Purpose | Measure ANE-offloaded training (16% faster) |
| Build | cd training && make train_large_ane |
| Run | ./train_large_ane --steps 100 |
Compare baseline vs ANE-offloaded:
make train_large && ./train_large --steps 100
make train_large_ane && ./train_large_ane --steps 100
Dashboard (live monitoring)
pip install blessed psutil numpy
sudo python3 dashboard.py # live mode (needs powermetrics)
sudo python3 dashboard.py --resume # attach to resumed training
| Flag | Description |
|---|---|
--resume |
Resume from checkpoint |
--infinite |
Train indefinitely |
--no-powermetrics |
Disable power monitoring |
--no-generate |
Disable text generation preview |
--steps N |
Total steps (default 10000) |
Root-Level Benchmark Scripts
All root-level scripts are standalone Objective-C programs. Common build pattern:
xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML \
-framework IOSurface -ldl -o <output> <source>.m
inmem_peak.m -- Peak TFLOPS (self-contained)
No prerequisites. Generates MIL and weight blobs programmatically.
| Item | Details |
|---|---|
| Purpose | Maximum sustained TFLOPS via deep conv chains (32-256 layers deep) |
| Measures | ms per run, TFLOPS, % peak across 10 configurations |
| Prerequisites | None (self-contained MIL generation) |
| Build | xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o inmem_peak inmem_peak.m |
| Run | ./inmem_peak |
Expected output:
=== Programmatic MIL to In-Memory ANE Peak ===
Config W(MB) GFLOP ms/run TFLOPS %peak
----------------------------------------------------------------------
32x conv 512ch sp64 16.0 1.07 X.XXX ms Y.YY Z.Z%
64x conv 512ch sp64 32.0 2.15 X.XXX ms Y.YY Z.Z%
...
inmem_basic.m -- In-Memory Proof-of-Concept
| Item | Details |
|---|---|
| Purpose | End-to-end test: compile, load, run, benchmark using _ANEInMemoryModel |
| Prerequisites | Pre-built mlpackage at /tmp/ane_sram_256ch_64sp.mlpackage |
| Build | xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o inmem_basic inmem_basic.m |
| Run | ./inmem_basic |
inmem_bench.m -- Dispatch Latency
| Item | Details |
|---|---|
| Purpose | ANE dispatch latency across 6 model sizes (256-4096 channels) |
| Measures | ms per run, TFLOPS at each configuration |
| Prerequisites | Pre-built mlpackages for all 6 configs |
| Build | xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o inmem_bench inmem_bench.m |
| Run | ./inmem_bench |
sram_bench.m -- SRAM Capacity Probe
| Item | Details |
|---|---|
| Purpose | Find SRAM capacity by detecting performance cliff at increasing weight sizes |
| Measures | ms per run, TFLOPS, weight/activation/total memory at 9 configurations |
| Prerequisites | Pre-built mlpackages for 9 configs (256-8192 channels) |
| Build | xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o sram_bench sram_bench.m |
| Run | ./sram_bench |
sram_probe.m -- Fine-Grained SRAM Exploration
| Item | Details |
|---|---|
| Purpose | Finer-grained SRAM probe with 13 data points and GFLOPS/MB efficiency |
| Measures | ms per run, TFLOPS, GFLOPS/MB with spilling indicators |
| Prerequisites | Pre-built mlpackages for 13 configs |
| Build | xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o sram_probe sram_probe.m |
| Run | ./sram_probe |
api_exploration.m -- API Discovery
| Item | Details |
|---|---|
| Purpose | Explore ANE private API surface (class methods, file structures, internal objects) |
| Prerequisites | Pre-built mlpackage at /tmp/ane_sram_1024ch_64sp.mlpackage |
| Build | xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o api_exploration api_exploration.m |
| Run | ./api_exploration |
Test Files
Tests with Makefile targets (cd training/)
| Test | Build | What It Tests |
|---|---|---|
test_rmsnorm_bwd |
make test_rmsnorm_bwd |
RMSNorm backward on ANE vs CPU reference. PASS: max diff < 0.05, mean < 0.01. Benchmarks 100 runs. |
test_classifier |
make test_classifier |
4-part: final RMSNorm, classifier forward (32000-ch conv), softmax over VOCAB, classifier backward. |
test_weight_reload |
make test_weight_reload |
Tests if weights can be hot-swapped by overwriting blob files + unload/reload. Key finding: NO, weights are baked. |
test_perf_stats |
make test_perf_stats |
Probes _ANEPerformanceStats class methods, properties, and instantiation. Tests perfStats in _ANERequest. |
test_qos_sweep |
make test_qos_sweep |
QoS parameter sweep (0-63) across compile, load, run. Finding: no measurable latency difference. |
test_ane_advanced |
make test_ane_advanced |
Probes SharedEvents, weightsBuffer IOSurface, procedureIndex, ChainingRequest. Enumerates all 67 ANE classes. |
Build all probe tests at once: make probes
Tests without Makefile targets (manual build)
| Test | Build Command | What It Tests |
|---|---|---|
test_ane_causal_attn |
xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_ane_causal_attn test_ane_causal_attn.m |
Decomposed causal attention: Q at K^T on ANE, mask+softmax on CPU, scores at V on ANE |
test_ane_sdpa5 |
xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_ane_sdpa5 test_ane_sdpa5.m |
4 approaches to causal masking with scaled_dot_product_attention |
test_conv_attn3 |
xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_conv_attn3 test_conv_attn3.m |
Grouped conv approach to attention (K,V baked as conv weights) |
test_full_fused |
xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o test_full_fused test_full_fused.m |
Full fused attention + FFN in single MIL dispatch at DIM=768, HEADS=12, SEQ=64 |
test_fused_qkv |
xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_fused_qkv test_fused_qkv.m |
Fused QKV (3 convs + concat in one dispatch) vs separate dispatches |
test_fused_bwd |
xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_fused_bwd test_fused_bwd.m |
Fused backward: slice_by_size + 2 convs + add in one kernel |
Bridge Library
cd bridge
make # Build libane_bridge.dylib
make test # Build and link test_bridge
./test_bridge # Run bridge tests
Known Results
M4 (from README)
Single-layer (dim=768, seq=512):
| Optimization | ms/step | ANE utilization |
|---|---|---|
| Baseline (vDSP transpose) | 33.5 | 3.1% |
| Channel-first layout | 20.3 | 5.2% |
| vDSP vectorized RMSNorm | 14.2 | 7.4% |
| GCD async cblas overlap | 11.4 | 9.2% |
| ANE RMSNorm fusion | 11.4 | 9.2% |
| Wo^T fusion (7 to 6 kernels) | 11.4 | 9.2% |
| Deferred cblas wait | 9.3 | 11.2% |
Full Stories110M (12 layers):
| Component | Time (ms/step) |
|---|---|
| ANE runs | 9.6 |
| IO (fp16 conversion) | 4.1 |
| Classifier (cblas) | 9.1 |
| Cross-entropy + residuals | 14.4 |
| RMSNorm | 0.1 |
| Total | ~107 |
M5 Probe Results (from m5result.md)
Machine: Apple M5, macOS 26.3, ANE Family H16 (same as M4)
- Weight reload: FAIL -- weights baked at compile time, cannot be overwritten
- QoS sweep: All QoS 0-63 work, no measurable latency difference
- Performance stats:
_ANEPerformanceStatsclass exists,alloc/initreturns nil (needs factory methods) - weightsBuffer IOSurface: Does NOT override compiled weights
- ChainingRequest: Exists with loopback and pipeline support -- most promising for utilization improvement
Timing Metrics Key
| Metric | What it measures |
|---|---|
ane |
ANE kernel runs (all 6 kernels per layer x 12 layers) |
io |
fp16-to-fp32 IOSurface data transfer (NEON conversion) |
cls |
Classifier matmul (CPU cblas_sgemm) |
elem |
Embedding lookup, residual adds, cross-entropy |
rms |
RMSNorm forward/backward (CPU vDSP) |
cblas_wait |
Time waiting for async dW gradient sgemms to complete |