|
|
||
|---|---|---|
| .. | ||
| Makefile | ||
| README.md | ||
| ane_classifier.h | ||
| ane_mil_gen.h | ||
| ane_rmsnorm_bwd.h | ||
| ane_runtime.h | ||
| backward.h | ||
| dashboard.gif | ||
| dashboard.py | ||
| download_data.sh | ||
| forward.h | ||
| m5result.md | ||
| model.h | ||
| stories_config.h | ||
| stories_cpu_ops.h | ||
| stories_io.h | ||
| stories_mil.h | ||
| test_ane_advanced.m | ||
| test_ane_causal_attn.m | ||
| test_ane_sdpa5.m | ||
| test_chaining.m | ||
| test_classifier.m | ||
| test_conv_attn3.m | ||
| test_full_fused.m | ||
| test_fused_bwd.m | ||
| test_fused_qkv.m | ||
| test_perf_stats.m | ||
| test_qos_sweep.m | ||
| test_rmsnorm_bwd.m | ||
| test_weight_reload.m | ||
| tiny_train.m | ||
| tiny_train_old.m | ||
| tokenize.py | ||
| train.m | ||
| train_large.m | ||
| train_large_ane.m | ||
README.md
ANE Training — Stories110M on Apple Neural Engine
Training a 109M-parameter Llama2-architecture transformer (Stories110M) directly on Apple's Neural Engine using private ANE APIs.
Architecture
- Model: Stories110M — dim=768, hidden=2048, heads=12, layers=12, vocab=32000, seq=256
- 109.53M params (84.95M transformer + 24.58M embedding)
- 72 ANE kernels per compile (60 weight-bearing, 12 weight-free sdpaBwd2)
- 6 kernel types per layer: fwdAttn, fwdFFN, ffnBwd, sdpaBwd1, sdpaBwd2, qkvBwd
Performance
| Component | Time (ms/step) |
|---|---|
| ANE eval | 9.6 |
| IO (fp16 conversion) | 4.1 |
| Classifier (cblas) | 9.1 |
| Cross-entropy + residuals | 14.4 |
| RMSNorm | 0.1 |
| Total | 107 ms/step |
Files
| File | Description |
|---|---|
train_large.m |
Main training loop — 12-layer forward/backward, checkpoint, exec() restart |
stories_config.h |
Model config, structs, alloc helpers |
stories_io.h |
IOSurface I/O, NEON fp16 conversion, kernel compile/eval |
stories_mil.h |
MIL program generators for all 6 ANE kernel types |
stories_cpu_ops.h |
vDSP-vectorized RMSNorm, cross-entropy, Adam, embedding ops |
dashboard.py |
TUI dashboard — loss curve, power/CPU/memory graphs, text generation |
tokenize.py |
Extract pretokenized TinyStories data |
Makefile |
Build targets |
How it works
-
Forward pass: Each layer runs fwdAttn (QKV + SDPA + Wo) and fwdFFN (W1 + SiLU(W3) + W2) on ANE via MIL-compiled kernels. Final RMSNorm + classifier matmul on CPU (cblas).
-
Backward pass: Reverse layer order. ffnBwd, sdpaBwd1, sdpaBwd2, qkvBwd on ANE. Weight gradients (dW) via async cblas_sgemm on CPU. RMSNorm backward via vDSP.
-
Compile budget: ANE has a ~119 compile limit per process. With 72 kernels per batch, we run 10 accumulation steps then
exec()restart with checkpoint resume. -
Data: Real TinyStories text (20M tokens), mmap'd uint16 token IDs, random position sampling per step.
Usage
1. Download Training Data
bash download_data.sh
Downloads pretokenized TinyStories (Llama 2 BPE, 32K vocab) from enio/TinyStories on HuggingFace. Produces tinystories_data00.bin (~41 MB, ~20M tokens).
2. Build & Train
# Baseline: classifier + softmax on CPU
make train_large
./train_large --steps 100 # quick test
./train_large # full 10k steps
./train_large --resume # resume from checkpoint
# ANE-offloaded: classifier + softmax on ANE (faster)
make train_large_ane
./train_large_ane --steps 100
CLI flags: --steps N (default 10000), --lr F (default 3e-4), --resume.
3. Monitor with Dashboard
pip install blessed psutil numpy
sudo python3 dashboard.py # live mode (needs powermetrics)
sudo python3 dashboard.py --resume # attach to resumed training
4. Benchmarking
Both programs print an Efficiency Report at completion:
=== Efficiency Report ===
Total steps: 100
Avg train: 107.0 ms/step
ANE TFLOPS: 2.45 sustained
ANE utilization: 15.5% of 15.8 TFLOPS
Per-batch timing breakdown during training:
ane=9.6 io=4.1 cls=9.1 elem=14.4 rms=0.1 cblas_wait=2.3 ms/step
| Metric | What it measures |
|---|---|
ane |
ANE kernel evaluation |
io |
fp16↔fp32 IOSurface transfer |
cls |
Classifier matmul (CPU cblas) |
elem |
Embedding, residual adds, cross-entropy |
rms |
RMSNorm forward/backward |
cblas_wait |
Waiting for async dW gradient sgemms |
Compare baseline vs ANE-offloaded:
make train_large && ./train_large --steps 100
make train_large_ane && ./train_large_ane --steps 100
Key techniques
- NEON vectorized fp16<->fp32: ARM NEON intrinsics for fast IOSurface data transfer
- vDSP cross-entropy:
vDSP_mtrans+vvexpf+vDSP_sve— 8x faster than scalar - Async weight gradients: cblas_sgemm dispatched to background queue, overlapped with ANE
- SDPA causal mask workaround: ANE hardware ignores attn_mask, so we decompose attention into Q@K^T (ANE conv) + mask+softmax (CPU) + scores@V (ANE conv)
