mirror of https://github.com/maderix/ANE.git
Add model config to benchmark report, update README with current results
Benchmark report now includes full Stories110M model configuration (arch, layers, dims, kernels). README updated: 12-layer results replace stale single-layer numbers, limitations reflect current state.
This commit is contained in:
parent
1a7d8846b2
commit
efcf193075
15
README.md
15
README.md
|
|
@ -29,7 +29,7 @@ The goal was to demonstrate that **training on the Apple Neural Engine — and p
|
||||||
|
|
||||||
Some coverage of this project has overstated its implications. To be clear:
|
Some coverage of this project has overstated its implications. To be clear:
|
||||||
|
|
||||||
- Training works, but utilization is low (~2-3% of peak) with significant engineering challenges remaining
|
- Training works, but utilization is low (~5-9% of peak) with significant engineering challenges remaining
|
||||||
- Many element-wise operations still fall back to CPU
|
- Many element-wise operations still fall back to CPU
|
||||||
- This does **not** replace GPU training for anything beyond small research models today
|
- This does **not** replace GPU training for anything beyond small research models today
|
||||||
|
|
||||||
|
|
@ -57,11 +57,12 @@ This is MIT licensed for a reason. Everyone now has access to AI-assisted develo
|
||||||
|
|
||||||
A from-scratch implementation of transformer training (forward + backward pass) running on the ANE in Apple Silicon. The ANE is a 15.8 TFLOPS (M4) inference accelerator that Apple does not expose for training. This project reverse-engineers the `_ANEClient` / `_ANECompiler` private APIs and the MIL (Model Intermediate Language) format to run custom compute graphs — including backpropagation — directly on ANE hardware.
|
A from-scratch implementation of transformer training (forward + backward pass) running on the ANE in Apple Silicon. The ANE is a 15.8 TFLOPS (M4) inference accelerator that Apple does not expose for training. This project reverse-engineers the `_ANEClient` / `_ANECompiler` private APIs and the MIL (Model Intermediate Language) format to run custom compute graphs — including backpropagation — directly on ANE hardware.
|
||||||
|
|
||||||
**Current results (M4, single transformer layer, dim=768, seq=512):**
|
**Current results — Stories110M (12-layer, dim=768, seq=256, 109M params):**
|
||||||
- 9.3 ms/step, 11.2% ANE utilization (1.78 TFLOPS sustained)
|
- Static pipeline: **91 ms/step** (M3 Ultra), **106 ms/step** (M4)
|
||||||
- 6 ANE kernel dispatches per training step
|
- Dynamic pipeline: **110 ms/step**, no recompilation
|
||||||
|
- 72 ANE kernels per step (static), 9 shared kernels (dynamic)
|
||||||
- All forward and backward dx passes on ANE, dW gradients on CPU (Accelerate cblas)
|
- All forward and backward dx passes on ANE, dW gradients on CPU (Accelerate cblas)
|
||||||
- Adam optimizer, gradient accumulation, checkpoint/resume
|
- Adam optimizer, gradient accumulation, checkpoint/resume via exec() restart
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
|
|
@ -146,8 +147,8 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve
|
||||||
|
|
||||||
- **SDPA causal masking** — ANE hardware ignores `attn_mask` in SDPA ops; causal attention is decomposed into separate Q@K^T (ANE) → mask+softmax (ANE via add+softmax) → scores@V (ANE)
|
- **SDPA causal masking** — ANE hardware ignores `attn_mask` in SDPA ops; causal attention is decomposed into separate Q@K^T (ANE) → mask+softmax (ANE via add+softmax) → scores@V (ANE)
|
||||||
- **~119 compile limit** — ANE compiler leaks resources; worked around via `exec()` restart with checkpoint
|
- **~119 compile limit** — ANE compiler leaks resources; worked around via `exec()` restart with checkpoint
|
||||||
- **Single layer** — Currently trains one transformer layer; multi-layer would need pipeline scheduling
|
- **Compile overhead** — Static pipeline recompiles 60+ kernels every 10 steps (~3.7s); dynamic pipeline avoids this
|
||||||
- **Synthetic data** — Currently uses random data for benchmarking; real tokenized data support is WIP
|
- **Low utilization** — Training sustains ~1-2 TFLOPS out of 15.8+ peak due to CPU fallbacks and I/O overhead
|
||||||
|
|
||||||
## Performance History
|
## Performance History
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -1,7 +1,29 @@
|
||||||
# Apple Neural Engine — Cross-Generation Benchmark Report
|
# Apple Neural Engine — Cross-Generation Benchmark Report
|
||||||
|
|
||||||
Community-submitted benchmark data from [Issue #3](https://github.com/maderix/ANE/issues/3).
|
Community-submitted benchmark data from [Issue #3](https://github.com/maderix/ANE/issues/3).
|
||||||
All results use Stories110M (12-layer transformer, 109M params, dim=768, seq=256).
|
|
||||||
|
## Model Configuration
|
||||||
|
|
||||||
|
All training benchmarks use **Stories110M** — a Llama2-architecture transformer:
|
||||||
|
|
||||||
|
```
|
||||||
|
Parameter Value
|
||||||
|
────────────────────────
|
||||||
|
Architecture Llama2 (RoPE, SwiGLU, RMSNorm, GQA-ready)
|
||||||
|
Layers 12
|
||||||
|
Dimension 768
|
||||||
|
Hidden (FFN) 2048
|
||||||
|
Heads 12
|
||||||
|
Vocab 32000 (Llama 2 BPE)
|
||||||
|
Sequence 256
|
||||||
|
Total Params 109.53M (84.95M transformer + 24.58M embedding)
|
||||||
|
Training Data TinyStories (~20M tokens, pretokenized)
|
||||||
|
Optimizer Adam (lr=1e-4 to 3e-4, b1=0.9, b2=0.999)
|
||||||
|
Precision FP16 on ANE, FP32 on CPU
|
||||||
|
```
|
||||||
|
|
||||||
|
Kernels per step (static pipeline): 72 (60 weight-bearing + 12 static sdpaBwd2).
|
||||||
|
Forward: sdpaFwd + ffnW13 + ffnW2 per layer. Backward: ffnBwdW2t + ffnBwdW13t + wotBwd + sdpaBwd1 + sdpaBwd2 + qkvBwd per layer. Weight gradients (dW) via `cblas_sgemm` on CPU.
|
||||||
|
|
||||||
## Training Performance (Static Pipeline)
|
## Training Performance (Static Pipeline)
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue