Add model config to benchmark report, update README with current results

Benchmark report now includes full Stories110M model configuration (arch, layers, dims, kernels). README updated: 12-layer results replace stale single-layer numbers, limitations reflect current state.
2026-03-04 06:13:21 -08:00 · 2026-03-04 06:13:21 -08:00 · efcf193075
parent 1a7d8846b2
commit efcf193075
2 changed files with 31 additions and 8 deletions
--- a/README.md
+++ b/README.md
@ -29,7 +29,7 @@ The goal was to demonstrate that **training on the Apple Neural Engine — and p
 Some coverage of this project has overstated its implications. To be clear:
- Training works, but utilization is low (~2-3% of peak) with significant engineering challenges remaining
+- Training works, but utilization is low (~5-9% of peak) with significant engineering challenges remaining
 - Many element-wise operations still fall back to CPU
 - This does **not** replace GPU training for anything beyond small research models today
@ -57,11 +57,12 @@ This is MIT licensed for a reason. Everyone now has access to AI-assisted develo
 A from-scratch implementation of transformer training (forward + backward pass) running on the ANE in Apple Silicon. The ANE is a 15.8 TFLOPS (M4) inference accelerator that Apple does not expose for training. This project reverse-engineers the `_ANEClient` / `_ANECompiler` private APIs and the MIL (Model Intermediate Language) format to run custom compute graphs — including backpropagation — directly on ANE hardware.
-**Current results (M4, single transformer layer, dim=768, seq=512):**
+**Current results — Stories110M (12-layer, dim=768, seq=256, 109M params):**
- 9.3 ms/step, 11.2% ANE utilization (1.78 TFLOPS sustained)
+- Static pipeline: **91 ms/step** (M3 Ultra), **106 ms/step** (M4)
- 6 ANE kernel dispatches per training step
+- Dynamic pipeline: **110 ms/step**, no recompilation
 - 72 ANE kernels per step (static), 9 shared kernels (dynamic)
 - All forward and backward dx passes on ANE, dW gradients on CPU (Accelerate cblas)
- Adam optimizer, gradient accumulation, checkpoint/resume
+- Adam optimizer, gradient accumulation, checkpoint/resume via exec() restart
 ## Architecture
@ -146,8 +147,8 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve
 - **SDPA causal masking** — ANE hardware ignores `attn_mask` in SDPA ops; causal attention is decomposed into separate Q@K^T (ANE) → mask+softmax (ANE via add+softmax) → scores@V (ANE)
 - **~119 compile limit** — ANE compiler leaks resources; worked around via `exec()` restart with checkpoint
- **Single layer** — Currently trains one transformer layer; multi-layer would need pipeline scheduling
+- **Compile overhead** — Static pipeline recompiles 60+ kernels every 10 steps (~3.7s); dynamic pipeline avoids this
- **Synthetic data** — Currently uses random data for benchmarking; real tokenized data support is WIP
+- **Low utilization** — Training sustains ~1-2 TFLOPS out of 15.8+ peak due to CPU fallbacks and I/O overhead
 ## Performance History
--- a/benchmarks/ANE_BENCHMARK_REPORT.md
+++ b/benchmarks/ANE_BENCHMARK_REPORT.md
@ -1,7 +1,29 @@
 # Apple Neural Engine — Cross-Generation Benchmark Report
 Community-submitted benchmark data from [Issue #3](https://github.com/maderix/ANE/issues/3).
-All results use Stories110M (12-layer transformer, 109M params, dim=768, seq=256).
+
 ## Model Configuration
 All training benchmarks use **Stories110M** — a Llama2-architecture transformer:
 ```
 Parameter       Value
 ────────────────────────
 Architecture    Llama2 (RoPE, SwiGLU, RMSNorm, GQA-ready)
 Layers          12
 Dimension       768
 Hidden (FFN)    2048
 Heads           12
 Vocab           32000 (Llama 2 BPE)
 Sequence        256
 Total Params    109.53M (84.95M transformer + 24.58M embedding)
 Training Data   TinyStories (~20M tokens, pretokenized)
 Optimizer       Adam (lr=1e-4 to 3e-4, b1=0.9, b2=0.999)
 Precision       FP16 on ANE, FP32 on CPU
 ```
 Kernels per step (static pipeline): 72 (60 weight-bearing + 12 static sdpaBwd2).
 Forward: sdpaFwd + ffnW13 + ffnW2 per layer. Backward: ffnBwdW2t + ffnBwdW13t + wotBwd + sdpaBwd1 + sdpaBwd2 + qkvBwd per layer. Weight gradients (dW) via `cblas_sgemm` on CPU.
 ## Training Performance (Static Pipeline)