diff --git a/README.md b/README.md
index cf5ad07..ed2362d 100644
--- a/README.md
+++ b/README.md
@@ -29,7 +29,7 @@ The goal was to demonstrate that **training on the Apple Neural Engine — and p
 
 Some coverage of this project has overstated its implications. To be clear:
 
-- Training works, but utilization is low (~2-3% of peak) with significant engineering challenges remaining
+- Training works, but utilization is low (~5-9% of peak) with significant engineering challenges remaining
 - Many element-wise operations still fall back to CPU
 - This does **not** replace GPU training for anything beyond small research models today
 
@@ -57,11 +57,12 @@ This is MIT licensed for a reason. Everyone now has access to AI-assisted develo
 
 A from-scratch implementation of transformer training (forward + backward pass) running on the ANE in Apple Silicon. The ANE is a 15.8 TFLOPS (M4) inference accelerator that Apple does not expose for training. This project reverse-engineers the `_ANEClient` / `_ANECompiler` private APIs and the MIL (Model Intermediate Language) format to run custom compute graphs — including backpropagation — directly on ANE hardware.
 
-**Current results (M4, single transformer layer, dim=768, seq=512):**
-- 9.3 ms/step, 11.2% ANE utilization (1.78 TFLOPS sustained)
-- 6 ANE kernel dispatches per training step
+**Current results — Stories110M (12-layer, dim=768, seq=256, 109M params):**
+- Static pipeline: **91 ms/step** (M3 Ultra), **106 ms/step** (M4)
+- Dynamic pipeline: **110 ms/step**, no recompilation
+- 72 ANE kernels per step (static), 9 shared kernels (dynamic)
 - All forward and backward dx passes on ANE, dW gradients on CPU (Accelerate cblas)
-- Adam optimizer, gradient accumulation, checkpoint/resume
+- Adam optimizer, gradient accumulation, checkpoint/resume via exec() restart
 
 ## Architecture
 
@@ -146,8 +147,8 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve
 
 - **SDPA causal masking** — ANE hardware ignores `attn_mask` in SDPA ops; causal attention is decomposed into separate Q@K^T (ANE) → mask+softmax (ANE via add+softmax) → scores@V (ANE)
 - **~119 compile limit** — ANE compiler leaks resources; worked around via `exec()` restart with checkpoint
-- **Single layer** — Currently trains one transformer layer; multi-layer would need pipeline scheduling
-- **Synthetic data** — Currently uses random data for benchmarking; real tokenized data support is WIP
+- **Compile overhead** — Static pipeline recompiles 60+ kernels every 10 steps (~3.7s); dynamic pipeline avoids this
+- **Low utilization** — Training sustains ~1-2 TFLOPS out of 15.8+ peak due to CPU fallbacks and I/O overhead
 
 ## Performance History
 
diff --git a/benchmarks/ANE_BENCHMARK_REPORT.md b/benchmarks/ANE_BENCHMARK_REPORT.md
index 6f41dc8..b7095a0 100644
--- a/benchmarks/ANE_BENCHMARK_REPORT.md
+++ b/benchmarks/ANE_BENCHMARK_REPORT.md
@@ -1,7 +1,29 @@
 # Apple Neural Engine — Cross-Generation Benchmark Report
 
 Community-submitted benchmark data from [Issue #3](https://github.com/maderix/ANE/issues/3).
-All results use Stories110M (12-layer transformer, 109M params, dim=768, seq=256).
+
+## Model Configuration
+
+All training benchmarks use **Stories110M** — a Llama2-architecture transformer:
+
+```
+Parameter       Value
+────────────────────────
+Architecture    Llama2 (RoPE, SwiGLU, RMSNorm, GQA-ready)
+Layers          12
+Dimension       768
+Hidden (FFN)    2048
+Heads           12
+Vocab           32000 (Llama 2 BPE)
+Sequence        256
+Total Params    109.53M (84.95M transformer + 24.58M embedding)
+Training Data   TinyStories (~20M tokens, pretokenized)
+Optimizer       Adam (lr=1e-4 to 3e-4, b1=0.9, b2=0.999)
+Precision       FP16 on ANE, FP32 on CPU
+```
+
+Kernels per step (static pipeline): 72 (60 weight-bearing + 12 static sdpaBwd2).
+Forward: sdpaFwd + ffnW13 + ffnW2 per layer. Backward: ffnBwdW2t + ffnBwdW13t + wotBwd + sdpaBwd1 + sdpaBwd2 + qkvBwd per layer. Weight gradients (dW) via `cblas_sgemm` on CPU.
 
 ## Training Performance (Static Pipeline)