From 5271c002816bc4ed9c68dfddc2ffb3045449523a Mon Sep 17 00:00:00 2001 From: m0at Date: Mon, 2 Mar 2026 14:25:26 -0800 Subject: [PATCH] Update README to reflect GPT-2 inference, M5 findings, and upstream attribution Co-Authored-By: Claude Opus 4.6 --- README.md | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 130 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index e404c25..f0d8766 100644 --- a/README.md +++ b/README.md @@ -1,18 +1,143 @@ -# ANE Training — Backpropagation on Apple Neural Engine +# Running Transformers on Apple's Neural Engine You might be asking, "why the FUCK would you pick GPT2?" Have you read the art bro? Have you? Nah. I doubt it. -GPT2 had more soul in it's theoretical pinky finger than all of us combined. +GPT2 had more soul in it's theoretical pinky finger than all of us combined. But I digress.. -## What This Is +Running transformer inference and training directly on Apple's Neural Engine via reverse-engineered private APIs. No CoreML, no Metal, no GPU — pure ANE compute through `_ANEInMemoryModel` and MIL programs compiled at runtime. -A from-scratch implementation of transformer training (forward + backward pass) running on the ANE in Apple Silicon with NEON cpu decode. +Forked from [maderix/ANEtransformers](https://github.com/maderix/ANEtransformers) which demonstrated ANE training (Stories110M, 12-layer forward+backward on ANE). This fork extends the project with GPT-2 inference on ANE, systematic M5 hardware investigation, and fused kernel optimization. -I forked diz shit and need to write out everything different so stay tuned. +## What's Here + +### GPT-2 Inference on ANE (`training/gpt2.m`) + +Complete GPT-2 (124M) inference engine. Two-phase architecture: + +1. **ANE prefill** — Full sequence processed on ANE using fused attention and FFN kernels. One fused attention kernel (QKV + multi-head SDPA + causal mask + softmax + output projection) and one fused FFN kernel (W1 + GELU + W2) per layer, compiled per sequence-length bucket (32, 64, 128, 256, 512, 1024). Embedding and LayerNorm on CPU with Accelerate. + +2. **CPU decode with KV cache** — Single-token generation runs entirely on CPU using NEON fp16 matmul (4-row unrolled), bypassing ANE dispatch overhead. LM head via GCD-parallel NEON fp16 over 50,257 vocab rows. + +Includes a from-scratch BPE tokenizer (`gpt2_tokenizer.h`) and weight converter (`gpt2_convert.py`) that pulls weights from HuggingFace with no PyTorch dependency. + +```bash +cd training + +# Download and convert weights +pip install safetensors huggingface_hub +python3 gpt2_convert.py + +# Build and run +make gpt2 +./gpt2 --prompt "The meaning of life is" --tokens 100 --temp 0.8 +``` + +### ANE Training (upstream) + +The original [maderix](https://github.com/maderix/ANEtransformers) work: training a 109M-parameter Llama2-architecture transformer (Stories110M) directly on ANE. 12-layer forward+backward pass, 6 ANE kernel types per layer (72 kernels/step), 107 ms/step on M4. Adam optimizer, gradient accumulation, checkpoint/resume via `exec()` restart. See `training/train_large.m` and the [training README](training/README.md). + +### M5 Hardware Investigation + +Systematic probing of ANE behavior on Apple M5 (H16 family, same as M4). Key findings documented in [`training/m5result.md`](training/m5result.md): + +| Question | Result | +|----------|--------| +| Can weights be swapped without recompile? | **No.** Weights baked at compile time. File overwrite + reload ignored. | +| Does `weightsBuffer` IOSurface override compiled weights? | **No.** Same output regardless. | +| Does QoS affect ANE frequency? | **No.** All QoS 0-63 work. Fixed frequency, no latency difference. | +| Can `_ANEChainingRequest` chain kernels without CPU round-trips? | **Validates but rejected by driver** (Error Code=15). Likely requires entitlements only CoreML holds. | +| Can `_ANEPerformanceStats` expose hardware counters? | Class exists with `hwExecutionTime` but requires factory construction via model `perfStatsMask`. | +| Real-time eval path? | `beginRealTimeTask` returns NO (needs entitlement). `evaluateRealTimeWithModel` works but no perf gain. | + +### Fused Kernel Benchmarks (`training/bench_fused.m`) + +Quantifies the value of operation fusion on ANE: + +- **Dispatch overhead**: ~60-80 us per sequential ANE dispatch +- **Fused vs separate**: 1.5-3x speedup from fusing multiple convolutions into single MIL programs +- **Conclusion**: Fused MIL is the only viable path to high ANE utilization. Chaining API is inaccessible without Apple-internal entitlements. Intermediates stay in ANE SRAM when fused. + +Full investigation: [`training/CHAINING_INVESTIGATION.md`](training/CHAINING_INVESTIGATION.md) + +## How It Works + +1. **MIL generation** — Objective-C constructs MIL program text at runtime: convolutions (linear layers), matmul (attention), softmax, element-wise ops +2. **In-memory compilation** — `_ANEInMemoryModelDescriptor` compiles MIL text + fp16 weight blobs directly to ANE programs, no disk `.mlmodelc` needed +3. **IOSurface I/O** — Tensors via IOSurface shared memory in `[1, C, 1, S]` channel-first fp16 format. Spatial dimension padded to minimum stride of 32. Minimum surface size 49152 bytes. +4. **Weight baking** — Weights compiled as BLOBFILE constants into MIL programs. No runtime weight update possible — must recompile to change weights. +5. **NEON fp16** — ARM NEON intrinsics for fast fp32-fp16 conversion at IOSurface boundaries and for CPU-side matmul during decode + +## File Structure + +``` +├── api_exploration.m # Initial ANE private API discovery +├── inmem_basic.m # In-memory MIL compilation proof-of-concept +├── inmem_bench.m # ANE dispatch latency benchmarks +├── inmem_peak.m # Peak TFLOPS measurement (2048x2048 matmul) +├── sram_bench.m # ANE SRAM bandwidth probing +├── sram_probe.m # SRAM size/layout exploration +└── training/ + ├── gpt2.m # GPT-2 124M inference: ANE prefill + CPU KV-cache decode + ├── gpt2_convert.py # HuggingFace → ANE weight converter (no PyTorch) + ├── gpt2_tokenizer.h # Self-contained BPE tokenizer (header-only C) + ├── bench_fused.m # Fused vs separate kernel benchmarks + ├── CHAINING_INVESTIGATION.md # Full chaining API reverse-engineering writeup + ├── m5result.md # M5 ANE probe results + ├── train_large.m # 12-layer Stories110M ANE training (upstream) + ├── stories_config.h # Training model config and structs + ├── stories_io.h # Training IOSurface I/O and kernel compile/eval + ├── stories_mil.h # Training MIL generators (6 kernel types) + ├── stories_cpu_ops.h # vDSP RMSNorm, cross-entropy, Adam, embeddings + ├── dashboard.py # Training TUI: loss curves, power, text generation + ├── ane_runtime.h # ANE private API wrapper + ├── ane_mil_gen.h # MIL generation helpers + ├── test_chaining.m # Chaining API experiments + ├── test_weight_reload.m # Weight swap without recompile test + ├── test_ane_advanced.m # weightsBuffer, procedureIndex, shared events probe + ├── test_qos_sweep.m # QoS 0-63 latency sweep + ├── test_perf_stats.m # _ANEPerformanceStats introspection + ├── test_decode_attn.m # Multi-input decode attention kernel validation + ├── test_multi_input.m # Multi-input IOSurface size constraints + ├── test_ffn_seq1.m # FFN at seq=1 (decode mode) validation + ├── test_lm_head_ane.m # LM head on ANE feasibility + ├── test_lm_head_fast.m # LM head CPU benchmark (6 approaches) + ├── test_lm_head_neon.m # LM head NEON fp16 benchmark + ├── docs/ # Roadmap: GPT-2 XL, streaming, sampling, interactive + └── Makefile +``` + +## Building + +Requires macOS 15+ on Apple Silicon (tested on M4, M5). No external dependencies — system frameworks + private ANE APIs resolved at runtime via `objc_msgSend`. + +```bash +cd training + +# GPT-2 inference +make gpt2 +./gpt2 --prompt "Once upon a time" --tokens 200 + +# Stories110M training (upstream) +make train_large +./train_large + +# Benchmarks +make bench_fused +./bench_fused +``` + +## Known ANE Constraints + +- **Weights are immutable after compile** — no hot-swap, no `weightsBuffer` override, no file-swap reload +- **SDPA causal masking ignored by hardware** — must decompose into Q@K^T + mask add + softmax + scores@V +- **~119 compile limit per process** — ANE compiler leaks resources; training uses `exec()` restart +- **IOSurface minimum 49152 bytes** — even for tiny tensors (seq=1 decode) +- **Spatial stride padded to 32** — `[1, C, 1, W]` surfaces have stride `max(W, 32)` +- **Chaining API inaccessible** — `_ANEChainingRequest` validates but driver rejects (Error Code=15) ## Disclaimer @@ -21,4 +146,3 @@ This project is independent research into Apple Neural Engine architecture. It u ## License MIT — see [LICENSE](LICENSE) -