Update README to reflect GPT-2 inference, M5 findings, and upstream attribution

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
m0at 2026-03-02 14:25:26 -08:00
parent 70bfa4e54e
commit 5271c00281
1 changed files with 130 additions and 6 deletions

136
README.md
View File

@ -1,18 +1,143 @@
# ANE Training — Backpropagation on Apple Neural Engine
# Running Transformers on Apple's Neural Engine
You might be asking, "why the FUCK would you pick GPT2?"
Have you read the art bro? Have you? Nah. I doubt it.
GPT2 had more soul in it's theoretical pinky finger than all of us combined.
GPT2 had more soul in it's theoretical pinky finger than all of us combined.
But I digress..
## What This Is
Running transformer inference and training directly on Apple's Neural Engine via reverse-engineered private APIs. No CoreML, no Metal, no GPU — pure ANE compute through `_ANEInMemoryModel` and MIL programs compiled at runtime.
A from-scratch implementation of transformer training (forward + backward pass) running on the ANE in Apple Silicon with NEON cpu decode.
Forked from [maderix/ANEtransformers](https://github.com/maderix/ANEtransformers) which demonstrated ANE training (Stories110M, 12-layer forward+backward on ANE). This fork extends the project with GPT-2 inference on ANE, systematic M5 hardware investigation, and fused kernel optimization.
I forked diz shit and need to write out everything different so stay tuned.
## What's Here
### GPT-2 Inference on ANE (`training/gpt2.m`)
Complete GPT-2 (124M) inference engine. Two-phase architecture:
1. **ANE prefill** — Full sequence processed on ANE using fused attention and FFN kernels. One fused attention kernel (QKV + multi-head SDPA + causal mask + softmax + output projection) and one fused FFN kernel (W1 + GELU + W2) per layer, compiled per sequence-length bucket (32, 64, 128, 256, 512, 1024). Embedding and LayerNorm on CPU with Accelerate.
2. **CPU decode with KV cache** — Single-token generation runs entirely on CPU using NEON fp16 matmul (4-row unrolled), bypassing ANE dispatch overhead. LM head via GCD-parallel NEON fp16 over 50,257 vocab rows.
Includes a from-scratch BPE tokenizer (`gpt2_tokenizer.h`) and weight converter (`gpt2_convert.py`) that pulls weights from HuggingFace with no PyTorch dependency.
```bash
cd training
# Download and convert weights
pip install safetensors huggingface_hub
python3 gpt2_convert.py
# Build and run
make gpt2
./gpt2 --prompt "The meaning of life is" --tokens 100 --temp 0.8
```
### ANE Training (upstream)
The original [maderix](https://github.com/maderix/ANEtransformers) work: training a 109M-parameter Llama2-architecture transformer (Stories110M) directly on ANE. 12-layer forward+backward pass, 6 ANE kernel types per layer (72 kernels/step), 107 ms/step on M4. Adam optimizer, gradient accumulation, checkpoint/resume via `exec()` restart. See `training/train_large.m` and the [training README](training/README.md).
### M5 Hardware Investigation
Systematic probing of ANE behavior on Apple M5 (H16 family, same as M4). Key findings documented in [`training/m5result.md`](training/m5result.md):
| Question | Result |
|----------|--------|
| Can weights be swapped without recompile? | **No.** Weights baked at compile time. File overwrite + reload ignored. |
| Does `weightsBuffer` IOSurface override compiled weights? | **No.** Same output regardless. |
| Does QoS affect ANE frequency? | **No.** All QoS 0-63 work. Fixed frequency, no latency difference. |
| Can `_ANEChainingRequest` chain kernels without CPU round-trips? | **Validates but rejected by driver** (Error Code=15). Likely requires entitlements only CoreML holds. |
| Can `_ANEPerformanceStats` expose hardware counters? | Class exists with `hwExecutionTime` but requires factory construction via model `perfStatsMask`. |
| Real-time eval path? | `beginRealTimeTask` returns NO (needs entitlement). `evaluateRealTimeWithModel` works but no perf gain. |
### Fused Kernel Benchmarks (`training/bench_fused.m`)
Quantifies the value of operation fusion on ANE:
- **Dispatch overhead**: ~60-80 us per sequential ANE dispatch
- **Fused vs separate**: 1.5-3x speedup from fusing multiple convolutions into single MIL programs
- **Conclusion**: Fused MIL is the only viable path to high ANE utilization. Chaining API is inaccessible without Apple-internal entitlements. Intermediates stay in ANE SRAM when fused.
Full investigation: [`training/CHAINING_INVESTIGATION.md`](training/CHAINING_INVESTIGATION.md)
## How It Works
1. **MIL generation** — Objective-C constructs MIL program text at runtime: convolutions (linear layers), matmul (attention), softmax, element-wise ops
2. **In-memory compilation**`_ANEInMemoryModelDescriptor` compiles MIL text + fp16 weight blobs directly to ANE programs, no disk `.mlmodelc` needed
3. **IOSurface I/O** — Tensors via IOSurface shared memory in `[1, C, 1, S]` channel-first fp16 format. Spatial dimension padded to minimum stride of 32. Minimum surface size 49152 bytes.
4. **Weight baking** — Weights compiled as BLOBFILE constants into MIL programs. No runtime weight update possible — must recompile to change weights.
5. **NEON fp16** — ARM NEON intrinsics for fast fp32-fp16 conversion at IOSurface boundaries and for CPU-side matmul during decode
## File Structure
```
├── api_exploration.m # Initial ANE private API discovery
├── inmem_basic.m # In-memory MIL compilation proof-of-concept
├── inmem_bench.m # ANE dispatch latency benchmarks
├── inmem_peak.m # Peak TFLOPS measurement (2048x2048 matmul)
├── sram_bench.m # ANE SRAM bandwidth probing
├── sram_probe.m # SRAM size/layout exploration
└── training/
├── gpt2.m # GPT-2 124M inference: ANE prefill + CPU KV-cache decode
├── gpt2_convert.py # HuggingFace → ANE weight converter (no PyTorch)
├── gpt2_tokenizer.h # Self-contained BPE tokenizer (header-only C)
├── bench_fused.m # Fused vs separate kernel benchmarks
├── CHAINING_INVESTIGATION.md # Full chaining API reverse-engineering writeup
├── m5result.md # M5 ANE probe results
├── train_large.m # 12-layer Stories110M ANE training (upstream)
├── stories_config.h # Training model config and structs
├── stories_io.h # Training IOSurface I/O and kernel compile/eval
├── stories_mil.h # Training MIL generators (6 kernel types)
├── stories_cpu_ops.h # vDSP RMSNorm, cross-entropy, Adam, embeddings
├── dashboard.py # Training TUI: loss curves, power, text generation
├── ane_runtime.h # ANE private API wrapper
├── ane_mil_gen.h # MIL generation helpers
├── test_chaining.m # Chaining API experiments
├── test_weight_reload.m # Weight swap without recompile test
├── test_ane_advanced.m # weightsBuffer, procedureIndex, shared events probe
├── test_qos_sweep.m # QoS 0-63 latency sweep
├── test_perf_stats.m # _ANEPerformanceStats introspection
├── test_decode_attn.m # Multi-input decode attention kernel validation
├── test_multi_input.m # Multi-input IOSurface size constraints
├── test_ffn_seq1.m # FFN at seq=1 (decode mode) validation
├── test_lm_head_ane.m # LM head on ANE feasibility
├── test_lm_head_fast.m # LM head CPU benchmark (6 approaches)
├── test_lm_head_neon.m # LM head NEON fp16 benchmark
├── docs/ # Roadmap: GPT-2 XL, streaming, sampling, interactive
└── Makefile
```
## Building
Requires macOS 15+ on Apple Silicon (tested on M4, M5). No external dependencies — system frameworks + private ANE APIs resolved at runtime via `objc_msgSend`.
```bash
cd training
# GPT-2 inference
make gpt2
./gpt2 --prompt "Once upon a time" --tokens 200
# Stories110M training (upstream)
make train_large
./train_large
# Benchmarks
make bench_fused
./bench_fused
```
## Known ANE Constraints
- **Weights are immutable after compile** — no hot-swap, no `weightsBuffer` override, no file-swap reload
- **SDPA causal masking ignored by hardware** — must decompose into Q@K^T + mask add + softmax + scores@V
- **~119 compile limit per process** — ANE compiler leaks resources; training uses `exec()` restart
- **IOSurface minimum 49152 bytes** — even for tiny tensors (seq=1 decode)
- **Spatial stride padded to 32**`[1, C, 1, W]` surfaces have stride `max(W, 32)`
- **Chaining API inaccessible**`_ANEChainingRequest` validates but driver rejects (Error Code=15)
## Disclaimer
@ -21,4 +146,3 @@ This project is independent research into Apple Neural Engine architecture. It u
## License
MIT — see [LICENSE](LICENSE)