mirror of https://github.com/maderix/ANE.git
Update README to reflect GPT-2 inference, M5 findings, and upstream attribution
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
70bfa4e54e
commit
5271c00281
136
README.md
136
README.md
|
|
@ -1,18 +1,143 @@
|
|||
# ANE Training — Backpropagation on Apple Neural Engine
|
||||
# Running Transformers on Apple's Neural Engine
|
||||
|
||||
You might be asking, "why the FUCK would you pick GPT2?"
|
||||
|
||||
Have you read the art bro? Have you? Nah. I doubt it.
|
||||
|
||||
GPT2 had more soul in it's theoretical pinky finger than all of us combined.
|
||||
GPT2 had more soul in it's theoretical pinky finger than all of us combined.
|
||||
|
||||
But I digress..
|
||||
|
||||
## What This Is
|
||||
Running transformer inference and training directly on Apple's Neural Engine via reverse-engineered private APIs. No CoreML, no Metal, no GPU — pure ANE compute through `_ANEInMemoryModel` and MIL programs compiled at runtime.
|
||||
|
||||
A from-scratch implementation of transformer training (forward + backward pass) running on the ANE in Apple Silicon with NEON cpu decode.
|
||||
Forked from [maderix/ANEtransformers](https://github.com/maderix/ANEtransformers) which demonstrated ANE training (Stories110M, 12-layer forward+backward on ANE). This fork extends the project with GPT-2 inference on ANE, systematic M5 hardware investigation, and fused kernel optimization.
|
||||
|
||||
I forked diz shit and need to write out everything different so stay tuned.
|
||||
## What's Here
|
||||
|
||||
### GPT-2 Inference on ANE (`training/gpt2.m`)
|
||||
|
||||
Complete GPT-2 (124M) inference engine. Two-phase architecture:
|
||||
|
||||
1. **ANE prefill** — Full sequence processed on ANE using fused attention and FFN kernels. One fused attention kernel (QKV + multi-head SDPA + causal mask + softmax + output projection) and one fused FFN kernel (W1 + GELU + W2) per layer, compiled per sequence-length bucket (32, 64, 128, 256, 512, 1024). Embedding and LayerNorm on CPU with Accelerate.
|
||||
|
||||
2. **CPU decode with KV cache** — Single-token generation runs entirely on CPU using NEON fp16 matmul (4-row unrolled), bypassing ANE dispatch overhead. LM head via GCD-parallel NEON fp16 over 50,257 vocab rows.
|
||||
|
||||
Includes a from-scratch BPE tokenizer (`gpt2_tokenizer.h`) and weight converter (`gpt2_convert.py`) that pulls weights from HuggingFace with no PyTorch dependency.
|
||||
|
||||
```bash
|
||||
cd training
|
||||
|
||||
# Download and convert weights
|
||||
pip install safetensors huggingface_hub
|
||||
python3 gpt2_convert.py
|
||||
|
||||
# Build and run
|
||||
make gpt2
|
||||
./gpt2 --prompt "The meaning of life is" --tokens 100 --temp 0.8
|
||||
```
|
||||
|
||||
### ANE Training (upstream)
|
||||
|
||||
The original [maderix](https://github.com/maderix/ANEtransformers) work: training a 109M-parameter Llama2-architecture transformer (Stories110M) directly on ANE. 12-layer forward+backward pass, 6 ANE kernel types per layer (72 kernels/step), 107 ms/step on M4. Adam optimizer, gradient accumulation, checkpoint/resume via `exec()` restart. See `training/train_large.m` and the [training README](training/README.md).
|
||||
|
||||
### M5 Hardware Investigation
|
||||
|
||||
Systematic probing of ANE behavior on Apple M5 (H16 family, same as M4). Key findings documented in [`training/m5result.md`](training/m5result.md):
|
||||
|
||||
| Question | Result |
|
||||
|----------|--------|
|
||||
| Can weights be swapped without recompile? | **No.** Weights baked at compile time. File overwrite + reload ignored. |
|
||||
| Does `weightsBuffer` IOSurface override compiled weights? | **No.** Same output regardless. |
|
||||
| Does QoS affect ANE frequency? | **No.** All QoS 0-63 work. Fixed frequency, no latency difference. |
|
||||
| Can `_ANEChainingRequest` chain kernels without CPU round-trips? | **Validates but rejected by driver** (Error Code=15). Likely requires entitlements only CoreML holds. |
|
||||
| Can `_ANEPerformanceStats` expose hardware counters? | Class exists with `hwExecutionTime` but requires factory construction via model `perfStatsMask`. |
|
||||
| Real-time eval path? | `beginRealTimeTask` returns NO (needs entitlement). `evaluateRealTimeWithModel` works but no perf gain. |
|
||||
|
||||
### Fused Kernel Benchmarks (`training/bench_fused.m`)
|
||||
|
||||
Quantifies the value of operation fusion on ANE:
|
||||
|
||||
- **Dispatch overhead**: ~60-80 us per sequential ANE dispatch
|
||||
- **Fused vs separate**: 1.5-3x speedup from fusing multiple convolutions into single MIL programs
|
||||
- **Conclusion**: Fused MIL is the only viable path to high ANE utilization. Chaining API is inaccessible without Apple-internal entitlements. Intermediates stay in ANE SRAM when fused.
|
||||
|
||||
Full investigation: [`training/CHAINING_INVESTIGATION.md`](training/CHAINING_INVESTIGATION.md)
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **MIL generation** — Objective-C constructs MIL program text at runtime: convolutions (linear layers), matmul (attention), softmax, element-wise ops
|
||||
2. **In-memory compilation** — `_ANEInMemoryModelDescriptor` compiles MIL text + fp16 weight blobs directly to ANE programs, no disk `.mlmodelc` needed
|
||||
3. **IOSurface I/O** — Tensors via IOSurface shared memory in `[1, C, 1, S]` channel-first fp16 format. Spatial dimension padded to minimum stride of 32. Minimum surface size 49152 bytes.
|
||||
4. **Weight baking** — Weights compiled as BLOBFILE constants into MIL programs. No runtime weight update possible — must recompile to change weights.
|
||||
5. **NEON fp16** — ARM NEON intrinsics for fast fp32-fp16 conversion at IOSurface boundaries and for CPU-side matmul during decode
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
├── api_exploration.m # Initial ANE private API discovery
|
||||
├── inmem_basic.m # In-memory MIL compilation proof-of-concept
|
||||
├── inmem_bench.m # ANE dispatch latency benchmarks
|
||||
├── inmem_peak.m # Peak TFLOPS measurement (2048x2048 matmul)
|
||||
├── sram_bench.m # ANE SRAM bandwidth probing
|
||||
├── sram_probe.m # SRAM size/layout exploration
|
||||
└── training/
|
||||
├── gpt2.m # GPT-2 124M inference: ANE prefill + CPU KV-cache decode
|
||||
├── gpt2_convert.py # HuggingFace → ANE weight converter (no PyTorch)
|
||||
├── gpt2_tokenizer.h # Self-contained BPE tokenizer (header-only C)
|
||||
├── bench_fused.m # Fused vs separate kernel benchmarks
|
||||
├── CHAINING_INVESTIGATION.md # Full chaining API reverse-engineering writeup
|
||||
├── m5result.md # M5 ANE probe results
|
||||
├── train_large.m # 12-layer Stories110M ANE training (upstream)
|
||||
├── stories_config.h # Training model config and structs
|
||||
├── stories_io.h # Training IOSurface I/O and kernel compile/eval
|
||||
├── stories_mil.h # Training MIL generators (6 kernel types)
|
||||
├── stories_cpu_ops.h # vDSP RMSNorm, cross-entropy, Adam, embeddings
|
||||
├── dashboard.py # Training TUI: loss curves, power, text generation
|
||||
├── ane_runtime.h # ANE private API wrapper
|
||||
├── ane_mil_gen.h # MIL generation helpers
|
||||
├── test_chaining.m # Chaining API experiments
|
||||
├── test_weight_reload.m # Weight swap without recompile test
|
||||
├── test_ane_advanced.m # weightsBuffer, procedureIndex, shared events probe
|
||||
├── test_qos_sweep.m # QoS 0-63 latency sweep
|
||||
├── test_perf_stats.m # _ANEPerformanceStats introspection
|
||||
├── test_decode_attn.m # Multi-input decode attention kernel validation
|
||||
├── test_multi_input.m # Multi-input IOSurface size constraints
|
||||
├── test_ffn_seq1.m # FFN at seq=1 (decode mode) validation
|
||||
├── test_lm_head_ane.m # LM head on ANE feasibility
|
||||
├── test_lm_head_fast.m # LM head CPU benchmark (6 approaches)
|
||||
├── test_lm_head_neon.m # LM head NEON fp16 benchmark
|
||||
├── docs/ # Roadmap: GPT-2 XL, streaming, sampling, interactive
|
||||
└── Makefile
|
||||
```
|
||||
|
||||
## Building
|
||||
|
||||
Requires macOS 15+ on Apple Silicon (tested on M4, M5). No external dependencies — system frameworks + private ANE APIs resolved at runtime via `objc_msgSend`.
|
||||
|
||||
```bash
|
||||
cd training
|
||||
|
||||
# GPT-2 inference
|
||||
make gpt2
|
||||
./gpt2 --prompt "Once upon a time" --tokens 200
|
||||
|
||||
# Stories110M training (upstream)
|
||||
make train_large
|
||||
./train_large
|
||||
|
||||
# Benchmarks
|
||||
make bench_fused
|
||||
./bench_fused
|
||||
```
|
||||
|
||||
## Known ANE Constraints
|
||||
|
||||
- **Weights are immutable after compile** — no hot-swap, no `weightsBuffer` override, no file-swap reload
|
||||
- **SDPA causal masking ignored by hardware** — must decompose into Q@K^T + mask add + softmax + scores@V
|
||||
- **~119 compile limit per process** — ANE compiler leaks resources; training uses `exec()` restart
|
||||
- **IOSurface minimum 49152 bytes** — even for tiny tensors (seq=1 decode)
|
||||
- **Spatial stride padded to 32** — `[1, C, 1, W]` surfaces have stride `max(W, 32)`
|
||||
- **Chaining API inaccessible** — `_ANEChainingRequest` validates but driver rejects (Error Code=15)
|
||||
|
||||
## Disclaimer
|
||||
|
||||
|
|
@ -21,4 +146,3 @@ This project is independent research into Apple Neural Engine architecture. It u
|
|||
## License
|
||||
|
||||
MIT — see [LICENSE](LICENSE)
|
||||
|
||||
|
|
|
|||
Loading…
Reference in New Issue