mirror of https://github.com/maderix/ANE.git
Revise README for project fork and updates
Forked the project and updated the README to reflect changes.
This commit is contained in:
parent
6b8d69b93d
commit
40a5384074
98
README.md
98
README.md
|
|
@ -14,101 +14,7 @@ Training neural networks directly on Apple's Neural Engine (ANE) via reverse-eng
|
|||
|
||||
A from-scratch implementation of transformer training (forward + backward pass) running on the ANE in Apple Silicon. The ANE is a 15.8 TFLOPS (M4) inference accelerator that Apple does not expose for training. This project reverse-engineers the `_ANEClient` / `_ANECompiler` private APIs and the MIL (Model Intermediate Language) format to run custom compute graphs — including backpropagation — directly on ANE hardware.
|
||||
|
||||
**Current results (M4, single transformer layer, dim=768, seq=512):**
|
||||
- 9.3 ms/step, 11.2% ANE utilization (1.78 TFLOPS sustained)
|
||||
- 6 ANE kernel dispatches per training step
|
||||
- All forward and backward dx passes on ANE, dW gradients on CPU (Accelerate cblas)
|
||||
- Adam optimizer, gradient accumulation, checkpoint/resume
|
||||
|
||||
## Architecture
|
||||
|
||||
The training loop uses 6 ANE kernels per step:
|
||||
|
||||
| Kernel | Function | Weights |
|
||||
|--------|----------|---------|
|
||||
| `kFwdAttn` | RMSNorm + QKV projection + SDPA + output projection | Wq, Wk, Wv, Wo, rms1, mask |
|
||||
| `kFwdFFN` | RMSNorm + SwiGLU FFN (W1, W3, SiLU, W2) | W1, W2, W3, rms2 |
|
||||
| `kFFNBwd` | FFN backward (W2^T + SiLU_bwd + W1^T + W3^T) | W2^T, W1^T, W3^T |
|
||||
| `kSdpaBwd1` | Wo^T + SDPA backward part 1 (dV, probs, dp) | Wo^T, mask |
|
||||
| `kSdpaBwd2` | SDPA backward part 2 (softmax grad, dQ, dK) | — |
|
||||
| `kQKVb` | QKV backward (Wq^T + Wk^T + Wv^T → dx) | Wq^T, Wk^T, Wv^T |
|
||||
|
||||
CPU handles: RMSNorm backward, residual connections, loss computation, dW gradient accumulation (cblas_sgemm), Adam optimizer updates.
|
||||
|
||||
Key optimizations:
|
||||
- **Channel-first CPU layout** — matches ANE IOSurface `[1,C,1,S]` format, eliminates all transpose overhead
|
||||
- **vDSP vectorized RMSNorm** — 10x faster than naive (6.7ms → 0.7ms)
|
||||
- **GCD async cblas overlap** — dW gradient sgemms run in parallel with ANE evals on a serial dispatch queue
|
||||
- **Deferred cblas wait** — wait pushed into next step's forward pass for maximum overlap
|
||||
- **ANE RMSNorm fusion** — RMSNorm folded into forward kernels as MIL ops (reduce_sum + pow + mul)
|
||||
- **Wo^T fusion** — output projection backward merged into SDPA backward kernel
|
||||
- **Forward taps** — Q, K, V, attention scores, hidden states exposed via concat outputs, avoiding CPU recompute
|
||||
- **exec() restart** — bypasses ~119 ANE compile limit per process
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
├── api_exploration.m # Initial ANE API discovery
|
||||
├── inmem_basic.m # In-memory MIL compilation proof-of-concept
|
||||
├── inmem_bench.m # ANE dispatch latency benchmarks
|
||||
├── inmem_peak.m # Peak TFLOPS measurement (2048x2048 matmul)
|
||||
├── sram_bench.m # ANE SRAM bandwidth probing
|
||||
├── sram_probe.m # SRAM size/layout exploration
|
||||
└── training/
|
||||
├── ane_runtime.h # ANE private API wrapper (compile, eval, IOSurface)
|
||||
├── ane_mil_gen.h # MIL program generation helpers
|
||||
├── model.h # Model weight initialization and blob builders
|
||||
├── forward.h # Forward pass MIL generators
|
||||
├── backward.h # Backward pass MIL generators
|
||||
├── train.m # Minimal training loop (early prototype)
|
||||
├── tiny_train.m # 2-layer tiny model training
|
||||
├── train_large.m # Main: single-layer dim=768 training (optimized)
|
||||
├── test_*.m # Unit tests for individual kernels
|
||||
└── Makefile
|
||||
```
|
||||
|
||||
## Building
|
||||
|
||||
Requires macOS 15+ on Apple Silicon (tested on M4).
|
||||
|
||||
```bash
|
||||
# Build the main training program
|
||||
xcrun clang -O2 -framework Foundation -framework IOSurface \
|
||||
-framework CoreML -framework Accelerate -ldl -lobjc \
|
||||
-o train_large training/train_large.m
|
||||
|
||||
# Run
|
||||
./train_large
|
||||
```
|
||||
|
||||
No external dependencies. Uses only system frameworks + private ANE APIs resolved at runtime via `objc_msgSend`.
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **MIL generation** — Objective-C code constructs MIL program text at runtime, specifying convolutions (for linear layers), matmul (for attention), softmax, element-wise ops
|
||||
2. **In-memory compilation** — `_ANEInMemoryModelDescriptor` compiles MIL text + weight blobs directly to ANE programs, no disk mlmodelc needed
|
||||
3. **IOSurface I/O** — Input/output tensors passed via IOSurface shared memory in `[1, channels, 1, spatial]` format (fp16)
|
||||
4. **Weight embedding** — Weights baked into ANE programs as BLOBFILE constants; recompiled each batch when weights change
|
||||
5. **Gradient flow** — Forward taps expose intermediates needed for backward; backward kernels compute dx (input gradients) on ANE; dW (weight gradients) computed on CPU via cblas
|
||||
|
||||
## Limitations
|
||||
|
||||
- **SDPA causal masking** — ANE hardware ignores `attn_mask` in SDPA ops; causal attention is decomposed into separate Q@K^T (ANE) → mask+softmax (ANE via add+softmax) → scores@V (ANE)
|
||||
- **~119 compile limit** — ANE compiler leaks resources; worked around via `exec()` restart with checkpoint
|
||||
- **Single layer** — Currently trains one transformer layer; multi-layer would need pipeline scheduling
|
||||
- **Synthetic data** — Currently uses random data for benchmarking; real tokenized data support is WIP
|
||||
|
||||
## Performance History
|
||||
|
||||
| Optimization | ms/step | ANE util |
|
||||
|---|---|---|
|
||||
| Baseline (vDSP transpose) | 33.5 | 3.1% |
|
||||
| Channel-first layout | 20.3 | 5.2% |
|
||||
| vDSP vectorized RMSNorm | 14.2 | 7.4% |
|
||||
| GCD async cblas overlap | 11.4 | 9.2% |
|
||||
| ANE RMSNorm fusion | 11.4 | 9.2% |
|
||||
| Wo^T fusion (7→6 kernels) | 11.4 | 9.2% |
|
||||
| Deferred cblas wait | **9.3** | **11.2%** |
|
||||
I forked diz shit and need to write out everything different so stay tuned.
|
||||
|
||||
## Disclaimer
|
||||
|
||||
|
|
@ -117,5 +23,3 @@ This project is independent research into Apple Neural Engine architecture. It u
|
|||
## License
|
||||
|
||||
MIT — see [LICENSE](LICENSE)
|
||||
|
||||
|
||||
|
|
|
|||
Loading…
Reference in New Issue