Revise README for project fork and updates

Forked the project and updated the README to reflect changes.
This commit is contained in:
gigantic 2026-03-02 13:31:31 -08:00 committed by GitHub
parent 6b8d69b93d
commit 40a5384074
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
1 changed files with 1 additions and 97 deletions

View File

@ -14,101 +14,7 @@ Training neural networks directly on Apple's Neural Engine (ANE) via reverse-eng
A from-scratch implementation of transformer training (forward + backward pass) running on the ANE in Apple Silicon. The ANE is a 15.8 TFLOPS (M4) inference accelerator that Apple does not expose for training. This project reverse-engineers the `_ANEClient` / `_ANECompiler` private APIs and the MIL (Model Intermediate Language) format to run custom compute graphs — including backpropagation — directly on ANE hardware.
**Current results (M4, single transformer layer, dim=768, seq=512):**
- 9.3 ms/step, 11.2% ANE utilization (1.78 TFLOPS sustained)
- 6 ANE kernel dispatches per training step
- All forward and backward dx passes on ANE, dW gradients on CPU (Accelerate cblas)
- Adam optimizer, gradient accumulation, checkpoint/resume
## Architecture
The training loop uses 6 ANE kernels per step:
| Kernel | Function | Weights |
|--------|----------|---------|
| `kFwdAttn` | RMSNorm + QKV projection + SDPA + output projection | Wq, Wk, Wv, Wo, rms1, mask |
| `kFwdFFN` | RMSNorm + SwiGLU FFN (W1, W3, SiLU, W2) | W1, W2, W3, rms2 |
| `kFFNBwd` | FFN backward (W2^T + SiLU_bwd + W1^T + W3^T) | W2^T, W1^T, W3^T |
| `kSdpaBwd1` | Wo^T + SDPA backward part 1 (dV, probs, dp) | Wo^T, mask |
| `kSdpaBwd2` | SDPA backward part 2 (softmax grad, dQ, dK) | — |
| `kQKVb` | QKV backward (Wq^T + Wk^T + Wv^T → dx) | Wq^T, Wk^T, Wv^T |
CPU handles: RMSNorm backward, residual connections, loss computation, dW gradient accumulation (cblas_sgemm), Adam optimizer updates.
Key optimizations:
- **Channel-first CPU layout** — matches ANE IOSurface `[1,C,1,S]` format, eliminates all transpose overhead
- **vDSP vectorized RMSNorm** — 10x faster than naive (6.7ms → 0.7ms)
- **GCD async cblas overlap** — dW gradient sgemms run in parallel with ANE evals on a serial dispatch queue
- **Deferred cblas wait** — wait pushed into next step's forward pass for maximum overlap
- **ANE RMSNorm fusion** — RMSNorm folded into forward kernels as MIL ops (reduce_sum + pow + mul)
- **Wo^T fusion** — output projection backward merged into SDPA backward kernel
- **Forward taps** — Q, K, V, attention scores, hidden states exposed via concat outputs, avoiding CPU recompute
- **exec() restart** — bypasses ~119 ANE compile limit per process
## File Structure
```
├── api_exploration.m # Initial ANE API discovery
├── inmem_basic.m # In-memory MIL compilation proof-of-concept
├── inmem_bench.m # ANE dispatch latency benchmarks
├── inmem_peak.m # Peak TFLOPS measurement (2048x2048 matmul)
├── sram_bench.m # ANE SRAM bandwidth probing
├── sram_probe.m # SRAM size/layout exploration
└── training/
├── ane_runtime.h # ANE private API wrapper (compile, eval, IOSurface)
├── ane_mil_gen.h # MIL program generation helpers
├── model.h # Model weight initialization and blob builders
├── forward.h # Forward pass MIL generators
├── backward.h # Backward pass MIL generators
├── train.m # Minimal training loop (early prototype)
├── tiny_train.m # 2-layer tiny model training
├── train_large.m # Main: single-layer dim=768 training (optimized)
├── test_*.m # Unit tests for individual kernels
└── Makefile
```
## Building
Requires macOS 15+ on Apple Silicon (tested on M4).
```bash
# Build the main training program
xcrun clang -O2 -framework Foundation -framework IOSurface \
-framework CoreML -framework Accelerate -ldl -lobjc \
-o train_large training/train_large.m
# Run
./train_large
```
No external dependencies. Uses only system frameworks + private ANE APIs resolved at runtime via `objc_msgSend`.
## How It Works
1. **MIL generation** — Objective-C code constructs MIL program text at runtime, specifying convolutions (for linear layers), matmul (for attention), softmax, element-wise ops
2. **In-memory compilation**`_ANEInMemoryModelDescriptor` compiles MIL text + weight blobs directly to ANE programs, no disk mlmodelc needed
3. **IOSurface I/O** — Input/output tensors passed via IOSurface shared memory in `[1, channels, 1, spatial]` format (fp16)
4. **Weight embedding** — Weights baked into ANE programs as BLOBFILE constants; recompiled each batch when weights change
5. **Gradient flow** — Forward taps expose intermediates needed for backward; backward kernels compute dx (input gradients) on ANE; dW (weight gradients) computed on CPU via cblas
## Limitations
- **SDPA causal masking** — ANE hardware ignores `attn_mask` in SDPA ops; causal attention is decomposed into separate Q@K^T (ANE) → mask+softmax (ANE via add+softmax) → scores@V (ANE)
- **~119 compile limit** — ANE compiler leaks resources; worked around via `exec()` restart with checkpoint
- **Single layer** — Currently trains one transformer layer; multi-layer would need pipeline scheduling
- **Synthetic data** — Currently uses random data for benchmarking; real tokenized data support is WIP
## Performance History
| Optimization | ms/step | ANE util |
|---|---|---|
| Baseline (vDSP transpose) | 33.5 | 3.1% |
| Channel-first layout | 20.3 | 5.2% |
| vDSP vectorized RMSNorm | 14.2 | 7.4% |
| GCD async cblas overlap | 11.4 | 9.2% |
| ANE RMSNorm fusion | 11.4 | 9.2% |
| Wo^T fusion (7→6 kernels) | 11.4 | 9.2% |
| Deferred cblas wait | **9.3** | **11.2%** |
I forked diz shit and need to write out everything different so stay tuned.
## Disclaimer
@ -117,5 +23,3 @@ This project is independent research into Apple Neural Engine architecture. It u
## License
MIT — see [LICENSE](LICENSE)