Revise README for project fork and updates

Forked the project and updated the README to reflect changes.
2026-03-02 13:31:31 -08:00 · 2026-03-02 13:31:31 -08:00 · 40a5384074
parent 6b8d69b93d
commit 40a5384074
1 changed files with 1 additions and 97 deletions
--- a/README.md
+++ b/README.md
@ -14,101 +14,7 @@ Training neural networks directly on Apple's Neural Engine (ANE) via reverse-eng

 A from-scratch implementation of transformer training (forward + backward pass) running on the ANE in Apple Silicon. The ANE is a 15.8 TFLOPS (M4) inference accelerator that Apple does not expose for training. This project reverse-engineers the `_ANEClient` / `_ANECompiler` private APIs and the MIL (Model Intermediate Language) format to run custom compute graphs — including backpropagation — directly on ANE hardware.

-**Current results (M4, single transformer layer, dim=768, seq=512):**
- 9.3 ms/step, 11.2% ANE utilization (1.78 TFLOPS sustained)
- 6 ANE kernel dispatches per training step
- All forward and backward dx passes on ANE, dW gradients on CPU (Accelerate cblas)
- Adam optimizer, gradient accumulation, checkpoint/resume
-
-## Architecture
-
-The training loop uses 6 ANE kernels per step:
-
-| Kernel | Function | Weights |
-|--------|----------|---------|
-| `kFwdAttn` | RMSNorm + QKV projection + SDPA + output projection | Wq, Wk, Wv, Wo, rms1, mask |
-| `kFwdFFN` | RMSNorm + SwiGLU FFN (W1, W3, SiLU, W2) | W1, W2, W3, rms2 |
-| `kFFNBwd` | FFN backward (W2^T + SiLU_bwd + W1^T + W3^T) | W2^T, W1^T, W3^T |
-| `kSdpaBwd1` | Wo^T + SDPA backward part 1 (dV, probs, dp) | Wo^T, mask |
-| `kSdpaBwd2` | SDPA backward part 2 (softmax grad, dQ, dK) | — |
-| `kQKVb` | QKV backward (Wq^T + Wk^T + Wv^T → dx) | Wq^T, Wk^T, Wv^T |
-
-CPU handles: RMSNorm backward, residual connections, loss computation, dW gradient accumulation (cblas_sgemm), Adam optimizer updates.
-
-Key optimizations:
- **Channel-first CPU layout** — matches ANE IOSurface `[1,C,1,S]` format, eliminates all transpose overhead
- **vDSP vectorized RMSNorm** — 10x faster than naive (6.7ms → 0.7ms)
- **GCD async cblas overlap** — dW gradient sgemms run in parallel with ANE evals on a serial dispatch queue
- **Deferred cblas wait** — wait pushed into next step's forward pass for maximum overlap
- **ANE RMSNorm fusion** — RMSNorm folded into forward kernels as MIL ops (reduce_sum + pow + mul)
- **Wo^T fusion** — output projection backward merged into SDPA backward kernel
- **Forward taps** — Q, K, V, attention scores, hidden states exposed via concat outputs, avoiding CPU recompute
- **exec() restart** — bypasses ~119 ANE compile limit per process
-
-## File Structure
-
-```
-├── api_exploration.m       # Initial ANE API discovery
-├── inmem_basic.m           # In-memory MIL compilation proof-of-concept
-├── inmem_bench.m           # ANE dispatch latency benchmarks
-├── inmem_peak.m            # Peak TFLOPS measurement (2048x2048 matmul)
-├── sram_bench.m            # ANE SRAM bandwidth probing
-├── sram_probe.m            # SRAM size/layout exploration
-└── training/
-    ├── ane_runtime.h       # ANE private API wrapper (compile, eval, IOSurface)
-    ├── ane_mil_gen.h       # MIL program generation helpers
-    ├── model.h             # Model weight initialization and blob builders
-    ├── forward.h           # Forward pass MIL generators
-    ├── backward.h          # Backward pass MIL generators
-    ├── train.m             # Minimal training loop (early prototype)
-    ├── tiny_train.m        # 2-layer tiny model training
-    ├── train_large.m       # Main: single-layer dim=768 training (optimized)
-    ├── test_*.m            # Unit tests for individual kernels
-    └── Makefile
-```
-
-## Building
-
-Requires macOS 15+ on Apple Silicon (tested on M4).
-
-```bash
-# Build the main training program
-xcrun clang -O2 -framework Foundation -framework IOSurface \
-  -framework CoreML -framework Accelerate -ldl -lobjc \
-  -o train_large training/train_large.m
-
-# Run
-./train_large
-```
-
-No external dependencies. Uses only system frameworks + private ANE APIs resolved at runtime via `objc_msgSend`.
-
-## How It Works
-
-1. **MIL generation** — Objective-C code constructs MIL program text at runtime, specifying convolutions (for linear layers), matmul (for attention), softmax, element-wise ops
-2. **In-memory compilation** — `_ANEInMemoryModelDescriptor` compiles MIL text + weight blobs directly to ANE programs, no disk mlmodelc needed
-3. **IOSurface I/O** — Input/output tensors passed via IOSurface shared memory in `[1, channels, 1, spatial]` format (fp16)
-4. **Weight embedding** — Weights baked into ANE programs as BLOBFILE constants; recompiled each batch when weights change
-5. **Gradient flow** — Forward taps expose intermediates needed for backward; backward kernels compute dx (input gradients) on ANE; dW (weight gradients) computed on CPU via cblas
-
-## Limitations
-
- **SDPA causal masking** — ANE hardware ignores `attn_mask` in SDPA ops; causal attention is decomposed into separate Q@K^T (ANE) → mask+softmax (ANE via add+softmax) → scores@V (ANE)
- **~119 compile limit** — ANE compiler leaks resources; worked around via `exec()` restart with checkpoint
- **Single layer** — Currently trains one transformer layer; multi-layer would need pipeline scheduling
- **Synthetic data** — Currently uses random data for benchmarking; real tokenized data support is WIP
-
-## Performance History
-
-| Optimization | ms/step | ANE util |
-|---|---|---|
-| Baseline (vDSP transpose) | 33.5 | 3.1% |
-| Channel-first layout | 20.3 | 5.2% |
-| vDSP vectorized RMSNorm | 14.2 | 7.4% |
-| GCD async cblas overlap | 11.4 | 9.2% |
-| ANE RMSNorm fusion | 11.4 | 9.2% |
-| Wo^T fusion (7→6 kernels) | 11.4 | 9.2% |
-| Deferred cblas wait | **9.3** | **11.2%** |
+I forked diz shit and need to write out everything different so stay tuned.

 ## Disclaimer

@ -117,5 +23,3 @@ This project is independent research into Apple Neural Engine architecture. It u
 ## License

 MIT — see [LICENSE](LICENSE)
-
-