diff --git a/.github/ISSUE_TEMPLATE/benchmark_submission.md b/.github/ISSUE_TEMPLATE/benchmark_submission.md new file mode 100644 index 0000000..9649489 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/benchmark_submission.md @@ -0,0 +1,26 @@ +--- +name: Benchmark Submission +about: Submit your ANE benchmark results +title: "[Benchmark] results" +labels: benchmark +assignees: '' +--- + +## System Info + +- **Chip**: (e.g., Apple M4 Max) +- **Machine**: (e.g., Mac16,5) +- **macOS Version**: +- **Memory**: (e.g., 128 GB) + +## Benchmark Results + +Paste the contents of your JSON results file below: + +```json + +``` + +## Notes + +Any observations, issues encountered, or interesting findings. diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md new file mode 100644 index 0000000..07b0a49 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/bug_report.md @@ -0,0 +1,33 @@ +--- +name: Bug Report +about: Report a build failure, crash, or unexpected behavior +title: "[Bug] " +labels: bug +assignees: '' +--- + +## Environment + +- **Chip**: +- **macOS Version**: +- **Xcode Version**: (run `xcodebuild -version`) + +## Description + +What happened? + +## Steps to Reproduce + +1. +2. +3. + +## Expected Behavior + +What did you expect to happen? + +## Logs / Output + +``` +Paste relevant output here +``` diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md new file mode 100644 index 0000000..881f1d4 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/feature_request.md @@ -0,0 +1,19 @@ +--- +name: Feature Request +about: Suggest a new feature or research direction +title: "[Feature] " +labels: enhancement +assignees: '' +--- + +## Description + +What would you like to see added? + +## Motivation + +Why would this be useful? + +## Possible Approach + +If you have ideas on how to implement this, share them here. diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..260aee7 --- /dev/null +++ b/.gitignore @@ -0,0 +1,64 @@ +# Build artifacts +*.o +*.dSYM/ + +# Root-level compiled binaries +ane_probe +api_explore +inmem_basic +inmem_bench +inmem_peak +sram_bench +sram_probe + +# Training binaries +tiny_train +tiny_train_m1 +train_large +training/train_large +training/train_large_ane +training/test_* +!training/test_*.m + +# Test/research binaries +test_chaining + +# Generated mlpackage files +/tmp/ane_*.mlpackage + +# Benchmark results (keep community_benchmarks/ submissions) +benchmark_results_*.txt +community_benchmarks/SUMMARY.json +community_benchmarks/SUMMARY.md +community_benchmarks/apple_m4_max_20260303_*.json + +# Python +__pycache__/ +*.pyc +*.egg-info/ +/tmp/ane_venv/ + +# Training data (downloaded separately) +assets/ + +# Web dashboard (lives in separate private repo) +web/ + +# Training data binaries (downloaded via make setup) +training/tinystories_data00.bin +training/ane_stories110M_ckpt.bin +*.bin +!training/download_data.sh + +# Internal / private +.cursor/ +docs/launch/ +comm + +# macOS +.DS_Store + +# Editor +*.swp +*.swo +*~ diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..a9b65c7 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,60 @@ +# Contributing to ANE Training + +Thanks for your interest in contributing! This community fork welcomes benchmark submissions, bug fixes, and research contributions. + +## Benchmark Submissions (Easiest Way to Contribute) + +The single most valuable thing you can do is run the benchmark on your hardware and submit results. + +### Quick Version + +```bash +bash scripts/run_community_benchmark.sh +``` + +The script will guide you through everything, including optional auto-submission to the dashboard. + +### What Gets Collected + +- Your chip model (e.g., Apple M4 Max) +- macOS version, memory, core counts +- SRAM probe results (TFLOPS vs weight size) +- In-memory peak TFLOPS +- Training performance (optional, requires training data) +- Your GitHub username (optional) + +No personal data, no IP addresses stored (only hashed for rate limiting). + +## Bug Reports + +Open an issue with: +- Your hardware (chip, macOS version, memory) +- Steps to reproduce +- Expected vs actual behavior +- Relevant log output + +## Code Contributions + +1. Fork the repository +2. Create a feature branch (`git checkout -b my-feature`) +3. Make your changes +4. Test on your hardware +5. Submit a Pull Request + +### Code Style + +- Objective-C: follow the existing style in `training/` (no ARC annotations in headers, `_Float16` for fp16) +- Shell scripts: use `set -euo pipefail`, quote variables +- Python: minimal dependencies, Python 3.11+ compatible + +### Areas Where Help is Needed + +- **Benchmarks on hardware we don't have**: M1, M2, M3, M3 Pro/Max/Ultra, M4 Pro, M5 +- **Reducing compilation overhead**: currently 80-85% of wall time +- **`_ANEChainingRequest` research**: pipelining multiple ANE operations without recompile +- **`_ANEPerformanceStats` investigation**: getting real hardware timing data +- **Larger model support**: scaling beyond Stories110M + +## Questions? + +Open a GitHub issue or discussion. We're happy to help. diff --git a/README.md b/README.md index ce3df1f..beed9c2 100644 --- a/README.md +++ b/README.md @@ -12,24 +12,24 @@ This is a **research project**, not a production framework. The goal was to demonstrate that **training on the Apple Neural Engine — and potentially other NPUs — is possible**, and that the barrier has always been software support, not hardware capability. The ANE is a remarkably capable piece of silicon that Apple restricts to inference-only use through CoreML. This project bypasses that restriction using reverse-engineered private APIs to show what's possible when you give the hardware a chance. -### What this project is +### What This Project Is - A proof of concept for ANE training via `_ANEClient` and `_ANECompiler` private APIs - A set of benchmarks documenting real ANE performance characteristics (throughput, power, SRAM behavior) - A reference for anyone exploring direct ANE access outside CoreML - Research code that I update when I find something interesting -### What this project is not +### What This Project Is Not - A maintained framework or library - A replacement for CoreML, MLX, llama.cpp, or any production inference stack - A path to training large models on consumer hardware (yet) -### On the hype +### On The Hype Some coverage of this project has overstated its implications. To be clear: -- Training works, but utilization is low (~2-3% of peak) with significant engineering challenges remaining +- Training works, but utilization is low (~8-11% of peak) with significant engineering challenges remaining - Many element-wise operations still fall back to CPU - This does **not** replace GPU training for anything beyond small research models today @@ -37,34 +37,73 @@ The honest results — including all limitations — are documented in the accom - [Part 1: Reverse Engineering](https://maderix.substack.com/p/inside-the-m4-apple-neural-engine) - [Part 2: Benchmarks](https://maderix.substack.com/p/inside-the-m4-apple-neural-engine-615) -### On maintenance - -I don't intend to grow this into a large community project. My focus is on original research (compiler infrastructure for edge AI optimization), and maintaining an open-source framework takes time away from that. - -That said: -- I'll keep pushing updates when I discover something interesting -- Bug fixes and benchmark contributions (especially on hardware I don't own) are welcome -- Feature requests will likely go unaddressed — but feel free to fork - ### Fork it, build on it This is MIT licensed for a reason. Everyone now has access to AI-assisted development tools that can adapt and extend code in hours. If this project is useful to you — take it, modify it, build something better. If you do something cool with it, I'd love to hear about it. --- +## Community Fork + +This fork extends the original project with: + +- **M1/M2/M3/M4 compatibility** — MIL syntax fixes for broader Apple Silicon support (from upstream PR #6) +- **Security hardening** — stack protection, format security, input validation (upstream PRs #5, #7) +- **Bug fixes** — token sampling underflow fix, dashboard sudo hang fix (upstream PRs #17, #20) +- **Configurable paths** — training data, model, and checkpoint paths via environment variables +- **Community benchmarks** — standardized benchmark script + online dashboard for comparing results across chips +- **12-layer training** — full Stories110M (12 transformer layers, 109M params) already working + +### Contributing + +We welcome benchmark submissions from any Apple Silicon hardware. See [Community Benchmarks](#community-benchmarks) below for how to run and submit your results. + +--- + +## Quick Start + +**Requirements:** macOS 15+ on Apple Silicon (M1/M2/M3/M4/M5), Xcode CLI tools. + +```bash +# Install Xcode CLI tools (if not already installed) +xcode-select --install + +# Clone and set up +git clone https://github.com/dev-erik/ANE.git +cd ANE/training + +# Download training data + model weights +make setup + +# Build and run training (12-layer Stories110M) +make train_large +./train_large --steps 100 +``` + +### Environment Variables + +| Variable | Default | Description | +|----------|---------|-------------| +| `ANE_MODEL_PATH` | `../../assets/models/stories110M.bin` | Path to model weights | +| `ANE_DATA_PATH` | `../../assets/data/tinystories_data00.bin` | Path to tokenized training data | +| `ANE_CKPT_PATH` | `/tmp/ane_ckpt.bin` | Path for checkpoint files | +| `ANE_ACCUM_STEPS` | `10` | Gradient accumulation steps before weight update (max 10000) | + +--- + ## What This Is A from-scratch implementation of transformer training (forward + backward pass) running on the ANE in Apple Silicon. The ANE is a 15.8 TFLOPS (M4) inference accelerator that Apple does not expose for training. This project reverse-engineers the `_ANEClient` / `_ANECompiler` private APIs and the MIL (Model Intermediate Language) format to run custom compute graphs — including backpropagation — directly on ANE hardware. -**Current results (M4, single transformer layer, dim=768, seq=512):** -- 9.3 ms/step, 11.2% ANE utilization (1.78 TFLOPS sustained) -- 6 ANE kernel dispatches per training step +**Current results (M4 Max, 12-layer Stories110M, dim=768, seq=256):** +- 62-72 ms/step, 8-11% ANE utilization (1.3-1.7 TFLOPS sustained) +- 6 ANE kernel dispatches per layer per training step - All forward and backward dx passes on ANE, dW gradients on CPU (Accelerate cblas) -- Adam optimizer, gradient accumulation, checkpoint/resume +- Adam optimizer, gradient accumulation, checkpoint/resume via process restart ## Architecture -The training loop uses 6 ANE kernels per step: +The training loop uses 6 ANE kernels per step per layer: | Kernel | Function | Weights | |--------|----------|---------| @@ -73,19 +112,19 @@ The training loop uses 6 ANE kernels per step: | `kFFNBwd` | FFN backward (W2^T + SiLU_bwd + W1^T + W3^T) | W2^T, W1^T, W3^T | | `kSdpaBwd1` | Wo^T + SDPA backward part 1 (dV, probs, dp) | Wo^T, mask | | `kSdpaBwd2` | SDPA backward part 2 (softmax grad, dQ, dK) | — | -| `kQKVb` | QKV backward (Wq^T + Wk^T + Wv^T → dx) | Wq^T, Wk^T, Wv^T | +| `kQKVb` | QKV backward (Wq^T + Wk^T + Wv^T -> dx) | Wq^T, Wk^T, Wv^T | CPU handles: RMSNorm backward, residual connections, loss computation, dW gradient accumulation (cblas_sgemm), Adam optimizer updates. Key optimizations: - **Channel-first CPU layout** — matches ANE IOSurface `[1,C,1,S]` format, eliminates all transpose overhead -- **vDSP vectorized RMSNorm** — 10x faster than naive (6.7ms → 0.7ms) +- **vDSP vectorized RMSNorm** — 10x faster than naive (6.7ms to 0.7ms) - **GCD async cblas overlap** — dW gradient sgemms run in parallel with ANE evals on a serial dispatch queue - **Deferred cblas wait** — wait pushed into next step's forward pass for maximum overlap - **ANE RMSNorm fusion** — RMSNorm folded into forward kernels as MIL ops (reduce_sum + pow + mul) - **Wo^T fusion** — output projection backward merged into SDPA backward kernel - **Forward taps** — Q, K, V, attention scores, hidden states exposed via concat outputs, avoiding CPU recompute -- **exec() restart** — bypasses ~119 ANE compile limit per process +- **Process restart** — bypasses ~119 ANE compile limit per process via checkpoint and re-launch ## File Structure @@ -93,34 +132,116 @@ Key optimizations: ├── api_exploration.m # Initial ANE API discovery ├── inmem_basic.m # In-memory MIL compilation proof-of-concept ├── inmem_bench.m # ANE dispatch latency benchmarks -├── inmem_peak.m # Peak TFLOPS measurement (2048x2048 matmul) +├── inmem_peak.m # Peak TFLOPS measurement ├── sram_bench.m # ANE SRAM bandwidth probing ├── sram_probe.m # SRAM size/layout exploration +├── scripts/ +│ ├── run_benchmarks.sh # Full benchmark suite runner +│ ├── run_community_benchmark.sh # Standardized community benchmark (JSON output) +│ ├── gen_mlpackages.py # Generate .mlpackage models for sram/inmem tests +│ └── aggregate_benchmarks.py # Aggregate community JSON results +├── community_benchmarks/ # Community-submitted benchmark results (JSON) +├── web/ # Dashboard web app (Next.js + Neon Postgres) +├── docs/ +│ ├── ARCHITECTURE.md # System architecture with diagrams +│ ├── API_REFERENCE.md # Complete function index +│ ├── BENCHMARKS.md # Benchmark guide +│ └── BENCHMARK_RESULTS.md # Detailed M4 Max results └── training/ ├── ane_runtime.h # ANE private API wrapper (compile, eval, IOSurface) ├── ane_mil_gen.h # MIL program generation helpers - ├── model.h # Model weight initialization and blob builders - ├── forward.h # Forward pass MIL generators - ├── backward.h # Backward pass MIL generators + ├── ane_classifier.h # Classifier forward/backward MIL generators + ├── ane_rmsnorm_bwd.h # RMSNorm backward MIL generator + ├── stories_config.h # Model configuration (dims, structs, macros) + ├── stories_io.h # IOSurface I/O, blob builders, compile/eval helpers + ├── stories_mil.h # MIL generators (SDPA, FFN, QKV backward) + ├── stories_cpu_ops.h # CPU ops (RMSNorm, Adam, cross-entropy, embed) + ├── model.h # Gen1 model weight init and blob builders + ├── forward.h # Gen1 forward pass MIL generators + ├── backward.h # Gen1 backward pass MIL generators + ├── train_large.m # Main: 12-layer training (CPU classifier) + ├── train_large_ane.m # 12-layer training (ANE classifier) ├── train.m # Minimal training loop (early prototype) ├── tiny_train.m # 2-layer tiny model training - ├── train_large.m # Main: single-layer dim=768 training (optimized) ├── test_*.m # Unit tests for individual kernels - └── Makefile + ├── dashboard.py # Real-time training monitor + ├── tokenize.py # Training data preprocessing + ├── download_data.sh # Download training data + model weights + └── Makefile # Build system (make train_large, make test, etc.) ``` +## Community Benchmarks + +We collect community benchmark results across Apple Silicon chips to understand ANE performance characteristics. + +### Run Benchmarks + +```bash +# Run the standardized community benchmark +bash scripts/run_community_benchmark.sh + +# Skip training benchmarks (if no training data) +bash scripts/run_community_benchmark.sh --skip-training + +# Custom training steps +bash scripts/run_community_benchmark.sh --steps 50 +``` + +The script will: +1. Detect your hardware (chip, memory, cores) +2. Run SRAM probe and in-memory peak benchmarks +3. Optionally run training benchmarks +4. Save results as JSON to `community_benchmarks/` +5. Ask if you'd like to submit results to the online dashboard + +### Submit Results + +**Option A: Automatic submission** +At the end of the benchmark run, the script will ask if you want to submit. Your results are sent anonymously to our dashboard (IP is hashed, never stored raw). + +**Option B: GitHub PR** +1. Fork this repository +2. Run the benchmark script +3. Commit the JSON file from `community_benchmarks/` +4. Open a Pull Request + +**Option C: GitHub Issue** +Paste the contents of your JSON results file in a new issue. + +### View Results + +Visit the **[ANE Community Benchmark Dashboard](https://web-lac-sigma-61.vercel.app)** to see aggregated results across all Apple Silicon chips. + +### Data Privacy + +- Your IP address is hashed (SHA-256) for rate limiting and duplicate detection only +- No personal information is collected or stored +- All benchmark data is public +- Rate limited to 5 submissions per hour per IP + +--- + ## Building -Requires macOS 15+ on Apple Silicon (tested on M4). +Requires macOS 15+ on Apple Silicon (tested on M1 through M5). ```bash -# Build the main training program -xcrun clang -O2 -framework Foundation -framework IOSurface \ - -framework CoreML -framework Accelerate -ldl -lobjc \ - -o train_large training/train_large.m +cd training -# Run -./train_large +# Build everything +make all + +# Build just the training programs +make train_large train_large_ane + +# Run tests +make test + +# Download training data +make data + +# Full setup (data + dependencies) +make setup ``` No external dependencies. Uses only system frameworks + private ANE APIs resolved at runtime via `objc_msgSend`. @@ -135,10 +256,11 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve ## Limitations -- **SDPA causal masking** — ANE hardware ignores `attn_mask` in SDPA ops; causal attention is decomposed into separate Q@K^T (ANE) → mask+softmax (ANE via add+softmax) → scores@V (ANE) -- **~119 compile limit** — ANE compiler leaks resources; worked around via `exec()` restart with checkpoint -- **Single layer** — Currently trains one transformer layer; multi-layer would need pipeline scheduling -- **Synthetic data** — Currently uses random data for benchmarking; real tokenized data support is WIP +- **SDPA causal masking** — ANE hardware ignores `attn_mask` in SDPA ops; causal attention is decomposed into separate Q@K^T (ANE) then mask+softmax (ANE via add+softmax) then scores@V (ANE) +- **~119 compile limit** — ANE compiler leaks resources; worked around via process restart with checkpoint +- **Compilation overhead** — Weights baked at compile time mean recompilation every ACCUM_STEPS. Compilation is 80-85% of wall time. Investigating `_ANEChainingRequest` for potential pipeline without recompile. +- **Classifier backward regression** — ANE classifier backward is ~3x slower than CPU cblas due to matmul (not conv) being used to work around ANE's 8192 input channel limit +- **SRAM capacity** — ANE SRAM is ~24-32 MB (M4 Max). Models with weight matrices exceeding this threshold spill to DRAM with significant performance cliffs. Current Stories110M weights (~1.2 MB each) stay within SRAM. ## Performance History @@ -149,12 +271,14 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve | vDSP vectorized RMSNorm | 14.2 | 7.4% | | GCD async cblas overlap | 11.4 | 9.2% | | ANE RMSNorm fusion | 11.4 | 9.2% | -| Wo^T fusion (7→6 kernels) | 11.4 | 9.2% | +| Wo^T fusion (7 to 6 kernels) | 11.4 | 9.2% | | Deferred cblas wait | **9.3** | **11.2%** | +*Note: Above numbers are for single-layer training. Full 12-layer training runs at 62-72 ms/step.* + ## Disclaimer -This project uses Apple's private, undocumented APIs (`_ANEClient`, `_ANECompiler`, `_ANEInMemoryModelDescriptor`). These APIs are not covered by any public stability guarantee and may change or break with any macOS update. This is independent research into Apple Neural Engine architecture, using APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see *Sega v. Accolade*, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk. +This project uses Apple's private, undocumented APIs (`_ANEClient`, `_ANECompiler`, `_ANEInMemoryModelDescriptor`). These APIs are not covered by any public stability guarantee and may change or break with any macOS update. This is independent research into Apple Neural Engine architecture, using APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see *Sega v. Accolade*, 1992; DMCA section 1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk. ## License @@ -162,4 +286,4 @@ MIT — see [LICENSE](LICENSE) --- -*Built by a human + Claude, one weekend at a time.* +*Originally built by [maderix](https://github.com/maderix). Community fork maintained with contributions from the ANE research community.*