Training on Apple Neural Engine

Go to file

Claude 7b6a18a059 Add ANE int8/int4 quantization probe Probe whether Apple Neural Engine executes quantized ops natively (faster int8-int8 compute path) or just dequantizes to fp16 at load time. Tests 5 approaches at transformer-representative dimensions: 1. FP16 baseline conv (baked weights) 2. INT8 via constexpr_affine_dequantize (per-channel scale+zp) 3. UINT4 via constexpr_affine_dequantize (per-channel) 4. UINT4 via constexpr_blockwise_shift_scale (block_size=32) 5. 4-bit palettized via constexpr_lut_to_dense (16-entry LUT) Each test compiles MIL → ANE kernel, benchmarks 100 evals, reports TFLOPS. If int8 shows ~2x fp16 TFLOPS, ANE has native int8 compute. If same TFLOPS, it's dequant-only (still useful for memory savings). Build: xcrun clang -O2 -fobjc-arc -o quant_probe quant_probe.m \ -framework Foundation -framework IOSurface -ldl https://claude.ai/code/session_01U5HLjsm4iUzL9iDaHbxeRB		2026-03-03 01:02:05 +00:00
training	stories110M: 12-layer ANE training with dashboard, 107ms/step	2026-03-01 03:14:39 -08:00
LICENSE	Initial release	2026-02-28 00:22:06 -08:00
README.md	Update README.md	2026-03-02 14:36:28 -08:00
api_exploration.m	Initial release	2026-02-28 00:22:06 -08:00
inmem_basic.m	Initial release	2026-02-28 00:22:06 -08:00
inmem_bench.m	Initial release	2026-02-28 00:22:06 -08:00
inmem_peak.m	Initial release	2026-02-28 00:22:06 -08:00
quant_probe.m	Add ANE int8/int4 quantization probe	2026-03-03 01:02:05 +00:00
sram_bench.m	Initial release	2026-02-28 00:22:06 -08:00
sram_probe.m	Initial release	2026-02-28 00:22:06 -08:00

README.md

Running Transformers on Apple's Neural Engine

You might be asking, "why the FUCK would you pick GPT2?"

Have you read the art bro? Have you? Nah. I doubt it.

GPT2 had more soul in its theoretical pinky finger than all of us combined.

But I digress..

Running transformer inference and training directly on Apple's Neural Engine via reverse-engineered private APIs. No CoreML, no Metal, no GPU — pure ANE compute through _ANEInMemoryModel and MIL programs compiled at runtime.

Forked from maderix/ANEtransformers which demonstrated ANE training (Stories110M, 12-layer forward+backward on ANE). This fork extends the project with GPT-2 inference on ANE, systematic M5 hardware investigation, and fused kernel optimization.

What's Here

GPT-2 Inference on ANE (`training/gpt2.m`)

Complete GPT-2 (124M) inference engine. Two-phase architecture:

ANE prefill — Full sequence processed on ANE using fused attention and FFN kernels. One fused attention kernel (QKV + multi-head SDPA + causal mask + softmax + output projection) and one fused FFN kernel (W1 + GELU + W2) per layer, compiled per sequence-length bucket (32, 64, 128, 256, 512, 1024). Embedding and LayerNorm on CPU with Accelerate.
CPU decode with KV cache — Single-token generation runs entirely on CPU using NEON fp16 matmul (4-row unrolled), bypassing ANE dispatch overhead. LM head via GCD-parallel NEON fp16 over 50,257 vocab rows.

Includes a from-scratch BPE tokenizer (gpt2_tokenizer.h) and weight converter (gpt2_convert.py) that pulls weights from HuggingFace with no PyTorch dependency.

cd training

# Download and convert weights
pip install safetensors huggingface_hub
python3 gpt2_convert.py

# Build and run
make gpt2
./gpt2 --prompt "The meaning of life is" --tokens 100 --temp 0.8

ANE Training (upstream)

The original maderix work: training a 109M-parameter Llama2-architecture transformer (Stories110M) directly on ANE. 12-layer forward+backward pass, 6 ANE kernel types per layer (72 kernels/step), 107 ms/step on M4. Adam optimizer, gradient accumulation, checkpoint/resume via exec() restart. See training/train_large.m and the training README.

M5 Hardware Investigation

Systematic probing of ANE behavior on Apple M5 (H16 family, same as M4). Key findings documented in training/m5result.md:

Question	Result
Can weights be swapped without recompile?	No. Weights baked at compile time. File overwrite + reload ignored.
Does `weightsBuffer` IOSurface override compiled weights?	No. Same output regardless.
Does QoS affect ANE frequency?	No. All QoS 0-63 work. Fixed frequency, no latency difference.
Can `_ANEChainingRequest` chain kernels without CPU round-trips?	Validates but rejected by driver (Error Code=15). Likely requires entitlements only CoreML holds.
Can `_ANEPerformanceStats` expose hardware counters?	Class exists with `hwExecutionTime` but requires factory construction via model `perfStatsMask`.
Real-time eval path?	`beginRealTimeTask` returns NO (needs entitlement). `evaluateRealTimeWithModel` works but no perf gain.

Fused Kernel Benchmarks (`training/bench_fused.m`)

Quantifies the value of operation fusion on ANE:

Dispatch overhead: ~60-80 us per sequential ANE dispatch
Fused vs separate: 1.5-3x speedup from fusing multiple convolutions into single MIL programs
Conclusion: Fused MIL is the only viable path to high ANE utilization. Chaining API is inaccessible without Apple-internal entitlements. Intermediates stay in ANE SRAM when fused.

Full investigation: training/CHAINING_INVESTIGATION.md

How It Works

MIL generation — Objective-C constructs MIL program text at runtime: convolutions (linear layers), matmul (attention), softmax, element-wise ops
In-memory compilation — _ANEInMemoryModelDescriptor compiles MIL text + fp16 weight blobs directly to ANE programs, no disk .mlmodelc needed
IOSurface I/O — Tensors via IOSurface shared memory in [1, C, 1, S] channel-first fp16 format. Spatial dimension padded to minimum stride of 32. Minimum surface size 49152 bytes.
Weight baking — Weights compiled as BLOBFILE constants into MIL programs. No runtime weight update possible — must recompile to change weights.
NEON fp16 — ARM NEON intrinsics for fast fp32-fp16 conversion at IOSurface boundaries and for CPU-side matmul during decode

File Structure

├── api_exploration.m               # Initial ANE private API discovery
├── inmem_basic.m                   # In-memory MIL compilation proof-of-concept
├── inmem_bench.m                   # ANE dispatch latency benchmarks
├── inmem_peak.m                    # Peak TFLOPS measurement (2048x2048 matmul)
├── sram_bench.m                    # ANE SRAM bandwidth probing
├── sram_probe.m                    # SRAM size/layout exploration
└── training/
    ├── gpt2.m                      # GPT-2 124M inference: ANE prefill + CPU KV-cache decode
    ├── gpt2_convert.py             # HuggingFace → ANE weight converter (no PyTorch)
    ├── gpt2_tokenizer.h            # Self-contained BPE tokenizer (header-only C)
    ├── bench_fused.m               # Fused vs separate kernel benchmarks
    ├── CHAINING_INVESTIGATION.md   # Full chaining API reverse-engineering writeup
    ├── m5result.md                 # M5 ANE probe results
    ├── train_large.m               # 12-layer Stories110M ANE training (upstream)
    ├── stories_config.h            # Training model config and structs
    ├── stories_io.h                # Training IOSurface I/O and kernel compile/eval
    ├── stories_mil.h               # Training MIL generators (6 kernel types)
    ├── stories_cpu_ops.h           # vDSP RMSNorm, cross-entropy, Adam, embeddings
    ├── dashboard.py                # Training TUI: loss curves, power, text generation
    ├── ane_runtime.h               # ANE private API wrapper
    ├── ane_mil_gen.h               # MIL generation helpers
    ├── test_chaining.m             # Chaining API experiments
    ├── test_weight_reload.m        # Weight swap without recompile test
    ├── test_ane_advanced.m         # weightsBuffer, procedureIndex, shared events probe
    ├── test_qos_sweep.m            # QoS 0-63 latency sweep
    ├── test_perf_stats.m           # _ANEPerformanceStats introspection
    ├── test_decode_attn.m          # Multi-input decode attention kernel validation
    ├── test_multi_input.m          # Multi-input IOSurface size constraints
    ├── test_ffn_seq1.m             # FFN at seq=1 (decode mode) validation
    ├── test_lm_head_ane.m          # LM head on ANE feasibility
    ├── test_lm_head_fast.m         # LM head CPU benchmark (6 approaches)
    ├── test_lm_head_neon.m         # LM head NEON fp16 benchmark
    ├── docs/                       # Roadmap: GPT-2 XL, streaming, sampling, interactive
    └── Makefile

Building

Requires macOS 15+ on Apple Silicon (tested on M4, M5). No external dependencies — system frameworks + private ANE APIs resolved at runtime via objc_msgSend.

cd training

# GPT-2 inference
make gpt2
./gpt2 --prompt "Once upon a time" --tokens 200

# Stories110M training (upstream)
make train_large
./train_large

# Benchmarks
make bench_fused
./bench_fused

Known ANE Constraints

Weights are immutable after compile — no hot-swap, no weightsBuffer override, no file-swap reload
SDPA causal masking ignored by hardware — must decompose into Q@K^T + mask add + softmax + scores@V
~119 compile limit per process — ANE compiler leaks resources; training uses exec() restart
IOSurface minimum 49152 bytes — even for tiny tensors (seq=1 decode)
Spatial stride padded to 32 — [1, C, 1, W] surfaces have stride max(W, 32)
Chaining API inaccessible — _ANEChainingRequest validates but driver rejects (Error Code=15)

Disclaimer

This project is independent research into Apple Neural Engine architecture. It uses undocumented APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see Sega v. Accolade, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk.

License

MIT — see LICENSE