6.3 KiB
ANE Inference — Full LLM on Apple Neural Engine
First complete LLM inference running directly on Apple's Neural Engine via reverse-engineered _ANEClient APIs. No CoreML. No Xcode compiler dependency at runtime. Token-for-token match with PyTorch.
Built on top of the maderix/ANE training runtime.
What This Does
Runs Qwen2.5-0.5B-Instruct (24 transformer layers, 494M parameters) entirely on the ANE:
- 169 ANE kernels compiled at startup via
_ANEInMemoryModel - 82 tokens/sec decode on M4 Pro
- Zero GPU usage — runs on 16 dedicated neural cores
- Correct output — matches PyTorch reference token-for-token
All linear projections (Q, K, V, O, gate, up, down × 24 layers + chunked LM head) compile as baked-weight 1×1 convolution kernels on ANE. Element-wise ops (RMSNorm, RoPE, softmax, SiLU, attention scores) run on CPU via Accelerate BLAS.
Architecture
Token → Embedding (CPU) → 24× Transformer Layer → LM Head (CPU) → Next Token
│
├── RMSNorm (CPU)
├── Q/K/V Projection (ANE conv kernel)
├── RoPE (CPU, rotate_half)
├── GQA Attention (CPU, 14 heads / 2 KV heads)
├── O Projection (ANE conv kernel)
├── Residual (CPU)
├── RMSNorm (CPU)
├── Gate/Up Projection (ANE conv kernel)
├── SiLU + elementwise mul (CPU)
├── Down Projection (ANE conv kernel)
└── Residual (CPU)
Quick Start
# 1. Convert weights from HuggingFace safetensors to flat binary
pip install safetensors torch transformers
python3 convert_weights.py /path/to/Qwen2.5-0.5B-Instruct qwen05b.bin
# 2. Build
xcrun clang -O2 -framework Foundation -framework IOSurface \
-framework CoreML -framework Accelerate -ldl -lobjc -fobjc-arc \
-o qwen_ane main.m
# 3. Run (single-shot, pass space-separated token IDs)
./qwen_ane qwen05b.bin "151644 8948 198 2610 525 264 10950 17847 13" 20
# 4. With tokenizer (requires transformers)
python3 run.py "Say hello in one word."
Server Mode (Recommended)
The first invocation compiles 169 ANE kernels (~5.5s). Server mode keeps them loaded so subsequent prompts respond instantly.
Socket server (best for run.py integration)
# Terminal 1: start the server (compiles once, stays running)
./qwen_ane qwen05b.bin --server /tmp/qwen_ane.sock
# Terminal 2: queries are instant (~0.5s instead of ~6s)
python3 run.py "What is 2+2?"
python3 run.py "Capital of France?"
python3 run.py "Count from 1 to 5"
run.py auto-detects the socket at /tmp/qwen_ane.sock and connects to it. If no server is running, it falls back to subprocess mode (slower).
You can also query the socket directly:
echo '{"tokens": [151644, 8948, 198], "max_tokens": 50}' | nc -U /tmp/qwen_ane.sock
Response format:
{"output": [9707, 0, 151645], "prefill_tps": 68.4, "decode_tps": 67.8, "prompt_tokens": 28, "gen_tokens": 3}
Stdin server (for piping/scripting)
./qwen_ane qwen05b.bin --server
# Waits for "READY", then send lines of space-separated token IDs:
# 151644 8948 198 2610 525|20
# (pipe character separates max_tokens)
Performance comparison
| Mode | First prompt | Subsequent prompts |
|---|---|---|
| Single-shot | ~6s | ~6s (recompiles) |
| Server | ~6s (startup) | ~0.5s |
Output
=== Qwen2.5-0.5B ANE Inference ===
Loading weights...
Config: dim=896 hidden=4864 layers=24 heads=14 kv_heads=2 vocab=151936
Compiling ANE kernels (169 total)...
Compile time: 5.1s
Prompt: 28 tokens, generating up to 10
Prefill: 64.2 t/s (28 tokens)
OUT: 9707 13 151645
Decode: 82.4 t/s (2 tokens)
→ "Hello." (matches PyTorch exactly)
Files
| File | What |
|---|---|
qwen_ane_infer.h |
Full 24-layer transformer forward pass, ANE kernel compilation, KV cache |
main.m |
Weight loader, token I/O, main generation loop |
convert_weights.py |
HuggingFace safetensors → flat f32 binary (includes Q/K/V biases) |
run.py |
Python wrapper with HuggingFace tokenizer |
Model Support
Currently implements Qwen2.5 architecture:
- GQA attention (grouped-query,
n_heads≠n_kv_heads) rotate_halfRoPE (not interleaved pairs)- SwiGLU FFN (gate + up + silu + down)
- Q/K/V bias (Qwen-specific)
- Tied word embeddings (lm_head = embed)
- Chunked LM head (vocab > 65536 exceeds ANE max dim)
Adapting to other architectures (LLaMA, Gemma, Mistral) requires:
- Adjusting the config constants in
qwen_ane_infer.h - Updating
convert_weights.pyfor the weight naming scheme - Removing Q/K/V bias handling if the model doesn't have them
- Switching RoPE to interleaved pairs if needed
Requirements
- macOS 15+ on Apple Silicon (M1/M2/M3/M4)
- Xcode Command Line Tools (for
xcrun clang) - Python 3.9+ with
safetensors,torch,transformers(for weight conversion)
Known Limitations
- CPU projections only — ANE baked-weight conv kernels compile successfully but produce incorrect output (FP16 weight blob format mismatch). The
USE_ANE_PROJECTIONStoggle exists but defaults to 0 (CPU via Accelerate BLAS). Fixing this would push decode speed from 82 t/s to 120+ t/s. - Single model — hardcoded for Qwen2.5-0.5B. Needs parameterization for other sizes.
- f32 weights — 1.9GB on disk. FP16 or quantized weight support would halve this.
How It Works
The key insight from maderix's reverse engineering: the ANE executes compiled MIL (Machine Learning Intermediate Language) programs as atomic graph operations. Each linear projection becomes a MIL program with baked FP16 weights, compiled in-memory via _ANEInMemoryModel, and executed through IOSurface-based zero-copy I/O.
We chain 169 of these atomic operations (7 per transformer layer + 16 LM head chunks) with CPU-side element-wise ops in between. The ANE handles the compute-heavy matmuls; the CPU handles the memory-bound operations (attention scores, softmax, RoPE).
License
Same as maderix/ANE — research and educational use.