ANE/inference
Erik Bray b4d81b71d4 [feat] Merge upstream PRs #21, #23, #26: NEON-optimized training (train_opt), double-buffered async ANE training (train_double_buffer), Qwen2.5-0.5B LLM inference (inference/). Added get_path() env var support and SEC_FLAGS to all new targets. Skipped PR #22 (binary blob risk). 2026-03-03 17:18:02 +01:00
..
README.md [feat] Merge upstream PRs #21, #23, #26: NEON-optimized training (train_opt), double-buffered async ANE training (train_double_buffer), Qwen2.5-0.5B LLM inference (inference/). Added get_path() env var support and SEC_FLAGS to all new targets. Skipped PR #22 (binary blob risk). 2026-03-03 17:18:02 +01:00
convert_weights.py [feat] Merge upstream PRs #21, #23, #26: NEON-optimized training (train_opt), double-buffered async ANE training (train_double_buffer), Qwen2.5-0.5B LLM inference (inference/). Added get_path() env var support and SEC_FLAGS to all new targets. Skipped PR #22 (binary blob risk). 2026-03-03 17:18:02 +01:00
main.m [feat] Merge upstream PRs #21, #23, #26: NEON-optimized training (train_opt), double-buffered async ANE training (train_double_buffer), Qwen2.5-0.5B LLM inference (inference/). Added get_path() env var support and SEC_FLAGS to all new targets. Skipped PR #22 (binary blob risk). 2026-03-03 17:18:02 +01:00
qwen_ane_infer.h [feat] Merge upstream PRs #21, #23, #26: NEON-optimized training (train_opt), double-buffered async ANE training (train_double_buffer), Qwen2.5-0.5B LLM inference (inference/). Added get_path() env var support and SEC_FLAGS to all new targets. Skipped PR #22 (binary blob risk). 2026-03-03 17:18:02 +01:00
run.py [feat] Merge upstream PRs #21, #23, #26: NEON-optimized training (train_opt), double-buffered async ANE training (train_double_buffer), Qwen2.5-0.5B LLM inference (inference/). Added get_path() env var support and SEC_FLAGS to all new targets. Skipped PR #22 (binary blob risk). 2026-03-03 17:18:02 +01:00

README.md

ANE Inference — Full LLM on Apple Neural Engine

First complete LLM inference running directly on Apple's Neural Engine via reverse-engineered _ANEClient APIs. No CoreML. No Xcode compiler dependency at runtime. Token-for-token match with PyTorch.

Built on top of the maderix/ANE training runtime.

What This Does

Runs Qwen2.5-0.5B-Instruct (24 transformer layers, 494M parameters) entirely on the ANE:

  • 169 ANE kernels compiled at startup via _ANEInMemoryModel
  • 82 tokens/sec decode on M4 Pro
  • Zero GPU usage — runs on 16 dedicated neural cores
  • Correct output — matches PyTorch reference token-for-token

All linear projections (Q, K, V, O, gate, up, down × 24 layers + chunked LM head) compile as baked-weight 1×1 convolution kernels on ANE. Element-wise ops (RMSNorm, RoPE, softmax, SiLU, attention scores) run on CPU via Accelerate BLAS.

Architecture

Token → Embedding (CPU) → 24× Transformer Layer → LM Head (CPU) → Next Token
                              │
                              ├── RMSNorm (CPU)
                              ├── Q/K/V Projection (ANE conv kernel)
                              ├── RoPE (CPU, rotate_half)
                              ├── GQA Attention (CPU, 14 heads / 2 KV heads)
                              ├── O Projection (ANE conv kernel)
                              ├── Residual (CPU)
                              ├── RMSNorm (CPU)
                              ├── Gate/Up Projection (ANE conv kernel)
                              ├── SiLU + elementwise mul (CPU)
                              ├── Down Projection (ANE conv kernel)
                              └── Residual (CPU)

Quick Start

# 1. Convert weights from HuggingFace safetensors to flat binary
pip install safetensors torch transformers
python3 convert_weights.py /path/to/Qwen2.5-0.5B-Instruct qwen05b.bin

# 2. Build
xcrun clang -O2 -framework Foundation -framework IOSurface \
  -framework CoreML -framework Accelerate -ldl -lobjc \
  -o qwen_ane main.m

# 3. Run (pass space-separated token IDs)
./qwen_ane qwen05b.bin "151644 8948 198 2610 525 264 10950 17847 13" 20

# 4. With tokenizer (requires transformers)
python3 run.py "Say hello in one word."

Output

=== Qwen2.5-0.5B ANE Inference ===

Loading weights...
Config: dim=896 hidden=4864 layers=24 heads=14 kv_heads=2 vocab=151936
Compiling ANE kernels (169 total)...
Compile time: 5.1s

Prompt: 28 tokens, generating up to 10
Prefill: 64.2 t/s (28 tokens)
OUT: 9707 13 151645
Decode: 82.4 t/s (2 tokens)

→ "Hello."  (matches PyTorch exactly)

Files

File What
qwen_ane_infer.h Full 24-layer transformer forward pass, ANE kernel compilation, KV cache
main.m Weight loader, token I/O, main generation loop
convert_weights.py HuggingFace safetensors → flat f32 binary (includes Q/K/V biases)
run.py Python wrapper with HuggingFace tokenizer

Model Support

Currently implements Qwen2.5 architecture:

  • GQA attention (grouped-query, n_headsn_kv_heads)
  • rotate_half RoPE (not interleaved pairs)
  • SwiGLU FFN (gate + up + silu + down)
  • Q/K/V bias (Qwen-specific)
  • Tied word embeddings (lm_head = embed)
  • Chunked LM head (vocab > 65536 exceeds ANE max dim)

Adapting to other architectures (LLaMA, Gemma, Mistral) requires:

  1. Adjusting the config constants in qwen_ane_infer.h
  2. Updating convert_weights.py for the weight naming scheme
  3. Removing Q/K/V bias handling if the model doesn't have them
  4. Switching RoPE to interleaved pairs if needed

Requirements

  • macOS 15+ on Apple Silicon (M1/M2/M3/M4)
  • Xcode Command Line Tools (for xcrun clang)
  • Python 3.9+ with safetensors, torch, transformers (for weight conversion)

Known Limitations

  • CPU projections only — ANE baked-weight conv kernels compile successfully but produce incorrect output (FP16 weight blob format mismatch). The USE_ANE_PROJECTIONS toggle exists but defaults to 0 (CPU via Accelerate BLAS). Fixing this would push decode speed from 82 t/s to 120+ t/s.
  • No persistent server — each invocation recompiles 169 kernels (~5s). A server mode that compiles once and serves via HTTP would eliminate this overhead.
  • Single model — hardcoded for Qwen2.5-0.5B. Needs parameterization for other sizes.
  • f32 weights — 1.9GB on disk. FP16 or quantized weight support would halve this.

How It Works

The key insight from maderix's reverse engineering: the ANE executes compiled MIL (Machine Learning Intermediate Language) programs as atomic graph operations. Each linear projection becomes a MIL program with baked FP16 weights, compiled in-memory via _ANEInMemoryModel, and executed through IOSurface-based zero-copy I/O.

We chain 169 of these atomic operations (7 per transformer layer + 16 LM head chunks) with CPU-side element-wise ops in between. The ANE handles the compute-heavy matmuls; the CPU handles the memory-bound operations (attention scores, softmax, RoPE).

License

Same as maderix/ANE — research and educational use.