zemog
|
b476456736
|
Add LLM inference on ANE — first full transformer on Neural Engine without CoreML
Qwen2.5-0.5B (24 layers, 494M params) running directly on Apple Neural Engine
via _ANEInMemoryModel APIs. 169 ANE kernels compiled at startup.
- 82 tokens/sec decode, zero GPU usage
- Token-for-token match with PyTorch ("Hello." = [9707, 13, 151645])
- GQA attention (14 heads / 2 KV heads), rotate_half RoPE, SwiGLU FFN
- Q/K/V biases, tied embeddings, chunked LM head (vocab > ANE 65536 limit)
- CPU element-wise ops via Accelerate BLAS
Files: qwen_ane_infer.h (forward pass), main.m (loader + generation),
convert_weights.py (safetensors → flat binary), run.py (tokenizer wrapper)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-03-03 10:18:15 -05:00 |