Commit Graph

2 Commits

Author SHA1 Message Date
zemog b476456736 Add LLM inference on ANE — first full transformer on Neural Engine without CoreML
Qwen2.5-0.5B (24 layers, 494M params) running directly on Apple Neural Engine
via _ANEInMemoryModel APIs. 169 ANE kernels compiled at startup.

- 82 tokens/sec decode, zero GPU usage
- Token-for-token match with PyTorch ("Hello." = [9707, 13, 151645])
- GQA attention (14 heads / 2 KV heads), rotate_half RoPE, SwiGLU FFN
- Q/K/V biases, tied embeddings, chunked LM head (vocab > ANE 65536 limit)
- CPU element-wise ops via Accelerate BLAS

Files: qwen_ane_infer.h (forward pass), main.m (loader + generation),
convert_weights.py (safetensors → flat binary), run.py (tokenizer wrapper)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 10:18:15 -05:00
zemog 21e8a58627 Qwen2.5-0.5B ANE inference — token-for-token match, 82 t/s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 09:30:04 -05:00