berkus/ANE - ANE

Commit Graph

Author	SHA1	Message	Date
zemog	b476456736	Add LLM inference on ANE — first full transformer on Neural Engine without CoreML Qwen2.5-0.5B (24 layers, 494M params) running directly on Apple Neural Engine via _ANEInMemoryModel APIs. 169 ANE kernels compiled at startup. - 82 tokens/sec decode, zero GPU usage - Token-for-token match with PyTorch ("Hello." = [9707, 13, 151645]) - GQA attention (14 heads / 2 KV heads), rotate_half RoPE, SwiGLU FFN - Q/K/V biases, tied embeddings, chunked LM head (vocab > ANE 65536 limit) - CPU element-wise ops via Accelerate BLAS Files: qwen_ane_infer.h (forward pass), main.m (loader + generation), convert_weights.py (safetensors → flat binary), run.py (tokenizer wrapper) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 10:18:15 -05:00
zemog	21e8a58627	Qwen2.5-0.5B ANE inference — token-for-token match, 82 t/s Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 09:30:04 -05:00

Author

SHA1

Message

Date

zemog

b476456736

Add LLM inference on ANE — first full transformer on Neural Engine without CoreML

Qwen2.5-0.5B (24 layers, 494M params) running directly on Apple Neural Engine
via _ANEInMemoryModel APIs. 169 ANE kernels compiled at startup.

- 82 tokens/sec decode, zero GPU usage
- Token-for-token match with PyTorch ("Hello." = [9707, 13, 151645])
- GQA attention (14 heads / 2 KV heads), rotate_half RoPE, SwiGLU FFN
- Q/K/V biases, tied embeddings, chunked LM head (vocab > ANE 65536 limit)
- CPU element-wise ops via Accelerate BLAS

Files: qwen_ane_infer.h (forward pass), main.m (loader + generation),
convert_weights.py (safetensors → flat binary), run.py (tokenizer wrapper)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-03 10:18:15 -05:00

zemog

21e8a58627

Qwen2.5-0.5B ANE inference — token-for-token match, 82 t/s

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-03 09:30:04 -05:00

2 Commits