7.9 KiB
ANE Inference — Full LLM on Apple Neural Engine
First complete LLM inference running directly on Apple's Neural Engine via reverse-engineered _ANEClient APIs. No CoreML. No Xcode compiler dependency at runtime.
Built on top of the maderix/ANE training runtime.
What This Does
Runs Qwen2.5-0.5B-Instruct (24 transformer layers, 494M parameters) on ANE:
- 169 ANE kernels compiled at startup via
_ANEInMemoryModel - ~60 tokens/sec decode on M4 Max
- Pure C HTTP API — no Python needed for serving
- BPE tokenizer in C — send plain text, get plain text back
- ~6s cold start, then instant responses in server mode
Quick Start (One Command)
cd inference
./setup.sh
This automatically:
- Creates a Python venv and installs dependencies
- Downloads Qwen/Qwen2.5-0.5B-Instruct from HuggingFace (~953 MB)
- Converts BF16 safetensors to f32 binary format (~1.9 GB)
- Builds the
qwen_anebinary - Runs a smoke test
After setup, you're ready to go.
HTTP API (Recommended)
The fastest way to use inference. Single process, zero Python overhead.
# Start server (compiles 169 ANE kernels on first launch, ~6s)
./qwen_ane qwen05b.bin --http 8000
# Query with plain text — tokenization happens in C
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "What is 2+2?", "max_tokens": 50}'
Response:
{
"text": "2+2 equals 4.",
"prompt_tokens": 29,
"gen_tokens": 8,
"prefill_tps": 66.2,
"decode_tps": 57.3,
"elapsed_s": 0.608
}
Endpoints
| Method | Path | Description |
|---|---|---|
| POST | /v1/completions |
Generate text from a prompt |
| GET | /health |
Server status check |
POST /v1/completions
{
"prompt": "Your question here",
"max_tokens": 50,
"system": "You are a helpful assistant."
}
prompt(required): The user messagemax_tokens(optional, default 50, max 512): Maximum tokens to generatesystem(optional): System prompt override
Options
# Custom port
./qwen_ane qwen05b.bin --http 9000
# Custom model directory (for tokenizer files)
./qwen_ane qwen05b.bin --http 8000 --model-dir /path/to/Qwen2.5-0.5B-Instruct
Default model directory: ~/models/Qwen2.5-0.5B-Instruct
Other Modes
Socket server (for programmatic access)
# Terminal 1: start server
./qwen_ane qwen05b.bin --server /tmp/qwen_ane.sock
# Terminal 2: query with run.py (auto-detects socket)
python3 run.py "What is 2+2?"
# Or query directly with nc
echo '{"tokens": [151644, 8948, 198], "max_tokens": 50}' | nc -U /tmp/qwen_ane.sock
Stdin server (for piping/scripting)
./qwen_ane qwen05b.bin --server
# Send space-separated token IDs, pipe char separates max_tokens:
# 151644 8948 198 2610 525|20
Single-shot (no server)
# Raw token IDs
./qwen_ane qwen05b.bin "151644 8948 198 2610 525 264 10950 17847 13" 20
# With Python tokenizer
python3 run.py "Say hello in one word."
Python API server (alternative)
If you prefer Python for the HTTP layer:
./qwen_ane qwen05b.bin --server /tmp/qwen_ane.sock
python3 api_server.py --port 8000
Throughput Benchmark
Run the standardized benchmark to measure your hardware's performance:
./benchmark.sh
This runs 5 prompts of varying length, measures prefill and decode tokens/sec in server mode, tests cold start latency, and checks decode speed consistency.
Sample output (M4 Max, 128 GB):
Prompt Input Output Prefill(t/s) Decode(t/s) Latency(ms)
──────────────────────────────────────────────────────────────────
tiny 23 10 53.7 53.6 632
short 29 8 66.2 49.5 628
medium 33 84 63.4 55.3 2064
long 36 200 66.4 54.5 4235
stress 122 11 58.6 58.5 2303
──────────────────────────────────────────────────────────────────
Average 61.7 54.3
Cold start (single-shot): ~6.2s (includes ANE kernel compilation)
Results are saved to benchmark_results.json for programmatic use.
Compare with LM Studio
The benchmark script prints instructions for running the same prompts in LM Studio:
- Download LM Studio
- Search for and download Qwen2.5-0.5B-Instruct (GGUF Q4_K_M or Q8_0)
- Load the model, start the server (Developer tab, port 1234)
- Run the same prompts and compare tokens/sec:
curl http://localhost:1234/api/v1/chat \
-H "Content-Type: application/json" \
-d '{"model":"qwen2.5-0.5b-instruct","system_prompt":"You are a helpful assistant.","input":"What is 2+2?"}'
Note: LM Studio uses quantized GGUF weights (CPU/GPU) while we use full BF16 precision on the Neural Engine.
Performance
| Mode | First prompt | Subsequent prompts |
|---|---|---|
| Single-shot | ~6s | ~6s (recompiles each time) |
| Server (socket/HTTP) | ~6s (startup) | ~0.5s |
Architecture
Token -> Embedding (CPU) -> 24x Transformer Layer -> LM Head (CPU) -> Next Token
|
+-- RMSNorm (CPU)
+-- Q/K/V Projection (ANE conv kernel)
+-- RoPE (CPU, rotate_half)
+-- GQA Attention (CPU, 14 heads / 2 KV heads)
+-- O Projection (ANE conv kernel)
+-- Residual (CPU)
+-- RMSNorm (CPU)
+-- Gate/Up Projection (ANE conv kernel)
+-- SiLU + elementwise mul (CPU)
+-- Down Projection (ANE conv kernel)
+-- Residual (CPU)
Files
| File | What |
|---|---|
setup.sh |
One-command setup: downloads model, converts weights, builds binary |
benchmark.sh |
Throughput benchmark with LM Studio comparison |
main.m |
Entry point: weight loader, server modes, HTTP API |
qwen_ane_infer.h |
Full 24-layer transformer forward pass, ANE kernel compilation, KV cache |
tokenizer.h |
BPE tokenizer in C: vocab/merge loading, encode/decode, chat template |
http_server.h |
Minimal HTTP/1.1 server: TCP, request parsing, JSON responses |
convert_weights.py |
HuggingFace safetensors to flat f32 binary |
run.py |
Python wrapper with HuggingFace tokenizer (auto-connects to socket server) |
api_server.py |
Python HTTP API bridge to socket server (alternative to C HTTP) |
Model
- 494M parameters, BFloat16
- 24 layers, 896 dim, 4864 hidden
- 14 attention heads, 2 KV heads (GQA)
- 151,936 vocab size
- Download:
setup.shhandles this automatically
Requirements
- macOS 15+ on Apple Silicon (M1/M2/M3/M4)
- Xcode Command Line Tools (
xcode-select --install) - Python 3.11+ (for weight conversion only, not needed for serving)
Known Limitations
- CPU projections only — ANE baked-weight conv kernels compile but produce incorrect output (FP16 weight blob format mismatch).
USE_ANE_PROJECTIONSdefaults to 0 (CPU via Accelerate BLAS). Fixing this would increase decode speed significantly. - Single model — hardcoded for Qwen2.5-0.5B. Other sizes need config changes.
- f32 weights — 1.9GB on disk. FP16 weight support would halve this.
- Single-threaded HTTP — handles one request at a time. Sufficient for local use.
License
Same as maderix/ANE — research and educational use.