ANE/inference/README.md

7.9 KiB

ANE Inference — Full LLM on Apple Neural Engine

First complete LLM inference running directly on Apple's Neural Engine via reverse-engineered _ANEClient APIs. No CoreML. No Xcode compiler dependency at runtime.

Built on top of the maderix/ANE training runtime.

What This Does

Runs Qwen2.5-0.5B-Instruct (24 transformer layers, 494M parameters) on ANE:

  • 169 ANE kernels compiled at startup via _ANEInMemoryModel
  • ~60 tokens/sec decode on M4 Max
  • Pure C HTTP API — no Python needed for serving
  • BPE tokenizer in C — send plain text, get plain text back
  • ~6s cold start, then instant responses in server mode

Quick Start (One Command)

cd inference
./setup.sh

This automatically:

  1. Creates a Python venv and installs dependencies
  2. Downloads Qwen/Qwen2.5-0.5B-Instruct from HuggingFace (~953 MB)
  3. Converts BF16 safetensors to f32 binary format (~1.9 GB)
  4. Builds the qwen_ane binary
  5. Runs a smoke test

After setup, you're ready to go.

The fastest way to use inference. Single process, zero Python overhead.

# Start server (compiles 169 ANE kernels on first launch, ~6s)
./qwen_ane qwen05b.bin --http 8000

# Query with plain text — tokenization happens in C
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is 2+2?", "max_tokens": 50}'

Response:

{
  "text": "2+2 equals 4.",
  "prompt_tokens": 29,
  "gen_tokens": 8,
  "prefill_tps": 66.2,
  "decode_tps": 57.3,
  "elapsed_s": 0.608
}

Endpoints

Method Path Description
POST /v1/completions Generate text from a prompt
GET /health Server status check

POST /v1/completions

{
  "prompt": "Your question here",
  "max_tokens": 50,
  "system": "You are a helpful assistant."
}
  • prompt (required): The user message
  • max_tokens (optional, default 50, max 512): Maximum tokens to generate
  • system (optional): System prompt override

Options

# Custom port
./qwen_ane qwen05b.bin --http 9000

# Custom model directory (for tokenizer files)
./qwen_ane qwen05b.bin --http 8000 --model-dir /path/to/Qwen2.5-0.5B-Instruct

Default model directory: ~/models/Qwen2.5-0.5B-Instruct

Other Modes

Socket server (for programmatic access)

# Terminal 1: start server
./qwen_ane qwen05b.bin --server /tmp/qwen_ane.sock

# Terminal 2: query with run.py (auto-detects socket)
python3 run.py "What is 2+2?"

# Or query directly with nc
echo '{"tokens": [151644, 8948, 198], "max_tokens": 50}' | nc -U /tmp/qwen_ane.sock

Stdin server (for piping/scripting)

./qwen_ane qwen05b.bin --server
# Send space-separated token IDs, pipe char separates max_tokens:
# 151644 8948 198 2610 525|20

Single-shot (no server)

# Raw token IDs
./qwen_ane qwen05b.bin "151644 8948 198 2610 525 264 10950 17847 13" 20

# With Python tokenizer
python3 run.py "Say hello in one word."

Python API server (alternative)

If you prefer Python for the HTTP layer:

./qwen_ane qwen05b.bin --server /tmp/qwen_ane.sock
python3 api_server.py --port 8000

Throughput Benchmark

Run the standardized benchmark to measure your hardware's performance:

./benchmark.sh

This runs 5 prompts of varying length, measures prefill and decode tokens/sec in server mode, tests cold start latency, and checks decode speed consistency.

Sample output (M4 Max, 128 GB):

Prompt        Input Output Prefill(t/s)  Decode(t/s)  Latency(ms)
──────────────────────────────────────────────────────────────────
tiny             23     10         53.7         53.6          632
short            29      8         66.2         49.5          628
medium           33     84         63.4         55.3         2064
long             36    200         66.4         54.5         4235
stress          122     11         58.6         58.5         2303
──────────────────────────────────────────────────────────────────
Average                            61.7         54.3

Cold start (single-shot): ~6.2s (includes ANE kernel compilation)

Results are saved to benchmark_results.json for programmatic use.

Compare with LM Studio

The benchmark script prints instructions for running the same prompts in LM Studio:

  1. Download LM Studio
  2. Search for and download Qwen2.5-0.5B-Instruct (GGUF Q4_K_M or Q8_0)
  3. Load the model, start the server (Developer tab, port 1234)
  4. Run the same prompts and compare tokens/sec:
curl http://localhost:1234/api/v1/chat \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen2.5-0.5b-instruct","system_prompt":"You are a helpful assistant.","input":"What is 2+2?"}'

Note: LM Studio uses quantized GGUF weights (CPU/GPU) while we use full BF16 precision on the Neural Engine.

Performance

Mode First prompt Subsequent prompts
Single-shot ~6s ~6s (recompiles each time)
Server (socket/HTTP) ~6s (startup) ~0.5s

Architecture

Token -> Embedding (CPU) -> 24x Transformer Layer -> LM Head (CPU) -> Next Token
                              |
                              +-- RMSNorm (CPU)
                              +-- Q/K/V Projection (ANE conv kernel)
                              +-- RoPE (CPU, rotate_half)
                              +-- GQA Attention (CPU, 14 heads / 2 KV heads)
                              +-- O Projection (ANE conv kernel)
                              +-- Residual (CPU)
                              +-- RMSNorm (CPU)
                              +-- Gate/Up Projection (ANE conv kernel)
                              +-- SiLU + elementwise mul (CPU)
                              +-- Down Projection (ANE conv kernel)
                              +-- Residual (CPU)

Files

File What
setup.sh One-command setup: downloads model, converts weights, builds binary
benchmark.sh Throughput benchmark with LM Studio comparison
main.m Entry point: weight loader, server modes, HTTP API
qwen_ane_infer.h Full 24-layer transformer forward pass, ANE kernel compilation, KV cache
tokenizer.h BPE tokenizer in C: vocab/merge loading, encode/decode, chat template
http_server.h Minimal HTTP/1.1 server: TCP, request parsing, JSON responses
convert_weights.py HuggingFace safetensors to flat f32 binary
run.py Python wrapper with HuggingFace tokenizer (auto-connects to socket server)
api_server.py Python HTTP API bridge to socket server (alternative to C HTTP)

Model

Qwen/Qwen2.5-0.5B-Instruct

  • 494M parameters, BFloat16
  • 24 layers, 896 dim, 4864 hidden
  • 14 attention heads, 2 KV heads (GQA)
  • 151,936 vocab size
  • Download: setup.sh handles this automatically

Requirements

  • macOS 15+ on Apple Silicon (M1/M2/M3/M4)
  • Xcode Command Line Tools (xcode-select --install)
  • Python 3.11+ (for weight conversion only, not needed for serving)

Known Limitations

  • CPU projections only — ANE baked-weight conv kernels compile but produce incorrect output (FP16 weight blob format mismatch). USE_ANE_PROJECTIONS defaults to 0 (CPU via Accelerate BLAS). Fixing this would increase decode speed significantly.
  • Single model — hardcoded for Qwen2.5-0.5B. Other sizes need config changes.
  • f32 weights — 1.9GB on disk. FP16 weight support would halve this.
  • Single-threaded HTTP — handles one request at a time. Sufficient for local use.

License

Same as maderix/ANE — research and educational use.