mirror of https://github.com/maderix/ANE.git
243 lines
7.9 KiB
Markdown
243 lines
7.9 KiB
Markdown
# ANE Inference — Full LLM on Apple Neural Engine
|
|
|
|
First complete LLM inference running directly on Apple's Neural Engine via reverse-engineered `_ANEClient` APIs. No CoreML. No Xcode compiler dependency at runtime.
|
|
|
|
Built on top of the [maderix/ANE](https://github.com/maderix/ANE) training runtime.
|
|
|
|
## What This Does
|
|
|
|
Runs **Qwen2.5-0.5B-Instruct** (24 transformer layers, 494M parameters) on ANE:
|
|
|
|
- **169 ANE kernels** compiled at startup via `_ANEInMemoryModel`
|
|
- **~60 tokens/sec** decode on M4 Max
|
|
- **Pure C HTTP API** — no Python needed for serving
|
|
- **BPE tokenizer in C** — send plain text, get plain text back
|
|
- **~6s cold start**, then instant responses in server mode
|
|
|
|
## Quick Start (One Command)
|
|
|
|
```bash
|
|
cd inference
|
|
./setup.sh
|
|
```
|
|
|
|
This automatically:
|
|
1. Creates a Python venv and installs dependencies
|
|
2. Downloads [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) from HuggingFace (~953 MB)
|
|
3. Converts BF16 safetensors to f32 binary format (~1.9 GB)
|
|
4. Builds the `qwen_ane` binary
|
|
5. Runs a smoke test
|
|
|
|
After setup, you're ready to go.
|
|
|
|
## HTTP API (Recommended)
|
|
|
|
The fastest way to use inference. Single process, zero Python overhead.
|
|
|
|
```bash
|
|
# Start server (compiles 169 ANE kernels on first launch, ~6s)
|
|
./qwen_ane qwen05b.bin --http 8000
|
|
|
|
# Query with plain text — tokenization happens in C
|
|
curl http://localhost:8000/v1/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"prompt": "What is 2+2?", "max_tokens": 50}'
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"text": "2+2 equals 4.",
|
|
"prompt_tokens": 29,
|
|
"gen_tokens": 8,
|
|
"prefill_tps": 66.2,
|
|
"decode_tps": 57.3,
|
|
"elapsed_s": 0.608
|
|
}
|
|
```
|
|
|
|
### Endpoints
|
|
|
|
| Method | Path | Description |
|
|
|--------|------|-------------|
|
|
| POST | `/v1/completions` | Generate text from a prompt |
|
|
| GET | `/health` | Server status check |
|
|
|
|
### POST /v1/completions
|
|
|
|
```json
|
|
{
|
|
"prompt": "Your question here",
|
|
"max_tokens": 50,
|
|
"system": "You are a helpful assistant."
|
|
}
|
|
```
|
|
|
|
- `prompt` (required): The user message
|
|
- `max_tokens` (optional, default 50, max 512): Maximum tokens to generate
|
|
- `system` (optional): System prompt override
|
|
|
|
### Options
|
|
|
|
```bash
|
|
# Custom port
|
|
./qwen_ane qwen05b.bin --http 9000
|
|
|
|
# Custom model directory (for tokenizer files)
|
|
./qwen_ane qwen05b.bin --http 8000 --model-dir /path/to/Qwen2.5-0.5B-Instruct
|
|
```
|
|
|
|
Default model directory: `~/models/Qwen2.5-0.5B-Instruct`
|
|
|
|
## Other Modes
|
|
|
|
### Socket server (for programmatic access)
|
|
|
|
```bash
|
|
# Terminal 1: start server
|
|
./qwen_ane qwen05b.bin --server /tmp/qwen_ane.sock
|
|
|
|
# Terminal 2: query with run.py (auto-detects socket)
|
|
python3 run.py "What is 2+2?"
|
|
|
|
# Or query directly with nc
|
|
echo '{"tokens": [151644, 8948, 198], "max_tokens": 50}' | nc -U /tmp/qwen_ane.sock
|
|
```
|
|
|
|
### Stdin server (for piping/scripting)
|
|
|
|
```bash
|
|
./qwen_ane qwen05b.bin --server
|
|
# Send space-separated token IDs, pipe char separates max_tokens:
|
|
# 151644 8948 198 2610 525|20
|
|
```
|
|
|
|
### Single-shot (no server)
|
|
|
|
```bash
|
|
# Raw token IDs
|
|
./qwen_ane qwen05b.bin "151644 8948 198 2610 525 264 10950 17847 13" 20
|
|
|
|
# With Python tokenizer
|
|
python3 run.py "Say hello in one word."
|
|
```
|
|
|
|
### Python API server (alternative)
|
|
|
|
If you prefer Python for the HTTP layer:
|
|
|
|
```bash
|
|
./qwen_ane qwen05b.bin --server /tmp/qwen_ane.sock
|
|
python3 api_server.py --port 8000
|
|
```
|
|
|
|
## Throughput Benchmark
|
|
|
|
Run the standardized benchmark to measure your hardware's performance:
|
|
|
|
```bash
|
|
./benchmark.sh
|
|
```
|
|
|
|
This runs 5 prompts of varying length, measures prefill and decode tokens/sec in server mode, tests cold start latency, and checks decode speed consistency.
|
|
|
|
Sample output (M4 Max, 128 GB):
|
|
```
|
|
Prompt Input Output Prefill(t/s) Decode(t/s) Latency(ms)
|
|
──────────────────────────────────────────────────────────────────
|
|
tiny 23 10 53.7 53.6 632
|
|
short 29 8 66.2 49.5 628
|
|
medium 33 84 63.4 55.3 2064
|
|
long 36 200 66.4 54.5 4235
|
|
stress 122 11 58.6 58.5 2303
|
|
──────────────────────────────────────────────────────────────────
|
|
Average 61.7 54.3
|
|
|
|
Cold start (single-shot): ~6.2s (includes ANE kernel compilation)
|
|
```
|
|
|
|
Results are saved to `benchmark_results.json` for programmatic use.
|
|
|
|
### Compare with LM Studio
|
|
|
|
The benchmark script prints instructions for running the same prompts in LM Studio:
|
|
|
|
1. Download [LM Studio](https://lmstudio.ai)
|
|
2. Search for and download **Qwen2.5-0.5B-Instruct** (GGUF Q4_K_M or Q8_0)
|
|
3. Load the model, start the server (Developer tab, port 1234)
|
|
4. Run the same prompts and compare tokens/sec:
|
|
|
|
```bash
|
|
curl http://localhost:1234/api/v1/chat \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model":"qwen2.5-0.5b-instruct","system_prompt":"You are a helpful assistant.","input":"What is 2+2?"}'
|
|
```
|
|
|
|
Note: LM Studio uses quantized GGUF weights (CPU/GPU) while we use full BF16 precision on the Neural Engine.
|
|
|
|
## Performance
|
|
|
|
| Mode | First prompt | Subsequent prompts |
|
|
|------|-------------|-------------------|
|
|
| Single-shot | ~6s | ~6s (recompiles each time) |
|
|
| Server (socket/HTTP) | ~6s (startup) | ~0.5s |
|
|
|
|
## Architecture
|
|
|
|
```
|
|
Token -> Embedding (CPU) -> 24x Transformer Layer -> LM Head (CPU) -> Next Token
|
|
|
|
|
+-- RMSNorm (CPU)
|
|
+-- Q/K/V Projection (ANE conv kernel)
|
|
+-- RoPE (CPU, rotate_half)
|
|
+-- GQA Attention (CPU, 14 heads / 2 KV heads)
|
|
+-- O Projection (ANE conv kernel)
|
|
+-- Residual (CPU)
|
|
+-- RMSNorm (CPU)
|
|
+-- Gate/Up Projection (ANE conv kernel)
|
|
+-- SiLU + elementwise mul (CPU)
|
|
+-- Down Projection (ANE conv kernel)
|
|
+-- Residual (CPU)
|
|
```
|
|
|
|
## Files
|
|
|
|
| File | What |
|
|
|------|------|
|
|
| `setup.sh` | One-command setup: downloads model, converts weights, builds binary |
|
|
| `benchmark.sh` | Throughput benchmark with LM Studio comparison |
|
|
| `main.m` | Entry point: weight loader, server modes, HTTP API |
|
|
| `qwen_ane_infer.h` | Full 24-layer transformer forward pass, ANE kernel compilation, KV cache |
|
|
| `tokenizer.h` | BPE tokenizer in C: vocab/merge loading, encode/decode, chat template |
|
|
| `http_server.h` | Minimal HTTP/1.1 server: TCP, request parsing, JSON responses |
|
|
| `convert_weights.py` | HuggingFace safetensors to flat f32 binary |
|
|
| `run.py` | Python wrapper with HuggingFace tokenizer (auto-connects to socket server) |
|
|
| `api_server.py` | Python HTTP API bridge to socket server (alternative to C HTTP) |
|
|
|
|
## Model
|
|
|
|
**[Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)**
|
|
|
|
- 494M parameters, BFloat16
|
|
- 24 layers, 896 dim, 4864 hidden
|
|
- 14 attention heads, 2 KV heads (GQA)
|
|
- 151,936 vocab size
|
|
- Download: `setup.sh` handles this automatically
|
|
|
|
## Requirements
|
|
|
|
- macOS 15+ on Apple Silicon (M1/M2/M3/M4)
|
|
- Xcode Command Line Tools (`xcode-select --install`)
|
|
- Python 3.11+ (for weight conversion only, not needed for serving)
|
|
|
|
## Known Limitations
|
|
|
|
- **CPU projections only** — ANE baked-weight conv kernels compile but produce incorrect output (FP16 weight blob format mismatch). `USE_ANE_PROJECTIONS` defaults to 0 (CPU via Accelerate BLAS). Fixing this would increase decode speed significantly.
|
|
- **Single model** — hardcoded for Qwen2.5-0.5B. Other sizes need config changes.
|
|
- **f32 weights** — 1.9GB on disk. FP16 weight support would halve this.
|
|
- **Single-threaded HTTP** — handles one request at a time. Sufficient for local use.
|
|
|
|
## License
|
|
|
|
Same as maderix/ANE — research and educational use.
|