mirror of https://github.com/maderix/ANE.git
564 lines
29 KiB
Markdown
564 lines
29 KiB
Markdown
# ANE Internals: What We Know
|
|
|
|
A comprehensive guide to Apple's Neural Engine (ANE) based on reverse engineering, private API exploration, and community research. This extends and updates [hollance/neural-engine](https://github.com/hollance/neural-engine/tree/master/docs) with findings from direct hardware experimentation on M4 Max / macOS 15.
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [How does the ANE work internally?](#1-how-does-the-ane-work-internally)
|
|
2. [Can I program the ANE directly?](#2-can-i-program-the-ane-directly)
|
|
3. [What can be compiled and run on ANE?](#3-what-can-be-compiled-and-run-on-ane)
|
|
4. [Security and safety mechanisms](#4-security-and-safety-mechanisms)
|
|
5. [Is the ANE 16-bit?](#5-is-the-ane-16-bit)
|
|
6. [ANE vs GPU vs CPU](#6-ane-vs-gpu-vs-cpu)
|
|
7. [Reverse engineering the ANE](#7-reverse-engineering-the-ane)
|
|
8. [How to verify ANE execution](#8-how-to-verify-ane-execution)
|
|
9. [References and external resources](#9-references-and-external-resources)
|
|
|
|
---
|
|
|
|
## 1. How does the ANE work internally?
|
|
|
|
> hollance/neural-engine says: "I don't think anyone outside Apple knows."
|
|
|
|
We now know substantially more.
|
|
|
|
### Hardware Architecture
|
|
|
|
The ANE is a fixed-function neural network accelerator integrated into Apple Silicon SoCs:
|
|
|
|
| Chip | ANE Cores | Peak TOPS | SRAM Budget |
|
|
|------|-----------|-----------|-------------|
|
|
| A12-A13 | 8 | 5 | ~4 MB |
|
|
| A14/M1 | 16 | 11 | ~16 MB |
|
|
| A15/M2 | 16 | 15.8 | ~24 MB |
|
|
| M4/M4 Pro/M4 Max | 16 | 38 | ~24-32 MB |
|
|
|
|
SRAM budget measured via `sram_probe.m` performance cliff detection on M4 Max:
|
|
- Peak efficiency at ~12.5 MB weights (282.6 GFLOPS/MB)
|
|
- First spill at ~32 MB (drops to 59.2 GFLOPS/MB)
|
|
- Catastrophic spilling at 128 MB (8.0 GFLOPS/MB)
|
|
|
|
The ANE operates on FP16 data exclusively. All I/O is through IOSurface shared memory buffers in `[1, C, 1, S]` channel-first FP16 layout.
|
|
|
|
### Compilation Pipeline
|
|
|
|
There are two paths from a neural network to ANE hardware execution:
|
|
|
|
**Standard CoreML path** (from [Black Hat Asia 2021, Wish Wu](https://infocondb.org/con/black-hat/black-hat-asia-2021/apple-neural-engine-internal-from-ml-algorithm-to-hw-registers)):
|
|
|
|
```
|
|
ML model (TF/PyTorch/Caffe)
|
|
-> coremltools -> .mlmodel
|
|
-> coremlc (CoreML compiler) -> .mlmodelc/
|
|
-> espresso precompile -> net.plist + weights
|
|
-> ANECompiler (in ane_compiler_service) -> model.hwx
|
|
-> aned daemon -> H11ANEIn kernel driver (IOKit)
|
|
-> ANE firmware -> hardware registers
|
|
```
|
|
|
|
**Direct private API path** (what this project uses):
|
|
|
|
```
|
|
MIL text + weight blobs (in memory)
|
|
-> _ANEInMemoryModelDescriptor (ObjC object)
|
|
-> _ANEInMemoryModel.compileWithQoS: -> ANE binary (in temp dir)
|
|
-> _ANEInMemoryModel.loadWithQoS: -> loaded onto ANE hardware
|
|
-> _ANEInMemoryModel.evaluateWithQoS: -> execution via aned
|
|
```
|
|
|
|
The direct path bypasses CoreML, espresso, and the `.hwx` file format entirely. It compiles MIL (Model Intermediate Language) text directly into ANE-executable binary, loads it, and runs it. This is how we achieve both training and inference on the ANE without any CoreML dependency.
|
|
|
|
### System Architecture
|
|
|
|
```
|
|
+------------------+ +------------------+ +------------------+
|
|
| User Process | | aned daemon | | Kernel |
|
|
| | | | | |
|
|
| _ANEClient -----+---->| ANE scheduler +---->| H11ANEIn driver |
|
|
| (sharedConnection)| | (all interfaces) | | (IOKit) |
|
|
| | | | | |
|
|
| App gets 3 IOKit | | Compiles models | | Passes model.hwx |
|
|
| interfaces: | | Manages loading | | to ANE firmware |
|
|
| - open | | Handles requests | | |
|
|
| - close | +------------------+ +------------------+
|
|
| - programSend | |
|
|
| Request | v
|
|
+------------------+ +------------------+
|
|
| ANE Firmware |
|
|
| (co-processor) |
|
|
| |
|
|
| Parses register |
|
|
| operations from |
|
|
| compiled binary |
|
|
+------------------+
|
|
```
|
|
|
|
The `aned` daemon mediates between user processes and the kernel driver. Apps only get 3 IOKit interfaces (open, close, programSendRequest). The daemon has access to all driver interfaces, which is why `_ANEClient.sharedConnection` communicates through the daemon rather than directly to the kernel.
|
|
|
|
### Execution Paths
|
|
|
|
We have benchmarked four distinct ways to trigger ANE kernel execution:
|
|
|
|
| Method | API | Latency (64x32) | Latency (768x256) |
|
|
|--------|-----|------------------|--------------------|
|
|
| Standard | `model.evaluateWithQoS:options:request:error:` | 0.175 ms | 0.205 ms |
|
|
| Real-Time | `client.evaluateRealTimeWithModel:options:request:error:` | 0.093 ms | 0.246 ms |
|
|
| processRequest | `program.processRequest:model:qos:...` | 0.131 ms | 0.185 ms |
|
|
| Direct | `client.doEvaluateDirectWithModel:options:request:qos:error:` | 0.225 ms | N/A |
|
|
|
|
**Key finding**: At production kernel dimensions (768x256, matching Stories110M), all paths converge to ~0.2 ms per kernel. The RT speedup (1.88x) observed on small 64x32 kernels does not hold at production scale. The standard path remains the most reliable.
|
|
|
|
### Resource Limits
|
|
|
|
The ANE runtime leaks internal resources during compilation. After ~119 compiles per process, subsequent compilations fail silently. The workaround is checkpoint-and-restart: save weights and optimizer state, terminate the process, and re-launch with `--resume`.
|
|
|
|
With `MAX_COMPILES=100` (conservative) and 60 weight-bearing kernels per batch (12 layers x 5 kernels), only 1 training batch fits per process lifetime.
|
|
|
|
---
|
|
|
|
## 2. Can I program the ANE directly?
|
|
|
|
> hollance/neural-engine says: "Unfortunately not. You can only use the Neural Engine through Core ML."
|
|
|
|
**Yes, you can.** The `AppleNeuralEngine.framework` contains 67+ private Objective-C classes that provide direct access to the ANE without CoreML. This project uses them for both training and inference.
|
|
|
|
### Minimal Example
|
|
|
|
The core compilation/load/execution cycle in pseudocode:
|
|
|
|
```objc
|
|
#import <dlfcn.h>
|
|
#import <objc/runtime.h>
|
|
|
|
// Load the private framework
|
|
dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
|
|
|
|
// Write MIL program as text
|
|
NSData *milData = [@"program(1.0) { ... }" dataUsingEncoding:NSUTF8StringEncoding];
|
|
|
|
// Create descriptor
|
|
id descriptor = [_ANEInMemoryModelDescriptor modelWithMILText:milData
|
|
weights:weightDict
|
|
optionsPlist:nil];
|
|
|
|
// Compile -> Load -> Run
|
|
id model = [_ANEInMemoryModel inMemoryModelWithDescriptor:descriptor];
|
|
[model compileWithQoS:21 options:nil error:&error];
|
|
[model loadWithQoS:21 options:nil error:&error];
|
|
|
|
// Create IOSurface I/O and request
|
|
id request = [_ANERequest requestWithInputs:@[inputSurface]
|
|
inputIndices:@[@0]
|
|
outputs:@[outputSurface]
|
|
outputIndices:@[@0]
|
|
weightsBuffer:nil
|
|
perfStats:nil
|
|
procedureIndex:0];
|
|
|
|
[model evaluateWithQoS:21 options:nil request:request error:&error];
|
|
```
|
|
|
|
A complete reusable wrapper is implemented in [`training/ane_runtime.h`](../training/ane_runtime.h) with functions:
|
|
- `ane_init()` -- load framework, resolve classes
|
|
- `ane_compile(kernel, mil_text, weight_dict)` -- compile MIL to ANE binary
|
|
- `ane_run(kernel)` -- standard execution path
|
|
- `ane_free(kernel)` -- unload and release resources
|
|
|
|
### MIL (Model Intermediate Language)
|
|
|
|
MIL is Apple's intermediate representation for neural network operations. Key facts:
|
|
|
|
- Text-based format: `program(1.0) { func main(...) { ... } }`
|
|
- Targets: `ios16`, `ios17`, `ios18` (determines available ops)
|
|
- All tensors are 4D: `[batch, channels, height, width]` or equivalently `[1, C, 1, S]`
|
|
- Convolutions (`conv`) are the workhorse: a 1x1 conv with `[out_ch, in_ch, 1, 1]` weights = matrix multiply
|
|
- Weights referenced via `BLOBFILE(path="@model_path/weights/name.bin", offset=uint64(64))`
|
|
- Weights are baked at compile time and cannot be swapped at runtime
|
|
|
|
Supported operations include: `conv`, `matmul`, `add`, `mul`, `sigmoid`, `softmax`, `reshape`, `transpose`, `concat`, `reduce_mean`, `rsqrt`, `cast`, `constexpr_affine_dequantize`, and more.
|
|
|
|
### Alternative: ANECompiler CLI
|
|
|
|
[ANETools](https://github.com/antgroup-skyward/ANETools) (from Wish Wu / Ant Group) provides command-line tools that invoke the ANECompiler module directly:
|
|
|
|
```bash
|
|
# Convert mlmodelc to ANE-compatible format
|
|
MLModelCToANECompiler input.mlmodelc output/
|
|
|
|
# Compile to hardware format
|
|
ANECompiler --target-arch ane_v5 --debug-mask 2147483647 net.plist weights/ output.hwx
|
|
|
|
# Disassemble compiled binary
|
|
ANEDisassembler output.hwx
|
|
```
|
|
|
|
The `--debug-mask` flag (set to max integer) generates intermediate files during compilation, revealing internal register operations.
|
|
|
|
---
|
|
|
|
## 3. What can be compiled and run on ANE?
|
|
|
|
Any computation expressible as a static MIL (Model Intermediate Language) dataflow graph that the E5 compiler accepts. The ANE is a fixed-function accelerator, not a general-purpose processor -- it executes predefined operation graphs, not arbitrary code.
|
|
|
|
### Verified Operations
|
|
|
|
These operations have been compiled to custom MIL programs and executed on ANE hardware with output validated against CPU reference implementations (see `test_mil_custom.m`):
|
|
|
|
| Category | Operations | Notes |
|
|
|----------|-----------|-------|
|
|
| Activations | `relu`, `gelu`, `softmax` | GELU supports EXACT, TANH_APPROXIMATION, SIGMOID_APPROXIMATION modes |
|
|
| Normalization | `layer_norm` | Epsilon type must match gamma/beta dtype |
|
|
| Attention | `scaled_dot_product_attention` | Fused Q@K^T/sqrt(d) + softmax + @V in a single op (iOS 18+) |
|
|
| Linear algebra | `linear` (const weights), `matmul` (runtime tensors) | `linear` requires compile-time constant weights; `matmul` supports runtime inputs |
|
|
| Type conversion | `cast` | fp32 <-> fp16. Required at ANE I/O boundaries |
|
|
| Elementwise | `add`, `mul`, `real_div` | Broadcasting supported |
|
|
| Shape | `reshape`, `transpose`, `concat`, `slice_by_index` | `concat` requires `interleave` param |
|
|
| Composite | Full transformer block (LN + SDPA + Residual + FFN + GELU) | Compiles and runs as a single ANE program (~0.21ms) |
|
|
|
|
### Available but Not Yet Tested
|
|
|
|
These are valid MIL operations that the E5 compiler should accept:
|
|
|
|
- `conv` -- convolutions (the upstream maderix/ANE repo uses these extensively for training)
|
|
- `reduce_sum`, `reduce_mean`, `reduce_max` -- reductions
|
|
- `gather`, `scatter` -- embedding lookups, KV cache writes
|
|
- `rsqrt`, `sqrt`, `exp`, `log`, `tanh` -- unary math
|
|
- `split`, `slice_by_size` -- tensor slicing
|
|
- `batch_norm`, `instance_norm` -- normalization variants
|
|
- Various pooling, padding, upsampling operations
|
|
|
|
### What Cannot Run on ANE
|
|
|
|
| Limitation | Detail |
|
|
|-----------|--------|
|
|
| No control flow | No loops, conditionals, or branching. MIL is a static dataflow graph. |
|
|
| No dynamic shapes | All tensor dimensions must be known at compile time. |
|
|
| No runtime weight updates | Weights are `const`, baked into the compiled binary. Changing weights requires recompilation (~10-50ms). |
|
|
| No arbitrary memory access | No pointers or indexing beyond what `gather`/`scatter` provide. |
|
|
| No custom ops | Only operations in Apple's MIL op set. No user-defined kernels at the hardware level. |
|
|
| No FP32 compute | ANE computes in FP16 only. FP32 inputs are cast to FP16 internally. |
|
|
|
|
### Implications for Training
|
|
|
|
The ANE can execute the forward pass and the matrix math of backpropagation (`matmul` for dX and dW gradients). However, training is impractical because weights are read-only constants. After computing weight gradients on ANE, the optimizer step (W -= lr * dW) must run on CPU, and the MIL program must be recompiled with updated weights before the next forward pass. This recompilation costs ~10-50ms per step, dominating training time. See [ANE_CHAINING_RESEARCH.md, Section 9](ANE_CHAINING_RESEARCH.md#9-ane-training-feasibility-analysis) for detailed analysis.
|
|
|
|
---
|
|
|
|
## 4. Security and Safety Mechanisms
|
|
|
|
The ANE has multiple layers of safety enforcement, but Apple's security model assumes access goes through CoreML. The private APIs we use bypass CoreML but still pass through the `aned` daemon and the E5 compiler.
|
|
|
|
### Compile-Time Safety
|
|
|
|
| Mechanism | What it does |
|
|
|-----------|-------------|
|
|
| MIL syntax validation | The E5 compiler rejects malformed MIL with `InvalidMILProgram` errors |
|
|
| Type checking | Tensor dtypes, shapes, and parameter types must match exactly. Mismatches cause compile errors (e.g., `layer_norm` epsilon must match gamma/beta dtype; `concat` axis must be `int32` scalar, not tensor) |
|
|
| Op validation | Unknown or unsupported operations are rejected |
|
|
| I/O matching | MIL input/output names and shapes must match the `MLModelDescription` passed to `MLE5Engine` |
|
|
|
|
### Runtime Safety
|
|
|
|
| Mechanism | What it does |
|
|
|-----------|-------------|
|
|
| Shape enforcement | Input tensors must match declared shape exactly -- `MultiArray shape doesn't match ML Program's expected shape` error on mismatch |
|
|
| Daemon mediation | ANE runs through the `aned` daemon (system service). User processes only get 3 IOKit interfaces: open, close, `programSendRequest` |
|
|
| IOSurface isolation | I/O memory is managed by the kernel via IOSurface. Cannot read/write arbitrary memory through them |
|
|
| SRAM limits | Programs exceeding the ANE SRAM budget (~24-32MB on M4 Max) are rejected or fall back to CPU/GPU |
|
|
| Compile limit | ~119 compiled programs per process before the compiler leaks enough resources to fail (resource exhaustion, not a security boundary) |
|
|
|
|
### Sandbox Interaction
|
|
|
|
The E5 runtime needs write access to `~/Library/Caches/<binary_name>/` for its ANE specialization cache. macOS app sandbox can block this, causing compilation to fail with permission errors. When running outside a sandbox (e.g., command-line tools), this directory is created automatically.
|
|
|
|
### What is NOT Protected
|
|
|
|
| Gap | Detail |
|
|
|-----|--------|
|
|
| No access control | No authentication or entitlement check for using the private APIs. Any process can call `_ANEClient.sharedConnection` |
|
|
| No rate limiting | Programs can be compiled in a loop until the ~119 limit exhausts resources |
|
|
| No MIL signing | No code signing validation on MIL text -- any syntactically valid program that passes the compiler's type checks will execute |
|
|
| No isolation between programs | Multiple programs from the same process share the ANE with no hardware-level isolation (the daemon schedules them) |
|
|
|
|
### Practical Risk Assessment
|
|
|
|
The ANE attack surface is limited because:
|
|
|
|
1. **Fixed-function hardware**: The ANE executes predefined neural network operations, not arbitrary instructions. There is no instruction pointer, no stack, and no way to jump to arbitrary code.
|
|
2. **Typed dataflow**: MIL programs operate on typed tensors with fixed shapes. There are no buffer overflows in the traditional sense -- the compiler enforces all dimensions at compile time.
|
|
3. **Daemon intermediary**: All ANE access goes through `aned`, which validates requests before forwarding to the kernel driver. Direct IOKit access to the ANE is restricted to 3 interfaces.
|
|
4. **No persistent state**: ANE programs don't persist across reboots. Compiled programs live in temp directories and caches that are cleaned by the OS.
|
|
|
|
The main risk of the private APIs is **stability**: these APIs are undocumented and may change with any macOS update, potentially breaking programs that depend on them.
|
|
|
|
---
|
|
|
|
## 5. Is the ANE 16-bit?
|
|
|
|
> hollance/neural-engine says: "It appears so."
|
|
|
|
**Confirmed.** The ANE operates in FP16 for both compute and storage:
|
|
|
|
- All IOSurface I/O must be FP16. Passing FP32 data produces zeros.
|
|
- MIL programs must use `fp16` I/O types (setting `g_fp16_io=1` in our codebase)
|
|
- F32-to-F16 conversion happens on the CPU before writing to IOSurfaces
|
|
- FP16 precision limits: values above ~65504 overflow, values below ~5.96e-8 underflow to zero
|
|
|
|
### Quantization Support
|
|
|
|
| Format | ANE Native? | Notes |
|
|
|--------|------------|-------|
|
|
| FP16 | Yes | Native compute and storage format |
|
|
| INT8 | Partial | Memory bandwidth savings only, no compute speedup. `constexpr_affine_dequantize` in MIL dequantizes to FP16 before compute |
|
|
| Q4 | No | Not supported. Requires GPU (Metal) or CPU dequantization |
|
|
| FP32 | No | Internally converted to FP16; higher precision lost |
|
|
|
|
Apple markets ANE TOPS using INT8, so the 38 TOPS figure for M4 is really ~19 TFLOPS in FP16 (each INT8 op counts as 1 TOP but FP16 ops count as 2).
|
|
|
|
---
|
|
|
|
## 6. ANE vs GPU vs CPU
|
|
|
|
Benchmarked on Qwen2.5-0.5B (dim=896, 24 layers, 494M params) on M4 Max:
|
|
|
|
### Decode Performance (single-token generation)
|
|
|
|
| Engine | Format | Weight Size | Decode t/s | Bottleneck |
|
|
|--------|--------|-------------|------------|------------|
|
|
| CPU AMX (cblas_sgemv) | F32 | 1.97 GB | ~91 t/s | Memory bandwidth |
|
|
| CPU AMX (cblas_sgemv) | F16->F32 | 658 MB disk | ~91 t/s | Memory bandwidth (F32 in RAM) |
|
|
| CPU AMX (cblas_sgemv) | Q4->F32 | 188 MB disk | ~91 t/s | Memory bandwidth (dequant at load) |
|
|
| Metal GPU (Q4 SIMD) | Q4 | 188 MB | ~10 t/s | Dispatch overhead (~400 dispatches/token) |
|
|
| LM Studio (MLX) | Q4 MLX | ~188 MB | 258-496 t/s | Optimized Metal kernels |
|
|
|
|
### Prefill Performance (batch prompt processing)
|
|
|
|
| Engine | Format | Prefill t/s | Method |
|
|
|--------|--------|-------------|--------|
|
|
| CPU AMX (cblas_sgemm) | F32 | 880-960 t/s | Batched matmul |
|
|
| CPU AMX (cblas_sgemv) | F32 | ~40 t/s | Sequential per-token |
|
|
|
|
### ANE Training Kernel Performance
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Kernel latency | ~0.2 ms per kernel (768x256 production dims) |
|
|
| Peak TFLOPS | 11.14 (128x conv 512ch sp64) |
|
|
| Sustained training | 1.29-1.68 TFLOPS |
|
|
| ANE utilization | 8-11% of peak |
|
|
|
|
### When to use each
|
|
|
|
- **ANE**: Best for parallel FP16 operations where data stays on-chip (training kernels, fused attention). The ~119 compile limit and FP16-only restriction are significant constraints.
|
|
- **GPU (Metal)**: Best for large models (dim >= 4096) where native quantized matmul kernels (as in MLX/llama.cpp) can read Q4/Q8 data directly from GPU memory. Dispatch overhead dominates for small models.
|
|
- **CPU AMX**: Best for small/medium model decode (dim <= 896). `cblas_sgemv` uses the AMX coprocessor internally and achieves ~33% of theoretical bandwidth. Cannot be beaten by manual NEON, threading, or Metal for this model size.
|
|
|
|
---
|
|
|
|
## 7. Reverse engineering the ANE
|
|
|
|
### Prior Work
|
|
|
|
| Project | Focus | Key Contribution |
|
|
|---------|-------|-------------------|
|
|
| [hollance/neural-engine](https://github.com/hollance/neural-engine) | CoreML-level documentation | Comprehensive device list, layer compatibility, model surgery guides |
|
|
| [geohot/tinygrad ANE](https://github.com/tinygrad/tinygrad) | Driver-level reverse engineering | Initial IOKit driver analysis, ANE instruction format exploration |
|
|
| [Black Hat Asia 2021 (Wish Wu)](https://infocondb.org/con/black-hat/black-hat-asia-2021/apple-neural-engine-internal-from-ml-algorithm-to-hw-registers) | Full stack: ML to HW registers | Documented compilation pipeline, .hwx format, security attack surfaces, FaceID ANE usage. Created ANEDisassembler. [Video](https://www.youtube.com/watch?v=1wvBDUnPNEo) |
|
|
| [ANETools](https://github.com/antgroup-skyward/ANETools) | CLI compilation and disassembly | ANECompiler CLI wrapper, ANEDisassembler for .hwx files, `debug_mask` flag for intermediate output |
|
|
| [eiln/anecc](https://github.com/eiln/anecc) | Independent ANE compiler | CoreML-to-ANE compiler for Asahi Linux, alternative compilation path |
|
|
| [freedomtan/coreml_to_ane_hwx](https://github.com/freedomtan/coreml_to_ane_hwx) | CoreML to .hwx conversion | Direct converter bypassing some CoreML steps |
|
|
| [maderix/ANE](https://github.com/maderix/ANE) | Training on ANE | First neural network training on ANE via private APIs |
|
|
| [maderix Substack](https://open.substack.com/pub/maderix/p/inside-the-m4-apple-neural-engine) | M4 ANE deep-dive | Detailed M4 ANE architecture analysis, SRAM probing, kernel fusion |
|
|
|
|
### Our Discoveries: Private API Class Hierarchy
|
|
|
|
We have documented 20+ private Objective-C classes in `AppleNeuralEngine.framework`:
|
|
|
|
```
|
|
NSObject
|
|
|-- _ANEClient (singleton, daemon connection)
|
|
| Methods: sharedConnection, evaluateWithModel:, evaluateRealTimeWithModel:,
|
|
| doEvaluateDirectWithModel:, prepareChainingWithModel:,
|
|
| enqueueSetsWithModel:, buffersReadyWithModel:,
|
|
| beginRealTimeTask, endRealTimeTask
|
|
|
|
|
|-- _ANEInMemoryModelDescriptor (MIL + weights spec)
|
|
| Factory: +modelWithMILText:weights:optionsPlist:
|
|
|
|
|
|-- _ANEInMemoryModel (compile/load/run)
|
|
| Methods: compileWithQoS:, loadWithQoS:, evaluateWithQoS:, unloadWithQoS:
|
|
| Props: hexStringIdentifier, programHandle (uint64), program, perfStatsMask
|
|
|
|
|
|-- _ANEModel (disk-based compiled model -- 52 instance methods)
|
|
| Factory: +modelAtURL:key:, +modelAtURL:key:modelAttributes:
|
|
| Methods: getUUID, inputSymbolIndicesForProcedureIndex:,
|
|
| outputSymbolIndicesForProcedureIndex:
|
|
| Props: mapper, program
|
|
|
|
|
|-- _ANERequest (I/O surface packaging)
|
|
| Factory: +requestWithInputs:inputIndices:outputs:outputIndices:
|
|
| weightsBuffer:perfStats:procedureIndex:
|
|
|
|
|
|-- _ANEIOSurfaceObject (thin IOSurface wrapper)
|
|
| Factory: +objectWithIOSurface:
|
|
|
|
|
|-- _ANEBuffer (IOSurfaceObject + symbolIndex + source) [KEY DISCOVERY]
|
|
| Factory: +bufferWithIOSurfaceObject:symbolIndex:source:
|
|
| source: 0=ANE, 1=output, 2=unknown
|
|
|
|
|
|-- _ANEChainingRequest (multi-op pipeline)
|
|
| Factory: +chainingRequestWithInputs:outputSets:lbInputSymbolId:
|
|
| lbOutputSymbolId:procedureIndex:signalEvents:
|
|
| transactionHandle:fwEnqueueDelay:memoryPoolId:
|
|
| Methods: validate
|
|
|
|
|
|-- _ANEIOSurfaceOutputSets (output packaging for chaining)
|
|
| Factory: +objectWithstatsSurRef:outputBuffer:
|
|
| Note: requires non-NULL statsSurRef (any IOSurface works, even 64 bytes)
|
|
|
|
|
|-- _ANEInputBuffersReady (input signaling for chaining)
|
|
| Factory: +inputBuffersWithProcedureIndex:inputBufferInfoIndex:
|
|
| inputFreeValue:executionDelay:
|
|
|
|
|
|-- _ANEOutputSetEnqueue (output pipeline config for chaining)
|
|
| Factory: +outputSetWithProcedureIndex:setIndex:signalValue:
|
|
| signalNotRequired:isOpenLoop:
|
|
|
|
|
|-- _ANEProgramForEvaluation (lower-level program)
|
|
| Factory: +programWithHandle:intermediateBufferHandle:queueDepth:
|
|
| Methods: processRequest:model:qos:qIndex:modelStringID:options:
|
|
| returnValue:error:
|
|
|
|
|
|-- _ANEProgramIOSurfacesMapper (symbol-to-surface mapping)
|
|
| Factory: +mapperWithProgramHandle:, +mapperWithController:
|
|
| Note: only works with _ANEModel, not _ANEInMemoryModel
|
|
|
|
|
|-- _ANEPerformanceStats
|
|
| Factory: +statsWithHardwareExecutionNS:
|
|
| Props: hwExecutionTime, performanceCounters
|
|
|
|
|
|-- _ANESharedSignalEvent (hardware signal fence)
|
|
| Factory: +signalEventWithValue:symbolIndex:eventType:sharedEvent:
|
|
| Requires IOSurfaceSharedEvent objects
|
|
|
|
|
|-- _ANESharedWaitEvent (hardware wait fence)
|
|
| Factory: +waitEventWithValue:sharedEvent:
|
|
| Requires IOSurfaceSharedEvent objects
|
|
|
|
|
|-- _ANEModelInstanceParameters, _ANEDeviceController, _ANEQoSMapper
|
|
```
|
|
|
|
Full details with experiment logs: [ANE_CHAINING_RESEARCH.md](ANE_CHAINING_RESEARCH.md)
|
|
|
|
### ChainingRequest API Status
|
|
|
|
The `_ANEChainingRequest` API is designed to pipeline multiple ANE operations without CPU round-trips. Current status:
|
|
|
|
- `_ANEChainingRequest.validate` returns **YES** (with `_ANEBuffer` inputs + `_ANEIOSurfaceOutputSets` outputs)
|
|
- `prepareChainingWithModel:` **fails** -- calls `getUUID` on `_ANEInMemoryModel` which lacks it
|
|
- Requires `_ANEModel` (disk-based compiled model) which has `getUUID` and symbol index methods
|
|
- `_ANEModel` factory methods require a `key:` parameter; the hex identifier from `_ANEInMemoryModel` is the likely key
|
|
|
|
This is the highest-priority research area. Chaining would eliminate the ~23 CPU-ANE round-trips per token in a 12-layer model, potentially enabling on-chip pipeline execution.
|
|
|
|
### model.hwx Binary Format
|
|
|
|
The `.hwx` file is the compiled hardware representation loaded by the ANE kernel driver. From Wu's Black Hat research:
|
|
|
|
- Mach-O format binary containing register operations
|
|
- Compiled from `net.plist` + weights by the ANECompiler module
|
|
- Loaded by the `H11ANEIn` kernel driver via `programCreate` interface
|
|
- ANE firmware parses it to extract register addresses and values
|
|
- Can be disassembled with [ANETools/ANEDisassembler](https://github.com/antgroup-skyward/ANETools)
|
|
|
|
Our `_ANEInMemoryModel` path bypasses `.hwx` generation -- the model goes directly from MIL to an internal binary format in a temp directory. Whether this temp directory contains an equivalent to `.hwx` is an open question (see [ANE_CHAINING_RESEARCH.md](ANE_CHAINING_RESEARCH.md) for next steps).
|
|
|
|
---
|
|
|
|
## 8. How to verify ANE execution
|
|
|
|
### Power Monitoring
|
|
|
|
```bash
|
|
sudo powermetrics --samplers ane_power -i 1000
|
|
```
|
|
|
|
Shows real-time ANE power draw. Active ANE usage typically shows 2-4W on M4 Max during training.
|
|
|
|
### Performance Statistics
|
|
|
|
```objc
|
|
model.perfStatsMask = 0xFF;
|
|
// After execution:
|
|
// model.performanceCounters -- returns nil on current macOS (limited API)
|
|
```
|
|
|
|
The `_ANEPerformanceStats` class exists and can be instantiated via `+statsWithHardwareExecutionNS:`, but the hardware counters are not populated on the current macOS/M4 combination. The `perfStatsMask` property is accepted but `performanceCounters` returns nil after execution.
|
|
|
|
### IOSurface Output Validation
|
|
|
|
Read back FP16 data from output IOSurfaces and compare against CPU reference:
|
|
|
|
```objc
|
|
_Float16 *out = (_Float16 *)IOSurfaceGetBaseAddress(surface);
|
|
IOSurfaceLock(surface, kIOSurfaceLockReadOnly, NULL);
|
|
for (int i = 0; i < n; i++) {
|
|
float val = (float)out[i];
|
|
// Compare against CPU reference
|
|
}
|
|
IOSurfaceUnlock(surface, kIOSurfaceLockReadOnly, NULL);
|
|
```
|
|
|
|
### ANE Compiler Debug Output
|
|
|
|
From Wu's research, the ANECompiler module has a `debug_mask` flag. Setting it to `2147483647` (max int) generates intermediate files during compilation, revealing:
|
|
- Register operation sequences
|
|
- Memory allocation decisions
|
|
- Tiling strategies
|
|
- Weight layout in SRAM
|
|
|
|
This can be applied when using the ANECompiler CLI tools from [ANETools](https://github.com/antgroup-skyward/ANETools).
|
|
|
|
---
|
|
|
|
## 9. References and External Resources
|
|
|
|
### Documentation and Research
|
|
|
|
| Resource | URL | Focus |
|
|
|----------|-----|-------|
|
|
| hollance/neural-engine | https://github.com/hollance/neural-engine | CoreML-level ANE docs |
|
|
| maderix Substack | https://open.substack.com/pub/maderix/p/inside-the-m4-apple-neural-engine | M4 ANE architecture |
|
|
| Black Hat Asia 2021 | https://infocondb.org/con/black-hat/black-hat-asia-2021/apple-neural-engine-internal-from-ml-algorithm-to-hw-registers | Full stack reverse engineering |
|
|
| BH Asia 2021 Video | https://www.youtube.com/watch?v=1wvBDUnPNEo | 30-min talk by Wish Wu |
|
|
| Apple ML Research | https://machinelearning.apple.com/research/neural-engine-transformers | Deploying transformers on ANE |
|
|
| ANE Supported Devices | https://github.com/hollance/neural-engine/blob/master/docs/supported-devices.md | Comprehensive device/chip list |
|
|
|
|
### Tools
|
|
|
|
| Tool | URL | Purpose |
|
|
|------|-----|---------|
|
|
| ANETools | https://github.com/antgroup-skyward/ANETools | ANECompiler CLI, ANEDisassembler |
|
|
| eiln/anecc | https://github.com/eiln/anecc | Independent ANE compiler (Asahi Linux) |
|
|
| freedomtan/coreml_to_ane_hwx | https://github.com/freedomtan/coreml_to_ane_hwx | CoreML to .hwx converter |
|
|
| coremltools | https://github.com/apple/coremltools | Apple's official ML model tools |
|
|
|
|
### Projects Using ANE Directly
|
|
|
|
| Project | URL | What it does |
|
|
|---------|-----|-------------|
|
|
| maderix/ANE | https://github.com/maderix/ANE | Training on ANE (this project's upstream) |
|
|
| dev-erik/ANE | https://github.com/dev-erik/ANE | This fork: inference optimization, ChainingRequest research |
|
|
|
|
### This Project's ANE Documentation
|
|
|
|
| Document | Description |
|
|
|----------|-------------|
|
|
| [ANE_INTERNALS.md](ANE_INTERNALS.md) | This file -- comprehensive ANE internals guide |
|
|
| [ANE_CHAINING_RESEARCH.md](ANE_CHAINING_RESEARCH.md) | ChainingRequest API research, experiment logs, benchmarks |
|
|
| [ARCHITECTURE.md](ARCHITECTURE.md) | Training system architecture, kernel fusion map, data flow |
|
|
| [API_REFERENCE.md](API_REFERENCE.md) | Complete function index for all source files |
|
|
| [BENCHMARK_RESULTS.md](BENCHMARK_RESULTS.md) | M4 Max benchmark results (training, TFLOPS, SRAM) |
|