29 KiB

Raw Blame History

ANE Internals: What We Know

A comprehensive guide to Apple's Neural Engine (ANE) based on reverse engineering, private API exploration, and community research. This extends and updates hollance/neural-engine with findings from direct hardware experimentation on M4 Max / macOS 15.

How does the ANE work internally?
Can I program the ANE directly?
What can be compiled and run on ANE?
Security and safety mechanisms
Is the ANE 16-bit?
ANE vs GPU vs CPU
Reverse engineering the ANE
How to verify ANE execution
References and external resources

1. How does the ANE work internally?

hollance/neural-engine says: "I don't think anyone outside Apple knows."

We now know substantially more.

Hardware Architecture

The ANE is a fixed-function neural network accelerator integrated into Apple Silicon SoCs:

Chip	ANE Cores	Peak TOPS	SRAM Budget
A12-A13	8	5	~4 MB
A14/M1	16	11	~16 MB
A15/M2	16	15.8	~24 MB
M4/M4 Pro/M4 Max	16	38	~24-32 MB

SRAM budget measured via sram_probe.m performance cliff detection on M4 Max:

Peak efficiency at ~12.5 MB weights (282.6 GFLOPS/MB)
First spill at ~32 MB (drops to 59.2 GFLOPS/MB)
Catastrophic spilling at 128 MB (8.0 GFLOPS/MB)

The ANE operates on FP16 data exclusively. All I/O is through IOSurface shared memory buffers in [1, C, 1, S] channel-first FP16 layout.

Compilation Pipeline

There are two paths from a neural network to ANE hardware execution:

Standard CoreML path (from Black Hat Asia 2021, Wish Wu):

ML model (TF/PyTorch/Caffe)
  -> coremltools -> .mlmodel
  -> coremlc (CoreML compiler) -> .mlmodelc/
  -> espresso precompile -> net.plist + weights
  -> ANECompiler (in ane_compiler_service) -> model.hwx
  -> aned daemon -> H11ANEIn kernel driver (IOKit)
  -> ANE firmware -> hardware registers

Direct private API path (what this project uses):

MIL text + weight blobs (in memory)
  -> _ANEInMemoryModelDescriptor (ObjC object)
  -> _ANEInMemoryModel.compileWithQoS: -> ANE binary (in temp dir)
  -> _ANEInMemoryModel.loadWithQoS: -> loaded onto ANE hardware
  -> _ANEInMemoryModel.evaluateWithQoS: -> execution via aned

The direct path bypasses CoreML, espresso, and the .hwx file format entirely. It compiles MIL (Model Intermediate Language) text directly into ANE-executable binary, loads it, and runs it. This is how we achieve both training and inference on the ANE without any CoreML dependency.

System Architecture

+------------------+     +------------------+     +------------------+
| User Process     |     | aned daemon      |     | Kernel           |
|                  |     |                  |     |                  |
| _ANEClient  -----+---->| ANE scheduler    +---->| H11ANEIn driver  |
| (sharedConnection)|    | (all interfaces) |     | (IOKit)          |
|                  |     |                  |     |                  |
| App gets 3 IOKit |     | Compiles models  |     | Passes model.hwx |
| interfaces:      |     | Manages loading  |     | to ANE firmware  |
|  - open          |     | Handles requests |     |                  |
|  - close         |     +------------------+     +------------------+
|  - programSend   |                                      |
|    Request       |                                      v
+------------------+                              +------------------+
                                                  | ANE Firmware     |
                                                  | (co-processor)   |
                                                  |                  |
                                                  | Parses register  |
                                                  | operations from  |
                                                  | compiled binary  |
                                                  +------------------+

The aned daemon mediates between user processes and the kernel driver. Apps only get 3 IOKit interfaces (open, close, programSendRequest). The daemon has access to all driver interfaces, which is why _ANEClient.sharedConnection communicates through the daemon rather than directly to the kernel.

Execution Paths

We have benchmarked four distinct ways to trigger ANE kernel execution:

Method	API	Latency (64x32)	Latency (768x256)
Standard	`model.evaluateWithQoS:options:request:error:`	0.175 ms	0.205 ms
Real-Time	`client.evaluateRealTimeWithModel:options:request:error:`	0.093 ms	0.246 ms
processRequest	`program.processRequest:model:qos:...`	0.131 ms	0.185 ms
Direct	`client.doEvaluateDirectWithModel:options:request:qos:error:`	0.225 ms	N/A

Key finding: At production kernel dimensions (768x256, matching Stories110M), all paths converge to ~0.2 ms per kernel. The RT speedup (1.88x) observed on small 64x32 kernels does not hold at production scale. The standard path remains the most reliable.

Resource Limits

The ANE runtime leaks internal resources during compilation. After ~119 compiles per process, subsequent compilations fail silently. The workaround is checkpoint-and-restart: save weights and optimizer state, terminate the process, and re-launch with --resume.

With MAX_COMPILES=100 (conservative) and 60 weight-bearing kernels per batch (12 layers x 5 kernels), only 1 training batch fits per process lifetime.

2. Can I program the ANE directly?

hollance/neural-engine says: "Unfortunately not. You can only use the Neural Engine through Core ML."

Yes, you can. The AppleNeuralEngine.framework contains 67+ private Objective-C classes that provide direct access to the ANE without CoreML. This project uses them for both training and inference.

Minimal Example

The core compilation/load/execution cycle in pseudocode:

#import <dlfcn.h>
#import <objc/runtime.h>

// Load the private framework
dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);

// Write MIL program as text
NSData *milData = [@"program(1.0) { ... }" dataUsingEncoding:NSUTF8StringEncoding];

// Create descriptor
id descriptor = [_ANEInMemoryModelDescriptor modelWithMILText:milData
                                                      weights:weightDict
                                                  optionsPlist:nil];

// Compile -> Load -> Run
id model = [_ANEInMemoryModel inMemoryModelWithDescriptor:descriptor];
[model compileWithQoS:21 options:nil error:&error];
[model loadWithQoS:21 options:nil error:&error];

// Create IOSurface I/O and request
id request = [_ANERequest requestWithInputs:@[inputSurface]
                               inputIndices:@[@0]
                                    outputs:@[outputSurface]
                              outputIndices:@[@0]
                              weightsBuffer:nil
                                  perfStats:nil
                             procedureIndex:0];

[model evaluateWithQoS:21 options:nil request:request error:&error];

A complete reusable wrapper is implemented in training/ane_runtime.h with functions:

ane_init() -- load framework, resolve classes
ane_compile(kernel, mil_text, weight_dict) -- compile MIL to ANE binary
ane_run(kernel) -- standard execution path
ane_free(kernel) -- unload and release resources

MIL (Model Intermediate Language)

MIL is Apple's intermediate representation for neural network operations. Key facts:

Text-based format: program(1.0) { func main(...) { ... } }
Targets: ios16, ios17, ios18 (determines available ops)
All tensors are 4D: [batch, channels, height, width] or equivalently [1, C, 1, S]
Convolutions (conv) are the workhorse: a 1x1 conv with [out_ch, in_ch, 1, 1] weights = matrix multiply
Weights referenced via BLOBFILE(path="@model_path/weights/name.bin", offset=uint64(64))
Weights are baked at compile time and cannot be swapped at runtime

Supported operations include: conv, matmul, add, mul, sigmoid, softmax, reshape, transpose, concat, reduce_mean, rsqrt, cast, constexpr_affine_dequantize, and more.

Alternative: ANECompiler CLI

ANETools (from Wish Wu / Ant Group) provides command-line tools that invoke the ANECompiler module directly:

# Convert mlmodelc to ANE-compatible format
MLModelCToANECompiler input.mlmodelc output/

# Compile to hardware format
ANECompiler --target-arch ane_v5 --debug-mask 2147483647 net.plist weights/ output.hwx

# Disassemble compiled binary
ANEDisassembler output.hwx

The --debug-mask flag (set to max integer) generates intermediate files during compilation, revealing internal register operations.

3. What can be compiled and run on ANE?

Any computation expressible as a static MIL (Model Intermediate Language) dataflow graph that the E5 compiler accepts. The ANE is a fixed-function accelerator, not a general-purpose processor -- it executes predefined operation graphs, not arbitrary code.

Verified Operations

These operations have been compiled to custom MIL programs and executed on ANE hardware with output validated against CPU reference implementations (see test_mil_custom.m):

Category	Operations	Notes
Activations	`relu`, `gelu`, `softmax`	GELU supports EXACT, TANH_APPROXIMATION, SIGMOID_APPROXIMATION modes
Normalization	`layer_norm`	Epsilon type must match gamma/beta dtype
Attention	`scaled_dot_product_attention`	Fused Q@K^T/sqrt(d) + softmax + @V in a single op (iOS 18+)
Linear algebra	`linear` (const weights), `matmul` (runtime tensors)	`linear` requires compile-time constant weights; `matmul` supports runtime inputs
Type conversion	`cast`	fp32 <-> fp16. Required at ANE I/O boundaries
Elementwise	`add`, `mul`, `real_div`	Broadcasting supported
Shape	`reshape`, `transpose`, `concat`, `slice_by_index`	`concat` requires `interleave` param
Composite	Full transformer block (LN + SDPA + Residual + FFN + GELU)	Compiles and runs as a single ANE program (~0.21ms)

Available but Not Yet Tested

These are valid MIL operations that the E5 compiler should accept:

conv -- convolutions (the upstream maderix/ANE repo uses these extensively for training)
reduce_sum, reduce_mean, reduce_max -- reductions
gather, scatter -- embedding lookups, KV cache writes
rsqrt, sqrt, exp, log, tanh -- unary math
split, slice_by_size -- tensor slicing
batch_norm, instance_norm -- normalization variants
Various pooling, padding, upsampling operations

What Cannot Run on ANE

Limitation	Detail
No control flow	No loops, conditionals, or branching. MIL is a static dataflow graph.
No dynamic shapes	All tensor dimensions must be known at compile time.
No runtime weight updates	Weights are `const`, baked into the compiled binary. Changing weights requires recompilation (~10-50ms).
No arbitrary memory access	No pointers or indexing beyond what `gather`/`scatter` provide.
No custom ops	Only operations in Apple's MIL op set. No user-defined kernels at the hardware level.
No FP32 compute	ANE computes in FP16 only. FP32 inputs are cast to FP16 internally.

Implications for Training

The ANE can execute the forward pass and the matrix math of backpropagation (matmul for dX and dW gradients). However, training is impractical because weights are read-only constants. After computing weight gradients on ANE, the optimizer step (W -= lr * dW) must run on CPU, and the MIL program must be recompiled with updated weights before the next forward pass. This recompilation costs ~10-50ms per step, dominating training time. See ANE_CHAINING_RESEARCH.md, Section 9 for detailed analysis.

4. Security and Safety Mechanisms

The ANE has multiple layers of safety enforcement, but Apple's security model assumes access goes through CoreML. The private APIs we use bypass CoreML but still pass through the aned daemon and the E5 compiler.

Compile-Time Safety

Mechanism	What it does
MIL syntax validation	The E5 compiler rejects malformed MIL with `InvalidMILProgram` errors
Type checking	Tensor dtypes, shapes, and parameter types must match exactly. Mismatches cause compile errors (e.g., `layer_norm` epsilon must match gamma/beta dtype; `concat` axis must be `int32` scalar, not tensor)
Op validation	Unknown or unsupported operations are rejected
I/O matching	MIL input/output names and shapes must match the `MLModelDescription` passed to `MLE5Engine`

Runtime Safety

Mechanism	What it does
Shape enforcement	Input tensors must match declared shape exactly -- `MultiArray shape doesn't match ML Program's expected shape` error on mismatch
Daemon mediation	ANE runs through the `aned` daemon (system service). User processes only get 3 IOKit interfaces: open, close, `programSendRequest`
IOSurface isolation	I/O memory is managed by the kernel via IOSurface. Cannot read/write arbitrary memory through them
SRAM limits	Programs exceeding the ANE SRAM budget (~24-32MB on M4 Max) are rejected or fall back to CPU/GPU
Compile limit	~119 compiled programs per process before the compiler leaks enough resources to fail (resource exhaustion, not a security boundary)

Sandbox Interaction

The E5 runtime needs write access to ~/Library/Caches/<binary_name>/ for its ANE specialization cache. macOS app sandbox can block this, causing compilation to fail with permission errors. When running outside a sandbox (e.g., command-line tools), this directory is created automatically.

What is NOT Protected

Gap	Detail
No access control	No authentication or entitlement check for using the private APIs. Any process can call `_ANEClient.sharedConnection`
No rate limiting	Programs can be compiled in a loop until the ~119 limit exhausts resources
No MIL signing	No code signing validation on MIL text -- any syntactically valid program that passes the compiler's type checks will execute
No isolation between programs	Multiple programs from the same process share the ANE with no hardware-level isolation (the daemon schedules them)

Practical Risk Assessment

The ANE attack surface is limited because:

Fixed-function hardware: The ANE executes predefined neural network operations, not arbitrary instructions. There is no instruction pointer, no stack, and no way to jump to arbitrary code.
Typed dataflow: MIL programs operate on typed tensors with fixed shapes. There are no buffer overflows in the traditional sense -- the compiler enforces all dimensions at compile time.
Daemon intermediary: All ANE access goes through aned, which validates requests before forwarding to the kernel driver. Direct IOKit access to the ANE is restricted to 3 interfaces.
No persistent state: ANE programs don't persist across reboots. Compiled programs live in temp directories and caches that are cleaned by the OS.

The main risk of the private APIs is stability: these APIs are undocumented and may change with any macOS update, potentially breaking programs that depend on them.

5. Is the ANE 16-bit?

hollance/neural-engine says: "It appears so."

Confirmed. The ANE operates in FP16 for both compute and storage:

All IOSurface I/O must be FP16. Passing FP32 data produces zeros.
MIL programs must use fp16 I/O types (setting g_fp16_io=1 in our codebase)
F32-to-F16 conversion happens on the CPU before writing to IOSurfaces
FP16 precision limits: values above ~65504 overflow, values below ~5.96e-8 underflow to zero

Quantization Support

Format	ANE Native?	Notes
FP16	Yes	Native compute and storage format
INT8	Partial	Memory bandwidth savings only, no compute speedup. `constexpr_affine_dequantize` in MIL dequantizes to FP16 before compute
Q4	No	Not supported. Requires GPU (Metal) or CPU dequantization
FP32	No	Internally converted to FP16; higher precision lost

Apple markets ANE TOPS using INT8, so the 38 TOPS figure for M4 is really ~19 TFLOPS in FP16 (each INT8 op counts as 1 TOP but FP16 ops count as 2).

6. ANE vs GPU vs CPU

Benchmarked on Qwen2.5-0.5B (dim=896, 24 layers, 494M params) on M4 Max:

Decode Performance (single-token generation)

Engine	Format	Weight Size	Decode t/s	Bottleneck
CPU AMX (cblas_sgemv)	F32	1.97 GB	~91 t/s	Memory bandwidth
CPU AMX (cblas_sgemv)	F16->F32	658 MB disk	~91 t/s	Memory bandwidth (F32 in RAM)
CPU AMX (cblas_sgemv)	Q4->F32	188 MB disk	~91 t/s	Memory bandwidth (dequant at load)
Metal GPU (Q4 SIMD)	Q4	188 MB	~10 t/s	Dispatch overhead (~400 dispatches/token)
LM Studio (MLX)	Q4 MLX	~188 MB	258-496 t/s	Optimized Metal kernels

Prefill Performance (batch prompt processing)

Engine	Format	Prefill t/s	Method
CPU AMX (cblas_sgemm)	F32	880-960 t/s	Batched matmul
CPU AMX (cblas_sgemv)	F32	~40 t/s	Sequential per-token

ANE Training Kernel Performance

Metric	Value
Kernel latency	~0.2 ms per kernel (768x256 production dims)
Peak TFLOPS	11.14 (128x conv 512ch sp64)
Sustained training	1.29-1.68 TFLOPS
ANE utilization	8-11% of peak

When to use each

ANE: Best for parallel FP16 operations where data stays on-chip (training kernels, fused attention). The ~119 compile limit and FP16-only restriction are significant constraints.
GPU (Metal): Best for large models (dim >= 4096) where native quantized matmul kernels (as in MLX/llama.cpp) can read Q4/Q8 data directly from GPU memory. Dispatch overhead dominates for small models.
CPU AMX: Best for small/medium model decode (dim <= 896). cblas_sgemv uses the AMX coprocessor internally and achieves ~33% of theoretical bandwidth. Cannot be beaten by manual NEON, threading, or Metal for this model size.

7. Reverse engineering the ANE

Prior Work

Project	Focus	Key Contribution
hollance/neural-engine	CoreML-level documentation	Comprehensive device list, layer compatibility, model surgery guides
geohot/tinygrad ANE	Driver-level reverse engineering	Initial IOKit driver analysis, ANE instruction format exploration
Black Hat Asia 2021 (Wish Wu)	Full stack: ML to HW registers	Documented compilation pipeline, .hwx format, security attack surfaces, FaceID ANE usage. Created ANEDisassembler. Video
ANETools	CLI compilation and disassembly	ANECompiler CLI wrapper, ANEDisassembler for .hwx files, `debug_mask` flag for intermediate output
eiln/anecc	Independent ANE compiler	CoreML-to-ANE compiler for Asahi Linux, alternative compilation path
freedomtan/coreml_to_ane_hwx	CoreML to .hwx conversion	Direct converter bypassing some CoreML steps
maderix/ANE	Training on ANE	First neural network training on ANE via private APIs
maderix Substack	M4 ANE deep-dive	Detailed M4 ANE architecture analysis, SRAM probing, kernel fusion

Our Discoveries: Private API Class Hierarchy

We have documented 20+ private Objective-C classes in AppleNeuralEngine.framework:

NSObject
|-- _ANEClient (singleton, daemon connection)
|   Methods: sharedConnection, evaluateWithModel:, evaluateRealTimeWithModel:,
|            doEvaluateDirectWithModel:, prepareChainingWithModel:,
|            enqueueSetsWithModel:, buffersReadyWithModel:,
|            beginRealTimeTask, endRealTimeTask
|
|-- _ANEInMemoryModelDescriptor (MIL + weights spec)
|   Factory: +modelWithMILText:weights:optionsPlist:
|
|-- _ANEInMemoryModel (compile/load/run)
|   Methods: compileWithQoS:, loadWithQoS:, evaluateWithQoS:, unloadWithQoS:
|   Props: hexStringIdentifier, programHandle (uint64), program, perfStatsMask
|
|-- _ANEModel (disk-based compiled model -- 52 instance methods)
|   Factory: +modelAtURL:key:, +modelAtURL:key:modelAttributes:
|   Methods: getUUID, inputSymbolIndicesForProcedureIndex:,
|            outputSymbolIndicesForProcedureIndex:
|   Props: mapper, program
|
|-- _ANERequest (I/O surface packaging)
|   Factory: +requestWithInputs:inputIndices:outputs:outputIndices:
|             weightsBuffer:perfStats:procedureIndex:
|
|-- _ANEIOSurfaceObject (thin IOSurface wrapper)
|   Factory: +objectWithIOSurface:
|
|-- _ANEBuffer (IOSurfaceObject + symbolIndex + source) [KEY DISCOVERY]
|   Factory: +bufferWithIOSurfaceObject:symbolIndex:source:
|   source: 0=ANE, 1=output, 2=unknown
|
|-- _ANEChainingRequest (multi-op pipeline)
|   Factory: +chainingRequestWithInputs:outputSets:lbInputSymbolId:
|             lbOutputSymbolId:procedureIndex:signalEvents:
|             transactionHandle:fwEnqueueDelay:memoryPoolId:
|   Methods: validate
|
|-- _ANEIOSurfaceOutputSets (output packaging for chaining)
|   Factory: +objectWithstatsSurRef:outputBuffer:
|   Note: requires non-NULL statsSurRef (any IOSurface works, even 64 bytes)
|
|-- _ANEInputBuffersReady (input signaling for chaining)
|   Factory: +inputBuffersWithProcedureIndex:inputBufferInfoIndex:
|             inputFreeValue:executionDelay:
|
|-- _ANEOutputSetEnqueue (output pipeline config for chaining)
|   Factory: +outputSetWithProcedureIndex:setIndex:signalValue:
|             signalNotRequired:isOpenLoop:
|
|-- _ANEProgramForEvaluation (lower-level program)
|   Factory: +programWithHandle:intermediateBufferHandle:queueDepth:
|   Methods: processRequest:model:qos:qIndex:modelStringID:options:
|             returnValue:error:
|
|-- _ANEProgramIOSurfacesMapper (symbol-to-surface mapping)
|   Factory: +mapperWithProgramHandle:, +mapperWithController:
|   Note: only works with _ANEModel, not _ANEInMemoryModel
|
|-- _ANEPerformanceStats
|   Factory: +statsWithHardwareExecutionNS:
|   Props: hwExecutionTime, performanceCounters
|
|-- _ANESharedSignalEvent (hardware signal fence)
|   Factory: +signalEventWithValue:symbolIndex:eventType:sharedEvent:
|   Requires IOSurfaceSharedEvent objects
|
|-- _ANESharedWaitEvent (hardware wait fence)
|   Factory: +waitEventWithValue:sharedEvent:
|   Requires IOSurfaceSharedEvent objects
|
|-- _ANEModelInstanceParameters, _ANEDeviceController, _ANEQoSMapper

Full details with experiment logs: ANE_CHAINING_RESEARCH.md

ChainingRequest API Status

The _ANEChainingRequest API is designed to pipeline multiple ANE operations without CPU round-trips. Current status:

_ANEChainingRequest.validate returns YES (with _ANEBuffer inputs + _ANEIOSurfaceOutputSets outputs)
prepareChainingWithModel: fails -- calls getUUID on _ANEInMemoryModel which lacks it
Requires _ANEModel (disk-based compiled model) which has getUUID and symbol index methods
_ANEModel factory methods require a key: parameter; the hex identifier from _ANEInMemoryModel is the likely key

This is the highest-priority research area. Chaining would eliminate the ~23 CPU-ANE round-trips per token in a 12-layer model, potentially enabling on-chip pipeline execution.

model.hwx Binary Format

The .hwx file is the compiled hardware representation loaded by the ANE kernel driver. From Wu's Black Hat research:

Mach-O format binary containing register operations
Compiled from net.plist + weights by the ANECompiler module
Loaded by the H11ANEIn kernel driver via programCreate interface
ANE firmware parses it to extract register addresses and values
Can be disassembled with ANETools/ANEDisassembler

Our _ANEInMemoryModel path bypasses .hwx generation -- the model goes directly from MIL to an internal binary format in a temp directory. Whether this temp directory contains an equivalent to .hwx is an open question (see ANE_CHAINING_RESEARCH.md for next steps).

8. How to verify ANE execution

Power Monitoring

sudo powermetrics --samplers ane_power -i 1000

Shows real-time ANE power draw. Active ANE usage typically shows 2-4W on M4 Max during training.

Performance Statistics

model.perfStatsMask = 0xFF;
// After execution:
// model.performanceCounters -- returns nil on current macOS (limited API)

The _ANEPerformanceStats class exists and can be instantiated via +statsWithHardwareExecutionNS:, but the hardware counters are not populated on the current macOS/M4 combination. The perfStatsMask property is accepted but performanceCounters returns nil after execution.

IOSurface Output Validation

Read back FP16 data from output IOSurfaces and compare against CPU reference:

_Float16 *out = (_Float16 *)IOSurfaceGetBaseAddress(surface);
IOSurfaceLock(surface, kIOSurfaceLockReadOnly, NULL);
for (int i = 0; i < n; i++) {
    float val = (float)out[i];
    // Compare against CPU reference
}
IOSurfaceUnlock(surface, kIOSurfaceLockReadOnly, NULL);

ANE Compiler Debug Output

From Wu's research, the ANECompiler module has a debug_mask flag. Setting it to 2147483647 (max int) generates intermediate files during compilation, revealing:

Register operation sequences
Memory allocation decisions
Tiling strategies
Weight layout in SRAM

This can be applied when using the ANECompiler CLI tools from ANETools.

9. References and External Resources

Documentation and Research

Resource	URL	Focus
hollance/neural-engine	https://github.com/hollance/neural-engine	CoreML-level ANE docs
maderix Substack	https://open.substack.com/pub/maderix/p/inside-the-m4-apple-neural-engine	M4 ANE architecture
Black Hat Asia 2021	https://infocondb.org/con/black-hat/black-hat-asia-2021/apple-neural-engine-internal-from-ml-algorithm-to-hw-registers	Full stack reverse engineering
BH Asia 2021 Video	https://www.youtube.com/watch?v=1wvBDUnPNEo	30-min talk by Wish Wu
Apple ML Research	https://machinelearning.apple.com/research/neural-engine-transformers	Deploying transformers on ANE
ANE Supported Devices	https://github.com/hollance/neural-engine/blob/master/docs/supported-devices.md	Comprehensive device/chip list

Tools

Tool	URL	Purpose
ANETools	https://github.com/antgroup-skyward/ANETools	ANECompiler CLI, ANEDisassembler
eiln/anecc	https://github.com/eiln/anecc	Independent ANE compiler (Asahi Linux)
freedomtan/coreml_to_ane_hwx	https://github.com/freedomtan/coreml_to_ane_hwx	CoreML to .hwx converter
coremltools	https://github.com/apple/coremltools	Apple's official ML model tools

Projects Using ANE Directly

Project	URL	What it does
maderix/ANE	https://github.com/maderix/ANE	Training on ANE (this project's upstream)
dev-erik/ANE	https://github.com/dev-erik/ANE	This fork: inference optimization, ChainingRequest research

This Project's ANE Documentation

Document	Description
ANE_INTERNALS.md	This file -- comprehensive ANE internals guide
ANE_CHAINING_RESEARCH.md	ChainingRequest API research, experiment logs, benchmarks
ARCHITECTURE.md	Training system architecture, kernel fusion map, data flow
API_REFERENCE.md	Complete function index for all source files
BENCHMARK_RESULTS.md	M4 Max benchmark results (training, TFLOPS, SRAM)

29 KiB Raw Blame History