[test] ANE private API research: chaining, E5 runtime, custom MIL compilation experiments

This commit is contained in:
Erik Bray 2026-03-04 21:39:24 +01:00
parent efcf193075
commit 99ba013d9b
11 changed files with 8855 additions and 8 deletions

File diff suppressed because it is too large Load Diff

563
docs/ANE_INTERNALS.md Normal file
View File

@ -0,0 +1,563 @@
# ANE Internals: What We Know
A comprehensive guide to Apple's Neural Engine (ANE) based on reverse engineering, private API exploration, and community research. This extends and updates [hollance/neural-engine](https://github.com/hollance/neural-engine/tree/master/docs) with findings from direct hardware experimentation on M4 Max / macOS 15.
---
## Table of Contents
1. [How does the ANE work internally?](#1-how-does-the-ane-work-internally)
2. [Can I program the ANE directly?](#2-can-i-program-the-ane-directly)
3. [What can be compiled and run on ANE?](#3-what-can-be-compiled-and-run-on-ane)
4. [Security and safety mechanisms](#4-security-and-safety-mechanisms)
5. [Is the ANE 16-bit?](#5-is-the-ane-16-bit)
6. [ANE vs GPU vs CPU](#6-ane-vs-gpu-vs-cpu)
7. [Reverse engineering the ANE](#7-reverse-engineering-the-ane)
8. [How to verify ANE execution](#8-how-to-verify-ane-execution)
9. [References and external resources](#9-references-and-external-resources)
---
## 1. How does the ANE work internally?
> hollance/neural-engine says: "I don't think anyone outside Apple knows."
We now know substantially more.
### Hardware Architecture
The ANE is a fixed-function neural network accelerator integrated into Apple Silicon SoCs:
| Chip | ANE Cores | Peak TOPS | SRAM Budget |
|------|-----------|-----------|-------------|
| A12-A13 | 8 | 5 | ~4 MB |
| A14/M1 | 16 | 11 | ~16 MB |
| A15/M2 | 16 | 15.8 | ~24 MB |
| M4/M4 Pro/M4 Max | 16 | 38 | ~24-32 MB |
SRAM budget measured via `sram_probe.m` performance cliff detection on M4 Max:
- Peak efficiency at ~12.5 MB weights (282.6 GFLOPS/MB)
- First spill at ~32 MB (drops to 59.2 GFLOPS/MB)
- Catastrophic spilling at 128 MB (8.0 GFLOPS/MB)
The ANE operates on FP16 data exclusively. All I/O is through IOSurface shared memory buffers in `[1, C, 1, S]` channel-first FP16 layout.
### Compilation Pipeline
There are two paths from a neural network to ANE hardware execution:
**Standard CoreML path** (from [Black Hat Asia 2021, Wish Wu](https://infocondb.org/con/black-hat/black-hat-asia-2021/apple-neural-engine-internal-from-ml-algorithm-to-hw-registers)):
```
ML model (TF/PyTorch/Caffe)
-> coremltools -> .mlmodel
-> coremlc (CoreML compiler) -> .mlmodelc/
-> espresso precompile -> net.plist + weights
-> ANECompiler (in ane_compiler_service) -> model.hwx
-> aned daemon -> H11ANEIn kernel driver (IOKit)
-> ANE firmware -> hardware registers
```
**Direct private API path** (what this project uses):
```
MIL text + weight blobs (in memory)
-> _ANEInMemoryModelDescriptor (ObjC object)
-> _ANEInMemoryModel.compileWithQoS: -> ANE binary (in temp dir)
-> _ANEInMemoryModel.loadWithQoS: -> loaded onto ANE hardware
-> _ANEInMemoryModel.evaluateWithQoS: -> execution via aned
```
The direct path bypasses CoreML, espresso, and the `.hwx` file format entirely. It compiles MIL (Model Intermediate Language) text directly into ANE-executable binary, loads it, and runs it. This is how we achieve both training and inference on the ANE without any CoreML dependency.
### System Architecture
```
+------------------+ +------------------+ +------------------+
| User Process | | aned daemon | | Kernel |
| | | | | |
| _ANEClient -----+---->| ANE scheduler +---->| H11ANEIn driver |
| (sharedConnection)| | (all interfaces) | | (IOKit) |
| | | | | |
| App gets 3 IOKit | | Compiles models | | Passes model.hwx |
| interfaces: | | Manages loading | | to ANE firmware |
| - open | | Handles requests | | |
| - close | +------------------+ +------------------+
| - programSend | |
| Request | v
+------------------+ +------------------+
| ANE Firmware |
| (co-processor) |
| |
| Parses register |
| operations from |
| compiled binary |
+------------------+
```
The `aned` daemon mediates between user processes and the kernel driver. Apps only get 3 IOKit interfaces (open, close, programSendRequest). The daemon has access to all driver interfaces, which is why `_ANEClient.sharedConnection` communicates through the daemon rather than directly to the kernel.
### Execution Paths
We have benchmarked four distinct ways to trigger ANE kernel execution:
| Method | API | Latency (64x32) | Latency (768x256) |
|--------|-----|------------------|--------------------|
| Standard | `model.evaluateWithQoS:options:request:error:` | 0.175 ms | 0.205 ms |
| Real-Time | `client.evaluateRealTimeWithModel:options:request:error:` | 0.093 ms | 0.246 ms |
| processRequest | `program.processRequest:model:qos:...` | 0.131 ms | 0.185 ms |
| Direct | `client.doEvaluateDirectWithModel:options:request:qos:error:` | 0.225 ms | N/A |
**Key finding**: At production kernel dimensions (768x256, matching Stories110M), all paths converge to ~0.2 ms per kernel. The RT speedup (1.88x) observed on small 64x32 kernels does not hold at production scale. The standard path remains the most reliable.
### Resource Limits
The ANE runtime leaks internal resources during compilation. After ~119 compiles per process, subsequent compilations fail silently. The workaround is checkpoint-and-restart: save weights and optimizer state, terminate the process, and re-launch with `--resume`.
With `MAX_COMPILES=100` (conservative) and 60 weight-bearing kernels per batch (12 layers x 5 kernels), only 1 training batch fits per process lifetime.
---
## 2. Can I program the ANE directly?
> hollance/neural-engine says: "Unfortunately not. You can only use the Neural Engine through Core ML."
**Yes, you can.** The `AppleNeuralEngine.framework` contains 67+ private Objective-C classes that provide direct access to the ANE without CoreML. This project uses them for both training and inference.
### Minimal Example
The core compilation/load/execution cycle in pseudocode:
```objc
#import <dlfcn.h>
#import <objc/runtime.h>
// Load the private framework
dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
// Write MIL program as text
NSData *milData = [@"program(1.0) { ... }" dataUsingEncoding:NSUTF8StringEncoding];
// Create descriptor
id descriptor = [_ANEInMemoryModelDescriptor modelWithMILText:milData
weights:weightDict
optionsPlist:nil];
// Compile -> Load -> Run
id model = [_ANEInMemoryModel inMemoryModelWithDescriptor:descriptor];
[model compileWithQoS:21 options:nil error:&error];
[model loadWithQoS:21 options:nil error:&error];
// Create IOSurface I/O and request
id request = [_ANERequest requestWithInputs:@[inputSurface]
inputIndices:@[@0]
outputs:@[outputSurface]
outputIndices:@[@0]
weightsBuffer:nil
perfStats:nil
procedureIndex:0];
[model evaluateWithQoS:21 options:nil request:request error:&error];
```
A complete reusable wrapper is implemented in [`training/ane_runtime.h`](../training/ane_runtime.h) with functions:
- `ane_init()` -- load framework, resolve classes
- `ane_compile(kernel, mil_text, weight_dict)` -- compile MIL to ANE binary
- `ane_run(kernel)` -- standard execution path
- `ane_free(kernel)` -- unload and release resources
### MIL (Model Intermediate Language)
MIL is Apple's intermediate representation for neural network operations. Key facts:
- Text-based format: `program(1.0) { func main(...) { ... } }`
- Targets: `ios16`, `ios17`, `ios18` (determines available ops)
- All tensors are 4D: `[batch, channels, height, width]` or equivalently `[1, C, 1, S]`
- Convolutions (`conv`) are the workhorse: a 1x1 conv with `[out_ch, in_ch, 1, 1]` weights = matrix multiply
- Weights referenced via `BLOBFILE(path="@model_path/weights/name.bin", offset=uint64(64))`
- Weights are baked at compile time and cannot be swapped at runtime
Supported operations include: `conv`, `matmul`, `add`, `mul`, `sigmoid`, `softmax`, `reshape`, `transpose`, `concat`, `reduce_mean`, `rsqrt`, `cast`, `constexpr_affine_dequantize`, and more.
### Alternative: ANECompiler CLI
[ANETools](https://github.com/antgroup-skyward/ANETools) (from Wish Wu / Ant Group) provides command-line tools that invoke the ANECompiler module directly:
```bash
# Convert mlmodelc to ANE-compatible format
MLModelCToANECompiler input.mlmodelc output/
# Compile to hardware format
ANECompiler --target-arch ane_v5 --debug-mask 2147483647 net.plist weights/ output.hwx
# Disassemble compiled binary
ANEDisassembler output.hwx
```
The `--debug-mask` flag (set to max integer) generates intermediate files during compilation, revealing internal register operations.
---
## 3. What can be compiled and run on ANE?
Any computation expressible as a static MIL (Model Intermediate Language) dataflow graph that the E5 compiler accepts. The ANE is a fixed-function accelerator, not a general-purpose processor -- it executes predefined operation graphs, not arbitrary code.
### Verified Operations
These operations have been compiled to custom MIL programs and executed on ANE hardware with output validated against CPU reference implementations (see `test_mil_custom.m`):
| Category | Operations | Notes |
|----------|-----------|-------|
| Activations | `relu`, `gelu`, `softmax` | GELU supports EXACT, TANH_APPROXIMATION, SIGMOID_APPROXIMATION modes |
| Normalization | `layer_norm` | Epsilon type must match gamma/beta dtype |
| Attention | `scaled_dot_product_attention` | Fused Q@K^T/sqrt(d) + softmax + @V in a single op (iOS 18+) |
| Linear algebra | `linear` (const weights), `matmul` (runtime tensors) | `linear` requires compile-time constant weights; `matmul` supports runtime inputs |
| Type conversion | `cast` | fp32 <-> fp16. Required at ANE I/O boundaries |
| Elementwise | `add`, `mul`, `real_div` | Broadcasting supported |
| Shape | `reshape`, `transpose`, `concat`, `slice_by_index` | `concat` requires `interleave` param |
| Composite | Full transformer block (LN + SDPA + Residual + FFN + GELU) | Compiles and runs as a single ANE program (~0.21ms) |
### Available but Not Yet Tested
These are valid MIL operations that the E5 compiler should accept:
- `conv` -- convolutions (the upstream maderix/ANE repo uses these extensively for training)
- `reduce_sum`, `reduce_mean`, `reduce_max` -- reductions
- `gather`, `scatter` -- embedding lookups, KV cache writes
- `rsqrt`, `sqrt`, `exp`, `log`, `tanh` -- unary math
- `split`, `slice_by_size` -- tensor slicing
- `batch_norm`, `instance_norm` -- normalization variants
- Various pooling, padding, upsampling operations
### What Cannot Run on ANE
| Limitation | Detail |
|-----------|--------|
| No control flow | No loops, conditionals, or branching. MIL is a static dataflow graph. |
| No dynamic shapes | All tensor dimensions must be known at compile time. |
| No runtime weight updates | Weights are `const`, baked into the compiled binary. Changing weights requires recompilation (~10-50ms). |
| No arbitrary memory access | No pointers or indexing beyond what `gather`/`scatter` provide. |
| No custom ops | Only operations in Apple's MIL op set. No user-defined kernels at the hardware level. |
| No FP32 compute | ANE computes in FP16 only. FP32 inputs are cast to FP16 internally. |
### Implications for Training
The ANE can execute the forward pass and the matrix math of backpropagation (`matmul` for dX and dW gradients). However, training is impractical because weights are read-only constants. After computing weight gradients on ANE, the optimizer step (W -= lr * dW) must run on CPU, and the MIL program must be recompiled with updated weights before the next forward pass. This recompilation costs ~10-50ms per step, dominating training time. See [ANE_CHAINING_RESEARCH.md, Section 9](ANE_CHAINING_RESEARCH.md#9-ane-training-feasibility-analysis) for detailed analysis.
---
## 4. Security and Safety Mechanisms
The ANE has multiple layers of safety enforcement, but Apple's security model assumes access goes through CoreML. The private APIs we use bypass CoreML but still pass through the `aned` daemon and the E5 compiler.
### Compile-Time Safety
| Mechanism | What it does |
|-----------|-------------|
| MIL syntax validation | The E5 compiler rejects malformed MIL with `InvalidMILProgram` errors |
| Type checking | Tensor dtypes, shapes, and parameter types must match exactly. Mismatches cause compile errors (e.g., `layer_norm` epsilon must match gamma/beta dtype; `concat` axis must be `int32` scalar, not tensor) |
| Op validation | Unknown or unsupported operations are rejected |
| I/O matching | MIL input/output names and shapes must match the `MLModelDescription` passed to `MLE5Engine` |
### Runtime Safety
| Mechanism | What it does |
|-----------|-------------|
| Shape enforcement | Input tensors must match declared shape exactly -- `MultiArray shape doesn't match ML Program's expected shape` error on mismatch |
| Daemon mediation | ANE runs through the `aned` daemon (system service). User processes only get 3 IOKit interfaces: open, close, `programSendRequest` |
| IOSurface isolation | I/O memory is managed by the kernel via IOSurface. Cannot read/write arbitrary memory through them |
| SRAM limits | Programs exceeding the ANE SRAM budget (~24-32MB on M4 Max) are rejected or fall back to CPU/GPU |
| Compile limit | ~119 compiled programs per process before the compiler leaks enough resources to fail (resource exhaustion, not a security boundary) |
### Sandbox Interaction
The E5 runtime needs write access to `~/Library/Caches/<binary_name>/` for its ANE specialization cache. macOS app sandbox can block this, causing compilation to fail with permission errors. When running outside a sandbox (e.g., command-line tools), this directory is created automatically.
### What is NOT Protected
| Gap | Detail |
|-----|--------|
| No access control | No authentication or entitlement check for using the private APIs. Any process can call `_ANEClient.sharedConnection` |
| No rate limiting | Programs can be compiled in a loop until the ~119 limit exhausts resources |
| No MIL signing | No code signing validation on MIL text -- any syntactically valid program that passes the compiler's type checks will execute |
| No isolation between programs | Multiple programs from the same process share the ANE with no hardware-level isolation (the daemon schedules them) |
### Practical Risk Assessment
The ANE attack surface is limited because:
1. **Fixed-function hardware**: The ANE executes predefined neural network operations, not arbitrary instructions. There is no instruction pointer, no stack, and no way to jump to arbitrary code.
2. **Typed dataflow**: MIL programs operate on typed tensors with fixed shapes. There are no buffer overflows in the traditional sense -- the compiler enforces all dimensions at compile time.
3. **Daemon intermediary**: All ANE access goes through `aned`, which validates requests before forwarding to the kernel driver. Direct IOKit access to the ANE is restricted to 3 interfaces.
4. **No persistent state**: ANE programs don't persist across reboots. Compiled programs live in temp directories and caches that are cleaned by the OS.
The main risk of the private APIs is **stability**: these APIs are undocumented and may change with any macOS update, potentially breaking programs that depend on them.
---
## 5. Is the ANE 16-bit?
> hollance/neural-engine says: "It appears so."
**Confirmed.** The ANE operates in FP16 for both compute and storage:
- All IOSurface I/O must be FP16. Passing FP32 data produces zeros.
- MIL programs must use `fp16` I/O types (setting `g_fp16_io=1` in our codebase)
- F32-to-F16 conversion happens on the CPU before writing to IOSurfaces
- FP16 precision limits: values above ~65504 overflow, values below ~5.96e-8 underflow to zero
### Quantization Support
| Format | ANE Native? | Notes |
|--------|------------|-------|
| FP16 | Yes | Native compute and storage format |
| INT8 | Partial | Memory bandwidth savings only, no compute speedup. `constexpr_affine_dequantize` in MIL dequantizes to FP16 before compute |
| Q4 | No | Not supported. Requires GPU (Metal) or CPU dequantization |
| FP32 | No | Internally converted to FP16; higher precision lost |
Apple markets ANE TOPS using INT8, so the 38 TOPS figure for M4 is really ~19 TFLOPS in FP16 (each INT8 op counts as 1 TOP but FP16 ops count as 2).
---
## 6. ANE vs GPU vs CPU
Benchmarked on Qwen2.5-0.5B (dim=896, 24 layers, 494M params) on M4 Max:
### Decode Performance (single-token generation)
| Engine | Format | Weight Size | Decode t/s | Bottleneck |
|--------|--------|-------------|------------|------------|
| CPU AMX (cblas_sgemv) | F32 | 1.97 GB | ~91 t/s | Memory bandwidth |
| CPU AMX (cblas_sgemv) | F16->F32 | 658 MB disk | ~91 t/s | Memory bandwidth (F32 in RAM) |
| CPU AMX (cblas_sgemv) | Q4->F32 | 188 MB disk | ~91 t/s | Memory bandwidth (dequant at load) |
| Metal GPU (Q4 SIMD) | Q4 | 188 MB | ~10 t/s | Dispatch overhead (~400 dispatches/token) |
| LM Studio (MLX) | Q4 MLX | ~188 MB | 258-496 t/s | Optimized Metal kernels |
### Prefill Performance (batch prompt processing)
| Engine | Format | Prefill t/s | Method |
|--------|--------|-------------|--------|
| CPU AMX (cblas_sgemm) | F32 | 880-960 t/s | Batched matmul |
| CPU AMX (cblas_sgemv) | F32 | ~40 t/s | Sequential per-token |
### ANE Training Kernel Performance
| Metric | Value |
|--------|-------|
| Kernel latency | ~0.2 ms per kernel (768x256 production dims) |
| Peak TFLOPS | 11.14 (128x conv 512ch sp64) |
| Sustained training | 1.29-1.68 TFLOPS |
| ANE utilization | 8-11% of peak |
### When to use each
- **ANE**: Best for parallel FP16 operations where data stays on-chip (training kernels, fused attention). The ~119 compile limit and FP16-only restriction are significant constraints.
- **GPU (Metal)**: Best for large models (dim >= 4096) where native quantized matmul kernels (as in MLX/llama.cpp) can read Q4/Q8 data directly from GPU memory. Dispatch overhead dominates for small models.
- **CPU AMX**: Best for small/medium model decode (dim <= 896). `cblas_sgemv` uses the AMX coprocessor internally and achieves ~33% of theoretical bandwidth. Cannot be beaten by manual NEON, threading, or Metal for this model size.
---
## 7. Reverse engineering the ANE
### Prior Work
| Project | Focus | Key Contribution |
|---------|-------|-------------------|
| [hollance/neural-engine](https://github.com/hollance/neural-engine) | CoreML-level documentation | Comprehensive device list, layer compatibility, model surgery guides |
| [geohot/tinygrad ANE](https://github.com/tinygrad/tinygrad) | Driver-level reverse engineering | Initial IOKit driver analysis, ANE instruction format exploration |
| [Black Hat Asia 2021 (Wish Wu)](https://infocondb.org/con/black-hat/black-hat-asia-2021/apple-neural-engine-internal-from-ml-algorithm-to-hw-registers) | Full stack: ML to HW registers | Documented compilation pipeline, .hwx format, security attack surfaces, FaceID ANE usage. Created ANEDisassembler. [Video](https://www.youtube.com/watch?v=1wvBDUnPNEo) |
| [ANETools](https://github.com/antgroup-skyward/ANETools) | CLI compilation and disassembly | ANECompiler CLI wrapper, ANEDisassembler for .hwx files, `debug_mask` flag for intermediate output |
| [eiln/anecc](https://github.com/eiln/anecc) | Independent ANE compiler | CoreML-to-ANE compiler for Asahi Linux, alternative compilation path |
| [freedomtan/coreml_to_ane_hwx](https://github.com/freedomtan/coreml_to_ane_hwx) | CoreML to .hwx conversion | Direct converter bypassing some CoreML steps |
| [maderix/ANE](https://github.com/maderix/ANE) | Training on ANE | First neural network training on ANE via private APIs |
| [maderix Substack](https://open.substack.com/pub/maderix/p/inside-the-m4-apple-neural-engine) | M4 ANE deep-dive | Detailed M4 ANE architecture analysis, SRAM probing, kernel fusion |
### Our Discoveries: Private API Class Hierarchy
We have documented 20+ private Objective-C classes in `AppleNeuralEngine.framework`:
```
NSObject
|-- _ANEClient (singleton, daemon connection)
| Methods: sharedConnection, evaluateWithModel:, evaluateRealTimeWithModel:,
| doEvaluateDirectWithModel:, prepareChainingWithModel:,
| enqueueSetsWithModel:, buffersReadyWithModel:,
| beginRealTimeTask, endRealTimeTask
|
|-- _ANEInMemoryModelDescriptor (MIL + weights spec)
| Factory: +modelWithMILText:weights:optionsPlist:
|
|-- _ANEInMemoryModel (compile/load/run)
| Methods: compileWithQoS:, loadWithQoS:, evaluateWithQoS:, unloadWithQoS:
| Props: hexStringIdentifier, programHandle (uint64), program, perfStatsMask
|
|-- _ANEModel (disk-based compiled model -- 52 instance methods)
| Factory: +modelAtURL:key:, +modelAtURL:key:modelAttributes:
| Methods: getUUID, inputSymbolIndicesForProcedureIndex:,
| outputSymbolIndicesForProcedureIndex:
| Props: mapper, program
|
|-- _ANERequest (I/O surface packaging)
| Factory: +requestWithInputs:inputIndices:outputs:outputIndices:
| weightsBuffer:perfStats:procedureIndex:
|
|-- _ANEIOSurfaceObject (thin IOSurface wrapper)
| Factory: +objectWithIOSurface:
|
|-- _ANEBuffer (IOSurfaceObject + symbolIndex + source) [KEY DISCOVERY]
| Factory: +bufferWithIOSurfaceObject:symbolIndex:source:
| source: 0=ANE, 1=output, 2=unknown
|
|-- _ANEChainingRequest (multi-op pipeline)
| Factory: +chainingRequestWithInputs:outputSets:lbInputSymbolId:
| lbOutputSymbolId:procedureIndex:signalEvents:
| transactionHandle:fwEnqueueDelay:memoryPoolId:
| Methods: validate
|
|-- _ANEIOSurfaceOutputSets (output packaging for chaining)
| Factory: +objectWithstatsSurRef:outputBuffer:
| Note: requires non-NULL statsSurRef (any IOSurface works, even 64 bytes)
|
|-- _ANEInputBuffersReady (input signaling for chaining)
| Factory: +inputBuffersWithProcedureIndex:inputBufferInfoIndex:
| inputFreeValue:executionDelay:
|
|-- _ANEOutputSetEnqueue (output pipeline config for chaining)
| Factory: +outputSetWithProcedureIndex:setIndex:signalValue:
| signalNotRequired:isOpenLoop:
|
|-- _ANEProgramForEvaluation (lower-level program)
| Factory: +programWithHandle:intermediateBufferHandle:queueDepth:
| Methods: processRequest:model:qos:qIndex:modelStringID:options:
| returnValue:error:
|
|-- _ANEProgramIOSurfacesMapper (symbol-to-surface mapping)
| Factory: +mapperWithProgramHandle:, +mapperWithController:
| Note: only works with _ANEModel, not _ANEInMemoryModel
|
|-- _ANEPerformanceStats
| Factory: +statsWithHardwareExecutionNS:
| Props: hwExecutionTime, performanceCounters
|
|-- _ANESharedSignalEvent (hardware signal fence)
| Factory: +signalEventWithValue:symbolIndex:eventType:sharedEvent:
| Requires IOSurfaceSharedEvent objects
|
|-- _ANESharedWaitEvent (hardware wait fence)
| Factory: +waitEventWithValue:sharedEvent:
| Requires IOSurfaceSharedEvent objects
|
|-- _ANEModelInstanceParameters, _ANEDeviceController, _ANEQoSMapper
```
Full details with experiment logs: [ANE_CHAINING_RESEARCH.md](ANE_CHAINING_RESEARCH.md)
### ChainingRequest API Status
The `_ANEChainingRequest` API is designed to pipeline multiple ANE operations without CPU round-trips. Current status:
- `_ANEChainingRequest.validate` returns **YES** (with `_ANEBuffer` inputs + `_ANEIOSurfaceOutputSets` outputs)
- `prepareChainingWithModel:` **fails** -- calls `getUUID` on `_ANEInMemoryModel` which lacks it
- Requires `_ANEModel` (disk-based compiled model) which has `getUUID` and symbol index methods
- `_ANEModel` factory methods require a `key:` parameter; the hex identifier from `_ANEInMemoryModel` is the likely key
This is the highest-priority research area. Chaining would eliminate the ~23 CPU-ANE round-trips per token in a 12-layer model, potentially enabling on-chip pipeline execution.
### model.hwx Binary Format
The `.hwx` file is the compiled hardware representation loaded by the ANE kernel driver. From Wu's Black Hat research:
- Mach-O format binary containing register operations
- Compiled from `net.plist` + weights by the ANECompiler module
- Loaded by the `H11ANEIn` kernel driver via `programCreate` interface
- ANE firmware parses it to extract register addresses and values
- Can be disassembled with [ANETools/ANEDisassembler](https://github.com/antgroup-skyward/ANETools)
Our `_ANEInMemoryModel` path bypasses `.hwx` generation -- the model goes directly from MIL to an internal binary format in a temp directory. Whether this temp directory contains an equivalent to `.hwx` is an open question (see [ANE_CHAINING_RESEARCH.md](ANE_CHAINING_RESEARCH.md) for next steps).
---
## 8. How to verify ANE execution
### Power Monitoring
```bash
sudo powermetrics --samplers ane_power -i 1000
```
Shows real-time ANE power draw. Active ANE usage typically shows 2-4W on M4 Max during training.
### Performance Statistics
```objc
model.perfStatsMask = 0xFF;
// After execution:
// model.performanceCounters -- returns nil on current macOS (limited API)
```
The `_ANEPerformanceStats` class exists and can be instantiated via `+statsWithHardwareExecutionNS:`, but the hardware counters are not populated on the current macOS/M4 combination. The `perfStatsMask` property is accepted but `performanceCounters` returns nil after execution.
### IOSurface Output Validation
Read back FP16 data from output IOSurfaces and compare against CPU reference:
```objc
_Float16 *out = (_Float16 *)IOSurfaceGetBaseAddress(surface);
IOSurfaceLock(surface, kIOSurfaceLockReadOnly, NULL);
for (int i = 0; i < n; i++) {
float val = (float)out[i];
// Compare against CPU reference
}
IOSurfaceUnlock(surface, kIOSurfaceLockReadOnly, NULL);
```
### ANE Compiler Debug Output
From Wu's research, the ANECompiler module has a `debug_mask` flag. Setting it to `2147483647` (max int) generates intermediate files during compilation, revealing:
- Register operation sequences
- Memory allocation decisions
- Tiling strategies
- Weight layout in SRAM
This can be applied when using the ANECompiler CLI tools from [ANETools](https://github.com/antgroup-skyward/ANETools).
---
## 9. References and External Resources
### Documentation and Research
| Resource | URL | Focus |
|----------|-----|-------|
| hollance/neural-engine | https://github.com/hollance/neural-engine | CoreML-level ANE docs |
| maderix Substack | https://open.substack.com/pub/maderix/p/inside-the-m4-apple-neural-engine | M4 ANE architecture |
| Black Hat Asia 2021 | https://infocondb.org/con/black-hat/black-hat-asia-2021/apple-neural-engine-internal-from-ml-algorithm-to-hw-registers | Full stack reverse engineering |
| BH Asia 2021 Video | https://www.youtube.com/watch?v=1wvBDUnPNEo | 30-min talk by Wish Wu |
| Apple ML Research | https://machinelearning.apple.com/research/neural-engine-transformers | Deploying transformers on ANE |
| ANE Supported Devices | https://github.com/hollance/neural-engine/blob/master/docs/supported-devices.md | Comprehensive device/chip list |
### Tools
| Tool | URL | Purpose |
|------|-----|---------|
| ANETools | https://github.com/antgroup-skyward/ANETools | ANECompiler CLI, ANEDisassembler |
| eiln/anecc | https://github.com/eiln/anecc | Independent ANE compiler (Asahi Linux) |
| freedomtan/coreml_to_ane_hwx | https://github.com/freedomtan/coreml_to_ane_hwx | CoreML to .hwx converter |
| coremltools | https://github.com/apple/coremltools | Apple's official ML model tools |
### Projects Using ANE Directly
| Project | URL | What it does |
|---------|-----|-------------|
| maderix/ANE | https://github.com/maderix/ANE | Training on ANE (this project's upstream) |
| dev-erik/ANE | https://github.com/dev-erik/ANE | This fork: inference optimization, ChainingRequest research |
### This Project's ANE Documentation
| Document | Description |
|----------|-------------|
| [ANE_INTERNALS.md](ANE_INTERNALS.md) | This file -- comprehensive ANE internals guide |
| [ANE_CHAINING_RESEARCH.md](ANE_CHAINING_RESEARCH.md) | ChainingRequest API research, experiment logs, benchmarks |
| [ARCHITECTURE.md](ARCHITECTURE.md) | Training system architecture, kernel fusion map, data flow |
| [API_REFERENCE.md](API_REFERENCE.md) | Complete function index for all source files |
| [BENCHMARK_RESULTS.md](BENCHMARK_RESULTS.md) | M4 Max benchmark results (training, TFLOPS, SRAM) |

View File

@ -1,14 +1,21 @@
CC = xcrun clang
CFLAGS = -O2 -Wall -Wno-deprecated-declarations -fobjc-arc
CC_C = xcrun clang
ANE_COMPAT = -Wno-deprecated-declarations
SEC_FLAGS = -fstack-protector-strong -Wformat-security
CFLAGS = -O2 -Wall $(ANE_COMPAT) -fobjc-arc $(SEC_FLAGS)
CFLAGS_C = -O2 -Wall -Wextra -Werror -std=c11
CFLAGS_DEBUG = -O0 -g -Wall $(ANE_COMPAT) -fobjc-arc -fsanitize=address,undefined
FRAMEWORKS = -framework Foundation -framework CoreML -framework IOSurface
LDFLAGS = $(FRAMEWORKS) -ldl
HEADERS_LARGE = stories_config.h stories_io.h stories_mil.h stories_cpu_ops.h
HEADERS_LARGE = stories_config.h stories_io.h stories_mil.h stories_cpu_ops.h data_validation.h
HEADERS_ANE = $(HEADERS_LARGE) ane_rmsnorm_bwd.h ane_classifier.h
train: train.m ane_runtime.h ane_mil_gen.h model.h forward.h backward.h
$(CC) $(CFLAGS) -o $@ train.m $(LDFLAGS)
$(CC) $(CFLAGS) -o $@ train.m $(LDFLAGS) -framework Accelerate
train_large: train_large.m $(HEADERS_LARGE)
$(CC) $(CFLAGS) -o $@ train_large.m $(LDFLAGS) -framework Accelerate
@ -16,6 +23,14 @@ train_large: train_large.m $(HEADERS_LARGE)
train_large_ane: train_large_ane.m $(HEADERS_ANE)
$(CC) $(CFLAGS) -o $@ train_large_ane.m $(LDFLAGS) -framework Accelerate
HEADERS_OPT = $(HEADERS_LARGE) stories_cpu_ops_opt.h
train_opt: train_opt.m $(HEADERS_OPT)
$(CC) $(CFLAGS) -o $@ train_opt.m $(LDFLAGS) -framework Accelerate -framework Metal -framework MetalPerformanceShaders
train_double_buffer: train_double_buffer.m $(HEADERS_LARGE)
$(CC) $(CFLAGS) -o $@ train_double_buffer.m $(LDFLAGS) -framework Accelerate
PROBES = test_weight_reload test_perf_stats test_qos_sweep test_ane_advanced
test_rmsnorm_bwd: test_rmsnorm_bwd.m $(HEADERS_ANE)
@ -36,13 +51,56 @@ test_qos_sweep: test_qos_sweep.m
test_ane_advanced: test_ane_advanced.m
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
test_chaining: test_chaining.m
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
test_chaining_v2: test_chaining_v2.m
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
test_bench_paths: test_bench_paths.m ane_runtime.h
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
test_ane_model: test_ane_model.m
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Metal
test_throughput_ceiling: test_throughput_ceiling.m ane_runtime.h
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
test_coreml_chaining: test_coreml_chaining.m
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Metal
test_e5_validate: test_e5_validate.m
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Metal
test_mil_custom: test_mil_custom.m
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Accelerate
test_data_validation: test_data_validation.c data_validation.h
$(CC_C) $(CFLAGS_C) -o $@ $<
probes: $(PROBES)
security-tests: test_data_validation
data: tokenize
@bash download_data.sh
tokenize:
python3 tokenize.py
setup: data
@echo "=== Setup complete ==="
@echo "Data: tinystories_data00.bin"
@echo "To train: make train_large && ./train_large"
@echo "Override paths: ANE_MODEL_PATH=... ANE_DATA_PATH=... ./train_large"
verify-flags:
@echo "=== Active CFLAGS ==="
@echo "$(CFLAGS)"
@echo "=== Compiler version ==="
@xcrun clang --version
clean:
rm -f train train_large train_large_ane $(PROBES) test_rmsnorm_bwd test_classifier
.PHONY: clean tokenize probes
rm -f train train_large train_large_ane train_opt train_double_buffer $(PROBES) test_rmsnorm_bwd test_classifier test_data_validation test_chaining test_chaining_v2 test_bench_paths test_ane_model test_throughput_ceiling test_coreml_chaining test_e5_validate test_mil_custom
.PHONY: clean tokenize probes security-tests verify-flags data setup

View File

@ -20,15 +20,33 @@ typedef struct {
static Class g_ANEDesc, g_ANEInMem, g_ANEReq, g_ANEIO;
static bool g_ane_loaded = false;
static id g_ane_client = nil;
static bool g_ane_ok = false;
static void ane_init(void) {
if (g_ane_loaded) return;
dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
g_ane_loaded = true; // Set first to prevent re-entry (ref: CRIT-01)
void *handle = dlopen(
"/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine",
RTLD_NOW);
if (!handle) {
fprintf(stderr, "ANE: dlopen failed: %s\n", dlerror());
return;
}
g_ANEDesc = NSClassFromString(@"_ANEInMemoryModelDescriptor");
g_ANEInMem = NSClassFromString(@"_ANEInMemoryModel");
g_ANEReq = NSClassFromString(@"_ANERequest");
g_ANEIO = NSClassFromString(@"_ANEIOSurfaceObject");
g_ane_loaded = true;
if (!g_ANEDesc || !g_ANEInMem || !g_ANEReq || !g_ANEIO) {
fprintf(stderr, "ANE: Private classes not found (macOS version mismatch?)\n");
return;
}
g_ane_ok = true;
Class clientCls = NSClassFromString(@"_ANEClient");
if (clientCls) {
g_ane_client = [clientCls performSelector:@selector(sharedConnection)];
}
}
static IOSurfaceRef ane_create_surface(size_t bytes) {
@ -50,6 +68,7 @@ static ANEKernel *ane_compile(NSData *milText, NSData *weightData,
int nInputs, size_t *inputSizes,
int nOutputs, size_t *outputSizes) {
ane_init();
if (!g_ane_ok) { fprintf(stderr, "ANE: not available\n"); return NULL; } // CRIT-01/02
NSError *e = nil;
NSDictionary *wdict = nil;
@ -63,6 +82,7 @@ static ANEKernel *ane_compile(NSData *milText, NSData *weightData,
id mdl = ((id(*)(Class,SEL,id))objc_msgSend)(
g_ANEInMem, @selector(inMemoryModelWithDescriptor:), desc);
if (!mdl) { fprintf(stderr, "ANE: inMemoryModel allocation failed\n"); return NULL; } // CRIT-02
// Pre-populate temp dir with MIL + weights
id hx = ((id(*)(id,SEL))objc_msgSend)(mdl, @selector(hexStringIdentifier));
@ -151,6 +171,20 @@ static bool ane_eval(ANEKernel *k) {
return ok;
}
static bool ane_eval_rt(ANEKernel *k) {
if (!g_ane_client) return ane_eval(k);
NSError *e = nil;
BOOL ok = ((BOOL(*)(id,SEL,id,id,id,NSError**))objc_msgSend)(
g_ane_client, @selector(evaluateRealTimeWithModel:options:request:error:),
k->model, @{}, k->request, &e);
if (!ok) {
fprintf(stderr, "ANE RT eval failed, falling back to standard: %s\n",
e ? [[e description] UTF8String] : "unknown");
return ane_eval(k);
}
return true;
}
static void ane_free(ANEKernel *k) {
if (!k) return;
NSError *e = nil;

2260
training/test_ane_model.m Normal file

File diff suppressed because it is too large Load Diff

148
training/test_bench_paths.m Normal file
View File

@ -0,0 +1,148 @@
// test_bench_paths.m Benchmark ANE evaluation paths at production dimensions
// Compares: standard, RT, processRequest, and ane_eval_rt wrapper
#import <Foundation/Foundation.h>
#import <objc/runtime.h>
#import <objc/message.h>
#import <dlfcn.h>
#import <IOSurface/IOSurface.h>
#import <mach/mach_time.h>
static mach_timebase_info_data_t g_tb;
static double tb_ms(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
static int g_fp16_io = 0;
#include "ane_runtime.h"
static NSString *gen_bench_conv(int ch, int sp) {
return [NSString stringWithFormat:
@"program(1.0)\n[buildInfo = dict<tensor<string, []>, tensor<string, []>>({{\"coremlc-version\", \"3505.4.1\"}})]\n{\n"
" func main<ios16>(tensor<fp16, [1, %d, 1, %d]> x) {\n"
" tensor<string, []> pt = const()[name=tensor<string, []>(\"pt\"), val=tensor<string, []>(\"valid\")];\n"
" tensor<int32, [2]> st = const()[name=tensor<string, []>(\"st\"), val=tensor<int32, [2]>([1,1])];\n"
" tensor<int32, [4]> pd = const()[name=tensor<string, []>(\"pd\"), val=tensor<int32, [4]>([0,0,0,0])];\n"
" tensor<int32, [2]> dl = const()[name=tensor<string, []>(\"dl\"), val=tensor<int32, [2]>([1,1])];\n"
" tensor<int32, []> gr = const()[name=tensor<string, []>(\"gr\"), val=tensor<int32, []>(1)];\n"
" tensor<fp16, [%d,%d,1,1]> W = const()[name=tensor<string, []>(\"W\"), "
"val=tensor<fp16, [%d,%d,1,1]>(BLOBFILE(path=tensor<string, []>(\"@model_path/weights/weight.bin\"), offset=tensor<uint64, []>(64)))];\n"
" tensor<fp16, [1,%d,1,%d]> y = conv(dilations=dl,groups=gr,pad=pd,pad_type=pt,strides=st,weight=W,x=x)"
"[name=tensor<string, []>(\"conv\")];\n"
" } -> (y);\n}\n", ch, sp, ch, ch, ch, ch, ch, sp];
}
int main(int argc, char **argv) {
@autoreleasepool {
setbuf(stdout, NULL);
mach_timebase_info(&g_tb);
printf("=== ANE Eval Path Benchmark (production dimensions) ===\n\n");
ane_init();
if (!g_ane_ok) { printf("FATAL: ANE not available\n"); return 1; }
typedef struct { int ch; int sp; const char *label; } TestConfig;
TestConfig configs[] = {
{64, 32, "64x32 (test)"},
{128, 64, "128x64 (small)"},
{256, 64, "256x64 (med)"},
{768, 256, "768x256 (prod)"},
{512, 64, "512x64 (large)"},
};
int nconfigs = sizeof(configs) / sizeof(configs[0]);
int WARMUP = 20, ITERS = 200;
id client = g_ane_client;
printf(" Client: %s | Warmup: %d | Iters: %d\n\n", client ? "OK" : "NO", WARMUP, ITERS);
printf("%-18s %10s %14s %14s %14s\n", "Config", "Standard", "RT", "ProcReq", "ane_eval_rt");
printf("%-18s %10s %14s %14s %14s\n", "------", "--------", "--", "-------", "-----------");
for (int ci = 0; ci < nconfigs; ci++) {
int CH = configs[ci].ch, SP = configs[ci].sp;
_Float16 *w = (_Float16*)calloc(CH*CH, sizeof(_Float16));
for (int i = 0; i < CH; i++) w[i*CH+i] = (_Float16)0.5f;
int ws = CH*CH*2, tot = 128+ws;
uint8_t *blob = (uint8_t*)calloc(tot, 1);
blob[0]=1; blob[4]=2; blob[64]=0xEF; blob[65]=0xBE; blob[66]=0xAD; blob[67]=0xDE; blob[68]=1;
*(uint32_t*)(blob+72)=ws; *(uint32_t*)(blob+80)=128;
memcpy(blob+128, w, ws);
NSData *wdata = [NSData dataWithBytesNoCopy:blob length:tot freeWhenDone:YES];
free(w);
g_fp16_io = 1;
NSString *mil = gen_bench_conv(CH, SP);
NSData *milData = [mil dataUsingEncoding:NSUTF8StringEncoding];
size_t ioBytes = CH * SP * 2;
ANEKernel *k = ane_compile(milData, wdata, 1, &ioBytes, 1, &ioBytes);
if (!k) { printf("%-18s (compile failed)\n", configs[ci].label); continue; }
IOSurfaceLock(k->ioInputs[0], 0, NULL);
_Float16 *inp = (_Float16*)IOSurfaceGetBaseAddress(k->ioInputs[0]);
for (int i = 0; i < CH*SP; i++) inp[i] = (_Float16)1.0f;
IOSurfaceUnlock(k->ioInputs[0], 0, NULL);
NSError *e = nil;
for (int i = 0; i < WARMUP; i++) ane_eval(k);
uint64_t t0 = mach_absolute_time();
for (int i = 0; i < ITERS; i++) ane_eval(k);
double std_ms = tb_ms(mach_absolute_time() - t0) / ITERS;
double rt_ms = -1;
if (client) {
@try {
for (int i = 0; i < WARMUP; i++)
((BOOL(*)(id,SEL,id,id,id,NSError**))objc_msgSend)(
client, @selector(evaluateRealTimeWithModel:options:request:error:),
k->model, @{}, k->request, &e);
t0 = mach_absolute_time();
for (int i = 0; i < ITERS; i++)
((BOOL(*)(id,SEL,id,id,id,NSError**))objc_msgSend)(
client, @selector(evaluateRealTimeWithModel:options:request:error:),
k->model, @{}, k->request, &e);
rt_ms = tb_ms(mach_absolute_time() - t0) / ITERS;
} @catch (NSException *ex) { rt_ms = -1; }
}
double proc_ms = -1;
@try {
id prog = [k->model valueForKey:@"program"];
id hexId = [k->model valueForKey:@"hexStringIdentifier"];
SEL procSel = @selector(processRequest:model:qos:qIndex:modelStringID:options:returnValue:error:);
if (prog && [prog respondsToSelector:procSel]) {
for (int i = 0; i < WARMUP; i++) {
BOOL rv = NO;
((BOOL(*)(id,SEL,id,id,unsigned int,int,id,id,BOOL*,NSError**))objc_msgSend)(
prog, procSel, k->request, k->model, 21, 0, hexId, @{}, &rv, &e);
}
t0 = mach_absolute_time();
for (int i = 0; i < ITERS; i++) {
BOOL rv = NO;
((BOOL(*)(id,SEL,id,id,unsigned int,int,id,id,BOOL*,NSError**))objc_msgSend)(
prog, procSel, k->request, k->model, 21, 0, hexId, @{}, &rv, &e);
}
proc_ms = tb_ms(mach_absolute_time() - t0) / ITERS;
}
} @catch (NSException *ex) { (void)ex; }
double wrap_ms = -1;
@try {
for (int i = 0; i < WARMUP; i++) ane_eval_rt(k);
t0 = mach_absolute_time();
for (int i = 0; i < ITERS; i++) ane_eval_rt(k);
wrap_ms = tb_ms(mach_absolute_time() - t0) / ITERS;
} @catch (NSException *ex) { wrap_ms = -1; }
char s[32], r[32], p[32], w2[32];
snprintf(s, 32, "%.3f ms", std_ms);
snprintf(r, 32, rt_ms >= 0 ? "%.3f (%.1fx)" : "N/A", rt_ms, std_ms/rt_ms);
snprintf(p, 32, proc_ms >= 0 ? "%.3f (%.1fx)" : "N/A", proc_ms, std_ms/proc_ms);
snprintf(w2, 32, wrap_ms >= 0 ? "%.3f (%.1fx)" : "N/A", wrap_ms, std_ms/wrap_ms);
printf("%-18s %10s %14s %14s %14s\n", configs[ci].label, s, r, p, w2);
ane_free(k);
}
printf("\n=== Benchmark complete ===\n");
}
return 0;
}

1700
training/test_chaining_v2.m Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

817
training/test_e5_validate.m Normal file
View File

@ -0,0 +1,817 @@
// test_e5_validate.m Experiments W1-W5: E5 Runtime Validation & Deep API Exploration
// Build: make test_e5_validate && ./test_e5_validate
#import <Foundation/Foundation.h>
#import <CoreML/CoreML.h>
#import <objc/runtime.h>
#import <objc/message.h>
#import <dlfcn.h>
#import <mach/mach_time.h>
#import <IOSurface/IOSurface.h>
static mach_timebase_info_data_t g_tb;
static double tb_ms(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
#pragma mark - Helpers
static void dump_all_methods(Class cls, const char *label) {
if (!cls) { printf(" %s: NOT FOUND\n", label); return; }
printf("\n--- %s ---\n", label);
unsigned int mc;
Method *cm = class_copyMethodList(object_getClass(cls), &mc);
if (mc > 0) {
printf(" Class methods (%u):\n", mc);
for (unsigned int i = 0; i < mc; i++) {
const char *sel = sel_getName(method_getName(cm[i]));
const char *enc = method_getTypeEncoding(cm[i]);
printf(" + %s [%s]\n", sel, enc ? enc : "?");
}
}
free(cm);
Method *im = class_copyMethodList(cls, &mc);
if (mc > 0) {
printf(" Instance methods (%u):\n", mc);
for (unsigned int i = 0; i < mc; i++) {
const char *sel = sel_getName(method_getName(im[i]));
const char *enc = method_getTypeEncoding(im[i]);
printf(" - %s [%s]\n", sel, enc ? enc : "?");
}
}
free(im);
unsigned int pc;
objc_property_t *props = class_copyPropertyList(cls, &pc);
if (pc > 0) {
printf(" Properties (%u):\n", pc);
for (unsigned int i = 0; i < pc; i++)
printf(" %s [%s]\n", property_getName(props[i]),
property_getAttributes(props[i]));
}
free(props);
unsigned int ic;
Ivar *ivars = class_copyIvarList(cls, &ic);
if (ic > 0) {
printf(" Ivars (%u):\n", ic);
for (unsigned int i = 0; i < ic; i++) {
const char *n = ivar_getName(ivars[i]);
const char *t = ivar_getTypeEncoding(ivars[i]);
printf(" %s type=%s\n", n, t ? t : "?");
}
}
free(ivars);
Class super = class_getSuperclass(cls);
if (super && super != [NSObject class])
printf(" Superclass: %s\n", class_getName(super));
}
static float max_abs_diff(float *a, float *b, int n) {
float m = 0;
for (int i = 0; i < n; i++) {
float d = fabsf(a[i] - b[i]);
if (d > m) m = d;
}
return m;
}
static float mean_abs(float *a, int n) {
float s = 0;
for (int i = 0; i < n; i++) s += fabsf(a[i]);
return s / n;
}
#pragma mark - Main
int main(int argc, const char *argv[]) {
(void)argc; (void)argv;
@autoreleasepool {
mach_timebase_info(&g_tb);
printf("================================================================\n");
printf(" E5 Runtime: Validation & Exhaustive API Documentation\n");
printf("================================================================\n\n");
dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/"
"AppleNeuralEngine", RTLD_NOW);
// ============================================================
// W2: Exhaustive API Documentation (dump first so we have it)
// ============================================================
printf("================================================================\n");
printf(" W2: Exhaustive E5 Runtime API Documentation\n");
printf("================================================================\n");
const char *classNames[] = {
"MLE5Engine",
"MLE5ProgramLibrary",
"MLE5ProgramLibraryOnDeviceAOTCompilationImpl",
"MLE5ProgramLibraryE5BundleImpl",
"MLE5ExecutionStreamOperation",
"MLE5ExecutionStream",
"MLE5ExecutionStreamPool",
"MLE5StaticShapeExecutionStreamOperationPool",
"MLE5RangeShapeExecutionStreamOperationPool",
"MLE5EnumeratedShapeExecutionStreamOperationPool",
"MLE5ExecutionStreamOperationPoolFactory",
"MLE5InputPort",
"MLE5OutputPort",
"MLE5InputPortBinder",
"MLE5OutputPortBinder",
"MLProgramE5Container",
NULL
};
for (int i = 0; classNames[i]; i++) {
Class cls = NSClassFromString(
[NSString stringWithUTF8String:classNames[i]]);
dump_all_methods(cls, classNames[i]);
}
printf("\n--- e5rt_* C API Symbols ---\n");
const char *cFuncs[] = {
"e5rt_program_library_create",
"e5rt_program_library_destroy",
"e5rt_program_library_compile",
"e5rt_program_library_get_function",
"e5rt_program_library_load_function",
"e5rt_execution_stream_create",
"e5rt_execution_stream_destroy",
"e5rt_execution_stream_submit",
"e5rt_execution_stream_wait",
"e5rt_execution_stream_execute",
"e5rt_execution_stream_sync",
"e5rt_execution_stream_operation_create",
"e5rt_execution_stream_operation_destroy",
"e5rt_execution_stream_operation_set_input",
"e5rt_execution_stream_operation_set_output",
"e5rt_execution_stream_operation_execute",
"e5rt_async_event_create",
"e5rt_async_event_destroy",
"e5rt_async_event_signal",
"e5rt_async_event_wait",
"e5rt_buffer_create",
"e5rt_buffer_destroy",
"e5rt_io_port_create",
"e5rt_io_port_bind",
"e5rt_context_create",
"e5rt_init",
"e5rt_get_version",
NULL
};
for (int i = 0; cFuncs[i]; i++) {
void *sym = dlsym(RTLD_DEFAULT, cFuncs[i]);
if (sym) printf(" FOUND: %s at %p\n", cFuncs[i], sym);
}
fflush(stdout);
// ============================================================
// W1: Output Validation
// ============================================================
printf("\n================================================================\n");
printf(" W1: Output Correctness Validation\n");
printf("================================================================\n\n");
int ch = 256, sp = 64;
NSString *pkgPath = [NSString stringWithFormat:
@"/tmp/ane_sram_%dch_%dsp.mlpackage", ch, sp];
if (![[NSFileManager defaultManager] fileExistsAtPath:pkgPath]) {
printf(" FATAL: %s not found. Run gen_mlpackages.py\n",
[pkgPath UTF8String]);
return 1;
}
NSError *err = nil;
MLModelConfiguration *cfg = [[MLModelConfiguration alloc] init];
cfg.computeUnits = MLComputeUnitsAll;
MLPredictionOptions *predOpts = [[MLPredictionOptions alloc] init];
Class opCls = NSClassFromString(@"MLE5ExecutionStreamOperation");
NSURL *compiled = [MLModel compileModelAtURL:
[NSURL fileURLWithPath:pkgPath] error:&err];
if (err) { printf(" Compile FAILED\n"); return 1; }
err = nil;
MLModel *model = [MLModel modelWithContentsOfURL:compiled
configuration:cfg error:&err];
if (err) { printf(" Load FAILED\n"); return 1; }
int nElems = 1 * ch * 1 * sp;
MLMultiArray *inputArr = [[MLMultiArray alloc]
initWithShape:@[@1, @(ch), @1, @(sp)]
dataType:MLMultiArrayDataTypeFloat32 error:nil];
float *inPtr = (float *)[inputArr dataPointer];
for (int i = 0; i < nElems; i++)
inPtr[i] = sinf((float)i * 0.01f) * 0.5f;
NSString *inName = [[[[model modelDescription] inputDescriptionsByName]
allKeys] firstObject];
NSString *outName = [[[[model modelDescription] outputDescriptionsByName]
allKeys] firstObject];
MLDictionaryFeatureProvider *fp = [[MLDictionaryFeatureProvider alloc]
initWithDictionary:@{inName: inputArr} error:nil];
printf(" Input: %s [1,%d,1,%d], first 5: [%.4f %.4f %.4f %.4f %.4f]\n",
[inName UTF8String], ch, sp,
inPtr[0], inPtr[1], inPtr[2], inPtr[3], inPtr[4]);
printf(" Output: %s\n", [outName UTF8String]);
fflush(stdout);
// --- Reference: CoreML sequential prediction ---
printf("\n --- W1.1: CoreML reference prediction ---\n");
err = nil;
id<MLFeatureProvider> refResult = [model predictionFromFeatures:fp error:&err];
if (err) { printf(" Prediction FAILED\n"); return 1; }
MLMultiArray *refOut = [refResult featureValueForName:outName].multiArrayValue;
float *refPtr = (float *)[refOut dataPointer];
int outElems = 1;
for (int d = 0; d < (int)refOut.shape.count; d++)
outElems *= [refOut.shape[d] intValue];
printf(" Output shape: [");
for (int d = 0; d < (int)refOut.shape.count; d++)
printf("%s%d", d ? "," : "", [refOut.shape[d] intValue]);
printf("] (%d elements)\n", outElems);
printf(" First 5 ref: [%.6f %.6f %.6f %.6f %.6f]\n",
refPtr[0], refPtr[1], refPtr[2], refPtr[3], refPtr[4]);
printf(" Mean |ref|: %.6f\n", mean_abs(refPtr, outElems));
fflush(stdout);
// --- E5 stream prediction ---
printf("\n --- W1.2: E5 stream prediction ---\n");
id e5engine = nil;
@try { e5engine = [model valueForKey:@"_internalEngine"]; }
@catch (NSException *e) { (void)e; }
id progLib = nil;
@try { progLib = [e5engine valueForKey:@"programLibrary"]; }
@catch (NSException *e) { (void)e; }
id streamPool = nil;
@try { streamPool = [e5engine valueForKey:@"streamPool"]; }
@catch (NSException *e) { (void)e; }
id op = ((id(*)(id,SEL,id,id,id,id,id,unsigned long long))objc_msgSend)(
[opCls alloc],
@selector(initWithProgramLibrary:functionName:modelDescription:
configuration:debugLabel:modelSignpostId:),
progLib, @"main", [model modelDescription], cfg,
@"validate_op", (unsigned long long)0);
NSError *plErr = nil;
BOOL plOk = ((BOOL(*)(id,SEL,NSError**))objc_msgSend)(
op, @selector(preloadAndReturnError:), &plErr);
printf(" preload: %s\n", plOk ? "YES" : "NO");
if (plErr) printf(" Error: %s\n", [[plErr description] UTF8String]);
fflush(stdout);
id stream = [streamPool performSelector:@selector(takeOut)];
Ivar shIvar = class_getInstanceVariable([stream class], "_streamHandle");
void *sh = (__bridge void *)object_getIvar(stream, shIvar);
printf(" stream: %p, handle: %p\n", (__bridge void *)stream, sh);
[stream setValue:@[op] forKey:@"operations"];
NSError *prepErr = nil;
BOOL prepOk = ((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
op, @selector(prepareForInputFeatures:options:error:),
fp, predOpts, &prepErr);
printf(" prepare: %s\n", prepOk ? "YES" : "NO");
if (prepErr) printf(" Error: %s\n", [[prepErr description] UTF8String]);
fflush(stdout);
NSError *execErr = nil;
BOOL execOk = ((BOOL(*)(id,SEL,void*,NSError**))objc_msgSend)(
stream, @selector(_executeStream:error:), sh, &execErr);
printf(" execute: %s\n", execOk ? "YES" : "NO");
if (execErr) printf(" Error: %s\n", [[execErr description] UTF8String]);
fflush(stdout);
// Read output from the operation
printf("\n --- W1.3: Read E5 output features ---\n");
fflush(stdout);
id e5Result = nil;
@try {
e5Result = [op valueForKey:@"outputFeatures"];
printf(" outputFeatures: %s\n",
e5Result ? [NSStringFromClass([e5Result class]) UTF8String]
: "nil");
} @catch (NSException *ex) {
printf(" outputFeatures EXCEPTION: %s\n",
[[ex reason] UTF8String]);
}
if (e5Result && [e5Result conformsToProtocol:@protocol(MLFeatureProvider)]) {
MLMultiArray *e5Out = [(id<MLFeatureProvider>)e5Result
featureValueForName:outName].multiArrayValue;
if (e5Out) {
float *e5Ptr = (float *)[e5Out dataPointer];
printf(" E5 first 5: [%.6f %.6f %.6f %.6f %.6f]\n",
e5Ptr[0], e5Ptr[1], e5Ptr[2], e5Ptr[3], e5Ptr[4]);
printf(" Mean |e5|: %.6f\n", mean_abs(e5Ptr, outElems));
float mad = max_abs_diff(refPtr, e5Ptr, outElems);
printf(" Max abs diff: %.8f\n", mad);
printf(" Relative error: %.2e\n",
mad / (mean_abs(refPtr, outElems) + 1e-10f));
if (mad < 1e-3f) {
printf(" *** VALIDATION PASSED: outputs match ***\n");
} else if (mad < 1e-1f) {
printf(" VALIDATION WARNING: small differences (FP16 expected)\n");
} else {
printf(" VALIDATION FAILED: outputs diverge!\n");
}
} else {
printf(" E5 output array is nil for key '%s'\n",
[outName UTF8String]);
NSArray *ofNames = [(id<MLFeatureProvider>)e5Result
featureNames].allObjects;
printf(" Available features: %s\n",
[[ofNames description] UTF8String]);
}
} else {
printf(" Cannot read output features\n");
}
// Also read output via outputPorts
printf("\n --- W1.4: Read via output ports ---\n");
fflush(stdout);
@try {
id outPorts = [op valueForKey:@"outputPorts"];
printf(" outputPorts: %s (count=%lu)\n",
outPorts ? [NSStringFromClass([outPorts class]) UTF8String]
: "nil",
outPorts ? (unsigned long)[(NSArray *)outPorts count] : 0);
if (outPorts && [(NSArray *)outPorts count] > 0) {
for (NSUInteger pi = 0; pi < [(NSArray *)outPorts count]; pi++) {
id port = [(NSArray *)outPorts objectAtIndex:pi];
printf(" Port[%lu]: %s\n", (unsigned long)pi,
[[port description] UTF8String]);
@try {
id portName = [port valueForKey:@"name"];
printf(" name: %s\n",
portName ? [(NSString *)portName UTF8String] : "nil");
} @catch (NSException *ex) { (void)ex; }
@try {
id portFD = [port valueForKey:@"featureDescription"];
printf(" featureDescription: %s\n",
portFD ? [[portFD description] UTF8String] : "nil");
} @catch (NSException *ex) { (void)ex; }
@try {
id binder = [port valueForKey:@"binder"];
printf(" binder: %s\n",
binder ? [NSStringFromClass([binder class])
UTF8String] : "nil");
if (binder) {
@try {
id fv = [binder valueForKey:@"featureValue"];
printf(" featureValue: %s\n",
fv ? [NSStringFromClass([fv class])
UTF8String] : "nil");
if (fv) {
MLMultiArray *ma = [(MLFeatureValue *)fv
multiArrayValue];
if (ma) {
float *ptr = (float *)[ma dataPointer];
printf(" first 5: [%.6f %.6f %.6f"
" %.6f %.6f]\n",
ptr[0], ptr[1], ptr[2],
ptr[3], ptr[4]);
float mad2 = max_abs_diff(refPtr, ptr,
outElems);
printf(" Max abs diff vs ref: %.8f\n",
mad2);
}
}
} @catch (NSException *ex) {
printf(" featureValue EXCEPTION: %s\n",
[[ex reason] UTF8String]);
}
}
} @catch (NSException *ex) { (void)ex; }
}
}
} @catch (NSException *ex) {
printf(" outputPorts EXCEPTION: %s\n", [[ex reason] UTF8String]);
}
// Also read input ports
printf("\n --- W1.5: Inspect input ports ---\n");
fflush(stdout);
@try {
id inPorts = [op valueForKey:@"inputPorts"];
printf(" inputPorts: %s (count=%lu)\n",
inPorts ? [NSStringFromClass([inPorts class]) UTF8String]
: "nil",
inPorts ? (unsigned long)[(NSArray *)inPorts count] : 0);
if (inPorts) {
for (NSUInteger pi = 0; pi < [(NSArray *)inPorts count]; pi++) {
id port = [(NSArray *)inPorts objectAtIndex:pi];
printf(" Port[%lu]: %s\n", (unsigned long)pi,
[[port description] UTF8String]);
@try {
printf(" name: %s\n",
[[(id)[port valueForKey:@"name"] description]
UTF8String]);
printf(" portHandle: %p\n",
(__bridge void *)[port valueForKey:@"portHandle"]);
} @catch (NSException *ex) { (void)ex; }
@try {
id binder = [port valueForKey:@"binder"];
if (binder) {
printf(" binder: %s\n",
[NSStringFromClass([binder class]) UTF8String]);
printf(" bindingMode: %d\n",
((char(*)(id,SEL))objc_msgSend)(
binder, @selector(bindingMode)));
id dfv = nil;
@try {
dfv = [binder valueForKey:@"directlyBoundFeatureValue"];
} @catch (NSException *ex) { (void)ex; }
printf(" directlyBound: %s\n",
dfv ? "YES" : "NO");
}
} @catch (NSException *ex) { (void)ex; }
}
}
} @catch (NSException *ex) {
printf(" inputPorts EXCEPTION: %s\n", [[ex reason] UTF8String]);
}
// Return stream
[stream setValue:@[op] forKey:@"operations"];
((void(*)(id,SEL,id))objc_msgSend)(
streamPool, @selector(putBack:), stream);
// ============================================================
// W1.6: Multi-op output validation
// ============================================================
printf("\n --- W1.6: Multi-op output validation ---\n");
fflush(stdout);
{
NSString *pkg2Path = @"/tmp/ane_sram_512ch_64sp.mlpackage";
err = nil;
NSURL *c2 = [MLModel compileModelAtURL:
[NSURL fileURLWithPath:pkg2Path] error:&err];
if (err) { printf(" Compile2 FAILED\n"); goto skip_multiop; }
err = nil;
MLModel *model2 = [MLModel modelWithContentsOfURL:c2
configuration:cfg error:&err];
if (err) { printf(" Load2 FAILED\n"); goto skip_multiop; }
int ch2 = 512;
int nElems2 = 1 * ch2 * 1 * sp;
MLMultiArray *inputArr2 = [[MLMultiArray alloc]
initWithShape:@[@1, @(ch2), @1, @(sp)]
dataType:MLMultiArrayDataTypeFloat32 error:nil];
float *in2Ptr = (float *)[inputArr2 dataPointer];
for (int i = 0; i < nElems2; i++)
in2Ptr[i] = cosf((float)i * 0.02f) * 0.3f;
NSString *in2Name = [[[[model2 modelDescription] inputDescriptionsByName]
allKeys] firstObject];
NSString *out2Name = [[[[model2 modelDescription] outputDescriptionsByName]
allKeys] firstObject];
MLDictionaryFeatureProvider *fp2 = [[MLDictionaryFeatureProvider alloc]
initWithDictionary:@{in2Name: inputArr2} error:nil];
// Reference predictions
err = nil;
id<MLFeatureProvider> ref1 = [model predictionFromFeatures:fp error:&err];
err = nil;
id<MLFeatureProvider> ref2 = [model2 predictionFromFeatures:fp2 error:&err];
float *ref1Ptr = (float *)[[ref1 featureValueForName:outName].multiArrayValue dataPointer];
float *ref2Ptr = (float *)[[ref2 featureValueForName:out2Name].multiArrayValue dataPointer];
// E5 multi-op stream
id e5_2 = nil;
@try { e5_2 = [model2 valueForKey:@"_internalEngine"]; }
@catch (NSException *e) { (void)e; }
id pLib2 = nil;
@try { pLib2 = [e5_2 valueForKey:@"programLibrary"]; }
@catch (NSException *e) { (void)e; }
id op1 = ((id(*)(id,SEL,id,id,id,id,id,unsigned long long))objc_msgSend)(
[opCls alloc],
@selector(initWithProgramLibrary:functionName:modelDescription:
configuration:debugLabel:modelSignpostId:),
progLib, @"main", [model modelDescription], cfg,
@"val_op1", (unsigned long long)0);
id op2 = ((id(*)(id,SEL,id,id,id,id,id,unsigned long long))objc_msgSend)(
[opCls alloc],
@selector(initWithProgramLibrary:functionName:modelDescription:
configuration:debugLabel:modelSignpostId:),
pLib2, @"main", [model2 modelDescription], cfg,
@"val_op2", (unsigned long long)0);
((BOOL(*)(id,SEL,NSError**))objc_msgSend)(op1, @selector(preloadAndReturnError:), nil);
((BOOL(*)(id,SEL,NSError**))objc_msgSend)(op2, @selector(preloadAndReturnError:), nil);
id stream2 = [streamPool performSelector:@selector(takeOut)];
Ivar shIvar2 = class_getInstanceVariable([stream2 class], "_streamHandle");
void *sh2 = (__bridge void *)object_getIvar(stream2, shIvar2);
[stream2 setValue:@[op1, op2] forKey:@"operations"];
((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
op1, @selector(prepareForInputFeatures:options:error:),
fp, predOpts, nil);
((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
op2, @selector(prepareForInputFeatures:options:error:),
fp2, predOpts, nil);
NSError *mErr = nil;
BOOL mOk = ((BOOL(*)(id,SEL,void*,NSError**))objc_msgSend)(
stream2, @selector(_executeStream:error:), sh2, &mErr);
printf(" Multi-op execute: %s\n", mOk ? "YES" : "NO");
if (mErr) printf(" Error: %s\n", [[mErr description] UTF8String]);
fflush(stdout);
if (mOk) {
// Read outputs
@try {
id out1 = [op1 valueForKey:@"outputFeatures"];
id out2 = [op2 valueForKey:@"outputFeatures"];
if (out1 && out2) {
MLMultiArray *ma1 = [(id<MLFeatureProvider>)out1
featureValueForName:outName].multiArrayValue;
MLMultiArray *ma2 = [(id<MLFeatureProvider>)out2
featureValueForName:out2Name].multiArrayValue;
if (ma1 && ma2) {
float *p1 = (float *)[ma1 dataPointer];
float *p2 = (float *)[ma2 dataPointer];
float mad1 = max_abs_diff(ref1Ptr, p1, outElems);
float mad2 = max_abs_diff(ref2Ptr, p2, nElems2);
printf(" Op1 max diff: %.8f (mean_ref=%.6f)\n",
mad1, mean_abs(ref1Ptr, outElems));
printf(" Op2 max diff: %.8f (mean_ref=%.6f)\n",
mad2, mean_abs(ref2Ptr, nElems2));
if (mad1 < 1e-3f && mad2 < 1e-3f) {
printf(" *** MULTI-OP VALIDATION PASSED ***\n");
} else {
printf(" MULTI-OP VALIDATION: differences detected\n");
}
} else {
printf(" Could not extract MLMultiArray from outputs\n");
}
} else {
printf(" outputFeatures nil for op1 or op2\n");
}
} @catch (NSException *ex) {
printf(" Output read EXCEPTION: %s\n",
[[ex reason] UTF8String]);
}
}
[stream2 setValue:@[op1] forKey:@"operations"];
((void(*)(id,SEL,id))objc_msgSend)(
streamPool, @selector(putBack:), stream2);
}
skip_multiop:
// ============================================================
// W4: Async stream submission
// ============================================================
printf("\n================================================================\n");
printf(" W4: Async Stream Submission\n");
printf("================================================================\n\n");
fflush(stdout);
{
id asyncStream = [streamPool performSelector:@selector(takeOut)];
Ivar ashIvar = class_getInstanceVariable([asyncStream class], "_streamHandle");
void *ash = (__bridge void *)object_getIvar(asyncStream, ashIvar);
id asyncOp = ((id(*)(id,SEL,id,id,id,id,id,unsigned long long))
objc_msgSend)([opCls alloc],
@selector(initWithProgramLibrary:functionName:modelDescription:
configuration:debugLabel:modelSignpostId:),
progLib, @"main", [model modelDescription], cfg,
@"async_op", (unsigned long long)0);
((BOOL(*)(id,SEL,NSError**))objc_msgSend)(
asyncOp, @selector(preloadAndReturnError:), nil);
[asyncStream setValue:@[asyncOp] forKey:@"operations"];
((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
asyncOp, @selector(prepareForInputFeatures:options:error:),
fp, predOpts, nil);
// Try async submission
__block BOOL asyncDone = NO;
__block double asyncMs = 0;
uint64_t asyncT0 = mach_absolute_time();
@try {
// prepareAsyncSubmissionForInputFeatures
NSError *asyncPrepErr = nil;
BOOL asyncPrepOk = ((BOOL(*)(id,SEL,id,id,NSError**))
objc_msgSend)(asyncStream,
@selector(prepareAsyncSubmissionForInputFeatures:options:error:),
fp, predOpts, &asyncPrepErr);
printf(" prepareAsyncSubmission: %s\n",
asyncPrepOk ? "YES" : "NO");
if (asyncPrepErr) printf(" Error: %s\n",
[[asyncPrepErr description] UTF8String]);
fflush(stdout);
if (asyncPrepOk) {
((void(*)(id,SEL,void(^)(void)))objc_msgSend)(
asyncStream, @selector(submitWithCompletionHandler:),
^{
asyncMs = tb_ms(mach_absolute_time() - asyncT0);
asyncDone = YES;
});
printf(" Submitted async, waiting...\n");
fflush(stdout);
for (int w = 0; w < 100 && !asyncDone; w++)
usleep(1000);
printf(" Async completed: %s (%.3f ms)\n",
asyncDone ? "YES" : "TIMEOUT", asyncMs);
fflush(stdout);
if (asyncDone) {
// Benchmark async vs sync
int N = 200;
// Sync benchmark
uint64_t t0 = mach_absolute_time();
for (int i = 0; i < N; i++) {
((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
asyncOp,
@selector(prepareForInputFeatures:options:error:),
fp, predOpts, nil);
((BOOL(*)(id,SEL,void*,NSError**))objc_msgSend)(
asyncStream,
@selector(_executeStream:error:), ash, nil);
}
double syncMs = tb_ms(mach_absolute_time() - t0) / N;
// Async benchmark
t0 = mach_absolute_time();
for (int i = 0; i < N; i++) {
((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
asyncOp,
@selector(prepareForInputFeatures:options:error:),
fp, predOpts, nil);
((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
asyncStream,
@selector(prepareAsyncSubmissionForInputFeatures:
options:error:),
fp, predOpts, nil);
__block BOOL done = NO;
((void(*)(id,SEL,void(^)(void)))objc_msgSend)(
asyncStream,
@selector(submitWithCompletionHandler:),
^{ done = YES; });
while (!done) usleep(100);
}
double asyncBenchMs = tb_ms(mach_absolute_time() - t0) / N;
printf(" Sync: %.4f ms/eval\n", syncMs);
printf(" Async (wait): %.4f ms/eval\n", asyncBenchMs);
}
}
} @catch (NSException *ex) {
printf(" Async EXCEPTION: %s\n", [[ex reason] UTF8String]);
}
[asyncStream setValue:@[asyncOp] forKey:@"operations"];
((void(*)(id,SEL,id))objc_msgSend)(
streamPool, @selector(putBack:), asyncStream);
}
// ============================================================
// W5: Port-Based Data Flow
// ============================================================
printf("\n================================================================\n");
printf(" W5: Port-Based Data Flow Investigation\n");
printf("================================================================\n\n");
fflush(stdout);
{
id portOp = ((id(*)(id,SEL,id,id,id,id,id,unsigned long long))
objc_msgSend)([opCls alloc],
@selector(initWithProgramLibrary:functionName:modelDescription:
configuration:debugLabel:modelSignpostId:),
progLib, @"main", [model modelDescription], cfg,
@"port_op", (unsigned long long)0);
((BOOL(*)(id,SEL,NSError**))objc_msgSend)(
portOp, @selector(preloadAndReturnError:), nil);
// Inspect ports before prepare
printf(" --- Before prepare ---\n");
@try {
id inP = [portOp valueForKey:@"inputPorts"];
id outP = [portOp valueForKey:@"outputPorts"];
id stP = [portOp valueForKey:@"statePorts"];
printf(" inputPorts: %lu, outputPorts: %lu, statePorts: %lu\n",
inP ? (unsigned long)[(NSArray *)inP count] : 0,
outP ? (unsigned long)[(NSArray *)outP count] : 0,
stP ? (unsigned long)[(NSArray *)stP count] : 0);
if (inP) {
for (id p in (NSArray *)inP) {
printf(" in: %s portHandle=%p name=%s\n",
[NSStringFromClass([p class]) UTF8String],
(__bridge void *)[p valueForKey:@"portHandle"],
[[(id)[p valueForKey:@"name"] description] UTF8String]);
}
}
if (outP) {
for (id p in (NSArray *)outP) {
printf(" out: %s portHandle=%p name=%s\n",
[NSStringFromClass([p class]) UTF8String],
(__bridge void *)[p valueForKey:@"portHandle"],
[[(id)[p valueForKey:@"name"] description] UTF8String]);
@try {
id fd = [p valueForKey:@"featureDescription"];
if (fd) printf(" featureDesc: %s\n",
[[fd description] UTF8String]);
} @catch (NSException *ex) { (void)ex; }
}
}
} @catch (NSException *ex) {
printf(" Port inspection EXCEPTION: %s\n",
[[ex reason] UTF8String]);
}
// Prepare and inspect after
((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
portOp, @selector(prepareForInputFeatures:options:error:),
fp, predOpts, nil);
printf("\n --- After prepare ---\n");
@try {
id inP = [portOp valueForKey:@"inputPorts"];
if (inP) {
for (id p in (NSArray *)inP) {
id binder = [p valueForKey:@"binder"];
BOOL directBound = ((BOOL(*)(id,SEL))objc_msgSend)(
p, @selector(boundFeatureDirectly));
printf(" in: name=%s directBound=%s binder=%s\n",
[[(id)[p valueForKey:@"name"] description] UTF8String],
directBound ? "YES" : "NO",
binder ? [NSStringFromClass([binder class])
UTF8String] : "nil");
if (binder) {
char mode = ((char(*)(id,SEL))objc_msgSend)(
binder, @selector(bindingMode));
printf(" bindingMode=%d\n", (int)mode);
}
}
}
id outP = [portOp valueForKey:@"outputPorts"];
if (outP) {
for (id p in (NSArray *)outP) {
BOOL directBound = ((BOOL(*)(id,SEL))objc_msgSend)(
p, @selector(boundFeatureDirectly));
BOOL obDirectBound = ((BOOL(*)(id,SEL))objc_msgSend)(
p, @selector(outputBackingWasDirectlyBound));
printf(" out: name=%s directBound=%s"
" outputBackingDirectBound=%s\n",
[[(id)[p valueForKey:@"name"] description] UTF8String],
directBound ? "YES" : "NO",
obDirectBound ? "YES" : "NO");
id binder = [p valueForKey:@"binder"];
if (binder) {
printf(" binder: %s\n",
[NSStringFromClass([binder class]) UTF8String]);
@try {
id ob = [binder valueForKey:@"outputBacking"];
printf(" outputBacking: %s\n",
ob ? [NSStringFromClass([ob class])
UTF8String] : "nil");
} @catch (NSException *ex) { (void)ex; }
}
}
}
} @catch (NSException *ex) {
printf(" Post-prepare EXCEPTION: %s\n",
[[ex reason] UTF8String]);
}
}
// ============================================================
// Summary
// ============================================================
printf("\n================================================================\n");
printf(" SUMMARY\n");
printf("================================================================\n");
printf(" W1: Output validation -- see above\n");
printf(" W2: API documentation -- complete (all classes dumped)\n");
printf(" W4: Async submission -- see above\n");
printf(" W5: Port data flow -- see above\n");
printf("================================================================\n");
printf("\nDone.\n");
}
return 0;
}

915
training/test_mil_custom.m Normal file
View File

@ -0,0 +1,915 @@
// test_mil_custom.m Experiments Y1-Y3, Z1: Custom MIL -> ANE Execution
// Build: make test_mil_custom && ./test_mil_custom
#import <Foundation/Foundation.h>
#import <CoreML/CoreML.h>
#import <objc/runtime.h>
#import <objc/message.h>
#import <dlfcn.h>
#import <mach/mach_time.h>
#import <Accelerate/Accelerate.h>
static mach_timebase_info_data_t g_tb;
static double tb_ms(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
#pragma mark - MIL Compilation Pipeline
static id compileAndCreateEngine(NSString *milText, NSString *label,
id container, MLModelConfiguration *cfg,
MLModelDescription *desc, NSError **outErr) {
NSString *milPath = [NSString stringWithFormat:@"/tmp/%@.mil", label];
[milText writeToFile:milPath atomically:YES encoding:NSUTF8StringEncoding error:nil];
NSURL *milURL = [NSURL fileURLWithPath:milPath];
Class aotCls = NSClassFromString(@"MLE5ProgramLibraryOnDeviceAOTCompilationImpl");
if (!aotCls) {
if (outErr) *outErr = [NSError errorWithDomain:@"MIL" code:1
userInfo:@{NSLocalizedDescriptionKey: @"AOT class not found"}];
return nil;
}
id aotImpl = ((id(*)(id,SEL,id,id,id))objc_msgSend)(
[aotCls alloc],
NSSelectorFromString(@"initWithMILTextAtURL:container:configuration:"),
milURL, container, cfg);
if (!aotImpl) {
if (outErr) *outErr = [NSError errorWithDomain:@"MIL" code:2
userInfo:@{NSLocalizedDescriptionKey: @"AOT init failed"}];
return nil;
}
NSError *plErr = nil;
void *plHandle = ((void*(*)(id,SEL,BOOL,NSError**))objc_msgSend)(
aotImpl,
NSSelectorFromString(@"createProgramLibraryHandleWithRespecialization:error:"),
NO, &plErr);
if (!plHandle) {
printf(" [%s] PL handle failed: %s\n", [label UTF8String],
plErr ? [[plErr description] UTF8String] : "unknown");
if (outErr) *outErr = plErr;
return nil;
}
Class plCls = NSClassFromString(@"MLE5ProgramLibrary");
id progLib = ((id(*)(id,SEL,id,id,id))objc_msgSend)(
[plCls alloc],
NSSelectorFromString(@"initWithImpl:container:configuration:"),
aotImpl, container, cfg);
if (!progLib) {
if (outErr) *outErr = [NSError errorWithDomain:@"MIL" code:4
userInfo:@{NSLocalizedDescriptionKey: @"ProgramLibrary init failed"}];
return nil;
}
Class engCls = NSClassFromString(@"MLE5Engine");
// Find the correct init selector
static dispatch_once_t once;
static SEL engInitSel = NULL;
dispatch_once(&once, ^{
unsigned int mc;
Method *ims = class_copyMethodList(engCls, &mc);
printf(" MLE5Engine init selectors:\n");
for (unsigned int i = 0; i < mc; i++) {
const char *sel = sel_getName(method_getName(ims[i]));
if (strstr(sel, "init")) {
printf(" - %s [%s]\n", sel, method_getTypeEncoding(ims[i]));
if (strstr(sel, "ProgramLibrary") && strstr(sel, "modelDescription"))
engInitSel = method_getName(ims[i]);
}
}
free(ims);
});
if (!engInitSel) {
if (outErr) *outErr = [NSError errorWithDomain:@"MIL" code:5
userInfo:@{NSLocalizedDescriptionKey: @"No MLE5Engine init selector found"}];
return nil;
}
printf(" Using init: %s\n", sel_getName(engInitSel));
// Count colons to determine argument count
const char *selName = sel_getName(engInitSel);
int argCount = 0;
for (const char *p = selName; *p; p++) if (*p == ':') argCount++;
id engine = nil;
if (argCount == 7) {
// initWithProgramLibrary:modelDescription:configuration:functionName:
// classProbabilitiesFeatureName:optionalInputDefaultValues:compilerVersionInfo:
engine = ((id(*)(id,SEL,id,id,id,id,id,id,id))objc_msgSend)(
[engCls alloc], engInitSel, progLib, desc, cfg,
@"main", nil, nil, nil);
} else if (argCount == 5) {
engine = ((id(*)(id,SEL,id,id,id,id,id))objc_msgSend)(
[engCls alloc], engInitSel, progLib, desc, cfg, nil, label);
} else if (argCount == 6) {
engine = ((id(*)(id,SEL,id,id,id,id,id,id))objc_msgSend)(
[engCls alloc], engInitSel, progLib, desc, cfg, nil, nil, label);
} else {
printf(" Unexpected arg count %d for MLE5Engine init\n", argCount);
}
if (!engine) {
if (outErr) *outErr = [NSError errorWithDomain:@"MIL" code:5
userInfo:@{NSLocalizedDescriptionKey: @"Engine init failed"}];
return nil;
}
NSError *prepErr = nil;
BOOL prepOk = ((BOOL(*)(id,SEL,long long,NSError**))objc_msgSend)(
engine, NSSelectorFromString(@"prepareWithConcurrencyHint:error:"),
(long long)1, &prepErr);
if (!prepOk) {
printf(" [%s] Prepare failed: %s\n", [label UTF8String],
prepErr ? [[prepErr description] UTF8String] : "unknown");
if (outErr) *outErr = prepErr;
return nil;
}
return engine;
}
static id<MLFeatureProvider> runEngine(id engine, id<MLFeatureProvider> features,
MLPredictionOptions *opts, NSError **outErr) {
return ((id(*)(id,SEL,id,id,NSError**))objc_msgSend)(
engine, NSSelectorFromString(@"predictionFromFeatures:options:error:"),
features, opts, outErr);
}
#pragma mark - Numeric Helpers
static float max_abs_diff(const float *a, const float *b, int n) {
float m = 0;
for (int i = 0; i < n; i++) {
float d = fabsf(a[i] - b[i]);
if (d > m) m = d;
}
return m;
}
static float mean_abs(const float *a, int n) {
float s = 0;
for (int i = 0; i < n; i++) s += fabsf(a[i]);
return s / n;
}
static void fill_random(float *buf, int n, float scale) {
for (int i = 0; i < n; i++)
buf[i] = ((float)arc4random() / (float)UINT32_MAX - 0.5f) * 2.0f * scale;
}
static void print_first(const char *label, const float *buf, int total) {
int n = total < 8 ? total : 8;
printf(" %s: [", label);
for (int i = 0; i < n; i++)
printf("%s%.4f", i ? ", " : "", buf[i]);
printf("]\n");
}
#pragma mark - CPU Reference Implementations
static void cpu_sdpa(const float *Q, const float *K, const float *V,
float *out, int seqLen, int headDim) {
float scale = 1.0f / sqrtf((float)headDim);
float *scores = (float *)calloc(seqLen * seqLen, sizeof(float));
for (int i = 0; i < seqLen; i++) {
for (int j = 0; j < seqLen; j++) {
float dot = 0;
for (int d = 0; d < headDim; d++)
dot += Q[i * headDim + d] * K[j * headDim + d];
scores[i * seqLen + j] = dot * scale;
}
}
for (int i = 0; i < seqLen; i++) {
float maxv = scores[i * seqLen];
for (int j = 1; j < seqLen; j++)
if (scores[i * seqLen + j] > maxv) maxv = scores[i * seqLen + j];
float sum = 0;
for (int j = 0; j < seqLen; j++) {
scores[i * seqLen + j] = expf(scores[i * seqLen + j] - maxv);
sum += scores[i * seqLen + j];
}
for (int j = 0; j < seqLen; j++)
scores[i * seqLen + j] /= sum;
}
for (int i = 0; i < seqLen; i++) {
for (int d = 0; d < headDim; d++) {
float acc = 0;
for (int j = 0; j < seqLen; j++)
acc += scores[i * seqLen + j] * V[j * headDim + d];
out[i * headDim + d] = acc;
}
}
free(scores);
}
#pragma mark - Container Discovery
static id findE5Container(MLModel *model, NSURL *compiledURL, MLModelConfiguration *cfg) {
// Try standard paths first
@try {
id eng = [model valueForKey:@"_internalEngine"];
if ([NSStringFromClass([eng class]) containsString:@"MLE5"]) {
id pl = [eng valueForKey:@"programLibrary"];
if (pl) {
id c = nil;
@try { c = [pl valueForKey:@"_container"]; } @catch(id e) { (void)e; }
if (!c) {
@try {
id impl = [pl valueForKey:@"_impl"];
if (impl) c = [impl valueForKey:@"_container"];
} @catch(id e) { (void)e; }
}
if (c) return c;
}
}
// MLMultiFunctionProgramEngine path
if ([NSStringFromClass([eng class]) isEqualToString:@"MLMultiFunctionProgramEngine"]) {
NSDictionary *map = [eng valueForKey:@"_functionNameToEngineMap"];
for (id key in map) {
id sub = map[key];
if ([NSStringFromClass([sub class]) containsString:@"MLE5"]) {
id pl = [sub valueForKey:@"programLibrary"];
if (pl) {
id c = nil;
@try { c = [pl valueForKey:@"_container"]; } @catch(id e) { (void)e; }
if (!c) {
@try {
id impl = [pl valueForKey:@"_impl"];
if (impl) c = [impl valueForKey:@"_container"];
} @catch(id e) { (void)e; }
}
if (c) return c;
}
}
}
}
} @catch(id e) { (void)e; }
// Create MLProgramE5Container directly from compiled model
Class e5Cls = NSClassFromString(@"MLProgramE5Container");
if (!e5Cls) return nil;
// Find model.mil path inside the compiled model
NSString *compiledPath = [compiledURL path];
NSString *milPath = [compiledPath stringByAppendingPathComponent:@"model.mil"];
if (![[NSFileManager defaultManager] fileExistsAtPath:milPath]) {
printf(" No model.mil at %s\n", [milPath UTF8String]);
// List contents
NSArray *contents = [[NSFileManager defaultManager]
contentsOfDirectoryAtPath:compiledPath error:nil];
printf(" Compiled model contents: %s\n", [[contents description] UTF8String]);
}
// Try to create E5 container with the model asset description from NN container
@try {
id eng = [model valueForKey:@"_internalEngine"];
id nnContainer = [eng valueForKey:@"_container"];
if (nnContainer) {
// Get model file path
NSString *modelFilePath = nil;
@try { modelFilePath = [nnContainer valueForKey:@"_modelFilePath"]; }
@catch(id e) { (void)e; }
if (modelFilePath) {
printf(" Model file path: %s\n", [modelFilePath UTF8String]);
// Try to create E5 container with this path
@try {
id c = ((id(*)(id,SEL,id,id))objc_msgSend)(
[e5Cls alloc],
NSSelectorFromString(@"initWithModelAssetPath:configuration:"),
modelFilePath, cfg);
if (c) return c;
} @catch(id e) { (void)e; }
}
// Try initWithModelAssetDescription
@try {
id assetDesc = nil;
@try { assetDesc = [nnContainer valueForKey:@"_modelAssetDescription"]; }
@catch(id e) { (void)e; }
if (!assetDesc) {
@try { assetDesc = [nnContainer valueForKey:@"modelAssetDescription"]; }
@catch(id e) { (void)e; }
}
if (assetDesc) {
printf(" Asset description: %s\n",
[NSStringFromClass([assetDesc class]) UTF8String]);
id c = ((id(*)(id,SEL,id,id))objc_msgSend)(
[e5Cls alloc],
NSSelectorFromString(@"initWithModelAssetDescription:configuration:"),
assetDesc, cfg);
if (c) return c;
}
} @catch(id e) { (void)e; }
}
} @catch(id e) { (void)e; }
// Dump E5Container init methods
unsigned int mc;
Method *ims = class_copyMethodList(e5Cls, &mc);
printf(" MLProgramE5Container init methods:\n");
for (unsigned int i = 0; i < mc; i++) {
const char *sel = sel_getName(method_getName(ims[i]));
if (strstr(sel, "init"))
printf(" - %s\n", sel);
}
free(ims);
return nil;
}
#pragma mark - Main
int main(int argc, const char *argv[]) {
(void)argc; (void)argv;
@autoreleasepool {
mach_timebase_info(&g_tb);
printf("================================================================\n");
printf(" Custom MIL -> ANE: Experiments Y1, Y2, Y3, Z1\n");
printf("================================================================\n\n");
dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/"
"AppleNeuralEngine", RTLD_NOW);
NSString *pkgPath = @"/tmp/ane_sram_256ch_64sp.mlpackage";
if (![[NSFileManager defaultManager] fileExistsAtPath:pkgPath]) {
printf("FATAL: %s not found. Run: python3 scripts/gen_mlpackages.py\n",
[pkgPath UTF8String]);
return 1;
}
NSError *err = nil;
MLModelConfiguration *cfg = [[MLModelConfiguration alloc] init];
cfg.computeUnits = MLComputeUnitsAll;
MLPredictionOptions *opts = [[MLPredictionOptions alloc] init];
NSURL *compiled = [MLModel compileModelAtURL:
[NSURL fileURLWithPath:pkgPath] error:&err];
if (err) { printf("FATAL: compile: %s\n", [[err description] UTF8String]); return 1; }
MLModel *refModel = [MLModel modelWithContentsOfURL:compiled
configuration:cfg error:&err];
if (err) { printf("FATAL: load: %s\n", [[err description] UTF8String]); return 1; }
printf(" Ref model: %s\n", [NSStringFromClass([refModel class]) UTF8String]);
MLModelDescription *refDesc = [refModel modelDescription];
// Find or create E5 container
id refContainer = findE5Container(refModel, compiled, cfg);
if (refContainer) {
printf(" Container: %s\n\n", [NSStringFromClass([refContainer class]) UTF8String]);
} else {
printf(" No E5 container found. Trying nil container...\n\n");
}
int ch = 256, sp = 64;
int nElems = ch * sp;
NSString *inName = [[[refDesc inputDescriptionsByName] allKeys] firstObject];
NSString *outName = [[[refDesc outputDescriptionsByName] allKeys] firstObject];
printf(" I/O: %s -> %s, shape [1,%d,1,%d]\n\n", [inName UTF8String],
[outName UTF8String], ch, sp);
// ============================================================
// Y1: Scaled Dot-Product Attention
// ============================================================
printf("================================================================\n");
printf(" Y1: scaled_dot_product_attention on ANE\n");
printf("================================================================\n\n");
{
int seqLen = ch, headDim = sp;
NSString *sdpaMIL = [NSString stringWithFormat:
@"program(1.3)\n"
"{\n"
" func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n"
" string c16 = const()[name = string(\"c16\"), val = string(\"fp16\")];\n"
" tensor<fp16, [1, %d, 1, %d]> x16 = cast(dtype = c16, x = x)[name = string(\"x16\")];\n"
" tensor<int32, [4]> sr = const()[name = string(\"sr\"), val = tensor<int32, [4]>([1, 1, %d, %d])];\n"
" tensor<fp16, [1, 1, %d, %d]> q = reshape(x = x16, shape = sr)[name = string(\"q\")];\n"
" tensor<fp16, [1, 1, %d, %d]> k = reshape(x = x16, shape = sr)[name = string(\"k\")];\n"
" tensor<fp16, [1, 1, %d, %d]> v = reshape(x = x16, shape = sr)[name = string(\"v\")];\n"
" tensor<fp16, [1, 1, %d, %d]> attn = scaled_dot_product_attention(query = q, key = k, value = v)[name = string(\"attn\")];\n"
" tensor<int32, [4]> or = const()[name = string(\"or\"), val = tensor<int32, [4]>([1, %d, 1, %d])];\n"
" tensor<fp16, [1, %d, 1, %d]> rs = reshape(x = attn, shape = or)[name = string(\"rs\")];\n"
" string c32 = const()[name = string(\"c32\"), val = string(\"fp32\")];\n"
" tensor<fp32, [1, %d, 1, %d]> cast_out = cast(dtype = c32, x = rs)[name = string(\"cast_out\")];\n"
" } -> (cast_out);\n"
"}\n",
ch, sp, ch, sp,
seqLen, headDim, seqLen, headDim, seqLen, headDim, seqLen, headDim,
seqLen, headDim,
ch, sp, ch, sp,
ch, sp];
printf(" Self-attention: B=1, nHeads=1, seqLen=%d, headDim=%d\n\n", seqLen, headDim);
err = nil;
id engine = compileAndCreateEngine(sdpaMIL, @"y1_sdpa", refContainer, cfg, refDesc, &err);
if (!engine) {
printf(" Y1 FAILED: %s\n\n", err ? [[err description] UTF8String] : "unknown");
} else {
printf(" Y1: Engine created\n");
MLMultiArray *inputArr = [[MLMultiArray alloc]
initWithShape:@[@1, @(ch), @1, @(sp)]
dataType:MLMultiArrayDataTypeFloat32 error:nil];
float *inPtr = (float *)[inputArr dataPointer];
fill_random(inPtr, nElems, 0.5f);
MLDictionaryFeatureProvider *fp = [[MLDictionaryFeatureProvider alloc]
initWithDictionary:@{inName: inputArr} error:nil];
NSError *runErr = nil;
uint64_t t0 = mach_absolute_time();
id<MLFeatureProvider> result = runEngine(engine, fp, opts, &runErr);
double ms = tb_ms(mach_absolute_time() - t0);
if (runErr || !result) {
printf(" Y1 prediction FAILED: %s\n\n",
runErr ? [[runErr description] UTF8String] : "nil");
} else {
MLMultiArray *outArr = [result featureValueForName:outName].multiArrayValue;
if (!outArr) {
printf(" Y1 output nil\n\n");
} else {
float *outPtr = (float *)[outArr dataPointer];
print_first("ANE out", outPtr, nElems);
printf(" Time: %.3f ms\n", ms);
float *cpuOut = (float *)calloc(nElems, sizeof(float));
cpu_sdpa(inPtr, inPtr, inPtr, cpuOut, seqLen, headDim);
print_first("CPU ref", cpuOut, nElems);
float mad = max_abs_diff(outPtr, cpuOut, nElems);
printf(" Max diff: %.6f, Rel: %.2e\n",
mad, mad / (mean_abs(cpuOut, nElems) + 1e-10f));
printf(" %s\n\n", mad < 0.02f ? "*** Y1 PASSED ***" :
(mad < 0.1f ? "Y1 WARNING" : "Y1 FAILED"));
int N = 100;
t0 = mach_absolute_time();
for (int i = 0; i < N; i++) runEngine(engine, fp, opts, nil);
printf(" Bench: %.4f ms/eval (%d iters)\n\n",
tb_ms(mach_absolute_time() - t0) / N, N);
free(cpuOut);
}
}
}
}
// ============================================================
// Y2: Linear with Embedded Weights
// ============================================================
printf("================================================================\n");
printf(" Y2: linear op with embedded weights on ANE\n");
printf("================================================================\n\n");
{
int inDim = sp, outDim = sp;
float *W = (float *)malloc(outDim * inDim * sizeof(float));
float *B = (float *)malloc(outDim * sizeof(float));
fill_random(W, outDim * inDim, 0.1f);
fill_random(B, outDim, 0.01f);
NSMutableString *wLit = [NSMutableString stringWithString:@"["];
for (int i = 0; i < outDim; i++) {
if (i > 0) [wLit appendString:@", "];
[wLit appendString:@"["];
for (int j = 0; j < inDim; j++) {
if (j > 0) [wLit appendString:@", "];
[wLit appendFormat:@"%.8e", W[i * inDim + j]];
}
[wLit appendString:@"]"];
}
[wLit appendString:@"]"];
NSMutableString *bLit = [NSMutableString stringWithString:@"["];
for (int j = 0; j < outDim; j++) {
if (j > 0) [bLit appendString:@", "];
[bLit appendFormat:@"%.8e", B[j]];
}
[bLit appendString:@"]"];
NSString *linearMIL = [NSString stringWithFormat:
@"program(1.3)\n"
"{\n"
" func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n"
" string c16 = const()[name = string(\"c16\"), val = string(\"fp16\")];\n"
" tensor<fp16, [1, %d, 1, %d]> x16 = cast(dtype = c16, x = x)[name = string(\"x16\")];\n"
" tensor<int32, [2]> rs = const()[name = string(\"rs\"), val = tensor<int32, [2]>([%d, %d])];\n"
" tensor<fp16, [%d, %d]> flat = reshape(x = x16, shape = rs)[name = string(\"flat\")];\n"
" tensor<fp16, [%d, %d]> Wc = const()[name = string(\"Wc\"), val = tensor<fp16, [%d, %d]>(%@)];\n"
" tensor<fp16, [%d]> Bc = const()[name = string(\"Bc\"), val = tensor<fp16, [%d]>(%@)];\n"
" tensor<fp16, [%d, %d]> lin = linear(x = flat, weight = Wc, bias = Bc)[name = string(\"lin\")];\n"
" tensor<int32, [4]> rs2 = const()[name = string(\"rs2\"), val = tensor<int32, [4]>([1, %d, 1, %d])];\n"
" tensor<fp16, [1, %d, 1, %d]> rso = reshape(x = lin, shape = rs2)[name = string(\"rso\")];\n"
" string c32 = const()[name = string(\"c32\"), val = string(\"fp32\")];\n"
" tensor<fp32, [1, %d, 1, %d]> cast_out = cast(dtype = c32, x = rso)[name = string(\"cast_out\")];\n"
" } -> (cast_out);\n"
"}\n",
ch, sp, ch, sp,
ch, sp, ch, sp,
outDim, inDim, outDim, inDim, wLit,
outDim, outDim, bLit,
ch, outDim,
ch, sp, ch, sp,
ch, sp];
printf(" Config: [%d,%d] linear %d->%d with embedded W+b\n\n", ch, sp, inDim, outDim);
err = nil;
id engine = compileAndCreateEngine(linearMIL, @"y2_linear", refContainer, cfg, refDesc, &err);
if (!engine) {
printf(" Y2 FAILED: %s\n\n", err ? [[err description] UTF8String] : "unknown");
} else {
printf(" Y2: Engine created\n");
MLMultiArray *inputArr = [[MLMultiArray alloc]
initWithShape:@[@1, @(ch), @1, @(sp)]
dataType:MLMultiArrayDataTypeFloat32 error:nil];
float *inPtr = (float *)[inputArr dataPointer];
fill_random(inPtr, nElems, 0.5f);
MLDictionaryFeatureProvider *fp = [[MLDictionaryFeatureProvider alloc]
initWithDictionary:@{inName: inputArr} error:nil];
NSError *runErr = nil;
uint64_t t0 = mach_absolute_time();
id<MLFeatureProvider> result = runEngine(engine, fp, opts, &runErr);
double ms = tb_ms(mach_absolute_time() - t0);
if (runErr || !result) {
printf(" Y2 prediction FAILED: %s\n\n",
runErr ? [[runErr description] UTF8String] : "nil");
} else {
MLMultiArray *outArr = [result featureValueForName:outName].multiArrayValue;
if (outArr) {
float *outPtr = (float *)[outArr dataPointer];
print_first("ANE out", outPtr, nElems);
printf(" Time: %.3f ms\n", ms);
// CPU: x[ch,sp] @ W^T[sp,sp] + b[sp]
float *cpuOut = (float *)calloc(nElems, sizeof(float));
for (int i = 0; i < ch; i++) {
for (int j = 0; j < outDim; j++) {
float acc = 0;
for (int k = 0; k < inDim; k++)
acc += inPtr[i * inDim + k] * W[j * inDim + k];
cpuOut[i * outDim + j] = acc + B[j];
}
}
print_first("CPU ref", cpuOut, nElems);
float mad = max_abs_diff(outPtr, cpuOut, nElems);
printf(" Max diff: %.6f, Rel: %.2e\n",
mad, mad / (mean_abs(cpuOut, nElems) + 1e-10f));
printf(" %s\n\n", mad < 0.05f ? "*** Y2 PASSED ***" :
(mad < 0.5f ? "Y2 WARNING" : "Y2 FAILED"));
int N = 100;
t0 = mach_absolute_time();
for (int i = 0; i < N; i++) runEngine(engine, fp, opts, nil);
printf(" Bench: %.4f ms/eval (%d iters)\n\n",
tb_ms(mach_absolute_time() - t0) / N, N);
free(cpuOut);
}
}
}
free(W); free(B);
}
// ============================================================
// Y3: Transformer Block (Attention + FFN)
// ============================================================
printf("================================================================\n");
printf(" Y3: Transformer Block (LN + SDPA + Residual + LN + FFN + Residual)\n");
printf("================================================================\n\n");
{
int seqLen = ch, dim = sp, ffnDim = 128;
float *w1 = (float *)malloc(ffnDim * dim * sizeof(float));
float *b1 = (float *)malloc(ffnDim * sizeof(float));
float *w2 = (float *)malloc(dim * ffnDim * sizeof(float));
float *b2 = (float *)malloc(dim * sizeof(float));
fill_random(w1, ffnDim * dim, 0.05f);
fill_random(b1, ffnDim, 0.01f);
fill_random(w2, dim * ffnDim, 0.05f);
fill_random(b2, dim, 0.01f);
// Build weight string literals
NSMutableString *(^buildMat)(float*, int, int) = ^(float *m, int rows, int cols) {
NSMutableString *s = [NSMutableString stringWithString:@"["];
for (int i = 0; i < rows; i++) {
if (i > 0) [s appendString:@", "];
[s appendString:@"["];
for (int j = 0; j < cols; j++) {
if (j > 0) [s appendString:@", "];
[s appendFormat:@"%.8e", m[i * cols + j]];
}
[s appendString:@"]"];
}
[s appendString:@"]"];
return s;
};
NSMutableString *(^buildVec)(float*, int) = ^(float *v, int n) {
NSMutableString *s = [NSMutableString stringWithString:@"["];
for (int i = 0; i < n; i++) {
if (i > 0) [s appendString:@", "];
[s appendFormat:@"%.8e", v[i]];
}
[s appendString:@"]"];
return s;
};
NSMutableString *(^buildOnes)(int) = ^(int n) {
NSMutableString *s = [NSMutableString stringWithString:@"["];
for (int i = 0; i < n; i++) {
if (i > 0) [s appendString:@", "];
[s appendString:@"1.0"];
}
[s appendString:@"]"];
return s;
};
NSMutableString *(^buildZeros)(int) = ^(int n) {
NSMutableString *s = [NSMutableString stringWithString:@"["];
for (int i = 0; i < n; i++) {
if (i > 0) [s appendString:@", "];
[s appendString:@"0.0"];
}
[s appendString:@"]"];
return s;
};
NSString *tfMIL = [NSString stringWithFormat:
@"program(1.3)\n"
"{\n"
" func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n"
" string c16 = const()[name = string(\"c16\"), val = string(\"fp16\")];\n"
" tensor<fp16, [1, %d, 1, %d]> x16 = cast(dtype = c16, x = x)[name = string(\"x16\")];\n"
" tensor<int32, [2]> r2 = const()[name = string(\"r2\"), val = tensor<int32, [2]>([%d, %d])];\n"
" tensor<fp16, [%d, %d]> flat = reshape(x = x16, shape = r2)[name = string(\"flat\")];\n"
// LN1
" tensor<fp16, [%d]> g1 = const()[name = string(\"g1\"), val = tensor<fp16, [%d]>(%@)];\n"
" tensor<fp16, [%d]> b1 = const()[name = string(\"b1\"), val = tensor<fp16, [%d]>(%@)];\n"
" tensor<int32, [1]> la = const()[name = string(\"la\"), val = tensor<int32, [1]>([-1])];\n"
" fp16 eps = const()[name = string(\"eps\"), val = fp16(1e-5)];\n"
" tensor<fp16, [%d, %d]> ln1 = layer_norm(x = flat, axes = la, gamma = g1, beta = b1, epsilon = eps)[name = string(\"ln1\")];\n"
// SDPA
" tensor<int32, [4]> sr = const()[name = string(\"sr\"), val = tensor<int32, [4]>([1, 1, %d, %d])];\n"
" tensor<fp16, [1, 1, %d, %d]> q = reshape(x = ln1, shape = sr)[name = string(\"q\")];\n"
" tensor<fp16, [1, 1, %d, %d]> k = reshape(x = ln1, shape = sr)[name = string(\"k\")];\n"
" tensor<fp16, [1, 1, %d, %d]> v = reshape(x = ln1, shape = sr)[name = string(\"v\")];\n"
" tensor<fp16, [1, 1, %d, %d]> at = scaled_dot_product_attention(query = q, key = k, value = v)[name = string(\"at\")];\n"
" tensor<fp16, [%d, %d]> af = reshape(x = at, shape = r2)[name = string(\"af\")];\n"
// Residual 1
" tensor<fp16, [%d, %d]> r1 = add(x = flat, y = af)[name = string(\"r1\")];\n"
// LN2
" tensor<fp16, [%d]> g2 = const()[name = string(\"g2\"), val = tensor<fp16, [%d]>(%@)];\n"
" tensor<fp16, [%d]> b2 = const()[name = string(\"b2\"), val = tensor<fp16, [%d]>(%@)];\n"
" tensor<fp16, [%d, %d]> ln2 = layer_norm(x = r1, axes = la, gamma = g2, beta = b2, epsilon = eps)[name = string(\"ln2\")];\n"
// FFN
" tensor<fp16, [%d, %d]> W1 = const()[name = string(\"W1\"), val = tensor<fp16, [%d, %d]>(%@)];\n"
" tensor<fp16, [%d]> B1 = const()[name = string(\"B1\"), val = tensor<fp16, [%d]>(%@)];\n"
" tensor<fp16, [%d, %d]> f1 = linear(x = ln2, weight = W1, bias = B1)[name = string(\"f1\")];\n"
" tensor<fp16, [%d, %d]> ga = gelu(x = f1, mode = string(\"TANH_APPROXIMATION\"))[name = string(\"ga\")];\n"
" tensor<fp16, [%d, %d]> W2 = const()[name = string(\"W2\"), val = tensor<fp16, [%d, %d]>(%@)];\n"
" tensor<fp16, [%d]> B2 = const()[name = string(\"B2\"), val = tensor<fp16, [%d]>(%@)];\n"
" tensor<fp16, [%d, %d]> f2 = linear(x = ga, weight = W2, bias = B2)[name = string(\"f2\")];\n"
// Residual 2
" tensor<fp16, [%d, %d]> r2o = add(x = r1, y = f2)[name = string(\"r2o\")];\n"
// Output
" tensor<int32, [4]> r4 = const()[name = string(\"r4\"), val = tensor<int32, [4]>([1, %d, 1, %d])];\n"
" tensor<fp16, [1, %d, 1, %d]> o16 = reshape(x = r2o, shape = r4)[name = string(\"o16\")];\n"
" string c32 = const()[name = string(\"c32\"), val = string(\"fp32\")];\n"
" tensor<fp32, [1, %d, 1, %d]> cast_out = cast(dtype = c32, x = o16)[name = string(\"cast_out\")];\n"
" } -> (cast_out);\n"
"}\n",
ch, sp, ch, sp,
seqLen, dim, seqLen, dim,
dim, dim, buildOnes(dim),
dim, dim, buildZeros(dim),
seqLen, dim,
seqLen, dim, seqLen, dim, seqLen, dim, seqLen, dim,
seqLen, dim,
seqLen, dim,
seqLen, dim,
dim, dim, buildOnes(dim),
dim, dim, buildZeros(dim),
seqLen, dim,
ffnDim, dim, ffnDim, dim, buildMat(w1, ffnDim, dim),
ffnDim, ffnDim, buildVec(b1, ffnDim),
seqLen, ffnDim,
seqLen, ffnDim,
dim, ffnDim, dim, ffnDim, buildMat(w2, dim, ffnDim),
dim, dim, buildVec(b2, dim),
seqLen, dim,
seqLen, dim,
ch, sp, ch, sp,
ch, sp];
printf(" Pipeline: LN->SDPA->Res->LN->FFN(%d->%d->%d)->Res\n\n", dim, ffnDim, dim);
err = nil;
id engine = compileAndCreateEngine(tfMIL, @"y3_transformer",
refContainer, cfg, refDesc, &err);
if (!engine) {
printf(" Y3 FAILED: %s\n\n", err ? [[err description] UTF8String] : "unknown");
} else {
printf(" Y3: Engine created!\n");
MLMultiArray *inputArr = [[MLMultiArray alloc]
initWithShape:@[@1, @(ch), @1, @(sp)]
dataType:MLMultiArrayDataTypeFloat32 error:nil];
float *inPtr = (float *)[inputArr dataPointer];
fill_random(inPtr, nElems, 0.5f);
MLDictionaryFeatureProvider *fp = [[MLDictionaryFeatureProvider alloc]
initWithDictionary:@{inName: inputArr} error:nil];
NSError *runErr = nil;
uint64_t t0 = mach_absolute_time();
id<MLFeatureProvider> result = runEngine(engine, fp, opts, &runErr);
double ms = tb_ms(mach_absolute_time() - t0);
if (runErr || !result) {
printf(" Y3 prediction FAILED: %s\n\n",
runErr ? [[runErr description] UTF8String] : "nil");
} else {
MLMultiArray *outArr = [result featureValueForName:outName].multiArrayValue;
if (outArr) {
float *outPtr = (float *)[outArr dataPointer];
print_first("ANE out", outPtr, nElems);
printf(" Time: %.3f ms\n", ms);
float m = mean_abs(outPtr, nElems);
printf(" Non-zero: %s (mean_abs=%.6f)\n", m > 1e-6f ? "YES" : "NO", m);
printf(" %s\n\n", m > 1e-6f ? "*** Y3 PASSED ***" : "Y3 FAILED");
int N = 100;
t0 = mach_absolute_time();
for (int i = 0; i < N; i++) runEngine(engine, fp, opts, nil);
printf(" Bench: %.4f ms/eval (%d iters)\n\n",
tb_ms(mach_absolute_time() - t0) / N, N);
}
}
}
free(w1); free(b1); free(w2); free(b2);
}
// ============================================================
// Z1: Linear Backward Pass (Gradient Computation)
// ============================================================
printf("================================================================\n");
printf(" Z1: Backward Pass (matmul with runtime tensors) on ANE\n");
printf("================================================================\n\n");
{
int M = 128, K = 64, N = 64;
NSString *bwdMIL = [NSString stringWithFormat:
@"program(1.3)\n"
"{\n"
" func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n"
" string c16 = const()[name = string(\"c16\"), val = string(\"fp16\")];\n"
" tensor<fp16, [1, %d, 1, %d]> x16 = cast(dtype = c16, x = x)[name = string(\"x16\")];\n"
" tensor<int32, [2]> r2 = const()[name = string(\"r2\"), val = tensor<int32, [2]>([%d, %d])];\n"
" tensor<fp16, [%d, %d]> flat = reshape(x = x16, shape = r2)[name = string(\"flat\")];\n"
// Slice dY [0:128, :]
" tensor<int32, [2]> db = const()[name = string(\"db\"), val = tensor<int32, [2]>([0, 0])];\n"
" tensor<int32, [2]> de = const()[name = string(\"de\"), val = tensor<int32, [2]>([%d, %d])];\n"
" tensor<fp16, [%d, %d]> dY = slice_by_index(x = flat, begin = db, end = de)[name = string(\"dY\")];\n"
// Slice W [128:192, :]
" tensor<int32, [2]> wb = const()[name = string(\"wb\"), val = tensor<int32, [2]>([%d, 0])];\n"
" tensor<int32, [2]> we = const()[name = string(\"we\"), val = tensor<int32, [2]>([%d, %d])];\n"
" tensor<fp16, [%d, %d]> W = slice_by_index(x = flat, begin = wb, end = we)[name = string(\"W\")];\n"
// Slice pad [192:256, :]
" tensor<int32, [2]> pb = const()[name = string(\"pb\"), val = tensor<int32, [2]>([%d, 0])];\n"
" tensor<int32, [2]> pe = const()[name = string(\"pe\"), val = tensor<int32, [2]>([%d, %d])];\n"
" tensor<fp16, [%d, %d]> pad = slice_by_index(x = flat, begin = pb, end = pe)[name = string(\"pad\")];\n"
// dX = dY @ W
" bool txf = const()[name = string(\"txf\"), val = bool(false)];\n"
" bool tyf = const()[name = string(\"tyf\"), val = bool(false)];\n"
" bool txt = const()[name = string(\"txt\"), val = bool(true)];\n"
" tensor<fp16, [%d, %d]> dX = matmul(x = dY, y = W, transpose_x = txf, transpose_y = tyf)[name = string(\"dX\")];\n"
// dW = dY^T @ dY
" tensor<fp16, [%d, %d]> dW = matmul(x = dY, y = dY, transpose_x = txt, transpose_y = tyf)[name = string(\"dW\")];\n"
// Concat [dX, dW, pad]
" int32 ax = const()[name = string(\"ax\"), val = int32(0)];\n"
" bool il = const()[name = string(\"il\"), val = bool(false)];\n"
" tensor<fp16, [%d, %d]> pk = concat(values = (dX, dW, pad), axis = ax, interleave = il)[name = string(\"pk\")];\n"
" tensor<int32, [4]> r4 = const()[name = string(\"r4\"), val = tensor<int32, [4]>([1, %d, 1, %d])];\n"
" tensor<fp16, [1, %d, 1, %d]> o16 = reshape(x = pk, shape = r4)[name = string(\"o16\")];\n"
" string c32 = const()[name = string(\"c32\"), val = string(\"fp32\")];\n"
" tensor<fp32, [1, %d, 1, %d]> cast_out = cast(dtype = c32, x = o16)[name = string(\"cast_out\")];\n"
" } -> (cast_out);\n"
"}\n",
ch, sp, ch, sp,
ch, sp, ch, sp,
M, K, M, K,
M, M + K, K, K, K,
M + K, ch, sp, ch - M - K, sp,
M, N,
K, K,
ch, sp,
ch, sp, ch, sp,
ch, sp];
printf(" dX = dY[%d,%d] @ W[%d,%d] -> [%d,%d]\n", M, K, K, N, M, N);
printf(" dW = dY^T @ dY -> [%d,%d]\n\n", K, K);
err = nil;
id engine = compileAndCreateEngine(bwdMIL, @"z1_backward",
refContainer, cfg, refDesc, &err);
if (!engine) {
printf(" Z1 FAILED: %s\n\n", err ? [[err description] UTF8String] : "unknown");
} else {
printf(" Z1: Engine created\n");
MLMultiArray *inputArr = [[MLMultiArray alloc]
initWithShape:@[@1, @(ch), @1, @(sp)]
dataType:MLMultiArrayDataTypeFloat32 error:nil];
float *inPtr = (float *)[inputArr dataPointer];
fill_random(inPtr, nElems, 0.3f);
MLDictionaryFeatureProvider *fp = [[MLDictionaryFeatureProvider alloc]
initWithDictionary:@{inName: inputArr} error:nil];
NSError *runErr = nil;
uint64_t t0 = mach_absolute_time();
id<MLFeatureProvider> result = runEngine(engine, fp, opts, &runErr);
double ms = tb_ms(mach_absolute_time() - t0);
if (runErr || !result) {
printf(" Z1 prediction FAILED: %s\n\n",
runErr ? [[runErr description] UTF8String] : "nil");
} else {
MLMultiArray *outArr = [result featureValueForName:outName].multiArrayValue;
if (outArr) {
float *outPtr = (float *)[outArr dataPointer];
// CPU: dX = dY @ W
float *dY_cpu = inPtr;
float *W_cpu = inPtr + M * K;
float *dX_cpu = (float *)calloc(M * N, sizeof(float));
for (int i = 0; i < M; i++)
for (int j = 0; j < N; j++) {
float a = 0;
for (int k = 0; k < K; k++)
a += dY_cpu[i*K+k] * W_cpu[k*N+j];
dX_cpu[i*N+j] = a;
}
// CPU: dW = dY^T @ dY
float *dW_cpu = (float *)calloc(K * K, sizeof(float));
for (int i = 0; i < K; i++)
for (int j = 0; j < K; j++) {
float a = 0;
for (int m = 0; m < M; m++)
a += dY_cpu[m*K+i] * dY_cpu[m*K+j];
dW_cpu[i*K+j] = a;
}
print_first("ANE dX", outPtr, M * N);
print_first("CPU dX", dX_cpu, M * N);
float mad_dx = max_abs_diff(outPtr, dX_cpu, M * N);
printf(" dX diff: %.6f, Rel: %.2e\n",
mad_dx, mad_dx / (mean_abs(dX_cpu, M*N) + 1e-10f));
print_first("ANE dW", outPtr + M*N, K*K);
print_first("CPU dW", dW_cpu, K*K);
float mad_dw = max_abs_diff(outPtr + M*N, dW_cpu, K * K);
printf(" dW diff: %.6f, Rel: %.2e\n",
mad_dw, mad_dw / (mean_abs(dW_cpu, K*K) + 1e-10f));
printf(" Time: %.3f ms\n", ms);
printf(" %s\n\n",
(mad_dx < 0.5f && mad_dw < 1.0f)
? "*** Z1 PASSED ***" : "Z1: differences (fp16 precision)");
int NN = 100;
t0 = mach_absolute_time();
for (int i = 0; i < NN; i++) runEngine(engine, fp, opts, nil);
printf(" Bench: %.4f ms/eval (%d iters)\n\n",
tb_ms(mach_absolute_time() - t0) / NN, NN);
free(dX_cpu); free(dW_cpu);
}
}
}
}
printf("================================================================\n");
printf(" DONE\n");
printf("================================================================\n");
}
return 0;
}

View File

@ -0,0 +1,238 @@
// test_throughput_ceiling.m Experiment I: Multi-kernel throughput ceiling
// Measures CPU round-trip overhead for sequential ANE kernel execution
// Build: make test_throughput_ceiling && ./test_throughput_ceiling
#import <Foundation/Foundation.h>
#import <mach/mach_time.h>
#include <dispatch/dispatch.h>
#include "ane_runtime.h"
static int g_fp16_io = 1;
static NSString *gen_conv_mil_fp16(int ch, int sp) {
return [NSString stringWithFormat:
@"program(1.0)\n[buildInfo = dict<tensor<string, []>, tensor<string, []>>"
"({{\"coremlc-version\", \"3505.4.1\"}})]\n{\n"
" func main<ios16>(tensor<fp16, [1, %d, 1, %d]> x) {\n"
" tensor<string, []> pt = const()[name=tensor<string, []>(\"pt\"),"
" val=tensor<string, []>(\"valid\")];\n"
" tensor<int32, [2]> st = const()[name=tensor<string, []>(\"st\"),"
" val=tensor<int32, [2]>([1,1])];\n"
" tensor<int32, [4]> pd = const()[name=tensor<string, []>(\"pd\"),"
" val=tensor<int32, [4]>([0,0,0,0])];\n"
" tensor<int32, [2]> dl = const()[name=tensor<string, []>(\"dl\"),"
" val=tensor<int32, [2]>([1,1])];\n"
" tensor<int32, []> gr = const()[name=tensor<string, []>(\"gr\"),"
" val=tensor<int32, []>(1)];\n"
" tensor<fp16, [%d,%d,1,1]> W = const()[name=tensor<string, []>(\"W\"), "
"val=tensor<fp16, [%d,%d,1,1]>(BLOBFILE(path=tensor<string, []>"
"(\"@model_path/weights/weight.bin\"), offset=tensor<uint64, []>(64)))];\n"
" tensor<fp16, [1,%d,1,%d]> y = conv(dilations=dl,groups=gr,"
"pad=pd,pad_type=pt,strides=st,weight=W,x=x)"
"[name=tensor<string, []>(\"conv\")];\n"
" } -> (y);\n}\n", ch, sp, ch, ch, ch, ch, ch, sp];
}
static ANEKernel *compile_fp16_kernel(int ch, int sp) {
int ws = ch * ch * 2;
int tot = 128 + ws;
uint8_t *blob = (uint8_t *)calloc((size_t)tot, 1);
blob[0] = 1; blob[4] = 2;
blob[64] = 0xEF; blob[65] = 0xBE; blob[66] = 0xAD; blob[67] = 0xDE;
blob[68] = 1;
*(uint32_t *)(blob + 72) = (uint32_t)ws;
*(uint32_t *)(blob + 80) = 128;
_Float16 *wp = (_Float16 *)(blob + 128);
for (int i = 0; i < ch; i++) wp[i * ch + i] = (_Float16)1.0f;
NSData *wdata = [NSData dataWithBytesNoCopy:blob length:(NSUInteger)tot
freeWhenDone:YES];
NSString *mil = gen_conv_mil_fp16(ch, sp);
NSData *md = [mil dataUsingEncoding:NSUTF8StringEncoding];
size_t ioBytes = (size_t)ch * sp * 2;
return ane_compile(md, wdata, 1, &ioBytes, 1, &ioBytes);
}
int main(int argc, const char *argv[]) {
(void)argc; (void)argv;
@autoreleasepool {
mach_timebase_info_data_t tb;
mach_timebase_info(&tb);
printf("============================================================\n");
printf(" Experiment I: Multi-Kernel Throughput Ceiling\n");
printf(" Measuring CPU round-trip overhead for sequential ANE ops\n");
printf("============================================================\n\n");
ane_init();
if (!g_ane_ok) { printf("ANE not available\n"); return 1; }
typedef struct { int ch; int sp; const char *name; } Config;
Config configs[] = {
{64, 32, "64x32 (test)"},
{256, 64, "256x64 (small)"},
{768, 256, "768x256 (prod)"},
};
int nconfigs = sizeof(configs) / sizeof(configs[0]);
for (int ci = 0; ci < nconfigs; ci++) {
Config cfg = configs[ci];
printf("=== Config: %s ===\n", cfg.name);
int nlayers = 12;
ANEKernel *kernels[12];
int compiled = 0;
for (int i = 0; i < nlayers; i++) {
@try {
kernels[i] = compile_fp16_kernel(cfg.ch, cfg.sp);
if (!kernels[i]) {
printf(" Kernel %d compile failed\n", i);
break;
}
compiled++;
} @catch (NSException *ex) {
printf(" Kernel %d exception: %s\n", i,
[[ex reason] UTF8String]);
break;
}
}
printf(" Compiled %d/%d kernels\n", compiled, nlayers);
if (compiled < 2) {
printf(" Need at least 2 kernels, skipping\n\n");
for (int i = 0; i < compiled; i++) ane_free(kernels[i]);
continue;
}
size_t ioBytes = (size_t)cfg.ch * cfg.sp * 2;
int warmup = 5;
int iters = 50;
// --- Test 1: Sequential (run + memcpy chain) ---
printf("\n --- Test 1: Sequential (run + memcpy) ---\n");
{
for (int w = 0; w < warmup; w++) {
@try {
for (int i = 0; i < compiled; i++)
ane_eval(kernels[i]);
} @catch (NSException *ex) { (void)ex; }
}
uint64_t t0 = mach_absolute_time();
for (int it = 0; it < iters; it++) {
for (int i = 0; i < compiled - 1; i++) {
@try {
ane_eval(kernels[i]);
IOSurfaceLock(kernels[i]->ioOutputs[0],
kIOSurfaceLockReadOnly, NULL);
IOSurfaceLock(kernels[i+1]->ioInputs[0], 0, NULL);
memcpy(
IOSurfaceGetBaseAddress(kernels[i+1]->ioInputs[0]),
IOSurfaceGetBaseAddress(kernels[i]->ioOutputs[0]),
ioBytes);
IOSurfaceUnlock(kernels[i+1]->ioInputs[0], 0, NULL);
IOSurfaceUnlock(kernels[i]->ioOutputs[0],
kIOSurfaceLockReadOnly, NULL);
} @catch (NSException *ex) { (void)ex; }
}
@try {
ane_eval(kernels[compiled - 1]);
} @catch (NSException *ex) { (void)ex; }
}
double totalMs = (double)(mach_absolute_time() - t0) * tb.numer / tb.denom / 1e6;
double perIter = totalMs / iters;
double perKernel = perIter / compiled;
printf(" Total: %.2f ms/pass (%d kernels)\n", perIter, compiled);
printf(" Per kernel: %.3f ms\n", perKernel);
printf(" Throughput: %.0f kernels/s\n", compiled * 1000.0 / perIter);
}
// --- Test 2: Run-only (no memcpy, pure ANE overhead) ---
printf("\n --- Test 2: Run-only (no memcpy between) ---\n");
{
uint64_t t0 = mach_absolute_time();
for (int it = 0; it < iters; it++) {
for (int i = 0; i < compiled; i++) {
@try {
ane_eval(kernels[i]);
} @catch (NSException *ex) { (void)ex; }
}
}
double totalMs = (double)(mach_absolute_time() - t0) * tb.numer / tb.denom / 1e6;
double perIter = totalMs / iters;
double perKernel = perIter / compiled;
printf(" Total: %.2f ms/pass (%d kernels)\n", perIter, compiled);
printf(" Per kernel: %.3f ms\n", perKernel);
printf(" Throughput: %.0f kernels/s\n", compiled * 1000.0 / perIter);
}
// --- Test 3: Memcpy-only overhead ---
printf("\n --- Test 3: Memcpy-only overhead ---\n");
{
uint64_t t0 = mach_absolute_time();
for (int it = 0; it < iters * 10; it++) {
for (int i = 0; i < compiled - 1; i++) {
IOSurfaceLock(kernels[i]->ioOutputs[0], kIOSurfaceLockReadOnly, NULL);
IOSurfaceLock(kernels[i+1]->ioInputs[0], 0, NULL);
memcpy(
IOSurfaceGetBaseAddress(kernels[i+1]->ioInputs[0]),
IOSurfaceGetBaseAddress(kernels[i]->ioOutputs[0]),
ioBytes);
IOSurfaceUnlock(kernels[i+1]->ioInputs[0], 0, NULL);
IOSurfaceUnlock(kernels[i]->ioOutputs[0], kIOSurfaceLockReadOnly, NULL);
}
}
double totalMs = (double)(mach_absolute_time() - t0) * tb.numer / tb.denom / 1e6;
double perIter = totalMs / (iters * 10);
double perCopy = perIter / (compiled - 1);
printf(" Total: %.3f ms/pass (%d copies)\n", perIter, compiled - 1);
printf(" Per memcpy: %.4f ms (%lu bytes)\n", perCopy, (unsigned long)ioBytes);
}
// --- Test 4: GCD serial queue ---
printf("\n --- Test 4: GCD serial queue ---\n");
{
ANEKernel **kptrs = (ANEKernel **)malloc(
(size_t)compiled * sizeof(ANEKernel *));
for (int i = 0; i < compiled; i++) kptrs[i] = kernels[i];
dispatch_queue_t q = dispatch_queue_create(
"ane.throughput", DISPATCH_QUEUE_SERIAL);
dispatch_semaphore_t sem = dispatch_semaphore_create(0);
const int ncomp = compiled;
uint64_t t0 = mach_absolute_time();
for (int it = 0; it < iters; it++) {
__block int done = 0;
for (int i = 0; i < ncomp; i++) {
ANEKernel *kp = kptrs[i];
dispatch_async(q, ^{
@try {
ane_eval(kp);
} @catch (NSException *ex) { (void)ex; }
done++;
if (done == ncomp)
dispatch_semaphore_signal(sem);
});
}
dispatch_semaphore_wait(sem, DISPATCH_TIME_FOREVER);
}
double totalMs = (double)(mach_absolute_time() - t0)
* tb.numer / tb.denom / 1e6;
double perIter = totalMs / iters;
printf(" Total: %.2f ms/pass (%d kernels, serial queue)\n",
perIter, ncomp);
printf(" Per kernel: %.3f ms\n", perIter / ncomp);
free(kptrs);
}
printf("\n --- CPU Round-trip Overhead ---\n");
printf(" Overhead = (Sequential - RunOnly) / %d copies\n", compiled - 1);
printf(" This is what chaining would eliminate per layer.\n");
for (int i = 0; i < compiled; i++) ane_free(kernels[i]);
printf("\n");
}
printf("Done.\n");
}
return 0;
}