mirror of https://github.com/maderix/ANE.git
Merge 99ba013d9b into efcf193075
This commit is contained in:
commit
005fa4d79a
File diff suppressed because it is too large
Load Diff
|
|
@ -0,0 +1,563 @@
|
|||
# ANE Internals: What We Know
|
||||
|
||||
A comprehensive guide to Apple's Neural Engine (ANE) based on reverse engineering, private API exploration, and community research. This extends and updates [hollance/neural-engine](https://github.com/hollance/neural-engine/tree/master/docs) with findings from direct hardware experimentation on M4 Max / macOS 15.
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [How does the ANE work internally?](#1-how-does-the-ane-work-internally)
|
||||
2. [Can I program the ANE directly?](#2-can-i-program-the-ane-directly)
|
||||
3. [What can be compiled and run on ANE?](#3-what-can-be-compiled-and-run-on-ane)
|
||||
4. [Security and safety mechanisms](#4-security-and-safety-mechanisms)
|
||||
5. [Is the ANE 16-bit?](#5-is-the-ane-16-bit)
|
||||
6. [ANE vs GPU vs CPU](#6-ane-vs-gpu-vs-cpu)
|
||||
7. [Reverse engineering the ANE](#7-reverse-engineering-the-ane)
|
||||
8. [How to verify ANE execution](#8-how-to-verify-ane-execution)
|
||||
9. [References and external resources](#9-references-and-external-resources)
|
||||
|
||||
---
|
||||
|
||||
## 1. How does the ANE work internally?
|
||||
|
||||
> hollance/neural-engine says: "I don't think anyone outside Apple knows."
|
||||
|
||||
We now know substantially more.
|
||||
|
||||
### Hardware Architecture
|
||||
|
||||
The ANE is a fixed-function neural network accelerator integrated into Apple Silicon SoCs:
|
||||
|
||||
| Chip | ANE Cores | Peak TOPS | SRAM Budget |
|
||||
|------|-----------|-----------|-------------|
|
||||
| A12-A13 | 8 | 5 | ~4 MB |
|
||||
| A14/M1 | 16 | 11 | ~16 MB |
|
||||
| A15/M2 | 16 | 15.8 | ~24 MB |
|
||||
| M4/M4 Pro/M4 Max | 16 | 38 | ~24-32 MB |
|
||||
|
||||
SRAM budget measured via `sram_probe.m` performance cliff detection on M4 Max:
|
||||
- Peak efficiency at ~12.5 MB weights (282.6 GFLOPS/MB)
|
||||
- First spill at ~32 MB (drops to 59.2 GFLOPS/MB)
|
||||
- Catastrophic spilling at 128 MB (8.0 GFLOPS/MB)
|
||||
|
||||
The ANE operates on FP16 data exclusively. All I/O is through IOSurface shared memory buffers in `[1, C, 1, S]` channel-first FP16 layout.
|
||||
|
||||
### Compilation Pipeline
|
||||
|
||||
There are two paths from a neural network to ANE hardware execution:
|
||||
|
||||
**Standard CoreML path** (from [Black Hat Asia 2021, Wish Wu](https://infocondb.org/con/black-hat/black-hat-asia-2021/apple-neural-engine-internal-from-ml-algorithm-to-hw-registers)):
|
||||
|
||||
```
|
||||
ML model (TF/PyTorch/Caffe)
|
||||
-> coremltools -> .mlmodel
|
||||
-> coremlc (CoreML compiler) -> .mlmodelc/
|
||||
-> espresso precompile -> net.plist + weights
|
||||
-> ANECompiler (in ane_compiler_service) -> model.hwx
|
||||
-> aned daemon -> H11ANEIn kernel driver (IOKit)
|
||||
-> ANE firmware -> hardware registers
|
||||
```
|
||||
|
||||
**Direct private API path** (what this project uses):
|
||||
|
||||
```
|
||||
MIL text + weight blobs (in memory)
|
||||
-> _ANEInMemoryModelDescriptor (ObjC object)
|
||||
-> _ANEInMemoryModel.compileWithQoS: -> ANE binary (in temp dir)
|
||||
-> _ANEInMemoryModel.loadWithQoS: -> loaded onto ANE hardware
|
||||
-> _ANEInMemoryModel.evaluateWithQoS: -> execution via aned
|
||||
```
|
||||
|
||||
The direct path bypasses CoreML, espresso, and the `.hwx` file format entirely. It compiles MIL (Model Intermediate Language) text directly into ANE-executable binary, loads it, and runs it. This is how we achieve both training and inference on the ANE without any CoreML dependency.
|
||||
|
||||
### System Architecture
|
||||
|
||||
```
|
||||
+------------------+ +------------------+ +------------------+
|
||||
| User Process | | aned daemon | | Kernel |
|
||||
| | | | | |
|
||||
| _ANEClient -----+---->| ANE scheduler +---->| H11ANEIn driver |
|
||||
| (sharedConnection)| | (all interfaces) | | (IOKit) |
|
||||
| | | | | |
|
||||
| App gets 3 IOKit | | Compiles models | | Passes model.hwx |
|
||||
| interfaces: | | Manages loading | | to ANE firmware |
|
||||
| - open | | Handles requests | | |
|
||||
| - close | +------------------+ +------------------+
|
||||
| - programSend | |
|
||||
| Request | v
|
||||
+------------------+ +------------------+
|
||||
| ANE Firmware |
|
||||
| (co-processor) |
|
||||
| |
|
||||
| Parses register |
|
||||
| operations from |
|
||||
| compiled binary |
|
||||
+------------------+
|
||||
```
|
||||
|
||||
The `aned` daemon mediates between user processes and the kernel driver. Apps only get 3 IOKit interfaces (open, close, programSendRequest). The daemon has access to all driver interfaces, which is why `_ANEClient.sharedConnection` communicates through the daemon rather than directly to the kernel.
|
||||
|
||||
### Execution Paths
|
||||
|
||||
We have benchmarked four distinct ways to trigger ANE kernel execution:
|
||||
|
||||
| Method | API | Latency (64x32) | Latency (768x256) |
|
||||
|--------|-----|------------------|--------------------|
|
||||
| Standard | `model.evaluateWithQoS:options:request:error:` | 0.175 ms | 0.205 ms |
|
||||
| Real-Time | `client.evaluateRealTimeWithModel:options:request:error:` | 0.093 ms | 0.246 ms |
|
||||
| processRequest | `program.processRequest:model:qos:...` | 0.131 ms | 0.185 ms |
|
||||
| Direct | `client.doEvaluateDirectWithModel:options:request:qos:error:` | 0.225 ms | N/A |
|
||||
|
||||
**Key finding**: At production kernel dimensions (768x256, matching Stories110M), all paths converge to ~0.2 ms per kernel. The RT speedup (1.88x) observed on small 64x32 kernels does not hold at production scale. The standard path remains the most reliable.
|
||||
|
||||
### Resource Limits
|
||||
|
||||
The ANE runtime leaks internal resources during compilation. After ~119 compiles per process, subsequent compilations fail silently. The workaround is checkpoint-and-restart: save weights and optimizer state, terminate the process, and re-launch with `--resume`.
|
||||
|
||||
With `MAX_COMPILES=100` (conservative) and 60 weight-bearing kernels per batch (12 layers x 5 kernels), only 1 training batch fits per process lifetime.
|
||||
|
||||
---
|
||||
|
||||
## 2. Can I program the ANE directly?
|
||||
|
||||
> hollance/neural-engine says: "Unfortunately not. You can only use the Neural Engine through Core ML."
|
||||
|
||||
**Yes, you can.** The `AppleNeuralEngine.framework` contains 67+ private Objective-C classes that provide direct access to the ANE without CoreML. This project uses them for both training and inference.
|
||||
|
||||
### Minimal Example
|
||||
|
||||
The core compilation/load/execution cycle in pseudocode:
|
||||
|
||||
```objc
|
||||
#import <dlfcn.h>
|
||||
#import <objc/runtime.h>
|
||||
|
||||
// Load the private framework
|
||||
dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
|
||||
|
||||
// Write MIL program as text
|
||||
NSData *milData = [@"program(1.0) { ... }" dataUsingEncoding:NSUTF8StringEncoding];
|
||||
|
||||
// Create descriptor
|
||||
id descriptor = [_ANEInMemoryModelDescriptor modelWithMILText:milData
|
||||
weights:weightDict
|
||||
optionsPlist:nil];
|
||||
|
||||
// Compile -> Load -> Run
|
||||
id model = [_ANEInMemoryModel inMemoryModelWithDescriptor:descriptor];
|
||||
[model compileWithQoS:21 options:nil error:&error];
|
||||
[model loadWithQoS:21 options:nil error:&error];
|
||||
|
||||
// Create IOSurface I/O and request
|
||||
id request = [_ANERequest requestWithInputs:@[inputSurface]
|
||||
inputIndices:@[@0]
|
||||
outputs:@[outputSurface]
|
||||
outputIndices:@[@0]
|
||||
weightsBuffer:nil
|
||||
perfStats:nil
|
||||
procedureIndex:0];
|
||||
|
||||
[model evaluateWithQoS:21 options:nil request:request error:&error];
|
||||
```
|
||||
|
||||
A complete reusable wrapper is implemented in [`training/ane_runtime.h`](../training/ane_runtime.h) with functions:
|
||||
- `ane_init()` -- load framework, resolve classes
|
||||
- `ane_compile(kernel, mil_text, weight_dict)` -- compile MIL to ANE binary
|
||||
- `ane_run(kernel)` -- standard execution path
|
||||
- `ane_free(kernel)` -- unload and release resources
|
||||
|
||||
### MIL (Model Intermediate Language)
|
||||
|
||||
MIL is Apple's intermediate representation for neural network operations. Key facts:
|
||||
|
||||
- Text-based format: `program(1.0) { func main(...) { ... } }`
|
||||
- Targets: `ios16`, `ios17`, `ios18` (determines available ops)
|
||||
- All tensors are 4D: `[batch, channels, height, width]` or equivalently `[1, C, 1, S]`
|
||||
- Convolutions (`conv`) are the workhorse: a 1x1 conv with `[out_ch, in_ch, 1, 1]` weights = matrix multiply
|
||||
- Weights referenced via `BLOBFILE(path="@model_path/weights/name.bin", offset=uint64(64))`
|
||||
- Weights are baked at compile time and cannot be swapped at runtime
|
||||
|
||||
Supported operations include: `conv`, `matmul`, `add`, `mul`, `sigmoid`, `softmax`, `reshape`, `transpose`, `concat`, `reduce_mean`, `rsqrt`, `cast`, `constexpr_affine_dequantize`, and more.
|
||||
|
||||
### Alternative: ANECompiler CLI
|
||||
|
||||
[ANETools](https://github.com/antgroup-skyward/ANETools) (from Wish Wu / Ant Group) provides command-line tools that invoke the ANECompiler module directly:
|
||||
|
||||
```bash
|
||||
# Convert mlmodelc to ANE-compatible format
|
||||
MLModelCToANECompiler input.mlmodelc output/
|
||||
|
||||
# Compile to hardware format
|
||||
ANECompiler --target-arch ane_v5 --debug-mask 2147483647 net.plist weights/ output.hwx
|
||||
|
||||
# Disassemble compiled binary
|
||||
ANEDisassembler output.hwx
|
||||
```
|
||||
|
||||
The `--debug-mask` flag (set to max integer) generates intermediate files during compilation, revealing internal register operations.
|
||||
|
||||
---
|
||||
|
||||
## 3. What can be compiled and run on ANE?
|
||||
|
||||
Any computation expressible as a static MIL (Model Intermediate Language) dataflow graph that the E5 compiler accepts. The ANE is a fixed-function accelerator, not a general-purpose processor -- it executes predefined operation graphs, not arbitrary code.
|
||||
|
||||
### Verified Operations
|
||||
|
||||
These operations have been compiled to custom MIL programs and executed on ANE hardware with output validated against CPU reference implementations (see `test_mil_custom.m`):
|
||||
|
||||
| Category | Operations | Notes |
|
||||
|----------|-----------|-------|
|
||||
| Activations | `relu`, `gelu`, `softmax` | GELU supports EXACT, TANH_APPROXIMATION, SIGMOID_APPROXIMATION modes |
|
||||
| Normalization | `layer_norm` | Epsilon type must match gamma/beta dtype |
|
||||
| Attention | `scaled_dot_product_attention` | Fused Q@K^T/sqrt(d) + softmax + @V in a single op (iOS 18+) |
|
||||
| Linear algebra | `linear` (const weights), `matmul` (runtime tensors) | `linear` requires compile-time constant weights; `matmul` supports runtime inputs |
|
||||
| Type conversion | `cast` | fp32 <-> fp16. Required at ANE I/O boundaries |
|
||||
| Elementwise | `add`, `mul`, `real_div` | Broadcasting supported |
|
||||
| Shape | `reshape`, `transpose`, `concat`, `slice_by_index` | `concat` requires `interleave` param |
|
||||
| Composite | Full transformer block (LN + SDPA + Residual + FFN + GELU) | Compiles and runs as a single ANE program (~0.21ms) |
|
||||
|
||||
### Available but Not Yet Tested
|
||||
|
||||
These are valid MIL operations that the E5 compiler should accept:
|
||||
|
||||
- `conv` -- convolutions (the upstream maderix/ANE repo uses these extensively for training)
|
||||
- `reduce_sum`, `reduce_mean`, `reduce_max` -- reductions
|
||||
- `gather`, `scatter` -- embedding lookups, KV cache writes
|
||||
- `rsqrt`, `sqrt`, `exp`, `log`, `tanh` -- unary math
|
||||
- `split`, `slice_by_size` -- tensor slicing
|
||||
- `batch_norm`, `instance_norm` -- normalization variants
|
||||
- Various pooling, padding, upsampling operations
|
||||
|
||||
### What Cannot Run on ANE
|
||||
|
||||
| Limitation | Detail |
|
||||
|-----------|--------|
|
||||
| No control flow | No loops, conditionals, or branching. MIL is a static dataflow graph. |
|
||||
| No dynamic shapes | All tensor dimensions must be known at compile time. |
|
||||
| No runtime weight updates | Weights are `const`, baked into the compiled binary. Changing weights requires recompilation (~10-50ms). |
|
||||
| No arbitrary memory access | No pointers or indexing beyond what `gather`/`scatter` provide. |
|
||||
| No custom ops | Only operations in Apple's MIL op set. No user-defined kernels at the hardware level. |
|
||||
| No FP32 compute | ANE computes in FP16 only. FP32 inputs are cast to FP16 internally. |
|
||||
|
||||
### Implications for Training
|
||||
|
||||
The ANE can execute the forward pass and the matrix math of backpropagation (`matmul` for dX and dW gradients). However, training is impractical because weights are read-only constants. After computing weight gradients on ANE, the optimizer step (W -= lr * dW) must run on CPU, and the MIL program must be recompiled with updated weights before the next forward pass. This recompilation costs ~10-50ms per step, dominating training time. See [ANE_CHAINING_RESEARCH.md, Section 9](ANE_CHAINING_RESEARCH.md#9-ane-training-feasibility-analysis) for detailed analysis.
|
||||
|
||||
---
|
||||
|
||||
## 4. Security and Safety Mechanisms
|
||||
|
||||
The ANE has multiple layers of safety enforcement, but Apple's security model assumes access goes through CoreML. The private APIs we use bypass CoreML but still pass through the `aned` daemon and the E5 compiler.
|
||||
|
||||
### Compile-Time Safety
|
||||
|
||||
| Mechanism | What it does |
|
||||
|-----------|-------------|
|
||||
| MIL syntax validation | The E5 compiler rejects malformed MIL with `InvalidMILProgram` errors |
|
||||
| Type checking | Tensor dtypes, shapes, and parameter types must match exactly. Mismatches cause compile errors (e.g., `layer_norm` epsilon must match gamma/beta dtype; `concat` axis must be `int32` scalar, not tensor) |
|
||||
| Op validation | Unknown or unsupported operations are rejected |
|
||||
| I/O matching | MIL input/output names and shapes must match the `MLModelDescription` passed to `MLE5Engine` |
|
||||
|
||||
### Runtime Safety
|
||||
|
||||
| Mechanism | What it does |
|
||||
|-----------|-------------|
|
||||
| Shape enforcement | Input tensors must match declared shape exactly -- `MultiArray shape doesn't match ML Program's expected shape` error on mismatch |
|
||||
| Daemon mediation | ANE runs through the `aned` daemon (system service). User processes only get 3 IOKit interfaces: open, close, `programSendRequest` |
|
||||
| IOSurface isolation | I/O memory is managed by the kernel via IOSurface. Cannot read/write arbitrary memory through them |
|
||||
| SRAM limits | Programs exceeding the ANE SRAM budget (~24-32MB on M4 Max) are rejected or fall back to CPU/GPU |
|
||||
| Compile limit | ~119 compiled programs per process before the compiler leaks enough resources to fail (resource exhaustion, not a security boundary) |
|
||||
|
||||
### Sandbox Interaction
|
||||
|
||||
The E5 runtime needs write access to `~/Library/Caches/<binary_name>/` for its ANE specialization cache. macOS app sandbox can block this, causing compilation to fail with permission errors. When running outside a sandbox (e.g., command-line tools), this directory is created automatically.
|
||||
|
||||
### What is NOT Protected
|
||||
|
||||
| Gap | Detail |
|
||||
|-----|--------|
|
||||
| No access control | No authentication or entitlement check for using the private APIs. Any process can call `_ANEClient.sharedConnection` |
|
||||
| No rate limiting | Programs can be compiled in a loop until the ~119 limit exhausts resources |
|
||||
| No MIL signing | No code signing validation on MIL text -- any syntactically valid program that passes the compiler's type checks will execute |
|
||||
| No isolation between programs | Multiple programs from the same process share the ANE with no hardware-level isolation (the daemon schedules them) |
|
||||
|
||||
### Practical Risk Assessment
|
||||
|
||||
The ANE attack surface is limited because:
|
||||
|
||||
1. **Fixed-function hardware**: The ANE executes predefined neural network operations, not arbitrary instructions. There is no instruction pointer, no stack, and no way to jump to arbitrary code.
|
||||
2. **Typed dataflow**: MIL programs operate on typed tensors with fixed shapes. There are no buffer overflows in the traditional sense -- the compiler enforces all dimensions at compile time.
|
||||
3. **Daemon intermediary**: All ANE access goes through `aned`, which validates requests before forwarding to the kernel driver. Direct IOKit access to the ANE is restricted to 3 interfaces.
|
||||
4. **No persistent state**: ANE programs don't persist across reboots. Compiled programs live in temp directories and caches that are cleaned by the OS.
|
||||
|
||||
The main risk of the private APIs is **stability**: these APIs are undocumented and may change with any macOS update, potentially breaking programs that depend on them.
|
||||
|
||||
---
|
||||
|
||||
## 5. Is the ANE 16-bit?
|
||||
|
||||
> hollance/neural-engine says: "It appears so."
|
||||
|
||||
**Confirmed.** The ANE operates in FP16 for both compute and storage:
|
||||
|
||||
- All IOSurface I/O must be FP16. Passing FP32 data produces zeros.
|
||||
- MIL programs must use `fp16` I/O types (setting `g_fp16_io=1` in our codebase)
|
||||
- F32-to-F16 conversion happens on the CPU before writing to IOSurfaces
|
||||
- FP16 precision limits: values above ~65504 overflow, values below ~5.96e-8 underflow to zero
|
||||
|
||||
### Quantization Support
|
||||
|
||||
| Format | ANE Native? | Notes |
|
||||
|--------|------------|-------|
|
||||
| FP16 | Yes | Native compute and storage format |
|
||||
| INT8 | Partial | Memory bandwidth savings only, no compute speedup. `constexpr_affine_dequantize` in MIL dequantizes to FP16 before compute |
|
||||
| Q4 | No | Not supported. Requires GPU (Metal) or CPU dequantization |
|
||||
| FP32 | No | Internally converted to FP16; higher precision lost |
|
||||
|
||||
Apple markets ANE TOPS using INT8, so the 38 TOPS figure for M4 is really ~19 TFLOPS in FP16 (each INT8 op counts as 1 TOP but FP16 ops count as 2).
|
||||
|
||||
---
|
||||
|
||||
## 6. ANE vs GPU vs CPU
|
||||
|
||||
Benchmarked on Qwen2.5-0.5B (dim=896, 24 layers, 494M params) on M4 Max:
|
||||
|
||||
### Decode Performance (single-token generation)
|
||||
|
||||
| Engine | Format | Weight Size | Decode t/s | Bottleneck |
|
||||
|--------|--------|-------------|------------|------------|
|
||||
| CPU AMX (cblas_sgemv) | F32 | 1.97 GB | ~91 t/s | Memory bandwidth |
|
||||
| CPU AMX (cblas_sgemv) | F16->F32 | 658 MB disk | ~91 t/s | Memory bandwidth (F32 in RAM) |
|
||||
| CPU AMX (cblas_sgemv) | Q4->F32 | 188 MB disk | ~91 t/s | Memory bandwidth (dequant at load) |
|
||||
| Metal GPU (Q4 SIMD) | Q4 | 188 MB | ~10 t/s | Dispatch overhead (~400 dispatches/token) |
|
||||
| LM Studio (MLX) | Q4 MLX | ~188 MB | 258-496 t/s | Optimized Metal kernels |
|
||||
|
||||
### Prefill Performance (batch prompt processing)
|
||||
|
||||
| Engine | Format | Prefill t/s | Method |
|
||||
|--------|--------|-------------|--------|
|
||||
| CPU AMX (cblas_sgemm) | F32 | 880-960 t/s | Batched matmul |
|
||||
| CPU AMX (cblas_sgemv) | F32 | ~40 t/s | Sequential per-token |
|
||||
|
||||
### ANE Training Kernel Performance
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Kernel latency | ~0.2 ms per kernel (768x256 production dims) |
|
||||
| Peak TFLOPS | 11.14 (128x conv 512ch sp64) |
|
||||
| Sustained training | 1.29-1.68 TFLOPS |
|
||||
| ANE utilization | 8-11% of peak |
|
||||
|
||||
### When to use each
|
||||
|
||||
- **ANE**: Best for parallel FP16 operations where data stays on-chip (training kernels, fused attention). The ~119 compile limit and FP16-only restriction are significant constraints.
|
||||
- **GPU (Metal)**: Best for large models (dim >= 4096) where native quantized matmul kernels (as in MLX/llama.cpp) can read Q4/Q8 data directly from GPU memory. Dispatch overhead dominates for small models.
|
||||
- **CPU AMX**: Best for small/medium model decode (dim <= 896). `cblas_sgemv` uses the AMX coprocessor internally and achieves ~33% of theoretical bandwidth. Cannot be beaten by manual NEON, threading, or Metal for this model size.
|
||||
|
||||
---
|
||||
|
||||
## 7. Reverse engineering the ANE
|
||||
|
||||
### Prior Work
|
||||
|
||||
| Project | Focus | Key Contribution |
|
||||
|---------|-------|-------------------|
|
||||
| [hollance/neural-engine](https://github.com/hollance/neural-engine) | CoreML-level documentation | Comprehensive device list, layer compatibility, model surgery guides |
|
||||
| [geohot/tinygrad ANE](https://github.com/tinygrad/tinygrad) | Driver-level reverse engineering | Initial IOKit driver analysis, ANE instruction format exploration |
|
||||
| [Black Hat Asia 2021 (Wish Wu)](https://infocondb.org/con/black-hat/black-hat-asia-2021/apple-neural-engine-internal-from-ml-algorithm-to-hw-registers) | Full stack: ML to HW registers | Documented compilation pipeline, .hwx format, security attack surfaces, FaceID ANE usage. Created ANEDisassembler. [Video](https://www.youtube.com/watch?v=1wvBDUnPNEo) |
|
||||
| [ANETools](https://github.com/antgroup-skyward/ANETools) | CLI compilation and disassembly | ANECompiler CLI wrapper, ANEDisassembler for .hwx files, `debug_mask` flag for intermediate output |
|
||||
| [eiln/anecc](https://github.com/eiln/anecc) | Independent ANE compiler | CoreML-to-ANE compiler for Asahi Linux, alternative compilation path |
|
||||
| [freedomtan/coreml_to_ane_hwx](https://github.com/freedomtan/coreml_to_ane_hwx) | CoreML to .hwx conversion | Direct converter bypassing some CoreML steps |
|
||||
| [maderix/ANE](https://github.com/maderix/ANE) | Training on ANE | First neural network training on ANE via private APIs |
|
||||
| [maderix Substack](https://open.substack.com/pub/maderix/p/inside-the-m4-apple-neural-engine) | M4 ANE deep-dive | Detailed M4 ANE architecture analysis, SRAM probing, kernel fusion |
|
||||
|
||||
### Our Discoveries: Private API Class Hierarchy
|
||||
|
||||
We have documented 20+ private Objective-C classes in `AppleNeuralEngine.framework`:
|
||||
|
||||
```
|
||||
NSObject
|
||||
|-- _ANEClient (singleton, daemon connection)
|
||||
| Methods: sharedConnection, evaluateWithModel:, evaluateRealTimeWithModel:,
|
||||
| doEvaluateDirectWithModel:, prepareChainingWithModel:,
|
||||
| enqueueSetsWithModel:, buffersReadyWithModel:,
|
||||
| beginRealTimeTask, endRealTimeTask
|
||||
|
|
||||
|-- _ANEInMemoryModelDescriptor (MIL + weights spec)
|
||||
| Factory: +modelWithMILText:weights:optionsPlist:
|
||||
|
|
||||
|-- _ANEInMemoryModel (compile/load/run)
|
||||
| Methods: compileWithQoS:, loadWithQoS:, evaluateWithQoS:, unloadWithQoS:
|
||||
| Props: hexStringIdentifier, programHandle (uint64), program, perfStatsMask
|
||||
|
|
||||
|-- _ANEModel (disk-based compiled model -- 52 instance methods)
|
||||
| Factory: +modelAtURL:key:, +modelAtURL:key:modelAttributes:
|
||||
| Methods: getUUID, inputSymbolIndicesForProcedureIndex:,
|
||||
| outputSymbolIndicesForProcedureIndex:
|
||||
| Props: mapper, program
|
||||
|
|
||||
|-- _ANERequest (I/O surface packaging)
|
||||
| Factory: +requestWithInputs:inputIndices:outputs:outputIndices:
|
||||
| weightsBuffer:perfStats:procedureIndex:
|
||||
|
|
||||
|-- _ANEIOSurfaceObject (thin IOSurface wrapper)
|
||||
| Factory: +objectWithIOSurface:
|
||||
|
|
||||
|-- _ANEBuffer (IOSurfaceObject + symbolIndex + source) [KEY DISCOVERY]
|
||||
| Factory: +bufferWithIOSurfaceObject:symbolIndex:source:
|
||||
| source: 0=ANE, 1=output, 2=unknown
|
||||
|
|
||||
|-- _ANEChainingRequest (multi-op pipeline)
|
||||
| Factory: +chainingRequestWithInputs:outputSets:lbInputSymbolId:
|
||||
| lbOutputSymbolId:procedureIndex:signalEvents:
|
||||
| transactionHandle:fwEnqueueDelay:memoryPoolId:
|
||||
| Methods: validate
|
||||
|
|
||||
|-- _ANEIOSurfaceOutputSets (output packaging for chaining)
|
||||
| Factory: +objectWithstatsSurRef:outputBuffer:
|
||||
| Note: requires non-NULL statsSurRef (any IOSurface works, even 64 bytes)
|
||||
|
|
||||
|-- _ANEInputBuffersReady (input signaling for chaining)
|
||||
| Factory: +inputBuffersWithProcedureIndex:inputBufferInfoIndex:
|
||||
| inputFreeValue:executionDelay:
|
||||
|
|
||||
|-- _ANEOutputSetEnqueue (output pipeline config for chaining)
|
||||
| Factory: +outputSetWithProcedureIndex:setIndex:signalValue:
|
||||
| signalNotRequired:isOpenLoop:
|
||||
|
|
||||
|-- _ANEProgramForEvaluation (lower-level program)
|
||||
| Factory: +programWithHandle:intermediateBufferHandle:queueDepth:
|
||||
| Methods: processRequest:model:qos:qIndex:modelStringID:options:
|
||||
| returnValue:error:
|
||||
|
|
||||
|-- _ANEProgramIOSurfacesMapper (symbol-to-surface mapping)
|
||||
| Factory: +mapperWithProgramHandle:, +mapperWithController:
|
||||
| Note: only works with _ANEModel, not _ANEInMemoryModel
|
||||
|
|
||||
|-- _ANEPerformanceStats
|
||||
| Factory: +statsWithHardwareExecutionNS:
|
||||
| Props: hwExecutionTime, performanceCounters
|
||||
|
|
||||
|-- _ANESharedSignalEvent (hardware signal fence)
|
||||
| Factory: +signalEventWithValue:symbolIndex:eventType:sharedEvent:
|
||||
| Requires IOSurfaceSharedEvent objects
|
||||
|
|
||||
|-- _ANESharedWaitEvent (hardware wait fence)
|
||||
| Factory: +waitEventWithValue:sharedEvent:
|
||||
| Requires IOSurfaceSharedEvent objects
|
||||
|
|
||||
|-- _ANEModelInstanceParameters, _ANEDeviceController, _ANEQoSMapper
|
||||
```
|
||||
|
||||
Full details with experiment logs: [ANE_CHAINING_RESEARCH.md](ANE_CHAINING_RESEARCH.md)
|
||||
|
||||
### ChainingRequest API Status
|
||||
|
||||
The `_ANEChainingRequest` API is designed to pipeline multiple ANE operations without CPU round-trips. Current status:
|
||||
|
||||
- `_ANEChainingRequest.validate` returns **YES** (with `_ANEBuffer` inputs + `_ANEIOSurfaceOutputSets` outputs)
|
||||
- `prepareChainingWithModel:` **fails** -- calls `getUUID` on `_ANEInMemoryModel` which lacks it
|
||||
- Requires `_ANEModel` (disk-based compiled model) which has `getUUID` and symbol index methods
|
||||
- `_ANEModel` factory methods require a `key:` parameter; the hex identifier from `_ANEInMemoryModel` is the likely key
|
||||
|
||||
This is the highest-priority research area. Chaining would eliminate the ~23 CPU-ANE round-trips per token in a 12-layer model, potentially enabling on-chip pipeline execution.
|
||||
|
||||
### model.hwx Binary Format
|
||||
|
||||
The `.hwx` file is the compiled hardware representation loaded by the ANE kernel driver. From Wu's Black Hat research:
|
||||
|
||||
- Mach-O format binary containing register operations
|
||||
- Compiled from `net.plist` + weights by the ANECompiler module
|
||||
- Loaded by the `H11ANEIn` kernel driver via `programCreate` interface
|
||||
- ANE firmware parses it to extract register addresses and values
|
||||
- Can be disassembled with [ANETools/ANEDisassembler](https://github.com/antgroup-skyward/ANETools)
|
||||
|
||||
Our `_ANEInMemoryModel` path bypasses `.hwx` generation -- the model goes directly from MIL to an internal binary format in a temp directory. Whether this temp directory contains an equivalent to `.hwx` is an open question (see [ANE_CHAINING_RESEARCH.md](ANE_CHAINING_RESEARCH.md) for next steps).
|
||||
|
||||
---
|
||||
|
||||
## 8. How to verify ANE execution
|
||||
|
||||
### Power Monitoring
|
||||
|
||||
```bash
|
||||
sudo powermetrics --samplers ane_power -i 1000
|
||||
```
|
||||
|
||||
Shows real-time ANE power draw. Active ANE usage typically shows 2-4W on M4 Max during training.
|
||||
|
||||
### Performance Statistics
|
||||
|
||||
```objc
|
||||
model.perfStatsMask = 0xFF;
|
||||
// After execution:
|
||||
// model.performanceCounters -- returns nil on current macOS (limited API)
|
||||
```
|
||||
|
||||
The `_ANEPerformanceStats` class exists and can be instantiated via `+statsWithHardwareExecutionNS:`, but the hardware counters are not populated on the current macOS/M4 combination. The `perfStatsMask` property is accepted but `performanceCounters` returns nil after execution.
|
||||
|
||||
### IOSurface Output Validation
|
||||
|
||||
Read back FP16 data from output IOSurfaces and compare against CPU reference:
|
||||
|
||||
```objc
|
||||
_Float16 *out = (_Float16 *)IOSurfaceGetBaseAddress(surface);
|
||||
IOSurfaceLock(surface, kIOSurfaceLockReadOnly, NULL);
|
||||
for (int i = 0; i < n; i++) {
|
||||
float val = (float)out[i];
|
||||
// Compare against CPU reference
|
||||
}
|
||||
IOSurfaceUnlock(surface, kIOSurfaceLockReadOnly, NULL);
|
||||
```
|
||||
|
||||
### ANE Compiler Debug Output
|
||||
|
||||
From Wu's research, the ANECompiler module has a `debug_mask` flag. Setting it to `2147483647` (max int) generates intermediate files during compilation, revealing:
|
||||
- Register operation sequences
|
||||
- Memory allocation decisions
|
||||
- Tiling strategies
|
||||
- Weight layout in SRAM
|
||||
|
||||
This can be applied when using the ANECompiler CLI tools from [ANETools](https://github.com/antgroup-skyward/ANETools).
|
||||
|
||||
---
|
||||
|
||||
## 9. References and External Resources
|
||||
|
||||
### Documentation and Research
|
||||
|
||||
| Resource | URL | Focus |
|
||||
|----------|-----|-------|
|
||||
| hollance/neural-engine | https://github.com/hollance/neural-engine | CoreML-level ANE docs |
|
||||
| maderix Substack | https://open.substack.com/pub/maderix/p/inside-the-m4-apple-neural-engine | M4 ANE architecture |
|
||||
| Black Hat Asia 2021 | https://infocondb.org/con/black-hat/black-hat-asia-2021/apple-neural-engine-internal-from-ml-algorithm-to-hw-registers | Full stack reverse engineering |
|
||||
| BH Asia 2021 Video | https://www.youtube.com/watch?v=1wvBDUnPNEo | 30-min talk by Wish Wu |
|
||||
| Apple ML Research | https://machinelearning.apple.com/research/neural-engine-transformers | Deploying transformers on ANE |
|
||||
| ANE Supported Devices | https://github.com/hollance/neural-engine/blob/master/docs/supported-devices.md | Comprehensive device/chip list |
|
||||
|
||||
### Tools
|
||||
|
||||
| Tool | URL | Purpose |
|
||||
|------|-----|---------|
|
||||
| ANETools | https://github.com/antgroup-skyward/ANETools | ANECompiler CLI, ANEDisassembler |
|
||||
| eiln/anecc | https://github.com/eiln/anecc | Independent ANE compiler (Asahi Linux) |
|
||||
| freedomtan/coreml_to_ane_hwx | https://github.com/freedomtan/coreml_to_ane_hwx | CoreML to .hwx converter |
|
||||
| coremltools | https://github.com/apple/coremltools | Apple's official ML model tools |
|
||||
|
||||
### Projects Using ANE Directly
|
||||
|
||||
| Project | URL | What it does |
|
||||
|---------|-----|-------------|
|
||||
| maderix/ANE | https://github.com/maderix/ANE | Training on ANE (this project's upstream) |
|
||||
| dev-erik/ANE | https://github.com/dev-erik/ANE | This fork: inference optimization, ChainingRequest research |
|
||||
|
||||
### This Project's ANE Documentation
|
||||
|
||||
| Document | Description |
|
||||
|----------|-------------|
|
||||
| [ANE_INTERNALS.md](ANE_INTERNALS.md) | This file -- comprehensive ANE internals guide |
|
||||
| [ANE_CHAINING_RESEARCH.md](ANE_CHAINING_RESEARCH.md) | ChainingRequest API research, experiment logs, benchmarks |
|
||||
| [ARCHITECTURE.md](ARCHITECTURE.md) | Training system architecture, kernel fusion map, data flow |
|
||||
| [API_REFERENCE.md](API_REFERENCE.md) | Complete function index for all source files |
|
||||
| [BENCHMARK_RESULTS.md](BENCHMARK_RESULTS.md) | M4 Max benchmark results (training, TFLOPS, SRAM) |
|
||||
|
|
@ -1,14 +1,21 @@
|
|||
CC = xcrun clang
|
||||
CFLAGS = -O2 -Wall -Wno-deprecated-declarations -fobjc-arc
|
||||
CC_C = xcrun clang
|
||||
|
||||
ANE_COMPAT = -Wno-deprecated-declarations
|
||||
SEC_FLAGS = -fstack-protector-strong -Wformat-security
|
||||
|
||||
CFLAGS = -O2 -Wall $(ANE_COMPAT) -fobjc-arc $(SEC_FLAGS)
|
||||
CFLAGS_C = -O2 -Wall -Wextra -Werror -std=c11
|
||||
CFLAGS_DEBUG = -O0 -g -Wall $(ANE_COMPAT) -fobjc-arc -fsanitize=address,undefined
|
||||
FRAMEWORKS = -framework Foundation -framework CoreML -framework IOSurface
|
||||
LDFLAGS = $(FRAMEWORKS) -ldl
|
||||
|
||||
HEADERS_LARGE = stories_config.h stories_io.h stories_mil.h stories_cpu_ops.h
|
||||
HEADERS_LARGE = stories_config.h stories_io.h stories_mil.h stories_cpu_ops.h data_validation.h
|
||||
|
||||
HEADERS_ANE = $(HEADERS_LARGE) ane_rmsnorm_bwd.h ane_classifier.h
|
||||
|
||||
train: train.m ane_runtime.h ane_mil_gen.h model.h forward.h backward.h
|
||||
$(CC) $(CFLAGS) -o $@ train.m $(LDFLAGS)
|
||||
$(CC) $(CFLAGS) -o $@ train.m $(LDFLAGS) -framework Accelerate
|
||||
|
||||
train_large: train_large.m $(HEADERS_LARGE)
|
||||
$(CC) $(CFLAGS) -o $@ train_large.m $(LDFLAGS) -framework Accelerate
|
||||
|
|
@ -16,6 +23,14 @@ train_large: train_large.m $(HEADERS_LARGE)
|
|||
train_large_ane: train_large_ane.m $(HEADERS_ANE)
|
||||
$(CC) $(CFLAGS) -o $@ train_large_ane.m $(LDFLAGS) -framework Accelerate
|
||||
|
||||
HEADERS_OPT = $(HEADERS_LARGE) stories_cpu_ops_opt.h
|
||||
|
||||
train_opt: train_opt.m $(HEADERS_OPT)
|
||||
$(CC) $(CFLAGS) -o $@ train_opt.m $(LDFLAGS) -framework Accelerate -framework Metal -framework MetalPerformanceShaders
|
||||
|
||||
train_double_buffer: train_double_buffer.m $(HEADERS_LARGE)
|
||||
$(CC) $(CFLAGS) -o $@ train_double_buffer.m $(LDFLAGS) -framework Accelerate
|
||||
|
||||
PROBES = test_weight_reload test_perf_stats test_qos_sweep test_ane_advanced
|
||||
|
||||
test_rmsnorm_bwd: test_rmsnorm_bwd.m $(HEADERS_ANE)
|
||||
|
|
@ -36,13 +51,56 @@ test_qos_sweep: test_qos_sweep.m
|
|||
test_ane_advanced: test_ane_advanced.m
|
||||
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
|
||||
|
||||
test_chaining: test_chaining.m
|
||||
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
|
||||
|
||||
test_chaining_v2: test_chaining_v2.m
|
||||
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
|
||||
|
||||
test_bench_paths: test_bench_paths.m ane_runtime.h
|
||||
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
|
||||
|
||||
test_ane_model: test_ane_model.m
|
||||
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Metal
|
||||
|
||||
test_throughput_ceiling: test_throughput_ceiling.m ane_runtime.h
|
||||
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
|
||||
|
||||
test_coreml_chaining: test_coreml_chaining.m
|
||||
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Metal
|
||||
|
||||
test_e5_validate: test_e5_validate.m
|
||||
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Metal
|
||||
|
||||
test_mil_custom: test_mil_custom.m
|
||||
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Accelerate
|
||||
|
||||
test_data_validation: test_data_validation.c data_validation.h
|
||||
$(CC_C) $(CFLAGS_C) -o $@ $<
|
||||
|
||||
probes: $(PROBES)
|
||||
|
||||
security-tests: test_data_validation
|
||||
|
||||
data: tokenize
|
||||
@bash download_data.sh
|
||||
|
||||
tokenize:
|
||||
python3 tokenize.py
|
||||
|
||||
setup: data
|
||||
@echo "=== Setup complete ==="
|
||||
@echo "Data: tinystories_data00.bin"
|
||||
@echo "To train: make train_large && ./train_large"
|
||||
@echo "Override paths: ANE_MODEL_PATH=... ANE_DATA_PATH=... ./train_large"
|
||||
|
||||
verify-flags:
|
||||
@echo "=== Active CFLAGS ==="
|
||||
@echo "$(CFLAGS)"
|
||||
@echo "=== Compiler version ==="
|
||||
@xcrun clang --version
|
||||
|
||||
clean:
|
||||
rm -f train train_large train_large_ane $(PROBES) test_rmsnorm_bwd test_classifier
|
||||
|
||||
.PHONY: clean tokenize probes
|
||||
rm -f train train_large train_large_ane train_opt train_double_buffer $(PROBES) test_rmsnorm_bwd test_classifier test_data_validation test_chaining test_chaining_v2 test_bench_paths test_ane_model test_throughput_ceiling test_coreml_chaining test_e5_validate test_mil_custom
|
||||
|
||||
.PHONY: clean tokenize probes security-tests verify-flags data setup
|
||||
|
|
|
|||
|
|
@ -20,15 +20,33 @@ typedef struct {
|
|||
|
||||
static Class g_ANEDesc, g_ANEInMem, g_ANEReq, g_ANEIO;
|
||||
static bool g_ane_loaded = false;
|
||||
static id g_ane_client = nil;
|
||||
static bool g_ane_ok = false;
|
||||
|
||||
static void ane_init(void) {
|
||||
if (g_ane_loaded) return;
|
||||
dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
|
||||
g_ane_loaded = true; // Set first to prevent re-entry (ref: CRIT-01)
|
||||
void *handle = dlopen(
|
||||
"/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine",
|
||||
RTLD_NOW);
|
||||
if (!handle) {
|
||||
fprintf(stderr, "ANE: dlopen failed: %s\n", dlerror());
|
||||
return;
|
||||
}
|
||||
g_ANEDesc = NSClassFromString(@"_ANEInMemoryModelDescriptor");
|
||||
g_ANEInMem = NSClassFromString(@"_ANEInMemoryModel");
|
||||
g_ANEReq = NSClassFromString(@"_ANERequest");
|
||||
g_ANEIO = NSClassFromString(@"_ANEIOSurfaceObject");
|
||||
g_ane_loaded = true;
|
||||
if (!g_ANEDesc || !g_ANEInMem || !g_ANEReq || !g_ANEIO) {
|
||||
fprintf(stderr, "ANE: Private classes not found (macOS version mismatch?)\n");
|
||||
return;
|
||||
}
|
||||
g_ane_ok = true;
|
||||
|
||||
Class clientCls = NSClassFromString(@"_ANEClient");
|
||||
if (clientCls) {
|
||||
g_ane_client = [clientCls performSelector:@selector(sharedConnection)];
|
||||
}
|
||||
}
|
||||
|
||||
static IOSurfaceRef ane_create_surface(size_t bytes) {
|
||||
|
|
@ -50,6 +68,7 @@ static ANEKernel *ane_compile(NSData *milText, NSData *weightData,
|
|||
int nInputs, size_t *inputSizes,
|
||||
int nOutputs, size_t *outputSizes) {
|
||||
ane_init();
|
||||
if (!g_ane_ok) { fprintf(stderr, "ANE: not available\n"); return NULL; } // CRIT-01/02
|
||||
NSError *e = nil;
|
||||
|
||||
NSDictionary *wdict = nil;
|
||||
|
|
@ -63,6 +82,7 @@ static ANEKernel *ane_compile(NSData *milText, NSData *weightData,
|
|||
|
||||
id mdl = ((id(*)(Class,SEL,id))objc_msgSend)(
|
||||
g_ANEInMem, @selector(inMemoryModelWithDescriptor:), desc);
|
||||
if (!mdl) { fprintf(stderr, "ANE: inMemoryModel allocation failed\n"); return NULL; } // CRIT-02
|
||||
|
||||
// Pre-populate temp dir with MIL + weights
|
||||
id hx = ((id(*)(id,SEL))objc_msgSend)(mdl, @selector(hexStringIdentifier));
|
||||
|
|
@ -151,6 +171,20 @@ static bool ane_eval(ANEKernel *k) {
|
|||
return ok;
|
||||
}
|
||||
|
||||
static bool ane_eval_rt(ANEKernel *k) {
|
||||
if (!g_ane_client) return ane_eval(k);
|
||||
NSError *e = nil;
|
||||
BOOL ok = ((BOOL(*)(id,SEL,id,id,id,NSError**))objc_msgSend)(
|
||||
g_ane_client, @selector(evaluateRealTimeWithModel:options:request:error:),
|
||||
k->model, @{}, k->request, &e);
|
||||
if (!ok) {
|
||||
fprintf(stderr, "ANE RT eval failed, falling back to standard: %s\n",
|
||||
e ? [[e description] UTF8String] : "unknown");
|
||||
return ane_eval(k);
|
||||
}
|
||||
return true;
|
||||
}
|
||||
|
||||
static void ane_free(ANEKernel *k) {
|
||||
if (!k) return;
|
||||
NSError *e = nil;
|
||||
|
|
|
|||
File diff suppressed because it is too large
Load Diff
|
|
@ -0,0 +1,148 @@
|
|||
// test_bench_paths.m — Benchmark ANE evaluation paths at production dimensions
|
||||
// Compares: standard, RT, processRequest, and ane_eval_rt wrapper
|
||||
#import <Foundation/Foundation.h>
|
||||
#import <objc/runtime.h>
|
||||
#import <objc/message.h>
|
||||
#import <dlfcn.h>
|
||||
#import <IOSurface/IOSurface.h>
|
||||
#import <mach/mach_time.h>
|
||||
|
||||
static mach_timebase_info_data_t g_tb;
|
||||
static double tb_ms(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
|
||||
static int g_fp16_io = 0;
|
||||
|
||||
#include "ane_runtime.h"
|
||||
|
||||
static NSString *gen_bench_conv(int ch, int sp) {
|
||||
return [NSString stringWithFormat:
|
||||
@"program(1.0)\n[buildInfo = dict<tensor<string, []>, tensor<string, []>>({{\"coremlc-version\", \"3505.4.1\"}})]\n{\n"
|
||||
" func main<ios16>(tensor<fp16, [1, %d, 1, %d]> x) {\n"
|
||||
" tensor<string, []> pt = const()[name=tensor<string, []>(\"pt\"), val=tensor<string, []>(\"valid\")];\n"
|
||||
" tensor<int32, [2]> st = const()[name=tensor<string, []>(\"st\"), val=tensor<int32, [2]>([1,1])];\n"
|
||||
" tensor<int32, [4]> pd = const()[name=tensor<string, []>(\"pd\"), val=tensor<int32, [4]>([0,0,0,0])];\n"
|
||||
" tensor<int32, [2]> dl = const()[name=tensor<string, []>(\"dl\"), val=tensor<int32, [2]>([1,1])];\n"
|
||||
" tensor<int32, []> gr = const()[name=tensor<string, []>(\"gr\"), val=tensor<int32, []>(1)];\n"
|
||||
" tensor<fp16, [%d,%d,1,1]> W = const()[name=tensor<string, []>(\"W\"), "
|
||||
"val=tensor<fp16, [%d,%d,1,1]>(BLOBFILE(path=tensor<string, []>(\"@model_path/weights/weight.bin\"), offset=tensor<uint64, []>(64)))];\n"
|
||||
" tensor<fp16, [1,%d,1,%d]> y = conv(dilations=dl,groups=gr,pad=pd,pad_type=pt,strides=st,weight=W,x=x)"
|
||||
"[name=tensor<string, []>(\"conv\")];\n"
|
||||
" } -> (y);\n}\n", ch, sp, ch, ch, ch, ch, ch, sp];
|
||||
}
|
||||
|
||||
int main(int argc, char **argv) {
|
||||
@autoreleasepool {
|
||||
setbuf(stdout, NULL);
|
||||
mach_timebase_info(&g_tb);
|
||||
|
||||
printf("=== ANE Eval Path Benchmark (production dimensions) ===\n\n");
|
||||
|
||||
ane_init();
|
||||
if (!g_ane_ok) { printf("FATAL: ANE not available\n"); return 1; }
|
||||
|
||||
typedef struct { int ch; int sp; const char *label; } TestConfig;
|
||||
TestConfig configs[] = {
|
||||
{64, 32, "64x32 (test)"},
|
||||
{128, 64, "128x64 (small)"},
|
||||
{256, 64, "256x64 (med)"},
|
||||
{768, 256, "768x256 (prod)"},
|
||||
{512, 64, "512x64 (large)"},
|
||||
};
|
||||
int nconfigs = sizeof(configs) / sizeof(configs[0]);
|
||||
int WARMUP = 20, ITERS = 200;
|
||||
|
||||
id client = g_ane_client;
|
||||
printf(" Client: %s | Warmup: %d | Iters: %d\n\n", client ? "OK" : "NO", WARMUP, ITERS);
|
||||
printf("%-18s %10s %14s %14s %14s\n", "Config", "Standard", "RT", "ProcReq", "ane_eval_rt");
|
||||
printf("%-18s %10s %14s %14s %14s\n", "------", "--------", "--", "-------", "-----------");
|
||||
|
||||
for (int ci = 0; ci < nconfigs; ci++) {
|
||||
int CH = configs[ci].ch, SP = configs[ci].sp;
|
||||
|
||||
_Float16 *w = (_Float16*)calloc(CH*CH, sizeof(_Float16));
|
||||
for (int i = 0; i < CH; i++) w[i*CH+i] = (_Float16)0.5f;
|
||||
int ws = CH*CH*2, tot = 128+ws;
|
||||
uint8_t *blob = (uint8_t*)calloc(tot, 1);
|
||||
blob[0]=1; blob[4]=2; blob[64]=0xEF; blob[65]=0xBE; blob[66]=0xAD; blob[67]=0xDE; blob[68]=1;
|
||||
*(uint32_t*)(blob+72)=ws; *(uint32_t*)(blob+80)=128;
|
||||
memcpy(blob+128, w, ws);
|
||||
NSData *wdata = [NSData dataWithBytesNoCopy:blob length:tot freeWhenDone:YES];
|
||||
free(w);
|
||||
|
||||
g_fp16_io = 1;
|
||||
NSString *mil = gen_bench_conv(CH, SP);
|
||||
NSData *milData = [mil dataUsingEncoding:NSUTF8StringEncoding];
|
||||
size_t ioBytes = CH * SP * 2;
|
||||
ANEKernel *k = ane_compile(milData, wdata, 1, &ioBytes, 1, &ioBytes);
|
||||
if (!k) { printf("%-18s (compile failed)\n", configs[ci].label); continue; }
|
||||
|
||||
IOSurfaceLock(k->ioInputs[0], 0, NULL);
|
||||
_Float16 *inp = (_Float16*)IOSurfaceGetBaseAddress(k->ioInputs[0]);
|
||||
for (int i = 0; i < CH*SP; i++) inp[i] = (_Float16)1.0f;
|
||||
IOSurfaceUnlock(k->ioInputs[0], 0, NULL);
|
||||
|
||||
NSError *e = nil;
|
||||
|
||||
for (int i = 0; i < WARMUP; i++) ane_eval(k);
|
||||
uint64_t t0 = mach_absolute_time();
|
||||
for (int i = 0; i < ITERS; i++) ane_eval(k);
|
||||
double std_ms = tb_ms(mach_absolute_time() - t0) / ITERS;
|
||||
|
||||
double rt_ms = -1;
|
||||
if (client) {
|
||||
@try {
|
||||
for (int i = 0; i < WARMUP; i++)
|
||||
((BOOL(*)(id,SEL,id,id,id,NSError**))objc_msgSend)(
|
||||
client, @selector(evaluateRealTimeWithModel:options:request:error:),
|
||||
k->model, @{}, k->request, &e);
|
||||
t0 = mach_absolute_time();
|
||||
for (int i = 0; i < ITERS; i++)
|
||||
((BOOL(*)(id,SEL,id,id,id,NSError**))objc_msgSend)(
|
||||
client, @selector(evaluateRealTimeWithModel:options:request:error:),
|
||||
k->model, @{}, k->request, &e);
|
||||
rt_ms = tb_ms(mach_absolute_time() - t0) / ITERS;
|
||||
} @catch (NSException *ex) { rt_ms = -1; }
|
||||
}
|
||||
|
||||
double proc_ms = -1;
|
||||
@try {
|
||||
id prog = [k->model valueForKey:@"program"];
|
||||
id hexId = [k->model valueForKey:@"hexStringIdentifier"];
|
||||
SEL procSel = @selector(processRequest:model:qos:qIndex:modelStringID:options:returnValue:error:);
|
||||
if (prog && [prog respondsToSelector:procSel]) {
|
||||
for (int i = 0; i < WARMUP; i++) {
|
||||
BOOL rv = NO;
|
||||
((BOOL(*)(id,SEL,id,id,unsigned int,int,id,id,BOOL*,NSError**))objc_msgSend)(
|
||||
prog, procSel, k->request, k->model, 21, 0, hexId, @{}, &rv, &e);
|
||||
}
|
||||
t0 = mach_absolute_time();
|
||||
for (int i = 0; i < ITERS; i++) {
|
||||
BOOL rv = NO;
|
||||
((BOOL(*)(id,SEL,id,id,unsigned int,int,id,id,BOOL*,NSError**))objc_msgSend)(
|
||||
prog, procSel, k->request, k->model, 21, 0, hexId, @{}, &rv, &e);
|
||||
}
|
||||
proc_ms = tb_ms(mach_absolute_time() - t0) / ITERS;
|
||||
}
|
||||
} @catch (NSException *ex) { (void)ex; }
|
||||
|
||||
double wrap_ms = -1;
|
||||
@try {
|
||||
for (int i = 0; i < WARMUP; i++) ane_eval_rt(k);
|
||||
t0 = mach_absolute_time();
|
||||
for (int i = 0; i < ITERS; i++) ane_eval_rt(k);
|
||||
wrap_ms = tb_ms(mach_absolute_time() - t0) / ITERS;
|
||||
} @catch (NSException *ex) { wrap_ms = -1; }
|
||||
|
||||
char s[32], r[32], p[32], w2[32];
|
||||
snprintf(s, 32, "%.3f ms", std_ms);
|
||||
snprintf(r, 32, rt_ms >= 0 ? "%.3f (%.1fx)" : "N/A", rt_ms, std_ms/rt_ms);
|
||||
snprintf(p, 32, proc_ms >= 0 ? "%.3f (%.1fx)" : "N/A", proc_ms, std_ms/proc_ms);
|
||||
snprintf(w2, 32, wrap_ms >= 0 ? "%.3f (%.1fx)" : "N/A", wrap_ms, std_ms/wrap_ms);
|
||||
printf("%-18s %10s %14s %14s %14s\n", configs[ci].label, s, r, p, w2);
|
||||
|
||||
ane_free(k);
|
||||
}
|
||||
|
||||
printf("\n=== Benchmark complete ===\n");
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
|
|
@ -0,0 +1,817 @@
|
|||
// test_e5_validate.m — Experiments W1-W5: E5 Runtime Validation & Deep API Exploration
|
||||
// Build: make test_e5_validate && ./test_e5_validate
|
||||
#import <Foundation/Foundation.h>
|
||||
#import <CoreML/CoreML.h>
|
||||
#import <objc/runtime.h>
|
||||
#import <objc/message.h>
|
||||
#import <dlfcn.h>
|
||||
#import <mach/mach_time.h>
|
||||
#import <IOSurface/IOSurface.h>
|
||||
|
||||
static mach_timebase_info_data_t g_tb;
|
||||
static double tb_ms(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
|
||||
|
||||
#pragma mark - Helpers
|
||||
|
||||
static void dump_all_methods(Class cls, const char *label) {
|
||||
if (!cls) { printf(" %s: NOT FOUND\n", label); return; }
|
||||
printf("\n--- %s ---\n", label);
|
||||
|
||||
unsigned int mc;
|
||||
Method *cm = class_copyMethodList(object_getClass(cls), &mc);
|
||||
if (mc > 0) {
|
||||
printf(" Class methods (%u):\n", mc);
|
||||
for (unsigned int i = 0; i < mc; i++) {
|
||||
const char *sel = sel_getName(method_getName(cm[i]));
|
||||
const char *enc = method_getTypeEncoding(cm[i]);
|
||||
printf(" + %s [%s]\n", sel, enc ? enc : "?");
|
||||
}
|
||||
}
|
||||
free(cm);
|
||||
|
||||
Method *im = class_copyMethodList(cls, &mc);
|
||||
if (mc > 0) {
|
||||
printf(" Instance methods (%u):\n", mc);
|
||||
for (unsigned int i = 0; i < mc; i++) {
|
||||
const char *sel = sel_getName(method_getName(im[i]));
|
||||
const char *enc = method_getTypeEncoding(im[i]);
|
||||
printf(" - %s [%s]\n", sel, enc ? enc : "?");
|
||||
}
|
||||
}
|
||||
free(im);
|
||||
|
||||
unsigned int pc;
|
||||
objc_property_t *props = class_copyPropertyList(cls, &pc);
|
||||
if (pc > 0) {
|
||||
printf(" Properties (%u):\n", pc);
|
||||
for (unsigned int i = 0; i < pc; i++)
|
||||
printf(" %s [%s]\n", property_getName(props[i]),
|
||||
property_getAttributes(props[i]));
|
||||
}
|
||||
free(props);
|
||||
|
||||
unsigned int ic;
|
||||
Ivar *ivars = class_copyIvarList(cls, &ic);
|
||||
if (ic > 0) {
|
||||
printf(" Ivars (%u):\n", ic);
|
||||
for (unsigned int i = 0; i < ic; i++) {
|
||||
const char *n = ivar_getName(ivars[i]);
|
||||
const char *t = ivar_getTypeEncoding(ivars[i]);
|
||||
printf(" %s type=%s\n", n, t ? t : "?");
|
||||
}
|
||||
}
|
||||
free(ivars);
|
||||
|
||||
Class super = class_getSuperclass(cls);
|
||||
if (super && super != [NSObject class])
|
||||
printf(" Superclass: %s\n", class_getName(super));
|
||||
}
|
||||
|
||||
static float max_abs_diff(float *a, float *b, int n) {
|
||||
float m = 0;
|
||||
for (int i = 0; i < n; i++) {
|
||||
float d = fabsf(a[i] - b[i]);
|
||||
if (d > m) m = d;
|
||||
}
|
||||
return m;
|
||||
}
|
||||
|
||||
static float mean_abs(float *a, int n) {
|
||||
float s = 0;
|
||||
for (int i = 0; i < n; i++) s += fabsf(a[i]);
|
||||
return s / n;
|
||||
}
|
||||
|
||||
#pragma mark - Main
|
||||
|
||||
int main(int argc, const char *argv[]) {
|
||||
(void)argc; (void)argv;
|
||||
@autoreleasepool {
|
||||
mach_timebase_info(&g_tb);
|
||||
printf("================================================================\n");
|
||||
printf(" E5 Runtime: Validation & Exhaustive API Documentation\n");
|
||||
printf("================================================================\n\n");
|
||||
|
||||
dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/"
|
||||
"AppleNeuralEngine", RTLD_NOW);
|
||||
|
||||
// ============================================================
|
||||
// W2: Exhaustive API Documentation (dump first so we have it)
|
||||
// ============================================================
|
||||
printf("================================================================\n");
|
||||
printf(" W2: Exhaustive E5 Runtime API Documentation\n");
|
||||
printf("================================================================\n");
|
||||
|
||||
const char *classNames[] = {
|
||||
"MLE5Engine",
|
||||
"MLE5ProgramLibrary",
|
||||
"MLE5ProgramLibraryOnDeviceAOTCompilationImpl",
|
||||
"MLE5ProgramLibraryE5BundleImpl",
|
||||
"MLE5ExecutionStreamOperation",
|
||||
"MLE5ExecutionStream",
|
||||
"MLE5ExecutionStreamPool",
|
||||
"MLE5StaticShapeExecutionStreamOperationPool",
|
||||
"MLE5RangeShapeExecutionStreamOperationPool",
|
||||
"MLE5EnumeratedShapeExecutionStreamOperationPool",
|
||||
"MLE5ExecutionStreamOperationPoolFactory",
|
||||
"MLE5InputPort",
|
||||
"MLE5OutputPort",
|
||||
"MLE5InputPortBinder",
|
||||
"MLE5OutputPortBinder",
|
||||
"MLProgramE5Container",
|
||||
NULL
|
||||
};
|
||||
for (int i = 0; classNames[i]; i++) {
|
||||
Class cls = NSClassFromString(
|
||||
[NSString stringWithUTF8String:classNames[i]]);
|
||||
dump_all_methods(cls, classNames[i]);
|
||||
}
|
||||
|
||||
printf("\n--- e5rt_* C API Symbols ---\n");
|
||||
const char *cFuncs[] = {
|
||||
"e5rt_program_library_create",
|
||||
"e5rt_program_library_destroy",
|
||||
"e5rt_program_library_compile",
|
||||
"e5rt_program_library_get_function",
|
||||
"e5rt_program_library_load_function",
|
||||
"e5rt_execution_stream_create",
|
||||
"e5rt_execution_stream_destroy",
|
||||
"e5rt_execution_stream_submit",
|
||||
"e5rt_execution_stream_wait",
|
||||
"e5rt_execution_stream_execute",
|
||||
"e5rt_execution_stream_sync",
|
||||
"e5rt_execution_stream_operation_create",
|
||||
"e5rt_execution_stream_operation_destroy",
|
||||
"e5rt_execution_stream_operation_set_input",
|
||||
"e5rt_execution_stream_operation_set_output",
|
||||
"e5rt_execution_stream_operation_execute",
|
||||
"e5rt_async_event_create",
|
||||
"e5rt_async_event_destroy",
|
||||
"e5rt_async_event_signal",
|
||||
"e5rt_async_event_wait",
|
||||
"e5rt_buffer_create",
|
||||
"e5rt_buffer_destroy",
|
||||
"e5rt_io_port_create",
|
||||
"e5rt_io_port_bind",
|
||||
"e5rt_context_create",
|
||||
"e5rt_init",
|
||||
"e5rt_get_version",
|
||||
NULL
|
||||
};
|
||||
for (int i = 0; cFuncs[i]; i++) {
|
||||
void *sym = dlsym(RTLD_DEFAULT, cFuncs[i]);
|
||||
if (sym) printf(" FOUND: %s at %p\n", cFuncs[i], sym);
|
||||
}
|
||||
fflush(stdout);
|
||||
|
||||
// ============================================================
|
||||
// W1: Output Validation
|
||||
// ============================================================
|
||||
printf("\n================================================================\n");
|
||||
printf(" W1: Output Correctness Validation\n");
|
||||
printf("================================================================\n\n");
|
||||
|
||||
int ch = 256, sp = 64;
|
||||
NSString *pkgPath = [NSString stringWithFormat:
|
||||
@"/tmp/ane_sram_%dch_%dsp.mlpackage", ch, sp];
|
||||
if (![[NSFileManager defaultManager] fileExistsAtPath:pkgPath]) {
|
||||
printf(" FATAL: %s not found. Run gen_mlpackages.py\n",
|
||||
[pkgPath UTF8String]);
|
||||
return 1;
|
||||
}
|
||||
|
||||
NSError *err = nil;
|
||||
MLModelConfiguration *cfg = [[MLModelConfiguration alloc] init];
|
||||
cfg.computeUnits = MLComputeUnitsAll;
|
||||
MLPredictionOptions *predOpts = [[MLPredictionOptions alloc] init];
|
||||
Class opCls = NSClassFromString(@"MLE5ExecutionStreamOperation");
|
||||
|
||||
NSURL *compiled = [MLModel compileModelAtURL:
|
||||
[NSURL fileURLWithPath:pkgPath] error:&err];
|
||||
if (err) { printf(" Compile FAILED\n"); return 1; }
|
||||
err = nil;
|
||||
MLModel *model = [MLModel modelWithContentsOfURL:compiled
|
||||
configuration:cfg error:&err];
|
||||
if (err) { printf(" Load FAILED\n"); return 1; }
|
||||
|
||||
int nElems = 1 * ch * 1 * sp;
|
||||
MLMultiArray *inputArr = [[MLMultiArray alloc]
|
||||
initWithShape:@[@1, @(ch), @1, @(sp)]
|
||||
dataType:MLMultiArrayDataTypeFloat32 error:nil];
|
||||
|
||||
float *inPtr = (float *)[inputArr dataPointer];
|
||||
for (int i = 0; i < nElems; i++)
|
||||
inPtr[i] = sinf((float)i * 0.01f) * 0.5f;
|
||||
|
||||
NSString *inName = [[[[model modelDescription] inputDescriptionsByName]
|
||||
allKeys] firstObject];
|
||||
NSString *outName = [[[[model modelDescription] outputDescriptionsByName]
|
||||
allKeys] firstObject];
|
||||
MLDictionaryFeatureProvider *fp = [[MLDictionaryFeatureProvider alloc]
|
||||
initWithDictionary:@{inName: inputArr} error:nil];
|
||||
|
||||
printf(" Input: %s [1,%d,1,%d], first 5: [%.4f %.4f %.4f %.4f %.4f]\n",
|
||||
[inName UTF8String], ch, sp,
|
||||
inPtr[0], inPtr[1], inPtr[2], inPtr[3], inPtr[4]);
|
||||
printf(" Output: %s\n", [outName UTF8String]);
|
||||
fflush(stdout);
|
||||
|
||||
// --- Reference: CoreML sequential prediction ---
|
||||
printf("\n --- W1.1: CoreML reference prediction ---\n");
|
||||
err = nil;
|
||||
id<MLFeatureProvider> refResult = [model predictionFromFeatures:fp error:&err];
|
||||
if (err) { printf(" Prediction FAILED\n"); return 1; }
|
||||
|
||||
MLMultiArray *refOut = [refResult featureValueForName:outName].multiArrayValue;
|
||||
float *refPtr = (float *)[refOut dataPointer];
|
||||
int outElems = 1;
|
||||
for (int d = 0; d < (int)refOut.shape.count; d++)
|
||||
outElems *= [refOut.shape[d] intValue];
|
||||
printf(" Output shape: [");
|
||||
for (int d = 0; d < (int)refOut.shape.count; d++)
|
||||
printf("%s%d", d ? "," : "", [refOut.shape[d] intValue]);
|
||||
printf("] (%d elements)\n", outElems);
|
||||
printf(" First 5 ref: [%.6f %.6f %.6f %.6f %.6f]\n",
|
||||
refPtr[0], refPtr[1], refPtr[2], refPtr[3], refPtr[4]);
|
||||
printf(" Mean |ref|: %.6f\n", mean_abs(refPtr, outElems));
|
||||
fflush(stdout);
|
||||
|
||||
// --- E5 stream prediction ---
|
||||
printf("\n --- W1.2: E5 stream prediction ---\n");
|
||||
|
||||
id e5engine = nil;
|
||||
@try { e5engine = [model valueForKey:@"_internalEngine"]; }
|
||||
@catch (NSException *e) { (void)e; }
|
||||
id progLib = nil;
|
||||
@try { progLib = [e5engine valueForKey:@"programLibrary"]; }
|
||||
@catch (NSException *e) { (void)e; }
|
||||
id streamPool = nil;
|
||||
@try { streamPool = [e5engine valueForKey:@"streamPool"]; }
|
||||
@catch (NSException *e) { (void)e; }
|
||||
|
||||
id op = ((id(*)(id,SEL,id,id,id,id,id,unsigned long long))objc_msgSend)(
|
||||
[opCls alloc],
|
||||
@selector(initWithProgramLibrary:functionName:modelDescription:
|
||||
configuration:debugLabel:modelSignpostId:),
|
||||
progLib, @"main", [model modelDescription], cfg,
|
||||
@"validate_op", (unsigned long long)0);
|
||||
|
||||
NSError *plErr = nil;
|
||||
BOOL plOk = ((BOOL(*)(id,SEL,NSError**))objc_msgSend)(
|
||||
op, @selector(preloadAndReturnError:), &plErr);
|
||||
printf(" preload: %s\n", plOk ? "YES" : "NO");
|
||||
if (plErr) printf(" Error: %s\n", [[plErr description] UTF8String]);
|
||||
fflush(stdout);
|
||||
|
||||
id stream = [streamPool performSelector:@selector(takeOut)];
|
||||
Ivar shIvar = class_getInstanceVariable([stream class], "_streamHandle");
|
||||
void *sh = (__bridge void *)object_getIvar(stream, shIvar);
|
||||
printf(" stream: %p, handle: %p\n", (__bridge void *)stream, sh);
|
||||
|
||||
[stream setValue:@[op] forKey:@"operations"];
|
||||
|
||||
NSError *prepErr = nil;
|
||||
BOOL prepOk = ((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
|
||||
op, @selector(prepareForInputFeatures:options:error:),
|
||||
fp, predOpts, &prepErr);
|
||||
printf(" prepare: %s\n", prepOk ? "YES" : "NO");
|
||||
if (prepErr) printf(" Error: %s\n", [[prepErr description] UTF8String]);
|
||||
fflush(stdout);
|
||||
|
||||
NSError *execErr = nil;
|
||||
BOOL execOk = ((BOOL(*)(id,SEL,void*,NSError**))objc_msgSend)(
|
||||
stream, @selector(_executeStream:error:), sh, &execErr);
|
||||
printf(" execute: %s\n", execOk ? "YES" : "NO");
|
||||
if (execErr) printf(" Error: %s\n", [[execErr description] UTF8String]);
|
||||
fflush(stdout);
|
||||
|
||||
// Read output from the operation
|
||||
printf("\n --- W1.3: Read E5 output features ---\n");
|
||||
fflush(stdout);
|
||||
id e5Result = nil;
|
||||
@try {
|
||||
e5Result = [op valueForKey:@"outputFeatures"];
|
||||
printf(" outputFeatures: %s\n",
|
||||
e5Result ? [NSStringFromClass([e5Result class]) UTF8String]
|
||||
: "nil");
|
||||
} @catch (NSException *ex) {
|
||||
printf(" outputFeatures EXCEPTION: %s\n",
|
||||
[[ex reason] UTF8String]);
|
||||
}
|
||||
|
||||
if (e5Result && [e5Result conformsToProtocol:@protocol(MLFeatureProvider)]) {
|
||||
MLMultiArray *e5Out = [(id<MLFeatureProvider>)e5Result
|
||||
featureValueForName:outName].multiArrayValue;
|
||||
if (e5Out) {
|
||||
float *e5Ptr = (float *)[e5Out dataPointer];
|
||||
printf(" E5 first 5: [%.6f %.6f %.6f %.6f %.6f]\n",
|
||||
e5Ptr[0], e5Ptr[1], e5Ptr[2], e5Ptr[3], e5Ptr[4]);
|
||||
printf(" Mean |e5|: %.6f\n", mean_abs(e5Ptr, outElems));
|
||||
|
||||
float mad = max_abs_diff(refPtr, e5Ptr, outElems);
|
||||
printf(" Max abs diff: %.8f\n", mad);
|
||||
printf(" Relative error: %.2e\n",
|
||||
mad / (mean_abs(refPtr, outElems) + 1e-10f));
|
||||
|
||||
if (mad < 1e-3f) {
|
||||
printf(" *** VALIDATION PASSED: outputs match ***\n");
|
||||
} else if (mad < 1e-1f) {
|
||||
printf(" VALIDATION WARNING: small differences (FP16 expected)\n");
|
||||
} else {
|
||||
printf(" VALIDATION FAILED: outputs diverge!\n");
|
||||
}
|
||||
} else {
|
||||
printf(" E5 output array is nil for key '%s'\n",
|
||||
[outName UTF8String]);
|
||||
|
||||
NSArray *ofNames = [(id<MLFeatureProvider>)e5Result
|
||||
featureNames].allObjects;
|
||||
printf(" Available features: %s\n",
|
||||
[[ofNames description] UTF8String]);
|
||||
}
|
||||
} else {
|
||||
printf(" Cannot read output features\n");
|
||||
}
|
||||
|
||||
// Also read output via outputPorts
|
||||
printf("\n --- W1.4: Read via output ports ---\n");
|
||||
fflush(stdout);
|
||||
@try {
|
||||
id outPorts = [op valueForKey:@"outputPorts"];
|
||||
printf(" outputPorts: %s (count=%lu)\n",
|
||||
outPorts ? [NSStringFromClass([outPorts class]) UTF8String]
|
||||
: "nil",
|
||||
outPorts ? (unsigned long)[(NSArray *)outPorts count] : 0);
|
||||
|
||||
if (outPorts && [(NSArray *)outPorts count] > 0) {
|
||||
for (NSUInteger pi = 0; pi < [(NSArray *)outPorts count]; pi++) {
|
||||
id port = [(NSArray *)outPorts objectAtIndex:pi];
|
||||
printf(" Port[%lu]: %s\n", (unsigned long)pi,
|
||||
[[port description] UTF8String]);
|
||||
@try {
|
||||
id portName = [port valueForKey:@"name"];
|
||||
printf(" name: %s\n",
|
||||
portName ? [(NSString *)portName UTF8String] : "nil");
|
||||
} @catch (NSException *ex) { (void)ex; }
|
||||
@try {
|
||||
id portFD = [port valueForKey:@"featureDescription"];
|
||||
printf(" featureDescription: %s\n",
|
||||
portFD ? [[portFD description] UTF8String] : "nil");
|
||||
} @catch (NSException *ex) { (void)ex; }
|
||||
@try {
|
||||
id binder = [port valueForKey:@"binder"];
|
||||
printf(" binder: %s\n",
|
||||
binder ? [NSStringFromClass([binder class])
|
||||
UTF8String] : "nil");
|
||||
if (binder) {
|
||||
@try {
|
||||
id fv = [binder valueForKey:@"featureValue"];
|
||||
printf(" featureValue: %s\n",
|
||||
fv ? [NSStringFromClass([fv class])
|
||||
UTF8String] : "nil");
|
||||
if (fv) {
|
||||
MLMultiArray *ma = [(MLFeatureValue *)fv
|
||||
multiArrayValue];
|
||||
if (ma) {
|
||||
float *ptr = (float *)[ma dataPointer];
|
||||
printf(" first 5: [%.6f %.6f %.6f"
|
||||
" %.6f %.6f]\n",
|
||||
ptr[0], ptr[1], ptr[2],
|
||||
ptr[3], ptr[4]);
|
||||
float mad2 = max_abs_diff(refPtr, ptr,
|
||||
outElems);
|
||||
printf(" Max abs diff vs ref: %.8f\n",
|
||||
mad2);
|
||||
}
|
||||
}
|
||||
} @catch (NSException *ex) {
|
||||
printf(" featureValue EXCEPTION: %s\n",
|
||||
[[ex reason] UTF8String]);
|
||||
}
|
||||
}
|
||||
} @catch (NSException *ex) { (void)ex; }
|
||||
}
|
||||
}
|
||||
} @catch (NSException *ex) {
|
||||
printf(" outputPorts EXCEPTION: %s\n", [[ex reason] UTF8String]);
|
||||
}
|
||||
|
||||
// Also read input ports
|
||||
printf("\n --- W1.5: Inspect input ports ---\n");
|
||||
fflush(stdout);
|
||||
@try {
|
||||
id inPorts = [op valueForKey:@"inputPorts"];
|
||||
printf(" inputPorts: %s (count=%lu)\n",
|
||||
inPorts ? [NSStringFromClass([inPorts class]) UTF8String]
|
||||
: "nil",
|
||||
inPorts ? (unsigned long)[(NSArray *)inPorts count] : 0);
|
||||
if (inPorts) {
|
||||
for (NSUInteger pi = 0; pi < [(NSArray *)inPorts count]; pi++) {
|
||||
id port = [(NSArray *)inPorts objectAtIndex:pi];
|
||||
printf(" Port[%lu]: %s\n", (unsigned long)pi,
|
||||
[[port description] UTF8String]);
|
||||
@try {
|
||||
printf(" name: %s\n",
|
||||
[[(id)[port valueForKey:@"name"] description]
|
||||
UTF8String]);
|
||||
printf(" portHandle: %p\n",
|
||||
(__bridge void *)[port valueForKey:@"portHandle"]);
|
||||
} @catch (NSException *ex) { (void)ex; }
|
||||
@try {
|
||||
id binder = [port valueForKey:@"binder"];
|
||||
if (binder) {
|
||||
printf(" binder: %s\n",
|
||||
[NSStringFromClass([binder class]) UTF8String]);
|
||||
printf(" bindingMode: %d\n",
|
||||
((char(*)(id,SEL))objc_msgSend)(
|
||||
binder, @selector(bindingMode)));
|
||||
id dfv = nil;
|
||||
@try {
|
||||
dfv = [binder valueForKey:@"directlyBoundFeatureValue"];
|
||||
} @catch (NSException *ex) { (void)ex; }
|
||||
printf(" directlyBound: %s\n",
|
||||
dfv ? "YES" : "NO");
|
||||
}
|
||||
} @catch (NSException *ex) { (void)ex; }
|
||||
}
|
||||
}
|
||||
} @catch (NSException *ex) {
|
||||
printf(" inputPorts EXCEPTION: %s\n", [[ex reason] UTF8String]);
|
||||
}
|
||||
|
||||
// Return stream
|
||||
[stream setValue:@[op] forKey:@"operations"];
|
||||
((void(*)(id,SEL,id))objc_msgSend)(
|
||||
streamPool, @selector(putBack:), stream);
|
||||
|
||||
// ============================================================
|
||||
// W1.6: Multi-op output validation
|
||||
// ============================================================
|
||||
printf("\n --- W1.6: Multi-op output validation ---\n");
|
||||
fflush(stdout);
|
||||
|
||||
{
|
||||
NSString *pkg2Path = @"/tmp/ane_sram_512ch_64sp.mlpackage";
|
||||
err = nil;
|
||||
NSURL *c2 = [MLModel compileModelAtURL:
|
||||
[NSURL fileURLWithPath:pkg2Path] error:&err];
|
||||
if (err) { printf(" Compile2 FAILED\n"); goto skip_multiop; }
|
||||
err = nil;
|
||||
MLModel *model2 = [MLModel modelWithContentsOfURL:c2
|
||||
configuration:cfg error:&err];
|
||||
if (err) { printf(" Load2 FAILED\n"); goto skip_multiop; }
|
||||
int ch2 = 512;
|
||||
int nElems2 = 1 * ch2 * 1 * sp;
|
||||
MLMultiArray *inputArr2 = [[MLMultiArray alloc]
|
||||
initWithShape:@[@1, @(ch2), @1, @(sp)]
|
||||
dataType:MLMultiArrayDataTypeFloat32 error:nil];
|
||||
float *in2Ptr = (float *)[inputArr2 dataPointer];
|
||||
for (int i = 0; i < nElems2; i++)
|
||||
in2Ptr[i] = cosf((float)i * 0.02f) * 0.3f;
|
||||
|
||||
NSString *in2Name = [[[[model2 modelDescription] inputDescriptionsByName]
|
||||
allKeys] firstObject];
|
||||
NSString *out2Name = [[[[model2 modelDescription] outputDescriptionsByName]
|
||||
allKeys] firstObject];
|
||||
MLDictionaryFeatureProvider *fp2 = [[MLDictionaryFeatureProvider alloc]
|
||||
initWithDictionary:@{in2Name: inputArr2} error:nil];
|
||||
|
||||
// Reference predictions
|
||||
err = nil;
|
||||
id<MLFeatureProvider> ref1 = [model predictionFromFeatures:fp error:&err];
|
||||
err = nil;
|
||||
id<MLFeatureProvider> ref2 = [model2 predictionFromFeatures:fp2 error:&err];
|
||||
float *ref1Ptr = (float *)[[ref1 featureValueForName:outName].multiArrayValue dataPointer];
|
||||
float *ref2Ptr = (float *)[[ref2 featureValueForName:out2Name].multiArrayValue dataPointer];
|
||||
|
||||
// E5 multi-op stream
|
||||
id e5_2 = nil;
|
||||
@try { e5_2 = [model2 valueForKey:@"_internalEngine"]; }
|
||||
@catch (NSException *e) { (void)e; }
|
||||
id pLib2 = nil;
|
||||
@try { pLib2 = [e5_2 valueForKey:@"programLibrary"]; }
|
||||
@catch (NSException *e) { (void)e; }
|
||||
|
||||
id op1 = ((id(*)(id,SEL,id,id,id,id,id,unsigned long long))objc_msgSend)(
|
||||
[opCls alloc],
|
||||
@selector(initWithProgramLibrary:functionName:modelDescription:
|
||||
configuration:debugLabel:modelSignpostId:),
|
||||
progLib, @"main", [model modelDescription], cfg,
|
||||
@"val_op1", (unsigned long long)0);
|
||||
id op2 = ((id(*)(id,SEL,id,id,id,id,id,unsigned long long))objc_msgSend)(
|
||||
[opCls alloc],
|
||||
@selector(initWithProgramLibrary:functionName:modelDescription:
|
||||
configuration:debugLabel:modelSignpostId:),
|
||||
pLib2, @"main", [model2 modelDescription], cfg,
|
||||
@"val_op2", (unsigned long long)0);
|
||||
|
||||
((BOOL(*)(id,SEL,NSError**))objc_msgSend)(op1, @selector(preloadAndReturnError:), nil);
|
||||
((BOOL(*)(id,SEL,NSError**))objc_msgSend)(op2, @selector(preloadAndReturnError:), nil);
|
||||
|
||||
id stream2 = [streamPool performSelector:@selector(takeOut)];
|
||||
Ivar shIvar2 = class_getInstanceVariable([stream2 class], "_streamHandle");
|
||||
void *sh2 = (__bridge void *)object_getIvar(stream2, shIvar2);
|
||||
|
||||
[stream2 setValue:@[op1, op2] forKey:@"operations"];
|
||||
|
||||
((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
|
||||
op1, @selector(prepareForInputFeatures:options:error:),
|
||||
fp, predOpts, nil);
|
||||
((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
|
||||
op2, @selector(prepareForInputFeatures:options:error:),
|
||||
fp2, predOpts, nil);
|
||||
|
||||
NSError *mErr = nil;
|
||||
BOOL mOk = ((BOOL(*)(id,SEL,void*,NSError**))objc_msgSend)(
|
||||
stream2, @selector(_executeStream:error:), sh2, &mErr);
|
||||
printf(" Multi-op execute: %s\n", mOk ? "YES" : "NO");
|
||||
if (mErr) printf(" Error: %s\n", [[mErr description] UTF8String]);
|
||||
fflush(stdout);
|
||||
|
||||
if (mOk) {
|
||||
// Read outputs
|
||||
@try {
|
||||
id out1 = [op1 valueForKey:@"outputFeatures"];
|
||||
id out2 = [op2 valueForKey:@"outputFeatures"];
|
||||
|
||||
if (out1 && out2) {
|
||||
MLMultiArray *ma1 = [(id<MLFeatureProvider>)out1
|
||||
featureValueForName:outName].multiArrayValue;
|
||||
MLMultiArray *ma2 = [(id<MLFeatureProvider>)out2
|
||||
featureValueForName:out2Name].multiArrayValue;
|
||||
|
||||
if (ma1 && ma2) {
|
||||
float *p1 = (float *)[ma1 dataPointer];
|
||||
float *p2 = (float *)[ma2 dataPointer];
|
||||
|
||||
float mad1 = max_abs_diff(ref1Ptr, p1, outElems);
|
||||
float mad2 = max_abs_diff(ref2Ptr, p2, nElems2);
|
||||
|
||||
printf(" Op1 max diff: %.8f (mean_ref=%.6f)\n",
|
||||
mad1, mean_abs(ref1Ptr, outElems));
|
||||
printf(" Op2 max diff: %.8f (mean_ref=%.6f)\n",
|
||||
mad2, mean_abs(ref2Ptr, nElems2));
|
||||
|
||||
if (mad1 < 1e-3f && mad2 < 1e-3f) {
|
||||
printf(" *** MULTI-OP VALIDATION PASSED ***\n");
|
||||
} else {
|
||||
printf(" MULTI-OP VALIDATION: differences detected\n");
|
||||
}
|
||||
} else {
|
||||
printf(" Could not extract MLMultiArray from outputs\n");
|
||||
}
|
||||
} else {
|
||||
printf(" outputFeatures nil for op1 or op2\n");
|
||||
}
|
||||
} @catch (NSException *ex) {
|
||||
printf(" Output read EXCEPTION: %s\n",
|
||||
[[ex reason] UTF8String]);
|
||||
}
|
||||
}
|
||||
|
||||
[stream2 setValue:@[op1] forKey:@"operations"];
|
||||
((void(*)(id,SEL,id))objc_msgSend)(
|
||||
streamPool, @selector(putBack:), stream2);
|
||||
}
|
||||
skip_multiop:
|
||||
|
||||
// ============================================================
|
||||
// W4: Async stream submission
|
||||
// ============================================================
|
||||
printf("\n================================================================\n");
|
||||
printf(" W4: Async Stream Submission\n");
|
||||
printf("================================================================\n\n");
|
||||
fflush(stdout);
|
||||
|
||||
{
|
||||
id asyncStream = [streamPool performSelector:@selector(takeOut)];
|
||||
Ivar ashIvar = class_getInstanceVariable([asyncStream class], "_streamHandle");
|
||||
void *ash = (__bridge void *)object_getIvar(asyncStream, ashIvar);
|
||||
|
||||
id asyncOp = ((id(*)(id,SEL,id,id,id,id,id,unsigned long long))
|
||||
objc_msgSend)([opCls alloc],
|
||||
@selector(initWithProgramLibrary:functionName:modelDescription:
|
||||
configuration:debugLabel:modelSignpostId:),
|
||||
progLib, @"main", [model modelDescription], cfg,
|
||||
@"async_op", (unsigned long long)0);
|
||||
((BOOL(*)(id,SEL,NSError**))objc_msgSend)(
|
||||
asyncOp, @selector(preloadAndReturnError:), nil);
|
||||
[asyncStream setValue:@[asyncOp] forKey:@"operations"];
|
||||
|
||||
((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
|
||||
asyncOp, @selector(prepareForInputFeatures:options:error:),
|
||||
fp, predOpts, nil);
|
||||
|
||||
// Try async submission
|
||||
__block BOOL asyncDone = NO;
|
||||
__block double asyncMs = 0;
|
||||
uint64_t asyncT0 = mach_absolute_time();
|
||||
|
||||
@try {
|
||||
// prepareAsyncSubmissionForInputFeatures
|
||||
NSError *asyncPrepErr = nil;
|
||||
BOOL asyncPrepOk = ((BOOL(*)(id,SEL,id,id,NSError**))
|
||||
objc_msgSend)(asyncStream,
|
||||
@selector(prepareAsyncSubmissionForInputFeatures:options:error:),
|
||||
fp, predOpts, &asyncPrepErr);
|
||||
printf(" prepareAsyncSubmission: %s\n",
|
||||
asyncPrepOk ? "YES" : "NO");
|
||||
if (asyncPrepErr) printf(" Error: %s\n",
|
||||
[[asyncPrepErr description] UTF8String]);
|
||||
fflush(stdout);
|
||||
|
||||
if (asyncPrepOk) {
|
||||
((void(*)(id,SEL,void(^)(void)))objc_msgSend)(
|
||||
asyncStream, @selector(submitWithCompletionHandler:),
|
||||
^{
|
||||
asyncMs = tb_ms(mach_absolute_time() - asyncT0);
|
||||
asyncDone = YES;
|
||||
});
|
||||
printf(" Submitted async, waiting...\n");
|
||||
fflush(stdout);
|
||||
|
||||
for (int w = 0; w < 100 && !asyncDone; w++)
|
||||
usleep(1000);
|
||||
|
||||
printf(" Async completed: %s (%.3f ms)\n",
|
||||
asyncDone ? "YES" : "TIMEOUT", asyncMs);
|
||||
fflush(stdout);
|
||||
|
||||
if (asyncDone) {
|
||||
// Benchmark async vs sync
|
||||
int N = 200;
|
||||
|
||||
// Sync benchmark
|
||||
uint64_t t0 = mach_absolute_time();
|
||||
for (int i = 0; i < N; i++) {
|
||||
((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
|
||||
asyncOp,
|
||||
@selector(prepareForInputFeatures:options:error:),
|
||||
fp, predOpts, nil);
|
||||
((BOOL(*)(id,SEL,void*,NSError**))objc_msgSend)(
|
||||
asyncStream,
|
||||
@selector(_executeStream:error:), ash, nil);
|
||||
}
|
||||
double syncMs = tb_ms(mach_absolute_time() - t0) / N;
|
||||
|
||||
// Async benchmark
|
||||
t0 = mach_absolute_time();
|
||||
for (int i = 0; i < N; i++) {
|
||||
((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
|
||||
asyncOp,
|
||||
@selector(prepareForInputFeatures:options:error:),
|
||||
fp, predOpts, nil);
|
||||
((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
|
||||
asyncStream,
|
||||
@selector(prepareAsyncSubmissionForInputFeatures:
|
||||
options:error:),
|
||||
fp, predOpts, nil);
|
||||
|
||||
__block BOOL done = NO;
|
||||
((void(*)(id,SEL,void(^)(void)))objc_msgSend)(
|
||||
asyncStream,
|
||||
@selector(submitWithCompletionHandler:),
|
||||
^{ done = YES; });
|
||||
while (!done) usleep(100);
|
||||
}
|
||||
double asyncBenchMs = tb_ms(mach_absolute_time() - t0) / N;
|
||||
|
||||
printf(" Sync: %.4f ms/eval\n", syncMs);
|
||||
printf(" Async (wait): %.4f ms/eval\n", asyncBenchMs);
|
||||
}
|
||||
}
|
||||
} @catch (NSException *ex) {
|
||||
printf(" Async EXCEPTION: %s\n", [[ex reason] UTF8String]);
|
||||
}
|
||||
|
||||
[asyncStream setValue:@[asyncOp] forKey:@"operations"];
|
||||
((void(*)(id,SEL,id))objc_msgSend)(
|
||||
streamPool, @selector(putBack:), asyncStream);
|
||||
}
|
||||
|
||||
// ============================================================
|
||||
// W5: Port-Based Data Flow
|
||||
// ============================================================
|
||||
printf("\n================================================================\n");
|
||||
printf(" W5: Port-Based Data Flow Investigation\n");
|
||||
printf("================================================================\n\n");
|
||||
fflush(stdout);
|
||||
|
||||
{
|
||||
id portOp = ((id(*)(id,SEL,id,id,id,id,id,unsigned long long))
|
||||
objc_msgSend)([opCls alloc],
|
||||
@selector(initWithProgramLibrary:functionName:modelDescription:
|
||||
configuration:debugLabel:modelSignpostId:),
|
||||
progLib, @"main", [model modelDescription], cfg,
|
||||
@"port_op", (unsigned long long)0);
|
||||
((BOOL(*)(id,SEL,NSError**))objc_msgSend)(
|
||||
portOp, @selector(preloadAndReturnError:), nil);
|
||||
|
||||
// Inspect ports before prepare
|
||||
printf(" --- Before prepare ---\n");
|
||||
@try {
|
||||
id inP = [portOp valueForKey:@"inputPorts"];
|
||||
id outP = [portOp valueForKey:@"outputPorts"];
|
||||
id stP = [portOp valueForKey:@"statePorts"];
|
||||
printf(" inputPorts: %lu, outputPorts: %lu, statePorts: %lu\n",
|
||||
inP ? (unsigned long)[(NSArray *)inP count] : 0,
|
||||
outP ? (unsigned long)[(NSArray *)outP count] : 0,
|
||||
stP ? (unsigned long)[(NSArray *)stP count] : 0);
|
||||
|
||||
if (inP) {
|
||||
for (id p in (NSArray *)inP) {
|
||||
printf(" in: %s portHandle=%p name=%s\n",
|
||||
[NSStringFromClass([p class]) UTF8String],
|
||||
(__bridge void *)[p valueForKey:@"portHandle"],
|
||||
[[(id)[p valueForKey:@"name"] description] UTF8String]);
|
||||
}
|
||||
}
|
||||
if (outP) {
|
||||
for (id p in (NSArray *)outP) {
|
||||
printf(" out: %s portHandle=%p name=%s\n",
|
||||
[NSStringFromClass([p class]) UTF8String],
|
||||
(__bridge void *)[p valueForKey:@"portHandle"],
|
||||
[[(id)[p valueForKey:@"name"] description] UTF8String]);
|
||||
@try {
|
||||
id fd = [p valueForKey:@"featureDescription"];
|
||||
if (fd) printf(" featureDesc: %s\n",
|
||||
[[fd description] UTF8String]);
|
||||
} @catch (NSException *ex) { (void)ex; }
|
||||
}
|
||||
}
|
||||
} @catch (NSException *ex) {
|
||||
printf(" Port inspection EXCEPTION: %s\n",
|
||||
[[ex reason] UTF8String]);
|
||||
}
|
||||
|
||||
// Prepare and inspect after
|
||||
((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
|
||||
portOp, @selector(prepareForInputFeatures:options:error:),
|
||||
fp, predOpts, nil);
|
||||
|
||||
printf("\n --- After prepare ---\n");
|
||||
@try {
|
||||
id inP = [portOp valueForKey:@"inputPorts"];
|
||||
if (inP) {
|
||||
for (id p in (NSArray *)inP) {
|
||||
id binder = [p valueForKey:@"binder"];
|
||||
BOOL directBound = ((BOOL(*)(id,SEL))objc_msgSend)(
|
||||
p, @selector(boundFeatureDirectly));
|
||||
printf(" in: name=%s directBound=%s binder=%s\n",
|
||||
[[(id)[p valueForKey:@"name"] description] UTF8String],
|
||||
directBound ? "YES" : "NO",
|
||||
binder ? [NSStringFromClass([binder class])
|
||||
UTF8String] : "nil");
|
||||
if (binder) {
|
||||
char mode = ((char(*)(id,SEL))objc_msgSend)(
|
||||
binder, @selector(bindingMode));
|
||||
printf(" bindingMode=%d\n", (int)mode);
|
||||
}
|
||||
}
|
||||
}
|
||||
id outP = [portOp valueForKey:@"outputPorts"];
|
||||
if (outP) {
|
||||
for (id p in (NSArray *)outP) {
|
||||
BOOL directBound = ((BOOL(*)(id,SEL))objc_msgSend)(
|
||||
p, @selector(boundFeatureDirectly));
|
||||
BOOL obDirectBound = ((BOOL(*)(id,SEL))objc_msgSend)(
|
||||
p, @selector(outputBackingWasDirectlyBound));
|
||||
printf(" out: name=%s directBound=%s"
|
||||
" outputBackingDirectBound=%s\n",
|
||||
[[(id)[p valueForKey:@"name"] description] UTF8String],
|
||||
directBound ? "YES" : "NO",
|
||||
obDirectBound ? "YES" : "NO");
|
||||
id binder = [p valueForKey:@"binder"];
|
||||
if (binder) {
|
||||
printf(" binder: %s\n",
|
||||
[NSStringFromClass([binder class]) UTF8String]);
|
||||
@try {
|
||||
id ob = [binder valueForKey:@"outputBacking"];
|
||||
printf(" outputBacking: %s\n",
|
||||
ob ? [NSStringFromClass([ob class])
|
||||
UTF8String] : "nil");
|
||||
} @catch (NSException *ex) { (void)ex; }
|
||||
}
|
||||
}
|
||||
}
|
||||
} @catch (NSException *ex) {
|
||||
printf(" Post-prepare EXCEPTION: %s\n",
|
||||
[[ex reason] UTF8String]);
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================
|
||||
// Summary
|
||||
// ============================================================
|
||||
printf("\n================================================================\n");
|
||||
printf(" SUMMARY\n");
|
||||
printf("================================================================\n");
|
||||
printf(" W1: Output validation -- see above\n");
|
||||
printf(" W2: API documentation -- complete (all classes dumped)\n");
|
||||
printf(" W4: Async submission -- see above\n");
|
||||
printf(" W5: Port data flow -- see above\n");
|
||||
printf("================================================================\n");
|
||||
printf("\nDone.\n");
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
|
@ -0,0 +1,915 @@
|
|||
// test_mil_custom.m — Experiments Y1-Y3, Z1: Custom MIL -> ANE Execution
|
||||
// Build: make test_mil_custom && ./test_mil_custom
|
||||
#import <Foundation/Foundation.h>
|
||||
#import <CoreML/CoreML.h>
|
||||
#import <objc/runtime.h>
|
||||
#import <objc/message.h>
|
||||
#import <dlfcn.h>
|
||||
#import <mach/mach_time.h>
|
||||
#import <Accelerate/Accelerate.h>
|
||||
|
||||
static mach_timebase_info_data_t g_tb;
|
||||
static double tb_ms(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
|
||||
|
||||
#pragma mark - MIL Compilation Pipeline
|
||||
|
||||
static id compileAndCreateEngine(NSString *milText, NSString *label,
|
||||
id container, MLModelConfiguration *cfg,
|
||||
MLModelDescription *desc, NSError **outErr) {
|
||||
NSString *milPath = [NSString stringWithFormat:@"/tmp/%@.mil", label];
|
||||
[milText writeToFile:milPath atomically:YES encoding:NSUTF8StringEncoding error:nil];
|
||||
NSURL *milURL = [NSURL fileURLWithPath:milPath];
|
||||
|
||||
Class aotCls = NSClassFromString(@"MLE5ProgramLibraryOnDeviceAOTCompilationImpl");
|
||||
if (!aotCls) {
|
||||
if (outErr) *outErr = [NSError errorWithDomain:@"MIL" code:1
|
||||
userInfo:@{NSLocalizedDescriptionKey: @"AOT class not found"}];
|
||||
return nil;
|
||||
}
|
||||
|
||||
id aotImpl = ((id(*)(id,SEL,id,id,id))objc_msgSend)(
|
||||
[aotCls alloc],
|
||||
NSSelectorFromString(@"initWithMILTextAtURL:container:configuration:"),
|
||||
milURL, container, cfg);
|
||||
if (!aotImpl) {
|
||||
if (outErr) *outErr = [NSError errorWithDomain:@"MIL" code:2
|
||||
userInfo:@{NSLocalizedDescriptionKey: @"AOT init failed"}];
|
||||
return nil;
|
||||
}
|
||||
|
||||
NSError *plErr = nil;
|
||||
void *plHandle = ((void*(*)(id,SEL,BOOL,NSError**))objc_msgSend)(
|
||||
aotImpl,
|
||||
NSSelectorFromString(@"createProgramLibraryHandleWithRespecialization:error:"),
|
||||
NO, &plErr);
|
||||
if (!plHandle) {
|
||||
printf(" [%s] PL handle failed: %s\n", [label UTF8String],
|
||||
plErr ? [[plErr description] UTF8String] : "unknown");
|
||||
if (outErr) *outErr = plErr;
|
||||
return nil;
|
||||
}
|
||||
|
||||
Class plCls = NSClassFromString(@"MLE5ProgramLibrary");
|
||||
id progLib = ((id(*)(id,SEL,id,id,id))objc_msgSend)(
|
||||
[plCls alloc],
|
||||
NSSelectorFromString(@"initWithImpl:container:configuration:"),
|
||||
aotImpl, container, cfg);
|
||||
if (!progLib) {
|
||||
if (outErr) *outErr = [NSError errorWithDomain:@"MIL" code:4
|
||||
userInfo:@{NSLocalizedDescriptionKey: @"ProgramLibrary init failed"}];
|
||||
return nil;
|
||||
}
|
||||
|
||||
Class engCls = NSClassFromString(@"MLE5Engine");
|
||||
|
||||
// Find the correct init selector
|
||||
static dispatch_once_t once;
|
||||
static SEL engInitSel = NULL;
|
||||
dispatch_once(&once, ^{
|
||||
unsigned int mc;
|
||||
Method *ims = class_copyMethodList(engCls, &mc);
|
||||
printf(" MLE5Engine init selectors:\n");
|
||||
for (unsigned int i = 0; i < mc; i++) {
|
||||
const char *sel = sel_getName(method_getName(ims[i]));
|
||||
if (strstr(sel, "init")) {
|
||||
printf(" - %s [%s]\n", sel, method_getTypeEncoding(ims[i]));
|
||||
if (strstr(sel, "ProgramLibrary") && strstr(sel, "modelDescription"))
|
||||
engInitSel = method_getName(ims[i]);
|
||||
}
|
||||
}
|
||||
free(ims);
|
||||
});
|
||||
|
||||
if (!engInitSel) {
|
||||
if (outErr) *outErr = [NSError errorWithDomain:@"MIL" code:5
|
||||
userInfo:@{NSLocalizedDescriptionKey: @"No MLE5Engine init selector found"}];
|
||||
return nil;
|
||||
}
|
||||
|
||||
printf(" Using init: %s\n", sel_getName(engInitSel));
|
||||
|
||||
// Count colons to determine argument count
|
||||
const char *selName = sel_getName(engInitSel);
|
||||
int argCount = 0;
|
||||
for (const char *p = selName; *p; p++) if (*p == ':') argCount++;
|
||||
|
||||
id engine = nil;
|
||||
if (argCount == 7) {
|
||||
// initWithProgramLibrary:modelDescription:configuration:functionName:
|
||||
// classProbabilitiesFeatureName:optionalInputDefaultValues:compilerVersionInfo:
|
||||
engine = ((id(*)(id,SEL,id,id,id,id,id,id,id))objc_msgSend)(
|
||||
[engCls alloc], engInitSel, progLib, desc, cfg,
|
||||
@"main", nil, nil, nil);
|
||||
} else if (argCount == 5) {
|
||||
engine = ((id(*)(id,SEL,id,id,id,id,id))objc_msgSend)(
|
||||
[engCls alloc], engInitSel, progLib, desc, cfg, nil, label);
|
||||
} else if (argCount == 6) {
|
||||
engine = ((id(*)(id,SEL,id,id,id,id,id,id))objc_msgSend)(
|
||||
[engCls alloc], engInitSel, progLib, desc, cfg, nil, nil, label);
|
||||
} else {
|
||||
printf(" Unexpected arg count %d for MLE5Engine init\n", argCount);
|
||||
}
|
||||
|
||||
if (!engine) {
|
||||
if (outErr) *outErr = [NSError errorWithDomain:@"MIL" code:5
|
||||
userInfo:@{NSLocalizedDescriptionKey: @"Engine init failed"}];
|
||||
return nil;
|
||||
}
|
||||
|
||||
NSError *prepErr = nil;
|
||||
BOOL prepOk = ((BOOL(*)(id,SEL,long long,NSError**))objc_msgSend)(
|
||||
engine, NSSelectorFromString(@"prepareWithConcurrencyHint:error:"),
|
||||
(long long)1, &prepErr);
|
||||
if (!prepOk) {
|
||||
printf(" [%s] Prepare failed: %s\n", [label UTF8String],
|
||||
prepErr ? [[prepErr description] UTF8String] : "unknown");
|
||||
if (outErr) *outErr = prepErr;
|
||||
return nil;
|
||||
}
|
||||
|
||||
return engine;
|
||||
}
|
||||
|
||||
static id<MLFeatureProvider> runEngine(id engine, id<MLFeatureProvider> features,
|
||||
MLPredictionOptions *opts, NSError **outErr) {
|
||||
return ((id(*)(id,SEL,id,id,NSError**))objc_msgSend)(
|
||||
engine, NSSelectorFromString(@"predictionFromFeatures:options:error:"),
|
||||
features, opts, outErr);
|
||||
}
|
||||
|
||||
#pragma mark - Numeric Helpers
|
||||
|
||||
static float max_abs_diff(const float *a, const float *b, int n) {
|
||||
float m = 0;
|
||||
for (int i = 0; i < n; i++) {
|
||||
float d = fabsf(a[i] - b[i]);
|
||||
if (d > m) m = d;
|
||||
}
|
||||
return m;
|
||||
}
|
||||
|
||||
static float mean_abs(const float *a, int n) {
|
||||
float s = 0;
|
||||
for (int i = 0; i < n; i++) s += fabsf(a[i]);
|
||||
return s / n;
|
||||
}
|
||||
|
||||
static void fill_random(float *buf, int n, float scale) {
|
||||
for (int i = 0; i < n; i++)
|
||||
buf[i] = ((float)arc4random() / (float)UINT32_MAX - 0.5f) * 2.0f * scale;
|
||||
}
|
||||
|
||||
static void print_first(const char *label, const float *buf, int total) {
|
||||
int n = total < 8 ? total : 8;
|
||||
printf(" %s: [", label);
|
||||
for (int i = 0; i < n; i++)
|
||||
printf("%s%.4f", i ? ", " : "", buf[i]);
|
||||
printf("]\n");
|
||||
}
|
||||
|
||||
#pragma mark - CPU Reference Implementations
|
||||
|
||||
static void cpu_sdpa(const float *Q, const float *K, const float *V,
|
||||
float *out, int seqLen, int headDim) {
|
||||
float scale = 1.0f / sqrtf((float)headDim);
|
||||
float *scores = (float *)calloc(seqLen * seqLen, sizeof(float));
|
||||
|
||||
for (int i = 0; i < seqLen; i++) {
|
||||
for (int j = 0; j < seqLen; j++) {
|
||||
float dot = 0;
|
||||
for (int d = 0; d < headDim; d++)
|
||||
dot += Q[i * headDim + d] * K[j * headDim + d];
|
||||
scores[i * seqLen + j] = dot * scale;
|
||||
}
|
||||
}
|
||||
for (int i = 0; i < seqLen; i++) {
|
||||
float maxv = scores[i * seqLen];
|
||||
for (int j = 1; j < seqLen; j++)
|
||||
if (scores[i * seqLen + j] > maxv) maxv = scores[i * seqLen + j];
|
||||
float sum = 0;
|
||||
for (int j = 0; j < seqLen; j++) {
|
||||
scores[i * seqLen + j] = expf(scores[i * seqLen + j] - maxv);
|
||||
sum += scores[i * seqLen + j];
|
||||
}
|
||||
for (int j = 0; j < seqLen; j++)
|
||||
scores[i * seqLen + j] /= sum;
|
||||
}
|
||||
for (int i = 0; i < seqLen; i++) {
|
||||
for (int d = 0; d < headDim; d++) {
|
||||
float acc = 0;
|
||||
for (int j = 0; j < seqLen; j++)
|
||||
acc += scores[i * seqLen + j] * V[j * headDim + d];
|
||||
out[i * headDim + d] = acc;
|
||||
}
|
||||
}
|
||||
free(scores);
|
||||
}
|
||||
|
||||
#pragma mark - Container Discovery
|
||||
|
||||
static id findE5Container(MLModel *model, NSURL *compiledURL, MLModelConfiguration *cfg) {
|
||||
// Try standard paths first
|
||||
@try {
|
||||
id eng = [model valueForKey:@"_internalEngine"];
|
||||
if ([NSStringFromClass([eng class]) containsString:@"MLE5"]) {
|
||||
id pl = [eng valueForKey:@"programLibrary"];
|
||||
if (pl) {
|
||||
id c = nil;
|
||||
@try { c = [pl valueForKey:@"_container"]; } @catch(id e) { (void)e; }
|
||||
if (!c) {
|
||||
@try {
|
||||
id impl = [pl valueForKey:@"_impl"];
|
||||
if (impl) c = [impl valueForKey:@"_container"];
|
||||
} @catch(id e) { (void)e; }
|
||||
}
|
||||
if (c) return c;
|
||||
}
|
||||
}
|
||||
|
||||
// MLMultiFunctionProgramEngine path
|
||||
if ([NSStringFromClass([eng class]) isEqualToString:@"MLMultiFunctionProgramEngine"]) {
|
||||
NSDictionary *map = [eng valueForKey:@"_functionNameToEngineMap"];
|
||||
for (id key in map) {
|
||||
id sub = map[key];
|
||||
if ([NSStringFromClass([sub class]) containsString:@"MLE5"]) {
|
||||
id pl = [sub valueForKey:@"programLibrary"];
|
||||
if (pl) {
|
||||
id c = nil;
|
||||
@try { c = [pl valueForKey:@"_container"]; } @catch(id e) { (void)e; }
|
||||
if (!c) {
|
||||
@try {
|
||||
id impl = [pl valueForKey:@"_impl"];
|
||||
if (impl) c = [impl valueForKey:@"_container"];
|
||||
} @catch(id e) { (void)e; }
|
||||
}
|
||||
if (c) return c;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
} @catch(id e) { (void)e; }
|
||||
|
||||
// Create MLProgramE5Container directly from compiled model
|
||||
Class e5Cls = NSClassFromString(@"MLProgramE5Container");
|
||||
if (!e5Cls) return nil;
|
||||
|
||||
// Find model.mil path inside the compiled model
|
||||
NSString *compiledPath = [compiledURL path];
|
||||
NSString *milPath = [compiledPath stringByAppendingPathComponent:@"model.mil"];
|
||||
if (![[NSFileManager defaultManager] fileExistsAtPath:milPath]) {
|
||||
printf(" No model.mil at %s\n", [milPath UTF8String]);
|
||||
|
||||
// List contents
|
||||
NSArray *contents = [[NSFileManager defaultManager]
|
||||
contentsOfDirectoryAtPath:compiledPath error:nil];
|
||||
printf(" Compiled model contents: %s\n", [[contents description] UTF8String]);
|
||||
}
|
||||
|
||||
// Try to create E5 container with the model asset description from NN container
|
||||
@try {
|
||||
id eng = [model valueForKey:@"_internalEngine"];
|
||||
id nnContainer = [eng valueForKey:@"_container"];
|
||||
if (nnContainer) {
|
||||
// Get model file path
|
||||
NSString *modelFilePath = nil;
|
||||
@try { modelFilePath = [nnContainer valueForKey:@"_modelFilePath"]; }
|
||||
@catch(id e) { (void)e; }
|
||||
|
||||
if (modelFilePath) {
|
||||
printf(" Model file path: %s\n", [modelFilePath UTF8String]);
|
||||
|
||||
// Try to create E5 container with this path
|
||||
@try {
|
||||
id c = ((id(*)(id,SEL,id,id))objc_msgSend)(
|
||||
[e5Cls alloc],
|
||||
NSSelectorFromString(@"initWithModelAssetPath:configuration:"),
|
||||
modelFilePath, cfg);
|
||||
if (c) return c;
|
||||
} @catch(id e) { (void)e; }
|
||||
}
|
||||
|
||||
// Try initWithModelAssetDescription
|
||||
@try {
|
||||
id assetDesc = nil;
|
||||
@try { assetDesc = [nnContainer valueForKey:@"_modelAssetDescription"]; }
|
||||
@catch(id e) { (void)e; }
|
||||
if (!assetDesc) {
|
||||
@try { assetDesc = [nnContainer valueForKey:@"modelAssetDescription"]; }
|
||||
@catch(id e) { (void)e; }
|
||||
}
|
||||
if (assetDesc) {
|
||||
printf(" Asset description: %s\n",
|
||||
[NSStringFromClass([assetDesc class]) UTF8String]);
|
||||
id c = ((id(*)(id,SEL,id,id))objc_msgSend)(
|
||||
[e5Cls alloc],
|
||||
NSSelectorFromString(@"initWithModelAssetDescription:configuration:"),
|
||||
assetDesc, cfg);
|
||||
if (c) return c;
|
||||
}
|
||||
} @catch(id e) { (void)e; }
|
||||
}
|
||||
} @catch(id e) { (void)e; }
|
||||
|
||||
// Dump E5Container init methods
|
||||
unsigned int mc;
|
||||
Method *ims = class_copyMethodList(e5Cls, &mc);
|
||||
printf(" MLProgramE5Container init methods:\n");
|
||||
for (unsigned int i = 0; i < mc; i++) {
|
||||
const char *sel = sel_getName(method_getName(ims[i]));
|
||||
if (strstr(sel, "init"))
|
||||
printf(" - %s\n", sel);
|
||||
}
|
||||
free(ims);
|
||||
|
||||
return nil;
|
||||
}
|
||||
|
||||
#pragma mark - Main
|
||||
|
||||
int main(int argc, const char *argv[]) {
|
||||
(void)argc; (void)argv;
|
||||
@autoreleasepool {
|
||||
mach_timebase_info(&g_tb);
|
||||
printf("================================================================\n");
|
||||
printf(" Custom MIL -> ANE: Experiments Y1, Y2, Y3, Z1\n");
|
||||
printf("================================================================\n\n");
|
||||
|
||||
dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/"
|
||||
"AppleNeuralEngine", RTLD_NOW);
|
||||
|
||||
NSString *pkgPath = @"/tmp/ane_sram_256ch_64sp.mlpackage";
|
||||
if (![[NSFileManager defaultManager] fileExistsAtPath:pkgPath]) {
|
||||
printf("FATAL: %s not found. Run: python3 scripts/gen_mlpackages.py\n",
|
||||
[pkgPath UTF8String]);
|
||||
return 1;
|
||||
}
|
||||
|
||||
NSError *err = nil;
|
||||
MLModelConfiguration *cfg = [[MLModelConfiguration alloc] init];
|
||||
cfg.computeUnits = MLComputeUnitsAll;
|
||||
MLPredictionOptions *opts = [[MLPredictionOptions alloc] init];
|
||||
|
||||
NSURL *compiled = [MLModel compileModelAtURL:
|
||||
[NSURL fileURLWithPath:pkgPath] error:&err];
|
||||
if (err) { printf("FATAL: compile: %s\n", [[err description] UTF8String]); return 1; }
|
||||
|
||||
MLModel *refModel = [MLModel modelWithContentsOfURL:compiled
|
||||
configuration:cfg error:&err];
|
||||
if (err) { printf("FATAL: load: %s\n", [[err description] UTF8String]); return 1; }
|
||||
printf(" Ref model: %s\n", [NSStringFromClass([refModel class]) UTF8String]);
|
||||
|
||||
MLModelDescription *refDesc = [refModel modelDescription];
|
||||
|
||||
// Find or create E5 container
|
||||
id refContainer = findE5Container(refModel, compiled, cfg);
|
||||
if (refContainer) {
|
||||
printf(" Container: %s\n\n", [NSStringFromClass([refContainer class]) UTF8String]);
|
||||
} else {
|
||||
printf(" No E5 container found. Trying nil container...\n\n");
|
||||
}
|
||||
|
||||
int ch = 256, sp = 64;
|
||||
int nElems = ch * sp;
|
||||
NSString *inName = [[[refDesc inputDescriptionsByName] allKeys] firstObject];
|
||||
NSString *outName = [[[refDesc outputDescriptionsByName] allKeys] firstObject];
|
||||
printf(" I/O: %s -> %s, shape [1,%d,1,%d]\n\n", [inName UTF8String],
|
||||
[outName UTF8String], ch, sp);
|
||||
|
||||
// ============================================================
|
||||
// Y1: Scaled Dot-Product Attention
|
||||
// ============================================================
|
||||
printf("================================================================\n");
|
||||
printf(" Y1: scaled_dot_product_attention on ANE\n");
|
||||
printf("================================================================\n\n");
|
||||
|
||||
{
|
||||
int seqLen = ch, headDim = sp;
|
||||
|
||||
NSString *sdpaMIL = [NSString stringWithFormat:
|
||||
@"program(1.3)\n"
|
||||
"{\n"
|
||||
" func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n"
|
||||
" string c16 = const()[name = string(\"c16\"), val = string(\"fp16\")];\n"
|
||||
" tensor<fp16, [1, %d, 1, %d]> x16 = cast(dtype = c16, x = x)[name = string(\"x16\")];\n"
|
||||
" tensor<int32, [4]> sr = const()[name = string(\"sr\"), val = tensor<int32, [4]>([1, 1, %d, %d])];\n"
|
||||
" tensor<fp16, [1, 1, %d, %d]> q = reshape(x = x16, shape = sr)[name = string(\"q\")];\n"
|
||||
" tensor<fp16, [1, 1, %d, %d]> k = reshape(x = x16, shape = sr)[name = string(\"k\")];\n"
|
||||
" tensor<fp16, [1, 1, %d, %d]> v = reshape(x = x16, shape = sr)[name = string(\"v\")];\n"
|
||||
" tensor<fp16, [1, 1, %d, %d]> attn = scaled_dot_product_attention(query = q, key = k, value = v)[name = string(\"attn\")];\n"
|
||||
" tensor<int32, [4]> or = const()[name = string(\"or\"), val = tensor<int32, [4]>([1, %d, 1, %d])];\n"
|
||||
" tensor<fp16, [1, %d, 1, %d]> rs = reshape(x = attn, shape = or)[name = string(\"rs\")];\n"
|
||||
" string c32 = const()[name = string(\"c32\"), val = string(\"fp32\")];\n"
|
||||
" tensor<fp32, [1, %d, 1, %d]> cast_out = cast(dtype = c32, x = rs)[name = string(\"cast_out\")];\n"
|
||||
" } -> (cast_out);\n"
|
||||
"}\n",
|
||||
ch, sp, ch, sp,
|
||||
seqLen, headDim, seqLen, headDim, seqLen, headDim, seqLen, headDim,
|
||||
seqLen, headDim,
|
||||
ch, sp, ch, sp,
|
||||
ch, sp];
|
||||
|
||||
printf(" Self-attention: B=1, nHeads=1, seqLen=%d, headDim=%d\n\n", seqLen, headDim);
|
||||
|
||||
err = nil;
|
||||
id engine = compileAndCreateEngine(sdpaMIL, @"y1_sdpa", refContainer, cfg, refDesc, &err);
|
||||
|
||||
if (!engine) {
|
||||
printf(" Y1 FAILED: %s\n\n", err ? [[err description] UTF8String] : "unknown");
|
||||
} else {
|
||||
printf(" Y1: Engine created\n");
|
||||
MLMultiArray *inputArr = [[MLMultiArray alloc]
|
||||
initWithShape:@[@1, @(ch), @1, @(sp)]
|
||||
dataType:MLMultiArrayDataTypeFloat32 error:nil];
|
||||
float *inPtr = (float *)[inputArr dataPointer];
|
||||
fill_random(inPtr, nElems, 0.5f);
|
||||
|
||||
MLDictionaryFeatureProvider *fp = [[MLDictionaryFeatureProvider alloc]
|
||||
initWithDictionary:@{inName: inputArr} error:nil];
|
||||
|
||||
NSError *runErr = nil;
|
||||
uint64_t t0 = mach_absolute_time();
|
||||
id<MLFeatureProvider> result = runEngine(engine, fp, opts, &runErr);
|
||||
double ms = tb_ms(mach_absolute_time() - t0);
|
||||
|
||||
if (runErr || !result) {
|
||||
printf(" Y1 prediction FAILED: %s\n\n",
|
||||
runErr ? [[runErr description] UTF8String] : "nil");
|
||||
} else {
|
||||
MLMultiArray *outArr = [result featureValueForName:outName].multiArrayValue;
|
||||
if (!outArr) {
|
||||
printf(" Y1 output nil\n\n");
|
||||
} else {
|
||||
float *outPtr = (float *)[outArr dataPointer];
|
||||
print_first("ANE out", outPtr, nElems);
|
||||
printf(" Time: %.3f ms\n", ms);
|
||||
|
||||
float *cpuOut = (float *)calloc(nElems, sizeof(float));
|
||||
cpu_sdpa(inPtr, inPtr, inPtr, cpuOut, seqLen, headDim);
|
||||
print_first("CPU ref", cpuOut, nElems);
|
||||
|
||||
float mad = max_abs_diff(outPtr, cpuOut, nElems);
|
||||
printf(" Max diff: %.6f, Rel: %.2e\n",
|
||||
mad, mad / (mean_abs(cpuOut, nElems) + 1e-10f));
|
||||
printf(" %s\n\n", mad < 0.02f ? "*** Y1 PASSED ***" :
|
||||
(mad < 0.1f ? "Y1 WARNING" : "Y1 FAILED"));
|
||||
|
||||
int N = 100;
|
||||
t0 = mach_absolute_time();
|
||||
for (int i = 0; i < N; i++) runEngine(engine, fp, opts, nil);
|
||||
printf(" Bench: %.4f ms/eval (%d iters)\n\n",
|
||||
tb_ms(mach_absolute_time() - t0) / N, N);
|
||||
free(cpuOut);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================
|
||||
// Y2: Linear with Embedded Weights
|
||||
// ============================================================
|
||||
printf("================================================================\n");
|
||||
printf(" Y2: linear op with embedded weights on ANE\n");
|
||||
printf("================================================================\n\n");
|
||||
|
||||
{
|
||||
int inDim = sp, outDim = sp;
|
||||
|
||||
float *W = (float *)malloc(outDim * inDim * sizeof(float));
|
||||
float *B = (float *)malloc(outDim * sizeof(float));
|
||||
fill_random(W, outDim * inDim, 0.1f);
|
||||
fill_random(B, outDim, 0.01f);
|
||||
|
||||
NSMutableString *wLit = [NSMutableString stringWithString:@"["];
|
||||
for (int i = 0; i < outDim; i++) {
|
||||
if (i > 0) [wLit appendString:@", "];
|
||||
[wLit appendString:@"["];
|
||||
for (int j = 0; j < inDim; j++) {
|
||||
if (j > 0) [wLit appendString:@", "];
|
||||
[wLit appendFormat:@"%.8e", W[i * inDim + j]];
|
||||
}
|
||||
[wLit appendString:@"]"];
|
||||
}
|
||||
[wLit appendString:@"]"];
|
||||
|
||||
NSMutableString *bLit = [NSMutableString stringWithString:@"["];
|
||||
for (int j = 0; j < outDim; j++) {
|
||||
if (j > 0) [bLit appendString:@", "];
|
||||
[bLit appendFormat:@"%.8e", B[j]];
|
||||
}
|
||||
[bLit appendString:@"]"];
|
||||
|
||||
NSString *linearMIL = [NSString stringWithFormat:
|
||||
@"program(1.3)\n"
|
||||
"{\n"
|
||||
" func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n"
|
||||
" string c16 = const()[name = string(\"c16\"), val = string(\"fp16\")];\n"
|
||||
" tensor<fp16, [1, %d, 1, %d]> x16 = cast(dtype = c16, x = x)[name = string(\"x16\")];\n"
|
||||
" tensor<int32, [2]> rs = const()[name = string(\"rs\"), val = tensor<int32, [2]>([%d, %d])];\n"
|
||||
" tensor<fp16, [%d, %d]> flat = reshape(x = x16, shape = rs)[name = string(\"flat\")];\n"
|
||||
" tensor<fp16, [%d, %d]> Wc = const()[name = string(\"Wc\"), val = tensor<fp16, [%d, %d]>(%@)];\n"
|
||||
" tensor<fp16, [%d]> Bc = const()[name = string(\"Bc\"), val = tensor<fp16, [%d]>(%@)];\n"
|
||||
" tensor<fp16, [%d, %d]> lin = linear(x = flat, weight = Wc, bias = Bc)[name = string(\"lin\")];\n"
|
||||
" tensor<int32, [4]> rs2 = const()[name = string(\"rs2\"), val = tensor<int32, [4]>([1, %d, 1, %d])];\n"
|
||||
" tensor<fp16, [1, %d, 1, %d]> rso = reshape(x = lin, shape = rs2)[name = string(\"rso\")];\n"
|
||||
" string c32 = const()[name = string(\"c32\"), val = string(\"fp32\")];\n"
|
||||
" tensor<fp32, [1, %d, 1, %d]> cast_out = cast(dtype = c32, x = rso)[name = string(\"cast_out\")];\n"
|
||||
" } -> (cast_out);\n"
|
||||
"}\n",
|
||||
ch, sp, ch, sp,
|
||||
ch, sp, ch, sp,
|
||||
outDim, inDim, outDim, inDim, wLit,
|
||||
outDim, outDim, bLit,
|
||||
ch, outDim,
|
||||
ch, sp, ch, sp,
|
||||
ch, sp];
|
||||
|
||||
printf(" Config: [%d,%d] linear %d->%d with embedded W+b\n\n", ch, sp, inDim, outDim);
|
||||
|
||||
err = nil;
|
||||
id engine = compileAndCreateEngine(linearMIL, @"y2_linear", refContainer, cfg, refDesc, &err);
|
||||
|
||||
if (!engine) {
|
||||
printf(" Y2 FAILED: %s\n\n", err ? [[err description] UTF8String] : "unknown");
|
||||
} else {
|
||||
printf(" Y2: Engine created\n");
|
||||
MLMultiArray *inputArr = [[MLMultiArray alloc]
|
||||
initWithShape:@[@1, @(ch), @1, @(sp)]
|
||||
dataType:MLMultiArrayDataTypeFloat32 error:nil];
|
||||
float *inPtr = (float *)[inputArr dataPointer];
|
||||
fill_random(inPtr, nElems, 0.5f);
|
||||
|
||||
MLDictionaryFeatureProvider *fp = [[MLDictionaryFeatureProvider alloc]
|
||||
initWithDictionary:@{inName: inputArr} error:nil];
|
||||
|
||||
NSError *runErr = nil;
|
||||
uint64_t t0 = mach_absolute_time();
|
||||
id<MLFeatureProvider> result = runEngine(engine, fp, opts, &runErr);
|
||||
double ms = tb_ms(mach_absolute_time() - t0);
|
||||
|
||||
if (runErr || !result) {
|
||||
printf(" Y2 prediction FAILED: %s\n\n",
|
||||
runErr ? [[runErr description] UTF8String] : "nil");
|
||||
} else {
|
||||
MLMultiArray *outArr = [result featureValueForName:outName].multiArrayValue;
|
||||
if (outArr) {
|
||||
float *outPtr = (float *)[outArr dataPointer];
|
||||
print_first("ANE out", outPtr, nElems);
|
||||
printf(" Time: %.3f ms\n", ms);
|
||||
|
||||
// CPU: x[ch,sp] @ W^T[sp,sp] + b[sp]
|
||||
float *cpuOut = (float *)calloc(nElems, sizeof(float));
|
||||
for (int i = 0; i < ch; i++) {
|
||||
for (int j = 0; j < outDim; j++) {
|
||||
float acc = 0;
|
||||
for (int k = 0; k < inDim; k++)
|
||||
acc += inPtr[i * inDim + k] * W[j * inDim + k];
|
||||
cpuOut[i * outDim + j] = acc + B[j];
|
||||
}
|
||||
}
|
||||
print_first("CPU ref", cpuOut, nElems);
|
||||
|
||||
float mad = max_abs_diff(outPtr, cpuOut, nElems);
|
||||
printf(" Max diff: %.6f, Rel: %.2e\n",
|
||||
mad, mad / (mean_abs(cpuOut, nElems) + 1e-10f));
|
||||
printf(" %s\n\n", mad < 0.05f ? "*** Y2 PASSED ***" :
|
||||
(mad < 0.5f ? "Y2 WARNING" : "Y2 FAILED"));
|
||||
|
||||
int N = 100;
|
||||
t0 = mach_absolute_time();
|
||||
for (int i = 0; i < N; i++) runEngine(engine, fp, opts, nil);
|
||||
printf(" Bench: %.4f ms/eval (%d iters)\n\n",
|
||||
tb_ms(mach_absolute_time() - t0) / N, N);
|
||||
free(cpuOut);
|
||||
}
|
||||
}
|
||||
}
|
||||
free(W); free(B);
|
||||
}
|
||||
|
||||
// ============================================================
|
||||
// Y3: Transformer Block (Attention + FFN)
|
||||
// ============================================================
|
||||
printf("================================================================\n");
|
||||
printf(" Y3: Transformer Block (LN + SDPA + Residual + LN + FFN + Residual)\n");
|
||||
printf("================================================================\n\n");
|
||||
|
||||
{
|
||||
int seqLen = ch, dim = sp, ffnDim = 128;
|
||||
|
||||
float *w1 = (float *)malloc(ffnDim * dim * sizeof(float));
|
||||
float *b1 = (float *)malloc(ffnDim * sizeof(float));
|
||||
float *w2 = (float *)malloc(dim * ffnDim * sizeof(float));
|
||||
float *b2 = (float *)malloc(dim * sizeof(float));
|
||||
fill_random(w1, ffnDim * dim, 0.05f);
|
||||
fill_random(b1, ffnDim, 0.01f);
|
||||
fill_random(w2, dim * ffnDim, 0.05f);
|
||||
fill_random(b2, dim, 0.01f);
|
||||
|
||||
// Build weight string literals
|
||||
NSMutableString *(^buildMat)(float*, int, int) = ^(float *m, int rows, int cols) {
|
||||
NSMutableString *s = [NSMutableString stringWithString:@"["];
|
||||
for (int i = 0; i < rows; i++) {
|
||||
if (i > 0) [s appendString:@", "];
|
||||
[s appendString:@"["];
|
||||
for (int j = 0; j < cols; j++) {
|
||||
if (j > 0) [s appendString:@", "];
|
||||
[s appendFormat:@"%.8e", m[i * cols + j]];
|
||||
}
|
||||
[s appendString:@"]"];
|
||||
}
|
||||
[s appendString:@"]"];
|
||||
return s;
|
||||
};
|
||||
|
||||
NSMutableString *(^buildVec)(float*, int) = ^(float *v, int n) {
|
||||
NSMutableString *s = [NSMutableString stringWithString:@"["];
|
||||
for (int i = 0; i < n; i++) {
|
||||
if (i > 0) [s appendString:@", "];
|
||||
[s appendFormat:@"%.8e", v[i]];
|
||||
}
|
||||
[s appendString:@"]"];
|
||||
return s;
|
||||
};
|
||||
|
||||
NSMutableString *(^buildOnes)(int) = ^(int n) {
|
||||
NSMutableString *s = [NSMutableString stringWithString:@"["];
|
||||
for (int i = 0; i < n; i++) {
|
||||
if (i > 0) [s appendString:@", "];
|
||||
[s appendString:@"1.0"];
|
||||
}
|
||||
[s appendString:@"]"];
|
||||
return s;
|
||||
};
|
||||
|
||||
NSMutableString *(^buildZeros)(int) = ^(int n) {
|
||||
NSMutableString *s = [NSMutableString stringWithString:@"["];
|
||||
for (int i = 0; i < n; i++) {
|
||||
if (i > 0) [s appendString:@", "];
|
||||
[s appendString:@"0.0"];
|
||||
}
|
||||
[s appendString:@"]"];
|
||||
return s;
|
||||
};
|
||||
|
||||
NSString *tfMIL = [NSString stringWithFormat:
|
||||
@"program(1.3)\n"
|
||||
"{\n"
|
||||
" func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n"
|
||||
" string c16 = const()[name = string(\"c16\"), val = string(\"fp16\")];\n"
|
||||
" tensor<fp16, [1, %d, 1, %d]> x16 = cast(dtype = c16, x = x)[name = string(\"x16\")];\n"
|
||||
" tensor<int32, [2]> r2 = const()[name = string(\"r2\"), val = tensor<int32, [2]>([%d, %d])];\n"
|
||||
" tensor<fp16, [%d, %d]> flat = reshape(x = x16, shape = r2)[name = string(\"flat\")];\n"
|
||||
// LN1
|
||||
" tensor<fp16, [%d]> g1 = const()[name = string(\"g1\"), val = tensor<fp16, [%d]>(%@)];\n"
|
||||
" tensor<fp16, [%d]> b1 = const()[name = string(\"b1\"), val = tensor<fp16, [%d]>(%@)];\n"
|
||||
" tensor<int32, [1]> la = const()[name = string(\"la\"), val = tensor<int32, [1]>([-1])];\n"
|
||||
" fp16 eps = const()[name = string(\"eps\"), val = fp16(1e-5)];\n"
|
||||
" tensor<fp16, [%d, %d]> ln1 = layer_norm(x = flat, axes = la, gamma = g1, beta = b1, epsilon = eps)[name = string(\"ln1\")];\n"
|
||||
// SDPA
|
||||
" tensor<int32, [4]> sr = const()[name = string(\"sr\"), val = tensor<int32, [4]>([1, 1, %d, %d])];\n"
|
||||
" tensor<fp16, [1, 1, %d, %d]> q = reshape(x = ln1, shape = sr)[name = string(\"q\")];\n"
|
||||
" tensor<fp16, [1, 1, %d, %d]> k = reshape(x = ln1, shape = sr)[name = string(\"k\")];\n"
|
||||
" tensor<fp16, [1, 1, %d, %d]> v = reshape(x = ln1, shape = sr)[name = string(\"v\")];\n"
|
||||
" tensor<fp16, [1, 1, %d, %d]> at = scaled_dot_product_attention(query = q, key = k, value = v)[name = string(\"at\")];\n"
|
||||
" tensor<fp16, [%d, %d]> af = reshape(x = at, shape = r2)[name = string(\"af\")];\n"
|
||||
// Residual 1
|
||||
" tensor<fp16, [%d, %d]> r1 = add(x = flat, y = af)[name = string(\"r1\")];\n"
|
||||
// LN2
|
||||
" tensor<fp16, [%d]> g2 = const()[name = string(\"g2\"), val = tensor<fp16, [%d]>(%@)];\n"
|
||||
" tensor<fp16, [%d]> b2 = const()[name = string(\"b2\"), val = tensor<fp16, [%d]>(%@)];\n"
|
||||
" tensor<fp16, [%d, %d]> ln2 = layer_norm(x = r1, axes = la, gamma = g2, beta = b2, epsilon = eps)[name = string(\"ln2\")];\n"
|
||||
// FFN
|
||||
" tensor<fp16, [%d, %d]> W1 = const()[name = string(\"W1\"), val = tensor<fp16, [%d, %d]>(%@)];\n"
|
||||
" tensor<fp16, [%d]> B1 = const()[name = string(\"B1\"), val = tensor<fp16, [%d]>(%@)];\n"
|
||||
" tensor<fp16, [%d, %d]> f1 = linear(x = ln2, weight = W1, bias = B1)[name = string(\"f1\")];\n"
|
||||
" tensor<fp16, [%d, %d]> ga = gelu(x = f1, mode = string(\"TANH_APPROXIMATION\"))[name = string(\"ga\")];\n"
|
||||
" tensor<fp16, [%d, %d]> W2 = const()[name = string(\"W2\"), val = tensor<fp16, [%d, %d]>(%@)];\n"
|
||||
" tensor<fp16, [%d]> B2 = const()[name = string(\"B2\"), val = tensor<fp16, [%d]>(%@)];\n"
|
||||
" tensor<fp16, [%d, %d]> f2 = linear(x = ga, weight = W2, bias = B2)[name = string(\"f2\")];\n"
|
||||
// Residual 2
|
||||
" tensor<fp16, [%d, %d]> r2o = add(x = r1, y = f2)[name = string(\"r2o\")];\n"
|
||||
// Output
|
||||
" tensor<int32, [4]> r4 = const()[name = string(\"r4\"), val = tensor<int32, [4]>([1, %d, 1, %d])];\n"
|
||||
" tensor<fp16, [1, %d, 1, %d]> o16 = reshape(x = r2o, shape = r4)[name = string(\"o16\")];\n"
|
||||
" string c32 = const()[name = string(\"c32\"), val = string(\"fp32\")];\n"
|
||||
" tensor<fp32, [1, %d, 1, %d]> cast_out = cast(dtype = c32, x = o16)[name = string(\"cast_out\")];\n"
|
||||
" } -> (cast_out);\n"
|
||||
"}\n",
|
||||
ch, sp, ch, sp,
|
||||
seqLen, dim, seqLen, dim,
|
||||
dim, dim, buildOnes(dim),
|
||||
dim, dim, buildZeros(dim),
|
||||
seqLen, dim,
|
||||
seqLen, dim, seqLen, dim, seqLen, dim, seqLen, dim,
|
||||
seqLen, dim,
|
||||
seqLen, dim,
|
||||
seqLen, dim,
|
||||
dim, dim, buildOnes(dim),
|
||||
dim, dim, buildZeros(dim),
|
||||
seqLen, dim,
|
||||
ffnDim, dim, ffnDim, dim, buildMat(w1, ffnDim, dim),
|
||||
ffnDim, ffnDim, buildVec(b1, ffnDim),
|
||||
seqLen, ffnDim,
|
||||
seqLen, ffnDim,
|
||||
dim, ffnDim, dim, ffnDim, buildMat(w2, dim, ffnDim),
|
||||
dim, dim, buildVec(b2, dim),
|
||||
seqLen, dim,
|
||||
seqLen, dim,
|
||||
ch, sp, ch, sp,
|
||||
ch, sp];
|
||||
|
||||
printf(" Pipeline: LN->SDPA->Res->LN->FFN(%d->%d->%d)->Res\n\n", dim, ffnDim, dim);
|
||||
|
||||
err = nil;
|
||||
id engine = compileAndCreateEngine(tfMIL, @"y3_transformer",
|
||||
refContainer, cfg, refDesc, &err);
|
||||
|
||||
if (!engine) {
|
||||
printf(" Y3 FAILED: %s\n\n", err ? [[err description] UTF8String] : "unknown");
|
||||
} else {
|
||||
printf(" Y3: Engine created!\n");
|
||||
MLMultiArray *inputArr = [[MLMultiArray alloc]
|
||||
initWithShape:@[@1, @(ch), @1, @(sp)]
|
||||
dataType:MLMultiArrayDataTypeFloat32 error:nil];
|
||||
float *inPtr = (float *)[inputArr dataPointer];
|
||||
fill_random(inPtr, nElems, 0.5f);
|
||||
|
||||
MLDictionaryFeatureProvider *fp = [[MLDictionaryFeatureProvider alloc]
|
||||
initWithDictionary:@{inName: inputArr} error:nil];
|
||||
|
||||
NSError *runErr = nil;
|
||||
uint64_t t0 = mach_absolute_time();
|
||||
id<MLFeatureProvider> result = runEngine(engine, fp, opts, &runErr);
|
||||
double ms = tb_ms(mach_absolute_time() - t0);
|
||||
|
||||
if (runErr || !result) {
|
||||
printf(" Y3 prediction FAILED: %s\n\n",
|
||||
runErr ? [[runErr description] UTF8String] : "nil");
|
||||
} else {
|
||||
MLMultiArray *outArr = [result featureValueForName:outName].multiArrayValue;
|
||||
if (outArr) {
|
||||
float *outPtr = (float *)[outArr dataPointer];
|
||||
print_first("ANE out", outPtr, nElems);
|
||||
printf(" Time: %.3f ms\n", ms);
|
||||
float m = mean_abs(outPtr, nElems);
|
||||
printf(" Non-zero: %s (mean_abs=%.6f)\n", m > 1e-6f ? "YES" : "NO", m);
|
||||
printf(" %s\n\n", m > 1e-6f ? "*** Y3 PASSED ***" : "Y3 FAILED");
|
||||
|
||||
int N = 100;
|
||||
t0 = mach_absolute_time();
|
||||
for (int i = 0; i < N; i++) runEngine(engine, fp, opts, nil);
|
||||
printf(" Bench: %.4f ms/eval (%d iters)\n\n",
|
||||
tb_ms(mach_absolute_time() - t0) / N, N);
|
||||
}
|
||||
}
|
||||
}
|
||||
free(w1); free(b1); free(w2); free(b2);
|
||||
}
|
||||
|
||||
// ============================================================
|
||||
// Z1: Linear Backward Pass (Gradient Computation)
|
||||
// ============================================================
|
||||
printf("================================================================\n");
|
||||
printf(" Z1: Backward Pass (matmul with runtime tensors) on ANE\n");
|
||||
printf("================================================================\n\n");
|
||||
|
||||
{
|
||||
int M = 128, K = 64, N = 64;
|
||||
|
||||
NSString *bwdMIL = [NSString stringWithFormat:
|
||||
@"program(1.3)\n"
|
||||
"{\n"
|
||||
" func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n"
|
||||
" string c16 = const()[name = string(\"c16\"), val = string(\"fp16\")];\n"
|
||||
" tensor<fp16, [1, %d, 1, %d]> x16 = cast(dtype = c16, x = x)[name = string(\"x16\")];\n"
|
||||
" tensor<int32, [2]> r2 = const()[name = string(\"r2\"), val = tensor<int32, [2]>([%d, %d])];\n"
|
||||
" tensor<fp16, [%d, %d]> flat = reshape(x = x16, shape = r2)[name = string(\"flat\")];\n"
|
||||
// Slice dY [0:128, :]
|
||||
" tensor<int32, [2]> db = const()[name = string(\"db\"), val = tensor<int32, [2]>([0, 0])];\n"
|
||||
" tensor<int32, [2]> de = const()[name = string(\"de\"), val = tensor<int32, [2]>([%d, %d])];\n"
|
||||
" tensor<fp16, [%d, %d]> dY = slice_by_index(x = flat, begin = db, end = de)[name = string(\"dY\")];\n"
|
||||
// Slice W [128:192, :]
|
||||
" tensor<int32, [2]> wb = const()[name = string(\"wb\"), val = tensor<int32, [2]>([%d, 0])];\n"
|
||||
" tensor<int32, [2]> we = const()[name = string(\"we\"), val = tensor<int32, [2]>([%d, %d])];\n"
|
||||
" tensor<fp16, [%d, %d]> W = slice_by_index(x = flat, begin = wb, end = we)[name = string(\"W\")];\n"
|
||||
// Slice pad [192:256, :]
|
||||
" tensor<int32, [2]> pb = const()[name = string(\"pb\"), val = tensor<int32, [2]>([%d, 0])];\n"
|
||||
" tensor<int32, [2]> pe = const()[name = string(\"pe\"), val = tensor<int32, [2]>([%d, %d])];\n"
|
||||
" tensor<fp16, [%d, %d]> pad = slice_by_index(x = flat, begin = pb, end = pe)[name = string(\"pad\")];\n"
|
||||
// dX = dY @ W
|
||||
" bool txf = const()[name = string(\"txf\"), val = bool(false)];\n"
|
||||
" bool tyf = const()[name = string(\"tyf\"), val = bool(false)];\n"
|
||||
" bool txt = const()[name = string(\"txt\"), val = bool(true)];\n"
|
||||
" tensor<fp16, [%d, %d]> dX = matmul(x = dY, y = W, transpose_x = txf, transpose_y = tyf)[name = string(\"dX\")];\n"
|
||||
// dW = dY^T @ dY
|
||||
" tensor<fp16, [%d, %d]> dW = matmul(x = dY, y = dY, transpose_x = txt, transpose_y = tyf)[name = string(\"dW\")];\n"
|
||||
// Concat [dX, dW, pad]
|
||||
" int32 ax = const()[name = string(\"ax\"), val = int32(0)];\n"
|
||||
" bool il = const()[name = string(\"il\"), val = bool(false)];\n"
|
||||
" tensor<fp16, [%d, %d]> pk = concat(values = (dX, dW, pad), axis = ax, interleave = il)[name = string(\"pk\")];\n"
|
||||
" tensor<int32, [4]> r4 = const()[name = string(\"r4\"), val = tensor<int32, [4]>([1, %d, 1, %d])];\n"
|
||||
" tensor<fp16, [1, %d, 1, %d]> o16 = reshape(x = pk, shape = r4)[name = string(\"o16\")];\n"
|
||||
" string c32 = const()[name = string(\"c32\"), val = string(\"fp32\")];\n"
|
||||
" tensor<fp32, [1, %d, 1, %d]> cast_out = cast(dtype = c32, x = o16)[name = string(\"cast_out\")];\n"
|
||||
" } -> (cast_out);\n"
|
||||
"}\n",
|
||||
ch, sp, ch, sp,
|
||||
ch, sp, ch, sp,
|
||||
M, K, M, K,
|
||||
M, M + K, K, K, K,
|
||||
M + K, ch, sp, ch - M - K, sp,
|
||||
M, N,
|
||||
K, K,
|
||||
ch, sp,
|
||||
ch, sp, ch, sp,
|
||||
ch, sp];
|
||||
|
||||
printf(" dX = dY[%d,%d] @ W[%d,%d] -> [%d,%d]\n", M, K, K, N, M, N);
|
||||
printf(" dW = dY^T @ dY -> [%d,%d]\n\n", K, K);
|
||||
|
||||
err = nil;
|
||||
id engine = compileAndCreateEngine(bwdMIL, @"z1_backward",
|
||||
refContainer, cfg, refDesc, &err);
|
||||
|
||||
if (!engine) {
|
||||
printf(" Z1 FAILED: %s\n\n", err ? [[err description] UTF8String] : "unknown");
|
||||
} else {
|
||||
printf(" Z1: Engine created\n");
|
||||
MLMultiArray *inputArr = [[MLMultiArray alloc]
|
||||
initWithShape:@[@1, @(ch), @1, @(sp)]
|
||||
dataType:MLMultiArrayDataTypeFloat32 error:nil];
|
||||
float *inPtr = (float *)[inputArr dataPointer];
|
||||
fill_random(inPtr, nElems, 0.3f);
|
||||
|
||||
MLDictionaryFeatureProvider *fp = [[MLDictionaryFeatureProvider alloc]
|
||||
initWithDictionary:@{inName: inputArr} error:nil];
|
||||
|
||||
NSError *runErr = nil;
|
||||
uint64_t t0 = mach_absolute_time();
|
||||
id<MLFeatureProvider> result = runEngine(engine, fp, opts, &runErr);
|
||||
double ms = tb_ms(mach_absolute_time() - t0);
|
||||
|
||||
if (runErr || !result) {
|
||||
printf(" Z1 prediction FAILED: %s\n\n",
|
||||
runErr ? [[runErr description] UTF8String] : "nil");
|
||||
} else {
|
||||
MLMultiArray *outArr = [result featureValueForName:outName].multiArrayValue;
|
||||
if (outArr) {
|
||||
float *outPtr = (float *)[outArr dataPointer];
|
||||
|
||||
// CPU: dX = dY @ W
|
||||
float *dY_cpu = inPtr;
|
||||
float *W_cpu = inPtr + M * K;
|
||||
float *dX_cpu = (float *)calloc(M * N, sizeof(float));
|
||||
for (int i = 0; i < M; i++)
|
||||
for (int j = 0; j < N; j++) {
|
||||
float a = 0;
|
||||
for (int k = 0; k < K; k++)
|
||||
a += dY_cpu[i*K+k] * W_cpu[k*N+j];
|
||||
dX_cpu[i*N+j] = a;
|
||||
}
|
||||
|
||||
// CPU: dW = dY^T @ dY
|
||||
float *dW_cpu = (float *)calloc(K * K, sizeof(float));
|
||||
for (int i = 0; i < K; i++)
|
||||
for (int j = 0; j < K; j++) {
|
||||
float a = 0;
|
||||
for (int m = 0; m < M; m++)
|
||||
a += dY_cpu[m*K+i] * dY_cpu[m*K+j];
|
||||
dW_cpu[i*K+j] = a;
|
||||
}
|
||||
|
||||
print_first("ANE dX", outPtr, M * N);
|
||||
print_first("CPU dX", dX_cpu, M * N);
|
||||
float mad_dx = max_abs_diff(outPtr, dX_cpu, M * N);
|
||||
printf(" dX diff: %.6f, Rel: %.2e\n",
|
||||
mad_dx, mad_dx / (mean_abs(dX_cpu, M*N) + 1e-10f));
|
||||
|
||||
print_first("ANE dW", outPtr + M*N, K*K);
|
||||
print_first("CPU dW", dW_cpu, K*K);
|
||||
float mad_dw = max_abs_diff(outPtr + M*N, dW_cpu, K * K);
|
||||
printf(" dW diff: %.6f, Rel: %.2e\n",
|
||||
mad_dw, mad_dw / (mean_abs(dW_cpu, K*K) + 1e-10f));
|
||||
printf(" Time: %.3f ms\n", ms);
|
||||
printf(" %s\n\n",
|
||||
(mad_dx < 0.5f && mad_dw < 1.0f)
|
||||
? "*** Z1 PASSED ***" : "Z1: differences (fp16 precision)");
|
||||
|
||||
int NN = 100;
|
||||
t0 = mach_absolute_time();
|
||||
for (int i = 0; i < NN; i++) runEngine(engine, fp, opts, nil);
|
||||
printf(" Bench: %.4f ms/eval (%d iters)\n\n",
|
||||
tb_ms(mach_absolute_time() - t0) / NN, NN);
|
||||
|
||||
free(dX_cpu); free(dW_cpu);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
printf("================================================================\n");
|
||||
printf(" DONE\n");
|
||||
printf("================================================================\n");
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
|
@ -0,0 +1,238 @@
|
|||
// test_throughput_ceiling.m — Experiment I: Multi-kernel throughput ceiling
|
||||
// Measures CPU round-trip overhead for sequential ANE kernel execution
|
||||
// Build: make test_throughput_ceiling && ./test_throughput_ceiling
|
||||
#import <Foundation/Foundation.h>
|
||||
#import <mach/mach_time.h>
|
||||
#include <dispatch/dispatch.h>
|
||||
#include "ane_runtime.h"
|
||||
|
||||
static int g_fp16_io = 1;
|
||||
|
||||
static NSString *gen_conv_mil_fp16(int ch, int sp) {
|
||||
return [NSString stringWithFormat:
|
||||
@"program(1.0)\n[buildInfo = dict<tensor<string, []>, tensor<string, []>>"
|
||||
"({{\"coremlc-version\", \"3505.4.1\"}})]\n{\n"
|
||||
" func main<ios16>(tensor<fp16, [1, %d, 1, %d]> x) {\n"
|
||||
" tensor<string, []> pt = const()[name=tensor<string, []>(\"pt\"),"
|
||||
" val=tensor<string, []>(\"valid\")];\n"
|
||||
" tensor<int32, [2]> st = const()[name=tensor<string, []>(\"st\"),"
|
||||
" val=tensor<int32, [2]>([1,1])];\n"
|
||||
" tensor<int32, [4]> pd = const()[name=tensor<string, []>(\"pd\"),"
|
||||
" val=tensor<int32, [4]>([0,0,0,0])];\n"
|
||||
" tensor<int32, [2]> dl = const()[name=tensor<string, []>(\"dl\"),"
|
||||
" val=tensor<int32, [2]>([1,1])];\n"
|
||||
" tensor<int32, []> gr = const()[name=tensor<string, []>(\"gr\"),"
|
||||
" val=tensor<int32, []>(1)];\n"
|
||||
" tensor<fp16, [%d,%d,1,1]> W = const()[name=tensor<string, []>(\"W\"), "
|
||||
"val=tensor<fp16, [%d,%d,1,1]>(BLOBFILE(path=tensor<string, []>"
|
||||
"(\"@model_path/weights/weight.bin\"), offset=tensor<uint64, []>(64)))];\n"
|
||||
" tensor<fp16, [1,%d,1,%d]> y = conv(dilations=dl,groups=gr,"
|
||||
"pad=pd,pad_type=pt,strides=st,weight=W,x=x)"
|
||||
"[name=tensor<string, []>(\"conv\")];\n"
|
||||
" } -> (y);\n}\n", ch, sp, ch, ch, ch, ch, ch, sp];
|
||||
}
|
||||
|
||||
static ANEKernel *compile_fp16_kernel(int ch, int sp) {
|
||||
int ws = ch * ch * 2;
|
||||
int tot = 128 + ws;
|
||||
uint8_t *blob = (uint8_t *)calloc((size_t)tot, 1);
|
||||
blob[0] = 1; blob[4] = 2;
|
||||
blob[64] = 0xEF; blob[65] = 0xBE; blob[66] = 0xAD; blob[67] = 0xDE;
|
||||
blob[68] = 1;
|
||||
*(uint32_t *)(blob + 72) = (uint32_t)ws;
|
||||
*(uint32_t *)(blob + 80) = 128;
|
||||
_Float16 *wp = (_Float16 *)(blob + 128);
|
||||
for (int i = 0; i < ch; i++) wp[i * ch + i] = (_Float16)1.0f;
|
||||
NSData *wdata = [NSData dataWithBytesNoCopy:blob length:(NSUInteger)tot
|
||||
freeWhenDone:YES];
|
||||
|
||||
NSString *mil = gen_conv_mil_fp16(ch, sp);
|
||||
NSData *md = [mil dataUsingEncoding:NSUTF8StringEncoding];
|
||||
size_t ioBytes = (size_t)ch * sp * 2;
|
||||
return ane_compile(md, wdata, 1, &ioBytes, 1, &ioBytes);
|
||||
}
|
||||
|
||||
int main(int argc, const char *argv[]) {
|
||||
(void)argc; (void)argv;
|
||||
@autoreleasepool {
|
||||
mach_timebase_info_data_t tb;
|
||||
mach_timebase_info(&tb);
|
||||
|
||||
printf("============================================================\n");
|
||||
printf(" Experiment I: Multi-Kernel Throughput Ceiling\n");
|
||||
printf(" Measuring CPU round-trip overhead for sequential ANE ops\n");
|
||||
printf("============================================================\n\n");
|
||||
|
||||
ane_init();
|
||||
if (!g_ane_ok) { printf("ANE not available\n"); return 1; }
|
||||
|
||||
typedef struct { int ch; int sp; const char *name; } Config;
|
||||
Config configs[] = {
|
||||
{64, 32, "64x32 (test)"},
|
||||
{256, 64, "256x64 (small)"},
|
||||
{768, 256, "768x256 (prod)"},
|
||||
};
|
||||
int nconfigs = sizeof(configs) / sizeof(configs[0]);
|
||||
|
||||
for (int ci = 0; ci < nconfigs; ci++) {
|
||||
Config cfg = configs[ci];
|
||||
printf("=== Config: %s ===\n", cfg.name);
|
||||
|
||||
int nlayers = 12;
|
||||
ANEKernel *kernels[12];
|
||||
int compiled = 0;
|
||||
for (int i = 0; i < nlayers; i++) {
|
||||
@try {
|
||||
kernels[i] = compile_fp16_kernel(cfg.ch, cfg.sp);
|
||||
if (!kernels[i]) {
|
||||
printf(" Kernel %d compile failed\n", i);
|
||||
break;
|
||||
}
|
||||
compiled++;
|
||||
} @catch (NSException *ex) {
|
||||
printf(" Kernel %d exception: %s\n", i,
|
||||
[[ex reason] UTF8String]);
|
||||
break;
|
||||
}
|
||||
}
|
||||
printf(" Compiled %d/%d kernels\n", compiled, nlayers);
|
||||
if (compiled < 2) {
|
||||
printf(" Need at least 2 kernels, skipping\n\n");
|
||||
for (int i = 0; i < compiled; i++) ane_free(kernels[i]);
|
||||
continue;
|
||||
}
|
||||
|
||||
size_t ioBytes = (size_t)cfg.ch * cfg.sp * 2;
|
||||
int warmup = 5;
|
||||
int iters = 50;
|
||||
|
||||
// --- Test 1: Sequential (run + memcpy chain) ---
|
||||
printf("\n --- Test 1: Sequential (run + memcpy) ---\n");
|
||||
{
|
||||
for (int w = 0; w < warmup; w++) {
|
||||
@try {
|
||||
for (int i = 0; i < compiled; i++)
|
||||
ane_eval(kernels[i]);
|
||||
} @catch (NSException *ex) { (void)ex; }
|
||||
}
|
||||
|
||||
uint64_t t0 = mach_absolute_time();
|
||||
for (int it = 0; it < iters; it++) {
|
||||
for (int i = 0; i < compiled - 1; i++) {
|
||||
@try {
|
||||
ane_eval(kernels[i]);
|
||||
IOSurfaceLock(kernels[i]->ioOutputs[0],
|
||||
kIOSurfaceLockReadOnly, NULL);
|
||||
IOSurfaceLock(kernels[i+1]->ioInputs[0], 0, NULL);
|
||||
memcpy(
|
||||
IOSurfaceGetBaseAddress(kernels[i+1]->ioInputs[0]),
|
||||
IOSurfaceGetBaseAddress(kernels[i]->ioOutputs[0]),
|
||||
ioBytes);
|
||||
IOSurfaceUnlock(kernels[i+1]->ioInputs[0], 0, NULL);
|
||||
IOSurfaceUnlock(kernels[i]->ioOutputs[0],
|
||||
kIOSurfaceLockReadOnly, NULL);
|
||||
} @catch (NSException *ex) { (void)ex; }
|
||||
}
|
||||
@try {
|
||||
ane_eval(kernels[compiled - 1]);
|
||||
} @catch (NSException *ex) { (void)ex; }
|
||||
}
|
||||
double totalMs = (double)(mach_absolute_time() - t0) * tb.numer / tb.denom / 1e6;
|
||||
double perIter = totalMs / iters;
|
||||
double perKernel = perIter / compiled;
|
||||
printf(" Total: %.2f ms/pass (%d kernels)\n", perIter, compiled);
|
||||
printf(" Per kernel: %.3f ms\n", perKernel);
|
||||
printf(" Throughput: %.0f kernels/s\n", compiled * 1000.0 / perIter);
|
||||
}
|
||||
|
||||
// --- Test 2: Run-only (no memcpy, pure ANE overhead) ---
|
||||
printf("\n --- Test 2: Run-only (no memcpy between) ---\n");
|
||||
{
|
||||
uint64_t t0 = mach_absolute_time();
|
||||
for (int it = 0; it < iters; it++) {
|
||||
for (int i = 0; i < compiled; i++) {
|
||||
@try {
|
||||
ane_eval(kernels[i]);
|
||||
} @catch (NSException *ex) { (void)ex; }
|
||||
}
|
||||
}
|
||||
double totalMs = (double)(mach_absolute_time() - t0) * tb.numer / tb.denom / 1e6;
|
||||
double perIter = totalMs / iters;
|
||||
double perKernel = perIter / compiled;
|
||||
printf(" Total: %.2f ms/pass (%d kernels)\n", perIter, compiled);
|
||||
printf(" Per kernel: %.3f ms\n", perKernel);
|
||||
printf(" Throughput: %.0f kernels/s\n", compiled * 1000.0 / perIter);
|
||||
}
|
||||
|
||||
// --- Test 3: Memcpy-only overhead ---
|
||||
printf("\n --- Test 3: Memcpy-only overhead ---\n");
|
||||
{
|
||||
uint64_t t0 = mach_absolute_time();
|
||||
for (int it = 0; it < iters * 10; it++) {
|
||||
for (int i = 0; i < compiled - 1; i++) {
|
||||
IOSurfaceLock(kernels[i]->ioOutputs[0], kIOSurfaceLockReadOnly, NULL);
|
||||
IOSurfaceLock(kernels[i+1]->ioInputs[0], 0, NULL);
|
||||
memcpy(
|
||||
IOSurfaceGetBaseAddress(kernels[i+1]->ioInputs[0]),
|
||||
IOSurfaceGetBaseAddress(kernels[i]->ioOutputs[0]),
|
||||
ioBytes);
|
||||
IOSurfaceUnlock(kernels[i+1]->ioInputs[0], 0, NULL);
|
||||
IOSurfaceUnlock(kernels[i]->ioOutputs[0], kIOSurfaceLockReadOnly, NULL);
|
||||
}
|
||||
}
|
||||
double totalMs = (double)(mach_absolute_time() - t0) * tb.numer / tb.denom / 1e6;
|
||||
double perIter = totalMs / (iters * 10);
|
||||
double perCopy = perIter / (compiled - 1);
|
||||
printf(" Total: %.3f ms/pass (%d copies)\n", perIter, compiled - 1);
|
||||
printf(" Per memcpy: %.4f ms (%lu bytes)\n", perCopy, (unsigned long)ioBytes);
|
||||
}
|
||||
|
||||
// --- Test 4: GCD serial queue ---
|
||||
printf("\n --- Test 4: GCD serial queue ---\n");
|
||||
{
|
||||
ANEKernel **kptrs = (ANEKernel **)malloc(
|
||||
(size_t)compiled * sizeof(ANEKernel *));
|
||||
for (int i = 0; i < compiled; i++) kptrs[i] = kernels[i];
|
||||
|
||||
dispatch_queue_t q = dispatch_queue_create(
|
||||
"ane.throughput", DISPATCH_QUEUE_SERIAL);
|
||||
dispatch_semaphore_t sem = dispatch_semaphore_create(0);
|
||||
const int ncomp = compiled;
|
||||
|
||||
uint64_t t0 = mach_absolute_time();
|
||||
for (int it = 0; it < iters; it++) {
|
||||
__block int done = 0;
|
||||
for (int i = 0; i < ncomp; i++) {
|
||||
ANEKernel *kp = kptrs[i];
|
||||
dispatch_async(q, ^{
|
||||
@try {
|
||||
ane_eval(kp);
|
||||
} @catch (NSException *ex) { (void)ex; }
|
||||
done++;
|
||||
if (done == ncomp)
|
||||
dispatch_semaphore_signal(sem);
|
||||
});
|
||||
}
|
||||
dispatch_semaphore_wait(sem, DISPATCH_TIME_FOREVER);
|
||||
}
|
||||
double totalMs = (double)(mach_absolute_time() - t0)
|
||||
* tb.numer / tb.denom / 1e6;
|
||||
double perIter = totalMs / iters;
|
||||
printf(" Total: %.2f ms/pass (%d kernels, serial queue)\n",
|
||||
perIter, ncomp);
|
||||
printf(" Per kernel: %.3f ms\n", perIter / ncomp);
|
||||
free(kptrs);
|
||||
}
|
||||
|
||||
printf("\n --- CPU Round-trip Overhead ---\n");
|
||||
printf(" Overhead = (Sequential - RunOnly) / %d copies\n", compiled - 1);
|
||||
printf(" This is what chaining would eliminate per layer.\n");
|
||||
|
||||
for (int i = 0; i < compiled; i++) ane_free(kernels[i]);
|
||||
printf("\n");
|
||||
}
|
||||
|
||||
printf("Done.\n");
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
Loading…
Reference in New Issue