ANE/docs/ANE_CHAINING_RESEARCH.md

1112 lines
57 KiB
Markdown

# ANE ChainingRequest API Research
Research into Apple Neural Engine private APIs for multi-kernel pipelining, conducted on M4 Max / macOS 15.
**Goal**: Eliminate CPU round-trips between ANE layer evaluations. In a 12-layer model, sequential evaluation requires 23+ CPU-ANE round-trips per token. The `_ANEChainingRequest` API appears designed to let the ANE run operations back-to-back in a hardware pipeline, keeping data on-chip.
**Status**: ChainingRequest validates and `prepareChainingWithModel:` no longer crashes (crash fix: pass nil for symbol/procedure params). Blocked on Code=15 (`ANEProgramChainingPrepare Failed`) -- the `_ANEModel` needs Espresso IR format (not MIL) for full symbol table population. At production dims (768x256), sequential ANE dispatch costs ~0.2ms/kernel; chaining would save ~23 round-trips per token.
See also: [ANE_INTERNALS.md](ANE_INTERNALS.md) for comprehensive ANE documentation including compilation pipeline, hardware specs, and community research references.
---
## Test Files
| File | Purpose |
|------|---------|
| `training/test_chaining.m` | v1 prototype: sequential baseline + ChainingRequest creation |
| `training/test_chaining_v2.m` | v2 deep exploration: 6-phase probe of 12+ private classes |
| `training/test_ane_model.m` | Experiments E-P: _ANEModel loading, compiler, chaining, fences, type encoding, mapping |
| `training/test_throughput_ceiling.m` | Experiment I: 12-kernel throughput ceiling benchmark |
Build and run:
```bash
cd training
make test_chaining && ./test_chaining
make test_chaining_v2 && ./test_chaining_v2
make test_ane_model && ./test_ane_model
make test_throughput_ceiling && ./test_throughput_ceiling
```
---
## 1. Executive Summary
### What works
| Finding | Impact | Status |
|---------|--------|--------|
| `evaluateRealTimeWithModel:` via `_ANEClient` | 1.88x faster on small kernels (64x32); **no benefit at production dims** (768x256) | Benchmarked |
| `processRequest` via `_ANEProgramForEvaluation` | 1.34x faster on small kernels; marginal at production dims | Benchmarked |
| `_ANEBuffer` wraps IOSurface with `symbolIndex` | Solves input indexing for chaining | Proven |
| All 9 unexplored ANE classes exist on M4 Max | Full API surfaces documented | Documented |
> **Important**: The RT execution speedup (1.88x) observed in isolated testing on 64x32 convolution kernels does **not** generalize to production dimensions. At 768x256 (Stories110M size), all four execution paths converge to ~0.2 ms per kernel. See [Production Dimension Results](#production-dimension-results-test_bench_pathsm-m4-max) below.
### What's been solved
| Finding | Status | Detail |
|---------|--------|--------|
| `_ANEIOSurfaceOutputSets` works with 64-byte statsSurRef | **SOLVED** | Any non-NULL IOSurface works as stats buffer |
| `_ANEChainingRequest.validate` returns YES | **SOLVED** | With proper `_ANEBuffer` inputs + `_ANEIOSurfaceOutputSets` outputs |
| `processRequest` via `_ANEProgramForEvaluation` | **1.34x faster** | Lower-level eval (0.131 ms vs 0.175 ms) |
| ChainingRequest factory crash (`[NSConstantIntegerNumber count]`) | **SOLVED** | Pass `nil` for `lbInputSymbolId`, `lbOutputSymbolId`, `procedureIndex` |
| `_ANEModel` loading from temp directory | **SOLVED** | `modelAtURL:key:` with tmpDir URL + hexStringIdentifier |
| `_ANESharedSignalEvent` / `_ANESharedWaitEvent` | **SOLVED** | Use `MTLSharedEvent` or `IOSurfaceSharedEventCreate()` |
| ChainingRequest type encodings | **DOCUMENTED** | All 9 factory params are `@` (object). `prepare` has 5 params (3x`@`, 1x`I` qos, 1x`^@` err) |
### What's still blocked
| Blocker | Root Cause |
|---------|------------|
| `prepareChainingWithModel:` returns Code=15 | `ANEProgramChainingPrepare() Failed` -- model not recognized as chaining-capable |
| `_ANEModel` has empty symbol table | MIL-compiled model shell lacks Espresso IR data (`model.espresso.net`) |
| `_ANEClient.loadModel:` / `compileModel:` fail | Require Espresso IR format, not MIL |
| `_ANEProgramIOSurfacesMapper` returns NO | Needs fully loaded model with symbol table |
| `_ANEPerformanceStats` with `_ANERequest` | Request expects `statType` selector on perfStats objects |
---
## 2. ANE Private API Class Map
### Core Classes (known working)
**`_ANEInMemoryModel`** -- the model object for in-memory MIL compilation.
- `+inMemoryModelWithDescriptor:` -- create from `_ANEInMemoryModelDescriptor`
- `-compileWithQoS:options:error:` -- compile MIL to ANE binary
- `-loadWithQoS:options:error:` -- load compiled model onto ANE
- `-evaluateWithQoS:options:request:error:` -- standard evaluation (QoS 0-63, 21 default)
- `-unloadWithQoS:error:` -- unload from ANE
- Properties: `hexStringIdentifier`, `programHandle` (uint64), `program` (`_ANEProgramForEvaluation`), `perfStatsMask`
- Missing: `inputSymbolNames`, `outputSymbolNames`, `inputSymbolIndicesForProcedureIndex:`
**`_ANEInMemoryModelDescriptor`** -- model specification.
- `+modelWithMILText:weights:optionsPlist:` -- create descriptor from MIL NSData + weight dict
**`_ANERequest`** -- evaluation request packaging I/O surfaces.
- `+requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:`
- `perfStats` parameter expects `NSArray` of stat info objects (not `_ANEPerformanceStats`)
**`_ANEIOSurfaceObject`** -- thin wrapper around `IOSurfaceRef`.
- `+objectWithIOSurface:` -- wrap a raw IOSurface
- Does NOT have `symbolIndex` property (this is the v1 blocker)
**`_ANEClient`** -- client connection to the ANE daemon.
- `+sharedConnection` -- singleton accessor
- `-evaluateWithModel:options:request:qos:error:` -- 5-param eval via client
- `-evaluateRealTimeWithModel:options:request:error:` -- **RT priority eval (1.7x faster)**
- `-doEvaluateDirectWithModel:options:request:qos:error:` -- direct eval bypass
- `-beginRealTimeTask` / `-endRealTimeTask` -- RT task bracketing (returns NO, but RT eval still works)
- `-prepareChainingWithModel:options:chainingReq:qos:error:` -- chaining setup
- `-enqueueSetsWithModel:outputSet:options:qos:error:` -- chaining output enqueue
- `-buffersReadyWithModel:inputBuffers:options:qos:error:` -- chaining input signal
### Discovered Classes (v2 exploration)
**`_ANEBuffer`** -- wraps `_ANEIOSurfaceObject` with index metadata. **Key discovery.**
- `+bufferWithIOSurfaceObject:symbolIndex:source:` -- factory
- `ioSurfaceObject`: an `_ANEIOSurfaceObject` (NOT raw `IOSurfaceRef`)
- `symbolIndex`: `NSNumber` mapping to compiled model I/O symbol
- `source`: `long long` -- 0=ANE, 1=output, 2=unknown
- Properties: `ioSurfaceObject`, `symbolIndex`, `source`
- Description format: `"_ANEBuffer: { ioSurface=0x... ; symbolIndex=0 ; ANEBufferProducerAgent=0 }"`
**`_ANEProgramIOSurfacesMapper`** -- maps IOSurfaces to compiled model symbols.
- `+mapperWithProgramHandle:(uint64_t)handle` -- works, creates mapper
- `+mapperWithController:(id)ctrl` -- alternative factory
- `-mapIOSurfacesWithModel:request:cacheInference:error:` -- **FAILS** on `_ANEInMemoryModel` (calls `inputSymbolIndicesForProcedureIndex:` which doesn't exist)
- `-validateRequest:model:` -- also fails for same reason
- Implication: designed for `_ANEModel` (disk-based compiled models), not in-memory MIL
**`_ANEProgramForEvaluation`** -- lower-level evaluation program.
- Accessible via `model.program` property
- `+programWithHandle:intermediateBufferHandle:queueDepth:` -- factory
- `-processRequest:model:qos:qIndex:modelStringID:options:returnValue:error:` -- low-level eval
**`_ANEIOSurfaceOutputSets`** -- output set packaging for chaining.
- `+objectWithstatsSurRef:outputBuffer:` -- factory
- `statsSurRef`: `IOSurfaceRef` for perf stats collection -- **returns nil when NULL**
- `outputBuffer`: `NSArray` of `_ANEBuffer` objects
- This is the current blocker: we don't know the correct stats IOSurface format
**`_ANEInputBuffersReady`** -- input signaling for chaining pipeline.
- `+inputBuffersWithProcedureIndex:inputBufferInfoIndex:inputFreeValue:executionDelay:`
- Parameters: procedure index, buffer info indices, free values, execution delay
- This is the mechanism that tells the ANE "inputs are ready, start processing"
**`_ANEOutputSetEnqueue`** -- output pipeline configuration for chaining.
- `+outputSetWithProcedureIndex:setIndex:signalValue:signalNotRequired:isOpenLoop:`
- Configures output set enqueue behavior with signal values and open-loop mode
**`_ANEChainingRequest`** -- the chaining request itself.
- `+chainingRequestWithInputs:outputSets:lbInputSymbolId:lbOutputSymbolId:procedureIndex:signalEvents:transactionHandle:fwEnqueueDelay:memoryPoolId:`
- `-validate` -- returns YES/NO
- Expects `inputs` as `_ANEBuffer` objects, `outputSets` as `_ANEIOSurfaceOutputSets` objects
**`_ANEModelInstanceParameters`** -- model instance configuration.
- Alloc/init produces a valid object
- API surface dumped but not yet exercised
**`_ANEDeviceController`** -- device-level controller.
- `+controllerWithProgramHandle:` -- attempted but returned nil in our tests
**`_ANEQoSMapper`** -- QoS level mapping.
- API surface dumped, not yet exercised
**`_ANEPerformanceStats`** -- performance statistics.
- `+statsWithHardwareExecutionNS:(uint64_t)ns` -- factory
- Properties: `hwExecutionTime`, `performanceCounters`
- Cannot be used with `_ANERequest.perfStats` (expects array of objects with `statType` selector)
- Setting `perfStatsMask=0xFF` on model works but `performanceCounters` returns nil
**`_ANESharedSignalEvent` / `_ANESharedWaitEvent`** -- hardware sync primitives (not yet explored).
- Likely the fence mechanism for GPU-ANE or multi-model synchronization
- Referenced in `_ANEChainingRequest.signalEvents` parameter
---
## 3. Experiment Logs
### v1: test_chaining.m Results (M4 Max)
```
=== ANE ChainingRequest Prototype ===
All required classes found.
--- Phase 1: Compile two identical conv kernels ---
Kernel 1: compiled and loaded
Kernel 2: compiled and loaded
--- Phase 2: Baseline (sequential eval) ---
Sequential: 10.355 ms total (0.207 ms/pair)
Output[0..3]: [0.2500, 0.2500, 0.2500, 0.2500]
--- Phase 3: _ANEChainingRequest exploration ---
_ANEClient: obtained
ChainingRequest created: _ANEChainingRequest: { inputBuffer=(
"_ANEIOSurfaceObject: { ioSurface=0x... ; startOffset=0 }"
) ; outputSets=( ... ) }
validate: NO
--- Phase 4: Loopback ChainingRequest ---
ChainingRequest created (loopback)
validate: NO
prepareChainingWithModel: EXCEPTION (validate fails first)
--- Summary ---
Sequential baseline: 0.207 ms/pair (two evals + memcpy)
ChainingRequest: creates but validate FAILS
Root cause: _ANEIOSurfaceObject lacks symbolIndex property
Next: explore _ANEBuffer and _ANEProgramIOSurfacesMapper
```
### v2: test_chaining_v2.m Results (M4 Max)
**Phase 1: Class Introspection**
- 9 classes found, 0 missing
- All classes exist on M4 Max / macOS 15
- Full method lists, properties, and type encodings dumped for each
**Phase 2: Symbol Name Discovery**
- `inputSymbolNames`: NOT available on `_ANEInMemoryModel`
- `outputSymbolNames`: NOT available on `_ANEInMemoryModel`
- `programHandle`: YES (uint64 handle to compiled program)
- `_ANEIOSurfaceObject` does NOT have `symbolIndex` getter or setter
- `+objectWithIOSurface:symbolIndex:` class method NOT available
**Phase 3: IOSurface Mapper & Buffer Experiments**
3a: `_ANEProgramIOSurfacesMapper`
```
mapperWithProgramHandle(12345): created successfully
mapIOSurfacesWithModel: EXCEPTION
-[_ANEInMemoryModel inputSymbolIndicesForProcedureIndex:]:
unrecognized selector
validateRequest:model: EXCEPTION (same reason)
```
3b: `_ANEBuffer` -- **success**
```
bufferWithIOSurfaceObject(symIdx=0, source=0):
_ANEBuffer: { ioSurface=0x... ; symbolIndex=0 ; ANEBufferProducerAgent=0 }
bufferWithIOSurfaceObject(symIdx=0, source=1):
_ANEBuffer: { ioSurface=0x... ; symbolIndex=0 ; ANEBufferProducerAgent=1 }
bufferWithIOSurfaceObject(symIdx=0, source=2):
_ANEBuffer: { ioSurface=0x... ; symbolIndex=0 ; ANEBufferProducerAgent=2 }
bufferWithIOSurfaceObject(symIdx=1, source=0):
_ANEBuffer: { ioSurface=0x... ; symbolIndex=1 ; ANEBufferProducerAgent=0 }
symbolIndex property: accessible and correct
```
3c: `_ANEIOSurfaceObject` symbolIndex experiments
```
setSymbolIndex: NOT available on _ANEIOSurfaceObject
symbolIndex getter: NOT available
+objectWithIOSurface:symbolIndex: NOT available
```
3d: IOSurface property experiments
```
IOSurface 'symbolIndex' property (set via IOSurfaceSetValue): 0
_ANEIOSurfaceObject.symbolIndex after property set: <exception>
(IOSurface user properties do NOT propagate to _ANEIOSurfaceObject)
```
3e: `_ANEProgramForEvaluation`
```
k1.model.program: <_ANEProgramForEvaluation: 0x...>
(accessible via model.program property)
```
**Phase 4: ChainingRequest Retry**
4a: Sequential baseline
```
Sequential: 0.259 ms/pair (50 iters)
Output[0..3]: [0.2500, 0.2500, 0.2500, 0.2500]
```
Attempts 1-4: Various raw IOSurface configurations
```
[Attempt 1] Standard (raw IOSurfaceObject): CRASH
-[_ANEIOSurfaceObject symbolIndex]: unrecognized selector
[Attempt 2] IOSurface with symbolIndex property: CRASH (same)
[Attempt 3] Two-model loopback: CRASH (same)
[Attempt 4] Skip validate, call prepareChainingWithModel directly: CRASH (same)
```
Attempt 5: `_ANEBuffer` + `_ANEIOSurfaceOutputSets`
```
bufIn: _ANEBuffer: { ... symbolIndex=0 ; ANEBufferProducerAgent=0 }
bufOut: _ANEBuffer: { ... symbolIndex=0 ; ANEBufferProducerAgent=1 }
outputSet (objectWithstatsSurRef:NULL outputBuffer:@[bufOut]): nil
-> _ANEIOSurfaceOutputSets returns nil when statsSurRef is NULL
```
Attempt 6: `_ANEClient.evaluateWithModel:` -- **works**
```
evaluateWithModel (via client): YES
```
Attempt 7: `_ANEClient.doEvaluateDirectWithModel:` -- **works**
```
doEvaluateDirectWithModel: YES
```
**Phase 5: Alternative Execution Paths**
5a: Real-time eval -- **1.7x speedup**
```
beginRealTimeTask: NO (possibly needs entitlement)
evaluateRealTimeWithModel: YES
RT eval: 0.090 ms/eval avg (50 iters)
Standard eval: 0.157 ms/eval avg (50 iters)
RT vs Standard speedup: 1.74x
endRealTimeTask: NO
```
5b: PerfStats
```
perfStatsMask = 0x01..0x80: set OK (all masks accepted)
statsWithHardwareExecutionNS:0 = <_ANEPerformanceStats>
Eval with @[perfStats]: OK (no crash when wrapped in array)
hwExecutionTime after eval: nil
Eval with mask=0xFF, perfStats=nil: OK
performanceCounters: nil
```
---
## 4. Evaluation Path Benchmarks
Measured on 64x32 convolution kernels, M4 Max, 200 iterations after 10 warmup:
| Method | Latency | Speedup | API |
|--------|---------|---------|-----|
| `evaluateWithQoS:` (standard) | 0.175 ms | 1.0x | `model.evaluateWithQoS:options:request:error:` |
| `evaluateRealTimeWithModel:` | 0.093 ms | **1.88x** | `client.evaluateRealTimeWithModel:options:request:error:` |
| `processRequest` | 0.131 ms | **1.34x** | `program.processRequest:model:qos:qIndex:modelStringID:options:returnValue:error:` |
| `doEvaluateDirectWithModel:` | 0.225 ms | 0.78x | `client.doEvaluateDirectWithModel:options:request:qos:error:` |
Key observations (small kernel, isolated):
- RT eval was fastest in isolated test (1.88x speedup on 64x32)
- `processRequest` was faster than standard but slower than RT
- `doEvaluateDirectWithModel` was actually **slower** than standard (0.78x)
- `beginRealTimeTask` returning NO does not prevent `evaluateRealTimeWithModel:` from working
### Production Dimension Results (test_bench_paths.m, M4 Max)
At realistic kernel sizes with multiple compiled models, the picture changes:
| Config | Standard | RT | processRequest | ane_eval_rt |
|--------|----------|-----|----------------|-------------|
| 64x32 (test) | 0.109 ms | 0.233 ms (0.5x) | 0.156 ms (0.7x) | 0.195 ms (0.6x) |
| 128x64 | 0.208 ms | 0.184 ms (1.1x) | 0.201 ms (1.0x) | 0.185 ms (1.1x) |
| 256x64 | 0.197 ms | 0.212 ms (0.9x) | 0.203 ms (1.0x) | 0.157 ms (1.3x) |
| 512x64 | 0.120 ms | 0.147 ms (0.8x) | 0.194 ms (0.6x) | 0.179 ms (0.7x) |
| 768x256 (prod) | 0.205 ms | 0.246 ms (0.8x) | 0.185 ms (1.1x) | 0.291 ms (0.7x) |
**Key finding**: The RT eval speedup observed in isolated testing (1.88x) does not hold at production dimensions. At 768x256 (Stories110M size), all eval paths perform similarly (~0.2 ms), with standard eval being competitive or fastest. The overhead of the client-based paths (RT, direct) outweighs any ANE scheduling benefit at scale.
---
## 5. Remaining Blockers and Next Steps
### SOLVED: _ANEIOSurfaceOutputSets statsSurRef
The chaining pipeline requires:
1. Inputs as `_ANEBuffer` objects with `symbolIndex` -- **SOLVED**
2. OutputSets as `_ANEIOSurfaceOutputSets` objects -- **SOLVED**
A 64-byte IOSurface as `statsSurRef` is sufficient. `_ANEChainingRequest.validate` returns YES with this setup.
### SOLVED: ChainingRequest parameter type mismatch (Experiment K-L)
The `[NSConstantIntegerNumber count]` crash was caused by passing `NSNumber` values for `lbInputSymbolId`, `lbOutputSymbolId`, and `procedureIndex`. Type encoding analysis (Experiment K) revealed all 9 factory parameters are `@` (id/object), but the factory internally calls `count` on them, expecting arrays or nil.
**Fix**: Pass `nil` for `lbInputSymbolId`, `lbOutputSymbolId`, and `procedureIndex`:
```objc
chainingRequestWithInputs:@[buf] outputSets:@[outSet]
lbInputSymbolId:nil lbOutputSymbolId:nil procedureIndex:nil
signalEvents:@[] transactionHandle:@0 fwEnqueueDelay:@0 memoryPoolId:@0
```
This produces a valid `_ANEChainingRequest` (`validate` returns YES) and `prepareChainingWithModel:` no longer crashes.
### Current Blocker: ANEProgramChainingPrepare() Failed (Code=15)
`prepareChainingWithModel:` now returns NO with error:
```
Error Domain=com.apple.appleneuralengine Code=15
"ANEProgramChainingPrepare() Failed: Program chaining prepare error"
```
This error occurs with all three model types tested:
- Fresh `_ANEModel` (state=1, populated with programHandle+program)
- Populated `_ANEModel` from Experiment E (state=5 after failed loadModel/compileModel)
- `_ANEInMemoryModel` still crashes on `getUUID` (cannot be used with chaining at all)
The `Code=15` error is a **logical failure** in the ANE daemon's chaining preparation, not a crash. The model is not fully recognized as "chaining-capable" by the daemon, likely because:
1. The `_ANEModel` was populated by copying `programHandle`/`program` from an `_ANEInMemoryModel`, not loaded through the standard CoreML/Espresso pipeline
2. Symbol indices remain empty (the daemon may require them for chaining buffer routing)
3. The model needs `model.espresso.net` format (not MIL) for `_ANEClient.loadModel:` / `compileModel:`
**Previous blocker (SOLVED)**: `[NSConstantIntegerNumber count]` crash -- fixed by passing `nil` for symbol/procedure params.
### Experiments E-H Results (test_ane_model.m)
#### Experiment E: _ANEModel Loading -- SOLVED
`_ANEModel.modelAtURL:key:` works with the compiled temp directory URL and `hexStringIdentifier` as key:
```
diskModel = _ANEModel.modelAtURL:key:(tmpDirURL, hexId)
-> _ANEModel with UUID, getUUID works
-> state=1, program=nil, programHandle=0 (shell only)
```
Populating the shell with `_ANEInMemoryModel` data:
```
diskModel.setProgramHandle:(inMemoryModel.programHandle) -> success
diskModel.setProgram:(inMemoryModel.program) -> success
```
After population, `programHandle` and `program` are set, but `inputSymbolIndicesForProcedureIndex:0` still returns empty `NSIndexSet`. The symbol table data isn't stored in the `_ANEProgramForEvaluation` -- it's likely in the `model.hwx` or `net.plist` that the standard CoreML path generates.
#### Experiment E2: ANECompiler -- No ObjC API
- `ANECompiler.framework` exists at `/System/Library/PrivateFrameworks/ANECompiler.framework/` but contains **no ObjC classes** -- it's a pure C library (`ANECCompile()` is the entry point, called internally by `_ANEInMemoryModel.compileWithQoS:`)
- `debug_mask` option had no visible effect on compilation output
- No `ane_compiler_service` found at standard paths
- Key `_ANEInMemoryModel` compilation methods found: `saveModelFiles`, `localModelPath`, `compiledModelExists`, `mapIOSurfacesWithRequest:cacheInference:error:`
#### Experiment F: Chaining Pipeline -- Blocked
With populated `_ANEModel` (has UUID + programHandle + program), `prepareChainingWithModel:` still crashes on `[NSConstantIntegerNumber count]`. The crash is in the `_ANEChainingRequest` parameter handling, not in the model itself.
#### Experiment G: Hardware Fences -- FULLY SOLVED
Both `_ANESharedSignalEvent` and `_ANESharedWaitEvent` now work:
```objc
// MTLSharedEvent via Metal (works)
id device = MTLCreateSystemDefaultDevice();
id sharedEvent = [device newSharedEvent];
// IOSurfaceSharedEvent via IOKit (also works)
id iosEvent = IOSurfaceSharedEventCreate();
// Signal event factory: (uint64_t value, unsigned int symbolIndex, long long eventType, id sharedEvent)
_ANESharedSignalEvent.signalEventWithValue:symbolIndex:eventType:sharedEvent:
-> works with both MTLSharedEvent and IOSurfaceSharedEvent
// Wait event factory: (uint64_t value, id sharedEvent)
_ANESharedWaitEvent.waitEventWithValue:sharedEvent:
-> works with both event types
```
Event types 0, 1, 2 all produce valid signal events. The `eventType` property is correctly set.
#### Experiment H: Alternative Preparation -- Same Crash
`doPrepareChainingWithModel:options:chainingReq:qos:error:` exists with identical signature and crashes identically. Full `_ANEClient` API (46 instance methods) documented in test output.
### Throughput Ceiling (test_throughput_ceiling.m, Experiment I)
12-kernel pipeline benchmarks on M4 Max:
| Config | Sequential (run+memcpy) | Run-only | Memcpy-only | GCD Serial |
|--------|------------------------|----------|-------------|------------|
| 64x32 (test) | 0.272 ms/kernel | 0.158 ms/kernel | 0.001 ms/copy | 0.200 ms/kernel |
| 256x64 (small) | 0.191 ms/kernel | 0.181 ms/kernel | 0.002 ms/copy | 0.176 ms/kernel |
| 768x256 (prod) | 0.177 ms/kernel | 0.226 ms/kernel | 0.006 ms/copy | 0.186 ms/kernel |
**Key findings**:
- **Memcpy overhead is negligible** (<0.01 ms per copy even at 393KB). Not the bottleneck.
- **CPU round-trip overhead** is in the ANE dispatch itself, not data movement.
- At production dims, sequential with memcpy is actually *faster* than eval-only (pipeline caching effect).
- **GCD serial queue** provides modest improvement at small dims but marginal at production.
- **Chaining's value** would be eliminating the ~0.2ms/kernel ANE dispatch overhead, not memcpy. With 12 kernels, total pipeline takes ~2.1ms (prod), so eliminating dispatch could potentially halve this.
### Experiments K-P Results (test_ane_model.m, 2026-03-04)
#### Experiment K: Type Encoding Analysis -- COMPLETE
Full type encodings for all chaining-related methods:
| Method | Encoding | Notes |
|--------|----------|-------|
| `chainingRequestWithInputs:...` | `@88@0:8@16@24@32@40@48@56@64@72@80` | All 9 params are `@` (id/object) |
| `prepareChainingWithModel:...` | `B52@0:8@16@24@32I40^@44` | 5 params: 3x `@`, 1x `I` (uint32 qos), 1x `^@` (error ptr) |
| `doPrepareChainingWithModel:...` | `B52@0:8@16@24@32I40^@44` | Same signature as prepareChainingWithModel |
The `_ANEChainingRequest` factory takes 9 object parameters. The `lbInputSymbolId`, `lbOutputSymbolId`, and `procedureIndex` are all `@` (object), not raw integers. Internally, the factory calls `unsignedIntegerValue` (from NSNumber) or `count` (from NSArray) on these parameters.
| `_ANEChainingRequest` Property | Encoding | Type |
|-------------------------------|----------|------|
| `procedureIndex` | `@` | id (nil or NSArray) |
| `loopbackInputSymbolIndex` | `@` | id (nil or NSArray) |
| `loopbackOutputSymbolIndex` | `@` | id (nil or NSArray) |
#### Experiment L: Array-Typed Parameters -- BREAKTHROUGH
| Combo | lbIn | lbOut | procIdx | Factory | Validate | Prepare |
|-------|------|-------|---------|---------|----------|---------|
| L.1: Arrays `@[@(-1)]` | `@[@(-1)]` | `@[@(-1)]` | `@[@0]` | CRASH: `unsignedIntegerValue` on NSArray | - | - |
| L.2: Arrays `@[@0]` | `@[@0]` | `@[@0]` | `@[@0]` | CRASH: `unsignedIntegerValue` on NSArray | - | - |
| L.3: Empty `@[]` | `@[]` | `@[]` | `@[]` | CRASH: `unsignedIntegerValue` on empty array | - | - |
| **L.4: nil** | **nil** | **nil** | **nil** | **OK** | **YES** | **NO (Code=15)** |
| L.5: NSNumber | `@(-1)` | `@(-1)` | `@0` | CRASH: `count` on NSNumber | - | - |
**Passing `nil` for all three symbol/procedure params gets past both the factory crash and the `prepareChainingWithModel` crash.** The `validate` returns YES and `prepareChainingWithModel:` returns a clean error (Code=15: `ANEProgramChainingPrepare() Failed`) instead of crashing.
#### Experiment M: Load Model via _ANEClient -- BLOCKED
Both `loadModel:` and `compileModel:` on `_ANEClient` require **Espresso IR** format (`model.espresso.net`), not MIL:
```
Error Domain=com.apple.appleneuralengine.espresso Code=-1
"_ANEEspressoIRTranslator : error Cannot load network '.../model.espresso.net'"
```
`compiledModelExistsFor:` returns NO for our MIL-compiled model. After the failed load/compile attempts, the `_ANEModel` state changes from 1 to 5 (error/invalid state).
The standard CoreML pipeline generates `model.espresso.net` (Espresso IR) and `model.espresso.weights` from the `.mlpackage` / `.mlmodelc` format. Our MIL-only path bypasses this, so we can't use `_ANEClient.loadModel:` without first generating the Espresso IR.
#### Experiment N: IOSurface Mapping -- PARTIAL
`_ANEProgramIOSurfacesMapper`:
- `mapperWithProgramHandle:` creates a valid mapper from the `_ANEInMemoryModel` programHandle
- `mapIOSurfacesWithModel:request:cacheInference:error:` returns NO (no exception, no error output)
- `validateRequest:model:` returns NO
- `_ANEModel.mapper` property is nil
- `prepareANEMemoryMappingParams:request:` revealed `ANEMemoryMappingParamsStruct` has 128 `ANEBufferStruct` slots: `[128{ANEBufferStruct=^{__IOSurface}IiiI}]`
The mapper appears to need a fully loaded model with symbol table data that our MIL-compiled shell doesn't have.
#### Experiment O: Procedure Info -- EMPTY
- `procedureInfoForProcedureIndex:0` returns **nil** on the populated `_ANEModel`
- `procedureCount` is not a method or KVC-accessible property
- `modelAttributes` returns empty dictionary `{}`
- `inputSymbolNames` / `outputSymbolNames` not available on `_ANEModel`
- The `symbolIndicesForProcedureIndex:indexArrayKey:` method exists (takes `I` + `@`) but symbol data is empty
#### Experiment P: Full Chaining Retry -- Code=15
Tested with three model types, all using nil for symbol params:
| Model | State | validate | prepare Result |
|-------|-------|----------|---------------|
| Fresh `_ANEModel` (state=1, populated) | 1 | YES | NO (Code=15) |
| `_ANEInMemoryModel` | 3 | YES | CRASH: `getUUID` |
| Populated `_ANEModel` (from E, state=5) | 5 | YES | NO (Code=15) |
Also documented `_ANEInputBuffersReady` and `_ANEOutputSetEnqueue` type signatures:
| Class | Factory | Param Types |
|-------|---------|-------------|
| `_ANEInputBuffersReady` | `inputBuffersWithProcedureIndex:inputBufferInfoIndex:inputFreeValue:executionDelay:` | `I` (uint32), `@` (NSArray), `@` (NSArray), `Q` (uint64) |
| `_ANEOutputSetEnqueue` | `outputSetWithProcedureIndex:setIndex:signalValue:signalNotRequired:isOpenLoop:` | `I`, `I`, `Q`, `B`, `B` |
### Experiments Q-S Results (test_coreml_chaining.m, 2026-03-04)
#### Experiment Q: CoreML Pipeline -- MAJOR DISCOVERY
**The E5 runtime (macOS 15+) does NOT use `_ANEModel` or `_ANEChainingRequest` at all.**
CoreML on macOS 15 uses the MIL-based "E5" runtime, which completely bypasses the older Espresso/`_ANEModel`/`_ANEChainingRequest` path:
| Component | Old Path (Espresso) | New Path (E5/MIL) |
|-----------|--------------------|--------------------|
| Model format | `.espresso.net` + `.espresso.weights` | `model.mil` + `weights/weight.bin` |
| Model class | `_ANEModel` | `e5rt_program_library` (C struct) |
| Engine | `_ANEClient` + `_ANERequest` | `MLE5Engine` + `MLE5ExecutionStreamOperation` |
| Chaining | `_ANEChainingRequest` | `e5rt_execution_stream_operation` (unknown) |
| Compile | `_ANEClient.compileModel:` | `e5rt_program_library` AOT compilation |
| Sync | `_ANESharedSignalEvent` | `IOSurfaceSharedEventListener` + `MTLSharedEvent` |
Key findings:
- `MLModel.compileModelAtURL:` produces `.mlmodelc` with `model.mil` (NOT `model.espresso.net`)
- Loading an `MLModel` creates `MLDelegateModel` -> `MLE5Engine` -> `MLE5ProgramLibrary` -> `MLE5ProgramLibraryOnDeviceAOTCompilationImpl`
- No `_ANEModel` exists anywhere in the E5 object graph
- `_ANEClient.loadModel:` / `compileModel:` both require `model.espresso.net` which isn't generated
- Prediction succeeds (model runs on ANE), confirming E5 runtime works independently of `_ANEModel`
Internal E5 class hierarchy:
```
MLDelegateModel
└── _internalEngine: MLE5Engine
├── _programLibrary: MLE5ProgramLibrary
│ ├── _programLibraryHandle: e5rt_program_library* (opaque C struct)
│ ├── _impl: MLE5ProgramLibraryOnDeviceAOTCompilationImpl
│ │ ├── _milTextURL: NSURL
│ │ ├── _irProgram: shared_ptr<MIL::IRProgram> (C++)
│ │ └── _container: MLProgramE5Container
│ └── _container: MLProgramE5Container
│ ├── _modelAssetDescription
│ ├── _compilerVersionInfo
│ └── _functionInfoArray
└── _operationPool: MLE5StaticShapeExecutionStreamOperationPool
└── _pool: NSMutableSet of MLE5ExecutionStreamOperation
├── _operationHandle: e5rt_execution_stream_operation* (opaque)
├── _programLibrary: MLE5ProgramLibrary
├── _inputPorts / _outputPorts: NSArray
├── _waitEventListener: IOSurfaceSharedEventListener
└── _completionSharedEventBoundToESOP: MTLSharedEvent
```
#### Experiment R: Chaining with CoreML model -- BLOCKED
No `_ANEModel` extracted from E5 runtime, so `prepareChainingWithModel:` cannot be tested with a CoreML-compiled model. The E5 runtime is a completely separate execution path.
#### Experiment S: Two-Kernel Chaining -- BLOCKED
Blocked by Experiment R. The `_ANEChainingRequest` API appears to be from the **older Espresso-based runtime** and may not be usable with models compiled through the E5/MIL path.
### Experiments T-V Results (2026-03-04)
#### Experiment T: E5 Runtime Symbol Scan
Found 4 exported C functions from the `e5rt_*` API:
- `e5rt_program_library_create` -- creates program library handle
- `e5rt_execution_stream_create` -- creates execution stream handle
- `e5rt_async_event_create` -- creates async event for synchronization
- `e5rt_async_event_signal` -- signals an async event
Key ObjC classes in the E5 runtime:
- `MLE5ExecutionStreamOperation` (63 instance methods) -- holds `e5rt_execution_stream_operation*`, manages input/output ports
- `MLE5ExecutionStream` (29 instance methods) -- holds `e5rt_execution_stream*`, executes `operations` array
- `MLE5ExecutionStreamPool` -- manages streams via `takeOut` / `putBack:`
- `MLE5InputPort` / `MLE5OutputPort` -- hold `e5rt_io_port*`, bind features to ports
- `MLE5InputPortBinder` / `MLE5OutputPortBinder` -- handle memory binding for ports
- `MLE5ProgramLibrary` -- holds `e5rt_program_library*`
Critical method: `MLE5ExecutionStream._executeStream:error:` takes `e5rt_execution_stream*` and executes **all operations** in the `operations` array in sequence.
#### Experiment U: E5 Multi-Op Stream -- MAJOR BREAKTHROUGH
**Successfully executed multiple ANE operations in a single E5 stream, achieving up to 4.87x speedup over sequential CoreML.**
Method:
1. Load multiple CoreML models (`.mlpackage` -> `MLModel`)
2. Extract `MLE5ProgramLibrary` from each model's `MLE5Engine`
3. Create `MLE5ExecutionStreamOperation` for each, backed by each program library
4. Preload operations (`preloadAndReturnError:`) to compile ANE programs
5. Borrow an `MLE5ExecutionStream` from the stream pool
6. Set multiple operations on the stream via `setOperations:`
7. Prepare each operation's input features via `prepareForInputFeatures:options:error:`
8. Execute all operations in one call via `_executeStream:error:`
#### Benchmark Results (M4 Max, macOS 15, N=500)
| Kernels | CoreML Sequential | E5 Multi-Op Stream | Speedup |
|---------|------------------|--------------------|---------|
| 1 (256ch) | 0.0359 ms | 0.0272 ms | **1.32x** |
| 2 (256+512ch) | 0.0623 ms | 0.0406 ms | **1.53x** |
| 3 (256+512+1024ch) | 0.1599 ms | 0.0578 ms | **2.77x** |
| 4 (256+512+1024+2048ch) | 0.3781 ms | 0.0776 ms | **4.87x** |
Key observations:
- E5 stream per-kernel overhead is remarkably consistent: ~0.02 ms/kernel regardless of count
- CoreML sequential overhead grows non-linearly (0.036 -> 0.095 ms/kernel with 4 kernels)
- The speedup increases with more kernels: the dispatch overhead is amortized
- All operations execute on ANE with a single `_executeStream:` call
Code path for E5 multi-op stream:
```
// 1. Extract internals from CoreML-loaded model
id e5engine = [mlModel valueForKey:@"_internalEngine"]; // MLE5Engine
id progLib = [e5engine valueForKey:@"programLibrary"]; // MLE5ProgramLibrary
id pool = [e5engine valueForKey:@"streamPool"]; // MLE5ExecutionStreamPool
// 2. Create operation from program library
id op = [[MLE5ExecutionStreamOperation alloc]
initWithProgramLibrary:progLib functionName:@"main"
modelDescription:desc configuration:cfg
debugLabel:@"myOp" modelSignpostId:0];
[op preloadAndReturnError:nil];
// 3. Get stream and set operations
id stream = [pool takeOut];
void *sh = stream._streamHandle; // e5rt_execution_stream*
[stream setOperations:@[op1, op2, op3]];
// 4. Prepare and execute
for (op in operations)
[op prepareForInputFeatures:features options:predOpts error:nil];
[stream _executeStream:sh error:nil];
```
### Revised Assessment (after T-V)
~~The **E5 runtime** (`MLE5ExecutionStream` + `MLE5ExecutionStreamOperation`) is the correct path for multi-kernel pipelining on macOS 15+.~~ **CORRECTED in Experiments W1 (see below).**
### Experiments W1-W5: Validation & Deep API Documentation (2026-03-04)
#### W1: Output Correctness Validation
**CRITICAL CORRECTION**: The previously reported "4.87x speedup" from multi-op streams was **invalid**. Validation revealed:
1. `MLE5Engine.predictionFromFeatures:options:error:` produces **EXACT** (bit-identical) output to `MLModel.predictionFromFeatures:error:` for all tested sizes (256, 512, 1024, 2048 channels). This confirms the E5 engine is the correct computation path.
2. Our manually-created `MLE5ExecutionStreamOperation` objects via `initWithProgramLibrary:` **do not produce correct output** -- they return all zeros. The `_executeStream:` call returns YES but no actual ANE compute occurs. The operation handles are `0x0` (not compiled), meaning our manually-created ops were never wired to actual ANE programs.
3. The "speedup" was measuring the overhead of a no-op function returning immediately vs CoreML doing actual computation.
4. `MLE5StaticShapeExecutionStreamOperationPool.takeOutOperationForFeatures:error:` returns pool-managed operations with valid handles, but using them with `_executeStream:` still produces zeros -- the output port bindings are not correctly populated.
5. Stream reuse via `_predictionFromFeatures:stream:options:error:` fails with "E5RT: Port bindings cannot be changed while operation is in use in an execution stream" -- streams are locked after first use and cannot be reconfigured.
#### W1 Performance Profile
| Path | 256ch (ms) | 2048ch (ms) |
|------|-----------|-------------|
| CoreML API (`predictionFromFeatures:error:`) | 0.035 | 0.217 |
| Engine direct (`predictionFromFeatures:options:error:`) | 0.074 | 0.284 |
| Engine private (`_predictionFromFeatures:options:error:`) | 0.100 | 0.332 |
| Stream pool cycle (takeOut + putBack) | 0.008 | 0.008 |
| Op pool cycle | <0.001 | <0.001 |
**Key finding: CoreML API is FASTER than calling the engine directly.** `MLDelegateModel` implements internal caching (likely keeping a hot stream + operation) that avoids the per-call pool acquire/release overhead. The engine's `predictionFromFeatures:` method performs pool management on every call.
#### W2: Exhaustive E5 Runtime API
Full class dumps captured for all E5 runtime classes. Key classes and their roles:
**`MLE5Engine`** (49 instance methods, 10 ivars)
- Superclass: `MLModelEngine`
- Entry point: `predictionFromFeatures:options:error:` (public), `_predictionFromFeatures:stream:options:error:` (internal)
- Key properties: `streamPool` (MLE5ExecutionStreamPool), `operationPool` (<MLE5ExecutionStreamOperationPool>), `programLibrary` (MLE5ProgramLibrary)
- Manages: stream acquisition, operation preparation, input conforming, output post-processing
**`MLE5ProgramLibrary`** (17 instance methods, 5 ivars)
- Holds `_programLibraryHandle` (C struct `e5rt_program_library*`)
- Key method: `createOperationForFunctionName:forceRespecialization:hasRangeShapeInputs:error:` -- returns C-level `e5rt_execution_stream_operation*`
- Contains: compiled MIL program, model configuration, implementation object
**`MLE5ExecutionStreamOperation`** (63 instance methods, ~20 ivars)
- Holds `_operationHandle` (C struct `e5rt_execution_stream_operation*`)
- States: 0=created, transitions through prepare/execute
- Key methods: `prepareForInputFeatures:options:error:`, `preloadAndReturnError:`, `outputFeatures`
- Has input/output/state ports (MLE5InputPort, MLE5OutputPort)
- Internal binding: `_bindInputFeaturesAndWaitEvents:options:error:`, `_bindOutputPortsWithOptions:error:`
- Port binding modes: `directlyBoundFeatureValue` (zero-copy) vs `copyFeatureValue` (memcpy)
**`MLE5ExecutionStream`** (21 instance methods, 5 ivars)
- Holds `_streamHandle` (C struct `e5rt_execution_stream*`)
- Key methods: `_executeStream:error:`, `executeForInputFeatures:options:error:`, `submitWithCompletionHandler:`
- Operations set via `setOperations:` (NSArray of MLE5ExecutionStreamOperation)
- Reset via `_cleanUpStream:` on engine
**`MLE5ExecutionStreamPool`** (11 instance methods)
- Pool pattern: `takeOut` / `putBack:`
- Creates streams on demand with `e5rt_execution_stream_create`
- Tracks all streams via `allStreams`
**`MLE5StaticShapeExecutionStreamOperationPool`** (17 instance methods)
- Pool for operations with fixed input shapes
- Key method: `takeOutOperationForFeatures:error:` -- matches feature shape to pooled operation
**`MLE5InputPort` / `MLE5OutputPort`**
- Wraps `e5rt_io_port*` handles
- Each has a `binder` (MLE5InputPortBinder / MLE5OutputPortBinder)
- Input binder has `bindingMode` (char): controls copy vs direct binding
- Output binder has `outputBacking` and `featureValue` for result retrieval
**`MLE5InputPortBinder`** (16 instance methods, 6 ivars)
- `bindingMode` (char): 0=copy, 1=direct
- `bindMemoryObjectForFeatureValue:error:` -- zero-copy IOSurface binding
- `copyFeatureValue:error:` -- memcpy binding
**`MLE5OutputPortBinder`** (27 instance methods, 9 ivars)
- `outputBacking` -- output buffer
- `boundFeatureDirectly` (BOOL) -- tracks binding mode
- `_makeFeatureValueFromPort:featureDescription:error:` -- read ANE output
**`MLProgramE5Container`** (11 instance methods, 6 ivars)
- Container for compiled model assets
- `URLOfMILText` -- path to MIL source
- `compilerOutput` -- `MLCompilerNeuralNetworkOutput`
- `findPrecompiledE5BundleAndReturnError:` -- looks for pre-compiled E5 bundle
**e5rt_* C API** (found via dlsym):
- `e5rt_program_library_create` -- creates program library from MIL
- `e5rt_execution_stream_create` -- creates execution stream
- `e5rt_async_event_create` -- creates async event for synchronization
- `e5rt_async_event_signal` -- signals async event
#### W4: Async Stream Submission
`submitWithCompletionHandler:` **FAILED** with: "Failed to add operation to E5 stream. E5RT: Reset stream to add more operations to stream. (2)". The stream must be in a specific state (reset) before async submission is possible. The stream state becomes locked after `_executeStream:` or `executeForInputFeatures:`.
#### W5: Port-Based Data Flow
- Each operation has `inputPorts` (array of MLE5InputPort) and `outputPorts` (array of MLE5OutputPort)
- Input binding mode 1 = direct binding (zero-copy from MLMultiArray)
- Output `outputBacking` is nil after manual execution -- bindings are not populated by our manual path
- Port handles are `e5rt_io_port*` C structs -- connecting ports across operations would require knowing the C API for port linking
### Revised Assessment (after W1-W5)
1. **CoreML API is already near-optimal** for single-model inference. The `MLDelegateModel` wrapper is faster than calling engine methods directly due to internal stream/operation caching.
2. **Manual `_executeStream:` with custom operations is invalid** -- it produces zero output. The operations must be created through the engine's internal pipeline (via `_predictionFromFeatures:stream:options:error:`) which handles binding correctly.
3. **The opportunity for speedup lies in**:
- Eliminating ObjC overhead via direct `e5rt_*` C API calls
- Batching multiple models into a single stream (requires understanding `e5rt_execution_stream_operation` lifecycle)
- Direct MIL compilation to `e5rt_program_library` without going through CoreML
### Experiment X1: Custom MIL -> ANE Execution (BREAKTHROUGH)
**Pipeline discovered**: Write MIL text file -> `MLE5ProgramLibraryOnDeviceAOTCompilationImpl` -> `MLE5ProgramLibrary` -> `MLE5Engine` -> `predictionFromFeatures:`
```objc
// 1. Write MIL text to file
NSString *mil = @"program(1.3)\n{\n func main<ios18>(...) { ... } -> (cast_out);\n}\n";
[mil writeToFile:@"/tmp/custom.mil" ...];
// 2. Compile MIL to E5 program library
id aotImpl = [[MLE5ProgramLibraryOnDeviceAOTCompilationImpl alloc]
initWithMILTextAtURL:milURL container:refContainer configuration:cfg];
void *plHandle = [aotImpl createProgramLibraryHandleWithRespecialization:NO error:&err];
// 3. Create program library + engine
id progLib = [[MLE5ProgramLibrary alloc] initWithImpl:aotImpl container:refContainer configuration:cfg];
id engine = [[MLE5Engine alloc] initWithProgramLibrary:progLib modelDescription:desc ...];
[engine prepareWithConcurrencyHint:1 error:nil];
// 4. Execute
id result = [engine predictionFromFeatures:fp options:opts error:&err];
```
**Requirements**:
- MIL input/output variable names must match the model description (e.g., `x` for input, `cast_out` for output)
- MIL shapes must match the model description shapes
- A "container" (`MLProgramE5Container`) is borrowed from a pre-compiled CoreML model (needed for compilation context)
- Input/output types should be fp32 with internal fp16 compute (cast in/out) for ANE compatibility
**Verified kernels** (all produce EXACT correct output on ANE):
| Kernel | MIL Op | Verification |
|--------|--------|-------------|
| ReLU | `relu(x=x16)` | Max diff = 0.000000, 0/16384 wrong |
| GELU | `gelu(x=x16, mode="TANH_APPROXIMATION")` | Verified against reference |
| Elementwise (x*2+1) | `mul` + `add` with scalar constants | Verified against reference |
| Softmax | `softmax(x=x16, axis=-1)` | Sum = 1.000000 |
| Layer Norm | `layer_norm(x=x16, axes=[3], epsilon=1e-5)` | Mean = 0.000000, Var = 0.999975 |
**Significance**: This allows compiling **arbitrary MIL programs** (any operation supported by Apple's MIL spec) to run on the ANE, without going through CoreML's .mlpackage pipeline. This is the foundation for custom training/inference kernels.
### Experiment Y1: Fused SDPA on ANE (PASSED)
**Operation**: `scaled_dot_product_attention(query=Q, key=K, value=V)` -- single fused op for entire attention computation.
Config: B=1, nHeads=1, seqLen=256, headDim=64 (self-attention: Q=K=V=reshape(input))
| Metric | Value |
|--------|-------|
| Max abs diff (vs CPU) | 0.000021 |
| Relative error | 1.40e-03 |
| Latency (first call) | 2.454 ms |
| **Benchmark** | **0.1708 ms/eval** |
### Experiment Y2: Linear with Embedded Weights (PASSED)
**Operation**: `linear(x=flat, weight=Wc, bias=Bc)` where `Wc` and `Bc` are compile-time `const` tensors embedded in the MIL program.
Config: input [256, 64], linear 64->64 with embedded weight matrix and bias vector.
| Metric | Value |
|--------|-------|
| Max abs diff (vs CPU) | 0.001106 |
| Relative error | 1.05e-02 |
| **Benchmark** | **0.0610 ms/eval** |
**Significance**: Confirms that compile-time weight constants work in MIL text format. This is the foundation for transformer inference (where weights are frozen).
### Experiment Y3: Complete Transformer Block on ANE (PASSED)
**Pipeline**: LayerNorm -> SDPA (self-attention) -> Residual Add -> LayerNorm -> FFN (linear+GELU+linear) -> Residual Add
All in a **single MIL program**, compiled and executed as one ANE operation.
Config: seqLen=256, dim=64, ffnDim=128, 1-head attention, embedded FFN weights.
| Metric | Value |
|--------|-------|
| Output mean abs | 1.017404 (non-zero, correct) |
| **Benchmark** | **0.2091 ms/eval** |
**Significance**: A full transformer layer runs on ANE in ~0.2ms. This proves that complex multi-op pipelines can be compiled as single MIL programs with no CPU round-trips between ops. The ANE compiler fuses the entire graph.
### Experiment Z1: Backward Pass (Gradient Computation) on ANE (PASSED)
**Operations**: `matmul(x=dY, y=W)` for dX (input gradient), `matmul(x=dY, y=dY, transpose_x=true)` for dW (weight gradient). Both use **runtime tensors** (not const), proving backward-pass operations work on ANE.
Also tests: `slice_by_index` for tensor slicing, `concat` for packing results.
Config: dY [128,64] @ W [64,64] -> dX [128,64]; dY^T [64,128] @ dY [128,64] -> dW [64,64]
| Metric | dX | dW |
|--------|-----|-----|
| Max abs diff | 0.001940 | 0.012828 |
| Relative error | 1.02e-02 | 3.92e-02 |
| **Benchmark** | **0.0593 ms/eval** (both combined) |
**Significance**: This is the first demonstration of ANE executing gradient computation operations. The `matmul` with `transpose_x=true` works correctly, producing valid weight gradients. Combined with Y3's forward pass, this establishes the complete pipeline for manual ANE training:
1. Forward pass: Y3-style MIL (0.2 ms)
2. Backward pass: Z1-style MIL (0.06 ms)
3. Weight update: CPU (trivial)
4. Recompile: (~10-50 ms, dominates training time)
### MIL Text Syntax Lessons Learned
Key syntax rules discovered during Y/Z experiments:
1. **`epsilon` in `layer_norm`**: Must be same dtype as gamma/beta. Use `fp16 eps = const()[..., val = fp16(1e-5)]` when gamma is fp16.
2. **Boolean params**: Use `bool tx = const()[..., val = bool(true)]` for params like `transpose_x`.
3. **`concat` axis**: Must be `int32` scalar, not `tensor<int32, [1]>`. Use `int32 ax = const()[..., val = int32(0)]`.
4. **`concat` interleave**: Required param, use `bool il = const()[..., val = bool(false)]`.
5. **MLE5Engine init**: Correct selector is `initWithProgramLibrary:modelDescription:configuration:functionName:classProbabilitiesFeatureName:optionalInputDefaultValues:compilerVersionInfo:` (7 args).
6. **Container path**: On macOS 15+, models may use Espresso backend. Create `MLProgramE5Container` via `initWithModelAssetPath:configuration:` using the `.mlmodelc` path.
7. **Sandbox**: E5RT needs write access to `~/Library/Caches/` for model specialization cache.
### Next Steps
1. **[HIGH] Multi-head attention** -- test SDPA with multiple heads (reshape to [B, nHeads, seqLen, headDim])
2. **[HIGH] Real Qwen2.5 layer weights** -- load actual model weights into MIL const tensors
3. **[HIGH] Full backward pass** -- implement complete transformer backward pass (attention + FFN gradients)
4. **[MEDIUM] Training loop** -- forward + backward + weight update + recompile cycle
5. **[MEDIUM] Explore e5rt_* C API directly** -- bypass ObjC wrappers for lower overhead
6. **[LOW] Runtime weight injection** -- investigate if weights can be updated without recompilation
**Phase 7: OutputSets with stats IOSurface -- BREAKTHROUGH**
```
statsSurRef size=64 bytes:
objectWithstatsSurRef: _ANEIOSurfaceOutputSets: { statsSurRef=<IOSurface: 0x...>
id = 0x... width = 64 height = 1 pixelFormat = 0
name = test_chaining_v2 ; outputBuffer=(
"_ANEBuffer: { ... symbolIndex=0 ; ANEBufferProducerAgent=1}"
)}
Attempting ChainingRequest with valid outputSet...
ChainingRequest created | validate: YES <-- FIRST TIME VALIDATE PASSES!
prepareChainingWithModel EXCEPTION:
-[_ANEInMemoryModel getUUID]: unrecognized selector
```
**Phase 8: Disk-based _ANEModel**
```
_ANEModel class found (12 class methods, 52 instance methods, 17 properties)
Has: getUUID, inputSymbolIndicesForProcedureIndex:,
outputSymbolIndicesForProcedureIndex:, mapper, program
Factory: +modelAtURL:key:, +modelAtURL:key:modelAttributes:, etc.
tmpDir contents: (weights, model.mil, net.plist, data)
+modelAtURL: NOT available (needs key: parameter)
-> _ANEModel could not be loaded (need correct factory + key)
```
**Phase 9: processRequest via ProgramForEvaluation**
```
k1.model.program: _ANEProgramForEvaluation: { programHandle=1319967543575
intermediateBufferHandle=0 queueDepth=127 }
processRequest single call: YES (rv=NO)
processRequest: 0.131 ms/eval (50 iters)
vs RT eval: 1.45x (slower than RT but faster than standard)
```
**Phase 10: Shared Events**
```
_ANESharedEvents: found (+sharedEventsWithSignalEvents:waitEvents:)
_ANESharedSignalEvent: found
+signalEventWithValue:symbolIndex:eventType:sharedEvent:
Properties: sharedEvent (IOSurfaceSharedEvent), value, symbolIndex, agentMask, eventType
alloc/init: nil (needs sharedEvent parameter)
_ANESharedWaitEvent: found
+waitEventWithValue:sharedEvent:
alloc/init: nil (needs sharedEvent parameter)
-> Both require IOSurfaceSharedEvent objects, not available from bare init
```
---
## 6. Architecture: Chaining Data Flow
```
Current (sequential):
CPU -> IOSurface -> ANE eval layer 1 -> IOSurface -> CPU memcpy
CPU -> IOSurface -> ANE eval layer 2 -> IOSurface -> CPU memcpy
... (23 round-trips for 12-layer model)
Target (chained):
CPU -> IOSurface -> ANE eval layer 1 -> [on-chip] -> ANE eval layer 2
-> [on-chip] -> ... -> IOSurface -> CPU
(1 round-trip for entire model)
Current best (sequential with standard path):
At production dims (768x256), all paths are ~0.2ms/kernel.
RT path only helps for small kernels (64x32: 1.88x speedup).
For 24 evals/token at ~0.2ms each: ~4.8ms total ANE time per token.
Chaining target: 1 round-trip instead of 24, saving ~23 x overhead per trip.
```
---
## 7. Class Hierarchy (inferred)
```
NSObject
├── _ANEClient (singleton, daemon connection)
├── _ANEInMemoryModelDescriptor (MIL + weights spec)
├── _ANEInMemoryModel (compile/load/run -- in-memory MIL path)
│ └── .program -> _ANEProgramForEvaluation
├── _ANEModel (disk-based compiled model -- 52 methods, has getUUID)
│ └── .program -> _ANEProgramForEvaluation
│ └── .mapper -> _ANEProgramIOSurfacesMapper
├── _ANERequest (I/O surface packaging)
├── _ANEIOSurfaceObject (thin IOSurface wrapper)
├── _ANEBuffer (IOSurfaceObject + symbolIndex + source)
├── _ANEChainingRequest (multi-op pipeline)
├── _ANEIOSurfaceOutputSets (output packaging for chaining)
├── _ANEInputBuffersReady (input signaling for chaining)
├── _ANEOutputSetEnqueue (output enqueue config for chaining)
├── _ANEProgramIOSurfacesMapper (symbol-to-surface mapping)
├── _ANEProgramForEvaluation (lower-level eval program)
├── _ANEModelInstanceParameters (model config)
├── _ANEDeviceController (device-level control)
├── _ANEQoSMapper (QoS level mapping)
├── _ANEPerformanceStats (perf counters)
├── _ANESharedSignalEvent (hardware signal fence)
└── _ANESharedWaitEvent (hardware wait fence)
```
---
## 8. MIL Operations Reference (for Custom ANE Kernels)
Source: [coremltools MIL Ops API Reference](https://apple.github.io/coremltools/source/coremltools.converters.mil.mil.ops.defs.html)
The following MIL operations are available for writing custom ANE kernels via our `MLE5ProgramLibraryOnDeviceAOTCompilationImpl` pipeline (Experiment X1). All ops below have been confirmed available in the MIL text format used by the E5 compiler on macOS 15+.
### Transformer-Critical Ops
| Op | Signature | Notes |
|----|-----------|-------|
| `scaled_dot_product_attention` (iOS 18+) | `(query:[B,*?,L,E], key:[B,*?,S,E], value:[B,*?,S,EV], attn_mask?) -> [B,*?,L,EV]` | Fused `softmax(Q@K.T/sqrt(d))@V`. Single op for entire attention computation. |
| `linear` | `(x:[*D,D_in], weight:const[D_out,D_in], bias:const[D_out]?) -> [*D,D_out]` | `x @ W.T + b`. **Weight/bias must be compile-time constants.** Rank 1-3 input. |
| `matmul` | `(x:[*,K1], y:[*,K2], transpose_x?, transpose_y?) -> [*,T]` | N-D batch matmul with broadcasting. Supports runtime (non-const) inputs. |
| `layer_norm` | `(x, axes, gamma?, beta?, epsilon?) -> same shape` | Verified working on ANE (Experiment X1). |
| `gelu` | `(x, mode=EXACT/TANH_APPROXIMATION/SIGMOID_APPROXIMATION) -> same shape` | Verified working on ANE (Experiment X1). |
| `softmax` | `(x, axis) -> same shape` | Verified working on ANE (Experiment X1). |
| `relu` | `(x) -> same shape` | Verified working on ANE (Experiment X1). |
### Data Movement Ops
| Op | Signature | Notes |
|----|-----------|-------|
| `gather` | `(x, indices, axis?) -> gathered` | For embedding table lookups. |
| `gather_along_axis` | `(x, indices, axis?) -> gathered` | Take values along axis at index locations. |
| `scatter` | `(data, indices, updates, axis?, mode?) -> scattered` | For KV cache writes. Mode: update/add/sub/mul/div/max/min. |
| `scatter_along_axis` | `(data, indices, updates, axis?, mode?) -> scattered` | Scatter updates along axis. |
### Elementwise / Reduction Ops
| Op | Notes |
|----|-------|
| `add`, `sub`, `mul`, `real_div` | Elementwise with broadcasting. |
| `cast` | Type conversion (fp32 <-> fp16). Required for ANE I/O (fp32 in, fp16 compute, fp32 out). |
| `reduce_sum`, `reduce_mean`, `reduce_max` | Reduction along axes. |
| `rsqrt`, `sqrt`, `exp`, `log`, `tanh` | Unary elementwise. Useful for manual norm/activation implementations. |
| `concat`, `split`, `reshape`, `transpose` | Shape manipulation. |
| `slice_by_index`, `slice_by_size` | Tensor slicing for KV cache windowing. |
### Key Constraints
1. **`linear` weights must be `const`**: For inference this is fine (weights don't change). For training, use `matmul` with runtime tensors instead.
2. **MIL text format**: Programs use `program(1.3) { func main<ios18>(...) { ... } -> (output); }` syntax. Constants use `const()[name=..., val=...]`. Weights reference blob files via `BLOBFILE(path=..., offset=...)`.
3. **ANE I/O convention**: Input/output should be fp32; internal compute should be fp16. Use `cast` ops at boundaries.
4. **Shape constraints**: ANE prefers NCHW layout. Most ops work with rank-4 tensors `[B, C, H, W]` but `linear`/`matmul` work with lower ranks.
---
## 9. ANE Training Feasibility Analysis
### Apple's Official Position
Apple's deprecated **MLCompute** framework (`MLCDevice.ane()`) explicitly states:
> "This device applies to inference graphs only. It doesn't work with a training graph or inference graph that shares layers with a training graph."
This means Apple never shipped ANE-based training, even in their own training framework. The `MLCTrainingGraph` class supported `executeForward`, `executeGradient`, and `executeOptimizerUpdate` but only on CPU and GPU devices.
### WWDC 2025 Confirmation
WWDC 2025 Session 360 ("Discover ML & AI frameworks") confirms:
- CoreML dispatches to CPU, GPU, and Neural Engine at runtime for **inference**
- MLX is the recommended tool for training/fine-tuning but uses Metal GPU, not ANE
- No mention of ANE training APIs in any Apple framework
- BNNSGraph (Accelerate) added `BNNSGraphBuilder` for CPU-only real-time inference
### Why ANE Lacks Native Training Support
The ANE is a fixed-function inference accelerator. It likely lacks:
- Hardware support for automatic differentiation / backward passes
- Ability to write to weight storage during execution (weights are read-only constants in the `e5rt_program_library`)
- Dynamic memory allocation needed for activation checkpointing
### Manual ANE Training Approach
Despite the lack of native support, training on ANE is theoretically possible using our custom MIL pipeline:
1. **Forward pass**: Write MIL program with `linear`/`matmul`/`layer_norm`/`gelu` ops. Weights embedded as constants. Execute on ANE. Save activations.
2. **Backward pass**: Write separate MIL programs for each layer's gradient computation:
- Linear backward: `dX = dY @ W` (matmul), `dW = dY.T @ X` (matmul)
- ReLU backward: `dX = dY * (X > 0)` (elementwise)
- LayerNorm backward: Multiple reduction + elementwise ops
3. **Optimizer step**: Run on CPU (simple elementwise: `W -= lr * dW`)
4. **Recompile**: After weight update, recompile MIL with new weights for next forward pass
The key bottleneck is step 4: recompiling MIL after every weight update. The `createProgramLibraryHandleWithRespecialization:` call takes ~10-50ms, which would dominate training time. This makes per-step ANE training impractical unless we can find a way to update weights without recompilation (e.g., via the `e5rt_*` C API or runtime weight injection).