57 KiB
ANE ChainingRequest API Research
Research into Apple Neural Engine private APIs for multi-kernel pipelining, conducted on M4 Max / macOS 15.
Goal: Eliminate CPU round-trips between ANE layer evaluations. In a 12-layer model, sequential evaluation requires 23+ CPU-ANE round-trips per token. The _ANEChainingRequest API appears designed to let the ANE run operations back-to-back in a hardware pipeline, keeping data on-chip.
Status: ChainingRequest validates and prepareChainingWithModel: no longer crashes (crash fix: pass nil for symbol/procedure params). Blocked on Code=15 (ANEProgramChainingPrepare Failed) -- the _ANEModel needs Espresso IR format (not MIL) for full symbol table population. At production dims (768x256), sequential ANE dispatch costs ~0.2ms/kernel; chaining would save ~23 round-trips per token.
See also: ANE_INTERNALS.md for comprehensive ANE documentation including compilation pipeline, hardware specs, and community research references.
Test Files
| File | Purpose |
|---|---|
training/test_chaining.m |
v1 prototype: sequential baseline + ChainingRequest creation |
training/test_chaining_v2.m |
v2 deep exploration: 6-phase probe of 12+ private classes |
training/test_ane_model.m |
Experiments E-P: _ANEModel loading, compiler, chaining, fences, type encoding, mapping |
training/test_throughput_ceiling.m |
Experiment I: 12-kernel throughput ceiling benchmark |
Build and run:
cd training
make test_chaining && ./test_chaining
make test_chaining_v2 && ./test_chaining_v2
make test_ane_model && ./test_ane_model
make test_throughput_ceiling && ./test_throughput_ceiling
1. Executive Summary
What works
| Finding | Impact | Status |
|---|---|---|
evaluateRealTimeWithModel: via _ANEClient |
1.88x faster on small kernels (64x32); no benefit at production dims (768x256) | Benchmarked |
processRequest via _ANEProgramForEvaluation |
1.34x faster on small kernels; marginal at production dims | Benchmarked |
_ANEBuffer wraps IOSurface with symbolIndex |
Solves input indexing for chaining | Proven |
| All 9 unexplored ANE classes exist on M4 Max | Full API surfaces documented | Documented |
Important: The RT execution speedup (1.88x) observed in isolated testing on 64x32 convolution kernels does not generalize to production dimensions. At 768x256 (Stories110M size), all four execution paths converge to ~0.2 ms per kernel. See Production Dimension Results below.
What's been solved
| Finding | Status | Detail |
|---|---|---|
_ANEIOSurfaceOutputSets works with 64-byte statsSurRef |
SOLVED | Any non-NULL IOSurface works as stats buffer |
_ANEChainingRequest.validate returns YES |
SOLVED | With proper _ANEBuffer inputs + _ANEIOSurfaceOutputSets outputs |
processRequest via _ANEProgramForEvaluation |
1.34x faster | Lower-level eval (0.131 ms vs 0.175 ms) |
ChainingRequest factory crash ([NSConstantIntegerNumber count]) |
SOLVED | Pass nil for lbInputSymbolId, lbOutputSymbolId, procedureIndex |
_ANEModel loading from temp directory |
SOLVED | modelAtURL:key: with tmpDir URL + hexStringIdentifier |
_ANESharedSignalEvent / _ANESharedWaitEvent |
SOLVED | Use MTLSharedEvent or IOSurfaceSharedEventCreate() |
| ChainingRequest type encodings | DOCUMENTED | All 9 factory params are @ (object). prepare has 5 params (3x@, 1xI qos, 1x^@ err) |
What's still blocked
| Blocker | Root Cause |
|---|---|
prepareChainingWithModel: returns Code=15 |
ANEProgramChainingPrepare() Failed -- model not recognized as chaining-capable |
_ANEModel has empty symbol table |
MIL-compiled model shell lacks Espresso IR data (model.espresso.net) |
_ANEClient.loadModel: / compileModel: fail |
Require Espresso IR format, not MIL |
_ANEProgramIOSurfacesMapper returns NO |
Needs fully loaded model with symbol table |
_ANEPerformanceStats with _ANERequest |
Request expects statType selector on perfStats objects |
2. ANE Private API Class Map
Core Classes (known working)
_ANEInMemoryModel -- the model object for in-memory MIL compilation.
+inMemoryModelWithDescriptor:-- create from_ANEInMemoryModelDescriptor-compileWithQoS:options:error:-- compile MIL to ANE binary-loadWithQoS:options:error:-- load compiled model onto ANE-evaluateWithQoS:options:request:error:-- standard evaluation (QoS 0-63, 21 default)-unloadWithQoS:error:-- unload from ANE- Properties:
hexStringIdentifier,programHandle(uint64),program(_ANEProgramForEvaluation),perfStatsMask - Missing:
inputSymbolNames,outputSymbolNames,inputSymbolIndicesForProcedureIndex:
_ANEInMemoryModelDescriptor -- model specification.
+modelWithMILText:weights:optionsPlist:-- create descriptor from MIL NSData + weight dict
_ANERequest -- evaluation request packaging I/O surfaces.
+requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:perfStatsparameter expectsNSArrayof stat info objects (not_ANEPerformanceStats)
_ANEIOSurfaceObject -- thin wrapper around IOSurfaceRef.
+objectWithIOSurface:-- wrap a raw IOSurface- Does NOT have
symbolIndexproperty (this is the v1 blocker)
_ANEClient -- client connection to the ANE daemon.
+sharedConnection-- singleton accessor-evaluateWithModel:options:request:qos:error:-- 5-param eval via client-evaluateRealTimeWithModel:options:request:error:-- RT priority eval (1.7x faster)-doEvaluateDirectWithModel:options:request:qos:error:-- direct eval bypass-beginRealTimeTask/-endRealTimeTask-- RT task bracketing (returns NO, but RT eval still works)-prepareChainingWithModel:options:chainingReq:qos:error:-- chaining setup-enqueueSetsWithModel:outputSet:options:qos:error:-- chaining output enqueue-buffersReadyWithModel:inputBuffers:options:qos:error:-- chaining input signal
Discovered Classes (v2 exploration)
_ANEBuffer -- wraps _ANEIOSurfaceObject with index metadata. Key discovery.
+bufferWithIOSurfaceObject:symbolIndex:source:-- factoryioSurfaceObject: an_ANEIOSurfaceObject(NOT rawIOSurfaceRef)symbolIndex:NSNumbermapping to compiled model I/O symbolsource:long long-- 0=ANE, 1=output, 2=unknown
- Properties:
ioSurfaceObject,symbolIndex,source - Description format:
"_ANEBuffer: { ioSurface=0x... ; symbolIndex=0 ; ANEBufferProducerAgent=0 }"
_ANEProgramIOSurfacesMapper -- maps IOSurfaces to compiled model symbols.
+mapperWithProgramHandle:(uint64_t)handle-- works, creates mapper+mapperWithController:(id)ctrl-- alternative factory-mapIOSurfacesWithModel:request:cacheInference:error:-- FAILS on_ANEInMemoryModel(callsinputSymbolIndicesForProcedureIndex:which doesn't exist)-validateRequest:model:-- also fails for same reason- Implication: designed for
_ANEModel(disk-based compiled models), not in-memory MIL
_ANEProgramForEvaluation -- lower-level evaluation program.
- Accessible via
model.programproperty +programWithHandle:intermediateBufferHandle:queueDepth:-- factory-processRequest:model:qos:qIndex:modelStringID:options:returnValue:error:-- low-level eval
_ANEIOSurfaceOutputSets -- output set packaging for chaining.
+objectWithstatsSurRef:outputBuffer:-- factorystatsSurRef:IOSurfaceReffor perf stats collection -- returns nil when NULLoutputBuffer:NSArrayof_ANEBufferobjects
- This is the current blocker: we don't know the correct stats IOSurface format
_ANEInputBuffersReady -- input signaling for chaining pipeline.
+inputBuffersWithProcedureIndex:inputBufferInfoIndex:inputFreeValue:executionDelay:- Parameters: procedure index, buffer info indices, free values, execution delay
- This is the mechanism that tells the ANE "inputs are ready, start processing"
_ANEOutputSetEnqueue -- output pipeline configuration for chaining.
+outputSetWithProcedureIndex:setIndex:signalValue:signalNotRequired:isOpenLoop:- Configures output set enqueue behavior with signal values and open-loop mode
_ANEChainingRequest -- the chaining request itself.
+chainingRequestWithInputs:outputSets:lbInputSymbolId:lbOutputSymbolId:procedureIndex:signalEvents:transactionHandle:fwEnqueueDelay:memoryPoolId:-validate-- returns YES/NO- Expects
inputsas_ANEBufferobjects,outputSetsas_ANEIOSurfaceOutputSetsobjects
_ANEModelInstanceParameters -- model instance configuration.
- Alloc/init produces a valid object
- API surface dumped but not yet exercised
_ANEDeviceController -- device-level controller.
+controllerWithProgramHandle:-- attempted but returned nil in our tests
_ANEQoSMapper -- QoS level mapping.
- API surface dumped, not yet exercised
_ANEPerformanceStats -- performance statistics.
+statsWithHardwareExecutionNS:(uint64_t)ns-- factory- Properties:
hwExecutionTime,performanceCounters - Cannot be used with
_ANERequest.perfStats(expects array of objects withstatTypeselector) - Setting
perfStatsMask=0xFFon model works butperformanceCountersreturns nil
_ANESharedSignalEvent / _ANESharedWaitEvent -- hardware sync primitives (not yet explored).
- Likely the fence mechanism for GPU-ANE or multi-model synchronization
- Referenced in
_ANEChainingRequest.signalEventsparameter
3. Experiment Logs
v1: test_chaining.m Results (M4 Max)
=== ANE ChainingRequest Prototype ===
All required classes found.
--- Phase 1: Compile two identical conv kernels ---
Kernel 1: compiled and loaded
Kernel 2: compiled and loaded
--- Phase 2: Baseline (sequential eval) ---
Sequential: 10.355 ms total (0.207 ms/pair)
Output[0..3]: [0.2500, 0.2500, 0.2500, 0.2500]
--- Phase 3: _ANEChainingRequest exploration ---
_ANEClient: obtained
ChainingRequest created: _ANEChainingRequest: { inputBuffer=(
"_ANEIOSurfaceObject: { ioSurface=0x... ; startOffset=0 }"
) ; outputSets=( ... ) }
validate: NO
--- Phase 4: Loopback ChainingRequest ---
ChainingRequest created (loopback)
validate: NO
prepareChainingWithModel: EXCEPTION (validate fails first)
--- Summary ---
Sequential baseline: 0.207 ms/pair (two evals + memcpy)
ChainingRequest: creates but validate FAILS
Root cause: _ANEIOSurfaceObject lacks symbolIndex property
Next: explore _ANEBuffer and _ANEProgramIOSurfacesMapper
v2: test_chaining_v2.m Results (M4 Max)
Phase 1: Class Introspection
- 9 classes found, 0 missing
- All classes exist on M4 Max / macOS 15
- Full method lists, properties, and type encodings dumped for each
Phase 2: Symbol Name Discovery
inputSymbolNames: NOT available on_ANEInMemoryModeloutputSymbolNames: NOT available on_ANEInMemoryModelprogramHandle: YES (uint64 handle to compiled program)_ANEIOSurfaceObjectdoes NOT havesymbolIndexgetter or setter+objectWithIOSurface:symbolIndex:class method NOT available
Phase 3: IOSurface Mapper & Buffer Experiments
3a: _ANEProgramIOSurfacesMapper
mapperWithProgramHandle(12345): created successfully
mapIOSurfacesWithModel: EXCEPTION
-[_ANEInMemoryModel inputSymbolIndicesForProcedureIndex:]:
unrecognized selector
validateRequest:model: EXCEPTION (same reason)
3b: _ANEBuffer -- success
bufferWithIOSurfaceObject(symIdx=0, source=0):
_ANEBuffer: { ioSurface=0x... ; symbolIndex=0 ; ANEBufferProducerAgent=0 }
bufferWithIOSurfaceObject(symIdx=0, source=1):
_ANEBuffer: { ioSurface=0x... ; symbolIndex=0 ; ANEBufferProducerAgent=1 }
bufferWithIOSurfaceObject(symIdx=0, source=2):
_ANEBuffer: { ioSurface=0x... ; symbolIndex=0 ; ANEBufferProducerAgent=2 }
bufferWithIOSurfaceObject(symIdx=1, source=0):
_ANEBuffer: { ioSurface=0x... ; symbolIndex=1 ; ANEBufferProducerAgent=0 }
symbolIndex property: accessible and correct
3c: _ANEIOSurfaceObject symbolIndex experiments
setSymbolIndex: NOT available on _ANEIOSurfaceObject
symbolIndex getter: NOT available
+objectWithIOSurface:symbolIndex: NOT available
3d: IOSurface property experiments
IOSurface 'symbolIndex' property (set via IOSurfaceSetValue): 0
_ANEIOSurfaceObject.symbolIndex after property set: <exception>
(IOSurface user properties do NOT propagate to _ANEIOSurfaceObject)
3e: _ANEProgramForEvaluation
k1.model.program: <_ANEProgramForEvaluation: 0x...>
(accessible via model.program property)
Phase 4: ChainingRequest Retry
4a: Sequential baseline
Sequential: 0.259 ms/pair (50 iters)
Output[0..3]: [0.2500, 0.2500, 0.2500, 0.2500]
Attempts 1-4: Various raw IOSurface configurations
[Attempt 1] Standard (raw IOSurfaceObject): CRASH
-[_ANEIOSurfaceObject symbolIndex]: unrecognized selector
[Attempt 2] IOSurface with symbolIndex property: CRASH (same)
[Attempt 3] Two-model loopback: CRASH (same)
[Attempt 4] Skip validate, call prepareChainingWithModel directly: CRASH (same)
Attempt 5: _ANEBuffer + _ANEIOSurfaceOutputSets
bufIn: _ANEBuffer: { ... symbolIndex=0 ; ANEBufferProducerAgent=0 }
bufOut: _ANEBuffer: { ... symbolIndex=0 ; ANEBufferProducerAgent=1 }
outputSet (objectWithstatsSurRef:NULL outputBuffer:@[bufOut]): nil
-> _ANEIOSurfaceOutputSets returns nil when statsSurRef is NULL
Attempt 6: _ANEClient.evaluateWithModel: -- works
evaluateWithModel (via client): YES
Attempt 7: _ANEClient.doEvaluateDirectWithModel: -- works
doEvaluateDirectWithModel: YES
Phase 5: Alternative Execution Paths
5a: Real-time eval -- 1.7x speedup
beginRealTimeTask: NO (possibly needs entitlement)
evaluateRealTimeWithModel: YES
RT eval: 0.090 ms/eval avg (50 iters)
Standard eval: 0.157 ms/eval avg (50 iters)
RT vs Standard speedup: 1.74x
endRealTimeTask: NO
5b: PerfStats
perfStatsMask = 0x01..0x80: set OK (all masks accepted)
statsWithHardwareExecutionNS:0 = <_ANEPerformanceStats>
Eval with @[perfStats]: OK (no crash when wrapped in array)
hwExecutionTime after eval: nil
Eval with mask=0xFF, perfStats=nil: OK
performanceCounters: nil
4. Evaluation Path Benchmarks
Measured on 64x32 convolution kernels, M4 Max, 200 iterations after 10 warmup:
| Method | Latency | Speedup | API |
|---|---|---|---|
evaluateWithQoS: (standard) |
0.175 ms | 1.0x | model.evaluateWithQoS:options:request:error: |
evaluateRealTimeWithModel: |
0.093 ms | 1.88x | client.evaluateRealTimeWithModel:options:request:error: |
processRequest |
0.131 ms | 1.34x | program.processRequest:model:qos:qIndex:modelStringID:options:returnValue:error: |
doEvaluateDirectWithModel: |
0.225 ms | 0.78x | client.doEvaluateDirectWithModel:options:request:qos:error: |
Key observations (small kernel, isolated):
- RT eval was fastest in isolated test (1.88x speedup on 64x32)
processRequestwas faster than standard but slower than RTdoEvaluateDirectWithModelwas actually slower than standard (0.78x)beginRealTimeTaskreturning NO does not preventevaluateRealTimeWithModel:from working
Production Dimension Results (test_bench_paths.m, M4 Max)
At realistic kernel sizes with multiple compiled models, the picture changes:
| Config | Standard | RT | processRequest | ane_eval_rt |
|---|---|---|---|---|
| 64x32 (test) | 0.109 ms | 0.233 ms (0.5x) | 0.156 ms (0.7x) | 0.195 ms (0.6x) |
| 128x64 | 0.208 ms | 0.184 ms (1.1x) | 0.201 ms (1.0x) | 0.185 ms (1.1x) |
| 256x64 | 0.197 ms | 0.212 ms (0.9x) | 0.203 ms (1.0x) | 0.157 ms (1.3x) |
| 512x64 | 0.120 ms | 0.147 ms (0.8x) | 0.194 ms (0.6x) | 0.179 ms (0.7x) |
| 768x256 (prod) | 0.205 ms | 0.246 ms (0.8x) | 0.185 ms (1.1x) | 0.291 ms (0.7x) |
Key finding: The RT eval speedup observed in isolated testing (1.88x) does not hold at production dimensions. At 768x256 (Stories110M size), all eval paths perform similarly (~0.2 ms), with standard eval being competitive or fastest. The overhead of the client-based paths (RT, direct) outweighs any ANE scheduling benefit at scale.
5. Remaining Blockers and Next Steps
SOLVED: _ANEIOSurfaceOutputSets statsSurRef
The chaining pipeline requires:
- Inputs as
_ANEBufferobjects withsymbolIndex-- SOLVED - OutputSets as
_ANEIOSurfaceOutputSetsobjects -- SOLVED
A 64-byte IOSurface as statsSurRef is sufficient. _ANEChainingRequest.validate returns YES with this setup.
SOLVED: ChainingRequest parameter type mismatch (Experiment K-L)
The [NSConstantIntegerNumber count] crash was caused by passing NSNumber values for lbInputSymbolId, lbOutputSymbolId, and procedureIndex. Type encoding analysis (Experiment K) revealed all 9 factory parameters are @ (id/object), but the factory internally calls count on them, expecting arrays or nil.
Fix: Pass nil for lbInputSymbolId, lbOutputSymbolId, and procedureIndex:
chainingRequestWithInputs:@[buf] outputSets:@[outSet]
lbInputSymbolId:nil lbOutputSymbolId:nil procedureIndex:nil
signalEvents:@[] transactionHandle:@0 fwEnqueueDelay:@0 memoryPoolId:@0
This produces a valid _ANEChainingRequest (validate returns YES) and prepareChainingWithModel: no longer crashes.
Current Blocker: ANEProgramChainingPrepare() Failed (Code=15)
prepareChainingWithModel: now returns NO with error:
Error Domain=com.apple.appleneuralengine Code=15
"ANEProgramChainingPrepare() Failed: Program chaining prepare error"
This error occurs with all three model types tested:
- Fresh
_ANEModel(state=1, populated with programHandle+program) - Populated
_ANEModelfrom Experiment E (state=5 after failed loadModel/compileModel) _ANEInMemoryModelstill crashes ongetUUID(cannot be used with chaining at all)
The Code=15 error is a logical failure in the ANE daemon's chaining preparation, not a crash. The model is not fully recognized as "chaining-capable" by the daemon, likely because:
- The
_ANEModelwas populated by copyingprogramHandle/programfrom an_ANEInMemoryModel, not loaded through the standard CoreML/Espresso pipeline - Symbol indices remain empty (the daemon may require them for chaining buffer routing)
- The model needs
model.espresso.netformat (not MIL) for_ANEClient.loadModel:/compileModel:
Previous blocker (SOLVED): [NSConstantIntegerNumber count] crash -- fixed by passing nil for symbol/procedure params.
Experiments E-H Results (test_ane_model.m)
Experiment E: _ANEModel Loading -- SOLVED
_ANEModel.modelAtURL:key: works with the compiled temp directory URL and hexStringIdentifier as key:
diskModel = _ANEModel.modelAtURL:key:(tmpDirURL, hexId)
-> _ANEModel with UUID, getUUID works
-> state=1, program=nil, programHandle=0 (shell only)
Populating the shell with _ANEInMemoryModel data:
diskModel.setProgramHandle:(inMemoryModel.programHandle) -> success
diskModel.setProgram:(inMemoryModel.program) -> success
After population, programHandle and program are set, but inputSymbolIndicesForProcedureIndex:0 still returns empty NSIndexSet. The symbol table data isn't stored in the _ANEProgramForEvaluation -- it's likely in the model.hwx or net.plist that the standard CoreML path generates.
Experiment E2: ANECompiler -- No ObjC API
ANECompiler.frameworkexists at/System/Library/PrivateFrameworks/ANECompiler.framework/but contains no ObjC classes -- it's a pure C library (ANECCompile()is the entry point, called internally by_ANEInMemoryModel.compileWithQoS:)debug_maskoption had no visible effect on compilation output- No
ane_compiler_servicefound at standard paths - Key
_ANEInMemoryModelcompilation methods found:saveModelFiles,localModelPath,compiledModelExists,mapIOSurfacesWithRequest:cacheInference:error:
Experiment F: Chaining Pipeline -- Blocked
With populated _ANEModel (has UUID + programHandle + program), prepareChainingWithModel: still crashes on [NSConstantIntegerNumber count]. The crash is in the _ANEChainingRequest parameter handling, not in the model itself.
Experiment G: Hardware Fences -- FULLY SOLVED
Both _ANESharedSignalEvent and _ANESharedWaitEvent now work:
// MTLSharedEvent via Metal (works)
id device = MTLCreateSystemDefaultDevice();
id sharedEvent = [device newSharedEvent];
// IOSurfaceSharedEvent via IOKit (also works)
id iosEvent = IOSurfaceSharedEventCreate();
// Signal event factory: (uint64_t value, unsigned int symbolIndex, long long eventType, id sharedEvent)
_ANESharedSignalEvent.signalEventWithValue:symbolIndex:eventType:sharedEvent:
-> works with both MTLSharedEvent and IOSurfaceSharedEvent
// Wait event factory: (uint64_t value, id sharedEvent)
_ANESharedWaitEvent.waitEventWithValue:sharedEvent:
-> works with both event types
Event types 0, 1, 2 all produce valid signal events. The eventType property is correctly set.
Experiment H: Alternative Preparation -- Same Crash
doPrepareChainingWithModel:options:chainingReq:qos:error: exists with identical signature and crashes identically. Full _ANEClient API (46 instance methods) documented in test output.
Throughput Ceiling (test_throughput_ceiling.m, Experiment I)
12-kernel pipeline benchmarks on M4 Max:
| Config | Sequential (run+memcpy) | Run-only | Memcpy-only | GCD Serial |
|---|---|---|---|---|
| 64x32 (test) | 0.272 ms/kernel | 0.158 ms/kernel | 0.001 ms/copy | 0.200 ms/kernel |
| 256x64 (small) | 0.191 ms/kernel | 0.181 ms/kernel | 0.002 ms/copy | 0.176 ms/kernel |
| 768x256 (prod) | 0.177 ms/kernel | 0.226 ms/kernel | 0.006 ms/copy | 0.186 ms/kernel |
Key findings:
- Memcpy overhead is negligible (<0.01 ms per copy even at 393KB). Not the bottleneck.
- CPU round-trip overhead is in the ANE dispatch itself, not data movement.
- At production dims, sequential with memcpy is actually faster than eval-only (pipeline caching effect).
- GCD serial queue provides modest improvement at small dims but marginal at production.
- Chaining's value would be eliminating the ~0.2ms/kernel ANE dispatch overhead, not memcpy. With 12 kernels, total pipeline takes ~2.1ms (prod), so eliminating dispatch could potentially halve this.
Experiments K-P Results (test_ane_model.m, 2026-03-04)
Experiment K: Type Encoding Analysis -- COMPLETE
Full type encodings for all chaining-related methods:
| Method | Encoding | Notes |
|---|---|---|
chainingRequestWithInputs:... |
@88@0:8@16@24@32@40@48@56@64@72@80 |
All 9 params are @ (id/object) |
prepareChainingWithModel:... |
B52@0:8@16@24@32I40^@44 |
5 params: 3x @, 1x I (uint32 qos), 1x ^@ (error ptr) |
doPrepareChainingWithModel:... |
B52@0:8@16@24@32I40^@44 |
Same signature as prepareChainingWithModel |
The _ANEChainingRequest factory takes 9 object parameters. The lbInputSymbolId, lbOutputSymbolId, and procedureIndex are all @ (object), not raw integers. Internally, the factory calls unsignedIntegerValue (from NSNumber) or count (from NSArray) on these parameters.
_ANEChainingRequest Property |
Encoding | Type |
|---|---|---|
procedureIndex |
@ |
id (nil or NSArray) |
loopbackInputSymbolIndex |
@ |
id (nil or NSArray) |
loopbackOutputSymbolIndex |
@ |
id (nil or NSArray) |
Experiment L: Array-Typed Parameters -- BREAKTHROUGH
| Combo | lbIn | lbOut | procIdx | Factory | Validate | Prepare |
|---|---|---|---|---|---|---|
L.1: Arrays @[@(-1)] |
@[@(-1)] |
@[@(-1)] |
@[@0] |
CRASH: unsignedIntegerValue on NSArray |
- | - |
L.2: Arrays @[@0] |
@[@0] |
@[@0] |
@[@0] |
CRASH: unsignedIntegerValue on NSArray |
- | - |
L.3: Empty @[] |
@[] |
@[] |
@[] |
CRASH: unsignedIntegerValue on empty array |
- | - |
| L.4: nil | nil | nil | nil | OK | YES | NO (Code=15) |
| L.5: NSNumber | @(-1) |
@(-1) |
@0 |
CRASH: count on NSNumber |
- | - |
Passing nil for all three symbol/procedure params gets past both the factory crash and the prepareChainingWithModel crash. The validate returns YES and prepareChainingWithModel: returns a clean error (Code=15: ANEProgramChainingPrepare() Failed) instead of crashing.
Experiment M: Load Model via _ANEClient -- BLOCKED
Both loadModel: and compileModel: on _ANEClient require Espresso IR format (model.espresso.net), not MIL:
Error Domain=com.apple.appleneuralengine.espresso Code=-1
"_ANEEspressoIRTranslator : error Cannot load network '.../model.espresso.net'"
compiledModelExistsFor: returns NO for our MIL-compiled model. After the failed load/compile attempts, the _ANEModel state changes from 1 to 5 (error/invalid state).
The standard CoreML pipeline generates model.espresso.net (Espresso IR) and model.espresso.weights from the .mlpackage / .mlmodelc format. Our MIL-only path bypasses this, so we can't use _ANEClient.loadModel: without first generating the Espresso IR.
Experiment N: IOSurface Mapping -- PARTIAL
_ANEProgramIOSurfacesMapper:
mapperWithProgramHandle:creates a valid mapper from the_ANEInMemoryModelprogramHandlemapIOSurfacesWithModel:request:cacheInference:error:returns NO (no exception, no error output)validateRequest:model:returns NO_ANEModel.mapperproperty is nilprepareANEMemoryMappingParams:request:revealedANEMemoryMappingParamsStructhas 128ANEBufferStructslots:[128{ANEBufferStruct=^{__IOSurface}IiiI}]
The mapper appears to need a fully loaded model with symbol table data that our MIL-compiled shell doesn't have.
Experiment O: Procedure Info -- EMPTY
procedureInfoForProcedureIndex:0returns nil on the populated_ANEModelprocedureCountis not a method or KVC-accessible propertymodelAttributesreturns empty dictionary{}inputSymbolNames/outputSymbolNamesnot available on_ANEModel- The
symbolIndicesForProcedureIndex:indexArrayKey:method exists (takesI+@) but symbol data is empty
Experiment P: Full Chaining Retry -- Code=15
Tested with three model types, all using nil for symbol params:
| Model | State | validate | prepare Result |
|---|---|---|---|
Fresh _ANEModel (state=1, populated) |
1 | YES | NO (Code=15) |
_ANEInMemoryModel |
3 | YES | CRASH: getUUID |
Populated _ANEModel (from E, state=5) |
5 | YES | NO (Code=15) |
Also documented _ANEInputBuffersReady and _ANEOutputSetEnqueue type signatures:
| Class | Factory | Param Types |
|---|---|---|
_ANEInputBuffersReady |
inputBuffersWithProcedureIndex:inputBufferInfoIndex:inputFreeValue:executionDelay: |
I (uint32), @ (NSArray), @ (NSArray), Q (uint64) |
_ANEOutputSetEnqueue |
outputSetWithProcedureIndex:setIndex:signalValue:signalNotRequired:isOpenLoop: |
I, I, Q, B, B |
Experiments Q-S Results (test_coreml_chaining.m, 2026-03-04)
Experiment Q: CoreML Pipeline -- MAJOR DISCOVERY
The E5 runtime (macOS 15+) does NOT use _ANEModel or _ANEChainingRequest at all.
CoreML on macOS 15 uses the MIL-based "E5" runtime, which completely bypasses the older Espresso/_ANEModel/_ANEChainingRequest path:
| Component | Old Path (Espresso) | New Path (E5/MIL) |
|---|---|---|
| Model format | .espresso.net + .espresso.weights |
model.mil + weights/weight.bin |
| Model class | _ANEModel |
e5rt_program_library (C struct) |
| Engine | _ANEClient + _ANERequest |
MLE5Engine + MLE5ExecutionStreamOperation |
| Chaining | _ANEChainingRequest |
e5rt_execution_stream_operation (unknown) |
| Compile | _ANEClient.compileModel: |
e5rt_program_library AOT compilation |
| Sync | _ANESharedSignalEvent |
IOSurfaceSharedEventListener + MTLSharedEvent |
Key findings:
MLModel.compileModelAtURL:produces.mlmodelcwithmodel.mil(NOTmodel.espresso.net)- Loading an
MLModelcreatesMLDelegateModel->MLE5Engine->MLE5ProgramLibrary->MLE5ProgramLibraryOnDeviceAOTCompilationImpl - No
_ANEModelexists anywhere in the E5 object graph _ANEClient.loadModel:/compileModel:both requiremodel.espresso.netwhich isn't generated- Prediction succeeds (model runs on ANE), confirming E5 runtime works independently of
_ANEModel
Internal E5 class hierarchy:
MLDelegateModel
└── _internalEngine: MLE5Engine
├── _programLibrary: MLE5ProgramLibrary
│ ├── _programLibraryHandle: e5rt_program_library* (opaque C struct)
│ ├── _impl: MLE5ProgramLibraryOnDeviceAOTCompilationImpl
│ │ ├── _milTextURL: NSURL
│ │ ├── _irProgram: shared_ptr<MIL::IRProgram> (C++)
│ │ └── _container: MLProgramE5Container
│ └── _container: MLProgramE5Container
│ ├── _modelAssetDescription
│ ├── _compilerVersionInfo
│ └── _functionInfoArray
└── _operationPool: MLE5StaticShapeExecutionStreamOperationPool
└── _pool: NSMutableSet of MLE5ExecutionStreamOperation
├── _operationHandle: e5rt_execution_stream_operation* (opaque)
├── _programLibrary: MLE5ProgramLibrary
├── _inputPorts / _outputPorts: NSArray
├── _waitEventListener: IOSurfaceSharedEventListener
└── _completionSharedEventBoundToESOP: MTLSharedEvent
Experiment R: Chaining with CoreML model -- BLOCKED
No _ANEModel extracted from E5 runtime, so prepareChainingWithModel: cannot be tested with a CoreML-compiled model. The E5 runtime is a completely separate execution path.
Experiment S: Two-Kernel Chaining -- BLOCKED
Blocked by Experiment R. The _ANEChainingRequest API appears to be from the older Espresso-based runtime and may not be usable with models compiled through the E5/MIL path.
Experiments T-V Results (2026-03-04)
Experiment T: E5 Runtime Symbol Scan
Found 4 exported C functions from the e5rt_* API:
e5rt_program_library_create-- creates program library handlee5rt_execution_stream_create-- creates execution stream handlee5rt_async_event_create-- creates async event for synchronizatione5rt_async_event_signal-- signals an async event
Key ObjC classes in the E5 runtime:
MLE5ExecutionStreamOperation(63 instance methods) -- holdse5rt_execution_stream_operation*, manages input/output portsMLE5ExecutionStream(29 instance methods) -- holdse5rt_execution_stream*, executesoperationsarrayMLE5ExecutionStreamPool-- manages streams viatakeOut/putBack:MLE5InputPort/MLE5OutputPort-- holde5rt_io_port*, bind features to portsMLE5InputPortBinder/MLE5OutputPortBinder-- handle memory binding for portsMLE5ProgramLibrary-- holdse5rt_program_library*
Critical method: MLE5ExecutionStream._executeStream:error: takes e5rt_execution_stream* and executes all operations in the operations array in sequence.
Experiment U: E5 Multi-Op Stream -- MAJOR BREAKTHROUGH
Successfully executed multiple ANE operations in a single E5 stream, achieving up to 4.87x speedup over sequential CoreML.
Method:
- Load multiple CoreML models (
.mlpackage->MLModel) - Extract
MLE5ProgramLibraryfrom each model'sMLE5Engine - Create
MLE5ExecutionStreamOperationfor each, backed by each program library - Preload operations (
preloadAndReturnError:) to compile ANE programs - Borrow an
MLE5ExecutionStreamfrom the stream pool - Set multiple operations on the stream via
setOperations: - Prepare each operation's input features via
prepareForInputFeatures:options:error: - Execute all operations in one call via
_executeStream:error:
Benchmark Results (M4 Max, macOS 15, N=500)
| Kernels | CoreML Sequential | E5 Multi-Op Stream | Speedup |
|---|---|---|---|
| 1 (256ch) | 0.0359 ms | 0.0272 ms | 1.32x |
| 2 (256+512ch) | 0.0623 ms | 0.0406 ms | 1.53x |
| 3 (256+512+1024ch) | 0.1599 ms | 0.0578 ms | 2.77x |
| 4 (256+512+1024+2048ch) | 0.3781 ms | 0.0776 ms | 4.87x |
Key observations:
- E5 stream per-kernel overhead is remarkably consistent: ~0.02 ms/kernel regardless of count
- CoreML sequential overhead grows non-linearly (0.036 -> 0.095 ms/kernel with 4 kernels)
- The speedup increases with more kernels: the dispatch overhead is amortized
- All operations execute on ANE with a single
_executeStream:call
Code path for E5 multi-op stream:
// 1. Extract internals from CoreML-loaded model
id e5engine = [mlModel valueForKey:@"_internalEngine"]; // MLE5Engine
id progLib = [e5engine valueForKey:@"programLibrary"]; // MLE5ProgramLibrary
id pool = [e5engine valueForKey:@"streamPool"]; // MLE5ExecutionStreamPool
// 2. Create operation from program library
id op = [[MLE5ExecutionStreamOperation alloc]
initWithProgramLibrary:progLib functionName:@"main"
modelDescription:desc configuration:cfg
debugLabel:@"myOp" modelSignpostId:0];
[op preloadAndReturnError:nil];
// 3. Get stream and set operations
id stream = [pool takeOut];
void *sh = stream._streamHandle; // e5rt_execution_stream*
[stream setOperations:@[op1, op2, op3]];
// 4. Prepare and execute
for (op in operations)
[op prepareForInputFeatures:features options:predOpts error:nil];
[stream _executeStream:sh error:nil];
Revised Assessment (after T-V)
The E5 runtime ( CORRECTED in Experiments W1 (see below).MLE5ExecutionStream + MLE5ExecutionStreamOperation) is the correct path for multi-kernel pipelining on macOS 15+.
Experiments W1-W5: Validation & Deep API Documentation (2026-03-04)
W1: Output Correctness Validation
CRITICAL CORRECTION: The previously reported "4.87x speedup" from multi-op streams was invalid. Validation revealed:
-
MLE5Engine.predictionFromFeatures:options:error:produces EXACT (bit-identical) output toMLModel.predictionFromFeatures:error:for all tested sizes (256, 512, 1024, 2048 channels). This confirms the E5 engine is the correct computation path. -
Our manually-created
MLE5ExecutionStreamOperationobjects viainitWithProgramLibrary:do not produce correct output -- they return all zeros. The_executeStream:call returns YES but no actual ANE compute occurs. The operation handles are0x0(not compiled), meaning our manually-created ops were never wired to actual ANE programs. -
The "speedup" was measuring the overhead of a no-op function returning immediately vs CoreML doing actual computation.
-
MLE5StaticShapeExecutionStreamOperationPool.takeOutOperationForFeatures:error:returns pool-managed operations with valid handles, but using them with_executeStream:still produces zeros -- the output port bindings are not correctly populated. -
Stream reuse via
_predictionFromFeatures:stream:options:error:fails with "E5RT: Port bindings cannot be changed while operation is in use in an execution stream" -- streams are locked after first use and cannot be reconfigured.
W1 Performance Profile
| Path | 256ch (ms) | 2048ch (ms) |
|---|---|---|
CoreML API (predictionFromFeatures:error:) |
0.035 | 0.217 |
Engine direct (predictionFromFeatures:options:error:) |
0.074 | 0.284 |
Engine private (_predictionFromFeatures:options:error:) |
0.100 | 0.332 |
| Stream pool cycle (takeOut + putBack) | 0.008 | 0.008 |
| Op pool cycle | <0.001 | <0.001 |
Key finding: CoreML API is FASTER than calling the engine directly. MLDelegateModel implements internal caching (likely keeping a hot stream + operation) that avoids the per-call pool acquire/release overhead. The engine's predictionFromFeatures: method performs pool management on every call.
W2: Exhaustive E5 Runtime API
Full class dumps captured for all E5 runtime classes. Key classes and their roles:
MLE5Engine (49 instance methods, 10 ivars)
- Superclass:
MLModelEngine - Entry point:
predictionFromFeatures:options:error:(public),_predictionFromFeatures:stream:options:error:(internal) - Key properties:
streamPool(MLE5ExecutionStreamPool),operationPool(),programLibrary(MLE5ProgramLibrary) - Manages: stream acquisition, operation preparation, input conforming, output post-processing
MLE5ProgramLibrary (17 instance methods, 5 ivars)
- Holds
_programLibraryHandle(C structe5rt_program_library*) - Key method:
createOperationForFunctionName:forceRespecialization:hasRangeShapeInputs:error:-- returns C-levele5rt_execution_stream_operation* - Contains: compiled MIL program, model configuration, implementation object
MLE5ExecutionStreamOperation (63 instance methods, ~20 ivars)
- Holds
_operationHandle(C structe5rt_execution_stream_operation*) - States: 0=created, transitions through prepare/execute
- Key methods:
prepareForInputFeatures:options:error:,preloadAndReturnError:,outputFeatures - Has input/output/state ports (MLE5InputPort, MLE5OutputPort)
- Internal binding:
_bindInputFeaturesAndWaitEvents:options:error:,_bindOutputPortsWithOptions:error: - Port binding modes:
directlyBoundFeatureValue(zero-copy) vscopyFeatureValue(memcpy)
MLE5ExecutionStream (21 instance methods, 5 ivars)
- Holds
_streamHandle(C structe5rt_execution_stream*) - Key methods:
_executeStream:error:,executeForInputFeatures:options:error:,submitWithCompletionHandler: - Operations set via
setOperations:(NSArray of MLE5ExecutionStreamOperation) - Reset via
_cleanUpStream:on engine
MLE5ExecutionStreamPool (11 instance methods)
- Pool pattern:
takeOut/putBack: - Creates streams on demand with
e5rt_execution_stream_create - Tracks all streams via
allStreams
MLE5StaticShapeExecutionStreamOperationPool (17 instance methods)
- Pool for operations with fixed input shapes
- Key method:
takeOutOperationForFeatures:error:-- matches feature shape to pooled operation
MLE5InputPort / MLE5OutputPort
- Wraps
e5rt_io_port*handles - Each has a
binder(MLE5InputPortBinder / MLE5OutputPortBinder) - Input binder has
bindingMode(char): controls copy vs direct binding - Output binder has
outputBackingandfeatureValuefor result retrieval
MLE5InputPortBinder (16 instance methods, 6 ivars)
bindingMode(char): 0=copy, 1=directbindMemoryObjectForFeatureValue:error:-- zero-copy IOSurface bindingcopyFeatureValue:error:-- memcpy binding
MLE5OutputPortBinder (27 instance methods, 9 ivars)
outputBacking-- output bufferboundFeatureDirectly(BOOL) -- tracks binding mode_makeFeatureValueFromPort:featureDescription:error:-- read ANE output
MLProgramE5Container (11 instance methods, 6 ivars)
- Container for compiled model assets
URLOfMILText-- path to MIL sourcecompilerOutput--MLCompilerNeuralNetworkOutputfindPrecompiledE5BundleAndReturnError:-- looks for pre-compiled E5 bundle
e5rt_ C API* (found via dlsym):
e5rt_program_library_create-- creates program library from MILe5rt_execution_stream_create-- creates execution streame5rt_async_event_create-- creates async event for synchronizatione5rt_async_event_signal-- signals async event
W4: Async Stream Submission
submitWithCompletionHandler: FAILED with: "Failed to add operation to E5 stream. E5RT: Reset stream to add more operations to stream. (2)". The stream must be in a specific state (reset) before async submission is possible. The stream state becomes locked after _executeStream: or executeForInputFeatures:.
W5: Port-Based Data Flow
- Each operation has
inputPorts(array of MLE5InputPort) andoutputPorts(array of MLE5OutputPort) - Input binding mode 1 = direct binding (zero-copy from MLMultiArray)
- Output
outputBackingis nil after manual execution -- bindings are not populated by our manual path - Port handles are
e5rt_io_port*C structs -- connecting ports across operations would require knowing the C API for port linking
Revised Assessment (after W1-W5)
-
CoreML API is already near-optimal for single-model inference. The
MLDelegateModelwrapper is faster than calling engine methods directly due to internal stream/operation caching. -
Manual
_executeStream:with custom operations is invalid -- it produces zero output. The operations must be created through the engine's internal pipeline (via_predictionFromFeatures:stream:options:error:) which handles binding correctly. -
The opportunity for speedup lies in:
- Eliminating ObjC overhead via direct
e5rt_*C API calls - Batching multiple models into a single stream (requires understanding
e5rt_execution_stream_operationlifecycle) - Direct MIL compilation to
e5rt_program_librarywithout going through CoreML
- Eliminating ObjC overhead via direct
Experiment X1: Custom MIL -> ANE Execution (BREAKTHROUGH)
Pipeline discovered: Write MIL text file -> MLE5ProgramLibraryOnDeviceAOTCompilationImpl -> MLE5ProgramLibrary -> MLE5Engine -> predictionFromFeatures:
// 1. Write MIL text to file
NSString *mil = @"program(1.3)\n{\n func main<ios18>(...) { ... } -> (cast_out);\n}\n";
[mil writeToFile:@"/tmp/custom.mil" ...];
// 2. Compile MIL to E5 program library
id aotImpl = [[MLE5ProgramLibraryOnDeviceAOTCompilationImpl alloc]
initWithMILTextAtURL:milURL container:refContainer configuration:cfg];
void *plHandle = [aotImpl createProgramLibraryHandleWithRespecialization:NO error:&err];
// 3. Create program library + engine
id progLib = [[MLE5ProgramLibrary alloc] initWithImpl:aotImpl container:refContainer configuration:cfg];
id engine = [[MLE5Engine alloc] initWithProgramLibrary:progLib modelDescription:desc ...];
[engine prepareWithConcurrencyHint:1 error:nil];
// 4. Execute
id result = [engine predictionFromFeatures:fp options:opts error:&err];
Requirements:
- MIL input/output variable names must match the model description (e.g.,
xfor input,cast_outfor output) - MIL shapes must match the model description shapes
- A "container" (
MLProgramE5Container) is borrowed from a pre-compiled CoreML model (needed for compilation context) - Input/output types should be fp32 with internal fp16 compute (cast in/out) for ANE compatibility
Verified kernels (all produce EXACT correct output on ANE):
| Kernel | MIL Op | Verification |
|---|---|---|
| ReLU | relu(x=x16) |
Max diff = 0.000000, 0/16384 wrong |
| GELU | gelu(x=x16, mode="TANH_APPROXIMATION") |
Verified against reference |
| Elementwise (x*2+1) | mul + add with scalar constants |
Verified against reference |
| Softmax | softmax(x=x16, axis=-1) |
Sum = 1.000000 |
| Layer Norm | layer_norm(x=x16, axes=[3], epsilon=1e-5) |
Mean = 0.000000, Var = 0.999975 |
Significance: This allows compiling arbitrary MIL programs (any operation supported by Apple's MIL spec) to run on the ANE, without going through CoreML's .mlpackage pipeline. This is the foundation for custom training/inference kernels.
Experiment Y1: Fused SDPA on ANE (PASSED)
Operation: scaled_dot_product_attention(query=Q, key=K, value=V) -- single fused op for entire attention computation.
Config: B=1, nHeads=1, seqLen=256, headDim=64 (self-attention: Q=K=V=reshape(input))
| Metric | Value |
|---|---|
| Max abs diff (vs CPU) | 0.000021 |
| Relative error | 1.40e-03 |
| Latency (first call) | 2.454 ms |
| Benchmark | 0.1708 ms/eval |
Experiment Y2: Linear with Embedded Weights (PASSED)
Operation: linear(x=flat, weight=Wc, bias=Bc) where Wc and Bc are compile-time const tensors embedded in the MIL program.
Config: input [256, 64], linear 64->64 with embedded weight matrix and bias vector.
| Metric | Value |
|---|---|
| Max abs diff (vs CPU) | 0.001106 |
| Relative error | 1.05e-02 |
| Benchmark | 0.0610 ms/eval |
Significance: Confirms that compile-time weight constants work in MIL text format. This is the foundation for transformer inference (where weights are frozen).
Experiment Y3: Complete Transformer Block on ANE (PASSED)
Pipeline: LayerNorm -> SDPA (self-attention) -> Residual Add -> LayerNorm -> FFN (linear+GELU+linear) -> Residual Add
All in a single MIL program, compiled and executed as one ANE operation.
Config: seqLen=256, dim=64, ffnDim=128, 1-head attention, embedded FFN weights.
| Metric | Value |
|---|---|
| Output mean abs | 1.017404 (non-zero, correct) |
| Benchmark | 0.2091 ms/eval |
Significance: A full transformer layer runs on ANE in ~0.2ms. This proves that complex multi-op pipelines can be compiled as single MIL programs with no CPU round-trips between ops. The ANE compiler fuses the entire graph.
Experiment Z1: Backward Pass (Gradient Computation) on ANE (PASSED)
Operations: matmul(x=dY, y=W) for dX (input gradient), matmul(x=dY, y=dY, transpose_x=true) for dW (weight gradient). Both use runtime tensors (not const), proving backward-pass operations work on ANE.
Also tests: slice_by_index for tensor slicing, concat for packing results.
Config: dY [128,64] @ W [64,64] -> dX [128,64]; dY^T [64,128] @ dY [128,64] -> dW [64,64]
| Metric | dX | dW |
|---|---|---|
| Max abs diff | 0.001940 | 0.012828 |
| Relative error | 1.02e-02 | 3.92e-02 |
| Benchmark | 0.0593 ms/eval (both combined) |
Significance: This is the first demonstration of ANE executing gradient computation operations. The matmul with transpose_x=true works correctly, producing valid weight gradients. Combined with Y3's forward pass, this establishes the complete pipeline for manual ANE training:
- Forward pass: Y3-style MIL (0.2 ms)
- Backward pass: Z1-style MIL (0.06 ms)
- Weight update: CPU (trivial)
- Recompile: (~10-50 ms, dominates training time)
MIL Text Syntax Lessons Learned
Key syntax rules discovered during Y/Z experiments:
epsiloninlayer_norm: Must be same dtype as gamma/beta. Usefp16 eps = const()[..., val = fp16(1e-5)]when gamma is fp16.- Boolean params: Use
bool tx = const()[..., val = bool(true)]for params liketranspose_x. concataxis: Must beint32scalar, nottensor<int32, [1]>. Useint32 ax = const()[..., val = int32(0)].concatinterleave: Required param, usebool il = const()[..., val = bool(false)].- MLE5Engine init: Correct selector is
initWithProgramLibrary:modelDescription:configuration:functionName:classProbabilitiesFeatureName:optionalInputDefaultValues:compilerVersionInfo:(7 args). - Container path: On macOS 15+, models may use Espresso backend. Create
MLProgramE5ContainerviainitWithModelAssetPath:configuration:using the.mlmodelcpath. - Sandbox: E5RT needs write access to
~/Library/Caches/for model specialization cache.
Next Steps
- [HIGH] Multi-head attention -- test SDPA with multiple heads (reshape to [B, nHeads, seqLen, headDim])
- [HIGH] Real Qwen2.5 layer weights -- load actual model weights into MIL const tensors
- [HIGH] Full backward pass -- implement complete transformer backward pass (attention + FFN gradients)
- [MEDIUM] Training loop -- forward + backward + weight update + recompile cycle
- [MEDIUM] Explore e5rt_ C API directly* -- bypass ObjC wrappers for lower overhead
- [LOW] Runtime weight injection -- investigate if weights can be updated without recompilation
Phase 7: OutputSets with stats IOSurface -- BREAKTHROUGH
statsSurRef size=64 bytes:
objectWithstatsSurRef: _ANEIOSurfaceOutputSets: { statsSurRef=<IOSurface: 0x...>
id = 0x... width = 64 height = 1 pixelFormat = 0
name = test_chaining_v2 ; outputBuffer=(
"_ANEBuffer: { ... symbolIndex=0 ; ANEBufferProducerAgent=1}"
)}
Attempting ChainingRequest with valid outputSet...
ChainingRequest created | validate: YES <-- FIRST TIME VALIDATE PASSES!
prepareChainingWithModel EXCEPTION:
-[_ANEInMemoryModel getUUID]: unrecognized selector
Phase 8: Disk-based _ANEModel
_ANEModel class found (12 class methods, 52 instance methods, 17 properties)
Has: getUUID, inputSymbolIndicesForProcedureIndex:,
outputSymbolIndicesForProcedureIndex:, mapper, program
Factory: +modelAtURL:key:, +modelAtURL:key:modelAttributes:, etc.
tmpDir contents: (weights, model.mil, net.plist, data)
+modelAtURL: NOT available (needs key: parameter)
-> _ANEModel could not be loaded (need correct factory + key)
Phase 9: processRequest via ProgramForEvaluation
k1.model.program: _ANEProgramForEvaluation: { programHandle=1319967543575
intermediateBufferHandle=0 queueDepth=127 }
processRequest single call: YES (rv=NO)
processRequest: 0.131 ms/eval (50 iters)
vs RT eval: 1.45x (slower than RT but faster than standard)
Phase 10: Shared Events
_ANESharedEvents: found (+sharedEventsWithSignalEvents:waitEvents:)
_ANESharedSignalEvent: found
+signalEventWithValue:symbolIndex:eventType:sharedEvent:
Properties: sharedEvent (IOSurfaceSharedEvent), value, symbolIndex, agentMask, eventType
alloc/init: nil (needs sharedEvent parameter)
_ANESharedWaitEvent: found
+waitEventWithValue:sharedEvent:
alloc/init: nil (needs sharedEvent parameter)
-> Both require IOSurfaceSharedEvent objects, not available from bare init
6. Architecture: Chaining Data Flow
Current (sequential):
CPU -> IOSurface -> ANE eval layer 1 -> IOSurface -> CPU memcpy
CPU -> IOSurface -> ANE eval layer 2 -> IOSurface -> CPU memcpy
... (23 round-trips for 12-layer model)
Target (chained):
CPU -> IOSurface -> ANE eval layer 1 -> [on-chip] -> ANE eval layer 2
-> [on-chip] -> ... -> IOSurface -> CPU
(1 round-trip for entire model)
Current best (sequential with standard path):
At production dims (768x256), all paths are ~0.2ms/kernel.
RT path only helps for small kernels (64x32: 1.88x speedup).
For 24 evals/token at ~0.2ms each: ~4.8ms total ANE time per token.
Chaining target: 1 round-trip instead of 24, saving ~23 x overhead per trip.
7. Class Hierarchy (inferred)
NSObject
├── _ANEClient (singleton, daemon connection)
├── _ANEInMemoryModelDescriptor (MIL + weights spec)
├── _ANEInMemoryModel (compile/load/run -- in-memory MIL path)
│ └── .program -> _ANEProgramForEvaluation
├── _ANEModel (disk-based compiled model -- 52 methods, has getUUID)
│ └── .program -> _ANEProgramForEvaluation
│ └── .mapper -> _ANEProgramIOSurfacesMapper
├── _ANERequest (I/O surface packaging)
├── _ANEIOSurfaceObject (thin IOSurface wrapper)
├── _ANEBuffer (IOSurfaceObject + symbolIndex + source)
├── _ANEChainingRequest (multi-op pipeline)
├── _ANEIOSurfaceOutputSets (output packaging for chaining)
├── _ANEInputBuffersReady (input signaling for chaining)
├── _ANEOutputSetEnqueue (output enqueue config for chaining)
├── _ANEProgramIOSurfacesMapper (symbol-to-surface mapping)
├── _ANEProgramForEvaluation (lower-level eval program)
├── _ANEModelInstanceParameters (model config)
├── _ANEDeviceController (device-level control)
├── _ANEQoSMapper (QoS level mapping)
├── _ANEPerformanceStats (perf counters)
├── _ANESharedSignalEvent (hardware signal fence)
└── _ANESharedWaitEvent (hardware wait fence)
8. MIL Operations Reference (for Custom ANE Kernels)
Source: coremltools MIL Ops API Reference
The following MIL operations are available for writing custom ANE kernels via our MLE5ProgramLibraryOnDeviceAOTCompilationImpl pipeline (Experiment X1). All ops below have been confirmed available in the MIL text format used by the E5 compiler on macOS 15+.
Transformer-Critical Ops
| Op | Signature | Notes |
|---|---|---|
scaled_dot_product_attention (iOS 18+) |
(query:[B,*?,L,E], key:[B,*?,S,E], value:[B,*?,S,EV], attn_mask?) -> [B,*?,L,EV] |
Fused softmax(Q@K.T/sqrt(d))@V. Single op for entire attention computation. |
linear |
(x:[*D,D_in], weight:const[D_out,D_in], bias:const[D_out]?) -> [*D,D_out] |
x @ W.T + b. Weight/bias must be compile-time constants. Rank 1-3 input. |
matmul |
(x:[*,K1], y:[*,K2], transpose_x?, transpose_y?) -> [*,T] |
N-D batch matmul with broadcasting. Supports runtime (non-const) inputs. |
layer_norm |
(x, axes, gamma?, beta?, epsilon?) -> same shape |
Verified working on ANE (Experiment X1). |
gelu |
(x, mode=EXACT/TANH_APPROXIMATION/SIGMOID_APPROXIMATION) -> same shape |
Verified working on ANE (Experiment X1). |
softmax |
(x, axis) -> same shape |
Verified working on ANE (Experiment X1). |
relu |
(x) -> same shape |
Verified working on ANE (Experiment X1). |
Data Movement Ops
| Op | Signature | Notes |
|---|---|---|
gather |
(x, indices, axis?) -> gathered |
For embedding table lookups. |
gather_along_axis |
(x, indices, axis?) -> gathered |
Take values along axis at index locations. |
scatter |
(data, indices, updates, axis?, mode?) -> scattered |
For KV cache writes. Mode: update/add/sub/mul/div/max/min. |
scatter_along_axis |
(data, indices, updates, axis?, mode?) -> scattered |
Scatter updates along axis. |
Elementwise / Reduction Ops
| Op | Notes |
|---|---|
add, sub, mul, real_div |
Elementwise with broadcasting. |
cast |
Type conversion (fp32 <-> fp16). Required for ANE I/O (fp32 in, fp16 compute, fp32 out). |
reduce_sum, reduce_mean, reduce_max |
Reduction along axes. |
rsqrt, sqrt, exp, log, tanh |
Unary elementwise. Useful for manual norm/activation implementations. |
concat, split, reshape, transpose |
Shape manipulation. |
slice_by_index, slice_by_size |
Tensor slicing for KV cache windowing. |
Key Constraints
linearweights must beconst: For inference this is fine (weights don't change). For training, usematmulwith runtime tensors instead.- MIL text format: Programs use
program(1.3) { func main<ios18>(...) { ... } -> (output); }syntax. Constants useconst()[name=..., val=...]. Weights reference blob files viaBLOBFILE(path=..., offset=...). - ANE I/O convention: Input/output should be fp32; internal compute should be fp16. Use
castops at boundaries. - Shape constraints: ANE prefers NCHW layout. Most ops work with rank-4 tensors
[B, C, H, W]butlinear/matmulwork with lower ranks.
9. ANE Training Feasibility Analysis
Apple's Official Position
Apple's deprecated MLCompute framework (MLCDevice.ane()) explicitly states:
"This device applies to inference graphs only. It doesn't work with a training graph or inference graph that shares layers with a training graph."
This means Apple never shipped ANE-based training, even in their own training framework. The MLCTrainingGraph class supported executeForward, executeGradient, and executeOptimizerUpdate but only on CPU and GPU devices.
WWDC 2025 Confirmation
WWDC 2025 Session 360 ("Discover ML & AI frameworks") confirms:
- CoreML dispatches to CPU, GPU, and Neural Engine at runtime for inference
- MLX is the recommended tool for training/fine-tuning but uses Metal GPU, not ANE
- No mention of ANE training APIs in any Apple framework
- BNNSGraph (Accelerate) added
BNNSGraphBuilderfor CPU-only real-time inference
Why ANE Lacks Native Training Support
The ANE is a fixed-function inference accelerator. It likely lacks:
- Hardware support for automatic differentiation / backward passes
- Ability to write to weight storage during execution (weights are read-only constants in the
e5rt_program_library) - Dynamic memory allocation needed for activation checkpointing
Manual ANE Training Approach
Despite the lack of native support, training on ANE is theoretically possible using our custom MIL pipeline:
- Forward pass: Write MIL program with
linear/matmul/layer_norm/geluops. Weights embedded as constants. Execute on ANE. Save activations. - Backward pass: Write separate MIL programs for each layer's gradient computation:
- Linear backward:
dX = dY @ W(matmul),dW = dY.T @ X(matmul) - ReLU backward:
dX = dY * (X > 0)(elementwise) - LayerNorm backward: Multiple reduction + elementwise ops
- Linear backward:
- Optimizer step: Run on CPU (simple elementwise:
W -= lr * dW) - Recompile: After weight update, recompile MIL with new weights for next forward pass
The key bottleneck is step 4: recompiling MIL after every weight update. The createProgramLibraryHandleWithRespecialization: call takes ~10-50ms, which would dominate training time. This makes per-step ANE training impractical unless we can find a way to update weights without recompilation (e.g., via the e5rt_* C API or runtime weight injection).