ANE/docs/ANE_CHAINING_RESEARCH.md

57 KiB

ANE ChainingRequest API Research

Research into Apple Neural Engine private APIs for multi-kernel pipelining, conducted on M4 Max / macOS 15.

Goal: Eliminate CPU round-trips between ANE layer evaluations. In a 12-layer model, sequential evaluation requires 23+ CPU-ANE round-trips per token. The _ANEChainingRequest API appears designed to let the ANE run operations back-to-back in a hardware pipeline, keeping data on-chip.

Status: ChainingRequest validates and prepareChainingWithModel: no longer crashes (crash fix: pass nil for symbol/procedure params). Blocked on Code=15 (ANEProgramChainingPrepare Failed) -- the _ANEModel needs Espresso IR format (not MIL) for full symbol table population. At production dims (768x256), sequential ANE dispatch costs ~0.2ms/kernel; chaining would save ~23 round-trips per token.

See also: ANE_INTERNALS.md for comprehensive ANE documentation including compilation pipeline, hardware specs, and community research references.


Test Files

File Purpose
training/test_chaining.m v1 prototype: sequential baseline + ChainingRequest creation
training/test_chaining_v2.m v2 deep exploration: 6-phase probe of 12+ private classes
training/test_ane_model.m Experiments E-P: _ANEModel loading, compiler, chaining, fences, type encoding, mapping
training/test_throughput_ceiling.m Experiment I: 12-kernel throughput ceiling benchmark

Build and run:

cd training
make test_chaining && ./test_chaining
make test_chaining_v2 && ./test_chaining_v2
make test_ane_model && ./test_ane_model
make test_throughput_ceiling && ./test_throughput_ceiling

1. Executive Summary

What works

Finding Impact Status
evaluateRealTimeWithModel: via _ANEClient 1.88x faster on small kernels (64x32); no benefit at production dims (768x256) Benchmarked
processRequest via _ANEProgramForEvaluation 1.34x faster on small kernels; marginal at production dims Benchmarked
_ANEBuffer wraps IOSurface with symbolIndex Solves input indexing for chaining Proven
All 9 unexplored ANE classes exist on M4 Max Full API surfaces documented Documented

Important: The RT execution speedup (1.88x) observed in isolated testing on 64x32 convolution kernels does not generalize to production dimensions. At 768x256 (Stories110M size), all four execution paths converge to ~0.2 ms per kernel. See Production Dimension Results below.

What's been solved

Finding Status Detail
_ANEIOSurfaceOutputSets works with 64-byte statsSurRef SOLVED Any non-NULL IOSurface works as stats buffer
_ANEChainingRequest.validate returns YES SOLVED With proper _ANEBuffer inputs + _ANEIOSurfaceOutputSets outputs
processRequest via _ANEProgramForEvaluation 1.34x faster Lower-level eval (0.131 ms vs 0.175 ms)
ChainingRequest factory crash ([NSConstantIntegerNumber count]) SOLVED Pass nil for lbInputSymbolId, lbOutputSymbolId, procedureIndex
_ANEModel loading from temp directory SOLVED modelAtURL:key: with tmpDir URL + hexStringIdentifier
_ANESharedSignalEvent / _ANESharedWaitEvent SOLVED Use MTLSharedEvent or IOSurfaceSharedEventCreate()
ChainingRequest type encodings DOCUMENTED All 9 factory params are @ (object). prepare has 5 params (3x@, 1xI qos, 1x^@ err)

What's still blocked

Blocker Root Cause
prepareChainingWithModel: returns Code=15 ANEProgramChainingPrepare() Failed -- model not recognized as chaining-capable
_ANEModel has empty symbol table MIL-compiled model shell lacks Espresso IR data (model.espresso.net)
_ANEClient.loadModel: / compileModel: fail Require Espresso IR format, not MIL
_ANEProgramIOSurfacesMapper returns NO Needs fully loaded model with symbol table
_ANEPerformanceStats with _ANERequest Request expects statType selector on perfStats objects

2. ANE Private API Class Map

Core Classes (known working)

_ANEInMemoryModel -- the model object for in-memory MIL compilation.

  • +inMemoryModelWithDescriptor: -- create from _ANEInMemoryModelDescriptor
  • -compileWithQoS:options:error: -- compile MIL to ANE binary
  • -loadWithQoS:options:error: -- load compiled model onto ANE
  • -evaluateWithQoS:options:request:error: -- standard evaluation (QoS 0-63, 21 default)
  • -unloadWithQoS:error: -- unload from ANE
  • Properties: hexStringIdentifier, programHandle (uint64), program (_ANEProgramForEvaluation), perfStatsMask
  • Missing: inputSymbolNames, outputSymbolNames, inputSymbolIndicesForProcedureIndex:

_ANEInMemoryModelDescriptor -- model specification.

  • +modelWithMILText:weights:optionsPlist: -- create descriptor from MIL NSData + weight dict

_ANERequest -- evaluation request packaging I/O surfaces.

  • +requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:
  • perfStats parameter expects NSArray of stat info objects (not _ANEPerformanceStats)

_ANEIOSurfaceObject -- thin wrapper around IOSurfaceRef.

  • +objectWithIOSurface: -- wrap a raw IOSurface
  • Does NOT have symbolIndex property (this is the v1 blocker)

_ANEClient -- client connection to the ANE daemon.

  • +sharedConnection -- singleton accessor
  • -evaluateWithModel:options:request:qos:error: -- 5-param eval via client
  • -evaluateRealTimeWithModel:options:request:error: -- RT priority eval (1.7x faster)
  • -doEvaluateDirectWithModel:options:request:qos:error: -- direct eval bypass
  • -beginRealTimeTask / -endRealTimeTask -- RT task bracketing (returns NO, but RT eval still works)
  • -prepareChainingWithModel:options:chainingReq:qos:error: -- chaining setup
  • -enqueueSetsWithModel:outputSet:options:qos:error: -- chaining output enqueue
  • -buffersReadyWithModel:inputBuffers:options:qos:error: -- chaining input signal

Discovered Classes (v2 exploration)

_ANEBuffer -- wraps _ANEIOSurfaceObject with index metadata. Key discovery.

  • +bufferWithIOSurfaceObject:symbolIndex:source: -- factory
    • ioSurfaceObject: an _ANEIOSurfaceObject (NOT raw IOSurfaceRef)
    • symbolIndex: NSNumber mapping to compiled model I/O symbol
    • source: long long -- 0=ANE, 1=output, 2=unknown
  • Properties: ioSurfaceObject, symbolIndex, source
  • Description format: "_ANEBuffer: { ioSurface=0x... ; symbolIndex=0 ; ANEBufferProducerAgent=0 }"

_ANEProgramIOSurfacesMapper -- maps IOSurfaces to compiled model symbols.

  • +mapperWithProgramHandle:(uint64_t)handle -- works, creates mapper
  • +mapperWithController:(id)ctrl -- alternative factory
  • -mapIOSurfacesWithModel:request:cacheInference:error: -- FAILS on _ANEInMemoryModel (calls inputSymbolIndicesForProcedureIndex: which doesn't exist)
  • -validateRequest:model: -- also fails for same reason
  • Implication: designed for _ANEModel (disk-based compiled models), not in-memory MIL

_ANEProgramForEvaluation -- lower-level evaluation program.

  • Accessible via model.program property
  • +programWithHandle:intermediateBufferHandle:queueDepth: -- factory
  • -processRequest:model:qos:qIndex:modelStringID:options:returnValue:error: -- low-level eval

_ANEIOSurfaceOutputSets -- output set packaging for chaining.

  • +objectWithstatsSurRef:outputBuffer: -- factory
    • statsSurRef: IOSurfaceRef for perf stats collection -- returns nil when NULL
    • outputBuffer: NSArray of _ANEBuffer objects
  • This is the current blocker: we don't know the correct stats IOSurface format

_ANEInputBuffersReady -- input signaling for chaining pipeline.

  • +inputBuffersWithProcedureIndex:inputBufferInfoIndex:inputFreeValue:executionDelay:
  • Parameters: procedure index, buffer info indices, free values, execution delay
  • This is the mechanism that tells the ANE "inputs are ready, start processing"

_ANEOutputSetEnqueue -- output pipeline configuration for chaining.

  • +outputSetWithProcedureIndex:setIndex:signalValue:signalNotRequired:isOpenLoop:
  • Configures output set enqueue behavior with signal values and open-loop mode

_ANEChainingRequest -- the chaining request itself.

  • +chainingRequestWithInputs:outputSets:lbInputSymbolId:lbOutputSymbolId:procedureIndex:signalEvents:transactionHandle:fwEnqueueDelay:memoryPoolId:
  • -validate -- returns YES/NO
  • Expects inputs as _ANEBuffer objects, outputSets as _ANEIOSurfaceOutputSets objects

_ANEModelInstanceParameters -- model instance configuration.

  • Alloc/init produces a valid object
  • API surface dumped but not yet exercised

_ANEDeviceController -- device-level controller.

  • +controllerWithProgramHandle: -- attempted but returned nil in our tests

_ANEQoSMapper -- QoS level mapping.

  • API surface dumped, not yet exercised

_ANEPerformanceStats -- performance statistics.

  • +statsWithHardwareExecutionNS:(uint64_t)ns -- factory
  • Properties: hwExecutionTime, performanceCounters
  • Cannot be used with _ANERequest.perfStats (expects array of objects with statType selector)
  • Setting perfStatsMask=0xFF on model works but performanceCounters returns nil

_ANESharedSignalEvent / _ANESharedWaitEvent -- hardware sync primitives (not yet explored).

  • Likely the fence mechanism for GPU-ANE or multi-model synchronization
  • Referenced in _ANEChainingRequest.signalEvents parameter

3. Experiment Logs

v1: test_chaining.m Results (M4 Max)

=== ANE ChainingRequest Prototype ===

All required classes found.

--- Phase 1: Compile two identical conv kernels ---
  Kernel 1: compiled and loaded
  Kernel 2: compiled and loaded

--- Phase 2: Baseline (sequential eval) ---
  Sequential: 10.355 ms total (0.207 ms/pair)
  Output[0..3]: [0.2500, 0.2500, 0.2500, 0.2500]

--- Phase 3: _ANEChainingRequest exploration ---
  _ANEClient: obtained
  ChainingRequest created: _ANEChainingRequest: { inputBuffer=(
    "_ANEIOSurfaceObject: { ioSurface=0x... ; startOffset=0 }"
  ) ; outputSets=( ... ) }
  validate: NO

--- Phase 4: Loopback ChainingRequest ---
  ChainingRequest created (loopback)
  validate: NO
  prepareChainingWithModel: EXCEPTION (validate fails first)

--- Summary ---
  Sequential baseline: 0.207 ms/pair (two evals + memcpy)
  ChainingRequest: creates but validate FAILS
  Root cause: _ANEIOSurfaceObject lacks symbolIndex property
  Next: explore _ANEBuffer and _ANEProgramIOSurfacesMapper

v2: test_chaining_v2.m Results (M4 Max)

Phase 1: Class Introspection

  • 9 classes found, 0 missing
  • All classes exist on M4 Max / macOS 15
  • Full method lists, properties, and type encodings dumped for each

Phase 2: Symbol Name Discovery

  • inputSymbolNames: NOT available on _ANEInMemoryModel
  • outputSymbolNames: NOT available on _ANEInMemoryModel
  • programHandle: YES (uint64 handle to compiled program)
  • _ANEIOSurfaceObject does NOT have symbolIndex getter or setter
  • +objectWithIOSurface:symbolIndex: class method NOT available

Phase 3: IOSurface Mapper & Buffer Experiments

3a: _ANEProgramIOSurfacesMapper

  mapperWithProgramHandle(12345): created successfully
  mapIOSurfacesWithModel: EXCEPTION
    -[_ANEInMemoryModel inputSymbolIndicesForProcedureIndex:]:
    unrecognized selector
  validateRequest:model: EXCEPTION (same reason)

3b: _ANEBuffer -- success

  bufferWithIOSurfaceObject(symIdx=0, source=0):
    _ANEBuffer: { ioSurface=0x... ; symbolIndex=0 ; ANEBufferProducerAgent=0 }
  bufferWithIOSurfaceObject(symIdx=0, source=1):
    _ANEBuffer: { ioSurface=0x... ; symbolIndex=0 ; ANEBufferProducerAgent=1 }
  bufferWithIOSurfaceObject(symIdx=0, source=2):
    _ANEBuffer: { ioSurface=0x... ; symbolIndex=0 ; ANEBufferProducerAgent=2 }
  bufferWithIOSurfaceObject(symIdx=1, source=0):
    _ANEBuffer: { ioSurface=0x... ; symbolIndex=1 ; ANEBufferProducerAgent=0 }
  symbolIndex property: accessible and correct

3c: _ANEIOSurfaceObject symbolIndex experiments

  setSymbolIndex: NOT available on _ANEIOSurfaceObject
  symbolIndex getter: NOT available
  +objectWithIOSurface:symbolIndex: NOT available

3d: IOSurface property experiments

  IOSurface 'symbolIndex' property (set via IOSurfaceSetValue): 0
  _ANEIOSurfaceObject.symbolIndex after property set: <exception>
  (IOSurface user properties do NOT propagate to _ANEIOSurfaceObject)

3e: _ANEProgramForEvaluation

  k1.model.program: <_ANEProgramForEvaluation: 0x...>
  (accessible via model.program property)

Phase 4: ChainingRequest Retry

4a: Sequential baseline

  Sequential: 0.259 ms/pair (50 iters)
  Output[0..3]: [0.2500, 0.2500, 0.2500, 0.2500]

Attempts 1-4: Various raw IOSurface configurations

  [Attempt 1] Standard (raw IOSurfaceObject): CRASH
    -[_ANEIOSurfaceObject symbolIndex]: unrecognized selector
  [Attempt 2] IOSurface with symbolIndex property: CRASH (same)
  [Attempt 3] Two-model loopback: CRASH (same)
  [Attempt 4] Skip validate, call prepareChainingWithModel directly: CRASH (same)

Attempt 5: _ANEBuffer + _ANEIOSurfaceOutputSets

  bufIn: _ANEBuffer: { ... symbolIndex=0 ; ANEBufferProducerAgent=0 }
  bufOut: _ANEBuffer: { ... symbolIndex=0 ; ANEBufferProducerAgent=1 }
  outputSet (objectWithstatsSurRef:NULL outputBuffer:@[bufOut]): nil
  -> _ANEIOSurfaceOutputSets returns nil when statsSurRef is NULL

Attempt 6: _ANEClient.evaluateWithModel: -- works

  evaluateWithModel (via client): YES

Attempt 7: _ANEClient.doEvaluateDirectWithModel: -- works

  doEvaluateDirectWithModel: YES

Phase 5: Alternative Execution Paths

5a: Real-time eval -- 1.7x speedup

  beginRealTimeTask: NO (possibly needs entitlement)
  evaluateRealTimeWithModel: YES

  RT eval:       0.090 ms/eval avg (50 iters)
  Standard eval: 0.157 ms/eval avg (50 iters)
  RT vs Standard speedup: 1.74x

  endRealTimeTask: NO

5b: PerfStats

  perfStatsMask = 0x01..0x80: set OK (all masks accepted)
  statsWithHardwareExecutionNS:0 = <_ANEPerformanceStats>
  Eval with @[perfStats]: OK (no crash when wrapped in array)
  hwExecutionTime after eval: nil
  Eval with mask=0xFF, perfStats=nil: OK
  performanceCounters: nil

4. Evaluation Path Benchmarks

Measured on 64x32 convolution kernels, M4 Max, 200 iterations after 10 warmup:

Method Latency Speedup API
evaluateWithQoS: (standard) 0.175 ms 1.0x model.evaluateWithQoS:options:request:error:
evaluateRealTimeWithModel: 0.093 ms 1.88x client.evaluateRealTimeWithModel:options:request:error:
processRequest 0.131 ms 1.34x program.processRequest:model:qos:qIndex:modelStringID:options:returnValue:error:
doEvaluateDirectWithModel: 0.225 ms 0.78x client.doEvaluateDirectWithModel:options:request:qos:error:

Key observations (small kernel, isolated):

  • RT eval was fastest in isolated test (1.88x speedup on 64x32)
  • processRequest was faster than standard but slower than RT
  • doEvaluateDirectWithModel was actually slower than standard (0.78x)
  • beginRealTimeTask returning NO does not prevent evaluateRealTimeWithModel: from working

Production Dimension Results (test_bench_paths.m, M4 Max)

At realistic kernel sizes with multiple compiled models, the picture changes:

Config Standard RT processRequest ane_eval_rt
64x32 (test) 0.109 ms 0.233 ms (0.5x) 0.156 ms (0.7x) 0.195 ms (0.6x)
128x64 0.208 ms 0.184 ms (1.1x) 0.201 ms (1.0x) 0.185 ms (1.1x)
256x64 0.197 ms 0.212 ms (0.9x) 0.203 ms (1.0x) 0.157 ms (1.3x)
512x64 0.120 ms 0.147 ms (0.8x) 0.194 ms (0.6x) 0.179 ms (0.7x)
768x256 (prod) 0.205 ms 0.246 ms (0.8x) 0.185 ms (1.1x) 0.291 ms (0.7x)

Key finding: The RT eval speedup observed in isolated testing (1.88x) does not hold at production dimensions. At 768x256 (Stories110M size), all eval paths perform similarly (~0.2 ms), with standard eval being competitive or fastest. The overhead of the client-based paths (RT, direct) outweighs any ANE scheduling benefit at scale.


5. Remaining Blockers and Next Steps

SOLVED: _ANEIOSurfaceOutputSets statsSurRef

The chaining pipeline requires:

  1. Inputs as _ANEBuffer objects with symbolIndex -- SOLVED
  2. OutputSets as _ANEIOSurfaceOutputSets objects -- SOLVED

A 64-byte IOSurface as statsSurRef is sufficient. _ANEChainingRequest.validate returns YES with this setup.

SOLVED: ChainingRequest parameter type mismatch (Experiment K-L)

The [NSConstantIntegerNumber count] crash was caused by passing NSNumber values for lbInputSymbolId, lbOutputSymbolId, and procedureIndex. Type encoding analysis (Experiment K) revealed all 9 factory parameters are @ (id/object), but the factory internally calls count on them, expecting arrays or nil.

Fix: Pass nil for lbInputSymbolId, lbOutputSymbolId, and procedureIndex:

chainingRequestWithInputs:@[buf] outputSets:@[outSet]
    lbInputSymbolId:nil lbOutputSymbolId:nil procedureIndex:nil
    signalEvents:@[] transactionHandle:@0 fwEnqueueDelay:@0 memoryPoolId:@0

This produces a valid _ANEChainingRequest (validate returns YES) and prepareChainingWithModel: no longer crashes.

Current Blocker: ANEProgramChainingPrepare() Failed (Code=15)

prepareChainingWithModel: now returns NO with error:

Error Domain=com.apple.appleneuralengine Code=15
"ANEProgramChainingPrepare() Failed: Program chaining prepare error"

This error occurs with all three model types tested:

  • Fresh _ANEModel (state=1, populated with programHandle+program)
  • Populated _ANEModel from Experiment E (state=5 after failed loadModel/compileModel)
  • _ANEInMemoryModel still crashes on getUUID (cannot be used with chaining at all)

The Code=15 error is a logical failure in the ANE daemon's chaining preparation, not a crash. The model is not fully recognized as "chaining-capable" by the daemon, likely because:

  1. The _ANEModel was populated by copying programHandle/program from an _ANEInMemoryModel, not loaded through the standard CoreML/Espresso pipeline
  2. Symbol indices remain empty (the daemon may require them for chaining buffer routing)
  3. The model needs model.espresso.net format (not MIL) for _ANEClient.loadModel: / compileModel:

Previous blocker (SOLVED): [NSConstantIntegerNumber count] crash -- fixed by passing nil for symbol/procedure params.

Experiments E-H Results (test_ane_model.m)

Experiment E: _ANEModel Loading -- SOLVED

_ANEModel.modelAtURL:key: works with the compiled temp directory URL and hexStringIdentifier as key:

diskModel = _ANEModel.modelAtURL:key:(tmpDirURL, hexId)
  -> _ANEModel with UUID, getUUID works
  -> state=1, program=nil, programHandle=0 (shell only)

Populating the shell with _ANEInMemoryModel data:

diskModel.setProgramHandle:(inMemoryModel.programHandle)  -> success
diskModel.setProgram:(inMemoryModel.program)              -> success

After population, programHandle and program are set, but inputSymbolIndicesForProcedureIndex:0 still returns empty NSIndexSet. The symbol table data isn't stored in the _ANEProgramForEvaluation -- it's likely in the model.hwx or net.plist that the standard CoreML path generates.

Experiment E2: ANECompiler -- No ObjC API

  • ANECompiler.framework exists at /System/Library/PrivateFrameworks/ANECompiler.framework/ but contains no ObjC classes -- it's a pure C library (ANECCompile() is the entry point, called internally by _ANEInMemoryModel.compileWithQoS:)
  • debug_mask option had no visible effect on compilation output
  • No ane_compiler_service found at standard paths
  • Key _ANEInMemoryModel compilation methods found: saveModelFiles, localModelPath, compiledModelExists, mapIOSurfacesWithRequest:cacheInference:error:

Experiment F: Chaining Pipeline -- Blocked

With populated _ANEModel (has UUID + programHandle + program), prepareChainingWithModel: still crashes on [NSConstantIntegerNumber count]. The crash is in the _ANEChainingRequest parameter handling, not in the model itself.

Experiment G: Hardware Fences -- FULLY SOLVED

Both _ANESharedSignalEvent and _ANESharedWaitEvent now work:

// MTLSharedEvent via Metal (works)
id device = MTLCreateSystemDefaultDevice();
id sharedEvent = [device newSharedEvent];

// IOSurfaceSharedEvent via IOKit (also works)
id iosEvent = IOSurfaceSharedEventCreate();

// Signal event factory: (uint64_t value, unsigned int symbolIndex, long long eventType, id sharedEvent)
_ANESharedSignalEvent.signalEventWithValue:symbolIndex:eventType:sharedEvent:
  -> works with both MTLSharedEvent and IOSurfaceSharedEvent

// Wait event factory: (uint64_t value, id sharedEvent)
_ANESharedWaitEvent.waitEventWithValue:sharedEvent:
  -> works with both event types

Event types 0, 1, 2 all produce valid signal events. The eventType property is correctly set.

Experiment H: Alternative Preparation -- Same Crash

doPrepareChainingWithModel:options:chainingReq:qos:error: exists with identical signature and crashes identically. Full _ANEClient API (46 instance methods) documented in test output.

Throughput Ceiling (test_throughput_ceiling.m, Experiment I)

12-kernel pipeline benchmarks on M4 Max:

Config Sequential (run+memcpy) Run-only Memcpy-only GCD Serial
64x32 (test) 0.272 ms/kernel 0.158 ms/kernel 0.001 ms/copy 0.200 ms/kernel
256x64 (small) 0.191 ms/kernel 0.181 ms/kernel 0.002 ms/copy 0.176 ms/kernel
768x256 (prod) 0.177 ms/kernel 0.226 ms/kernel 0.006 ms/copy 0.186 ms/kernel

Key findings:

  • Memcpy overhead is negligible (<0.01 ms per copy even at 393KB). Not the bottleneck.
  • CPU round-trip overhead is in the ANE dispatch itself, not data movement.
  • At production dims, sequential with memcpy is actually faster than eval-only (pipeline caching effect).
  • GCD serial queue provides modest improvement at small dims but marginal at production.
  • Chaining's value would be eliminating the ~0.2ms/kernel ANE dispatch overhead, not memcpy. With 12 kernels, total pipeline takes ~2.1ms (prod), so eliminating dispatch could potentially halve this.

Experiments K-P Results (test_ane_model.m, 2026-03-04)

Experiment K: Type Encoding Analysis -- COMPLETE

Full type encodings for all chaining-related methods:

Method Encoding Notes
chainingRequestWithInputs:... @88@0:8@16@24@32@40@48@56@64@72@80 All 9 params are @ (id/object)
prepareChainingWithModel:... B52@0:8@16@24@32I40^@44 5 params: 3x @, 1x I (uint32 qos), 1x ^@ (error ptr)
doPrepareChainingWithModel:... B52@0:8@16@24@32I40^@44 Same signature as prepareChainingWithModel

The _ANEChainingRequest factory takes 9 object parameters. The lbInputSymbolId, lbOutputSymbolId, and procedureIndex are all @ (object), not raw integers. Internally, the factory calls unsignedIntegerValue (from NSNumber) or count (from NSArray) on these parameters.

_ANEChainingRequest Property Encoding Type
procedureIndex @ id (nil or NSArray)
loopbackInputSymbolIndex @ id (nil or NSArray)
loopbackOutputSymbolIndex @ id (nil or NSArray)

Experiment L: Array-Typed Parameters -- BREAKTHROUGH

Combo lbIn lbOut procIdx Factory Validate Prepare
L.1: Arrays @[@(-1)] @[@(-1)] @[@(-1)] @[@0] CRASH: unsignedIntegerValue on NSArray - -
L.2: Arrays @[@0] @[@0] @[@0] @[@0] CRASH: unsignedIntegerValue on NSArray - -
L.3: Empty @[] @[] @[] @[] CRASH: unsignedIntegerValue on empty array - -
L.4: nil nil nil nil OK YES NO (Code=15)
L.5: NSNumber @(-1) @(-1) @0 CRASH: count on NSNumber - -

Passing nil for all three symbol/procedure params gets past both the factory crash and the prepareChainingWithModel crash. The validate returns YES and prepareChainingWithModel: returns a clean error (Code=15: ANEProgramChainingPrepare() Failed) instead of crashing.

Experiment M: Load Model via _ANEClient -- BLOCKED

Both loadModel: and compileModel: on _ANEClient require Espresso IR format (model.espresso.net), not MIL:

Error Domain=com.apple.appleneuralengine.espresso Code=-1
"_ANEEspressoIRTranslator : error Cannot load network '.../model.espresso.net'"

compiledModelExistsFor: returns NO for our MIL-compiled model. After the failed load/compile attempts, the _ANEModel state changes from 1 to 5 (error/invalid state).

The standard CoreML pipeline generates model.espresso.net (Espresso IR) and model.espresso.weights from the .mlpackage / .mlmodelc format. Our MIL-only path bypasses this, so we can't use _ANEClient.loadModel: without first generating the Espresso IR.

Experiment N: IOSurface Mapping -- PARTIAL

_ANEProgramIOSurfacesMapper:

  • mapperWithProgramHandle: creates a valid mapper from the _ANEInMemoryModel programHandle
  • mapIOSurfacesWithModel:request:cacheInference:error: returns NO (no exception, no error output)
  • validateRequest:model: returns NO
  • _ANEModel.mapper property is nil
  • prepareANEMemoryMappingParams:request: revealed ANEMemoryMappingParamsStruct has 128 ANEBufferStruct slots: [128{ANEBufferStruct=^{__IOSurface}IiiI}]

The mapper appears to need a fully loaded model with symbol table data that our MIL-compiled shell doesn't have.

Experiment O: Procedure Info -- EMPTY

  • procedureInfoForProcedureIndex:0 returns nil on the populated _ANEModel
  • procedureCount is not a method or KVC-accessible property
  • modelAttributes returns empty dictionary {}
  • inputSymbolNames / outputSymbolNames not available on _ANEModel
  • The symbolIndicesForProcedureIndex:indexArrayKey: method exists (takes I + @) but symbol data is empty

Experiment P: Full Chaining Retry -- Code=15

Tested with three model types, all using nil for symbol params:

Model State validate prepare Result
Fresh _ANEModel (state=1, populated) 1 YES NO (Code=15)
_ANEInMemoryModel 3 YES CRASH: getUUID
Populated _ANEModel (from E, state=5) 5 YES NO (Code=15)

Also documented _ANEInputBuffersReady and _ANEOutputSetEnqueue type signatures:

Class Factory Param Types
_ANEInputBuffersReady inputBuffersWithProcedureIndex:inputBufferInfoIndex:inputFreeValue:executionDelay: I (uint32), @ (NSArray), @ (NSArray), Q (uint64)
_ANEOutputSetEnqueue outputSetWithProcedureIndex:setIndex:signalValue:signalNotRequired:isOpenLoop: I, I, Q, B, B

Experiments Q-S Results (test_coreml_chaining.m, 2026-03-04)

Experiment Q: CoreML Pipeline -- MAJOR DISCOVERY

The E5 runtime (macOS 15+) does NOT use _ANEModel or _ANEChainingRequest at all.

CoreML on macOS 15 uses the MIL-based "E5" runtime, which completely bypasses the older Espresso/_ANEModel/_ANEChainingRequest path:

Component Old Path (Espresso) New Path (E5/MIL)
Model format .espresso.net + .espresso.weights model.mil + weights/weight.bin
Model class _ANEModel e5rt_program_library (C struct)
Engine _ANEClient + _ANERequest MLE5Engine + MLE5ExecutionStreamOperation
Chaining _ANEChainingRequest e5rt_execution_stream_operation (unknown)
Compile _ANEClient.compileModel: e5rt_program_library AOT compilation
Sync _ANESharedSignalEvent IOSurfaceSharedEventListener + MTLSharedEvent

Key findings:

  • MLModel.compileModelAtURL: produces .mlmodelc with model.mil (NOT model.espresso.net)
  • Loading an MLModel creates MLDelegateModel -> MLE5Engine -> MLE5ProgramLibrary -> MLE5ProgramLibraryOnDeviceAOTCompilationImpl
  • No _ANEModel exists anywhere in the E5 object graph
  • _ANEClient.loadModel: / compileModel: both require model.espresso.net which isn't generated
  • Prediction succeeds (model runs on ANE), confirming E5 runtime works independently of _ANEModel

Internal E5 class hierarchy:

MLDelegateModel
  └── _internalEngine: MLE5Engine
        ├── _programLibrary: MLE5ProgramLibrary
        │     ├── _programLibraryHandle: e5rt_program_library* (opaque C struct)
        │     ├── _impl: MLE5ProgramLibraryOnDeviceAOTCompilationImpl
        │     │     ├── _milTextURL: NSURL
        │     │     ├── _irProgram: shared_ptr<MIL::IRProgram> (C++)
        │     │     └── _container: MLProgramE5Container
        │     └── _container: MLProgramE5Container
        │           ├── _modelAssetDescription
        │           ├── _compilerVersionInfo
        │           └── _functionInfoArray
        └── _operationPool: MLE5StaticShapeExecutionStreamOperationPool
              └── _pool: NSMutableSet of MLE5ExecutionStreamOperation
                    ├── _operationHandle: e5rt_execution_stream_operation* (opaque)
                    ├── _programLibrary: MLE5ProgramLibrary
                    ├── _inputPorts / _outputPorts: NSArray
                    ├── _waitEventListener: IOSurfaceSharedEventListener
                    └── _completionSharedEventBoundToESOP: MTLSharedEvent

Experiment R: Chaining with CoreML model -- BLOCKED

No _ANEModel extracted from E5 runtime, so prepareChainingWithModel: cannot be tested with a CoreML-compiled model. The E5 runtime is a completely separate execution path.

Experiment S: Two-Kernel Chaining -- BLOCKED

Blocked by Experiment R. The _ANEChainingRequest API appears to be from the older Espresso-based runtime and may not be usable with models compiled through the E5/MIL path.

Experiments T-V Results (2026-03-04)

Experiment T: E5 Runtime Symbol Scan

Found 4 exported C functions from the e5rt_* API:

  • e5rt_program_library_create -- creates program library handle
  • e5rt_execution_stream_create -- creates execution stream handle
  • e5rt_async_event_create -- creates async event for synchronization
  • e5rt_async_event_signal -- signals an async event

Key ObjC classes in the E5 runtime:

  • MLE5ExecutionStreamOperation (63 instance methods) -- holds e5rt_execution_stream_operation*, manages input/output ports
  • MLE5ExecutionStream (29 instance methods) -- holds e5rt_execution_stream*, executes operations array
  • MLE5ExecutionStreamPool -- manages streams via takeOut / putBack:
  • MLE5InputPort / MLE5OutputPort -- hold e5rt_io_port*, bind features to ports
  • MLE5InputPortBinder / MLE5OutputPortBinder -- handle memory binding for ports
  • MLE5ProgramLibrary -- holds e5rt_program_library*

Critical method: MLE5ExecutionStream._executeStream:error: takes e5rt_execution_stream* and executes all operations in the operations array in sequence.

Experiment U: E5 Multi-Op Stream -- MAJOR BREAKTHROUGH

Successfully executed multiple ANE operations in a single E5 stream, achieving up to 4.87x speedup over sequential CoreML.

Method:

  1. Load multiple CoreML models (.mlpackage -> MLModel)
  2. Extract MLE5ProgramLibrary from each model's MLE5Engine
  3. Create MLE5ExecutionStreamOperation for each, backed by each program library
  4. Preload operations (preloadAndReturnError:) to compile ANE programs
  5. Borrow an MLE5ExecutionStream from the stream pool
  6. Set multiple operations on the stream via setOperations:
  7. Prepare each operation's input features via prepareForInputFeatures:options:error:
  8. Execute all operations in one call via _executeStream:error:

Benchmark Results (M4 Max, macOS 15, N=500)

Kernels CoreML Sequential E5 Multi-Op Stream Speedup
1 (256ch) 0.0359 ms 0.0272 ms 1.32x
2 (256+512ch) 0.0623 ms 0.0406 ms 1.53x
3 (256+512+1024ch) 0.1599 ms 0.0578 ms 2.77x
4 (256+512+1024+2048ch) 0.3781 ms 0.0776 ms 4.87x

Key observations:

  • E5 stream per-kernel overhead is remarkably consistent: ~0.02 ms/kernel regardless of count
  • CoreML sequential overhead grows non-linearly (0.036 -> 0.095 ms/kernel with 4 kernels)
  • The speedup increases with more kernels: the dispatch overhead is amortized
  • All operations execute on ANE with a single _executeStream: call

Code path for E5 multi-op stream:

// 1. Extract internals from CoreML-loaded model
id e5engine = [mlModel valueForKey:@"_internalEngine"];  // MLE5Engine
id progLib  = [e5engine valueForKey:@"programLibrary"];   // MLE5ProgramLibrary
id pool     = [e5engine valueForKey:@"streamPool"];       // MLE5ExecutionStreamPool

// 2. Create operation from program library
id op = [[MLE5ExecutionStreamOperation alloc]
    initWithProgramLibrary:progLib functionName:@"main"
    modelDescription:desc configuration:cfg
    debugLabel:@"myOp" modelSignpostId:0];
[op preloadAndReturnError:nil];

// 3. Get stream and set operations
id stream = [pool takeOut];
void *sh = stream._streamHandle;  // e5rt_execution_stream*
[stream setOperations:@[op1, op2, op3]];

// 4. Prepare and execute
for (op in operations)
    [op prepareForInputFeatures:features options:predOpts error:nil];
[stream _executeStream:sh error:nil];

Revised Assessment (after T-V)

The E5 runtime (MLE5ExecutionStream + MLE5ExecutionStreamOperation) is the correct path for multi-kernel pipelining on macOS 15+. CORRECTED in Experiments W1 (see below).

Experiments W1-W5: Validation & Deep API Documentation (2026-03-04)

W1: Output Correctness Validation

CRITICAL CORRECTION: The previously reported "4.87x speedup" from multi-op streams was invalid. Validation revealed:

  1. MLE5Engine.predictionFromFeatures:options:error: produces EXACT (bit-identical) output to MLModel.predictionFromFeatures:error: for all tested sizes (256, 512, 1024, 2048 channels). This confirms the E5 engine is the correct computation path.

  2. Our manually-created MLE5ExecutionStreamOperation objects via initWithProgramLibrary: do not produce correct output -- they return all zeros. The _executeStream: call returns YES but no actual ANE compute occurs. The operation handles are 0x0 (not compiled), meaning our manually-created ops were never wired to actual ANE programs.

  3. The "speedup" was measuring the overhead of a no-op function returning immediately vs CoreML doing actual computation.

  4. MLE5StaticShapeExecutionStreamOperationPool.takeOutOperationForFeatures:error: returns pool-managed operations with valid handles, but using them with _executeStream: still produces zeros -- the output port bindings are not correctly populated.

  5. Stream reuse via _predictionFromFeatures:stream:options:error: fails with "E5RT: Port bindings cannot be changed while operation is in use in an execution stream" -- streams are locked after first use and cannot be reconfigured.

W1 Performance Profile

Path 256ch (ms) 2048ch (ms)
CoreML API (predictionFromFeatures:error:) 0.035 0.217
Engine direct (predictionFromFeatures:options:error:) 0.074 0.284
Engine private (_predictionFromFeatures:options:error:) 0.100 0.332
Stream pool cycle (takeOut + putBack) 0.008 0.008
Op pool cycle <0.001 <0.001

Key finding: CoreML API is FASTER than calling the engine directly. MLDelegateModel implements internal caching (likely keeping a hot stream + operation) that avoids the per-call pool acquire/release overhead. The engine's predictionFromFeatures: method performs pool management on every call.

W2: Exhaustive E5 Runtime API

Full class dumps captured for all E5 runtime classes. Key classes and their roles:

MLE5Engine (49 instance methods, 10 ivars)

  • Superclass: MLModelEngine
  • Entry point: predictionFromFeatures:options:error: (public), _predictionFromFeatures:stream:options:error: (internal)
  • Key properties: streamPool (MLE5ExecutionStreamPool), operationPool (), programLibrary (MLE5ProgramLibrary)
  • Manages: stream acquisition, operation preparation, input conforming, output post-processing

MLE5ProgramLibrary (17 instance methods, 5 ivars)

  • Holds _programLibraryHandle (C struct e5rt_program_library*)
  • Key method: createOperationForFunctionName:forceRespecialization:hasRangeShapeInputs:error: -- returns C-level e5rt_execution_stream_operation*
  • Contains: compiled MIL program, model configuration, implementation object

MLE5ExecutionStreamOperation (63 instance methods, ~20 ivars)

  • Holds _operationHandle (C struct e5rt_execution_stream_operation*)
  • States: 0=created, transitions through prepare/execute
  • Key methods: prepareForInputFeatures:options:error:, preloadAndReturnError:, outputFeatures
  • Has input/output/state ports (MLE5InputPort, MLE5OutputPort)
  • Internal binding: _bindInputFeaturesAndWaitEvents:options:error:, _bindOutputPortsWithOptions:error:
  • Port binding modes: directlyBoundFeatureValue (zero-copy) vs copyFeatureValue (memcpy)

MLE5ExecutionStream (21 instance methods, 5 ivars)

  • Holds _streamHandle (C struct e5rt_execution_stream*)
  • Key methods: _executeStream:error:, executeForInputFeatures:options:error:, submitWithCompletionHandler:
  • Operations set via setOperations: (NSArray of MLE5ExecutionStreamOperation)
  • Reset via _cleanUpStream: on engine

MLE5ExecutionStreamPool (11 instance methods)

  • Pool pattern: takeOut / putBack:
  • Creates streams on demand with e5rt_execution_stream_create
  • Tracks all streams via allStreams

MLE5StaticShapeExecutionStreamOperationPool (17 instance methods)

  • Pool for operations with fixed input shapes
  • Key method: takeOutOperationForFeatures:error: -- matches feature shape to pooled operation

MLE5InputPort / MLE5OutputPort

  • Wraps e5rt_io_port* handles
  • Each has a binder (MLE5InputPortBinder / MLE5OutputPortBinder)
  • Input binder has bindingMode (char): controls copy vs direct binding
  • Output binder has outputBacking and featureValue for result retrieval

MLE5InputPortBinder (16 instance methods, 6 ivars)

  • bindingMode (char): 0=copy, 1=direct
  • bindMemoryObjectForFeatureValue:error: -- zero-copy IOSurface binding
  • copyFeatureValue:error: -- memcpy binding

MLE5OutputPortBinder (27 instance methods, 9 ivars)

  • outputBacking -- output buffer
  • boundFeatureDirectly (BOOL) -- tracks binding mode
  • _makeFeatureValueFromPort:featureDescription:error: -- read ANE output

MLProgramE5Container (11 instance methods, 6 ivars)

  • Container for compiled model assets
  • URLOfMILText -- path to MIL source
  • compilerOutput -- MLCompilerNeuralNetworkOutput
  • findPrecompiledE5BundleAndReturnError: -- looks for pre-compiled E5 bundle

e5rt_ C API* (found via dlsym):

  • e5rt_program_library_create -- creates program library from MIL
  • e5rt_execution_stream_create -- creates execution stream
  • e5rt_async_event_create -- creates async event for synchronization
  • e5rt_async_event_signal -- signals async event

W4: Async Stream Submission

submitWithCompletionHandler: FAILED with: "Failed to add operation to E5 stream. E5RT: Reset stream to add more operations to stream. (2)". The stream must be in a specific state (reset) before async submission is possible. The stream state becomes locked after _executeStream: or executeForInputFeatures:.

W5: Port-Based Data Flow

  • Each operation has inputPorts (array of MLE5InputPort) and outputPorts (array of MLE5OutputPort)
  • Input binding mode 1 = direct binding (zero-copy from MLMultiArray)
  • Output outputBacking is nil after manual execution -- bindings are not populated by our manual path
  • Port handles are e5rt_io_port* C structs -- connecting ports across operations would require knowing the C API for port linking

Revised Assessment (after W1-W5)

  1. CoreML API is already near-optimal for single-model inference. The MLDelegateModel wrapper is faster than calling engine methods directly due to internal stream/operation caching.

  2. Manual _executeStream: with custom operations is invalid -- it produces zero output. The operations must be created through the engine's internal pipeline (via _predictionFromFeatures:stream:options:error:) which handles binding correctly.

  3. The opportunity for speedup lies in:

    • Eliminating ObjC overhead via direct e5rt_* C API calls
    • Batching multiple models into a single stream (requires understanding e5rt_execution_stream_operation lifecycle)
    • Direct MIL compilation to e5rt_program_library without going through CoreML

Experiment X1: Custom MIL -> ANE Execution (BREAKTHROUGH)

Pipeline discovered: Write MIL text file -> MLE5ProgramLibraryOnDeviceAOTCompilationImpl -> MLE5ProgramLibrary -> MLE5Engine -> predictionFromFeatures:

// 1. Write MIL text to file
NSString *mil = @"program(1.3)\n{\n    func main<ios18>(...) { ... } -> (cast_out);\n}\n";
[mil writeToFile:@"/tmp/custom.mil" ...];

// 2. Compile MIL to E5 program library
id aotImpl = [[MLE5ProgramLibraryOnDeviceAOTCompilationImpl alloc]
    initWithMILTextAtURL:milURL container:refContainer configuration:cfg];
void *plHandle = [aotImpl createProgramLibraryHandleWithRespecialization:NO error:&err];

// 3. Create program library + engine
id progLib = [[MLE5ProgramLibrary alloc] initWithImpl:aotImpl container:refContainer configuration:cfg];
id engine = [[MLE5Engine alloc] initWithProgramLibrary:progLib modelDescription:desc ...];
[engine prepareWithConcurrencyHint:1 error:nil];

// 4. Execute
id result = [engine predictionFromFeatures:fp options:opts error:&err];

Requirements:

  • MIL input/output variable names must match the model description (e.g., x for input, cast_out for output)
  • MIL shapes must match the model description shapes
  • A "container" (MLProgramE5Container) is borrowed from a pre-compiled CoreML model (needed for compilation context)
  • Input/output types should be fp32 with internal fp16 compute (cast in/out) for ANE compatibility

Verified kernels (all produce EXACT correct output on ANE):

Kernel MIL Op Verification
ReLU relu(x=x16) Max diff = 0.000000, 0/16384 wrong
GELU gelu(x=x16, mode="TANH_APPROXIMATION") Verified against reference
Elementwise (x*2+1) mul + add with scalar constants Verified against reference
Softmax softmax(x=x16, axis=-1) Sum = 1.000000
Layer Norm layer_norm(x=x16, axes=[3], epsilon=1e-5) Mean = 0.000000, Var = 0.999975

Significance: This allows compiling arbitrary MIL programs (any operation supported by Apple's MIL spec) to run on the ANE, without going through CoreML's .mlpackage pipeline. This is the foundation for custom training/inference kernels.

Experiment Y1: Fused SDPA on ANE (PASSED)

Operation: scaled_dot_product_attention(query=Q, key=K, value=V) -- single fused op for entire attention computation.

Config: B=1, nHeads=1, seqLen=256, headDim=64 (self-attention: Q=K=V=reshape(input))

Metric Value
Max abs diff (vs CPU) 0.000021
Relative error 1.40e-03
Latency (first call) 2.454 ms
Benchmark 0.1708 ms/eval

Experiment Y2: Linear with Embedded Weights (PASSED)

Operation: linear(x=flat, weight=Wc, bias=Bc) where Wc and Bc are compile-time const tensors embedded in the MIL program.

Config: input [256, 64], linear 64->64 with embedded weight matrix and bias vector.

Metric Value
Max abs diff (vs CPU) 0.001106
Relative error 1.05e-02
Benchmark 0.0610 ms/eval

Significance: Confirms that compile-time weight constants work in MIL text format. This is the foundation for transformer inference (where weights are frozen).

Experiment Y3: Complete Transformer Block on ANE (PASSED)

Pipeline: LayerNorm -> SDPA (self-attention) -> Residual Add -> LayerNorm -> FFN (linear+GELU+linear) -> Residual Add

All in a single MIL program, compiled and executed as one ANE operation.

Config: seqLen=256, dim=64, ffnDim=128, 1-head attention, embedded FFN weights.

Metric Value
Output mean abs 1.017404 (non-zero, correct)
Benchmark 0.2091 ms/eval

Significance: A full transformer layer runs on ANE in ~0.2ms. This proves that complex multi-op pipelines can be compiled as single MIL programs with no CPU round-trips between ops. The ANE compiler fuses the entire graph.

Experiment Z1: Backward Pass (Gradient Computation) on ANE (PASSED)

Operations: matmul(x=dY, y=W) for dX (input gradient), matmul(x=dY, y=dY, transpose_x=true) for dW (weight gradient). Both use runtime tensors (not const), proving backward-pass operations work on ANE.

Also tests: slice_by_index for tensor slicing, concat for packing results.

Config: dY [128,64] @ W [64,64] -> dX [128,64]; dY^T [64,128] @ dY [128,64] -> dW [64,64]

Metric dX dW
Max abs diff 0.001940 0.012828
Relative error 1.02e-02 3.92e-02
Benchmark 0.0593 ms/eval (both combined)

Significance: This is the first demonstration of ANE executing gradient computation operations. The matmul with transpose_x=true works correctly, producing valid weight gradients. Combined with Y3's forward pass, this establishes the complete pipeline for manual ANE training:

  1. Forward pass: Y3-style MIL (0.2 ms)
  2. Backward pass: Z1-style MIL (0.06 ms)
  3. Weight update: CPU (trivial)
  4. Recompile: (~10-50 ms, dominates training time)

MIL Text Syntax Lessons Learned

Key syntax rules discovered during Y/Z experiments:

  1. epsilon in layer_norm: Must be same dtype as gamma/beta. Use fp16 eps = const()[..., val = fp16(1e-5)] when gamma is fp16.
  2. Boolean params: Use bool tx = const()[..., val = bool(true)] for params like transpose_x.
  3. concat axis: Must be int32 scalar, not tensor<int32, [1]>. Use int32 ax = const()[..., val = int32(0)].
  4. concat interleave: Required param, use bool il = const()[..., val = bool(false)].
  5. MLE5Engine init: Correct selector is initWithProgramLibrary:modelDescription:configuration:functionName:classProbabilitiesFeatureName:optionalInputDefaultValues:compilerVersionInfo: (7 args).
  6. Container path: On macOS 15+, models may use Espresso backend. Create MLProgramE5Container via initWithModelAssetPath:configuration: using the .mlmodelc path.
  7. Sandbox: E5RT needs write access to ~/Library/Caches/ for model specialization cache.

Next Steps

  1. [HIGH] Multi-head attention -- test SDPA with multiple heads (reshape to [B, nHeads, seqLen, headDim])
  2. [HIGH] Real Qwen2.5 layer weights -- load actual model weights into MIL const tensors
  3. [HIGH] Full backward pass -- implement complete transformer backward pass (attention + FFN gradients)
  4. [MEDIUM] Training loop -- forward + backward + weight update + recompile cycle
  5. [MEDIUM] Explore e5rt_ C API directly* -- bypass ObjC wrappers for lower overhead
  6. [LOW] Runtime weight injection -- investigate if weights can be updated without recompilation

Phase 7: OutputSets with stats IOSurface -- BREAKTHROUGH

  statsSurRef size=64 bytes:
    objectWithstatsSurRef: _ANEIOSurfaceOutputSets: { statsSurRef=<IOSurface: 0x...>
    id = 0x... width = 64 height = 1 pixelFormat = 0
    name = test_chaining_v2 ; outputBuffer=(
      "_ANEBuffer: { ... symbolIndex=0 ; ANEBufferProducerAgent=1}"
    )}

    Attempting ChainingRequest with valid outputSet...
    ChainingRequest created | validate: YES     <-- FIRST TIME VALIDATE PASSES!
    prepareChainingWithModel EXCEPTION:
      -[_ANEInMemoryModel getUUID]: unrecognized selector

Phase 8: Disk-based _ANEModel

  _ANEModel class found (12 class methods, 52 instance methods, 17 properties)
  Has: getUUID, inputSymbolIndicesForProcedureIndex:,
       outputSymbolIndicesForProcedureIndex:, mapper, program
  Factory: +modelAtURL:key:, +modelAtURL:key:modelAttributes:, etc.

  tmpDir contents: (weights, model.mil, net.plist, data)
  +modelAtURL: NOT available (needs key: parameter)
  -> _ANEModel could not be loaded (need correct factory + key)

Phase 9: processRequest via ProgramForEvaluation

  k1.model.program: _ANEProgramForEvaluation: { programHandle=1319967543575
    intermediateBufferHandle=0 queueDepth=127 }
  processRequest single call: YES (rv=NO)
  processRequest: 0.131 ms/eval (50 iters)
  vs RT eval: 1.45x (slower than RT but faster than standard)

Phase 10: Shared Events

  _ANESharedEvents: found (+sharedEventsWithSignalEvents:waitEvents:)
  _ANESharedSignalEvent: found
    +signalEventWithValue:symbolIndex:eventType:sharedEvent:
    Properties: sharedEvent (IOSurfaceSharedEvent), value, symbolIndex, agentMask, eventType
    alloc/init: nil (needs sharedEvent parameter)
  _ANESharedWaitEvent: found
    +waitEventWithValue:sharedEvent:
    alloc/init: nil (needs sharedEvent parameter)
  -> Both require IOSurfaceSharedEvent objects, not available from bare init

6. Architecture: Chaining Data Flow

Current (sequential):
  CPU -> IOSurface -> ANE eval layer 1 -> IOSurface -> CPU memcpy
  CPU -> IOSurface -> ANE eval layer 2 -> IOSurface -> CPU memcpy
  ... (23 round-trips for 12-layer model)

Target (chained):
  CPU -> IOSurface -> ANE eval layer 1 -> [on-chip] -> ANE eval layer 2
                   -> [on-chip] -> ... -> IOSurface -> CPU
  (1 round-trip for entire model)

Current best (sequential with standard path):
  At production dims (768x256), all paths are ~0.2ms/kernel.
  RT path only helps for small kernels (64x32: 1.88x speedup).
  For 24 evals/token at ~0.2ms each: ~4.8ms total ANE time per token.
  Chaining target: 1 round-trip instead of 24, saving ~23 x overhead per trip.

7. Class Hierarchy (inferred)

NSObject
├── _ANEClient (singleton, daemon connection)
├── _ANEInMemoryModelDescriptor (MIL + weights spec)
├── _ANEInMemoryModel (compile/load/run -- in-memory MIL path)
│   └── .program -> _ANEProgramForEvaluation
├── _ANEModel (disk-based compiled model -- 52 methods, has getUUID)
│   └── .program -> _ANEProgramForEvaluation
│   └── .mapper -> _ANEProgramIOSurfacesMapper
├── _ANERequest (I/O surface packaging)
├── _ANEIOSurfaceObject (thin IOSurface wrapper)
├── _ANEBuffer (IOSurfaceObject + symbolIndex + source)
├── _ANEChainingRequest (multi-op pipeline)
├── _ANEIOSurfaceOutputSets (output packaging for chaining)
├── _ANEInputBuffersReady (input signaling for chaining)
├── _ANEOutputSetEnqueue (output enqueue config for chaining)
├── _ANEProgramIOSurfacesMapper (symbol-to-surface mapping)
├── _ANEProgramForEvaluation (lower-level eval program)
├── _ANEModelInstanceParameters (model config)
├── _ANEDeviceController (device-level control)
├── _ANEQoSMapper (QoS level mapping)
├── _ANEPerformanceStats (perf counters)
├── _ANESharedSignalEvent (hardware signal fence)
└── _ANESharedWaitEvent (hardware wait fence)

8. MIL Operations Reference (for Custom ANE Kernels)

Source: coremltools MIL Ops API Reference

The following MIL operations are available for writing custom ANE kernels via our MLE5ProgramLibraryOnDeviceAOTCompilationImpl pipeline (Experiment X1). All ops below have been confirmed available in the MIL text format used by the E5 compiler on macOS 15+.

Transformer-Critical Ops

Op Signature Notes
scaled_dot_product_attention (iOS 18+) (query:[B,*?,L,E], key:[B,*?,S,E], value:[B,*?,S,EV], attn_mask?) -> [B,*?,L,EV] Fused softmax(Q@K.T/sqrt(d))@V. Single op for entire attention computation.
linear (x:[*D,D_in], weight:const[D_out,D_in], bias:const[D_out]?) -> [*D,D_out] x @ W.T + b. Weight/bias must be compile-time constants. Rank 1-3 input.
matmul (x:[*,K1], y:[*,K2], transpose_x?, transpose_y?) -> [*,T] N-D batch matmul with broadcasting. Supports runtime (non-const) inputs.
layer_norm (x, axes, gamma?, beta?, epsilon?) -> same shape Verified working on ANE (Experiment X1).
gelu (x, mode=EXACT/TANH_APPROXIMATION/SIGMOID_APPROXIMATION) -> same shape Verified working on ANE (Experiment X1).
softmax (x, axis) -> same shape Verified working on ANE (Experiment X1).
relu (x) -> same shape Verified working on ANE (Experiment X1).

Data Movement Ops

Op Signature Notes
gather (x, indices, axis?) -> gathered For embedding table lookups.
gather_along_axis (x, indices, axis?) -> gathered Take values along axis at index locations.
scatter (data, indices, updates, axis?, mode?) -> scattered For KV cache writes. Mode: update/add/sub/mul/div/max/min.
scatter_along_axis (data, indices, updates, axis?, mode?) -> scattered Scatter updates along axis.

Elementwise / Reduction Ops

Op Notes
add, sub, mul, real_div Elementwise with broadcasting.
cast Type conversion (fp32 <-> fp16). Required for ANE I/O (fp32 in, fp16 compute, fp32 out).
reduce_sum, reduce_mean, reduce_max Reduction along axes.
rsqrt, sqrt, exp, log, tanh Unary elementwise. Useful for manual norm/activation implementations.
concat, split, reshape, transpose Shape manipulation.
slice_by_index, slice_by_size Tensor slicing for KV cache windowing.

Key Constraints

  1. linear weights must be const: For inference this is fine (weights don't change). For training, use matmul with runtime tensors instead.
  2. MIL text format: Programs use program(1.3) { func main<ios18>(...) { ... } -> (output); } syntax. Constants use const()[name=..., val=...]. Weights reference blob files via BLOBFILE(path=..., offset=...).
  3. ANE I/O convention: Input/output should be fp32; internal compute should be fp16. Use cast ops at boundaries.
  4. Shape constraints: ANE prefers NCHW layout. Most ops work with rank-4 tensors [B, C, H, W] but linear/matmul work with lower ranks.

9. ANE Training Feasibility Analysis

Apple's Official Position

Apple's deprecated MLCompute framework (MLCDevice.ane()) explicitly states:

"This device applies to inference graphs only. It doesn't work with a training graph or inference graph that shares layers with a training graph."

This means Apple never shipped ANE-based training, even in their own training framework. The MLCTrainingGraph class supported executeForward, executeGradient, and executeOptimizerUpdate but only on CPU and GPU devices.

WWDC 2025 Confirmation

WWDC 2025 Session 360 ("Discover ML & AI frameworks") confirms:

  • CoreML dispatches to CPU, GPU, and Neural Engine at runtime for inference
  • MLX is the recommended tool for training/fine-tuning but uses Metal GPU, not ANE
  • No mention of ANE training APIs in any Apple framework
  • BNNSGraph (Accelerate) added BNNSGraphBuilder for CPU-only real-time inference

Why ANE Lacks Native Training Support

The ANE is a fixed-function inference accelerator. It likely lacks:

  • Hardware support for automatic differentiation / backward passes
  • Ability to write to weight storage during execution (weights are read-only constants in the e5rt_program_library)
  • Dynamic memory allocation needed for activation checkpointing

Manual ANE Training Approach

Despite the lack of native support, training on ANE is theoretically possible using our custom MIL pipeline:

  1. Forward pass: Write MIL program with linear/matmul/layer_norm/gelu ops. Weights embedded as constants. Execute on ANE. Save activations.
  2. Backward pass: Write separate MIL programs for each layer's gradient computation:
    • Linear backward: dX = dY @ W (matmul), dW = dY.T @ X (matmul)
    • ReLU backward: dX = dY * (X > 0) (elementwise)
    • LayerNorm backward: Multiple reduction + elementwise ops
  3. Optimizer step: Run on CPU (simple elementwise: W -= lr * dW)
  4. Recompile: After weight update, recompile MIL with new weights for next forward pass

The key bottleneck is step 4: recompiling MIL after every weight update. The createProgramLibraryHandleWithRespecialization: call takes ~10-50ms, which would dominate training time. This makes per-step ANE training impractical unless we can find a way to update weights without recompilation (e.g., via the e5rt_* C API or runtime weight injection).