57 KiB

Raw Blame History

ANE ChainingRequest API Research

Research into Apple Neural Engine private APIs for multi-kernel pipelining, conducted on M4 Max / macOS 15.

Goal: Eliminate CPU round-trips between ANE layer evaluations. In a 12-layer model, sequential evaluation requires 23+ CPU-ANE round-trips per token. The _ANEChainingRequest API appears designed to let the ANE run operations back-to-back in a hardware pipeline, keeping data on-chip.

Status: ChainingRequest validates and prepareChainingWithModel: no longer crashes (crash fix: pass nil for symbol/procedure params). Blocked on Code=15 (ANEProgramChainingPrepare Failed) -- the _ANEModel needs Espresso IR format (not MIL) for full symbol table population. At production dims (768x256), sequential ANE dispatch costs ~0.2ms/kernel; chaining would save ~23 round-trips per token.

See also: ANE_INTERNALS.md for comprehensive ANE documentation including compilation pipeline, hardware specs, and community research references.

Test Files

File	Purpose
`training/test_chaining.m`	v1 prototype: sequential baseline + ChainingRequest creation
`training/test_chaining_v2.m`	v2 deep exploration: 6-phase probe of 12+ private classes
`training/test_ane_model.m`	Experiments E-P: _ANEModel loading, compiler, chaining, fences, type encoding, mapping
`training/test_throughput_ceiling.m`	Experiment I: 12-kernel throughput ceiling benchmark

Build and run:

cd training
make test_chaining && ./test_chaining
make test_chaining_v2 && ./test_chaining_v2
make test_ane_model && ./test_ane_model
make test_throughput_ceiling && ./test_throughput_ceiling

1. Executive Summary

What works

Finding	Impact	Status
`evaluateRealTimeWithModel:` via `_ANEClient`	1.88x faster on small kernels (64x32); no benefit at production dims (768x256)	Benchmarked
`processRequest` via `_ANEProgramForEvaluation`	1.34x faster on small kernels; marginal at production dims	Benchmarked
`_ANEBuffer` wraps IOSurface with `symbolIndex`	Solves input indexing for chaining	Proven
All 9 unexplored ANE classes exist on M4 Max	Full API surfaces documented	Documented

Important: The RT execution speedup (1.88x) observed in isolated testing on 64x32 convolution kernels does not generalize to production dimensions. At 768x256 (Stories110M size), all four execution paths converge to ~0.2 ms per kernel. See Production Dimension Results below.

What's been solved

Finding	Status	Detail
`_ANEIOSurfaceOutputSets` works with 64-byte statsSurRef	SOLVED	Any non-NULL IOSurface works as stats buffer
`_ANEChainingRequest.validate` returns YES	SOLVED	With proper `_ANEBuffer` inputs + `_ANEIOSurfaceOutputSets` outputs
`processRequest` via `_ANEProgramForEvaluation`	1.34x faster	Lower-level eval (0.131 ms vs 0.175 ms)
ChainingRequest factory crash (`[NSConstantIntegerNumber count]`)	SOLVED	Pass `nil` for `lbInputSymbolId`, `lbOutputSymbolId`, `procedureIndex`
`_ANEModel` loading from temp directory	SOLVED	`modelAtURL:key:` with tmpDir URL + hexStringIdentifier
`_ANESharedSignalEvent` / `_ANESharedWaitEvent`	SOLVED	Use `MTLSharedEvent` or `IOSurfaceSharedEventCreate()`
ChainingRequest type encodings	DOCUMENTED	All 9 factory params are `@` (object). `prepare` has 5 params (3x`@`, 1x`I` qos, 1x`^@` err)

What's still blocked

Blocker	Root Cause
`prepareChainingWithModel:` returns Code=15	`ANEProgramChainingPrepare() Failed` -- model not recognized as chaining-capable
`_ANEModel` has empty symbol table	MIL-compiled model shell lacks Espresso IR data (`model.espresso.net`)
`_ANEClient.loadModel:` / `compileModel:` fail	Require Espresso IR format, not MIL
`_ANEProgramIOSurfacesMapper` returns NO	Needs fully loaded model with symbol table
`_ANEPerformanceStats` with `_ANERequest`	Request expects `statType` selector on perfStats objects

2. ANE Private API Class Map

Core Classes (known working)

_ANEInMemoryModel -- the model object for in-memory MIL compilation.

+inMemoryModelWithDescriptor: -- create from _ANEInMemoryModelDescriptor
-compileWithQoS:options:error: -- compile MIL to ANE binary
-loadWithQoS:options:error: -- load compiled model onto ANE
-evaluateWithQoS:options:request:error: -- standard evaluation (QoS 0-63, 21 default)
-unloadWithQoS:error: -- unload from ANE
Properties: hexStringIdentifier, programHandle (uint64), program (_ANEProgramForEvaluation), perfStatsMask
Missing: inputSymbolNames, outputSymbolNames, inputSymbolIndicesForProcedureIndex:

_ANEInMemoryModelDescriptor -- model specification.

+modelWithMILText:weights:optionsPlist: -- create descriptor from MIL NSData + weight dict

_ANERequest -- evaluation request packaging I/O surfaces.

+requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:
perfStats parameter expects NSArray of stat info objects (not _ANEPerformanceStats)

_ANEIOSurfaceObject -- thin wrapper around IOSurfaceRef.

+objectWithIOSurface: -- wrap a raw IOSurface
Does NOT have symbolIndex property (this is the v1 blocker)

_ANEClient -- client connection to the ANE daemon.

+sharedConnection -- singleton accessor
-evaluateWithModel:options:request:qos:error: -- 5-param eval via client
-evaluateRealTimeWithModel:options:request:error: -- RT priority eval (1.7x faster)
-doEvaluateDirectWithModel:options:request:qos:error: -- direct eval bypass
-beginRealTimeTask / -endRealTimeTask -- RT task bracketing (returns NO, but RT eval still works)
-prepareChainingWithModel:options:chainingReq:qos:error: -- chaining setup
-enqueueSetsWithModel:outputSet:options:qos:error: -- chaining output enqueue
-buffersReadyWithModel:inputBuffers:options:qos:error: -- chaining input signal

Discovered Classes (v2 exploration)

_ANEBuffer -- wraps _ANEIOSurfaceObject with index metadata. Key discovery.

+bufferWithIOSurfaceObject:symbolIndex:source: -- factory
- ioSurfaceObject: an _ANEIOSurfaceObject (NOT raw IOSurfaceRef)
- symbolIndex: NSNumber mapping to compiled model I/O symbol
- source: long long -- 0=ANE, 1=output, 2=unknown
Properties: ioSurfaceObject, symbolIndex, source
Description format: "_ANEBuffer: { ioSurface=0x... ; symbolIndex=0 ; ANEBufferProducerAgent=0 }"

_ANEProgramIOSurfacesMapper -- maps IOSurfaces to compiled model symbols.

+mapperWithProgramHandle:(uint64_t)handle -- works, creates mapper
+mapperWithController:(id)ctrl -- alternative factory
-mapIOSurfacesWithModel:request:cacheInference:error: -- FAILS on _ANEInMemoryModel (calls inputSymbolIndicesForProcedureIndex: which doesn't exist)
-validateRequest:model: -- also fails for same reason
Implication: designed for _ANEModel (disk-based compiled models), not in-memory MIL

_ANEProgramForEvaluation -- lower-level evaluation program.

Accessible via model.program property
+programWithHandle:intermediateBufferHandle:queueDepth: -- factory
-processRequest:model:qos:qIndex:modelStringID:options:returnValue:error: -- low-level eval

_ANEIOSurfaceOutputSets -- output set packaging for chaining.

+objectWithstatsSurRef:outputBuffer: -- factory
- statsSurRef: IOSurfaceRef for perf stats collection -- returns nil when NULL
- outputBuffer: NSArray of _ANEBuffer objects
This is the current blocker: we don't know the correct stats IOSurface format

_ANEInputBuffersReady -- input signaling for chaining pipeline.

+inputBuffersWithProcedureIndex:inputBufferInfoIndex:inputFreeValue:executionDelay:
Parameters: procedure index, buffer info indices, free values, execution delay
This is the mechanism that tells the ANE "inputs are ready, start processing"

_ANEOutputSetEnqueue -- output pipeline configuration for chaining.

+outputSetWithProcedureIndex:setIndex:signalValue:signalNotRequired:isOpenLoop:
Configures output set enqueue behavior with signal values and open-loop mode

_ANEChainingRequest -- the chaining request itself.

+chainingRequestWithInputs:outputSets:lbInputSymbolId:lbOutputSymbolId:procedureIndex:signalEvents:transactionHandle:fwEnqueueDelay:memoryPoolId:
-validate -- returns YES/NO
Expects inputs as _ANEBuffer objects, outputSets as _ANEIOSurfaceOutputSets objects

_ANEModelInstanceParameters -- model instance configuration.

Alloc/init produces a valid object
API surface dumped but not yet exercised

_ANEDeviceController -- device-level controller.

+controllerWithProgramHandle: -- attempted but returned nil in our tests

_ANEQoSMapper -- QoS level mapping.

API surface dumped, not yet exercised

_ANEPerformanceStats -- performance statistics.

+statsWithHardwareExecutionNS:(uint64_t)ns -- factory
Properties: hwExecutionTime, performanceCounters
Cannot be used with _ANERequest.perfStats (expects array of objects with statType selector)
Setting perfStatsMask=0xFF on model works but performanceCounters returns nil

_ANESharedSignalEvent / _ANESharedWaitEvent -- hardware sync primitives (not yet explored).

Likely the fence mechanism for GPU-ANE or multi-model synchronization
Referenced in _ANEChainingRequest.signalEvents parameter

3. Experiment Logs

v1: test_chaining.m Results (M4 Max)

=== ANE ChainingRequest Prototype ===

All required classes found.

--- Phase 1: Compile two identical conv kernels ---
  Kernel 1: compiled and loaded
  Kernel 2: compiled and loaded

--- Phase 2: Baseline (sequential eval) ---
  Sequential: 10.355 ms total (0.207 ms/pair)
  Output[0..3]: [0.2500, 0.2500, 0.2500, 0.2500]

--- Phase 3: _ANEChainingRequest exploration ---
  _ANEClient: obtained
  ChainingRequest created: _ANEChainingRequest: { inputBuffer=(
    "_ANEIOSurfaceObject: { ioSurface=0x... ; startOffset=0 }"
  ) ; outputSets=( ... ) }
  validate: NO

--- Phase 4: Loopback ChainingRequest ---
  ChainingRequest created (loopback)
  validate: NO
  prepareChainingWithModel: EXCEPTION (validate fails first)

--- Summary ---
  Sequential baseline: 0.207 ms/pair (two evals + memcpy)
  ChainingRequest: creates but validate FAILS
  Root cause: _ANEIOSurfaceObject lacks symbolIndex property
  Next: explore _ANEBuffer and _ANEProgramIOSurfacesMapper

v2: test_chaining_v2.m Results (M4 Max)

Phase 1: Class Introspection

9 classes found, 0 missing
All classes exist on M4 Max / macOS 15
Full method lists, properties, and type encodings dumped for each

Phase 2: Symbol Name Discovery

inputSymbolNames: NOT available on _ANEInMemoryModel
outputSymbolNames: NOT available on _ANEInMemoryModel
programHandle: YES (uint64 handle to compiled program)
_ANEIOSurfaceObject does NOT have symbolIndex getter or setter
+objectWithIOSurface:symbolIndex: class method NOT available

Phase 3: IOSurface Mapper & Buffer Experiments

3a: _ANEProgramIOSurfacesMapper

  mapperWithProgramHandle(12345): created successfully
  mapIOSurfacesWithModel: EXCEPTION
    -[_ANEInMemoryModel inputSymbolIndicesForProcedureIndex:]:
    unrecognized selector
  validateRequest:model: EXCEPTION (same reason)

3b: _ANEBuffer -- success

  bufferWithIOSurfaceObject(symIdx=0, source=0):
    _ANEBuffer: { ioSurface=0x... ; symbolIndex=0 ; ANEBufferProducerAgent=0 }
  bufferWithIOSurfaceObject(symIdx=0, source=1):
    _ANEBuffer: { ioSurface=0x... ; symbolIndex=0 ; ANEBufferProducerAgent=1 }
  bufferWithIOSurfaceObject(symIdx=0, source=2):
    _ANEBuffer: { ioSurface=0x... ; symbolIndex=0 ; ANEBufferProducerAgent=2 }
  bufferWithIOSurfaceObject(symIdx=1, source=0):
    _ANEBuffer: { ioSurface=0x... ; symbolIndex=1 ; ANEBufferProducerAgent=0 }
  symbolIndex property: accessible and correct

3c: _ANEIOSurfaceObject symbolIndex experiments

  setSymbolIndex: NOT available on _ANEIOSurfaceObject
  symbolIndex getter: NOT available
  +objectWithIOSurface:symbolIndex: NOT available

3d: IOSurface property experiments

  IOSurface 'symbolIndex' property (set via IOSurfaceSetValue): 0
  _ANEIOSurfaceObject.symbolIndex after property set: <exception>
  (IOSurface user properties do NOT propagate to _ANEIOSurfaceObject)

3e: _ANEProgramForEvaluation

  k1.model.program: <_ANEProgramForEvaluation: 0x...>
  (accessible via model.program property)

Phase 4: ChainingRequest Retry

4a: Sequential baseline

  Sequential: 0.259 ms/pair (50 iters)
  Output[0..3]: [0.2500, 0.2500, 0.2500, 0.2500]

Attempts 1-4: Various raw IOSurface configurations

  [Attempt 1] Standard (raw IOSurfaceObject): CRASH
    -[_ANEIOSurfaceObject symbolIndex]: unrecognized selector
  [Attempt 2] IOSurface with symbolIndex property: CRASH (same)
  [Attempt 3] Two-model loopback: CRASH (same)
  [Attempt 4] Skip validate, call prepareChainingWithModel directly: CRASH (same)

Attempt 5: _ANEBuffer + _ANEIOSurfaceOutputSets

  bufIn: _ANEBuffer: { ... symbolIndex=0 ; ANEBufferProducerAgent=0 }
  bufOut: _ANEBuffer: { ... symbolIndex=0 ; ANEBufferProducerAgent=1 }
  outputSet (objectWithstatsSurRef:NULL outputBuffer:@[bufOut]): nil
  -> _ANEIOSurfaceOutputSets returns nil when statsSurRef is NULL

Attempt 6: _ANEClient.evaluateWithModel: -- works

  evaluateWithModel (via client): YES

Attempt 7: _ANEClient.doEvaluateDirectWithModel: -- works

  doEvaluateDirectWithModel: YES

Phase 5: Alternative Execution Paths

5a: Real-time eval -- 1.7x speedup

  beginRealTimeTask: NO (possibly needs entitlement)
  evaluateRealTimeWithModel: YES

  RT eval:       0.090 ms/eval avg (50 iters)
  Standard eval: 0.157 ms/eval avg (50 iters)
  RT vs Standard speedup: 1.74x

  endRealTimeTask: NO

5b: PerfStats

  perfStatsMask = 0x01..0x80: set OK (all masks accepted)
  statsWithHardwareExecutionNS:0 = <_ANEPerformanceStats>
  Eval with @[perfStats]: OK (no crash when wrapped in array)
  hwExecutionTime after eval: nil
  Eval with mask=0xFF, perfStats=nil: OK
  performanceCounters: nil

4. Evaluation Path Benchmarks

Measured on 64x32 convolution kernels, M4 Max, 200 iterations after 10 warmup:

Method	Latency	Speedup	API
`evaluateWithQoS:` (standard)	0.175 ms	1.0x	`model.evaluateWithQoS:options:request:error:`
`evaluateRealTimeWithModel:`	0.093 ms	1.88x	`client.evaluateRealTimeWithModel:options:request:error:`
`processRequest`	0.131 ms	1.34x	`program.processRequest:model:qos:qIndex:modelStringID:options:returnValue:error:`
`doEvaluateDirectWithModel:`	0.225 ms	0.78x	`client.doEvaluateDirectWithModel:options:request:qos:error:`

Key observations (small kernel, isolated):

RT eval was fastest in isolated test (1.88x speedup on 64x32)
processRequest was faster than standard but slower than RT
doEvaluateDirectWithModel was actually slower than standard (0.78x)
beginRealTimeTask returning NO does not prevent evaluateRealTimeWithModel: from working

Production Dimension Results (test_bench_paths.m, M4 Max)

At realistic kernel sizes with multiple compiled models, the picture changes:

Config	Standard	RT	processRequest	ane_eval_rt
64x32 (test)	0.109 ms	0.233 ms (0.5x)	0.156 ms (0.7x)	0.195 ms (0.6x)
128x64	0.208 ms	0.184 ms (1.1x)	0.201 ms (1.0x)	0.185 ms (1.1x)
256x64	0.197 ms	0.212 ms (0.9x)	0.203 ms (1.0x)	0.157 ms (1.3x)
512x64	0.120 ms	0.147 ms (0.8x)	0.194 ms (0.6x)	0.179 ms (0.7x)
768x256 (prod)	0.205 ms	0.246 ms (0.8x)	0.185 ms (1.1x)	0.291 ms (0.7x)

Key finding: The RT eval speedup observed in isolated testing (1.88x) does not hold at production dimensions. At 768x256 (Stories110M size), all eval paths perform similarly (~0.2 ms), with standard eval being competitive or fastest. The overhead of the client-based paths (RT, direct) outweighs any ANE scheduling benefit at scale.

5. Remaining Blockers and Next Steps

SOLVED: _ANEIOSurfaceOutputSets statsSurRef

The chaining pipeline requires:

Inputs as _ANEBuffer objects with symbolIndex -- SOLVED
OutputSets as _ANEIOSurfaceOutputSets objects -- SOLVED

A 64-byte IOSurface as statsSurRef is sufficient. _ANEChainingRequest.validate returns YES with this setup.

SOLVED: ChainingRequest parameter type mismatch (Experiment K-L)

The [NSConstantIntegerNumber count] crash was caused by passing NSNumber values for lbInputSymbolId, lbOutputSymbolId, and procedureIndex. Type encoding analysis (Experiment K) revealed all 9 factory parameters are @ (id/object), but the factory internally calls count on them, expecting arrays or nil.

Fix: Pass nil for lbInputSymbolId, lbOutputSymbolId, and procedureIndex:

chainingRequestWithInputs:@[buf] outputSets:@[outSet]
    lbInputSymbolId:nil lbOutputSymbolId:nil procedureIndex:nil
    signalEvents:@[] transactionHandle:@0 fwEnqueueDelay:@0 memoryPoolId:@0

This produces a valid _ANEChainingRequest (validate returns YES) and prepareChainingWithModel: no longer crashes.

Current Blocker: ANEProgramChainingPrepare() Failed (Code=15)

prepareChainingWithModel: now returns NO with error:

Error Domain=com.apple.appleneuralengine Code=15
"ANEProgramChainingPrepare() Failed: Program chaining prepare error"

This error occurs with all three model types tested:

Fresh _ANEModel (state=1, populated with programHandle+program)
Populated _ANEModel from Experiment E (state=5 after failed loadModel/compileModel)
_ANEInMemoryModel still crashes on getUUID (cannot be used with chaining at all)

The Code=15 error is a logical failure in the ANE daemon's chaining preparation, not a crash. The model is not fully recognized as "chaining-capable" by the daemon, likely because:

The _ANEModel was populated by copying programHandle/program from an _ANEInMemoryModel, not loaded through the standard CoreML/Espresso pipeline
Symbol indices remain empty (the daemon may require them for chaining buffer routing)
The model needs model.espresso.net format (not MIL) for _ANEClient.loadModel: / compileModel:

Previous blocker (SOLVED): [NSConstantIntegerNumber count] crash -- fixed by passing nil for symbol/procedure params.

Experiments E-H Results (test_ane_model.m)

Experiment E: _ANEModel Loading -- SOLVED

_ANEModel.modelAtURL:key: works with the compiled temp directory URL and hexStringIdentifier as key:

diskModel = _ANEModel.modelAtURL:key:(tmpDirURL, hexId)
  -> _ANEModel with UUID, getUUID works
  -> state=1, program=nil, programHandle=0 (shell only)

Populating the shell with _ANEInMemoryModel data:

diskModel.setProgramHandle:(inMemoryModel.programHandle)  -> success
diskModel.setProgram:(inMemoryModel.program)              -> success

After population, programHandle and program are set, but inputSymbolIndicesForProcedureIndex:0 still returns empty NSIndexSet. The symbol table data isn't stored in the _ANEProgramForEvaluation -- it's likely in the model.hwx or net.plist that the standard CoreML path generates.

Experiment E2: ANECompiler -- No ObjC API

ANECompiler.framework exists at /System/Library/PrivateFrameworks/ANECompiler.framework/ but contains no ObjC classes -- it's a pure C library (ANECCompile() is the entry point, called internally by _ANEInMemoryModel.compileWithQoS:)
debug_mask option had no visible effect on compilation output
No ane_compiler_service found at standard paths
Key _ANEInMemoryModel compilation methods found: saveModelFiles, localModelPath, compiledModelExists, mapIOSurfacesWithRequest:cacheInference:error:

Experiment F: Chaining Pipeline -- Blocked

With populated _ANEModel (has UUID + programHandle + program), prepareChainingWithModel: still crashes on [NSConstantIntegerNumber count]. The crash is in the _ANEChainingRequest parameter handling, not in the model itself.

Experiment G: Hardware Fences -- FULLY SOLVED

Both _ANESharedSignalEvent and _ANESharedWaitEvent now work:

// MTLSharedEvent via Metal (works)
id device = MTLCreateSystemDefaultDevice();
id sharedEvent = [device newSharedEvent];

// IOSurfaceSharedEvent via IOKit (also works)
id iosEvent = IOSurfaceSharedEventCreate();

// Signal event factory: (uint64_t value, unsigned int symbolIndex, long long eventType, id sharedEvent)
_ANESharedSignalEvent.signalEventWithValue:symbolIndex:eventType:sharedEvent:
  -> works with both MTLSharedEvent and IOSurfaceSharedEvent

// Wait event factory: (uint64_t value, id sharedEvent)
_ANESharedWaitEvent.waitEventWithValue:sharedEvent:
  -> works with both event types

Event types 0, 1, 2 all produce valid signal events. The eventType property is correctly set.

Experiment H: Alternative Preparation -- Same Crash

doPrepareChainingWithModel:options:chainingReq:qos:error: exists with identical signature and crashes identically. Full _ANEClient API (46 instance methods) documented in test output.

Throughput Ceiling (test_throughput_ceiling.m, Experiment I)

12-kernel pipeline benchmarks on M4 Max:

Config	Sequential (run+memcpy)	Run-only	Memcpy-only	GCD Serial
64x32 (test)	0.272 ms/kernel	0.158 ms/kernel	0.001 ms/copy	0.200 ms/kernel
256x64 (small)	0.191 ms/kernel	0.181 ms/kernel	0.002 ms/copy	0.176 ms/kernel
768x256 (prod)	0.177 ms/kernel	0.226 ms/kernel	0.006 ms/copy	0.186 ms/kernel

Key findings:

Memcpy overhead is negligible (<0.01 ms per copy even at 393KB). Not the bottleneck.
CPU round-trip overhead is in the ANE dispatch itself, not data movement.
At production dims, sequential with memcpy is actually faster than eval-only (pipeline caching effect).
GCD serial queue provides modest improvement at small dims but marginal at production.
Chaining's value would be eliminating the ~0.2ms/kernel ANE dispatch overhead, not memcpy. With 12 kernels, total pipeline takes ~2.1ms (prod), so eliminating dispatch could potentially halve this.

Experiments K-P Results (test_ane_model.m, 2026-03-04)

Experiment K: Type Encoding Analysis -- COMPLETE

Full type encodings for all chaining-related methods:

Method	Encoding	Notes
`chainingRequestWithInputs:...`	`@88@0:8@16@24@32@40@48@56@64@72@80`	All 9 params are `@` (id/object)
`prepareChainingWithModel:...`	`B52@0:8@16@24@32I40^@44`	5 params: 3x `@`, 1x `I` (uint32 qos), 1x `^@` (error ptr)
`doPrepareChainingWithModel:...`	`B52@0:8@16@24@32I40^@44`	Same signature as prepareChainingWithModel

The _ANEChainingRequest factory takes 9 object parameters. The lbInputSymbolId, lbOutputSymbolId, and procedureIndex are all @ (object), not raw integers. Internally, the factory calls unsignedIntegerValue (from NSNumber) or count (from NSArray) on these parameters.

`_ANEChainingRequest` Property	Encoding	Type
`procedureIndex`	`@`	id (nil or NSArray)
`loopbackInputSymbolIndex`	`@`	id (nil or NSArray)
`loopbackOutputSymbolIndex`	`@`	id (nil or NSArray)

Experiment L: Array-Typed Parameters -- BREAKTHROUGH

Combo	lbIn	lbOut	procIdx	Factory	Validate	Prepare
L.1: Arrays `@[@(-1)]`	`@[@(-1)]`	`@[@(-1)]`	`@[@0]`	CRASH: `unsignedIntegerValue` on NSArray	-	-
L.2: Arrays `@[@0]`	`@[@0]`	`@[@0]`	`@[@0]`	CRASH: `unsignedIntegerValue` on NSArray	-	-
L.3: Empty `@[]`	`@[]`	`@[]`	`@[]`	CRASH: `unsignedIntegerValue` on empty array	-	-
L.4: nil	nil	nil	nil	OK	YES	NO (Code=15)
L.5: NSNumber	`@(-1)`	`@(-1)`	`@0`	CRASH: `count` on NSNumber	-	-

Passing nil for all three symbol/procedure params gets past both the factory crash and the prepareChainingWithModel crash. The validate returns YES and prepareChainingWithModel: returns a clean error (Code=15: ANEProgramChainingPrepare() Failed) instead of crashing.

Experiment M: Load Model via _ANEClient -- BLOCKED

Both loadModel: and compileModel: on _ANEClient require Espresso IR format (model.espresso.net), not MIL:

Error Domain=com.apple.appleneuralengine.espresso Code=-1
"_ANEEspressoIRTranslator : error Cannot load network '.../model.espresso.net'"

compiledModelExistsFor: returns NO for our MIL-compiled model. After the failed load/compile attempts, the _ANEModel state changes from 1 to 5 (error/invalid state).

The standard CoreML pipeline generates model.espresso.net (Espresso IR) and model.espresso.weights from the .mlpackage / .mlmodelc format. Our MIL-only path bypasses this, so we can't use _ANEClient.loadModel: without first generating the Espresso IR.

Experiment N: IOSurface Mapping -- PARTIAL

_ANEProgramIOSurfacesMapper:

mapperWithProgramHandle: creates a valid mapper from the _ANEInMemoryModel programHandle
mapIOSurfacesWithModel:request:cacheInference:error: returns NO (no exception, no error output)
validateRequest:model: returns NO
_ANEModel.mapper property is nil
prepareANEMemoryMappingParams:request: revealed ANEMemoryMappingParamsStruct has 128 ANEBufferStruct slots: [128{ANEBufferStruct=^{__IOSurface}IiiI}]

The mapper appears to need a fully loaded model with symbol table data that our MIL-compiled shell doesn't have.

Experiment O: Procedure Info -- EMPTY

procedureInfoForProcedureIndex:0 returns nil on the populated _ANEModel
procedureCount is not a method or KVC-accessible property
modelAttributes returns empty dictionary {}
inputSymbolNames / outputSymbolNames not available on _ANEModel
The symbolIndicesForProcedureIndex:indexArrayKey: method exists (takes I + @) but symbol data is empty

Experiment P: Full Chaining Retry -- Code=15

Tested with three model types, all using nil for symbol params:

Model	State	validate	prepare Result
Fresh `_ANEModel` (state=1, populated)	1	YES	NO (Code=15)
`_ANEInMemoryModel`	3	YES	CRASH: `getUUID`
Populated `_ANEModel` (from E, state=5)	5	YES	NO (Code=15)

Also documented _ANEInputBuffersReady and _ANEOutputSetEnqueue type signatures:

Class	Factory	Param Types
`_ANEInputBuffersReady`	`inputBuffersWithProcedureIndex:inputBufferInfoIndex:inputFreeValue:executionDelay:`	`I` (uint32), `@` (NSArray), `@` (NSArray), `Q` (uint64)
`_ANEOutputSetEnqueue`	`outputSetWithProcedureIndex:setIndex:signalValue:signalNotRequired:isOpenLoop:`	`I`, `I`, `Q`, `B`, `B`

Experiments Q-S Results (test_coreml_chaining.m, 2026-03-04)

Experiment Q: CoreML Pipeline -- MAJOR DISCOVERY

The E5 runtime (macOS 15+) does NOT use _ANEModel or _ANEChainingRequest at all.

CoreML on macOS 15 uses the MIL-based "E5" runtime, which completely bypasses the older Espresso/_ANEModel/_ANEChainingRequest path:

Component	Old Path (Espresso)	New Path (E5/MIL)
Model format	`.espresso.net` + `.espresso.weights`	`model.mil` + `weights/weight.bin`
Model class	`_ANEModel`	`e5rt_program_library` (C struct)
Engine	`_ANEClient` + `_ANERequest`	`MLE5Engine` + `MLE5ExecutionStreamOperation`
Chaining	`_ANEChainingRequest`	`e5rt_execution_stream_operation` (unknown)
Compile	`_ANEClient.compileModel:`	`e5rt_program_library` AOT compilation
Sync	`_ANESharedSignalEvent`	`IOSurfaceSharedEventListener` + `MTLSharedEvent`

Key findings:

MLModel.compileModelAtURL: produces .mlmodelc with model.mil (NOT model.espresso.net)
Loading an MLModel creates MLDelegateModel -> MLE5Engine -> MLE5ProgramLibrary -> MLE5ProgramLibraryOnDeviceAOTCompilationImpl
No _ANEModel exists anywhere in the E5 object graph
_ANEClient.loadModel: / compileModel: both require model.espresso.net which isn't generated
Prediction succeeds (model runs on ANE), confirming E5 runtime works independently of _ANEModel

Internal E5 class hierarchy:

MLDelegateModel
  └── _internalEngine: MLE5Engine
        ├── _programLibrary: MLE5ProgramLibrary
        │     ├── _programLibraryHandle: e5rt_program_library* (opaque C struct)
        │     ├── _impl: MLE5ProgramLibraryOnDeviceAOTCompilationImpl
        │     │     ├── _milTextURL: NSURL
        │     │     ├── _irProgram: shared_ptr<MIL::IRProgram> (C++)
        │     │     └── _container: MLProgramE5Container
        │     └── _container: MLProgramE5Container
        │           ├── _modelAssetDescription
        │           ├── _compilerVersionInfo
        │           └── _functionInfoArray
        └── _operationPool: MLE5StaticShapeExecutionStreamOperationPool
              └── _pool: NSMutableSet of MLE5ExecutionStreamOperation
                    ├── _operationHandle: e5rt_execution_stream_operation* (opaque)
                    ├── _programLibrary: MLE5ProgramLibrary
                    ├── _inputPorts / _outputPorts: NSArray
                    ├── _waitEventListener: IOSurfaceSharedEventListener
                    └── _completionSharedEventBoundToESOP: MTLSharedEvent

Experiment R: Chaining with CoreML model -- BLOCKED

No _ANEModel extracted from E5 runtime, so prepareChainingWithModel: cannot be tested with a CoreML-compiled model. The E5 runtime is a completely separate execution path.

Experiment S: Two-Kernel Chaining -- BLOCKED

Blocked by Experiment R. The _ANEChainingRequest API appears to be from the older Espresso-based runtime and may not be usable with models compiled through the E5/MIL path.

Experiments T-V Results (2026-03-04)

Experiment T: E5 Runtime Symbol Scan

Found 4 exported C functions from the e5rt_* API:

e5rt_program_library_create -- creates program library handle
e5rt_execution_stream_create -- creates execution stream handle
e5rt_async_event_create -- creates async event for synchronization
e5rt_async_event_signal -- signals an async event

Key ObjC classes in the E5 runtime:

MLE5ExecutionStreamOperation (63 instance methods) -- holds e5rt_execution_stream_operation*, manages input/output ports
MLE5ExecutionStream (29 instance methods) -- holds e5rt_execution_stream*, executes operations array
MLE5ExecutionStreamPool -- manages streams via takeOut / putBack:
MLE5InputPort / MLE5OutputPort -- hold e5rt_io_port*, bind features to ports
MLE5InputPortBinder / MLE5OutputPortBinder -- handle memory binding for ports
MLE5ProgramLibrary -- holds e5rt_program_library*

Critical method: MLE5ExecutionStream._executeStream:error: takes e5rt_execution_stream* and executes all operations in the operations array in sequence.

Experiment U: E5 Multi-Op Stream -- MAJOR BREAKTHROUGH

Successfully executed multiple ANE operations in a single E5 stream, achieving up to 4.87x speedup over sequential CoreML.

Method:

Load multiple CoreML models (.mlpackage -> MLModel)
Extract MLE5ProgramLibrary from each model's MLE5Engine
Create MLE5ExecutionStreamOperation for each, backed by each program library
Preload operations (preloadAndReturnError:) to compile ANE programs
Borrow an MLE5ExecutionStream from the stream pool
Set multiple operations on the stream via setOperations:
Prepare each operation's input features via prepareForInputFeatures:options:error:
Execute all operations in one call via _executeStream:error:

Benchmark Results (M4 Max, macOS 15, N=500)

Kernels	CoreML Sequential	E5 Multi-Op Stream	Speedup
1 (256ch)	0.0359 ms	0.0272 ms	1.32x
2 (256+512ch)	0.0623 ms	0.0406 ms	1.53x
3 (256+512+1024ch)	0.1599 ms	0.0578 ms	2.77x
4 (256+512+1024+2048ch)	0.3781 ms	0.0776 ms	4.87x

Key observations:

E5 stream per-kernel overhead is remarkably consistent: ~0.02 ms/kernel regardless of count
CoreML sequential overhead grows non-linearly (0.036 -> 0.095 ms/kernel with 4 kernels)
The speedup increases with more kernels: the dispatch overhead is amortized
All operations execute on ANE with a single _executeStream: call

Code path for E5 multi-op stream:

// 1. Extract internals from CoreML-loaded model
id e5engine = [mlModel valueForKey:@"_internalEngine"];  // MLE5Engine
id progLib  = [e5engine valueForKey:@"programLibrary"];   // MLE5ProgramLibrary
id pool     = [e5engine valueForKey:@"streamPool"];       // MLE5ExecutionStreamPool

// 2. Create operation from program library
id op = [[MLE5ExecutionStreamOperation alloc]
    initWithProgramLibrary:progLib functionName:@"main"
    modelDescription:desc configuration:cfg
    debugLabel:@"myOp" modelSignpostId:0];
[op preloadAndReturnError:nil];

// 3. Get stream and set operations
id stream = [pool takeOut];
void *sh = stream._streamHandle;  // e5rt_execution_stream*
[stream setOperations:@[op1, op2, op3]];

// 4. Prepare and execute
for (op in operations)
    [op prepareForInputFeatures:features options:predOpts error:nil];
[stream _executeStream:sh error:nil];

Revised Assessment (after T-V)

~~The E5 runtime (MLE5ExecutionStream + MLE5ExecutionStreamOperation) is the correct path for multi-kernel pipelining on macOS 15+.~~ CORRECTED in Experiments W1 (see below).

Experiments W1-W5: Validation & Deep API Documentation (2026-03-04)

W1: Output Correctness Validation

CRITICAL CORRECTION: The previously reported "4.87x speedup" from multi-op streams was invalid. Validation revealed:

MLE5Engine.predictionFromFeatures:options:error: produces EXACT (bit-identical) output to MLModel.predictionFromFeatures:error: for all tested sizes (256, 512, 1024, 2048 channels). This confirms the E5 engine is the correct computation path.
Our manually-created MLE5ExecutionStreamOperation objects via initWithProgramLibrary: do not produce correct output -- they return all zeros. The _executeStream: call returns YES but no actual ANE compute occurs. The operation handles are 0x0 (not compiled), meaning our manually-created ops were never wired to actual ANE programs.
The "speedup" was measuring the overhead of a no-op function returning immediately vs CoreML doing actual computation.
MLE5StaticShapeExecutionStreamOperationPool.takeOutOperationForFeatures:error: returns pool-managed operations with valid handles, but using them with _executeStream: still produces zeros -- the output port bindings are not correctly populated.
Stream reuse via _predictionFromFeatures:stream:options:error: fails with "E5RT: Port bindings cannot be changed while operation is in use in an execution stream" -- streams are locked after first use and cannot be reconfigured.

W1 Performance Profile

Path	256ch (ms)	2048ch (ms)
CoreML API (`predictionFromFeatures:error:`)	0.035	0.217
Engine direct (`predictionFromFeatures:options:error:`)	0.074	0.284
Engine private (`_predictionFromFeatures:options:error:`)	0.100	0.332
Stream pool cycle (takeOut + putBack)	0.008	0.008
Op pool cycle	<0.001	<0.001

Key finding: CoreML API is FASTER than calling the engine directly. MLDelegateModel implements internal caching (likely keeping a hot stream + operation) that avoids the per-call pool acquire/release overhead. The engine's predictionFromFeatures: method performs pool management on every call.

W2: Exhaustive E5 Runtime API

Full class dumps captured for all E5 runtime classes. Key classes and their roles:

MLE5Engine (49 instance methods, 10 ivars)

Superclass: MLModelEngine
Entry point: predictionFromFeatures:options:error: (public), _predictionFromFeatures:stream:options:error: (internal)
Key properties: streamPool (MLE5ExecutionStreamPool), operationPool (), programLibrary (MLE5ProgramLibrary)
Manages: stream acquisition, operation preparation, input conforming, output post-processing

MLE5ProgramLibrary (17 instance methods, 5 ivars)

Holds _programLibraryHandle (C struct e5rt_program_library*)
Key method: createOperationForFunctionName:forceRespecialization:hasRangeShapeInputs:error: -- returns C-level e5rt_execution_stream_operation*
Contains: compiled MIL program, model configuration, implementation object

MLE5ExecutionStreamOperation (63 instance methods, ~20 ivars)

Holds _operationHandle (C struct e5rt_execution_stream_operation*)
States: 0=created, transitions through prepare/execute
Key methods: prepareForInputFeatures:options:error:, preloadAndReturnError:, outputFeatures
Has input/output/state ports (MLE5InputPort, MLE5OutputPort)
Internal binding: _bindInputFeaturesAndWaitEvents:options:error:, _bindOutputPortsWithOptions:error:
Port binding modes: directlyBoundFeatureValue (zero-copy) vs copyFeatureValue (memcpy)

MLE5ExecutionStream (21 instance methods, 5 ivars)

Holds _streamHandle (C struct e5rt_execution_stream*)
Key methods: _executeStream:error:, executeForInputFeatures:options:error:, submitWithCompletionHandler:
Operations set via setOperations: (NSArray of MLE5ExecutionStreamOperation)
Reset via _cleanUpStream: on engine

MLE5ExecutionStreamPool (11 instance methods)

Pool pattern: takeOut / putBack:
Creates streams on demand with e5rt_execution_stream_create
Tracks all streams via allStreams

MLE5StaticShapeExecutionStreamOperationPool (17 instance methods)

Pool for operations with fixed input shapes
Key method: takeOutOperationForFeatures:error: -- matches feature shape to pooled operation

MLE5InputPort / MLE5OutputPort

Wraps e5rt_io_port* handles
Each has a binder (MLE5InputPortBinder / MLE5OutputPortBinder)
Input binder has bindingMode (char): controls copy vs direct binding
Output binder has outputBacking and featureValue for result retrieval

MLE5InputPortBinder (16 instance methods, 6 ivars)

bindingMode (char): 0=copy, 1=direct
bindMemoryObjectForFeatureValue:error: -- zero-copy IOSurface binding
copyFeatureValue:error: -- memcpy binding

MLE5OutputPortBinder (27 instance methods, 9 ivars)

outputBacking -- output buffer
boundFeatureDirectly (BOOL) -- tracks binding mode
_makeFeatureValueFromPort:featureDescription:error: -- read ANE output

MLProgramE5Container (11 instance methods, 6 ivars)

Container for compiled model assets
URLOfMILText -- path to MIL source
compilerOutput -- MLCompilerNeuralNetworkOutput
findPrecompiledE5BundleAndReturnError: -- looks for pre-compiled E5 bundle

e5rt_ C API* (found via dlsym):

e5rt_program_library_create -- creates program library from MIL
e5rt_execution_stream_create -- creates execution stream
e5rt_async_event_create -- creates async event for synchronization
e5rt_async_event_signal -- signals async event

W4: Async Stream Submission

submitWithCompletionHandler: FAILED with: "Failed to add operation to E5 stream. E5RT: Reset stream to add more operations to stream. (2)". The stream must be in a specific state (reset) before async submission is possible. The stream state becomes locked after _executeStream: or executeForInputFeatures:.

W5: Port-Based Data Flow

Each operation has inputPorts (array of MLE5InputPort) and outputPorts (array of MLE5OutputPort)
Input binding mode 1 = direct binding (zero-copy from MLMultiArray)
Output outputBacking is nil after manual execution -- bindings are not populated by our manual path
Port handles are e5rt_io_port* C structs -- connecting ports across operations would require knowing the C API for port linking

Revised Assessment (after W1-W5)

CoreML API is already near-optimal for single-model inference. The MLDelegateModel wrapper is faster than calling engine methods directly due to internal stream/operation caching.
Manual _executeStream: with custom operations is invalid -- it produces zero output. The operations must be created through the engine's internal pipeline (via _predictionFromFeatures:stream:options:error:) which handles binding correctly.
The opportunity for speedup lies in:
- Eliminating ObjC overhead via direct e5rt_* C API calls
- Batching multiple models into a single stream (requires understanding e5rt_execution_stream_operation lifecycle)
- Direct MIL compilation to e5rt_program_library without going through CoreML

Experiment X1: Custom MIL -> ANE Execution (BREAKTHROUGH)

Pipeline discovered: Write MIL text file -> MLE5ProgramLibraryOnDeviceAOTCompilationImpl -> MLE5ProgramLibrary -> MLE5Engine -> predictionFromFeatures:

// 1. Write MIL text to file
NSString *mil = @"program(1.3)\n{\n    func main<ios18>(...) { ... } -> (cast_out);\n}\n";
[mil writeToFile:@"/tmp/custom.mil" ...];

// 2. Compile MIL to E5 program library
id aotImpl = [[MLE5ProgramLibraryOnDeviceAOTCompilationImpl alloc]
    initWithMILTextAtURL:milURL container:refContainer configuration:cfg];
void *plHandle = [aotImpl createProgramLibraryHandleWithRespecialization:NO error:&err];

// 3. Create program library + engine
id progLib = [[MLE5ProgramLibrary alloc] initWithImpl:aotImpl container:refContainer configuration:cfg];
id engine = [[MLE5Engine alloc] initWithProgramLibrary:progLib modelDescription:desc ...];
[engine prepareWithConcurrencyHint:1 error:nil];

// 4. Execute
id result = [engine predictionFromFeatures:fp options:opts error:&err];

Requirements:

MIL input/output variable names must match the model description (e.g., x for input, cast_out for output)
MIL shapes must match the model description shapes
A "container" (MLProgramE5Container) is borrowed from a pre-compiled CoreML model (needed for compilation context)
Input/output types should be fp32 with internal fp16 compute (cast in/out) for ANE compatibility

Verified kernels (all produce EXACT correct output on ANE):

Kernel	MIL Op	Verification
ReLU	`relu(x=x16)`	Max diff = 0.000000, 0/16384 wrong
GELU	`gelu(x=x16, mode="TANH_APPROXIMATION")`	Verified against reference
Elementwise (x*2+1)	`mul` + `add` with scalar constants	Verified against reference
Softmax	`softmax(x=x16, axis=-1)`	Sum = 1.000000
Layer Norm	`layer_norm(x=x16, axes=[3], epsilon=1e-5)`	Mean = 0.000000, Var = 0.999975

Significance: This allows compiling arbitrary MIL programs (any operation supported by Apple's MIL spec) to run on the ANE, without going through CoreML's .mlpackage pipeline. This is the foundation for custom training/inference kernels.

Experiment Y1: Fused SDPA on ANE (PASSED)

Operation: scaled_dot_product_attention(query=Q, key=K, value=V) -- single fused op for entire attention computation.

Config: B=1, nHeads=1, seqLen=256, headDim=64 (self-attention: Q=K=V=reshape(input))

Metric	Value
Max abs diff (vs CPU)	0.000021
Relative error	1.40e-03
Latency (first call)	2.454 ms
Benchmark	0.1708 ms/eval

Experiment Y2: Linear with Embedded Weights (PASSED)

Operation: linear(x=flat, weight=Wc, bias=Bc) where Wc and Bc are compile-time const tensors embedded in the MIL program.

Config: input [256, 64], linear 64->64 with embedded weight matrix and bias vector.

Metric	Value
Max abs diff (vs CPU)	0.001106
Relative error	1.05e-02
Benchmark	0.0610 ms/eval

Significance: Confirms that compile-time weight constants work in MIL text format. This is the foundation for transformer inference (where weights are frozen).

Experiment Y3: Complete Transformer Block on ANE (PASSED)

Pipeline: LayerNorm -> SDPA (self-attention) -> Residual Add -> LayerNorm -> FFN (linear+GELU+linear) -> Residual Add

All in a single MIL program, compiled and executed as one ANE operation.

Config: seqLen=256, dim=64, ffnDim=128, 1-head attention, embedded FFN weights.

Metric	Value
Output mean abs	1.017404 (non-zero, correct)
Benchmark	0.2091 ms/eval

Significance: A full transformer layer runs on ANE in ~0.2ms. This proves that complex multi-op pipelines can be compiled as single MIL programs with no CPU round-trips between ops. The ANE compiler fuses the entire graph.

Experiment Z1: Backward Pass (Gradient Computation) on ANE (PASSED)

Operations: matmul(x=dY, y=W) for dX (input gradient), matmul(x=dY, y=dY, transpose_x=true) for dW (weight gradient). Both use runtime tensors (not const), proving backward-pass operations work on ANE.

Also tests: slice_by_index for tensor slicing, concat for packing results.

Config: dY [128,64] @ W [64,64] -> dX [128,64]; dY^T [64,128] @ dY [128,64] -> dW [64,64]

Metric	dX	dW
Max abs diff	0.001940	0.012828
Relative error	1.02e-02	3.92e-02
Benchmark	0.0593 ms/eval (both combined)

Significance: This is the first demonstration of ANE executing gradient computation operations. The matmul with transpose_x=true works correctly, producing valid weight gradients. Combined with Y3's forward pass, this establishes the complete pipeline for manual ANE training:

Forward pass: Y3-style MIL (0.2 ms)
Backward pass: Z1-style MIL (0.06 ms)
Weight update: CPU (trivial)
Recompile: (~10-50 ms, dominates training time)

MIL Text Syntax Lessons Learned

Key syntax rules discovered during Y/Z experiments:

epsilon in layer_norm: Must be same dtype as gamma/beta. Use fp16 eps = const()[..., val = fp16(1e-5)] when gamma is fp16.
Boolean params: Use bool tx = const()[..., val = bool(true)] for params like transpose_x.
concat axis: Must be int32 scalar, not tensor<int32, [1]>. Use int32 ax = const()[..., val = int32(0)].
concat interleave: Required param, use bool il = const()[..., val = bool(false)].
MLE5Engine init: Correct selector is initWithProgramLibrary:modelDescription:configuration:functionName:classProbabilitiesFeatureName:optionalInputDefaultValues:compilerVersionInfo: (7 args).
Container path: On macOS 15+, models may use Espresso backend. Create MLProgramE5Container via initWithModelAssetPath:configuration: using the .mlmodelc path.
Sandbox: E5RT needs write access to ~/Library/Caches/ for model specialization cache.

Next Steps

[HIGH] Multi-head attention -- test SDPA with multiple heads (reshape to [B, nHeads, seqLen, headDim])
[HIGH] Real Qwen2.5 layer weights -- load actual model weights into MIL const tensors
[HIGH] Full backward pass -- implement complete transformer backward pass (attention + FFN gradients)
[MEDIUM] Training loop -- forward + backward + weight update + recompile cycle
[MEDIUM] Explore e5rt_ C API directly* -- bypass ObjC wrappers for lower overhead
[LOW] Runtime weight injection -- investigate if weights can be updated without recompilation

Phase 7: OutputSets with stats IOSurface -- BREAKTHROUGH

  statsSurRef size=64 bytes:
    objectWithstatsSurRef: _ANEIOSurfaceOutputSets: { statsSurRef=<IOSurface: 0x...>
    id = 0x... width = 64 height = 1 pixelFormat = 0
    name = test_chaining_v2 ; outputBuffer=(
      "_ANEBuffer: { ... symbolIndex=0 ; ANEBufferProducerAgent=1}"
    )}

    Attempting ChainingRequest with valid outputSet...
    ChainingRequest created | validate: YES     <-- FIRST TIME VALIDATE PASSES!
    prepareChainingWithModel EXCEPTION:
      -[_ANEInMemoryModel getUUID]: unrecognized selector

Phase 8: Disk-based _ANEModel

  _ANEModel class found (12 class methods, 52 instance methods, 17 properties)
  Has: getUUID, inputSymbolIndicesForProcedureIndex:,
       outputSymbolIndicesForProcedureIndex:, mapper, program
  Factory: +modelAtURL:key:, +modelAtURL:key:modelAttributes:, etc.

  tmpDir contents: (weights, model.mil, net.plist, data)
  +modelAtURL: NOT available (needs key: parameter)
  -> _ANEModel could not be loaded (need correct factory + key)

Phase 9: processRequest via ProgramForEvaluation

  k1.model.program: _ANEProgramForEvaluation: { programHandle=1319967543575
    intermediateBufferHandle=0 queueDepth=127 }
  processRequest single call: YES (rv=NO)
  processRequest: 0.131 ms/eval (50 iters)
  vs RT eval: 1.45x (slower than RT but faster than standard)

Phase 10: Shared Events

  _ANESharedEvents: found (+sharedEventsWithSignalEvents:waitEvents:)
  _ANESharedSignalEvent: found
    +signalEventWithValue:symbolIndex:eventType:sharedEvent:
    Properties: sharedEvent (IOSurfaceSharedEvent), value, symbolIndex, agentMask, eventType
    alloc/init: nil (needs sharedEvent parameter)
  _ANESharedWaitEvent: found
    +waitEventWithValue:sharedEvent:
    alloc/init: nil (needs sharedEvent parameter)
  -> Both require IOSurfaceSharedEvent objects, not available from bare init

6. Architecture: Chaining Data Flow

Current (sequential):
  CPU -> IOSurface -> ANE eval layer 1 -> IOSurface -> CPU memcpy
  CPU -> IOSurface -> ANE eval layer 2 -> IOSurface -> CPU memcpy
  ... (23 round-trips for 12-layer model)

Target (chained):
  CPU -> IOSurface -> ANE eval layer 1 -> [on-chip] -> ANE eval layer 2
                   -> [on-chip] -> ... -> IOSurface -> CPU
  (1 round-trip for entire model)

Current best (sequential with standard path):
  At production dims (768x256), all paths are ~0.2ms/kernel.
  RT path only helps for small kernels (64x32: 1.88x speedup).
  For 24 evals/token at ~0.2ms each: ~4.8ms total ANE time per token.
  Chaining target: 1 round-trip instead of 24, saving ~23 x overhead per trip.

7. Class Hierarchy (inferred)

NSObject
├── _ANEClient (singleton, daemon connection)
├── _ANEInMemoryModelDescriptor (MIL + weights spec)
├── _ANEInMemoryModel (compile/load/run -- in-memory MIL path)
│   └── .program -> _ANEProgramForEvaluation
├── _ANEModel (disk-based compiled model -- 52 methods, has getUUID)
│   └── .program -> _ANEProgramForEvaluation
│   └── .mapper -> _ANEProgramIOSurfacesMapper
├── _ANERequest (I/O surface packaging)
├── _ANEIOSurfaceObject (thin IOSurface wrapper)
├── _ANEBuffer (IOSurfaceObject + symbolIndex + source)
├── _ANEChainingRequest (multi-op pipeline)
├── _ANEIOSurfaceOutputSets (output packaging for chaining)
├── _ANEInputBuffersReady (input signaling for chaining)
├── _ANEOutputSetEnqueue (output enqueue config for chaining)
├── _ANEProgramIOSurfacesMapper (symbol-to-surface mapping)
├── _ANEProgramForEvaluation (lower-level eval program)
├── _ANEModelInstanceParameters (model config)
├── _ANEDeviceController (device-level control)
├── _ANEQoSMapper (QoS level mapping)
├── _ANEPerformanceStats (perf counters)
├── _ANESharedSignalEvent (hardware signal fence)
└── _ANESharedWaitEvent (hardware wait fence)

8. MIL Operations Reference (for Custom ANE Kernels)

Source: coremltools MIL Ops API Reference

The following MIL operations are available for writing custom ANE kernels via our MLE5ProgramLibraryOnDeviceAOTCompilationImpl pipeline (Experiment X1). All ops below have been confirmed available in the MIL text format used by the E5 compiler on macOS 15+.

Transformer-Critical Ops

Op	Signature	Notes
`scaled_dot_product_attention` (iOS 18+)	`(query:[B,?,L,E], key:[B,?,S,E], value:[B,?,S,EV], attn_mask?) -> [B,?,L,EV]`	Fused `softmax(Q@K.T/sqrt(d))@V`. Single op for entire attention computation.
`linear`	`(x:[D,D_in], weight:const[D_out,D_in], bias:const[D_out]?) -> [D,D_out]`	`x @ W.T + b`. Weight/bias must be compile-time constants. Rank 1-3 input.
`matmul`	`(x:[,K1], y:[,K2], transpose_x?, transpose_y?) -> [*,T]`	N-D batch matmul with broadcasting. Supports runtime (non-const) inputs.
`layer_norm`	`(x, axes, gamma?, beta?, epsilon?) -> same shape`	Verified working on ANE (Experiment X1).
`gelu`	`(x, mode=EXACT/TANH_APPROXIMATION/SIGMOID_APPROXIMATION) -> same shape`	Verified working on ANE (Experiment X1).
`softmax`	`(x, axis) -> same shape`	Verified working on ANE (Experiment X1).
`relu`	`(x) -> same shape`	Verified working on ANE (Experiment X1).

Data Movement Ops

Op	Signature	Notes
`gather`	`(x, indices, axis?) -> gathered`	For embedding table lookups.
`gather_along_axis`	`(x, indices, axis?) -> gathered`	Take values along axis at index locations.
`scatter`	`(data, indices, updates, axis?, mode?) -> scattered`	For KV cache writes. Mode: update/add/sub/mul/div/max/min.
`scatter_along_axis`	`(data, indices, updates, axis?, mode?) -> scattered`	Scatter updates along axis.

Elementwise / Reduction Ops

Op	Notes
`add`, `sub`, `mul`, `real_div`	Elementwise with broadcasting.
`cast`	Type conversion (fp32 <-> fp16). Required for ANE I/O (fp32 in, fp16 compute, fp32 out).
`reduce_sum`, `reduce_mean`, `reduce_max`	Reduction along axes.
`rsqrt`, `sqrt`, `exp`, `log`, `tanh`	Unary elementwise. Useful for manual norm/activation implementations.
`concat`, `split`, `reshape`, `transpose`	Shape manipulation.
`slice_by_index`, `slice_by_size`	Tensor slicing for KV cache windowing.

Key Constraints

linear weights must be const: For inference this is fine (weights don't change). For training, use matmul with runtime tensors instead.
MIL text format: Programs use program(1.3) { func main<ios18>(...) { ... } -> (output); } syntax. Constants use const()[name=..., val=...]. Weights reference blob files via BLOBFILE(path=..., offset=...).
ANE I/O convention: Input/output should be fp32; internal compute should be fp16. Use cast ops at boundaries.
Shape constraints: ANE prefers NCHW layout. Most ops work with rank-4 tensors [B, C, H, W] but linear/matmul work with lower ranks.

9. ANE Training Feasibility Analysis

Apple's Official Position

Apple's deprecated MLCompute framework (MLCDevice.ane()) explicitly states:

"This device applies to inference graphs only. It doesn't work with a training graph or inference graph that shares layers with a training graph."

This means Apple never shipped ANE-based training, even in their own training framework. The MLCTrainingGraph class supported executeForward, executeGradient, and executeOptimizerUpdate but only on CPU and GPU devices.

WWDC 2025 Confirmation

WWDC 2025 Session 360 ("Discover ML & AI frameworks") confirms:

CoreML dispatches to CPU, GPU, and Neural Engine at runtime for inference
MLX is the recommended tool for training/fine-tuning but uses Metal GPU, not ANE
No mention of ANE training APIs in any Apple framework
BNNSGraph (Accelerate) added BNNSGraphBuilder for CPU-only real-time inference

Why ANE Lacks Native Training Support

The ANE is a fixed-function inference accelerator. It likely lacks:

Hardware support for automatic differentiation / backward passes
Ability to write to weight storage during execution (weights are read-only constants in the e5rt_program_library)
Dynamic memory allocation needed for activation checkpointing

Manual ANE Training Approach

Despite the lack of native support, training on ANE is theoretically possible using our custom MIL pipeline:

Forward pass: Write MIL program with linear/matmul/layer_norm/gelu ops. Weights embedded as constants. Execute on ANE. Save activations.
Backward pass: Write separate MIL programs for each layer's gradient computation:
- Linear backward: dX = dY @ W (matmul), dW = dY.T @ X (matmul)
- ReLU backward: dX = dY * (X > 0) (elementwise)
- LayerNorm backward: Multiple reduction + elementwise ops
Optimizer step: Run on CPU (simple elementwise: W -= lr * dW)
Recompile: After weight update, recompile MIL with new weights for next forward pass

The key bottleneck is step 4: recompiling MIL after every weight update. The createProgramLibraryHandleWithRespecialization: call takes ~10-50ms, which would dominate training time. This makes per-step ANE training impractical unless we can find a way to update weights without recompilation (e.g., via the e5rt_* C API or runtime weight injection).

57 KiB Raw Blame History