6.9 KiB
M5 ANE Probe Results
Machine: Apple M5, macOS 26.3 (Darwin 25.3.0) Date: 2026-03-01 ANE Family: H16 (same as M4)
test_weight_reload — FAIL
Question: Can we skip recompilation by overwriting weight blobs on disk and calling unload+load?
Result: No. Weights are baked at compile time. Overwriting weights/weight.bin in tmpDir and doing unload→load produces identical output — the ANE ignores the file change.
Kernel: 64x64 conv, spatial=32
Compile+load: 33.3ms | Unload: 0.5ms | Reload: 3.8ms
Output A (identity): [0.0100, 0.0200, 0.0300, 0.0400]
Output B (3x identity, after file overwrite + reload): [0.0100, 0.0200, 0.0300, 0.0400]
Max A-B diff: 0.000000
Implication: Cannot eliminate compilation bottleneck via file swap. Must use async recompile, raise ACCUM_STEPS, or find another path.
test_perf_stats — Partial Success
Question: What hardware counters does _ANEPerformanceStats expose?
Result: The class exists with useful properties, but alloc/init returns nil. Must be created via factory methods that require internal buffers.
Available Properties
| Property | Type | Description |
|---|---|---|
hwExecutionTime |
uint64 | Hardware execution time in nanoseconds |
perfCounterData |
NSData | Raw performance counter data blob |
pStatsRawData |
NSData | Raw stats data |
Factory Methods
+statsWithHardwareExecutionNS:— create from hw execution time+statsWithRequestPerformanceBuffer:statsBufferSize:— create from raw buffer+statsWithReconstructed:hardwareExecutionNS:aneStatsRawData:— reconstruct from components+driverMaskForANEFMask:— convert ANE feature mask to driver mask
Instance Methods
-performanceCounters— returns counter object-stringForPerfCounter:— human-readable counter name-emitPerfcounterSignpostsWithModelStringID:— emit signposts for profiling
Key Finding: _ANEModel has perfStatsMask property. Setting this on the model before eval likely enables perf stats population in the request. The _ANEPerformanceStats object passed to request gets populated by the driver — we need to set the mask first, then read stats after eval.
test_qos_sweep — All QoS Values Work
Question: Does QoS affect ANE frequency or latency?
Result: All QoS values 0-63 compile, load, and eval successfully. No measurable latency difference — ANE appears to run at fixed frequency regardless of QoS.
Kernel: 256x256 conv, spatial=64 (8.4 MFLOPS)
QoS Compile Load Eval(1) Eval(avg10) Status
0 13.9ms 15.6ms 0.22ms 0.11ms OK
1 11.6ms 1.8ms 0.17ms 0.07ms OK
5 11.4ms 1.7ms 0.17ms 0.07ms OK
10 12.0ms 1.8ms 0.18ms 0.06ms OK
21 11.8ms 1.7ms 0.18ms 0.08ms OK
33 11.5ms 1.7ms 0.17ms 0.06ms OK
47 10.8ms 1.7ms 0.18ms 0.06ms OK
63 11.3ms 1.7ms 0.17ms 0.07ms OK
Notes:
- QoS 0 has elevated load time (15.6ms vs ~1.7ms) — possibly first-use initialization
- Compile time ~11ms, load ~1.7ms, eval ~0.07ms avg for 8.4 MFLOPS kernel
- Eval throughput: 8.4M / 0.07ms = 120 GFLOPS for a single 256×256 conv
test_ane_advanced — Key Findings
weightsBuffer IOSurface — Does NOT Override
Passing a weightsBuffer IOSurface with different weights to the request does not change output. The compiled weights are still used.
Baseline (1x identity): Output[0..3] = [0.1000, 0.2000, 0.3000, 0.3999]
weightsBuffer (3x identity): Output[0..3] = [0.1000, 0.2000, 0.3000, 0.3999]
The weightsBuffer parameter likely serves a different purpose (perhaps for models that declare runtime weights vs baked constants).
procedureIndex — All 0-15 Succeed
All procedure indices 0-15 return OK. Single-procedure models work with any index (they probably ignore non-zero indices). Multi-procedure models compiled from _ANEChainingRequest would use different indices for different subgraphs.
SharedEvents — Classes Exist, Need IOSurfaceSharedEvent
_ANESharedEvents,_ANESharedSignalEvent,_ANESharedWaitEventall existalloc/initreturns nil — they needIOSurfaceSharedEventobjects (Metal shared events)_ANESharedSignalEventhassymbolIndexandagentMask— for GPU↔ANE sync- Signal API:
+signalEventWithValue:symbolIndex:eventType:sharedEvent: - Wait API:
+waitEventWithValue:sharedEvent:eventType:
ChainingRequest — Exists with Loopback Support
_ANEChainingRequest supports chained execution:
inputBuffer,outputSets— multiple output sets for pipelineloopbackInputSymbolIndex,loopbackOutputSymbolIndex— feed output back as inputfwEnqueueDelay— firmware-level enqueue timingmemoryPoolId— shared memory pool across chained opssignalEvents— sync with other agents
Notable _ANEClient Methods
evaluateRealTimeWithModel:options:request:error:— real-time eval pathloadRealTimeModel:options:qos:error:— RT model loadingbeginRealTimeTask/endRealTimeTask— RT task bracketingprepareChainingWithModel:options:chainingReq:qos:error:— set up chainingenqueueSetsWithModel:outputSet:options:qos:error:— enqueue output setsbuffersReadyWithModel:inputBuffers:options:qos:error:— signal input ready
All ANE Classes Found (67 total)
Key unexplored classes: _ANEDeviceController, _ANEQoSMapper, _ANEBuffer, _ANEIOSurfaceOutputSets, _ANEProgramForEvaluation, _ANEProgramIOSurfacesMapper, _ANEModelInstanceParameters, _ANEInputBuffersReady, _ANEOutputSetEnqueue
Strategic Implications
Compilation Bottleneck (Primary)
Weight reload and weightsBuffer both fail. Weights are irrevocably baked at compile time. The only paths forward:
- Raise ACCUM_STEPS significantly (10→100+) to amortize compile cost
- Async background compilation while training continues with old weights
- Chaining API (
_ANEChainingRequest) to pipeline multiple layers in one dispatch
Performance Monitoring
hwExecutionTime from _ANEPerformanceStats gives wall-clock ANE time per eval. To enable:
- Set
perfStatsMaskon the_ANEInMemoryModelbefore eval - Pass an
_ANEPerformanceStatsto the request - Read
hwExecutionTimeafter eval
Real-Time Path
_ANEClient has a dedicated real-time evaluation path (evaluateRealTimeWithModel:) with RT load/unload. This may provide lower/more predictable latency.
Chaining (Most Promising for Utilization)
_ANEChainingRequest with loopback could allow multiple layers to execute as a single ANE program without CPU round-trips between layers. Combined with _ANEIOSurfaceOutputSets and _ANEInputBuffersReady, this could dramatically reduce idle time between kernel dispatches.