ANE/training/m5result.md

6.9 KiB
Raw Blame History

M5 ANE Probe Results

Machine: Apple M5, macOS 26.3 (Darwin 25.3.0) Date: 2026-03-01 ANE Family: H16 (same as M4)


test_weight_reload — FAIL

Question: Can we skip recompilation by overwriting weight blobs on disk and calling unload+load?

Result: No. Weights are baked at compile time. Overwriting weights/weight.bin in tmpDir and doing unload→load produces identical output — the ANE ignores the file change.

Kernel: 64x64 conv, spatial=32
Compile+load: 33.3ms | Unload: 0.5ms | Reload: 3.8ms
Output A (identity): [0.0100, 0.0200, 0.0300, 0.0400]
Output B (3x identity, after file overwrite + reload): [0.0100, 0.0200, 0.0300, 0.0400]
Max A-B diff: 0.000000

Implication: Cannot eliminate compilation bottleneck via file swap. Must use async recompile, raise ACCUM_STEPS, or find another path.


test_perf_stats — Partial Success

Question: What hardware counters does _ANEPerformanceStats expose?

Result: The class exists with useful properties, but alloc/init returns nil. Must be created via factory methods that require internal buffers.

Available Properties

Property Type Description
hwExecutionTime uint64 Hardware execution time in nanoseconds
perfCounterData NSData Raw performance counter data blob
pStatsRawData NSData Raw stats data

Factory Methods

  • +statsWithHardwareExecutionNS: — create from hw execution time
  • +statsWithRequestPerformanceBuffer:statsBufferSize: — create from raw buffer
  • +statsWithReconstructed:hardwareExecutionNS:aneStatsRawData: — reconstruct from components
  • +driverMaskForANEFMask: — convert ANE feature mask to driver mask

Instance Methods

  • -performanceCounters — returns counter object
  • -stringForPerfCounter: — human-readable counter name
  • -emitPerfcounterSignpostsWithModelStringID: — emit signposts for profiling

Key Finding: _ANEModel has perfStatsMask property. Setting this on the model before eval likely enables perf stats population in the request. The _ANEPerformanceStats object passed to request gets populated by the driver — we need to set the mask first, then read stats after eval.


test_qos_sweep — All QoS Values Work

Question: Does QoS affect ANE frequency or latency?

Result: All QoS values 0-63 compile, load, and eval successfully. No measurable latency difference — ANE appears to run at fixed frequency regardless of QoS.

Kernel: 256x256 conv, spatial=64 (8.4 MFLOPS)
 QoS    Compile       Load    Eval(1) Eval(avg10)  Status
   0     13.9ms     15.6ms     0.22ms     0.11ms  OK
   1     11.6ms      1.8ms     0.17ms     0.07ms  OK
   5     11.4ms      1.7ms     0.17ms     0.07ms  OK
  10     12.0ms      1.8ms     0.18ms     0.06ms  OK
  21     11.8ms      1.7ms     0.18ms     0.08ms  OK
  33     11.5ms      1.7ms     0.17ms     0.06ms  OK
  47     10.8ms      1.7ms     0.18ms     0.06ms  OK
  63     11.3ms      1.7ms     0.17ms     0.07ms  OK

Notes:

  • QoS 0 has elevated load time (15.6ms vs ~1.7ms) — possibly first-use initialization
  • Compile time ~11ms, load ~1.7ms, eval ~0.07ms avg for 8.4 MFLOPS kernel
  • Eval throughput: 8.4M / 0.07ms = 120 GFLOPS for a single 256×256 conv

test_ane_advanced — Key Findings

weightsBuffer IOSurface — Does NOT Override

Passing a weightsBuffer IOSurface with different weights to the request does not change output. The compiled weights are still used.

Baseline (1x identity): Output[0..3] = [0.1000, 0.2000, 0.3000, 0.3999]
weightsBuffer (3x identity): Output[0..3] = [0.1000, 0.2000, 0.3000, 0.3999]

The weightsBuffer parameter likely serves a different purpose (perhaps for models that declare runtime weights vs baked constants).

procedureIndex — All 0-15 Succeed

All procedure indices 0-15 return OK. Single-procedure models work with any index (they probably ignore non-zero indices). Multi-procedure models compiled from _ANEChainingRequest would use different indices for different subgraphs.

SharedEvents — Classes Exist, Need IOSurfaceSharedEvent

  • _ANESharedEvents, _ANESharedSignalEvent, _ANESharedWaitEvent all exist
  • alloc/init returns nil — they need IOSurfaceSharedEvent objects (Metal shared events)
  • _ANESharedSignalEvent has symbolIndex and agentMask — for GPU↔ANE sync
  • Signal API: +signalEventWithValue:symbolIndex:eventType:sharedEvent:
  • Wait API: +waitEventWithValue:sharedEvent:eventType:

ChainingRequest — Exists with Loopback Support

_ANEChainingRequest supports chained execution:

  • inputBuffer, outputSets — multiple output sets for pipeline
  • loopbackInputSymbolIndex, loopbackOutputSymbolIndex — feed output back as input
  • fwEnqueueDelay — firmware-level enqueue timing
  • memoryPoolId — shared memory pool across chained ops
  • signalEvents — sync with other agents

Notable _ANEClient Methods

  • evaluateRealTimeWithModel:options:request:error: — real-time eval path
  • loadRealTimeModel:options:qos:error: — RT model loading
  • beginRealTimeTask / endRealTimeTask — RT task bracketing
  • prepareChainingWithModel:options:chainingReq:qos:error: — set up chaining
  • enqueueSetsWithModel:outputSet:options:qos:error: — enqueue output sets
  • buffersReadyWithModel:inputBuffers:options:qos:error: — signal input ready

All ANE Classes Found (67 total)

Key unexplored classes: _ANEDeviceController, _ANEQoSMapper, _ANEBuffer, _ANEIOSurfaceOutputSets, _ANEProgramForEvaluation, _ANEProgramIOSurfacesMapper, _ANEModelInstanceParameters, _ANEInputBuffersReady, _ANEOutputSetEnqueue


Strategic Implications

Compilation Bottleneck (Primary)

Weight reload and weightsBuffer both fail. Weights are irrevocably baked at compile time. The only paths forward:

  1. Raise ACCUM_STEPS significantly (10→100+) to amortize compile cost
  2. Async background compilation while training continues with old weights
  3. Chaining API (_ANEChainingRequest) to pipeline multiple layers in one dispatch

Performance Monitoring

hwExecutionTime from _ANEPerformanceStats gives wall-clock ANE time per eval. To enable:

  1. Set perfStatsMask on the _ANEInMemoryModel before eval
  2. Pass an _ANEPerformanceStats to the request
  3. Read hwExecutionTime after eval

Real-Time Path

_ANEClient has a dedicated real-time evaluation path (evaluateRealTimeWithModel:) with RT load/unload. This may provide lower/more predictable latency.

Chaining (Most Promising for Utilization)

_ANEChainingRequest with loopback could allow multiple layers to execute as a single ANE program without CPU round-trips between layers. Combined with _ANEIOSurfaceOutputSets and _ANEInputBuffersReady, this could dramatically reduce idle time between kernel dispatches.