Merge pull request #2 from m0at/m5-maximized

ANE probe tests + training telemetry for M5 optimization
This commit is contained in:
Manjeet Singh 2026-03-02 14:57:12 +05:30 committed by GitHub
commit 893f58e725
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
7 changed files with 1075 additions and 2 deletions

View File

@ -11,10 +11,26 @@ train: train.m ane_runtime.h ane_mil_gen.h model.h forward.h backward.h
train_large: train_large.m $(HEADERS_LARGE)
$(CC) $(CFLAGS) -o $@ train_large.m $(LDFLAGS) -framework Accelerate
PROBES = test_weight_reload test_perf_stats test_qos_sweep test_ane_advanced
test_weight_reload: test_weight_reload.m
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
test_perf_stats: test_perf_stats.m
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
test_qos_sweep: test_qos_sweep.m
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
test_ane_advanced: test_ane_advanced.m
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
probes: $(PROBES)
tokenize:
python3 tokenize.py
clean:
rm -f train train_large
rm -f train train_large $(PROBES)
.PHONY: clean tokenize
.PHONY: clean tokenize probes

146
training/m5result.md Normal file
View File

@ -0,0 +1,146 @@
# M5 ANE Probe Results
**Machine**: Apple M5, macOS 26.3 (Darwin 25.3.0)
**Date**: 2026-03-01
**ANE Family**: H16 (same as M4)
---
## test_weight_reload — FAIL
**Question**: Can we skip recompilation by overwriting weight blobs on disk and calling unload+load?
**Result**: **No.** Weights are baked at compile time. Overwriting `weights/weight.bin` in tmpDir and doing unload→load produces identical output — the ANE ignores the file change.
```
Kernel: 64x64 conv, spatial=32
Compile+load: 33.3ms | Unload: 0.5ms | Reload: 3.8ms
Output A (identity): [0.0100, 0.0200, 0.0300, 0.0400]
Output B (3x identity, after file overwrite + reload): [0.0100, 0.0200, 0.0300, 0.0400]
Max A-B diff: 0.000000
```
**Implication**: Cannot eliminate compilation bottleneck via file swap. Must use async recompile, raise ACCUM_STEPS, or find another path.
---
## test_perf_stats — Partial Success
**Question**: What hardware counters does `_ANEPerformanceStats` expose?
**Result**: The class exists with useful properties, but `alloc/init` returns `nil`. Must be created via factory methods that require internal buffers.
### Available Properties
| Property | Type | Description |
|----------|------|-------------|
| `hwExecutionTime` | uint64 | Hardware execution time in nanoseconds |
| `perfCounterData` | NSData | Raw performance counter data blob |
| `pStatsRawData` | NSData | Raw stats data |
### Factory Methods
- `+statsWithHardwareExecutionNS:` — create from hw execution time
- `+statsWithRequestPerformanceBuffer:statsBufferSize:` — create from raw buffer
- `+statsWithReconstructed:hardwareExecutionNS:aneStatsRawData:` — reconstruct from components
- `+driverMaskForANEFMask:` — convert ANE feature mask to driver mask
### Instance Methods
- `-performanceCounters` — returns counter object
- `-stringForPerfCounter:` — human-readable counter name
- `-emitPerfcounterSignpostsWithModelStringID:` — emit signposts for profiling
**Key Finding**: `_ANEModel` has `perfStatsMask` property. Setting this on the model before eval likely enables perf stats population in the request. The `_ANEPerformanceStats` object passed to request gets populated *by the driver* — we need to set the mask first, then read stats after eval.
---
## test_qos_sweep — All QoS Values Work
**Question**: Does QoS affect ANE frequency or latency?
**Result**: All QoS values 0-63 compile, load, and eval successfully. **No measurable latency difference** — ANE appears to run at fixed frequency regardless of QoS.
```
Kernel: 256x256 conv, spatial=64 (8.4 MFLOPS)
QoS Compile Load Eval(1) Eval(avg10) Status
0 13.9ms 15.6ms 0.22ms 0.11ms OK
1 11.6ms 1.8ms 0.17ms 0.07ms OK
5 11.4ms 1.7ms 0.17ms 0.07ms OK
10 12.0ms 1.8ms 0.18ms 0.06ms OK
21 11.8ms 1.7ms 0.18ms 0.08ms OK
33 11.5ms 1.7ms 0.17ms 0.06ms OK
47 10.8ms 1.7ms 0.18ms 0.06ms OK
63 11.3ms 1.7ms 0.17ms 0.07ms OK
```
**Notes**:
- QoS 0 has elevated load time (15.6ms vs ~1.7ms) — possibly first-use initialization
- Compile time ~11ms, load ~1.7ms, eval ~0.07ms avg for 8.4 MFLOPS kernel
- Eval throughput: 8.4M / 0.07ms = **120 GFLOPS** for a single 256×256 conv
---
## test_ane_advanced — Key Findings
### weightsBuffer IOSurface — Does NOT Override
Passing a `weightsBuffer` IOSurface with different weights to the request **does not change output**. The compiled weights are still used.
```
Baseline (1x identity): Output[0..3] = [0.1000, 0.2000, 0.3000, 0.3999]
weightsBuffer (3x identity): Output[0..3] = [0.1000, 0.2000, 0.3000, 0.3999]
```
The `weightsBuffer` parameter likely serves a different purpose (perhaps for models that declare runtime weights vs baked constants).
### procedureIndex — All 0-15 Succeed
All procedure indices 0-15 return OK. Single-procedure models work with any index (they probably ignore non-zero indices). Multi-procedure models compiled from `_ANEChainingRequest` would use different indices for different subgraphs.
### SharedEvents — Classes Exist, Need IOSurfaceSharedEvent
- `_ANESharedEvents`, `_ANESharedSignalEvent`, `_ANESharedWaitEvent` all exist
- `alloc/init` returns nil — they need `IOSurfaceSharedEvent` objects (Metal shared events)
- `_ANESharedSignalEvent` has `symbolIndex` and `agentMask` — for GPU↔ANE sync
- Signal API: `+signalEventWithValue:symbolIndex:eventType:sharedEvent:`
- Wait API: `+waitEventWithValue:sharedEvent:eventType:`
### ChainingRequest — Exists with Loopback Support
`_ANEChainingRequest` supports chained execution:
- `inputBuffer`, `outputSets` — multiple output sets for pipeline
- `loopbackInputSymbolIndex`, `loopbackOutputSymbolIndex` — feed output back as input
- `fwEnqueueDelay` — firmware-level enqueue timing
- `memoryPoolId` — shared memory pool across chained ops
- `signalEvents` — sync with other agents
### Notable _ANEClient Methods
- `evaluateRealTimeWithModel:options:request:error:` — real-time eval path
- `loadRealTimeModel:options:qos:error:` — RT model loading
- `beginRealTimeTask` / `endRealTimeTask` — RT task bracketing
- `prepareChainingWithModel:options:chainingReq:qos:error:` — set up chaining
- `enqueueSetsWithModel:outputSet:options:qos:error:` — enqueue output sets
- `buffersReadyWithModel:inputBuffers:options:qos:error:` — signal input ready
### All ANE Classes Found (67 total)
Key unexplored classes: `_ANEDeviceController`, `_ANEQoSMapper`, `_ANEBuffer`, `_ANEIOSurfaceOutputSets`, `_ANEProgramForEvaluation`, `_ANEProgramIOSurfacesMapper`, `_ANEModelInstanceParameters`, `_ANEInputBuffersReady`, `_ANEOutputSetEnqueue`
---
## Strategic Implications
### Compilation Bottleneck (Primary)
Weight reload and weightsBuffer both fail. **Weights are irrevocably baked at compile time.** The only paths forward:
1. **Raise ACCUM_STEPS significantly** (10→100+) to amortize compile cost
2. **Async background compilation** while training continues with old weights
3. **Chaining API** (`_ANEChainingRequest`) to pipeline multiple layers in one dispatch
### Performance Monitoring
`hwExecutionTime` from `_ANEPerformanceStats` gives wall-clock ANE time per eval. To enable:
1. Set `perfStatsMask` on the `_ANEInMemoryModel` before eval
2. Pass an `_ANEPerformanceStats` to the request
3. Read `hwExecutionTime` after eval
### Real-Time Path
`_ANEClient` has a dedicated real-time evaluation path (`evaluateRealTimeWithModel:`) with RT load/unload. This may provide lower/more predictable latency.
### Chaining (Most Promising for Utilization)
`_ANEChainingRequest` with loopback could allow multiple layers to execute as a single ANE program without CPU round-trips between layers. Combined with `_ANEIOSurfaceOutputSets` and `_ANEInputBuffersReady`, this could dramatically reduce idle time between kernel dispatches.

View File

@ -0,0 +1,245 @@
// test_ane_advanced.m Probe advanced ANE interfaces
// SharedEvents, weightsBuffer, procedureIndex, VirtualClient, ChainingRequest
#import <Foundation/Foundation.h>
#import <objc/runtime.h>
#import <objc/message.h>
#import <dlfcn.h>
#import <IOSurface/IOSurface.h>
#import <mach/mach_time.h>
#include <math.h>
static mach_timebase_info_data_t g_tb;
static double tb_ms(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
static void dump_class(const char *name) {
Class cls = NSClassFromString([NSString stringWithUTF8String:name]);
if (!cls) { printf(" %s: NOT FOUND\n", name); return; }
printf("\n=== %s ===\n", name);
unsigned int count;
Method *methods = class_copyMethodList(object_getClass(cls), &count);
if (count) printf(" Class methods:\n");
for (unsigned int i = 0; i < count; i++) {
SEL s = method_getName(methods[i]);
const char *enc = method_getTypeEncoding(methods[i]);
printf(" + %s [%s]\n", sel_getName(s), enc ? enc : "?");
}
free(methods);
methods = class_copyMethodList(cls, &count);
if (count) printf(" Instance methods:\n");
for (unsigned int i = 0; i < count; i++) {
SEL s = method_getName(methods[i]);
const char *enc = method_getTypeEncoding(methods[i]);
printf(" - %s [%s]\n", sel_getName(s), enc ? enc : "?");
}
free(methods);
unsigned int pcount;
objc_property_t *props = class_copyPropertyList(cls, &pcount);
if (pcount) printf(" Properties:\n");
for (unsigned int i = 0; i < pcount; i++) {
const char *pname = property_getName(props[i]);
const char *pattr = property_getAttributes(props[i]);
printf(" @property %s [%s]\n", pname, pattr ? pattr : "?");
}
free(props);
}
static IOSurfaceRef make_surface(size_t bytes) {
return IOSurfaceCreate((__bridge CFDictionaryRef)@{
(id)kIOSurfaceWidth:@(bytes), (id)kIOSurfaceHeight:@1,
(id)kIOSurfaceBytesPerElement:@1, (id)kIOSurfaceBytesPerRow:@(bytes),
(id)kIOSurfaceAllocSize:@(bytes), (id)kIOSurfacePixelFormat:@0});
}
int main() {
@autoreleasepool {
setbuf(stdout, NULL);
mach_timebase_info(&g_tb);
dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
printf("=== ANE Advanced Interface Probe ===\n");
// === Part 1: Event/Sync classes ===
printf("\n--- Part 1: Event/Sync Classes ---\n");
dump_class("_ANESharedEvents");
dump_class("_ANESharedSignalEvent");
dump_class("_ANESharedWaitEvent");
dump_class("_ANEEvent");
dump_class("_ANEFenceEvent");
const char *event_classes[] = {
"_ANESharedEvents", "_ANESharedSignalEvent", "_ANESharedWaitEvent",
"_ANEEvent", "_ANEFenceEvent", NULL
};
for (int i = 0; event_classes[i]; i++) {
Class cls = NSClassFromString([NSString stringWithUTF8String:event_classes[i]]);
if (!cls) continue;
@try {
id obj = [[cls alloc] init];
printf(" %s alloc/init: %s\n", event_classes[i],
obj ? [[obj description] UTF8String] : "nil");
} @catch (NSException *ex) {
printf(" %s alloc/init: EXCEPTION: %s\n", event_classes[i], [[ex reason] UTF8String]);
}
}
// === Part 2: VirtualClient and ChainingRequest ===
printf("\n--- Part 2: VirtualClient / ChainingRequest ---\n");
dump_class("_ANEVirtualClient");
dump_class("_ANEChainingRequest");
dump_class("_ANEMultiRequest");
dump_class("_ANEBatchRequest");
// === Part 3: Compile working kernel for weightsBuffer + procedureIndex tests ===
printf("\n--- Part 3: weightsBuffer IOSurface test ---\n");
Class g_D = NSClassFromString(@"_ANEInMemoryModelDescriptor");
Class g_I = NSClassFromString(@"_ANEInMemoryModel");
Class g_AR = NSClassFromString(@"_ANERequest");
Class g_AIO= NSClassFromString(@"_ANEIOSurfaceObject");
int CH = 64, SP = 32;
_Float16 *w = (_Float16*)calloc(CH*CH, sizeof(_Float16));
for (int i = 0; i < CH; i++) w[i*CH+i] = (_Float16)1.0f;
int ws = CH*CH*2, tot = 128+ws;
uint8_t *blob = (uint8_t*)calloc(tot,1);
blob[0]=1; blob[4]=2; blob[64]=0xEF; blob[65]=0xBE; blob[66]=0xAD; blob[67]=0xDE; blob[68]=1;
*(uint32_t*)(blob+72)=ws; *(uint32_t*)(blob+80)=128;
memcpy(blob+128, w, ws);
NSData *wdata = [NSData dataWithBytesNoCopy:blob length:tot freeWhenDone:YES];
NSString *mil = [NSString stringWithFormat:
@"program(1.3)\n"
"[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, "
"{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, "
"{\"coremltools-version\", \"9.0\"}})]\n"
"{\n"
" func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n"
" string pt = const()[name=string(\"pt\"), val=string(\"valid\")];\n"
" tensor<int32, [2]> st = const()[name=string(\"st\"), val=tensor<int32, [2]>([1,1])];\n"
" tensor<int32, [4]> pd = const()[name=string(\"pd\"), val=tensor<int32, [4]>([0,0,0,0])];\n"
" tensor<int32, [2]> dl = const()[name=string(\"dl\"), val=tensor<int32, [2]>([1,1])];\n"
" int32 gr = const()[name=string(\"gr\"), val=int32(1)];\n"
" string to16 = const()[name=string(\"to16\"), val=string(\"fp16\")];\n"
" tensor<fp16, [1,%d,1,%d]> x16 = cast(dtype=to16,x=x)[name=string(\"cin\")];\n"
" tensor<fp16, [%d,%d,1,1]> W = const()[name=string(\"W\"), "
"val=tensor<fp16, [%d,%d,1,1]>(BLOBFILE(path=string(\"@model_path/weights/weight.bin\"), offset=uint64(64)))];\n"
" tensor<fp16, [1,%d,1,%d]> y16 = conv(dilations=dl,groups=gr,pad=pd,pad_type=pt,strides=st,weight=W,x=x16)"
"[name=string(\"conv\")];\n"
" string to32 = const()[name=string(\"to32\"), val=string(\"fp32\")];\n"
" tensor<fp32, [1,%d,1,%d]> y = cast(dtype=to32,x=y16)[name=string(\"cout\")];\n"
" } -> (y);\n"
"}\n", CH, SP, CH, SP, CH, CH, CH, CH, CH, SP, CH, SP];
NSData *md = [mil dataUsingEncoding:NSUTF8StringEncoding];
id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)(g_D, @selector(modelWithMILText:weights:optionsPlist:),
md, @{@"@model_path/weights/weight.bin": @{@"offset":@0, @"data":wdata}}, nil);
id mdl = ((id(*)(Class,SEL,id))objc_msgSend)(g_I, @selector(inMemoryModelWithDescriptor:), desc);
id hx = ((id(*)(id,SEL))objc_msgSend)(mdl, @selector(hexStringIdentifier));
NSString *td = [NSTemporaryDirectory() stringByAppendingPathComponent:hx];
NSFileManager *fm = [NSFileManager defaultManager];
[fm createDirectoryAtPath:[td stringByAppendingPathComponent:@"weights"]
withIntermediateDirectories:YES attributes:nil error:nil];
[md writeToFile:[td stringByAppendingPathComponent:@"model.mil"] atomically:YES];
[wdata writeToFile:[td stringByAppendingPathComponent:@"weights/weight.bin"] atomically:YES];
NSError *e = nil;
((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(mdl, @selector(compileWithQoS:options:error:), 21, @{}, &e);
((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(mdl, @selector(loadWithQoS:options:error:), 21, @{}, &e);
int ioBytes = CH * SP * 4;
IOSurfaceRef ioIn = make_surface(ioBytes);
IOSurfaceRef ioOut = make_surface(ioBytes);
IOSurfaceLock(ioIn, 0, NULL);
float *inp = (float*)IOSurfaceGetBaseAddress(ioIn);
for (int c = 0; c < CH; c++) for (int s = 0; s < SP; s++) inp[c*SP+s] = (float)(s+1) * 0.1f;
IOSurfaceUnlock(ioIn, 0, NULL);
// Baseline eval
id wI = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), ioIn);
id wO = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), ioOut);
id req0 = ((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(g_AR,
@selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
@[wI], @[@0], @[wO], @[@0], nil, nil, @0);
BOOL ok = ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
mdl, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req0, &e);
printf(" Baseline eval (weightsBuffer=nil, procIdx=0): %s\n", ok ? "OK" : "FAIL");
IOSurfaceLock(ioOut, kIOSurfaceLockReadOnly, NULL);
float *out0 = (float*)IOSurfaceGetBaseAddress(ioOut);
float baseline_0 = out0[0], baseline_1 = out0[1];
printf(" Output[0..3]: [%.4f, %.4f, %.4f, %.4f]\n", out0[0], out0[1], out0[2], out0[3]);
IOSurfaceUnlock(ioOut, kIOSurfaceLockReadOnly, NULL);
// Test weightsBuffer: IOSurface with 3x identity weights
printf("\n Testing weightsBuffer IOSurface...\n");
_Float16 *w3 = (_Float16*)calloc(CH*CH, sizeof(_Float16));
for (int i = 0; i < CH; i++) w3[i*CH+i] = (_Float16)3.0f;
IOSurfaceRef ioW = make_surface(ws);
IOSurfaceLock(ioW, 0, NULL);
memcpy(IOSurfaceGetBaseAddress(ioW), w3, ws);
IOSurfaceUnlock(ioW, 0, NULL);
free(w3);
id wW = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), ioW);
wI = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), ioIn);
wO = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), ioOut);
id req_wb = ((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(g_AR,
@selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
@[wI], @[@0], @[wO], @[@0], wW, nil, @0);
printf(" Request with weightsBuffer: %s\n", req_wb ? "created" : "nil");
if (req_wb) {
ok = ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
mdl, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req_wb, &e);
printf(" Eval with weightsBuffer: %s\n", ok ? "OK" : e ? [[e description] UTF8String] : "FAIL");
if (ok) {
IOSurfaceLock(ioOut, kIOSurfaceLockReadOnly, NULL);
float *outW = (float*)IOSurfaceGetBaseAddress(ioOut);
printf(" Output[0..3]: [%.4f, %.4f, %.4f, %.4f]\n", outW[0], outW[1], outW[2], outW[3]);
bool changed = fabsf(outW[0] - baseline_0) > 0.001f;
bool is_3x = fabsf(outW[0] - baseline_0 * 3.0f) < 0.1f;
printf(" weightsBuffer: output %s", changed ? "CHANGED" : "unchanged");
if (changed) printf(" (%s)", is_3x ? "matches 3x — WORKS!" : "but not 3x as expected");
printf("\n");
IOSurfaceUnlock(ioOut, kIOSurfaceLockReadOnly, NULL);
}
}
CFRelease(ioW);
// === Part 4: procedureIndex sweep ===
printf("\n--- Part 4: procedureIndex sweep (0-15) ---\n");
for (int pi = 0; pi < 16; pi++) {
wI = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), ioIn);
wO = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), ioOut);
id req_p = ((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(g_AR,
@selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
@[wI], @[@0], @[wO], @[@0], nil, nil, @(pi));
if (!req_p) { printf(" procIdx %2d: request=nil\n", pi); continue; }
ok = ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
mdl, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req_p, &e);
printf(" procIdx %2d: %s%s\n", pi, ok ? "OK" : "FAIL",
!ok && e ? [NSString stringWithFormat:@" (%@)", [e localizedDescription]].UTF8String : "");
}
// === Part 5: Scan all ANE classes ===
printf("\n--- Part 5: All ANE-prefixed classes ---\n");
unsigned int classCount;
Class *allClasses = objc_copyClassList(&classCount);
for (unsigned int i = 0; i < classCount; i++) {
const char *name = class_getName(allClasses[i]);
if (strstr(name, "ANE") || strstr(name, "ane")) {
printf(" %s\n", name);
}
}
free(allClasses);
free(w);
// Cleanup
((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)(mdl, @selector(unloadWithQoS:error:), 21, &e);
[fm removeItemAtPath:td error:nil];
CFRelease(ioIn); CFRelease(ioOut);
printf("\nDone.\n");
}
return 0;
}

233
training/test_perf_stats.m Normal file
View File

@ -0,0 +1,233 @@
// test_perf_stats.m What does _ANEPerformanceStats expose?
// Probe class methods, properties, instantiate, pass to request, read back.
#import <Foundation/Foundation.h>
#import <objc/runtime.h>
#import <objc/message.h>
#import <dlfcn.h>
#import <IOSurface/IOSurface.h>
#import <mach/mach_time.h>
static mach_timebase_info_data_t g_tb;
static double tb_ms(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
static void dump_class(const char *name) {
Class cls = NSClassFromString([NSString stringWithUTF8String:name]);
if (!cls) { printf(" %s: NOT FOUND\n", name); return; }
printf("\n=== %s ===\n", name);
unsigned int count;
Method *methods = class_copyMethodList(object_getClass(cls), &count);
if (count) printf(" Class methods:\n");
for (unsigned int i = 0; i < count; i++) {
SEL s = method_getName(methods[i]);
const char *enc = method_getTypeEncoding(methods[i]);
printf(" + %s [%s]\n", sel_getName(s), enc ? enc : "?");
}
free(methods);
methods = class_copyMethodList(cls, &count);
if (count) printf(" Instance methods:\n");
for (unsigned int i = 0; i < count; i++) {
SEL s = method_getName(methods[i]);
const char *enc = method_getTypeEncoding(methods[i]);
printf(" - %s [%s]\n", sel_getName(s), enc ? enc : "?");
}
free(methods);
unsigned int pcount;
objc_property_t *props = class_copyPropertyList(cls, &pcount);
if (pcount) printf(" Properties:\n");
for (unsigned int i = 0; i < pcount; i++) {
const char *pname = property_getName(props[i]);
const char *pattr = property_getAttributes(props[i]);
printf(" @property %s [%s]\n", pname, pattr ? pattr : "?");
}
free(props);
}
static IOSurfaceRef make_surface(size_t bytes) {
return IOSurfaceCreate((__bridge CFDictionaryRef)@{
(id)kIOSurfaceWidth:@(bytes), (id)kIOSurfaceHeight:@1,
(id)kIOSurfaceBytesPerElement:@1, (id)kIOSurfaceBytesPerRow:@(bytes),
(id)kIOSurfaceAllocSize:@(bytes), (id)kIOSurfacePixelFormat:@0});
}
int main() {
@autoreleasepool {
setbuf(stdout, NULL);
mach_timebase_info(&g_tb);
dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
printf("=== ANE Performance Stats Probe ===\n");
dump_class("_ANEPerformanceStats");
dump_class("_ANEPerfRequest");
dump_class("ANEPerfRequest");
dump_class("_ANEPerformanceCounters");
dump_class("_ANEDeviceInfo");
dump_class("_ANEModel");
dump_class("_ANEInMemoryModel");
dump_class("_ANERequest");
dump_class("_ANEIOSurfaceObject");
dump_class("_ANEInMemoryModelDescriptor");
dump_class("_ANEClient");
dump_class("_ANEVirtualClient");
// Try to instantiate _ANEPerformanceStats
printf("\n=== Instantiation Tests ===\n");
Class perfClass = NSClassFromString(@"_ANEPerformanceStats");
if (perfClass) {
@try {
id perfStats = [[perfClass alloc] init];
printf("_ANEPerformanceStats alloc/init: %s\n",
perfStats ? [[perfStats description] UTF8String] : "nil");
if (perfStats) {
unsigned int pcount;
objc_property_t *props = class_copyPropertyList(perfClass, &pcount);
for (unsigned int i = 0; i < pcount; i++) {
const char *pname = property_getName(props[i]);
@try {
id val = [perfStats valueForKey:[NSString stringWithUTF8String:pname]];
printf(" %s = %s\n", pname, val ? [[val description] UTF8String] : "nil");
} @catch (NSException *ex) {
printf(" %s = <exception: %s>\n", pname, [[ex reason] UTF8String]);
}
}
free(props);
}
} @catch (NSException *ex) {
printf("Exception: %s\n", [[ex reason] UTF8String]);
}
}
// Compile a working kernel and test perfStats in request
printf("\n=== Compile kernel and test perfStats in request ===\n");
Class g_D = NSClassFromString(@"_ANEInMemoryModelDescriptor");
Class g_I = NSClassFromString(@"_ANEInMemoryModel");
Class g_AR = NSClassFromString(@"_ANERequest");
Class g_AIO= NSClassFromString(@"_ANEIOSurfaceObject");
int CH = 64, SP = 32;
_Float16 *w = (_Float16*)calloc(CH*CH, sizeof(_Float16));
for (int i = 0; i < CH; i++) w[i*CH+i] = (_Float16)1.0f;
int ws = CH*CH*2, tot = 128+ws;
uint8_t *blob = (uint8_t*)calloc(tot,1);
blob[0]=1; blob[4]=2; blob[64]=0xEF; blob[65]=0xBE; blob[66]=0xAD; blob[67]=0xDE; blob[68]=1;
*(uint32_t*)(blob+72)=ws; *(uint32_t*)(blob+80)=128;
memcpy(blob+128, w, ws);
NSData *wdata = [NSData dataWithBytesNoCopy:blob length:tot freeWhenDone:YES];
free(w);
NSString *mil = [NSString stringWithFormat:
@"program(1.3)\n"
"[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, "
"{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, "
"{\"coremltools-version\", \"9.0\"}})]\n"
"{\n"
" func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n"
" string pt = const()[name=string(\"pt\"), val=string(\"valid\")];\n"
" tensor<int32, [2]> st = const()[name=string(\"st\"), val=tensor<int32, [2]>([1,1])];\n"
" tensor<int32, [4]> pd = const()[name=string(\"pd\"), val=tensor<int32, [4]>([0,0,0,0])];\n"
" tensor<int32, [2]> dl = const()[name=string(\"dl\"), val=tensor<int32, [2]>([1,1])];\n"
" int32 gr = const()[name=string(\"gr\"), val=int32(1)];\n"
" string to16 = const()[name=string(\"to16\"), val=string(\"fp16\")];\n"
" tensor<fp16, [1,%d,1,%d]> x16 = cast(dtype=to16,x=x)[name=string(\"cin\")];\n"
" tensor<fp16, [%d,%d,1,1]> W = const()[name=string(\"W\"), "
"val=tensor<fp16, [%d,%d,1,1]>(BLOBFILE(path=string(\"@model_path/weights/weight.bin\"), offset=uint64(64)))];\n"
" tensor<fp16, [1,%d,1,%d]> y16 = conv(dilations=dl,groups=gr,pad=pd,pad_type=pt,strides=st,weight=W,x=x16)"
"[name=string(\"conv\")];\n"
" string to32 = const()[name=string(\"to32\"), val=string(\"fp32\")];\n"
" tensor<fp32, [1,%d,1,%d]> y = cast(dtype=to32,x=y16)[name=string(\"cout\")];\n"
" } -> (y);\n"
"}\n", CH, SP, CH, SP, CH, CH, CH, CH, CH, SP, CH, SP];
NSData *md = [mil dataUsingEncoding:NSUTF8StringEncoding];
id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)(g_D, @selector(modelWithMILText:weights:optionsPlist:),
md, @{@"@model_path/weights/weight.bin": @{@"offset":@0, @"data":wdata}}, nil);
id mdl = ((id(*)(Class,SEL,id))objc_msgSend)(g_I, @selector(inMemoryModelWithDescriptor:), desc);
id hx = ((id(*)(id,SEL))objc_msgSend)(mdl, @selector(hexStringIdentifier));
NSString *td = [NSTemporaryDirectory() stringByAppendingPathComponent:hx];
[[NSFileManager defaultManager] createDirectoryAtPath:[td stringByAppendingPathComponent:@"weights"]
withIntermediateDirectories:YES attributes:nil error:nil];
[md writeToFile:[td stringByAppendingPathComponent:@"model.mil"] atomically:YES];
[wdata writeToFile:[td stringByAppendingPathComponent:@"weights/weight.bin"] atomically:YES];
NSError *e = nil;
((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(mdl, @selector(compileWithQoS:options:error:), 21, @{}, &e);
((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(mdl, @selector(loadWithQoS:options:error:), 21, @{}, &e);
int ioBytes = CH * SP * 4; // fp32
IOSurfaceRef ioIn = make_surface(ioBytes);
IOSurfaceRef ioOut = make_surface(ioBytes);
id wI = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), ioIn);
id wO = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), ioOut);
// Try creating request WITH perfStats
if (perfClass) {
id perfStats = [[perfClass alloc] init];
printf(" Creating request with perfStats=%s\n", perfStats ? "non-nil" : "nil");
id req = ((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(g_AR,
@selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
@[wI], @[@0], @[wO], @[@0], nil, perfStats, @0);
printf(" Request: %s\n", req ? "created" : "nil");
if (req) {
IOSurfaceLock(ioIn, 0, NULL);
float *inp = (float*)IOSurfaceGetBaseAddress(ioIn);
for (int i = 0; i < CH*SP; i++) inp[i] = 1.0f;
IOSurfaceUnlock(ioIn, 0, NULL);
BOOL ok = ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
mdl, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e);
printf(" Eval: %s\n", ok ? "OK" : [[e description] UTF8String]);
if (ok && perfStats) {
printf("\n PerfStats after 1 eval:\n");
unsigned int pcount;
objc_property_t *props = class_copyPropertyList(perfClass, &pcount);
for (unsigned int i = 0; i < pcount; i++) {
const char *pname = property_getName(props[i]);
@try {
id val = [perfStats valueForKey:[NSString stringWithUTF8String:pname]];
printf(" %s = %s\n", pname, val ? [[val description] UTF8String] : "nil");
} @catch (NSException *ex) {
printf(" %s = <exception>\n", pname);
}
}
free(props);
printf("\n Running 100 evals...\n");
uint64_t t0 = mach_absolute_time();
for (int i = 0; i < 100; i++) {
((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
mdl, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e);
}
printf(" 100 evals in %.1fms (%.2fms/eval)\n",
tb_ms(mach_absolute_time()-t0), tb_ms(mach_absolute_time()-t0)/100.0);
printf("\n PerfStats after 101 evals:\n");
props = class_copyPropertyList(perfClass, &pcount);
for (unsigned int i = 0; i < pcount; i++) {
const char *pname = property_getName(props[i]);
@try {
id val = [perfStats valueForKey:[NSString stringWithUTF8String:pname]];
printf(" %s = %s\n", pname, val ? [[val description] UTF8String] : "nil");
} @catch (NSException *ex) {
printf(" %s = <exception>\n", pname);
}
}
free(props);
}
}
} else {
printf(" _ANEPerformanceStats class NOT FOUND\n");
}
// Cleanup
((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)(mdl, @selector(unloadWithQoS:error:), 21, &e);
[[NSFileManager defaultManager] removeItemAtPath:td error:nil];
CFRelease(ioIn); CFRelease(ioOut);
}
return 0;
}

157
training/test_qos_sweep.m Normal file
View File

@ -0,0 +1,157 @@
// test_qos_sweep.m Does QoS affect frequency/latency?
// Sweep QoS 0-63 on compile, load, eval of a working kernel.
#import <Foundation/Foundation.h>
#import <objc/runtime.h>
#import <objc/message.h>
#import <dlfcn.h>
#import <IOSurface/IOSurface.h>
#import <mach/mach_time.h>
static mach_timebase_info_data_t g_tb;
static double tb_ms(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
static IOSurfaceRef make_surface(size_t bytes) {
return IOSurfaceCreate((__bridge CFDictionaryRef)@{
(id)kIOSurfaceWidth:@(bytes), (id)kIOSurfaceHeight:@1,
(id)kIOSurfaceBytesPerElement:@1, (id)kIOSurfaceBytesPerRow:@(bytes),
(id)kIOSurfaceAllocSize:@(bytes), (id)kIOSurfacePixelFormat:@0});
}
int main() {
@autoreleasepool {
setbuf(stdout, NULL);
mach_timebase_info(&g_tb);
dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
Class g_D = NSClassFromString(@"_ANEInMemoryModelDescriptor");
Class g_I = NSClassFromString(@"_ANEInMemoryModel");
Class g_AR = NSClassFromString(@"_ANERequest");
Class g_AIO= NSClassFromString(@"_ANEIOSurfaceObject");
// 256x256 conv, spatial=64 for measurable latency
int CH = 256, SP = 64;
int ws = CH*CH*2, tot = 128+ws;
uint8_t *blob = (uint8_t*)calloc(tot, 1);
blob[0]=1; blob[4]=2; blob[64]=0xEF; blob[65]=0xBE; blob[66]=0xAD; blob[67]=0xDE; blob[68]=1;
*(uint32_t*)(blob+72)=ws; *(uint32_t*)(blob+80)=128;
_Float16 *wp = (_Float16*)(blob+128);
for (int i = 0; i < CH*CH; i++) wp[i] = (_Float16)(0.01f * (i % 100 - 50));
NSData *wdata = [NSData dataWithBytesNoCopy:blob length:tot freeWhenDone:YES];
NSString *mil = [NSString stringWithFormat:
@"program(1.3)\n"
"[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, "
"{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, "
"{\"coremltools-version\", \"9.0\"}})]\n"
"{\n"
" func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n"
" string pt = const()[name=string(\"pt\"), val=string(\"valid\")];\n"
" tensor<int32, [2]> st = const()[name=string(\"st\"), val=tensor<int32, [2]>([1,1])];\n"
" tensor<int32, [4]> pd = const()[name=string(\"pd\"), val=tensor<int32, [4]>([0,0,0,0])];\n"
" tensor<int32, [2]> dl = const()[name=string(\"dl\"), val=tensor<int32, [2]>([1,1])];\n"
" int32 gr = const()[name=string(\"gr\"), val=int32(1)];\n"
" string to16 = const()[name=string(\"to16\"), val=string(\"fp16\")];\n"
" tensor<fp16, [1,%d,1,%d]> x16 = cast(dtype=to16,x=x)[name=string(\"cin\")];\n"
" tensor<fp16, [%d,%d,1,1]> W = const()[name=string(\"W\"), "
"val=tensor<fp16, [%d,%d,1,1]>(BLOBFILE(path=string(\"@model_path/weights/weight.bin\"), offset=uint64(64)))];\n"
" tensor<fp16, [1,%d,1,%d]> y16 = conv(dilations=dl,groups=gr,pad=pd,pad_type=pt,strides=st,weight=W,x=x16)"
"[name=string(\"conv\")];\n"
" string to32 = const()[name=string(\"to32\"), val=string(\"fp32\")];\n"
" tensor<fp32, [1,%d,1,%d]> y = cast(dtype=to32,x=y16)[name=string(\"cout\")];\n"
" } -> (y);\n"
"}\n", CH, SP, CH, SP, CH, CH, CH, CH, CH, SP, CH, SP];
NSDictionary *weights = @{@"@model_path/weights/weight.bin": @{@"offset":@0, @"data":wdata}};
NSData *milData = [mil dataUsingEncoding:NSUTF8StringEncoding];
NSFileManager *fm = [NSFileManager defaultManager];
printf("=== QoS Sweep: compile/load/eval with varying QoS ===\n");
printf("Kernel: %dx%d conv, spatial=%d (%.1f MFLOPS)\n", CH, CH, SP, 2.0*CH*CH*SP/1e6);
printf("%4s %10s %10s %10s %10s %s\n", "QoS", "Compile", "Load", "Eval(1)", "Eval(avg10)", "Status");
unsigned int qos_values[] = {0, 1, 5, 10, 15, 17, 19, 21, 25, 31, 33, 40, 47, 50, 55, 60, 63};
int n_qos = sizeof(qos_values)/sizeof(qos_values[0]);
for (int qi = 0; qi < n_qos; qi++) {
unsigned int qos = qos_values[qi];
NSError *e = nil;
// Make unique weights per iteration so hex differs
_Float16 *wq = (_Float16*)(blob+128);
wq[0] = (_Float16)(0.001f * qi);
NSData *wdata_q = [NSData dataWithBytes:blob length:tot];
NSDictionary *weights_q = @{@"@model_path/weights/weight.bin": @{@"offset":@0, @"data":wdata_q}};
id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)(g_D, @selector(modelWithMILText:weights:optionsPlist:),
milData, weights_q, nil);
id mdl = ((id(*)(Class,SEL,id))objc_msgSend)(g_I, @selector(inMemoryModelWithDescriptor:), desc);
id hx = ((id(*)(id,SEL))objc_msgSend)(mdl, @selector(hexStringIdentifier));
NSString *td = [NSTemporaryDirectory() stringByAppendingPathComponent:hx];
[fm createDirectoryAtPath:[td stringByAppendingPathComponent:@"weights"]
withIntermediateDirectories:YES attributes:nil error:nil];
[milData writeToFile:[td stringByAppendingPathComponent:@"model.mil"] atomically:YES];
[wdata_q writeToFile:[td stringByAppendingPathComponent:@"weights/weight.bin"] atomically:YES];
uint64_t t0 = mach_absolute_time();
BOOL cok = ((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(
mdl, @selector(compileWithQoS:options:error:), qos, @{}, &e);
double cms = tb_ms(mach_absolute_time() - t0);
if (!cok) {
printf("%4u %10s %10s %10s %10s COMPILE_FAIL\n", qos, "-", "-", "-", "-");
[fm removeItemAtPath:td error:nil];
continue;
}
t0 = mach_absolute_time();
BOOL lok = ((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(
mdl, @selector(loadWithQoS:options:error:), qos, @{}, &e);
double lms = tb_ms(mach_absolute_time() - t0);
if (!lok) {
printf("%4u %8.1fms %10s %10s %10s LOAD_FAIL\n", qos, cms, "-", "-", "-");
((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)(mdl, @selector(unloadWithQoS:error:), 21, &e);
[fm removeItemAtPath:td error:nil];
continue;
}
int ioBytes = CH * SP * 4;
IOSurfaceRef ioIn = make_surface(ioBytes);
IOSurfaceRef ioOut = make_surface(ioBytes);
id wI = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), ioIn);
id wO = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), ioOut);
id req = ((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(g_AR,
@selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
@[wI], @[@0], @[wO], @[@0], nil, nil, @0);
IOSurfaceLock(ioIn, 0, NULL);
float *inp = (float*)IOSurfaceGetBaseAddress(ioIn);
for (int i = 0; i < CH*SP; i++) inp[i] = 0.5f;
IOSurfaceUnlock(ioIn, 0, NULL);
t0 = mach_absolute_time();
BOOL eok = ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
mdl, @selector(evaluateWithQoS:options:request:error:), qos, @{}, req, &e);
double ems1 = tb_ms(mach_absolute_time() - t0);
if (!eok) {
printf("%4u %8.1fms %8.1fms %10s %10s EVAL_FAIL\n", qos, cms, lms, "-", "-");
} else {
t0 = mach_absolute_time();
for (int i = 0; i < 10; i++) {
((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
mdl, @selector(evaluateWithQoS:options:request:error:), qos, @{}, req, &e);
}
double ems_avg = tb_ms(mach_absolute_time() - t0) / 10.0;
printf("%4u %8.1fms %8.1fms %8.2fms %8.2fms OK\n", qos, cms, lms, ems1, ems_avg);
}
((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)(mdl, @selector(unloadWithQoS:error:), 21, &e);
CFRelease(ioIn); CFRelease(ioOut);
[fm removeItemAtPath:td error:nil];
}
printf("\nDone.\n");
}
return 0;
}

View File

@ -0,0 +1,253 @@
// test_weight_reload.m Can we skip recompilation by rewriting weight blobs on disk?
// Compile a conv kernel with weights A, eval, verify output.
// Overwrite weights/weight.bin in tmpDir with weights B.
// unloadWithQoS: then loadWithQoS: (no recompile).
// Eval again if output matches B @ x, compilation bottleneck is eliminated.
#import <Foundation/Foundation.h>
#import <objc/runtime.h>
#import <objc/message.h>
#import <dlfcn.h>
#import <IOSurface/IOSurface.h>
#import <mach/mach_time.h>
#include <math.h>
static mach_timebase_info_data_t g_tb;
static double tb_ms(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
static IOSurfaceRef make_surface(size_t bytes) {
return IOSurfaceCreate((__bridge CFDictionaryRef)@{
(id)kIOSurfaceWidth:@(bytes), (id)kIOSurfaceHeight:@1,
(id)kIOSurfaceBytesPerElement:@1, (id)kIOSurfaceBytesPerRow:@(bytes),
(id)kIOSurfaceAllocSize:@(bytes), (id)kIOSurfacePixelFormat:@0});
}
// Build weight blob matching inmem_peak format (single chunk)
static NSData *build_weight_blob(_Float16 *w, int rows, int cols) {
int ws = rows * cols * 2;
int tot = 128 + ws;
uint8_t *b = (uint8_t*)calloc(tot, 1);
b[0] = 1; b[4] = 2;
b[64] = 0xEF; b[65] = 0xBE; b[66] = 0xAD; b[67] = 0xDE; b[68] = 1;
*(uint32_t*)(b+72) = ws;
*(uint32_t*)(b+80) = 128;
memcpy(b + 128, w, ws);
return [NSData dataWithBytesNoCopy:b length:tot freeWhenDone:YES];
}
// Generate MIL for a simple conv: fp32 in cast fp16 conv cast fp32 out
static NSString *gen_mil(int ch, int sp) {
return [NSString stringWithFormat:
@"program(1.3)\n"
"[buildInfo = dict<string, string>({{\"coremlc-component-MIL\", \"3510.2.1\"}, "
"{\"coremlc-version\", \"3505.4.1\"}, {\"coremltools-component-milinternal\", \"\"}, "
"{\"coremltools-version\", \"9.0\"}})]\n"
"{\n"
" func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n"
" string pt = const()[name=string(\"pt\"), val=string(\"valid\")];\n"
" tensor<int32, [2]> st = const()[name=string(\"st\"), val=tensor<int32, [2]>([1,1])];\n"
" tensor<int32, [4]> pd = const()[name=string(\"pd\"), val=tensor<int32, [4]>([0,0,0,0])];\n"
" tensor<int32, [2]> dl = const()[name=string(\"dl\"), val=tensor<int32, [2]>([1,1])];\n"
" int32 gr = const()[name=string(\"gr\"), val=int32(1)];\n"
" string to16 = const()[name=string(\"to16\"), val=string(\"fp16\")];\n"
" tensor<fp16, [1,%d,1,%d]> x16 = cast(dtype=to16,x=x)[name=string(\"cin\")];\n"
" tensor<fp16, [%d,%d,1,1]> W = const()[name=string(\"W\"), "
"val=tensor<fp16, [%d,%d,1,1]>(BLOBFILE(path=string(\"@model_path/weights/weight.bin\"), offset=uint64(64)))];\n"
" tensor<fp16, [1,%d,1,%d]> y16 = conv(dilations=dl,groups=gr,pad=pd,pad_type=pt,strides=st,weight=W,x=x16)"
"[name=string(\"conv\")];\n"
" string to32 = const()[name=string(\"to32\"), val=string(\"fp32\")];\n"
" tensor<fp32, [1,%d,1,%d]> y = cast(dtype=to32,x=y16)[name=string(\"cout\")];\n"
" } -> (y);\n"
"}\n", ch, sp, ch, sp, ch, ch, ch, ch, ch, sp, ch, sp];
}
int main() {
@autoreleasepool {
setbuf(stdout, NULL);
mach_timebase_info(&g_tb);
dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
Class g_D = NSClassFromString(@"_ANEInMemoryModelDescriptor");
Class g_I = NSClassFromString(@"_ANEInMemoryModel");
Class g_AR = NSClassFromString(@"_ANERequest");
Class g_AIO= NSClassFromString(@"_ANEIOSurfaceObject");
if (!g_D || !g_I || !g_AR || !g_AIO) {
printf("FAIL: ANE classes not found\n");
return 1;
}
// Use 64-channel conv, spatial=32 (known to work on ANE)
int CH = 64, SP = 32;
// Weight set A: scaled identity (1.0 on diagonal)
_Float16 *weightsA = (_Float16*)calloc(CH*CH, sizeof(_Float16));
for (int i = 0; i < CH; i++) weightsA[i*CH+i] = (_Float16)1.0f;
// Weight set B: 3x identity
_Float16 *weightsB = (_Float16*)calloc(CH*CH, sizeof(_Float16));
for (int i = 0; i < CH; i++) weightsB[i*CH+i] = (_Float16)3.0f;
NSData *wdataA = build_weight_blob(weightsA, CH, CH);
NSString *mil = gen_mil(CH, SP);
NSDictionary *weights = @{
@"@model_path/weights/weight.bin": @{@"offset": @0, @"data": wdataA}
};
NSData *milData = [mil dataUsingEncoding:NSUTF8StringEncoding];
// === Compile with weights A ===
printf("=== Step 1: Compile with weights A (identity) ===\n");
printf(" Kernel: %dx%d conv, spatial=%d\n", CH, CH, SP);
uint64_t t0 = mach_absolute_time();
id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)(g_D, @selector(modelWithMILText:weights:optionsPlist:), milData, weights, nil);
if (!desc) { printf("FAIL: desc=NULL\n"); return 1; }
id mdl = ((id(*)(Class,SEL,id))objc_msgSend)(g_I, @selector(inMemoryModelWithDescriptor:), desc);
id hx = ((id(*)(id,SEL))objc_msgSend)(mdl, @selector(hexStringIdentifier));
NSString *td = [NSTemporaryDirectory() stringByAppendingPathComponent:hx];
NSFileManager *fm = [NSFileManager defaultManager];
[fm createDirectoryAtPath:[td stringByAppendingPathComponent:@"weights"] withIntermediateDirectories:YES attributes:nil error:nil];
[milData writeToFile:[td stringByAppendingPathComponent:@"model.mil"] atomically:YES];
[wdataA writeToFile:[td stringByAppendingPathComponent:@"weights/weight.bin"] atomically:YES];
NSError *e = nil;
BOOL ok = ((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(mdl, @selector(compileWithQoS:options:error:), 21, @{}, &e);
if (!ok) { printf("FAIL: compile: %s\n", [[e description] UTF8String]); return 1; }
ok = ((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(mdl, @selector(loadWithQoS:options:error:), 21, @{}, &e);
if (!ok) { printf("FAIL: load: %s\n", [[e description] UTF8String]); return 1; }
double compile_ms = tb_ms(mach_absolute_time() - t0);
printf(" Compile+load: %.1fms\n", compile_ms);
printf(" tmpDir: %s\n", [td UTF8String]);
// Build request and IOSurfaces (fp32 I/O)
int inBytes = CH * SP * 4; // fp32
int outBytes = CH * SP * 4;
IOSurfaceRef ioIn = make_surface(inBytes);
IOSurfaceRef ioOut = make_surface(outBytes);
id wI = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), ioIn);
id wO = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), ioOut);
id req = ((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(g_AR,
@selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
@[wI], @[@0], @[wO], @[@0], nil, nil, @0);
// Write input: channel c, spatial s = (c*SP + s + 1) * 0.01
IOSurfaceLock(ioIn, 0, NULL);
float *inp = (float*)IOSurfaceGetBaseAddress(ioIn);
for (int c = 0; c < CH; c++)
for (int s = 0; s < SP; s++)
inp[c*SP+s] = (float)(c*SP + s + 1) * 0.01f;
IOSurfaceUnlock(ioIn, 0, NULL);
// Eval with weights A
printf("\n=== Step 2: Eval with weights A ===\n");
ok = ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(mdl, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e);
if (!ok) { printf("FAIL: eval: %s\n", e ? [[e description] UTF8String] : "?"); return 1; }
IOSurfaceLock(ioOut, kIOSurfaceLockReadOnly, NULL);
float *outA = (float*)IOSurfaceGetBaseAddress(ioOut);
printf(" Output A[0..3]: [%.4f, %.4f, %.4f, %.4f]\n", outA[0], outA[1], outA[2], outA[3]);
printf(" Output A[%d..%d]: [%.4f, %.4f, %.4f, %.4f]\n", CH*SP-4, CH*SP-1,
outA[CH*SP-4], outA[CH*SP-3], outA[CH*SP-2], outA[CH*SP-1]);
// Save copy
float *outA_copy = (float*)malloc(outBytes);
memcpy(outA_copy, outA, outBytes);
IOSurfaceUnlock(ioOut, kIOSurfaceLockReadOnly, NULL);
// === Step 3: Overwrite weight file with B, unload+load ===
printf("\n=== Step 3: Overwrite weight.bin with B (3x identity), unload+load ===\n");
NSData *wdataB = build_weight_blob(weightsB, CH, CH);
NSString *weightPath = [td stringByAppendingPathComponent:@"weights/weight.bin"];
[wdataB writeToFile:weightPath atomically:YES];
printf(" Wrote new weight.bin\n");
// Unload
t0 = mach_absolute_time();
ok = ((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)(mdl, @selector(unloadWithQoS:error:), 21, &e);
double unload_ms = tb_ms(mach_absolute_time() - t0);
printf(" Unload: %s (%.2fms)\n", ok ? "OK" : "FAIL", unload_ms);
// Reload (no compile!)
t0 = mach_absolute_time();
ok = ((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(mdl, @selector(loadWithQoS:options:error:), 21, @{}, &e);
double reload_ms = tb_ms(mach_absolute_time() - t0);
printf(" Load (no recompile): %s (%.2fms)\n", ok ? "OK" : [[e description] UTF8String], reload_ms);
if (!ok) {
printf("\n*** Load-after-overwrite FAILED — trying recompile+load ***\n");
t0 = mach_absolute_time();
ok = ((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(mdl, @selector(compileWithQoS:options:error:), 21, @{}, &e);
printf(" Re-compile: %s (%.2fms)\n", ok ? "OK" : "FAIL", tb_ms(mach_absolute_time() - t0));
t0 = mach_absolute_time();
ok = ((BOOL(*)(id,SEL,unsigned int,id,NSError**))objc_msgSend)(mdl, @selector(loadWithQoS:options:error:), 21, @{}, &e);
printf(" Re-load: %s (%.2fms)\n", ok ? "OK" : "FAIL", tb_ms(mach_absolute_time() - t0));
}
// Build new request (re-use same surfaces)
wI = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), ioIn);
wO = ((id(*)(Class,SEL,IOSurfaceRef))objc_msgSend)(g_AIO, @selector(objectWithIOSurface:), ioOut);
req = ((id(*)(Class,SEL,id,id,id,id,id,id,id))objc_msgSend)(g_AR,
@selector(requestWithInputs:inputIndices:outputs:outputIndices:weightsBuffer:perfStats:procedureIndex:),
@[wI], @[@0], @[wO], @[@0], nil, nil, @0);
// Re-write same input
IOSurfaceLock(ioIn, 0, NULL);
inp = (float*)IOSurfaceGetBaseAddress(ioIn);
for (int c = 0; c < CH; c++)
for (int s = 0; s < SP; s++)
inp[c*SP+s] = (float)(c*SP + s + 1) * 0.01f;
IOSurfaceUnlock(ioIn, 0, NULL);
// Eval with (possibly reloaded) weights B
printf("\n=== Step 4: Eval after reload ===\n");
ok = ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(mdl, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e);
if (!ok) { printf("FAIL: eval after reload: %s\n", e ? [[e description] UTF8String] : "?"); return 1; }
IOSurfaceLock(ioOut, kIOSurfaceLockReadOnly, NULL);
float *outB = (float*)IOSurfaceGetBaseAddress(ioOut);
printf(" Output B[0..3]: [%.4f, %.4f, %.4f, %.4f]\n", outB[0], outB[1], outB[2], outB[3]);
printf(" Output B[%d..%d]: [%.4f, %.4f, %.4f, %.4f]\n", CH*SP-4, CH*SP-1,
outB[CH*SP-4], outB[CH*SP-3], outB[CH*SP-2], outB[CH*SP-1]);
// Check: did the output change?
bool changed = false;
float max_diff = 0;
for (int i = 0; i < CH*SP; i++) {
float d = fabsf(outB[i] - outA_copy[i]);
if (d > max_diff) max_diff = d;
if (d > 0.001f) changed = true;
}
// Expected: output B should be 3x output A
bool correct_3x = true;
float max_3x_err = 0;
for (int i = 0; i < CH*SP; i++) {
float expected = outA_copy[i] * 3.0f;
float err = fabsf(outB[i] - expected);
if (err > max_3x_err) max_3x_err = err;
if (err > 0.1f) correct_3x = false;
}
IOSurfaceUnlock(ioOut, kIOSurfaceLockReadOnly, NULL);
printf("\n=== RESULT ===\n");
printf(" Max A-B diff: %.6f\n", max_diff);
printf(" Max 3x error: %.6f\n", max_3x_err);
printf(" Compile+load: %.1fms | Unload: %.1fms | Reload: %.1fms\n", compile_ms, unload_ms, reload_ms);
if (changed && correct_3x) {
printf("\nSUCCESS: Weight reload works! Output matches 3x identity.\n");
printf(" Speedup: compile=%.1fms vs reload=%.1fms (%.1fx faster)\n",
compile_ms, unload_ms + reload_ms, compile_ms / (unload_ms + reload_ms));
printf(">>> Compilation bottleneck can be eliminated <<<\n");
} else if (changed && !correct_3x) {
printf("\nPARTIAL: Output changed but doesn't match expected 3x.\n");
} else {
printf("\nFAIL: Output did NOT change. Weight reload does not work.\n");
printf(" ANE cached the compiled model — weights baked at compile time.\n");
printf(">>> Need alternative: weightsBuffer IOSurface or async recompile <<<\n");
}
// Cleanup
((BOOL(*)(id,SEL,unsigned int,NSError**))objc_msgSend)(mdl, @selector(unloadWithQoS:error:), 21, &e);
[fm removeItemAtPath:td error:nil];
CFRelease(ioIn); CFRelease(ioOut);
free(outA_copy); free(weightsA); free(weightsB);
}
return 0;
}

View File

@ -581,6 +581,16 @@ int main(int argc, char *argv[]) {
steps_batch++;
if (step % 10 == 0 || step == start_step)
printf("step %-4d loss=%.4f\n", step, loss);
// JSON telemetry to stderr
double step_ane = t_ane/steps_batch, step_io = t_io/steps_batch;
double step_cls = t_cls/steps_batch, step_elem = t_elem/steps_batch;
double step_rms = t_rms/steps_batch, step_cbw = t_cblas_wait/steps_batch;
fprintf(stderr, "{\"type\":\"step\",\"step\":%d,\"loss\":%.6f,"
"\"t_ane\":%.3f,\"t_io\":%.3f,\"t_cls\":%.3f,"
"\"t_elem\":%.3f,\"t_rms\":%.3f,\"t_cblas_wait\":%.3f,"
"\"compiles\":%d}\n",
step, loss, step_ane, step_io, step_cls, step_elem, step_rms, step_cbw, g_compile_count);
}
double tms = tb_ms(mach_absolute_time() - tt);
total_train_ms += tms;
@ -622,6 +632,19 @@ int main(int argc, char *argv[]) {
printf(" ane=%.1f io=%.1f cls=%.1f elem=%.1f rms=%.1f cblas_wait=%.1f ms/step\n",
t_ane/steps_batch, t_io/steps_batch, t_cls/steps_batch, t_elem/steps_batch,
t_rms/steps_batch, t_cblas_wait/steps_batch);
// JSON batch telemetry to stderr
{
double bf = NLAYERS * (4.0*2*DIM*DIM*SEQ + 2.0*2*DIM*HIDDEN*SEQ + 2.0*HIDDEN*DIM*SEQ);
double bs = NLAYERS * 2.0*HEADS*5*SEQ*SEQ*HD;
double ane_f_batch = (bf*2 + bs) * steps_batch;
double ane_tflops = ane_f_batch / (tms * 1e9);
fprintf(stderr, "{\"type\":\"batch\",\"batch\":%d,\"compile_ms\":%.1f,"
"\"train_ms\":%.1f,\"ms_per_step\":%.1f}\n",
steps_batch, cms, tms, tms/steps_batch);
fprintf(stderr, "{\"type\":\"perf\",\"ane_tflops\":%.3f,\"ane_util_pct\":%.2f}\n",
ane_tflops, 100.0*ane_tflops/15.8);
}
}
// Efficiency report