[test] ANE private API research: chaining, E5 runtime, custom MIL compilation experiments

2026-03-04 21:39:24 +01:00 · 2026-03-04 21:39:24 +01:00 · 99ba013d9b
parent efcf193075
commit 99ba013d9b
11 changed files with 8855 additions and 8 deletions
--- a/docs/ANE_CHAINING_RESEARCH.md
+++ b/docs/ANE_CHAINING_RESEARCH.md
--- a/docs/ANE_INTERNALS.md
+++ b/docs/ANE_INTERNALS.md
@ -0,0 +1,563 @@
+# ANE Internals: What We Know
+
+A comprehensive guide to Apple's Neural Engine (ANE) based on reverse engineering, private API exploration, and community research. This extends and updates [hollance/neural-engine](https://github.com/hollance/neural-engine/tree/master/docs) with findings from direct hardware experimentation on M4 Max / macOS 15.
+
+---
+
+## Table of Contents
+
+1. [How does the ANE work internally?](#1-how-does-the-ane-work-internally)
+2. [Can I program the ANE directly?](#2-can-i-program-the-ane-directly)
+3. [What can be compiled and run on ANE?](#3-what-can-be-compiled-and-run-on-ane)
+4. [Security and safety mechanisms](#4-security-and-safety-mechanisms)
+5. [Is the ANE 16-bit?](#5-is-the-ane-16-bit)
+6. [ANE vs GPU vs CPU](#6-ane-vs-gpu-vs-cpu)
+7. [Reverse engineering the ANE](#7-reverse-engineering-the-ane)
+8. [How to verify ANE execution](#8-how-to-verify-ane-execution)
+9. [References and external resources](#9-references-and-external-resources)
+
+---
+
+## 1. How does the ANE work internally?
+
+> hollance/neural-engine says: "I don't think anyone outside Apple knows."
+
+We now know substantially more.
+
+### Hardware Architecture
+
+The ANE is a fixed-function neural network accelerator integrated into Apple Silicon SoCs:
+
+| Chip | ANE Cores | Peak TOPS | SRAM Budget |
+|------|-----------|-----------|-------------|
+| A12-A13 | 8 | 5 | ~4 MB |
+| A14/M1 | 16 | 11 | ~16 MB |
+| A15/M2 | 16 | 15.8 | ~24 MB |
+| M4/M4 Pro/M4 Max | 16 | 38 | ~24-32 MB |
+
+SRAM budget measured via `sram_probe.m` performance cliff detection on M4 Max:
+- Peak efficiency at ~12.5 MB weights (282.6 GFLOPS/MB)
+- First spill at ~32 MB (drops to 59.2 GFLOPS/MB)
+- Catastrophic spilling at 128 MB (8.0 GFLOPS/MB)
+
+The ANE operates on FP16 data exclusively. All I/O is through IOSurface shared memory buffers in `[1, C, 1, S]` channel-first FP16 layout.
+
+### Compilation Pipeline
+
+There are two paths from a neural network to ANE hardware execution:
+
+**Standard CoreML path** (from [Black Hat Asia 2021, Wish Wu](https://infocondb.org/con/black-hat/black-hat-asia-2021/apple-neural-engine-internal-from-ml-algorithm-to-hw-registers)):
+
+```
+ML model (TF/PyTorch/Caffe)
+  -> coremltools -> .mlmodel
+  -> coremlc (CoreML compiler) -> .mlmodelc/
+  -> espresso precompile -> net.plist + weights
+  -> ANECompiler (in ane_compiler_service) -> model.hwx
+  -> aned daemon -> H11ANEIn kernel driver (IOKit)
+  -> ANE firmware -> hardware registers
+```
+
+**Direct private API path** (what this project uses):
+
+```
+MIL text + weight blobs (in memory)
+  -> _ANEInMemoryModelDescriptor (ObjC object)
+  -> _ANEInMemoryModel.compileWithQoS: -> ANE binary (in temp dir)
+  -> _ANEInMemoryModel.loadWithQoS: -> loaded onto ANE hardware
+  -> _ANEInMemoryModel.evaluateWithQoS: -> execution via aned
+```
+
+The direct path bypasses CoreML, espresso, and the `.hwx` file format entirely. It compiles MIL (Model Intermediate Language) text directly into ANE-executable binary, loads it, and runs it. This is how we achieve both training and inference on the ANE without any CoreML dependency.
+
+### System Architecture
+
+```
+------------------+     +------------------+     +------------------+
+| User Process     |     | aned daemon      |     | Kernel           |
+|                  |     |                  |     |                  |
+| _ANEClient  -----+---->| ANE scheduler    +---->| H11ANEIn driver  |
+| (sharedConnection)|    | (all interfaces) |     | (IOKit)          |
+|                  |     |                  |     |                  |
+| App gets 3 IOKit |     | Compiles models  |     | Passes model.hwx |
+| interfaces:      |     | Manages loading  |     | to ANE firmware  |
+|  - open          |     | Handles requests |     |                  |
+|  - close         |     +------------------+     +------------------+
+|  - programSend   |                                      |
+|    Request       |                                      v
+------------------+                              +------------------+
+                                                  | ANE Firmware     |
+                                                  | (co-processor)   |
+                                                  |                  |
+                                                  | Parses register  |
+                                                  | operations from  |
+                                                  | compiled binary  |
+                                                  +------------------+
+```
+
+The `aned` daemon mediates between user processes and the kernel driver. Apps only get 3 IOKit interfaces (open, close, programSendRequest). The daemon has access to all driver interfaces, which is why `_ANEClient.sharedConnection` communicates through the daemon rather than directly to the kernel.
+
+### Execution Paths
+
+We have benchmarked four distinct ways to trigger ANE kernel execution:
+
+| Method | API | Latency (64x32) | Latency (768x256) |
+|--------|-----|------------------|--------------------|
+| Standard | `model.evaluateWithQoS:options:request:error:` | 0.175 ms | 0.205 ms |
+| Real-Time | `client.evaluateRealTimeWithModel:options:request:error:` | 0.093 ms | 0.246 ms |
+| processRequest | `program.processRequest:model:qos:...` | 0.131 ms | 0.185 ms |
+| Direct | `client.doEvaluateDirectWithModel:options:request:qos:error:` | 0.225 ms | N/A |
+
+**Key finding**: At production kernel dimensions (768x256, matching Stories110M), all paths converge to ~0.2 ms per kernel. The RT speedup (1.88x) observed on small 64x32 kernels does not hold at production scale. The standard path remains the most reliable.
+
+### Resource Limits
+
+The ANE runtime leaks internal resources during compilation. After ~119 compiles per process, subsequent compilations fail silently. The workaround is checkpoint-and-restart: save weights and optimizer state, terminate the process, and re-launch with `--resume`.
+
+With `MAX_COMPILES=100` (conservative) and 60 weight-bearing kernels per batch (12 layers x 5 kernels), only 1 training batch fits per process lifetime.
+
+---
+
+## 2. Can I program the ANE directly?
+
+> hollance/neural-engine says: "Unfortunately not. You can only use the Neural Engine through Core ML."
+
+**Yes, you can.** The `AppleNeuralEngine.framework` contains 67+ private Objective-C classes that provide direct access to the ANE without CoreML. This project uses them for both training and inference.
+
+### Minimal Example
+
+The core compilation/load/execution cycle in pseudocode:
+
+```objc
+#import <dlfcn.h>
+#import <objc/runtime.h>
+
+// Load the private framework
+dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
+
+// Write MIL program as text
+NSData *milData = [@"program(1.0) { ... }" dataUsingEncoding:NSUTF8StringEncoding];
+
+// Create descriptor
+id descriptor = [_ANEInMemoryModelDescriptor modelWithMILText:milData
+                                                      weights:weightDict
+                                                  optionsPlist:nil];
+
+// Compile -> Load -> Run
+id model = [_ANEInMemoryModel inMemoryModelWithDescriptor:descriptor];
+[model compileWithQoS:21 options:nil error:&error];
+[model loadWithQoS:21 options:nil error:&error];
+
+// Create IOSurface I/O and request
+id request = [_ANERequest requestWithInputs:@[inputSurface]
+                               inputIndices:@[@0]
+                                    outputs:@[outputSurface]
+                              outputIndices:@[@0]
+                              weightsBuffer:nil
+                                  perfStats:nil
+                             procedureIndex:0];
+
+[model evaluateWithQoS:21 options:nil request:request error:&error];
+```
+
+A complete reusable wrapper is implemented in [`training/ane_runtime.h`](../training/ane_runtime.h) with functions:
+- `ane_init()` -- load framework, resolve classes
+- `ane_compile(kernel, mil_text, weight_dict)` -- compile MIL to ANE binary
+- `ane_run(kernel)` -- standard execution path
+- `ane_free(kernel)` -- unload and release resources
+
+### MIL (Model Intermediate Language)
+
+MIL is Apple's intermediate representation for neural network operations. Key facts:
+
+- Text-based format: `program(1.0) { func main(...) { ... } }`
+- Targets: `ios16`, `ios17`, `ios18` (determines available ops)
+- All tensors are 4D: `[batch, channels, height, width]` or equivalently `[1, C, 1, S]`
+- Convolutions (`conv`) are the workhorse: a 1x1 conv with `[out_ch, in_ch, 1, 1]` weights = matrix multiply
+- Weights referenced via `BLOBFILE(path="@model_path/weights/name.bin", offset=uint64(64))`
+- Weights are baked at compile time and cannot be swapped at runtime
+
+Supported operations include: `conv`, `matmul`, `add`, `mul`, `sigmoid`, `softmax`, `reshape`, `transpose`, `concat`, `reduce_mean`, `rsqrt`, `cast`, `constexpr_affine_dequantize`, and more.
+
+### Alternative: ANECompiler CLI
+
+[ANETools](https://github.com/antgroup-skyward/ANETools) (from Wish Wu / Ant Group) provides command-line tools that invoke the ANECompiler module directly:
+
+```bash
+# Convert mlmodelc to ANE-compatible format
+MLModelCToANECompiler input.mlmodelc output/
+
+# Compile to hardware format
+ANECompiler --target-arch ane_v5 --debug-mask 2147483647 net.plist weights/ output.hwx
+
+# Disassemble compiled binary
+ANEDisassembler output.hwx
+```
+
+The `--debug-mask` flag (set to max integer) generates intermediate files during compilation, revealing internal register operations.
+
+---
+
+## 3. What can be compiled and run on ANE?
+
+Any computation expressible as a static MIL (Model Intermediate Language) dataflow graph that the E5 compiler accepts. The ANE is a fixed-function accelerator, not a general-purpose processor -- it executes predefined operation graphs, not arbitrary code.
+
+### Verified Operations
+
+These operations have been compiled to custom MIL programs and executed on ANE hardware with output validated against CPU reference implementations (see `test_mil_custom.m`):
+
+| Category | Operations | Notes |
+|----------|-----------|-------|
+| Activations | `relu`, `gelu`, `softmax` | GELU supports EXACT, TANH_APPROXIMATION, SIGMOID_APPROXIMATION modes |
+| Normalization | `layer_norm` | Epsilon type must match gamma/beta dtype |
+| Attention | `scaled_dot_product_attention` | Fused Q@K^T/sqrt(d) + softmax + @V in a single op (iOS 18+) |
+| Linear algebra | `linear` (const weights), `matmul` (runtime tensors) | `linear` requires compile-time constant weights; `matmul` supports runtime inputs |
+| Type conversion | `cast` | fp32 <-> fp16. Required at ANE I/O boundaries |
+| Elementwise | `add`, `mul`, `real_div` | Broadcasting supported |
+| Shape | `reshape`, `transpose`, `concat`, `slice_by_index` | `concat` requires `interleave` param |
+| Composite | Full transformer block (LN + SDPA + Residual + FFN + GELU) | Compiles and runs as a single ANE program (~0.21ms) |
+
+### Available but Not Yet Tested
+
+These are valid MIL operations that the E5 compiler should accept:
+
+- `conv` -- convolutions (the upstream maderix/ANE repo uses these extensively for training)
+- `reduce_sum`, `reduce_mean`, `reduce_max` -- reductions
+- `gather`, `scatter` -- embedding lookups, KV cache writes
+- `rsqrt`, `sqrt`, `exp`, `log`, `tanh` -- unary math
+- `split`, `slice_by_size` -- tensor slicing
+- `batch_norm`, `instance_norm` -- normalization variants
+- Various pooling, padding, upsampling operations
+
+### What Cannot Run on ANE
+
+| Limitation | Detail |
+|-----------|--------|
+| No control flow | No loops, conditionals, or branching. MIL is a static dataflow graph. |
+| No dynamic shapes | All tensor dimensions must be known at compile time. |
+| No runtime weight updates | Weights are `const`, baked into the compiled binary. Changing weights requires recompilation (~10-50ms). |
+| No arbitrary memory access | No pointers or indexing beyond what `gather`/`scatter` provide. |
+| No custom ops | Only operations in Apple's MIL op set. No user-defined kernels at the hardware level. |
+| No FP32 compute | ANE computes in FP16 only. FP32 inputs are cast to FP16 internally. |
+
+### Implications for Training
+
+The ANE can execute the forward pass and the matrix math of backpropagation (`matmul` for dX and dW gradients). However, training is impractical because weights are read-only constants. After computing weight gradients on ANE, the optimizer step (W -= lr * dW) must run on CPU, and the MIL program must be recompiled with updated weights before the next forward pass. This recompilation costs ~10-50ms per step, dominating training time. See [ANE_CHAINING_RESEARCH.md, Section 9](ANE_CHAINING_RESEARCH.md#9-ane-training-feasibility-analysis) for detailed analysis.
+
+---
+
+## 4. Security and Safety Mechanisms
+
+The ANE has multiple layers of safety enforcement, but Apple's security model assumes access goes through CoreML. The private APIs we use bypass CoreML but still pass through the `aned` daemon and the E5 compiler.
+
+### Compile-Time Safety
+
+| Mechanism | What it does |
+|-----------|-------------|
+| MIL syntax validation | The E5 compiler rejects malformed MIL with `InvalidMILProgram` errors |
+| Type checking | Tensor dtypes, shapes, and parameter types must match exactly. Mismatches cause compile errors (e.g., `layer_norm` epsilon must match gamma/beta dtype; `concat` axis must be `int32` scalar, not tensor) |
+| Op validation | Unknown or unsupported operations are rejected |
+| I/O matching | MIL input/output names and shapes must match the `MLModelDescription` passed to `MLE5Engine` |
+
+### Runtime Safety
+
+| Mechanism | What it does |
+|-----------|-------------|
+| Shape enforcement | Input tensors must match declared shape exactly -- `MultiArray shape doesn't match ML Program's expected shape` error on mismatch |
+| Daemon mediation | ANE runs through the `aned` daemon (system service). User processes only get 3 IOKit interfaces: open, close, `programSendRequest` |
+| IOSurface isolation | I/O memory is managed by the kernel via IOSurface. Cannot read/write arbitrary memory through them |
+| SRAM limits | Programs exceeding the ANE SRAM budget (~24-32MB on M4 Max) are rejected or fall back to CPU/GPU |
+| Compile limit | ~119 compiled programs per process before the compiler leaks enough resources to fail (resource exhaustion, not a security boundary) |
+
+### Sandbox Interaction
+
+The E5 runtime needs write access to `~/Library/Caches/<binary_name>/` for its ANE specialization cache. macOS app sandbox can block this, causing compilation to fail with permission errors. When running outside a sandbox (e.g., command-line tools), this directory is created automatically.
+
+### What is NOT Protected
+
+| Gap | Detail |
+|-----|--------|
+| No access control | No authentication or entitlement check for using the private APIs. Any process can call `_ANEClient.sharedConnection` |
+| No rate limiting | Programs can be compiled in a loop until the ~119 limit exhausts resources |
+| No MIL signing | No code signing validation on MIL text -- any syntactically valid program that passes the compiler's type checks will execute |
+| No isolation between programs | Multiple programs from the same process share the ANE with no hardware-level isolation (the daemon schedules them) |
+
+### Practical Risk Assessment
+
+The ANE attack surface is limited because:
+
+1. **Fixed-function hardware**: The ANE executes predefined neural network operations, not arbitrary instructions. There is no instruction pointer, no stack, and no way to jump to arbitrary code.
+2. **Typed dataflow**: MIL programs operate on typed tensors with fixed shapes. There are no buffer overflows in the traditional sense -- the compiler enforces all dimensions at compile time.
+3. **Daemon intermediary**: All ANE access goes through `aned`, which validates requests before forwarding to the kernel driver. Direct IOKit access to the ANE is restricted to 3 interfaces.
+4. **No persistent state**: ANE programs don't persist across reboots. Compiled programs live in temp directories and caches that are cleaned by the OS.
+
+The main risk of the private APIs is **stability**: these APIs are undocumented and may change with any macOS update, potentially breaking programs that depend on them.
+
+---
+
+## 5. Is the ANE 16-bit?
+
+> hollance/neural-engine says: "It appears so."
+
+**Confirmed.** The ANE operates in FP16 for both compute and storage:
+
+- All IOSurface I/O must be FP16. Passing FP32 data produces zeros.
+- MIL programs must use `fp16` I/O types (setting `g_fp16_io=1` in our codebase)
+- F32-to-F16 conversion happens on the CPU before writing to IOSurfaces
+- FP16 precision limits: values above ~65504 overflow, values below ~5.96e-8 underflow to zero
+
+### Quantization Support
+
+| Format | ANE Native? | Notes |
+|--------|------------|-------|
+| FP16 | Yes | Native compute and storage format |
+| INT8 | Partial | Memory bandwidth savings only, no compute speedup. `constexpr_affine_dequantize` in MIL dequantizes to FP16 before compute |
+| Q4 | No | Not supported. Requires GPU (Metal) or CPU dequantization |
+| FP32 | No | Internally converted to FP16; higher precision lost |
+
+Apple markets ANE TOPS using INT8, so the 38 TOPS figure for M4 is really ~19 TFLOPS in FP16 (each INT8 op counts as 1 TOP but FP16 ops count as 2).
+
+---
+
+## 6. ANE vs GPU vs CPU
+
+Benchmarked on Qwen2.5-0.5B (dim=896, 24 layers, 494M params) on M4 Max:
+
+### Decode Performance (single-token generation)
+
+| Engine | Format | Weight Size | Decode t/s | Bottleneck |
+|--------|--------|-------------|------------|------------|
+| CPU AMX (cblas_sgemv) | F32 | 1.97 GB | ~91 t/s | Memory bandwidth |
+| CPU AMX (cblas_sgemv) | F16->F32 | 658 MB disk | ~91 t/s | Memory bandwidth (F32 in RAM) |
+| CPU AMX (cblas_sgemv) | Q4->F32 | 188 MB disk | ~91 t/s | Memory bandwidth (dequant at load) |
+| Metal GPU (Q4 SIMD) | Q4 | 188 MB | ~10 t/s | Dispatch overhead (~400 dispatches/token) |
+| LM Studio (MLX) | Q4 MLX | ~188 MB | 258-496 t/s | Optimized Metal kernels |
+
+### Prefill Performance (batch prompt processing)
+
+| Engine | Format | Prefill t/s | Method |
+|--------|--------|-------------|--------|
+| CPU AMX (cblas_sgemm) | F32 | 880-960 t/s | Batched matmul |
+| CPU AMX (cblas_sgemv) | F32 | ~40 t/s | Sequential per-token |
+
+### ANE Training Kernel Performance
+
+| Metric | Value |
+|--------|-------|
+| Kernel latency | ~0.2 ms per kernel (768x256 production dims) |
+| Peak TFLOPS | 11.14 (128x conv 512ch sp64) |
+| Sustained training | 1.29-1.68 TFLOPS |
+| ANE utilization | 8-11% of peak |
+
+### When to use each
+
+- **ANE**: Best for parallel FP16 operations where data stays on-chip (training kernels, fused attention). The ~119 compile limit and FP16-only restriction are significant constraints.
+- **GPU (Metal)**: Best for large models (dim >= 4096) where native quantized matmul kernels (as in MLX/llama.cpp) can read Q4/Q8 data directly from GPU memory. Dispatch overhead dominates for small models.
+- **CPU AMX**: Best for small/medium model decode (dim <= 896). `cblas_sgemv` uses the AMX coprocessor internally and achieves ~33% of theoretical bandwidth. Cannot be beaten by manual NEON, threading, or Metal for this model size.
+
+---
+
+## 7. Reverse engineering the ANE
+
+### Prior Work
+
+| Project | Focus | Key Contribution |
+|---------|-------|-------------------|
+| [hollance/neural-engine](https://github.com/hollance/neural-engine) | CoreML-level documentation | Comprehensive device list, layer compatibility, model surgery guides |
+| [geohot/tinygrad ANE](https://github.com/tinygrad/tinygrad) | Driver-level reverse engineering | Initial IOKit driver analysis, ANE instruction format exploration |
+| [Black Hat Asia 2021 (Wish Wu)](https://infocondb.org/con/black-hat/black-hat-asia-2021/apple-neural-engine-internal-from-ml-algorithm-to-hw-registers) | Full stack: ML to HW registers | Documented compilation pipeline, .hwx format, security attack surfaces, FaceID ANE usage. Created ANEDisassembler. [Video](https://www.youtube.com/watch?v=1wvBDUnPNEo) |
+| [ANETools](https://github.com/antgroup-skyward/ANETools) | CLI compilation and disassembly | ANECompiler CLI wrapper, ANEDisassembler for .hwx files, `debug_mask` flag for intermediate output |
+| [eiln/anecc](https://github.com/eiln/anecc) | Independent ANE compiler | CoreML-to-ANE compiler for Asahi Linux, alternative compilation path |
+| [freedomtan/coreml_to_ane_hwx](https://github.com/freedomtan/coreml_to_ane_hwx) | CoreML to .hwx conversion | Direct converter bypassing some CoreML steps |
+| [maderix/ANE](https://github.com/maderix/ANE) | Training on ANE | First neural network training on ANE via private APIs |
+| [maderix Substack](https://open.substack.com/pub/maderix/p/inside-the-m4-apple-neural-engine) | M4 ANE deep-dive | Detailed M4 ANE architecture analysis, SRAM probing, kernel fusion |
+
+### Our Discoveries: Private API Class Hierarchy
+
+We have documented 20+ private Objective-C classes in `AppleNeuralEngine.framework`:
+
+```
+NSObject
+|-- _ANEClient (singleton, daemon connection)
+|   Methods: sharedConnection, evaluateWithModel:, evaluateRealTimeWithModel:,
+|            doEvaluateDirectWithModel:, prepareChainingWithModel:,
+|            enqueueSetsWithModel:, buffersReadyWithModel:,
+|            beginRealTimeTask, endRealTimeTask
+|
+|-- _ANEInMemoryModelDescriptor (MIL + weights spec)
+|   Factory: +modelWithMILText:weights:optionsPlist:
+|
+|-- _ANEInMemoryModel (compile/load/run)
+|   Methods: compileWithQoS:, loadWithQoS:, evaluateWithQoS:, unloadWithQoS:
+|   Props: hexStringIdentifier, programHandle (uint64), program, perfStatsMask
+|
+|-- _ANEModel (disk-based compiled model -- 52 instance methods)
+|   Factory: +modelAtURL:key:, +modelAtURL:key:modelAttributes:
+|   Methods: getUUID, inputSymbolIndicesForProcedureIndex:,
+|            outputSymbolIndicesForProcedureIndex:
+|   Props: mapper, program
+|
+|-- _ANERequest (I/O surface packaging)
+|   Factory: +requestWithInputs:inputIndices:outputs:outputIndices:
+|             weightsBuffer:perfStats:procedureIndex:
+|
+|-- _ANEIOSurfaceObject (thin IOSurface wrapper)
+|   Factory: +objectWithIOSurface:
+|
+|-- _ANEBuffer (IOSurfaceObject + symbolIndex + source) [KEY DISCOVERY]
+|   Factory: +bufferWithIOSurfaceObject:symbolIndex:source:
+|   source: 0=ANE, 1=output, 2=unknown
+|
+|-- _ANEChainingRequest (multi-op pipeline)
+|   Factory: +chainingRequestWithInputs:outputSets:lbInputSymbolId:
+|             lbOutputSymbolId:procedureIndex:signalEvents:
+|             transactionHandle:fwEnqueueDelay:memoryPoolId:
+|   Methods: validate
+|
+|-- _ANEIOSurfaceOutputSets (output packaging for chaining)
+|   Factory: +objectWithstatsSurRef:outputBuffer:
+|   Note: requires non-NULL statsSurRef (any IOSurface works, even 64 bytes)
+|
+|-- _ANEInputBuffersReady (input signaling for chaining)
+|   Factory: +inputBuffersWithProcedureIndex:inputBufferInfoIndex:
+|             inputFreeValue:executionDelay:
+|
+|-- _ANEOutputSetEnqueue (output pipeline config for chaining)
+|   Factory: +outputSetWithProcedureIndex:setIndex:signalValue:
+|             signalNotRequired:isOpenLoop:
+|
+|-- _ANEProgramForEvaluation (lower-level program)
+|   Factory: +programWithHandle:intermediateBufferHandle:queueDepth:
+|   Methods: processRequest:model:qos:qIndex:modelStringID:options:
+|             returnValue:error:
+|
+|-- _ANEProgramIOSurfacesMapper (symbol-to-surface mapping)
+|   Factory: +mapperWithProgramHandle:, +mapperWithController:
+|   Note: only works with _ANEModel, not _ANEInMemoryModel
+|
+|-- _ANEPerformanceStats
+|   Factory: +statsWithHardwareExecutionNS:
+|   Props: hwExecutionTime, performanceCounters
+|
+|-- _ANESharedSignalEvent (hardware signal fence)
+|   Factory: +signalEventWithValue:symbolIndex:eventType:sharedEvent:
+|   Requires IOSurfaceSharedEvent objects
+|
+|-- _ANESharedWaitEvent (hardware wait fence)
+|   Factory: +waitEventWithValue:sharedEvent:
+|   Requires IOSurfaceSharedEvent objects
+|
+|-- _ANEModelInstanceParameters, _ANEDeviceController, _ANEQoSMapper
+```
+
+Full details with experiment logs: [ANE_CHAINING_RESEARCH.md](ANE_CHAINING_RESEARCH.md)
+
+### ChainingRequest API Status
+
+The `_ANEChainingRequest` API is designed to pipeline multiple ANE operations without CPU round-trips. Current status:
+
+- `_ANEChainingRequest.validate` returns **YES** (with `_ANEBuffer` inputs + `_ANEIOSurfaceOutputSets` outputs)
+- `prepareChainingWithModel:` **fails** -- calls `getUUID` on `_ANEInMemoryModel` which lacks it
+- Requires `_ANEModel` (disk-based compiled model) which has `getUUID` and symbol index methods
+- `_ANEModel` factory methods require a `key:` parameter; the hex identifier from `_ANEInMemoryModel` is the likely key
+
+This is the highest-priority research area. Chaining would eliminate the ~23 CPU-ANE round-trips per token in a 12-layer model, potentially enabling on-chip pipeline execution.
+
+### model.hwx Binary Format
+
+The `.hwx` file is the compiled hardware representation loaded by the ANE kernel driver. From Wu's Black Hat research:
+
+- Mach-O format binary containing register operations
+- Compiled from `net.plist` + weights by the ANECompiler module
+- Loaded by the `H11ANEIn` kernel driver via `programCreate` interface
+- ANE firmware parses it to extract register addresses and values
+- Can be disassembled with [ANETools/ANEDisassembler](https://github.com/antgroup-skyward/ANETools)
+
+Our `_ANEInMemoryModel` path bypasses `.hwx` generation -- the model goes directly from MIL to an internal binary format in a temp directory. Whether this temp directory contains an equivalent to `.hwx` is an open question (see [ANE_CHAINING_RESEARCH.md](ANE_CHAINING_RESEARCH.md) for next steps).
+
+---
+
+## 8. How to verify ANE execution
+
+### Power Monitoring
+
+```bash
+sudo powermetrics --samplers ane_power -i 1000
+```
+
+Shows real-time ANE power draw. Active ANE usage typically shows 2-4W on M4 Max during training.
+
+### Performance Statistics
+
+```objc
+model.perfStatsMask = 0xFF;
+// After execution:
+// model.performanceCounters -- returns nil on current macOS (limited API)
+```
+
+The `_ANEPerformanceStats` class exists and can be instantiated via `+statsWithHardwareExecutionNS:`, but the hardware counters are not populated on the current macOS/M4 combination. The `perfStatsMask` property is accepted but `performanceCounters` returns nil after execution.
+
+### IOSurface Output Validation
+
+Read back FP16 data from output IOSurfaces and compare against CPU reference:
+
+```objc
+_Float16 *out = (_Float16 *)IOSurfaceGetBaseAddress(surface);
+IOSurfaceLock(surface, kIOSurfaceLockReadOnly, NULL);
+for (int i = 0; i < n; i++) {
+    float val = (float)out[i];
+    // Compare against CPU reference
+}
+IOSurfaceUnlock(surface, kIOSurfaceLockReadOnly, NULL);
+```
+
+### ANE Compiler Debug Output
+
+From Wu's research, the ANECompiler module has a `debug_mask` flag. Setting it to `2147483647` (max int) generates intermediate files during compilation, revealing:
+- Register operation sequences
+- Memory allocation decisions
+- Tiling strategies
+- Weight layout in SRAM
+
+This can be applied when using the ANECompiler CLI tools from [ANETools](https://github.com/antgroup-skyward/ANETools).
+
+---
+
+## 9. References and External Resources
+
+### Documentation and Research
+
+| Resource | URL | Focus |
+|----------|-----|-------|
+| hollance/neural-engine | https://github.com/hollance/neural-engine | CoreML-level ANE docs |
+| maderix Substack | https://open.substack.com/pub/maderix/p/inside-the-m4-apple-neural-engine | M4 ANE architecture |
+| Black Hat Asia 2021 | https://infocondb.org/con/black-hat/black-hat-asia-2021/apple-neural-engine-internal-from-ml-algorithm-to-hw-registers | Full stack reverse engineering |
+| BH Asia 2021 Video | https://www.youtube.com/watch?v=1wvBDUnPNEo | 30-min talk by Wish Wu |
+| Apple ML Research | https://machinelearning.apple.com/research/neural-engine-transformers | Deploying transformers on ANE |
+| ANE Supported Devices | https://github.com/hollance/neural-engine/blob/master/docs/supported-devices.md | Comprehensive device/chip list |
+
+### Tools
+
+| Tool | URL | Purpose |
+|------|-----|---------|
+| ANETools | https://github.com/antgroup-skyward/ANETools | ANECompiler CLI, ANEDisassembler |
+| eiln/anecc | https://github.com/eiln/anecc | Independent ANE compiler (Asahi Linux) |
+| freedomtan/coreml_to_ane_hwx | https://github.com/freedomtan/coreml_to_ane_hwx | CoreML to .hwx converter |
+| coremltools | https://github.com/apple/coremltools | Apple's official ML model tools |
+
+### Projects Using ANE Directly
+
+| Project | URL | What it does |
+|---------|-----|-------------|
+| maderix/ANE | https://github.com/maderix/ANE | Training on ANE (this project's upstream) |
+| dev-erik/ANE | https://github.com/dev-erik/ANE | This fork: inference optimization, ChainingRequest research |
+
+### This Project's ANE Documentation
+
+| Document | Description |
+|----------|-------------|
+| [ANE_INTERNALS.md](ANE_INTERNALS.md) | This file -- comprehensive ANE internals guide |
+| [ANE_CHAINING_RESEARCH.md](ANE_CHAINING_RESEARCH.md) | ChainingRequest API research, experiment logs, benchmarks |
+| [ARCHITECTURE.md](ARCHITECTURE.md) | Training system architecture, kernel fusion map, data flow |
+| [API_REFERENCE.md](API_REFERENCE.md) | Complete function index for all source files |
+| [BENCHMARK_RESULTS.md](BENCHMARK_RESULTS.md) | M4 Max benchmark results (training, TFLOPS, SRAM) |
--- a/training/Makefile
+++ b/training/Makefile
@ -1,14 +1,21 @@
 CC = xcrun clang
-CFLAGS = -O2 -Wall -Wno-deprecated-declarations -fobjc-arc
+CC_C = xcrun clang
+
+ANE_COMPAT = -Wno-deprecated-declarations
+SEC_FLAGS = -fstack-protector-strong -Wformat-security
+
+CFLAGS = -O2 -Wall $(ANE_COMPAT) -fobjc-arc $(SEC_FLAGS)
+CFLAGS_C = -O2 -Wall -Wextra -Werror -std=c11
+CFLAGS_DEBUG = -O0 -g -Wall $(ANE_COMPAT) -fobjc-arc -fsanitize=address,undefined
 FRAMEWORKS = -framework Foundation -framework CoreML -framework IOSurface
 LDFLAGS = $(FRAMEWORKS) -ldl

-HEADERS_LARGE = stories_config.h stories_io.h stories_mil.h stories_cpu_ops.h
+HEADERS_LARGE = stories_config.h stories_io.h stories_mil.h stories_cpu_ops.h data_validation.h

 HEADERS_ANE = $(HEADERS_LARGE) ane_rmsnorm_bwd.h ane_classifier.h

 train: train.m ane_runtime.h ane_mil_gen.h model.h forward.h backward.h
-	$(CC) $(CFLAGS) -o $@ train.m $(LDFLAGS)
+	$(CC) $(CFLAGS) -o $@ train.m $(LDFLAGS) -framework Accelerate

 train_large: train_large.m $(HEADERS_LARGE)
 	$(CC) $(CFLAGS) -o $@ train_large.m $(LDFLAGS) -framework Accelerate
@ -16,6 +23,14 @@ train_large: train_large.m $(HEADERS_LARGE)
 train_large_ane: train_large_ane.m $(HEADERS_ANE)
 	$(CC) $(CFLAGS) -o $@ train_large_ane.m $(LDFLAGS) -framework Accelerate

+HEADERS_OPT = $(HEADERS_LARGE) stories_cpu_ops_opt.h
+
+train_opt: train_opt.m $(HEADERS_OPT)
+	$(CC) $(CFLAGS) -o $@ train_opt.m $(LDFLAGS) -framework Accelerate -framework Metal -framework MetalPerformanceShaders
+
+train_double_buffer: train_double_buffer.m $(HEADERS_LARGE)
+	$(CC) $(CFLAGS) -o $@ train_double_buffer.m $(LDFLAGS) -framework Accelerate
+
 PROBES = test_weight_reload test_perf_stats test_qos_sweep test_ane_advanced

 test_rmsnorm_bwd: test_rmsnorm_bwd.m $(HEADERS_ANE)
@ -36,13 +51,56 @@ test_qos_sweep: test_qos_sweep.m
 test_ane_advanced: test_ane_advanced.m
 	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)

+test_chaining: test_chaining.m
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
+
+test_chaining_v2: test_chaining_v2.m
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
+
+test_bench_paths: test_bench_paths.m ane_runtime.h
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
+
+test_ane_model: test_ane_model.m
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Metal
+
+test_throughput_ceiling: test_throughput_ceiling.m ane_runtime.h
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
+
+test_coreml_chaining: test_coreml_chaining.m
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Metal
+
+test_e5_validate: test_e5_validate.m
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Metal
+
+test_mil_custom: test_mil_custom.m
+	$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) -framework Accelerate
+
+test_data_validation: test_data_validation.c data_validation.h
+	$(CC_C) $(CFLAGS_C) -o $@ $<
+
 probes: $(PROBES)

+security-tests: test_data_validation
+
+data: tokenize
+	@bash download_data.sh
+
 tokenize:
 	python3 tokenize.py

+setup: data
+	@echo "=== Setup complete ==="
+	@echo "Data:  tinystories_data00.bin"
+	@echo "To train: make train_large && ./train_large"
+	@echo "Override paths: ANE_MODEL_PATH=... ANE_DATA_PATH=... ./train_large"
+
+verify-flags:
+	@echo "=== Active CFLAGS ==="
+	@echo "$(CFLAGS)"
+	@echo "=== Compiler version ==="
+	@xcrun clang --version
+
 clean:
-	rm -f train train_large train_large_ane $(PROBES) test_rmsnorm_bwd test_classifier
-
-.PHONY: clean tokenize probes
+	rm -f train train_large train_large_ane train_opt train_double_buffer $(PROBES) test_rmsnorm_bwd test_classifier test_data_validation test_chaining test_chaining_v2 test_bench_paths test_ane_model test_throughput_ceiling test_coreml_chaining test_e5_validate test_mil_custom

+.PHONY: clean tokenize probes security-tests verify-flags data setup
--- a/training/ane_runtime.h
+++ b/training/ane_runtime.h
@ -20,15 +20,33 @@ typedef struct {

 static Class g_ANEDesc, g_ANEInMem, g_ANEReq, g_ANEIO;
 static bool g_ane_loaded = false;
+static id g_ane_client = nil;
+static bool g_ane_ok = false;

 static void ane_init(void) {
    if (g_ane_loaded) return;
-    dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
+    g_ane_loaded = true;  // Set first to prevent re-entry (ref: CRIT-01)
+    void *handle = dlopen(
+        "/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine",
+        RTLD_NOW);
+    if (!handle) {
+        fprintf(stderr, "ANE: dlopen failed: %s\n", dlerror());
+        return;
+    }
    g_ANEDesc  = NSClassFromString(@"_ANEInMemoryModelDescriptor");
    g_ANEInMem = NSClassFromString(@"_ANEInMemoryModel");
    g_ANEReq   = NSClassFromString(@"_ANERequest");
    g_ANEIO    = NSClassFromString(@"_ANEIOSurfaceObject");
-    g_ane_loaded = true;
+    if (!g_ANEDesc || !g_ANEInMem || !g_ANEReq || !g_ANEIO) {
+        fprintf(stderr, "ANE: Private classes not found (macOS version mismatch?)\n");
+        return;
+    }
+    g_ane_ok = true;
+
+    Class clientCls = NSClassFromString(@"_ANEClient");
+    if (clientCls) {
+        g_ane_client = [clientCls performSelector:@selector(sharedConnection)];
+    }
 }

 static IOSurfaceRef ane_create_surface(size_t bytes) {
@ -50,6 +68,7 @@ static ANEKernel *ane_compile(NSData *milText, NSData *weightData,
                               int nInputs, size_t *inputSizes,
                               int nOutputs, size_t *outputSizes) {
    ane_init();
+    if (!g_ane_ok) { fprintf(stderr, "ANE: not available\n"); return NULL; }  // CRIT-01/02
    NSError *e = nil;

    NSDictionary *wdict = nil;
@ -63,6 +82,7 @@ static ANEKernel *ane_compile(NSData *milText, NSData *weightData,

    id mdl = ((id(*)(Class,SEL,id))objc_msgSend)(
        g_ANEInMem, @selector(inMemoryModelWithDescriptor:), desc);
+    if (!mdl) { fprintf(stderr, "ANE: inMemoryModel allocation failed\n"); return NULL; }  // CRIT-02

    // Pre-populate temp dir with MIL + weights
    id hx = ((id(*)(id,SEL))objc_msgSend)(mdl, @selector(hexStringIdentifier));
@ -151,6 +171,20 @@ static bool ane_eval(ANEKernel *k) {
    return ok;
 }

+static bool ane_eval_rt(ANEKernel *k) {
+    if (!g_ane_client) return ane_eval(k);
+    NSError *e = nil;
+    BOOL ok = ((BOOL(*)(id,SEL,id,id,id,NSError**))objc_msgSend)(
+        g_ane_client, @selector(evaluateRealTimeWithModel:options:request:error:),
+        k->model, @{}, k->request, &e);
+    if (!ok) {
+        fprintf(stderr, "ANE RT eval failed, falling back to standard: %s\n",
+                e ? [[e description] UTF8String] : "unknown");
+        return ane_eval(k);
+    }
+    return true;
+}
+
 static void ane_free(ANEKernel *k) {
    if (!k) return;
    NSError *e = nil;
--- a/training/test_ane_model.m
+++ b/training/test_ane_model.m
--- a/training/test_bench_paths.m
+++ b/training/test_bench_paths.m
@ -0,0 +1,148 @@
+// test_bench_paths.m — Benchmark ANE evaluation paths at production dimensions
+// Compares: standard, RT, processRequest, and ane_eval_rt wrapper
+#import <Foundation/Foundation.h>
+#import <objc/runtime.h>
+#import <objc/message.h>
+#import <dlfcn.h>
+#import <IOSurface/IOSurface.h>
+#import <mach/mach_time.h>
+
+static mach_timebase_info_data_t g_tb;
+static double tb_ms(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
+static int g_fp16_io = 0;
+
+#include "ane_runtime.h"
+
+static NSString *gen_bench_conv(int ch, int sp) {
+    return [NSString stringWithFormat:
+        @"program(1.0)\n[buildInfo = dict<tensor<string, []>, tensor<string, []>>({{\"coremlc-version\", \"3505.4.1\"}})]\n{\n"
+        "    func main<ios16>(tensor<fp16, [1, %d, 1, %d]> x) {\n"
+        "        tensor<string, []> pt = const()[name=tensor<string, []>(\"pt\"), val=tensor<string, []>(\"valid\")];\n"
+        "        tensor<int32, [2]> st = const()[name=tensor<string, []>(\"st\"), val=tensor<int32, [2]>([1,1])];\n"
+        "        tensor<int32, [4]> pd = const()[name=tensor<string, []>(\"pd\"), val=tensor<int32, [4]>([0,0,0,0])];\n"
+        "        tensor<int32, [2]> dl = const()[name=tensor<string, []>(\"dl\"), val=tensor<int32, [2]>([1,1])];\n"
+        "        tensor<int32, []> gr = const()[name=tensor<string, []>(\"gr\"), val=tensor<int32, []>(1)];\n"
+        "        tensor<fp16, [%d,%d,1,1]> W = const()[name=tensor<string, []>(\"W\"), "
+        "val=tensor<fp16, [%d,%d,1,1]>(BLOBFILE(path=tensor<string, []>(\"@model_path/weights/weight.bin\"), offset=tensor<uint64, []>(64)))];\n"
+        "        tensor<fp16, [1,%d,1,%d]> y = conv(dilations=dl,groups=gr,pad=pd,pad_type=pt,strides=st,weight=W,x=x)"
+        "[name=tensor<string, []>(\"conv\")];\n"
+        "    } -> (y);\n}\n", ch, sp, ch, ch, ch, ch, ch, sp];
+}
+
+int main(int argc, char **argv) {
+    @autoreleasepool {
+        setbuf(stdout, NULL);
+        mach_timebase_info(&g_tb);
+
+        printf("=== ANE Eval Path Benchmark (production dimensions) ===\n\n");
+
+        ane_init();
+        if (!g_ane_ok) { printf("FATAL: ANE not available\n"); return 1; }
+
+        typedef struct { int ch; int sp; const char *label; } TestConfig;
+        TestConfig configs[] = {
+            {64,  32,  "64x32  (test)"},
+            {128, 64,  "128x64 (small)"},
+            {256, 64,  "256x64 (med)"},
+            {768, 256, "768x256 (prod)"},
+            {512, 64,  "512x64 (large)"},
+        };
+        int nconfigs = sizeof(configs) / sizeof(configs[0]);
+        int WARMUP = 20, ITERS = 200;
+
+        id client = g_ane_client;
+        printf("  Client: %s | Warmup: %d | Iters: %d\n\n", client ? "OK" : "NO", WARMUP, ITERS);
+        printf("%-18s %10s %14s %14s %14s\n", "Config", "Standard", "RT", "ProcReq", "ane_eval_rt");
+        printf("%-18s %10s %14s %14s %14s\n", "------", "--------", "--", "-------", "-----------");
+
+        for (int ci = 0; ci < nconfigs; ci++) {
+            int CH = configs[ci].ch, SP = configs[ci].sp;
+
+            _Float16 *w = (_Float16*)calloc(CH*CH, sizeof(_Float16));
+            for (int i = 0; i < CH; i++) w[i*CH+i] = (_Float16)0.5f;
+            int ws = CH*CH*2, tot = 128+ws;
+            uint8_t *blob = (uint8_t*)calloc(tot, 1);
+            blob[0]=1; blob[4]=2; blob[64]=0xEF; blob[65]=0xBE; blob[66]=0xAD; blob[67]=0xDE; blob[68]=1;
+            *(uint32_t*)(blob+72)=ws; *(uint32_t*)(blob+80)=128;
+            memcpy(blob+128, w, ws);
+            NSData *wdata = [NSData dataWithBytesNoCopy:blob length:tot freeWhenDone:YES];
+            free(w);
+
+            g_fp16_io = 1;
+            NSString *mil = gen_bench_conv(CH, SP);
+            NSData *milData = [mil dataUsingEncoding:NSUTF8StringEncoding];
+            size_t ioBytes = CH * SP * 2;
+            ANEKernel *k = ane_compile(milData, wdata, 1, &ioBytes, 1, &ioBytes);
+            if (!k) { printf("%-18s  (compile failed)\n", configs[ci].label); continue; }
+
+            IOSurfaceLock(k->ioInputs[0], 0, NULL);
+            _Float16 *inp = (_Float16*)IOSurfaceGetBaseAddress(k->ioInputs[0]);
+            for (int i = 0; i < CH*SP; i++) inp[i] = (_Float16)1.0f;
+            IOSurfaceUnlock(k->ioInputs[0], 0, NULL);
+
+            NSError *e = nil;
+
+            for (int i = 0; i < WARMUP; i++) ane_eval(k);
+            uint64_t t0 = mach_absolute_time();
+            for (int i = 0; i < ITERS; i++) ane_eval(k);
+            double std_ms = tb_ms(mach_absolute_time() - t0) / ITERS;
+
+            double rt_ms = -1;
+            if (client) {
+                @try {
+                    for (int i = 0; i < WARMUP; i++)
+                        ((BOOL(*)(id,SEL,id,id,id,NSError**))objc_msgSend)(
+                            client, @selector(evaluateRealTimeWithModel:options:request:error:),
+                            k->model, @{}, k->request, &e);
+                    t0 = mach_absolute_time();
+                    for (int i = 0; i < ITERS; i++)
+                        ((BOOL(*)(id,SEL,id,id,id,NSError**))objc_msgSend)(
+                            client, @selector(evaluateRealTimeWithModel:options:request:error:),
+                            k->model, @{}, k->request, &e);
+                    rt_ms = tb_ms(mach_absolute_time() - t0) / ITERS;
+                } @catch (NSException *ex) { rt_ms = -1; }
+            }
+
+            double proc_ms = -1;
+            @try {
+                id prog = [k->model valueForKey:@"program"];
+                id hexId = [k->model valueForKey:@"hexStringIdentifier"];
+                SEL procSel = @selector(processRequest:model:qos:qIndex:modelStringID:options:returnValue:error:);
+                if (prog && [prog respondsToSelector:procSel]) {
+                    for (int i = 0; i < WARMUP; i++) {
+                        BOOL rv = NO;
+                        ((BOOL(*)(id,SEL,id,id,unsigned int,int,id,id,BOOL*,NSError**))objc_msgSend)(
+                            prog, procSel, k->request, k->model, 21, 0, hexId, @{}, &rv, &e);
+                    }
+                    t0 = mach_absolute_time();
+                    for (int i = 0; i < ITERS; i++) {
+                        BOOL rv = NO;
+                        ((BOOL(*)(id,SEL,id,id,unsigned int,int,id,id,BOOL*,NSError**))objc_msgSend)(
+                            prog, procSel, k->request, k->model, 21, 0, hexId, @{}, &rv, &e);
+                    }
+                    proc_ms = tb_ms(mach_absolute_time() - t0) / ITERS;
+                }
+            } @catch (NSException *ex) { (void)ex; }
+
+            double wrap_ms = -1;
+            @try {
+                for (int i = 0; i < WARMUP; i++) ane_eval_rt(k);
+                t0 = mach_absolute_time();
+                for (int i = 0; i < ITERS; i++) ane_eval_rt(k);
+                wrap_ms = tb_ms(mach_absolute_time() - t0) / ITERS;
+            } @catch (NSException *ex) { wrap_ms = -1; }
+
+            char s[32], r[32], p[32], w2[32];
+            snprintf(s, 32, "%.3f ms", std_ms);
+            snprintf(r, 32, rt_ms >= 0 ? "%.3f (%.1fx)" : "N/A", rt_ms, std_ms/rt_ms);
+            snprintf(p, 32, proc_ms >= 0 ? "%.3f (%.1fx)" : "N/A", proc_ms, std_ms/proc_ms);
+            snprintf(w2, 32, wrap_ms >= 0 ? "%.3f (%.1fx)" : "N/A", wrap_ms, std_ms/wrap_ms);
+            printf("%-18s %10s %14s %14s %14s\n", configs[ci].label, s, r, p, w2);
+
+            ane_free(k);
+        }
+
+        printf("\n=== Benchmark complete ===\n");
+    }
+    return 0;
+}
--- a/training/test_chaining_v2.m
+++ b/training/test_chaining_v2.m
--- a/training/test_coreml_chaining.m
+++ b/training/test_coreml_chaining.m
--- a/training/test_e5_validate.m
+++ b/training/test_e5_validate.m
@ -0,0 +1,817 @@
+// test_e5_validate.m — Experiments W1-W5: E5 Runtime Validation & Deep API Exploration
+// Build: make test_e5_validate && ./test_e5_validate
+#import <Foundation/Foundation.h>
+#import <CoreML/CoreML.h>
+#import <objc/runtime.h>
+#import <objc/message.h>
+#import <dlfcn.h>
+#import <mach/mach_time.h>
+#import <IOSurface/IOSurface.h>
+
+static mach_timebase_info_data_t g_tb;
+static double tb_ms(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
+
+#pragma mark - Helpers
+
+static void dump_all_methods(Class cls, const char *label) {
+    if (!cls) { printf("  %s: NOT FOUND\n", label); return; }
+    printf("\n--- %s ---\n", label);
+
+    unsigned int mc;
+    Method *cm = class_copyMethodList(object_getClass(cls), &mc);
+    if (mc > 0) {
+        printf("  Class methods (%u):\n", mc);
+        for (unsigned int i = 0; i < mc; i++) {
+            const char *sel = sel_getName(method_getName(cm[i]));
+            const char *enc = method_getTypeEncoding(cm[i]);
+            printf("    + %s  [%s]\n", sel, enc ? enc : "?");
+        }
+    }
+    free(cm);
+
+    Method *im = class_copyMethodList(cls, &mc);
+    if (mc > 0) {
+        printf("  Instance methods (%u):\n", mc);
+        for (unsigned int i = 0; i < mc; i++) {
+            const char *sel = sel_getName(method_getName(im[i]));
+            const char *enc = method_getTypeEncoding(im[i]);
+            printf("    - %s  [%s]\n", sel, enc ? enc : "?");
+        }
+    }
+    free(im);
+
+    unsigned int pc;
+    objc_property_t *props = class_copyPropertyList(cls, &pc);
+    if (pc > 0) {
+        printf("  Properties (%u):\n", pc);
+        for (unsigned int i = 0; i < pc; i++)
+            printf("    %s  [%s]\n", property_getName(props[i]),
+                   property_getAttributes(props[i]));
+    }
+    free(props);
+
+    unsigned int ic;
+    Ivar *ivars = class_copyIvarList(cls, &ic);
+    if (ic > 0) {
+        printf("  Ivars (%u):\n", ic);
+        for (unsigned int i = 0; i < ic; i++) {
+            const char *n = ivar_getName(ivars[i]);
+            const char *t = ivar_getTypeEncoding(ivars[i]);
+            printf("    %s  type=%s\n", n, t ? t : "?");
+        }
+    }
+    free(ivars);
+
+    Class super = class_getSuperclass(cls);
+    if (super && super != [NSObject class])
+        printf("  Superclass: %s\n", class_getName(super));
+}
+
+static float max_abs_diff(float *a, float *b, int n) {
+    float m = 0;
+    for (int i = 0; i < n; i++) {
+        float d = fabsf(a[i] - b[i]);
+        if (d > m) m = d;
+    }
+    return m;
+}
+
+static float mean_abs(float *a, int n) {
+    float s = 0;
+    for (int i = 0; i < n; i++) s += fabsf(a[i]);
+    return s / n;
+}
+
+#pragma mark - Main
+
+int main(int argc, const char *argv[]) {
+    (void)argc; (void)argv;
+    @autoreleasepool {
+        mach_timebase_info(&g_tb);
+        printf("================================================================\n");
+        printf("  E5 Runtime: Validation & Exhaustive API Documentation\n");
+        printf("================================================================\n\n");
+
+        dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/"
+               "AppleNeuralEngine", RTLD_NOW);
+
+        // ============================================================
+        // W2: Exhaustive API Documentation (dump first so we have it)
+        // ============================================================
+        printf("================================================================\n");
+        printf("  W2: Exhaustive E5 Runtime API Documentation\n");
+        printf("================================================================\n");
+
+        const char *classNames[] = {
+            "MLE5Engine",
+            "MLE5ProgramLibrary",
+            "MLE5ProgramLibraryOnDeviceAOTCompilationImpl",
+            "MLE5ProgramLibraryE5BundleImpl",
+            "MLE5ExecutionStreamOperation",
+            "MLE5ExecutionStream",
+            "MLE5ExecutionStreamPool",
+            "MLE5StaticShapeExecutionStreamOperationPool",
+            "MLE5RangeShapeExecutionStreamOperationPool",
+            "MLE5EnumeratedShapeExecutionStreamOperationPool",
+            "MLE5ExecutionStreamOperationPoolFactory",
+            "MLE5InputPort",
+            "MLE5OutputPort",
+            "MLE5InputPortBinder",
+            "MLE5OutputPortBinder",
+            "MLProgramE5Container",
+            NULL
+        };
+        for (int i = 0; classNames[i]; i++) {
+            Class cls = NSClassFromString(
+                [NSString stringWithUTF8String:classNames[i]]);
+            dump_all_methods(cls, classNames[i]);
+        }
+
+        printf("\n--- e5rt_* C API Symbols ---\n");
+        const char *cFuncs[] = {
+            "e5rt_program_library_create",
+            "e5rt_program_library_destroy",
+            "e5rt_program_library_compile",
+            "e5rt_program_library_get_function",
+            "e5rt_program_library_load_function",
+            "e5rt_execution_stream_create",
+            "e5rt_execution_stream_destroy",
+            "e5rt_execution_stream_submit",
+            "e5rt_execution_stream_wait",
+            "e5rt_execution_stream_execute",
+            "e5rt_execution_stream_sync",
+            "e5rt_execution_stream_operation_create",
+            "e5rt_execution_stream_operation_destroy",
+            "e5rt_execution_stream_operation_set_input",
+            "e5rt_execution_stream_operation_set_output",
+            "e5rt_execution_stream_operation_execute",
+            "e5rt_async_event_create",
+            "e5rt_async_event_destroy",
+            "e5rt_async_event_signal",
+            "e5rt_async_event_wait",
+            "e5rt_buffer_create",
+            "e5rt_buffer_destroy",
+            "e5rt_io_port_create",
+            "e5rt_io_port_bind",
+            "e5rt_context_create",
+            "e5rt_init",
+            "e5rt_get_version",
+            NULL
+        };
+        for (int i = 0; cFuncs[i]; i++) {
+            void *sym = dlsym(RTLD_DEFAULT, cFuncs[i]);
+            if (sym) printf("  FOUND: %s at %p\n", cFuncs[i], sym);
+        }
+        fflush(stdout);
+
+        // ============================================================
+        // W1: Output Validation
+        // ============================================================
+        printf("\n================================================================\n");
+        printf("  W1: Output Correctness Validation\n");
+        printf("================================================================\n\n");
+
+        int ch = 256, sp = 64;
+        NSString *pkgPath = [NSString stringWithFormat:
+            @"/tmp/ane_sram_%dch_%dsp.mlpackage", ch, sp];
+        if (![[NSFileManager defaultManager] fileExistsAtPath:pkgPath]) {
+            printf("  FATAL: %s not found. Run gen_mlpackages.py\n",
+                   [pkgPath UTF8String]);
+            return 1;
+        }
+
+        NSError *err = nil;
+        MLModelConfiguration *cfg = [[MLModelConfiguration alloc] init];
+        cfg.computeUnits = MLComputeUnitsAll;
+        MLPredictionOptions *predOpts = [[MLPredictionOptions alloc] init];
+        Class opCls = NSClassFromString(@"MLE5ExecutionStreamOperation");
+
+        NSURL *compiled = [MLModel compileModelAtURL:
+            [NSURL fileURLWithPath:pkgPath] error:&err];
+        if (err) { printf("  Compile FAILED\n"); return 1; }
+        err = nil;
+        MLModel *model = [MLModel modelWithContentsOfURL:compiled
+                                           configuration:cfg error:&err];
+        if (err) { printf("  Load FAILED\n"); return 1; }
+
+        int nElems = 1 * ch * 1 * sp;
+        MLMultiArray *inputArr = [[MLMultiArray alloc]
+            initWithShape:@[@1, @(ch), @1, @(sp)]
+            dataType:MLMultiArrayDataTypeFloat32 error:nil];
+
+        float *inPtr = (float *)[inputArr dataPointer];
+        for (int i = 0; i < nElems; i++)
+            inPtr[i] = sinf((float)i * 0.01f) * 0.5f;
+
+        NSString *inName = [[[[model modelDescription] inputDescriptionsByName]
+            allKeys] firstObject];
+        NSString *outName = [[[[model modelDescription] outputDescriptionsByName]
+            allKeys] firstObject];
+        MLDictionaryFeatureProvider *fp = [[MLDictionaryFeatureProvider alloc]
+            initWithDictionary:@{inName: inputArr} error:nil];
+
+        printf("  Input: %s [1,%d,1,%d], first 5: [%.4f %.4f %.4f %.4f %.4f]\n",
+               [inName UTF8String], ch, sp,
+               inPtr[0], inPtr[1], inPtr[2], inPtr[3], inPtr[4]);
+        printf("  Output: %s\n", [outName UTF8String]);
+        fflush(stdout);
+
+        // --- Reference: CoreML sequential prediction ---
+        printf("\n  --- W1.1: CoreML reference prediction ---\n");
+        err = nil;
+        id<MLFeatureProvider> refResult = [model predictionFromFeatures:fp error:&err];
+        if (err) { printf("  Prediction FAILED\n"); return 1; }
+
+        MLMultiArray *refOut = [refResult featureValueForName:outName].multiArrayValue;
+        float *refPtr = (float *)[refOut dataPointer];
+        int outElems = 1;
+        for (int d = 0; d < (int)refOut.shape.count; d++)
+            outElems *= [refOut.shape[d] intValue];
+        printf("  Output shape: [");
+        for (int d = 0; d < (int)refOut.shape.count; d++)
+            printf("%s%d", d ? "," : "", [refOut.shape[d] intValue]);
+        printf("] (%d elements)\n", outElems);
+        printf("  First 5 ref: [%.6f %.6f %.6f %.6f %.6f]\n",
+               refPtr[0], refPtr[1], refPtr[2], refPtr[3], refPtr[4]);
+        printf("  Mean |ref|: %.6f\n", mean_abs(refPtr, outElems));
+        fflush(stdout);
+
+        // --- E5 stream prediction ---
+        printf("\n  --- W1.2: E5 stream prediction ---\n");
+
+        id e5engine = nil;
+        @try { e5engine = [model valueForKey:@"_internalEngine"]; }
+        @catch (NSException *e) { (void)e; }
+        id progLib = nil;
+        @try { progLib = [e5engine valueForKey:@"programLibrary"]; }
+        @catch (NSException *e) { (void)e; }
+        id streamPool = nil;
+        @try { streamPool = [e5engine valueForKey:@"streamPool"]; }
+        @catch (NSException *e) { (void)e; }
+
+        id op = ((id(*)(id,SEL,id,id,id,id,id,unsigned long long))objc_msgSend)(
+            [opCls alloc],
+            @selector(initWithProgramLibrary:functionName:modelDescription:
+                configuration:debugLabel:modelSignpostId:),
+            progLib, @"main", [model modelDescription], cfg,
+            @"validate_op", (unsigned long long)0);
+
+        NSError *plErr = nil;
+        BOOL plOk = ((BOOL(*)(id,SEL,NSError**))objc_msgSend)(
+            op, @selector(preloadAndReturnError:), &plErr);
+        printf("  preload: %s\n", plOk ? "YES" : "NO");
+        if (plErr) printf("  Error: %s\n", [[plErr description] UTF8String]);
+        fflush(stdout);
+
+        id stream = [streamPool performSelector:@selector(takeOut)];
+        Ivar shIvar = class_getInstanceVariable([stream class], "_streamHandle");
+        void *sh = (__bridge void *)object_getIvar(stream, shIvar);
+        printf("  stream: %p, handle: %p\n", (__bridge void *)stream, sh);
+
+        [stream setValue:@[op] forKey:@"operations"];
+
+        NSError *prepErr = nil;
+        BOOL prepOk = ((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
+            op, @selector(prepareForInputFeatures:options:error:),
+            fp, predOpts, &prepErr);
+        printf("  prepare: %s\n", prepOk ? "YES" : "NO");
+        if (prepErr) printf("  Error: %s\n", [[prepErr description] UTF8String]);
+        fflush(stdout);
+
+        NSError *execErr = nil;
+        BOOL execOk = ((BOOL(*)(id,SEL,void*,NSError**))objc_msgSend)(
+            stream, @selector(_executeStream:error:), sh, &execErr);
+        printf("  execute: %s\n", execOk ? "YES" : "NO");
+        if (execErr) printf("  Error: %s\n", [[execErr description] UTF8String]);
+        fflush(stdout);
+
+        // Read output from the operation
+        printf("\n  --- W1.3: Read E5 output features ---\n");
+        fflush(stdout);
+        id e5Result = nil;
+        @try {
+            e5Result = [op valueForKey:@"outputFeatures"];
+            printf("  outputFeatures: %s\n",
+                   e5Result ? [NSStringFromClass([e5Result class]) UTF8String]
+                            : "nil");
+        } @catch (NSException *ex) {
+            printf("  outputFeatures EXCEPTION: %s\n",
+                   [[ex reason] UTF8String]);
+        }
+
+        if (e5Result && [e5Result conformsToProtocol:@protocol(MLFeatureProvider)]) {
+            MLMultiArray *e5Out = [(id<MLFeatureProvider>)e5Result
+                featureValueForName:outName].multiArrayValue;
+            if (e5Out) {
+                float *e5Ptr = (float *)[e5Out dataPointer];
+                printf("  E5 first 5: [%.6f %.6f %.6f %.6f %.6f]\n",
+                       e5Ptr[0], e5Ptr[1], e5Ptr[2], e5Ptr[3], e5Ptr[4]);
+                printf("  Mean |e5|: %.6f\n", mean_abs(e5Ptr, outElems));
+
+                float mad = max_abs_diff(refPtr, e5Ptr, outElems);
+                printf("  Max abs diff: %.8f\n", mad);
+                printf("  Relative error: %.2e\n",
+                       mad / (mean_abs(refPtr, outElems) + 1e-10f));
+
+                if (mad < 1e-3f) {
+                    printf("  *** VALIDATION PASSED: outputs match ***\n");
+                } else if (mad < 1e-1f) {
+                    printf("  VALIDATION WARNING: small differences (FP16 expected)\n");
+                } else {
+                    printf("  VALIDATION FAILED: outputs diverge!\n");
+                }
+            } else {
+                printf("  E5 output array is nil for key '%s'\n",
+                       [outName UTF8String]);
+
+                NSArray *ofNames = [(id<MLFeatureProvider>)e5Result
+                    featureNames].allObjects;
+                printf("  Available features: %s\n",
+                       [[ofNames description] UTF8String]);
+            }
+        } else {
+            printf("  Cannot read output features\n");
+        }
+
+        // Also read output via outputPorts
+        printf("\n  --- W1.4: Read via output ports ---\n");
+        fflush(stdout);
+        @try {
+            id outPorts = [op valueForKey:@"outputPorts"];
+            printf("  outputPorts: %s (count=%lu)\n",
+                   outPorts ? [NSStringFromClass([outPorts class]) UTF8String]
+                            : "nil",
+                   outPorts ? (unsigned long)[(NSArray *)outPorts count] : 0);
+
+            if (outPorts && [(NSArray *)outPorts count] > 0) {
+                for (NSUInteger pi = 0; pi < [(NSArray *)outPorts count]; pi++) {
+                    id port = [(NSArray *)outPorts objectAtIndex:pi];
+                    printf("    Port[%lu]: %s\n", (unsigned long)pi,
+                           [[port description] UTF8String]);
+                    @try {
+                        id portName = [port valueForKey:@"name"];
+                        printf("      name: %s\n",
+                               portName ? [(NSString *)portName UTF8String] : "nil");
+                    } @catch (NSException *ex) { (void)ex; }
+                    @try {
+                        id portFD = [port valueForKey:@"featureDescription"];
+                        printf("      featureDescription: %s\n",
+                               portFD ? [[portFD description] UTF8String] : "nil");
+                    } @catch (NSException *ex) { (void)ex; }
+                    @try {
+                        id binder = [port valueForKey:@"binder"];
+                        printf("      binder: %s\n",
+                               binder ? [NSStringFromClass([binder class])
+                                            UTF8String] : "nil");
+                        if (binder) {
+                            @try {
+                                id fv = [binder valueForKey:@"featureValue"];
+                                printf("      featureValue: %s\n",
+                                       fv ? [NSStringFromClass([fv class])
+                                                UTF8String] : "nil");
+                                if (fv) {
+                                    MLMultiArray *ma = [(MLFeatureValue *)fv
+                                        multiArrayValue];
+                                    if (ma) {
+                                        float *ptr = (float *)[ma dataPointer];
+                                        printf("      first 5: [%.6f %.6f %.6f"
+                                               " %.6f %.6f]\n",
+                                               ptr[0], ptr[1], ptr[2],
+                                               ptr[3], ptr[4]);
+                                        float mad2 = max_abs_diff(refPtr, ptr,
+                                            outElems);
+                                        printf("      Max abs diff vs ref: %.8f\n",
+                                               mad2);
+                                    }
+                                }
+                            } @catch (NSException *ex) {
+                                printf("      featureValue EXCEPTION: %s\n",
+                                       [[ex reason] UTF8String]);
+                            }
+                        }
+                    } @catch (NSException *ex) { (void)ex; }
+                }
+            }
+        } @catch (NSException *ex) {
+            printf("  outputPorts EXCEPTION: %s\n", [[ex reason] UTF8String]);
+        }
+
+        // Also read input ports
+        printf("\n  --- W1.5: Inspect input ports ---\n");
+        fflush(stdout);
+        @try {
+            id inPorts = [op valueForKey:@"inputPorts"];
+            printf("  inputPorts: %s (count=%lu)\n",
+                   inPorts ? [NSStringFromClass([inPorts class]) UTF8String]
+                           : "nil",
+                   inPorts ? (unsigned long)[(NSArray *)inPorts count] : 0);
+            if (inPorts) {
+                for (NSUInteger pi = 0; pi < [(NSArray *)inPorts count]; pi++) {
+                    id port = [(NSArray *)inPorts objectAtIndex:pi];
+                    printf("    Port[%lu]: %s\n", (unsigned long)pi,
+                           [[port description] UTF8String]);
+                    @try {
+                        printf("      name: %s\n",
+                               [[(id)[port valueForKey:@"name"] description]
+                                   UTF8String]);
+                        printf("      portHandle: %p\n",
+                               (__bridge void *)[port valueForKey:@"portHandle"]);
+                    } @catch (NSException *ex) { (void)ex; }
+                    @try {
+                        id binder = [port valueForKey:@"binder"];
+                        if (binder) {
+                            printf("      binder: %s\n",
+                                   [NSStringFromClass([binder class]) UTF8String]);
+                            printf("      bindingMode: %d\n",
+                                   ((char(*)(id,SEL))objc_msgSend)(
+                                       binder, @selector(bindingMode)));
+                            id dfv = nil;
+                            @try {
+                                dfv = [binder valueForKey:@"directlyBoundFeatureValue"];
+                            } @catch (NSException *ex) { (void)ex; }
+                            printf("      directlyBound: %s\n",
+                                   dfv ? "YES" : "NO");
+                        }
+                    } @catch (NSException *ex) { (void)ex; }
+                }
+            }
+        } @catch (NSException *ex) {
+            printf("  inputPorts EXCEPTION: %s\n", [[ex reason] UTF8String]);
+        }
+
+        // Return stream
+        [stream setValue:@[op] forKey:@"operations"];
+        ((void(*)(id,SEL,id))objc_msgSend)(
+            streamPool, @selector(putBack:), stream);
+
+        // ============================================================
+        // W1.6: Multi-op output validation
+        // ============================================================
+        printf("\n  --- W1.6: Multi-op output validation ---\n");
+        fflush(stdout);
+
+        {
+            NSString *pkg2Path = @"/tmp/ane_sram_512ch_64sp.mlpackage";
+            err = nil;
+            NSURL *c2 = [MLModel compileModelAtURL:
+                [NSURL fileURLWithPath:pkg2Path] error:&err];
+            if (err) { printf("  Compile2 FAILED\n"); goto skip_multiop; }
+            err = nil;
+            MLModel *model2 = [MLModel modelWithContentsOfURL:c2
+                                                 configuration:cfg error:&err];
+            if (err) { printf("  Load2 FAILED\n"); goto skip_multiop; }
+            int ch2 = 512;
+            int nElems2 = 1 * ch2 * 1 * sp;
+            MLMultiArray *inputArr2 = [[MLMultiArray alloc]
+                initWithShape:@[@1, @(ch2), @1, @(sp)]
+                dataType:MLMultiArrayDataTypeFloat32 error:nil];
+            float *in2Ptr = (float *)[inputArr2 dataPointer];
+            for (int i = 0; i < nElems2; i++)
+                in2Ptr[i] = cosf((float)i * 0.02f) * 0.3f;
+
+            NSString *in2Name = [[[[model2 modelDescription] inputDescriptionsByName]
+                allKeys] firstObject];
+            NSString *out2Name = [[[[model2 modelDescription] outputDescriptionsByName]
+                allKeys] firstObject];
+            MLDictionaryFeatureProvider *fp2 = [[MLDictionaryFeatureProvider alloc]
+                initWithDictionary:@{in2Name: inputArr2} error:nil];
+
+            // Reference predictions
+            err = nil;
+            id<MLFeatureProvider> ref1 = [model predictionFromFeatures:fp error:&err];
+            err = nil;
+            id<MLFeatureProvider> ref2 = [model2 predictionFromFeatures:fp2 error:&err];
+            float *ref1Ptr = (float *)[[ref1 featureValueForName:outName].multiArrayValue dataPointer];
+            float *ref2Ptr = (float *)[[ref2 featureValueForName:out2Name].multiArrayValue dataPointer];
+
+            // E5 multi-op stream
+            id e5_2 = nil;
+            @try { e5_2 = [model2 valueForKey:@"_internalEngine"]; }
+            @catch (NSException *e) { (void)e; }
+            id pLib2 = nil;
+            @try { pLib2 = [e5_2 valueForKey:@"programLibrary"]; }
+            @catch (NSException *e) { (void)e; }
+
+            id op1 = ((id(*)(id,SEL,id,id,id,id,id,unsigned long long))objc_msgSend)(
+                [opCls alloc],
+                @selector(initWithProgramLibrary:functionName:modelDescription:
+                    configuration:debugLabel:modelSignpostId:),
+                progLib, @"main", [model modelDescription], cfg,
+                @"val_op1", (unsigned long long)0);
+            id op2 = ((id(*)(id,SEL,id,id,id,id,id,unsigned long long))objc_msgSend)(
+                [opCls alloc],
+                @selector(initWithProgramLibrary:functionName:modelDescription:
+                    configuration:debugLabel:modelSignpostId:),
+                pLib2, @"main", [model2 modelDescription], cfg,
+                @"val_op2", (unsigned long long)0);
+
+            ((BOOL(*)(id,SEL,NSError**))objc_msgSend)(op1, @selector(preloadAndReturnError:), nil);
+            ((BOOL(*)(id,SEL,NSError**))objc_msgSend)(op2, @selector(preloadAndReturnError:), nil);
+
+            id stream2 = [streamPool performSelector:@selector(takeOut)];
+            Ivar shIvar2 = class_getInstanceVariable([stream2 class], "_streamHandle");
+            void *sh2 = (__bridge void *)object_getIvar(stream2, shIvar2);
+
+            [stream2 setValue:@[op1, op2] forKey:@"operations"];
+
+            ((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
+                op1, @selector(prepareForInputFeatures:options:error:),
+                fp, predOpts, nil);
+            ((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
+                op2, @selector(prepareForInputFeatures:options:error:),
+                fp2, predOpts, nil);
+
+            NSError *mErr = nil;
+            BOOL mOk = ((BOOL(*)(id,SEL,void*,NSError**))objc_msgSend)(
+                stream2, @selector(_executeStream:error:), sh2, &mErr);
+            printf("  Multi-op execute: %s\n", mOk ? "YES" : "NO");
+            if (mErr) printf("  Error: %s\n", [[mErr description] UTF8String]);
+            fflush(stdout);
+
+            if (mOk) {
+                // Read outputs
+                @try {
+                    id out1 = [op1 valueForKey:@"outputFeatures"];
+                    id out2 = [op2 valueForKey:@"outputFeatures"];
+
+                    if (out1 && out2) {
+                        MLMultiArray *ma1 = [(id<MLFeatureProvider>)out1
+                            featureValueForName:outName].multiArrayValue;
+                        MLMultiArray *ma2 = [(id<MLFeatureProvider>)out2
+                            featureValueForName:out2Name].multiArrayValue;
+
+                        if (ma1 && ma2) {
+                            float *p1 = (float *)[ma1 dataPointer];
+                            float *p2 = (float *)[ma2 dataPointer];
+
+                            float mad1 = max_abs_diff(ref1Ptr, p1, outElems);
+                            float mad2 = max_abs_diff(ref2Ptr, p2, nElems2);
+
+                            printf("  Op1 max diff: %.8f  (mean_ref=%.6f)\n",
+                                   mad1, mean_abs(ref1Ptr, outElems));
+                            printf("  Op2 max diff: %.8f  (mean_ref=%.6f)\n",
+                                   mad2, mean_abs(ref2Ptr, nElems2));
+
+                            if (mad1 < 1e-3f && mad2 < 1e-3f) {
+                                printf("  *** MULTI-OP VALIDATION PASSED ***\n");
+                            } else {
+                                printf("  MULTI-OP VALIDATION: differences detected\n");
+                            }
+                        } else {
+                            printf("  Could not extract MLMultiArray from outputs\n");
+                        }
+                    } else {
+                        printf("  outputFeatures nil for op1 or op2\n");
+                    }
+                } @catch (NSException *ex) {
+                    printf("  Output read EXCEPTION: %s\n",
+                           [[ex reason] UTF8String]);
+                }
+            }
+
+            [stream2 setValue:@[op1] forKey:@"operations"];
+            ((void(*)(id,SEL,id))objc_msgSend)(
+                streamPool, @selector(putBack:), stream2);
+        }
+skip_multiop:
+
+        // ============================================================
+        // W4: Async stream submission
+        // ============================================================
+        printf("\n================================================================\n");
+        printf("  W4: Async Stream Submission\n");
+        printf("================================================================\n\n");
+        fflush(stdout);
+
+        {
+            id asyncStream = [streamPool performSelector:@selector(takeOut)];
+            Ivar ashIvar = class_getInstanceVariable([asyncStream class], "_streamHandle");
+            void *ash = (__bridge void *)object_getIvar(asyncStream, ashIvar);
+
+            id asyncOp = ((id(*)(id,SEL,id,id,id,id,id,unsigned long long))
+                objc_msgSend)([opCls alloc],
+                @selector(initWithProgramLibrary:functionName:modelDescription:
+                    configuration:debugLabel:modelSignpostId:),
+                progLib, @"main", [model modelDescription], cfg,
+                @"async_op", (unsigned long long)0);
+            ((BOOL(*)(id,SEL,NSError**))objc_msgSend)(
+                asyncOp, @selector(preloadAndReturnError:), nil);
+            [asyncStream setValue:@[asyncOp] forKey:@"operations"];
+
+            ((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
+                asyncOp, @selector(prepareForInputFeatures:options:error:),
+                fp, predOpts, nil);
+
+            // Try async submission
+            __block BOOL asyncDone = NO;
+            __block double asyncMs = 0;
+            uint64_t asyncT0 = mach_absolute_time();
+
+            @try {
+                // prepareAsyncSubmissionForInputFeatures
+                NSError *asyncPrepErr = nil;
+                BOOL asyncPrepOk = ((BOOL(*)(id,SEL,id,id,NSError**))
+                    objc_msgSend)(asyncStream,
+                    @selector(prepareAsyncSubmissionForInputFeatures:options:error:),
+                    fp, predOpts, &asyncPrepErr);
+                printf("  prepareAsyncSubmission: %s\n",
+                       asyncPrepOk ? "YES" : "NO");
+                if (asyncPrepErr) printf("  Error: %s\n",
+                    [[asyncPrepErr description] UTF8String]);
+                fflush(stdout);
+
+                if (asyncPrepOk) {
+                    ((void(*)(id,SEL,void(^)(void)))objc_msgSend)(
+                        asyncStream, @selector(submitWithCompletionHandler:),
+                        ^{
+                            asyncMs = tb_ms(mach_absolute_time() - asyncT0);
+                            asyncDone = YES;
+                        });
+                    printf("  Submitted async, waiting...\n");
+                    fflush(stdout);
+
+                    for (int w = 0; w < 100 && !asyncDone; w++)
+                        usleep(1000);
+
+                    printf("  Async completed: %s (%.3f ms)\n",
+                           asyncDone ? "YES" : "TIMEOUT", asyncMs);
+                    fflush(stdout);
+
+                    if (asyncDone) {
+                        // Benchmark async vs sync
+                        int N = 200;
+
+                        // Sync benchmark
+                        uint64_t t0 = mach_absolute_time();
+                        for (int i = 0; i < N; i++) {
+                            ((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
+                                asyncOp,
+                                @selector(prepareForInputFeatures:options:error:),
+                                fp, predOpts, nil);
+                            ((BOOL(*)(id,SEL,void*,NSError**))objc_msgSend)(
+                                asyncStream,
+                                @selector(_executeStream:error:), ash, nil);
+                        }
+                        double syncMs = tb_ms(mach_absolute_time() - t0) / N;
+
+                        // Async benchmark
+                        t0 = mach_absolute_time();
+                        for (int i = 0; i < N; i++) {
+                            ((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
+                                asyncOp,
+                                @selector(prepareForInputFeatures:options:error:),
+                                fp, predOpts, nil);
+                            ((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
+                                asyncStream,
+                                @selector(prepareAsyncSubmissionForInputFeatures:
+                                    options:error:),
+                                fp, predOpts, nil);
+
+                            __block BOOL done = NO;
+                            ((void(*)(id,SEL,void(^)(void)))objc_msgSend)(
+                                asyncStream,
+                                @selector(submitWithCompletionHandler:),
+                                ^{ done = YES; });
+                            while (!done) usleep(100);
+                        }
+                        double asyncBenchMs = tb_ms(mach_absolute_time() - t0) / N;
+
+                        printf("  Sync: %.4f ms/eval\n", syncMs);
+                        printf("  Async (wait): %.4f ms/eval\n", asyncBenchMs);
+                    }
+                }
+            } @catch (NSException *ex) {
+                printf("  Async EXCEPTION: %s\n", [[ex reason] UTF8String]);
+            }
+
+            [asyncStream setValue:@[asyncOp] forKey:@"operations"];
+            ((void(*)(id,SEL,id))objc_msgSend)(
+                streamPool, @selector(putBack:), asyncStream);
+        }
+
+        // ============================================================
+        // W5: Port-Based Data Flow
+        // ============================================================
+        printf("\n================================================================\n");
+        printf("  W5: Port-Based Data Flow Investigation\n");
+        printf("================================================================\n\n");
+        fflush(stdout);
+
+        {
+            id portOp = ((id(*)(id,SEL,id,id,id,id,id,unsigned long long))
+                objc_msgSend)([opCls alloc],
+                @selector(initWithProgramLibrary:functionName:modelDescription:
+                    configuration:debugLabel:modelSignpostId:),
+                progLib, @"main", [model modelDescription], cfg,
+                @"port_op", (unsigned long long)0);
+            ((BOOL(*)(id,SEL,NSError**))objc_msgSend)(
+                portOp, @selector(preloadAndReturnError:), nil);
+
+            // Inspect ports before prepare
+            printf("  --- Before prepare ---\n");
+            @try {
+                id inP = [portOp valueForKey:@"inputPorts"];
+                id outP = [portOp valueForKey:@"outputPorts"];
+                id stP = [portOp valueForKey:@"statePorts"];
+                printf("  inputPorts: %lu, outputPorts: %lu, statePorts: %lu\n",
+                       inP ? (unsigned long)[(NSArray *)inP count] : 0,
+                       outP ? (unsigned long)[(NSArray *)outP count] : 0,
+                       stP ? (unsigned long)[(NSArray *)stP count] : 0);
+
+                if (inP) {
+                    for (id p in (NSArray *)inP) {
+                        printf("    in: %s  portHandle=%p  name=%s\n",
+                               [NSStringFromClass([p class]) UTF8String],
+                               (__bridge void *)[p valueForKey:@"portHandle"],
+                               [[(id)[p valueForKey:@"name"] description] UTF8String]);
+                    }
+                }
+                if (outP) {
+                    for (id p in (NSArray *)outP) {
+                        printf("    out: %s  portHandle=%p  name=%s\n",
+                               [NSStringFromClass([p class]) UTF8String],
+                               (__bridge void *)[p valueForKey:@"portHandle"],
+                               [[(id)[p valueForKey:@"name"] description] UTF8String]);
+                        @try {
+                            id fd = [p valueForKey:@"featureDescription"];
+                            if (fd) printf("         featureDesc: %s\n",
+                                           [[fd description] UTF8String]);
+                        } @catch (NSException *ex) { (void)ex; }
+                    }
+                }
+            } @catch (NSException *ex) {
+                printf("  Port inspection EXCEPTION: %s\n",
+                       [[ex reason] UTF8String]);
+            }
+
+            // Prepare and inspect after
+            ((BOOL(*)(id,SEL,id,id,NSError**))objc_msgSend)(
+                portOp, @selector(prepareForInputFeatures:options:error:),
+                fp, predOpts, nil);
+
+            printf("\n  --- After prepare ---\n");
+            @try {
+                id inP = [portOp valueForKey:@"inputPorts"];
+                if (inP) {
+                    for (id p in (NSArray *)inP) {
+                        id binder = [p valueForKey:@"binder"];
+                        BOOL directBound = ((BOOL(*)(id,SEL))objc_msgSend)(
+                            p, @selector(boundFeatureDirectly));
+                        printf("    in: name=%s  directBound=%s  binder=%s\n",
+                               [[(id)[p valueForKey:@"name"] description] UTF8String],
+                               directBound ? "YES" : "NO",
+                               binder ? [NSStringFromClass([binder class])
+                                            UTF8String] : "nil");
+                        if (binder) {
+                            char mode = ((char(*)(id,SEL))objc_msgSend)(
+                                binder, @selector(bindingMode));
+                            printf("         bindingMode=%d\n", (int)mode);
+                        }
+                    }
+                }
+                id outP = [portOp valueForKey:@"outputPorts"];
+                if (outP) {
+                    for (id p in (NSArray *)outP) {
+                        BOOL directBound = ((BOOL(*)(id,SEL))objc_msgSend)(
+                            p, @selector(boundFeatureDirectly));
+                        BOOL obDirectBound = ((BOOL(*)(id,SEL))objc_msgSend)(
+                            p, @selector(outputBackingWasDirectlyBound));
+                        printf("    out: name=%s  directBound=%s"
+                               "  outputBackingDirectBound=%s\n",
+                               [[(id)[p valueForKey:@"name"] description] UTF8String],
+                               directBound ? "YES" : "NO",
+                               obDirectBound ? "YES" : "NO");
+                        id binder = [p valueForKey:@"binder"];
+                        if (binder) {
+                            printf("         binder: %s\n",
+                                   [NSStringFromClass([binder class]) UTF8String]);
+                            @try {
+                                id ob = [binder valueForKey:@"outputBacking"];
+                                printf("         outputBacking: %s\n",
+                                       ob ? [NSStringFromClass([ob class])
+                                                UTF8String] : "nil");
+                            } @catch (NSException *ex) { (void)ex; }
+                        }
+                    }
+                }
+            } @catch (NSException *ex) {
+                printf("  Post-prepare EXCEPTION: %s\n",
+                       [[ex reason] UTF8String]);
+            }
+        }
+
+        // ============================================================
+        // Summary
+        // ============================================================
+        printf("\n================================================================\n");
+        printf("  SUMMARY\n");
+        printf("================================================================\n");
+        printf("  W1: Output validation          -- see above\n");
+        printf("  W2: API documentation           -- complete (all classes dumped)\n");
+        printf("  W4: Async submission            -- see above\n");
+        printf("  W5: Port data flow              -- see above\n");
+        printf("================================================================\n");
+        printf("\nDone.\n");
+    }
+    return 0;
+}
--- a/training/test_mil_custom.m
+++ b/training/test_mil_custom.m
@ -0,0 +1,915 @@
+// test_mil_custom.m — Experiments Y1-Y3, Z1: Custom MIL -> ANE Execution
+// Build: make test_mil_custom && ./test_mil_custom
+#import <Foundation/Foundation.h>
+#import <CoreML/CoreML.h>
+#import <objc/runtime.h>
+#import <objc/message.h>
+#import <dlfcn.h>
+#import <mach/mach_time.h>
+#import <Accelerate/Accelerate.h>
+
+static mach_timebase_info_data_t g_tb;
+static double tb_ms(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; }
+
+#pragma mark - MIL Compilation Pipeline
+
+static id compileAndCreateEngine(NSString *milText, NSString *label,
+                                  id container, MLModelConfiguration *cfg,
+                                  MLModelDescription *desc, NSError **outErr) {
+    NSString *milPath = [NSString stringWithFormat:@"/tmp/%@.mil", label];
+    [milText writeToFile:milPath atomically:YES encoding:NSUTF8StringEncoding error:nil];
+    NSURL *milURL = [NSURL fileURLWithPath:milPath];
+
+    Class aotCls = NSClassFromString(@"MLE5ProgramLibraryOnDeviceAOTCompilationImpl");
+    if (!aotCls) {
+        if (outErr) *outErr = [NSError errorWithDomain:@"MIL" code:1
+            userInfo:@{NSLocalizedDescriptionKey: @"AOT class not found"}];
+        return nil;
+    }
+
+    id aotImpl = ((id(*)(id,SEL,id,id,id))objc_msgSend)(
+        [aotCls alloc],
+        NSSelectorFromString(@"initWithMILTextAtURL:container:configuration:"),
+        milURL, container, cfg);
+    if (!aotImpl) {
+        if (outErr) *outErr = [NSError errorWithDomain:@"MIL" code:2
+            userInfo:@{NSLocalizedDescriptionKey: @"AOT init failed"}];
+        return nil;
+    }
+
+    NSError *plErr = nil;
+    void *plHandle = ((void*(*)(id,SEL,BOOL,NSError**))objc_msgSend)(
+        aotImpl,
+        NSSelectorFromString(@"createProgramLibraryHandleWithRespecialization:error:"),
+        NO, &plErr);
+    if (!plHandle) {
+        printf("  [%s] PL handle failed: %s\n", [label UTF8String],
+               plErr ? [[plErr description] UTF8String] : "unknown");
+        if (outErr) *outErr = plErr;
+        return nil;
+    }
+
+    Class plCls = NSClassFromString(@"MLE5ProgramLibrary");
+    id progLib = ((id(*)(id,SEL,id,id,id))objc_msgSend)(
+        [plCls alloc],
+        NSSelectorFromString(@"initWithImpl:container:configuration:"),
+        aotImpl, container, cfg);
+    if (!progLib) {
+        if (outErr) *outErr = [NSError errorWithDomain:@"MIL" code:4
+            userInfo:@{NSLocalizedDescriptionKey: @"ProgramLibrary init failed"}];
+        return nil;
+    }
+
+    Class engCls = NSClassFromString(@"MLE5Engine");
+
+    // Find the correct init selector
+    static dispatch_once_t once;
+    static SEL engInitSel = NULL;
+    dispatch_once(&once, ^{
+        unsigned int mc;
+        Method *ims = class_copyMethodList(engCls, &mc);
+        printf("  MLE5Engine init selectors:\n");
+        for (unsigned int i = 0; i < mc; i++) {
+            const char *sel = sel_getName(method_getName(ims[i]));
+            if (strstr(sel, "init")) {
+                printf("    - %s  [%s]\n", sel, method_getTypeEncoding(ims[i]));
+                if (strstr(sel, "ProgramLibrary") && strstr(sel, "modelDescription"))
+                    engInitSel = method_getName(ims[i]);
+            }
+        }
+        free(ims);
+    });
+
+    if (!engInitSel) {
+        if (outErr) *outErr = [NSError errorWithDomain:@"MIL" code:5
+            userInfo:@{NSLocalizedDescriptionKey: @"No MLE5Engine init selector found"}];
+        return nil;
+    }
+
+    printf("  Using init: %s\n", sel_getName(engInitSel));
+
+    // Count colons to determine argument count
+    const char *selName = sel_getName(engInitSel);
+    int argCount = 0;
+    for (const char *p = selName; *p; p++) if (*p == ':') argCount++;
+
+    id engine = nil;
+    if (argCount == 7) {
+        // initWithProgramLibrary:modelDescription:configuration:functionName:
+        //   classProbabilitiesFeatureName:optionalInputDefaultValues:compilerVersionInfo:
+        engine = ((id(*)(id,SEL,id,id,id,id,id,id,id))objc_msgSend)(
+            [engCls alloc], engInitSel, progLib, desc, cfg,
+            @"main", nil, nil, nil);
+    } else if (argCount == 5) {
+        engine = ((id(*)(id,SEL,id,id,id,id,id))objc_msgSend)(
+            [engCls alloc], engInitSel, progLib, desc, cfg, nil, label);
+    } else if (argCount == 6) {
+        engine = ((id(*)(id,SEL,id,id,id,id,id,id))objc_msgSend)(
+            [engCls alloc], engInitSel, progLib, desc, cfg, nil, nil, label);
+    } else {
+        printf("  Unexpected arg count %d for MLE5Engine init\n", argCount);
+    }
+
+    if (!engine) {
+        if (outErr) *outErr = [NSError errorWithDomain:@"MIL" code:5
+            userInfo:@{NSLocalizedDescriptionKey: @"Engine init failed"}];
+        return nil;
+    }
+
+    NSError *prepErr = nil;
+    BOOL prepOk = ((BOOL(*)(id,SEL,long long,NSError**))objc_msgSend)(
+        engine, NSSelectorFromString(@"prepareWithConcurrencyHint:error:"),
+        (long long)1, &prepErr);
+    if (!prepOk) {
+        printf("  [%s] Prepare failed: %s\n", [label UTF8String],
+               prepErr ? [[prepErr description] UTF8String] : "unknown");
+        if (outErr) *outErr = prepErr;
+        return nil;
+    }
+
+    return engine;
+}
+
+static id<MLFeatureProvider> runEngine(id engine, id<MLFeatureProvider> features,
+                                       MLPredictionOptions *opts, NSError **outErr) {
+    return ((id(*)(id,SEL,id,id,NSError**))objc_msgSend)(
+        engine, NSSelectorFromString(@"predictionFromFeatures:options:error:"),
+        features, opts, outErr);
+}
+
+#pragma mark - Numeric Helpers
+
+static float max_abs_diff(const float *a, const float *b, int n) {
+    float m = 0;
+    for (int i = 0; i < n; i++) {
+        float d = fabsf(a[i] - b[i]);
+        if (d > m) m = d;
+    }
+    return m;
+}
+
+static float mean_abs(const float *a, int n) {
+    float s = 0;
+    for (int i = 0; i < n; i++) s += fabsf(a[i]);
+    return s / n;
+}
+
+static void fill_random(float *buf, int n, float scale) {
+    for (int i = 0; i < n; i++)
+        buf[i] = ((float)arc4random() / (float)UINT32_MAX - 0.5f) * 2.0f * scale;
+}
+
+static void print_first(const char *label, const float *buf, int total) {
+    int n = total < 8 ? total : 8;
+    printf("  %s: [", label);
+    for (int i = 0; i < n; i++)
+        printf("%s%.4f", i ? ", " : "", buf[i]);
+    printf("]\n");
+}
+
+#pragma mark - CPU Reference Implementations
+
+static void cpu_sdpa(const float *Q, const float *K, const float *V,
+                     float *out, int seqLen, int headDim) {
+    float scale = 1.0f / sqrtf((float)headDim);
+    float *scores = (float *)calloc(seqLen * seqLen, sizeof(float));
+
+    for (int i = 0; i < seqLen; i++) {
+        for (int j = 0; j < seqLen; j++) {
+            float dot = 0;
+            for (int d = 0; d < headDim; d++)
+                dot += Q[i * headDim + d] * K[j * headDim + d];
+            scores[i * seqLen + j] = dot * scale;
+        }
+    }
+    for (int i = 0; i < seqLen; i++) {
+        float maxv = scores[i * seqLen];
+        for (int j = 1; j < seqLen; j++)
+            if (scores[i * seqLen + j] > maxv) maxv = scores[i * seqLen + j];
+        float sum = 0;
+        for (int j = 0; j < seqLen; j++) {
+            scores[i * seqLen + j] = expf(scores[i * seqLen + j] - maxv);
+            sum += scores[i * seqLen + j];
+        }
+        for (int j = 0; j < seqLen; j++)
+            scores[i * seqLen + j] /= sum;
+    }
+    for (int i = 0; i < seqLen; i++) {
+        for (int d = 0; d < headDim; d++) {
+            float acc = 0;
+            for (int j = 0; j < seqLen; j++)
+                acc += scores[i * seqLen + j] * V[j * headDim + d];
+            out[i * headDim + d] = acc;
+        }
+    }
+    free(scores);
+}
+
+#pragma mark - Container Discovery
+
+static id findE5Container(MLModel *model, NSURL *compiledURL, MLModelConfiguration *cfg) {
+    // Try standard paths first
+    @try {
+        id eng = [model valueForKey:@"_internalEngine"];
+        if ([NSStringFromClass([eng class]) containsString:@"MLE5"]) {
+            id pl = [eng valueForKey:@"programLibrary"];
+            if (pl) {
+                id c = nil;
+                @try { c = [pl valueForKey:@"_container"]; } @catch(id e) { (void)e; }
+                if (!c) {
+                    @try {
+                        id impl = [pl valueForKey:@"_impl"];
+                        if (impl) c = [impl valueForKey:@"_container"];
+                    } @catch(id e) { (void)e; }
+                }
+                if (c) return c;
+            }
+        }
+
+        // MLMultiFunctionProgramEngine path
+        if ([NSStringFromClass([eng class]) isEqualToString:@"MLMultiFunctionProgramEngine"]) {
+            NSDictionary *map = [eng valueForKey:@"_functionNameToEngineMap"];
+            for (id key in map) {
+                id sub = map[key];
+                if ([NSStringFromClass([sub class]) containsString:@"MLE5"]) {
+                    id pl = [sub valueForKey:@"programLibrary"];
+                    if (pl) {
+                        id c = nil;
+                        @try { c = [pl valueForKey:@"_container"]; } @catch(id e) { (void)e; }
+                        if (!c) {
+                            @try {
+                                id impl = [pl valueForKey:@"_impl"];
+                                if (impl) c = [impl valueForKey:@"_container"];
+                            } @catch(id e) { (void)e; }
+                        }
+                        if (c) return c;
+                    }
+                }
+            }
+        }
+    } @catch(id e) { (void)e; }
+
+    // Create MLProgramE5Container directly from compiled model
+    Class e5Cls = NSClassFromString(@"MLProgramE5Container");
+    if (!e5Cls) return nil;
+
+    // Find model.mil path inside the compiled model
+    NSString *compiledPath = [compiledURL path];
+    NSString *milPath = [compiledPath stringByAppendingPathComponent:@"model.mil"];
+    if (![[NSFileManager defaultManager] fileExistsAtPath:milPath]) {
+        printf("  No model.mil at %s\n", [milPath UTF8String]);
+
+        // List contents
+        NSArray *contents = [[NSFileManager defaultManager]
+            contentsOfDirectoryAtPath:compiledPath error:nil];
+        printf("  Compiled model contents: %s\n", [[contents description] UTF8String]);
+    }
+
+    // Try to create E5 container with the model asset description from NN container
+    @try {
+        id eng = [model valueForKey:@"_internalEngine"];
+        id nnContainer = [eng valueForKey:@"_container"];
+        if (nnContainer) {
+            // Get model file path
+            NSString *modelFilePath = nil;
+            @try { modelFilePath = [nnContainer valueForKey:@"_modelFilePath"]; }
+            @catch(id e) { (void)e; }
+
+            if (modelFilePath) {
+                printf("  Model file path: %s\n", [modelFilePath UTF8String]);
+
+                // Try to create E5 container with this path
+                @try {
+                    id c = ((id(*)(id,SEL,id,id))objc_msgSend)(
+                        [e5Cls alloc],
+                        NSSelectorFromString(@"initWithModelAssetPath:configuration:"),
+                        modelFilePath, cfg);
+                    if (c) return c;
+                } @catch(id e) { (void)e; }
+            }
+
+            // Try initWithModelAssetDescription
+            @try {
+                id assetDesc = nil;
+                @try { assetDesc = [nnContainer valueForKey:@"_modelAssetDescription"]; }
+                @catch(id e) { (void)e; }
+                if (!assetDesc) {
+                    @try { assetDesc = [nnContainer valueForKey:@"modelAssetDescription"]; }
+                    @catch(id e) { (void)e; }
+                }
+                if (assetDesc) {
+                    printf("  Asset description: %s\n",
+                           [NSStringFromClass([assetDesc class]) UTF8String]);
+                    id c = ((id(*)(id,SEL,id,id))objc_msgSend)(
+                        [e5Cls alloc],
+                        NSSelectorFromString(@"initWithModelAssetDescription:configuration:"),
+                        assetDesc, cfg);
+                    if (c) return c;
+                }
+            } @catch(id e) { (void)e; }
+        }
+    } @catch(id e) { (void)e; }
+
+    // Dump E5Container init methods
+    unsigned int mc;
+    Method *ims = class_copyMethodList(e5Cls, &mc);
+    printf("  MLProgramE5Container init methods:\n");
+    for (unsigned int i = 0; i < mc; i++) {
+        const char *sel = sel_getName(method_getName(ims[i]));
+        if (strstr(sel, "init"))
+            printf("    - %s\n", sel);
+    }
+    free(ims);
+
+    return nil;
+}
+
+#pragma mark - Main
+
+int main(int argc, const char *argv[]) {
+    (void)argc; (void)argv;
+    @autoreleasepool {
+        mach_timebase_info(&g_tb);
+        printf("================================================================\n");
+        printf("  Custom MIL -> ANE: Experiments Y1, Y2, Y3, Z1\n");
+        printf("================================================================\n\n");
+
+        dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/"
+               "AppleNeuralEngine", RTLD_NOW);
+
+        NSString *pkgPath = @"/tmp/ane_sram_256ch_64sp.mlpackage";
+        if (![[NSFileManager defaultManager] fileExistsAtPath:pkgPath]) {
+            printf("FATAL: %s not found. Run: python3 scripts/gen_mlpackages.py\n",
+                   [pkgPath UTF8String]);
+            return 1;
+        }
+
+        NSError *err = nil;
+        MLModelConfiguration *cfg = [[MLModelConfiguration alloc] init];
+        cfg.computeUnits = MLComputeUnitsAll;
+        MLPredictionOptions *opts = [[MLPredictionOptions alloc] init];
+
+        NSURL *compiled = [MLModel compileModelAtURL:
+            [NSURL fileURLWithPath:pkgPath] error:&err];
+        if (err) { printf("FATAL: compile: %s\n", [[err description] UTF8String]); return 1; }
+
+        MLModel *refModel = [MLModel modelWithContentsOfURL:compiled
+                                              configuration:cfg error:&err];
+        if (err) { printf("FATAL: load: %s\n", [[err description] UTF8String]); return 1; }
+        printf("  Ref model: %s\n", [NSStringFromClass([refModel class]) UTF8String]);
+
+        MLModelDescription *refDesc = [refModel modelDescription];
+
+        // Find or create E5 container
+        id refContainer = findE5Container(refModel, compiled, cfg);
+        if (refContainer) {
+            printf("  Container: %s\n\n", [NSStringFromClass([refContainer class]) UTF8String]);
+        } else {
+            printf("  No E5 container found. Trying nil container...\n\n");
+        }
+
+        int ch = 256, sp = 64;
+        int nElems = ch * sp;
+        NSString *inName = [[[refDesc inputDescriptionsByName] allKeys] firstObject];
+        NSString *outName = [[[refDesc outputDescriptionsByName] allKeys] firstObject];
+        printf("  I/O: %s -> %s, shape [1,%d,1,%d]\n\n", [inName UTF8String],
+               [outName UTF8String], ch, sp);
+
+        // ============================================================
+        // Y1: Scaled Dot-Product Attention
+        // ============================================================
+        printf("================================================================\n");
+        printf("  Y1: scaled_dot_product_attention on ANE\n");
+        printf("================================================================\n\n");
+
+        {
+            int seqLen = ch, headDim = sp;
+
+            NSString *sdpaMIL = [NSString stringWithFormat:
+                @"program(1.3)\n"
+                "{\n"
+                "    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n"
+                "        string c16 = const()[name = string(\"c16\"), val = string(\"fp16\")];\n"
+                "        tensor<fp16, [1, %d, 1, %d]> x16 = cast(dtype = c16, x = x)[name = string(\"x16\")];\n"
+                "        tensor<int32, [4]> sr = const()[name = string(\"sr\"), val = tensor<int32, [4]>([1, 1, %d, %d])];\n"
+                "        tensor<fp16, [1, 1, %d, %d]> q = reshape(x = x16, shape = sr)[name = string(\"q\")];\n"
+                "        tensor<fp16, [1, 1, %d, %d]> k = reshape(x = x16, shape = sr)[name = string(\"k\")];\n"
+                "        tensor<fp16, [1, 1, %d, %d]> v = reshape(x = x16, shape = sr)[name = string(\"v\")];\n"
+                "        tensor<fp16, [1, 1, %d, %d]> attn = scaled_dot_product_attention(query = q, key = k, value = v)[name = string(\"attn\")];\n"
+                "        tensor<int32, [4]> or = const()[name = string(\"or\"), val = tensor<int32, [4]>([1, %d, 1, %d])];\n"
+                "        tensor<fp16, [1, %d, 1, %d]> rs = reshape(x = attn, shape = or)[name = string(\"rs\")];\n"
+                "        string c32 = const()[name = string(\"c32\"), val = string(\"fp32\")];\n"
+                "        tensor<fp32, [1, %d, 1, %d]> cast_out = cast(dtype = c32, x = rs)[name = string(\"cast_out\")];\n"
+                "    } -> (cast_out);\n"
+                "}\n",
+                ch, sp, ch, sp,
+                seqLen, headDim, seqLen, headDim, seqLen, headDim, seqLen, headDim,
+                seqLen, headDim,
+                ch, sp, ch, sp,
+                ch, sp];
+
+            printf("  Self-attention: B=1, nHeads=1, seqLen=%d, headDim=%d\n\n", seqLen, headDim);
+
+            err = nil;
+            id engine = compileAndCreateEngine(sdpaMIL, @"y1_sdpa", refContainer, cfg, refDesc, &err);
+
+            if (!engine) {
+                printf("  Y1 FAILED: %s\n\n", err ? [[err description] UTF8String] : "unknown");
+            } else {
+                printf("  Y1: Engine created\n");
+                MLMultiArray *inputArr = [[MLMultiArray alloc]
+                    initWithShape:@[@1, @(ch), @1, @(sp)]
+                    dataType:MLMultiArrayDataTypeFloat32 error:nil];
+                float *inPtr = (float *)[inputArr dataPointer];
+                fill_random(inPtr, nElems, 0.5f);
+
+                MLDictionaryFeatureProvider *fp = [[MLDictionaryFeatureProvider alloc]
+                    initWithDictionary:@{inName: inputArr} error:nil];
+
+                NSError *runErr = nil;
+                uint64_t t0 = mach_absolute_time();
+                id<MLFeatureProvider> result = runEngine(engine, fp, opts, &runErr);
+                double ms = tb_ms(mach_absolute_time() - t0);
+
+                if (runErr || !result) {
+                    printf("  Y1 prediction FAILED: %s\n\n",
+                           runErr ? [[runErr description] UTF8String] : "nil");
+                } else {
+                    MLMultiArray *outArr = [result featureValueForName:outName].multiArrayValue;
+                    if (!outArr) {
+                        printf("  Y1 output nil\n\n");
+                    } else {
+                        float *outPtr = (float *)[outArr dataPointer];
+                        print_first("ANE out", outPtr, nElems);
+                        printf("  Time: %.3f ms\n", ms);
+
+                        float *cpuOut = (float *)calloc(nElems, sizeof(float));
+                        cpu_sdpa(inPtr, inPtr, inPtr, cpuOut, seqLen, headDim);
+                        print_first("CPU ref", cpuOut, nElems);
+
+                        float mad = max_abs_diff(outPtr, cpuOut, nElems);
+                        printf("  Max diff: %.6f, Rel: %.2e\n",
+                               mad, mad / (mean_abs(cpuOut, nElems) + 1e-10f));
+                        printf("  %s\n\n", mad < 0.02f ? "*** Y1 PASSED ***" :
+                               (mad < 0.1f ? "Y1 WARNING" : "Y1 FAILED"));
+
+                        int N = 100;
+                        t0 = mach_absolute_time();
+                        for (int i = 0; i < N; i++) runEngine(engine, fp, opts, nil);
+                        printf("  Bench: %.4f ms/eval (%d iters)\n\n",
+                               tb_ms(mach_absolute_time() - t0) / N, N);
+                        free(cpuOut);
+                    }
+                }
+            }
+        }
+
+        // ============================================================
+        // Y2: Linear with Embedded Weights
+        // ============================================================
+        printf("================================================================\n");
+        printf("  Y2: linear op with embedded weights on ANE\n");
+        printf("================================================================\n\n");
+
+        {
+            int inDim = sp, outDim = sp;
+
+            float *W = (float *)malloc(outDim * inDim * sizeof(float));
+            float *B = (float *)malloc(outDim * sizeof(float));
+            fill_random(W, outDim * inDim, 0.1f);
+            fill_random(B, outDim, 0.01f);
+
+            NSMutableString *wLit = [NSMutableString stringWithString:@"["];
+            for (int i = 0; i < outDim; i++) {
+                if (i > 0) [wLit appendString:@", "];
+                [wLit appendString:@"["];
+                for (int j = 0; j < inDim; j++) {
+                    if (j > 0) [wLit appendString:@", "];
+                    [wLit appendFormat:@"%.8e", W[i * inDim + j]];
+                }
+                [wLit appendString:@"]"];
+            }
+            [wLit appendString:@"]"];
+
+            NSMutableString *bLit = [NSMutableString stringWithString:@"["];
+            for (int j = 0; j < outDim; j++) {
+                if (j > 0) [bLit appendString:@", "];
+                [bLit appendFormat:@"%.8e", B[j]];
+            }
+            [bLit appendString:@"]"];
+
+            NSString *linearMIL = [NSString stringWithFormat:
+                @"program(1.3)\n"
+                "{\n"
+                "    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n"
+                "        string c16 = const()[name = string(\"c16\"), val = string(\"fp16\")];\n"
+                "        tensor<fp16, [1, %d, 1, %d]> x16 = cast(dtype = c16, x = x)[name = string(\"x16\")];\n"
+                "        tensor<int32, [2]> rs = const()[name = string(\"rs\"), val = tensor<int32, [2]>([%d, %d])];\n"
+                "        tensor<fp16, [%d, %d]> flat = reshape(x = x16, shape = rs)[name = string(\"flat\")];\n"
+                "        tensor<fp16, [%d, %d]> Wc = const()[name = string(\"Wc\"), val = tensor<fp16, [%d, %d]>(%@)];\n"
+                "        tensor<fp16, [%d]> Bc = const()[name = string(\"Bc\"), val = tensor<fp16, [%d]>(%@)];\n"
+                "        tensor<fp16, [%d, %d]> lin = linear(x = flat, weight = Wc, bias = Bc)[name = string(\"lin\")];\n"
+                "        tensor<int32, [4]> rs2 = const()[name = string(\"rs2\"), val = tensor<int32, [4]>([1, %d, 1, %d])];\n"
+                "        tensor<fp16, [1, %d, 1, %d]> rso = reshape(x = lin, shape = rs2)[name = string(\"rso\")];\n"
+                "        string c32 = const()[name = string(\"c32\"), val = string(\"fp32\")];\n"
+                "        tensor<fp32, [1, %d, 1, %d]> cast_out = cast(dtype = c32, x = rso)[name = string(\"cast_out\")];\n"
+                "    } -> (cast_out);\n"
+                "}\n",
+                ch, sp, ch, sp,
+                ch, sp, ch, sp,
+                outDim, inDim, outDim, inDim, wLit,
+                outDim, outDim, bLit,
+                ch, outDim,
+                ch, sp, ch, sp,
+                ch, sp];
+
+            printf("  Config: [%d,%d] linear %d->%d with embedded W+b\n\n", ch, sp, inDim, outDim);
+
+            err = nil;
+            id engine = compileAndCreateEngine(linearMIL, @"y2_linear", refContainer, cfg, refDesc, &err);
+
+            if (!engine) {
+                printf("  Y2 FAILED: %s\n\n", err ? [[err description] UTF8String] : "unknown");
+            } else {
+                printf("  Y2: Engine created\n");
+                MLMultiArray *inputArr = [[MLMultiArray alloc]
+                    initWithShape:@[@1, @(ch), @1, @(sp)]
+                    dataType:MLMultiArrayDataTypeFloat32 error:nil];
+                float *inPtr = (float *)[inputArr dataPointer];
+                fill_random(inPtr, nElems, 0.5f);
+
+                MLDictionaryFeatureProvider *fp = [[MLDictionaryFeatureProvider alloc]
+                    initWithDictionary:@{inName: inputArr} error:nil];
+
+                NSError *runErr = nil;
+                uint64_t t0 = mach_absolute_time();
+                id<MLFeatureProvider> result = runEngine(engine, fp, opts, &runErr);
+                double ms = tb_ms(mach_absolute_time() - t0);
+
+                if (runErr || !result) {
+                    printf("  Y2 prediction FAILED: %s\n\n",
+                           runErr ? [[runErr description] UTF8String] : "nil");
+                } else {
+                    MLMultiArray *outArr = [result featureValueForName:outName].multiArrayValue;
+                    if (outArr) {
+                        float *outPtr = (float *)[outArr dataPointer];
+                        print_first("ANE out", outPtr, nElems);
+                        printf("  Time: %.3f ms\n", ms);
+
+                        // CPU: x[ch,sp] @ W^T[sp,sp] + b[sp]
+                        float *cpuOut = (float *)calloc(nElems, sizeof(float));
+                        for (int i = 0; i < ch; i++) {
+                            for (int j = 0; j < outDim; j++) {
+                                float acc = 0;
+                                for (int k = 0; k < inDim; k++)
+                                    acc += inPtr[i * inDim + k] * W[j * inDim + k];
+                                cpuOut[i * outDim + j] = acc + B[j];
+                            }
+                        }
+                        print_first("CPU ref", cpuOut, nElems);
+
+                        float mad = max_abs_diff(outPtr, cpuOut, nElems);
+                        printf("  Max diff: %.6f, Rel: %.2e\n",
+                               mad, mad / (mean_abs(cpuOut, nElems) + 1e-10f));
+                        printf("  %s\n\n", mad < 0.05f ? "*** Y2 PASSED ***" :
+                               (mad < 0.5f ? "Y2 WARNING" : "Y2 FAILED"));
+
+                        int N = 100;
+                        t0 = mach_absolute_time();
+                        for (int i = 0; i < N; i++) runEngine(engine, fp, opts, nil);
+                        printf("  Bench: %.4f ms/eval (%d iters)\n\n",
+                               tb_ms(mach_absolute_time() - t0) / N, N);
+                        free(cpuOut);
+                    }
+                }
+            }
+            free(W); free(B);
+        }
+
+        // ============================================================
+        // Y3: Transformer Block (Attention + FFN)
+        // ============================================================
+        printf("================================================================\n");
+        printf("  Y3: Transformer Block (LN + SDPA + Residual + LN + FFN + Residual)\n");
+        printf("================================================================\n\n");
+
+        {
+            int seqLen = ch, dim = sp, ffnDim = 128;
+
+            float *w1 = (float *)malloc(ffnDim * dim * sizeof(float));
+            float *b1 = (float *)malloc(ffnDim * sizeof(float));
+            float *w2 = (float *)malloc(dim * ffnDim * sizeof(float));
+            float *b2 = (float *)malloc(dim * sizeof(float));
+            fill_random(w1, ffnDim * dim, 0.05f);
+            fill_random(b1, ffnDim, 0.01f);
+            fill_random(w2, dim * ffnDim, 0.05f);
+            fill_random(b2, dim, 0.01f);
+
+            // Build weight string literals
+            NSMutableString *(^buildMat)(float*, int, int) = ^(float *m, int rows, int cols) {
+                NSMutableString *s = [NSMutableString stringWithString:@"["];
+                for (int i = 0; i < rows; i++) {
+                    if (i > 0) [s appendString:@", "];
+                    [s appendString:@"["];
+                    for (int j = 0; j < cols; j++) {
+                        if (j > 0) [s appendString:@", "];
+                        [s appendFormat:@"%.8e", m[i * cols + j]];
+                    }
+                    [s appendString:@"]"];
+                }
+                [s appendString:@"]"];
+                return s;
+            };
+
+            NSMutableString *(^buildVec)(float*, int) = ^(float *v, int n) {
+                NSMutableString *s = [NSMutableString stringWithString:@"["];
+                for (int i = 0; i < n; i++) {
+                    if (i > 0) [s appendString:@", "];
+                    [s appendFormat:@"%.8e", v[i]];
+                }
+                [s appendString:@"]"];
+                return s;
+            };
+
+            NSMutableString *(^buildOnes)(int) = ^(int n) {
+                NSMutableString *s = [NSMutableString stringWithString:@"["];
+                for (int i = 0; i < n; i++) {
+                    if (i > 0) [s appendString:@", "];
+                    [s appendString:@"1.0"];
+                }
+                [s appendString:@"]"];
+                return s;
+            };
+
+            NSMutableString *(^buildZeros)(int) = ^(int n) {
+                NSMutableString *s = [NSMutableString stringWithString:@"["];
+                for (int i = 0; i < n; i++) {
+                    if (i > 0) [s appendString:@", "];
+                    [s appendString:@"0.0"];
+                }
+                [s appendString:@"]"];
+                return s;
+            };
+
+            NSString *tfMIL = [NSString stringWithFormat:
+                @"program(1.3)\n"
+                "{\n"
+                "    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n"
+                "        string c16 = const()[name = string(\"c16\"), val = string(\"fp16\")];\n"
+                "        tensor<fp16, [1, %d, 1, %d]> x16 = cast(dtype = c16, x = x)[name = string(\"x16\")];\n"
+                "        tensor<int32, [2]> r2 = const()[name = string(\"r2\"), val = tensor<int32, [2]>([%d, %d])];\n"
+                "        tensor<fp16, [%d, %d]> flat = reshape(x = x16, shape = r2)[name = string(\"flat\")];\n"
+                // LN1
+                "        tensor<fp16, [%d]> g1 = const()[name = string(\"g1\"), val = tensor<fp16, [%d]>(%@)];\n"
+                "        tensor<fp16, [%d]> b1 = const()[name = string(\"b1\"), val = tensor<fp16, [%d]>(%@)];\n"
+                "        tensor<int32, [1]> la = const()[name = string(\"la\"), val = tensor<int32, [1]>([-1])];\n"
+                "        fp16 eps = const()[name = string(\"eps\"), val = fp16(1e-5)];\n"
+                "        tensor<fp16, [%d, %d]> ln1 = layer_norm(x = flat, axes = la, gamma = g1, beta = b1, epsilon = eps)[name = string(\"ln1\")];\n"
+                // SDPA
+                "        tensor<int32, [4]> sr = const()[name = string(\"sr\"), val = tensor<int32, [4]>([1, 1, %d, %d])];\n"
+                "        tensor<fp16, [1, 1, %d, %d]> q = reshape(x = ln1, shape = sr)[name = string(\"q\")];\n"
+                "        tensor<fp16, [1, 1, %d, %d]> k = reshape(x = ln1, shape = sr)[name = string(\"k\")];\n"
+                "        tensor<fp16, [1, 1, %d, %d]> v = reshape(x = ln1, shape = sr)[name = string(\"v\")];\n"
+                "        tensor<fp16, [1, 1, %d, %d]> at = scaled_dot_product_attention(query = q, key = k, value = v)[name = string(\"at\")];\n"
+                "        tensor<fp16, [%d, %d]> af = reshape(x = at, shape = r2)[name = string(\"af\")];\n"
+                // Residual 1
+                "        tensor<fp16, [%d, %d]> r1 = add(x = flat, y = af)[name = string(\"r1\")];\n"
+                // LN2
+                "        tensor<fp16, [%d]> g2 = const()[name = string(\"g2\"), val = tensor<fp16, [%d]>(%@)];\n"
+                "        tensor<fp16, [%d]> b2 = const()[name = string(\"b2\"), val = tensor<fp16, [%d]>(%@)];\n"
+                "        tensor<fp16, [%d, %d]> ln2 = layer_norm(x = r1, axes = la, gamma = g2, beta = b2, epsilon = eps)[name = string(\"ln2\")];\n"
+                // FFN
+                "        tensor<fp16, [%d, %d]> W1 = const()[name = string(\"W1\"), val = tensor<fp16, [%d, %d]>(%@)];\n"
+                "        tensor<fp16, [%d]> B1 = const()[name = string(\"B1\"), val = tensor<fp16, [%d]>(%@)];\n"
+                "        tensor<fp16, [%d, %d]> f1 = linear(x = ln2, weight = W1, bias = B1)[name = string(\"f1\")];\n"
+                "        tensor<fp16, [%d, %d]> ga = gelu(x = f1, mode = string(\"TANH_APPROXIMATION\"))[name = string(\"ga\")];\n"
+                "        tensor<fp16, [%d, %d]> W2 = const()[name = string(\"W2\"), val = tensor<fp16, [%d, %d]>(%@)];\n"
+                "        tensor<fp16, [%d]> B2 = const()[name = string(\"B2\"), val = tensor<fp16, [%d]>(%@)];\n"
+                "        tensor<fp16, [%d, %d]> f2 = linear(x = ga, weight = W2, bias = B2)[name = string(\"f2\")];\n"
+                // Residual 2
+                "        tensor<fp16, [%d, %d]> r2o = add(x = r1, y = f2)[name = string(\"r2o\")];\n"
+                // Output
+                "        tensor<int32, [4]> r4 = const()[name = string(\"r4\"), val = tensor<int32, [4]>([1, %d, 1, %d])];\n"
+                "        tensor<fp16, [1, %d, 1, %d]> o16 = reshape(x = r2o, shape = r4)[name = string(\"o16\")];\n"
+                "        string c32 = const()[name = string(\"c32\"), val = string(\"fp32\")];\n"
+                "        tensor<fp32, [1, %d, 1, %d]> cast_out = cast(dtype = c32, x = o16)[name = string(\"cast_out\")];\n"
+                "    } -> (cast_out);\n"
+                "}\n",
+                ch, sp, ch, sp,
+                seqLen, dim, seqLen, dim,
+                dim, dim, buildOnes(dim),
+                dim, dim, buildZeros(dim),
+                seqLen, dim,
+                seqLen, dim, seqLen, dim, seqLen, dim, seqLen, dim,
+                seqLen, dim,
+                seqLen, dim,
+                seqLen, dim,
+                dim, dim, buildOnes(dim),
+                dim, dim, buildZeros(dim),
+                seqLen, dim,
+                ffnDim, dim, ffnDim, dim, buildMat(w1, ffnDim, dim),
+                ffnDim, ffnDim, buildVec(b1, ffnDim),
+                seqLen, ffnDim,
+                seqLen, ffnDim,
+                dim, ffnDim, dim, ffnDim, buildMat(w2, dim, ffnDim),
+                dim, dim, buildVec(b2, dim),
+                seqLen, dim,
+                seqLen, dim,
+                ch, sp, ch, sp,
+                ch, sp];
+
+            printf("  Pipeline: LN->SDPA->Res->LN->FFN(%d->%d->%d)->Res\n\n", dim, ffnDim, dim);
+
+            err = nil;
+            id engine = compileAndCreateEngine(tfMIL, @"y3_transformer",
+                refContainer, cfg, refDesc, &err);
+
+            if (!engine) {
+                printf("  Y3 FAILED: %s\n\n", err ? [[err description] UTF8String] : "unknown");
+            } else {
+                printf("  Y3: Engine created!\n");
+                MLMultiArray *inputArr = [[MLMultiArray alloc]
+                    initWithShape:@[@1, @(ch), @1, @(sp)]
+                    dataType:MLMultiArrayDataTypeFloat32 error:nil];
+                float *inPtr = (float *)[inputArr dataPointer];
+                fill_random(inPtr, nElems, 0.5f);
+
+                MLDictionaryFeatureProvider *fp = [[MLDictionaryFeatureProvider alloc]
+                    initWithDictionary:@{inName: inputArr} error:nil];
+
+                NSError *runErr = nil;
+                uint64_t t0 = mach_absolute_time();
+                id<MLFeatureProvider> result = runEngine(engine, fp, opts, &runErr);
+                double ms = tb_ms(mach_absolute_time() - t0);
+
+                if (runErr || !result) {
+                    printf("  Y3 prediction FAILED: %s\n\n",
+                           runErr ? [[runErr description] UTF8String] : "nil");
+                } else {
+                    MLMultiArray *outArr = [result featureValueForName:outName].multiArrayValue;
+                    if (outArr) {
+                        float *outPtr = (float *)[outArr dataPointer];
+                        print_first("ANE out", outPtr, nElems);
+                        printf("  Time: %.3f ms\n", ms);
+                        float m = mean_abs(outPtr, nElems);
+                        printf("  Non-zero: %s (mean_abs=%.6f)\n", m > 1e-6f ? "YES" : "NO", m);
+                        printf("  %s\n\n", m > 1e-6f ? "*** Y3 PASSED ***" : "Y3 FAILED");
+
+                        int N = 100;
+                        t0 = mach_absolute_time();
+                        for (int i = 0; i < N; i++) runEngine(engine, fp, opts, nil);
+                        printf("  Bench: %.4f ms/eval (%d iters)\n\n",
+                               tb_ms(mach_absolute_time() - t0) / N, N);
+                    }
+                }
+            }
+            free(w1); free(b1); free(w2); free(b2);
+        }
+
+        // ============================================================
+        // Z1: Linear Backward Pass (Gradient Computation)
+        // ============================================================
+        printf("================================================================\n");
+        printf("  Z1: Backward Pass (matmul with runtime tensors) on ANE\n");
+        printf("================================================================\n\n");
+
+        {
+            int M = 128, K = 64, N = 64;
+
+            NSString *bwdMIL = [NSString stringWithFormat:
+                @"program(1.3)\n"
+                "{\n"
+                "    func main<ios18>(tensor<fp32, [1, %d, 1, %d]> x) {\n"
+                "        string c16 = const()[name = string(\"c16\"), val = string(\"fp16\")];\n"
+                "        tensor<fp16, [1, %d, 1, %d]> x16 = cast(dtype = c16, x = x)[name = string(\"x16\")];\n"
+                "        tensor<int32, [2]> r2 = const()[name = string(\"r2\"), val = tensor<int32, [2]>([%d, %d])];\n"
+                "        tensor<fp16, [%d, %d]> flat = reshape(x = x16, shape = r2)[name = string(\"flat\")];\n"
+                // Slice dY [0:128, :]
+                "        tensor<int32, [2]> db = const()[name = string(\"db\"), val = tensor<int32, [2]>([0, 0])];\n"
+                "        tensor<int32, [2]> de = const()[name = string(\"de\"), val = tensor<int32, [2]>([%d, %d])];\n"
+                "        tensor<fp16, [%d, %d]> dY = slice_by_index(x = flat, begin = db, end = de)[name = string(\"dY\")];\n"
+                // Slice W [128:192, :]
+                "        tensor<int32, [2]> wb = const()[name = string(\"wb\"), val = tensor<int32, [2]>([%d, 0])];\n"
+                "        tensor<int32, [2]> we = const()[name = string(\"we\"), val = tensor<int32, [2]>([%d, %d])];\n"
+                "        tensor<fp16, [%d, %d]> W = slice_by_index(x = flat, begin = wb, end = we)[name = string(\"W\")];\n"
+                // Slice pad [192:256, :]
+                "        tensor<int32, [2]> pb = const()[name = string(\"pb\"), val = tensor<int32, [2]>([%d, 0])];\n"
+                "        tensor<int32, [2]> pe = const()[name = string(\"pe\"), val = tensor<int32, [2]>([%d, %d])];\n"
+                "        tensor<fp16, [%d, %d]> pad = slice_by_index(x = flat, begin = pb, end = pe)[name = string(\"pad\")];\n"
+                // dX = dY @ W
+                "        bool txf = const()[name = string(\"txf\"), val = bool(false)];\n"
+                "        bool tyf = const()[name = string(\"tyf\"), val = bool(false)];\n"
+                "        bool txt = const()[name = string(\"txt\"), val = bool(true)];\n"
+                "        tensor<fp16, [%d, %d]> dX = matmul(x = dY, y = W, transpose_x = txf, transpose_y = tyf)[name = string(\"dX\")];\n"
+                // dW = dY^T @ dY
+                "        tensor<fp16, [%d, %d]> dW = matmul(x = dY, y = dY, transpose_x = txt, transpose_y = tyf)[name = string(\"dW\")];\n"
+                // Concat [dX, dW, pad]
+                "        int32 ax = const()[name = string(\"ax\"), val = int32(0)];\n"
+                "        bool il = const()[name = string(\"il\"), val = bool(false)];\n"
+                "        tensor<fp16, [%d, %d]> pk = concat(values = (dX, dW, pad), axis = ax, interleave = il)[name = string(\"pk\")];\n"
+                "        tensor<int32, [4]> r4 = const()[name = string(\"r4\"), val = tensor<int32, [4]>([1, %d, 1, %d])];\n"
+                "        tensor<fp16, [1, %d, 1, %d]> o16 = reshape(x = pk, shape = r4)[name = string(\"o16\")];\n"
+                "        string c32 = const()[name = string(\"c32\"), val = string(\"fp32\")];\n"
+                "        tensor<fp32, [1, %d, 1, %d]> cast_out = cast(dtype = c32, x = o16)[name = string(\"cast_out\")];\n"
+                "    } -> (cast_out);\n"
+                "}\n",
+                ch, sp, ch, sp,
+                ch, sp, ch, sp,
+                M, K, M, K,
+                M, M + K, K, K, K,
+                M + K, ch, sp, ch - M - K, sp,
+                M, N,
+                K, K,
+                ch, sp,
+                ch, sp, ch, sp,
+                ch, sp];
+
+            printf("  dX = dY[%d,%d] @ W[%d,%d] -> [%d,%d]\n", M, K, K, N, M, N);
+            printf("  dW = dY^T @ dY -> [%d,%d]\n\n", K, K);
+
+            err = nil;
+            id engine = compileAndCreateEngine(bwdMIL, @"z1_backward",
+                refContainer, cfg, refDesc, &err);
+
+            if (!engine) {
+                printf("  Z1 FAILED: %s\n\n", err ? [[err description] UTF8String] : "unknown");
+            } else {
+                printf("  Z1: Engine created\n");
+                MLMultiArray *inputArr = [[MLMultiArray alloc]
+                    initWithShape:@[@1, @(ch), @1, @(sp)]
+                    dataType:MLMultiArrayDataTypeFloat32 error:nil];
+                float *inPtr = (float *)[inputArr dataPointer];
+                fill_random(inPtr, nElems, 0.3f);
+
+                MLDictionaryFeatureProvider *fp = [[MLDictionaryFeatureProvider alloc]
+                    initWithDictionary:@{inName: inputArr} error:nil];
+
+                NSError *runErr = nil;
+                uint64_t t0 = mach_absolute_time();
+                id<MLFeatureProvider> result = runEngine(engine, fp, opts, &runErr);
+                double ms = tb_ms(mach_absolute_time() - t0);
+
+                if (runErr || !result) {
+                    printf("  Z1 prediction FAILED: %s\n\n",
+                           runErr ? [[runErr description] UTF8String] : "nil");
+                } else {
+                    MLMultiArray *outArr = [result featureValueForName:outName].multiArrayValue;
+                    if (outArr) {
+                        float *outPtr = (float *)[outArr dataPointer];
+
+                        // CPU: dX = dY @ W
+                        float *dY_cpu = inPtr;
+                        float *W_cpu = inPtr + M * K;
+                        float *dX_cpu = (float *)calloc(M * N, sizeof(float));
+                        for (int i = 0; i < M; i++)
+                            for (int j = 0; j < N; j++) {
+                                float a = 0;
+                                for (int k = 0; k < K; k++)
+                                    a += dY_cpu[i*K+k] * W_cpu[k*N+j];
+                                dX_cpu[i*N+j] = a;
+                            }
+
+                        // CPU: dW = dY^T @ dY
+                        float *dW_cpu = (float *)calloc(K * K, sizeof(float));
+                        for (int i = 0; i < K; i++)
+                            for (int j = 0; j < K; j++) {
+                                float a = 0;
+                                for (int m = 0; m < M; m++)
+                                    a += dY_cpu[m*K+i] * dY_cpu[m*K+j];
+                                dW_cpu[i*K+j] = a;
+                            }
+
+                        print_first("ANE dX", outPtr, M * N);
+                        print_first("CPU dX", dX_cpu, M * N);
+                        float mad_dx = max_abs_diff(outPtr, dX_cpu, M * N);
+                        printf("  dX diff: %.6f, Rel: %.2e\n",
+                               mad_dx, mad_dx / (mean_abs(dX_cpu, M*N) + 1e-10f));
+
+                        print_first("ANE dW", outPtr + M*N, K*K);
+                        print_first("CPU dW", dW_cpu, K*K);
+                        float mad_dw = max_abs_diff(outPtr + M*N, dW_cpu, K * K);
+                        printf("  dW diff: %.6f, Rel: %.2e\n",
+                               mad_dw, mad_dw / (mean_abs(dW_cpu, K*K) + 1e-10f));
+                        printf("  Time: %.3f ms\n", ms);
+                        printf("  %s\n\n",
+                               (mad_dx < 0.5f && mad_dw < 1.0f)
+                               ? "*** Z1 PASSED ***" : "Z1: differences (fp16 precision)");
+
+                        int NN = 100;
+                        t0 = mach_absolute_time();
+                        for (int i = 0; i < NN; i++) runEngine(engine, fp, opts, nil);
+                        printf("  Bench: %.4f ms/eval (%d iters)\n\n",
+                               tb_ms(mach_absolute_time() - t0) / NN, NN);
+
+                        free(dX_cpu); free(dW_cpu);
+                    }
+                }
+            }
+        }
+
+        printf("================================================================\n");
+        printf("  DONE\n");
+        printf("================================================================\n");
+    }
+    return 0;
+}
--- a/training/test_throughput_ceiling.m
+++ b/training/test_throughput_ceiling.m
@ -0,0 +1,238 @@
+// test_throughput_ceiling.m — Experiment I: Multi-kernel throughput ceiling
+// Measures CPU round-trip overhead for sequential ANE kernel execution
+// Build: make test_throughput_ceiling && ./test_throughput_ceiling
+#import <Foundation/Foundation.h>
+#import <mach/mach_time.h>
+#include <dispatch/dispatch.h>
+#include "ane_runtime.h"
+
+static int g_fp16_io = 1;
+
+static NSString *gen_conv_mil_fp16(int ch, int sp) {
+    return [NSString stringWithFormat:
+        @"program(1.0)\n[buildInfo = dict<tensor<string, []>, tensor<string, []>>"
+        "({{\"coremlc-version\", \"3505.4.1\"}})]\n{\n"
+        "    func main<ios16>(tensor<fp16, [1, %d, 1, %d]> x) {\n"
+        "        tensor<string, []> pt = const()[name=tensor<string, []>(\"pt\"),"
+        " val=tensor<string, []>(\"valid\")];\n"
+        "        tensor<int32, [2]> st = const()[name=tensor<string, []>(\"st\"),"
+        " val=tensor<int32, [2]>([1,1])];\n"
+        "        tensor<int32, [4]> pd = const()[name=tensor<string, []>(\"pd\"),"
+        " val=tensor<int32, [4]>([0,0,0,0])];\n"
+        "        tensor<int32, [2]> dl = const()[name=tensor<string, []>(\"dl\"),"
+        " val=tensor<int32, [2]>([1,1])];\n"
+        "        tensor<int32, []> gr = const()[name=tensor<string, []>(\"gr\"),"
+        " val=tensor<int32, []>(1)];\n"
+        "        tensor<fp16, [%d,%d,1,1]> W = const()[name=tensor<string, []>(\"W\"), "
+        "val=tensor<fp16, [%d,%d,1,1]>(BLOBFILE(path=tensor<string, []>"
+        "(\"@model_path/weights/weight.bin\"), offset=tensor<uint64, []>(64)))];\n"
+        "        tensor<fp16, [1,%d,1,%d]> y = conv(dilations=dl,groups=gr,"
+        "pad=pd,pad_type=pt,strides=st,weight=W,x=x)"
+        "[name=tensor<string, []>(\"conv\")];\n"
+        "    } -> (y);\n}\n", ch, sp, ch, ch, ch, ch, ch, sp];
+}
+
+static ANEKernel *compile_fp16_kernel(int ch, int sp) {
+    int ws = ch * ch * 2;
+    int tot = 128 + ws;
+    uint8_t *blob = (uint8_t *)calloc((size_t)tot, 1);
+    blob[0] = 1; blob[4] = 2;
+    blob[64] = 0xEF; blob[65] = 0xBE; blob[66] = 0xAD; blob[67] = 0xDE;
+    blob[68] = 1;
+    *(uint32_t *)(blob + 72) = (uint32_t)ws;
+    *(uint32_t *)(blob + 80) = 128;
+    _Float16 *wp = (_Float16 *)(blob + 128);
+    for (int i = 0; i < ch; i++) wp[i * ch + i] = (_Float16)1.0f;
+    NSData *wdata = [NSData dataWithBytesNoCopy:blob length:(NSUInteger)tot
+                                   freeWhenDone:YES];
+
+    NSString *mil = gen_conv_mil_fp16(ch, sp);
+    NSData *md = [mil dataUsingEncoding:NSUTF8StringEncoding];
+    size_t ioBytes = (size_t)ch * sp * 2;
+    return ane_compile(md, wdata, 1, &ioBytes, 1, &ioBytes);
+}
+
+int main(int argc, const char *argv[]) {
+    (void)argc; (void)argv;
+    @autoreleasepool {
+        mach_timebase_info_data_t tb;
+        mach_timebase_info(&tb);
+
+        printf("============================================================\n");
+        printf("  Experiment I: Multi-Kernel Throughput Ceiling\n");
+        printf("  Measuring CPU round-trip overhead for sequential ANE ops\n");
+        printf("============================================================\n\n");
+
+        ane_init();
+        if (!g_ane_ok) { printf("ANE not available\n"); return 1; }
+
+        typedef struct { int ch; int sp; const char *name; } Config;
+        Config configs[] = {
+            {64,  32,  "64x32 (test)"},
+            {256, 64,  "256x64 (small)"},
+            {768, 256, "768x256 (prod)"},
+        };
+        int nconfigs = sizeof(configs) / sizeof(configs[0]);
+
+        for (int ci = 0; ci < nconfigs; ci++) {
+            Config cfg = configs[ci];
+            printf("=== Config: %s ===\n", cfg.name);
+
+            int nlayers = 12;
+            ANEKernel *kernels[12];
+            int compiled = 0;
+            for (int i = 0; i < nlayers; i++) {
+                @try {
+                    kernels[i] = compile_fp16_kernel(cfg.ch, cfg.sp);
+                    if (!kernels[i]) {
+                        printf("  Kernel %d compile failed\n", i);
+                        break;
+                    }
+                    compiled++;
+                } @catch (NSException *ex) {
+                    printf("  Kernel %d exception: %s\n", i,
+                           [[ex reason] UTF8String]);
+                    break;
+                }
+            }
+            printf("  Compiled %d/%d kernels\n", compiled, nlayers);
+            if (compiled < 2) {
+                printf("  Need at least 2 kernels, skipping\n\n");
+                for (int i = 0; i < compiled; i++) ane_free(kernels[i]);
+                continue;
+            }
+
+            size_t ioBytes = (size_t)cfg.ch * cfg.sp * 2;
+            int warmup = 5;
+            int iters = 50;
+
+            // --- Test 1: Sequential (run + memcpy chain) ---
+            printf("\n  --- Test 1: Sequential (run + memcpy) ---\n");
+            {
+                for (int w = 0; w < warmup; w++) {
+                    @try {
+                        for (int i = 0; i < compiled; i++)
+                            ane_eval(kernels[i]);
+                    } @catch (NSException *ex) { (void)ex; }
+                }
+
+                uint64_t t0 = mach_absolute_time();
+                for (int it = 0; it < iters; it++) {
+                    for (int i = 0; i < compiled - 1; i++) {
+                        @try {
+                            ane_eval(kernels[i]);
+                            IOSurfaceLock(kernels[i]->ioOutputs[0],
+                                kIOSurfaceLockReadOnly, NULL);
+                            IOSurfaceLock(kernels[i+1]->ioInputs[0], 0, NULL);
+                            memcpy(
+                                IOSurfaceGetBaseAddress(kernels[i+1]->ioInputs[0]),
+                                IOSurfaceGetBaseAddress(kernels[i]->ioOutputs[0]),
+                                ioBytes);
+                            IOSurfaceUnlock(kernels[i+1]->ioInputs[0], 0, NULL);
+                            IOSurfaceUnlock(kernels[i]->ioOutputs[0],
+                                kIOSurfaceLockReadOnly, NULL);
+                        } @catch (NSException *ex) { (void)ex; }
+                    }
+                    @try {
+                        ane_eval(kernels[compiled - 1]);
+                    } @catch (NSException *ex) { (void)ex; }
+                }
+                double totalMs = (double)(mach_absolute_time() - t0) * tb.numer / tb.denom / 1e6;
+                double perIter = totalMs / iters;
+                double perKernel = perIter / compiled;
+                printf("  Total: %.2f ms/pass (%d kernels)\n", perIter, compiled);
+                printf("  Per kernel: %.3f ms\n", perKernel);
+                printf("  Throughput: %.0f kernels/s\n", compiled * 1000.0 / perIter);
+            }
+
+            // --- Test 2: Run-only (no memcpy, pure ANE overhead) ---
+            printf("\n  --- Test 2: Run-only (no memcpy between) ---\n");
+            {
+                uint64_t t0 = mach_absolute_time();
+                for (int it = 0; it < iters; it++) {
+                    for (int i = 0; i < compiled; i++) {
+                        @try {
+                            ane_eval(kernels[i]);
+                        } @catch (NSException *ex) { (void)ex; }
+                    }
+                }
+                double totalMs = (double)(mach_absolute_time() - t0) * tb.numer / tb.denom / 1e6;
+                double perIter = totalMs / iters;
+                double perKernel = perIter / compiled;
+                printf("  Total: %.2f ms/pass (%d kernels)\n", perIter, compiled);
+                printf("  Per kernel: %.3f ms\n", perKernel);
+                printf("  Throughput: %.0f kernels/s\n", compiled * 1000.0 / perIter);
+            }
+
+            // --- Test 3: Memcpy-only overhead ---
+            printf("\n  --- Test 3: Memcpy-only overhead ---\n");
+            {
+                uint64_t t0 = mach_absolute_time();
+                for (int it = 0; it < iters * 10; it++) {
+                    for (int i = 0; i < compiled - 1; i++) {
+                        IOSurfaceLock(kernels[i]->ioOutputs[0], kIOSurfaceLockReadOnly, NULL);
+                        IOSurfaceLock(kernels[i+1]->ioInputs[0], 0, NULL);
+                        memcpy(
+                            IOSurfaceGetBaseAddress(kernels[i+1]->ioInputs[0]),
+                            IOSurfaceGetBaseAddress(kernels[i]->ioOutputs[0]),
+                            ioBytes);
+                        IOSurfaceUnlock(kernels[i+1]->ioInputs[0], 0, NULL);
+                        IOSurfaceUnlock(kernels[i]->ioOutputs[0], kIOSurfaceLockReadOnly, NULL);
+                    }
+                }
+                double totalMs = (double)(mach_absolute_time() - t0) * tb.numer / tb.denom / 1e6;
+                double perIter = totalMs / (iters * 10);
+                double perCopy = perIter / (compiled - 1);
+                printf("  Total: %.3f ms/pass (%d copies)\n", perIter, compiled - 1);
+                printf("  Per memcpy: %.4f ms (%lu bytes)\n", perCopy, (unsigned long)ioBytes);
+            }
+
+            // --- Test 4: GCD serial queue ---
+            printf("\n  --- Test 4: GCD serial queue ---\n");
+            {
+                ANEKernel **kptrs = (ANEKernel **)malloc(
+                    (size_t)compiled * sizeof(ANEKernel *));
+                for (int i = 0; i < compiled; i++) kptrs[i] = kernels[i];
+
+                dispatch_queue_t q = dispatch_queue_create(
+                    "ane.throughput", DISPATCH_QUEUE_SERIAL);
+                dispatch_semaphore_t sem = dispatch_semaphore_create(0);
+                const int ncomp = compiled;
+
+                uint64_t t0 = mach_absolute_time();
+                for (int it = 0; it < iters; it++) {
+                    __block int done = 0;
+                    for (int i = 0; i < ncomp; i++) {
+                        ANEKernel *kp = kptrs[i];
+                        dispatch_async(q, ^{
+                            @try {
+                                ane_eval(kp);
+                            } @catch (NSException *ex) { (void)ex; }
+                            done++;
+                            if (done == ncomp)
+                                dispatch_semaphore_signal(sem);
+                        });
+                    }
+                    dispatch_semaphore_wait(sem, DISPATCH_TIME_FOREVER);
+                }
+                double totalMs = (double)(mach_absolute_time() - t0)
+                    * tb.numer / tb.denom / 1e6;
+                double perIter = totalMs / iters;
+                printf("  Total: %.2f ms/pass (%d kernels, serial queue)\n",
+                       perIter, ncomp);
+                printf("  Per kernel: %.3f ms\n", perIter / ncomp);
+                free(kptrs);
+            }
+
+            printf("\n  --- CPU Round-trip Overhead ---\n");
+            printf("  Overhead = (Sequential - RunOnly) / %d copies\n", compiled - 1);
+            printf("  This is what chaining would eliminate per layer.\n");
+
+            for (int i = 0; i < compiled; i++) ane_free(kernels[i]);
+            printf("\n");
+        }
+
+        printf("Done.\n");
+    }
+    return 0;
+}