[docs] Developer documentation: architecture diagrams, complete API reference, benchmark guide, M4 Max results, security audit report

2026-03-03 14:22:22 +01:00 · 2026-03-03 14:22:22 +01:00 · 37cac988b8
parent 680f8c7e20
commit 37cac988b8
6 changed files with 1701 additions and 0 deletions
--- a/docs/API_REFERENCE.md
+++ b/docs/API_REFERENCE.md
@ -0,0 +1,429 @@
+# ANE Training -- API Reference
+
+Complete function index for all public functions, structs, and macros organized by source file.
+
+---
+
+## Table of Contents
+
+1. [stories_config.h -- Model Configuration](#stories_configh)
+2. [stories_io.h -- IOSurface I/O and Compilation](#stories_ioh)
+3. [stories_mil.h -- MIL Program Generators](#stories_milh)
+4. [stories_cpu_ops.h -- CPU Operations](#stories_cpu_opsh)
+5. [ane_runtime.h -- Generalized ANE Wrapper](#ane_runtimeh)
+6. [ane_mil_gen.h -- Composable MIL Helpers](#ane_mil_genh)
+7. [ane_rmsnorm_bwd.h -- RMSNorm Backward on ANE](#ane_rmsnorm_bwdh)
+8. [ane_classifier.h -- Classifier and Softmax on ANE](#ane_classifierh)
+9. [bridge/ane_bridge.h -- C Bridge API](#bridgeane_bridgeh)
+10. [MIL Operation Reference](#mil-operation-reference)
+11. [Weight Blob Format](#weight-blob-format)
+
+---
+
+## stories_config.h
+
+Model constants, data structures, and memory allocation helpers.
+
+### Macros
+
+| Macro | Value | Description |
+|-------|-------|-------------|
+| `DIM` | 768 | Model hidden dimension |
+| `HIDDEN` | 2048 | FFN intermediate dimension |
+| `HEADS` | 12 | Number of attention heads |
+| `HD` | 64 (`DIM/HEADS`) | Per-head dimension |
+| `SEQ` | 256 | Sequence length |
+| `NLAYERS` | 12 | Number of transformer layers |
+| `VOCAB` | 32000 | Vocabulary size |
+| `ACCUM_STEPS` | 10 | Gradient accumulation steps per compile batch |
+| `MAX_COMPILES` | 100 | ANE compile budget before process restart |
+| `KERNELS_PER_LAYER` | 5 | Weight-bearing ANE kernels per layer |
+| `TOTAL_WEIGHT_KERNELS` | 60 | Total weight-bearing compiles per batch |
+| `SCORE_CH` | 3072 (`HEADS*SEQ`) | Attention score channels for SDPA backward |
+| `WQ_SZ` | 589824 (`DIM*DIM`) | Size of Q/K/V/O projection weight matrices |
+| `WO_SZ` | 589824 (`DIM*DIM`) | Size of output projection |
+| `W1_SZ` | 1572864 (`HIDDEN*DIM`) | FFN gate/value projection size |
+| `W2_SZ` | 1572864 (`DIM*HIDDEN`) | FFN down-projection size |
+| `W3_SZ` | 1572864 (`HIDDEN*DIM`) | FFN value projection size |
+| `LAYER_PARAMS` | -- | Total floats per layer: `4*WQ_SZ + W1_SZ + W2_SZ + W3_SZ + 2*DIM` |
+| `TOTAL_PARAMS` | -- | Total model params: `NLAYERS * LAYER_PARAMS + DIM + VOCAB*DIM` |
+
+### Structs
+
+#### `LayerWeights`
+Per-layer weight matrices (all `float*`).
+
+| Field | Shape | Description |
+|-------|-------|-------------|
+| `Wq`, `Wk`, `Wv`, `Wo` | `[DIM, DIM]` | Attention projection weights |
+| `W1`, `W3` | `[HIDDEN, DIM]` | FFN gate and value up-projections |
+| `W2` | `[DIM, HIDDEN]` | FFN down-projection |
+| `rms_att` | `[DIM]` | RMSNorm scale for attention sublayer |
+| `rms_ffn` | `[DIM]` | RMSNorm scale for FFN sublayer |
+
+#### `AdamState`
+First/second moment buffers for a single parameter group.
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `m` | `float*` | First moment (mean) estimate |
+| `v` | `float*` | Second moment (variance) estimate |
+| `n` | `size_t` | Number of parameters |
+
+#### `LayerAdam`
+Per-layer Adam optimizer state. Contains one `AdamState` per weight matrix: `Wq`, `Wk`, `Wv`, `Wo`, `W1`, `W2`, `W3`, `rms_att`, `rms_ffn`.
+
+#### `LayerActs`
+Per-layer activation tensors saved for the backward pass.
+
+| Field | Shape | Description |
+|-------|-------|-------------|
+| `layer_in` | `[DIM, SEQ]` | Input to this layer (for rmsnorm1 backward) |
+| `xnorm` | `[DIM, SEQ]` | RMSNorm1 output |
+| `Q`, `K`, `V` | `[DIM, SEQ]` | QKV projections |
+| `attn_out` | `[DIM, SEQ]` | Attention output (before Wo) |
+| `o_out` | `[DIM, SEQ]` | Wo projection output |
+| `x2` | `[DIM, SEQ]` | Residual after attention |
+| `x2norm` | `[DIM, SEQ]` | RMSNorm2 output |
+| `h1`, `h3` | `[HIDDEN, SEQ]` | FFN intermediates (W1 and W3 outputs) |
+| `silu_out` | `[HIDDEN, SEQ]` | SiLU(h1) * h3 gated output |
+| `ffn_out` | `[DIM, SEQ]` | FFN final output |
+
+#### `LayerGrads`
+Per-layer gradient accumulators. Same field names as `LayerWeights` (all `float*`): `Wq`, `Wk`, `Wv`, `Wo`, `W1`, `W2`, `W3`, `rms_att`, `rms_ffn`.
+
+#### `Kern`
+Single ANE kernel handle (stories-specific, single I/O).
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `model` | `void*` | Retained `_ANEInMemoryModel` |
+| `ioIn` | `IOSurfaceRef` | Input IOSurface |
+| `ioOut` | `IOSurfaceRef` | Output IOSurface |
+| `request` | `void*` | Retained `_ANERequest` |
+| `tmpDir` | `void*` | Retained temp directory path |
+
+#### `LayerKernels`
+ANE kernels for one transformer layer.
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `fwdAttn` | `Kern*` | SDPA forward + taps |
+| `fwdFFN` | `Kern*` | FFN forward + taps |
+| `ffnBwd` | `Kern*` | FFN backward |
+| `sdpaBwd1` | `Kern*` | SDPA backward part 1 (Wo^T + dV + scores) |
+| `sdpaBwd2` | `Kern*` | SDPA backward part 2 (dQ + dK) |
+| `qkvBwd` | `Kern*` | QKV backward (Wq^T, Wk^T, Wv^T) |
+
+#### `CkptHdr`
+Checkpoint file header (128 bytes, version 2).
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `magic` | `int` | `0x424C5A54` ("BLZT") |
+| `version` | `int` | 2 |
+| `step`, `total_steps` | `int` | Training progress |
+| `n_layers`, `vocab_size`, `dim`, `hidden_dim`, `n_heads`, `seq_len` | `int` | Model shape |
+| `lr`, `loss` | `float` | Learning rate, last loss |
+| `cum_compile`, `cum_train`, `cum_wall` | `double` | Cumulative timing (ms) |
+| `cum_steps`, `cum_batches` | `int` | Cumulative counters |
+| `adam_t` | `int` | Adam timestep (for bias correction) |
+| `pad[3]` | `int` | Alignment padding |
+
+#### `Llama2Config`
+Header from llama2.c model files (7 ints): `dim`, `hidden_dim`, `n_layers`, `n_heads`, `n_kv_heads`, `vocab_size`, `seq_len`.
+
+### Global Variables
+
+| Name | Type | Description |
+|------|------|-------------|
+| `g_D` | `Class` | `_ANEInMemoryModelDescriptor` ObjC class |
+| `g_I` | `Class` | `_ANEInMemoryModel` ObjC class |
+| `g_AR` | `Class` | `_ANERequest` ObjC class |
+| `g_AIO` | `Class` | `_ANEIOSurfaceObject` ObjC class |
+| `g_tb` | `mach_timebase_info_data_t` | Mach time base for timing |
+| `g_compile_count` | `int` | Running count of ANE compiles |
+
+### Functions
+
+| Function | Returns | Description |
+|----------|---------|-------------|
+| `ane_init(void)` | `void` | Load AppleNeuralEngine.framework, resolve 4 private class references |
+| `tb_ms(uint64_t t)` | `double` | Convert Mach absolute time to milliseconds |
+| `adam_alloc(size_t n)` | `AdamState` | Allocate zeroed first/second moment buffers for n parameters |
+| `adam_free(AdamState *s)` | `void` | Free an AdamState's buffers |
+| `layer_weights_alloc(void)` | `LayerWeights` | Allocate all weight matrices for one layer |
+| `layer_weights_free(LayerWeights *w)` | `void` | Free all weight matrices for one layer |
+| `layer_adam_alloc(void)` | `LayerAdam` | Allocate Adam state for all weights in one layer |
+| `layer_adam_free(LayerAdam *a)` | `void` | Free Adam state for one layer |
+| `layer_acts_alloc(void)` | `LayerActs` | Allocate all activation buffers for one layer |
+| `layer_acts_free(LayerActs *a)` | `void` | Free all activation buffers for one layer |
+| `layer_grads_alloc(void)` | `LayerGrads` | Allocate zeroed gradient accumulators for one layer |
+| `layer_grads_zero(LayerGrads *g)` | `void` | Zero all gradient accumulators (between accumulation steps) |
+| `layer_grads_free(LayerGrads *g)` | `void` | Free gradient accumulators for one layer |
+
+---
+
+## stories_io.h
+
+IOSurface creation, fp16/fp32 conversion, weight blob building, and ANE kernel compile/run.
+
+**Depends on**: `stories_config.h`, `<arm_neon.h>`
+
+### Functions
+
+| Function | Returns | Description |
+|----------|---------|-------------|
+| `make_surface(size_t bytes)` | `IOSurfaceRef` | Create a 1D IOSurface with given byte allocation |
+| `build_blob(const float *w, int rows, int cols)` | `NSData*` | Build fp16 weight blob (128B header + row-major fp16 data) from fp32 weights |
+| `build_blob_t(const float *w, int rows, int cols)` | `NSData*` | Build fp16 weight blob with transposed layout (col-major fp16 from row-major fp32) |
+| `build_blob_fp16(_Float16 *d, int cnt)` | `NSData*` | Build weight blob from pre-existing fp16 data (no conversion) |
+| `cvt_f16_f32(float *dst, const _Float16 *src, int n)` | `void` | NEON-vectorized fp16-to-fp32 conversion (8-wide SIMD) |
+| `cvt_f32_f16(_Float16 *dst, const float *src, int n)` | `void` | NEON-vectorized fp32-to-fp16 conversion (8-wide SIMD) |
+| `io_write_fp16(IOSurfaceRef s, const float *data, int channels, int sp)` | `void` | Write fp32 data to IOSurface as fp16 in channel-first `[C,S]` layout |
+| `io_read_fp16(IOSurfaceRef s, float *data, int ch_off, int channels, int sp)` | `void` | Read fp16 data from IOSurface at channel offset, convert to fp32 |
+| `io_copy(IOSurfaceRef dst, int dst_ch, IOSurfaceRef src, int src_ch, int channels, int sp)` | `void` | Copy fp16 data between IOSurfaces at specified channel offsets |
+| `io_write_fp16_at(IOSurfaceRef s, int ch_off, const float *data, int channels, int sp)` | `void` | Write fp32 data to IOSurface at specific channel offset as fp16 |
+| `compile_kern_mil_w(NSString *mil, NSDictionary *weights, int ic_bytes, int oc_bytes)` | `Kern*` | Compile MIL text + weight dictionary into a loaded ANE kernel with IOSurfaces. Increments `g_compile_count`. |
+| `free_kern(Kern *k)` | `void` | Unload ANE model, release IOSurfaces, remove temp directory, free kernel |
+| `ane_run(Kern *k)` | `void` | Run a compiled ANE kernel on current IOSurface contents |
+
+---
+
+## stories_mil.h
+
+MIL program generators for the 6 fused ANE kernel types. Each returns an `NSString*` containing the full MIL program text.
+
+**Depends on**: `stories_io.h`
+
+### Macros
+
+| Macro | Description |
+|-------|-------------|
+| `MIL_HDR` | Standard MIL program header (version 1.3, buildInfo with coremlc/coremltools versions) |
+| `CONV_CONST` | Common conv parameter constants (pad_type, strides, pad, dilations, groups) |
+
+### Functions
+
+| Function | Returns | Description |
+|----------|---------|-------------|
+| `gen_sdpa_fwd_taps(void)` | `NSString*` | SDPA forward: RMSNorm + QKV + attention + Wo. Output: `concat(o_out, Q, K, V, attn_out, xnorm)` `[1, 6*DIM, 1, SEQ]` |
+| `gen_ffn_fwd_taps(void)` | `NSString*` | FFN forward: RMSNorm + W1/W3 + SiLU + W2. Output: `concat(ffn_out, h1, h3, silu_out, x2norm)` `[1, 2*DIM+3*HIDDEN, 1, SEQ]` |
+| `gen_ffn_bwd(void)` | `NSString*` | FFN backward: Input `concat(dffn, h1, h3)`. Output: `concat(dx, dh1, dh3)` `[1, DIM+2*HIDDEN, 1, SEQ]` |
+| `gen_qkvb(void)` | `NSString*` | QKV backward: Input `concat(dQ, dK, dV)`. Output: `dx` `[1, DIM, 1, SEQ]` |
+| `gen_sdpa_bwd1(void)` | `NSString*` | SDPA backward part 1: Input `concat(Q, K, V, dx2)`. Output: `concat(dV, probs, dP)` `[1, DIM+2*SCORE_CH, 1, SEQ]` |
+| `gen_sdpa_bwd2(void)` | `NSString*` | SDPA backward part 2: Input `concat(probs, dP, Q, K)`. Output: `concat(dQ, dK)` `[1, 2*DIM, 1, SEQ]` |
+| `get_mask_blob(void)` | `NSData*` | Lazily build and cache causal attention mask as fp16 blob. Lower-triangular 0, upper -65504. |
+
+### Global Variables
+
+| Name | Type | Description |
+|------|------|-------------|
+| `g_mask_blob` | `NSData*` | Cached causal mask blob (built on first call to `get_mask_blob`) |
+
+---
+
+## stories_cpu_ops.h
+
+CPU-side operations using Accelerate framework (vDSP, vvrsqrtf, vvexpf).
+
+**Depends on**: `stories_config.h`
+
+### Functions
+
+| Function | Returns | Description |
+|----------|---------|-------------|
+| `rmsnorm(float *out, const float *x, const float *w, int d, int S)` | `void` | RMSNorm forward: `out = x * rsqrt(mean(x^2) + eps) * w`. Vectorized via vDSP. Layout: channel-first `[d, S]`. |
+| `rmsnorm_bwd(float *dx, float *dw, const float *dy, const float *x, const float *w, int d, int S)` | `void` | RMSNorm backward: computes `dx` (input gradient) and accumulates `dw` (scale gradient). |
+| `adam_update(float *w, const float *g, AdamState *s, int t, float lr, float b1, float b2, float eps)` | `void` | Adam optimizer step with bias correction. Updates weights in-place. `t` is the timestep for bias correction. |
+| `cross_entropy_loss(float *dlogits, const float *logits, const uint16_t *targets, int V, int S)` | `float` | Compute mean cross-entropy loss. Writes `dlogits = (softmax(logits) - one_hot(targets)) / S`. Column-major `[V, S]` layout. Uses vDSP transpose + vvexpf for vectorized softmax. |
+| `embed_lookup(float *x, const float *embed, const uint16_t *tokens, int dim, int seq)` | `void` | Embedding forward: gather rows from `embed[VOCAB, DIM]` into channel-first `x[DIM, SEQ]`. |
+| `embed_backward(float *d_embed, const float *dx, const uint16_t *tokens, int dim, int seq)` | `void` | Embedding backward: scatter-add `dx` back into embedding table gradient `d_embed`. |
+
+### Global Variables
+
+| Name | Type | Description |
+|------|------|-------------|
+| `g_rms_tmp` | `float*` | Lazily-allocated scratch buffer for RMSNorm (size SEQ) |
+
+---
+
+## ane_runtime.h
+
+Generalized ANE wrapper with multi-input/output support. Used in bridge, tests, and newer training variants.
+
+### Structs
+
+#### `ANEKernel`
+Generalized kernel handle supporting multiple inputs and outputs.
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `model` | `id` | `_ANEInMemoryModel` instance |
+| `ioInputs` | `IOSurfaceRef*` | Array of input IOSurfaces |
+| `ioOutputs` | `IOSurfaceRef*` | Array of output IOSurfaces |
+| `request` | `id` | `_ANERequest` instance |
+| `tmpDir` | `NSString*` | Temp directory for MIL/weights on disk |
+| `nInputs`, `nOutputs` | `int` | Number of I/O tensors |
+| `inputBytes`, `outputBytes` | `size_t*` | Byte sizes for each I/O tensor |
+
+### Global Variables
+
+| Name | Type | Description |
+|------|------|-------------|
+| `g_ANEDesc` | `Class` | `_ANEInMemoryModelDescriptor` |
+| `g_ANEInMem` | `Class` | `_ANEInMemoryModel` |
+| `g_ANEReq` | `Class` | `_ANERequest` |
+| `g_ANEIO` | `Class` | `_ANEIOSurfaceObject` |
+| `g_ane_loaded` | `bool` | Guard to avoid re-loading the framework |
+
+### Functions
+
+| Function | Returns | Description |
+|----------|---------|-------------|
+| `ane_init(void)` | `void` | Load AppleNeuralEngine.framework (idempotent), resolve 4 private ObjC classes |
+| `ane_create_surface(size_t bytes)` | `IOSurfaceRef` | Create a 1D IOSurface of given byte size |
+| `ane_compile(NSData *milText, NSData *weightData, int nInputs, size_t *inputSizes, int nOutputs, size_t *outputSizes)` | `ANEKernel*` | Full compile pipeline: build descriptor, compile MIL, load model, create IOSurfaces + request. Returns NULL on failure. |
+| `ane_write_input(ANEKernel *k, int idx, const void *data, size_t bytes)` | `void` | Write raw bytes to the idx-th input IOSurface (lock/memcpy/unlock) |
+| `ane_read_output(ANEKernel *k, int idx, void *data, size_t bytes)` | `void` | Read raw bytes from the idx-th output IOSurface (read-lock/memcpy/unlock) |
+| `ane_run_kernel(ANEKernel *k)` | `bool` | Run the compiled ANE kernel. Returns true on success. |
+| `ane_free(ANEKernel *k)` | `void` | Unload model, release all IOSurfaces, remove temp dir, free struct |
+
+---
+
+## ane_mil_gen.h
+
+Composable MIL generation helpers for common patterns, plus weight blob builders.
+
+### Functions
+
+| Function | Returns | Description |
+|----------|---------|-------------|
+| `mil_build_weight_blob(const float *w, int out_ch, int in_ch)` | `NSData*` | Build fp16 weight blob with 128B header from fp32 row-major `[out_ch, in_ch]` weights |
+| `mil_gen_matmul(int in_ch, int out_ch, int spatial)` | `NSString*` | Generate MIL for matmul `y = W @ x` with both as runtime inputs. Includes fp32-to-fp16-to-fp32 casts. |
+| `mil_gen_conv(int in_ch, int out_ch, int spatial)` | `NSString*` | Generate MIL for conv-based linear with baked weights from blob file (inference-only) |
+| `mil_gen_qkv(int dim, int spatial)` | `NSString*` | Generate MIL for fused QKV: 3 parallel convs from single input, weights from concatenated blob |
+| `mil_build_qkv_weight_blob(const float *wq, const float *wk, const float *wv, int dim)` | `NSData*` | Build concatenated weight blob for fused QKV (3 chunks, each with 64B header + fp16 data) |
+| `mil_build_ffn_up_weight_blob(const float *w1, const float *w3, int hidden_dim, int dim)` | `NSData*` | Build concatenated weight blob for fused FFN up-projection (W1 + W3 chunks) |
+| `mil_gen_ffn_up(int dim, int hidden_dim, int spatial)` | `NSString*` | Generate MIL for fused FFN up: W1 + W3 parallel convs, outputs h1 and h3 |
+
+---
+
+## ane_rmsnorm_bwd.h
+
+MIL generator for RMSNorm backward on ANE (used by `train_large_ane.m`).
+
+**Depends on**: `stories_mil.h`
+
+### Functions
+
+| Function | Returns | Description |
+|----------|---------|-------------|
+| `gen_rmsnorm_bwd(void)` | `NSString*` | Generate MIL for RMSNorm backward. Input: `concat(dy, x)` as `[1, 2*DIM, 1, SEQ]`. Baked weight: RMSNorm scale `w[DIM]`. Output: `dx` as `[1, DIM, 1, SEQ]`. Note: `dw` (weight gradient) stays on CPU. |
+
+---
+
+## ane_classifier.h
+
+MIL generators for classifier operations on ANE (used by `train_large_ane.m`).
+
+**Depends on**: `stories_mil.h`
+
+### Functions
+
+| Function | Returns | Description |
+|----------|---------|-------------|
+| `gen_classifier_fwd(void)` | `NSString*` | Classifier forward: single 32000-output-channel conv. Input: `[1, DIM, 1, SEQ]`. Baked: embedding weights `[VOCAB, DIM, 1, 1]`. Output: `[1, VOCAB, 1, SEQ]`. |
+| `gen_classifier_bwd(void)` | `NSString*` | Classifier backward: `dx = embed^T @ dlogits`. Uses `matmul` op (not conv, since ANE rejects conv with 32000 input channels). Input: `[1, VOCAB, 1, SEQ]`. Baked: `embed^T [1, DIM, VOCAB]`. Output: `[1, DIM, 1, SEQ]`. |
+| `gen_softmax_vocab(void)` | `NSString*` | Softmax over VOCAB dimension: `softmax(x, axis=1)`. Input: `[1, VOCAB, 1, SEQ]`. Output: `[1, VOCAB, 1, SEQ]`. |
+| `gen_final_rmsnorm(void)` | `NSString*` | Final RMSNorm (standalone, not fused). Input: `[1, DIM, 1, SEQ]`. Baked: `rms_final[DIM]`. Output: `[1, DIM, 1, SEQ]`. |
+
+---
+
+## bridge/ane_bridge.h
+
+C-callable bridge to ANE private APIs for Python ctypes integration.
+
+### Types
+
+| Type | Description |
+|------|-------------|
+| `ANEKernelHandle` | Opaque kernel handle (pointer to internal struct) |
+
+### Functions
+
+| Function | Returns | Description |
+|----------|---------|-------------|
+| `ane_bridge_init(void)` | `int` | Initialize ANE runtime (load private framework, resolve classes). Returns 0 on success, -1 on failure. |
+| `ane_bridge_compile(const char *mil_text, size_t mil_len, const uint8_t *weight_data, size_t weight_len, int n_inputs, const size_t *input_sizes, int n_outputs, const size_t *output_sizes)` | `ANEKernelHandle*` | Compile MIL text + single weight blob into ANE kernel. Returns NULL on failure. |
+| `ane_bridge_compile_multi_weights(const char *mil_text, size_t mil_len, const char **weight_names, const uint8_t **weight_datas, const size_t *weight_lens, int n_weights, int n_inputs, const size_t *input_sizes, int n_outputs, const size_t *output_sizes)` | `ANEKernelHandle*` | Compile MIL text + multiple named weight files. Weight names use `@model_path/` prefix convention. |
+| `ane_bridge_run(ANEKernelHandle *kernel)` | `bool` | Execute a compiled kernel on ANE. Returns true on success. |
+| `ane_bridge_write_input(ANEKernelHandle *kernel, int idx, const void *data, size_t bytes)` | `void` | Write data to kernel input IOSurface at index `idx` |
+| `ane_bridge_read_output(ANEKernelHandle *kernel, int idx, void *data, size_t bytes)` | `void` | Read data from kernel output IOSurface at index `idx` |
+| `ane_bridge_free(ANEKernelHandle *kernel)` | `void` | Unload model, release all IOSurfaces, remove temp dir, free handle |
+| `ane_bridge_get_compile_count(void)` | `int` | Get current compile count (for restart budgeting) |
+| `ane_bridge_reset_compile_count(void)` | `void` | Reset compile count to zero |
+| `ane_bridge_build_weight_blob(const float *src, int rows, int cols, size_t *out_len)` | `uint8_t*` | Build weight blob in ANE format (128B header + fp16). Caller must free via `ane_bridge_free_blob()`. |
+| `ane_bridge_build_weight_blob_transposed(const float *src, int rows, int cols, size_t *out_len)` | `uint8_t*` | Build transposed weight blob. Caller must free via `ane_bridge_free_blob()`. |
+| `ane_bridge_free_blob(void *ptr)` | `void` | Free a blob allocated by `ane_bridge_build_weight_blob*` |
+
+---
+
+## MIL Operation Reference
+
+All MIL programs target `ios18` and use fp16 tensors in `[1, C, 1, S]` layout (or `[1, H, S, S]` for attention scores).
+
+| Operation | MIL Syntax | Purpose |
+|-----------|-----------|---------|
+| `conv` | `conv(dilations=dl, groups=gr, pad=pd, pad_type=pt, strides=st, weight=W, x=xn)` | Linear projections (all Wq, Wk, Wv, Wo, W1, W2, W3). 1x1 conv = matmul. Weight shape: `[out_ch, in_ch, 1, 1]`. |
+| `matmul` | `matmul(transpose_x=tx, transpose_y=ty, x=a, y=b)` | Attention score computation (Q at K^T, scores at V, classifier backward). |
+| `softmax` | `softmax(axis=ax, x=ms)` | Attention weight normalization (`axis=-1`) and vocab softmax (`axis=1`). |
+| `mul` | `mul(x=a, y=b)` | Element-wise multiply: RMSNorm scaling, SiLU gating, attention scaling, softmax Jacobian. |
+| `add` | `add(x=a, y=b)` | Causal mask application, SiLU derivative `(1 + h*(1-sig))`, gradient accumulation. |
+| `sub` | `sub(x=a, y=b)` | SiLU derivative: `1 - sigmoid(h1)`, softmax backward: `dp - sum(P*dP)`. |
+| `sigmoid` | `sigmoid(x=h1)` | SiLU activation component (SiLU = x * sigmoid(x)). |
+| `pow` | `pow(x=ss3, y=nhalf)` | RMSNorm: `x^(-0.5)` = reciprocal sqrt. |
+| `reduce_sum` | `reduce_sum(x=sq, axes=rax, keep_dims=kd)` | RMSNorm: sum of squares along channel dim. Softmax backward: row-wise dot product. |
+| `reshape` | `reshape(shape=sh, x=xf)` | `[1,DIM,1,SEQ]` to `[1,HEADS,HD,SEQ]` for multi-head attention. Flatten attention scores. |
+| `transpose` | `transpose(perm=pm, x=q4)` | Permute `[0,1,3,2]`: swap spatial and head_dim for matmul compatibility. |
+| `concat` | `concat(axis=cax, interleave=cid, values=(a,b,c))` | Pack multiple outputs into single IOSurface ("taps"). Always `axis=1`, `interleave=false`. |
+| `slice_by_size` | `slice_by_size(x=x, begin=b, size=sz)` | Split concatenated inputs in backward kernels. `begin=[0,offset,0,0]`, `size=[1,channels,1,SEQ]`. |
+| `cast` | `cast(dtype=to_fp16, x=x)` | fp32-to-fp16 or fp16-to-fp32 precision conversion (used in ane_mil_gen.h generators). |
+| `const` | `const()[name=..., val=...]` | Declare scalar/tensor constants, conv parameters, weight blob references via `BLOBFILE`. |
+
+---
+
+## Weight Blob Format
+
+### Single-weight blob (128 bytes header + data)
+
+```
+Offset  Size   Content
+------  -----  -------
+0       1      0x01 (format marker)
+4       1      0x02 (format marker)
+5-63    59     zeros (global header padding)
+64      4      0xDEADBEEF (chunk magic, little-endian: EF BE AD DE)
+68      1      0x01 (chunk marker)
+72      4      uint32 data_size (total fp16 bytes = out_ch * in_ch * 2)
+80      4      uint32 data_offset (always 128 = 64 global + 64 chunk)
+84-127  44     zeros (chunk header padding)
+128+    N      fp16 weight data, row-major [out_ch, in_ch]
+```
+
+### Multi-weight blob (fused QKV, FFN up)
+
+```
+Offset      Content
+------      -------
+0-63        Global header (same as above)
+64          Chunk 0 header (64 bytes): magic, data_size, data_offset
+64+64       Chunk 0 data (fp16 weights)
+64+cs       Chunk 1 header (64 bytes)
+64+cs+64    Chunk 1 data (fp16 weights)
+...
+```
+
+Where `cs = 64 + n_elements * 2` (chunk header size + data size).
+
+MIL references use `BLOBFILE(path="@model_path/weights/name.bin", offset=uint64(X))` where X is the chunk header offset within the file (64 for first chunk, 64+cs for second, etc.).
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@ -0,0 +1,370 @@
+# ANE Training -- System Architecture
+
+Training neural networks directly on Apple's Neural Engine via reverse-engineered private APIs (`_ANEClient`, `_ANECompiler`). No CoreML training APIs, no Metal, no GPU.
+
+## Project Structure
+
+```
+ANE/
+-- api_exploration.m          # ANE private API discovery
+-- inmem_basic.m              # In-memory MIL compilation proof-of-concept
+-- inmem_bench.m              # ANE dispatch latency across model sizes
+-- inmem_peak.m               # Peak TFLOPS via deep conv chains (self-contained)
+-- sram_bench.m               # SRAM capacity probing (performance cliff detection)
+-- sram_probe.m               # Fine-grained SRAM size exploration
+-- bridge/
+|   +-- ane_bridge.h           # C-callable API for Python ctypes
+|   +-- ane_bridge.m           # Bridge implementation
+|   +-- Makefile               # Builds libane_bridge.dylib
+|   +-- libane_bridge.dylib    # Pre-built shared library
+-- training/
+|   +-- train_large.m          # Main: 12-layer training (CPU classifier)
+|   +-- train_large_ane.m      # Variant: classifier + softmax on ANE
+|   +-- stories_config.h       # Model constants, structs, alloc helpers
+|   +-- stories_io.h           # IOSurface I/O, NEON fp16, compile/run
+|   +-- stories_mil.h          # MIL generators for 6 fused ANE kernels
+|   +-- stories_cpu_ops.h      # vDSP RMSNorm, cross-entropy, Adam, embedding
+|   +-- ane_runtime.h          # Generalized ANE wrapper (multi-I/O)
+|   +-- ane_mil_gen.h          # Composable MIL helpers (conv, matmul, fused QKV)
+|   +-- ane_rmsnorm_bwd.h      # RMSNorm backward MIL (train_large_ane only)
+|   +-- ane_classifier.h       # Classifier/softmax MIL (train_large_ane only)
+|   +-- forward.h              # Gen1 forward pass (per-linear-kernel, all-CPU)
+|   +-- backward.h             # Gen1 backward pass (all-CPU reference)
+|   +-- model.h                # Gen1 Model struct, per-kernel compile
+|   +-- dashboard.py           # TUI monitoring (loss, power, text generation)
+|   +-- tokenize.py            # Extract pretokenized TinyStories data
+|   +-- download_data.sh       # Download TinyStories from HuggingFace
+|   +-- Makefile               # Build targets for training + tests
+|   +-- test_*.m               # 12 unit test files
+-- docs/                      # This documentation
+-- scripts/                   # Automation scripts
+```
+
+## Two Generations of Training Code
+
+### Gen1: `model.h` + `forward.h` + `backward.h`
+
+The original correctness reference. One ANE kernel per linear projection (7 per layer + 1 classifier = 85 kernels total). Forward and backward are sequential all-CPU operations with optional ANE for the matmuls. No kernel fusion, no async overlap. Used for verifying Gen2's fused kernels produce correct results.
+
+### Gen2: `train_large.m` + `stories_*.h` (production)
+
+The performance-optimized system. Uses **5 fused ANE kernels per layer** (each performing multiple operations in a single dispatch). Weight gradients (`dW`) run asynchronously on CPU via GCD to overlap with ANE. All data is channel-first `[C, S]` fp16 on IOSurfaces.
+
+The rest of this document describes Gen2.
+
+---
+
+## Model Configuration
+
+Stories110M -- a Llama2-architecture transformer:
+
+| Parameter | Value | Macro |
+|-----------|-------|-------|
+| Hidden dimension | 768 | `DIM` |
+| FFN intermediate | 2048 | `HIDDEN` |
+| Attention heads | 12 | `HEADS` |
+| Head dimension | 64 | `HD` |
+| Sequence length | 256 | `SEQ` |
+| Layers | 12 | `NLAYERS` |
+| Vocabulary | 32000 | `VOCAB` |
+| Total parameters | 109.53M | `TOTAL_PARAMS` |
+| Accumulation steps | 10 | `ACCUM_STEPS` |
+| Max ANE compiles | 100 | `MAX_COMPILES` |
+
+---
+
+## ANE Kernel Fusion Map
+
+Each training step dispatches 6 kernel types per layer. 5 are weight-bearing (recompiled each batch), 1 is weight-free (compiled once).
+
+| Kernel | Generator | Fused Operations | Baked Weights | Input Shape | Output Shape |
+|--------|-----------|-----------------|---------------|-------------|--------------|
+| `fwdAttn` | `gen_sdpa_fwd_taps()` | RMSNorm1, Wq/Wk/Wv conv, reshape, transpose, Q at K^T matmul, scale, causal mask, softmax, scores at V matmul, Wo conv | rms_att, Wq, Wk, Wv, Wo, mask | `[1,DIM,1,SEQ]` | `[1,6*DIM,1,SEQ]` |
+| `fwdFFN` | `gen_ffn_fwd_taps()` | RMSNorm2, W1/W3 conv, sigmoid, SiLU gating, W2 conv | rms_ffn, W1, W3, W2 | `[1,DIM,1,SEQ]` | `[1,2D+3H,1,SEQ]` |
+| `ffnBwd` | `gen_ffn_bwd()` | W2^T conv, SiLU derivative, W1^T/W3^T conv, add | W2^T, W1^T, W3^T | `[1,D+2H,1,SEQ]` | `[1,D+2H,1,SEQ]` |
+| `sdpaBwd1` | `gen_sdpa_bwd1()` | Wo^T conv, reshape, Q at K^T recompute, softmax, dV matmul, dP matmul | Wo^T, mask | `[1,4*DIM,1,SEQ]` | `[1,D+2*SC,1,SEQ]` |
+| `sdpaBwd2` | `gen_sdpa_bwd2()` | softmax Jacobian, scale, dQ=dS at K matmul, dK=dS^T at Q matmul | _(none)_ | `[1,2SC+2D,1,SEQ]` | `[1,2*DIM,1,SEQ]` |
+| `qkvBwd` | `gen_qkvb()` | Wq^T/Wk^T/Wv^T conv, sum | Wq^T, Wk^T, Wv^T | `[1,3*DIM,1,SEQ]` | `[1,DIM,1,SEQ]` |
+
+Where D=DIM=768, H=HIDDEN=2048, SC=SCORE_CH=HEADS*SEQ=3072.
+
+"Taps" in forward kernels: intermediate values (Q, K, V, attention output, norms) are concatenated onto the output via `concat(axis=1)` so backward kernels can read them without CPU recomputation.
+
+---
+
+## CPU vs ANE Operation Split
+
+| Operation | Location | Reason |
+|-----------|----------|--------|
+| Embedding lookup/backward | CPU | Scatter/gather by token index |
+| RMSNorm forward | ANE | Fused into fwdAttn/fwdFFN kernels |
+| QKV projections | ANE | 1x1 conv = matmul |
+| Multi-head attention (SDPA) | ANE | Decomposed Q at K^T + mask + softmax + scores at V |
+| FFN (SwiGLU) | ANE | W1,W3 conv + sigmoid + gate + W2 conv |
+| Residual connections | CPU | Simple `vDSP_vadd` |
+| Final RMSNorm | CPU (or ANE in `_ane` variant) | Standalone, not fused with other ops |
+| Classifier matmul | CPU cblas (or ANE in `_ane` variant) | `[VOCAB,DIM] x [DIM,SEQ]` |
+| Cross-entropy + softmax | CPU (partially ANE in `_ane`) | Target indexing requires CPU |
+| dW weight gradients | CPU (async cblas) | Outer products, independent of backward data flow |
+| RMSNorm backward | CPU (or ANE in `_ane` variant) | vDSP vectorized |
+| Adam optimizer | CPU | In-place weight mutation |
+
+---
+
+## Training Step Swim-Lane Diagram
+
+One complete training step showing CPU, ANE, and async GCD operations interleaved:
+
+```mermaid
+sequenceDiagram
+    participant CPU
+    participant ANE
+    participant GCD as GCD Async Queue
+
+    Note over CPU: FORWARD PASS (per layer L=0..11)
+
+    CPU->>CPU: embed_lookup(tokens to x_cur)
+
+    loop Layer L = 0..11
+        CPU->>CPU: wait for prior async dW
+        CPU->>CPU: save layer_in, write fp16 to IOSurface
+        CPU->>ANE: run fwdAttn kernel
+        ANE-->>CPU: concat(o_out, Q, K, V, attn_out, xnorm)
+        CPU->>CPU: read fp16 taps, residual add to x2
+
+        CPU->>CPU: write fp16 x2 to IOSurface
+        CPU->>ANE: run fwdFFN kernel
+        ANE-->>CPU: concat(ffn_out, h1, h3, silu_out, x2norm)
+        CPU->>CPU: read fp16 taps, residual add to x_cur
+    end
+
+    Note over CPU: CLASSIFIER + LOSS
+    CPU->>CPU: rmsnorm(x_cur to x_final)
+    CPU->>CPU: cblas_sgemm(embed x x_final to logits)
+    CPU->>CPU: cross_entropy_loss(logits to loss, dlogits)
+
+    Note over CPU: BACKWARD PASS
+    CPU->>CPU: cblas_sgemm(embed^T x dlogits to dy)
+    CPU->>GCD: async dEmbed += dlogits x x_final^T
+    CPU->>CPU: rmsnorm_bwd(dy to dx)
+
+    loop Layer L = 11..0
+        Note over CPU,GCD: FFN Backward
+        CPU->>CPU: write dffn + copy h1,h3 from fwd taps
+        CPU->>ANE: run ffnBwd kernel
+        ANE-->>CPU: concat(dx_ffn, dh1, dh3)
+        CPU->>GCD: async dW2, dW1, dW3 accumulation
+
+        Note over CPU,GCD: RMSNorm2 Backward + Residual
+        CPU->>CPU: rmsnorm_bwd, add residual gradient
+
+        Note over CPU,GCD: SDPA Backward
+        CPU->>GCD: async dWo accumulation
+        CPU->>CPU: copy Q,K,V from fwd taps, write dx2
+        CPU->>ANE: run sdpaBwd1 kernel
+        ANE-->>CPU: concat(dV, probs, dP)
+
+        CPU->>CPU: copy probs,dP,Q,K
+        CPU->>ANE: run sdpaBwd2 kernel
+        ANE-->>CPU: concat(dQ, dK)
+
+        CPU->>GCD: async dWq, dWk, dWv accumulation
+
+        Note over CPU,GCD: QKV Backward
+        CPU->>CPU: copy dQ,dK,dV
+        CPU->>ANE: run qkvBwd kernel
+        ANE-->>CPU: dx_attn
+
+        Note over CPU,GCD: RMSNorm1 Backward + Residual
+        CPU->>CPU: rmsnorm_bwd, add both skip gradients
+    end
+
+    CPU->>CPU: dispatch_group_wait(all async dW)
+    CPU->>CPU: embed_backward(dy to d_embed)
+```
+
+---
+
+## Async CPU/ANE Overlap Strategy
+
+The key insight: **dW gradients (weight gradients) are independent of the backward data flow**. They are outer products `dW += dy x x^T` that only accumulate into gradient buffers. The data-path gradients (`dx`) flow backward through the network on ANE.
+
+```
+Timeline for one backward layer:
+  ANE:  [ffnBwd]    [sdpaBwd1]   [sdpaBwd2]   [qkvBwd]
+  CPU:       [dW_FFN (3x sgemm)]    [dWo]    [dWqkv (3x sgemm)]
+```
+
+GCD serial dispatch queue `"dw_cblas"` ensures dW operations don't overlap each other (they share scratch buffers). The `dispatch_group_wait` at the start of each forward layer ensures async dW from the previous step's backward has finished before IOSurfaces are reused.
+
+---
+
+## Compile/Restart Lifecycle
+
+The ANE runtime leaks resources internally, limiting compiles to ~119 per process. The system manages this with checkpoint-and-restart:
+
+```mermaid
+flowchart TD
+    Start["Process starts (fresh or --resume)"] --> LoadCkpt{"--resume flag?"}
+    LoadCkpt -->|Yes| Resume["Load checkpoint: weights, Adam state, step counter"]
+    LoadCkpt -->|No| Init["Xavier init weights, zero Adam state"]
+    Resume --> CompileCheck
+    Init --> CompileCheck
+
+    CompileCheck{"g_compile_count + 60 > MAX_COMPILES?"} -->|Yes| SaveCheckpoint["Save checkpoint to ane_stories110M_ckpt.bin"]
+    SaveCheckpoint --> FreeAll["Free all ANE kernels"]
+    FreeAll --> RestartProcess["Re-launch process with --resume flag"]
+    RestartProcess --> Start
+
+    CompileCheck -->|No| Compile["Compile 60 weight-bearing kernels (5 per layer x 12)"]
+    Compile --> ZeroGrads["Zero gradient accumulators"]
+    ZeroGrads --> AccumLoop
+
+    subgraph AccumLoop ["Gradient Accumulation (10 steps)"]
+        SingleStep["Forward + Backward + async dW"] --> MoreSteps{"More accum steps?"}
+        MoreSteps -->|Yes| SingleStep
+    end
+
+    MoreSteps -->|No| WaitDW["dispatch_group_wait (all async dW)"]
+    WaitDW --> ScaleGrad["Scale gradients by 1/ACCUM_STEPS"]
+    ScaleGrad --> AdamUpdate["Adam update (mutates weights in-place)"]
+    AdamUpdate --> FreeKernels["Free all weight-bearing kernels"]
+    FreeKernels --> CompileCheck
+```
+
+With `MAX_COMPILES=100` and 60 weight-bearing kernels per batch, only **1 batch** (10 accumulation steps) fits per process lifetime. The checkpoint preserves:
+
+- Training step and total_steps
+- All weights and Adam (m, v) state per layer
+- Cumulative timing statistics
+- Adam timestep counter
+
+---
+
+## Data Flow Through One Layer
+
+Tensor shapes as they flow through forward and backward passes:
+
+```mermaid
+flowchart LR
+    subgraph fwdAttnKernel ["fwdAttn Kernel (ANE)"]
+        xIn["x_in\n[1,768,1,256]"] --> RMS1["RMSNorm1"]
+        RMS1 --> QKVConv["Wq,Wk,Wv conv\n[768,768,1,1]"]
+        QKVConv --> ReshapeHeads["reshape\n[1,12,64,256]"]
+        ReshapeHeads --> TransposeHeads["transpose\n[1,12,256,64]"]
+        TransposeHeads --> QKT["Q x K^T\n[1,12,256,256]"]
+        QKT --> ScaleMask["scale + mask\n+ softmax"]
+        ScaleMask --> AV["scores x V\n[1,12,256,64]"]
+        AV --> ReshapeBackFlat["reshape\n[1,768,1,256]"]
+        ReshapeBackFlat --> WoConv["Wo conv\n[768,768,1,1]"]
+    end
+
+    subgraph taps1 ["Taps via concat"]
+        WoConv --> T1["o_out [768]"]
+        QKVConv --> T2["Q,K,V [768 each]"]
+        AV --> T3["attn_out [768]"]
+        RMS1 --> T4["xnorm [768]"]
+    end
+
+    subgraph cpuResid1 ["CPU"]
+        T1 --> ResAdd1["x + o_out = x2"]
+    end
+
+    subgraph fwdFFNKernel ["fwdFFN Kernel (ANE)"]
+        ResAdd1 --> RMS2["RMSNorm2"]
+        RMS2 --> W1W3["W1,W3 conv\n[2048,768,1,1]"]
+        W1W3 --> SiLUGate["sigmoid + SiLU\n+ gating"]
+        SiLUGate --> W2Conv["W2 conv\n[768,2048,1,1]"]
+    end
+
+    subgraph taps2 ["Taps via concat"]
+        W2Conv --> T5["ffn_out [768]"]
+        W1W3 --> T6["h1,h3 [2048 each]"]
+        SiLUGate --> T7["silu_out [2048]"]
+        RMS2 --> T8["x2norm [768]"]
+    end
+
+    subgraph cpuResid2 ["CPU"]
+        T5 --> ResAdd2["x2 + ffn_out = x_next"]
+    end
+```
+
+---
+
+## IOSurface Memory Layout
+
+All tensors use channel-first `[1, C, 1, S]` fp16 layout on IOSurfaces, matching ANE's native format:
+
+```
+IOSurface memory (contiguous fp16):
+  channel_0:   [pos_0, pos_1, ..., pos_255]   (256 values)
+  channel_1:   [pos_0, pos_1, ..., pos_255]
+  ...
+  channel_767: [pos_0, pos_1, ..., pos_255]
+```
+
+Fused kernel outputs use `concat(axis=1)` to pack multiple tensors into a single IOSurface:
+
+```
+fwdAttn output [1, 6*768, 1, 256]:
+  channels    0-767:  o_out (Wo projection output)
+  channels  768-1535: Q (query projection)
+  channels 1536-2303: K (key projection)
+  channels 2304-3071: V (value projection)
+  channels 3072-3839: attn_out (pre-Wo attention output)
+  channels 3840-4607: xnorm (RMSNorm1 output)
+```
+
+CPU reads specific taps via `io_read_fp16(surface, data, ch_offset, n_channels, spatial)`.
+
+---
+
+## Weight Blob Format
+
+ANE weight blobs follow a binary format with a 128-byte header:
+
+```
+Offset  Size   Content
+------  -----  -------
+0       1      0x01 (format marker)
+4       1      0x02 (format marker)
+5-63    59     zeros (padding)
+64      4      0xDEADBEEF (chunk magic, little-endian)
+68      1      0x01 (chunk marker)
+72      4      uint32 data_size (fp16 weight bytes)
+80      4      uint32 data_offset (always 128)
+84-127  44     zeros (padding)
+128+    N      fp16 weight data, row-major [out_ch, in_ch]
+```
+
+Multi-weight blobs (fused QKV, FFN up) concatenate chunks: `[64B global header] [64B chunk0 header] [chunk0 data] [64B chunk1 header] [chunk1 data] ...`
+
+MIL programs reference weights via `BLOBFILE(path="@model_path/weights/name.bin", offset=uint64(64))` where offset 64 points to the chunk header within the file.
+
+---
+
+## Key Constraints
+
+| Constraint | Impact | Workaround |
+|-----------|--------|------------|
+| ~119 compile limit per process | ANE compiler leaks resources | `checkpoint + re-launch with --resume` |
+| Weights baked at compile time | Cannot hot-swap weights; must recompile | Gradient accumulation amortizes compile cost |
+| SDPA ignores `attn_mask` | Causal attention cannot use native SDPA mask | Decompose into Q at K^T + explicit mask + softmax + scores at V |
+| ANE SRAM capacity ~32 MB | Large weight matrices spill to DRAM | Performance cliff above ~3072 channels |
+| 32000 input channels rejected | ANE refuses conv with VOCAB input channels | Classifier backward uses `matmul` op with reshape instead of conv |
+| fp16 compute only | Precision limited on ANE | fp32 on CPU for loss, Adam; fp16 for ANE forward/backward |
+
+---
+
+## `train_large.m` vs `train_large_ane.m`
+
+`train_large_ane.m` moves additional operations from CPU to ANE:
+
+| Operation | `train_large.m` | `train_large_ane.m` |
+|-----------|-----------------|---------------------|
+| Final RMSNorm | CPU (`rmsnorm()` via vDSP) | ANE (`gen_final_rmsnorm()`) |
+| Classifier forward | CPU (`cblas_sgemm`) | ANE (`gen_classifier_fwd()`, 32000-ch conv) |
+| Softmax | CPU (inside `cross_entropy_loss()`) | ANE (`gen_softmax_vocab()`) |
+| Per-layer RMSNorm backward | CPU (`rmsnorm_bwd()` via vDSP) | ANE (`gen_rmsnorm_bwd()`) |
+
+This increases compile budget pressure: 86 weight-bearing kernels per batch (vs 60), leaving less headroom within MAX_COMPILES=100.
--- a/docs/BENCHMARKS.md
+++ b/docs/BENCHMARKS.md
@ -0,0 +1,253 @@
+# ANE Training -- Benchmarks and Tests Guide
+
+All benchmarks and tests require **macOS 15+ on Apple Silicon** (tested on M4, M5).
+
+---
+
+## Quick Start
+
+```bash
+# Build and run training benchmark (100 steps)
+cd training
+make train_large && ./train_large --steps 100
+
+# Run the automated benchmark suite
+cd ..
+bash scripts/run_benchmarks.sh
+```
+
+---
+
+## Training Benchmarks
+
+### train_large (CPU classifier)
+
+The main 12-layer Stories110M training loop with classifier on CPU.
+
+| Item | Details |
+|------|---------|
+| **Purpose** | Full transformer training benchmark |
+| **Measures** | ms/step, ANE TFLOPS, ANE utilization %, per-component timing |
+| **Prerequisites** | Training data: `bash download_data.sh` (or runs on random data if absent) |
+| **Build** | `cd training && make train_large` |
+| **Run** | `./train_large --steps 100` |
+| **CLI flags** | `--steps N` (default 10000), `--lr F` (default 3e-4), `--resume` |
+
+**Expected output:**
+
+```
+ane=9.6 io=4.1 cls=9.1 elem=14.4 rms=0.1 cblas_wait=2.3 ms/step
+
+=== Efficiency Report ===
+Total steps:     100
+Avg train:       107.0 ms/step
+ANE TFLOPS:      2.45 sustained
+ANE utilization: 15.5% of 15.8 TFLOPS
+```
+
+### train_large_ane (ANE classifier)
+
+Same training with classifier, softmax, and RMSNorm backward offloaded to ANE.
+
+| Item | Details |
+|------|---------|
+| **Purpose** | Measure ANE-offloaded training (16% faster) |
+| **Build** | `cd training && make train_large_ane` |
+| **Run** | `./train_large_ane --steps 100` |
+
+**Compare baseline vs ANE-offloaded:**
+
+```bash
+make train_large && ./train_large --steps 100
+make train_large_ane && ./train_large_ane --steps 100
+```
+
+### Dashboard (live monitoring)
+
+```bash
+pip install blessed psutil numpy
+sudo python3 dashboard.py            # live mode (needs powermetrics)
+sudo python3 dashboard.py --resume   # attach to resumed training
+```
+
+| Flag | Description |
+|------|-------------|
+| `--resume` | Resume from checkpoint |
+| `--infinite` | Train indefinitely |
+| `--no-powermetrics` | Disable power monitoring |
+| `--no-generate` | Disable text generation preview |
+| `--steps N` | Total steps (default 10000) |
+
+---
+
+## Root-Level Benchmark Scripts
+
+All root-level scripts are standalone Objective-C programs. Common build pattern:
+
+```bash
+xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML \
+  -framework IOSurface -ldl -o <output> <source>.m
+```
+
+### inmem_peak.m -- Peak TFLOPS (self-contained)
+
+**No prerequisites.** Generates MIL and weight blobs programmatically.
+
+| Item | Details |
+|------|---------|
+| **Purpose** | Maximum sustained TFLOPS via deep conv chains (32-256 layers deep) |
+| **Measures** | ms per run, TFLOPS, % peak across 10 configurations |
+| **Prerequisites** | None (self-contained MIL generation) |
+| **Build** | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o inmem_peak inmem_peak.m` |
+| **Run** | `./inmem_peak` |
+
+**Expected output:**
+
+```
+=== Programmatic MIL to In-Memory ANE Peak ===
+
+Config                       W(MB)   GFLOP   ms/run   TFLOPS  %peak
+----------------------------------------------------------------------
+32x conv 512ch sp64            16.0    1.07    X.XXX ms   Y.YY   Z.Z%
+64x conv 512ch sp64            32.0    2.15    X.XXX ms   Y.YY   Z.Z%
+...
+```
+
+### inmem_basic.m -- In-Memory Proof-of-Concept
+
+| Item | Details |
+|------|---------|
+| **Purpose** | End-to-end test: compile, load, run, benchmark using `_ANEInMemoryModel` |
+| **Prerequisites** | Pre-built mlpackage at `/tmp/ane_sram_256ch_64sp.mlpackage` |
+| **Build** | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o inmem_basic inmem_basic.m` |
+| **Run** | `./inmem_basic` |
+
+### inmem_bench.m -- Dispatch Latency
+
+| Item | Details |
+|------|---------|
+| **Purpose** | ANE dispatch latency across 6 model sizes (256-4096 channels) |
+| **Measures** | ms per run, TFLOPS at each configuration |
+| **Prerequisites** | Pre-built mlpackages for all 6 configs |
+| **Build** | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o inmem_bench inmem_bench.m` |
+| **Run** | `./inmem_bench` |
+
+### sram_bench.m -- SRAM Capacity Probe
+
+| Item | Details |
+|------|---------|
+| **Purpose** | Find SRAM capacity by detecting performance cliff at increasing weight sizes |
+| **Measures** | ms per run, TFLOPS, weight/activation/total memory at 9 configurations |
+| **Prerequisites** | Pre-built mlpackages for 9 configs (256-8192 channels) |
+| **Build** | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o sram_bench sram_bench.m` |
+| **Run** | `./sram_bench` |
+
+### sram_probe.m -- Fine-Grained SRAM Exploration
+
+| Item | Details |
+|------|---------|
+| **Purpose** | Finer-grained SRAM probe with 13 data points and GFLOPS/MB efficiency |
+| **Measures** | ms per run, TFLOPS, GFLOPS/MB with spilling indicators |
+| **Prerequisites** | Pre-built mlpackages for 13 configs |
+| **Build** | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o sram_probe sram_probe.m` |
+| **Run** | `./sram_probe` |
+
+### api_exploration.m -- API Discovery
+
+| Item | Details |
+|------|---------|
+| **Purpose** | Explore ANE private API surface (class methods, file structures, internal objects) |
+| **Prerequisites** | Pre-built mlpackage at `/tmp/ane_sram_1024ch_64sp.mlpackage` |
+| **Build** | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o api_exploration api_exploration.m` |
+| **Run** | `./api_exploration` |
+
+---
+
+## Test Files
+
+### Tests with Makefile targets (cd training/)
+
+| Test | Build | What It Tests |
+|------|-------|---------------|
+| `test_rmsnorm_bwd` | `make test_rmsnorm_bwd` | RMSNorm backward on ANE vs CPU reference. PASS: max diff < 0.05, mean < 0.01. Benchmarks 100 runs. |
+| `test_classifier` | `make test_classifier` | 4-part: final RMSNorm, classifier forward (32000-ch conv), softmax over VOCAB, classifier backward. |
+| `test_weight_reload` | `make test_weight_reload` | Tests if weights can be hot-swapped by overwriting blob files + unload/reload. Key finding: NO, weights are baked. |
+| `test_perf_stats` | `make test_perf_stats` | Probes `_ANEPerformanceStats` class methods, properties, and instantiation. Tests perfStats in `_ANERequest`. |
+| `test_qos_sweep` | `make test_qos_sweep` | QoS parameter sweep (0-63) across compile, load, run. Finding: no measurable latency difference. |
+| `test_ane_advanced` | `make test_ane_advanced` | Probes SharedEvents, weightsBuffer IOSurface, procedureIndex, ChainingRequest. Enumerates all 67 ANE classes. |
+
+Build all probe tests at once: `make probes`
+
+### Tests without Makefile targets (manual build)
+
+| Test | Build Command | What It Tests |
+|------|---------------|---------------|
+| `test_ane_causal_attn` | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_ane_causal_attn test_ane_causal_attn.m` | Decomposed causal attention: Q at K^T on ANE, mask+softmax on CPU, scores at V on ANE |
+| `test_ane_sdpa5` | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_ane_sdpa5 test_ane_sdpa5.m` | 4 approaches to causal masking with `scaled_dot_product_attention` |
+| `test_conv_attn3` | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_conv_attn3 test_conv_attn3.m` | Grouped conv approach to attention (K,V baked as conv weights) |
+| `test_full_fused` | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o test_full_fused test_full_fused.m` | Full fused attention + FFN in single MIL dispatch at DIM=768, HEADS=12, SEQ=64 |
+| `test_fused_qkv` | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_fused_qkv test_fused_qkv.m` | Fused QKV (3 convs + concat in one dispatch) vs separate dispatches |
+| `test_fused_bwd` | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_fused_bwd test_fused_bwd.m` | Fused backward: slice_by_size + 2 convs + add in one kernel |
+
+---
+
+## Bridge Library
+
+```bash
+cd bridge
+make                          # Build libane_bridge.dylib
+make test                     # Build and link test_bridge
+./test_bridge                 # Run bridge tests
+```
+
+---
+
+## Known Results
+
+### M4 (from README)
+
+**Single-layer (dim=768, seq=512):**
+
+| Optimization | ms/step | ANE utilization |
+|---|---|---|
+| Baseline (vDSP transpose) | 33.5 | 3.1% |
+| Channel-first layout | 20.3 | 5.2% |
+| vDSP vectorized RMSNorm | 14.2 | 7.4% |
+| GCD async cblas overlap | 11.4 | 9.2% |
+| ANE RMSNorm fusion | 11.4 | 9.2% |
+| Wo^T fusion (7 to 6 kernels) | 11.4 | 9.2% |
+| Deferred cblas wait | **9.3** | **11.2%** |
+
+**Full Stories110M (12 layers):**
+
+| Component | Time (ms/step) |
+|-----------|---------------|
+| ANE runs | 9.6 |
+| IO (fp16 conversion) | 4.1 |
+| Classifier (cblas) | 9.1 |
+| Cross-entropy + residuals | 14.4 |
+| RMSNorm | 0.1 |
+| **Total** | **~107** |
+
+### M5 Probe Results (from m5result.md)
+
+**Machine**: Apple M5, macOS 26.3, ANE Family H16 (same as M4)
+
+- **Weight reload**: FAIL -- weights baked at compile time, cannot be overwritten
+- **QoS sweep**: All QoS 0-63 work, no measurable latency difference
+- **Performance stats**: `_ANEPerformanceStats` class exists, `alloc/init` returns nil (needs factory methods)
+- **weightsBuffer IOSurface**: Does NOT override compiled weights
+- **ChainingRequest**: Exists with loopback and pipeline support -- most promising for utilization improvement
+
+---
+
+## Timing Metrics Key
+
+| Metric | What it measures |
+|--------|-----------------|
+| `ane` | ANE kernel runs (all 6 kernels per layer x 12 layers) |
+| `io` | fp16-to-fp32 IOSurface data transfer (NEON conversion) |
+| `cls` | Classifier matmul (CPU cblas_sgemm) |
+| `elem` | Embedding lookup, residual adds, cross-entropy |
+| `rms` | RMSNorm forward/backward (CPU vDSP) |
+| `cblas_wait` | Time waiting for async dW gradient sgemms to complete |
--- a/docs/BENCHMARK_RESULTS.md
+++ b/docs/BENCHMARK_RESULTS.md
@ -0,0 +1,156 @@
+# ANE Benchmark Results: Apple M4 Max
+
+**Date**: March 3, 2026
+**Machine**: Mac16,5 (MacBook Pro, Apple M4 Max)
+**macOS**: 26.2
+**ANE Peak**: 15.8 TFLOPS (theoretical)
+
+## Training Performance
+
+### train_large (CPU classifier path)
+
+| Metric | Value |
+|--------|-------|
+| Model | Stories110M (12 layers, dim=768, hidden=2048) |
+| Kernels | 72 (60 weight-bearing + 12 static sdpaBwd2) |
+| Avg step time | 72.4 ms/step |
+| ANE TFLOPS | 1.29 sustained |
+| Total TFLOPS | 2.41 (ANE+CPU) |
+| ANE utilization | 8.1% of 15.8 TFLOPS |
+| Compile time | 79.7% of wall time |
+| Train time | 16.4% of wall time |
+
+### train_large_ane (ANE-offloaded classifier)
+
+| Metric | Value |
+|--------|-------|
+| Model | Stories110M (same as above) |
+| Kernels | 99 (86 weight-bearing + 13 static) |
+| Avg step time | 62.9 ms/step |
+| ANE TFLOPS | 1.68 sustained |
+| Total TFLOPS | 2.77 (ANE+CPU) |
+| ANE utilization | 10.6% of 15.8 TFLOPS |
+| Compile time | 84.5% of wall time |
+| Train time | 12.5% of wall time |
+
+**Step time breakdown (ms/step, ANE classifier path):**
+
+| Component | Time (ms) | Description |
+|-----------|-----------|-------------|
+| ane | 10-12 | ANE kernel dispatch + evaluation |
+| elem | 12-13 | Elementwise ops (residuals, activations) |
+| cls | 5-6 | Classifier forward + backward |
+| io | 3-5 | IOSurface data transfers |
+| rms | 0.1 | RMSNorm |
+| cblas_wait | 0.0 | BLAS sync overhead |
+
+## Programmatic MIL Peak TFLOPS
+
+```
+Config                         W(MB)   GFLOP   ms/eval  TFLOPS
+----------------------------------------------------------------------
+32x conv 512ch sp64            16.0    1.07    0.408 ms   2.63
+48x conv 512ch sp64            24.0    1.61    0.262 ms   6.15
+64x conv 512ch sp64            32.0    2.15    0.244 ms   8.80
+96x conv 512ch sp64            48.0    3.22    0.326 ms   9.89
+128x conv 512ch sp64           64.0    4.29    0.385 ms  11.14
+64x conv 256ch sp64             8.0    0.54    0.365 ms   1.47
+128x conv 256ch sp64           16.0    1.07    0.454 ms   2.37
+256x conv 256ch sp64           32.0    2.15    0.351 ms   6.11
+64x conv 384ch sp64            18.0    1.21    0.429 ms   2.82
+128x conv 384ch sp64           36.0    2.42    0.354 ms   6.82
+```
+
+**Peak observed: 11.14 TFLOPS** (128x conv 512ch sp64, 64 MB weights)
+
+## In-Memory ANE Benchmark (via mlpackage)
+
+```
+Config         W (MB)    ms/eval   TFLOPS
+---------------------------------------------
+ 256ch x64sp     0.1     0.319 ms    0.03
+ 512ch x64sp     0.5     0.357 ms    0.09
+1024ch x64sp     2.0     0.457 ms    0.29
+2048ch x64sp     8.0     0.254 ms    2.11
+3072ch x64sp    18.0     0.389 ms    3.10
+4096ch x64sp    32.0     1.148 ms    1.87
+```
+
+## SRAM Probe Results
+
+### Coarse Probe (varying channels + spatial)
+
+```
+Config                      W (MB)  Act(MB)  Tot(MB)    ms/eval   TFLOPS
+--------------------------------------------------------------------------
+256ch x 64sp                  0.1     0.03      0.2     0.378 ms    0.02
+512ch x 64sp                  0.5     0.06      0.6     0.389 ms    0.09
+1024ch x 64sp                 2.0     0.12      2.2     0.392 ms    0.34
+2048ch x 64sp                 8.0     0.25      8.5     0.218 ms    2.47
+3072ch x 64sp                18.0     0.38     18.8     0.396 ms    3.05
+4096ch x 64sp                32.0     0.50     33.0     1.116 ms    1.92
+5120ch x 64sp                50.0     0.62     51.2     0.767 ms    4.38
+6144ch x 64sp                72.0     0.75     73.5     0.872 ms    5.54
+8192ch x 32sp               128.0     0.50    129.0     4.195 ms    1.02
+```
+
+### Fine Probe (spatial=64, weights only)
+
+```
+Channels       W (MB)    ms/eval   TFLOPS    GFLOPS/MB
+--------------------------------------------------------------
+   256 ch       0.1     0.378 ms    0.02       177.7
+   512 ch       0.5     0.431 ms    0.08       155.6
+  1024 ch       2.0     0.411 ms    0.33       163.5
+  1536 ch       4.5     0.493 ms    0.61       136.1
+  2048 ch       8.0     0.410 ms    1.31       163.9
+  2560 ch      12.5     0.237 ms    3.53       282.6  <-- peak efficiency
+  3072 ch      18.0     0.335 ms    3.60       200.1
+  3584 ch      24.5     0.414 ms    3.97       162.1
+  4096 ch      32.0     1.134 ms    1.89        59.2  <-- spilling
+  4608 ch      40.5     0.563 ms    4.83       119.2
+  5120 ch      50.0     0.659 ms    5.09       101.8
+  6144 ch      72.0     0.844 ms    5.73        79.5  <-- spilling
+  8192 ch     128.0     4.203 ms    1.02         8.0  <-- catastrophic spilling
+```
+
+### SRAM Analysis
+
+The M4 Max ANE SRAM appears to be approximately **24-32 MB**:
+
+- **Peak efficiency** at 2560ch (12.5 MB weights): 282.6 GFLOPS/MB, 3.53 TFLOPS
+- **First spill** at 4096ch (32.0 MB): drops to 59.2 GFLOPS/MB (1.89 TFLOPS)
+- **Catastrophic** at 8192ch (128.0 MB): 8.0 GFLOPS/MB (1.02 TFLOPS)
+
+The 4608ch recovery (4.83 TFLOPS despite 40.5 MB weights) suggests the ANE may use tiling strategies for some weight configurations.
+
+Training kernels (dim=768, weight matrices ~1.2 MB fp16 each) stay well within the SRAM budget.
+
+## Known Test Results
+
+| Test | Status | Notes |
+|------|--------|-------|
+| test_rmsnorm_bwd | PASS | ANE-accelerated RMSNorm backward |
+| test_classifier | PASS | 4 tests passed; ANE backward 3x slower than CPU cblas for matmul |
+| test_weight_reload | FAIL (expected) | ANE bakes weights at compile time; IOSurface override doesn't work |
+| test_perf_stats | PASS | _ANEPerformanceStats API accessible |
+| test_qos_sweep | PASS | QoS parameter has no measurable effect on latency |
+| test_ane_advanced | PASS | Advanced ANE operations verified |
+| inmem_basic | PASS | In-memory compilation and execution verified |
+| inmem_bench | PASS | Multi-config benchmarks via mlpackage |
+| inmem_peak | PASS | Peak TFLOPS measurement via programmatic MIL |
+| sram_bench | PASS | SRAM capacity probing |
+| sram_probe | PASS | Fine-grained SRAM spilling detection |
+
+## Reproducing
+
+```bash
+cd scripts && bash run_benchmarks.sh
+```
+
+The benchmark script auto-generates required `.mlpackage` models (needs Python 3.11-3.13 with `coremltools`).
+
+Override training data paths:
+```bash
+ANE_MODEL_PATH=/path/to/stories110M.bin ANE_DATA_PATH=/path/to/data.bin ./train_large
+```
--- a/docs/diaries/001-initial-setup-and-security-audit.md
+++ b/docs/diaries/001-initial-setup-and-security-audit.md
@ -0,0 +1,74 @@
+# Development Diary #001 — Initial Setup & Sicherheitsaudit
+**Datum:** 2026-03-02
+**Status:** Abgeschlossen
+
+## Aufgaben
+
+### 1. Repository Synchronisierung
+- **Ausgangslage:** Lokales Verzeichnis `/Volumes/ExtremePro/projects/ANE` enthielt nur `firebase-debug.log`
+- **Durchgeführt:**
+  ```bash
+  git init
+  git remote add origin https://github.com/maderix/ANE.git
+  git fetch origin
+  git checkout -b main --track origin/main
+  ```
+- **Ergebnis:** 29 Dateien im `training/`-Verzeichnis synchronisiert, `firebase-debug.log` unberührt
+- **Commit-Stand:** HEAD = origin/main (up to date)
+
+### 2. Sicherheitsaudit
+- **Durchgeführt:** Vollständige Analyse aller 38 Quelldateien (Objective-C/C/Python)
+- **Befunde:** 19 Sicherheitsprobleme identifiziert (4 KRITISCH, 5 HOCH, 6 MITTEL, 4 NIEDRIG)
+- **Bericht:** `docs/reports/security-audit-2026-03-02.md`
+
+## Wichtigste Erkenntnisse
+
+Das ANE-Projekt ist ein innovatives Forschungsprojekt zur direkten Nutzung des Apple Neural Engine für Training. Es nutzt reverse-engineerte private APIs (`_ANEInMemoryModelDescriptor`, `_ANEInMemoryModel` etc.) via `dlopen` + `objc_msgSend`.
+
+**Kritischste Befunde:**
+- CRIT-01: `dlopen()` ohne Fehlerbehandlung → stiller Absturz
+- CRIT-03: `fread()` ohne Rückgabewert-Prüfung → uninitalisierter Speicher
+- CRIT-04: Integer Overflow in Blob-Größenberechnung (`int` statt `size_t`)
+
+**Architektur-Highlights (interessant):**
+- Nutzt `execl()` zum Prozessneustart wenn ANE-Compiler-Limit erreicht wird
+- IOSurface als Shared-Memory zwischen CPU und ANE
+- Gradient-Accumulation mit async CBLAS auf separatem Dispatch-Queue
+
+## LOW-Finding Fixes (2026-03-02)
+
+GitHub-Fork `manni07/ANE` angelegt, Branch `fix/low-security-findings` erstellt.
+Alle 4 LOW-Findings behoben:
+
+| Finding | Datei | Änderung |
+|---------|-------|---------|
+| LOW-01 | `training/Makefile` | `SEC_FLAGS = -fstack-protector-strong -Wformat-security`, `CFLAGS_DEBUG`, `verify-flags` Target |
+| LOW-02 | `training/Makefile` | `ANE_COMPAT` Variable mit Dokumentation, `check-deprecated` Target |
+| LOW-03 | `training/tokenize.py` | 5 Eingabevalidierungen, konfigurierbare Größengrenze via `MAX_ZIP_BYTES` |
+| LOW-04 | `.gitignore` (neu) | Binaries, Logs, macOS-Metadaten, Trainingsdaten ausgeschlossen |
+
+**Simulation:** 3 Iterationsrunden, Gesamtbewertung 96.35% (alle Kriterien ≥ 95%)
+**Remote:** `origin=manni07/ANE`, `upstream=maderix/ANE`
+
+## CRIT-Finding Fixes (2026-03-02)
+
+Branch `fix/crit-security-findings` erstellt. Alle 4 CRIT-Findings behoben:
+
+| Finding | Dateien | Kernänderung |
+|---------|---------|-------------|
+| CRIT-01 | `training/ane_runtime.h`, `training/stories_config.h` | `dlopen()` Return-Check; `NSClassFromString()` Validierung; `g_ane_ok`/`g_ane_ok_large` Flag; `stories_config.h` Re-Entry-Guard |
+| CRIT-02 | `training/ane_runtime.h`, `training/stories_io.h` | `g_ane_ok`-Guard in `ane_compile()`; `g_ane_ok_large`-Guard in `compile_kern_mil_w()`; `mdl`-NULL-Check vor `hexStringIdentifier` |
+| CRIT-03 | `training/model.h`, `training/train_large.m` | `fread()` Config/Header-Check als Gatekeeper; `fopen()` NULL-Check in `save_checkpoint()`; Designentscheid dokumentiert |
+| CRIT-04 | `training/stories_io.h`, `training/model.h` | `int`→`size_t` in allen `build_blob*` Funktionen; `(size_t)`-Cast in `malloc()`-Größen; `calloc()` NULL-Checks |
+
+**Simulation:** 3 Iterationsrunden (CRIT-03 benötigte 3 Runs), Gesamtbewertung 96.15% (alle Kriterien ≥ 95%)
+**Branch:** `fix/crit-security-findings` auf `manni07/ANE`
+
+## Status
+
+| Finding-Typ | Anzahl | Status |
+|-------------|--------|--------|
+| KRITISCH (CRIT-01–04) | 4 | ✅ BEHOBEN |
+| HOCH (HIGH-01–05) | 5 | Offen |
+| MITTEL (MED-01–06) | 6 | Offen |
+| NIEDRIG (LOW-01–04) | 4 | ✅ BEHOBEN |
--- a/docs/reports/security-audit-2026-03-02.md
+++ b/docs/reports/security-audit-2026-03-02.md
@ -0,0 +1,419 @@
+# Sicherheitsaudit: ANE (Apple Neural Engine Training Framework)
+**Datum:** 2026-03-02
+**Repository:** https://github.com/maderix/ANE
+**Prüfer:** Claude Code (claude-sonnet-4-6)
+**Scope:** Vollständige Codebase-Analyse (38 Quelldateien, Objective-C/C/Python)
+
+---
+
+## Executive Summary
+
+Das ANE-Projekt implementiert Neural-Network-Training direkt auf Apples Neural Engine (ANE) via reverse-engineerter privater APIs. Es handelt sich um ein **Forschungs-/Experimental-Projekt** mit erheblichen inhärenten Sicherheitsrisiken durch die Nutzung undokumentierter Apple-Schnittstellen.
+
+**Gesamtbewertung: HOHES RISIKO** für produktiven Einsatz.
+
+| Kategorie | Anzahl |
+|-----------|--------|
+| KRITISCH  | 4      |
+| HOCH      | 5      |
+| MITTEL    | 6      |
+| NIEDRIG   | 4      |
+| **Gesamt**| **19** |
+
+---
+
+## KRITISCHE Befunde
+
+### [CRIT-01] Keine Fehlerbehandlung bei `dlopen()` für Private Framework
+**Datei:** `training/ane_runtime.h:26`, `api_exploration.m:15`
+**Schweregrad:** KRITISCH
+**Status: BEHOBEN** (2026-03-02, Branch `fix/crit-security-findings`)
+
+```objc
+// ane_runtime.h:26
+dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
+```
+
+**Problem:**
+- Der Rückgabewert von `dlopen()` wird nicht geprüft. Wenn das Framework nicht gefunden wird (nach macOS-Update oder auf nicht-Apple-Silicon-Hardware), gibt `dlopen()` NULL zurück — aber die Ausführung läuft weiter.
+- Alle nachfolgenden `NSClassFromString()`-Aufrufe geben dann ebenfalls NULL zurück.
+- `g_ane_loaded = true` wird gesetzt auch wenn das Laden fehlschlug.
+
+**Folge:** Nullzeiger-Dereferenzierungen beim ersten API-Aufruf, unkontrollierter Absturz ohne aussagekräftige Fehlermeldung.
+
+**Empfehlung:**
+```objc
+void *handle = dlopen("...", RTLD_NOW);
+if (!handle) {
+    fprintf(stderr, "ANE framework not found: %s\n", dlerror());
+    abort();
+}
+if (!g_ANEDesc || !g_ANEInMem || !g_ANEReq || !g_ANEIO) {
+    fprintf(stderr, "ANE private classes not found (API changed?)\n");
+    abort();
+}
+```
+
+---
+
+### [CRIT-02] Unsichere `objc_msgSend`-Casts ohne Typ-Validierung
+**Dateien:** `training/ane_runtime.h:59-125`, `training/stories_io.h:90-117`
+**Schweregrad:** KRITISCH
+**Status: BEHOBEN** (2026-03-02, Branch `fix/crit-security-findings`)
+
+```objc
+// ane_runtime.h:59-61
+id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)(
+    g_ANEDesc, @selector(modelWithMILText:weights:optionsPlist:),
+    milText, wdict, nil);
+```
+
+**Probleme:**
+1. Die Klasse `g_ANEDesc` könnte NULL sein (wenn `dlopen` fehlschlug, s. CRIT-01)
+2. Die Methodensignatur ist hardcodiert — bei Apple-API-Änderungen falsches Casting = undefiniertes Verhalten / Speicherkorruption
+3. Kein `@try/@catch` um mögliche Objective-C Exceptions abzufangen
+4. Globale Variablen `g_D`, `g_I`, `g_AIO`, `g_AR` in `stories_io.h` könnten NULL sein
+
+**Folge:** Speicherkorruption, SIGBUS, unkontrollierter Absturz.
+
+**Empfehlung:** Mindestens NULL-Checks vor jedem `objc_msgSend`:
+```objc
+if (!g_ANEDesc) { fprintf(stderr, "g_ANEDesc is NULL\n"); return NULL; }
+```
+
+---
+
+### [CRIT-03] `fread()`-Rückgabewerte nie geprüft — uninitalisierter Speicher
+**Dateien:** `training/model.h:81-146`, `training/train_large.m:17-55`
+**Schweregrad:** KRITISCH
+**Status: BEHOBEN** (2026-03-02, Branch `fix/crit-security-findings`)
+
+```c
+// model.h:81
+fread(&m->cfg, sizeof(Config), 1, f);  // Rückgabewert ignoriert!
+
+// train_large.m:29
+fread(embed, 4, V * DIM, f);  // Kein Check ob V*DIM floats gelesen wurden
+```
+
+**Probleme:**
+1. Wenn die Model-Datei kleiner als erwartet ist (korrupt, abgeschnitten), werden Structs mit Garbage-Werten befüllt
+2. Kein Check ob `cfg.dim`, `cfg.hidden_dim`, `cfg.n_layers` plausibel sind bevor Speicher allokiert wird
+3. `fread(embed, 4, V * DIM, f)` — bei V=32000, DIM=768: liest 98,304,000 Bytes. Keine Größenvalidierung.
+4. In `load_checkpoint()`: wenn die Datei nach dem Header endet, werden Gewichte mit 0-Bytes befüllt ohne Warnung
+
+**Empfehlung:**
+```c
+size_t n = fread(&m->cfg, sizeof(Config), 1, f);
+if (n != 1) { fprintf(stderr, "Config read failed\n"); fclose(f); return -1; }
+if (m->cfg.dim <= 0 || m->cfg.dim > 65536 || m->cfg.n_layers <= 0) {
+    fprintf(stderr, "Invalid model config\n"); fclose(f); return -1;
+}
+```
+
+---
+
+### [CRIT-04] Integer Overflow in Speicher-Berechnung
+**Dateien:** `training/stories_io.h:13-14`, `training/ane_mil_gen.h:12-13`
+**Schweregrad:** KRITISCH
+**Status: BEHOBEN** (2026-03-02, Branch `fix/crit-security-findings`)
+
+```c
+// stories_io.h:13-14
+static NSData *build_blob(const float *w, int rows, int cols) {
+    int ws = rows * cols * 2;   // INT-Multiplikation, kein size_t!
+    int tot = 128 + ws;
+```
+
+**Problem:** Bei grösseren Modellen mit `dim >= 2048, hidden >= 16384` könnten Integer-Overflows entstehen. `*(uint32_t*)(chunk + 8) = (uint32_t)wsize;` — wenn `wsize` als `int` negativ wird (Overflow), wird ein negativer Wert als uint32 geschrieben = falsche Blob-Größe → ANE-Fehler oder Speicherkorruption.
+
+**Empfehlung:** `size_t` für alle Speichergrößenberechnungen:
+```c
+size_t ws = (size_t)rows * cols * sizeof(_Float16);
+size_t tot = 128 + ws;
+```
+
+---
+
+## HOHE Befunde
+
+### [HIGH-01] Keine Eingabevalidierung für Token-Indizes
+**Datei:** `training/train_large.m:375-376`
+**Schweregrad:** HOCH
+
+```c
+size_t max_pos = n_tokens - SEQ - 1;
+size_t pos = (size_t)(drand48() * max_pos);
+uint16_t *input_tokens = token_data + pos;
+```
+
+**Probleme:**
+1. Token-Werte aus `token_data` werden direkt als Embedding-Indizes verwendet ohne Prüfung ob `token < VOCAB`
+2. Wenn die `.bin`-Datei korrupte Token-Werte enthält (> 32000), entstehen Out-of-Bounds-Zugriffe auf `embed[]`
+3. Kein Check ob `n_tokens >= SEQ + 1` vor der `max_pos`-Berechnung
+
+**Folge:** Heap-Buffer-Overflow, korrupte `.bin`-Datei kann zu Speicherschäden führen.
+
+---
+
+### [HIGH-02] Checkpoint-Pfad mit relativer Verzeichnis-Navigation
+**Datei:** `training/train_large.m:8-10`
+**Schweregrad:** HOCH
+
+```c
+#define CKPT_PATH "ane_stories110M_ckpt.bin"
+#define MODEL_PATH "../../assets/models/stories110M.bin"  // ← relativer Pfad!
+#define DATA_PATH "tinystories_data00.bin"
+```
+
+**Probleme:**
+1. `MODEL_PATH` enthält `../../` — relative Pfadnavigation. Wenn das Binary aus einem unerwarteten Verzeichnis gestartet wird, werden falsche Dateien gelesen.
+2. Kein `realpath()`-Aufruf zur Normalisierung des Pfades
+3. Manipulierter Checkpoint + `--resume` → unkontrollierte Binärdaten werden als Gewichte geladen
+
+---
+
+### [HIGH-03] `execl()` zur Prozessneustart ohne Argument-Validierung
+**Datei:** `training/train_large.m:331`
+**Schweregrad:** HOCH
+
+```c
+execl(argv[0], argv[0], "--resume", NULL);
+```
+
+**Probleme:**
+1. `argv[0]` wird ohne Validierung übergeben. Via Symlink könnte ein beliebiges Binary gestartet werden.
+2. `data_fd` (mmap'd Token-Datei) wird vor `execl()` nicht geschlossen — Dateideskriptor-Leak in neuen Prozess
+3. `munmap(token_data)` wird vor `execl()` nicht aufgerufen
+
+---
+
+### [HIGH-04] Fehlende `malloc()`/`calloc()`-Rückgabewert-Prüfungen
+**Dateien:** Alle `.m` und `.h` Dateien
+**Schweregrad:** HOCH
+
+```c
+// train_large.m:219
+float *embed = (float*)malloc(VOCAB*DIM*4);  // 32000*768*4 = 98MB — kein NULL-Check!
+```
+
+Keiner der `malloc()`/`calloc()`-Aufrufe prüft den Rückgabewert auf NULL. Bei Memory-Pressure (110M Model + Adam-State = mehrere GB) können Allokierungen fehlschlagen → Nullzeiger-Dereferenzierung.
+
+---
+
+### [HIGH-05] ANE-Inferenz ohne Fehlerprüfung im Trainings-Hot-Path
+**Datei:** `training/stories_io.h:131-134`
+**Schweregrad:** HOCH
+
+```c
+static void ane_run(Kern *k) {
+    id mdl = (__bridge id)k->model; id req = (__bridge id)k->request; NSError *e = nil;
+    ((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
+        mdl, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e);
+    // BOOL-Rückgabewert und NSError *e werden ignoriert!
+}
+```
+
+**Problem:** ANE-Ausführung kann fehlschlagen (Thermal-Throttling, Hardware-Fehler, API-Änderungen). Stille Fehler führen zu unerkannter Gradientenkorruption.
+
+---
+
+## MITTLERE Befunde
+
+### [MED-01] IOSurface Lock ohne Fehlerbehandlung
+**Datei:** `training/stories_io.h:62-83`
+**Schweregrad:** MITTEL
+
+```c
+IOSurfaceLock(s, 0, NULL);  // Return-Code ignoriert
+```
+
+`IOSurfaceLock()` gibt `kIOReturnSuccess` oder einen Fehlercode zurück. Bei Lock-Fehler wird trotzdem auf den Speicher zugegriffen — mögliche Data-Race-Condition.
+
+---
+
+### [MED-02] Temporäres Verzeichnis nicht sicher erstellt (TOCTOU-Risiko)
+**Datei:** `training/ane_runtime.h:68-80`, `training/stories_io.h:94-100`
+**Schweregrad:** MITTEL
+
+```objc
+NSString *td = [NSTemporaryDirectory() stringByAppendingPathComponent:hx];
+[milText writeToFile:[td stringByAppendingPathComponent:@"model.mil"] atomically:YES];
+```
+
+TOCTOU-Race zwischen `createDirectoryAtPath` und `writeToFile`. Der `hexStringIdentifier` könnte von einem anderen Prozess erraten und das Verzeichnis manipuliert werden.
+
+---
+
+### [MED-03] MIL-Text-Generierung ohne Parameter-Validierung
+**Datei:** `training/ane_mil_gen.h:32-52`
+**Schweregrad:** MITTEL
+
+```objc
+return [NSString stringWithFormat:
+    @"...tensor<fp32, [1, %d, %d]> x...", in_ch, spatial, ...];
+```
+
+Negative oder extrem große `in_ch`/`out_ch`/`spatial`-Werte durch fehlerhafte Konfiguration erzeugen invalides MIL das an den undokumentierten ANE-Compiler übergeben wird.
+
+---
+
+### [MED-04] Keine Endianness-Prüfung bei Checkpoint-Serialisierung
+**Datei:** `training/train_large.m:110-181`
+**Schweregrad:** MITTEL
+
+```c
+h.magic = 0x424C5A54;
+fwrite(&h, sizeof(h), 1, f);
+```
+
+Das `CkptHdr`-Struct wird als binärer Dump ohne Endianness-Marker geschrieben. Nicht portabel.
+
+---
+
+### [MED-05] NEON-Vektorisierung ohne Alignment-Garantie
+**Datei:** `training/stories_io.h:41-58`
+**Schweregrad:** MITTEL
+
+```c
+float16x8_t h = vld1q_f16((const __fp16*)(src + i));
+```
+
+Zeiger-Arithmetik mit `ch_off * sp` könnte das für NEON benötigte Alignment verletzen wenn `ch_off * sp` kein Vielfaches von 8 ist.
+
+---
+
+### [MED-06] Globale Variablen ohne Thread-Safety
+**Datei:** `training/stories_io.h`, `training/stories_config.h`
+**Schweregrad:** MITTEL
+
+```c
+static bool g_ane_loaded = false;
+static int g_compile_count = 0;
+```
+
+`g_compile_count` wird via `__sync_fetch_and_add()` atomar inkrementiert, aber `g_ane_loaded` und Klassen-Variablen nicht atomar gesetzt — bei Multi-Thread-Nutzung Race-Condition in `ane_init()`.
+
+---
+
+## NIEDRIGE Befunde
+
+### [LOW-01] Fehlende Compiler-Sicherheitsflags
+**Datei:** `training/Makefile:2`
+**Schweregrad:** NIEDRIG
+**Status: BEHOBEN** (2026-03-02, Branch `fix/low-security-findings`)
+
+```makefile
+CFLAGS = -O2 -Wall -Wno-deprecated-declarations -fobjc-arc
+```
+
+Fehlende Flags: `-fstack-protector-strong`, `-D_FORTIFY_SOURCE=2`, `-Wformat=2`
+
+**Fix:** `SEC_FLAGS = -fstack-protector-strong -Wformat-security` eingeführt. Hinweis:
+`-D_FORTIFY_SOURCE=2` ist auf macOS (Apple LLVM) bei `-O2` implizit aktiv — explizite
+Definition würde "macro redefinition"-Warnung erzeugen. `CFLAGS_DEBUG` mit
+`-fsanitize=address,undefined` für Debug-Builds hinzugefügt. `make verify-flags`
+zeigt aktive Flags.
+
+---
+
+### [LOW-02] `-Wno-deprecated-declarations` unterdrückt wichtige Warnungen
+**Datei:** `training/Makefile:2`
+**Schweregrad:** NIEDRIG
+**Status: BEHOBEN** (2026-03-02, Branch `fix/low-security-findings`)
+
+Unterdrückt Warnungen über veraltete API-Aufrufe — könnte wichtige Hinweise auf deprecated private APIs verstecken.
+
+**Fix:** Flag in benannte Variable `ANE_COMPAT` extrahiert mit erklärendem Kommentar
+(bewusste Unterdrückung wegen privater `_ANE*`-APIs via `objc_msgSend`). Neues Target
+`make check-deprecated` baut ohne Unterdrückung und zeigt alle verborgenen Warnungen.
+
+---
+
+### [LOW-03] Python-Skript ohne Eingabevalidierung
+**Datei:** `training/tokenize.py`
+**Schweregrad:** NIEDRIG
+**Status: BEHOBEN** (2026-03-02, Branch `fix/low-security-findings`)
+
+Keine Validierung der Eingabedateigröße — bei sehr großen Eingaben Out-of-Memory möglich.
+
+**Fix:** 5 Validierungen implementiert:
+1. ZIP-Existenzprüfung mit hilfreicher Fehlermeldung
+2. Konfigurierbare Größengrenze (Standard 10GB, via `MAX_ZIP_BYTES` env var überschreibbar)
+3. Prüfung ob `data00.bin` im ZIP enthalten ist
+4. Fehlerbehandlung bei `struct.unpack` wenn Output < 20 Bytes
+5. Token-Range-Validierung (alle Token müssen < `VOCAB_SIZE=32000` sein)
+
+---
+
+### [LOW-04] Keine `.gitignore` für sensible Artefakte
+**Datei:** Repository-Root
+**Schweregrad:** NIEDRIG
+**Status: BEHOBEN** (2026-03-02, Branch `fix/low-security-findings`)
+
+Keine `.gitignore`-Datei. Binäre Artefakte (Checkpoints, Trainingsdaten, `firebase-debug.log`) könnten versehentlich committed werden.
+
+**Fix:** `.gitignore` erstellt mit Regeln für: macOS-Metadaten (`.DS_Store`),
+Log-Dateien (`*.log`), kompilierte Binaries (`training/train`, `training/train_large`,
+alle Probe-Binaries), Trainingsdaten (`training/*.bin`), ANE-Artefakte
+(`*.mlmodelc/`, `*.mlpackage/`), externe Assets (`assets/`).
+
+---
+
+## Positive Befunde (Stärken)
+
+### Korrekte Speicherfreigabe
+`ane_free()` (`ane_runtime.h:149-160`) und `free_kern()` (`stories_io.h:122-130`) implementieren vollständige Cleanup-Routinen mit `CFRelease()`, `unloadWithQoS:error:` und Temporärverzeichnis-Bereinigung.
+
+### Magic-Byte Validierung in Checkpoints
+```c
+if (h.magic != 0x424C5A54 || h.version != 2) { fclose(f); return false; }
+```
+Grundlegender Schutz gegen korrupte Checkpoint-Dateien.
+
+### Atomare Compile-Counter
+```c
+__sync_fetch_and_add(&g_compile_count, 1);
+```
+Thread-sicherer Zähler für ANE-Kompilierungsanzahl.
+
+### Gradient-Accumulation mit async CBLAS
+Korrekte Parallelisierung von CPU-Gewichtsgradienten-Berechnung via `dispatch_group_async`.
+
+---
+
+## Risikobewertung für Produktionseinsatz
+
+| Aspekt | Bewertung |
+|--------|-----------|
+| Apple Silicon erforderlich | macOS 15+, M-Series only |
+| Private API Stabilität | **SEHR GERING** — jedes macOS-Update kann brechen |
+| Memory Safety | **MITTEL** — keine Bounds-Checks, keine Sanitizer |
+| Input Validation | **GERING** — Dateien werden unkritisch gelesen |
+| Error Handling | **GERING** — viele kritische Fehler werden ignoriert |
+| Eignung für Produktion | **NEIN** — Forschungs-/Experimental-Projekt |
+
+---
+
+## Empfehlungen nach Priorität
+
+### Sofortige Maßnahmen (KRITISCH)
+1. `dlopen()` Rückgabewert prüfen und bei Fehler abbrechen
+2. Alle `fread()`-Rückgabewerte prüfen + Dateigrößenvalidierung
+3. NULL-Checks vor allen `objc_msgSend`-Aufrufen
+4. `int` → `size_t` für alle Speichergrößenberechnungen
+
+### Kurzfristige Maßnahmen (HOCH)
+5. Token-Index-Validierung: `if (token >= VOCAB) abort()`
+6. ANE-Inferenz-Rückgabewert und NSError prüfen
+7. Compiler-Flags: `-fstack-protector-strong -D_FORTIFY_SOURCE=2`
+8. `.gitignore` für binäre Artefakte erstellen
+
+### Mittelfristige Maßnahmen (MITTEL)
+9. IOSurface Lock-Rückgabewerte prüfen
+10. `__atomic_store_n()` für `g_ane_loaded`
+11. MIL-Parameter-Validierung vor Formatierung
+
+---
+
+*Dieser Bericht ist für das ANE-Forschungsprojekt erstellt. Das Projekt ist explizit als Proof-of-Concept/Forschungscode konzipiert und nicht für Produktionseinsatz gedacht.*