[docs] Developer documentation: architecture diagrams, complete API reference, benchmark guide, M4 Max results, security audit report

This commit is contained in:
Erik Bray 2026-03-03 14:22:22 +01:00
parent 680f8c7e20
commit 37cac988b8
6 changed files with 1701 additions and 0 deletions

429
docs/API_REFERENCE.md Normal file
View File

@ -0,0 +1,429 @@
# ANE Training -- API Reference
Complete function index for all public functions, structs, and macros organized by source file.
---
## Table of Contents
1. [stories_config.h -- Model Configuration](#stories_configh)
2. [stories_io.h -- IOSurface I/O and Compilation](#stories_ioh)
3. [stories_mil.h -- MIL Program Generators](#stories_milh)
4. [stories_cpu_ops.h -- CPU Operations](#stories_cpu_opsh)
5. [ane_runtime.h -- Generalized ANE Wrapper](#ane_runtimeh)
6. [ane_mil_gen.h -- Composable MIL Helpers](#ane_mil_genh)
7. [ane_rmsnorm_bwd.h -- RMSNorm Backward on ANE](#ane_rmsnorm_bwdh)
8. [ane_classifier.h -- Classifier and Softmax on ANE](#ane_classifierh)
9. [bridge/ane_bridge.h -- C Bridge API](#bridgeane_bridgeh)
10. [MIL Operation Reference](#mil-operation-reference)
11. [Weight Blob Format](#weight-blob-format)
---
## stories_config.h
Model constants, data structures, and memory allocation helpers.
### Macros
| Macro | Value | Description |
|-------|-------|-------------|
| `DIM` | 768 | Model hidden dimension |
| `HIDDEN` | 2048 | FFN intermediate dimension |
| `HEADS` | 12 | Number of attention heads |
| `HD` | 64 (`DIM/HEADS`) | Per-head dimension |
| `SEQ` | 256 | Sequence length |
| `NLAYERS` | 12 | Number of transformer layers |
| `VOCAB` | 32000 | Vocabulary size |
| `ACCUM_STEPS` | 10 | Gradient accumulation steps per compile batch |
| `MAX_COMPILES` | 100 | ANE compile budget before process restart |
| `KERNELS_PER_LAYER` | 5 | Weight-bearing ANE kernels per layer |
| `TOTAL_WEIGHT_KERNELS` | 60 | Total weight-bearing compiles per batch |
| `SCORE_CH` | 3072 (`HEADS*SEQ`) | Attention score channels for SDPA backward |
| `WQ_SZ` | 589824 (`DIM*DIM`) | Size of Q/K/V/O projection weight matrices |
| `WO_SZ` | 589824 (`DIM*DIM`) | Size of output projection |
| `W1_SZ` | 1572864 (`HIDDEN*DIM`) | FFN gate/value projection size |
| `W2_SZ` | 1572864 (`DIM*HIDDEN`) | FFN down-projection size |
| `W3_SZ` | 1572864 (`HIDDEN*DIM`) | FFN value projection size |
| `LAYER_PARAMS` | -- | Total floats per layer: `4*WQ_SZ + W1_SZ + W2_SZ + W3_SZ + 2*DIM` |
| `TOTAL_PARAMS` | -- | Total model params: `NLAYERS * LAYER_PARAMS + DIM + VOCAB*DIM` |
### Structs
#### `LayerWeights`
Per-layer weight matrices (all `float*`).
| Field | Shape | Description |
|-------|-------|-------------|
| `Wq`, `Wk`, `Wv`, `Wo` | `[DIM, DIM]` | Attention projection weights |
| `W1`, `W3` | `[HIDDEN, DIM]` | FFN gate and value up-projections |
| `W2` | `[DIM, HIDDEN]` | FFN down-projection |
| `rms_att` | `[DIM]` | RMSNorm scale for attention sublayer |
| `rms_ffn` | `[DIM]` | RMSNorm scale for FFN sublayer |
#### `AdamState`
First/second moment buffers for a single parameter group.
| Field | Type | Description |
|-------|------|-------------|
| `m` | `float*` | First moment (mean) estimate |
| `v` | `float*` | Second moment (variance) estimate |
| `n` | `size_t` | Number of parameters |
#### `LayerAdam`
Per-layer Adam optimizer state. Contains one `AdamState` per weight matrix: `Wq`, `Wk`, `Wv`, `Wo`, `W1`, `W2`, `W3`, `rms_att`, `rms_ffn`.
#### `LayerActs`
Per-layer activation tensors saved for the backward pass.
| Field | Shape | Description |
|-------|-------|-------------|
| `layer_in` | `[DIM, SEQ]` | Input to this layer (for rmsnorm1 backward) |
| `xnorm` | `[DIM, SEQ]` | RMSNorm1 output |
| `Q`, `K`, `V` | `[DIM, SEQ]` | QKV projections |
| `attn_out` | `[DIM, SEQ]` | Attention output (before Wo) |
| `o_out` | `[DIM, SEQ]` | Wo projection output |
| `x2` | `[DIM, SEQ]` | Residual after attention |
| `x2norm` | `[DIM, SEQ]` | RMSNorm2 output |
| `h1`, `h3` | `[HIDDEN, SEQ]` | FFN intermediates (W1 and W3 outputs) |
| `silu_out` | `[HIDDEN, SEQ]` | SiLU(h1) * h3 gated output |
| `ffn_out` | `[DIM, SEQ]` | FFN final output |
#### `LayerGrads`
Per-layer gradient accumulators. Same field names as `LayerWeights` (all `float*`): `Wq`, `Wk`, `Wv`, `Wo`, `W1`, `W2`, `W3`, `rms_att`, `rms_ffn`.
#### `Kern`
Single ANE kernel handle (stories-specific, single I/O).
| Field | Type | Description |
|-------|------|-------------|
| `model` | `void*` | Retained `_ANEInMemoryModel` |
| `ioIn` | `IOSurfaceRef` | Input IOSurface |
| `ioOut` | `IOSurfaceRef` | Output IOSurface |
| `request` | `void*` | Retained `_ANERequest` |
| `tmpDir` | `void*` | Retained temp directory path |
#### `LayerKernels`
ANE kernels for one transformer layer.
| Field | Type | Description |
|-------|------|-------------|
| `fwdAttn` | `Kern*` | SDPA forward + taps |
| `fwdFFN` | `Kern*` | FFN forward + taps |
| `ffnBwd` | `Kern*` | FFN backward |
| `sdpaBwd1` | `Kern*` | SDPA backward part 1 (Wo^T + dV + scores) |
| `sdpaBwd2` | `Kern*` | SDPA backward part 2 (dQ + dK) |
| `qkvBwd` | `Kern*` | QKV backward (Wq^T, Wk^T, Wv^T) |
#### `CkptHdr`
Checkpoint file header (128 bytes, version 2).
| Field | Type | Description |
|-------|------|-------------|
| `magic` | `int` | `0x424C5A54` ("BLZT") |
| `version` | `int` | 2 |
| `step`, `total_steps` | `int` | Training progress |
| `n_layers`, `vocab_size`, `dim`, `hidden_dim`, `n_heads`, `seq_len` | `int` | Model shape |
| `lr`, `loss` | `float` | Learning rate, last loss |
| `cum_compile`, `cum_train`, `cum_wall` | `double` | Cumulative timing (ms) |
| `cum_steps`, `cum_batches` | `int` | Cumulative counters |
| `adam_t` | `int` | Adam timestep (for bias correction) |
| `pad[3]` | `int` | Alignment padding |
#### `Llama2Config`
Header from llama2.c model files (7 ints): `dim`, `hidden_dim`, `n_layers`, `n_heads`, `n_kv_heads`, `vocab_size`, `seq_len`.
### Global Variables
| Name | Type | Description |
|------|------|-------------|
| `g_D` | `Class` | `_ANEInMemoryModelDescriptor` ObjC class |
| `g_I` | `Class` | `_ANEInMemoryModel` ObjC class |
| `g_AR` | `Class` | `_ANERequest` ObjC class |
| `g_AIO` | `Class` | `_ANEIOSurfaceObject` ObjC class |
| `g_tb` | `mach_timebase_info_data_t` | Mach time base for timing |
| `g_compile_count` | `int` | Running count of ANE compiles |
### Functions
| Function | Returns | Description |
|----------|---------|-------------|
| `ane_init(void)` | `void` | Load AppleNeuralEngine.framework, resolve 4 private class references |
| `tb_ms(uint64_t t)` | `double` | Convert Mach absolute time to milliseconds |
| `adam_alloc(size_t n)` | `AdamState` | Allocate zeroed first/second moment buffers for n parameters |
| `adam_free(AdamState *s)` | `void` | Free an AdamState's buffers |
| `layer_weights_alloc(void)` | `LayerWeights` | Allocate all weight matrices for one layer |
| `layer_weights_free(LayerWeights *w)` | `void` | Free all weight matrices for one layer |
| `layer_adam_alloc(void)` | `LayerAdam` | Allocate Adam state for all weights in one layer |
| `layer_adam_free(LayerAdam *a)` | `void` | Free Adam state for one layer |
| `layer_acts_alloc(void)` | `LayerActs` | Allocate all activation buffers for one layer |
| `layer_acts_free(LayerActs *a)` | `void` | Free all activation buffers for one layer |
| `layer_grads_alloc(void)` | `LayerGrads` | Allocate zeroed gradient accumulators for one layer |
| `layer_grads_zero(LayerGrads *g)` | `void` | Zero all gradient accumulators (between accumulation steps) |
| `layer_grads_free(LayerGrads *g)` | `void` | Free gradient accumulators for one layer |
---
## stories_io.h
IOSurface creation, fp16/fp32 conversion, weight blob building, and ANE kernel compile/run.
**Depends on**: `stories_config.h`, `<arm_neon.h>`
### Functions
| Function | Returns | Description |
|----------|---------|-------------|
| `make_surface(size_t bytes)` | `IOSurfaceRef` | Create a 1D IOSurface with given byte allocation |
| `build_blob(const float *w, int rows, int cols)` | `NSData*` | Build fp16 weight blob (128B header + row-major fp16 data) from fp32 weights |
| `build_blob_t(const float *w, int rows, int cols)` | `NSData*` | Build fp16 weight blob with transposed layout (col-major fp16 from row-major fp32) |
| `build_blob_fp16(_Float16 *d, int cnt)` | `NSData*` | Build weight blob from pre-existing fp16 data (no conversion) |
| `cvt_f16_f32(float *dst, const _Float16 *src, int n)` | `void` | NEON-vectorized fp16-to-fp32 conversion (8-wide SIMD) |
| `cvt_f32_f16(_Float16 *dst, const float *src, int n)` | `void` | NEON-vectorized fp32-to-fp16 conversion (8-wide SIMD) |
| `io_write_fp16(IOSurfaceRef s, const float *data, int channels, int sp)` | `void` | Write fp32 data to IOSurface as fp16 in channel-first `[C,S]` layout |
| `io_read_fp16(IOSurfaceRef s, float *data, int ch_off, int channels, int sp)` | `void` | Read fp16 data from IOSurface at channel offset, convert to fp32 |
| `io_copy(IOSurfaceRef dst, int dst_ch, IOSurfaceRef src, int src_ch, int channels, int sp)` | `void` | Copy fp16 data between IOSurfaces at specified channel offsets |
| `io_write_fp16_at(IOSurfaceRef s, int ch_off, const float *data, int channels, int sp)` | `void` | Write fp32 data to IOSurface at specific channel offset as fp16 |
| `compile_kern_mil_w(NSString *mil, NSDictionary *weights, int ic_bytes, int oc_bytes)` | `Kern*` | Compile MIL text + weight dictionary into a loaded ANE kernel with IOSurfaces. Increments `g_compile_count`. |
| `free_kern(Kern *k)` | `void` | Unload ANE model, release IOSurfaces, remove temp directory, free kernel |
| `ane_run(Kern *k)` | `void` | Run a compiled ANE kernel on current IOSurface contents |
---
## stories_mil.h
MIL program generators for the 6 fused ANE kernel types. Each returns an `NSString*` containing the full MIL program text.
**Depends on**: `stories_io.h`
### Macros
| Macro | Description |
|-------|-------------|
| `MIL_HDR` | Standard MIL program header (version 1.3, buildInfo with coremlc/coremltools versions) |
| `CONV_CONST` | Common conv parameter constants (pad_type, strides, pad, dilations, groups) |
### Functions
| Function | Returns | Description |
|----------|---------|-------------|
| `gen_sdpa_fwd_taps(void)` | `NSString*` | SDPA forward: RMSNorm + QKV + attention + Wo. Output: `concat(o_out, Q, K, V, attn_out, xnorm)` `[1, 6*DIM, 1, SEQ]` |
| `gen_ffn_fwd_taps(void)` | `NSString*` | FFN forward: RMSNorm + W1/W3 + SiLU + W2. Output: `concat(ffn_out, h1, h3, silu_out, x2norm)` `[1, 2*DIM+3*HIDDEN, 1, SEQ]` |
| `gen_ffn_bwd(void)` | `NSString*` | FFN backward: Input `concat(dffn, h1, h3)`. Output: `concat(dx, dh1, dh3)` `[1, DIM+2*HIDDEN, 1, SEQ]` |
| `gen_qkvb(void)` | `NSString*` | QKV backward: Input `concat(dQ, dK, dV)`. Output: `dx` `[1, DIM, 1, SEQ]` |
| `gen_sdpa_bwd1(void)` | `NSString*` | SDPA backward part 1: Input `concat(Q, K, V, dx2)`. Output: `concat(dV, probs, dP)` `[1, DIM+2*SCORE_CH, 1, SEQ]` |
| `gen_sdpa_bwd2(void)` | `NSString*` | SDPA backward part 2: Input `concat(probs, dP, Q, K)`. Output: `concat(dQ, dK)` `[1, 2*DIM, 1, SEQ]` |
| `get_mask_blob(void)` | `NSData*` | Lazily build and cache causal attention mask as fp16 blob. Lower-triangular 0, upper -65504. |
### Global Variables
| Name | Type | Description |
|------|------|-------------|
| `g_mask_blob` | `NSData*` | Cached causal mask blob (built on first call to `get_mask_blob`) |
---
## stories_cpu_ops.h
CPU-side operations using Accelerate framework (vDSP, vvrsqrtf, vvexpf).
**Depends on**: `stories_config.h`
### Functions
| Function | Returns | Description |
|----------|---------|-------------|
| `rmsnorm(float *out, const float *x, const float *w, int d, int S)` | `void` | RMSNorm forward: `out = x * rsqrt(mean(x^2) + eps) * w`. Vectorized via vDSP. Layout: channel-first `[d, S]`. |
| `rmsnorm_bwd(float *dx, float *dw, const float *dy, const float *x, const float *w, int d, int S)` | `void` | RMSNorm backward: computes `dx` (input gradient) and accumulates `dw` (scale gradient). |
| `adam_update(float *w, const float *g, AdamState *s, int t, float lr, float b1, float b2, float eps)` | `void` | Adam optimizer step with bias correction. Updates weights in-place. `t` is the timestep for bias correction. |
| `cross_entropy_loss(float *dlogits, const float *logits, const uint16_t *targets, int V, int S)` | `float` | Compute mean cross-entropy loss. Writes `dlogits = (softmax(logits) - one_hot(targets)) / S`. Column-major `[V, S]` layout. Uses vDSP transpose + vvexpf for vectorized softmax. |
| `embed_lookup(float *x, const float *embed, const uint16_t *tokens, int dim, int seq)` | `void` | Embedding forward: gather rows from `embed[VOCAB, DIM]` into channel-first `x[DIM, SEQ]`. |
| `embed_backward(float *d_embed, const float *dx, const uint16_t *tokens, int dim, int seq)` | `void` | Embedding backward: scatter-add `dx` back into embedding table gradient `d_embed`. |
### Global Variables
| Name | Type | Description |
|------|------|-------------|
| `g_rms_tmp` | `float*` | Lazily-allocated scratch buffer for RMSNorm (size SEQ) |
---
## ane_runtime.h
Generalized ANE wrapper with multi-input/output support. Used in bridge, tests, and newer training variants.
### Structs
#### `ANEKernel`
Generalized kernel handle supporting multiple inputs and outputs.
| Field | Type | Description |
|-------|------|-------------|
| `model` | `id` | `_ANEInMemoryModel` instance |
| `ioInputs` | `IOSurfaceRef*` | Array of input IOSurfaces |
| `ioOutputs` | `IOSurfaceRef*` | Array of output IOSurfaces |
| `request` | `id` | `_ANERequest` instance |
| `tmpDir` | `NSString*` | Temp directory for MIL/weights on disk |
| `nInputs`, `nOutputs` | `int` | Number of I/O tensors |
| `inputBytes`, `outputBytes` | `size_t*` | Byte sizes for each I/O tensor |
### Global Variables
| Name | Type | Description |
|------|------|-------------|
| `g_ANEDesc` | `Class` | `_ANEInMemoryModelDescriptor` |
| `g_ANEInMem` | `Class` | `_ANEInMemoryModel` |
| `g_ANEReq` | `Class` | `_ANERequest` |
| `g_ANEIO` | `Class` | `_ANEIOSurfaceObject` |
| `g_ane_loaded` | `bool` | Guard to avoid re-loading the framework |
### Functions
| Function | Returns | Description |
|----------|---------|-------------|
| `ane_init(void)` | `void` | Load AppleNeuralEngine.framework (idempotent), resolve 4 private ObjC classes |
| `ane_create_surface(size_t bytes)` | `IOSurfaceRef` | Create a 1D IOSurface of given byte size |
| `ane_compile(NSData *milText, NSData *weightData, int nInputs, size_t *inputSizes, int nOutputs, size_t *outputSizes)` | `ANEKernel*` | Full compile pipeline: build descriptor, compile MIL, load model, create IOSurfaces + request. Returns NULL on failure. |
| `ane_write_input(ANEKernel *k, int idx, const void *data, size_t bytes)` | `void` | Write raw bytes to the idx-th input IOSurface (lock/memcpy/unlock) |
| `ane_read_output(ANEKernel *k, int idx, void *data, size_t bytes)` | `void` | Read raw bytes from the idx-th output IOSurface (read-lock/memcpy/unlock) |
| `ane_run_kernel(ANEKernel *k)` | `bool` | Run the compiled ANE kernel. Returns true on success. |
| `ane_free(ANEKernel *k)` | `void` | Unload model, release all IOSurfaces, remove temp dir, free struct |
---
## ane_mil_gen.h
Composable MIL generation helpers for common patterns, plus weight blob builders.
### Functions
| Function | Returns | Description |
|----------|---------|-------------|
| `mil_build_weight_blob(const float *w, int out_ch, int in_ch)` | `NSData*` | Build fp16 weight blob with 128B header from fp32 row-major `[out_ch, in_ch]` weights |
| `mil_gen_matmul(int in_ch, int out_ch, int spatial)` | `NSString*` | Generate MIL for matmul `y = W @ x` with both as runtime inputs. Includes fp32-to-fp16-to-fp32 casts. |
| `mil_gen_conv(int in_ch, int out_ch, int spatial)` | `NSString*` | Generate MIL for conv-based linear with baked weights from blob file (inference-only) |
| `mil_gen_qkv(int dim, int spatial)` | `NSString*` | Generate MIL for fused QKV: 3 parallel convs from single input, weights from concatenated blob |
| `mil_build_qkv_weight_blob(const float *wq, const float *wk, const float *wv, int dim)` | `NSData*` | Build concatenated weight blob for fused QKV (3 chunks, each with 64B header + fp16 data) |
| `mil_build_ffn_up_weight_blob(const float *w1, const float *w3, int hidden_dim, int dim)` | `NSData*` | Build concatenated weight blob for fused FFN up-projection (W1 + W3 chunks) |
| `mil_gen_ffn_up(int dim, int hidden_dim, int spatial)` | `NSString*` | Generate MIL for fused FFN up: W1 + W3 parallel convs, outputs h1 and h3 |
---
## ane_rmsnorm_bwd.h
MIL generator for RMSNorm backward on ANE (used by `train_large_ane.m`).
**Depends on**: `stories_mil.h`
### Functions
| Function | Returns | Description |
|----------|---------|-------------|
| `gen_rmsnorm_bwd(void)` | `NSString*` | Generate MIL for RMSNorm backward. Input: `concat(dy, x)` as `[1, 2*DIM, 1, SEQ]`. Baked weight: RMSNorm scale `w[DIM]`. Output: `dx` as `[1, DIM, 1, SEQ]`. Note: `dw` (weight gradient) stays on CPU. |
---
## ane_classifier.h
MIL generators for classifier operations on ANE (used by `train_large_ane.m`).
**Depends on**: `stories_mil.h`
### Functions
| Function | Returns | Description |
|----------|---------|-------------|
| `gen_classifier_fwd(void)` | `NSString*` | Classifier forward: single 32000-output-channel conv. Input: `[1, DIM, 1, SEQ]`. Baked: embedding weights `[VOCAB, DIM, 1, 1]`. Output: `[1, VOCAB, 1, SEQ]`. |
| `gen_classifier_bwd(void)` | `NSString*` | Classifier backward: `dx = embed^T @ dlogits`. Uses `matmul` op (not conv, since ANE rejects conv with 32000 input channels). Input: `[1, VOCAB, 1, SEQ]`. Baked: `embed^T [1, DIM, VOCAB]`. Output: `[1, DIM, 1, SEQ]`. |
| `gen_softmax_vocab(void)` | `NSString*` | Softmax over VOCAB dimension: `softmax(x, axis=1)`. Input: `[1, VOCAB, 1, SEQ]`. Output: `[1, VOCAB, 1, SEQ]`. |
| `gen_final_rmsnorm(void)` | `NSString*` | Final RMSNorm (standalone, not fused). Input: `[1, DIM, 1, SEQ]`. Baked: `rms_final[DIM]`. Output: `[1, DIM, 1, SEQ]`. |
---
## bridge/ane_bridge.h
C-callable bridge to ANE private APIs for Python ctypes integration.
### Types
| Type | Description |
|------|-------------|
| `ANEKernelHandle` | Opaque kernel handle (pointer to internal struct) |
### Functions
| Function | Returns | Description |
|----------|---------|-------------|
| `ane_bridge_init(void)` | `int` | Initialize ANE runtime (load private framework, resolve classes). Returns 0 on success, -1 on failure. |
| `ane_bridge_compile(const char *mil_text, size_t mil_len, const uint8_t *weight_data, size_t weight_len, int n_inputs, const size_t *input_sizes, int n_outputs, const size_t *output_sizes)` | `ANEKernelHandle*` | Compile MIL text + single weight blob into ANE kernel. Returns NULL on failure. |
| `ane_bridge_compile_multi_weights(const char *mil_text, size_t mil_len, const char **weight_names, const uint8_t **weight_datas, const size_t *weight_lens, int n_weights, int n_inputs, const size_t *input_sizes, int n_outputs, const size_t *output_sizes)` | `ANEKernelHandle*` | Compile MIL text + multiple named weight files. Weight names use `@model_path/` prefix convention. |
| `ane_bridge_run(ANEKernelHandle *kernel)` | `bool` | Execute a compiled kernel on ANE. Returns true on success. |
| `ane_bridge_write_input(ANEKernelHandle *kernel, int idx, const void *data, size_t bytes)` | `void` | Write data to kernel input IOSurface at index `idx` |
| `ane_bridge_read_output(ANEKernelHandle *kernel, int idx, void *data, size_t bytes)` | `void` | Read data from kernel output IOSurface at index `idx` |
| `ane_bridge_free(ANEKernelHandle *kernel)` | `void` | Unload model, release all IOSurfaces, remove temp dir, free handle |
| `ane_bridge_get_compile_count(void)` | `int` | Get current compile count (for restart budgeting) |
| `ane_bridge_reset_compile_count(void)` | `void` | Reset compile count to zero |
| `ane_bridge_build_weight_blob(const float *src, int rows, int cols, size_t *out_len)` | `uint8_t*` | Build weight blob in ANE format (128B header + fp16). Caller must free via `ane_bridge_free_blob()`. |
| `ane_bridge_build_weight_blob_transposed(const float *src, int rows, int cols, size_t *out_len)` | `uint8_t*` | Build transposed weight blob. Caller must free via `ane_bridge_free_blob()`. |
| `ane_bridge_free_blob(void *ptr)` | `void` | Free a blob allocated by `ane_bridge_build_weight_blob*` |
---
## MIL Operation Reference
All MIL programs target `ios18` and use fp16 tensors in `[1, C, 1, S]` layout (or `[1, H, S, S]` for attention scores).
| Operation | MIL Syntax | Purpose |
|-----------|-----------|---------|
| `conv` | `conv(dilations=dl, groups=gr, pad=pd, pad_type=pt, strides=st, weight=W, x=xn)` | Linear projections (all Wq, Wk, Wv, Wo, W1, W2, W3). 1x1 conv = matmul. Weight shape: `[out_ch, in_ch, 1, 1]`. |
| `matmul` | `matmul(transpose_x=tx, transpose_y=ty, x=a, y=b)` | Attention score computation (Q at K^T, scores at V, classifier backward). |
| `softmax` | `softmax(axis=ax, x=ms)` | Attention weight normalization (`axis=-1`) and vocab softmax (`axis=1`). |
| `mul` | `mul(x=a, y=b)` | Element-wise multiply: RMSNorm scaling, SiLU gating, attention scaling, softmax Jacobian. |
| `add` | `add(x=a, y=b)` | Causal mask application, SiLU derivative `(1 + h*(1-sig))`, gradient accumulation. |
| `sub` | `sub(x=a, y=b)` | SiLU derivative: `1 - sigmoid(h1)`, softmax backward: `dp - sum(P*dP)`. |
| `sigmoid` | `sigmoid(x=h1)` | SiLU activation component (SiLU = x * sigmoid(x)). |
| `pow` | `pow(x=ss3, y=nhalf)` | RMSNorm: `x^(-0.5)` = reciprocal sqrt. |
| `reduce_sum` | `reduce_sum(x=sq, axes=rax, keep_dims=kd)` | RMSNorm: sum of squares along channel dim. Softmax backward: row-wise dot product. |
| `reshape` | `reshape(shape=sh, x=xf)` | `[1,DIM,1,SEQ]` to `[1,HEADS,HD,SEQ]` for multi-head attention. Flatten attention scores. |
| `transpose` | `transpose(perm=pm, x=q4)` | Permute `[0,1,3,2]`: swap spatial and head_dim for matmul compatibility. |
| `concat` | `concat(axis=cax, interleave=cid, values=(a,b,c))` | Pack multiple outputs into single IOSurface ("taps"). Always `axis=1`, `interleave=false`. |
| `slice_by_size` | `slice_by_size(x=x, begin=b, size=sz)` | Split concatenated inputs in backward kernels. `begin=[0,offset,0,0]`, `size=[1,channels,1,SEQ]`. |
| `cast` | `cast(dtype=to_fp16, x=x)` | fp32-to-fp16 or fp16-to-fp32 precision conversion (used in ane_mil_gen.h generators). |
| `const` | `const()[name=..., val=...]` | Declare scalar/tensor constants, conv parameters, weight blob references via `BLOBFILE`. |
---
## Weight Blob Format
### Single-weight blob (128 bytes header + data)
```
Offset Size Content
------ ----- -------
0 1 0x01 (format marker)
4 1 0x02 (format marker)
5-63 59 zeros (global header padding)
64 4 0xDEADBEEF (chunk magic, little-endian: EF BE AD DE)
68 1 0x01 (chunk marker)
72 4 uint32 data_size (total fp16 bytes = out_ch * in_ch * 2)
80 4 uint32 data_offset (always 128 = 64 global + 64 chunk)
84-127 44 zeros (chunk header padding)
128+ N fp16 weight data, row-major [out_ch, in_ch]
```
### Multi-weight blob (fused QKV, FFN up)
```
Offset Content
------ -------
0-63 Global header (same as above)
64 Chunk 0 header (64 bytes): magic, data_size, data_offset
64+64 Chunk 0 data (fp16 weights)
64+cs Chunk 1 header (64 bytes)
64+cs+64 Chunk 1 data (fp16 weights)
...
```
Where `cs = 64 + n_elements * 2` (chunk header size + data size).
MIL references use `BLOBFILE(path="@model_path/weights/name.bin", offset=uint64(X))` where X is the chunk header offset within the file (64 for first chunk, 64+cs for second, etc.).

370
docs/ARCHITECTURE.md Normal file
View File

@ -0,0 +1,370 @@
# ANE Training -- System Architecture
Training neural networks directly on Apple's Neural Engine via reverse-engineered private APIs (`_ANEClient`, `_ANECompiler`). No CoreML training APIs, no Metal, no GPU.
## Project Structure
```
ANE/
+-- api_exploration.m # ANE private API discovery
+-- inmem_basic.m # In-memory MIL compilation proof-of-concept
+-- inmem_bench.m # ANE dispatch latency across model sizes
+-- inmem_peak.m # Peak TFLOPS via deep conv chains (self-contained)
+-- sram_bench.m # SRAM capacity probing (performance cliff detection)
+-- sram_probe.m # Fine-grained SRAM size exploration
+-- bridge/
| +-- ane_bridge.h # C-callable API for Python ctypes
| +-- ane_bridge.m # Bridge implementation
| +-- Makefile # Builds libane_bridge.dylib
| +-- libane_bridge.dylib # Pre-built shared library
+-- training/
| +-- train_large.m # Main: 12-layer training (CPU classifier)
| +-- train_large_ane.m # Variant: classifier + softmax on ANE
| +-- stories_config.h # Model constants, structs, alloc helpers
| +-- stories_io.h # IOSurface I/O, NEON fp16, compile/run
| +-- stories_mil.h # MIL generators for 6 fused ANE kernels
| +-- stories_cpu_ops.h # vDSP RMSNorm, cross-entropy, Adam, embedding
| +-- ane_runtime.h # Generalized ANE wrapper (multi-I/O)
| +-- ane_mil_gen.h # Composable MIL helpers (conv, matmul, fused QKV)
| +-- ane_rmsnorm_bwd.h # RMSNorm backward MIL (train_large_ane only)
| +-- ane_classifier.h # Classifier/softmax MIL (train_large_ane only)
| +-- forward.h # Gen1 forward pass (per-linear-kernel, all-CPU)
| +-- backward.h # Gen1 backward pass (all-CPU reference)
| +-- model.h # Gen1 Model struct, per-kernel compile
| +-- dashboard.py # TUI monitoring (loss, power, text generation)
| +-- tokenize.py # Extract pretokenized TinyStories data
| +-- download_data.sh # Download TinyStories from HuggingFace
| +-- Makefile # Build targets for training + tests
| +-- test_*.m # 12 unit test files
+-- docs/ # This documentation
+-- scripts/ # Automation scripts
```
## Two Generations of Training Code
### Gen1: `model.h` + `forward.h` + `backward.h`
The original correctness reference. One ANE kernel per linear projection (7 per layer + 1 classifier = 85 kernels total). Forward and backward are sequential all-CPU operations with optional ANE for the matmuls. No kernel fusion, no async overlap. Used for verifying Gen2's fused kernels produce correct results.
### Gen2: `train_large.m` + `stories_*.h` (production)
The performance-optimized system. Uses **5 fused ANE kernels per layer** (each performing multiple operations in a single dispatch). Weight gradients (`dW`) run asynchronously on CPU via GCD to overlap with ANE. All data is channel-first `[C, S]` fp16 on IOSurfaces.
The rest of this document describes Gen2.
---
## Model Configuration
Stories110M -- a Llama2-architecture transformer:
| Parameter | Value | Macro |
|-----------|-------|-------|
| Hidden dimension | 768 | `DIM` |
| FFN intermediate | 2048 | `HIDDEN` |
| Attention heads | 12 | `HEADS` |
| Head dimension | 64 | `HD` |
| Sequence length | 256 | `SEQ` |
| Layers | 12 | `NLAYERS` |
| Vocabulary | 32000 | `VOCAB` |
| Total parameters | 109.53M | `TOTAL_PARAMS` |
| Accumulation steps | 10 | `ACCUM_STEPS` |
| Max ANE compiles | 100 | `MAX_COMPILES` |
---
## ANE Kernel Fusion Map
Each training step dispatches 6 kernel types per layer. 5 are weight-bearing (recompiled each batch), 1 is weight-free (compiled once).
| Kernel | Generator | Fused Operations | Baked Weights | Input Shape | Output Shape |
|--------|-----------|-----------------|---------------|-------------|--------------|
| `fwdAttn` | `gen_sdpa_fwd_taps()` | RMSNorm1, Wq/Wk/Wv conv, reshape, transpose, Q at K^T matmul, scale, causal mask, softmax, scores at V matmul, Wo conv | rms_att, Wq, Wk, Wv, Wo, mask | `[1,DIM,1,SEQ]` | `[1,6*DIM,1,SEQ]` |
| `fwdFFN` | `gen_ffn_fwd_taps()` | RMSNorm2, W1/W3 conv, sigmoid, SiLU gating, W2 conv | rms_ffn, W1, W3, W2 | `[1,DIM,1,SEQ]` | `[1,2D+3H,1,SEQ]` |
| `ffnBwd` | `gen_ffn_bwd()` | W2^T conv, SiLU derivative, W1^T/W3^T conv, add | W2^T, W1^T, W3^T | `[1,D+2H,1,SEQ]` | `[1,D+2H,1,SEQ]` |
| `sdpaBwd1` | `gen_sdpa_bwd1()` | Wo^T conv, reshape, Q at K^T recompute, softmax, dV matmul, dP matmul | Wo^T, mask | `[1,4*DIM,1,SEQ]` | `[1,D+2*SC,1,SEQ]` |
| `sdpaBwd2` | `gen_sdpa_bwd2()` | softmax Jacobian, scale, dQ=dS at K matmul, dK=dS^T at Q matmul | _(none)_ | `[1,2SC+2D,1,SEQ]` | `[1,2*DIM,1,SEQ]` |
| `qkvBwd` | `gen_qkvb()` | Wq^T/Wk^T/Wv^T conv, sum | Wq^T, Wk^T, Wv^T | `[1,3*DIM,1,SEQ]` | `[1,DIM,1,SEQ]` |
Where D=DIM=768, H=HIDDEN=2048, SC=SCORE_CH=HEADS*SEQ=3072.
"Taps" in forward kernels: intermediate values (Q, K, V, attention output, norms) are concatenated onto the output via `concat(axis=1)` so backward kernels can read them without CPU recomputation.
---
## CPU vs ANE Operation Split
| Operation | Location | Reason |
|-----------|----------|--------|
| Embedding lookup/backward | CPU | Scatter/gather by token index |
| RMSNorm forward | ANE | Fused into fwdAttn/fwdFFN kernels |
| QKV projections | ANE | 1x1 conv = matmul |
| Multi-head attention (SDPA) | ANE | Decomposed Q at K^T + mask + softmax + scores at V |
| FFN (SwiGLU) | ANE | W1,W3 conv + sigmoid + gate + W2 conv |
| Residual connections | CPU | Simple `vDSP_vadd` |
| Final RMSNorm | CPU (or ANE in `_ane` variant) | Standalone, not fused with other ops |
| Classifier matmul | CPU cblas (or ANE in `_ane` variant) | `[VOCAB,DIM] x [DIM,SEQ]` |
| Cross-entropy + softmax | CPU (partially ANE in `_ane`) | Target indexing requires CPU |
| dW weight gradients | CPU (async cblas) | Outer products, independent of backward data flow |
| RMSNorm backward | CPU (or ANE in `_ane` variant) | vDSP vectorized |
| Adam optimizer | CPU | In-place weight mutation |
---
## Training Step Swim-Lane Diagram
One complete training step showing CPU, ANE, and async GCD operations interleaved:
```mermaid
sequenceDiagram
participant CPU
participant ANE
participant GCD as GCD Async Queue
Note over CPU: FORWARD PASS (per layer L=0..11)
CPU->>CPU: embed_lookup(tokens to x_cur)
loop Layer L = 0..11
CPU->>CPU: wait for prior async dW
CPU->>CPU: save layer_in, write fp16 to IOSurface
CPU->>ANE: run fwdAttn kernel
ANE-->>CPU: concat(o_out, Q, K, V, attn_out, xnorm)
CPU->>CPU: read fp16 taps, residual add to x2
CPU->>CPU: write fp16 x2 to IOSurface
CPU->>ANE: run fwdFFN kernel
ANE-->>CPU: concat(ffn_out, h1, h3, silu_out, x2norm)
CPU->>CPU: read fp16 taps, residual add to x_cur
end
Note over CPU: CLASSIFIER + LOSS
CPU->>CPU: rmsnorm(x_cur to x_final)
CPU->>CPU: cblas_sgemm(embed x x_final to logits)
CPU->>CPU: cross_entropy_loss(logits to loss, dlogits)
Note over CPU: BACKWARD PASS
CPU->>CPU: cblas_sgemm(embed^T x dlogits to dy)
CPU->>GCD: async dEmbed += dlogits x x_final^T
CPU->>CPU: rmsnorm_bwd(dy to dx)
loop Layer L = 11..0
Note over CPU,GCD: FFN Backward
CPU->>CPU: write dffn + copy h1,h3 from fwd taps
CPU->>ANE: run ffnBwd kernel
ANE-->>CPU: concat(dx_ffn, dh1, dh3)
CPU->>GCD: async dW2, dW1, dW3 accumulation
Note over CPU,GCD: RMSNorm2 Backward + Residual
CPU->>CPU: rmsnorm_bwd, add residual gradient
Note over CPU,GCD: SDPA Backward
CPU->>GCD: async dWo accumulation
CPU->>CPU: copy Q,K,V from fwd taps, write dx2
CPU->>ANE: run sdpaBwd1 kernel
ANE-->>CPU: concat(dV, probs, dP)
CPU->>CPU: copy probs,dP,Q,K
CPU->>ANE: run sdpaBwd2 kernel
ANE-->>CPU: concat(dQ, dK)
CPU->>GCD: async dWq, dWk, dWv accumulation
Note over CPU,GCD: QKV Backward
CPU->>CPU: copy dQ,dK,dV
CPU->>ANE: run qkvBwd kernel
ANE-->>CPU: dx_attn
Note over CPU,GCD: RMSNorm1 Backward + Residual
CPU->>CPU: rmsnorm_bwd, add both skip gradients
end
CPU->>CPU: dispatch_group_wait(all async dW)
CPU->>CPU: embed_backward(dy to d_embed)
```
---
## Async CPU/ANE Overlap Strategy
The key insight: **dW gradients (weight gradients) are independent of the backward data flow**. They are outer products `dW += dy x x^T` that only accumulate into gradient buffers. The data-path gradients (`dx`) flow backward through the network on ANE.
```
Timeline for one backward layer:
ANE: [ffnBwd] [sdpaBwd1] [sdpaBwd2] [qkvBwd]
CPU: [dW_FFN (3x sgemm)] [dWo] [dWqkv (3x sgemm)]
```
GCD serial dispatch queue `"dw_cblas"` ensures dW operations don't overlap each other (they share scratch buffers). The `dispatch_group_wait` at the start of each forward layer ensures async dW from the previous step's backward has finished before IOSurfaces are reused.
---
## Compile/Restart Lifecycle
The ANE runtime leaks resources internally, limiting compiles to ~119 per process. The system manages this with checkpoint-and-restart:
```mermaid
flowchart TD
Start["Process starts (fresh or --resume)"] --> LoadCkpt{"--resume flag?"}
LoadCkpt -->|Yes| Resume["Load checkpoint: weights, Adam state, step counter"]
LoadCkpt -->|No| Init["Xavier init weights, zero Adam state"]
Resume --> CompileCheck
Init --> CompileCheck
CompileCheck{"g_compile_count + 60 > MAX_COMPILES?"} -->|Yes| SaveCheckpoint["Save checkpoint to ane_stories110M_ckpt.bin"]
SaveCheckpoint --> FreeAll["Free all ANE kernels"]
FreeAll --> RestartProcess["Re-launch process with --resume flag"]
RestartProcess --> Start
CompileCheck -->|No| Compile["Compile 60 weight-bearing kernels (5 per layer x 12)"]
Compile --> ZeroGrads["Zero gradient accumulators"]
ZeroGrads --> AccumLoop
subgraph AccumLoop ["Gradient Accumulation (10 steps)"]
SingleStep["Forward + Backward + async dW"] --> MoreSteps{"More accum steps?"}
MoreSteps -->|Yes| SingleStep
end
MoreSteps -->|No| WaitDW["dispatch_group_wait (all async dW)"]
WaitDW --> ScaleGrad["Scale gradients by 1/ACCUM_STEPS"]
ScaleGrad --> AdamUpdate["Adam update (mutates weights in-place)"]
AdamUpdate --> FreeKernels["Free all weight-bearing kernels"]
FreeKernels --> CompileCheck
```
With `MAX_COMPILES=100` and 60 weight-bearing kernels per batch, only **1 batch** (10 accumulation steps) fits per process lifetime. The checkpoint preserves:
- Training step and total_steps
- All weights and Adam (m, v) state per layer
- Cumulative timing statistics
- Adam timestep counter
---
## Data Flow Through One Layer
Tensor shapes as they flow through forward and backward passes:
```mermaid
flowchart LR
subgraph fwdAttnKernel ["fwdAttn Kernel (ANE)"]
xIn["x_in\n[1,768,1,256]"] --> RMS1["RMSNorm1"]
RMS1 --> QKVConv["Wq,Wk,Wv conv\n[768,768,1,1]"]
QKVConv --> ReshapeHeads["reshape\n[1,12,64,256]"]
ReshapeHeads --> TransposeHeads["transpose\n[1,12,256,64]"]
TransposeHeads --> QKT["Q x K^T\n[1,12,256,256]"]
QKT --> ScaleMask["scale + mask\n+ softmax"]
ScaleMask --> AV["scores x V\n[1,12,256,64]"]
AV --> ReshapeBackFlat["reshape\n[1,768,1,256]"]
ReshapeBackFlat --> WoConv["Wo conv\n[768,768,1,1]"]
end
subgraph taps1 ["Taps via concat"]
WoConv --> T1["o_out [768]"]
QKVConv --> T2["Q,K,V [768 each]"]
AV --> T3["attn_out [768]"]
RMS1 --> T4["xnorm [768]"]
end
subgraph cpuResid1 ["CPU"]
T1 --> ResAdd1["x + o_out = x2"]
end
subgraph fwdFFNKernel ["fwdFFN Kernel (ANE)"]
ResAdd1 --> RMS2["RMSNorm2"]
RMS2 --> W1W3["W1,W3 conv\n[2048,768,1,1]"]
W1W3 --> SiLUGate["sigmoid + SiLU\n+ gating"]
SiLUGate --> W2Conv["W2 conv\n[768,2048,1,1]"]
end
subgraph taps2 ["Taps via concat"]
W2Conv --> T5["ffn_out [768]"]
W1W3 --> T6["h1,h3 [2048 each]"]
SiLUGate --> T7["silu_out [2048]"]
RMS2 --> T8["x2norm [768]"]
end
subgraph cpuResid2 ["CPU"]
T5 --> ResAdd2["x2 + ffn_out = x_next"]
end
```
---
## IOSurface Memory Layout
All tensors use channel-first `[1, C, 1, S]` fp16 layout on IOSurfaces, matching ANE's native format:
```
IOSurface memory (contiguous fp16):
channel_0: [pos_0, pos_1, ..., pos_255] (256 values)
channel_1: [pos_0, pos_1, ..., pos_255]
...
channel_767: [pos_0, pos_1, ..., pos_255]
```
Fused kernel outputs use `concat(axis=1)` to pack multiple tensors into a single IOSurface:
```
fwdAttn output [1, 6*768, 1, 256]:
channels 0-767: o_out (Wo projection output)
channels 768-1535: Q (query projection)
channels 1536-2303: K (key projection)
channels 2304-3071: V (value projection)
channels 3072-3839: attn_out (pre-Wo attention output)
channels 3840-4607: xnorm (RMSNorm1 output)
```
CPU reads specific taps via `io_read_fp16(surface, data, ch_offset, n_channels, spatial)`.
---
## Weight Blob Format
ANE weight blobs follow a binary format with a 128-byte header:
```
Offset Size Content
------ ----- -------
0 1 0x01 (format marker)
4 1 0x02 (format marker)
5-63 59 zeros (padding)
64 4 0xDEADBEEF (chunk magic, little-endian)
68 1 0x01 (chunk marker)
72 4 uint32 data_size (fp16 weight bytes)
80 4 uint32 data_offset (always 128)
84-127 44 zeros (padding)
128+ N fp16 weight data, row-major [out_ch, in_ch]
```
Multi-weight blobs (fused QKV, FFN up) concatenate chunks: `[64B global header] [64B chunk0 header] [chunk0 data] [64B chunk1 header] [chunk1 data] ...`
MIL programs reference weights via `BLOBFILE(path="@model_path/weights/name.bin", offset=uint64(64))` where offset 64 points to the chunk header within the file.
---
## Key Constraints
| Constraint | Impact | Workaround |
|-----------|--------|------------|
| ~119 compile limit per process | ANE compiler leaks resources | `checkpoint + re-launch with --resume` |
| Weights baked at compile time | Cannot hot-swap weights; must recompile | Gradient accumulation amortizes compile cost |
| SDPA ignores `attn_mask` | Causal attention cannot use native SDPA mask | Decompose into Q at K^T + explicit mask + softmax + scores at V |
| ANE SRAM capacity ~32 MB | Large weight matrices spill to DRAM | Performance cliff above ~3072 channels |
| 32000 input channels rejected | ANE refuses conv with VOCAB input channels | Classifier backward uses `matmul` op with reshape instead of conv |
| fp16 compute only | Precision limited on ANE | fp32 on CPU for loss, Adam; fp16 for ANE forward/backward |
---
## `train_large.m` vs `train_large_ane.m`
`train_large_ane.m` moves additional operations from CPU to ANE:
| Operation | `train_large.m` | `train_large_ane.m` |
|-----------|-----------------|---------------------|
| Final RMSNorm | CPU (`rmsnorm()` via vDSP) | ANE (`gen_final_rmsnorm()`) |
| Classifier forward | CPU (`cblas_sgemm`) | ANE (`gen_classifier_fwd()`, 32000-ch conv) |
| Softmax | CPU (inside `cross_entropy_loss()`) | ANE (`gen_softmax_vocab()`) |
| Per-layer RMSNorm backward | CPU (`rmsnorm_bwd()` via vDSP) | ANE (`gen_rmsnorm_bwd()`) |
This increases compile budget pressure: 86 weight-bearing kernels per batch (vs 60), leaving less headroom within MAX_COMPILES=100.

253
docs/BENCHMARKS.md Normal file
View File

@ -0,0 +1,253 @@
# ANE Training -- Benchmarks and Tests Guide
All benchmarks and tests require **macOS 15+ on Apple Silicon** (tested on M4, M5).
---
## Quick Start
```bash
# Build and run training benchmark (100 steps)
cd training
make train_large && ./train_large --steps 100
# Run the automated benchmark suite
cd ..
bash scripts/run_benchmarks.sh
```
---
## Training Benchmarks
### train_large (CPU classifier)
The main 12-layer Stories110M training loop with classifier on CPU.
| Item | Details |
|------|---------|
| **Purpose** | Full transformer training benchmark |
| **Measures** | ms/step, ANE TFLOPS, ANE utilization %, per-component timing |
| **Prerequisites** | Training data: `bash download_data.sh` (or runs on random data if absent) |
| **Build** | `cd training && make train_large` |
| **Run** | `./train_large --steps 100` |
| **CLI flags** | `--steps N` (default 10000), `--lr F` (default 3e-4), `--resume` |
**Expected output:**
```
ane=9.6 io=4.1 cls=9.1 elem=14.4 rms=0.1 cblas_wait=2.3 ms/step
=== Efficiency Report ===
Total steps: 100
Avg train: 107.0 ms/step
ANE TFLOPS: 2.45 sustained
ANE utilization: 15.5% of 15.8 TFLOPS
```
### train_large_ane (ANE classifier)
Same training with classifier, softmax, and RMSNorm backward offloaded to ANE.
| Item | Details |
|------|---------|
| **Purpose** | Measure ANE-offloaded training (16% faster) |
| **Build** | `cd training && make train_large_ane` |
| **Run** | `./train_large_ane --steps 100` |
**Compare baseline vs ANE-offloaded:**
```bash
make train_large && ./train_large --steps 100
make train_large_ane && ./train_large_ane --steps 100
```
### Dashboard (live monitoring)
```bash
pip install blessed psutil numpy
sudo python3 dashboard.py # live mode (needs powermetrics)
sudo python3 dashboard.py --resume # attach to resumed training
```
| Flag | Description |
|------|-------------|
| `--resume` | Resume from checkpoint |
| `--infinite` | Train indefinitely |
| `--no-powermetrics` | Disable power monitoring |
| `--no-generate` | Disable text generation preview |
| `--steps N` | Total steps (default 10000) |
---
## Root-Level Benchmark Scripts
All root-level scripts are standalone Objective-C programs. Common build pattern:
```bash
xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML \
-framework IOSurface -ldl -o <output> <source>.m
```
### inmem_peak.m -- Peak TFLOPS (self-contained)
**No prerequisites.** Generates MIL and weight blobs programmatically.
| Item | Details |
|------|---------|
| **Purpose** | Maximum sustained TFLOPS via deep conv chains (32-256 layers deep) |
| **Measures** | ms per run, TFLOPS, % peak across 10 configurations |
| **Prerequisites** | None (self-contained MIL generation) |
| **Build** | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o inmem_peak inmem_peak.m` |
| **Run** | `./inmem_peak` |
**Expected output:**
```
=== Programmatic MIL to In-Memory ANE Peak ===
Config W(MB) GFLOP ms/run TFLOPS %peak
----------------------------------------------------------------------
32x conv 512ch sp64 16.0 1.07 X.XXX ms Y.YY Z.Z%
64x conv 512ch sp64 32.0 2.15 X.XXX ms Y.YY Z.Z%
...
```
### inmem_basic.m -- In-Memory Proof-of-Concept
| Item | Details |
|------|---------|
| **Purpose** | End-to-end test: compile, load, run, benchmark using `_ANEInMemoryModel` |
| **Prerequisites** | Pre-built mlpackage at `/tmp/ane_sram_256ch_64sp.mlpackage` |
| **Build** | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o inmem_basic inmem_basic.m` |
| **Run** | `./inmem_basic` |
### inmem_bench.m -- Dispatch Latency
| Item | Details |
|------|---------|
| **Purpose** | ANE dispatch latency across 6 model sizes (256-4096 channels) |
| **Measures** | ms per run, TFLOPS at each configuration |
| **Prerequisites** | Pre-built mlpackages for all 6 configs |
| **Build** | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o inmem_bench inmem_bench.m` |
| **Run** | `./inmem_bench` |
### sram_bench.m -- SRAM Capacity Probe
| Item | Details |
|------|---------|
| **Purpose** | Find SRAM capacity by detecting performance cliff at increasing weight sizes |
| **Measures** | ms per run, TFLOPS, weight/activation/total memory at 9 configurations |
| **Prerequisites** | Pre-built mlpackages for 9 configs (256-8192 channels) |
| **Build** | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o sram_bench sram_bench.m` |
| **Run** | `./sram_bench` |
### sram_probe.m -- Fine-Grained SRAM Exploration
| Item | Details |
|------|---------|
| **Purpose** | Finer-grained SRAM probe with 13 data points and GFLOPS/MB efficiency |
| **Measures** | ms per run, TFLOPS, GFLOPS/MB with spilling indicators |
| **Prerequisites** | Pre-built mlpackages for 13 configs |
| **Build** | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o sram_probe sram_probe.m` |
| **Run** | `./sram_probe` |
### api_exploration.m -- API Discovery
| Item | Details |
|------|---------|
| **Purpose** | Explore ANE private API surface (class methods, file structures, internal objects) |
| **Prerequisites** | Pre-built mlpackage at `/tmp/ane_sram_1024ch_64sp.mlpackage` |
| **Build** | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o api_exploration api_exploration.m` |
| **Run** | `./api_exploration` |
---
## Test Files
### Tests with Makefile targets (cd training/)
| Test | Build | What It Tests |
|------|-------|---------------|
| `test_rmsnorm_bwd` | `make test_rmsnorm_bwd` | RMSNorm backward on ANE vs CPU reference. PASS: max diff < 0.05, mean < 0.01. Benchmarks 100 runs. |
| `test_classifier` | `make test_classifier` | 4-part: final RMSNorm, classifier forward (32000-ch conv), softmax over VOCAB, classifier backward. |
| `test_weight_reload` | `make test_weight_reload` | Tests if weights can be hot-swapped by overwriting blob files + unload/reload. Key finding: NO, weights are baked. |
| `test_perf_stats` | `make test_perf_stats` | Probes `_ANEPerformanceStats` class methods, properties, and instantiation. Tests perfStats in `_ANERequest`. |
| `test_qos_sweep` | `make test_qos_sweep` | QoS parameter sweep (0-63) across compile, load, run. Finding: no measurable latency difference. |
| `test_ane_advanced` | `make test_ane_advanced` | Probes SharedEvents, weightsBuffer IOSurface, procedureIndex, ChainingRequest. Enumerates all 67 ANE classes. |
Build all probe tests at once: `make probes`
### Tests without Makefile targets (manual build)
| Test | Build Command | What It Tests |
|------|---------------|---------------|
| `test_ane_causal_attn` | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_ane_causal_attn test_ane_causal_attn.m` | Decomposed causal attention: Q at K^T on ANE, mask+softmax on CPU, scores at V on ANE |
| `test_ane_sdpa5` | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_ane_sdpa5 test_ane_sdpa5.m` | 4 approaches to causal masking with `scaled_dot_product_attention` |
| `test_conv_attn3` | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_conv_attn3 test_conv_attn3.m` | Grouped conv approach to attention (K,V baked as conv weights) |
| `test_full_fused` | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o test_full_fused test_full_fused.m` | Full fused attention + FFN in single MIL dispatch at DIM=768, HEADS=12, SEQ=64 |
| `test_fused_qkv` | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_fused_qkv test_fused_qkv.m` | Fused QKV (3 convs + concat in one dispatch) vs separate dispatches |
| `test_fused_bwd` | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_fused_bwd test_fused_bwd.m` | Fused backward: slice_by_size + 2 convs + add in one kernel |
---
## Bridge Library
```bash
cd bridge
make # Build libane_bridge.dylib
make test # Build and link test_bridge
./test_bridge # Run bridge tests
```
---
## Known Results
### M4 (from README)
**Single-layer (dim=768, seq=512):**
| Optimization | ms/step | ANE utilization |
|---|---|---|
| Baseline (vDSP transpose) | 33.5 | 3.1% |
| Channel-first layout | 20.3 | 5.2% |
| vDSP vectorized RMSNorm | 14.2 | 7.4% |
| GCD async cblas overlap | 11.4 | 9.2% |
| ANE RMSNorm fusion | 11.4 | 9.2% |
| Wo^T fusion (7 to 6 kernels) | 11.4 | 9.2% |
| Deferred cblas wait | **9.3** | **11.2%** |
**Full Stories110M (12 layers):**
| Component | Time (ms/step) |
|-----------|---------------|
| ANE runs | 9.6 |
| IO (fp16 conversion) | 4.1 |
| Classifier (cblas) | 9.1 |
| Cross-entropy + residuals | 14.4 |
| RMSNorm | 0.1 |
| **Total** | **~107** |
### M5 Probe Results (from m5result.md)
**Machine**: Apple M5, macOS 26.3, ANE Family H16 (same as M4)
- **Weight reload**: FAIL -- weights baked at compile time, cannot be overwritten
- **QoS sweep**: All QoS 0-63 work, no measurable latency difference
- **Performance stats**: `_ANEPerformanceStats` class exists, `alloc/init` returns nil (needs factory methods)
- **weightsBuffer IOSurface**: Does NOT override compiled weights
- **ChainingRequest**: Exists with loopback and pipeline support -- most promising for utilization improvement
---
## Timing Metrics Key
| Metric | What it measures |
|--------|-----------------|
| `ane` | ANE kernel runs (all 6 kernels per layer x 12 layers) |
| `io` | fp16-to-fp32 IOSurface data transfer (NEON conversion) |
| `cls` | Classifier matmul (CPU cblas_sgemm) |
| `elem` | Embedding lookup, residual adds, cross-entropy |
| `rms` | RMSNorm forward/backward (CPU vDSP) |
| `cblas_wait` | Time waiting for async dW gradient sgemms to complete |

156
docs/BENCHMARK_RESULTS.md Normal file
View File

@ -0,0 +1,156 @@
# ANE Benchmark Results: Apple M4 Max
**Date**: March 3, 2026
**Machine**: Mac16,5 (MacBook Pro, Apple M4 Max)
**macOS**: 26.2
**ANE Peak**: 15.8 TFLOPS (theoretical)
## Training Performance
### train_large (CPU classifier path)
| Metric | Value |
|--------|-------|
| Model | Stories110M (12 layers, dim=768, hidden=2048) |
| Kernels | 72 (60 weight-bearing + 12 static sdpaBwd2) |
| Avg step time | 72.4 ms/step |
| ANE TFLOPS | 1.29 sustained |
| Total TFLOPS | 2.41 (ANE+CPU) |
| ANE utilization | 8.1% of 15.8 TFLOPS |
| Compile time | 79.7% of wall time |
| Train time | 16.4% of wall time |
### train_large_ane (ANE-offloaded classifier)
| Metric | Value |
|--------|-------|
| Model | Stories110M (same as above) |
| Kernels | 99 (86 weight-bearing + 13 static) |
| Avg step time | 62.9 ms/step |
| ANE TFLOPS | 1.68 sustained |
| Total TFLOPS | 2.77 (ANE+CPU) |
| ANE utilization | 10.6% of 15.8 TFLOPS |
| Compile time | 84.5% of wall time |
| Train time | 12.5% of wall time |
**Step time breakdown (ms/step, ANE classifier path):**
| Component | Time (ms) | Description |
|-----------|-----------|-------------|
| ane | 10-12 | ANE kernel dispatch + evaluation |
| elem | 12-13 | Elementwise ops (residuals, activations) |
| cls | 5-6 | Classifier forward + backward |
| io | 3-5 | IOSurface data transfers |
| rms | 0.1 | RMSNorm |
| cblas_wait | 0.0 | BLAS sync overhead |
## Programmatic MIL Peak TFLOPS
```
Config W(MB) GFLOP ms/eval TFLOPS
----------------------------------------------------------------------
32x conv 512ch sp64 16.0 1.07 0.408 ms 2.63
48x conv 512ch sp64 24.0 1.61 0.262 ms 6.15
64x conv 512ch sp64 32.0 2.15 0.244 ms 8.80
96x conv 512ch sp64 48.0 3.22 0.326 ms 9.89
128x conv 512ch sp64 64.0 4.29 0.385 ms 11.14
64x conv 256ch sp64 8.0 0.54 0.365 ms 1.47
128x conv 256ch sp64 16.0 1.07 0.454 ms 2.37
256x conv 256ch sp64 32.0 2.15 0.351 ms 6.11
64x conv 384ch sp64 18.0 1.21 0.429 ms 2.82
128x conv 384ch sp64 36.0 2.42 0.354 ms 6.82
```
**Peak observed: 11.14 TFLOPS** (128x conv 512ch sp64, 64 MB weights)
## In-Memory ANE Benchmark (via mlpackage)
```
Config W (MB) ms/eval TFLOPS
---------------------------------------------
256ch x64sp 0.1 0.319 ms 0.03
512ch x64sp 0.5 0.357 ms 0.09
1024ch x64sp 2.0 0.457 ms 0.29
2048ch x64sp 8.0 0.254 ms 2.11
3072ch x64sp 18.0 0.389 ms 3.10
4096ch x64sp 32.0 1.148 ms 1.87
```
## SRAM Probe Results
### Coarse Probe (varying channels + spatial)
```
Config W (MB) Act(MB) Tot(MB) ms/eval TFLOPS
--------------------------------------------------------------------------
256ch x 64sp 0.1 0.03 0.2 0.378 ms 0.02
512ch x 64sp 0.5 0.06 0.6 0.389 ms 0.09
1024ch x 64sp 2.0 0.12 2.2 0.392 ms 0.34
2048ch x 64sp 8.0 0.25 8.5 0.218 ms 2.47
3072ch x 64sp 18.0 0.38 18.8 0.396 ms 3.05
4096ch x 64sp 32.0 0.50 33.0 1.116 ms 1.92
5120ch x 64sp 50.0 0.62 51.2 0.767 ms 4.38
6144ch x 64sp 72.0 0.75 73.5 0.872 ms 5.54
8192ch x 32sp 128.0 0.50 129.0 4.195 ms 1.02
```
### Fine Probe (spatial=64, weights only)
```
Channels W (MB) ms/eval TFLOPS GFLOPS/MB
--------------------------------------------------------------
256 ch 0.1 0.378 ms 0.02 177.7
512 ch 0.5 0.431 ms 0.08 155.6
1024 ch 2.0 0.411 ms 0.33 163.5
1536 ch 4.5 0.493 ms 0.61 136.1
2048 ch 8.0 0.410 ms 1.31 163.9
2560 ch 12.5 0.237 ms 3.53 282.6 <-- peak efficiency
3072 ch 18.0 0.335 ms 3.60 200.1
3584 ch 24.5 0.414 ms 3.97 162.1
4096 ch 32.0 1.134 ms 1.89 59.2 <-- spilling
4608 ch 40.5 0.563 ms 4.83 119.2
5120 ch 50.0 0.659 ms 5.09 101.8
6144 ch 72.0 0.844 ms 5.73 79.5 <-- spilling
8192 ch 128.0 4.203 ms 1.02 8.0 <-- catastrophic spilling
```
### SRAM Analysis
The M4 Max ANE SRAM appears to be approximately **24-32 MB**:
- **Peak efficiency** at 2560ch (12.5 MB weights): 282.6 GFLOPS/MB, 3.53 TFLOPS
- **First spill** at 4096ch (32.0 MB): drops to 59.2 GFLOPS/MB (1.89 TFLOPS)
- **Catastrophic** at 8192ch (128.0 MB): 8.0 GFLOPS/MB (1.02 TFLOPS)
The 4608ch recovery (4.83 TFLOPS despite 40.5 MB weights) suggests the ANE may use tiling strategies for some weight configurations.
Training kernels (dim=768, weight matrices ~1.2 MB fp16 each) stay well within the SRAM budget.
## Known Test Results
| Test | Status | Notes |
|------|--------|-------|
| test_rmsnorm_bwd | PASS | ANE-accelerated RMSNorm backward |
| test_classifier | PASS | 4 tests passed; ANE backward 3x slower than CPU cblas for matmul |
| test_weight_reload | FAIL (expected) | ANE bakes weights at compile time; IOSurface override doesn't work |
| test_perf_stats | PASS | _ANEPerformanceStats API accessible |
| test_qos_sweep | PASS | QoS parameter has no measurable effect on latency |
| test_ane_advanced | PASS | Advanced ANE operations verified |
| inmem_basic | PASS | In-memory compilation and execution verified |
| inmem_bench | PASS | Multi-config benchmarks via mlpackage |
| inmem_peak | PASS | Peak TFLOPS measurement via programmatic MIL |
| sram_bench | PASS | SRAM capacity probing |
| sram_probe | PASS | Fine-grained SRAM spilling detection |
## Reproducing
```bash
cd scripts && bash run_benchmarks.sh
```
The benchmark script auto-generates required `.mlpackage` models (needs Python 3.11-3.13 with `coremltools`).
Override training data paths:
```bash
ANE_MODEL_PATH=/path/to/stories110M.bin ANE_DATA_PATH=/path/to/data.bin ./train_large
```

View File

@ -0,0 +1,74 @@
# Development Diary #001 — Initial Setup & Sicherheitsaudit
**Datum:** 2026-03-02
**Status:** Abgeschlossen
## Aufgaben
### 1. Repository Synchronisierung
- **Ausgangslage:** Lokales Verzeichnis `/Volumes/ExtremePro/projects/ANE` enthielt nur `firebase-debug.log`
- **Durchgeführt:**
```bash
git init
git remote add origin https://github.com/maderix/ANE.git
git fetch origin
git checkout -b main --track origin/main
```
- **Ergebnis:** 29 Dateien im `training/`-Verzeichnis synchronisiert, `firebase-debug.log` unberührt
- **Commit-Stand:** HEAD = origin/main (up to date)
### 2. Sicherheitsaudit
- **Durchgeführt:** Vollständige Analyse aller 38 Quelldateien (Objective-C/C/Python)
- **Befunde:** 19 Sicherheitsprobleme identifiziert (4 KRITISCH, 5 HOCH, 6 MITTEL, 4 NIEDRIG)
- **Bericht:** `docs/reports/security-audit-2026-03-02.md`
## Wichtigste Erkenntnisse
Das ANE-Projekt ist ein innovatives Forschungsprojekt zur direkten Nutzung des Apple Neural Engine für Training. Es nutzt reverse-engineerte private APIs (`_ANEInMemoryModelDescriptor`, `_ANEInMemoryModel` etc.) via `dlopen` + `objc_msgSend`.
**Kritischste Befunde:**
- CRIT-01: `dlopen()` ohne Fehlerbehandlung → stiller Absturz
- CRIT-03: `fread()` ohne Rückgabewert-Prüfung → uninitalisierter Speicher
- CRIT-04: Integer Overflow in Blob-Größenberechnung (`int` statt `size_t`)
**Architektur-Highlights (interessant):**
- Nutzt `execl()` zum Prozessneustart wenn ANE-Compiler-Limit erreicht wird
- IOSurface als Shared-Memory zwischen CPU und ANE
- Gradient-Accumulation mit async CBLAS auf separatem Dispatch-Queue
## LOW-Finding Fixes (2026-03-02)
GitHub-Fork `manni07/ANE` angelegt, Branch `fix/low-security-findings` erstellt.
Alle 4 LOW-Findings behoben:
| Finding | Datei | Änderung |
|---------|-------|---------|
| LOW-01 | `training/Makefile` | `SEC_FLAGS = -fstack-protector-strong -Wformat-security`, `CFLAGS_DEBUG`, `verify-flags` Target |
| LOW-02 | `training/Makefile` | `ANE_COMPAT` Variable mit Dokumentation, `check-deprecated` Target |
| LOW-03 | `training/tokenize.py` | 5 Eingabevalidierungen, konfigurierbare Größengrenze via `MAX_ZIP_BYTES` |
| LOW-04 | `.gitignore` (neu) | Binaries, Logs, macOS-Metadaten, Trainingsdaten ausgeschlossen |
**Simulation:** 3 Iterationsrunden, Gesamtbewertung 96.35% (alle Kriterien ≥ 95%)
**Remote:** `origin=manni07/ANE`, `upstream=maderix/ANE`
## CRIT-Finding Fixes (2026-03-02)
Branch `fix/crit-security-findings` erstellt. Alle 4 CRIT-Findings behoben:
| Finding | Dateien | Kernänderung |
|---------|---------|-------------|
| CRIT-01 | `training/ane_runtime.h`, `training/stories_config.h` | `dlopen()` Return-Check; `NSClassFromString()` Validierung; `g_ane_ok`/`g_ane_ok_large` Flag; `stories_config.h` Re-Entry-Guard |
| CRIT-02 | `training/ane_runtime.h`, `training/stories_io.h` | `g_ane_ok`-Guard in `ane_compile()`; `g_ane_ok_large`-Guard in `compile_kern_mil_w()`; `mdl`-NULL-Check vor `hexStringIdentifier` |
| CRIT-03 | `training/model.h`, `training/train_large.m` | `fread()` Config/Header-Check als Gatekeeper; `fopen()` NULL-Check in `save_checkpoint()`; Designentscheid dokumentiert |
| CRIT-04 | `training/stories_io.h`, `training/model.h` | `int`→`size_t` in allen `build_blob*` Funktionen; `(size_t)`-Cast in `malloc()`-Größen; `calloc()` NULL-Checks |
**Simulation:** 3 Iterationsrunden (CRIT-03 benötigte 3 Runs), Gesamtbewertung 96.15% (alle Kriterien ≥ 95%)
**Branch:** `fix/crit-security-findings` auf `manni07/ANE`
## Status
| Finding-Typ | Anzahl | Status |
|-------------|--------|--------|
| KRITISCH (CRIT-0104) | 4 | ✅ BEHOBEN |
| HOCH (HIGH-0105) | 5 | Offen |
| MITTEL (MED-0106) | 6 | Offen |
| NIEDRIG (LOW-0104) | 4 | ✅ BEHOBEN |

View File

@ -0,0 +1,419 @@
# Sicherheitsaudit: ANE (Apple Neural Engine Training Framework)
**Datum:** 2026-03-02
**Repository:** https://github.com/maderix/ANE
**Prüfer:** Claude Code (claude-sonnet-4-6)
**Scope:** Vollständige Codebase-Analyse (38 Quelldateien, Objective-C/C/Python)
---
## Executive Summary
Das ANE-Projekt implementiert Neural-Network-Training direkt auf Apples Neural Engine (ANE) via reverse-engineerter privater APIs. Es handelt sich um ein **Forschungs-/Experimental-Projekt** mit erheblichen inhärenten Sicherheitsrisiken durch die Nutzung undokumentierter Apple-Schnittstellen.
**Gesamtbewertung: HOHES RISIKO** für produktiven Einsatz.
| Kategorie | Anzahl |
|-----------|--------|
| KRITISCH | 4 |
| HOCH | 5 |
| MITTEL | 6 |
| NIEDRIG | 4 |
| **Gesamt**| **19** |
---
## KRITISCHE Befunde
### [CRIT-01] Keine Fehlerbehandlung bei `dlopen()` für Private Framework
**Datei:** `training/ane_runtime.h:26`, `api_exploration.m:15`
**Schweregrad:** KRITISCH
**Status: BEHOBEN** (2026-03-02, Branch `fix/crit-security-findings`)
```objc
// ane_runtime.h:26
dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
```
**Problem:**
- Der Rückgabewert von `dlopen()` wird nicht geprüft. Wenn das Framework nicht gefunden wird (nach macOS-Update oder auf nicht-Apple-Silicon-Hardware), gibt `dlopen()` NULL zurück — aber die Ausführung läuft weiter.
- Alle nachfolgenden `NSClassFromString()`-Aufrufe geben dann ebenfalls NULL zurück.
- `g_ane_loaded = true` wird gesetzt auch wenn das Laden fehlschlug.
**Folge:** Nullzeiger-Dereferenzierungen beim ersten API-Aufruf, unkontrollierter Absturz ohne aussagekräftige Fehlermeldung.
**Empfehlung:**
```objc
void *handle = dlopen("...", RTLD_NOW);
if (!handle) {
fprintf(stderr, "ANE framework not found: %s\n", dlerror());
abort();
}
if (!g_ANEDesc || !g_ANEInMem || !g_ANEReq || !g_ANEIO) {
fprintf(stderr, "ANE private classes not found (API changed?)\n");
abort();
}
```
---
### [CRIT-02] Unsichere `objc_msgSend`-Casts ohne Typ-Validierung
**Dateien:** `training/ane_runtime.h:59-125`, `training/stories_io.h:90-117`
**Schweregrad:** KRITISCH
**Status: BEHOBEN** (2026-03-02, Branch `fix/crit-security-findings`)
```objc
// ane_runtime.h:59-61
id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)(
g_ANEDesc, @selector(modelWithMILText:weights:optionsPlist:),
milText, wdict, nil);
```
**Probleme:**
1. Die Klasse `g_ANEDesc` könnte NULL sein (wenn `dlopen` fehlschlug, s. CRIT-01)
2. Die Methodensignatur ist hardcodiert — bei Apple-API-Änderungen falsches Casting = undefiniertes Verhalten / Speicherkorruption
3. Kein `@try/@catch` um mögliche Objective-C Exceptions abzufangen
4. Globale Variablen `g_D`, `g_I`, `g_AIO`, `g_AR` in `stories_io.h` könnten NULL sein
**Folge:** Speicherkorruption, SIGBUS, unkontrollierter Absturz.
**Empfehlung:** Mindestens NULL-Checks vor jedem `objc_msgSend`:
```objc
if (!g_ANEDesc) { fprintf(stderr, "g_ANEDesc is NULL\n"); return NULL; }
```
---
### [CRIT-03] `fread()`-Rückgabewerte nie geprüft — uninitalisierter Speicher
**Dateien:** `training/model.h:81-146`, `training/train_large.m:17-55`
**Schweregrad:** KRITISCH
**Status: BEHOBEN** (2026-03-02, Branch `fix/crit-security-findings`)
```c
// model.h:81
fread(&m->cfg, sizeof(Config), 1, f); // Rückgabewert ignoriert!
// train_large.m:29
fread(embed, 4, V * DIM, f); // Kein Check ob V*DIM floats gelesen wurden
```
**Probleme:**
1. Wenn die Model-Datei kleiner als erwartet ist (korrupt, abgeschnitten), werden Structs mit Garbage-Werten befüllt
2. Kein Check ob `cfg.dim`, `cfg.hidden_dim`, `cfg.n_layers` plausibel sind bevor Speicher allokiert wird
3. `fread(embed, 4, V * DIM, f)` — bei V=32000, DIM=768: liest 98,304,000 Bytes. Keine Größenvalidierung.
4. In `load_checkpoint()`: wenn die Datei nach dem Header endet, werden Gewichte mit 0-Bytes befüllt ohne Warnung
**Empfehlung:**
```c
size_t n = fread(&m->cfg, sizeof(Config), 1, f);
if (n != 1) { fprintf(stderr, "Config read failed\n"); fclose(f); return -1; }
if (m->cfg.dim <= 0 || m->cfg.dim > 65536 || m->cfg.n_layers <= 0) {
fprintf(stderr, "Invalid model config\n"); fclose(f); return -1;
}
```
---
### [CRIT-04] Integer Overflow in Speicher-Berechnung
**Dateien:** `training/stories_io.h:13-14`, `training/ane_mil_gen.h:12-13`
**Schweregrad:** KRITISCH
**Status: BEHOBEN** (2026-03-02, Branch `fix/crit-security-findings`)
```c
// stories_io.h:13-14
static NSData *build_blob(const float *w, int rows, int cols) {
int ws = rows * cols * 2; // INT-Multiplikation, kein size_t!
int tot = 128 + ws;
```
**Problem:** Bei grösseren Modellen mit `dim >= 2048, hidden >= 16384` könnten Integer-Overflows entstehen. `*(uint32_t*)(chunk + 8) = (uint32_t)wsize;` — wenn `wsize` als `int` negativ wird (Overflow), wird ein negativer Wert als uint32 geschrieben = falsche Blob-Größe → ANE-Fehler oder Speicherkorruption.
**Empfehlung:** `size_t` für alle Speichergrößenberechnungen:
```c
size_t ws = (size_t)rows * cols * sizeof(_Float16);
size_t tot = 128 + ws;
```
---
## HOHE Befunde
### [HIGH-01] Keine Eingabevalidierung für Token-Indizes
**Datei:** `training/train_large.m:375-376`
**Schweregrad:** HOCH
```c
size_t max_pos = n_tokens - SEQ - 1;
size_t pos = (size_t)(drand48() * max_pos);
uint16_t *input_tokens = token_data + pos;
```
**Probleme:**
1. Token-Werte aus `token_data` werden direkt als Embedding-Indizes verwendet ohne Prüfung ob `token < VOCAB`
2. Wenn die `.bin`-Datei korrupte Token-Werte enthält (> 32000), entstehen Out-of-Bounds-Zugriffe auf `embed[]`
3. Kein Check ob `n_tokens >= SEQ + 1` vor der `max_pos`-Berechnung
**Folge:** Heap-Buffer-Overflow, korrupte `.bin`-Datei kann zu Speicherschäden führen.
---
### [HIGH-02] Checkpoint-Pfad mit relativer Verzeichnis-Navigation
**Datei:** `training/train_large.m:8-10`
**Schweregrad:** HOCH
```c
#define CKPT_PATH "ane_stories110M_ckpt.bin"
#define MODEL_PATH "../../assets/models/stories110M.bin" // ← relativer Pfad!
#define DATA_PATH "tinystories_data00.bin"
```
**Probleme:**
1. `MODEL_PATH` enthält `../../` — relative Pfadnavigation. Wenn das Binary aus einem unerwarteten Verzeichnis gestartet wird, werden falsche Dateien gelesen.
2. Kein `realpath()`-Aufruf zur Normalisierung des Pfades
3. Manipulierter Checkpoint + `--resume` → unkontrollierte Binärdaten werden als Gewichte geladen
---
### [HIGH-03] `execl()` zur Prozessneustart ohne Argument-Validierung
**Datei:** `training/train_large.m:331`
**Schweregrad:** HOCH
```c
execl(argv[0], argv[0], "--resume", NULL);
```
**Probleme:**
1. `argv[0]` wird ohne Validierung übergeben. Via Symlink könnte ein beliebiges Binary gestartet werden.
2. `data_fd` (mmap'd Token-Datei) wird vor `execl()` nicht geschlossen — Dateideskriptor-Leak in neuen Prozess
3. `munmap(token_data)` wird vor `execl()` nicht aufgerufen
---
### [HIGH-04] Fehlende `malloc()`/`calloc()`-Rückgabewert-Prüfungen
**Dateien:** Alle `.m` und `.h` Dateien
**Schweregrad:** HOCH
```c
// train_large.m:219
float *embed = (float*)malloc(VOCAB*DIM*4); // 32000*768*4 = 98MB — kein NULL-Check!
```
Keiner der `malloc()`/`calloc()`-Aufrufe prüft den Rückgabewert auf NULL. Bei Memory-Pressure (110M Model + Adam-State = mehrere GB) können Allokierungen fehlschlagen → Nullzeiger-Dereferenzierung.
---
### [HIGH-05] ANE-Inferenz ohne Fehlerprüfung im Trainings-Hot-Path
**Datei:** `training/stories_io.h:131-134`
**Schweregrad:** HOCH
```c
static void ane_run(Kern *k) {
id mdl = (__bridge id)k->model; id req = (__bridge id)k->request; NSError *e = nil;
((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
mdl, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e);
// BOOL-Rückgabewert und NSError *e werden ignoriert!
}
```
**Problem:** ANE-Ausführung kann fehlschlagen (Thermal-Throttling, Hardware-Fehler, API-Änderungen). Stille Fehler führen zu unerkannter Gradientenkorruption.
---
## MITTLERE Befunde
### [MED-01] IOSurface Lock ohne Fehlerbehandlung
**Datei:** `training/stories_io.h:62-83`
**Schweregrad:** MITTEL
```c
IOSurfaceLock(s, 0, NULL); // Return-Code ignoriert
```
`IOSurfaceLock()` gibt `kIOReturnSuccess` oder einen Fehlercode zurück. Bei Lock-Fehler wird trotzdem auf den Speicher zugegriffen — mögliche Data-Race-Condition.
---
### [MED-02] Temporäres Verzeichnis nicht sicher erstellt (TOCTOU-Risiko)
**Datei:** `training/ane_runtime.h:68-80`, `training/stories_io.h:94-100`
**Schweregrad:** MITTEL
```objc
NSString *td = [NSTemporaryDirectory() stringByAppendingPathComponent:hx];
[milText writeToFile:[td stringByAppendingPathComponent:@"model.mil"] atomically:YES];
```
TOCTOU-Race zwischen `createDirectoryAtPath` und `writeToFile`. Der `hexStringIdentifier` könnte von einem anderen Prozess erraten und das Verzeichnis manipuliert werden.
---
### [MED-03] MIL-Text-Generierung ohne Parameter-Validierung
**Datei:** `training/ane_mil_gen.h:32-52`
**Schweregrad:** MITTEL
```objc
return [NSString stringWithFormat:
@"...tensor<fp32, [1, %d, %d]> x...", in_ch, spatial, ...];
```
Negative oder extrem große `in_ch`/`out_ch`/`spatial`-Werte durch fehlerhafte Konfiguration erzeugen invalides MIL das an den undokumentierten ANE-Compiler übergeben wird.
---
### [MED-04] Keine Endianness-Prüfung bei Checkpoint-Serialisierung
**Datei:** `training/train_large.m:110-181`
**Schweregrad:** MITTEL
```c
h.magic = 0x424C5A54;
fwrite(&h, sizeof(h), 1, f);
```
Das `CkptHdr`-Struct wird als binärer Dump ohne Endianness-Marker geschrieben. Nicht portabel.
---
### [MED-05] NEON-Vektorisierung ohne Alignment-Garantie
**Datei:** `training/stories_io.h:41-58`
**Schweregrad:** MITTEL
```c
float16x8_t h = vld1q_f16((const __fp16*)(src + i));
```
Zeiger-Arithmetik mit `ch_off * sp` könnte das für NEON benötigte Alignment verletzen wenn `ch_off * sp` kein Vielfaches von 8 ist.
---
### [MED-06] Globale Variablen ohne Thread-Safety
**Datei:** `training/stories_io.h`, `training/stories_config.h`
**Schweregrad:** MITTEL
```c
static bool g_ane_loaded = false;
static int g_compile_count = 0;
```
`g_compile_count` wird via `__sync_fetch_and_add()` atomar inkrementiert, aber `g_ane_loaded` und Klassen-Variablen nicht atomar gesetzt — bei Multi-Thread-Nutzung Race-Condition in `ane_init()`.
---
## NIEDRIGE Befunde
### [LOW-01] Fehlende Compiler-Sicherheitsflags
**Datei:** `training/Makefile:2`
**Schweregrad:** NIEDRIG
**Status: BEHOBEN** (2026-03-02, Branch `fix/low-security-findings`)
```makefile
CFLAGS = -O2 -Wall -Wno-deprecated-declarations -fobjc-arc
```
Fehlende Flags: `-fstack-protector-strong`, `-D_FORTIFY_SOURCE=2`, `-Wformat=2`
**Fix:** `SEC_FLAGS = -fstack-protector-strong -Wformat-security` eingeführt. Hinweis:
`-D_FORTIFY_SOURCE=2` ist auf macOS (Apple LLVM) bei `-O2` implizit aktiv — explizite
Definition würde "macro redefinition"-Warnung erzeugen. `CFLAGS_DEBUG` mit
`-fsanitize=address,undefined` für Debug-Builds hinzugefügt. `make verify-flags`
zeigt aktive Flags.
---
### [LOW-02] `-Wno-deprecated-declarations` unterdrückt wichtige Warnungen
**Datei:** `training/Makefile:2`
**Schweregrad:** NIEDRIG
**Status: BEHOBEN** (2026-03-02, Branch `fix/low-security-findings`)
Unterdrückt Warnungen über veraltete API-Aufrufe — könnte wichtige Hinweise auf deprecated private APIs verstecken.
**Fix:** Flag in benannte Variable `ANE_COMPAT` extrahiert mit erklärendem Kommentar
(bewusste Unterdrückung wegen privater `_ANE*`-APIs via `objc_msgSend`). Neues Target
`make check-deprecated` baut ohne Unterdrückung und zeigt alle verborgenen Warnungen.
---
### [LOW-03] Python-Skript ohne Eingabevalidierung
**Datei:** `training/tokenize.py`
**Schweregrad:** NIEDRIG
**Status: BEHOBEN** (2026-03-02, Branch `fix/low-security-findings`)
Keine Validierung der Eingabedateigröße — bei sehr großen Eingaben Out-of-Memory möglich.
**Fix:** 5 Validierungen implementiert:
1. ZIP-Existenzprüfung mit hilfreicher Fehlermeldung
2. Konfigurierbare Größengrenze (Standard 10GB, via `MAX_ZIP_BYTES` env var überschreibbar)
3. Prüfung ob `data00.bin` im ZIP enthalten ist
4. Fehlerbehandlung bei `struct.unpack` wenn Output < 20 Bytes
5. Token-Range-Validierung (alle Token müssen < `VOCAB_SIZE=32000` sein)
---
### [LOW-04] Keine `.gitignore` für sensible Artefakte
**Datei:** Repository-Root
**Schweregrad:** NIEDRIG
**Status: BEHOBEN** (2026-03-02, Branch `fix/low-security-findings`)
Keine `.gitignore`-Datei. Binäre Artefakte (Checkpoints, Trainingsdaten, `firebase-debug.log`) könnten versehentlich committed werden.
**Fix:** `.gitignore` erstellt mit Regeln für: macOS-Metadaten (`.DS_Store`),
Log-Dateien (`*.log`), kompilierte Binaries (`training/train`, `training/train_large`,
alle Probe-Binaries), Trainingsdaten (`training/*.bin`), ANE-Artefakte
(`*.mlmodelc/`, `*.mlpackage/`), externe Assets (`assets/`).
---
## Positive Befunde (Stärken)
### Korrekte Speicherfreigabe
`ane_free()` (`ane_runtime.h:149-160`) und `free_kern()` (`stories_io.h:122-130`) implementieren vollständige Cleanup-Routinen mit `CFRelease()`, `unloadWithQoS:error:` und Temporärverzeichnis-Bereinigung.
### Magic-Byte Validierung in Checkpoints
```c
if (h.magic != 0x424C5A54 || h.version != 2) { fclose(f); return false; }
```
Grundlegender Schutz gegen korrupte Checkpoint-Dateien.
### Atomare Compile-Counter
```c
__sync_fetch_and_add(&g_compile_count, 1);
```
Thread-sicherer Zähler für ANE-Kompilierungsanzahl.
### Gradient-Accumulation mit async CBLAS
Korrekte Parallelisierung von CPU-Gewichtsgradienten-Berechnung via `dispatch_group_async`.
---
## Risikobewertung für Produktionseinsatz
| Aspekt | Bewertung |
|--------|-----------|
| Apple Silicon erforderlich | macOS 15+, M-Series only |
| Private API Stabilität | **SEHR GERING** — jedes macOS-Update kann brechen |
| Memory Safety | **MITTEL** — keine Bounds-Checks, keine Sanitizer |
| Input Validation | **GERING** — Dateien werden unkritisch gelesen |
| Error Handling | **GERING** — viele kritische Fehler werden ignoriert |
| Eignung für Produktion | **NEIN** — Forschungs-/Experimental-Projekt |
---
## Empfehlungen nach Priorität
### Sofortige Maßnahmen (KRITISCH)
1. `dlopen()` Rückgabewert prüfen und bei Fehler abbrechen
2. Alle `fread()`-Rückgabewerte prüfen + Dateigrößenvalidierung
3. NULL-Checks vor allen `objc_msgSend`-Aufrufen
4. `int``size_t` für alle Speichergrößenberechnungen
### Kurzfristige Maßnahmen (HOCH)
5. Token-Index-Validierung: `if (token >= VOCAB) abort()`
6. ANE-Inferenz-Rückgabewert und NSError prüfen
7. Compiler-Flags: `-fstack-protector-strong -D_FORTIFY_SOURCE=2`
8. `.gitignore` für binäre Artefakte erstellen
### Mittelfristige Maßnahmen (MITTEL)
9. IOSurface Lock-Rückgabewerte prüfen
10. `__atomic_store_n()` für `g_ane_loaded`
11. MIL-Parameter-Validierung vor Formatierung
---
*Dieser Bericht ist für das ANE-Forschungsprojekt erstellt. Das Projekt ist explizit als Proof-of-Concept/Forschungscode konzipiert und nicht für Produktionseinsatz gedacht.*