mirror of https://github.com/maderix/ANE.git
[docs] Developer documentation: architecture diagrams, complete API reference, benchmark guide, M4 Max results, security audit report
This commit is contained in:
parent
680f8c7e20
commit
37cac988b8
|
|
@ -0,0 +1,429 @@
|
|||
# ANE Training -- API Reference
|
||||
|
||||
Complete function index for all public functions, structs, and macros organized by source file.
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [stories_config.h -- Model Configuration](#stories_configh)
|
||||
2. [stories_io.h -- IOSurface I/O and Compilation](#stories_ioh)
|
||||
3. [stories_mil.h -- MIL Program Generators](#stories_milh)
|
||||
4. [stories_cpu_ops.h -- CPU Operations](#stories_cpu_opsh)
|
||||
5. [ane_runtime.h -- Generalized ANE Wrapper](#ane_runtimeh)
|
||||
6. [ane_mil_gen.h -- Composable MIL Helpers](#ane_mil_genh)
|
||||
7. [ane_rmsnorm_bwd.h -- RMSNorm Backward on ANE](#ane_rmsnorm_bwdh)
|
||||
8. [ane_classifier.h -- Classifier and Softmax on ANE](#ane_classifierh)
|
||||
9. [bridge/ane_bridge.h -- C Bridge API](#bridgeane_bridgeh)
|
||||
10. [MIL Operation Reference](#mil-operation-reference)
|
||||
11. [Weight Blob Format](#weight-blob-format)
|
||||
|
||||
---
|
||||
|
||||
## stories_config.h
|
||||
|
||||
Model constants, data structures, and memory allocation helpers.
|
||||
|
||||
### Macros
|
||||
|
||||
| Macro | Value | Description |
|
||||
|-------|-------|-------------|
|
||||
| `DIM` | 768 | Model hidden dimension |
|
||||
| `HIDDEN` | 2048 | FFN intermediate dimension |
|
||||
| `HEADS` | 12 | Number of attention heads |
|
||||
| `HD` | 64 (`DIM/HEADS`) | Per-head dimension |
|
||||
| `SEQ` | 256 | Sequence length |
|
||||
| `NLAYERS` | 12 | Number of transformer layers |
|
||||
| `VOCAB` | 32000 | Vocabulary size |
|
||||
| `ACCUM_STEPS` | 10 | Gradient accumulation steps per compile batch |
|
||||
| `MAX_COMPILES` | 100 | ANE compile budget before process restart |
|
||||
| `KERNELS_PER_LAYER` | 5 | Weight-bearing ANE kernels per layer |
|
||||
| `TOTAL_WEIGHT_KERNELS` | 60 | Total weight-bearing compiles per batch |
|
||||
| `SCORE_CH` | 3072 (`HEADS*SEQ`) | Attention score channels for SDPA backward |
|
||||
| `WQ_SZ` | 589824 (`DIM*DIM`) | Size of Q/K/V/O projection weight matrices |
|
||||
| `WO_SZ` | 589824 (`DIM*DIM`) | Size of output projection |
|
||||
| `W1_SZ` | 1572864 (`HIDDEN*DIM`) | FFN gate/value projection size |
|
||||
| `W2_SZ` | 1572864 (`DIM*HIDDEN`) | FFN down-projection size |
|
||||
| `W3_SZ` | 1572864 (`HIDDEN*DIM`) | FFN value projection size |
|
||||
| `LAYER_PARAMS` | -- | Total floats per layer: `4*WQ_SZ + W1_SZ + W2_SZ + W3_SZ + 2*DIM` |
|
||||
| `TOTAL_PARAMS` | -- | Total model params: `NLAYERS * LAYER_PARAMS + DIM + VOCAB*DIM` |
|
||||
|
||||
### Structs
|
||||
|
||||
#### `LayerWeights`
|
||||
Per-layer weight matrices (all `float*`).
|
||||
|
||||
| Field | Shape | Description |
|
||||
|-------|-------|-------------|
|
||||
| `Wq`, `Wk`, `Wv`, `Wo` | `[DIM, DIM]` | Attention projection weights |
|
||||
| `W1`, `W3` | `[HIDDEN, DIM]` | FFN gate and value up-projections |
|
||||
| `W2` | `[DIM, HIDDEN]` | FFN down-projection |
|
||||
| `rms_att` | `[DIM]` | RMSNorm scale for attention sublayer |
|
||||
| `rms_ffn` | `[DIM]` | RMSNorm scale for FFN sublayer |
|
||||
|
||||
#### `AdamState`
|
||||
First/second moment buffers for a single parameter group.
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `m` | `float*` | First moment (mean) estimate |
|
||||
| `v` | `float*` | Second moment (variance) estimate |
|
||||
| `n` | `size_t` | Number of parameters |
|
||||
|
||||
#### `LayerAdam`
|
||||
Per-layer Adam optimizer state. Contains one `AdamState` per weight matrix: `Wq`, `Wk`, `Wv`, `Wo`, `W1`, `W2`, `W3`, `rms_att`, `rms_ffn`.
|
||||
|
||||
#### `LayerActs`
|
||||
Per-layer activation tensors saved for the backward pass.
|
||||
|
||||
| Field | Shape | Description |
|
||||
|-------|-------|-------------|
|
||||
| `layer_in` | `[DIM, SEQ]` | Input to this layer (for rmsnorm1 backward) |
|
||||
| `xnorm` | `[DIM, SEQ]` | RMSNorm1 output |
|
||||
| `Q`, `K`, `V` | `[DIM, SEQ]` | QKV projections |
|
||||
| `attn_out` | `[DIM, SEQ]` | Attention output (before Wo) |
|
||||
| `o_out` | `[DIM, SEQ]` | Wo projection output |
|
||||
| `x2` | `[DIM, SEQ]` | Residual after attention |
|
||||
| `x2norm` | `[DIM, SEQ]` | RMSNorm2 output |
|
||||
| `h1`, `h3` | `[HIDDEN, SEQ]` | FFN intermediates (W1 and W3 outputs) |
|
||||
| `silu_out` | `[HIDDEN, SEQ]` | SiLU(h1) * h3 gated output |
|
||||
| `ffn_out` | `[DIM, SEQ]` | FFN final output |
|
||||
|
||||
#### `LayerGrads`
|
||||
Per-layer gradient accumulators. Same field names as `LayerWeights` (all `float*`): `Wq`, `Wk`, `Wv`, `Wo`, `W1`, `W2`, `W3`, `rms_att`, `rms_ffn`.
|
||||
|
||||
#### `Kern`
|
||||
Single ANE kernel handle (stories-specific, single I/O).
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `model` | `void*` | Retained `_ANEInMemoryModel` |
|
||||
| `ioIn` | `IOSurfaceRef` | Input IOSurface |
|
||||
| `ioOut` | `IOSurfaceRef` | Output IOSurface |
|
||||
| `request` | `void*` | Retained `_ANERequest` |
|
||||
| `tmpDir` | `void*` | Retained temp directory path |
|
||||
|
||||
#### `LayerKernels`
|
||||
ANE kernels for one transformer layer.
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `fwdAttn` | `Kern*` | SDPA forward + taps |
|
||||
| `fwdFFN` | `Kern*` | FFN forward + taps |
|
||||
| `ffnBwd` | `Kern*` | FFN backward |
|
||||
| `sdpaBwd1` | `Kern*` | SDPA backward part 1 (Wo^T + dV + scores) |
|
||||
| `sdpaBwd2` | `Kern*` | SDPA backward part 2 (dQ + dK) |
|
||||
| `qkvBwd` | `Kern*` | QKV backward (Wq^T, Wk^T, Wv^T) |
|
||||
|
||||
#### `CkptHdr`
|
||||
Checkpoint file header (128 bytes, version 2).
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `magic` | `int` | `0x424C5A54` ("BLZT") |
|
||||
| `version` | `int` | 2 |
|
||||
| `step`, `total_steps` | `int` | Training progress |
|
||||
| `n_layers`, `vocab_size`, `dim`, `hidden_dim`, `n_heads`, `seq_len` | `int` | Model shape |
|
||||
| `lr`, `loss` | `float` | Learning rate, last loss |
|
||||
| `cum_compile`, `cum_train`, `cum_wall` | `double` | Cumulative timing (ms) |
|
||||
| `cum_steps`, `cum_batches` | `int` | Cumulative counters |
|
||||
| `adam_t` | `int` | Adam timestep (for bias correction) |
|
||||
| `pad[3]` | `int` | Alignment padding |
|
||||
|
||||
#### `Llama2Config`
|
||||
Header from llama2.c model files (7 ints): `dim`, `hidden_dim`, `n_layers`, `n_heads`, `n_kv_heads`, `vocab_size`, `seq_len`.
|
||||
|
||||
### Global Variables
|
||||
|
||||
| Name | Type | Description |
|
||||
|------|------|-------------|
|
||||
| `g_D` | `Class` | `_ANEInMemoryModelDescriptor` ObjC class |
|
||||
| `g_I` | `Class` | `_ANEInMemoryModel` ObjC class |
|
||||
| `g_AR` | `Class` | `_ANERequest` ObjC class |
|
||||
| `g_AIO` | `Class` | `_ANEIOSurfaceObject` ObjC class |
|
||||
| `g_tb` | `mach_timebase_info_data_t` | Mach time base for timing |
|
||||
| `g_compile_count` | `int` | Running count of ANE compiles |
|
||||
|
||||
### Functions
|
||||
|
||||
| Function | Returns | Description |
|
||||
|----------|---------|-------------|
|
||||
| `ane_init(void)` | `void` | Load AppleNeuralEngine.framework, resolve 4 private class references |
|
||||
| `tb_ms(uint64_t t)` | `double` | Convert Mach absolute time to milliseconds |
|
||||
| `adam_alloc(size_t n)` | `AdamState` | Allocate zeroed first/second moment buffers for n parameters |
|
||||
| `adam_free(AdamState *s)` | `void` | Free an AdamState's buffers |
|
||||
| `layer_weights_alloc(void)` | `LayerWeights` | Allocate all weight matrices for one layer |
|
||||
| `layer_weights_free(LayerWeights *w)` | `void` | Free all weight matrices for one layer |
|
||||
| `layer_adam_alloc(void)` | `LayerAdam` | Allocate Adam state for all weights in one layer |
|
||||
| `layer_adam_free(LayerAdam *a)` | `void` | Free Adam state for one layer |
|
||||
| `layer_acts_alloc(void)` | `LayerActs` | Allocate all activation buffers for one layer |
|
||||
| `layer_acts_free(LayerActs *a)` | `void` | Free all activation buffers for one layer |
|
||||
| `layer_grads_alloc(void)` | `LayerGrads` | Allocate zeroed gradient accumulators for one layer |
|
||||
| `layer_grads_zero(LayerGrads *g)` | `void` | Zero all gradient accumulators (between accumulation steps) |
|
||||
| `layer_grads_free(LayerGrads *g)` | `void` | Free gradient accumulators for one layer |
|
||||
|
||||
---
|
||||
|
||||
## stories_io.h
|
||||
|
||||
IOSurface creation, fp16/fp32 conversion, weight blob building, and ANE kernel compile/run.
|
||||
|
||||
**Depends on**: `stories_config.h`, `<arm_neon.h>`
|
||||
|
||||
### Functions
|
||||
|
||||
| Function | Returns | Description |
|
||||
|----------|---------|-------------|
|
||||
| `make_surface(size_t bytes)` | `IOSurfaceRef` | Create a 1D IOSurface with given byte allocation |
|
||||
| `build_blob(const float *w, int rows, int cols)` | `NSData*` | Build fp16 weight blob (128B header + row-major fp16 data) from fp32 weights |
|
||||
| `build_blob_t(const float *w, int rows, int cols)` | `NSData*` | Build fp16 weight blob with transposed layout (col-major fp16 from row-major fp32) |
|
||||
| `build_blob_fp16(_Float16 *d, int cnt)` | `NSData*` | Build weight blob from pre-existing fp16 data (no conversion) |
|
||||
| `cvt_f16_f32(float *dst, const _Float16 *src, int n)` | `void` | NEON-vectorized fp16-to-fp32 conversion (8-wide SIMD) |
|
||||
| `cvt_f32_f16(_Float16 *dst, const float *src, int n)` | `void` | NEON-vectorized fp32-to-fp16 conversion (8-wide SIMD) |
|
||||
| `io_write_fp16(IOSurfaceRef s, const float *data, int channels, int sp)` | `void` | Write fp32 data to IOSurface as fp16 in channel-first `[C,S]` layout |
|
||||
| `io_read_fp16(IOSurfaceRef s, float *data, int ch_off, int channels, int sp)` | `void` | Read fp16 data from IOSurface at channel offset, convert to fp32 |
|
||||
| `io_copy(IOSurfaceRef dst, int dst_ch, IOSurfaceRef src, int src_ch, int channels, int sp)` | `void` | Copy fp16 data between IOSurfaces at specified channel offsets |
|
||||
| `io_write_fp16_at(IOSurfaceRef s, int ch_off, const float *data, int channels, int sp)` | `void` | Write fp32 data to IOSurface at specific channel offset as fp16 |
|
||||
| `compile_kern_mil_w(NSString *mil, NSDictionary *weights, int ic_bytes, int oc_bytes)` | `Kern*` | Compile MIL text + weight dictionary into a loaded ANE kernel with IOSurfaces. Increments `g_compile_count`. |
|
||||
| `free_kern(Kern *k)` | `void` | Unload ANE model, release IOSurfaces, remove temp directory, free kernel |
|
||||
| `ane_run(Kern *k)` | `void` | Run a compiled ANE kernel on current IOSurface contents |
|
||||
|
||||
---
|
||||
|
||||
## stories_mil.h
|
||||
|
||||
MIL program generators for the 6 fused ANE kernel types. Each returns an `NSString*` containing the full MIL program text.
|
||||
|
||||
**Depends on**: `stories_io.h`
|
||||
|
||||
### Macros
|
||||
|
||||
| Macro | Description |
|
||||
|-------|-------------|
|
||||
| `MIL_HDR` | Standard MIL program header (version 1.3, buildInfo with coremlc/coremltools versions) |
|
||||
| `CONV_CONST` | Common conv parameter constants (pad_type, strides, pad, dilations, groups) |
|
||||
|
||||
### Functions
|
||||
|
||||
| Function | Returns | Description |
|
||||
|----------|---------|-------------|
|
||||
| `gen_sdpa_fwd_taps(void)` | `NSString*` | SDPA forward: RMSNorm + QKV + attention + Wo. Output: `concat(o_out, Q, K, V, attn_out, xnorm)` `[1, 6*DIM, 1, SEQ]` |
|
||||
| `gen_ffn_fwd_taps(void)` | `NSString*` | FFN forward: RMSNorm + W1/W3 + SiLU + W2. Output: `concat(ffn_out, h1, h3, silu_out, x2norm)` `[1, 2*DIM+3*HIDDEN, 1, SEQ]` |
|
||||
| `gen_ffn_bwd(void)` | `NSString*` | FFN backward: Input `concat(dffn, h1, h3)`. Output: `concat(dx, dh1, dh3)` `[1, DIM+2*HIDDEN, 1, SEQ]` |
|
||||
| `gen_qkvb(void)` | `NSString*` | QKV backward: Input `concat(dQ, dK, dV)`. Output: `dx` `[1, DIM, 1, SEQ]` |
|
||||
| `gen_sdpa_bwd1(void)` | `NSString*` | SDPA backward part 1: Input `concat(Q, K, V, dx2)`. Output: `concat(dV, probs, dP)` `[1, DIM+2*SCORE_CH, 1, SEQ]` |
|
||||
| `gen_sdpa_bwd2(void)` | `NSString*` | SDPA backward part 2: Input `concat(probs, dP, Q, K)`. Output: `concat(dQ, dK)` `[1, 2*DIM, 1, SEQ]` |
|
||||
| `get_mask_blob(void)` | `NSData*` | Lazily build and cache causal attention mask as fp16 blob. Lower-triangular 0, upper -65504. |
|
||||
|
||||
### Global Variables
|
||||
|
||||
| Name | Type | Description |
|
||||
|------|------|-------------|
|
||||
| `g_mask_blob` | `NSData*` | Cached causal mask blob (built on first call to `get_mask_blob`) |
|
||||
|
||||
---
|
||||
|
||||
## stories_cpu_ops.h
|
||||
|
||||
CPU-side operations using Accelerate framework (vDSP, vvrsqrtf, vvexpf).
|
||||
|
||||
**Depends on**: `stories_config.h`
|
||||
|
||||
### Functions
|
||||
|
||||
| Function | Returns | Description |
|
||||
|----------|---------|-------------|
|
||||
| `rmsnorm(float *out, const float *x, const float *w, int d, int S)` | `void` | RMSNorm forward: `out = x * rsqrt(mean(x^2) + eps) * w`. Vectorized via vDSP. Layout: channel-first `[d, S]`. |
|
||||
| `rmsnorm_bwd(float *dx, float *dw, const float *dy, const float *x, const float *w, int d, int S)` | `void` | RMSNorm backward: computes `dx` (input gradient) and accumulates `dw` (scale gradient). |
|
||||
| `adam_update(float *w, const float *g, AdamState *s, int t, float lr, float b1, float b2, float eps)` | `void` | Adam optimizer step with bias correction. Updates weights in-place. `t` is the timestep for bias correction. |
|
||||
| `cross_entropy_loss(float *dlogits, const float *logits, const uint16_t *targets, int V, int S)` | `float` | Compute mean cross-entropy loss. Writes `dlogits = (softmax(logits) - one_hot(targets)) / S`. Column-major `[V, S]` layout. Uses vDSP transpose + vvexpf for vectorized softmax. |
|
||||
| `embed_lookup(float *x, const float *embed, const uint16_t *tokens, int dim, int seq)` | `void` | Embedding forward: gather rows from `embed[VOCAB, DIM]` into channel-first `x[DIM, SEQ]`. |
|
||||
| `embed_backward(float *d_embed, const float *dx, const uint16_t *tokens, int dim, int seq)` | `void` | Embedding backward: scatter-add `dx` back into embedding table gradient `d_embed`. |
|
||||
|
||||
### Global Variables
|
||||
|
||||
| Name | Type | Description |
|
||||
|------|------|-------------|
|
||||
| `g_rms_tmp` | `float*` | Lazily-allocated scratch buffer for RMSNorm (size SEQ) |
|
||||
|
||||
---
|
||||
|
||||
## ane_runtime.h
|
||||
|
||||
Generalized ANE wrapper with multi-input/output support. Used in bridge, tests, and newer training variants.
|
||||
|
||||
### Structs
|
||||
|
||||
#### `ANEKernel`
|
||||
Generalized kernel handle supporting multiple inputs and outputs.
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `model` | `id` | `_ANEInMemoryModel` instance |
|
||||
| `ioInputs` | `IOSurfaceRef*` | Array of input IOSurfaces |
|
||||
| `ioOutputs` | `IOSurfaceRef*` | Array of output IOSurfaces |
|
||||
| `request` | `id` | `_ANERequest` instance |
|
||||
| `tmpDir` | `NSString*` | Temp directory for MIL/weights on disk |
|
||||
| `nInputs`, `nOutputs` | `int` | Number of I/O tensors |
|
||||
| `inputBytes`, `outputBytes` | `size_t*` | Byte sizes for each I/O tensor |
|
||||
|
||||
### Global Variables
|
||||
|
||||
| Name | Type | Description |
|
||||
|------|------|-------------|
|
||||
| `g_ANEDesc` | `Class` | `_ANEInMemoryModelDescriptor` |
|
||||
| `g_ANEInMem` | `Class` | `_ANEInMemoryModel` |
|
||||
| `g_ANEReq` | `Class` | `_ANERequest` |
|
||||
| `g_ANEIO` | `Class` | `_ANEIOSurfaceObject` |
|
||||
| `g_ane_loaded` | `bool` | Guard to avoid re-loading the framework |
|
||||
|
||||
### Functions
|
||||
|
||||
| Function | Returns | Description |
|
||||
|----------|---------|-------------|
|
||||
| `ane_init(void)` | `void` | Load AppleNeuralEngine.framework (idempotent), resolve 4 private ObjC classes |
|
||||
| `ane_create_surface(size_t bytes)` | `IOSurfaceRef` | Create a 1D IOSurface of given byte size |
|
||||
| `ane_compile(NSData *milText, NSData *weightData, int nInputs, size_t *inputSizes, int nOutputs, size_t *outputSizes)` | `ANEKernel*` | Full compile pipeline: build descriptor, compile MIL, load model, create IOSurfaces + request. Returns NULL on failure. |
|
||||
| `ane_write_input(ANEKernel *k, int idx, const void *data, size_t bytes)` | `void` | Write raw bytes to the idx-th input IOSurface (lock/memcpy/unlock) |
|
||||
| `ane_read_output(ANEKernel *k, int idx, void *data, size_t bytes)` | `void` | Read raw bytes from the idx-th output IOSurface (read-lock/memcpy/unlock) |
|
||||
| `ane_run_kernel(ANEKernel *k)` | `bool` | Run the compiled ANE kernel. Returns true on success. |
|
||||
| `ane_free(ANEKernel *k)` | `void` | Unload model, release all IOSurfaces, remove temp dir, free struct |
|
||||
|
||||
---
|
||||
|
||||
## ane_mil_gen.h
|
||||
|
||||
Composable MIL generation helpers for common patterns, plus weight blob builders.
|
||||
|
||||
### Functions
|
||||
|
||||
| Function | Returns | Description |
|
||||
|----------|---------|-------------|
|
||||
| `mil_build_weight_blob(const float *w, int out_ch, int in_ch)` | `NSData*` | Build fp16 weight blob with 128B header from fp32 row-major `[out_ch, in_ch]` weights |
|
||||
| `mil_gen_matmul(int in_ch, int out_ch, int spatial)` | `NSString*` | Generate MIL for matmul `y = W @ x` with both as runtime inputs. Includes fp32-to-fp16-to-fp32 casts. |
|
||||
| `mil_gen_conv(int in_ch, int out_ch, int spatial)` | `NSString*` | Generate MIL for conv-based linear with baked weights from blob file (inference-only) |
|
||||
| `mil_gen_qkv(int dim, int spatial)` | `NSString*` | Generate MIL for fused QKV: 3 parallel convs from single input, weights from concatenated blob |
|
||||
| `mil_build_qkv_weight_blob(const float *wq, const float *wk, const float *wv, int dim)` | `NSData*` | Build concatenated weight blob for fused QKV (3 chunks, each with 64B header + fp16 data) |
|
||||
| `mil_build_ffn_up_weight_blob(const float *w1, const float *w3, int hidden_dim, int dim)` | `NSData*` | Build concatenated weight blob for fused FFN up-projection (W1 + W3 chunks) |
|
||||
| `mil_gen_ffn_up(int dim, int hidden_dim, int spatial)` | `NSString*` | Generate MIL for fused FFN up: W1 + W3 parallel convs, outputs h1 and h3 |
|
||||
|
||||
---
|
||||
|
||||
## ane_rmsnorm_bwd.h
|
||||
|
||||
MIL generator for RMSNorm backward on ANE (used by `train_large_ane.m`).
|
||||
|
||||
**Depends on**: `stories_mil.h`
|
||||
|
||||
### Functions
|
||||
|
||||
| Function | Returns | Description |
|
||||
|----------|---------|-------------|
|
||||
| `gen_rmsnorm_bwd(void)` | `NSString*` | Generate MIL for RMSNorm backward. Input: `concat(dy, x)` as `[1, 2*DIM, 1, SEQ]`. Baked weight: RMSNorm scale `w[DIM]`. Output: `dx` as `[1, DIM, 1, SEQ]`. Note: `dw` (weight gradient) stays on CPU. |
|
||||
|
||||
---
|
||||
|
||||
## ane_classifier.h
|
||||
|
||||
MIL generators for classifier operations on ANE (used by `train_large_ane.m`).
|
||||
|
||||
**Depends on**: `stories_mil.h`
|
||||
|
||||
### Functions
|
||||
|
||||
| Function | Returns | Description |
|
||||
|----------|---------|-------------|
|
||||
| `gen_classifier_fwd(void)` | `NSString*` | Classifier forward: single 32000-output-channel conv. Input: `[1, DIM, 1, SEQ]`. Baked: embedding weights `[VOCAB, DIM, 1, 1]`. Output: `[1, VOCAB, 1, SEQ]`. |
|
||||
| `gen_classifier_bwd(void)` | `NSString*` | Classifier backward: `dx = embed^T @ dlogits`. Uses `matmul` op (not conv, since ANE rejects conv with 32000 input channels). Input: `[1, VOCAB, 1, SEQ]`. Baked: `embed^T [1, DIM, VOCAB]`. Output: `[1, DIM, 1, SEQ]`. |
|
||||
| `gen_softmax_vocab(void)` | `NSString*` | Softmax over VOCAB dimension: `softmax(x, axis=1)`. Input: `[1, VOCAB, 1, SEQ]`. Output: `[1, VOCAB, 1, SEQ]`. |
|
||||
| `gen_final_rmsnorm(void)` | `NSString*` | Final RMSNorm (standalone, not fused). Input: `[1, DIM, 1, SEQ]`. Baked: `rms_final[DIM]`. Output: `[1, DIM, 1, SEQ]`. |
|
||||
|
||||
---
|
||||
|
||||
## bridge/ane_bridge.h
|
||||
|
||||
C-callable bridge to ANE private APIs for Python ctypes integration.
|
||||
|
||||
### Types
|
||||
|
||||
| Type | Description |
|
||||
|------|-------------|
|
||||
| `ANEKernelHandle` | Opaque kernel handle (pointer to internal struct) |
|
||||
|
||||
### Functions
|
||||
|
||||
| Function | Returns | Description |
|
||||
|----------|---------|-------------|
|
||||
| `ane_bridge_init(void)` | `int` | Initialize ANE runtime (load private framework, resolve classes). Returns 0 on success, -1 on failure. |
|
||||
| `ane_bridge_compile(const char *mil_text, size_t mil_len, const uint8_t *weight_data, size_t weight_len, int n_inputs, const size_t *input_sizes, int n_outputs, const size_t *output_sizes)` | `ANEKernelHandle*` | Compile MIL text + single weight blob into ANE kernel. Returns NULL on failure. |
|
||||
| `ane_bridge_compile_multi_weights(const char *mil_text, size_t mil_len, const char **weight_names, const uint8_t **weight_datas, const size_t *weight_lens, int n_weights, int n_inputs, const size_t *input_sizes, int n_outputs, const size_t *output_sizes)` | `ANEKernelHandle*` | Compile MIL text + multiple named weight files. Weight names use `@model_path/` prefix convention. |
|
||||
| `ane_bridge_run(ANEKernelHandle *kernel)` | `bool` | Execute a compiled kernel on ANE. Returns true on success. |
|
||||
| `ane_bridge_write_input(ANEKernelHandle *kernel, int idx, const void *data, size_t bytes)` | `void` | Write data to kernel input IOSurface at index `idx` |
|
||||
| `ane_bridge_read_output(ANEKernelHandle *kernel, int idx, void *data, size_t bytes)` | `void` | Read data from kernel output IOSurface at index `idx` |
|
||||
| `ane_bridge_free(ANEKernelHandle *kernel)` | `void` | Unload model, release all IOSurfaces, remove temp dir, free handle |
|
||||
| `ane_bridge_get_compile_count(void)` | `int` | Get current compile count (for restart budgeting) |
|
||||
| `ane_bridge_reset_compile_count(void)` | `void` | Reset compile count to zero |
|
||||
| `ane_bridge_build_weight_blob(const float *src, int rows, int cols, size_t *out_len)` | `uint8_t*` | Build weight blob in ANE format (128B header + fp16). Caller must free via `ane_bridge_free_blob()`. |
|
||||
| `ane_bridge_build_weight_blob_transposed(const float *src, int rows, int cols, size_t *out_len)` | `uint8_t*` | Build transposed weight blob. Caller must free via `ane_bridge_free_blob()`. |
|
||||
| `ane_bridge_free_blob(void *ptr)` | `void` | Free a blob allocated by `ane_bridge_build_weight_blob*` |
|
||||
|
||||
---
|
||||
|
||||
## MIL Operation Reference
|
||||
|
||||
All MIL programs target `ios18` and use fp16 tensors in `[1, C, 1, S]` layout (or `[1, H, S, S]` for attention scores).
|
||||
|
||||
| Operation | MIL Syntax | Purpose |
|
||||
|-----------|-----------|---------|
|
||||
| `conv` | `conv(dilations=dl, groups=gr, pad=pd, pad_type=pt, strides=st, weight=W, x=xn)` | Linear projections (all Wq, Wk, Wv, Wo, W1, W2, W3). 1x1 conv = matmul. Weight shape: `[out_ch, in_ch, 1, 1]`. |
|
||||
| `matmul` | `matmul(transpose_x=tx, transpose_y=ty, x=a, y=b)` | Attention score computation (Q at K^T, scores at V, classifier backward). |
|
||||
| `softmax` | `softmax(axis=ax, x=ms)` | Attention weight normalization (`axis=-1`) and vocab softmax (`axis=1`). |
|
||||
| `mul` | `mul(x=a, y=b)` | Element-wise multiply: RMSNorm scaling, SiLU gating, attention scaling, softmax Jacobian. |
|
||||
| `add` | `add(x=a, y=b)` | Causal mask application, SiLU derivative `(1 + h*(1-sig))`, gradient accumulation. |
|
||||
| `sub` | `sub(x=a, y=b)` | SiLU derivative: `1 - sigmoid(h1)`, softmax backward: `dp - sum(P*dP)`. |
|
||||
| `sigmoid` | `sigmoid(x=h1)` | SiLU activation component (SiLU = x * sigmoid(x)). |
|
||||
| `pow` | `pow(x=ss3, y=nhalf)` | RMSNorm: `x^(-0.5)` = reciprocal sqrt. |
|
||||
| `reduce_sum` | `reduce_sum(x=sq, axes=rax, keep_dims=kd)` | RMSNorm: sum of squares along channel dim. Softmax backward: row-wise dot product. |
|
||||
| `reshape` | `reshape(shape=sh, x=xf)` | `[1,DIM,1,SEQ]` to `[1,HEADS,HD,SEQ]` for multi-head attention. Flatten attention scores. |
|
||||
| `transpose` | `transpose(perm=pm, x=q4)` | Permute `[0,1,3,2]`: swap spatial and head_dim for matmul compatibility. |
|
||||
| `concat` | `concat(axis=cax, interleave=cid, values=(a,b,c))` | Pack multiple outputs into single IOSurface ("taps"). Always `axis=1`, `interleave=false`. |
|
||||
| `slice_by_size` | `slice_by_size(x=x, begin=b, size=sz)` | Split concatenated inputs in backward kernels. `begin=[0,offset,0,0]`, `size=[1,channels,1,SEQ]`. |
|
||||
| `cast` | `cast(dtype=to_fp16, x=x)` | fp32-to-fp16 or fp16-to-fp32 precision conversion (used in ane_mil_gen.h generators). |
|
||||
| `const` | `const()[name=..., val=...]` | Declare scalar/tensor constants, conv parameters, weight blob references via `BLOBFILE`. |
|
||||
|
||||
---
|
||||
|
||||
## Weight Blob Format
|
||||
|
||||
### Single-weight blob (128 bytes header + data)
|
||||
|
||||
```
|
||||
Offset Size Content
|
||||
------ ----- -------
|
||||
0 1 0x01 (format marker)
|
||||
4 1 0x02 (format marker)
|
||||
5-63 59 zeros (global header padding)
|
||||
64 4 0xDEADBEEF (chunk magic, little-endian: EF BE AD DE)
|
||||
68 1 0x01 (chunk marker)
|
||||
72 4 uint32 data_size (total fp16 bytes = out_ch * in_ch * 2)
|
||||
80 4 uint32 data_offset (always 128 = 64 global + 64 chunk)
|
||||
84-127 44 zeros (chunk header padding)
|
||||
128+ N fp16 weight data, row-major [out_ch, in_ch]
|
||||
```
|
||||
|
||||
### Multi-weight blob (fused QKV, FFN up)
|
||||
|
||||
```
|
||||
Offset Content
|
||||
------ -------
|
||||
0-63 Global header (same as above)
|
||||
64 Chunk 0 header (64 bytes): magic, data_size, data_offset
|
||||
64+64 Chunk 0 data (fp16 weights)
|
||||
64+cs Chunk 1 header (64 bytes)
|
||||
64+cs+64 Chunk 1 data (fp16 weights)
|
||||
...
|
||||
```
|
||||
|
||||
Where `cs = 64 + n_elements * 2` (chunk header size + data size).
|
||||
|
||||
MIL references use `BLOBFILE(path="@model_path/weights/name.bin", offset=uint64(X))` where X is the chunk header offset within the file (64 for first chunk, 64+cs for second, etc.).
|
||||
|
|
@ -0,0 +1,370 @@
|
|||
# ANE Training -- System Architecture
|
||||
|
||||
Training neural networks directly on Apple's Neural Engine via reverse-engineered private APIs (`_ANEClient`, `_ANECompiler`). No CoreML training APIs, no Metal, no GPU.
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
ANE/
|
||||
+-- api_exploration.m # ANE private API discovery
|
||||
+-- inmem_basic.m # In-memory MIL compilation proof-of-concept
|
||||
+-- inmem_bench.m # ANE dispatch latency across model sizes
|
||||
+-- inmem_peak.m # Peak TFLOPS via deep conv chains (self-contained)
|
||||
+-- sram_bench.m # SRAM capacity probing (performance cliff detection)
|
||||
+-- sram_probe.m # Fine-grained SRAM size exploration
|
||||
+-- bridge/
|
||||
| +-- ane_bridge.h # C-callable API for Python ctypes
|
||||
| +-- ane_bridge.m # Bridge implementation
|
||||
| +-- Makefile # Builds libane_bridge.dylib
|
||||
| +-- libane_bridge.dylib # Pre-built shared library
|
||||
+-- training/
|
||||
| +-- train_large.m # Main: 12-layer training (CPU classifier)
|
||||
| +-- train_large_ane.m # Variant: classifier + softmax on ANE
|
||||
| +-- stories_config.h # Model constants, structs, alloc helpers
|
||||
| +-- stories_io.h # IOSurface I/O, NEON fp16, compile/run
|
||||
| +-- stories_mil.h # MIL generators for 6 fused ANE kernels
|
||||
| +-- stories_cpu_ops.h # vDSP RMSNorm, cross-entropy, Adam, embedding
|
||||
| +-- ane_runtime.h # Generalized ANE wrapper (multi-I/O)
|
||||
| +-- ane_mil_gen.h # Composable MIL helpers (conv, matmul, fused QKV)
|
||||
| +-- ane_rmsnorm_bwd.h # RMSNorm backward MIL (train_large_ane only)
|
||||
| +-- ane_classifier.h # Classifier/softmax MIL (train_large_ane only)
|
||||
| +-- forward.h # Gen1 forward pass (per-linear-kernel, all-CPU)
|
||||
| +-- backward.h # Gen1 backward pass (all-CPU reference)
|
||||
| +-- model.h # Gen1 Model struct, per-kernel compile
|
||||
| +-- dashboard.py # TUI monitoring (loss, power, text generation)
|
||||
| +-- tokenize.py # Extract pretokenized TinyStories data
|
||||
| +-- download_data.sh # Download TinyStories from HuggingFace
|
||||
| +-- Makefile # Build targets for training + tests
|
||||
| +-- test_*.m # 12 unit test files
|
||||
+-- docs/ # This documentation
|
||||
+-- scripts/ # Automation scripts
|
||||
```
|
||||
|
||||
## Two Generations of Training Code
|
||||
|
||||
### Gen1: `model.h` + `forward.h` + `backward.h`
|
||||
|
||||
The original correctness reference. One ANE kernel per linear projection (7 per layer + 1 classifier = 85 kernels total). Forward and backward are sequential all-CPU operations with optional ANE for the matmuls. No kernel fusion, no async overlap. Used for verifying Gen2's fused kernels produce correct results.
|
||||
|
||||
### Gen2: `train_large.m` + `stories_*.h` (production)
|
||||
|
||||
The performance-optimized system. Uses **5 fused ANE kernels per layer** (each performing multiple operations in a single dispatch). Weight gradients (`dW`) run asynchronously on CPU via GCD to overlap with ANE. All data is channel-first `[C, S]` fp16 on IOSurfaces.
|
||||
|
||||
The rest of this document describes Gen2.
|
||||
|
||||
---
|
||||
|
||||
## Model Configuration
|
||||
|
||||
Stories110M -- a Llama2-architecture transformer:
|
||||
|
||||
| Parameter | Value | Macro |
|
||||
|-----------|-------|-------|
|
||||
| Hidden dimension | 768 | `DIM` |
|
||||
| FFN intermediate | 2048 | `HIDDEN` |
|
||||
| Attention heads | 12 | `HEADS` |
|
||||
| Head dimension | 64 | `HD` |
|
||||
| Sequence length | 256 | `SEQ` |
|
||||
| Layers | 12 | `NLAYERS` |
|
||||
| Vocabulary | 32000 | `VOCAB` |
|
||||
| Total parameters | 109.53M | `TOTAL_PARAMS` |
|
||||
| Accumulation steps | 10 | `ACCUM_STEPS` |
|
||||
| Max ANE compiles | 100 | `MAX_COMPILES` |
|
||||
|
||||
---
|
||||
|
||||
## ANE Kernel Fusion Map
|
||||
|
||||
Each training step dispatches 6 kernel types per layer. 5 are weight-bearing (recompiled each batch), 1 is weight-free (compiled once).
|
||||
|
||||
| Kernel | Generator | Fused Operations | Baked Weights | Input Shape | Output Shape |
|
||||
|--------|-----------|-----------------|---------------|-------------|--------------|
|
||||
| `fwdAttn` | `gen_sdpa_fwd_taps()` | RMSNorm1, Wq/Wk/Wv conv, reshape, transpose, Q at K^T matmul, scale, causal mask, softmax, scores at V matmul, Wo conv | rms_att, Wq, Wk, Wv, Wo, mask | `[1,DIM,1,SEQ]` | `[1,6*DIM,1,SEQ]` |
|
||||
| `fwdFFN` | `gen_ffn_fwd_taps()` | RMSNorm2, W1/W3 conv, sigmoid, SiLU gating, W2 conv | rms_ffn, W1, W3, W2 | `[1,DIM,1,SEQ]` | `[1,2D+3H,1,SEQ]` |
|
||||
| `ffnBwd` | `gen_ffn_bwd()` | W2^T conv, SiLU derivative, W1^T/W3^T conv, add | W2^T, W1^T, W3^T | `[1,D+2H,1,SEQ]` | `[1,D+2H,1,SEQ]` |
|
||||
| `sdpaBwd1` | `gen_sdpa_bwd1()` | Wo^T conv, reshape, Q at K^T recompute, softmax, dV matmul, dP matmul | Wo^T, mask | `[1,4*DIM,1,SEQ]` | `[1,D+2*SC,1,SEQ]` |
|
||||
| `sdpaBwd2` | `gen_sdpa_bwd2()` | softmax Jacobian, scale, dQ=dS at K matmul, dK=dS^T at Q matmul | _(none)_ | `[1,2SC+2D,1,SEQ]` | `[1,2*DIM,1,SEQ]` |
|
||||
| `qkvBwd` | `gen_qkvb()` | Wq^T/Wk^T/Wv^T conv, sum | Wq^T, Wk^T, Wv^T | `[1,3*DIM,1,SEQ]` | `[1,DIM,1,SEQ]` |
|
||||
|
||||
Where D=DIM=768, H=HIDDEN=2048, SC=SCORE_CH=HEADS*SEQ=3072.
|
||||
|
||||
"Taps" in forward kernels: intermediate values (Q, K, V, attention output, norms) are concatenated onto the output via `concat(axis=1)` so backward kernels can read them without CPU recomputation.
|
||||
|
||||
---
|
||||
|
||||
## CPU vs ANE Operation Split
|
||||
|
||||
| Operation | Location | Reason |
|
||||
|-----------|----------|--------|
|
||||
| Embedding lookup/backward | CPU | Scatter/gather by token index |
|
||||
| RMSNorm forward | ANE | Fused into fwdAttn/fwdFFN kernels |
|
||||
| QKV projections | ANE | 1x1 conv = matmul |
|
||||
| Multi-head attention (SDPA) | ANE | Decomposed Q at K^T + mask + softmax + scores at V |
|
||||
| FFN (SwiGLU) | ANE | W1,W3 conv + sigmoid + gate + W2 conv |
|
||||
| Residual connections | CPU | Simple `vDSP_vadd` |
|
||||
| Final RMSNorm | CPU (or ANE in `_ane` variant) | Standalone, not fused with other ops |
|
||||
| Classifier matmul | CPU cblas (or ANE in `_ane` variant) | `[VOCAB,DIM] x [DIM,SEQ]` |
|
||||
| Cross-entropy + softmax | CPU (partially ANE in `_ane`) | Target indexing requires CPU |
|
||||
| dW weight gradients | CPU (async cblas) | Outer products, independent of backward data flow |
|
||||
| RMSNorm backward | CPU (or ANE in `_ane` variant) | vDSP vectorized |
|
||||
| Adam optimizer | CPU | In-place weight mutation |
|
||||
|
||||
---
|
||||
|
||||
## Training Step Swim-Lane Diagram
|
||||
|
||||
One complete training step showing CPU, ANE, and async GCD operations interleaved:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant CPU
|
||||
participant ANE
|
||||
participant GCD as GCD Async Queue
|
||||
|
||||
Note over CPU: FORWARD PASS (per layer L=0..11)
|
||||
|
||||
CPU->>CPU: embed_lookup(tokens to x_cur)
|
||||
|
||||
loop Layer L = 0..11
|
||||
CPU->>CPU: wait for prior async dW
|
||||
CPU->>CPU: save layer_in, write fp16 to IOSurface
|
||||
CPU->>ANE: run fwdAttn kernel
|
||||
ANE-->>CPU: concat(o_out, Q, K, V, attn_out, xnorm)
|
||||
CPU->>CPU: read fp16 taps, residual add to x2
|
||||
|
||||
CPU->>CPU: write fp16 x2 to IOSurface
|
||||
CPU->>ANE: run fwdFFN kernel
|
||||
ANE-->>CPU: concat(ffn_out, h1, h3, silu_out, x2norm)
|
||||
CPU->>CPU: read fp16 taps, residual add to x_cur
|
||||
end
|
||||
|
||||
Note over CPU: CLASSIFIER + LOSS
|
||||
CPU->>CPU: rmsnorm(x_cur to x_final)
|
||||
CPU->>CPU: cblas_sgemm(embed x x_final to logits)
|
||||
CPU->>CPU: cross_entropy_loss(logits to loss, dlogits)
|
||||
|
||||
Note over CPU: BACKWARD PASS
|
||||
CPU->>CPU: cblas_sgemm(embed^T x dlogits to dy)
|
||||
CPU->>GCD: async dEmbed += dlogits x x_final^T
|
||||
CPU->>CPU: rmsnorm_bwd(dy to dx)
|
||||
|
||||
loop Layer L = 11..0
|
||||
Note over CPU,GCD: FFN Backward
|
||||
CPU->>CPU: write dffn + copy h1,h3 from fwd taps
|
||||
CPU->>ANE: run ffnBwd kernel
|
||||
ANE-->>CPU: concat(dx_ffn, dh1, dh3)
|
||||
CPU->>GCD: async dW2, dW1, dW3 accumulation
|
||||
|
||||
Note over CPU,GCD: RMSNorm2 Backward + Residual
|
||||
CPU->>CPU: rmsnorm_bwd, add residual gradient
|
||||
|
||||
Note over CPU,GCD: SDPA Backward
|
||||
CPU->>GCD: async dWo accumulation
|
||||
CPU->>CPU: copy Q,K,V from fwd taps, write dx2
|
||||
CPU->>ANE: run sdpaBwd1 kernel
|
||||
ANE-->>CPU: concat(dV, probs, dP)
|
||||
|
||||
CPU->>CPU: copy probs,dP,Q,K
|
||||
CPU->>ANE: run sdpaBwd2 kernel
|
||||
ANE-->>CPU: concat(dQ, dK)
|
||||
|
||||
CPU->>GCD: async dWq, dWk, dWv accumulation
|
||||
|
||||
Note over CPU,GCD: QKV Backward
|
||||
CPU->>CPU: copy dQ,dK,dV
|
||||
CPU->>ANE: run qkvBwd kernel
|
||||
ANE-->>CPU: dx_attn
|
||||
|
||||
Note over CPU,GCD: RMSNorm1 Backward + Residual
|
||||
CPU->>CPU: rmsnorm_bwd, add both skip gradients
|
||||
end
|
||||
|
||||
CPU->>CPU: dispatch_group_wait(all async dW)
|
||||
CPU->>CPU: embed_backward(dy to d_embed)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Async CPU/ANE Overlap Strategy
|
||||
|
||||
The key insight: **dW gradients (weight gradients) are independent of the backward data flow**. They are outer products `dW += dy x x^T` that only accumulate into gradient buffers. The data-path gradients (`dx`) flow backward through the network on ANE.
|
||||
|
||||
```
|
||||
Timeline for one backward layer:
|
||||
ANE: [ffnBwd] [sdpaBwd1] [sdpaBwd2] [qkvBwd]
|
||||
CPU: [dW_FFN (3x sgemm)] [dWo] [dWqkv (3x sgemm)]
|
||||
```
|
||||
|
||||
GCD serial dispatch queue `"dw_cblas"` ensures dW operations don't overlap each other (they share scratch buffers). The `dispatch_group_wait` at the start of each forward layer ensures async dW from the previous step's backward has finished before IOSurfaces are reused.
|
||||
|
||||
---
|
||||
|
||||
## Compile/Restart Lifecycle
|
||||
|
||||
The ANE runtime leaks resources internally, limiting compiles to ~119 per process. The system manages this with checkpoint-and-restart:
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
Start["Process starts (fresh or --resume)"] --> LoadCkpt{"--resume flag?"}
|
||||
LoadCkpt -->|Yes| Resume["Load checkpoint: weights, Adam state, step counter"]
|
||||
LoadCkpt -->|No| Init["Xavier init weights, zero Adam state"]
|
||||
Resume --> CompileCheck
|
||||
Init --> CompileCheck
|
||||
|
||||
CompileCheck{"g_compile_count + 60 > MAX_COMPILES?"} -->|Yes| SaveCheckpoint["Save checkpoint to ane_stories110M_ckpt.bin"]
|
||||
SaveCheckpoint --> FreeAll["Free all ANE kernels"]
|
||||
FreeAll --> RestartProcess["Re-launch process with --resume flag"]
|
||||
RestartProcess --> Start
|
||||
|
||||
CompileCheck -->|No| Compile["Compile 60 weight-bearing kernels (5 per layer x 12)"]
|
||||
Compile --> ZeroGrads["Zero gradient accumulators"]
|
||||
ZeroGrads --> AccumLoop
|
||||
|
||||
subgraph AccumLoop ["Gradient Accumulation (10 steps)"]
|
||||
SingleStep["Forward + Backward + async dW"] --> MoreSteps{"More accum steps?"}
|
||||
MoreSteps -->|Yes| SingleStep
|
||||
end
|
||||
|
||||
MoreSteps -->|No| WaitDW["dispatch_group_wait (all async dW)"]
|
||||
WaitDW --> ScaleGrad["Scale gradients by 1/ACCUM_STEPS"]
|
||||
ScaleGrad --> AdamUpdate["Adam update (mutates weights in-place)"]
|
||||
AdamUpdate --> FreeKernels["Free all weight-bearing kernels"]
|
||||
FreeKernels --> CompileCheck
|
||||
```
|
||||
|
||||
With `MAX_COMPILES=100` and 60 weight-bearing kernels per batch, only **1 batch** (10 accumulation steps) fits per process lifetime. The checkpoint preserves:
|
||||
|
||||
- Training step and total_steps
|
||||
- All weights and Adam (m, v) state per layer
|
||||
- Cumulative timing statistics
|
||||
- Adam timestep counter
|
||||
|
||||
---
|
||||
|
||||
## Data Flow Through One Layer
|
||||
|
||||
Tensor shapes as they flow through forward and backward passes:
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph fwdAttnKernel ["fwdAttn Kernel (ANE)"]
|
||||
xIn["x_in\n[1,768,1,256]"] --> RMS1["RMSNorm1"]
|
||||
RMS1 --> QKVConv["Wq,Wk,Wv conv\n[768,768,1,1]"]
|
||||
QKVConv --> ReshapeHeads["reshape\n[1,12,64,256]"]
|
||||
ReshapeHeads --> TransposeHeads["transpose\n[1,12,256,64]"]
|
||||
TransposeHeads --> QKT["Q x K^T\n[1,12,256,256]"]
|
||||
QKT --> ScaleMask["scale + mask\n+ softmax"]
|
||||
ScaleMask --> AV["scores x V\n[1,12,256,64]"]
|
||||
AV --> ReshapeBackFlat["reshape\n[1,768,1,256]"]
|
||||
ReshapeBackFlat --> WoConv["Wo conv\n[768,768,1,1]"]
|
||||
end
|
||||
|
||||
subgraph taps1 ["Taps via concat"]
|
||||
WoConv --> T1["o_out [768]"]
|
||||
QKVConv --> T2["Q,K,V [768 each]"]
|
||||
AV --> T3["attn_out [768]"]
|
||||
RMS1 --> T4["xnorm [768]"]
|
||||
end
|
||||
|
||||
subgraph cpuResid1 ["CPU"]
|
||||
T1 --> ResAdd1["x + o_out = x2"]
|
||||
end
|
||||
|
||||
subgraph fwdFFNKernel ["fwdFFN Kernel (ANE)"]
|
||||
ResAdd1 --> RMS2["RMSNorm2"]
|
||||
RMS2 --> W1W3["W1,W3 conv\n[2048,768,1,1]"]
|
||||
W1W3 --> SiLUGate["sigmoid + SiLU\n+ gating"]
|
||||
SiLUGate --> W2Conv["W2 conv\n[768,2048,1,1]"]
|
||||
end
|
||||
|
||||
subgraph taps2 ["Taps via concat"]
|
||||
W2Conv --> T5["ffn_out [768]"]
|
||||
W1W3 --> T6["h1,h3 [2048 each]"]
|
||||
SiLUGate --> T7["silu_out [2048]"]
|
||||
RMS2 --> T8["x2norm [768]"]
|
||||
end
|
||||
|
||||
subgraph cpuResid2 ["CPU"]
|
||||
T5 --> ResAdd2["x2 + ffn_out = x_next"]
|
||||
end
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## IOSurface Memory Layout
|
||||
|
||||
All tensors use channel-first `[1, C, 1, S]` fp16 layout on IOSurfaces, matching ANE's native format:
|
||||
|
||||
```
|
||||
IOSurface memory (contiguous fp16):
|
||||
channel_0: [pos_0, pos_1, ..., pos_255] (256 values)
|
||||
channel_1: [pos_0, pos_1, ..., pos_255]
|
||||
...
|
||||
channel_767: [pos_0, pos_1, ..., pos_255]
|
||||
```
|
||||
|
||||
Fused kernel outputs use `concat(axis=1)` to pack multiple tensors into a single IOSurface:
|
||||
|
||||
```
|
||||
fwdAttn output [1, 6*768, 1, 256]:
|
||||
channels 0-767: o_out (Wo projection output)
|
||||
channels 768-1535: Q (query projection)
|
||||
channels 1536-2303: K (key projection)
|
||||
channels 2304-3071: V (value projection)
|
||||
channels 3072-3839: attn_out (pre-Wo attention output)
|
||||
channels 3840-4607: xnorm (RMSNorm1 output)
|
||||
```
|
||||
|
||||
CPU reads specific taps via `io_read_fp16(surface, data, ch_offset, n_channels, spatial)`.
|
||||
|
||||
---
|
||||
|
||||
## Weight Blob Format
|
||||
|
||||
ANE weight blobs follow a binary format with a 128-byte header:
|
||||
|
||||
```
|
||||
Offset Size Content
|
||||
------ ----- -------
|
||||
0 1 0x01 (format marker)
|
||||
4 1 0x02 (format marker)
|
||||
5-63 59 zeros (padding)
|
||||
64 4 0xDEADBEEF (chunk magic, little-endian)
|
||||
68 1 0x01 (chunk marker)
|
||||
72 4 uint32 data_size (fp16 weight bytes)
|
||||
80 4 uint32 data_offset (always 128)
|
||||
84-127 44 zeros (padding)
|
||||
128+ N fp16 weight data, row-major [out_ch, in_ch]
|
||||
```
|
||||
|
||||
Multi-weight blobs (fused QKV, FFN up) concatenate chunks: `[64B global header] [64B chunk0 header] [chunk0 data] [64B chunk1 header] [chunk1 data] ...`
|
||||
|
||||
MIL programs reference weights via `BLOBFILE(path="@model_path/weights/name.bin", offset=uint64(64))` where offset 64 points to the chunk header within the file.
|
||||
|
||||
---
|
||||
|
||||
## Key Constraints
|
||||
|
||||
| Constraint | Impact | Workaround |
|
||||
|-----------|--------|------------|
|
||||
| ~119 compile limit per process | ANE compiler leaks resources | `checkpoint + re-launch with --resume` |
|
||||
| Weights baked at compile time | Cannot hot-swap weights; must recompile | Gradient accumulation amortizes compile cost |
|
||||
| SDPA ignores `attn_mask` | Causal attention cannot use native SDPA mask | Decompose into Q at K^T + explicit mask + softmax + scores at V |
|
||||
| ANE SRAM capacity ~32 MB | Large weight matrices spill to DRAM | Performance cliff above ~3072 channels |
|
||||
| 32000 input channels rejected | ANE refuses conv with VOCAB input channels | Classifier backward uses `matmul` op with reshape instead of conv |
|
||||
| fp16 compute only | Precision limited on ANE | fp32 on CPU for loss, Adam; fp16 for ANE forward/backward |
|
||||
|
||||
---
|
||||
|
||||
## `train_large.m` vs `train_large_ane.m`
|
||||
|
||||
`train_large_ane.m` moves additional operations from CPU to ANE:
|
||||
|
||||
| Operation | `train_large.m` | `train_large_ane.m` |
|
||||
|-----------|-----------------|---------------------|
|
||||
| Final RMSNorm | CPU (`rmsnorm()` via vDSP) | ANE (`gen_final_rmsnorm()`) |
|
||||
| Classifier forward | CPU (`cblas_sgemm`) | ANE (`gen_classifier_fwd()`, 32000-ch conv) |
|
||||
| Softmax | CPU (inside `cross_entropy_loss()`) | ANE (`gen_softmax_vocab()`) |
|
||||
| Per-layer RMSNorm backward | CPU (`rmsnorm_bwd()` via vDSP) | ANE (`gen_rmsnorm_bwd()`) |
|
||||
|
||||
This increases compile budget pressure: 86 weight-bearing kernels per batch (vs 60), leaving less headroom within MAX_COMPILES=100.
|
||||
|
|
@ -0,0 +1,253 @@
|
|||
# ANE Training -- Benchmarks and Tests Guide
|
||||
|
||||
All benchmarks and tests require **macOS 15+ on Apple Silicon** (tested on M4, M5).
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Build and run training benchmark (100 steps)
|
||||
cd training
|
||||
make train_large && ./train_large --steps 100
|
||||
|
||||
# Run the automated benchmark suite
|
||||
cd ..
|
||||
bash scripts/run_benchmarks.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Training Benchmarks
|
||||
|
||||
### train_large (CPU classifier)
|
||||
|
||||
The main 12-layer Stories110M training loop with classifier on CPU.
|
||||
|
||||
| Item | Details |
|
||||
|------|---------|
|
||||
| **Purpose** | Full transformer training benchmark |
|
||||
| **Measures** | ms/step, ANE TFLOPS, ANE utilization %, per-component timing |
|
||||
| **Prerequisites** | Training data: `bash download_data.sh` (or runs on random data if absent) |
|
||||
| **Build** | `cd training && make train_large` |
|
||||
| **Run** | `./train_large --steps 100` |
|
||||
| **CLI flags** | `--steps N` (default 10000), `--lr F` (default 3e-4), `--resume` |
|
||||
|
||||
**Expected output:**
|
||||
|
||||
```
|
||||
ane=9.6 io=4.1 cls=9.1 elem=14.4 rms=0.1 cblas_wait=2.3 ms/step
|
||||
|
||||
=== Efficiency Report ===
|
||||
Total steps: 100
|
||||
Avg train: 107.0 ms/step
|
||||
ANE TFLOPS: 2.45 sustained
|
||||
ANE utilization: 15.5% of 15.8 TFLOPS
|
||||
```
|
||||
|
||||
### train_large_ane (ANE classifier)
|
||||
|
||||
Same training with classifier, softmax, and RMSNorm backward offloaded to ANE.
|
||||
|
||||
| Item | Details |
|
||||
|------|---------|
|
||||
| **Purpose** | Measure ANE-offloaded training (16% faster) |
|
||||
| **Build** | `cd training && make train_large_ane` |
|
||||
| **Run** | `./train_large_ane --steps 100` |
|
||||
|
||||
**Compare baseline vs ANE-offloaded:**
|
||||
|
||||
```bash
|
||||
make train_large && ./train_large --steps 100
|
||||
make train_large_ane && ./train_large_ane --steps 100
|
||||
```
|
||||
|
||||
### Dashboard (live monitoring)
|
||||
|
||||
```bash
|
||||
pip install blessed psutil numpy
|
||||
sudo python3 dashboard.py # live mode (needs powermetrics)
|
||||
sudo python3 dashboard.py --resume # attach to resumed training
|
||||
```
|
||||
|
||||
| Flag | Description |
|
||||
|------|-------------|
|
||||
| `--resume` | Resume from checkpoint |
|
||||
| `--infinite` | Train indefinitely |
|
||||
| `--no-powermetrics` | Disable power monitoring |
|
||||
| `--no-generate` | Disable text generation preview |
|
||||
| `--steps N` | Total steps (default 10000) |
|
||||
|
||||
---
|
||||
|
||||
## Root-Level Benchmark Scripts
|
||||
|
||||
All root-level scripts are standalone Objective-C programs. Common build pattern:
|
||||
|
||||
```bash
|
||||
xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML \
|
||||
-framework IOSurface -ldl -o <output> <source>.m
|
||||
```
|
||||
|
||||
### inmem_peak.m -- Peak TFLOPS (self-contained)
|
||||
|
||||
**No prerequisites.** Generates MIL and weight blobs programmatically.
|
||||
|
||||
| Item | Details |
|
||||
|------|---------|
|
||||
| **Purpose** | Maximum sustained TFLOPS via deep conv chains (32-256 layers deep) |
|
||||
| **Measures** | ms per run, TFLOPS, % peak across 10 configurations |
|
||||
| **Prerequisites** | None (self-contained MIL generation) |
|
||||
| **Build** | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o inmem_peak inmem_peak.m` |
|
||||
| **Run** | `./inmem_peak` |
|
||||
|
||||
**Expected output:**
|
||||
|
||||
```
|
||||
=== Programmatic MIL to In-Memory ANE Peak ===
|
||||
|
||||
Config W(MB) GFLOP ms/run TFLOPS %peak
|
||||
----------------------------------------------------------------------
|
||||
32x conv 512ch sp64 16.0 1.07 X.XXX ms Y.YY Z.Z%
|
||||
64x conv 512ch sp64 32.0 2.15 X.XXX ms Y.YY Z.Z%
|
||||
...
|
||||
```
|
||||
|
||||
### inmem_basic.m -- In-Memory Proof-of-Concept
|
||||
|
||||
| Item | Details |
|
||||
|------|---------|
|
||||
| **Purpose** | End-to-end test: compile, load, run, benchmark using `_ANEInMemoryModel` |
|
||||
| **Prerequisites** | Pre-built mlpackage at `/tmp/ane_sram_256ch_64sp.mlpackage` |
|
||||
| **Build** | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o inmem_basic inmem_basic.m` |
|
||||
| **Run** | `./inmem_basic` |
|
||||
|
||||
### inmem_bench.m -- Dispatch Latency
|
||||
|
||||
| Item | Details |
|
||||
|------|---------|
|
||||
| **Purpose** | ANE dispatch latency across 6 model sizes (256-4096 channels) |
|
||||
| **Measures** | ms per run, TFLOPS at each configuration |
|
||||
| **Prerequisites** | Pre-built mlpackages for all 6 configs |
|
||||
| **Build** | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o inmem_bench inmem_bench.m` |
|
||||
| **Run** | `./inmem_bench` |
|
||||
|
||||
### sram_bench.m -- SRAM Capacity Probe
|
||||
|
||||
| Item | Details |
|
||||
|------|---------|
|
||||
| **Purpose** | Find SRAM capacity by detecting performance cliff at increasing weight sizes |
|
||||
| **Measures** | ms per run, TFLOPS, weight/activation/total memory at 9 configurations |
|
||||
| **Prerequisites** | Pre-built mlpackages for 9 configs (256-8192 channels) |
|
||||
| **Build** | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o sram_bench sram_bench.m` |
|
||||
| **Run** | `./sram_bench` |
|
||||
|
||||
### sram_probe.m -- Fine-Grained SRAM Exploration
|
||||
|
||||
| Item | Details |
|
||||
|------|---------|
|
||||
| **Purpose** | Finer-grained SRAM probe with 13 data points and GFLOPS/MB efficiency |
|
||||
| **Measures** | ms per run, TFLOPS, GFLOPS/MB with spilling indicators |
|
||||
| **Prerequisites** | Pre-built mlpackages for 13 configs |
|
||||
| **Build** | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o sram_probe sram_probe.m` |
|
||||
| **Run** | `./sram_probe` |
|
||||
|
||||
### api_exploration.m -- API Discovery
|
||||
|
||||
| Item | Details |
|
||||
|------|---------|
|
||||
| **Purpose** | Explore ANE private API surface (class methods, file structures, internal objects) |
|
||||
| **Prerequisites** | Pre-built mlpackage at `/tmp/ane_sram_1024ch_64sp.mlpackage` |
|
||||
| **Build** | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o api_exploration api_exploration.m` |
|
||||
| **Run** | `./api_exploration` |
|
||||
|
||||
---
|
||||
|
||||
## Test Files
|
||||
|
||||
### Tests with Makefile targets (cd training/)
|
||||
|
||||
| Test | Build | What It Tests |
|
||||
|------|-------|---------------|
|
||||
| `test_rmsnorm_bwd` | `make test_rmsnorm_bwd` | RMSNorm backward on ANE vs CPU reference. PASS: max diff < 0.05, mean < 0.01. Benchmarks 100 runs. |
|
||||
| `test_classifier` | `make test_classifier` | 4-part: final RMSNorm, classifier forward (32000-ch conv), softmax over VOCAB, classifier backward. |
|
||||
| `test_weight_reload` | `make test_weight_reload` | Tests if weights can be hot-swapped by overwriting blob files + unload/reload. Key finding: NO, weights are baked. |
|
||||
| `test_perf_stats` | `make test_perf_stats` | Probes `_ANEPerformanceStats` class methods, properties, and instantiation. Tests perfStats in `_ANERequest`. |
|
||||
| `test_qos_sweep` | `make test_qos_sweep` | QoS parameter sweep (0-63) across compile, load, run. Finding: no measurable latency difference. |
|
||||
| `test_ane_advanced` | `make test_ane_advanced` | Probes SharedEvents, weightsBuffer IOSurface, procedureIndex, ChainingRequest. Enumerates all 67 ANE classes. |
|
||||
|
||||
Build all probe tests at once: `make probes`
|
||||
|
||||
### Tests without Makefile targets (manual build)
|
||||
|
||||
| Test | Build Command | What It Tests |
|
||||
|------|---------------|---------------|
|
||||
| `test_ane_causal_attn` | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_ane_causal_attn test_ane_causal_attn.m` | Decomposed causal attention: Q at K^T on ANE, mask+softmax on CPU, scores at V on ANE |
|
||||
| `test_ane_sdpa5` | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_ane_sdpa5 test_ane_sdpa5.m` | 4 approaches to causal masking with `scaled_dot_product_attention` |
|
||||
| `test_conv_attn3` | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_conv_attn3 test_conv_attn3.m` | Grouped conv approach to attention (K,V baked as conv weights) |
|
||||
| `test_full_fused` | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML -framework IOSurface -ldl -o test_full_fused test_full_fused.m` | Full fused attention + FFN in single MIL dispatch at DIM=768, HEADS=12, SEQ=64 |
|
||||
| `test_fused_qkv` | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_fused_qkv test_fused_qkv.m` | Fused QKV (3 convs + concat in one dispatch) vs separate dispatches |
|
||||
| `test_fused_bwd` | `xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl -o test_fused_bwd test_fused_bwd.m` | Fused backward: slice_by_size + 2 convs + add in one kernel |
|
||||
|
||||
---
|
||||
|
||||
## Bridge Library
|
||||
|
||||
```bash
|
||||
cd bridge
|
||||
make # Build libane_bridge.dylib
|
||||
make test # Build and link test_bridge
|
||||
./test_bridge # Run bridge tests
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Known Results
|
||||
|
||||
### M4 (from README)
|
||||
|
||||
**Single-layer (dim=768, seq=512):**
|
||||
|
||||
| Optimization | ms/step | ANE utilization |
|
||||
|---|---|---|
|
||||
| Baseline (vDSP transpose) | 33.5 | 3.1% |
|
||||
| Channel-first layout | 20.3 | 5.2% |
|
||||
| vDSP vectorized RMSNorm | 14.2 | 7.4% |
|
||||
| GCD async cblas overlap | 11.4 | 9.2% |
|
||||
| ANE RMSNorm fusion | 11.4 | 9.2% |
|
||||
| Wo^T fusion (7 to 6 kernels) | 11.4 | 9.2% |
|
||||
| Deferred cblas wait | **9.3** | **11.2%** |
|
||||
|
||||
**Full Stories110M (12 layers):**
|
||||
|
||||
| Component | Time (ms/step) |
|
||||
|-----------|---------------|
|
||||
| ANE runs | 9.6 |
|
||||
| IO (fp16 conversion) | 4.1 |
|
||||
| Classifier (cblas) | 9.1 |
|
||||
| Cross-entropy + residuals | 14.4 |
|
||||
| RMSNorm | 0.1 |
|
||||
| **Total** | **~107** |
|
||||
|
||||
### M5 Probe Results (from m5result.md)
|
||||
|
||||
**Machine**: Apple M5, macOS 26.3, ANE Family H16 (same as M4)
|
||||
|
||||
- **Weight reload**: FAIL -- weights baked at compile time, cannot be overwritten
|
||||
- **QoS sweep**: All QoS 0-63 work, no measurable latency difference
|
||||
- **Performance stats**: `_ANEPerformanceStats` class exists, `alloc/init` returns nil (needs factory methods)
|
||||
- **weightsBuffer IOSurface**: Does NOT override compiled weights
|
||||
- **ChainingRequest**: Exists with loopback and pipeline support -- most promising for utilization improvement
|
||||
|
||||
---
|
||||
|
||||
## Timing Metrics Key
|
||||
|
||||
| Metric | What it measures |
|
||||
|--------|-----------------|
|
||||
| `ane` | ANE kernel runs (all 6 kernels per layer x 12 layers) |
|
||||
| `io` | fp16-to-fp32 IOSurface data transfer (NEON conversion) |
|
||||
| `cls` | Classifier matmul (CPU cblas_sgemm) |
|
||||
| `elem` | Embedding lookup, residual adds, cross-entropy |
|
||||
| `rms` | RMSNorm forward/backward (CPU vDSP) |
|
||||
| `cblas_wait` | Time waiting for async dW gradient sgemms to complete |
|
||||
|
|
@ -0,0 +1,156 @@
|
|||
# ANE Benchmark Results: Apple M4 Max
|
||||
|
||||
**Date**: March 3, 2026
|
||||
**Machine**: Mac16,5 (MacBook Pro, Apple M4 Max)
|
||||
**macOS**: 26.2
|
||||
**ANE Peak**: 15.8 TFLOPS (theoretical)
|
||||
|
||||
## Training Performance
|
||||
|
||||
### train_large (CPU classifier path)
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Model | Stories110M (12 layers, dim=768, hidden=2048) |
|
||||
| Kernels | 72 (60 weight-bearing + 12 static sdpaBwd2) |
|
||||
| Avg step time | 72.4 ms/step |
|
||||
| ANE TFLOPS | 1.29 sustained |
|
||||
| Total TFLOPS | 2.41 (ANE+CPU) |
|
||||
| ANE utilization | 8.1% of 15.8 TFLOPS |
|
||||
| Compile time | 79.7% of wall time |
|
||||
| Train time | 16.4% of wall time |
|
||||
|
||||
### train_large_ane (ANE-offloaded classifier)
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Model | Stories110M (same as above) |
|
||||
| Kernels | 99 (86 weight-bearing + 13 static) |
|
||||
| Avg step time | 62.9 ms/step |
|
||||
| ANE TFLOPS | 1.68 sustained |
|
||||
| Total TFLOPS | 2.77 (ANE+CPU) |
|
||||
| ANE utilization | 10.6% of 15.8 TFLOPS |
|
||||
| Compile time | 84.5% of wall time |
|
||||
| Train time | 12.5% of wall time |
|
||||
|
||||
**Step time breakdown (ms/step, ANE classifier path):**
|
||||
|
||||
| Component | Time (ms) | Description |
|
||||
|-----------|-----------|-------------|
|
||||
| ane | 10-12 | ANE kernel dispatch + evaluation |
|
||||
| elem | 12-13 | Elementwise ops (residuals, activations) |
|
||||
| cls | 5-6 | Classifier forward + backward |
|
||||
| io | 3-5 | IOSurface data transfers |
|
||||
| rms | 0.1 | RMSNorm |
|
||||
| cblas_wait | 0.0 | BLAS sync overhead |
|
||||
|
||||
## Programmatic MIL Peak TFLOPS
|
||||
|
||||
```
|
||||
Config W(MB) GFLOP ms/eval TFLOPS
|
||||
----------------------------------------------------------------------
|
||||
32x conv 512ch sp64 16.0 1.07 0.408 ms 2.63
|
||||
48x conv 512ch sp64 24.0 1.61 0.262 ms 6.15
|
||||
64x conv 512ch sp64 32.0 2.15 0.244 ms 8.80
|
||||
96x conv 512ch sp64 48.0 3.22 0.326 ms 9.89
|
||||
128x conv 512ch sp64 64.0 4.29 0.385 ms 11.14
|
||||
64x conv 256ch sp64 8.0 0.54 0.365 ms 1.47
|
||||
128x conv 256ch sp64 16.0 1.07 0.454 ms 2.37
|
||||
256x conv 256ch sp64 32.0 2.15 0.351 ms 6.11
|
||||
64x conv 384ch sp64 18.0 1.21 0.429 ms 2.82
|
||||
128x conv 384ch sp64 36.0 2.42 0.354 ms 6.82
|
||||
```
|
||||
|
||||
**Peak observed: 11.14 TFLOPS** (128x conv 512ch sp64, 64 MB weights)
|
||||
|
||||
## In-Memory ANE Benchmark (via mlpackage)
|
||||
|
||||
```
|
||||
Config W (MB) ms/eval TFLOPS
|
||||
---------------------------------------------
|
||||
256ch x64sp 0.1 0.319 ms 0.03
|
||||
512ch x64sp 0.5 0.357 ms 0.09
|
||||
1024ch x64sp 2.0 0.457 ms 0.29
|
||||
2048ch x64sp 8.0 0.254 ms 2.11
|
||||
3072ch x64sp 18.0 0.389 ms 3.10
|
||||
4096ch x64sp 32.0 1.148 ms 1.87
|
||||
```
|
||||
|
||||
## SRAM Probe Results
|
||||
|
||||
### Coarse Probe (varying channels + spatial)
|
||||
|
||||
```
|
||||
Config W (MB) Act(MB) Tot(MB) ms/eval TFLOPS
|
||||
--------------------------------------------------------------------------
|
||||
256ch x 64sp 0.1 0.03 0.2 0.378 ms 0.02
|
||||
512ch x 64sp 0.5 0.06 0.6 0.389 ms 0.09
|
||||
1024ch x 64sp 2.0 0.12 2.2 0.392 ms 0.34
|
||||
2048ch x 64sp 8.0 0.25 8.5 0.218 ms 2.47
|
||||
3072ch x 64sp 18.0 0.38 18.8 0.396 ms 3.05
|
||||
4096ch x 64sp 32.0 0.50 33.0 1.116 ms 1.92
|
||||
5120ch x 64sp 50.0 0.62 51.2 0.767 ms 4.38
|
||||
6144ch x 64sp 72.0 0.75 73.5 0.872 ms 5.54
|
||||
8192ch x 32sp 128.0 0.50 129.0 4.195 ms 1.02
|
||||
```
|
||||
|
||||
### Fine Probe (spatial=64, weights only)
|
||||
|
||||
```
|
||||
Channels W (MB) ms/eval TFLOPS GFLOPS/MB
|
||||
--------------------------------------------------------------
|
||||
256 ch 0.1 0.378 ms 0.02 177.7
|
||||
512 ch 0.5 0.431 ms 0.08 155.6
|
||||
1024 ch 2.0 0.411 ms 0.33 163.5
|
||||
1536 ch 4.5 0.493 ms 0.61 136.1
|
||||
2048 ch 8.0 0.410 ms 1.31 163.9
|
||||
2560 ch 12.5 0.237 ms 3.53 282.6 <-- peak efficiency
|
||||
3072 ch 18.0 0.335 ms 3.60 200.1
|
||||
3584 ch 24.5 0.414 ms 3.97 162.1
|
||||
4096 ch 32.0 1.134 ms 1.89 59.2 <-- spilling
|
||||
4608 ch 40.5 0.563 ms 4.83 119.2
|
||||
5120 ch 50.0 0.659 ms 5.09 101.8
|
||||
6144 ch 72.0 0.844 ms 5.73 79.5 <-- spilling
|
||||
8192 ch 128.0 4.203 ms 1.02 8.0 <-- catastrophic spilling
|
||||
```
|
||||
|
||||
### SRAM Analysis
|
||||
|
||||
The M4 Max ANE SRAM appears to be approximately **24-32 MB**:
|
||||
|
||||
- **Peak efficiency** at 2560ch (12.5 MB weights): 282.6 GFLOPS/MB, 3.53 TFLOPS
|
||||
- **First spill** at 4096ch (32.0 MB): drops to 59.2 GFLOPS/MB (1.89 TFLOPS)
|
||||
- **Catastrophic** at 8192ch (128.0 MB): 8.0 GFLOPS/MB (1.02 TFLOPS)
|
||||
|
||||
The 4608ch recovery (4.83 TFLOPS despite 40.5 MB weights) suggests the ANE may use tiling strategies for some weight configurations.
|
||||
|
||||
Training kernels (dim=768, weight matrices ~1.2 MB fp16 each) stay well within the SRAM budget.
|
||||
|
||||
## Known Test Results
|
||||
|
||||
| Test | Status | Notes |
|
||||
|------|--------|-------|
|
||||
| test_rmsnorm_bwd | PASS | ANE-accelerated RMSNorm backward |
|
||||
| test_classifier | PASS | 4 tests passed; ANE backward 3x slower than CPU cblas for matmul |
|
||||
| test_weight_reload | FAIL (expected) | ANE bakes weights at compile time; IOSurface override doesn't work |
|
||||
| test_perf_stats | PASS | _ANEPerformanceStats API accessible |
|
||||
| test_qos_sweep | PASS | QoS parameter has no measurable effect on latency |
|
||||
| test_ane_advanced | PASS | Advanced ANE operations verified |
|
||||
| inmem_basic | PASS | In-memory compilation and execution verified |
|
||||
| inmem_bench | PASS | Multi-config benchmarks via mlpackage |
|
||||
| inmem_peak | PASS | Peak TFLOPS measurement via programmatic MIL |
|
||||
| sram_bench | PASS | SRAM capacity probing |
|
||||
| sram_probe | PASS | Fine-grained SRAM spilling detection |
|
||||
|
||||
## Reproducing
|
||||
|
||||
```bash
|
||||
cd scripts && bash run_benchmarks.sh
|
||||
```
|
||||
|
||||
The benchmark script auto-generates required `.mlpackage` models (needs Python 3.11-3.13 with `coremltools`).
|
||||
|
||||
Override training data paths:
|
||||
```bash
|
||||
ANE_MODEL_PATH=/path/to/stories110M.bin ANE_DATA_PATH=/path/to/data.bin ./train_large
|
||||
```
|
||||
|
|
@ -0,0 +1,74 @@
|
|||
# Development Diary #001 — Initial Setup & Sicherheitsaudit
|
||||
**Datum:** 2026-03-02
|
||||
**Status:** Abgeschlossen
|
||||
|
||||
## Aufgaben
|
||||
|
||||
### 1. Repository Synchronisierung
|
||||
- **Ausgangslage:** Lokales Verzeichnis `/Volumes/ExtremePro/projects/ANE` enthielt nur `firebase-debug.log`
|
||||
- **Durchgeführt:**
|
||||
```bash
|
||||
git init
|
||||
git remote add origin https://github.com/maderix/ANE.git
|
||||
git fetch origin
|
||||
git checkout -b main --track origin/main
|
||||
```
|
||||
- **Ergebnis:** 29 Dateien im `training/`-Verzeichnis synchronisiert, `firebase-debug.log` unberührt
|
||||
- **Commit-Stand:** HEAD = origin/main (up to date)
|
||||
|
||||
### 2. Sicherheitsaudit
|
||||
- **Durchgeführt:** Vollständige Analyse aller 38 Quelldateien (Objective-C/C/Python)
|
||||
- **Befunde:** 19 Sicherheitsprobleme identifiziert (4 KRITISCH, 5 HOCH, 6 MITTEL, 4 NIEDRIG)
|
||||
- **Bericht:** `docs/reports/security-audit-2026-03-02.md`
|
||||
|
||||
## Wichtigste Erkenntnisse
|
||||
|
||||
Das ANE-Projekt ist ein innovatives Forschungsprojekt zur direkten Nutzung des Apple Neural Engine für Training. Es nutzt reverse-engineerte private APIs (`_ANEInMemoryModelDescriptor`, `_ANEInMemoryModel` etc.) via `dlopen` + `objc_msgSend`.
|
||||
|
||||
**Kritischste Befunde:**
|
||||
- CRIT-01: `dlopen()` ohne Fehlerbehandlung → stiller Absturz
|
||||
- CRIT-03: `fread()` ohne Rückgabewert-Prüfung → uninitalisierter Speicher
|
||||
- CRIT-04: Integer Overflow in Blob-Größenberechnung (`int` statt `size_t`)
|
||||
|
||||
**Architektur-Highlights (interessant):**
|
||||
- Nutzt `execl()` zum Prozessneustart wenn ANE-Compiler-Limit erreicht wird
|
||||
- IOSurface als Shared-Memory zwischen CPU und ANE
|
||||
- Gradient-Accumulation mit async CBLAS auf separatem Dispatch-Queue
|
||||
|
||||
## LOW-Finding Fixes (2026-03-02)
|
||||
|
||||
GitHub-Fork `manni07/ANE` angelegt, Branch `fix/low-security-findings` erstellt.
|
||||
Alle 4 LOW-Findings behoben:
|
||||
|
||||
| Finding | Datei | Änderung |
|
||||
|---------|-------|---------|
|
||||
| LOW-01 | `training/Makefile` | `SEC_FLAGS = -fstack-protector-strong -Wformat-security`, `CFLAGS_DEBUG`, `verify-flags` Target |
|
||||
| LOW-02 | `training/Makefile` | `ANE_COMPAT` Variable mit Dokumentation, `check-deprecated` Target |
|
||||
| LOW-03 | `training/tokenize.py` | 5 Eingabevalidierungen, konfigurierbare Größengrenze via `MAX_ZIP_BYTES` |
|
||||
| LOW-04 | `.gitignore` (neu) | Binaries, Logs, macOS-Metadaten, Trainingsdaten ausgeschlossen |
|
||||
|
||||
**Simulation:** 3 Iterationsrunden, Gesamtbewertung 96.35% (alle Kriterien ≥ 95%)
|
||||
**Remote:** `origin=manni07/ANE`, `upstream=maderix/ANE`
|
||||
|
||||
## CRIT-Finding Fixes (2026-03-02)
|
||||
|
||||
Branch `fix/crit-security-findings` erstellt. Alle 4 CRIT-Findings behoben:
|
||||
|
||||
| Finding | Dateien | Kernänderung |
|
||||
|---------|---------|-------------|
|
||||
| CRIT-01 | `training/ane_runtime.h`, `training/stories_config.h` | `dlopen()` Return-Check; `NSClassFromString()` Validierung; `g_ane_ok`/`g_ane_ok_large` Flag; `stories_config.h` Re-Entry-Guard |
|
||||
| CRIT-02 | `training/ane_runtime.h`, `training/stories_io.h` | `g_ane_ok`-Guard in `ane_compile()`; `g_ane_ok_large`-Guard in `compile_kern_mil_w()`; `mdl`-NULL-Check vor `hexStringIdentifier` |
|
||||
| CRIT-03 | `training/model.h`, `training/train_large.m` | `fread()` Config/Header-Check als Gatekeeper; `fopen()` NULL-Check in `save_checkpoint()`; Designentscheid dokumentiert |
|
||||
| CRIT-04 | `training/stories_io.h`, `training/model.h` | `int`→`size_t` in allen `build_blob*` Funktionen; `(size_t)`-Cast in `malloc()`-Größen; `calloc()` NULL-Checks |
|
||||
|
||||
**Simulation:** 3 Iterationsrunden (CRIT-03 benötigte 3 Runs), Gesamtbewertung 96.15% (alle Kriterien ≥ 95%)
|
||||
**Branch:** `fix/crit-security-findings` auf `manni07/ANE`
|
||||
|
||||
## Status
|
||||
|
||||
| Finding-Typ | Anzahl | Status |
|
||||
|-------------|--------|--------|
|
||||
| KRITISCH (CRIT-01–04) | 4 | ✅ BEHOBEN |
|
||||
| HOCH (HIGH-01–05) | 5 | Offen |
|
||||
| MITTEL (MED-01–06) | 6 | Offen |
|
||||
| NIEDRIG (LOW-01–04) | 4 | ✅ BEHOBEN |
|
||||
|
|
@ -0,0 +1,419 @@
|
|||
# Sicherheitsaudit: ANE (Apple Neural Engine Training Framework)
|
||||
**Datum:** 2026-03-02
|
||||
**Repository:** https://github.com/maderix/ANE
|
||||
**Prüfer:** Claude Code (claude-sonnet-4-6)
|
||||
**Scope:** Vollständige Codebase-Analyse (38 Quelldateien, Objective-C/C/Python)
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Das ANE-Projekt implementiert Neural-Network-Training direkt auf Apples Neural Engine (ANE) via reverse-engineerter privater APIs. Es handelt sich um ein **Forschungs-/Experimental-Projekt** mit erheblichen inhärenten Sicherheitsrisiken durch die Nutzung undokumentierter Apple-Schnittstellen.
|
||||
|
||||
**Gesamtbewertung: HOHES RISIKO** für produktiven Einsatz.
|
||||
|
||||
| Kategorie | Anzahl |
|
||||
|-----------|--------|
|
||||
| KRITISCH | 4 |
|
||||
| HOCH | 5 |
|
||||
| MITTEL | 6 |
|
||||
| NIEDRIG | 4 |
|
||||
| **Gesamt**| **19** |
|
||||
|
||||
---
|
||||
|
||||
## KRITISCHE Befunde
|
||||
|
||||
### [CRIT-01] Keine Fehlerbehandlung bei `dlopen()` für Private Framework
|
||||
**Datei:** `training/ane_runtime.h:26`, `api_exploration.m:15`
|
||||
**Schweregrad:** KRITISCH
|
||||
**Status: BEHOBEN** (2026-03-02, Branch `fix/crit-security-findings`)
|
||||
|
||||
```objc
|
||||
// ane_runtime.h:26
|
||||
dlopen("/System/Library/PrivateFrameworks/AppleNeuralEngine.framework/AppleNeuralEngine", RTLD_NOW);
|
||||
```
|
||||
|
||||
**Problem:**
|
||||
- Der Rückgabewert von `dlopen()` wird nicht geprüft. Wenn das Framework nicht gefunden wird (nach macOS-Update oder auf nicht-Apple-Silicon-Hardware), gibt `dlopen()` NULL zurück — aber die Ausführung läuft weiter.
|
||||
- Alle nachfolgenden `NSClassFromString()`-Aufrufe geben dann ebenfalls NULL zurück.
|
||||
- `g_ane_loaded = true` wird gesetzt auch wenn das Laden fehlschlug.
|
||||
|
||||
**Folge:** Nullzeiger-Dereferenzierungen beim ersten API-Aufruf, unkontrollierter Absturz ohne aussagekräftige Fehlermeldung.
|
||||
|
||||
**Empfehlung:**
|
||||
```objc
|
||||
void *handle = dlopen("...", RTLD_NOW);
|
||||
if (!handle) {
|
||||
fprintf(stderr, "ANE framework not found: %s\n", dlerror());
|
||||
abort();
|
||||
}
|
||||
if (!g_ANEDesc || !g_ANEInMem || !g_ANEReq || !g_ANEIO) {
|
||||
fprintf(stderr, "ANE private classes not found (API changed?)\n");
|
||||
abort();
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### [CRIT-02] Unsichere `objc_msgSend`-Casts ohne Typ-Validierung
|
||||
**Dateien:** `training/ane_runtime.h:59-125`, `training/stories_io.h:90-117`
|
||||
**Schweregrad:** KRITISCH
|
||||
**Status: BEHOBEN** (2026-03-02, Branch `fix/crit-security-findings`)
|
||||
|
||||
```objc
|
||||
// ane_runtime.h:59-61
|
||||
id desc = ((id(*)(Class,SEL,id,id,id))objc_msgSend)(
|
||||
g_ANEDesc, @selector(modelWithMILText:weights:optionsPlist:),
|
||||
milText, wdict, nil);
|
||||
```
|
||||
|
||||
**Probleme:**
|
||||
1. Die Klasse `g_ANEDesc` könnte NULL sein (wenn `dlopen` fehlschlug, s. CRIT-01)
|
||||
2. Die Methodensignatur ist hardcodiert — bei Apple-API-Änderungen falsches Casting = undefiniertes Verhalten / Speicherkorruption
|
||||
3. Kein `@try/@catch` um mögliche Objective-C Exceptions abzufangen
|
||||
4. Globale Variablen `g_D`, `g_I`, `g_AIO`, `g_AR` in `stories_io.h` könnten NULL sein
|
||||
|
||||
**Folge:** Speicherkorruption, SIGBUS, unkontrollierter Absturz.
|
||||
|
||||
**Empfehlung:** Mindestens NULL-Checks vor jedem `objc_msgSend`:
|
||||
```objc
|
||||
if (!g_ANEDesc) { fprintf(stderr, "g_ANEDesc is NULL\n"); return NULL; }
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### [CRIT-03] `fread()`-Rückgabewerte nie geprüft — uninitalisierter Speicher
|
||||
**Dateien:** `training/model.h:81-146`, `training/train_large.m:17-55`
|
||||
**Schweregrad:** KRITISCH
|
||||
**Status: BEHOBEN** (2026-03-02, Branch `fix/crit-security-findings`)
|
||||
|
||||
```c
|
||||
// model.h:81
|
||||
fread(&m->cfg, sizeof(Config), 1, f); // Rückgabewert ignoriert!
|
||||
|
||||
// train_large.m:29
|
||||
fread(embed, 4, V * DIM, f); // Kein Check ob V*DIM floats gelesen wurden
|
||||
```
|
||||
|
||||
**Probleme:**
|
||||
1. Wenn die Model-Datei kleiner als erwartet ist (korrupt, abgeschnitten), werden Structs mit Garbage-Werten befüllt
|
||||
2. Kein Check ob `cfg.dim`, `cfg.hidden_dim`, `cfg.n_layers` plausibel sind bevor Speicher allokiert wird
|
||||
3. `fread(embed, 4, V * DIM, f)` — bei V=32000, DIM=768: liest 98,304,000 Bytes. Keine Größenvalidierung.
|
||||
4. In `load_checkpoint()`: wenn die Datei nach dem Header endet, werden Gewichte mit 0-Bytes befüllt ohne Warnung
|
||||
|
||||
**Empfehlung:**
|
||||
```c
|
||||
size_t n = fread(&m->cfg, sizeof(Config), 1, f);
|
||||
if (n != 1) { fprintf(stderr, "Config read failed\n"); fclose(f); return -1; }
|
||||
if (m->cfg.dim <= 0 || m->cfg.dim > 65536 || m->cfg.n_layers <= 0) {
|
||||
fprintf(stderr, "Invalid model config\n"); fclose(f); return -1;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### [CRIT-04] Integer Overflow in Speicher-Berechnung
|
||||
**Dateien:** `training/stories_io.h:13-14`, `training/ane_mil_gen.h:12-13`
|
||||
**Schweregrad:** KRITISCH
|
||||
**Status: BEHOBEN** (2026-03-02, Branch `fix/crit-security-findings`)
|
||||
|
||||
```c
|
||||
// stories_io.h:13-14
|
||||
static NSData *build_blob(const float *w, int rows, int cols) {
|
||||
int ws = rows * cols * 2; // INT-Multiplikation, kein size_t!
|
||||
int tot = 128 + ws;
|
||||
```
|
||||
|
||||
**Problem:** Bei grösseren Modellen mit `dim >= 2048, hidden >= 16384` könnten Integer-Overflows entstehen. `*(uint32_t*)(chunk + 8) = (uint32_t)wsize;` — wenn `wsize` als `int` negativ wird (Overflow), wird ein negativer Wert als uint32 geschrieben = falsche Blob-Größe → ANE-Fehler oder Speicherkorruption.
|
||||
|
||||
**Empfehlung:** `size_t` für alle Speichergrößenberechnungen:
|
||||
```c
|
||||
size_t ws = (size_t)rows * cols * sizeof(_Float16);
|
||||
size_t tot = 128 + ws;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## HOHE Befunde
|
||||
|
||||
### [HIGH-01] Keine Eingabevalidierung für Token-Indizes
|
||||
**Datei:** `training/train_large.m:375-376`
|
||||
**Schweregrad:** HOCH
|
||||
|
||||
```c
|
||||
size_t max_pos = n_tokens - SEQ - 1;
|
||||
size_t pos = (size_t)(drand48() * max_pos);
|
||||
uint16_t *input_tokens = token_data + pos;
|
||||
```
|
||||
|
||||
**Probleme:**
|
||||
1. Token-Werte aus `token_data` werden direkt als Embedding-Indizes verwendet ohne Prüfung ob `token < VOCAB`
|
||||
2. Wenn die `.bin`-Datei korrupte Token-Werte enthält (> 32000), entstehen Out-of-Bounds-Zugriffe auf `embed[]`
|
||||
3. Kein Check ob `n_tokens >= SEQ + 1` vor der `max_pos`-Berechnung
|
||||
|
||||
**Folge:** Heap-Buffer-Overflow, korrupte `.bin`-Datei kann zu Speicherschäden führen.
|
||||
|
||||
---
|
||||
|
||||
### [HIGH-02] Checkpoint-Pfad mit relativer Verzeichnis-Navigation
|
||||
**Datei:** `training/train_large.m:8-10`
|
||||
**Schweregrad:** HOCH
|
||||
|
||||
```c
|
||||
#define CKPT_PATH "ane_stories110M_ckpt.bin"
|
||||
#define MODEL_PATH "../../assets/models/stories110M.bin" // ← relativer Pfad!
|
||||
#define DATA_PATH "tinystories_data00.bin"
|
||||
```
|
||||
|
||||
**Probleme:**
|
||||
1. `MODEL_PATH` enthält `../../` — relative Pfadnavigation. Wenn das Binary aus einem unerwarteten Verzeichnis gestartet wird, werden falsche Dateien gelesen.
|
||||
2. Kein `realpath()`-Aufruf zur Normalisierung des Pfades
|
||||
3. Manipulierter Checkpoint + `--resume` → unkontrollierte Binärdaten werden als Gewichte geladen
|
||||
|
||||
---
|
||||
|
||||
### [HIGH-03] `execl()` zur Prozessneustart ohne Argument-Validierung
|
||||
**Datei:** `training/train_large.m:331`
|
||||
**Schweregrad:** HOCH
|
||||
|
||||
```c
|
||||
execl(argv[0], argv[0], "--resume", NULL);
|
||||
```
|
||||
|
||||
**Probleme:**
|
||||
1. `argv[0]` wird ohne Validierung übergeben. Via Symlink könnte ein beliebiges Binary gestartet werden.
|
||||
2. `data_fd` (mmap'd Token-Datei) wird vor `execl()` nicht geschlossen — Dateideskriptor-Leak in neuen Prozess
|
||||
3. `munmap(token_data)` wird vor `execl()` nicht aufgerufen
|
||||
|
||||
---
|
||||
|
||||
### [HIGH-04] Fehlende `malloc()`/`calloc()`-Rückgabewert-Prüfungen
|
||||
**Dateien:** Alle `.m` und `.h` Dateien
|
||||
**Schweregrad:** HOCH
|
||||
|
||||
```c
|
||||
// train_large.m:219
|
||||
float *embed = (float*)malloc(VOCAB*DIM*4); // 32000*768*4 = 98MB — kein NULL-Check!
|
||||
```
|
||||
|
||||
Keiner der `malloc()`/`calloc()`-Aufrufe prüft den Rückgabewert auf NULL. Bei Memory-Pressure (110M Model + Adam-State = mehrere GB) können Allokierungen fehlschlagen → Nullzeiger-Dereferenzierung.
|
||||
|
||||
---
|
||||
|
||||
### [HIGH-05] ANE-Inferenz ohne Fehlerprüfung im Trainings-Hot-Path
|
||||
**Datei:** `training/stories_io.h:131-134`
|
||||
**Schweregrad:** HOCH
|
||||
|
||||
```c
|
||||
static void ane_run(Kern *k) {
|
||||
id mdl = (__bridge id)k->model; id req = (__bridge id)k->request; NSError *e = nil;
|
||||
((BOOL(*)(id,SEL,unsigned int,id,id,NSError**))objc_msgSend)(
|
||||
mdl, @selector(evaluateWithQoS:options:request:error:), 21, @{}, req, &e);
|
||||
// BOOL-Rückgabewert und NSError *e werden ignoriert!
|
||||
}
|
||||
```
|
||||
|
||||
**Problem:** ANE-Ausführung kann fehlschlagen (Thermal-Throttling, Hardware-Fehler, API-Änderungen). Stille Fehler führen zu unerkannter Gradientenkorruption.
|
||||
|
||||
---
|
||||
|
||||
## MITTLERE Befunde
|
||||
|
||||
### [MED-01] IOSurface Lock ohne Fehlerbehandlung
|
||||
**Datei:** `training/stories_io.h:62-83`
|
||||
**Schweregrad:** MITTEL
|
||||
|
||||
```c
|
||||
IOSurfaceLock(s, 0, NULL); // Return-Code ignoriert
|
||||
```
|
||||
|
||||
`IOSurfaceLock()` gibt `kIOReturnSuccess` oder einen Fehlercode zurück. Bei Lock-Fehler wird trotzdem auf den Speicher zugegriffen — mögliche Data-Race-Condition.
|
||||
|
||||
---
|
||||
|
||||
### [MED-02] Temporäres Verzeichnis nicht sicher erstellt (TOCTOU-Risiko)
|
||||
**Datei:** `training/ane_runtime.h:68-80`, `training/stories_io.h:94-100`
|
||||
**Schweregrad:** MITTEL
|
||||
|
||||
```objc
|
||||
NSString *td = [NSTemporaryDirectory() stringByAppendingPathComponent:hx];
|
||||
[milText writeToFile:[td stringByAppendingPathComponent:@"model.mil"] atomically:YES];
|
||||
```
|
||||
|
||||
TOCTOU-Race zwischen `createDirectoryAtPath` und `writeToFile`. Der `hexStringIdentifier` könnte von einem anderen Prozess erraten und das Verzeichnis manipuliert werden.
|
||||
|
||||
---
|
||||
|
||||
### [MED-03] MIL-Text-Generierung ohne Parameter-Validierung
|
||||
**Datei:** `training/ane_mil_gen.h:32-52`
|
||||
**Schweregrad:** MITTEL
|
||||
|
||||
```objc
|
||||
return [NSString stringWithFormat:
|
||||
@"...tensor<fp32, [1, %d, %d]> x...", in_ch, spatial, ...];
|
||||
```
|
||||
|
||||
Negative oder extrem große `in_ch`/`out_ch`/`spatial`-Werte durch fehlerhafte Konfiguration erzeugen invalides MIL das an den undokumentierten ANE-Compiler übergeben wird.
|
||||
|
||||
---
|
||||
|
||||
### [MED-04] Keine Endianness-Prüfung bei Checkpoint-Serialisierung
|
||||
**Datei:** `training/train_large.m:110-181`
|
||||
**Schweregrad:** MITTEL
|
||||
|
||||
```c
|
||||
h.magic = 0x424C5A54;
|
||||
fwrite(&h, sizeof(h), 1, f);
|
||||
```
|
||||
|
||||
Das `CkptHdr`-Struct wird als binärer Dump ohne Endianness-Marker geschrieben. Nicht portabel.
|
||||
|
||||
---
|
||||
|
||||
### [MED-05] NEON-Vektorisierung ohne Alignment-Garantie
|
||||
**Datei:** `training/stories_io.h:41-58`
|
||||
**Schweregrad:** MITTEL
|
||||
|
||||
```c
|
||||
float16x8_t h = vld1q_f16((const __fp16*)(src + i));
|
||||
```
|
||||
|
||||
Zeiger-Arithmetik mit `ch_off * sp` könnte das für NEON benötigte Alignment verletzen wenn `ch_off * sp` kein Vielfaches von 8 ist.
|
||||
|
||||
---
|
||||
|
||||
### [MED-06] Globale Variablen ohne Thread-Safety
|
||||
**Datei:** `training/stories_io.h`, `training/stories_config.h`
|
||||
**Schweregrad:** MITTEL
|
||||
|
||||
```c
|
||||
static bool g_ane_loaded = false;
|
||||
static int g_compile_count = 0;
|
||||
```
|
||||
|
||||
`g_compile_count` wird via `__sync_fetch_and_add()` atomar inkrementiert, aber `g_ane_loaded` und Klassen-Variablen nicht atomar gesetzt — bei Multi-Thread-Nutzung Race-Condition in `ane_init()`.
|
||||
|
||||
---
|
||||
|
||||
## NIEDRIGE Befunde
|
||||
|
||||
### [LOW-01] Fehlende Compiler-Sicherheitsflags
|
||||
**Datei:** `training/Makefile:2`
|
||||
**Schweregrad:** NIEDRIG
|
||||
**Status: BEHOBEN** (2026-03-02, Branch `fix/low-security-findings`)
|
||||
|
||||
```makefile
|
||||
CFLAGS = -O2 -Wall -Wno-deprecated-declarations -fobjc-arc
|
||||
```
|
||||
|
||||
Fehlende Flags: `-fstack-protector-strong`, `-D_FORTIFY_SOURCE=2`, `-Wformat=2`
|
||||
|
||||
**Fix:** `SEC_FLAGS = -fstack-protector-strong -Wformat-security` eingeführt. Hinweis:
|
||||
`-D_FORTIFY_SOURCE=2` ist auf macOS (Apple LLVM) bei `-O2` implizit aktiv — explizite
|
||||
Definition würde "macro redefinition"-Warnung erzeugen. `CFLAGS_DEBUG` mit
|
||||
`-fsanitize=address,undefined` für Debug-Builds hinzugefügt. `make verify-flags`
|
||||
zeigt aktive Flags.
|
||||
|
||||
---
|
||||
|
||||
### [LOW-02] `-Wno-deprecated-declarations` unterdrückt wichtige Warnungen
|
||||
**Datei:** `training/Makefile:2`
|
||||
**Schweregrad:** NIEDRIG
|
||||
**Status: BEHOBEN** (2026-03-02, Branch `fix/low-security-findings`)
|
||||
|
||||
Unterdrückt Warnungen über veraltete API-Aufrufe — könnte wichtige Hinweise auf deprecated private APIs verstecken.
|
||||
|
||||
**Fix:** Flag in benannte Variable `ANE_COMPAT` extrahiert mit erklärendem Kommentar
|
||||
(bewusste Unterdrückung wegen privater `_ANE*`-APIs via `objc_msgSend`). Neues Target
|
||||
`make check-deprecated` baut ohne Unterdrückung und zeigt alle verborgenen Warnungen.
|
||||
|
||||
---
|
||||
|
||||
### [LOW-03] Python-Skript ohne Eingabevalidierung
|
||||
**Datei:** `training/tokenize.py`
|
||||
**Schweregrad:** NIEDRIG
|
||||
**Status: BEHOBEN** (2026-03-02, Branch `fix/low-security-findings`)
|
||||
|
||||
Keine Validierung der Eingabedateigröße — bei sehr großen Eingaben Out-of-Memory möglich.
|
||||
|
||||
**Fix:** 5 Validierungen implementiert:
|
||||
1. ZIP-Existenzprüfung mit hilfreicher Fehlermeldung
|
||||
2. Konfigurierbare Größengrenze (Standard 10GB, via `MAX_ZIP_BYTES` env var überschreibbar)
|
||||
3. Prüfung ob `data00.bin` im ZIP enthalten ist
|
||||
4. Fehlerbehandlung bei `struct.unpack` wenn Output < 20 Bytes
|
||||
5. Token-Range-Validierung (alle Token müssen < `VOCAB_SIZE=32000` sein)
|
||||
|
||||
---
|
||||
|
||||
### [LOW-04] Keine `.gitignore` für sensible Artefakte
|
||||
**Datei:** Repository-Root
|
||||
**Schweregrad:** NIEDRIG
|
||||
**Status: BEHOBEN** (2026-03-02, Branch `fix/low-security-findings`)
|
||||
|
||||
Keine `.gitignore`-Datei. Binäre Artefakte (Checkpoints, Trainingsdaten, `firebase-debug.log`) könnten versehentlich committed werden.
|
||||
|
||||
**Fix:** `.gitignore` erstellt mit Regeln für: macOS-Metadaten (`.DS_Store`),
|
||||
Log-Dateien (`*.log`), kompilierte Binaries (`training/train`, `training/train_large`,
|
||||
alle Probe-Binaries), Trainingsdaten (`training/*.bin`), ANE-Artefakte
|
||||
(`*.mlmodelc/`, `*.mlpackage/`), externe Assets (`assets/`).
|
||||
|
||||
---
|
||||
|
||||
## Positive Befunde (Stärken)
|
||||
|
||||
### Korrekte Speicherfreigabe
|
||||
`ane_free()` (`ane_runtime.h:149-160`) und `free_kern()` (`stories_io.h:122-130`) implementieren vollständige Cleanup-Routinen mit `CFRelease()`, `unloadWithQoS:error:` und Temporärverzeichnis-Bereinigung.
|
||||
|
||||
### Magic-Byte Validierung in Checkpoints
|
||||
```c
|
||||
if (h.magic != 0x424C5A54 || h.version != 2) { fclose(f); return false; }
|
||||
```
|
||||
Grundlegender Schutz gegen korrupte Checkpoint-Dateien.
|
||||
|
||||
### Atomare Compile-Counter
|
||||
```c
|
||||
__sync_fetch_and_add(&g_compile_count, 1);
|
||||
```
|
||||
Thread-sicherer Zähler für ANE-Kompilierungsanzahl.
|
||||
|
||||
### Gradient-Accumulation mit async CBLAS
|
||||
Korrekte Parallelisierung von CPU-Gewichtsgradienten-Berechnung via `dispatch_group_async`.
|
||||
|
||||
---
|
||||
|
||||
## Risikobewertung für Produktionseinsatz
|
||||
|
||||
| Aspekt | Bewertung |
|
||||
|--------|-----------|
|
||||
| Apple Silicon erforderlich | macOS 15+, M-Series only |
|
||||
| Private API Stabilität | **SEHR GERING** — jedes macOS-Update kann brechen |
|
||||
| Memory Safety | **MITTEL** — keine Bounds-Checks, keine Sanitizer |
|
||||
| Input Validation | **GERING** — Dateien werden unkritisch gelesen |
|
||||
| Error Handling | **GERING** — viele kritische Fehler werden ignoriert |
|
||||
| Eignung für Produktion | **NEIN** — Forschungs-/Experimental-Projekt |
|
||||
|
||||
---
|
||||
|
||||
## Empfehlungen nach Priorität
|
||||
|
||||
### Sofortige Maßnahmen (KRITISCH)
|
||||
1. `dlopen()` Rückgabewert prüfen und bei Fehler abbrechen
|
||||
2. Alle `fread()`-Rückgabewerte prüfen + Dateigrößenvalidierung
|
||||
3. NULL-Checks vor allen `objc_msgSend`-Aufrufen
|
||||
4. `int` → `size_t` für alle Speichergrößenberechnungen
|
||||
|
||||
### Kurzfristige Maßnahmen (HOCH)
|
||||
5. Token-Index-Validierung: `if (token >= VOCAB) abort()`
|
||||
6. ANE-Inferenz-Rückgabewert und NSError prüfen
|
||||
7. Compiler-Flags: `-fstack-protector-strong -D_FORTIFY_SOURCE=2`
|
||||
8. `.gitignore` für binäre Artefakte erstellen
|
||||
|
||||
### Mittelfristige Maßnahmen (MITTEL)
|
||||
9. IOSurface Lock-Rückgabewerte prüfen
|
||||
10. `__atomic_store_n()` für `g_ane_loaded`
|
||||
11. MIL-Parameter-Validierung vor Formatierung
|
||||
|
||||
---
|
||||
|
||||
*Dieser Bericht ist für das ANE-Forschungsprojekt erstellt. Das Projekt ist explizit als Proof-of-Concept/Forschungscode konzipiert und nicht für Produktionseinsatz gedacht.*
|
||||
Loading…
Reference in New Issue