23 KiB

Raw Blame History

ANE Training -- API Reference

Complete function index for all public functions, structs, and macros organized by source file.

stories_config.h -- Model Configuration
stories_io.h -- IOSurface I/O and Compilation
stories_mil.h -- MIL Program Generators
stories_cpu_ops.h -- CPU Operations
ane_runtime.h -- Generalized ANE Wrapper
ane_mil_gen.h -- Composable MIL Helpers
ane_rmsnorm_bwd.h -- RMSNorm Backward on ANE
ane_classifier.h -- Classifier and Softmax on ANE
bridge/ane_bridge.h -- C Bridge API
MIL Operation Reference
Weight Blob Format

stories_config.h

Model constants, data structures, and memory allocation helpers.

Macros

Macro	Value	Description
`DIM`	768	Model hidden dimension
`HIDDEN`	2048	FFN intermediate dimension
`HEADS`	12	Number of attention heads
`HD`	64 (`DIM/HEADS`)	Per-head dimension
`SEQ`	256	Sequence length
`NLAYERS`	12	Number of transformer layers
`VOCAB`	32000	Vocabulary size
`ACCUM_STEPS`	10	Gradient accumulation steps per compile batch
`MAX_COMPILES`	100	ANE compile budget before process restart
`KERNELS_PER_LAYER`	5	Weight-bearing ANE kernels per layer
`TOTAL_WEIGHT_KERNELS`	60	Total weight-bearing compiles per batch
`SCORE_CH`	3072 (`HEADS*SEQ`)	Attention score channels for SDPA backward
`WQ_SZ`	589824 (`DIM*DIM`)	Size of Q/K/V/O projection weight matrices
`WO_SZ`	589824 (`DIM*DIM`)	Size of output projection
`W1_SZ`	1572864 (`HIDDEN*DIM`)	FFN gate/value projection size
`W2_SZ`	1572864 (`DIM*HIDDEN`)	FFN down-projection size
`W3_SZ`	1572864 (`HIDDEN*DIM`)	FFN value projection size
`LAYER_PARAMS`	--	Total floats per layer: `4WQ_SZ + W1_SZ + W2_SZ + W3_SZ + 2DIM`
`TOTAL_PARAMS`	--	Total model params: `NLAYERS * LAYER_PARAMS + DIM + VOCAB*DIM`

Structs

`LayerWeights`

Per-layer weight matrices (all float*).

Field	Shape	Description
`Wq`, `Wk`, `Wv`, `Wo`	`[DIM, DIM]`	Attention projection weights
`W1`, `W3`	`[HIDDEN, DIM]`	FFN gate and value up-projections
`W2`	`[DIM, HIDDEN]`	FFN down-projection
`rms_att`	`[DIM]`	RMSNorm scale for attention sublayer
`rms_ffn`	`[DIM]`	RMSNorm scale for FFN sublayer

`AdamState`

First/second moment buffers for a single parameter group.

Field	Type	Description
`m`	`float*`	First moment (mean) estimate
`v`	`float*`	Second moment (variance) estimate
`n`	`size_t`	Number of parameters

`LayerAdam`

Per-layer Adam optimizer state. Contains one AdamState per weight matrix: Wq, Wk, Wv, Wo, W1, W2, W3, rms_att, rms_ffn.

`LayerActs`

Per-layer activation tensors saved for the backward pass.

Field	Shape	Description
`layer_in`	`[DIM, SEQ]`	Input to this layer (for rmsnorm1 backward)
`xnorm`	`[DIM, SEQ]`	RMSNorm1 output
`Q`, `K`, `V`	`[DIM, SEQ]`	QKV projections
`attn_out`	`[DIM, SEQ]`	Attention output (before Wo)
`o_out`	`[DIM, SEQ]`	Wo projection output
`x2`	`[DIM, SEQ]`	Residual after attention
`x2norm`	`[DIM, SEQ]`	RMSNorm2 output
`h1`, `h3`	`[HIDDEN, SEQ]`	FFN intermediates (W1 and W3 outputs)
`silu_out`	`[HIDDEN, SEQ]`	SiLU(h1) * h3 gated output
`ffn_out`	`[DIM, SEQ]`	FFN final output

`LayerGrads`

Per-layer gradient accumulators. Same field names as LayerWeights (all float*): Wq, Wk, Wv, Wo, W1, W2, W3, rms_att, rms_ffn.

`Kern`

Single ANE kernel handle (stories-specific, single I/O).

Field	Type	Description
`model`	`void*`	Retained `_ANEInMemoryModel`
`ioIn`	`IOSurfaceRef`	Input IOSurface
`ioOut`	`IOSurfaceRef`	Output IOSurface
`request`	`void*`	Retained `_ANERequest`
`tmpDir`	`void*`	Retained temp directory path

`LayerKernels`

ANE kernels for one transformer layer.

Field	Type	Description
`fwdAttn`	`Kern*`	SDPA forward + taps
`fwdFFN`	`Kern*`	FFN forward + taps
`ffnBwd`	`Kern*`	FFN backward
`sdpaBwd1`	`Kern*`	SDPA backward part 1 (Wo^T + dV + scores)
`sdpaBwd2`	`Kern*`	SDPA backward part 2 (dQ + dK)
`qkvBwd`	`Kern*`	QKV backward (Wq^T, Wk^T, Wv^T)

`CkptHdr`

Checkpoint file header (128 bytes, version 2).

Field	Type	Description
`magic`	`int`	`0x424C5A54` ("BLZT")
`version`	`int`	2
`step`, `total_steps`	`int`	Training progress
`n_layers`, `vocab_size`, `dim`, `hidden_dim`, `n_heads`, `seq_len`	`int`	Model shape
`lr`, `loss`	`float`	Learning rate, last loss
`cum_compile`, `cum_train`, `cum_wall`	`double`	Cumulative timing (ms)
`cum_steps`, `cum_batches`	`int`	Cumulative counters
`adam_t`	`int`	Adam timestep (for bias correction)
`pad[3]`	`int`	Alignment padding

`Llama2Config`

Header from llama2.c model files (7 ints): dim, hidden_dim, n_layers, n_heads, n_kv_heads, vocab_size, seq_len.

Global Variables

Name	Type	Description
`g_D`	`Class`	`_ANEInMemoryModelDescriptor` ObjC class
`g_I`	`Class`	`_ANEInMemoryModel` ObjC class
`g_AR`	`Class`	`_ANERequest` ObjC class
`g_AIO`	`Class`	`_ANEIOSurfaceObject` ObjC class
`g_tb`	`mach_timebase_info_data_t`	Mach time base for timing
`g_compile_count`	`int`	Running count of ANE compiles

Functions

Function	Returns	Description
`ane_init(void)`	`void`	Load AppleNeuralEngine.framework, resolve 4 private class references
`tb_ms(uint64_t t)`	`double`	Convert Mach absolute time to milliseconds
`adam_alloc(size_t n)`	`AdamState`	Allocate zeroed first/second moment buffers for n parameters
`adam_free(AdamState *s)`	`void`	Free an AdamState's buffers
`layer_weights_alloc(void)`	`LayerWeights`	Allocate all weight matrices for one layer
`layer_weights_free(LayerWeights *w)`	`void`	Free all weight matrices for one layer
`layer_adam_alloc(void)`	`LayerAdam`	Allocate Adam state for all weights in one layer
`layer_adam_free(LayerAdam *a)`	`void`	Free Adam state for one layer
`layer_acts_alloc(void)`	`LayerActs`	Allocate all activation buffers for one layer
`layer_acts_free(LayerActs *a)`	`void`	Free all activation buffers for one layer
`layer_grads_alloc(void)`	`LayerGrads`	Allocate zeroed gradient accumulators for one layer
`layer_grads_zero(LayerGrads *g)`	`void`	Zero all gradient accumulators (between accumulation steps)
`layer_grads_free(LayerGrads *g)`	`void`	Free gradient accumulators for one layer

stories_io.h

IOSurface creation, fp16/fp32 conversion, weight blob building, and ANE kernel compile/run.

Depends on: stories_config.h, <arm_neon.h>

Functions

Function	Returns	Description
`make_surface(size_t bytes)`	`IOSurfaceRef`	Create a 1D IOSurface with given byte allocation
`build_blob(const float *w, int rows, int cols)`	`NSData*`	Build fp16 weight blob (128B header + row-major fp16 data) from fp32 weights
`build_blob_t(const float *w, int rows, int cols)`	`NSData*`	Build fp16 weight blob with transposed layout (col-major fp16 from row-major fp32)
`build_blob_fp16(_Float16 *d, int cnt)`	`NSData*`	Build weight blob from pre-existing fp16 data (no conversion)
`cvt_f16_f32(float dst, const _Float16 src, int n)`	`void`	NEON-vectorized fp16-to-fp32 conversion (8-wide SIMD)
`cvt_f32_f16(_Float16 dst, const float src, int n)`	`void`	NEON-vectorized fp32-to-fp16 conversion (8-wide SIMD)
`io_write_fp16(IOSurfaceRef s, const float *data, int channels, int sp)`	`void`	Write fp32 data to IOSurface as fp16 in channel-first `[C,S]` layout
`io_read_fp16(IOSurfaceRef s, float *data, int ch_off, int channels, int sp)`	`void`	Read fp16 data from IOSurface at channel offset, convert to fp32
`io_copy(IOSurfaceRef dst, int dst_ch, IOSurfaceRef src, int src_ch, int channels, int sp)`	`void`	Copy fp16 data between IOSurfaces at specified channel offsets
`io_write_fp16_at(IOSurfaceRef s, int ch_off, const float *data, int channels, int sp)`	`void`	Write fp32 data to IOSurface at specific channel offset as fp16
`compile_kern_mil_w(NSString mil, NSDictionary weights, int ic_bytes, int oc_bytes)`	`Kern*`	Compile MIL text + weight dictionary into a loaded ANE kernel with IOSurfaces. Increments `g_compile_count`.
`free_kern(Kern *k)`	`void`	Unload ANE model, release IOSurfaces, remove temp directory, free kernel
`ane_run(Kern *k)`	`void`	Run a compiled ANE kernel on current IOSurface contents

stories_mil.h

MIL program generators for the 6 fused ANE kernel types. Each returns an NSString* containing the full MIL program text.

Depends on: stories_io.h

Macros

Macro	Description
`MIL_HDR`	Standard MIL program header (version 1.3, buildInfo with coremlc/coremltools versions)
`CONV_CONST`	Common conv parameter constants (pad_type, strides, pad, dilations, groups)

Functions

Function	Returns	Description
`gen_sdpa_fwd_taps(void)`	`NSString*`	SDPA forward: RMSNorm + QKV + attention + Wo. Output: `concat(o_out, Q, K, V, attn_out, xnorm)` `[1, 6*DIM, 1, SEQ]`
`gen_ffn_fwd_taps(void)`	`NSString*`	FFN forward: RMSNorm + W1/W3 + SiLU + W2. Output: `concat(ffn_out, h1, h3, silu_out, x2norm)` `[1, 2DIM+3HIDDEN, 1, SEQ]`
`gen_ffn_bwd(void)`	`NSString*`	FFN backward: Input `concat(dffn, h1, h3)`. Output: `concat(dx, dh1, dh3)` `[1, DIM+2*HIDDEN, 1, SEQ]`
`gen_qkvb(void)`	`NSString*`	QKV backward: Input `concat(dQ, dK, dV)`. Output: `dx` `[1, DIM, 1, SEQ]`
`gen_sdpa_bwd1(void)`	`NSString*`	SDPA backward part 1: Input `concat(Q, K, V, dx2)`. Output: `concat(dV, probs, dP)` `[1, DIM+2*SCORE_CH, 1, SEQ]`
`gen_sdpa_bwd2(void)`	`NSString*`	SDPA backward part 2: Input `concat(probs, dP, Q, K)`. Output: `concat(dQ, dK)` `[1, 2*DIM, 1, SEQ]`
`get_mask_blob(void)`	`NSData*`	Lazily build and cache causal attention mask as fp16 blob. Lower-triangular 0, upper -65504.

Global Variables

Name	Type	Description
`g_mask_blob`	`NSData*`	Cached causal mask blob (built on first call to `get_mask_blob`)

stories_cpu_ops.h

CPU-side operations using Accelerate framework (vDSP, vvrsqrtf, vvexpf).

Depends on: stories_config.h

Functions

Function	Returns	Description
`rmsnorm(float out, const float x, const float *w, int d, int S)`	`void`	RMSNorm forward: `out = x * rsqrt(mean(x^2) + eps) * w`. Vectorized via vDSP. Layout: channel-first `[d, S]`.
`rmsnorm_bwd(float dx, float dw, const float dy, const float x, const float *w, int d, int S)`	`void`	RMSNorm backward: computes `dx` (input gradient) and accumulates `dw` (scale gradient).
`adam_update(float w, const float g, AdamState *s, int t, float lr, float b1, float b2, float eps)`	`void`	Adam optimizer step with bias correction. Updates weights in-place. `t` is the timestep for bias correction.
`cross_entropy_loss(float dlogits, const float logits, const uint16_t *targets, int V, int S)`	`float`	Compute mean cross-entropy loss. Writes `dlogits = (softmax(logits) - one_hot(targets)) / S`. Column-major `[V, S]` layout. Uses vDSP transpose + vvexpf for vectorized softmax.
`embed_lookup(float x, const float embed, const uint16_t *tokens, int dim, int seq)`	`void`	Embedding forward: gather rows from `embed[VOCAB, DIM]` into channel-first `x[DIM, SEQ]`.
`embed_backward(float d_embed, const float dx, const uint16_t *tokens, int dim, int seq)`	`void`	Embedding backward: scatter-add `dx` back into embedding table gradient `d_embed`.

Global Variables

Name	Type	Description
`g_rms_tmp`	`float*`	Lazily-allocated scratch buffer for RMSNorm (size SEQ)

ane_runtime.h

Generalized ANE wrapper with multi-input/output support. Used in bridge, tests, and newer training variants.

Structs

`ANEKernel`

Generalized kernel handle supporting multiple inputs and outputs.

Field	Type	Description
`model`	`id`	`_ANEInMemoryModel` instance
`ioInputs`	`IOSurfaceRef*`	Array of input IOSurfaces
`ioOutputs`	`IOSurfaceRef*`	Array of output IOSurfaces
`request`	`id`	`_ANERequest` instance
`tmpDir`	`NSString*`	Temp directory for MIL/weights on disk
`nInputs`, `nOutputs`	`int`	Number of I/O tensors
`inputBytes`, `outputBytes`	`size_t*`	Byte sizes for each I/O tensor

Global Variables

Name	Type	Description
`g_ANEDesc`	`Class`	`_ANEInMemoryModelDescriptor`
`g_ANEInMem`	`Class`	`_ANEInMemoryModel`
`g_ANEReq`	`Class`	`_ANERequest`
`g_ANEIO`	`Class`	`_ANEIOSurfaceObject`
`g_ane_loaded`	`bool`	Guard to avoid re-loading the framework

Functions

Function	Returns	Description
`ane_init(void)`	`void`	Load AppleNeuralEngine.framework (idempotent), resolve 4 private ObjC classes
`ane_create_surface(size_t bytes)`	`IOSurfaceRef`	Create a 1D IOSurface of given byte size
`ane_compile(NSData milText, NSData weightData, int nInputs, size_t inputSizes, int nOutputs, size_t outputSizes)`	`ANEKernel*`	Full compile pipeline: build descriptor, compile MIL, load model, create IOSurfaces + request. Returns NULL on failure.
`ane_write_input(ANEKernel k, int idx, const void data, size_t bytes)`	`void`	Write raw bytes to the idx-th input IOSurface (lock/memcpy/unlock)
`ane_read_output(ANEKernel k, int idx, void data, size_t bytes)`	`void`	Read raw bytes from the idx-th output IOSurface (read-lock/memcpy/unlock)
`ane_run_kernel(ANEKernel *k)`	`bool`	Run the compiled ANE kernel. Returns true on success.
`ane_free(ANEKernel *k)`	`void`	Unload model, release all IOSurfaces, remove temp dir, free struct

ane_mil_gen.h

Composable MIL generation helpers for common patterns, plus weight blob builders.

Functions

Function	Returns	Description
`mil_build_weight_blob(const float *w, int out_ch, int in_ch)`	`NSData*`	Build fp16 weight blob with 128B header from fp32 row-major `[out_ch, in_ch]` weights
`mil_gen_matmul(int in_ch, int out_ch, int spatial)`	`NSString*`	Generate MIL for matmul `y = W @ x` with both as runtime inputs. Includes fp32-to-fp16-to-fp32 casts.
`mil_gen_conv(int in_ch, int out_ch, int spatial)`	`NSString*`	Generate MIL for conv-based linear with baked weights from blob file (inference-only)
`mil_gen_qkv(int dim, int spatial)`	`NSString*`	Generate MIL for fused QKV: 3 parallel convs from single input, weights from concatenated blob
`mil_build_qkv_weight_blob(const float wq, const float wk, const float *wv, int dim)`	`NSData*`	Build concatenated weight blob for fused QKV (3 chunks, each with 64B header + fp16 data)
`mil_build_ffn_up_weight_blob(const float w1, const float w3, int hidden_dim, int dim)`	`NSData*`	Build concatenated weight blob for fused FFN up-projection (W1 + W3 chunks)
`mil_gen_ffn_up(int dim, int hidden_dim, int spatial)`	`NSString*`	Generate MIL for fused FFN up: W1 + W3 parallel convs, outputs h1 and h3

ane_rmsnorm_bwd.h

MIL generator for RMSNorm backward on ANE (used by train_large_ane.m).

Depends on: stories_mil.h

Functions

Function	Returns	Description
`gen_rmsnorm_bwd(void)`	`NSString*`	Generate MIL for RMSNorm backward. Input: `concat(dy, x)` as `[1, 2*DIM, 1, SEQ]`. Baked weight: RMSNorm scale `w[DIM]`. Output: `dx` as `[1, DIM, 1, SEQ]`. Note: `dw` (weight gradient) stays on CPU.

ane_classifier.h

MIL generators for classifier operations on ANE (used by train_large_ane.m).

Depends on: stories_mil.h

Functions

Function	Returns	Description
`gen_classifier_fwd(void)`	`NSString*`	Classifier forward: single 32000-output-channel conv. Input: `[1, DIM, 1, SEQ]`. Baked: embedding weights `[VOCAB, DIM, 1, 1]`. Output: `[1, VOCAB, 1, SEQ]`.
`gen_classifier_bwd(void)`	`NSString*`	Classifier backward: `dx = embed^T @ dlogits`. Uses `matmul` op (not conv, since ANE rejects conv with 32000 input channels). Input: `[1, VOCAB, 1, SEQ]`. Baked: `embed^T [1, DIM, VOCAB]`. Output: `[1, DIM, 1, SEQ]`.
`gen_softmax_vocab(void)`	`NSString*`	Softmax over VOCAB dimension: `softmax(x, axis=1)`. Input: `[1, VOCAB, 1, SEQ]`. Output: `[1, VOCAB, 1, SEQ]`.
`gen_final_rmsnorm(void)`	`NSString*`	Final RMSNorm (standalone, not fused). Input: `[1, DIM, 1, SEQ]`. Baked: `rms_final[DIM]`. Output: `[1, DIM, 1, SEQ]`.

bridge/ane_bridge.h

C-callable bridge to ANE private APIs for Python ctypes integration.

Types

Type	Description
`ANEKernelHandle`	Opaque kernel handle (pointer to internal struct)

Functions

Function	Returns	Description
`ane_bridge_init(void)`	`int`	Initialize ANE runtime (load private framework, resolve classes). Returns 0 on success, -1 on failure.
`ane_bridge_compile(const char mil_text, size_t mil_len, const uint8_t weight_data, size_t weight_len, int n_inputs, const size_t input_sizes, int n_outputs, const size_t output_sizes)`	`ANEKernelHandle*`	Compile MIL text + single weight blob into ANE kernel. Returns NULL on failure.
`ane_bridge_compile_multi_weights(const char mil_text, size_t mil_len, const char weight_names, const uint8_t weight_datas, const size_t weight_lens, int n_weights, int n_inputs, const size_t input_sizes, int n_outputs, const size_t output_sizes)`	`ANEKernelHandle*`	Compile MIL text + multiple named weight files. Weight names use `@model_path/` prefix convention.
`ane_bridge_run(ANEKernelHandle *kernel)`	`bool`	Execute a compiled kernel on ANE. Returns true on success.
`ane_bridge_write_input(ANEKernelHandle kernel, int idx, const void data, size_t bytes)`	`void`	Write data to kernel input IOSurface at index `idx`
`ane_bridge_read_output(ANEKernelHandle kernel, int idx, void data, size_t bytes)`	`void`	Read data from kernel output IOSurface at index `idx`
`ane_bridge_free(ANEKernelHandle *kernel)`	`void`	Unload model, release all IOSurfaces, remove temp dir, free handle
`ane_bridge_get_compile_count(void)`	`int`	Get current compile count (for restart budgeting)
`ane_bridge_reset_compile_count(void)`	`void`	Reset compile count to zero
`ane_bridge_build_weight_blob(const float src, int rows, int cols, size_t out_len)`	`uint8_t*`	Build weight blob in ANE format (128B header + fp16). Caller must free via `ane_bridge_free_blob()`.
`ane_bridge_build_weight_blob_transposed(const float src, int rows, int cols, size_t out_len)`	`uint8_t*`	Build transposed weight blob. Caller must free via `ane_bridge_free_blob()`.
`ane_bridge_free_blob(void *ptr)`	`void`	Free a blob allocated by `ane_bridge_build_weight_blob*`

MIL Operation Reference

All MIL programs target ios18 and use fp16 tensors in [1, C, 1, S] layout (or [1, H, S, S] for attention scores).

Operation	MIL Syntax	Purpose
`conv`	`conv(dilations=dl, groups=gr, pad=pd, pad_type=pt, strides=st, weight=W, x=xn)`	Linear projections (all Wq, Wk, Wv, Wo, W1, W2, W3). 1x1 conv = matmul. Weight shape: `[out_ch, in_ch, 1, 1]`.
`matmul`	`matmul(transpose_x=tx, transpose_y=ty, x=a, y=b)`	Attention score computation (Q at K^T, scores at V, classifier backward).
`softmax`	`softmax(axis=ax, x=ms)`	Attention weight normalization (`axis=-1`) and vocab softmax (`axis=1`).
`mul`	`mul(x=a, y=b)`	Element-wise multiply: RMSNorm scaling, SiLU gating, attention scaling, softmax Jacobian.
`add`	`add(x=a, y=b)`	Causal mask application, SiLU derivative `(1 + h*(1-sig))`, gradient accumulation.
`sub`	`sub(x=a, y=b)`	SiLU derivative: `1 - sigmoid(h1)`, softmax backward: `dp - sum(P*dP)`.
`sigmoid`	`sigmoid(x=h1)`	SiLU activation component (SiLU = x * sigmoid(x)).
`pow`	`pow(x=ss3, y=nhalf)`	RMSNorm: `x^(-0.5)` = reciprocal sqrt.
`reduce_sum`	`reduce_sum(x=sq, axes=rax, keep_dims=kd)`	RMSNorm: sum of squares along channel dim. Softmax backward: row-wise dot product.
`reshape`	`reshape(shape=sh, x=xf)`	`[1,DIM,1,SEQ]` to `[1,HEADS,HD,SEQ]` for multi-head attention. Flatten attention scores.
`transpose`	`transpose(perm=pm, x=q4)`	Permute `[0,1,3,2]`: swap spatial and head_dim for matmul compatibility.
`concat`	`concat(axis=cax, interleave=cid, values=(a,b,c))`	Pack multiple outputs into single IOSurface ("taps"). Always `axis=1`, `interleave=false`.
`slice_by_size`	`slice_by_size(x=x, begin=b, size=sz)`	Split concatenated inputs in backward kernels. `begin=[0,offset,0,0]`, `size=[1,channels,1,SEQ]`.
`cast`	`cast(dtype=to_fp16, x=x)`	fp32-to-fp16 or fp16-to-fp32 precision conversion (used in ane_mil_gen.h generators).
`const`	`const()[name=..., val=...]`	Declare scalar/tensor constants, conv parameters, weight blob references via `BLOBFILE`.

Weight Blob Format

Single-weight blob (128 bytes header + data)

Offset  Size   Content
------  -----  -------
0       1      0x01 (format marker)
4       1      0x02 (format marker)
5-63    59     zeros (global header padding)
64      4      0xDEADBEEF (chunk magic, little-endian: EF BE AD DE)
68      1      0x01 (chunk marker)
72      4      uint32 data_size (total fp16 bytes = out_ch * in_ch * 2)
80      4      uint32 data_offset (always 128 = 64 global + 64 chunk)
84-127  44     zeros (chunk header padding)
128+    N      fp16 weight data, row-major [out_ch, in_ch]

Multi-weight blob (fused QKV, FFN up)

Offset      Content
------      -------
0-63        Global header (same as above)
64          Chunk 0 header (64 bytes): magic, data_size, data_offset
64+64       Chunk 0 data (fp16 weights)
64+cs       Chunk 1 header (64 bytes)
64+cs+64    Chunk 1 data (fp16 weights)
...

Where cs = 64 + n_elements * 2 (chunk header size + data size).

MIL references use BLOBFILE(path="@model_path/weights/name.bin", offset=uint64(X)) where X is the chunk header offset within the file (64 for first chunk, 64+cs for second, etc.).

23 KiB Raw Blame History

ANE Training -- API Reference

Table of Contents

stories_config.h

Macros

Structs

LayerWeights

AdamState

LayerAdam

LayerActs

LayerGrads

Kern

LayerKernels

CkptHdr

Llama2Config

Global Variables

Functions

stories_io.h

Functions

stories_mil.h

Macros

Functions

Global Variables

stories_cpu_ops.h

Functions

Global Variables

ane_runtime.h

Structs

ANEKernel

Global Variables

Functions

ane_mil_gen.h

Functions

ane_rmsnorm_bwd.h

Functions

ane_classifier.h

Functions

bridge/ane_bridge.h

Types

Functions

MIL Operation Reference

Weight Blob Format

Single-weight blob (128 bytes header + data)

Multi-weight blob (fused QKV, FFN up)

23 KiB

Raw Blame History

`LayerWeights`

`AdamState`

`LayerAdam`

`LayerActs`

`LayerGrads`

`Kern`

`LayerKernels`

`CkptHdr`

`Llama2Config`

`ANEKernel`