History

Erik Bray 680f8c7e20 [feat] ANE ChainingRequest API prototype: baseline measurement for multi-kernel pipelining without recompile overhead		2026-03-03 14:22:18 +01:00
..
Makefile	[fix] Security hardening (upstream PRs #5 , #7 ): stack-protector-strong, format-security flags, NULL guards on ane_compile/fread/fopen, tokenize.py input validation	2026-03-03 14:22:03 +01:00
README.md	Python Bridge+Memory leak fix+More functions	2026-03-03 02:04:36 -05:00
ane_classifier.h	Python Bridge+Memory leak fix+More functions	2026-03-03 02:04:36 -05:00
ane_mil_gen.h	[fix] M1/M2/M3 MIL syntax compatibility (upstream PR #6 ): use program(1.0), ios16 target, tensor types across 18 files	2026-03-03 14:21:48 +01:00
ane_rmsnorm_bwd.h	Python Bridge+Memory leak fix+More functions	2026-03-03 02:04:36 -05:00
ane_runtime.h	[fix] M1/M2/M3 MIL syntax compatibility (upstream PR #6 ): use program(1.0), ios16 target, tensor types across 18 files	2026-03-03 14:21:48 +01:00
backward.h	Initial release	2026-02-28 00:22:06 -08:00
dashboard.gif	stories110M: 12-layer ANE training with dashboard, 107ms/step	2026-03-01 03:14:39 -08:00
dashboard.py	[fix] Dashboard sudo hang fix (upstream PR #20 ): prevent blocking when password is required for powermetrics	2026-03-03 14:21:57 +01:00
download_data.sh	Python Bridge+Memory leak fix+More functions	2026-03-03 02:04:36 -05:00
forward.h	Initial release	2026-02-28 00:22:06 -08:00
m5result.md	Add M5 probe results: weight reload fails, all QoS work, chaining API found	2026-03-01 23:16:38 -08:00
model.h	[fix] M1/M2/M3 MIL syntax compatibility (upstream PR #6 ): use program(1.0), ios16 target, tensor types across 18 files	2026-03-03 14:21:48 +01:00
stories_config.h	[fix] Security hardening (upstream PRs #5 , #7 ): stack-protector-strong, format-security flags, NULL guards on ane_compile/fread/fopen, tokenize.py input validation	2026-03-03 14:22:03 +01:00
stories_cpu_ops.h	stories110M: 12-layer ANE training with dashboard, 107ms/step	2026-03-01 03:14:39 -08:00
stories_io.h	[fix] M1/M2/M3 MIL syntax compatibility (upstream PR #6 ): use program(1.0), ios16 target, tensor types across 18 files	2026-03-03 14:21:48 +01:00
stories_mil.h	[fix] M1/M2/M3 MIL syntax compatibility (upstream PR #6 ): use program(1.0), ios16 target, tensor types across 18 files	2026-03-03 14:21:48 +01:00
test_ane_advanced.m	[fix] M1/M2/M3 MIL syntax compatibility (upstream PR #6 ): use program(1.0), ios16 target, tensor types across 18 files	2026-03-03 14:21:48 +01:00
test_ane_causal_attn.m	[fix] M1/M2/M3 MIL syntax compatibility (upstream PR #6 ): use program(1.0), ios16 target, tensor types across 18 files	2026-03-03 14:21:48 +01:00
test_ane_sdpa5.m	[fix] M1/M2/M3 MIL syntax compatibility (upstream PR #6 ): use program(1.0), ios16 target, tensor types across 18 files	2026-03-03 14:21:48 +01:00
test_chaining.m	[feat] ANE ChainingRequest API prototype: baseline measurement for multi-kernel pipelining without recompile overhead	2026-03-03 14:22:18 +01:00
test_classifier.m	Python Bridge+Memory leak fix+More functions	2026-03-03 02:04:36 -05:00
test_conv_attn3.m	[fix] M1/M2/M3 MIL syntax compatibility (upstream PR #6 ): use program(1.0), ios16 target, tensor types across 18 files	2026-03-03 14:21:48 +01:00
test_full_fused.m	[fix] M1/M2/M3 MIL syntax compatibility (upstream PR #6 ): use program(1.0), ios16 target, tensor types across 18 files	2026-03-03 14:21:48 +01:00
test_fused_bwd.m	[fix] M1/M2/M3 MIL syntax compatibility (upstream PR #6 ): use program(1.0), ios16 target, tensor types across 18 files	2026-03-03 14:21:48 +01:00
test_fused_qkv.m	[fix] M1/M2/M3 MIL syntax compatibility (upstream PR #6 ): use program(1.0), ios16 target, tensor types across 18 files	2026-03-03 14:21:48 +01:00
test_perf_stats.m	[fix] M1/M2/M3 MIL syntax compatibility (upstream PR #6 ): use program(1.0), ios16 target, tensor types across 18 files	2026-03-03 14:21:48 +01:00
test_qos_sweep.m	[fix] M1/M2/M3 MIL syntax compatibility (upstream PR #6 ): use program(1.0), ios16 target, tensor types across 18 files	2026-03-03 14:21:48 +01:00
test_rmsnorm_bwd.m	Python Bridge+Memory leak fix+More functions	2026-03-03 02:04:36 -05:00
test_weight_reload.m	[fix] M1/M2/M3 MIL syntax compatibility (upstream PR #6 ): use program(1.0), ios16 target, tensor types across 18 files	2026-03-03 14:21:48 +01:00
tiny_train.m	[fix] M1/M2/M3 MIL syntax compatibility (upstream PR #6 ): use program(1.0), ios16 target, tensor types across 18 files	2026-03-03 14:21:48 +01:00
tiny_train_old.m	[fix] M1/M2/M3 MIL syntax compatibility (upstream PR #6 ): use program(1.0), ios16 target, tensor types across 18 files	2026-03-03 14:21:48 +01:00
tokenize.py	[fix] Security hardening (upstream PRs #5 , #7 ): stack-protector-strong, format-security flags, NULL guards on ane_compile/fread/fopen, tokenize.py input validation	2026-03-03 14:22:03 +01:00
train.m	Initial release	2026-02-28 00:22:06 -08:00
train_large.m	[fix] Token sampling underflow fix (upstream PR #17 ): prevent size_t wraparound on short datasets in both train_large variants	2026-03-03 14:21:53 +01:00
train_large_ane.m	[fix] Token sampling underflow fix (upstream PR #17 ): prevent size_t wraparound on short datasets in both train_large variants	2026-03-03 14:21:53 +01:00

README.md

ANE Training — Stories110M on Apple Neural Engine

Training a 109M-parameter Llama2-architecture transformer (Stories110M) directly on Apple's Neural Engine using private ANE APIs.

Architecture

Model: Stories110M — dim=768, hidden=2048, heads=12, layers=12, vocab=32000, seq=256
109.53M params (84.95M transformer + 24.58M embedding)
72 ANE kernels per compile (60 weight-bearing, 12 weight-free sdpaBwd2)
6 kernel types per layer: fwdAttn, fwdFFN, ffnBwd, sdpaBwd1, sdpaBwd2, qkvBwd

Performance

Component	Time (ms/step)
ANE eval	9.6
IO (fp16 conversion)	4.1
Classifier (cblas)	9.1
Cross-entropy + residuals	14.4
RMSNorm	0.1
Total	107 ms/step

Files

File	Description
`train_large.m`	Main training loop — 12-layer forward/backward, checkpoint, exec() restart
`stories_config.h`	Model config, structs, alloc helpers
`stories_io.h`	IOSurface I/O, NEON fp16 conversion, kernel compile/eval
`stories_mil.h`	MIL program generators for all 6 ANE kernel types
`stories_cpu_ops.h`	vDSP-vectorized RMSNorm, cross-entropy, Adam, embedding ops
`dashboard.py`	TUI dashboard — loss curve, power/CPU/memory graphs, text generation
`tokenize.py`	Extract pretokenized TinyStories data
`Makefile`	Build targets

How it works

Forward pass: Each layer runs fwdAttn (QKV + SDPA + Wo) and fwdFFN (W1 + SiLU(W3) + W2) on ANE via MIL-compiled kernels. Final RMSNorm + classifier matmul on CPU (cblas).
Backward pass: Reverse layer order. ffnBwd, sdpaBwd1, sdpaBwd2, qkvBwd on ANE. Weight gradients (dW) via async cblas_sgemm on CPU. RMSNorm backward via vDSP.
Compile budget: ANE has a ~119 compile limit per process. With 72 kernels per batch, we run 10 accumulation steps then exec() restart with checkpoint resume.
Data: Real TinyStories text (20M tokens), mmap'd uint16 token IDs, random position sampling per step.

Usage

1. Download Training Data

bash download_data.sh

Downloads pretokenized TinyStories (Llama 2 BPE, 32K vocab) from enio/TinyStories on HuggingFace. Produces tinystories_data00.bin (~41 MB, ~20M tokens).

2. Build & Train

# Baseline: classifier + softmax on CPU
make train_large
./train_large --steps 100        # quick test
./train_large                    # full 10k steps
./train_large --resume           # resume from checkpoint

# ANE-offloaded: classifier + softmax on ANE (faster)
make train_large_ane
./train_large_ane --steps 100

CLI flags: --steps N (default 10000), --lr F (default 3e-4), --resume.

3. Monitor with Dashboard

pip install blessed psutil numpy
sudo python3 dashboard.py          # live mode (needs powermetrics)
sudo python3 dashboard.py --resume  # attach to resumed training

4. Benchmarking

Both programs print an Efficiency Report at completion:

=== Efficiency Report ===
Total steps:     100
Avg train:       107.0 ms/step
ANE TFLOPS:      2.45 sustained
ANE utilization: 15.5% of 15.8 TFLOPS

Per-batch timing breakdown during training:

ane=9.6 io=4.1 cls=9.1 elem=14.4 rms=0.1 cblas_wait=2.3 ms/step

Metric	What it measures
`ane`	ANE kernel evaluation
`io`	fp16↔fp32 IOSurface transfer
`cls`	Classifier matmul (CPU cblas)
`elem`	Embedding, residual adds, cross-entropy
`rms`	RMSNorm forward/backward
`cblas_wait`	Waiting for async dW gradient sgemms

Compare baseline vs ANE-offloaded:

make train_large && ./train_large --steps 100
make train_large_ane && ./train_large_ane --steps 100

Key techniques

NEON vectorized fp16<->fp32: ARM NEON intrinsics for fast IOSurface data transfer
vDSP cross-entropy: vDSP_mtrans + vvexpf + vDSP_sve — 8x faster than scalar
Async weight gradients: cblas_sgemm dispatched to background queue, overlapped with ANE
SDPA causal mask workaround: ANE hardware ignores attn_mask, so we decompose attention into Q@K^T (ANE conv) + mask+softmax (CPU) + scores@V (ANE conv)