ANE/training
manni07 ad119aed46 fix: address CRIT security findings (CRIT-01 to CRIT-04)
- CRIT-01: dlopen() return check + NSClassFromString validation in ane_init()
           (ane_runtime.h + stories_config.h); g_ane_ok / g_ane_ok_large flag
           only set when all private classes load successfully; stories_config.h
           gets re-entry guard (g_ane_init_done) that was previously missing
- CRIT-02: g_ane_ok guard in ane_compile() and compile_kern_mil_w(); NULL check
           for inMemoryModel after inMemoryModelWithDescriptor: — prevents crash
           when API call returns nil (ane_runtime.h, stories_io.h)
- CRIT-03: Validate fread() return for critical config/header reads to prevent
           garbage malloc() sizes; fopen() NULL check in save_checkpoint();
           design decision documented (model.h, train_large.m)
- CRIT-04: int -> size_t in build_blob*/build_blob_t/build_blob_fp16; calloc()
           NULL checks added; (size_t) cast in malloc() size calculations to
           prevent signed integer overflow UB (stories_io.h, model.h)

Simulation: 3 iterations, overall score 96.15% (all criteria >= 95%)
ref: docs/reports/security-audit-2026-03-02.md
2026-03-02 22:14:51 +01:00
..
Makefile Add ANE probe tests and training telemetry for M5 optimization 2026-03-01 22:54:58 -08:00
README.md stories110M: 12-layer ANE training with dashboard, 107ms/step 2026-03-01 03:14:39 -08:00
ane_mil_gen.h Initial release 2026-02-28 00:22:06 -08:00
ane_runtime.h fix: address CRIT security findings (CRIT-01 to CRIT-04) 2026-03-02 22:14:51 +01:00
backward.h Initial release 2026-02-28 00:22:06 -08:00
dashboard.gif stories110M: 12-layer ANE training with dashboard, 107ms/step 2026-03-01 03:14:39 -08:00
dashboard.py stories110M: 12-layer ANE training with dashboard, 107ms/step 2026-03-01 03:14:39 -08:00
forward.h Initial release 2026-02-28 00:22:06 -08:00
m5result.md Add M5 probe results: weight reload fails, all QoS work, chaining API found 2026-03-01 23:16:38 -08:00
model.h fix: address CRIT security findings (CRIT-01 to CRIT-04) 2026-03-02 22:14:51 +01:00
stories_config.h fix: address CRIT security findings (CRIT-01 to CRIT-04) 2026-03-02 22:14:51 +01:00
stories_cpu_ops.h stories110M: 12-layer ANE training with dashboard, 107ms/step 2026-03-01 03:14:39 -08:00
stories_io.h fix: address CRIT security findings (CRIT-01 to CRIT-04) 2026-03-02 22:14:51 +01:00
stories_mil.h stories110M: 12-layer ANE training with dashboard, 107ms/step 2026-03-01 03:14:39 -08:00
test_ane_advanced.m Add M5 probe results: weight reload fails, all QoS work, chaining API found 2026-03-01 23:16:38 -08:00
test_ane_causal_attn.m Initial release 2026-02-28 00:22:06 -08:00
test_ane_sdpa5.m Initial release 2026-02-28 00:22:06 -08:00
test_conv_attn3.m Initial release 2026-02-28 00:22:06 -08:00
test_full_fused.m Initial release 2026-02-28 00:22:06 -08:00
test_fused_bwd.m Initial release 2026-02-28 00:22:06 -08:00
test_fused_qkv.m Initial release 2026-02-28 00:22:06 -08:00
test_perf_stats.m Add M5 probe results: weight reload fails, all QoS work, chaining API found 2026-03-01 23:16:38 -08:00
test_qos_sweep.m Add M5 probe results: weight reload fails, all QoS work, chaining API found 2026-03-01 23:16:38 -08:00
test_weight_reload.m Add M5 probe results: weight reload fails, all QoS work, chaining API found 2026-03-01 23:16:38 -08:00
tiny_train.m Initial release 2026-02-28 00:22:06 -08:00
tiny_train_old.m Initial release 2026-02-28 00:22:06 -08:00
tokenize.py stories110M: 12-layer ANE training with dashboard, 107ms/step 2026-03-01 03:14:39 -08:00
train.m Initial release 2026-02-28 00:22:06 -08:00
train_large.m fix: address CRIT security findings (CRIT-01 to CRIT-04) 2026-03-02 22:14:51 +01:00

README.md

ANE Training — Stories110M on Apple Neural Engine

Training a 109M-parameter Llama2-architecture transformer (Stories110M) directly on Apple's Neural Engine using private ANE APIs.

Dashboard

Architecture

  • Model: Stories110M — dim=768, hidden=2048, heads=12, layers=12, vocab=32000, seq=256
  • 109.53M params (84.95M transformer + 24.58M embedding)
  • 72 ANE kernels per compile (60 weight-bearing, 12 weight-free sdpaBwd2)
  • 6 kernel types per layer: fwdAttn, fwdFFN, ffnBwd, sdpaBwd1, sdpaBwd2, qkvBwd

Performance

Component Time (ms/step)
ANE eval 9.6
IO (fp16 conversion) 4.1
Classifier (cblas) 9.1
Cross-entropy + residuals 14.4
RMSNorm 0.1
Total 107 ms/step

Files

File Description
train_large.m Main training loop — 12-layer forward/backward, checkpoint, exec() restart
stories_config.h Model config, structs, alloc helpers
stories_io.h IOSurface I/O, NEON fp16 conversion, kernel compile/eval
stories_mil.h MIL program generators for all 6 ANE kernel types
stories_cpu_ops.h vDSP-vectorized RMSNorm, cross-entropy, Adam, embedding ops
dashboard.py TUI dashboard — loss curve, power/CPU/memory graphs, text generation
tokenize.py Extract pretokenized TinyStories data
Makefile Build targets

How it works

  1. Forward pass: Each layer runs fwdAttn (QKV + SDPA + Wo) and fwdFFN (W1 + SiLU(W3) + W2) on ANE via MIL-compiled kernels. Final RMSNorm + classifier matmul on CPU (cblas).

  2. Backward pass: Reverse layer order. ffnBwd, sdpaBwd1, sdpaBwd2, qkvBwd on ANE. Weight gradients (dW) via async cblas_sgemm on CPU. RMSNorm backward via vDSP.

  3. Compile budget: ANE has a ~119 compile limit per process. With 72 kernels per batch, we run 10 accumulation steps then exec() restart with checkpoint resume.

  4. Data: Real TinyStories text (20M tokens), mmap'd uint16 token IDs, random position sampling per step.

Usage

# Extract tokenized data
python3 tokenize.py

# Build and train
make train_large
./train_large                    # fresh start
./train_large --resume           # resume from checkpoint

# Monitor with dashboard
pip install blessed psutil numpy
python3 dashboard.py --resume    # needs sudo for powermetrics

Key techniques

  • NEON vectorized fp16<->fp32: ARM NEON intrinsics for fast IOSurface data transfer
  • vDSP cross-entropy: vDSP_mtrans + vvexpf + vDSP_sve — 8x faster than scalar
  • Async weight gradients: cblas_sgemm dispatched to background queue, overlapped with ANE
  • SDPA causal mask workaround: ANE hardware ignores attn_mask, so we decompose attention into Q@K^T (ANE conv) + mask+softmax (CPU) + scores@V (ANE conv)