ANE/training
Andy Huang e113fae683 feat: implement ANE SDK for general-purpose neural engine development
- Implement modular ANE-MIL layer library (Linear, Conv2D, Softmax, LayerNorm, etc.)
- Add Sequential model container with automated activation surface chaining (ping-ponging)
- Implement optimized 'Weights-as-Tensors' pattern across all SDK layers for zero-recompile weight updates
- Add comprehensive automated regression testing suite (regression_test.py)
- Standardize verification for legacy Transformer training and new modular SDK components
- Update README.md and roadmap to reflect SDK capabilities and usage instructions
- Refactor hardcoded paths and unify checkpoint naming conventions for stability
2026-03-03 15:35:55 +11:00
..
layers feat: implement ANE SDK for general-purpose neural engine development 2026-03-03 15:35:55 +11:00
.gitignore Optimize ANE training with weights-as-tensors, add inference and benchmarking tools 2026-03-03 14:10:44 +11:00
ANESDK_roadmap.md feat: implement ANE SDK for general-purpose neural engine development 2026-03-03 15:35:55 +11:00
Makefile feat: implement ANE SDK for general-purpose neural engine development 2026-03-03 15:35:55 +11:00
PR-01.md Refactor hardcoded absolute paths to script-relative paths 2026-03-03 14:32:43 +11:00
README.md feat: implement ANE SDK for general-purpose neural engine development 2026-03-03 15:35:55 +11:00
ane_mil_gen.h Initial release 2026-02-28 00:22:06 -08:00
ane_runtime.h Initial release 2026-02-28 00:22:06 -08:00
backward.h Initial release 2026-02-28 00:22:06 -08:00
benchmark_ane.m Optimize ANE training with weights-as-tensors, add inference and benchmarking tools 2026-03-03 14:10:44 +11:00
dashboard.gif stories110M: 12-layer ANE training with dashboard, 107ms/step 2026-03-01 03:14:39 -08:00
dashboard.py stories110M: 12-layer ANE training with dashboard, 107ms/step 2026-03-01 03:14:39 -08:00
encode_bpe.py Refactor hardcoded absolute paths to script-relative paths 2026-03-03 14:32:43 +11:00
forward.h Initial release 2026-02-28 00:22:06 -08:00
m5result.md Add M5 probe results: weight reload fails, all QoS work, chaining API found 2026-03-01 23:16:38 -08:00
model.h Initial release 2026-02-28 00:22:06 -08:00
regression_test.py feat: implement ANE SDK for general-purpose neural engine development 2026-03-03 15:35:55 +11:00
sample.py Optimize ANE training with weights-as-tensors, add inference and benchmarking tools 2026-03-03 14:10:44 +11:00
stories_config.h Optimize ANE training with weights-as-tensors, add inference and benchmarking tools 2026-03-03 14:10:44 +11:00
stories_cpu_ops.h stories110M: 12-layer ANE training with dashboard, 107ms/step 2026-03-01 03:14:39 -08:00
stories_io.h Optimize ANE training with weights-as-tensors, add inference and benchmarking tools 2026-03-03 14:10:44 +11:00
stories_mil.h Optimize ANE training with weights-as-tensors, add inference and benchmarking tools 2026-03-03 14:10:44 +11:00
test_ane_advanced.m Add M5 probe results: weight reload fails, all QoS work, chaining API found 2026-03-01 23:16:38 -08:00
test_ane_causal_attn.m Initial release 2026-02-28 00:22:06 -08:00
test_ane_sdpa5.m Initial release 2026-02-28 00:22:06 -08:00
test_conv_attn3.m Initial release 2026-02-28 00:22:06 -08:00
test_full_fused.m Initial release 2026-02-28 00:22:06 -08:00
test_fused_bwd.m Initial release 2026-02-28 00:22:06 -08:00
test_fused_qkv.m Initial release 2026-02-28 00:22:06 -08:00
test_perf_stats.m Add M5 probe results: weight reload fails, all QoS work, chaining API found 2026-03-01 23:16:38 -08:00
test_qos_sweep.m Add M5 probe results: weight reload fails, all QoS work, chaining API found 2026-03-01 23:16:38 -08:00
test_sdk_layers.m feat: implement ANE SDK for general-purpose neural engine development 2026-03-03 15:35:55 +11:00
test_sdk_model.m feat: implement ANE SDK for general-purpose neural engine development 2026-03-03 15:35:55 +11:00
test_weight_reload.m Add M5 probe results: weight reload fails, all QoS work, chaining API found 2026-03-01 23:16:38 -08:00
tiny_train.m Refactor hardcoded absolute paths to script-relative paths 2026-03-03 14:32:43 +11:00
tiny_train_old.m Initial release 2026-02-28 00:22:06 -08:00
tokenize.py Refactor hardcoded absolute paths to script-relative paths 2026-03-03 14:32:43 +11:00
tokenize_text.py Optimize ANE training with weights-as-tensors, add inference and benchmarking tools 2026-03-03 14:10:44 +11:00
tokenizer.py Optimize ANE training with weights-as-tensors, add inference and benchmarking tools 2026-03-03 14:10:44 +11:00
train.m Initial release 2026-02-28 00:22:06 -08:00
train_bpe.py Refactor hardcoded absolute paths to script-relative paths 2026-03-03 14:32:43 +11:00
train_large.m feat: implement ANE SDK for general-purpose neural engine development 2026-03-03 15:35:55 +11:00
vocab.json Optimize ANE training with weights-as-tensors, add inference and benchmarking tools 2026-03-03 14:10:44 +11:00

README.md

ANE Training & SDK — General-Purpose Neural Engine Platform

Training a 109M-parameter Llama2-architecture transformer (Stories110M) directly on Apple's Neural Engine. This repository has evolved into a fully-featured ANE SDK for developing and training arbitrary neural network architectures on Apple Silicon.

Dashboard

🚀 The ANE SDK

The ANE SDK provides a high-level API for defining, training, and benchmarking models on the Neural Engine without manual MIL (Model Intermediate Language) string concatenation.

Key Features

  • Modular Layer Library: High-level builders for NLP and Vision (Linear, Conv2D, LayerNorm, Softmax, etc.).
  • Graph Orchestration: Automatic activation chaining and IOSurface management via a Sequential model container.
  • Weights-as-Tensors: Every layer utilizes a zero-recompile optimization pattern, allowing dynamic weight updates for training.
  • Native Performance: Sustained throughput of >90 TFLOPS across modular components.

Architecture Comparison

Specialized (Legacy) ANE SDK (General-Purpose)
Fixed Topology: Transformer only Dynamic Topology: Arbitrary layers
Manual I/O: Manual surface pointers Automated Chaining: Sequential runner
Hardcoded MIL: stories_mil.h Modular MIL: layers/core.h, layers/cnn.h
Optimized Path: Hand-tuned SDPA Ease of Use: PyTorch-like API

Performance (Optimized)

Metric Value
Training Latency ~79.6 ms/step
Inference Latency (SEQ=256) 0.60 ms
Sustained ANE Throughput ~94.4 TFLOPS
Theoretical Inference TPS ~429,000 Tokens/sec
Weight Sync ~3.4 ms per layer (NEON-accelerated)
Compile Budget 0 restarts (Dynamic weight updates)

Configuration Variables

Most configuration is handled in stories_config.h and train_large.m.

Model Hyperparameters (stories_config.h)

  • DIM: Model dimension (default: 768)
  • HIDDEN: FFN hidden dimension (default: 2048)
  • NLAYERS: Number of transformer layers (default: 12)
  • VOCAB: Vocabulary size (default: 5000)
  • SEQ: Sequence length / context window (default: 256)

Training Paths (train_large.m)

  • DATA_PATH: Path to the tokenized binary dataset (default: tinystories_data00.bin)
  • MODEL_PATH: Path to the initial pretrained weights in llama2.c format.
  • CKPT_PATH: Output path for training checkpoints.

Compiling & Running

1. Prerequisites

Ensure you have a modern Mac with Apple Silicon (M1/M2/M3/M4). You will need xcrun (Xcode Command Line Tools) and various Python dependencies for data prep and monitoring.

2. Prepare Data

The trainer expects a flat binary file of uint16_t token IDs.

# Tokenize raw text into the expected format
python3 tokenize.py

3. Build and Train

# Compile the training binary
make train_large

# Start training (fresh start or default steps)
./train_large

# Resume with custom steps and learning rate
./train_large --resume --steps 1000 --lr 1e-4

Dataset Adaptation

To adapt this trainer to any custom text dataset:

  1. Tokenize: Use a tokenizer to convert your text corpus into a sequence of IDs.
  2. Export: Save the IDs as a raw binary file of uint16_t values.
  3. Configure: Update VOCAB, SEQ, and DATA_PATH in the config files to match your dataset.
  4. Compile: Re-run make train_large. The ANE kernels will automatically adjust to your new shapes.

Monitoring with Dashboard

The TUI dashboard provides real-time telemetry on loss, power usage, and model generation.

pip install blessed psutil numpy
# Dashboard may require sudo for powermetrics access
python3 dashboard.py --resume

Testing the Model

You can test the trained model using the standalone inference script. It uses standard vanilla NumPy to perform the forward pass on the CPU, making it easy to inspect.

Generate Text

# Test with a custom prompt and checkpoint
python3 sample.py --prompt "Once upon a time" --ckpt ane_stories110M_ckpt.bin --steps 100

Parameters

  • --prompt: The starting text for generation.
  • --ckpt: Path to the training checkpoint (.bin).
  • --vocab: Path to the BPE vocabulary (vocab.json).
  • --steps: Maximum number of tokens to generate.
  • --temp: Sampling temperature (default 0.8).

ANE SDK Usage

You can build arbitrary models using the modular layer library in layers/.

1. Define Model Architecture

#import "layers/anesdk.h"

// Define layers
ANESDKLayer l1 = anesdk_linear_create("fc1", 768, 2048, 256);
ANESDKLayer l2 = anesdk_relu_create("relu1", 2048, 1, 256);
ANESDKLayer l3 = anesdk_layernorm_create("ln1", 2048, 256);

// Assemble into Sequential model
ANESDKLayer layers[] = { l1, l2, l3 };
ANESDKModel model = anesdk_model_sequential_create(layers, 3);

2. Run Forward Pass

The SDK automatically manages IOSurface chaining between layers.

// Write input to the first layer
io_write_fp16(model.layers[0].kern->inputs[0], input_data, 768, 256);

// Run the whole graph on ANE
anesdk_model_forward(&model);

// Read result from the last layer
io_read_fp16(model.layers[2].kern->ioOut, output_data, 0, 2048, 256);

3. Automated Verification

The repository includes a regression suite that verifies both the legacy Transformer and your new SDK layers.

# Build and run all tests (Fast SDK tests -> Training -> Inference)
make regression

Performance Utilities

ANE Hardware Benchmark

To measure raw hardware throughput and verify the Weights-as-Tensors optimization, use the native C-based benchmark:

make benchmark_ane
./benchmark_ane

Average Forward Pass (SEQ=256): 0.60 ms | Throughput: ~94.4 TFLOPS.

Model Inference Utility (sample.py)

Verify trained checkpoints on the CPU using vanilla NumPy.

python3 sample.py --prompt "Once upon a time" --ckpt ane_stories110M_ckpt.bin

Key Optimization: Weights as Tensors

Previously, ANE training required recompiling kernels every time weights changed, hitting an OS-enforced 119-compile limit.

The current implementation defines weights as formal function parameters (tensor<fp16, [dim, dim]>) in the MIL program. This allows us to:

  1. Compile the kernel logic once.
  2. Update weights between batches by writing directly to IOSurfaces via NEON-accelerated loops (io_write_fp16_t).
  3. Maintain resident memory for the model, eliminating the need for exec() restarts.