From 4c14ed0e252d6f805c6fdb40bac5a8704f7be894 Mon Sep 17 00:00:00 2001 From: maderix Date: Tue, 3 Mar 2026 04:33:30 -0800 Subject: [PATCH] CLI fixes + --no-ane-extras flag + README benchmark table - Fix positional arg parsing (model_path, steps, lr were silently ignored) - Add --model, --ckpt flags; forward ckpt_path across exec() restarts - Add --no-ane-extras to disable ANE classifier/softmax/rmsnorm_bwd - CPU fallback for softmax/classifier/rmsnorm_bwd when extras disabled - Update README with 4-way benchmark comparison table (20 steps) --- training/README.md | 153 ++++++++++++++++++--------------- training/train_large.m | 24 ++++-- training/train_large_ane.m | 169 ++++++++++++++++++++++++------------- 3 files changed, 213 insertions(+), 133 deletions(-) diff --git a/training/README.md b/training/README.md index 9c4fb00..8ccde88 100644 --- a/training/README.md +++ b/training/README.md @@ -8,43 +8,68 @@ Training a 109M-parameter Llama2-architecture transformer (Stories110M) directly - **Model**: Stories110M — dim=768, hidden=2048, heads=12, layers=12, vocab=32000, seq=256 - **109.53M params** (84.95M transformer + 24.58M embedding) -- **72 ANE kernels** per compile (60 weight-bearing, 12 weight-free sdpaBwd2) -- **6 kernel types per layer**: fwdAttn, fwdFFN, ffnBwd, sdpaBwd1, sdpaBwd2, qkvBwd +- **SDPA causal mask workaround**: ANE hardware ignores attn_mask — decompose into Q@K^T (ANE conv) + mask+softmax (CPU) + scores@V (ANE conv) -## Performance +## Three Training Pipelines -| Component | Time (ms/step) | -|-----------|---------------| -| ANE eval | 9.6 | -| IO (fp16 conversion) | 4.1 | -| Classifier (cblas) | 9.1 | -| Cross-entropy + residuals | 14.4 | -| RMSNorm | 0.1 | -| **Total** | **107 ms/step** | +### 1. Static Baseline (`train_large`) +Original pipeline. Weights baked as constants in MIL kernels — recompile every 10 steps via `exec()` restart. + +- 60 weight-bearing + 12 weight-free kernels = 72 per compile batch +- Classifier + softmax + RMSNorm backward on CPU +- **106.7 ms/step**, 7.6s compile per restart + +### 2. Static + ANE Extras (`train_large_ane`) — PR#19 +Offloads classifier forward (32K conv), softmax, final RMSNorm, and RMSNorm backward to ANE. Bridge API for C-callable ANE access. + +- 86 kernels per compile batch (+24 rmsnorm_bwd, +1 classifier, +1 finalRms) +- **91.8 ms/step** (14% faster), 9.6s compile per restart +- Use `--no-ane-extras` to disable and fall back to CPU (for debugging) + +### 3. Dynamic Weight Pipeline (`training_dynamic/`) +Weights passed via IOSurface spatial dimension — compile 9 kernels once at startup, no recompilation needed. + +- 9 shared kernels across all 12 layers +- **111 ms/step**, 0.4s one-time compile +- No exec() restart, no compile limit issues + +## Performance Comparison (20 Steps) + +| | Static Baseline | PR#19 + ANE extras | PR#19 no extras | Dynamic | +|---|---|---|---|---| +| **Wall time** | **10.1s** | **11.7s** | **10.7s** | **~2.6s** | +| Compile | 7.6s (75.7%) | 9.6s (81.6%) | 7.5s (69.7%) | 0.4s (15%) | +| Train | 2.1s (21.2%) | 1.8s (15.6%) | 2.9s (27.4%) | 2.2s (85%) | +| **ms/step** | **106.7** | **91.8** | **147.0** | **111** | +| Kernels/restart | 72 | 86 | 60 | 9 (once) | +| ANE TFLOPS | 0.87 | 1.15 | 0.72 | — | +| Total TFLOPS | 1.63 | 1.90 | 1.19 | — | + +**Key insights:** +- Dynamic wins on wall time for any practical run length (3.9x faster at 20 steps) +- PR#19 has the best per-step throughput (92ms) but compile overhead dominates short runs +- Static restarts every 10 steps, so dynamic's zero-recompile advantage compounds ## Files | File | Description | |------|-------------| -| `train_large.m` | Main training loop — 12-layer forward/backward, checkpoint, exec() restart | -| `stories_config.h` | Model config, structs, alloc helpers | +| `train_large.m` | Static baseline — 72 kernels, classifier/softmax on CPU | +| `train_large_ane.m` | PR#19 — 86 kernels, classifier/softmax/rmsnorm_bwd on ANE | +| `training_dynamic/train.m` | Dynamic pipeline — 9 kernels, weights via IOSurface | +| `training_dynamic/mil_dynamic.h` | MIL generators for dynamic weight kernels | +| `training_dynamic/config.h` | Model config (DIM=768, HIDDEN=2048, etc.) | +| `training_dynamic/io.h` | IOSurface I/O + MIL compilation helpers | +| `training_dynamic/cpu_ops.h` | CPU ops (SiLU backward, cross-entropy, Adam) | +| `stories_config.h` | Static pipeline config, structs, alloc helpers | | `stories_io.h` | IOSurface I/O, NEON fp16 conversion, kernel compile/eval | -| `stories_mil.h` | MIL program generators for all 6 ANE kernel types | -| `stories_cpu_ops.h` | vDSP-vectorized RMSNorm, cross-entropy, Adam, embedding ops | -| `dashboard.py` | TUI dashboard — loss curve, power/CPU/memory graphs, text generation | -| `tokenize.py` | Extract pretokenized TinyStories data | +| `stories_mil.h` | MIL generators for static pipeline (6 kernel types) | +| `stories_cpu_ops.h` | vDSP-vectorized RMSNorm, cross-entropy, Adam | +| `ane_classifier.h` | ANE classifier fwd (32K conv), softmax kernels | +| `ane_rmsnorm_bwd.h` | ANE rmsnorm backward kernel | +| `dashboard.py` | TUI dashboard — loss curve, power/CPU/memory graphs | | `Makefile` | Build targets | -## How it works - -1. **Forward pass**: Each layer runs fwdAttn (QKV + SDPA + Wo) and fwdFFN (W1 + SiLU(W3) + W2) on ANE via MIL-compiled kernels. Final RMSNorm + classifier matmul on CPU (cblas). - -2. **Backward pass**: Reverse layer order. ffnBwd, sdpaBwd1, sdpaBwd2, qkvBwd on ANE. Weight gradients (dW) via async cblas_sgemm on CPU. RMSNorm backward via vDSP. - -3. **Compile budget**: ANE has a ~119 compile limit per process. With 72 kernels per batch, we run 10 accumulation steps then `exec()` restart with checkpoint resume. - -4. **Data**: Real TinyStories text (20M tokens), mmap'd uint16 token IDs, random position sampling per step. - ## Usage ### 1. Download Training Data @@ -53,69 +78,63 @@ Training a 109M-parameter Llama2-architecture transformer (Stories110M) directly bash download_data.sh ``` -Downloads pretokenized TinyStories (Llama 2 BPE, 32K vocab) from [enio/TinyStories](https://huggingface.co/datasets/enio/TinyStories) on HuggingFace. Produces `tinystories_data00.bin` (~41 MB, ~20M tokens). +Downloads pretokenized TinyStories (Llama 2 BPE, 32K vocab) from HuggingFace. Produces `tinystories_data00.bin` (~41 MB, ~20M tokens). ### 2. Build & Train ```bash -# Baseline: classifier + softmax on CPU +# Static baseline (classifier + softmax on CPU) make train_large -./train_large --steps 100 # quick test -./train_large # full 10k steps -./train_large --resume # resume from checkpoint +./train_large stories110M.bin 256 100 1e-4 +./train_large --model stories110M.bin --steps 100 --lr 1e-4 -# ANE-offloaded: classifier + softmax on ANE (faster) +# PR#19: ANE-offloaded classifier + softmax + rmsnorm_bwd make train_large_ane -./train_large_ane --steps 100 +./train_large_ane stories110M.bin 256 100 1e-4 +./train_large_ane --no-ane-extras --steps 100 # disable ANE extras + +# Dynamic pipeline (no recompilation) +cd training_dynamic && make train +./train --scratch # train from random init +./train # resume from checkpoint +./train --steps 200 --lr 1e-4 # custom steps/lr ``` -**CLI flags:** `--steps N` (default 10000), `--lr F` (default 3e-4), `--resume`. +**CLI flags (all pipelines):** +- `--steps N` (default 10000) +- `--lr F` (default 3e-4) +- `--model PATH` — pretrained weights file +- `--ckpt PATH` — checkpoint file (preserved across exec() restarts) +- `--resume` — resume from checkpoint +- `--no-ane-extras` — (train_large_ane only) disable ANE classifier/softmax/rmsnorm_bwd ### 3. Monitor with Dashboard ```bash pip install blessed psutil numpy -sudo python3 dashboard.py # live mode (needs powermetrics) -sudo python3 dashboard.py --resume # attach to resumed training +sudo python3 dashboard.py # static pipeline +sudo python3 dashboard.py --dynamic # dynamic pipeline ``` ### 4. Benchmarking -Both programs print an **Efficiency Report** at completion: +All programs print an **Efficiency Report** at completion: ``` === Efficiency Report === -Total steps: 100 -Avg train: 107.0 ms/step -ANE TFLOPS: 2.45 sustained -ANE utilization: 15.5% of 15.8 TFLOPS +Total steps: 20 +Wall time: 11738 ms (11.7 s) +Compile time: 9583 ms (81.6%) +Train time: 1835 ms (15.6%) +Avg train: 91.8 ms/step +ANE TFLOPS: 1.15 sustained ``` -Per-batch timing breakdown during training: +## Key Techniques -``` -ane=9.6 io=4.1 cls=9.1 elem=14.4 rms=0.1 cblas_wait=2.3 ms/step -``` - -| Metric | What it measures | -|--------|-----------------| -| `ane` | ANE kernel evaluation | -| `io` | fp16↔fp32 IOSurface transfer | -| `cls` | Classifier matmul (CPU cblas) | -| `elem` | Embedding, residual adds, cross-entropy | -| `rms` | RMSNorm forward/backward | -| `cblas_wait` | Waiting for async dW gradient sgemms | - -Compare baseline vs ANE-offloaded: - -```bash -make train_large && ./train_large --steps 100 -make train_large_ane && ./train_large_ane --steps 100 -``` - -## Key techniques - -- **NEON vectorized fp16<->fp32**: ARM NEON intrinsics for fast IOSurface data transfer +- **NEON vectorized fp16↔fp32**: ARM NEON intrinsics for fast IOSurface data transfer - **vDSP cross-entropy**: `vDSP_mtrans` + `vvexpf` + `vDSP_sve` — 8x faster than scalar - **Async weight gradients**: cblas_sgemm dispatched to background queue, overlapped with ANE -- **SDPA causal mask workaround**: ANE hardware ignores attn_mask, so we decompose attention into Q@K^T (ANE conv) + mask+softmax (CPU) + scores@V (ANE conv) +- **Vocab compaction** (dynamic): 32K → 9.2K active tokens, 3.5x reduction in classifier work +- **Dynamic weight packing**: Activations + weights concatenated in IOSurface spatial dimension — one kernel serves all 12 layers +- **exec() restart**: Workaround for ANE ~119 compile limit per process diff --git a/training/train_large.m b/training/train_large.m index e58ce08..17fb1c5 100644 --- a/training/train_large.m +++ b/training/train_large.m @@ -5,8 +5,8 @@ #include "stories_mil.h" #include "stories_cpu_ops.h" -#define CKPT_PATH "ane_stories110M_ckpt.bin" -#define MODEL_PATH "../../assets/models/stories110M.bin" +#define CKPT_PATH_DEFAULT "ane_stories110M_ckpt.bin" +#define MODEL_PATH_DEFAULT "stories110M.bin" #define DATA_PATH "tinystories_data00.bin" // ===== Weight loading from llama2.c format ===== @@ -193,11 +193,23 @@ int main(int argc, char *argv[]) { int adam_t = 0, start_step = 0; // Parse args + const char *ckpt_path = CKPT_PATH_DEFAULT; + const char *model_path = MODEL_PATH_DEFAULT; bool do_resume = false; + int pos = 0; for (int i=1; i