Dashboard: multi-model support (Stories110M + Qwen3-0.6B) with GQA-aware
text generation and KV cache. Weights & Biases logging (--wandb flag) for
loss, timing, power, and checkpoint events. Top-k=50 sampling to eliminate
garbage tokens from untrained vocab entries. Tokenizer reads any vocab size.
train.m: only save checkpoint when loss improves (best_loss tracking).
- Parse static pipeline JSON step/batch/perf lines for real-time updates
- Running elapsed time, ms/step from wall-clock timestamps, steps/sec
- Compute ANE + Total TFLOPS from FLOPs/step when not reported directly
- Support --ane (train_large_ane) and --no-ane-extras flags
- Dynamic pipeline timing breakdown + CKPT_PATH per mode
Dynamic weight pipeline that eliminates the ~3.7s recompile-every-10-steps
bottleneck. Weights are passed via IOSurface spatial dimension instead of
baked as constants, so kernels compile once at startup (345ms) and run
indefinitely without exec() restart.
Key components:
- training_dynamic/ — full pipeline (config, IO, MIL generators, train loop)
- 9 dynamic kernels shared across all 12 layers
- Vocab compaction 32K→9.2K for faster classifier
- Vectorized cross-entropy with vDSP/NEON
- Adam optimizer with gradient clipping + cosine LR schedule
- Checkpoint save/resume
- test_dynamic_matmul.m — validates dynamic weight matmul vs cblas
- test_weight_patch.m — tests weight update via IOSurface
- dashboard.py — updated with --dynamic flag for v2 pipeline support,
improved step regex parsing, --scratch/--lr/--accum CLI args
Performance: 110ms/step steady-state (no recompile overhead)
ane_fwd=21 ane_bwd=28 io_fwd=12 io_bwd=15 silu=10 cls=13 rms=5 ms