Dashboard: multi-model support (Stories110M + Qwen3-0.6B) with GQA-aware
text generation and KV cache. Weights & Biases logging (--wandb flag) for
loss, timing, power, and checkpoint events. Top-k=50 sampling to eliminate
garbage tokens from untrained vocab entries. Tokenizer reads any vocab size.
train.m: only save checkpoint when loss improves (best_loss tracking).
Implement Grouped-Query Attention (16q/8kv heads, head_dim=128) for
Qwen3-0.6B (28 layers, 596M params). Model configs moved to
models/*.h headers selected at build time via make MODEL=xxx.
Key changes:
- GQA-aware MIL kernels: sdpaFwd split from woFwd (Q_DIM!=DIM),
qBwd/kvBwd split from qkvBwd (different IC dimensions)
- K/V tile (KV_HEADS→HEADS) before SDPA backward, reduce after
- 10 kernels total, all model-agnostic via compile-time defines
- Makefile: make MODEL=qwen3_06b (default) or MODEL=stories110m
- Both models verified: Stories110M ~115ms/step, Qwen3 ~412ms/step
Three bugs prevented loss from converging below 5.5 (unigram plateau):
1. FP16 underflow in ANE backward matmuls: gradient (~8e-5) × weight (~0.036)
products flushed to zero in fp16. Fixed with global loss scaling (256×)
applied once to dlogits, divided out before Adam update.
2. Backward weight staging used raw weights instead of transposed — all 4
backward kernels (wotBwd, qkvBwd, ffnBwdW2t, ffnBwdW13t) now use
pre-transposed buffers (Wot_buf, Wqt_buf, etc.).
3. Added AdamW (decoupled weight decay, wd=0.1 for weights, 0.0 for norms),
activation clipping (act_clip=20), gradient clipping, cosine LR schedule,
per-layer IOSurface weight pre-staging, and vocab compaction.
Loss now drops 9.14 → 5.74 in 500 steps from random init (87ms/step).
Update the file structure section to reflect the current repository
layout, including benchmarks/, bridge/, training_dynamic/, and newly
added header files, scripts, and training variants. Fix missing space
in "Fork it, build on it" section.
Benchmark report now includes full Stories110M model configuration
(arch, layers, dims, kernels). README updated: 12-layer results
replace stale single-layer numbers, limitations reflect current state.
Community-submitted results for M1 Pro/Max, M3 Pro, M4 Pro/Max, M5.
Includes training performance, peak throughput, MIL compatibility
matrix, and structured JSON data.
- Validate all fread() return values in model_load_weights (model.h)
- Check ane_eval() return values in ane_conv_eval (forward.h) and ane_eval_k (tiny_train.m)
- Log error details on ANE eval failure (ane_runtime.h)
- Thread-safe RMSNorm: replace global g_rms_tmp with local allocation (stories_cpu_ops.h)
- Bounds-check token indices in cross_entropy_loss, embed_lookup, embed_backward
- Atomic checkpoint writes via tmp+rename pattern (tiny_train.m)
- Non-destructive recompile: compile new kernels first, swap only on success (model.h)
- Validate fread() in load_checkpoint (tiny_train.m)
[MLModel compileModelAtURL:] fails on macOS 26, breaking inmem_bench,
sram_bench, and sram_probe. This switches all three to generate MIL text
and weight blobs programmatically in memory (matching the working
inmem_peak.m approach), bypassing CoreML disk compilation entirely.
- inmem_bench.m: replace CoreML compile + file read with genMIL/buildWeightBlob
- sram_bench.m: switch from _ANEClient/_ANEModel to _ANEInMemoryModel API
- sram_probe.m: same _ANEClient → _ANEInMemoryModel conversion
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Parse static pipeline JSON step/batch/perf lines for real-time updates
- Running elapsed time, ms/step from wall-clock timestamps, steps/sec
- Compute ANE + Total TFLOPS from FLOPs/step when not reported directly
- Support --ane (train_large_ane) and --no-ane-extras flags
- Dynamic pipeline timing breakdown + CKPT_PATH per mode
Dynamic weight pipeline that eliminates the ~3.7s recompile-every-10-steps
bottleneck. Weights are passed via IOSurface spatial dimension instead of
baked as constants, so kernels compile once at startup (345ms) and run
indefinitely without exec() restart.
Key components:
- training_dynamic/ — full pipeline (config, IO, MIL generators, train loop)
- 9 dynamic kernels shared across all 12 layers
- Vocab compaction 32K→9.2K for faster classifier
- Vectorized cross-entropy with vDSP/NEON
- Adam optimizer with gradient clipping + cosine LR schedule
- Checkpoint save/resume
- test_dynamic_matmul.m — validates dynamic weight matmul vs cblas
- test_weight_patch.m — tests weight update via IOSurface
- dashboard.py — updated with --dynamic flag for v2 pipeline support,
improved step regex parsing, --scratch/--lr/--accum CLI args
Performance: 110ms/step steady-state (no recompile overhead)
ane_fwd=21 ane_bwd=28 io_fwd=12 io_bwd=15 silu=10 cls=13 rms=5 ms
Weave in scope notice near the top covering project intent, what it
is/isn't, hype clarification, maintenance expectations, and fork
encouragement. Consolidate private API disclaimer with existing
disclaimer section to avoid duplication.
https://claude.ai/code/session_01NNL4MVEY1aKp19eGHTYJUv
Key findings from running all 4 probes on Apple M5:
- Weight reload (unload+load after file overwrite) does NOT work — weights
are baked at compile time, output is identical regardless of file changes
- weightsBuffer IOSurface parameter also does not override compiled weights
- All QoS values 0-63 work, no measurable latency difference (~0.07ms/eval)
- _ANEPerformanceStats has hwExecutionTime (ns) + perfCounterData
- _ANEChainingRequest supports loopback execution (output→input chaining)
- _ANEClient has real-time eval path and chaining preparation methods
- procedureIndex 0-15 all succeed on single-procedure models
Fixed probe tests to use fp32 I/O with cast (matching inmem_peak pattern)
and 64+ channel kernels (ANE minimum size requirement).
Full analysis in training/m5result.md.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Four standalone probe tests to characterize the M5 ANE:
- test_weight_reload: Can weights be hot-swapped via unload+load without recompilation?
- test_perf_stats: Enumerate _ANEPerformanceStats methods/properties and hardware counters
- test_qos_sweep: Measure compile/load/eval latency across QoS 0-63
- test_ane_advanced: Probe SharedEvents, weightsBuffer IOSurface, procedureIndex, VirtualClient
Training telemetry (train_large.m):
- JSON lines to stderr with per-step timing breakdown and per-batch TFLOPS metrics
- Enables external monitoring tools to visualize ANE utilization in real-time
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>