berkus/ANE - ANE

Commit Graph

Author	SHA1	Message	Date
maderix	475348ad14	Add Qwen3-0.6B GQA support and multi-model build system Implement Grouped-Query Attention (16q/8kv heads, head_dim=128) for Qwen3-0.6B (28 layers, 596M params). Model configs moved to models/*.h headers selected at build time via make MODEL=xxx. Key changes: - GQA-aware MIL kernels: sdpaFwd split from woFwd (Q_DIM!=DIM), qBwd/kvBwd split from qkvBwd (different IC dimensions) - K/V tile (KV_HEADS→HEADS) before SDPA backward, reduce after - 10 kernels total, all model-agnostic via compile-time defines - Makefile: make MODEL=qwen3_06b (default) or MODEL=stories110m - Both models verified: Stories110M ~115ms/step, Qwen3 ~412ms/step	2026-03-06 06:23:15 -08:00
maderix	926f977b40	Fix backward pass: global loss scaling, weight transpose, AdamW, activation clipping Three bugs prevented loss from converging below 5.5 (unigram plateau): 1. FP16 underflow in ANE backward matmuls: gradient (~8e-5) × weight (~0.036) products flushed to zero in fp16. Fixed with global loss scaling (256×) applied once to dlogits, divided out before Adam update. 2. Backward weight staging used raw weights instead of transposed — all 4 backward kernels (wotBwd, qkvBwd, ffnBwdW2t, ffnBwdW13t) now use pre-transposed buffers (Wot_buf, Wqt_buf, etc.). 3. Added AdamW (decoupled weight decay, wd=0.1 for weights, 0.0 for norms), activation clipping (act_clip=20), gradient clipping, cosine LR schedule, per-layer IOSurface weight pre-staging, and vocab compaction. Loss now drops 9.14 → 5.74 in 500 steps from random init (87ms/step).	2026-03-05 07:23:08 -08:00
maderix	cb474e1537	Add dynamic weight training pipeline — 110ms/step without recompilation Dynamic weight pipeline that eliminates the ~3.7s recompile-every-10-steps bottleneck. Weights are passed via IOSurface spatial dimension instead of baked as constants, so kernels compile once at startup (345ms) and run indefinitely without exec() restart. Key components: - training_dynamic/ — full pipeline (config, IO, MIL generators, train loop) - 9 dynamic kernels shared across all 12 layers - Vocab compaction 32K→9.2K for faster classifier - Vectorized cross-entropy with vDSP/NEON - Adam optimizer with gradient clipping + cosine LR schedule - Checkpoint save/resume - test_dynamic_matmul.m — validates dynamic weight matmul vs cblas - test_weight_patch.m — tests weight update via IOSurface - dashboard.py — updated with --dynamic flag for v2 pipeline support, improved step regex parsing, --scratch/--lr/--accum CLI args Performance: 110ms/step steady-state (no recompile overhead) ane_fwd=21 ane_bwd=28 io_fwd=12 io_bwd=15 silu=10 cls=13 rms=5 ms	2026-03-03 04:34:55 -08:00

Author

SHA1

Message

Date

maderix

475348ad14

Add Qwen3-0.6B GQA support and multi-model build system

Implement Grouped-Query Attention (16q/8kv heads, head_dim=128) for
Qwen3-0.6B (28 layers, 596M params). Model configs moved to
models/*.h headers selected at build time via make MODEL=xxx.

Key changes:
- GQA-aware MIL kernels: sdpaFwd split from woFwd (Q_DIM!=DIM),
  qBwd/kvBwd split from qkvBwd (different IC dimensions)
- K/V tile (KV_HEADS→HEADS) before SDPA backward, reduce after
- 10 kernels total, all model-agnostic via compile-time defines
- Makefile: make MODEL=qwen3_06b (default) or MODEL=stories110m
- Both models verified: Stories110M ~115ms/step, Qwen3 ~412ms/step

2026-03-06 06:23:15 -08:00

maderix

926f977b40

Fix backward pass: global loss scaling, weight transpose, AdamW, activation clipping

Three bugs prevented loss from converging below 5.5 (unigram plateau):

1. FP16 underflow in ANE backward matmuls: gradient (~8e-5) × weight (~0.036)
   products flushed to zero in fp16. Fixed with global loss scaling (256×)
   applied once to dlogits, divided out before Adam update.

2. Backward weight staging used raw weights instead of transposed — all 4
   backward kernels (wotBwd, qkvBwd, ffnBwdW2t, ffnBwdW13t) now use
   pre-transposed buffers (Wot_buf, Wqt_buf, etc.).

3. Added AdamW (decoupled weight decay, wd=0.1 for weights, 0.0 for norms),
   activation clipping (act_clip=20), gradient clipping, cosine LR schedule,
   per-layer IOSurface weight pre-staging, and vocab compaction.

Loss now drops 9.14 → 5.74 in 500 steps from random init (87ms/step).

2026-03-05 07:23:08 -08:00

maderix

cb474e1537

Add dynamic weight training pipeline — 110ms/step without recompilation

Dynamic weight pipeline that eliminates the ~3.7s recompile-every-10-steps
bottleneck. Weights are passed via IOSurface spatial dimension instead of
baked as constants, so kernels compile once at startup (345ms) and run
indefinitely without exec() restart.

Key components:
- training_dynamic/ — full pipeline (config, IO, MIL generators, train loop)
  - 9 dynamic kernels shared across all 12 layers
  - Vocab compaction 32K→9.2K for faster classifier
  - Vectorized cross-entropy with vDSP/NEON
  - Adam optimizer with gradient clipping + cosine LR schedule
  - Checkpoint save/resume

- test_dynamic_matmul.m — validates dynamic weight matmul vs cblas
- test_weight_patch.m — tests weight update via IOSurface

- dashboard.py — updated with --dynamic flag for v2 pipeline support,
  improved step regex parsing, --scratch/--lr/--accum CLI args

Performance: 110ms/step steady-state (no recompile overhead)
  ane_fwd=21 ane_bwd=28 io_fwd=12 io_bwd=15 silu=10 cls=13 rms=5 ms

2026-03-03 04:34:55 -08:00

3 Commits