ANE/training/training_dynamic
maderix 926f977b40 Fix backward pass: global loss scaling, weight transpose, AdamW, activation clipping
Three bugs prevented loss from converging below 5.5 (unigram plateau):

1. FP16 underflow in ANE backward matmuls: gradient (~8e-5) × weight (~0.036)
   products flushed to zero in fp16. Fixed with global loss scaling (256×)
   applied once to dlogits, divided out before Adam update.

2. Backward weight staging used raw weights instead of transposed — all 4
   backward kernels (wotBwd, qkvBwd, ffnBwdW2t, ffnBwdW13t) now use
   pre-transposed buffers (Wot_buf, Wqt_buf, etc.).

3. Added AdamW (decoupled weight decay, wd=0.1 for weights, 0.0 for norms),
   activation clipping (act_clip=20), gradient clipping, cosine LR schedule,
   per-layer IOSurface weight pre-staging, and vocab compaction.

Loss now drops 9.14 → 5.74 in 500 steps from random init (87ms/step).
2026-03-05 07:23:08 -08:00
..
Makefile Add dynamic weight training pipeline — 110ms/step without recompilation 2026-03-03 04:34:55 -08:00
config.h Fix backward pass: global loss scaling, weight transpose, AdamW, activation clipping 2026-03-05 07:23:08 -08:00
cpu_ops.h Fix backward pass: global loss scaling, weight transpose, AdamW, activation clipping 2026-03-05 07:23:08 -08:00
io.h Fix backward pass: global loss scaling, weight transpose, AdamW, activation clipping 2026-03-05 07:23:08 -08:00
mil_dynamic.h Fix backward pass: global loss scaling, weight transpose, AdamW, activation clipping 2026-03-05 07:23:08 -08:00
train.m Fix backward pass: global loss scaling, weight transpose, AdamW, activation clipping 2026-03-05 07:23:08 -08:00