ANE/training_dynamic at 9595b1a499c548bc4b7201b84e0e1adda1a71df4 - ANE

History

maderix 926f977b40 Fix backward pass: global loss scaling, weight transpose, AdamW, activation clipping Three bugs prevented loss from converging below 5.5 (unigram plateau): 1. FP16 underflow in ANE backward matmuls: gradient (~8e-5) × weight (~0.036) products flushed to zero in fp16. Fixed with global loss scaling (256×) applied once to dlogits, divided out before Adam update. 2. Backward weight staging used raw weights instead of transposed — all 4 backward kernels (wotBwd, qkvBwd, ffnBwdW2t, ffnBwdW13t) now use pre-transposed buffers (Wot_buf, Wqt_buf, etc.). 3. Added AdamW (decoupled weight decay, wd=0.1 for weights, 0.0 for norms), activation clipping (act_clip=20), gradient clipping, cosine LR schedule, per-layer IOSurface weight pre-staging, and vocab compaction. Loss now drops 9.14 → 5.74 in 500 steps from random init (87ms/step).		2026-03-05 07:23:08 -08:00
..
Makefile	Add dynamic weight training pipeline — 110ms/step without recompilation	2026-03-03 04:34:55 -08:00
config.h	Fix backward pass: global loss scaling, weight transpose, AdamW, activation clipping	2026-03-05 07:23:08 -08:00
cpu_ops.h	Fix backward pass: global loss scaling, weight transpose, AdamW, activation clipping	2026-03-05 07:23:08 -08:00
io.h	Fix backward pass: global loss scaling, weight transpose, AdamW, activation clipping	2026-03-05 07:23:08 -08:00
mil_dynamic.h	Fix backward pass: global loss scaling, weight transpose, AdamW, activation clipping	2026-03-05 07:23:08 -08:00
train.m	Fix backward pass: global loss scaling, weight transpose, AdamW, activation clipping	2026-03-05 07:23:08 -08:00