berkus/ANE - ANE

Commit Graph

Author	SHA1	Message	Date
maderix	06535fc5be	Fix dashboard text generation: add KV cache for proper autoregressive attention	2026-03-05 08:14:21 -08:00
maderix	19da850fca	Use ACCELERATE_NEW_LAPACK to fix deprecated cblas warnings	2026-03-05 08:07:47 -08:00
maderix	389ee0dc77	Add --data flag to pass training data path from dashboard to binary	2026-03-05 08:03:54 -08:00
maderix	9595b1a499	Add tokenizer via git-lfs, fix dashboard tokenizer path - Add tokenizer.bin (434KB) to assets/models/ via git-lfs - Fix dashboard tokenizer path (was one parent too many)	2026-03-05 07:41:33 -08:00
maderix	926f977b40	Fix backward pass: global loss scaling, weight transpose, AdamW, activation clipping Three bugs prevented loss from converging below 5.5 (unigram plateau): 1. FP16 underflow in ANE backward matmuls: gradient (~8e-5) × weight (~0.036) products flushed to zero in fp16. Fixed with global loss scaling (256×) applied once to dlogits, divided out before Adam update. 2. Backward weight staging used raw weights instead of transposed — all 4 backward kernels (wotBwd, qkvBwd, ffnBwdW2t, ffnBwdW13t) now use pre-transposed buffers (Wot_buf, Wqt_buf, etc.). 3. Added AdamW (decoupled weight decay, wd=0.1 for weights, 0.0 for norms), activation clipping (act_clip=20), gradient clipping, cosine LR schedule, per-layer IOSurface weight pre-staging, and vocab compaction. Loss now drops 9.14 → 5.74 in 500 steps from random init (87ms/step).	2026-03-05 07:23:08 -08:00
maderix	e986572e90	Replace assert() with non-fatal bounds checks on token IDs Follow-up to PR #31 — assert() aborts on bad tokens, which is too harsh for training. Skip bad tokens with a warning instead.	2026-03-04 04:41:38 -08:00
Manjeet Singh	05fc8f85e3	Merge pull request #31 from alvgeppetto-debug/fix/safety-correctness fix: correctness and safety improvements for training	2026-03-04 18:09:56 +05:30
Manjeet Singh	032f866f2d	Merge pull request #29 from nabbilkhan/contrib/fix-training-data-paths Fix hardcoded TinyStories data path in train_large/train_large_ane	2026-03-04 17:48:43 +05:30
Manjeet Singh	7fbb912a89	Merge pull request #20 from guitared/main Optimize dashboard and prevent sudo hang when password needed	2026-03-04 17:48:30 +05:30
Manjeet Singh	3efa27d7a3	Merge pull request #17 from TastyHeadphones/tastyheadphones/short-dataset-underflow-fix Fix token sampling underflow for short token datasets	2026-03-04 17:48:22 +05:30
Alvaro GPT	541bf4ec90	fix: correctness & safety improvements - Validate all fread() return values in model_load_weights (model.h) - Check ane_eval() return values in ane_conv_eval (forward.h) and ane_eval_k (tiny_train.m) - Log error details on ANE eval failure (ane_runtime.h) - Thread-safe RMSNorm: replace global g_rms_tmp with local allocation (stories_cpu_ops.h) - Bounds-check token indices in cross_entropy_loss, embed_lookup, embed_backward - Atomic checkpoint writes via tmp+rename pattern (tiny_train.m) - Non-destructive recompile: compile new kernels first, swap only on success (model.h) - Validate fread() in load_checkpoint (tiny_train.m)	2026-03-03 20:46:58 +01:00
nabbilkhan	c04168ee17	Add --data path support for static training pipelines	2026-03-03 19:19:49 +00:00
maderix	443194bca4	Dashboard v2: live stats, JSON parsing, all three pipelines - Parse static pipeline JSON step/batch/perf lines for real-time updates - Running elapsed time, ms/step from wall-clock timestamps, steps/sec - Compute ANE + Total TFLOPS from FLOPs/step when not reported directly - Support --ane (train_large_ane) and --no-ane-extras flags - Dynamic pipeline timing breakdown + CKPT_PATH per mode	2026-03-03 05:24:35 -08:00
maderix	4c14ed0e25	CLI fixes + --no-ane-extras flag + README benchmark table - Fix positional arg parsing (model_path, steps, lr were silently ignored) - Add --model, --ckpt flags; forward ckpt_path across exec() restarts - Add --no-ane-extras to disable ANE classifier/softmax/rmsnorm_bwd - CPU fallback for softmax/classifier/rmsnorm_bwd when extras disabled - Update README with 4-way benchmark comparison table (20 steps)	2026-03-03 04:34:55 -08:00
maderix	cb474e1537	Add dynamic weight training pipeline — 110ms/step without recompilation Dynamic weight pipeline that eliminates the ~3.7s recompile-every-10-steps bottleneck. Weights are passed via IOSurface spatial dimension instead of baked as constants, so kernels compile once at startup (345ms) and run indefinitely without exec() restart. Key components: - training_dynamic/ — full pipeline (config, IO, MIL generators, train loop) - 9 dynamic kernels shared across all 12 layers - Vocab compaction 32K→9.2K for faster classifier - Vectorized cross-entropy with vDSP/NEON - Adam optimizer with gradient clipping + cosine LR schedule - Checkpoint save/resume - test_dynamic_matmul.m — validates dynamic weight matmul vs cblas - test_weight_patch.m — tests weight update via IOSurface - dashboard.py — updated with --dynamic flag for v2 pipeline support, improved step regex parsing, --scratch/--lr/--accum CLI args Performance: 110ms/step steady-state (no recompile overhead) ane_fwd=21 ane_bwd=28 io_fwd=12 io_bwd=15 silu=10 cls=13 rms=5 ms	2026-03-03 04:34:55 -08:00
Guitared	b8f09a6853	fix non-interactive session error and sudo password input for powermetrics	2026-03-03 14:14:30 +07:00
Guitared	65cfc3255f	optimize singleton token params in generate_text	2026-03-03 14:11:42 +07:00
Vipul	ebac5dd73f	Python Bridge+Memory leak fix+More functions	2026-03-03 02:04:36 -05:00
tastyheadphones	2b3b7ae5cc	Fix token sampling underflow on short datasets	2026-03-03 11:42:42 +09:00
m0at	184b182bfc	Add M5 probe results: weight reload fails, all QoS work, chaining API found Key findings from running all 4 probes on Apple M5: - Weight reload (unload+load after file overwrite) does NOT work — weights are baked at compile time, output is identical regardless of file changes - weightsBuffer IOSurface parameter also does not override compiled weights - All QoS values 0-63 work, no measurable latency difference (~0.07ms/eval) - _ANEPerformanceStats has hwExecutionTime (ns) + perfCounterData - _ANEChainingRequest supports loopback execution (output→input chaining) - _ANEClient has real-time eval path and chaining preparation methods - procedureIndex 0-15 all succeed on single-procedure models Fixed probe tests to use fp32 I/O with cast (matching inmem_peak pattern) and 64+ channel kernels (ANE minimum size requirement). Full analysis in training/m5result.md. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 23:16:38 -08:00
m0at	40d3f45631	Add ANE probe tests and training telemetry for M5 optimization Four standalone probe tests to characterize the M5 ANE: - test_weight_reload: Can weights be hot-swapped via unload+load without recompilation? - test_perf_stats: Enumerate _ANEPerformanceStats methods/properties and hardware counters - test_qos_sweep: Measure compile/load/eval latency across QoS 0-63 - test_ane_advanced: Probe SharedEvents, weightsBuffer IOSurface, procedureIndex, VirtualClient Training telemetry (train_large.m): - JSON lines to stderr with per-step timing breakdown and per-batch TFLOPS metrics - Enables external monitoring tools to visualize ANE utilization in real-time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 22:54:58 -08:00
maderix	4d67db1bdb	stories110M: 12-layer ANE training with dashboard, 107ms/step - Scale to full stories110M (109M params, 12 layers) with real TinyStories data - vDSP-vectorized cross-entropy (110ms→14ms), NEON fp16 IO, async dW - TUI dashboard: loss curve, ANE/CPU power, CPU/memory graphs, text generation - Split into modular headers: config, io, mil, cpu_ops	2026-03-01 03:14:39 -08:00
maderix	f213c8db68	Initial release	2026-02-28 00:22:06 -08:00

23 Commits