Commit Graph

43 Commits

Author SHA1 Message Date
Alvaro Videla 668c236a08
Merge 7ea45c2fab into 20cd236f61 2026-03-10 15:12:41 +05:30
maderix 20cd236f61 Add INT8 W8A8 support: 1.88x ANE throughput via quantize/dequantize MIL ops
- ane_int8_bench.m: standalone FP16 vs INT8 W8A8 benchmark (35.1 vs 18.6 TOPS on M4)
- bridge: add int8 weight blob builders (ane_bridge_build_weight_blob_int8, quantized)
- bridge: fix weight dict nil → @{} (prevents silent compile failure)
- README: update with Qwen3-0.6B, GQA, GPU↔ANE pipeline, INT8 results, file structure
2026-03-09 19:47:01 -07:00
maderix 7d61ee4d25 Multi-model dashboard with GQA, W&B integration, and best-loss checkpointing
Dashboard: multi-model support (Stories110M + Qwen3-0.6B) with GQA-aware
text generation and KV cache. Weights & Biases logging (--wandb flag) for
loss, timing, power, and checkpoint events. Top-k=50 sampling to eliminate
garbage tokens from untrained vocab entries. Tokenizer reads any vocab size.

train.m: only save checkpoint when loss improves (best_loss tracking).
2026-03-07 02:56:27 -08:00
maderix 475348ad14 Add Qwen3-0.6B GQA support and multi-model build system
Implement Grouped-Query Attention (16q/8kv heads, head_dim=128) for
Qwen3-0.6B (28 layers, 596M params). Model configs moved to
models/*.h headers selected at build time via make MODEL=xxx.

Key changes:
- GQA-aware MIL kernels: sdpaFwd split from woFwd (Q_DIM!=DIM),
  qBwd/kvBwd split from qkvBwd (different IC dimensions)
- K/V tile (KV_HEADS→HEADS) before SDPA backward, reduce after
- 10 kernels total, all model-agnostic via compile-time defines
- Makefile: make MODEL=qwen3_06b (default) or MODEL=stories110m
- Both models verified: Stories110M ~115ms/step, Qwen3 ~412ms/step
2026-03-06 06:23:15 -08:00
maderix c3c5094865 Fixed the dynamic pipeline logit generation 2026-03-06 04:51:32 -08:00
maderix 06535fc5be Fix dashboard text generation: add KV cache for proper autoregressive attention 2026-03-05 08:14:21 -08:00
maderix 19da850fca Use ACCELERATE_NEW_LAPACK to fix deprecated cblas warnings 2026-03-05 08:07:47 -08:00
maderix 389ee0dc77 Add --data flag to pass training data path from dashboard to binary 2026-03-05 08:03:54 -08:00
maderix 9595b1a499 Add tokenizer via git-lfs, fix dashboard tokenizer path
- Add tokenizer.bin (434KB) to assets/models/ via git-lfs
- Fix dashboard tokenizer path (was one parent too many)
2026-03-05 07:41:33 -08:00
maderix 926f977b40 Fix backward pass: global loss scaling, weight transpose, AdamW, activation clipping
Three bugs prevented loss from converging below 5.5 (unigram plateau):

1. FP16 underflow in ANE backward matmuls: gradient (~8e-5) × weight (~0.036)
   products flushed to zero in fp16. Fixed with global loss scaling (256×)
   applied once to dlogits, divided out before Adam update.

2. Backward weight staging used raw weights instead of transposed — all 4
   backward kernels (wotBwd, qkvBwd, ffnBwdW2t, ffnBwdW13t) now use
   pre-transposed buffers (Wot_buf, Wqt_buf, etc.).

3. Added AdamW (decoupled weight decay, wd=0.1 for weights, 0.0 for norms),
   activation clipping (act_clip=20), gradient clipping, cosine LR schedule,
   per-layer IOSurface weight pre-staging, and vocab compaction.

Loss now drops 9.14 → 5.74 in 500 steps from random init (87ms/step).
2026-03-05 07:23:08 -08:00
maderix efcf193075 Add model config to benchmark report, update README with current results
Benchmark report now includes full Stories110M model configuration
(arch, layers, dims, kernels). README updated: 12-layer results
replace stale single-layer numbers, limitations reflect current state.
2026-03-04 06:13:21 -08:00
maderix 1a7d8846b2 Add NE core counts, clarify FP16 vs rated TOPS methodology
All chips have 16 NE cores except Ultra (32 via UltraFusion).
M4 38 TOPS is INT8/mixed-precision, not comparable to M3 FP16 spec.
2026-03-04 06:11:29 -08:00
maderix 050bc4fdf0 Add cross-generation ANE benchmark report from issue #3
Community-submitted results for M1 Pro/Max, M3 Pro, M4 Pro/Max, M5.
Includes training performance, peak throughput, MIL compatibility
matrix, and structured JSON data.
2026-03-04 05:30:00 -08:00
maderix e986572e90 Replace assert() with non-fatal bounds checks on token IDs
Follow-up to PR #31 — assert() aborts on bad tokens, which is too
harsh for training. Skip bad tokens with a warning instead.
2026-03-04 04:41:38 -08:00
Manjeet Singh 05fc8f85e3
Merge pull request #31 from alvgeppetto-debug/fix/safety-correctness
fix: correctness and safety improvements for training
2026-03-04 18:09:56 +05:30
Manjeet Singh 032f866f2d
Merge pull request #29 from nabbilkhan/contrib/fix-training-data-paths
Fix hardcoded TinyStories data path in train_large/train_large_ane
2026-03-04 17:48:43 +05:30
Manjeet Singh 44309b7625
Merge pull request #27 from jskromer/fix/macos26-inmemory-benchmarks
Fix benchmarks for macOS 26: replace compileModelAtURL with in-memory MIL
2026-03-04 17:48:39 +05:30
Manjeet Singh 7fbb912a89
Merge pull request #20 from guitared/main
Optimize dashboard and prevent sudo hang when password needed
2026-03-04 17:48:30 +05:30
Manjeet Singh 37939c8a60
Merge pull request #34 from 04cb/fix/docs-add-training-data-link
Fix docs: add training data download instructions
2026-03-04 17:48:25 +05:30
Manjeet Singh 3efa27d7a3
Merge pull request #17 from TastyHeadphones/tastyheadphones/short-dataset-underflow-fix
Fix token sampling underflow for short token datasets
2026-03-04 17:48:22 +05:30
Manjeet Singh 4a6f3e40a9
Revise README for clarity and project details
Updated README to reflect project scope, architecture, and limitations.
2026-03-04 12:59:09 +05:30
04cb 0d9e139567 Fix docs: add training data download instructions 2026-03-04 08:16:20 +08:00
Alvaro GPT 7ea45c2fab perf: vectorize CPU bottlenecks with vDSP and cblas
- Vectorize adam_update with vDSP batch ops (stories_cpu_ops.h)
  Replaces scalar per-element loop with vDSP_vsmul/vsma/vsq/vdiv
  Expected ~3-4x faster for 2.4M parameter updates

- Vectorize model_adam_step ADAM_UPDATE macro with vDSP (backward.h)
  Same batch ops pattern for the train.m model pipeline

- Replace cpu_accum_dW with cblas_sgemm (backward.h)
  dW += dy^T @ x is a standard BLAS GEMM operation
  Expected 5-10x faster for weight gradient accumulation

- Replace cpu_matmul_backward_dx with cblas_sgemm (backward.h)
  dx = dy @ W^T is also a standard BLAS GEMM

- Add -framework Accelerate to train target (Makefile)
2026-03-03 20:47:03 +01:00
Alvaro GPT 541bf4ec90 fix: correctness & safety improvements
- Validate all fread() return values in model_load_weights (model.h)
- Check ane_eval() return values in ane_conv_eval (forward.h) and ane_eval_k (tiny_train.m)
- Log error details on ANE eval failure (ane_runtime.h)
- Thread-safe RMSNorm: replace global g_rms_tmp with local allocation (stories_cpu_ops.h)
- Bounds-check token indices in cross_entropy_loss, embed_lookup, embed_backward
- Atomic checkpoint writes via tmp+rename pattern (tiny_train.m)
- Non-destructive recompile: compile new kernels first, swap only on success (model.h)
- Validate fread() in load_checkpoint (tiny_train.m)
2026-03-03 20:46:58 +01:00
nabbilkhan c04168ee17 Add --data path support for static training pipelines 2026-03-03 19:19:49 +00:00
John Stephen Kromer d3d00307c0 Fix benchmarks for macOS 26: replace compileModelAtURL with in-memory MIL pipeline
[MLModel compileModelAtURL:] fails on macOS 26, breaking inmem_bench,
sram_bench, and sram_probe. This switches all three to generate MIL text
and weight blobs programmatically in memory (matching the working
inmem_peak.m approach), bypassing CoreML disk compilation entirely.

- inmem_bench.m: replace CoreML compile + file read with genMIL/buildWeightBlob
- sram_bench.m: switch from _ANEClient/_ANEModel to _ANEInMemoryModel API
- sram_probe.m: same _ANEClient → _ANEInMemoryModel conversion

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 10:20:05 -08:00
maderix 443194bca4 Dashboard v2: live stats, JSON parsing, all three pipelines
- Parse static pipeline JSON step/batch/perf lines for real-time updates
- Running elapsed time, ms/step from wall-clock timestamps, steps/sec
- Compute ANE + Total TFLOPS from FLOPs/step when not reported directly
- Support --ane (train_large_ane) and --no-ane-extras flags
- Dynamic pipeline timing breakdown + CKPT_PATH per mode
2026-03-03 05:24:35 -08:00
maderix 3c1aae65d7 Merge dynamic training pipeline + CLI fixes + benchmark comparison 2026-03-03 04:36:03 -08:00
maderix 4c14ed0e25 CLI fixes + --no-ane-extras flag + README benchmark table
- Fix positional arg parsing (model_path, steps, lr were silently ignored)
- Add --model, --ckpt flags; forward ckpt_path across exec() restarts
- Add --no-ane-extras to disable ANE classifier/softmax/rmsnorm_bwd
- CPU fallback for softmax/classifier/rmsnorm_bwd when extras disabled
- Update README with 4-way benchmark comparison table (20 steps)
2026-03-03 04:34:55 -08:00
maderix cb474e1537 Add dynamic weight training pipeline — 110ms/step without recompilation
Dynamic weight pipeline that eliminates the ~3.7s recompile-every-10-steps
bottleneck. Weights are passed via IOSurface spatial dimension instead of
baked as constants, so kernels compile once at startup (345ms) and run
indefinitely without exec() restart.

Key components:
- training_dynamic/ — full pipeline (config, IO, MIL generators, train loop)
  - 9 dynamic kernels shared across all 12 layers
  - Vocab compaction 32K→9.2K for faster classifier
  - Vectorized cross-entropy with vDSP/NEON
  - Adam optimizer with gradient clipping + cosine LR schedule
  - Checkpoint save/resume

- test_dynamic_matmul.m — validates dynamic weight matmul vs cblas
- test_weight_patch.m — tests weight update via IOSurface

- dashboard.py — updated with --dynamic flag for v2 pipeline support,
  improved step regex parsing, --scratch/--lr/--accum CLI args

Performance: 110ms/step steady-state (no recompile overhead)
  ane_fwd=21 ane_bwd=28 io_fwd=12 io_bwd=15 silu=10 cls=13 rms=5 ms
2026-03-03 04:34:55 -08:00
Manjeet Singh c33077430e
Merge PR #19: Bridge API + ANE classifier/softmax/rmsnorm_bwd offload (16% faster)
Bridge+Memory leak fix+More functions
2026-03-03 13:10:57 +05:30
Guitared a14ce098fb
Capitalize doc header 2026-03-03 14:18:35 +07:00
Guitared b8f09a6853
fix non-interactive session error and sudo password input for powermetrics 2026-03-03 14:14:30 +07:00
Guitared 65cfc3255f
optimize singleton token params in generate_text 2026-03-03 14:11:42 +07:00
Vipul ebac5dd73f Python Bridge+Memory leak fix+More functions 2026-03-03 02:04:36 -05:00
tastyheadphones 2b3b7ae5cc Fix token sampling underflow on short datasets 2026-03-03 11:42:42 +09:00
Manjeet Singh 1b792fce34
Merge pull request #15 from maderix/claude/add-readme-scope-notice-EL9sS
Add Project Scope & Intent notice to README
2026-03-03 06:26:35 +05:30
Claude 752a3be81a
Add Project Scope & Intent notice to README
Weave in scope notice near the top covering project intent, what it
is/isn't, hype clarification, maintenance expectations, and fork
encouragement. Consolidate private API disclaimer with existing
disclaimer section to avoid duplication.

https://claude.ai/code/session_01NNL4MVEY1aKp19eGHTYJUv
2026-03-03 00:54:46 +00:00
Manjeet Singh 893f58e725
Merge pull request #2 from m0at/m5-maximized
ANE probe tests + training telemetry for M5 optimization
2026-03-02 14:57:12 +05:30
m0at 184b182bfc Add M5 probe results: weight reload fails, all QoS work, chaining API found
Key findings from running all 4 probes on Apple M5:

- Weight reload (unload+load after file overwrite) does NOT work — weights
  are baked at compile time, output is identical regardless of file changes
- weightsBuffer IOSurface parameter also does not override compiled weights
- All QoS values 0-63 work, no measurable latency difference (~0.07ms/eval)
- _ANEPerformanceStats has hwExecutionTime (ns) + perfCounterData
- _ANEChainingRequest supports loopback execution (output→input chaining)
- _ANEClient has real-time eval path and chaining preparation methods
- procedureIndex 0-15 all succeed on single-procedure models

Fixed probe tests to use fp32 I/O with cast (matching inmem_peak pattern)
and 64+ channel kernels (ANE minimum size requirement).

Full analysis in training/m5result.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 23:16:38 -08:00
m0at 40d3f45631 Add ANE probe tests and training telemetry for M5 optimization
Four standalone probe tests to characterize the M5 ANE:
- test_weight_reload: Can weights be hot-swapped via unload+load without recompilation?
- test_perf_stats: Enumerate _ANEPerformanceStats methods/properties and hardware counters
- test_qos_sweep: Measure compile/load/eval latency across QoS 0-63
- test_ane_advanced: Probe SharedEvents, weightsBuffer IOSurface, procedureIndex, VirtualClient

Training telemetry (train_large.m):
- JSON lines to stderr with per-step timing breakdown and per-batch TFLOPS metrics
- Enables external monitoring tools to visualize ANE utilization in real-time

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 22:54:58 -08:00
maderix 4d67db1bdb stories110M: 12-layer ANE training with dashboard, 107ms/step
- Scale to full stories110M (109M params, 12 layers) with real TinyStories data
- vDSP-vectorized cross-entropy (110ms→14ms), NEON fp16 IO, async dW
- TUI dashboard: loss curve, ANE/CPU power, CPU/memory graphs, text generation
- Split into modular headers: config, io, mil, cpu_ops
2026-03-01 03:14:39 -08:00
maderix f213c8db68 Initial release 2026-02-28 00:22:06 -08:00