Commit Graph

32 Commits

Author SHA1 Message Date
maderix efcf193075 Add model config to benchmark report, update README with current results
Benchmark report now includes full Stories110M model configuration
(arch, layers, dims, kernels). README updated: 12-layer results
replace stale single-layer numbers, limitations reflect current state.
2026-03-04 06:13:21 -08:00
maderix 1a7d8846b2 Add NE core counts, clarify FP16 vs rated TOPS methodology
All chips have 16 NE cores except Ultra (32 via UltraFusion).
M4 38 TOPS is INT8/mixed-precision, not comparable to M3 FP16 spec.
2026-03-04 06:11:29 -08:00
maderix 050bc4fdf0 Add cross-generation ANE benchmark report from issue #3
Community-submitted results for M1 Pro/Max, M3 Pro, M4 Pro/Max, M5.
Includes training performance, peak throughput, MIL compatibility
matrix, and structured JSON data.
2026-03-04 05:30:00 -08:00
maderix e986572e90 Replace assert() with non-fatal bounds checks on token IDs
Follow-up to PR #31 — assert() aborts on bad tokens, which is too
harsh for training. Skip bad tokens with a warning instead.
2026-03-04 04:41:38 -08:00
Manjeet Singh 05fc8f85e3
Merge pull request #31 from alvgeppetto-debug/fix/safety-correctness
fix: correctness and safety improvements for training
2026-03-04 18:09:56 +05:30
Manjeet Singh 032f866f2d
Merge pull request #29 from nabbilkhan/contrib/fix-training-data-paths
Fix hardcoded TinyStories data path in train_large/train_large_ane
2026-03-04 17:48:43 +05:30
Manjeet Singh 44309b7625
Merge pull request #27 from jskromer/fix/macos26-inmemory-benchmarks
Fix benchmarks for macOS 26: replace compileModelAtURL with in-memory MIL
2026-03-04 17:48:39 +05:30
Manjeet Singh 7fbb912a89
Merge pull request #20 from guitared/main
Optimize dashboard and prevent sudo hang when password needed
2026-03-04 17:48:30 +05:30
Manjeet Singh 37939c8a60
Merge pull request #34 from 04cb/fix/docs-add-training-data-link
Fix docs: add training data download instructions
2026-03-04 17:48:25 +05:30
Manjeet Singh 3efa27d7a3
Merge pull request #17 from TastyHeadphones/tastyheadphones/short-dataset-underflow-fix
Fix token sampling underflow for short token datasets
2026-03-04 17:48:22 +05:30
Manjeet Singh 4a6f3e40a9
Revise README for clarity and project details
Updated README to reflect project scope, architecture, and limitations.
2026-03-04 12:59:09 +05:30
04cb 0d9e139567 Fix docs: add training data download instructions 2026-03-04 08:16:20 +08:00
Alvaro GPT 541bf4ec90 fix: correctness & safety improvements
- Validate all fread() return values in model_load_weights (model.h)
- Check ane_eval() return values in ane_conv_eval (forward.h) and ane_eval_k (tiny_train.m)
- Log error details on ANE eval failure (ane_runtime.h)
- Thread-safe RMSNorm: replace global g_rms_tmp with local allocation (stories_cpu_ops.h)
- Bounds-check token indices in cross_entropy_loss, embed_lookup, embed_backward
- Atomic checkpoint writes via tmp+rename pattern (tiny_train.m)
- Non-destructive recompile: compile new kernels first, swap only on success (model.h)
- Validate fread() in load_checkpoint (tiny_train.m)
2026-03-03 20:46:58 +01:00
nabbilkhan c04168ee17 Add --data path support for static training pipelines 2026-03-03 19:19:49 +00:00
John Stephen Kromer d3d00307c0 Fix benchmarks for macOS 26: replace compileModelAtURL with in-memory MIL pipeline
[MLModel compileModelAtURL:] fails on macOS 26, breaking inmem_bench,
sram_bench, and sram_probe. This switches all three to generate MIL text
and weight blobs programmatically in memory (matching the working
inmem_peak.m approach), bypassing CoreML disk compilation entirely.

- inmem_bench.m: replace CoreML compile + file read with genMIL/buildWeightBlob
- sram_bench.m: switch from _ANEClient/_ANEModel to _ANEInMemoryModel API
- sram_probe.m: same _ANEClient → _ANEInMemoryModel conversion

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 10:20:05 -08:00
maderix 443194bca4 Dashboard v2: live stats, JSON parsing, all three pipelines
- Parse static pipeline JSON step/batch/perf lines for real-time updates
- Running elapsed time, ms/step from wall-clock timestamps, steps/sec
- Compute ANE + Total TFLOPS from FLOPs/step when not reported directly
- Support --ane (train_large_ane) and --no-ane-extras flags
- Dynamic pipeline timing breakdown + CKPT_PATH per mode
2026-03-03 05:24:35 -08:00
maderix 3c1aae65d7 Merge dynamic training pipeline + CLI fixes + benchmark comparison 2026-03-03 04:36:03 -08:00
maderix 4c14ed0e25 CLI fixes + --no-ane-extras flag + README benchmark table
- Fix positional arg parsing (model_path, steps, lr were silently ignored)
- Add --model, --ckpt flags; forward ckpt_path across exec() restarts
- Add --no-ane-extras to disable ANE classifier/softmax/rmsnorm_bwd
- CPU fallback for softmax/classifier/rmsnorm_bwd when extras disabled
- Update README with 4-way benchmark comparison table (20 steps)
2026-03-03 04:34:55 -08:00
maderix cb474e1537 Add dynamic weight training pipeline — 110ms/step without recompilation
Dynamic weight pipeline that eliminates the ~3.7s recompile-every-10-steps
bottleneck. Weights are passed via IOSurface spatial dimension instead of
baked as constants, so kernels compile once at startup (345ms) and run
indefinitely without exec() restart.

Key components:
- training_dynamic/ — full pipeline (config, IO, MIL generators, train loop)
  - 9 dynamic kernels shared across all 12 layers
  - Vocab compaction 32K→9.2K for faster classifier
  - Vectorized cross-entropy with vDSP/NEON
  - Adam optimizer with gradient clipping + cosine LR schedule
  - Checkpoint save/resume

- test_dynamic_matmul.m — validates dynamic weight matmul vs cblas
- test_weight_patch.m — tests weight update via IOSurface

- dashboard.py — updated with --dynamic flag for v2 pipeline support,
  improved step regex parsing, --scratch/--lr/--accum CLI args

Performance: 110ms/step steady-state (no recompile overhead)
  ane_fwd=21 ane_bwd=28 io_fwd=12 io_bwd=15 silu=10 cls=13 rms=5 ms
2026-03-03 04:34:55 -08:00
Manjeet Singh c33077430e
Merge PR #19: Bridge API + ANE classifier/softmax/rmsnorm_bwd offload (16% faster)
Bridge+Memory leak fix+More functions
2026-03-03 13:10:57 +05:30
Guitared a14ce098fb
Capitalize doc header 2026-03-03 14:18:35 +07:00
Guitared b8f09a6853
fix non-interactive session error and sudo password input for powermetrics 2026-03-03 14:14:30 +07:00
Guitared 65cfc3255f
optimize singleton token params in generate_text 2026-03-03 14:11:42 +07:00
Vipul ebac5dd73f Python Bridge+Memory leak fix+More functions 2026-03-03 02:04:36 -05:00
tastyheadphones 2b3b7ae5cc Fix token sampling underflow on short datasets 2026-03-03 11:42:42 +09:00
Manjeet Singh 1b792fce34
Merge pull request #15 from maderix/claude/add-readme-scope-notice-EL9sS
Add Project Scope & Intent notice to README
2026-03-03 06:26:35 +05:30
Claude 752a3be81a
Add Project Scope & Intent notice to README
Weave in scope notice near the top covering project intent, what it
is/isn't, hype clarification, maintenance expectations, and fork
encouragement. Consolidate private API disclaimer with existing
disclaimer section to avoid duplication.

https://claude.ai/code/session_01NNL4MVEY1aKp19eGHTYJUv
2026-03-03 00:54:46 +00:00
Manjeet Singh 893f58e725
Merge pull request #2 from m0at/m5-maximized
ANE probe tests + training telemetry for M5 optimization
2026-03-02 14:57:12 +05:30
m0at 184b182bfc Add M5 probe results: weight reload fails, all QoS work, chaining API found
Key findings from running all 4 probes on Apple M5:

- Weight reload (unload+load after file overwrite) does NOT work — weights
  are baked at compile time, output is identical regardless of file changes
- weightsBuffer IOSurface parameter also does not override compiled weights
- All QoS values 0-63 work, no measurable latency difference (~0.07ms/eval)
- _ANEPerformanceStats has hwExecutionTime (ns) + perfCounterData
- _ANEChainingRequest supports loopback execution (output→input chaining)
- _ANEClient has real-time eval path and chaining preparation methods
- procedureIndex 0-15 all succeed on single-procedure models

Fixed probe tests to use fp32 I/O with cast (matching inmem_peak pattern)
and 64+ channel kernels (ANE minimum size requirement).

Full analysis in training/m5result.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 23:16:38 -08:00
m0at 40d3f45631 Add ANE probe tests and training telemetry for M5 optimization
Four standalone probe tests to characterize the M5 ANE:
- test_weight_reload: Can weights be hot-swapped via unload+load without recompilation?
- test_perf_stats: Enumerate _ANEPerformanceStats methods/properties and hardware counters
- test_qos_sweep: Measure compile/load/eval latency across QoS 0-63
- test_ane_advanced: Probe SharedEvents, weightsBuffer IOSurface, procedureIndex, VirtualClient

Training telemetry (train_large.m):
- JSON lines to stderr with per-step timing breakdown and per-batch TFLOPS metrics
- Enables external monitoring tools to visualize ANE utilization in real-time

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 22:54:58 -08:00
maderix 4d67db1bdb stories110M: 12-layer ANE training with dashboard, 107ms/step
- Scale to full stories110M (109M params, 12 layers) with real TinyStories data
- vDSP-vectorized cross-entropy (110ms→14ms), NEON fp16 IO, async dW
- TUI dashboard: loss curve, ANE/CPU power, CPU/memory graphs, text generation
- Split into modular headers: config, io, mil, cpu_ops
2026-03-01 03:14:39 -08:00
maderix f213c8db68 Initial release 2026-02-28 00:22:06 -08:00