berkus/ANE - ANE

Commit Graph

Author	SHA1	Message	Date
maderix	efcf193075	Add model config to benchmark report, update README with current results Benchmark report now includes full Stories110M model configuration (arch, layers, dims, kernels). README updated: 12-layer results replace stale single-layer numbers, limitations reflect current state.	2026-03-04 06:13:21 -08:00
maderix	1a7d8846b2	Add NE core counts, clarify FP16 vs rated TOPS methodology All chips have 16 NE cores except Ultra (32 via UltraFusion). M4 38 TOPS is INT8/mixed-precision, not comparable to M3 FP16 spec.	2026-03-04 06:11:29 -08:00
maderix	050bc4fdf0	Add cross-generation ANE benchmark report from issue #3 Community-submitted results for M1 Pro/Max, M3 Pro, M4 Pro/Max, M5. Includes training performance, peak throughput, MIL compatibility matrix, and structured JSON data.	2026-03-04 05:30:00 -08:00
maderix	e986572e90	Replace assert() with non-fatal bounds checks on token IDs Follow-up to PR #31 — assert() aborts on bad tokens, which is too harsh for training. Skip bad tokens with a warning instead.	2026-03-04 04:41:38 -08:00
Manjeet Singh	05fc8f85e3	Merge pull request #31 from alvgeppetto-debug/fix/safety-correctness fix: correctness and safety improvements for training	2026-03-04 18:09:56 +05:30
Manjeet Singh	032f866f2d	Merge pull request #29 from nabbilkhan/contrib/fix-training-data-paths Fix hardcoded TinyStories data path in train_large/train_large_ane	2026-03-04 17:48:43 +05:30
Manjeet Singh	44309b7625	Merge pull request #27 from jskromer/fix/macos26-inmemory-benchmarks Fix benchmarks for macOS 26: replace compileModelAtURL with in-memory MIL	2026-03-04 17:48:39 +05:30
Manjeet Singh	7fbb912a89	Merge pull request #20 from guitared/main Optimize dashboard and prevent sudo hang when password needed	2026-03-04 17:48:30 +05:30
Manjeet Singh	37939c8a60	Merge pull request #34 from 04cb/fix/docs-add-training-data-link Fix docs: add training data download instructions	2026-03-04 17:48:25 +05:30
Manjeet Singh	3efa27d7a3	Merge pull request #17 from TastyHeadphones/tastyheadphones/short-dataset-underflow-fix Fix token sampling underflow for short token datasets	2026-03-04 17:48:22 +05:30
Manjeet Singh	4a6f3e40a9	Revise README for clarity and project details Updated README to reflect project scope, architecture, and limitations.	2026-03-04 12:59:09 +05:30
04cb	0d9e139567	Fix docs: add training data download instructions	2026-03-04 08:16:20 +08:00
Alvaro GPT	541bf4ec90	fix: correctness & safety improvements - Validate all fread() return values in model_load_weights (model.h) - Check ane_eval() return values in ane_conv_eval (forward.h) and ane_eval_k (tiny_train.m) - Log error details on ANE eval failure (ane_runtime.h) - Thread-safe RMSNorm: replace global g_rms_tmp with local allocation (stories_cpu_ops.h) - Bounds-check token indices in cross_entropy_loss, embed_lookup, embed_backward - Atomic checkpoint writes via tmp+rename pattern (tiny_train.m) - Non-destructive recompile: compile new kernels first, swap only on success (model.h) - Validate fread() in load_checkpoint (tiny_train.m)	2026-03-03 20:46:58 +01:00
nabbilkhan	c04168ee17	Add --data path support for static training pipelines	2026-03-03 19:19:49 +00:00
John Stephen Kromer	d3d00307c0	Fix benchmarks for macOS 26: replace compileModelAtURL with in-memory MIL pipeline [MLModel compileModelAtURL:] fails on macOS 26, breaking inmem_bench, sram_bench, and sram_probe. This switches all three to generate MIL text and weight blobs programmatically in memory (matching the working inmem_peak.m approach), bypassing CoreML disk compilation entirely. - inmem_bench.m: replace CoreML compile + file read with genMIL/buildWeightBlob - sram_bench.m: switch from _ANEClient/_ANEModel to _ANEInMemoryModel API - sram_probe.m: same _ANEClient → _ANEInMemoryModel conversion Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 10:20:05 -08:00
maderix	443194bca4	Dashboard v2: live stats, JSON parsing, all three pipelines - Parse static pipeline JSON step/batch/perf lines for real-time updates - Running elapsed time, ms/step from wall-clock timestamps, steps/sec - Compute ANE + Total TFLOPS from FLOPs/step when not reported directly - Support --ane (train_large_ane) and --no-ane-extras flags - Dynamic pipeline timing breakdown + CKPT_PATH per mode	2026-03-03 05:24:35 -08:00
maderix	3c1aae65d7	Merge dynamic training pipeline + CLI fixes + benchmark comparison	2026-03-03 04:36:03 -08:00
maderix	4c14ed0e25	CLI fixes + --no-ane-extras flag + README benchmark table - Fix positional arg parsing (model_path, steps, lr were silently ignored) - Add --model, --ckpt flags; forward ckpt_path across exec() restarts - Add --no-ane-extras to disable ANE classifier/softmax/rmsnorm_bwd - CPU fallback for softmax/classifier/rmsnorm_bwd when extras disabled - Update README with 4-way benchmark comparison table (20 steps)	2026-03-03 04:34:55 -08:00
maderix	cb474e1537	Add dynamic weight training pipeline — 110ms/step without recompilation Dynamic weight pipeline that eliminates the ~3.7s recompile-every-10-steps bottleneck. Weights are passed via IOSurface spatial dimension instead of baked as constants, so kernels compile once at startup (345ms) and run indefinitely without exec() restart. Key components: - training_dynamic/ — full pipeline (config, IO, MIL generators, train loop) - 9 dynamic kernels shared across all 12 layers - Vocab compaction 32K→9.2K for faster classifier - Vectorized cross-entropy with vDSP/NEON - Adam optimizer with gradient clipping + cosine LR schedule - Checkpoint save/resume - test_dynamic_matmul.m — validates dynamic weight matmul vs cblas - test_weight_patch.m — tests weight update via IOSurface - dashboard.py — updated with --dynamic flag for v2 pipeline support, improved step regex parsing, --scratch/--lr/--accum CLI args Performance: 110ms/step steady-state (no recompile overhead) ane_fwd=21 ane_bwd=28 io_fwd=12 io_bwd=15 silu=10 cls=13 rms=5 ms	2026-03-03 04:34:55 -08:00
Manjeet Singh	c33077430e	Merge PR #19 : Bridge API + ANE classifier/softmax/rmsnorm_bwd offload (16% faster) Bridge+Memory leak fix+More functions	2026-03-03 13:10:57 +05:30
Guitared	a14ce098fb	Capitalize doc header	2026-03-03 14:18:35 +07:00
Guitared	b8f09a6853	fix non-interactive session error and sudo password input for powermetrics	2026-03-03 14:14:30 +07:00
Guitared	65cfc3255f	optimize singleton token params in generate_text	2026-03-03 14:11:42 +07:00
Vipul	ebac5dd73f	Python Bridge+Memory leak fix+More functions	2026-03-03 02:04:36 -05:00
tastyheadphones	2b3b7ae5cc	Fix token sampling underflow on short datasets	2026-03-03 11:42:42 +09:00
Manjeet Singh	1b792fce34	Merge pull request #15 from maderix/claude/add-readme-scope-notice-EL9sS Add Project Scope & Intent notice to README	2026-03-03 06:26:35 +05:30
Claude	752a3be81a	Add Project Scope & Intent notice to README Weave in scope notice near the top covering project intent, what it is/isn't, hype clarification, maintenance expectations, and fork encouragement. Consolidate private API disclaimer with existing disclaimer section to avoid duplication. https://claude.ai/code/session_01NNL4MVEY1aKp19eGHTYJUv	2026-03-03 00:54:46 +00:00
Manjeet Singh	893f58e725	Merge pull request #2 from m0at/m5-maximized ANE probe tests + training telemetry for M5 optimization	2026-03-02 14:57:12 +05:30
m0at	184b182bfc	Add M5 probe results: weight reload fails, all QoS work, chaining API found Key findings from running all 4 probes on Apple M5: - Weight reload (unload+load after file overwrite) does NOT work — weights are baked at compile time, output is identical regardless of file changes - weightsBuffer IOSurface parameter also does not override compiled weights - All QoS values 0-63 work, no measurable latency difference (~0.07ms/eval) - _ANEPerformanceStats has hwExecutionTime (ns) + perfCounterData - _ANEChainingRequest supports loopback execution (output→input chaining) - _ANEClient has real-time eval path and chaining preparation methods - procedureIndex 0-15 all succeed on single-procedure models Fixed probe tests to use fp32 I/O with cast (matching inmem_peak pattern) and 64+ channel kernels (ANE minimum size requirement). Full analysis in training/m5result.md. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 23:16:38 -08:00
m0at	40d3f45631	Add ANE probe tests and training telemetry for M5 optimization Four standalone probe tests to characterize the M5 ANE: - test_weight_reload: Can weights be hot-swapped via unload+load without recompilation? - test_perf_stats: Enumerate _ANEPerformanceStats methods/properties and hardware counters - test_qos_sweep: Measure compile/load/eval latency across QoS 0-63 - test_ane_advanced: Probe SharedEvents, weightsBuffer IOSurface, procedureIndex, VirtualClient Training telemetry (train_large.m): - JSON lines to stderr with per-step timing breakdown and per-batch TFLOPS metrics - Enables external monitoring tools to visualize ANE utilization in real-time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 22:54:58 -08:00
maderix	4d67db1bdb	stories110M: 12-layer ANE training with dashboard, 107ms/step - Scale to full stories110M (109M params, 12 layers) with real TinyStories data - vDSP-vectorized cross-entropy (110ms→14ms), NEON fp16 IO, async dW - TUI dashboard: loss curve, ANE/CPU power, CPU/memory graphs, text generation - Split into modular headers: config, io, mil, cpu_ops	2026-03-01 03:14:39 -08:00
maderix	f213c8db68	Initial release	2026-02-28 00:22:06 -08:00

32 Commits All Branches Search

32 Commits

All Branches