Commit Graph

13 Commits

Author SHA1 Message Date
fspecii 98ddd2d190 bridge: add compile_dyn + write_weight — function parameter IOSurfaces
Adds a second dynamic weight approach to the bridge alongside the existing
BLOBFILE compile path. Instead of packing weights into the spatial dimension
of a single large input tensor and slicing them inside MIL (the training_dynamic/
approach), weights are declared as native MIL function parameters backed by
persistent IOSurfaces:

  // training_dynamic/ approach: spatial packing
  func main<ios18>(tensor<fp32, [1, DIM, 1, SEQ + 4*DIM]> x) {
      Wq = slice_by_size(x=x, begin=..., size=...);  // overhead
      ...

  // this PR: native function parameters
  func main<ios18>(tensor<fp16,[1,K,1,M]> x, tensor<fp16,[1,N,K]> W) { ... }

New API:
  ane_bridge_compile_dyn()      — compile with n_weights IOSurface parameters
  ane_bridge_write_weight()     — write fp16 to weight IOSurface (~0.001ms)
  ane_bridge_write_weight_f32() — write fp32 with NEON conversion
  ane_bridge_copy_io()          — direct output→input copy, no CPU round-trip
  ane_bridge_begin/end_realtime() — 90.6% p99 jitter reduction

Compile cache fix: ANE only writes net.plist for parameter-based models (no
data file). try_cache_restore now checks net.plist only; data is saved/restored
conditionally for BLOBFILE models that do produce it.

Also removes the pre-built libane_bridge.dylib binary from version control.

Performance vs spatial packing (Stories110M, 12 layers, M-series):
  training_dynamic/ (slice approach): 110ms/step
  function parameter approach:         76.9ms/step  (-30%)

The slice/reshape/transpose overhead per weight matrix explains the gap.
Both compile once at startup; weight updates are IOSurface writes in both cases.

Tested: test_bridge.m — 15/15 assertions across all new API functions.
2026-03-03 15:00:51 +02:00
maderix 3c1aae65d7 Merge dynamic training pipeline + CLI fixes + benchmark comparison 2026-03-03 04:36:03 -08:00
maderix 4c14ed0e25 CLI fixes + --no-ane-extras flag + README benchmark table
- Fix positional arg parsing (model_path, steps, lr were silently ignored)
- Add --model, --ckpt flags; forward ckpt_path across exec() restarts
- Add --no-ane-extras to disable ANE classifier/softmax/rmsnorm_bwd
- CPU fallback for softmax/classifier/rmsnorm_bwd when extras disabled
- Update README with 4-way benchmark comparison table (20 steps)
2026-03-03 04:34:55 -08:00
maderix cb474e1537 Add dynamic weight training pipeline — 110ms/step without recompilation
Dynamic weight pipeline that eliminates the ~3.7s recompile-every-10-steps
bottleneck. Weights are passed via IOSurface spatial dimension instead of
baked as constants, so kernels compile once at startup (345ms) and run
indefinitely without exec() restart.

Key components:
- training_dynamic/ — full pipeline (config, IO, MIL generators, train loop)
  - 9 dynamic kernels shared across all 12 layers
  - Vocab compaction 32K→9.2K for faster classifier
  - Vectorized cross-entropy with vDSP/NEON
  - Adam optimizer with gradient clipping + cosine LR schedule
  - Checkpoint save/resume

- test_dynamic_matmul.m — validates dynamic weight matmul vs cblas
- test_weight_patch.m — tests weight update via IOSurface

- dashboard.py — updated with --dynamic flag for v2 pipeline support,
  improved step regex parsing, --scratch/--lr/--accum CLI args

Performance: 110ms/step steady-state (no recompile overhead)
  ane_fwd=21 ane_bwd=28 io_fwd=12 io_bwd=15 silu=10 cls=13 rms=5 ms
2026-03-03 04:34:55 -08:00
Manjeet Singh c33077430e
Merge PR #19: Bridge API + ANE classifier/softmax/rmsnorm_bwd offload (16% faster)
Bridge+Memory leak fix+More functions
2026-03-03 13:10:57 +05:30
Vipul ebac5dd73f Python Bridge+Memory leak fix+More functions 2026-03-03 02:04:36 -05:00
Manjeet Singh 1b792fce34
Merge pull request #15 from maderix/claude/add-readme-scope-notice-EL9sS
Add Project Scope & Intent notice to README
2026-03-03 06:26:35 +05:30
Claude 752a3be81a
Add Project Scope & Intent notice to README
Weave in scope notice near the top covering project intent, what it
is/isn't, hype clarification, maintenance expectations, and fork
encouragement. Consolidate private API disclaimer with existing
disclaimer section to avoid duplication.

https://claude.ai/code/session_01NNL4MVEY1aKp19eGHTYJUv
2026-03-03 00:54:46 +00:00
Manjeet Singh 893f58e725
Merge pull request #2 from m0at/m5-maximized
ANE probe tests + training telemetry for M5 optimization
2026-03-02 14:57:12 +05:30
m0at 184b182bfc Add M5 probe results: weight reload fails, all QoS work, chaining API found
Key findings from running all 4 probes on Apple M5:

- Weight reload (unload+load after file overwrite) does NOT work — weights
  are baked at compile time, output is identical regardless of file changes
- weightsBuffer IOSurface parameter also does not override compiled weights
- All QoS values 0-63 work, no measurable latency difference (~0.07ms/eval)
- _ANEPerformanceStats has hwExecutionTime (ns) + perfCounterData
- _ANEChainingRequest supports loopback execution (output→input chaining)
- _ANEClient has real-time eval path and chaining preparation methods
- procedureIndex 0-15 all succeed on single-procedure models

Fixed probe tests to use fp32 I/O with cast (matching inmem_peak pattern)
and 64+ channel kernels (ANE minimum size requirement).

Full analysis in training/m5result.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 23:16:38 -08:00
m0at 40d3f45631 Add ANE probe tests and training telemetry for M5 optimization
Four standalone probe tests to characterize the M5 ANE:
- test_weight_reload: Can weights be hot-swapped via unload+load without recompilation?
- test_perf_stats: Enumerate _ANEPerformanceStats methods/properties and hardware counters
- test_qos_sweep: Measure compile/load/eval latency across QoS 0-63
- test_ane_advanced: Probe SharedEvents, weightsBuffer IOSurface, procedureIndex, VirtualClient

Training telemetry (train_large.m):
- JSON lines to stderr with per-step timing breakdown and per-batch TFLOPS metrics
- Enables external monitoring tools to visualize ANE utilization in real-time

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 22:54:58 -08:00
maderix 4d67db1bdb stories110M: 12-layer ANE training with dashboard, 107ms/step
- Scale to full stories110M (109M params, 12 layers) with real TinyStories data
- vDSP-vectorized cross-entropy (110ms→14ms), NEON fp16 IO, async dW
- TUI dashboard: loss curve, ANE/CPU power, CPU/memory graphs, text generation
- Split into modular headers: config, io, mil, cpu_ops
2026-03-01 03:14:39 -08:00
maderix f213c8db68 Initial release 2026-02-28 00:22:06 -08:00