- Vectorize adam_update with vDSP batch ops (stories_cpu_ops.h)
Replaces scalar per-element loop with vDSP_vsmul/vsma/vsq/vdiv
Expected ~3-4x faster for 2.4M parameter updates
- Vectorize model_adam_step ADAM_UPDATE macro with vDSP (backward.h)
Same batch ops pattern for the train.m model pipeline
- Replace cpu_accum_dW with cblas_sgemm (backward.h)
dW += dy^T @ x is a standard BLAS GEMM operation
Expected 5-10x faster for weight gradient accumulation
- Replace cpu_matmul_backward_dx with cblas_sgemm (backward.h)
dx = dy @ W^T is also a standard BLAS GEMM
- Add -framework Accelerate to train target (Makefile)
- Validate all fread() return values in model_load_weights (model.h)
- Check ane_eval() return values in ane_conv_eval (forward.h) and ane_eval_k (tiny_train.m)
- Log error details on ANE eval failure (ane_runtime.h)
- Thread-safe RMSNorm: replace global g_rms_tmp with local allocation (stories_cpu_ops.h)
- Bounds-check token indices in cross_entropy_loss, embed_lookup, embed_backward
- Atomic checkpoint writes via tmp+rename pattern (tiny_train.m)
- Non-destructive recompile: compile new kernels first, swap only on success (model.h)
- Validate fread() in load_checkpoint (tiny_train.m)