berkus/ANE - ANE

Commit Graph

Author	SHA1	Message	Date
Alvaro GPT	7ea45c2fab	perf: vectorize CPU bottlenecks with vDSP and cblas - Vectorize adam_update with vDSP batch ops (stories_cpu_ops.h) Replaces scalar per-element loop with vDSP_vsmul/vsma/vsq/vdiv Expected ~3-4x faster for 2.4M parameter updates - Vectorize model_adam_step ADAM_UPDATE macro with vDSP (backward.h) Same batch ops pattern for the train.m model pipeline - Replace cpu_accum_dW with cblas_sgemm (backward.h) dW += dy^T @ x is a standard BLAS GEMM operation Expected 5-10x faster for weight gradient accumulation - Replace cpu_matmul_backward_dx with cblas_sgemm (backward.h) dx = dy @ W^T is also a standard BLAS GEMM - Add -framework Accelerate to train target (Makefile)	2026-03-03 20:47:03 +01:00
maderix	f213c8db68	Initial release	2026-02-28 00:22:06 -08:00

Author

SHA1

Message

Date

Alvaro GPT

7ea45c2fab

perf: vectorize CPU bottlenecks with vDSP and cblas

- Vectorize adam_update with vDSP batch ops (stories_cpu_ops.h)
  Replaces scalar per-element loop with vDSP_vsmul/vsma/vsq/vdiv
  Expected ~3-4x faster for 2.4M parameter updates

- Vectorize model_adam_step ADAM_UPDATE macro with vDSP (backward.h)
  Same batch ops pattern for the train.m model pipeline

- Replace cpu_accum_dW with cblas_sgemm (backward.h)
  dW += dy^T @ x is a standard BLAS GEMM operation
  Expected 5-10x faster for weight gradient accumulation

- Replace cpu_matmul_backward_dx with cblas_sgemm (backward.h)
  dx = dy @ W^T is also a standard BLAS GEMM

- Add -framework Accelerate to train target (Makefile)

2026-03-03 20:47:03 +01:00

maderix

f213c8db68

Initial release

2026-02-28 00:22:06 -08:00

2 Commits