New train_opt target with NEON-vectorized Adam, fp16 activation/gradient
caching, concurrent dW dispatch, pre-allocated buffers, and optional
Metal GPU support. Tested on M3 Max with stories110M.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>