Add reproducible M3 Ultra benchmark submission package

2026-03-03 18:39:34 +00:00 · 2026-03-03 18:39:34 +00:00 · 7fceb99988
parent 443194bca4
commit 7fceb99988
17 changed files with 632 additions and 5 deletions
--- a/README.md
+++ b/README.md
@ -148,11 +148,17 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve
 | Channel-first layout | 20.3 | 5.2% |
 | vDSP vectorized RMSNorm | 14.2 | 7.4% |
 | GCD async cblas overlap | 11.4 | 9.2% |
-| ANE RMSNorm fusion | 11.4 | 9.2% |
-| Wo^T fusion (7→6 kernels) | 11.4 | 9.2% |
-| Deferred cblas wait | **9.3** | **11.2%** |
-
-## Disclaimer
+| ANE RMSNorm fusion | 11.4 | 9.2% |
+| Wo^T fusion (7→6 kernels) | 11.4 | 9.2% |
+| Deferred cblas wait | **9.3** | **11.2%** |
+
+## Community Benchmarks
+
+Community hardware benchmark submissions live in [`benchmarks/submissions/`](benchmarks/submissions/).
+
+- [Mac Studio (Apple M3 Ultra, 256 GB) — 2026-03-03](benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/README.md)
+
+## Disclaimer

 This project uses Apple's private, undocumented APIs (`_ANEClient`, `_ANECompiler`, `_ANEInMemoryModelDescriptor`). These APIs are not covered by any public stability guarantee and may change or break with any macOS update. This is independent research into Apple Neural Engine architecture, using APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see *Sega v. Accolade*, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk.

--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@ -0,0 +1,34 @@
+# Community Benchmark Submissions
+
+This folder is for reproducible hardware benchmark submissions from the community.
+
+## Goals
+
+- Make cross-chip results easy to compare.
+- Keep raw logs attached so numbers are auditable.
+- Keep submissions lightweight and low-maintenance.
+
+## Submission Layout
+
+Use one directory per machine/date:
+
+`benchmarks/submissions/<chip>-<machine>-<YYYY-MM-DD>/`
+
+Required files:
+
+- `README.md` — short summary of machine, commands, and key results
+- `metrics.json` — machine-readable summary of key metrics
+- `raw/` — raw command outputs (`*.log`, `system_info.txt`, `upstream_commit.txt`)
+
+## Privacy
+
+Please redact machine serial numbers, UUIDs, and other unique identifiers before committing logs.
+
+## Minimal Repro Guidance
+
+Each submission should include:
+
+- exact upstream commit hash tested
+- exact commands run
+- fixed step counts for training comparisons (for example, `--steps 20`)
+- clear pass/fail status for each benchmark
--- a/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/README.md
+++ b/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/README.md
@ -0,0 +1,81 @@
+# Mac Studio M3 Ultra Benchmark Submission (2026-03-03)
+
+This submission targets upstream issue: `#3` (collecting results across Apple Silicon variants).
+
+## Environment
+
+- Upstream commit: `443194bca4491fae4400bae9dad2a0470692bdbf`
+- Machine: Mac Studio (`Mac15,14`)
+- Chip: Apple M3 Ultra
+- CPU cores: 28 total (20P + 8E)
+- Memory: 256 GB (`274877906944` bytes)
+- OS: macOS 26.3 (`25D125`)
+- Toolchain: Apple clang 17.0.0 (`/Library/Developer/CommandLineTools`)
+
+Raw system capture: [`raw/system_info.txt`](raw/system_info.txt)
+
+## Commands Run
+
+Exact commands used are included in [`commands.sh`](commands.sh).
+
+Highlights:
+
+```bash
+# Root benchmark
+xcrun clang -O2 -framework Foundation -framework IOSurface -framework CoreML \
+  -ldl -lobjc -o inmem_peak inmem_peak.m
+./inmem_peak
+
+# Training benchmarks
+cd training
+bash download_data.sh
+make train_large train_large_ane
+./train_large --steps 20 --lr 1e-4 --ckpt /tmp/train_large.ckpt
+./train_large_ane --steps 20 --lr 1e-4 --ckpt /tmp/train_large_ane.ckpt
+./train_large_ane --no-ane-extras --steps 20 --lr 1e-4 --ckpt /tmp/train_large_ane_no_extras.ckpt
+cd training_dynamic
+make train
+./train --scratch --steps 20 --lr 1e-4
+```
+
+## Training Results (20 steps)
+
+| Pipeline | Wall time | Compile time | Train time | Avg train | ANE TFLOPS | Total TFLOPS |
+|---|---:|---:|---:|---:|---:|---:|
+| `train_large` | 9471 ms | 7545 ms (79.7%) | 1623 ms (17.1%) | 81.2 ms/step | 1.15 | 2.15 |
+| `train_large_ane` | 10898 ms | 9090 ms (83.4%) | 1428 ms (13.1%) | 71.4 ms/step | 1.48 | 2.44 |
+| `train_large_ane --no-ane-extras` | 10248 ms | 7455 ms (72.7%) | 2476 ms (24.2%) | 123.8 ms/step | 0.85 | 1.41 |
+| `training_dynamic/train --scratch` | 2.9 s | 353 ms (one-time, 12.0%) | 2309 ms | 115.4 ms/step | n/a | n/a |
+
+Raw logs:
+
+- [`raw/train_large.log`](raw/train_large.log)
+- [`raw/train_large_ane.log`](raw/train_large_ane.log)
+- [`raw/train_large_ane_no_extras.log`](raw/train_large_ane_no_extras.log)
+- [`raw/train_dynamic.log`](raw/train_dynamic.log)
+
+## In-Memory Peak Results
+
+Best observed from `inmem_peak`:
+
+- 8.08 TFLOPS at `128x conv 512ch sp64` (`4.29 GFLOP`, `0.531 ms/eval`)
+
+Raw log:
+
+- [`raw/inmem_peak.log`](raw/inmem_peak.log)
+
+## Additional Root Benchmarks
+
+- `inmem_bench`: all configs returned `FAIL(-1)` on this clean setup
+- `sram_bench`: all configs returned `FAIL(-1)` on this clean setup
+
+Raw logs:
+
+- [`raw/inmem_bench.log`](raw/inmem_bench.log)
+- [`raw/sram_bench.log`](raw/sram_bench.log)
+
+## Notes
+
+- `train_large_ane` had the best per-step throughput in this run.
+- Dynamic had the best short-run wall-clock due to one-time compile cost.
+- Static pipelines remained compile-dominated over 20 steps.
--- a/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/commands.sh
+++ b/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/commands.sh
@ -0,0 +1,62 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# Repro commands used for this submission.
+# Machine: Mac Studio (Apple M3 Ultra)
+# Commit: 443194bca4491fae4400bae9dad2a0470692bdbf
+
+REPO="${REPO:-$HOME/Dev/ANE-upstream}"
+ART="${ART:-$REPO/bench_artifacts/m3-ultra-2026-03-03/raw}"
+
+mkdir -p "$ART"
+cd "$REPO"
+
+# System capture
+{
+  echo "timestamp_utc=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
+  sw_vers
+  uname -a
+  echo
+  echo "=== sysctl ==="
+  sysctl hw.model hw.memsize hw.ncpu hw.physicalcpu hw.logicalcpu \
+    hw.perflevel0.physicalcpu hw.perflevel1.physicalcpu \
+    machdep.cpu.brand_string 2>/dev/null || true
+  echo
+  echo "=== system_profiler SPHardwareDataType ==="
+  system_profiler SPHardwareDataType
+  echo
+  echo "=== toolchain ==="
+  xcode-select -p
+  xcrun clang --version
+} > "$ART/system_info.txt"
+
+# Root benchmark
+xcrun clang -O2 -framework Foundation -framework IOSurface -framework CoreML \
+  -ldl -lobjc -o inmem_peak inmem_peak.m
+./inmem_peak > "$ART/inmem_peak.log" 2>&1
+
+# Optional root benchmarks (may fail on clean setups)
+xcrun clang -O2 -framework Foundation -framework IOSurface -framework CoreML \
+  -ldl -lobjc -o inmem_bench inmem_bench.m
+./inmem_bench > "$ART/inmem_bench.log" 2>&1 || true
+
+xcrun clang -O2 -framework Foundation -framework IOSurface -framework CoreML \
+  -ldl -lobjc -o sram_bench sram_bench.m
+./sram_bench > "$ART/sram_bench.log" 2>&1 || true
+
+# Training benchmarks
+cd "$REPO/training"
+bash download_data.sh > "$ART/download_data.log" 2>&1
+make train_large train_large_ane > "$ART/training_make.log" 2>&1
+./train_large --steps 20 --lr 1e-4 --ckpt "$ART/train_large.ckpt" > "$ART/train_large.log" 2>&1
+./train_large_ane --steps 20 --lr 1e-4 --ckpt "$ART/train_large_ane.ckpt" > "$ART/train_large_ane.log" 2>&1
+./train_large_ane --no-ane-extras --steps 20 --lr 1e-4 --ckpt "$ART/train_large_ane_no_extras.ckpt" > "$ART/train_large_ane_no_extras.log" 2>&1
+
+cd "$REPO/training/training_dynamic"
+make train > "$ART/training_dynamic_make.log" 2>&1
+./train --scratch --steps 20 --lr 1e-4 > "$ART/train_dynamic.log" 2>&1
+
+cd "$REPO"
+git rev-parse HEAD > "$ART/upstream_commit.txt"
+
+echo "Done. Raw logs are in: $ART"
--- a/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/metrics.json
+++ b/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/metrics.json
@ -0,0 +1,101 @@
+{
+  "submission_id": "m3-ultra-mac-studio-2026-03-03",
+  "captured_at_utc": "2026-03-03T18:34:30Z",
+  "upstream_commit": "443194bca4491fae4400bae9dad2a0470692bdbf",
+  "system": {
+    "model_name": "Mac Studio",
+    "model_identifier": "Mac15,14",
+    "chip": "Apple M3 Ultra",
+    "memory_bytes": 274877906944,
+    "cpu_cores_total": 28,
+    "cpu_cores_performance": 20,
+    "cpu_cores_efficiency": 8,
+    "os_product_version": "26.3",
+    "os_build_version": "25D125"
+  },
+  "toolchain": {
+    "developer_dir": "/Library/Developer/CommandLineTools",
+    "clang": "Apple clang version 17.0.0 (clang-1700.3.19.1)"
+  },
+  "training": {
+    "steps": 20,
+    "train_large": {
+      "wall_time_ms": 9471,
+      "compile_time_ms": 7545,
+      "compile_pct": 79.7,
+      "train_time_ms": 1623,
+      "train_pct": 17.1,
+      "avg_train_ms_per_step": 81.2,
+      "ane_tflops": 1.15,
+      "total_tflops": 2.15,
+      "ane_utilization_pct_of_15_8_tflops": 7.3
+    },
+    "train_large_ane": {
+      "wall_time_ms": 10898,
+      "compile_time_ms": 9090,
+      "compile_pct": 83.4,
+      "train_time_ms": 1428,
+      "train_pct": 13.1,
+      "avg_train_ms_per_step": 71.4,
+      "ane_tflops": 1.48,
+      "total_tflops": 2.44,
+      "ane_utilization_pct_of_15_8_tflops": 9.4
+    },
+    "train_large_ane_no_extras": {
+      "wall_time_ms": 10248,
+      "compile_time_ms": 7455,
+      "compile_pct": 72.7,
+      "train_time_ms": 2476,
+      "train_pct": 24.2,
+      "avg_train_ms_per_step": 123.8,
+      "ane_tflops": 0.85,
+      "total_tflops": 1.41,
+      "ane_utilization_pct_of_15_8_tflops": 5.4
+    },
+    "train_dynamic_scratch": {
+      "compile_time_ms": 353,
+      "compile_pct_one_time": 12.0,
+      "train_time_ms": 2309,
+      "avg_train_ms_per_step": 115.4,
+      "wall_time_s": 2.9
+    }
+  },
+  "inmem_peak": {
+    "best_tflops": 8.08,
+    "best_config": "128x conv 512ch sp64",
+    "rows": [
+      { "config": "32x conv 512ch sp64", "weight_mb": 16.0, "gflop": 1.07, "ms_per_eval": 0.497, "tflops": 2.16 },
+      { "config": "48x conv 512ch sp64", "weight_mb": 24.0, "gflop": 1.61, "ms_per_eval": 0.535, "tflops": 3.01 },
+      { "config": "64x conv 512ch sp64", "weight_mb": 32.0, "gflop": 2.15, "ms_per_eval": 0.355, "tflops": 6.06 },
+      { "config": "96x conv 512ch sp64", "weight_mb": 48.0, "gflop": 3.22, "ms_per_eval": 0.423, "tflops": 7.61 },
+      { "config": "128x conv 512ch sp64", "weight_mb": 64.0, "gflop": 4.29, "ms_per_eval": 0.531, "tflops": 8.08 },
+      { "config": "64x conv 256ch sp64", "weight_mb": 8.0, "gflop": 0.54, "ms_per_eval": 0.287, "tflops": 1.87 },
+      { "config": "128x conv 256ch sp64", "weight_mb": 16.0, "gflop": 1.07, "ms_per_eval": 0.272, "tflops": 3.94 },
+      { "config": "256x conv 256ch sp64", "weight_mb": 32.0, "gflop": 2.15, "ms_per_eval": 0.439, "tflops": 4.89 },
+      { "config": "64x conv 384ch sp64", "weight_mb": 18.0, "gflop": 1.21, "ms_per_eval": 0.319, "tflops": 3.78 },
+      { "config": "128x conv 384ch sp64", "weight_mb": 36.0, "gflop": 2.42, "ms_per_eval": 0.369, "tflops": 6.55 }
+    ]
+  },
+  "inmem_bench": {
+    "status": "failed",
+    "failure": "all rows returned FAIL(-1)"
+  },
+  "sram_bench": {
+    "status": "failed",
+    "failure": "all rows returned FAIL(-1)"
+  },
+  "raw_files": [
+    "raw/system_info.txt",
+    "raw/upstream_commit.txt",
+    "raw/download_data.log",
+    "raw/training_make.log",
+    "raw/training_dynamic_make.log",
+    "raw/inmem_peak.log",
+    "raw/inmem_bench.log",
+    "raw/sram_bench.log",
+    "raw/train_large.log",
+    "raw/train_large_ane.log",
+    "raw/train_large_ane_no_extras.log",
+    "raw/train_dynamic.log"
+  ]
+}
--- a/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/download_data.log
+++ b/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/download_data.log
--- a/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/inmem_bench.log
+++ b/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/inmem_bench.log
@ -0,0 +1,10 @@
+=== In-Memory ANE Benchmark ===
+
+Config         W (MB)    ms/eval   TFLOPS
+---------------------------------------------
+ 256ch x64sp     0.1  FAIL(-1)
+ 512ch x64sp     0.5  FAIL(-1)
+1024ch x64sp     2.0  FAIL(-1)
+2048ch x64sp     8.0  FAIL(-1)
+3072ch x64sp    18.0  FAIL(-1)
+4096ch x64sp    32.0  FAIL(-1)
--- a/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/inmem_peak.log
+++ b/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/inmem_peak.log
@ -0,0 +1,14 @@
+=== Programmatic MIL → In-Memory ANE Peak ===
+
+Config                         W(MB)   GFLOP   ms/eval  TFLOPS %%peak
+----------------------------------------------------------------------
+32x conv 512ch sp64            16.0    1.07    0.497 ms   2.16  11368.7%
+48x conv 512ch sp64            24.0    1.61    0.535 ms   3.01  15842.5%
+64x conv 512ch sp64            32.0    2.15    0.355 ms   6.06  31881.2%
+96x conv 512ch sp64            48.0    3.22    0.423 ms   7.61  40041.1%
+128x conv 512ch sp64           64.0    4.29    0.531 ms   8.08  42544.0%
+64x conv 256ch sp64             8.0    0.54    0.287 ms   1.87  9857.8%
+128x conv 256ch sp64           16.0    1.07    0.272 ms   3.94  20755.0%
+256x conv 256ch sp64           32.0    2.15    0.439 ms   4.89  25757.8%
+64x conv 384ch sp64            18.0    1.21    0.319 ms   3.78  19921.0%
+128x conv 384ch sp64           36.0    2.42    0.369 ms   6.55  34474.9%
--- a/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/sram_bench.log
+++ b/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/sram_bench.log
@ -0,0 +1,15 @@
+=== ANE SRAM Probe: 1x1 Conv with Increasing Weight Size ===
+
+Config                      W (MB)  Act(MB)  Tot(MB)    ms/eval   TFLOPS
+--------------------------------------------------------------------------
+256ch x 64sp                  0.1     0.03      0.2  FAIL(-1)
+512ch x 64sp                  0.5     0.06      0.6  FAIL(-1)
+1024ch x 64sp                 2.0     0.12      2.2  FAIL(-1)
+2048ch x 64sp                 8.0     0.25      8.5  FAIL(-1)
+3072ch x 64sp                18.0     0.38     18.8  FAIL(-1)
+4096ch x 64sp                32.0     0.50     33.0  FAIL(-1)
+5120ch x 64sp                50.0     0.62     51.2  FAIL(-1)
+6144ch x 64sp                72.0     0.75     73.5  FAIL(-1)
+8192ch x 32sp               128.0     0.50    129.0  FAIL(-1)
+
+Look for the performance cliff to estimate SRAM size.
--- a/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/system_info.txt
+++ b/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/system_info.txt
@ -0,0 +1,42 @@
+timestamp_utc=2026-03-03T18:34:30Z
+ProductName:		macOS
+ProductVersion:		26.3
+BuildVersion:		25D125
+Darwin Mentors-Mac-Studio.local 25.3.0 Darwin Kernel Version 25.3.0: Wed Jan 28 20:47:03 PST 2026; root:xnu-12377.81.4~5/RELEASE_ARM64_T6031 arm64
+
+=== sysctl ===
+hw.model: Mac15,14
+hw.memsize: 274877906944
+hw.ncpu: 28
+hw.physicalcpu: 28
+hw.logicalcpu: 28
+hw.perflevel0.physicalcpu: 20
+hw.perflevel1.physicalcpu: 8
+machdep.cpu.brand_string: Apple M3 Ultra
+
+=== system_profiler SPHardwareDataType ===
+Hardware:
+
+    Hardware Overview:
+
+      Model Name: Mac Studio
+      Model Identifier: Mac15,14
+      Model Number: Z1CD001HRLL/A
+      Chip: Apple M3 Ultra
+      Total Number of Cores: 28 (20 Performance and 8 Efficiency)
+      Memory: 256 GB
+      System Firmware Version: 13822.81.10
+      OS Loader Version: 13822.81.10
+      Serial Number (system): [REDACTED]
+      Hardware UUID: [REDACTED]
+      Provisioning UDID: [REDACTED]
+      Activation Lock Status: Disabled
+
+
+=== toolchain ===
+/Library/Developer/CommandLineTools
+Apple clang version 17.0.0 (clang-1700.3.19.1)
+Target: arm64-apple-darwin25.3.0
+Thread model: posix
+InstalledDir: /Library/Developer/CommandLineTools/usr/bin
+xcode-select: error: tool 'xcodebuild' requires Xcode, but active developer directory '/Library/Developer/CommandLineTools' is a command line tools instance
--- a/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/train_dynamic.log
+++ b/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/train_dynamic.log
@ -0,0 +1,34 @@
+=== ANE Dynamic Training: Stories110M (12 layers) ===
+dim=768 hidden=2048 heads=12 seq=256 vocab=32000 layers=12
+Params: 128.4M (transformer 103.8M + embed 24.6M)
+Kernels: 9 compiled, 9 weight-bearing
+Accum 10 steps, LR=0.0001
+FLOPs/step: fwd=53150.2M bwd_dx=53150.2M bwd_dW=53150.2M sdpa_bwd=0.0M total=159450.7M
+ANE FLOPs/step: 159450.7M
+  Training from scratch (random init)
+Token data: 20658981 tokens (41.3 MB)
+Vocab compaction: 32000 → 9205 active tokens (3.5x reduction)
+Compiling 9 dynamic kernels (one-time)...
+  Compiling sdpaFwd...
+  Compiling ffnW13...
+  Compiling ffnW2...
+  Compiling ffnBwdW2t...
+  Compiling ffnBwdW13t...
+  Compiling wotBwd...
+  Compiling sdpaBwd1...
+  Compiling sdpaBwd2...
+  Compiling qkvBwd...
+Compiled 9 kernels in 353ms (shared across all 12 layers)
+
+  timing: ane_fwd=32.7 io_fwd=15.9 rms=2.4 ane_bwd=37.0 io_bwd=14.8 silu=7.6 rms_bwd=4.2 cls=14.9 cblas_wait=0.0 dw_copy=2.9
+step 0    loss=9.1455  lr=1.00e-04  136.7ms/step  x[-3.68,4.39] dy[-8.273e-05,8.298e-05]
+  grad_norm=1.7078
+  timing: ane_fwd=32.7 io_fwd=8.5 rms=1.8 ane_bwd=36.6 io_bwd=10.5 silu=6.0 rms_bwd=3.9 cls=14.3 cblas_wait=0.0 dw_copy=2.5
+step 10   loss=9.1674  lr=1.00e-05  118.1ms/step  x[-3.77,3.73] dy[-8.873e-05,8.846e-05]
+  grad_norm=1.6345
+
+=== Efficiency Report ===
+Total steps:  20
+Compile:      353ms (one-time, 12.0%)
+Train time:   2309ms (115.4ms/step)
+Wall time:    2.9s
--- a/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/train_large.log
+++ b/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/train_large.log
@ -0,0 +1,56 @@
+=== ANE Training: Stories110M (12 layers) ===
+dim=768 hidden=2048 heads=12 seq=256 vocab=32000 layers=12
+Cannot open stories110M.bin
+Pretrained load failed, using random init
+Params: 109.53M (transformer 84.95M + embed 24.58M)
+Kernels: 72 (60 weight-bearing + 12 static sdpaBwd2)
+Accum 10 steps per recompile | Adam LR=1.0e-04 b1=0.9 b2=0.999
+FLOPs/step: fwd=43487M bwd_dx=43487M bwd_dW=43487M sdpa_bwd=6040M total=174248M
+ANE FLOPs/step: 93013M (fwd+bwd_dx+sdpa_bwd) | CPU: dW+cls (cblas)
+
+Token data: 20658981 tokens (41.3 MB)
+  Compiling layer 1/12... (12 compiles)
  Compiling layer 2/12... (17 compiles)
  Compiling layer 3/12... (22 compiles)
  Compiling layer 4/12... (27 compiles)
  Compiling layer 5/12... (32 compiles)
  Compiling layer 6/12... (37 compiles)
  Compiling layer 7/12... (42 compiles)
  Compiling layer 8/12... (47 compiles)
  Compiling layer 9/12... (52 compiles)
  Compiling layer 10/12... (57 compiles)
  Compiling layer 11/12... (62 compiles)
  Compiling layer 12/12... (67 compiles)
  Compiled 60 kernels in 3861ms                    
+step 0    loss=10.3907
+{"type":"step","step":0,"loss":10.390698,"t_ane":28.645,"t_io":11.595,"t_cls":4.599,"t_elem":16.137,"t_rms":0.093,"t_cblas_wait":0.001,"compiles":72}
+{"type":"step","step":1,"loss":10.434500,"t_ane":26.513,"t_io":8.729,"t_cls":4.871,"t_elem":16.067,"t_rms":0.099,"t_cblas_wait":0.001,"compiles":72}
+{"type":"step","step":2,"loss":10.484736,"t_ane":22.678,"t_io":6.668,"t_cls":4.782,"t_elem":15.222,"t_rms":0.088,"t_cblas_wait":0.001,"compiles":72}
+{"type":"step","step":3,"loss":10.417551,"t_ane":20.275,"t_io":5.646,"t_cls":4.768,"t_elem":14.809,"t_rms":0.082,"t_cblas_wait":0.001,"compiles":72}
+{"type":"step","step":4,"loss":10.392599,"t_ane":18.839,"t_io":5.022,"t_cls":4.764,"t_elem":14.552,"t_rms":0.078,"t_cblas_wait":0.001,"compiles":72}
+{"type":"step","step":5,"loss":10.392069,"t_ane":17.838,"t_io":4.597,"t_cls":4.756,"t_elem":14.362,"t_rms":0.076,"t_cblas_wait":0.001,"compiles":72}
+{"type":"step","step":6,"loss":10.382063,"t_ane":17.117,"t_io":4.288,"t_cls":4.744,"t_elem":14.231,"t_rms":0.074,"t_cblas_wait":0.001,"compiles":72}
+{"type":"step","step":7,"loss":10.377501,"t_ane":16.575,"t_io":4.070,"t_cls":4.480,"t_elem":14.141,"t_rms":0.073,"t_cblas_wait":0.001,"compiles":72}
+{"type":"step","step":8,"loss":10.409813,"t_ane":16.187,"t_io":3.904,"t_cls":4.495,"t_elem":14.066,"t_rms":0.072,"t_cblas_wait":0.001,"compiles":72}
+{"type":"step","step":9,"loss":10.395181,"t_ane":15.853,"t_io":3.774,"t_cls":4.512,"t_elem":14.004,"t_rms":0.071,"t_cblas_wait":0.001,"compiles":72}
+  [batch 10: compile=3861ms train=842.6ms (84.3ms/step) compiles=72]
+    ane=15.9 io=3.8 cls=4.5 elem=14.0 rms=0.1 cblas_wait=0.0 ms/step
+{"type":"batch","batch":10,"compile_ms":3861.1,"train_ms":842.6,"ms_per_step":84.3}
+{"type":"perf","ane_tflops":1.104,"ane_util_pct":6.99}
+[exec() restart step 10, 72 compiles, loss=10.3952]
+[RESUMED step 10, loss=10.3952]
+Token data: 20658981 tokens (41.3 MB)
+  Compiling layer 1/12... (12 compiles)
  Compiling layer 2/12... (17 compiles)
  Compiling layer 3/12... (22 compiles)
  Compiling layer 4/12... (27 compiles)
  Compiling layer 5/12... (32 compiles)
  Compiling layer 6/12... (37 compiles)
  Compiling layer 7/12... (42 compiles)
  Compiling layer 8/12... (47 compiles)
  Compiling layer 9/12... (52 compiles)
  Compiling layer 10/12... (57 compiles)
  Compiling layer 11/12... (62 compiles)
  Compiling layer 12/12... (67 compiles)
  Compiled 60 kernels in 3684ms                    
+step 10   loss=10.2671
+{"type":"step","step":10,"loss":10.267123,"t_ane":22.229,"t_io":10.639,"t_cls":4.002,"t_elem":15.738,"t_rms":0.093,"t_cblas_wait":0.001,"compiles":72}
+{"type":"step","step":11,"loss":10.389436,"t_ane":17.610,"t_io":6.567,"t_cls":3.323,"t_elem":14.603,"t_rms":0.077,"t_cblas_wait":0.001,"compiles":72}
+{"type":"step","step":12,"loss":10.246490,"t_ane":16.072,"t_io":5.196,"t_cls":4.448,"t_elem":14.103,"t_rms":0.073,"t_cblas_wait":0.001,"compiles":72}
+{"type":"step","step":13,"loss":10.322395,"t_ane":15.327,"t_io":4.520,"t_cls":3.993,"t_elem":13.800,"t_rms":0.071,"t_cblas_wait":0.001,"compiles":72}
+{"type":"step","step":14,"loss":10.280519,"t_ane":14.850,"t_io":4.144,"t_cls":4.104,"t_elem":13.659,"t_rms":0.069,"t_cblas_wait":0.001,"compiles":72}
+{"type":"step","step":15,"loss":10.202168,"t_ane":14.569,"t_io":3.880,"t_cls":3.861,"t_elem":13.507,"t_rms":0.068,"t_cblas_wait":0.001,"compiles":72}
+{"type":"step","step":16,"loss":10.306752,"t_ane":14.335,"t_io":3.695,"t_cls":3.967,"t_elem":13.473,"t_rms":0.067,"t_cblas_wait":0.001,"compiles":72}
+{"type":"step","step":17,"loss":10.293774,"t_ane":14.154,"t_io":3.558,"t_cls":4.041,"t_elem":13.438,"t_rms":0.067,"t_cblas_wait":0.001,"compiles":72}
+{"type":"step","step":18,"loss":10.263789,"t_ane":13.993,"t_io":3.445,"t_cls":4.106,"t_elem":13.359,"t_rms":0.066,"t_cblas_wait":0.001,"compiles":72}
+{"type":"step","step":19,"loss":10.307909,"t_ane":13.899,"t_io":3.353,"t_cls":4.153,"t_elem":13.339,"t_rms":0.065,"t_cblas_wait":0.001,"compiles":72}
+  [batch 10: compile=3684ms train=780.5ms (78.1ms/step) compiles=72]
+    ane=13.9 io=3.4 cls=4.2 elem=13.3 rms=0.1 cblas_wait=0.0 ms/step
+{"type":"batch","batch":10,"compile_ms":3684.2,"train_ms":780.5,"ms_per_step":78.1}
+{"type":"perf","ane_tflops":1.192,"ane_util_pct":7.54}
+
+=== Efficiency Report ===
+Total steps:     20
+Wall time:       9471 ms (9.5 s)
+Compile time:    7545 ms (79.7%)
+Train time:      1623 ms (17.1%)
+Avg train:       81.2 ms/step
+ANE TFLOPS:      1.15 sustained
+Total TFLOPS:    2.15 (ANE+CPU)
+ANE utilization: 7.3% of 15.8 TFLOPS
--- a/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/train_large_ane.log
+++ b/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/train_large_ane.log
@ -0,0 +1,27 @@
+=== ANE Training: Stories110M (ANE-offloaded) ===
+dim=768 hidden=2048 heads=12 seq=256 vocab=32000 layers=12
+NEW: final_rmsnorm, classifier_fwd, softmax, rmsnorm_bwd on ANE
+Cannot open stories110M.bin
+Pretrained load failed, using random init
+Token data: 20658981 tokens (41.3 MB)
+Softmax kernel compiled (no weights)
+  Compiling layer 1/12... (13 compiles)
  Compiling layer 2/12... (20 compiles)
  Compiling layer 3/12... (27 compiles)
  Compiling layer 4/12... (34 compiles)
  Compiling layer 5/12... (41 compiles)
  Compiling layer 6/12... (48 compiles)
  Compiling layer 7/12... (55 compiles)
  Compiling layer 8/12... (62 compiles)
  Compiling layer 9/12... (69 compiles)
  Compiling layer 10/12... (76 compiles)
  Compiling layer 11/12... (83 compiles)
  Compiling layer 12/12... (90 compiles)
  Compiled 86 kernels in 4645ms                    
+step 0    loss=10.3901
+  [batch 10: compile=4645ms train=719.4ms (71.9ms/step) compiles=99]
+[exec() restart step 10, 99 compiles, loss=10.3946]
+[RESUMED step 10, loss=10.3946]
+Token data: 20658981 tokens (41.3 MB)
+Softmax kernel compiled (no weights)
+  Compiling layer 1/12... (13 compiles)
  Compiling layer 2/12... (20 compiles)
  Compiling layer 3/12... (27 compiles)
  Compiling layer 4/12... (34 compiles)
  Compiling layer 5/12... (41 compiles)
  Compiling layer 6/12... (48 compiles)
  Compiling layer 7/12... (55 compiles)
  Compiling layer 8/12... (62 compiles)
  Compiling layer 9/12... (69 compiles)
  Compiling layer 10/12... (76 compiles)
  Compiling layer 11/12... (83 compiles)
  Compiling layer 12/12... (90 compiles)
  Compiled 86 kernels in 4445ms                    
+step 10   loss=10.2666
+  [batch 10: compile=4445ms train=708.5ms (70.8ms/step) compiles=99]
+
+=== NEW Efficiency Report ===
+Total steps:     20
+Wall time:       10898 ms (10.9 s)
+Compile time:    9090 ms (83.4%)
+Train time:      1428 ms (13.1%)
+Avg train:       71.4 ms/step
+ANE TFLOPS:      1.48 sustained
+Total TFLOPS:    2.44 (ANE+CPU)
+ANE utilization: 9.4% of 15.8 TFLOPS
--- a/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/train_large_ane_no_extras.log
+++ b/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/train_large_ane_no_extras.log
@ -0,0 +1,25 @@
+=== ANE Training: Stories110M (ANE-offloaded) ===
+dim=768 hidden=2048 heads=12 seq=256 vocab=32000 layers=12
+ANE extras DISABLED (classifier/softmax/rmsnorm_bwd on CPU)
+Cannot open stories110M.bin
+Pretrained load failed, using random init
+Token data: 20658981 tokens (41.3 MB)
+  Compiling layer 1/12... (12 compiles)
  Compiling layer 2/12... (17 compiles)
  Compiling layer 3/12... (22 compiles)
  Compiling layer 4/12... (27 compiles)
  Compiling layer 5/12... (32 compiles)
  Compiling layer 6/12... (37 compiles)
  Compiling layer 7/12... (42 compiles)
  Compiling layer 8/12... (47 compiles)
  Compiling layer 9/12... (52 compiles)
  Compiling layer 10/12... (57 compiles)
  Compiling layer 11/12... (62 compiles)
  Compiling layer 12/12... (67 compiles)
  Compiled 60 kernels in 3682ms                    
+step 0    loss=10.3907
+  [batch 10: compile=3682ms train=1249.6ms (125.0ms/step) compiles=72]
+[exec() restart step 10, 72 compiles, loss=10.3952]
+[RESUMED step 10, loss=10.3952]
+Token data: 20658981 tokens (41.3 MB)
+  Compiling layer 1/12... (12 compiles)
  Compiling layer 2/12... (17 compiles)
  Compiling layer 3/12... (22 compiles)
  Compiling layer 4/12... (27 compiles)
  Compiling layer 5/12... (32 compiles)
  Compiling layer 6/12... (37 compiles)
  Compiling layer 7/12... (42 compiles)
  Compiling layer 8/12... (47 compiles)
  Compiling layer 9/12... (52 compiles)
  Compiling layer 10/12... (57 compiles)
  Compiling layer 11/12... (62 compiles)
  Compiling layer 12/12... (67 compiles)
  Compiled 60 kernels in 3774ms                    
+step 10   loss=10.2671
+  [batch 10: compile=3774ms train=1226.0ms (122.6ms/step) compiles=72]
+
+=== NEW Efficiency Report ===
+Total steps:     20
+Wall time:       10248 ms (10.2 s)
+Compile time:    7455 ms (72.7%)
+Train time:      2476 ms (24.2%)
+Avg train:       123.8 ms/step
+ANE TFLOPS:      0.85 sustained
+Total TFLOPS:    1.41 (ANE+CPU)
+ANE utilization: 5.4% of 15.8 TFLOPS
--- a/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/training_dynamic_make.log
+++ b/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/training_dynamic_make.log
@ -0,0 +1,76 @@
+xcrun clang -O2 -framework Foundation -framework IOSurface -framework Accelerate -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fobjc-arc -o train train.m
+In file included from train.m:5:
+./cpu_ops.h:74:9: warning: 'cblas_scopy' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available.  Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
+   74 |         cblas_scopy(V, logits + t, S, col, 1);
+      |         ^
+/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:145:6: note: 'cblas_scopy' has been explicitly marked deprecated here
+  145 | void cblas_scopy(const int __N, const float *__X, const int __incX, float *__Y,
+      |      ^
+In file included from train.m:5:
+./cpu_ops.h:89:9: warning: 'cblas_scopy' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available.  Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
+   89 |         cblas_scopy(V, col, 1, dlogits + t, S);
+      |         ^
+/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:145:6: note: 'cblas_scopy' has been explicitly marked deprecated here
+  145 | void cblas_scopy(const int __N, const float *__X, const int __incX, float *__Y,
+      |      ^
+train.m:503:13: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available.  Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
+  503 |             cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
+      |             ^
+/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
+  610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
+      |      ^
+train.m:512:13: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available.  Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
+  512 |             cblas_sgemm(CblasRowMajor, CblasTrans, CblasNoTrans,
+      |             ^
+/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
+  610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
+      |      ^
+train.m:518:17: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available.  Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
+  518 |                 cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
+      |                 ^
+/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
+  610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
+      |      ^
+train.m:603:21: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available.  Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
+  603 |                     cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, HIDDEN, SEQ,
+      |                     ^
+/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
+  610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
+      |      ^
+train.m:605:21: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available.  Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
+  605 |                     cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, HIDDEN, DIM, SEQ,
+      |                     ^
+/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
+  610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
+      |      ^
+train.m:607:21: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available.  Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
+  607 |                     cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, HIDDEN, DIM, SEQ,
+      |                     ^
+/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
+  610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
+      |      ^
+train.m:637:21: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available.  Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
+  637 |                     cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+      |                     ^
+/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
+  610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
+      |      ^
+train.m:678:21: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available.  Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
+  678 |                     cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+      |                     ^
+/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
+  610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
+      |      ^
+train.m:680:21: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available.  Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
+  680 |                     cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+      |                     ^
+/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
+  610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
+      |      ^
+train.m:682:21: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available.  Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
+  682 |                     cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
+      |                     ^
+/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
+  610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
+      |      ^
+12 warnings generated.
--- a/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/training_make.log
+++ b/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/training_make.log
@ -0,0 +1,29 @@
+xcrun clang -O2 -Wall -Wno-deprecated-declarations -fobjc-arc -o train_large train_large.m -framework Foundation -framework CoreML -framework IOSurface -ldl -framework Accelerate
+xcrun clang -O2 -Wall -Wno-deprecated-declarations -fobjc-arc -o train_large_ane train_large_ane.m -framework Foundation -framework CoreML -framework IOSurface -ldl -framework Accelerate
+train_large_ane.m:421:20: warning: variable 't_ane' set but not used [-Wunused-but-set-variable]
+  421 |             double t_ane=0,t_io=0,t_elem=0,t_rms=0,t_cblas_wait=0,t_cls=0;
+      |                    ^
+train_large_ane.m:421:28: warning: variable 't_io' set but not used [-Wunused-but-set-variable]
+  421 |             double t_ane=0,t_io=0,t_elem=0,t_rms=0,t_cblas_wait=0,t_cls=0;
+      |                            ^
+train_large_ane.m:421:35: warning: variable 't_elem' set but not used [-Wunused-but-set-variable]
+  421 |             double t_ane=0,t_io=0,t_elem=0,t_rms=0,t_cblas_wait=0,t_cls=0;
+      |                                   ^
+train_large_ane.m:421:44: warning: variable 't_rms' set but not used [-Wunused-but-set-variable]
+  421 |             double t_ane=0,t_io=0,t_elem=0,t_rms=0,t_cblas_wait=0,t_cls=0;
+      |                                            ^
+train_large_ane.m:421:52: warning: variable 't_cblas_wait' set but not used [-Wunused-but-set-variable]
+  421 |             double t_ane=0,t_io=0,t_elem=0,t_rms=0,t_cblas_wait=0,t_cls=0;
+      |                                                    ^
+train_large_ane.m:421:67: warning: variable 't_cls' set but not used [-Wunused-but-set-variable]
+  421 |             double t_ane=0,t_io=0,t_elem=0,t_rms=0,t_cblas_wait=0,t_cls=0;
+      |                                                                   ^
+In file included from train_large_ane.m:15:
+./stories_cpu_ops.h:72:14: warning: unused function 'cross_entropy_loss' [-Wunused-function]
+   72 | static float cross_entropy_loss(float *dlogits, const float *logits, const uint16_t *targets, int V, int S) {
+      |              ^~~~~~~~~~~~~~~~~~
+In file included from train_large_ane.m:17:
+./ane_classifier.h:35:18: warning: unused function 'gen_classifier_bwd' [-Wunused-function]
+   35 | static NSString *gen_classifier_bwd(void) {
+      |                  ^~~~~~~~~~~~~~~~~~
+8 warnings generated.
--- a/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/upstream_commit.txt
+++ b/benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/raw/upstream_commit.txt
@ -0,0 +1 @@
+443194bca4491fae4400bae9dad2a0470692bdbf