Add reproducible M3 Ultra benchmark submission package

This commit is contained in:
nabbilkhan 2026-03-03 18:39:34 +00:00
parent 443194bca4
commit 7fceb99988
17 changed files with 632 additions and 5 deletions

View File

@ -148,11 +148,17 @@ No external dependencies. Uses only system frameworks + private ANE APIs resolve
| Channel-first layout | 20.3 | 5.2% |
| vDSP vectorized RMSNorm | 14.2 | 7.4% |
| GCD async cblas overlap | 11.4 | 9.2% |
| ANE RMSNorm fusion | 11.4 | 9.2% |
| Wo^T fusion (7→6 kernels) | 11.4 | 9.2% |
| Deferred cblas wait | **9.3** | **11.2%** |
## Disclaimer
| ANE RMSNorm fusion | 11.4 | 9.2% |
| Wo^T fusion (7→6 kernels) | 11.4 | 9.2% |
| Deferred cblas wait | **9.3** | **11.2%** |
## Community Benchmarks
Community hardware benchmark submissions live in [`benchmarks/submissions/`](benchmarks/submissions/).
- [Mac Studio (Apple M3 Ultra, 256 GB) — 2026-03-03](benchmarks/submissions/m3-ultra-mac-studio-2026-03-03/README.md)
## Disclaimer
This project uses Apple's private, undocumented APIs (`_ANEClient`, `_ANECompiler`, `_ANEInMemoryModelDescriptor`). These APIs are not covered by any public stability guarantee and may change or break with any macOS update. This is independent research into Apple Neural Engine architecture, using APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see *Sega v. Accolade*, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk.

34
benchmarks/README.md Normal file
View File

@ -0,0 +1,34 @@
# Community Benchmark Submissions
This folder is for reproducible hardware benchmark submissions from the community.
## Goals
- Make cross-chip results easy to compare.
- Keep raw logs attached so numbers are auditable.
- Keep submissions lightweight and low-maintenance.
## Submission Layout
Use one directory per machine/date:
`benchmarks/submissions/<chip>-<machine>-<YYYY-MM-DD>/`
Required files:
- `README.md` — short summary of machine, commands, and key results
- `metrics.json` — machine-readable summary of key metrics
- `raw/` — raw command outputs (`*.log`, `system_info.txt`, `upstream_commit.txt`)
## Privacy
Please redact machine serial numbers, UUIDs, and other unique identifiers before committing logs.
## Minimal Repro Guidance
Each submission should include:
- exact upstream commit hash tested
- exact commands run
- fixed step counts for training comparisons (for example, `--steps 20`)
- clear pass/fail status for each benchmark

View File

@ -0,0 +1,81 @@
# Mac Studio M3 Ultra Benchmark Submission (2026-03-03)
This submission targets upstream issue: `#3` (collecting results across Apple Silicon variants).
## Environment
- Upstream commit: `443194bca4491fae4400bae9dad2a0470692bdbf`
- Machine: Mac Studio (`Mac15,14`)
- Chip: Apple M3 Ultra
- CPU cores: 28 total (20P + 8E)
- Memory: 256 GB (`274877906944` bytes)
- OS: macOS 26.3 (`25D125`)
- Toolchain: Apple clang 17.0.0 (`/Library/Developer/CommandLineTools`)
Raw system capture: [`raw/system_info.txt`](raw/system_info.txt)
## Commands Run
Exact commands used are included in [`commands.sh`](commands.sh).
Highlights:
```bash
# Root benchmark
xcrun clang -O2 -framework Foundation -framework IOSurface -framework CoreML \
-ldl -lobjc -o inmem_peak inmem_peak.m
./inmem_peak
# Training benchmarks
cd training
bash download_data.sh
make train_large train_large_ane
./train_large --steps 20 --lr 1e-4 --ckpt /tmp/train_large.ckpt
./train_large_ane --steps 20 --lr 1e-4 --ckpt /tmp/train_large_ane.ckpt
./train_large_ane --no-ane-extras --steps 20 --lr 1e-4 --ckpt /tmp/train_large_ane_no_extras.ckpt
cd training_dynamic
make train
./train --scratch --steps 20 --lr 1e-4
```
## Training Results (20 steps)
| Pipeline | Wall time | Compile time | Train time | Avg train | ANE TFLOPS | Total TFLOPS |
|---|---:|---:|---:|---:|---:|---:|
| `train_large` | 9471 ms | 7545 ms (79.7%) | 1623 ms (17.1%) | 81.2 ms/step | 1.15 | 2.15 |
| `train_large_ane` | 10898 ms | 9090 ms (83.4%) | 1428 ms (13.1%) | 71.4 ms/step | 1.48 | 2.44 |
| `train_large_ane --no-ane-extras` | 10248 ms | 7455 ms (72.7%) | 2476 ms (24.2%) | 123.8 ms/step | 0.85 | 1.41 |
| `training_dynamic/train --scratch` | 2.9 s | 353 ms (one-time, 12.0%) | 2309 ms | 115.4 ms/step | n/a | n/a |
Raw logs:
- [`raw/train_large.log`](raw/train_large.log)
- [`raw/train_large_ane.log`](raw/train_large_ane.log)
- [`raw/train_large_ane_no_extras.log`](raw/train_large_ane_no_extras.log)
- [`raw/train_dynamic.log`](raw/train_dynamic.log)
## In-Memory Peak Results
Best observed from `inmem_peak`:
- 8.08 TFLOPS at `128x conv 512ch sp64` (`4.29 GFLOP`, `0.531 ms/eval`)
Raw log:
- [`raw/inmem_peak.log`](raw/inmem_peak.log)
## Additional Root Benchmarks
- `inmem_bench`: all configs returned `FAIL(-1)` on this clean setup
- `sram_bench`: all configs returned `FAIL(-1)` on this clean setup
Raw logs:
- [`raw/inmem_bench.log`](raw/inmem_bench.log)
- [`raw/sram_bench.log`](raw/sram_bench.log)
## Notes
- `train_large_ane` had the best per-step throughput in this run.
- Dynamic had the best short-run wall-clock due to one-time compile cost.
- Static pipelines remained compile-dominated over 20 steps.

View File

@ -0,0 +1,62 @@
#!/usr/bin/env bash
set -euo pipefail
# Repro commands used for this submission.
# Machine: Mac Studio (Apple M3 Ultra)
# Commit: 443194bca4491fae4400bae9dad2a0470692bdbf
REPO="${REPO:-$HOME/Dev/ANE-upstream}"
ART="${ART:-$REPO/bench_artifacts/m3-ultra-2026-03-03/raw}"
mkdir -p "$ART"
cd "$REPO"
# System capture
{
echo "timestamp_utc=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
sw_vers
uname -a
echo
echo "=== sysctl ==="
sysctl hw.model hw.memsize hw.ncpu hw.physicalcpu hw.logicalcpu \
hw.perflevel0.physicalcpu hw.perflevel1.physicalcpu \
machdep.cpu.brand_string 2>/dev/null || true
echo
echo "=== system_profiler SPHardwareDataType ==="
system_profiler SPHardwareDataType
echo
echo "=== toolchain ==="
xcode-select -p
xcrun clang --version
} > "$ART/system_info.txt"
# Root benchmark
xcrun clang -O2 -framework Foundation -framework IOSurface -framework CoreML \
-ldl -lobjc -o inmem_peak inmem_peak.m
./inmem_peak > "$ART/inmem_peak.log" 2>&1
# Optional root benchmarks (may fail on clean setups)
xcrun clang -O2 -framework Foundation -framework IOSurface -framework CoreML \
-ldl -lobjc -o inmem_bench inmem_bench.m
./inmem_bench > "$ART/inmem_bench.log" 2>&1 || true
xcrun clang -O2 -framework Foundation -framework IOSurface -framework CoreML \
-ldl -lobjc -o sram_bench sram_bench.m
./sram_bench > "$ART/sram_bench.log" 2>&1 || true
# Training benchmarks
cd "$REPO/training"
bash download_data.sh > "$ART/download_data.log" 2>&1
make train_large train_large_ane > "$ART/training_make.log" 2>&1
./train_large --steps 20 --lr 1e-4 --ckpt "$ART/train_large.ckpt" > "$ART/train_large.log" 2>&1
./train_large_ane --steps 20 --lr 1e-4 --ckpt "$ART/train_large_ane.ckpt" > "$ART/train_large_ane.log" 2>&1
./train_large_ane --no-ane-extras --steps 20 --lr 1e-4 --ckpt "$ART/train_large_ane_no_extras.ckpt" > "$ART/train_large_ane_no_extras.log" 2>&1
cd "$REPO/training/training_dynamic"
make train > "$ART/training_dynamic_make.log" 2>&1
./train --scratch --steps 20 --lr 1e-4 > "$ART/train_dynamic.log" 2>&1
cd "$REPO"
git rev-parse HEAD > "$ART/upstream_commit.txt"
echo "Done. Raw logs are in: $ART"

View File

@ -0,0 +1,101 @@
{
"submission_id": "m3-ultra-mac-studio-2026-03-03",
"captured_at_utc": "2026-03-03T18:34:30Z",
"upstream_commit": "443194bca4491fae4400bae9dad2a0470692bdbf",
"system": {
"model_name": "Mac Studio",
"model_identifier": "Mac15,14",
"chip": "Apple M3 Ultra",
"memory_bytes": 274877906944,
"cpu_cores_total": 28,
"cpu_cores_performance": 20,
"cpu_cores_efficiency": 8,
"os_product_version": "26.3",
"os_build_version": "25D125"
},
"toolchain": {
"developer_dir": "/Library/Developer/CommandLineTools",
"clang": "Apple clang version 17.0.0 (clang-1700.3.19.1)"
},
"training": {
"steps": 20,
"train_large": {
"wall_time_ms": 9471,
"compile_time_ms": 7545,
"compile_pct": 79.7,
"train_time_ms": 1623,
"train_pct": 17.1,
"avg_train_ms_per_step": 81.2,
"ane_tflops": 1.15,
"total_tflops": 2.15,
"ane_utilization_pct_of_15_8_tflops": 7.3
},
"train_large_ane": {
"wall_time_ms": 10898,
"compile_time_ms": 9090,
"compile_pct": 83.4,
"train_time_ms": 1428,
"train_pct": 13.1,
"avg_train_ms_per_step": 71.4,
"ane_tflops": 1.48,
"total_tflops": 2.44,
"ane_utilization_pct_of_15_8_tflops": 9.4
},
"train_large_ane_no_extras": {
"wall_time_ms": 10248,
"compile_time_ms": 7455,
"compile_pct": 72.7,
"train_time_ms": 2476,
"train_pct": 24.2,
"avg_train_ms_per_step": 123.8,
"ane_tflops": 0.85,
"total_tflops": 1.41,
"ane_utilization_pct_of_15_8_tflops": 5.4
},
"train_dynamic_scratch": {
"compile_time_ms": 353,
"compile_pct_one_time": 12.0,
"train_time_ms": 2309,
"avg_train_ms_per_step": 115.4,
"wall_time_s": 2.9
}
},
"inmem_peak": {
"best_tflops": 8.08,
"best_config": "128x conv 512ch sp64",
"rows": [
{ "config": "32x conv 512ch sp64", "weight_mb": 16.0, "gflop": 1.07, "ms_per_eval": 0.497, "tflops": 2.16 },
{ "config": "48x conv 512ch sp64", "weight_mb": 24.0, "gflop": 1.61, "ms_per_eval": 0.535, "tflops": 3.01 },
{ "config": "64x conv 512ch sp64", "weight_mb": 32.0, "gflop": 2.15, "ms_per_eval": 0.355, "tflops": 6.06 },
{ "config": "96x conv 512ch sp64", "weight_mb": 48.0, "gflop": 3.22, "ms_per_eval": 0.423, "tflops": 7.61 },
{ "config": "128x conv 512ch sp64", "weight_mb": 64.0, "gflop": 4.29, "ms_per_eval": 0.531, "tflops": 8.08 },
{ "config": "64x conv 256ch sp64", "weight_mb": 8.0, "gflop": 0.54, "ms_per_eval": 0.287, "tflops": 1.87 },
{ "config": "128x conv 256ch sp64", "weight_mb": 16.0, "gflop": 1.07, "ms_per_eval": 0.272, "tflops": 3.94 },
{ "config": "256x conv 256ch sp64", "weight_mb": 32.0, "gflop": 2.15, "ms_per_eval": 0.439, "tflops": 4.89 },
{ "config": "64x conv 384ch sp64", "weight_mb": 18.0, "gflop": 1.21, "ms_per_eval": 0.319, "tflops": 3.78 },
{ "config": "128x conv 384ch sp64", "weight_mb": 36.0, "gflop": 2.42, "ms_per_eval": 0.369, "tflops": 6.55 }
]
},
"inmem_bench": {
"status": "failed",
"failure": "all rows returned FAIL(-1)"
},
"sram_bench": {
"status": "failed",
"failure": "all rows returned FAIL(-1)"
},
"raw_files": [
"raw/system_info.txt",
"raw/upstream_commit.txt",
"raw/download_data.log",
"raw/training_make.log",
"raw/training_dynamic_make.log",
"raw/inmem_peak.log",
"raw/inmem_bench.log",
"raw/sram_bench.log",
"raw/train_large.log",
"raw/train_large_ane.log",
"raw/train_large_ane_no_extras.log",
"raw/train_dynamic.log"
]
}

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,10 @@
=== In-Memory ANE Benchmark ===
Config W (MB) ms/eval TFLOPS
---------------------------------------------
256ch x64sp 0.1 FAIL(-1)
512ch x64sp 0.5 FAIL(-1)
1024ch x64sp 2.0 FAIL(-1)
2048ch x64sp 8.0 FAIL(-1)
3072ch x64sp 18.0 FAIL(-1)
4096ch x64sp 32.0 FAIL(-1)

View File

@ -0,0 +1,14 @@
=== Programmatic MIL → In-Memory ANE Peak ===
Config W(MB) GFLOP ms/eval TFLOPS %%peak
----------------------------------------------------------------------
32x conv 512ch sp64 16.0 1.07 0.497 ms 2.16 11368.7%
48x conv 512ch sp64 24.0 1.61 0.535 ms 3.01 15842.5%
64x conv 512ch sp64 32.0 2.15 0.355 ms 6.06 31881.2%
96x conv 512ch sp64 48.0 3.22 0.423 ms 7.61 40041.1%
128x conv 512ch sp64 64.0 4.29 0.531 ms 8.08 42544.0%
64x conv 256ch sp64 8.0 0.54 0.287 ms 1.87 9857.8%
128x conv 256ch sp64 16.0 1.07 0.272 ms 3.94 20755.0%
256x conv 256ch sp64 32.0 2.15 0.439 ms 4.89 25757.8%
64x conv 384ch sp64 18.0 1.21 0.319 ms 3.78 19921.0%
128x conv 384ch sp64 36.0 2.42 0.369 ms 6.55 34474.9%

View File

@ -0,0 +1,15 @@
=== ANE SRAM Probe: 1x1 Conv with Increasing Weight Size ===
Config W (MB) Act(MB) Tot(MB) ms/eval TFLOPS
--------------------------------------------------------------------------
256ch x 64sp 0.1 0.03 0.2 FAIL(-1)
512ch x 64sp 0.5 0.06 0.6 FAIL(-1)
1024ch x 64sp 2.0 0.12 2.2 FAIL(-1)
2048ch x 64sp 8.0 0.25 8.5 FAIL(-1)
3072ch x 64sp 18.0 0.38 18.8 FAIL(-1)
4096ch x 64sp 32.0 0.50 33.0 FAIL(-1)
5120ch x 64sp 50.0 0.62 51.2 FAIL(-1)
6144ch x 64sp 72.0 0.75 73.5 FAIL(-1)
8192ch x 32sp 128.0 0.50 129.0 FAIL(-1)
Look for the performance cliff to estimate SRAM size.

View File

@ -0,0 +1,42 @@
timestamp_utc=2026-03-03T18:34:30Z
ProductName: macOS
ProductVersion: 26.3
BuildVersion: 25D125
Darwin Mentors-Mac-Studio.local 25.3.0 Darwin Kernel Version 25.3.0: Wed Jan 28 20:47:03 PST 2026; root:xnu-12377.81.4~5/RELEASE_ARM64_T6031 arm64
=== sysctl ===
hw.model: Mac15,14
hw.memsize: 274877906944
hw.ncpu: 28
hw.physicalcpu: 28
hw.logicalcpu: 28
hw.perflevel0.physicalcpu: 20
hw.perflevel1.physicalcpu: 8
machdep.cpu.brand_string: Apple M3 Ultra
=== system_profiler SPHardwareDataType ===
Hardware:
Hardware Overview:
Model Name: Mac Studio
Model Identifier: Mac15,14
Model Number: Z1CD001HRLL/A
Chip: Apple M3 Ultra
Total Number of Cores: 28 (20 Performance and 8 Efficiency)
Memory: 256 GB
System Firmware Version: 13822.81.10
OS Loader Version: 13822.81.10
Serial Number (system): [REDACTED]
Hardware UUID: [REDACTED]
Provisioning UDID: [REDACTED]
Activation Lock Status: Disabled
=== toolchain ===
/Library/Developer/CommandLineTools
Apple clang version 17.0.0 (clang-1700.3.19.1)
Target: arm64-apple-darwin25.3.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
xcode-select: error: tool 'xcodebuild' requires Xcode, but active developer directory '/Library/Developer/CommandLineTools' is a command line tools instance

View File

@ -0,0 +1,34 @@
=== ANE Dynamic Training: Stories110M (12 layers) ===
dim=768 hidden=2048 heads=12 seq=256 vocab=32000 layers=12
Params: 128.4M (transformer 103.8M + embed 24.6M)
Kernels: 9 compiled, 9 weight-bearing
Accum 10 steps, LR=0.0001
FLOPs/step: fwd=53150.2M bwd_dx=53150.2M bwd_dW=53150.2M sdpa_bwd=0.0M total=159450.7M
ANE FLOPs/step: 159450.7M
Training from scratch (random init)
Token data: 20658981 tokens (41.3 MB)
Vocab compaction: 32000 → 9205 active tokens (3.5x reduction)
Compiling 9 dynamic kernels (one-time)...
Compiling sdpaFwd...
Compiling ffnW13...
Compiling ffnW2...
Compiling ffnBwdW2t...
Compiling ffnBwdW13t...
Compiling wotBwd...
Compiling sdpaBwd1...
Compiling sdpaBwd2...
Compiling qkvBwd...
Compiled 9 kernels in 353ms (shared across all 12 layers)
timing: ane_fwd=32.7 io_fwd=15.9 rms=2.4 ane_bwd=37.0 io_bwd=14.8 silu=7.6 rms_bwd=4.2 cls=14.9 cblas_wait=0.0 dw_copy=2.9
step 0 loss=9.1455 lr=1.00e-04 136.7ms/step x[-3.68,4.39] dy[-8.273e-05,8.298e-05]
grad_norm=1.7078
timing: ane_fwd=32.7 io_fwd=8.5 rms=1.8 ane_bwd=36.6 io_bwd=10.5 silu=6.0 rms_bwd=3.9 cls=14.3 cblas_wait=0.0 dw_copy=2.5
step 10 loss=9.1674 lr=1.00e-05 118.1ms/step x[-3.77,3.73] dy[-8.873e-05,8.846e-05]
grad_norm=1.6345
=== Efficiency Report ===
Total steps: 20
Compile: 353ms (one-time, 12.0%)
Train time: 2309ms (115.4ms/step)
Wall time: 2.9s

View File

@ -0,0 +1,56 @@
=== ANE Training: Stories110M (12 layers) ===
dim=768 hidden=2048 heads=12 seq=256 vocab=32000 layers=12
Cannot open stories110M.bin
Pretrained load failed, using random init
Params: 109.53M (transformer 84.95M + embed 24.58M)
Kernels: 72 (60 weight-bearing + 12 static sdpaBwd2)
Accum 10 steps per recompile | Adam LR=1.0e-04 b1=0.9 b2=0.999
FLOPs/step: fwd=43487M bwd_dx=43487M bwd_dW=43487M sdpa_bwd=6040M total=174248M
ANE FLOPs/step: 93013M (fwd+bwd_dx+sdpa_bwd) | CPU: dW+cls (cblas)
Token data: 20658981 tokens (41.3 MB)
Compiling layer 1/12... (12 compiles) Compiling layer 2/12... (17 compiles) Compiling layer 3/12... (22 compiles) Compiling layer 4/12... (27 compiles) Compiling layer 5/12... (32 compiles) Compiling layer 6/12... (37 compiles) Compiling layer 7/12... (42 compiles) Compiling layer 8/12... (47 compiles) Compiling layer 9/12... (52 compiles) Compiling layer 10/12... (57 compiles) Compiling layer 11/12... (62 compiles) Compiling layer 12/12... (67 compiles) Compiled 60 kernels in 3861ms
step 0 loss=10.3907
{"type":"step","step":0,"loss":10.390698,"t_ane":28.645,"t_io":11.595,"t_cls":4.599,"t_elem":16.137,"t_rms":0.093,"t_cblas_wait":0.001,"compiles":72}
{"type":"step","step":1,"loss":10.434500,"t_ane":26.513,"t_io":8.729,"t_cls":4.871,"t_elem":16.067,"t_rms":0.099,"t_cblas_wait":0.001,"compiles":72}
{"type":"step","step":2,"loss":10.484736,"t_ane":22.678,"t_io":6.668,"t_cls":4.782,"t_elem":15.222,"t_rms":0.088,"t_cblas_wait":0.001,"compiles":72}
{"type":"step","step":3,"loss":10.417551,"t_ane":20.275,"t_io":5.646,"t_cls":4.768,"t_elem":14.809,"t_rms":0.082,"t_cblas_wait":0.001,"compiles":72}
{"type":"step","step":4,"loss":10.392599,"t_ane":18.839,"t_io":5.022,"t_cls":4.764,"t_elem":14.552,"t_rms":0.078,"t_cblas_wait":0.001,"compiles":72}
{"type":"step","step":5,"loss":10.392069,"t_ane":17.838,"t_io":4.597,"t_cls":4.756,"t_elem":14.362,"t_rms":0.076,"t_cblas_wait":0.001,"compiles":72}
{"type":"step","step":6,"loss":10.382063,"t_ane":17.117,"t_io":4.288,"t_cls":4.744,"t_elem":14.231,"t_rms":0.074,"t_cblas_wait":0.001,"compiles":72}
{"type":"step","step":7,"loss":10.377501,"t_ane":16.575,"t_io":4.070,"t_cls":4.480,"t_elem":14.141,"t_rms":0.073,"t_cblas_wait":0.001,"compiles":72}
{"type":"step","step":8,"loss":10.409813,"t_ane":16.187,"t_io":3.904,"t_cls":4.495,"t_elem":14.066,"t_rms":0.072,"t_cblas_wait":0.001,"compiles":72}
{"type":"step","step":9,"loss":10.395181,"t_ane":15.853,"t_io":3.774,"t_cls":4.512,"t_elem":14.004,"t_rms":0.071,"t_cblas_wait":0.001,"compiles":72}
[batch 10: compile=3861ms train=842.6ms (84.3ms/step) compiles=72]
ane=15.9 io=3.8 cls=4.5 elem=14.0 rms=0.1 cblas_wait=0.0 ms/step
{"type":"batch","batch":10,"compile_ms":3861.1,"train_ms":842.6,"ms_per_step":84.3}
{"type":"perf","ane_tflops":1.104,"ane_util_pct":6.99}
[exec() restart step 10, 72 compiles, loss=10.3952]
[RESUMED step 10, loss=10.3952]
Token data: 20658981 tokens (41.3 MB)
Compiling layer 1/12... (12 compiles) Compiling layer 2/12... (17 compiles) Compiling layer 3/12... (22 compiles) Compiling layer 4/12... (27 compiles) Compiling layer 5/12... (32 compiles) Compiling layer 6/12... (37 compiles) Compiling layer 7/12... (42 compiles) Compiling layer 8/12... (47 compiles) Compiling layer 9/12... (52 compiles) Compiling layer 10/12... (57 compiles) Compiling layer 11/12... (62 compiles) Compiling layer 12/12... (67 compiles) Compiled 60 kernels in 3684ms
step 10 loss=10.2671
{"type":"step","step":10,"loss":10.267123,"t_ane":22.229,"t_io":10.639,"t_cls":4.002,"t_elem":15.738,"t_rms":0.093,"t_cblas_wait":0.001,"compiles":72}
{"type":"step","step":11,"loss":10.389436,"t_ane":17.610,"t_io":6.567,"t_cls":3.323,"t_elem":14.603,"t_rms":0.077,"t_cblas_wait":0.001,"compiles":72}
{"type":"step","step":12,"loss":10.246490,"t_ane":16.072,"t_io":5.196,"t_cls":4.448,"t_elem":14.103,"t_rms":0.073,"t_cblas_wait":0.001,"compiles":72}
{"type":"step","step":13,"loss":10.322395,"t_ane":15.327,"t_io":4.520,"t_cls":3.993,"t_elem":13.800,"t_rms":0.071,"t_cblas_wait":0.001,"compiles":72}
{"type":"step","step":14,"loss":10.280519,"t_ane":14.850,"t_io":4.144,"t_cls":4.104,"t_elem":13.659,"t_rms":0.069,"t_cblas_wait":0.001,"compiles":72}
{"type":"step","step":15,"loss":10.202168,"t_ane":14.569,"t_io":3.880,"t_cls":3.861,"t_elem":13.507,"t_rms":0.068,"t_cblas_wait":0.001,"compiles":72}
{"type":"step","step":16,"loss":10.306752,"t_ane":14.335,"t_io":3.695,"t_cls":3.967,"t_elem":13.473,"t_rms":0.067,"t_cblas_wait":0.001,"compiles":72}
{"type":"step","step":17,"loss":10.293774,"t_ane":14.154,"t_io":3.558,"t_cls":4.041,"t_elem":13.438,"t_rms":0.067,"t_cblas_wait":0.001,"compiles":72}
{"type":"step","step":18,"loss":10.263789,"t_ane":13.993,"t_io":3.445,"t_cls":4.106,"t_elem":13.359,"t_rms":0.066,"t_cblas_wait":0.001,"compiles":72}
{"type":"step","step":19,"loss":10.307909,"t_ane":13.899,"t_io":3.353,"t_cls":4.153,"t_elem":13.339,"t_rms":0.065,"t_cblas_wait":0.001,"compiles":72}
[batch 10: compile=3684ms train=780.5ms (78.1ms/step) compiles=72]
ane=13.9 io=3.4 cls=4.2 elem=13.3 rms=0.1 cblas_wait=0.0 ms/step
{"type":"batch","batch":10,"compile_ms":3684.2,"train_ms":780.5,"ms_per_step":78.1}
{"type":"perf","ane_tflops":1.192,"ane_util_pct":7.54}
=== Efficiency Report ===
Total steps: 20
Wall time: 9471 ms (9.5 s)
Compile time: 7545 ms (79.7%)
Train time: 1623 ms (17.1%)
Avg train: 81.2 ms/step
ANE TFLOPS: 1.15 sustained
Total TFLOPS: 2.15 (ANE+CPU)
ANE utilization: 7.3% of 15.8 TFLOPS

View File

@ -0,0 +1,27 @@
=== ANE Training: Stories110M (ANE-offloaded) ===
dim=768 hidden=2048 heads=12 seq=256 vocab=32000 layers=12
NEW: final_rmsnorm, classifier_fwd, softmax, rmsnorm_bwd on ANE
Cannot open stories110M.bin
Pretrained load failed, using random init
Token data: 20658981 tokens (41.3 MB)
Softmax kernel compiled (no weights)
Compiling layer 1/12... (13 compiles) Compiling layer 2/12... (20 compiles) Compiling layer 3/12... (27 compiles) Compiling layer 4/12... (34 compiles) Compiling layer 5/12... (41 compiles) Compiling layer 6/12... (48 compiles) Compiling layer 7/12... (55 compiles) Compiling layer 8/12... (62 compiles) Compiling layer 9/12... (69 compiles) Compiling layer 10/12... (76 compiles) Compiling layer 11/12... (83 compiles) Compiling layer 12/12... (90 compiles) Compiled 86 kernels in 4645ms
step 0 loss=10.3901
[batch 10: compile=4645ms train=719.4ms (71.9ms/step) compiles=99]
[exec() restart step 10, 99 compiles, loss=10.3946]
[RESUMED step 10, loss=10.3946]
Token data: 20658981 tokens (41.3 MB)
Softmax kernel compiled (no weights)
Compiling layer 1/12... (13 compiles) Compiling layer 2/12... (20 compiles) Compiling layer 3/12... (27 compiles) Compiling layer 4/12... (34 compiles) Compiling layer 5/12... (41 compiles) Compiling layer 6/12... (48 compiles) Compiling layer 7/12... (55 compiles) Compiling layer 8/12... (62 compiles) Compiling layer 9/12... (69 compiles) Compiling layer 10/12... (76 compiles) Compiling layer 11/12... (83 compiles) Compiling layer 12/12... (90 compiles) Compiled 86 kernels in 4445ms
step 10 loss=10.2666
[batch 10: compile=4445ms train=708.5ms (70.8ms/step) compiles=99]
=== NEW Efficiency Report ===
Total steps: 20
Wall time: 10898 ms (10.9 s)
Compile time: 9090 ms (83.4%)
Train time: 1428 ms (13.1%)
Avg train: 71.4 ms/step
ANE TFLOPS: 1.48 sustained
Total TFLOPS: 2.44 (ANE+CPU)
ANE utilization: 9.4% of 15.8 TFLOPS

View File

@ -0,0 +1,25 @@
=== ANE Training: Stories110M (ANE-offloaded) ===
dim=768 hidden=2048 heads=12 seq=256 vocab=32000 layers=12
ANE extras DISABLED (classifier/softmax/rmsnorm_bwd on CPU)
Cannot open stories110M.bin
Pretrained load failed, using random init
Token data: 20658981 tokens (41.3 MB)
Compiling layer 1/12... (12 compiles) Compiling layer 2/12... (17 compiles) Compiling layer 3/12... (22 compiles) Compiling layer 4/12... (27 compiles) Compiling layer 5/12... (32 compiles) Compiling layer 6/12... (37 compiles) Compiling layer 7/12... (42 compiles) Compiling layer 8/12... (47 compiles) Compiling layer 9/12... (52 compiles) Compiling layer 10/12... (57 compiles) Compiling layer 11/12... (62 compiles) Compiling layer 12/12... (67 compiles) Compiled 60 kernels in 3682ms
step 0 loss=10.3907
[batch 10: compile=3682ms train=1249.6ms (125.0ms/step) compiles=72]
[exec() restart step 10, 72 compiles, loss=10.3952]
[RESUMED step 10, loss=10.3952]
Token data: 20658981 tokens (41.3 MB)
Compiling layer 1/12... (12 compiles) Compiling layer 2/12... (17 compiles) Compiling layer 3/12... (22 compiles) Compiling layer 4/12... (27 compiles) Compiling layer 5/12... (32 compiles) Compiling layer 6/12... (37 compiles) Compiling layer 7/12... (42 compiles) Compiling layer 8/12... (47 compiles) Compiling layer 9/12... (52 compiles) Compiling layer 10/12... (57 compiles) Compiling layer 11/12... (62 compiles) Compiling layer 12/12... (67 compiles) Compiled 60 kernels in 3774ms
step 10 loss=10.2671
[batch 10: compile=3774ms train=1226.0ms (122.6ms/step) compiles=72]
=== NEW Efficiency Report ===
Total steps: 20
Wall time: 10248 ms (10.2 s)
Compile time: 7455 ms (72.7%)
Train time: 2476 ms (24.2%)
Avg train: 123.8 ms/step
ANE TFLOPS: 0.85 sustained
Total TFLOPS: 1.41 (ANE+CPU)
ANE utilization: 5.4% of 15.8 TFLOPS

View File

@ -0,0 +1,76 @@
xcrun clang -O2 -framework Foundation -framework IOSurface -framework Accelerate -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -fobjc-arc -o train train.m
In file included from train.m:5:
./cpu_ops.h:74:9: warning: 'cblas_scopy' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available. Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
74 | cblas_scopy(V, logits + t, S, col, 1);
| ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:145:6: note: 'cblas_scopy' has been explicitly marked deprecated here
145 | void cblas_scopy(const int __N, const float *__X, const int __incX, float *__Y,
| ^
In file included from train.m:5:
./cpu_ops.h:89:9: warning: 'cblas_scopy' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available. Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
89 | cblas_scopy(V, col, 1, dlogits + t, S);
| ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:145:6: note: 'cblas_scopy' has been explicitly marked deprecated here
145 | void cblas_scopy(const int __N, const float *__X, const int __incX, float *__Y,
| ^
train.m:503:13: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available. Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
503 | cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
| ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
| ^
train.m:512:13: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available. Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
512 | cblas_sgemm(CblasRowMajor, CblasTrans, CblasNoTrans,
| ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
| ^
train.m:518:17: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available. Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
518 | cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
| ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
| ^
train.m:603:21: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available. Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
603 | cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, HIDDEN, SEQ,
| ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
| ^
train.m:605:21: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available. Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
605 | cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, HIDDEN, DIM, SEQ,
| ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
| ^
train.m:607:21: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available. Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
607 | cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, HIDDEN, DIM, SEQ,
| ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
| ^
train.m:637:21: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available. Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
637 | cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
| ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
| ^
train.m:678:21: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available. Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
678 | cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
| ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
| ^
train.m:680:21: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available. Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
680 | cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
| ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
| ^
train.m:682:21: warning: 'cblas_sgemm' is deprecated: first deprecated in macOS 13.3 - An updated CBLAS interface supporting ILP64 is available. Please compile with -DACCELERATE_NEW_LAPACK to access the new headers and -DACCELERATE_LAPACK_ILP64 for ILP64 support. [-Wdeprecated-declarations]
682 | cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, DIM, DIM, SEQ,
| ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:610:6: note: 'cblas_sgemm' has been explicitly marked deprecated here
610 | void cblas_sgemm(const enum CBLAS_ORDER __Order,
| ^
12 warnings generated.

View File

@ -0,0 +1,29 @@
xcrun clang -O2 -Wall -Wno-deprecated-declarations -fobjc-arc -o train_large train_large.m -framework Foundation -framework CoreML -framework IOSurface -ldl -framework Accelerate
xcrun clang -O2 -Wall -Wno-deprecated-declarations -fobjc-arc -o train_large_ane train_large_ane.m -framework Foundation -framework CoreML -framework IOSurface -ldl -framework Accelerate
train_large_ane.m:421:20: warning: variable 't_ane' set but not used [-Wunused-but-set-variable]
421 | double t_ane=0,t_io=0,t_elem=0,t_rms=0,t_cblas_wait=0,t_cls=0;
| ^
train_large_ane.m:421:28: warning: variable 't_io' set but not used [-Wunused-but-set-variable]
421 | double t_ane=0,t_io=0,t_elem=0,t_rms=0,t_cblas_wait=0,t_cls=0;
| ^
train_large_ane.m:421:35: warning: variable 't_elem' set but not used [-Wunused-but-set-variable]
421 | double t_ane=0,t_io=0,t_elem=0,t_rms=0,t_cblas_wait=0,t_cls=0;
| ^
train_large_ane.m:421:44: warning: variable 't_rms' set but not used [-Wunused-but-set-variable]
421 | double t_ane=0,t_io=0,t_elem=0,t_rms=0,t_cblas_wait=0,t_cls=0;
| ^
train_large_ane.m:421:52: warning: variable 't_cblas_wait' set but not used [-Wunused-but-set-variable]
421 | double t_ane=0,t_io=0,t_elem=0,t_rms=0,t_cblas_wait=0,t_cls=0;
| ^
train_large_ane.m:421:67: warning: variable 't_cls' set but not used [-Wunused-but-set-variable]
421 | double t_ane=0,t_io=0,t_elem=0,t_rms=0,t_cblas_wait=0,t_cls=0;
| ^
In file included from train_large_ane.m:15:
./stories_cpu_ops.h:72:14: warning: unused function 'cross_entropy_loss' [-Wunused-function]
72 | static float cross_entropy_loss(float *dlogits, const float *logits, const uint16_t *targets, int V, int S) {
| ^~~~~~~~~~~~~~~~~~
In file included from train_large_ane.m:17:
./ane_classifier.h:35:18: warning: unused function 'gen_classifier_bwd' [-Wunused-function]
35 | static NSString *gen_classifier_bwd(void) {
| ^~~~~~~~~~~~~~~~~~
8 warnings generated.

View File

@ -0,0 +1 @@
443194bca4491fae4400bae9dad2a0470692bdbf