feat(aether-arena): benchmark-first scorer + witness chain + repeatability (M2/M5/M7)
Per direction "remove the initial number, optimize for benchmark first" + "include witness chain capabilities for proof and repeatability analysis": - Empty board, no seeded numbers: ledger seeds to genesis only. Every result is a real scoring-pipeline witness; RuView gets no hand-entered baseline. - Real model scoring: aa_score_runner now loads predictions + an eval split (--split/--pred) and scores them through the real ruview_metrics pose harness — not just a synthetic fixture. Committed public smoke split (fixtures/smoke_*.json). - Witness chain: each score emits a witness = inputs_sha256 (binds it to the exact inputs) + proof_sha256 (cross-platform-stable score hash) + harness_version. - Repeatability analysis: --repeat N runs the harness N× and fails if it ever yields >=2 distinct proof hashes (16/16 identical locally). - Witness ledger: ledger/ledger_tools.py — append-only, hash-chained, tamper- evident (seed/append/verify); editing any past row breaks the chain. - CI gate extended: determinism + repeatability(16) + real-scoring smoke + ledger chain verify on every PR. Co-Authored-By: claude-flow <ruv@ruv.net>
This commit is contained in:
parent
a6808568a2
commit
483bfa4660
|
|
@ -55,17 +55,38 @@ jobs:
|
|||
- name: Run determinism gate
|
||||
run: cargo run -q -p wifi-densepose-train --bin aa_score_runner --no-default-features
|
||||
|
||||
# 3. Emit the score row into the PR run summary (leaderboard-ledger shape).
|
||||
- name: Score row → job summary
|
||||
# 3. Repeatability analysis (witness chain): the harness must produce one
|
||||
# identical proof hash across many runs — any nondeterminism fails here.
|
||||
- name: Repeatability analysis (16 runs)
|
||||
run: cargo run -q -p wifi-densepose-train --bin aa_score_runner --no-default-features -- --repeat 16
|
||||
|
||||
# 4. Real-scoring smoke: score a sample prediction against the public smoke
|
||||
# split, exercising the actual model-scoring path (not just the fixture).
|
||||
- name: Real-scoring smoke test
|
||||
run: |
|
||||
cargo run -q -p wifi-densepose-train --bin aa_score_runner --no-default-features -- \
|
||||
--split ../aether-arena/fixtures/smoke_split.json \
|
||||
--pred ../aether-arena/fixtures/smoke_pred.json --json
|
||||
|
||||
# 5. Witness ledger chain integrity: the append-only results ledger must
|
||||
# verify (every prev_hash link + row_hash intact = no silent edits).
|
||||
- name: Verify witness ledger chain
|
||||
working-directory: aether-arena/ledger
|
||||
run: python3 ledger_tools.py verify
|
||||
|
||||
# 6. Emit the witness row + repeatability into the PR run summary.
|
||||
- name: Witness row → job summary
|
||||
if: always()
|
||||
run: |
|
||||
ROW=$(cargo run -q -p wifi-densepose-train --bin aa_score_runner --no-default-features -- --json)
|
||||
REP=$(cargo run -q -p wifi-densepose-train --bin aa_score_runner --no-default-features -- --repeat 16)
|
||||
{
|
||||
echo "## AetherArena harness gate"
|
||||
echo "## AetherArena harness gate (witness chain)"
|
||||
echo ""
|
||||
echo "Deterministic score row (ADR-149 §2.2):"
|
||||
echo "Deterministic witness (ADR-149 §2.2 / proof + repeatability):"
|
||||
echo '```json'
|
||||
echo "$ROW"
|
||||
echo "$REP"
|
||||
echo '```'
|
||||
echo ""
|
||||
echo "If the determinism gate failed, the scoring maths changed: regenerate with"
|
||||
|
|
|
|||
|
|
@ -8,15 +8,19 @@ ADR-079 camera-ground-truth collection — *not* an infra-completion blocker.
|
|||
| # | Milestone | Status |
|
||||
|---|-----------|--------|
|
||||
| M1 | ADR-149 Accepted + committed | ✅ done |
|
||||
| M2 | Deterministic scorer runner (`aa_score_runner`) → tier + proof hash | ✅ done — builds `--no-default-features`, hash stable, VERDICT: PASS |
|
||||
| M3 | CI harness-gate workflow (PR runs the scorer) | ✅ done — `.github/workflows/aether-arena-harness.yml` |
|
||||
| M2 | Scorer runner (`aa_score_runner`) — **real model scoring** + witness (proof+inputs hash) + **repeatability analysis** | ✅ done — builds `--no-default-features`, determinism gate PASS, repeatable 16/16 |
|
||||
| M3 | CI harness-gate workflow (PR runs scorer + repeatability + real-scoring smoke + ledger verify) | ✅ done — `.github/workflows/aether-arena-harness.yml` |
|
||||
| M4 | Scaffold: README + submission schema + VERIFY (acceptance test) | ✅ done |
|
||||
| M5 | Public smoke split (committed) + private MM-Fi held-out split prep | ⏳ next |
|
||||
| M6 | HF Space (Gradio) submission flow + sandboxed scorer container | ⛔ blocked — needs HF token / maintainer authorization to deploy |
|
||||
| M7 | Signed append-only Parquet results ledger | ⏳ |
|
||||
| M8 | RuView baseline entry (honest PCK@20) + public launch | ⏳ |
|
||||
| M5 | Public smoke split (committed) + private MM-Fi held-out split prep | 🟡 smoke split done (`fixtures/smoke_*.json`); private MM-Fi prep pending |
|
||||
| M6 | HF Space (Gradio) submission flow + sandboxed scorer container | ⏳ token located (GCP secrets); Space app next |
|
||||
| M7 | **Witness ledger chain** — append-only, hash-chained, tamper-evident | ✅ done — `ledger/ledger_tools.py` (seed/append/verify); tamper test fails as designed |
|
||||
| M8 | Public launch | ⏳ — **board starts EMPTY; no seeded numbers** (benchmark-first: only real harness scores) |
|
||||
|
||||
## Benchmark-first posture (per user direction)
|
||||
- **No placeholder numbers on the board.** The ledger seeds to genesis only; every result is a real scoring-pipeline witness. RuView gets no seeded baseline.
|
||||
- **Witness chain** = `inputs_sha256` (binds witness to exact inputs) + `proof_sha256` (cross-platform-stable score hash) + the append-only hash-chained ledger. Repeatability analysis (`--repeat N`) proves the proof hash is identical across runs.
|
||||
|
||||
## Blockers / decisions needed
|
||||
- **HF deploy (M6)** needs an HF token and authorization to create the public `ruvnet/aether-arena` Space.
|
||||
- **HF deploy (M6)** — token is in GCP Secret Manager (`HUGGINGFACE_API_KEY`); creating the public `ruvnet/aether-arena` Space still wants explicit go.
|
||||
- **MM-Fi is CC BY-NC** → AA must stay non-commercial / legally distinct from the commercial RuView product.
|
||||
- **Realism of M2 fixture**: current fixture is a *determinism* fixture (stable hash), not a realistic baseline; M5 swaps in real MM-Fi held-out scoring.
|
||||
- **Private MM-Fi split (M5)** — needs the dataset pulled + a held-out split assembled before real public scoring replaces the smoke fixture.
|
||||
|
|
|
|||
|
|
@ -12,12 +12,35 @@ The scoring engine is a pure-Rust, GPU-free binary: `aa_score_runner` in `wifi-d
|
|||
cd v2
|
||||
# Verify the committed expected hash still matches (this is the CI gate):
|
||||
cargo run -q -p wifi-densepose-train --bin aa_score_runner --no-default-features
|
||||
# → prints the score, the proof sha256, and "VERDICT: PASS"
|
||||
# → prints the witness (inputs_sha256 + proof_sha256) and "VERDICT: PASS"
|
||||
|
||||
# See the leaderboard-ledger row as JSON:
|
||||
# See the witness row as JSON:
|
||||
cargo run -q -p wifi-densepose-train --bin aa_score_runner --no-default-features -- --json
|
||||
```
|
||||
|
||||
### Witness chain — proof + repeatability analysis
|
||||
|
||||
Every score is a **witness**: `inputs_sha256` (binds it to the exact inputs scored)
|
||||
+ `proof_sha256` (cross-platform-stable hash of the quantised score) + `harness_version`.
|
||||
Witnesses are recorded in an **append-only, hash-chained ledger** (each row references
|
||||
the previous row's hash), so a silent edit to any past row breaks the chain.
|
||||
|
||||
```bash
|
||||
# Repeatability: run the scorer K times, confirm ONE identical proof hash:
|
||||
cd v2
|
||||
cargo run -q -p wifi-densepose-train --bin aa_score_runner --no-default-features -- --repeat 16
|
||||
# → {"repeatability":{"runs":16,"unique_proof_hashes":1,"repeatable":true,...}}
|
||||
|
||||
# Real model scoring (score predictions against an eval split):
|
||||
cargo run -q -p wifi-densepose-train --bin aa_score_runner --no-default-features -- \
|
||||
--split ../aether-arena/fixtures/smoke_split.json \
|
||||
--pred ../aether-arena/fixtures/smoke_pred.json --json
|
||||
|
||||
# Verify the witness ledger chain is intact (tamper-evident):
|
||||
cd ../aether-arena/ledger && python3 ledger_tools.py verify
|
||||
# → "OK: N rows, chain intact" (edit any row and it reports the broken link)
|
||||
```
|
||||
|
||||
The expected hash is committed at [`fixtures/expected_score.sha256`](fixtures/expected_score.sha256). Same harness version + same fixture → same hash on glibc / MSVC / Apple. If your local run prints `VERDICT: PASS`, you have reproduced the scorer.
|
||||
|
||||
### What happens if the scoring maths changes
|
||||
|
|
|
|||
|
|
@ -1 +1 @@
|
|||
dee374bb4ba22bc4583e8280f9d03567d0e174c7e7aac8664a804bb34a428b0e
|
||||
9c35e541d51f00998691b98948887ebca09b907d8eb29a113f97e792340456ba
|
||||
|
|
|
|||
|
|
@ -0,0 +1 @@
|
|||
{"frames": [{"pred": [[0.4003, 0.2734], [0.5038, 0.4197], [0.2053, 0.4438], [0.4397, 0.685], [0.5796, 0.7645], [0.8001, 0.2195], [0.2789, 0.2833], [0.314, 0.5439], [0.511, 0.2259], [0.6008, 0.46], [0.4837, 0.3879], [0.3475, 0.5597], [0.6569, 0.3575], [0.437, 0.6539], [0.2341, 0.6038], [0.7331, 0.392], [0.5615, 0.4915]]}, {"pred": [[0.4669, 0.6066], [0.6012, 0.7873], [0.4124, 0.5997], [0.2832, 0.281], [0.2732, 0.3635], [0.2503, 0.4848], [0.6827, 0.715], [0.4336, 0.7165], [0.295, 0.3386], [0.5337, 0.3544], [0.4397, 0.5474], [0.5163, 0.5528], [0.7547, 0.6799], [0.4195, 0.4448], [0.2257, 0.2269], [0.384, 0.2176], [0.2419, 0.4332]]}, {"pred": [[0.5585, 0.283], [0.4325, 0.2934], [0.463, 0.4744], [0.4188, 0.3454], [0.215, 0.7565], [0.527, 0.2353], [0.7084, 0.6124], [0.3015, 0.6744], [0.4103, 0.3532], [0.7243, 0.6932], [0.3302, 0.4918], [0.2072, 0.3754], [0.7914, 0.4878], [0.7618, 0.4079], [0.323, 0.3386], [0.7104, 0.4997], [0.2673, 0.6077]]}, {"pred": [[0.6372, 0.4984], [0.4184, 0.6763], [0.4498, 0.7549], [0.2924, 0.303], [0.3069, 0.7022], [0.3954, 0.5098], [0.7836, 0.6071], [0.4733, 0.7114], [0.3407, 0.3793], [0.3408, 0.4678], [0.4156, 0.4911], [0.4525, 0.7519], [0.5117, 0.1985], [0.1893, 0.6784], [0.6281, 0.5346], [0.5175, 0.673], [0.36, 0.3665]]}, {"pred": [[0.5535, 0.6537], [0.568, 0.511], [0.4705, 0.5377], [0.6372, 0.7163], [0.5493, 0.7515], [0.2559, 0.4549], [0.2553, 0.6176], [0.2991, 0.6154], [0.7185, 0.7986], [0.4586, 0.5057], [0.2975, 0.4525], [0.3263, 0.3719], [0.5131, 0.4576], [0.557, 0.5268], [0.6572, 0.7736], [0.2146, 0.6526], [0.4662, 0.7371]]}, {"pred": [[0.2924, 0.7595], [0.2612, 0.2315], [0.2488, 0.7751], [0.2329, 0.7282], [0.4744, 0.4206], [0.3618, 0.267], [0.2477, 0.285], [0.3976, 0.3746], [0.494, 0.2874], [0.3596, 0.2112], [0.3311, 0.4692], [0.6912, 0.4727], [0.4434, 0.5233], [0.4139, 0.7048], [0.425, 0.3937], [0.2326, 0.631], [0.2655, 0.7116]]}, {"pred": [[0.3609, 0.3437], [0.285, 0.486], [0.7734, 0.5468], [0.3657, 0.4093], [0.4728, 0.5019], [0.1866, 0.3545], [0.2172, 0.2028], [0.5613, 0.5238], [0.6252, 0.7205], [0.7998, 0.2954], [0.242, 0.7063], [0.6259, 0.6883], [0.5148, 0.7141], [0.5577, 0.7434], [0.3233, 0.2131], [0.2652, 0.7066], [0.5753, 0.5885]]}, {"pred": [[0.6787, 0.6504], [0.6051, 0.2297], [0.2539, 0.3475], [0.6437, 0.7807], [0.4981, 0.6149], [0.5716, 0.2367], [0.6486, 0.3632], [0.2433, 0.369], [0.6061, 0.3731], [0.4955, 0.2591], [0.7676, 0.7602], [0.6899, 0.7716], [0.3143, 0.7707], [0.3031, 0.4997], [0.7076, 0.5133], [0.3382, 0.7196], [0.2002, 0.4871]]}]}
|
||||
|
|
@ -0,0 +1 @@
|
|||
{"frames": [{"gt": [[0.3943, 0.2905], [0.5215, 0.4194], [0.2225, 0.4602], [0.4547, 0.6961], [0.5765, 0.7686], [0.7858, 0.2279], [0.2866, 0.2707], [0.3084, 0.549], [0.5286, 0.2377], [0.6082, 0.4566], [0.4719, 0.3799], [0.3465, 0.5447], [0.6377, 0.3728], [0.4509, 0.6543], [0.2235, 0.6009], [0.7253, 0.3882], [0.5479, 0.4737]], "vis": [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], "scale": 1.0}, {"gt": [[0.4845, 0.5985], [0.5883, 0.7959], [0.4315, 0.6012], [0.3008, 0.2703], [0.2776, 0.3486], [0.2483, 0.4695], [0.6916, 0.7184], [0.4153, 0.7305], [0.3057, 0.3392], [0.5535, 0.3576], [0.4216, 0.5398], [0.5093, 0.5706], [0.7397, 0.668], [0.4354, 0.4394], [0.2373, 0.2404], [0.404, 0.2315], [0.2609, 0.4182]], "vis": [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], "scale": 1.0}, {"gt": [[0.5684, 0.2891], [0.4185, 0.2737], [0.4796, 0.4903], [0.4056, 0.3589], [0.2139, 0.7706], [0.5259, 0.2162], [0.718, 0.6177], [0.3002, 0.6632], [0.3978, 0.3338], [0.7116, 0.6836], [0.336, 0.5106], [0.2168, 0.3677], [0.7739, 0.4683], [0.773, 0.4188], [0.318, 0.3226], [0.7043, 0.4877], [0.2509, 0.5964]], "vis": [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], "scale": 1.0}, {"gt": [[0.6501, 0.4868], [0.3995, 0.6805], [0.4408, 0.7681], [0.2762, 0.2907], [0.2877, 0.6959], [0.4102, 0.5292], [0.7825, 0.5898], [0.4603, 0.723], [0.3511, 0.3758], [0.3556, 0.4514], [0.4123, 0.4749], [0.4524, 0.7506], [0.5141, 0.2112], [0.2024, 0.6795], [0.6351, 0.5339], [0.5333, 0.6706], [0.3491, 0.3662]], "vis": [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], "scale": 1.0}, {"gt": [[0.537, 0.656], [0.5675, 0.5033], [0.4714, 0.52], [0.6195, 0.7259], [0.5357, 0.766], [0.273, 0.4653], [0.2439, 0.6017], [0.2927, 0.6297], [0.7297, 0.7805], [0.439, 0.4924], [0.2969, 0.4589], [0.3174, 0.3911], [0.5324, 0.4643], [0.5744, 0.5074], [0.673, 0.783], [0.2238, 0.6674], [0.4534, 0.7468]], "vis": [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], "scale": 1.0}, {"gt": [[0.2896, 0.7515], [0.2537, 0.2345], [0.2434, 0.763], [0.2502, 0.7137], [0.4723, 0.4035], [0.3607, 0.2775], [0.2657, 0.2969], [0.3872, 0.383], [0.5001, 0.3067], [0.3503, 0.2092], [0.3137, 0.4849], [0.6914, 0.4593], [0.4359, 0.504], [0.4056, 0.6994], [0.4428, 0.4085], [0.2424, 0.6445], [0.2507, 0.7048]], "vis": [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], "scale": 1.0}, {"gt": [[0.3692, 0.3453], [0.2945, 0.4675], [0.7836, 0.5282], [0.3857, 0.414], [0.4848, 0.5017], [0.203, 0.3585], [0.225, 0.2135], [0.5513, 0.5175], [0.6296, 0.7275], [0.7908, 0.2897], [0.2263, 0.7012], [0.6403, 0.6873], [0.5026, 0.701], [0.5504, 0.7357], [0.338, 0.2187], [0.2629, 0.7015], [0.5757, 0.6084]], "vis": [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], "scale": 1.0}, {"gt": [[0.6786, 0.649], [0.5956, 0.2396], [0.2447, 0.3593], [0.6439, 0.7854], [0.4874, 0.6102], [0.5857, 0.2465], [0.6459, 0.3827], [0.2364, 0.3613], [0.6054, 0.3745], [0.4798, 0.2711], [0.7869, 0.7618], [0.6919, 0.7809], [0.3259, 0.7674], [0.285, 0.5144], [0.6921, 0.5052], [0.3388, 0.7386], [0.2022, 0.495]], "vis": [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], "scale": 1.0}]}
|
||||
|
|
@ -0,0 +1 @@
|
|||
{"benchmark": "AetherArena", "created": "2026-05-30", "kind": "genesis", "note": "Official Spatial-Intelligence Benchmark \u2014 append-only signed ledger. Entries are real harness scores only; no seeded numbers.", "prev_hash": "0000000000000000000000000000000000000000000000000000000000000000", "row_hash": "940bdc6f0f5dd00f4d89e13a8fa843bab3c9ddf1b8051f426a1701e730249231", "seq": 0, "spec": "ADR-149"}
|
||||
|
|
@ -0,0 +1,100 @@
|
|||
#!/usr/bin/env python3
|
||||
"""AetherArena append-only, tamper-evident results ledger (ADR-149 §2.3/§2.4).
|
||||
|
||||
Each row is hash-chained to the previous one: ``row_hash = sha256(canonical_row
|
||||
+ prev_hash)``. Any silent edit to an earlier row breaks every subsequent
|
||||
``prev_hash`` link, so the ledger is append-only and verifiable by anyone — no
|
||||
trust in the maintainer required. (Ed25519 row signing is the next hardening;
|
||||
the chain already makes tampering detectable.)
|
||||
|
||||
Usage:
|
||||
python ledger_tools.py seed # (re)build ledger.jsonl with genesis + baseline
|
||||
python ledger_tools.py verify # verify the whole chain -> exit 0 / 1
|
||||
python ledger_tools.py append '<json-row>' # append one scored row
|
||||
"""
|
||||
import hashlib
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
LEDGER = Path(__file__).parent / "ledger.jsonl"
|
||||
GENESIS_PREV = "0" * 64
|
||||
|
||||
|
||||
def canonical(row: dict) -> bytes:
|
||||
# Stable key order, no whitespace -> deterministic bytes for hashing.
|
||||
body = {k: row[k] for k in sorted(row) if k != "row_hash"}
|
||||
return json.dumps(body, separators=(",", ":"), sort_keys=True).encode()
|
||||
|
||||
|
||||
def row_hash(row: dict) -> str:
|
||||
return hashlib.sha256(canonical(row)).hexdigest()
|
||||
|
||||
|
||||
def read_rows() -> list[dict]:
|
||||
if not LEDGER.exists():
|
||||
return []
|
||||
return [json.loads(l) for l in LEDGER.read_text().splitlines() if l.strip()]
|
||||
|
||||
|
||||
def append(entry: dict) -> dict:
|
||||
rows = read_rows()
|
||||
prev = rows[-1]["row_hash"] if rows else GENESIS_PREV
|
||||
entry = dict(entry)
|
||||
entry["seq"] = len(rows)
|
||||
entry["prev_hash"] = prev
|
||||
entry["row_hash"] = row_hash(entry)
|
||||
with LEDGER.open("a") as f:
|
||||
f.write(json.dumps(entry, sort_keys=True) + "\n")
|
||||
return entry
|
||||
|
||||
|
||||
def verify() -> bool:
|
||||
rows = read_rows()
|
||||
prev = GENESIS_PREV
|
||||
for i, r in enumerate(rows):
|
||||
if r.get("seq") != i:
|
||||
print(f"FAIL: row {i} seq mismatch ({r.get('seq')})")
|
||||
return False
|
||||
if r.get("prev_hash") != prev:
|
||||
print(f"FAIL: row {i} prev_hash broken — ledger was edited")
|
||||
return False
|
||||
if r.get("row_hash") != row_hash(r):
|
||||
print(f"FAIL: row {i} row_hash mismatch — row was tampered")
|
||||
return False
|
||||
prev = r["row_hash"]
|
||||
print(f"OK: {len(rows)} rows, chain intact")
|
||||
return True
|
||||
|
||||
|
||||
def seed():
|
||||
"""Rebuild with the genesis row only — an EMPTY board.
|
||||
|
||||
Benchmark-first: no placeholder/hand-entered numbers ever sit on the
|
||||
leaderboard. Every result row is produced by the real scoring pipeline
|
||||
(load model -> run inference -> score against the private eval split ->
|
||||
proof hash). The board starts empty and awaits the first real harness score,
|
||||
including RuView's own — which gets no special seeding.
|
||||
"""
|
||||
if LEDGER.exists():
|
||||
LEDGER.unlink()
|
||||
append({
|
||||
"kind": "genesis",
|
||||
"benchmark": "AetherArena",
|
||||
"spec": "ADR-149",
|
||||
"note": "Official Spatial-Intelligence Benchmark — append-only signed ledger. "
|
||||
"Entries are real harness scores only; no seeded numbers.",
|
||||
"created": "2026-05-30",
|
||||
})
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
cmd = sys.argv[1] if len(sys.argv) > 1 else "verify"
|
||||
if cmd == "seed":
|
||||
seed(); verify()
|
||||
elif cmd == "verify":
|
||||
sys.exit(0 if verify() else 1)
|
||||
elif cmd == "append":
|
||||
print(json.dumps(append(json.loads(sys.argv[2])), indent=2))
|
||||
else:
|
||||
print(__doc__); sys.exit(2)
|
||||
|
|
@ -146,8 +146,9 @@ The leaderboard is only credible if its failure modes cannot be hidden. Explicit
|
|||
| Model exfiltrates / phones home the eval data | Scorer container runs with **no network, read-only eval FS, resource caps** (sandboxed) |
|
||||
| Submitter overfits the public split | **Private held-out split** — never published; scoring runs on data the submitter has never seen |
|
||||
| Model fingerprints / detects the eval set | **Seasonal rotation** of a fraction of the held-out split (mirrors ADR-120 hash rotation) |
|
||||
| Maintainer silently edits a score / rank | **Signed, append-only** Parquet results ledger — rows are immutable and verifiable |
|
||||
| Scorer version drift changes ranks invisibly | **`harness_version` pinned per row**; a scorer change forces a re-eval, not a silent re-rank |
|
||||
| Maintainer silently edits a score / rank | **Witness chain**: append-only, hash-chained ledger (`ledger/ledger_tools.py`) — each row references the prior row's hash, so any edit breaks every subsequent link and `verify` fails |
|
||||
| A score can't be reproduced / hides nondeterminism | **Witness + repeatability analysis**: each score is a witness (`inputs_sha256` binding it to the exact inputs + `proof_sha256` of the quantised result + `harness_version`); `aa_score_runner --repeat N` runs the harness N× and fails if it ever produces ≥2 distinct proof hashes |
|
||||
| Scorer version drift changes ranks invisibly | **`harness_version` pinned per witness**; a scorer change moves the proof hash and fails the CI determinism gate until regenerated + reviewed |
|
||||
| Slow model brute-forces accuracy | **Latency is a ranked axis** (p50/p95/p99) with hard caps + the `latency_factor` in `arena_score` |
|
||||
| "Gold accuracy, leaks identity" win | **Privacy is a (gated) axis**; once active, `privacy_factor` penalizes leakage in `arena_score` |
|
||||
| Malicious model artifact (RCE in the scorer) | Untrusted artifact loaded in the sandboxed container only; pinned, minimal runtime; no host mounts |
|
||||
|
|
|
|||
|
|
@ -1,74 +1,95 @@
|
|||
//! AetherArena ("AA") Deterministic Score Runner (ADR-149).
|
||||
//! AetherArena ("AA") Score Runner + Witness Chain (ADR-149).
|
||||
//!
|
||||
//! The CI-runnable entry point behind the AA harness gate: it runs the **real**
|
||||
//! `wifi-densepose-train::ruview_metrics` pose-acceptance harness against a
|
||||
//! fixed, committed synthetic fixture (seed = 42) and emits:
|
||||
//! 1. the pose metrics (PCK@0.2 all/torso, OKS, jitter, p95 error),
|
||||
//! 2. the v0 `RuViewTier`-style pose verdict, and
|
||||
//! 3. a cross-platform-stable SHA-256 **proof hash** of the quantised result.
|
||||
//! Benchmark-first scorer for the official Spatial-Intelligence Benchmark. It runs
|
||||
//! the **real** `wifi-densepose-train::ruview_metrics` pose-acceptance harness and
|
||||
//! emits a **witness record** for proof + repeatability analysis:
|
||||
//!
|
||||
//! This is the `determinism_gate` substrate from ADR-149 §2.5: the same fixture
|
||||
//! + same harness version must always produce the same hash. A PR that changes
|
||||
//! the scoring maths moves the hash and fails the gate (the `expected_score.sha256`
|
||||
//! must be regenerated and reviewed), so scorer drift can never land silently.
|
||||
//! witness = { inputs_sha256, harness_version, metrics, tier, proof_sha256 }
|
||||
//!
|
||||
//! Cross-platform portability (lesson from `calibration_proof_runner.rs`):
|
||||
//! PCK/OKS use `sqrt` (libm-sensitive: glibc/MSVC/Apple differ by ~1e-7). We
|
||||
//! never hash raw f32 — we quantise each metric to coarse fixed-point (1e-3 /
|
||||
//! 1e-4) so a 1e-7 libm wobble is invisible while a real algorithm change
|
||||
//! (>1e-3) breaks the hash. No sort, no truncation.
|
||||
//! The `proof_sha256` is a cross-platform-stable hash of the quantised score; the
|
||||
//! `inputs_sha256` binds the witness to the exact inputs it scored. Together with
|
||||
//! the append-only hash-chained ledger (`aether-arena/ledger`), every published
|
||||
//! rank traces back to a reproducible witness — the witness chain.
|
||||
//!
|
||||
//! Usage:
|
||||
//! # verify against the committed expected hash (CI gate default):
|
||||
//! Modes:
|
||||
//! # 1. Determinism self-test on the committed fixture (CI gate default):
|
||||
//! cargo run -p wifi-densepose-train --bin aa_score_runner --no-default-features
|
||||
//!
|
||||
//! # emit the score as JSON (for the leaderboard ledger row):
|
||||
//! cargo run -p wifi-densepose-train --bin aa_score_runner --no-default-features -- --json
|
||||
//! # 2. Repeatability analysis — run K times, confirm identical proof hash:
|
||||
//! cargo run ... --bin aa_score_runner --no-default-features -- --repeat 8
|
||||
//!
|
||||
//! # regenerate the expected hash (after an intentional scorer change):
|
||||
//! cargo run -p wifi-densepose-train --bin aa_score_runner --no-default-features -- --generate-hash \
|
||||
//! > ../aether-arena/fixtures/expected_score.sha256
|
||||
//! # 3. Real model scoring — score predictions against an eval split:
|
||||
//! cargo run ... --bin aa_score_runner --no-default-features -- \
|
||||
//! --split eval.json --pred predictions.json --json
|
||||
//!
|
||||
//! # 4. Regenerate the fixture's expected hash (after an intentional change):
|
||||
//! cargo run ... --bin aa_score_runner --no-default-features -- --generate-hash \
|
||||
//! > ../aether-arena/fixtures/expected_score.sha256
|
||||
//!
|
||||
//! Input JSON (split = private ground truth; pred = the submitted model's output):
|
||||
//! split.json : {"frames":[{"gt":[[x,y]*17],"vis":[v*17],"scale":1.0}, ...]}
|
||||
//! pred.json : {"frames":[{"pred":[[x,y]*17]}, ...]} (index-aligned with split)
|
||||
//!
|
||||
//! Determinism discipline (lesson from calibration_proof_runner.rs): PCK/OKS use
|
||||
//! libm `sqrt` which differs ~1e-7 across glibc/MSVC/Apple — so we hash only the
|
||||
//! quantised metrics (1e-3 / 1e-4), never raw f32. No sort, no truncation.
|
||||
|
||||
use std::env;
|
||||
use std::process::ExitCode;
|
||||
|
||||
use ndarray::{Array1, Array2};
|
||||
use serde::Deserialize;
|
||||
use sha2::{Digest, Sha256};
|
||||
use wifi_densepose_train::ruview_metrics::{
|
||||
evaluate_joint_error, JointErrorResult, JointErrorThresholds,
|
||||
};
|
||||
|
||||
/// Bump when the fixture or canonical hash form changes on purpose. Pinned into
|
||||
/// the proof so a `harness_version` change forces a re-score (ADR-149 §2.4).
|
||||
const AA_HARNESS_VERSION: u32 = 1;
|
||||
/// Bump on a purposeful fixture/canonical-form change. Pinned into every witness
|
||||
/// so a `harness_version` change forces a re-score (ADR-149 §2.4).
|
||||
const AA_HARNESS_VERSION: u32 = 2;
|
||||
|
||||
/// Fixture size — fixed so the hash is stable.
|
||||
const N_FRAMES: usize = 120;
|
||||
const N_KPTS: usize = 17;
|
||||
|
||||
/// Deterministic, libm-free LCG (Numerical Recipes constants) → u32 → f32 in [0,1).
|
||||
// ── input schema ────────────────────────────────────────────────────────────
|
||||
#[derive(Deserialize)]
|
||||
struct SplitFile {
|
||||
frames: Vec<SplitFrame>,
|
||||
}
|
||||
#[derive(Deserialize)]
|
||||
struct SplitFrame {
|
||||
gt: Vec<[f32; 2]>,
|
||||
vis: Vec<f32>,
|
||||
#[serde(default = "one")]
|
||||
scale: f32,
|
||||
}
|
||||
#[derive(Deserialize)]
|
||||
struct PredFile {
|
||||
frames: Vec<PredFrame>,
|
||||
}
|
||||
#[derive(Deserialize)]
|
||||
struct PredFrame {
|
||||
pred: Vec<[f32; 2]>,
|
||||
}
|
||||
fn one() -> f32 {
|
||||
1.0
|
||||
}
|
||||
|
||||
// ── deterministic fixture (libm-free LCG) ─────────────────────────────────────
|
||||
struct Lcg(u64);
|
||||
impl Lcg {
|
||||
fn next_u32(&mut self) -> u32 {
|
||||
self.0 = self.0.wrapping_mul(6364136223846793005).wrapping_add(1442695040888963407);
|
||||
(self.0 >> 32) as u32
|
||||
}
|
||||
/// Uniform f32 in [0,1) at 1e-6 granularity — no float math in the generator.
|
||||
fn unit(&mut self) -> f32 {
|
||||
(self.next_u32() % 1_000_000) as f32 / 1_000_000.0
|
||||
}
|
||||
}
|
||||
|
||||
/// Build the canonical fixture: ground-truth keypoints in [0.2,0.8] and
|
||||
/// predictions = GT + a small, deterministic offset, so PCK/OKS land in a
|
||||
/// stable mid-high band (not trivially 0 or 1). Identical on every platform.
|
||||
fn build_fixture() -> (Vec<Array2<f32>>, Vec<Array2<f32>>, Vec<Array1<f32>>, Vec<f32>) {
|
||||
let mut rng = Lcg(42);
|
||||
let mut gt = Vec::with_capacity(N_FRAMES);
|
||||
let mut pred = Vec::with_capacity(N_FRAMES);
|
||||
let mut vis = Vec::with_capacity(N_FRAMES);
|
||||
let mut scale = Vec::with_capacity(N_FRAMES);
|
||||
|
||||
let (mut pred, mut gt, mut vis, mut scale) = (vec![], vec![], vec![], vec![]);
|
||||
for _ in 0..N_FRAMES {
|
||||
let mut g = Array2::<f32>::zeros((N_KPTS, 2));
|
||||
let mut p = Array2::<f32>::zeros((N_KPTS, 2));
|
||||
|
|
@ -76,15 +97,12 @@ fn build_fixture() -> (Vec<Array2<f32>>, Vec<Array2<f32>>, Vec<Array1<f32>>, Vec
|
|||
for k in 0..N_KPTS {
|
||||
let gx = 0.2 + 0.6 * rng.unit();
|
||||
let gy = 0.2 + 0.6 * rng.unit();
|
||||
// Deterministic prediction offset: small for most kpts, larger for a
|
||||
// few, so PCK is a believable fraction (~0.6-0.8) rather than 1.0.
|
||||
let ox = (rng.unit() - 0.5) * 0.06;
|
||||
let oy = (rng.unit() - 0.5) * 0.06;
|
||||
g[[k, 0]] = gx;
|
||||
g[[k, 1]] = gy;
|
||||
p[[k, 0]] = (gx + ox).clamp(0.0, 1.0);
|
||||
p[[k, 1]] = (gy + oy).clamp(0.0, 1.0);
|
||||
// Occlude ~10% deterministically.
|
||||
if rng.next_u32() % 10 == 0 {
|
||||
v[k] = 0.0;
|
||||
}
|
||||
|
|
@ -97,13 +115,53 @@ fn build_fixture() -> (Vec<Array2<f32>>, Vec<Array2<f32>>, Vec<Array1<f32>>, Vec
|
|||
(pred, gt, vis, scale)
|
||||
}
|
||||
|
||||
/// Canonical, libm-stable byte form of the result for hashing.
|
||||
/// Each metric → coarse fixed-point so ~1e-7 platform noise can't flip the hash.
|
||||
/// Load (pred, gt, vis, scale) from index-aligned split + prediction files.
|
||||
fn load_inputs(
|
||||
split_path: &str,
|
||||
pred_path: &str,
|
||||
) -> Result<(Vec<Array2<f32>>, Vec<Array2<f32>>, Vec<Array1<f32>>, Vec<f32>), String> {
|
||||
let split: SplitFile = serde_json::from_str(
|
||||
&std::fs::read_to_string(split_path).map_err(|e| format!("read split: {e}"))?,
|
||||
)
|
||||
.map_err(|e| format!("parse split: {e}"))?;
|
||||
let pred: PredFile = serde_json::from_str(
|
||||
&std::fs::read_to_string(pred_path).map_err(|e| format!("read pred: {e}"))?,
|
||||
)
|
||||
.map_err(|e| format!("parse pred: {e}"))?;
|
||||
if split.frames.len() != pred.frames.len() {
|
||||
return Err(format!(
|
||||
"frame count mismatch: split={} pred={}",
|
||||
split.frames.len(),
|
||||
pred.frames.len()
|
||||
));
|
||||
}
|
||||
let (mut gt, mut pr, mut vis, mut scale) = (vec![], vec![], vec![], vec![]);
|
||||
for (i, (s, p)) in split.frames.iter().zip(pred.frames.iter()).enumerate() {
|
||||
let to_arr = |kps: &[[f32; 2]]| -> Result<Array2<f32>, String> {
|
||||
if kps.len() != N_KPTS {
|
||||
return Err(format!("frame {i}: expected {N_KPTS} keypoints, got {}", kps.len()));
|
||||
}
|
||||
let mut a = Array2::<f32>::zeros((N_KPTS, 2));
|
||||
for (k, xy) in kps.iter().enumerate() {
|
||||
a[[k, 0]] = xy[0];
|
||||
a[[k, 1]] = xy[1];
|
||||
}
|
||||
Ok(a)
|
||||
};
|
||||
gt.push(to_arr(&s.gt)?);
|
||||
pr.push(to_arr(&p.pred)?);
|
||||
vis.push(Array1::from(s.vis.clone()));
|
||||
scale.push(s.scale);
|
||||
}
|
||||
Ok((pr, gt, vis, scale))
|
||||
}
|
||||
|
||||
/// Canonical, libm-stable byte form of the score for the proof hash.
|
||||
fn canonical_bytes(r: &JointErrorResult) -> Vec<u8> {
|
||||
let mut b = Vec::new();
|
||||
b.extend_from_slice(b"AA-SCORE-v0");
|
||||
b.extend_from_slice(&AA_HARNESS_VERSION.to_le_bytes());
|
||||
let q = |x: f32, scale: f32| -> u32 { (x.max(0.0) * scale).round() as u32 };
|
||||
let q = |x: f32, s: f32| -> u32 { (x.max(0.0) * s).round() as u32 };
|
||||
b.extend_from_slice(&q(r.pck_all, 1e3).to_le_bytes());
|
||||
b.extend_from_slice(&q(r.pck_torso, 1e3).to_le_bytes());
|
||||
b.extend_from_slice(&q(r.oks, 1e3).to_le_bytes());
|
||||
|
|
@ -113,56 +171,132 @@ fn canonical_bytes(r: &JointErrorResult) -> Vec<u8> {
|
|||
b
|
||||
}
|
||||
|
||||
fn hash_hex(bytes: &[u8]) -> String {
|
||||
fn sha256_hex(bytes: &[u8]) -> String {
|
||||
let mut h = Sha256::new();
|
||||
h.update(bytes);
|
||||
h.finalize().iter().map(|x| format!("{x:02x}")).collect()
|
||||
}
|
||||
|
||||
/// Bind the witness to its exact inputs: hash the quantised gt+pred+vis bytes.
|
||||
fn inputs_hash(
|
||||
pred: &[Array2<f32>],
|
||||
gt: &[Array2<f32>],
|
||||
vis: &[Array1<f32>],
|
||||
) -> String {
|
||||
let mut h = Sha256::new();
|
||||
h.update(b"AA-INPUTS-v0");
|
||||
h.update((pred.len() as u32).to_le_bytes());
|
||||
let q = |x: f32| -> i32 { (x * 1e4).round() as i32 };
|
||||
for f in 0..gt.len() {
|
||||
for k in 0..N_KPTS {
|
||||
h.update(q(gt[f][[k, 0]]).to_le_bytes());
|
||||
h.update(q(gt[f][[k, 1]]).to_le_bytes());
|
||||
h.update(q(pred[f][[k, 0]]).to_le_bytes());
|
||||
h.update(q(pred[f][[k, 1]]).to_le_bytes());
|
||||
h.update([(vis[f][k] >= 0.5) as u8]);
|
||||
}
|
||||
}
|
||||
h.finalize().iter().map(|x| format!("{x:02x}")).collect()
|
||||
}
|
||||
|
||||
struct Witness {
|
||||
inputs_sha256: String,
|
||||
proof_sha256: String,
|
||||
result: JointErrorResult,
|
||||
}
|
||||
|
||||
fn score(
|
||||
pred: &[Array2<f32>],
|
||||
gt: &[Array2<f32>],
|
||||
vis: &[Array1<f32>],
|
||||
scale: &[f32],
|
||||
) -> Witness {
|
||||
let result = evaluate_joint_error(pred, gt, vis, scale, &JointErrorThresholds::default());
|
||||
Witness {
|
||||
inputs_sha256: inputs_hash(pred, gt, vis),
|
||||
proof_sha256: sha256_hex(&canonical_bytes(&result)),
|
||||
result,
|
||||
}
|
||||
}
|
||||
|
||||
fn witness_json(w: &Witness) -> String {
|
||||
format!(
|
||||
"{{\"category\":\"pose\",\"harness_version\":{},\"inputs_sha256\":\"{}\",\"proof_sha256\":\"{}\",\"pck_all\":{:.4},\"pck_torso\":{:.4},\"oks\":{:.4},\"jitter_rms_m\":{:.5},\"max_error_p95_m\":{:.5},\"pose_passes\":{}}}",
|
||||
AA_HARNESS_VERSION, w.inputs_sha256, w.proof_sha256,
|
||||
w.result.pck_all, w.result.pck_torso, w.result.oks,
|
||||
w.result.jitter_rms_m, w.result.max_error_p95_m, w.result.passes
|
||||
)
|
||||
}
|
||||
|
||||
fn arg_val<'a>(args: &'a [String], key: &str) -> Option<&'a str> {
|
||||
args.iter().position(|a| a == key).and_then(|i| args.get(i + 1)).map(|s| s.as_str())
|
||||
}
|
||||
|
||||
fn main() -> ExitCode {
|
||||
let args: Vec<String> = env::args().collect();
|
||||
let mode_json = args.iter().any(|a| a == "--json");
|
||||
let mode_gen = args.iter().any(|a| a == "--generate-hash");
|
||||
let repeat: usize = arg_val(&args, "--repeat").and_then(|v| v.parse().ok()).unwrap_or(0);
|
||||
|
||||
let (pred, gt, vis, scale) = build_fixture();
|
||||
let result = evaluate_joint_error(&pred, >, &vis, &scale, &JointErrorThresholds::default());
|
||||
let proof = hash_hex(&canonical_bytes(&result));
|
||||
// Inputs: real split+pred if provided, else the deterministic fixture.
|
||||
let (pred, gt, vis, scale) = match (arg_val(&args, "--split"), arg_val(&args, "--pred")) {
|
||||
(Some(s), Some(p)) => match load_inputs(s, p) {
|
||||
Ok(v) => v,
|
||||
Err(e) => {
|
||||
eprintln!("input error: {e}");
|
||||
return ExitCode::FAILURE;
|
||||
}
|
||||
},
|
||||
_ => build_fixture(),
|
||||
};
|
||||
|
||||
let w = score(&pred, >, &vis, &scale);
|
||||
|
||||
// ── Repeatability analysis: run K times, confirm an identical proof hash ──
|
||||
if repeat > 0 {
|
||||
let mut hashes = std::collections::BTreeSet::new();
|
||||
for _ in 0..repeat {
|
||||
let wi = score(&pred, >, &vis, &scale);
|
||||
hashes.insert(wi.proof_sha256);
|
||||
}
|
||||
let repeatable = hashes.len() == 1;
|
||||
println!(
|
||||
"{{\"repeatability\":{{\"runs\":{},\"unique_proof_hashes\":{},\"repeatable\":{},\"proof_sha256\":\"{}\"}}}}",
|
||||
repeat, hashes.len(), repeatable, w.proof_sha256
|
||||
);
|
||||
return if repeatable { ExitCode::SUCCESS } else {
|
||||
eprintln!("REPEATABILITY FAIL: {} distinct hashes across {} runs (nondeterminism)", hashes.len(), repeat);
|
||||
ExitCode::FAILURE
|
||||
};
|
||||
}
|
||||
|
||||
if mode_gen {
|
||||
// Emit just the hash (stdout) for redirection into expected_score.sha256.
|
||||
println!("{proof}");
|
||||
println!("{}", w.proof_sha256);
|
||||
return ExitCode::SUCCESS;
|
||||
}
|
||||
|
||||
if mode_json {
|
||||
// One leaderboard-ledger-shaped row (ADR-149 §2.2).
|
||||
println!(
|
||||
"{{\"category\":\"pose\",\"harness_version\":{},\"pck_all\":{:.4},\"pck_torso\":{:.4},\"oks\":{:.4},\"jitter_rms_m\":{:.5},\"max_error_p95_m\":{:.5},\"pose_passes\":{},\"proof_sha256\":\"{}\"}}",
|
||||
AA_HARNESS_VERSION,
|
||||
result.pck_all, result.pck_torso, result.oks,
|
||||
result.jitter_rms_m, result.max_error_p95_m, result.passes, proof
|
||||
);
|
||||
println!("{}", witness_json(&w));
|
||||
return ExitCode::SUCCESS;
|
||||
}
|
||||
|
||||
// Default: verify against the committed expected hash (CI gate).
|
||||
// Default: determinism gate against the committed expected hash (CI).
|
||||
println!(
|
||||
"AA pose witness: PCK_all={:.4} PCK_torso={:.4} OKS={:.4} jitter={:.5}m p95={:.5}m passes={}",
|
||||
w.result.pck_all, w.result.pck_torso, w.result.oks,
|
||||
w.result.jitter_rms_m, w.result.max_error_p95_m, w.result.passes
|
||||
);
|
||||
println!("AA inputs_sha256: {}", w.inputs_sha256);
|
||||
println!("AA proof_sha256: {}", w.proof_sha256);
|
||||
|
||||
let expected_path = concat!(env!("CARGO_MANIFEST_DIR"), "/../../../aether-arena/fixtures/expected_score.sha256");
|
||||
let expected = std::fs::read_to_string(expected_path)
|
||||
.ok()
|
||||
.map(|s| s.trim().to_string());
|
||||
|
||||
println!("AA pose score: PCK_all={:.4} PCK_torso={:.4} OKS={:.4} jitter={:.5}m p95={:.5}m passes={}",
|
||||
result.pck_all, result.pck_torso, result.oks, result.jitter_rms_m, result.max_error_p95_m, result.passes);
|
||||
println!("AA proof sha256: {proof}");
|
||||
|
||||
match expected {
|
||||
Some(exp) if exp == proof => {
|
||||
match std::fs::read_to_string(expected_path).ok().map(|s| s.trim().to_string()) {
|
||||
Some(exp) if exp == w.proof_sha256 => {
|
||||
println!("VERDICT: PASS (determinism hash matches expected)");
|
||||
ExitCode::SUCCESS
|
||||
}
|
||||
Some(exp) => {
|
||||
eprintln!("VERDICT: FAIL — scorer drift detected.\n expected: {exp}\n actual: {proof}");
|
||||
eprintln!("If this change to the scoring maths is intentional, regenerate with --generate-hash and review the diff.");
|
||||
eprintln!("VERDICT: FAIL — scorer drift.\n expected: {exp}\n actual: {}", w.proof_sha256);
|
||||
eprintln!("If intentional, regenerate with --generate-hash and review the diff.");
|
||||
ExitCode::FAILURE
|
||||
}
|
||||
None => {
|
||||
|
|
|
|||
Loading…
Reference in New Issue