wifi-densepose/aether-arena/VERIFY.md

4.1 KiB
Raw Permalink Blame History

Verifying AetherArena (you don't have to trust us)

AA's credibility rests on a stranger being able to reproduce a score and see that the rules are fair. This is the launch gate (ADR-149 §7): v0 does not ship until all five checks below pass for someone with no insider access.

Wider context: this page covers the leaderboard scorer. For the whole-platform answer to "is this real / does it actually work?" — including the deterministic pipeline proof, the published models + public-benchmark numbers, and the built-in-public development trail — see docs/proof-of-capabilities.md.

The open scorer

The scoring engine is a pure-Rust, GPU-free binary: aa_score_runner in wifi-densepose-train. It runs the real ruview_metrics pose-acceptance harness on a fixed fixture and emits a cross-platform-stable SHA-256 determinism proof.

Reproduce the determinism hash locally

cd v2
# Verify the committed expected hash still matches (this is the CI gate):
cargo run -q -p wifi-densepose-train --bin aa_score_runner --no-default-features
# → prints the witness (inputs_sha256 + proof_sha256) and "VERDICT: PASS"

# See the witness row as JSON:
cargo run -q -p wifi-densepose-train --bin aa_score_runner --no-default-features -- --json

Witness chain — proof + repeatability analysis

Every score is a witness: inputs_sha256 (binds it to the exact inputs scored)

  • proof_sha256 (cross-platform-stable hash of the quantised score) + harness_version. Witnesses are recorded in an append-only, hash-chained ledger (each row references the previous row's hash), so a silent edit to any past row breaks the chain.
# Repeatability: run the scorer K times, confirm ONE identical proof hash:
cd v2
cargo run -q -p wifi-densepose-train --bin aa_score_runner --no-default-features -- --repeat 16
# → {"repeatability":{"runs":16,"unique_proof_hashes":1,"repeatable":true,...}}

# Real model scoring (score predictions against an eval split):
cargo run -q -p wifi-densepose-train --bin aa_score_runner --no-default-features -- \
  --split ../aether-arena/fixtures/smoke_split.json \
  --pred  ../aether-arena/fixtures/smoke_pred.json --json

# Verify the witness ledger chain is intact (tamper-evident):
cd ../aether-arena/ledger && python3 ledger_tools.py verify
# → "OK: N rows, chain intact"   (edit any row and it reports the broken link)

The expected hash is committed at fixtures/expected_score.sha256. Same harness version + same fixture → same hash on glibc / MSVC / Apple. If your local run prints VERDICT: PASS, you have reproduced the scorer.

What happens if the scoring maths changes

Any edit to ruview_metrics.rs, ablation.rs, or aa_score_runner.rs moves the hash and fails the CI gate (.github/workflows/aether-arena-harness.yml) until the maintainer regenerates and reviews:

cargo run -p wifi-densepose-train --bin aa_score_runner --no-default-features -- --generate-hash \
  > aether-arena/fixtures/expected_score.sha256

So a scorer change is always a reviewed, public diff — never silent. That's harness_version pinning + determinism_gate in action (ADR-149 §2.4§2.5).

The five-step acceptance test (v0 launch gate)

A stranger must be able to:

  1. Submit a model (artifact + schema/aa-submission.toml) with no insider help.
  2. Get a deterministic score — same model + same harness_version → same numbers.
  3. See the signed row appended to the public results ledger.
  4. Rerun the scorer locally on the public smoke split and reproduce the logic (the command above).
  5. Understand why the rank is fair — private split, open scorer, pinned version, proof hash — from these docs alone.

If any step fails, v0 is not ready.

Current status

  • Step 4 (rerun the open scorer locally, reproduce the hash) — works today via aa_score_runner.
  • CI harness gate runs the scorer on every PR.
  • Steps 13, 5 (HF Space submission flow + signed ledger) — in progress; require the HF Space deploy (needs an HF token / maintainer authorization).