wifi-densepose/aether-arena
ruv 483bfa4660 feat(aether-arena): benchmark-first scorer + witness chain + repeatability (M2/M5/M7)
Per direction "remove the initial number, optimize for benchmark first" + "include
witness chain capabilities for proof and repeatability analysis":

- Empty board, no seeded numbers: ledger seeds to genesis only. Every result is a
  real scoring-pipeline witness; RuView gets no hand-entered baseline.
- Real model scoring: aa_score_runner now loads predictions + an eval split
  (--split/--pred) and scores them through the real ruview_metrics pose harness —
  not just a synthetic fixture. Committed public smoke split (fixtures/smoke_*.json).
- Witness chain: each score emits a witness = inputs_sha256 (binds it to the exact
  inputs) + proof_sha256 (cross-platform-stable score hash) + harness_version.
- Repeatability analysis: --repeat N runs the harness N× and fails if it ever
  yields >=2 distinct proof hashes (16/16 identical locally).
- Witness ledger: ledger/ledger_tools.py — append-only, hash-chained, tamper-
  evident (seed/append/verify); editing any past row breaks the chain.
- CI gate extended: determinism + repeatability(16) + real-scoring smoke + ledger
  chain verify on every PR.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-30 16:59:11 -04:00
..
fixtures feat(aether-arena): benchmark-first scorer + witness chain + repeatability (M2/M5/M7) 2026-05-30 16:59:11 -04:00
ledger feat(aether-arena): benchmark-first scorer + witness chain + repeatability (M2/M5/M7) 2026-05-30 16:59:11 -04:00
schema feat(aether-arena): ADR-149 spatial-intelligence benchmark — scorer + CI harness gate (M1-M4) 2026-05-30 16:47:22 -04:00
README.md feat(aether-arena): ADR-149 spatial-intelligence benchmark — scorer + CI harness gate (M1-M4) 2026-05-30 16:47:22 -04:00
STATUS.md feat(aether-arena): benchmark-first scorer + witness chain + repeatability (M2/M5/M7) 2026-05-30 16:59:11 -04:00
VERIFY.md feat(aether-arena): benchmark-first scorer + witness chain + repeatability (M2/M5/M7) 2026-05-30 16:59:11 -04:00

README.md

AetherArena ("AA") — The Official Spatial-Intelligence Benchmark

Public leaderboard. Private evaluation split. Open scorer. Signed results.

AetherArena is a standalone, project-agnostic benchmark for camera-free spatial intelligence — pose, presence, occupancy, tracking, and vitals from RF/WiFi (and, over time, mmWave / UWB / radar / lidar / multimodal). It is not a single-vendor leaderboard: any team, framework, or sensing modality can enter, and every entrant — including the RuView baseline that donated the seed scorer — is scored by the identical, open, pinned harness.

Specified in ADR-149 (Accepted).

Canonical home: ruvnet/aether-arena + a Hugging Face Space (deploy pending — see STATUS).


Why

WiFi/RF spatial sensing has no shared yardstick — papers self-report against inconsistent splits and metrics, with no accounting for latency, reproducibility, or privacy leakage. AA fixes the measurement, not just the models: a single deterministic scorer, a private held-out split nobody can train on, and a signed result ledger that can't be silently edited.

What gets measured (v0)

Category Metric Status
Pose PCK@0.2 (all / torso), OKS Ranked
Presence accuracy, FP/FN Ranked
Edge latency p50 / p95 / p99 ms Ranked
Determinism proof-hash pass/fail Ranked (gate)
Tracking (MOTA) activates when multi-person clips land
Vitals (BPM err) activates when paired vitals ground truth lands
Privacy leakage membership-inference ∈ [0,1] gated — not ranked until the attacker ships
Cross-room degradation ratio coming soon

The headline rank is the category metric; an optional arena_score = quality × latency_factor × privacy_factor × determinism_gate is exposed alongside (never instead) so accuracy can't win at any cost. See ADR-149 §2.5.

How scoring works

The scorer is RuView's already-published wifi-densepose-train acceptance harness (ruview_metrics + ADR-145 ablation), run in a pinned sandbox. You submit a model, not predictions — predictions on data you hold prove nothing. Your model is scored against a private MM-Fi held-out split (CC BY-NC 4.0; Wi-Pose excluded for redistribution reasons), and one signed, append-only row is written to the results ledger with a determinism proof hash.

Submission lifecycle: submitted → validated → quarantined → smoke_scored → full_scored → published (or rejected with a reason). The model only ever runs inside a no-network, read-only-FS sandbox.

Submit (when the Space is live)

  1. Write a manifest: schema/aa-submission.toml.
  2. Push your model artifact (.safetensors / .rvf / LoRA adapter) + manifest to the Space.
  3. Watch it move through the lifecycle; your signed row appears on the board.

Verify it's fair (you don't have to trust us)

See VERIFY.md — run the open scorer locally on the public smoke split, reproduce the determinism hash, and confirm RuView's own entries were scored by the identical path. That five-step check is the launch gate (ADR-149 §7).

Neutrality

AA is a neutral commons. The scorer is open and versioned; any metric change is a public harness_version bump that re-scores all entries. RuView donated the seed harness and enters as one baseline — it gets no special treatment (ADR-149 §2.8).