wifi-densepose/docs/adr/ADR-161-homecore-server-lay...

14 KiB
Raw Blame History

ADR-161: HOMECORE Server Layer — WebSocket Auth Bypass, Reply-Theater & Documented-but-No-Op Automation (Security & Honest Labeling)

  • Status: accepted
  • Date: 2026-06-12
  • Deciders: ruv
  • Tags: homecore, http-ws-boundary, websocket-auth-bypass, security, automation-engine, documented-no-op, prove-everything, soundness, honest-labeling
  • Amends: ADR-130 (HOMECORE-API WS protocol), ADR-129 (HOMECORE-AUTO automation engine), ADR-128 (plugin manifest)

Context

Beyond-SOTA sweep Milestone 7, over the HOMECORE server/network layer crates only — homecore-api, homecore-server, homecore-automation, homecore-hap, homecore-plugins — executed under the project's prove-everything / anti-"AI-slop" directive.

Headline — the library cores are real, but the network boundary was unsound

The same audit pattern as ADR-160 held for the library logic: the automation trigger/condition/template/action evaluators, the REST handlers, the HAP mapping, and the plugin manifest parser are real, tested code — not stubs. That is the anti-slop positive and it is cited here as such.

What the audit found was not fake business logic but an unsound trust boundary plus documented-but-no-op features:

  1. A CRITICAL WebSocket authentication bypass — the WS handshake accepted any non-empty token, ignoring the provisioned token whitelist the REST path enforces.
  2. Reply-theater — WS command responses were computed, then logged and discarded; no result/pong/event ever reached the client.
  3. Documented-but-idle automation — the engine was constructed and dropped (never started); time triggers, RunMode, Choose branches, and template conditions were each documented as working but were no-ops in the live path.

This is a worse class than ADR-160's over-naming: here the doc claimed a capability the code did not deliver (auth enforcement, reply transport, running automations). The fix is implement where feasible, honestly relabel where not — never leave a false doc. Every fix is pinned by a test that fails on the old code.

Grading vocabulary (ADR-152 / ADR-158 / ADR-160):

  • MEASURED — reproduced in this worktree, command + failing-on-old test recorded.
  • NO-ACTION (already-honest/already-hardened) — audited, found correct, cited as a positive.
  • ACCEPTED-FUTURE — deliberately deferred, nothing dropped.

Decision — Fixes Landed

§A1 — WebSocket auth bypass (CRITICAL, security) — MEASURED

homecore-api/src/ws.rs handshake checked only token.trim().is_empty() and sent auth_ok for any non-empty token. It never called state.tokens().is_valid() — the check the REST path uses via auth::BearerAuth. With a provisioned HOMECORE_TOKENS whitelist, any attacker-chosen non-empty token got full WS access (read all states, call any service, subscribe to all events).

Real fix: the handshake now calls state.tokens().is_valid(&token).await (the same store + method as REST). A wrong token receives auth_invalid and the socket closes. DEV (allow_any) mode still accepts any non-empty bearer with a warn, so smoke tests keep working; the empty token is rejected inside is_valid.

Failing-on-old test (tests/ws_handshake.rs): wrong_token_is_rejected — provisions a real (non-dev) store with one good token, sends a DIFFERENT non-empty token over the WS handshake, asserts auth_invalid. On the old source the client received {"type":"auth_ok",…} (verified: the test panics on old ws.rs with left: "auth_ok", right: "auth_invalid"). Companion: correct_token_is_accepted. Grade: MEASURED. This is the milestone headline.

§A2 — WS replies never transmitted (HIGH, functional) — MEASURED

ws.rs::Connection::run moved the socket into a recv-only task; the only consumer of the response mpsc just did debug!("ws emit: {msg}") and dropped every message. No command reply ever reached the wire.

Real fix: the socket is split with futures_util::StreamExt::split. A dedicated writer task drains the response channel onto sink.send(...) (text frames; a __pong:<n> sentinel maps to a Pong control frame); the reader task parses commands concurrently. On reader exit the senders drop and the writer task ends cleanly.

Failing-on-old tests: result_reply_is_received (connect → auth → get_states → assert a result reply is RECEIVED within 5s) and ping_pong_reply_is_received. Both time out on the old source (verified: Elapsed panic). Grade: MEASURED.

§A8 — homecore-api bin: no env-token path, network-exposed (HIGH, security) — MEASURED

homecore-api/src/bin/server.rs bound 0.0.0.0:8123 with SharedState::new()allow_any_non_empty() and no HOMECORE_TOKENS path (unlike homecore-server), so a provisioned operator had no way to lock it down.

Real fix: the bin now mirrors homecore-server's provisioning — prefer the HOMECORE_TOKENS whitelist (LongLivedTokenStore::from_env()), fall back to an explicitly warn-logged DEV mode only when unset. It also defaults the bind address to 127.0.0.1 (loopback) so a bare cargo run is not network-exposed, with HOMECORE_BIND to opt into LAN.

Failing-on-old test (tests/server_bin_auth.rs): provisioned_bin_rejects_wrong_bearer reproduces the bin's exact provisioning path (a populated, non-dev store) and asserts a wrong bearer → 401; from_env_path_enforces_whitelist proves from_env() is not dev mode and enforces the list. The old bin's allow_any_non_empty() accepted the wrong bearer. Grade: MEASURED.

§A3 — Automation engine never started (HIGH) — MEASURED

homecore-server/src/main.rs did let _automation_engine = AutomationEngine::new(...) then dropped it immediately, while the header doc claimed "Automation engine subscribed to the state machine."

Real fix: the engine is now built into a long-lived binding and .start() is called, spawning the event loop + timer task; the header/log lines state it is started with N automations and which trigger classes are active. (With A4A7 the running engine is genuinely functional, not theater.)

Evidence: the engine-behavior tests below run against the same AutomationEngine::start() path now wired into the bin. Grade: MEASURED.

§A4 — Trigger::Time hard-coded false, no timer (HIGH) — MEASURED

trigger.rs::matches_sync returned false for Time and there was no timer task anywhere, so time automations could never fire.

Real fix: AutomationEngine::start_timer — a 1 Hz tokio interval that compares each time: automation's at (HH:MM or HH:MM:SS) against the local wall-clock second and fires it once per match (conditions still gate it). matches_sync returning false for Time is now correct and documented (it is a wall-clock trigger with no state-change context); a public fire_time_for_test exposes the same path deterministically.

Failing-on-old test (tests/engine_behaviors.rs): time_trigger_fires_via_timer_path (+ unit time_at_matches_handles_hh_mm_and_hh_mm_ss). The method does not exist on the old engine. Grade: MEASURED.

§A5 — RunMode documented as AtomicBool-enforced but unbounded-parallel (HIGH) — MEASURED

engine.rs doc claimed "RunMode::Single is enforced via a per-automation AtomicBool" — but no such code existed and every trigger spawned an unbounded parallel task regardless of mode.

Real fix: each registered automation carries a running: Arc<AtomicBool>. Single/IgnoreFirst modes compare_exchange the flag before spawning and skip the trigger if a run is already in flight, clearing it on completion; Parallel (and, for now, Restart/Queued) spawn on every trigger.

Failing-on-old tests (tests/engine_behaviors.rs): single_mode_does_not_double_fire_on_rapid_triggers (two rapid triggers while the first run sleeps → exactly 1 run; old code fired 2, verified) and parallel_mode_does_fire_concurrently (→ 2). Grade: MEASURED (Single/Parallel honored; bounded Queued/Restart/max ordering → ACCEPTED-FUTURE, see below).

§A6 — Action::Choose ignored branches (HIGH) — MEASURED

action.rs discarded choices and always ran default.

Real fix: ChoiceBranch::matches deserialises each branch's serde_yaml::Value conditions into Condition and evaluates them (AND semantics, against an EvalContext now carried on ExecutionContext). Choose runs the first matching branch's sequence and falls to default only if none match.

Failing-on-old tests (action.rs inline): choose_runs_matching_branch_not_default (matching branch runs, default does NOT — old code ran default, verified) and choose_falls_to_default_when_no_branch_matches. Grade: MEASURED.

§A7 — Template conditions always false in the live engine (MEDIUM) — MEASURED

condition.rs returned false for Template whenever template_env was None, and the engine built every EvalContext with template_env: None (EvalContext::new), so template: conditions could never be true in production — only in unit tests that hand-built a template env.

Real fix: the engine constructs one TemplateEnvironment over the state machine and threads it into every EvalContext via EvalContext::with_templates (event loop, timer task, and ExecutionContext for Choose branches).

Failing-on-old tests (tests/engine_behaviors.rs): template_condition_evaluates_true_in_engine (a {{ is_state(...) }} condition gates an action true) and template_condition_evaluates_false_blocks_action. On the old engine the action never ran (template always false, verified). Grade: MEASURED.

§B5 — Plugin manifest sig/hash "verified before execution" doc was false (LOW, honesty) — relabeled

homecore-plugins/src/manifest.rs documented wasm_module_hash as "verified before execution" and carried wasm_module_sig / publisher_key, but these fields are never read for verification (only ever set to None in tests).

Fix (honest labeling — no false capability claimed): the three fields are re-doc'd "(P4 — not yet enforced, ADR-161/B5)" — parsed and round-tripped, but no integrity/signature check happens before a plugin runs. No verification code was added (that is P4); the doc now matches the code. Grade: doc-honesty (no behavior change). (Superseded by ADR-162 §P4: the hash/signature gate is now implemented and enforced.)

Negative Results (NO-ACTION positives — audited, found correct, cited not edited)

These were checked and are genuinely sound/honest; cited as positives, not touched:

  • CSPRNG correctness — all IDs are uuid::v4; the rng/randn suspicion was REFUTED. No weak-randomness issue exists.
  • CORS allowlist (app.rs) — already hardened (explicit AllowOrigin::list, no permissive(), allow_credentials(false), env override). NO-ACTION.
  • No path traversal in homecore-migrate — audited, clean.
  • No secrets in logs — audited, clean.
  • HAP pairing stub — honestly disclaimed as a surface stub; not over-claimed.
  • InProcessRuntime "no sandbox" disclaimer — honest; left as-is.

Deferred Backlog (Nothing Dropped)

  • Plugin authority-isolation (P5)homecore_permissions claims are parsed but not enforced at the host-call boundary. DONE — ADR-162 §P5. hc_state_set now consults a PermissionSet distilled from the manifest; an undeclared write returns a typed -3 to the guest.
  • Plugin signature/hash verification (P4)implement the wasm_module_hash/wasm_module_sig/publisher_key gate that B5 now honestly says is absent. DONE — ADR-162 §P4. WasmtimeRuntime::load_plugin now SHA-256-checks the module, Ed25519-verifies the signature against publisher_key, and enforces a PluginPolicy trust allowlist (secure-default rejects unsigned/untrusted/tampered modules).
  • HAP real pairing (P2) — SRP/HKDF pairing + encrypted sessions; current bridge is an accessory-mapping surface. ACCEPTED-FUTURE (honestly stubbed).
  • RunMode::Queued/Restart/max orderingSingle/Parallel are honored; bounded queueing, restart-kill, and max concurrency are not yet wired (every non-Single mode is parallel). DONE — ADR-162 §A5. Restart aborts the in-flight task, Queued serializes via a per-automation async mutex, and max: N caps concurrency via a per-automation semaphore.
  • Automation YAML load-at-boot — the engine starts empty; a YAML loader is P-next. The bin log states "0 automations registered" honestly.

Reproduction (MEASURED)

cd v2
cargo test -p homecore-api -p homecore-server -p homecore-automation -p homecore-hap --no-default-features
cargo test -p homecore-plugins --features wasmtime
cargo build --workspace --no-default-features

Result at time of writing (all 0 failed):

  • homecore-api25 passed (lib 18; server_bin_auth 3; ws_handshake 4)
  • homecore-automation42 passed (lib 37; engine_behaviors 5)
  • homecore-hap17 passed
  • homecore-server — bin, 0 tests
  • (homecore-plugins15 passed: lib 12; integration 3)
  • Full workspace cargo build --workspace --no-default-features succeeds.

Consequences

  • The WebSocket path can no longer be entered with a forged token — it enforces the same LongLivedTokenStore whitelist as REST (A1).
  • WS clients now actually receive result/pong/event frames (A2).
  • The homecore-api dev bin defaults to loopback and honors HOMECORE_TOKENS (A8); it is no longer an open 0.0.0.0 accept-any endpoint by default.
  • The automation engine is started for real and its time triggers, Single run-mode, Choose branches, and template: conditions all function — no doc claims a capability the code lacks (A3A7).
  • The plugin manifest no longer claims signature verification it does not perform (B5).
  • Files kept under the 500-line guideline (engine.rs 462; behavioral tests moved to tests/engine_behaviors.rs).