# ADR-161: HOMECORE Server Layer — WebSocket Auth Bypass, Reply-Theater & Documented-but-No-Op Automation (Security & Honest Labeling) - **Status**: accepted - **Date**: 2026-06-12 - **Deciders**: ruv - **Tags**: homecore, http-ws-boundary, websocket-auth-bypass, security, automation-engine, documented-no-op, prove-everything, soundness, honest-labeling - **Amends**: ADR-130 (HOMECORE-API WS protocol), ADR-129 (HOMECORE-AUTO automation engine), ADR-128 (plugin manifest) ## Context Beyond-SOTA sweep **Milestone 7**, over the HOMECORE **server/network layer** crates only — `homecore-api`, `homecore-server`, `homecore-automation`, `homecore-hap`, `homecore-plugins` — executed under the project's **prove-everything / anti-"AI-slop"** directive. ### Headline — the library cores are real, but the network boundary was unsound The same audit pattern as ADR-160 held for the *library logic*: the automation trigger/condition/template/action evaluators, the REST handlers, the HAP mapping, and the plugin manifest parser are **real, tested code** — not stubs. That is the anti-slop positive and it is cited here as such. What the audit found was **not fake business logic but an unsound trust boundary plus documented-but-no-op features**: 1. A **CRITICAL WebSocket authentication bypass** — the WS handshake accepted any non-empty token, ignoring the provisioned token whitelist the REST path enforces. 2. **Reply-theater** — WS command responses were computed, then logged and **discarded**; no `result`/`pong`/`event` ever reached the client. 3. **Documented-but-idle automation** — the engine was constructed and dropped (never started); time triggers, `RunMode`, `Choose` branches, and template conditions were each **documented as working but were no-ops in the live path**. This is a worse class than ADR-160's over-naming: here the **doc claimed a capability the code did not deliver** (auth enforcement, reply transport, running automations). The fix is **implement where feasible, honestly relabel where not — never leave a false doc.** Every fix is pinned by a test that **fails on the old code**. Grading vocabulary (ADR-152 / ADR-158 / ADR-160): - **MEASURED** — reproduced in this worktree, command + failing-on-old test recorded. - **NO-ACTION (already-honest/already-hardened)** — audited, found correct, cited as a positive. - **ACCEPTED-FUTURE** — deliberately deferred, nothing dropped. ## Decision — Fixes Landed ### §A1 — WebSocket auth bypass (CRITICAL, security) — MEASURED `homecore-api/src/ws.rs` handshake checked only `token.trim().is_empty()` and sent `auth_ok` for **any** non-empty token. It never called `state.tokens().is_valid()` — the check the REST path uses via `auth::BearerAuth`. With a provisioned `HOMECORE_TOKENS` whitelist, **any attacker-chosen non-empty token got full WS access** (read all states, call any service, subscribe to all events). **Real fix:** the handshake now calls `state.tokens().is_valid(&token).await` (the *same* store + method as REST). A wrong token receives `auth_invalid` and the socket closes. DEV (`allow_any`) mode still accepts any non-empty bearer with a warn, so smoke tests keep working; the empty token is rejected inside `is_valid`. **Failing-on-old test** (`tests/ws_handshake.rs`): `wrong_token_is_rejected` — provisions a real (non-dev) store with one good token, sends a DIFFERENT non-empty token over the WS handshake, asserts `auth_invalid`. On the old source the client received `{"type":"auth_ok",…}` (verified: the test panics on old `ws.rs` with `left: "auth_ok", right: "auth_invalid"`). Companion: `correct_token_is_accepted`. **Grade: MEASURED. This is the milestone headline.** ### §A2 — WS replies never transmitted (HIGH, functional) — MEASURED `ws.rs::Connection::run` moved the socket into a recv-only task; the only consumer of the response mpsc just did `debug!("ws emit: {msg}")` and dropped every message. No command reply ever reached the wire. **Real fix:** the socket is split with `futures_util::StreamExt::split`. A dedicated **writer task** drains the response channel onto `sink.send(...)` (text frames; a `__pong:` sentinel maps to a Pong control frame); the reader task parses commands concurrently. On reader exit the senders drop and the writer task ends cleanly. **Failing-on-old tests:** `result_reply_is_received` (connect → auth → `get_states` → assert a `result` reply is RECEIVED within 5s) and `ping_pong_reply_is_received`. Both time out on the old source (verified: `Elapsed` panic). **Grade: MEASURED.** ### §A8 — `homecore-api` bin: no env-token path, network-exposed (HIGH, security) — MEASURED `homecore-api/src/bin/server.rs` bound `0.0.0.0:8123` with `SharedState::new()` → `allow_any_non_empty()` and **no** `HOMECORE_TOKENS` path (unlike `homecore-server`), so a provisioned operator had no way to lock it down. **Real fix:** the bin now mirrors `homecore-server`'s provisioning — prefer the `HOMECORE_TOKENS` whitelist (`LongLivedTokenStore::from_env()`), fall back to an **explicitly warn-logged** DEV mode only when unset. It also defaults the bind address to **`127.0.0.1`** (loopback) so a bare `cargo run` is not network-exposed, with `HOMECORE_BIND` to opt into LAN. **Failing-on-old test** (`tests/server_bin_auth.rs`): `provisioned_bin_rejects_wrong_bearer` reproduces the bin's exact provisioning path (a populated, non-dev store) and asserts a wrong bearer → 401; `from_env_path_enforces_whitelist` proves `from_env()` is not dev mode and enforces the list. The old bin's `allow_any_non_empty()` accepted the wrong bearer. **Grade: MEASURED.** ### §A3 — Automation engine never started (HIGH) — MEASURED `homecore-server/src/main.rs` did `let _automation_engine = AutomationEngine::new(...)` then dropped it immediately, while the header doc claimed "Automation engine subscribed to the state machine." **Real fix:** the engine is now built into a long-lived binding and `.start()` is called, spawning the event loop + timer task; the header/log lines state it is started with N automations and which trigger classes are active. (With A4–A7 the running engine is genuinely functional, not theater.) **Evidence:** the engine-behavior tests below run against the same `AutomationEngine::start()` path now wired into the bin. **Grade: MEASURED.** ### §A4 — `Trigger::Time` hard-coded `false`, no timer (HIGH) — MEASURED `trigger.rs::matches_sync` returned `false` for `Time` and there was **no timer task** anywhere, so time automations could never fire. **Real fix:** `AutomationEngine::start_timer` — a 1 Hz tokio interval that compares each `time:` automation's `at` (`HH:MM` or `HH:MM:SS`) against the local wall-clock second and fires it once per match (conditions still gate it). `matches_sync` returning `false` for `Time` is now **correct and documented** (it is a wall-clock trigger with no state-change context); a public `fire_time_for_test` exposes the same path deterministically. **Failing-on-old test** (`tests/engine_behaviors.rs`): `time_trigger_fires_via_timer_path` (+ unit `time_at_matches_handles_hh_mm_and_hh_mm_ss`). The method does not exist on the old engine. **Grade: MEASURED.** ### §A5 — `RunMode` documented as AtomicBool-enforced but unbounded-parallel (HIGH) — MEASURED `engine.rs` doc claimed "RunMode::Single is enforced via a per-automation AtomicBool" — but no such code existed and **every** trigger spawned an unbounded parallel task regardless of `mode`. **Real fix:** each registered automation carries a `running: Arc`. `Single`/`IgnoreFirst` modes `compare_exchange` the flag before spawning and **skip** the trigger if a run is already in flight, clearing it on completion; `Parallel` (and, for now, `Restart`/`Queued`) spawn on every trigger. **Failing-on-old tests** (`tests/engine_behaviors.rs`): `single_mode_does_not_double_fire_on_rapid_triggers` (two rapid triggers while the first run sleeps → exactly **1** run; old code fired **2**, verified) and `parallel_mode_does_fire_concurrently` (→ 2). **Grade: MEASURED (Single/Parallel honored; bounded `Queued`/`Restart`/`max` ordering → ACCEPTED-FUTURE, see below).** ### §A6 — `Action::Choose` ignored branches (HIGH) — MEASURED `action.rs` discarded `choices` and always ran `default`. **Real fix:** `ChoiceBranch::matches` deserialises each branch's `serde_yaml::Value` conditions into `Condition` and evaluates them (AND semantics, against an `EvalContext` now carried on `ExecutionContext`). `Choose` runs the **first matching branch's** sequence and falls to `default` only if none match. **Failing-on-old tests** (`action.rs` inline): `choose_runs_matching_branch_not_default` (matching branch runs, default does NOT — old code ran default, verified) and `choose_falls_to_default_when_no_branch_matches`. **Grade: MEASURED.** ### §A7 — Template conditions always false in the live engine (MEDIUM) — MEASURED `condition.rs` returned `false` for `Template` whenever `template_env` was `None`, and the engine built every `EvalContext` with `template_env: None` (`EvalContext::new`), so `template:` conditions could never be true in production — only in unit tests that hand-built a template env. **Real fix:** the engine constructs one `TemplateEnvironment` over the state machine and threads it into every `EvalContext` via `EvalContext::with_templates` (event loop, timer task, and `ExecutionContext` for `Choose` branches). **Failing-on-old tests** (`tests/engine_behaviors.rs`): `template_condition_evaluates_true_in_engine` (a `{{ is_state(...) }}` condition gates an action true) and `template_condition_evaluates_false_blocks_action`. On the old engine the action never ran (template always false, verified). **Grade: MEASURED.** ### §B5 — Plugin manifest sig/hash "verified before execution" doc was false (LOW, honesty) — relabeled `homecore-plugins/src/manifest.rs` documented `wasm_module_hash` as "verified before execution" and carried `wasm_module_sig` / `publisher_key`, but these fields are **never read** for verification (only ever set to `None` in tests). **Fix (honest labeling — no false capability claimed):** the three fields are re-doc'd **"(P4 — not yet enforced, ADR-161/B5)"** — parsed and round-tripped, but no integrity/signature check happens before a plugin runs. No verification code was added (that is P4); the doc now matches the code. **Grade: doc-honesty (no behavior change).** *(Superseded by ADR-162 §P4: the hash/signature gate is now implemented and enforced.)* ## Negative Results (NO-ACTION positives — audited, found correct, cited not edited) These were checked and are genuinely sound/honest; cited as positives, **not** touched: - **CSPRNG correctness** — all IDs are `uuid::v4`; the rng/`randn` suspicion was **REFUTED**. No weak-randomness issue exists. - **CORS allowlist** (`app.rs`) — already hardened (explicit `AllowOrigin::list`, no `permissive()`, `allow_credentials(false)`, env override). NO-ACTION. - **No path traversal in `homecore-migrate`** — audited, clean. - **No secrets in logs** — audited, clean. - **HAP pairing stub** — honestly disclaimed as a surface stub; not over-claimed. - **`InProcessRuntime` "no sandbox" disclaimer** — honest; left as-is. ## Deferred Backlog (Nothing Dropped) - **Plugin authority-isolation (P5)** — ~~`homecore_permissions` claims are parsed but not enforced at the host-call boundary.~~ **DONE — ADR-162 §P5.** `hc_state_set` now consults a `PermissionSet` distilled from the manifest; an undeclared write returns a typed `-3` to the guest. - **Plugin signature/hash verification (P4)** — ~~implement the `wasm_module_hash`/`wasm_module_sig`/`publisher_key` gate that B5 now honestly says is absent.~~ **DONE — ADR-162 §P4.** `WasmtimeRuntime::load_plugin` now SHA-256-checks the module, Ed25519-verifies the signature against `publisher_key`, and enforces a `PluginPolicy` trust allowlist (secure-default rejects unsigned/untrusted/tampered modules). - **HAP real pairing (P2)** — SRP/HKDF pairing + encrypted sessions; current bridge is an accessory-mapping surface. **ACCEPTED-FUTURE (honestly stubbed).** - **`RunMode::Queued`/`Restart`/`max` ordering** — ~~`Single`/`Parallel` are honored; bounded queueing, restart-kill, and `max` concurrency are not yet wired (every non-Single mode is parallel).~~ **DONE — ADR-162 §A5.** Restart aborts the in-flight task, Queued serializes via a per-automation async mutex, and `max: N` caps concurrency via a per-automation semaphore. - **Automation YAML load-at-boot** — the engine starts empty; a YAML loader is P-next. The bin log states "0 automations registered" honestly. ## Reproduction (MEASURED) ```bash cd v2 cargo test -p homecore-api -p homecore-server -p homecore-automation -p homecore-hap --no-default-features cargo test -p homecore-plugins --features wasmtime cargo build --workspace --no-default-features ``` Result at time of writing (all 0 failed): - **homecore-api** — **25 passed** (lib 18; `server_bin_auth` 3; `ws_handshake` 4) - **homecore-automation** — **42 passed** (lib 37; `engine_behaviors` 5) - **homecore-hap** — **17 passed** - **homecore-server** — bin, **0 tests** - (**homecore-plugins** — **15 passed**: lib 12; integration 3) - Full workspace `cargo build --workspace --no-default-features` succeeds. ## Consequences - The WebSocket path can no longer be entered with a forged token — it enforces the same `LongLivedTokenStore` whitelist as REST (A1). - WS clients now actually receive `result`/`pong`/`event` frames (A2). - The `homecore-api` dev bin defaults to loopback and honors `HOMECORE_TOKENS` (A8); it is no longer an open `0.0.0.0` accept-any endpoint by default. - The automation engine is started for real and its time triggers, `Single` run-mode, `Choose` branches, and `template:` conditions all function — no doc claims a capability the code lacks (A3–A7). - The plugin manifest no longer claims signature verification it does not perform (B5). - Files kept under the 500-line guideline (`engine.rs` 462; behavioral tests moved to `tests/engine_behaviors.rs`).