docs(adr): ADR-161 — HOMECORE server-layer security & honest-labeling sweep (M7)

Records the Milestone 7 audit: library cores are real (anti-slop positive) but the network boundary had a CRITICAL WS auth bypass (A1) + reply-theater (A2) + documented-but-no-op automation (A3-A7) + a network-exposed dev bin (A8), all fixed and graded MEASURED with failing-on-old tests. Cites the NO-ACTION security positives (uuid::v4 CSPRNG refuted-suspicion, hardened CORS, no-traversal migrate, no-secrets-in-logs, honest HAP stub) and the deferred backlog (plugin authority-isolation P5, sig-verification P4, HAP real pairing P2, bounded run-modes, YAML load-at-boot). Co-Authored-By: claude-flow <ruv@ruv.net>
2026-06-12 00:55:52 -04:00 · 2026-06-12 00:55:52 -04:00 · d0da5888e3
parent e51704cd25
commit d0da5888e3
1 changed files with 260 additions and 0 deletions
--- a/docs/adr/ADR-161-homecore-server-layer-security.md
+++ b/docs/adr/ADR-161-homecore-server-layer-security.md
@ -0,0 +1,260 @@
+# ADR-161: HOMECORE Server Layer — WebSocket Auth Bypass, Reply-Theater & Documented-but-No-Op Automation (Security & Honest Labeling)
+
+- **Status**: accepted
+- **Date**: 2026-06-12
+- **Deciders**: ruv
+- **Tags**: homecore, http-ws-boundary, websocket-auth-bypass, security, automation-engine, documented-no-op, prove-everything, soundness, honest-labeling
+- **Amends**: ADR-130 (HOMECORE-API WS protocol), ADR-129 (HOMECORE-AUTO automation engine), ADR-128 (plugin manifest)
+
+## Context
+
+Beyond-SOTA sweep **Milestone 7**, over the HOMECORE **server/network layer**
+crates only — `homecore-api`, `homecore-server`, `homecore-automation`,
+`homecore-hap`, `homecore-plugins` — executed under the project's
+**prove-everything / anti-"AI-slop"** directive.
+
+### Headline — the library cores are real, but the network boundary was unsound
+
+The same audit pattern as ADR-160 held for the *library logic*: the automation
+trigger/condition/template/action evaluators, the REST handlers, the HAP
+mapping, and the plugin manifest parser are **real, tested code** — not stubs.
+That is the anti-slop positive and it is cited here as such.
+
+What the audit found was **not fake business logic but an unsound trust
+boundary plus documented-but-no-op features**:
+
+1. A **CRITICAL WebSocket authentication bypass** — the WS handshake accepted
+   any non-empty token, ignoring the provisioned token whitelist the REST path
+   enforces.
+2. **Reply-theater** — WS command responses were computed, then logged and
+   **discarded**; no `result`/`pong`/`event` ever reached the client.
+3. **Documented-but-idle automation** — the engine was constructed and dropped
+   (never started); time triggers, `RunMode`, `Choose` branches, and template
+   conditions were each **documented as working but were no-ops in the live
+   path**.
+
+This is a worse class than ADR-160's over-naming: here the **doc claimed a
+capability the code did not deliver** (auth enforcement, reply transport,
+running automations). The fix is **implement where feasible, honestly relabel
+where not — never leave a false doc.** Every fix is pinned by a test that
+**fails on the old code**.
+
+Grading vocabulary (ADR-152 / ADR-158 / ADR-160):
+- **MEASURED** — reproduced in this worktree, command + failing-on-old test recorded.
+- **NO-ACTION (already-honest/already-hardened)** — audited, found correct, cited as a positive.
+- **ACCEPTED-FUTURE** — deliberately deferred, nothing dropped.
+
+## Decision — Fixes Landed
+
+### §A1 — WebSocket auth bypass (CRITICAL, security) — MEASURED
+
+`homecore-api/src/ws.rs` handshake checked only `token.trim().is_empty()` and
+sent `auth_ok` for **any** non-empty token. It never called
+`state.tokens().is_valid()` — the check the REST path uses via
+`auth::BearerAuth`. With a provisioned `HOMECORE_TOKENS` whitelist, **any
+attacker-chosen non-empty token got full WS access** (read all states, call any
+service, subscribe to all events).
+
+**Real fix:** the handshake now calls
+`state.tokens().is_valid(&token).await` (the *same* store + method as REST).
+A wrong token receives `auth_invalid` and the socket closes. DEV (`allow_any`)
+mode still accepts any non-empty bearer with a warn, so smoke tests keep
+working; the empty token is rejected inside `is_valid`.
+
+**Failing-on-old test** (`tests/ws_handshake.rs`):
+`wrong_token_is_rejected` — provisions a real (non-dev) store with one good
+token, sends a DIFFERENT non-empty token over the WS handshake, asserts
+`auth_invalid`. On the old source the client received
+`{"type":"auth_ok",…}` (verified: the test panics on old `ws.rs` with
+`left: "auth_ok", right: "auth_invalid"`). Companion: `correct_token_is_accepted`.
+**Grade: MEASURED. This is the milestone headline.**
+
+### §A2 — WS replies never transmitted (HIGH, functional) — MEASURED
+
+`ws.rs::Connection::run` moved the socket into a recv-only task; the only
+consumer of the response mpsc just did `debug!("ws emit: {msg}")` and dropped
+every message. No command reply ever reached the wire.
+
+**Real fix:** the socket is split with `futures_util::StreamExt::split`. A
+dedicated **writer task** drains the response channel onto `sink.send(...)`
+(text frames; a `__pong:<n>` sentinel maps to a Pong control frame); the reader
+task parses commands concurrently. On reader exit the senders drop and the
+writer task ends cleanly.
+
+**Failing-on-old tests:** `result_reply_is_received` (connect → auth →
+`get_states` → assert a `result` reply is RECEIVED within 5s) and
+`ping_pong_reply_is_received`. Both time out on the old source (verified:
+`Elapsed` panic). **Grade: MEASURED.**
+
+### §A8 — `homecore-api` bin: no env-token path, network-exposed (HIGH, security) — MEASURED
+
+`homecore-api/src/bin/server.rs` bound `0.0.0.0:8123` with
+`SharedState::new()` → `allow_any_non_empty()` and **no** `HOMECORE_TOKENS`
+path (unlike `homecore-server`), so a provisioned operator had no way to lock
+it down.
+
+**Real fix:** the bin now mirrors `homecore-server`'s provisioning — prefer the
+`HOMECORE_TOKENS` whitelist (`LongLivedTokenStore::from_env()`), fall back to an
+**explicitly warn-logged** DEV mode only when unset. It also defaults the bind
+address to **`127.0.0.1`** (loopback) so a bare `cargo run` is not
+network-exposed, with `HOMECORE_BIND` to opt into LAN.
+
+**Failing-on-old test** (`tests/server_bin_auth.rs`):
+`provisioned_bin_rejects_wrong_bearer` reproduces the bin's exact provisioning
+path (a populated, non-dev store) and asserts a wrong bearer → 401;
+`from_env_path_enforces_whitelist` proves `from_env()` is not dev mode and
+enforces the list. The old bin's `allow_any_non_empty()` accepted the wrong
+bearer. **Grade: MEASURED.**
+
+### §A3 — Automation engine never started (HIGH) — MEASURED
+
+`homecore-server/src/main.rs` did `let _automation_engine = AutomationEngine::new(...)`
+then dropped it immediately, while the header doc claimed "Automation engine
+subscribed to the state machine."
+
+**Real fix:** the engine is now built into a long-lived binding and `.start()`
+is called, spawning the event loop + timer task; the header/log lines state it
+is started with N automations and which trigger classes are active. (With A4–A7
+the running engine is genuinely functional, not theater.)
+
+**Evidence:** the engine-behavior tests below run against the same
+`AutomationEngine::start()` path now wired into the bin. **Grade: MEASURED.**
+
+### §A4 — `Trigger::Time` hard-coded `false`, no timer (HIGH) — MEASURED
+
+`trigger.rs::matches_sync` returned `false` for `Time` and there was **no timer
+task** anywhere, so time automations could never fire.
+
+**Real fix:** `AutomationEngine::start_timer` — a 1 Hz tokio interval that
+compares each `time:` automation's `at` (`HH:MM` or `HH:MM:SS`) against the
+local wall-clock second and fires it once per match (conditions still gate it).
+`matches_sync` returning `false` for `Time` is now **correct and documented**
+(it is a wall-clock trigger with no state-change context); a public
+`fire_time_for_test` exposes the same path deterministically.
+
+**Failing-on-old test** (`tests/engine_behaviors.rs`):
+`time_trigger_fires_via_timer_path` (+ unit `time_at_matches_handles_hh_mm_and_hh_mm_ss`).
+The method does not exist on the old engine. **Grade: MEASURED.**
+
+### §A5 — `RunMode` documented as AtomicBool-enforced but unbounded-parallel (HIGH) — MEASURED
+
+`engine.rs` doc claimed "RunMode::Single is enforced via a per-automation
+AtomicBool" — but no such code existed and **every** trigger spawned an
+unbounded parallel task regardless of `mode`.
+
+**Real fix:** each registered automation carries a `running: Arc<AtomicBool>`.
+`Single`/`IgnoreFirst` modes `compare_exchange` the flag before spawning and
+**skip** the trigger if a run is already in flight, clearing it on completion;
+`Parallel` (and, for now, `Restart`/`Queued`) spawn on every trigger.
+
+**Failing-on-old tests** (`tests/engine_behaviors.rs`):
+`single_mode_does_not_double_fire_on_rapid_triggers` (two rapid triggers while
+the first run sleeps → exactly **1** run; old code fired **2**, verified) and
+`parallel_mode_does_fire_concurrently` (→ 2). **Grade: MEASURED (Single/Parallel
+honored; bounded `Queued`/`Restart`/`max` ordering → ACCEPTED-FUTURE, see below).**
+
+### §A6 — `Action::Choose` ignored branches (HIGH) — MEASURED
+
+`action.rs` discarded `choices` and always ran `default`.
+
+**Real fix:** `ChoiceBranch::matches` deserialises each branch's
+`serde_yaml::Value` conditions into `Condition` and evaluates them (AND
+semantics, against an `EvalContext` now carried on `ExecutionContext`). `Choose`
+runs the **first matching branch's** sequence and falls to `default` only if
+none match.
+
+**Failing-on-old tests** (`action.rs` inline):
+`choose_runs_matching_branch_not_default` (matching branch runs, default does
+NOT — old code ran default, verified) and
+`choose_falls_to_default_when_no_branch_matches`. **Grade: MEASURED.**
+
+### §A7 — Template conditions always false in the live engine (MEDIUM) — MEASURED
+
+`condition.rs` returned `false` for `Template` whenever `template_env` was
+`None`, and the engine built every `EvalContext` with `template_env: None`
+(`EvalContext::new`), so `template:` conditions could never be true in
+production — only in unit tests that hand-built a template env.
+
+**Real fix:** the engine constructs one `TemplateEnvironment` over the state
+machine and threads it into every `EvalContext` via
+`EvalContext::with_templates` (event loop, timer task, and
+`ExecutionContext` for `Choose` branches).
+
+**Failing-on-old tests** (`tests/engine_behaviors.rs`):
+`template_condition_evaluates_true_in_engine` (a `{{ is_state(...) }}` condition
+gates an action true) and `template_condition_evaluates_false_blocks_action`.
+On the old engine the action never ran (template always false, verified).
+**Grade: MEASURED.**
+
+### §B5 — Plugin manifest sig/hash "verified before execution" doc was false (LOW, honesty) — relabeled
+
+`homecore-plugins/src/manifest.rs` documented `wasm_module_hash` as "verified
+before execution" and carried `wasm_module_sig` / `publisher_key`, but these
+fields are **never read** for verification (only ever set to `None` in tests).
+
+**Fix (honest labeling — no false capability claimed):** the three fields are
+re-doc'd **"(P4 — not yet enforced, ADR-161/B5)"** — parsed and round-tripped,
+but no integrity/signature check happens before a plugin runs. No verification
+code was added (that is P4); the doc now matches the code.
+**Grade: doc-honesty (no behavior change).**
+
+## Negative Results (NO-ACTION positives — audited, found correct, cited not edited)
+
+These were checked and are genuinely sound/honest; cited as positives, **not**
+touched:
+- **CSPRNG correctness** — all IDs are `uuid::v4`; the rng/`randn` suspicion was
+  **REFUTED**. No weak-randomness issue exists.
+- **CORS allowlist** (`app.rs`) — already hardened (explicit `AllowOrigin::list`,
+  no `permissive()`, `allow_credentials(false)`, env override). NO-ACTION.
+- **No path traversal in `homecore-migrate`** — audited, clean.
+- **No secrets in logs** — audited, clean.
+- **HAP pairing stub** — honestly disclaimed as a surface stub; not over-claimed.
+- **`InProcessRuntime` "no sandbox" disclaimer** — honest; left as-is.
+
+## Deferred Backlog (Nothing Dropped)
+
+- **Plugin authority-isolation (P5)** — `homecore_permissions` claims are parsed
+  but not enforced at the host-call boundary. **ACCEPTED-FUTURE.**
+- **Plugin signature/hash verification (P4)** — implement the
+  `wasm_module_hash`/`wasm_module_sig`/`publisher_key` gate that B5 now honestly
+  says is absent. **ACCEPTED-FUTURE.**
+- **HAP real pairing (P2)** — SRP/HKDF pairing + encrypted sessions; current
+  bridge is an accessory-mapping surface. **ACCEPTED-FUTURE (honestly stubbed).**
+- **`RunMode::Queued`/`Restart`/`max` ordering** — `Single`/`Parallel` are
+  honored; bounded queueing, restart-kill, and `max` concurrency are not yet
+  wired (every non-Single mode is parallel). **ACCEPTED-FUTURE** — the
+  `engine.rs` doc states exactly this, no over-claim.
+- **Automation YAML load-at-boot** — the engine starts empty; a YAML loader is
+  P-next. The bin log states "0 automations registered" honestly.
+
+## Reproduction (MEASURED)
+
+```bash
+cd v2
+cargo test -p homecore-api -p homecore-server -p homecore-automation -p homecore-hap --no-default-features
+cargo test -p homecore-plugins --features wasmtime
+cargo build --workspace --no-default-features
+```
+
+Result at time of writing (all 0 failed):
+- **homecore-api** — **25 passed** (lib 18; `server_bin_auth` 3; `ws_handshake` 4)
+- **homecore-automation** — **42 passed** (lib 37; `engine_behaviors` 5)
+- **homecore-hap** — **17 passed**
+- **homecore-server** — bin, **0 tests**
+- (**homecore-plugins** — **15 passed**: lib 12; integration 3)
+- Full workspace `cargo build --workspace --no-default-features` succeeds.
+
+## Consequences
+
+- The WebSocket path can no longer be entered with a forged token — it enforces
+  the same `LongLivedTokenStore` whitelist as REST (A1).
+- WS clients now actually receive `result`/`pong`/`event` frames (A2).
+- The `homecore-api` dev bin defaults to loopback and honors `HOMECORE_TOKENS`
+  (A8); it is no longer an open `0.0.0.0` accept-any endpoint by default.
+- The automation engine is started for real and its time triggers, `Single`
+  run-mode, `Choose` branches, and `template:` conditions all function — no doc
+  claims a capability the code lacks (A3–A7).
+- The plugin manifest no longer claims signature verification it does not
+  perform (B5).
+- Files kept under the 500-line guideline (`engine.rs` 462; behavioral tests
+  moved to `tests/engine_behaviors.rs`).