268 lines
14 KiB
Markdown
268 lines
14 KiB
Markdown
# ADR-161: HOMECORE Server Layer — WebSocket Auth Bypass, Reply-Theater & Documented-but-No-Op Automation (Security & Honest Labeling)
|
||
|
||
- **Status**: accepted
|
||
- **Date**: 2026-06-12
|
||
- **Deciders**: ruv
|
||
- **Tags**: homecore, http-ws-boundary, websocket-auth-bypass, security, automation-engine, documented-no-op, prove-everything, soundness, honest-labeling
|
||
- **Amends**: ADR-130 (HOMECORE-API WS protocol), ADR-129 (HOMECORE-AUTO automation engine), ADR-128 (plugin manifest)
|
||
|
||
## Context
|
||
|
||
Beyond-SOTA sweep **Milestone 7**, over the HOMECORE **server/network layer**
|
||
crates only — `homecore-api`, `homecore-server`, `homecore-automation`,
|
||
`homecore-hap`, `homecore-plugins` — executed under the project's
|
||
**prove-everything / anti-"AI-slop"** directive.
|
||
|
||
### Headline — the library cores are real, but the network boundary was unsound
|
||
|
||
The same audit pattern as ADR-160 held for the *library logic*: the automation
|
||
trigger/condition/template/action evaluators, the REST handlers, the HAP
|
||
mapping, and the plugin manifest parser are **real, tested code** — not stubs.
|
||
That is the anti-slop positive and it is cited here as such.
|
||
|
||
What the audit found was **not fake business logic but an unsound trust
|
||
boundary plus documented-but-no-op features**:
|
||
|
||
1. A **CRITICAL WebSocket authentication bypass** — the WS handshake accepted
|
||
any non-empty token, ignoring the provisioned token whitelist the REST path
|
||
enforces.
|
||
2. **Reply-theater** — WS command responses were computed, then logged and
|
||
**discarded**; no `result`/`pong`/`event` ever reached the client.
|
||
3. **Documented-but-idle automation** — the engine was constructed and dropped
|
||
(never started); time triggers, `RunMode`, `Choose` branches, and template
|
||
conditions were each **documented as working but were no-ops in the live
|
||
path**.
|
||
|
||
This is a worse class than ADR-160's over-naming: here the **doc claimed a
|
||
capability the code did not deliver** (auth enforcement, reply transport,
|
||
running automations). The fix is **implement where feasible, honestly relabel
|
||
where not — never leave a false doc.** Every fix is pinned by a test that
|
||
**fails on the old code**.
|
||
|
||
Grading vocabulary (ADR-152 / ADR-158 / ADR-160):
|
||
- **MEASURED** — reproduced in this worktree, command + failing-on-old test recorded.
|
||
- **NO-ACTION (already-honest/already-hardened)** — audited, found correct, cited as a positive.
|
||
- **ACCEPTED-FUTURE** — deliberately deferred, nothing dropped.
|
||
|
||
## Decision — Fixes Landed
|
||
|
||
### §A1 — WebSocket auth bypass (CRITICAL, security) — MEASURED
|
||
|
||
`homecore-api/src/ws.rs` handshake checked only `token.trim().is_empty()` and
|
||
sent `auth_ok` for **any** non-empty token. It never called
|
||
`state.tokens().is_valid()` — the check the REST path uses via
|
||
`auth::BearerAuth`. With a provisioned `HOMECORE_TOKENS` whitelist, **any
|
||
attacker-chosen non-empty token got full WS access** (read all states, call any
|
||
service, subscribe to all events).
|
||
|
||
**Real fix:** the handshake now calls
|
||
`state.tokens().is_valid(&token).await` (the *same* store + method as REST).
|
||
A wrong token receives `auth_invalid` and the socket closes. DEV (`allow_any`)
|
||
mode still accepts any non-empty bearer with a warn, so smoke tests keep
|
||
working; the empty token is rejected inside `is_valid`.
|
||
|
||
**Failing-on-old test** (`tests/ws_handshake.rs`):
|
||
`wrong_token_is_rejected` — provisions a real (non-dev) store with one good
|
||
token, sends a DIFFERENT non-empty token over the WS handshake, asserts
|
||
`auth_invalid`. On the old source the client received
|
||
`{"type":"auth_ok",…}` (verified: the test panics on old `ws.rs` with
|
||
`left: "auth_ok", right: "auth_invalid"`). Companion: `correct_token_is_accepted`.
|
||
**Grade: MEASURED. This is the milestone headline.**
|
||
|
||
### §A2 — WS replies never transmitted (HIGH, functional) — MEASURED
|
||
|
||
`ws.rs::Connection::run` moved the socket into a recv-only task; the only
|
||
consumer of the response mpsc just did `debug!("ws emit: {msg}")` and dropped
|
||
every message. No command reply ever reached the wire.
|
||
|
||
**Real fix:** the socket is split with `futures_util::StreamExt::split`. A
|
||
dedicated **writer task** drains the response channel onto `sink.send(...)`
|
||
(text frames; a `__pong:<n>` sentinel maps to a Pong control frame); the reader
|
||
task parses commands concurrently. On reader exit the senders drop and the
|
||
writer task ends cleanly.
|
||
|
||
**Failing-on-old tests:** `result_reply_is_received` (connect → auth →
|
||
`get_states` → assert a `result` reply is RECEIVED within 5s) and
|
||
`ping_pong_reply_is_received`. Both time out on the old source (verified:
|
||
`Elapsed` panic). **Grade: MEASURED.**
|
||
|
||
### §A8 — `homecore-api` bin: no env-token path, network-exposed (HIGH, security) — MEASURED
|
||
|
||
`homecore-api/src/bin/server.rs` bound `0.0.0.0:8123` with
|
||
`SharedState::new()` → `allow_any_non_empty()` and **no** `HOMECORE_TOKENS`
|
||
path (unlike `homecore-server`), so a provisioned operator had no way to lock
|
||
it down.
|
||
|
||
**Real fix:** the bin now mirrors `homecore-server`'s provisioning — prefer the
|
||
`HOMECORE_TOKENS` whitelist (`LongLivedTokenStore::from_env()`), fall back to an
|
||
**explicitly warn-logged** DEV mode only when unset. It also defaults the bind
|
||
address to **`127.0.0.1`** (loopback) so a bare `cargo run` is not
|
||
network-exposed, with `HOMECORE_BIND` to opt into LAN.
|
||
|
||
**Failing-on-old test** (`tests/server_bin_auth.rs`):
|
||
`provisioned_bin_rejects_wrong_bearer` reproduces the bin's exact provisioning
|
||
path (a populated, non-dev store) and asserts a wrong bearer → 401;
|
||
`from_env_path_enforces_whitelist` proves `from_env()` is not dev mode and
|
||
enforces the list. The old bin's `allow_any_non_empty()` accepted the wrong
|
||
bearer. **Grade: MEASURED.**
|
||
|
||
### §A3 — Automation engine never started (HIGH) — MEASURED
|
||
|
||
`homecore-server/src/main.rs` did `let _automation_engine = AutomationEngine::new(...)`
|
||
then dropped it immediately, while the header doc claimed "Automation engine
|
||
subscribed to the state machine."
|
||
|
||
**Real fix:** the engine is now built into a long-lived binding and `.start()`
|
||
is called, spawning the event loop + timer task; the header/log lines state it
|
||
is started with N automations and which trigger classes are active. (With A4–A7
|
||
the running engine is genuinely functional, not theater.)
|
||
|
||
**Evidence:** the engine-behavior tests below run against the same
|
||
`AutomationEngine::start()` path now wired into the bin. **Grade: MEASURED.**
|
||
|
||
### §A4 — `Trigger::Time` hard-coded `false`, no timer (HIGH) — MEASURED
|
||
|
||
`trigger.rs::matches_sync` returned `false` for `Time` and there was **no timer
|
||
task** anywhere, so time automations could never fire.
|
||
|
||
**Real fix:** `AutomationEngine::start_timer` — a 1 Hz tokio interval that
|
||
compares each `time:` automation's `at` (`HH:MM` or `HH:MM:SS`) against the
|
||
local wall-clock second and fires it once per match (conditions still gate it).
|
||
`matches_sync` returning `false` for `Time` is now **correct and documented**
|
||
(it is a wall-clock trigger with no state-change context); a public
|
||
`fire_time_for_test` exposes the same path deterministically.
|
||
|
||
**Failing-on-old test** (`tests/engine_behaviors.rs`):
|
||
`time_trigger_fires_via_timer_path` (+ unit `time_at_matches_handles_hh_mm_and_hh_mm_ss`).
|
||
The method does not exist on the old engine. **Grade: MEASURED.**
|
||
|
||
### §A5 — `RunMode` documented as AtomicBool-enforced but unbounded-parallel (HIGH) — MEASURED
|
||
|
||
`engine.rs` doc claimed "RunMode::Single is enforced via a per-automation
|
||
AtomicBool" — but no such code existed and **every** trigger spawned an
|
||
unbounded parallel task regardless of `mode`.
|
||
|
||
**Real fix:** each registered automation carries a `running: Arc<AtomicBool>`.
|
||
`Single`/`IgnoreFirst` modes `compare_exchange` the flag before spawning and
|
||
**skip** the trigger if a run is already in flight, clearing it on completion;
|
||
`Parallel` (and, for now, `Restart`/`Queued`) spawn on every trigger.
|
||
|
||
**Failing-on-old tests** (`tests/engine_behaviors.rs`):
|
||
`single_mode_does_not_double_fire_on_rapid_triggers` (two rapid triggers while
|
||
the first run sleeps → exactly **1** run; old code fired **2**, verified) and
|
||
`parallel_mode_does_fire_concurrently` (→ 2). **Grade: MEASURED (Single/Parallel
|
||
honored; bounded `Queued`/`Restart`/`max` ordering → ACCEPTED-FUTURE, see below).**
|
||
|
||
### §A6 — `Action::Choose` ignored branches (HIGH) — MEASURED
|
||
|
||
`action.rs` discarded `choices` and always ran `default`.
|
||
|
||
**Real fix:** `ChoiceBranch::matches` deserialises each branch's
|
||
`serde_yaml::Value` conditions into `Condition` and evaluates them (AND
|
||
semantics, against an `EvalContext` now carried on `ExecutionContext`). `Choose`
|
||
runs the **first matching branch's** sequence and falls to `default` only if
|
||
none match.
|
||
|
||
**Failing-on-old tests** (`action.rs` inline):
|
||
`choose_runs_matching_branch_not_default` (matching branch runs, default does
|
||
NOT — old code ran default, verified) and
|
||
`choose_falls_to_default_when_no_branch_matches`. **Grade: MEASURED.**
|
||
|
||
### §A7 — Template conditions always false in the live engine (MEDIUM) — MEASURED
|
||
|
||
`condition.rs` returned `false` for `Template` whenever `template_env` was
|
||
`None`, and the engine built every `EvalContext` with `template_env: None`
|
||
(`EvalContext::new`), so `template:` conditions could never be true in
|
||
production — only in unit tests that hand-built a template env.
|
||
|
||
**Real fix:** the engine constructs one `TemplateEnvironment` over the state
|
||
machine and threads it into every `EvalContext` via
|
||
`EvalContext::with_templates` (event loop, timer task, and
|
||
`ExecutionContext` for `Choose` branches).
|
||
|
||
**Failing-on-old tests** (`tests/engine_behaviors.rs`):
|
||
`template_condition_evaluates_true_in_engine` (a `{{ is_state(...) }}` condition
|
||
gates an action true) and `template_condition_evaluates_false_blocks_action`.
|
||
On the old engine the action never ran (template always false, verified).
|
||
**Grade: MEASURED.**
|
||
|
||
### §B5 — Plugin manifest sig/hash "verified before execution" doc was false (LOW, honesty) — relabeled
|
||
|
||
`homecore-plugins/src/manifest.rs` documented `wasm_module_hash` as "verified
|
||
before execution" and carried `wasm_module_sig` / `publisher_key`, but these
|
||
fields are **never read** for verification (only ever set to `None` in tests).
|
||
|
||
**Fix (honest labeling — no false capability claimed):** the three fields are
|
||
re-doc'd **"(P4 — not yet enforced, ADR-161/B5)"** — parsed and round-tripped,
|
||
but no integrity/signature check happens before a plugin runs. No verification
|
||
code was added (that is P4); the doc now matches the code.
|
||
**Grade: doc-honesty (no behavior change).** *(Superseded by ADR-162 §P4:
|
||
the hash/signature gate is now implemented and enforced.)*
|
||
|
||
## Negative Results (NO-ACTION positives — audited, found correct, cited not edited)
|
||
|
||
These were checked and are genuinely sound/honest; cited as positives, **not**
|
||
touched:
|
||
- **CSPRNG correctness** — all IDs are `uuid::v4`; the rng/`randn` suspicion was
|
||
**REFUTED**. No weak-randomness issue exists.
|
||
- **CORS allowlist** (`app.rs`) — already hardened (explicit `AllowOrigin::list`,
|
||
no `permissive()`, `allow_credentials(false)`, env override). NO-ACTION.
|
||
- **No path traversal in `homecore-migrate`** — audited, clean.
|
||
- **No secrets in logs** — audited, clean.
|
||
- **HAP pairing stub** — honestly disclaimed as a surface stub; not over-claimed.
|
||
- **`InProcessRuntime` "no sandbox" disclaimer** — honest; left as-is.
|
||
|
||
## Deferred Backlog (Nothing Dropped)
|
||
|
||
- **Plugin authority-isolation (P5)** — ~~`homecore_permissions` claims are parsed
|
||
but not enforced at the host-call boundary.~~ **DONE — ADR-162 §P5.**
|
||
`hc_state_set` now consults a `PermissionSet` distilled from the manifest;
|
||
an undeclared write returns a typed `-3` to the guest.
|
||
- **Plugin signature/hash verification (P4)** — ~~implement the
|
||
`wasm_module_hash`/`wasm_module_sig`/`publisher_key` gate that B5 now honestly
|
||
says is absent.~~ **DONE — ADR-162 §P4.** `WasmtimeRuntime::load_plugin` now
|
||
SHA-256-checks the module, Ed25519-verifies the signature against
|
||
`publisher_key`, and enforces a `PluginPolicy` trust allowlist
|
||
(secure-default rejects unsigned/untrusted/tampered modules).
|
||
- **HAP real pairing (P2)** — SRP/HKDF pairing + encrypted sessions; current
|
||
bridge is an accessory-mapping surface. **ACCEPTED-FUTURE (honestly stubbed).**
|
||
- **`RunMode::Queued`/`Restart`/`max` ordering** — ~~`Single`/`Parallel` are
|
||
honored; bounded queueing, restart-kill, and `max` concurrency are not yet
|
||
wired (every non-Single mode is parallel).~~ **DONE — ADR-162 §A5.** Restart
|
||
aborts the in-flight task, Queued serializes via a per-automation async mutex,
|
||
and `max: N` caps concurrency via a per-automation semaphore.
|
||
- **Automation YAML load-at-boot** — the engine starts empty; a YAML loader is
|
||
P-next. The bin log states "0 automations registered" honestly.
|
||
|
||
## Reproduction (MEASURED)
|
||
|
||
```bash
|
||
cd v2
|
||
cargo test -p homecore-api -p homecore-server -p homecore-automation -p homecore-hap --no-default-features
|
||
cargo test -p homecore-plugins --features wasmtime
|
||
cargo build --workspace --no-default-features
|
||
```
|
||
|
||
Result at time of writing (all 0 failed):
|
||
- **homecore-api** — **25 passed** (lib 18; `server_bin_auth` 3; `ws_handshake` 4)
|
||
- **homecore-automation** — **42 passed** (lib 37; `engine_behaviors` 5)
|
||
- **homecore-hap** — **17 passed**
|
||
- **homecore-server** — bin, **0 tests**
|
||
- (**homecore-plugins** — **15 passed**: lib 12; integration 3)
|
||
- Full workspace `cargo build --workspace --no-default-features` succeeds.
|
||
|
||
## Consequences
|
||
|
||
- The WebSocket path can no longer be entered with a forged token — it enforces
|
||
the same `LongLivedTokenStore` whitelist as REST (A1).
|
||
- WS clients now actually receive `result`/`pong`/`event` frames (A2).
|
||
- The `homecore-api` dev bin defaults to loopback and honors `HOMECORE_TOKENS`
|
||
(A8); it is no longer an open `0.0.0.0` accept-any endpoint by default.
|
||
- The automation engine is started for real and its time triggers, `Single`
|
||
run-mode, `Choose` branches, and `template:` conditions all function — no doc
|
||
claims a capability the code lacks (A3–A7).
|
||
- The plugin manifest no longer claims signature verification it does not
|
||
perform (B5).
|
||
- Files kept under the 500-line guideline (`engine.rs` 462; behavioral tests
|
||
moved to `tests/engine_behaviors.rs`).
|