docs: add persistent node registry, OTA safety gate, plugin architecture to ADR-052

Incorporates engineering review feedback:
- Persistent node registry (~/.ruview/nodes.db) — discovery becomes reconciliation
- BatchOtaSession aggregate with TdmSafe rolling update strategy
- Plugin architecture section — control plane extensibility trajectory
- Renumbered sections for new content (9-12 added, impl phases now section 13)

Co-Authored-By: claude-flow <ruv@ruv.net>
This commit is contained in:
ruv 2026-03-06 15:57:12 -05:00
parent 2ba8b3b93d
commit 50a82165c9
2 changed files with 133 additions and 2 deletions

View File

@ -51,6 +51,12 @@ address.
**Invariant**: No two nodes may share the same MAC address. If a node is
discovered via multiple strategies, the most recent data wins.
**Persistence**: The registry is persisted to `~/.ruview/nodes.db` (SQLite via
`rusqlite`). On startup, all previously known nodes are loaded as `Offline` and
reconciled against a fresh discovery scan. This means the app **remembers the
mesh** across restarts — critical for field deployments where nodes may be
temporarily powered off.
#### `Node` (Entity)
| Field | Type | Description |
@ -156,6 +162,32 @@ Represents an over-the-air firmware update to a running node.
| `phase` | `OtaPhase` | Uploading / Rebooting / Verifying / Done / Failed |
| `progress` | `Progress` | Upload progress |
#### `BatchOtaSession` (Aggregate Root)
Coordinates rolling firmware updates across multiple mesh nodes. Prevents all
nodes from rebooting simultaneously, which would collapse the sensing network.
| Field | Type | Description |
|-------|------|-------------|
| `id` | `Uuid` | Batch session identifier |
| `firmware` | `FirmwareBinary` | The binary being deployed |
| `strategy` | `OtaStrategy` | `Sequential`, `TdmSafe`, `Parallel` |
| `max_concurrent` | `usize` | Max nodes updating at once |
| `batch_delay_secs` | `u64` | Delay between batches |
| `fail_fast` | `bool` | Abort remaining on first failure |
| `node_states` | `Map<MacAddress, BatchNodeState>` | Per-node progress |
**Invariant**: In `TdmSafe` mode, adjacent TDM slots are never updated
concurrently. Even-slot nodes update first, then odd-slot nodes.
**Lifecycle**: `Planning → InProgress → Completed | PartialFailure | Aborted`
- `BatchNodeState` — enum: `Queued`, `Uploading(Progress)`, `Rebooting`, `Verifying`, `Done`, `Failed(String)`, `Skipped`
- `OtaStrategy` — enum:
- `Sequential` — one node at a time, wait for rejoin
- `TdmSafe` — update non-adjacent slots to maintain sensing coverage
- `Parallel` — all at once (development only)
### Value Objects
- `SerialPort``{ name: String, vid: u16, pid: u16, manufacturer: Option<String> }`
@ -177,6 +209,9 @@ Represents an over-the-air firmware update to a running node.
| `OtaStarted` | `{ session_id, target_mac, firmware_version }` | Discovery (mark node as updating) |
| `OtaCompleted` | `{ session_id, target_mac, new_version }` | Discovery (refresh node info) |
| `OtaFailed` | `{ session_id, target_mac, error }` | UI (show error) |
| `BatchOtaStarted` | `{ batch_id, strategy, node_count }` | UI (show batch progress) |
| `BatchNodeUpdated` | `{ batch_id, mac, state }` | UI (update per-node status), Discovery (refresh) |
| `BatchOtaCompleted` | `{ batch_id, succeeded, failed, skipped }` | UI (show summary), Discovery (full rescan) |
### Anti-Corruption Layer

View File

@ -637,7 +637,103 @@ cargo build --release -p wifi-densepose-sensing-server
# target/release/sensing-server -> crates/wifi-densepose-desktop/binaries/sensing-server-{arch}
```
### 9. Security Considerations
### 9. Persistent Node Registry
Discovery alone is transient — nodes appear when they broadcast, disappear when they don't. A persistent local registry transforms discovery into **reconciliation**.
```
~/.ruview/nodes.db (SQLite via rusqlite)
```
**Schema:**
```sql
CREATE TABLE nodes (
mac TEXT PRIMARY KEY, -- e.g. "AA:BB:CC:DD:EE:FF"
last_ip TEXT, -- last known IP
last_seen INTEGER NOT NULL, -- Unix timestamp
firmware TEXT, -- e.g. "0.3.1"
chip TEXT DEFAULT 'esp32s3', -- esp32, esp32s3, esp32c3
mesh_role TEXT DEFAULT 'node', -- 'coordinator' | 'node' | 'aggregator'
tdm_slot INTEGER, -- assigned TDM slot index
capabilities TEXT, -- JSON: {"wasm": true, "ota": true, "csi": true}
friendly_name TEXT, -- user-assigned label
notes TEXT -- free-form notes
);
```
**Behavior:**
- On discovery broadcast, upsert into registry (update `last_ip`, `last_seen`, `firmware`)
- Dashboard shows **all registered nodes**, dimming those not seen recently
- User can manually add nodes by MAC/IP (for networks without mDNS)
- Export/import registry as JSON for fleet management across machines
- Node health history (uptime, last OTA, error count) tracked over time
This means the desktop app **remembers the mesh** across restarts, which is critical for field deployments where nodes may be offline temporarily.
### 10. OTA Safety Gate — Rolling Updates
Mesh deployments cannot tolerate all nodes rebooting simultaneously. The OTA subsystem includes a **rolling update mode** that preserves sensing continuity:
```rust
#[derive(Serialize, Deserialize)]
pub struct BatchOtaConfig {
/// Update strategy
pub strategy: OtaStrategy,
/// Max nodes updating concurrently
pub max_concurrent: usize,
/// Delay between batches (seconds)
pub batch_delay_secs: u64,
/// Abort if any node fails
pub fail_fast: bool,
}
#[derive(Serialize, Deserialize)]
pub enum OtaStrategy {
/// Update one node at a time, wait for it to rejoin mesh
Sequential,
/// Update non-adjacent TDM slots to maintain coverage
TdmSafe,
/// Update all nodes simultaneously (development only)
Parallel,
}
```
**`TdmSafe` strategy:**
1. Sort nodes by TDM slot index
2. Update even-slot nodes first (slots 0, 2, 4...)
3. Wait for each to reboot and rejoin mesh (verified via beacon)
4. Then update odd-slot nodes (slots 1, 3, 5...)
5. At no point are adjacent nodes offline simultaneously
**UI flow:**
- User selects target firmware + target nodes
- App shows pre-update diff (current vs new version per node)
- Progress bar per node with states: `queued → uploading → rebooting → verifying → done`
- Abort button halts remaining updates without rolling back completed ones
- Post-update health check confirms all nodes are sensing
### 11. Plugin Architecture (Future)
This desktop tool is quietly becoming the **control plane for RuView**. Once it manages discovery, firmware, OTA, WASM, sensing, and mesh topology, plugin extensibility becomes inevitable:
- **Firmware management** today → **swarm orchestration** tomorrow
- **WASM upload** today → **edge module marketplace** tomorrow
- **Sensing view** today → **activity classification dashboard** tomorrow
The Tauri command surface should be designed with this trajectory in mind:
- Commands are grouped by bounded context (already done)
- Each context can be extended by loading additional Tauri plugins
- The node registry becomes the source of truth for all plugins
- Event bus (Tauri's `emit`/`listen`) provides cross-plugin communication
This does NOT mean building a plugin system in Phase 1. It means keeping the architecture open to it: no hardcoded views, state flows through the registry, commands are typed and versioned.
### 12. Security Considerations
1. **PSK Storage**: OTA PSK tokens are stored in the OS keychain via `tauri-plugin-stronghold` or the platform's native credential store, never in plaintext config files.
@ -649,7 +745,7 @@ cargo build --release -p wifi-densepose-sensing-server
5. **WASM Signature Verification**: The desktop app can sign WASM modules before upload using a locally stored Ed25519 key pair, complementing the node-side verification (ADR-040).
### 10. Implementation Phases
### 13. Implementation Phases
| Phase | Scope | Effort | Priority |
|-------|-------|--------|----------|