# OTA Pipeline — Full Reproduction Recipe Verbatim agent contribution (2026-05-17), saved as authoritative reference for the WiFi-OTA flow on this RuView fork. Kept whole deliberately — splitting it would lose the diagnostic flowchart. ## TL;DR OTA works because **three FW-side fixes** are in place. Without them the chip receives the firmware, reboots, **panics during early boot of the new partition**, the bootloader rolls back, and from outside it looks like "OTA didn't work" even though the upload succeeded. Most agents focus on the network side (curl, gh-action) and miss it, because the bug lives inside the firmware. --- ## 0 · Prerequisites (without them OTA = panic loop) These three things **must already be in the firmware running on the chip** (i.e. in ota_0/factory before the first OTA). If they're not there, fix once via USB-flash; after that, OTA works. ### A. `OTA_SIZE_UNKNOWN` instead of `OTA_WITH_SEQUENTIAL_WRITES` **File:** `firmware/esp32-csi-node/main/ota_update.c:137` ```c esp_err_t err = esp_ota_begin(update_partition, OTA_SIZE_UNKNOWN, &ota_handle); ``` **Why:** `OTA_WITH_SEQUENTIAL_WRITES` erases 4 KB pages on the fly as it writes. If the new binary (~870 KB) is smaller than the previous one in the same partition (~1.1 MB), **tail of the old code stays in the partition**. The SHA-image-verify in `esp_ota_end()` only checks the declared image-header length — residual code isn't covered. After reboot the new app may jump into IRAM / a .literal pool address overlapped by stale code → **Guru Meditation Error** → bootloader rolls back. `OTA_SIZE_UNKNOWN` forces a **full partition erase before write** (~1.5 s overhead, unnoticeable). ### B. `config.stack_size = 8192` for httpd **File:** `firmware/esp32-csi-node/main/ota_update.c:225` ```c httpd_config_t config = HTTPD_DEFAULT_CONFIG(); // default stack_size = 4096 config.server_port = OTA_PORT; config.max_uri_handlers = 12; config.recv_wait_timeout = 30; config.stack_size = 8192; // ← critical ``` **Why:** `esp_ota_end()` streams a SHA-256 verify over the entire image and walks the mmap segments = >5 KB of local variables. On the standard 4 KB httpd-task stack → **stack overflow** at validation time. The chip panics **inside the handler**, before `esp_ota_set_boot_partition()`. From outside you see `{"status":"ok"}` (it's sent before `esp_ota_end`), but the partition doesn't switch. ### C. Reset reason logged in `app_main` **File:** `firmware/esp32-csi-node/main/main.c:130-153` ```c static const char *reset_reason_str(esp_reset_reason_t r) { switch (r) { case ESP_RST_PANIC: return "PANIC"; case ESP_RST_TASK_WDT: return "TASK_WDT"; case ESP_RST_SW: return "SW"; ... } } void app_main(void) { esp_reset_reason_t rr = esp_reset_reason(); const esp_partition_t *running = esp_ota_get_running_partition(); ESP_LOGI(TAG, "boot: reset_reason=%s running_partition=%s", reset_reason_str(rr), running ? running->label : "?"); ... } ``` **Why:** Without this line you **cannot tell** "new image booted cleanly after OTA" from "new image panicked → rolled back". `/ota/status` looks the same (or suspicious) in both cases. With this line the first UART line after boot tells the truth: - `reset_reason=SW running_partition=ota_1` → OTA OK, new image in ota_1. - `reset_reason=PANIC running_partition=ota_0` → new image panicked, rollback worked. **This is the case other agents get stuck in — without the log it's impossible to diagnose.** --- ## 1 · Wire format of POST /ota **Endpoint:** `POST http://:8032/ota` **Headers:** - `Content-Type: application/octet-stream` (required) - `Content-Length: ` (curl/urllib sets it) - `Authorization: Bearer ` (only if `security/ota_psk` is in NVS) **Body:** raw bytes of `build/esp32-csi-node.bin` — no multipart, no base64. **Response on success:** ```json {"status":"ok","message":"OTA update successful. Rebooting..."} ``` **Important about the response:** the chip sends it **before `esp_restart()`**, but `vTaskDelay(1000ms)` between response and restart **does not guarantee delivery**. On macOS / Linux curl will see: - `{"status":"ok"...}`, or - `Connection reset by peer` (TCP RST from the dying side), or - `Recv failure`. **All three are upload success.** The real check is NOT curl's status — it's a **second GET `/ota/status` after reboot**. --- ## 2 · Chip's path through the handler ``` HTTP POST /ota │ ▼ ota_check_auth(req) ← if PSK in NVS, verifies Authorization header │ ▼ esp_ota_get_next_update_partition(NULL) │ ← running in ota_0 → returns ota_1, and vice-versa ▼ esp_ota_begin(part, OTA_SIZE_UNKNOWN, &handle) │ ← full erase of target partition (~1.5 s) ▼ loop { received = httpd_req_recv(req, buf, 1024) esp_ota_write(handle, buf, received) } ← writes in 1 KB chunks │ ▼ esp_ota_end(handle) ← SHA-256 verify over the entire image (>5 KB stack) │ ▼ esp_ota_set_boot_partition(part) ← writes "boot from target" into otadata │ ▼ httpd_resp_send(JSON) ← replies {"status":"ok"...} │ ▼ vTaskDelay(1000ms) ← window so TCP flush goes out (best-effort) │ ▼ esp_restart() ← soft reset via RTC_SW_CPU_RST │ ▼ [bootloader picks ota_1 from otadata → loads new image → app_main] │ ▼ "I (335) main: boot: reset_reason=SW running_partition=ota_1" ``` --- ## 3 · Flashing via `scripts/ota-deploy.sh` ```bash # Scenario A — deploy to all nodes on local /24 (auto-discover): scripts/ota-deploy.sh # Scenario B — specific IPs: scripts/ota-deploy.sh 192.168.0.100 192.168.0.101 # Scenario C — build before deploy: scripts/ota-deploy.sh --build # Scenario D — with auth: OTA_PSK=your_token scripts/ota-deploy.sh ``` **What the script does under the hood (4 phases):** ### Phase 1 — discovery ```python arp -a -n → ['192.168.0.100', '192.168.0.101', ...] # parallel GET /ota/status:8032 (timeout 1.5s) # only IPs that return valid JSON survive ``` If ARP is empty (fresh Mac boot) → fallback ping-sweep `.100`–`.110`. ### Phase 2 — snapshot before ``` GET /ota/status:8032 on each node → remember running_partition (ota_0 or ota_1) ``` ### Phase 3 — parallel upload ```python ThreadPoolExecutor(max_workers=len(targets)) for each node: urllib POST with body = read_bytes(esp32-csi-node.bin) ConnectionResetError caught as expected (that's the reboot) ``` ### Phase 4 — verify ``` sleep 10 ← wait for boot to finish for each node (up to 6 retries, 3-s delay): GET /ota/status:8032 new_part != old_part → ✓ new_part == old_part → ✗ FAIL (panicked) exit 0 if all OK, 1 if any node didn't confirm ``` --- ## 4 · Diagnosis when "OTA doesn't work" Flowchart that catches **every observable failure mode** on ESP32-S3 in this FW: ``` GET /ota/status works? ├── 404/timeout → node offline / wrong network / IP changed (check `arp -a`) ├── 200, time=OLD → OTA didn't take (see below) └── 200, time=NEW → OTA OK ✓ OTA didn't take — diagnose via UART (USB!): See "boot: reset_reason=..." in UART? ├── reset_reason=POWERON → chip didn't reboot — POST didn't arrive, check curl ├── reset_reason=SW AND running_partition=ota_X → OTA OK, may be server-side cache ├── reset_reason=PANIC AND running_partition=ota_0 │ → NEW image panics at boot │ → causes (most likely first): │ 1. OTA_WITH_SEQUENTIAL_WRITES → tail of old code (fix A above) │ 2. esp_ota_end stack overflow (fix B above) │ 3. ABI mismatch bootloader vs new app (USB-flash bootloader.bin) │ 4. real bug in new code (read the backtrace before PANIC) ├── reset_reason=TASK_WDT → handler hung mid-upload └── reset_reason=BROWNOUT → power supply browned out under stress (USB on bus power?) ``` If UART is unavailable (no USB) but HTTP works: POST then GET `/ota/status` three times at 5 s intervals. If `next_partition` flip-flops, the chip is in a panic loop. That's a definitive diagnosis. --- ## 5 · Why other agents fail (common pitfalls) | Pitfall | Symptom | Fix | |---|---|---| | Treat OTA as a pure network problem, never look at FW | "POST returned 200 but time doesn't change" → endless curl-header experiments | **Verify the three FW prerequisites first**, before any curl | | Use `OTA_WITH_SEQUENTIAL_WRITES` (it's in IDF examples) | OTA works once, stops working after binary size changes | Switch to `OTA_SIZE_UNKNOWN` | | Leave httpd stack at 4 KB | Sometimes works (fast SHA), sometimes doesn't — looks flaky | `config.stack_size = 8192` | | Enable `CONFIG_BOOTLOADER_APP_ROLLBACK_ENABLE=y` "for safety" | Every OTA rolled back because nobody calls `esp_ota_mark_app_valid_cancel_rollback()` | Either disable, or call the API after 10 s | | `curl` without `--data-binary` (only `-d`) | Binary corrupted by HTML-encoding | Use `--data-binary @file.bin` or urllib bytes | | Measure success by HTTP response code | Connection reset = normal (esp_restart kills socket), not failure | Re-check via **GET /ota/status after reboot** | | Don't wait 10 s after reboot before verify | Verify times out, agent thinks OTA failed | `sleep 10` (or backoff retries) | | Ignore that mDNS names drift | Flash the wrong node, or stale ARP cache | Auto-discover by IP **at deploy time**, not by hostname | | Share a single file descriptor across upload threads | Race conditions, partial reads | Each upload-thread opens its own file | | Rely on bootloader rollback instead of explicit app_valid | Image sometimes flagged BAD, OTA becomes non-idempotent | If rollback enabled, MUST call `esp_ota_mark_app_valid_cancel_rollback()` | --- ## 6 · Things other agents do **wrong** From recurring patterns in others' logs: 1. **Rely on `idf.py flash --port .../ota`** — that mode does NOT exist in idf.py. OTA is only via the HTTP handler. 2. **Send via `ssh esp32 'esp_ota_write ...'`** — ESP32 has no shell; OTA is only via the HTTP endpoint. 3. **Run MQTT-based OTA** — this FW has no MQTT client; only HTTP POST on 8032. 4. **Use ESP RainMaker / esp_https_ota** — those require HTTPS + cert; we serve plain HTTP. Don't confuse the APIs. 5. **Re-use an old build of `firmware/esp32-csi-node/build/esp32-csi-node.bin`** — forget to run `idf.py build`. The script's `--build` solves that. --- ## 7 · Quick reference (for the next agent) ```bash # Once over USB if the nodes still run pre-fix firmware: cd /Users/arsen/Desktop/RuView/firmware/esp32-csi-node source ~/esp/esp-idf-v5.2/export.sh idf.py build # Hold BOOT+RESET on the device cd build esptool.py --chip esp32s3 --port /dev/cu.usbmodem... -b 460800 \ --before default-reset --after hard-reset write-flash \ --flash-mode dio --flash-size 8MB --flash-freq 80m \ 0x0 bootloader/bootloader.bin \ 0x8000 partition_table/partition-table.bin \ 0xf000 ota_data_initial.bin \ 0x20000 esp32-csi-node.bin # Forever after, over WiFi: scripts/ota-deploy.sh --build # (auto-discover, parallel POST, verify, exit code) ``` ## Operator REST endpoints on the running FW (port 8032) After the first OTA the FW exposes three control endpoints. They share the same Bearer-PSK auth as `/ota` (open when `security/ota_psk` NVS key is unset, gated when set). All accept plain HTTP — no JSON dependency on the FW side. | Method | Path | Body | Purpose | ADR | |---|---|---|---|---| | `GET` | `/ota/status` | — | Version, date, running/next partition, max image size | ADR-045 | | `POST` | `/ota` | image bin | Upload + flash (auth-gated) | ADR-045 | | `POST` | `/ota/recalibrate` | — | Clear `csi_cfg/gl_agc` + `gl_fft` + `gl_ap_mac`, reboot — forces fresh gain-lock at next boot | ADR-109 | | `POST` | `/ota/set-target` | `IPv4:PORT` plain text | Write `csi_cfg/target_ip` + `target_port` to NVS, reboot — repoints the CSI aggregator after Mac IP move / router swap without USB | ADR-115 | Examples (operator side, no USB): ```bash # After moving Mac to a new LAN / changing routers: curl -s -X POST -d '192.168.0.103:5005' http://192.168.0.100:8032/ota/set-target curl -s -X POST -d '192.168.0.103:5005' http://192.168.0.101:8032/ota/set-target # Each returns {"status":"ok","target_ip":"...","target_port":...,"message":"rebooting"} # After AP swap that changed the indoor path geometry: curl -X POST http://192.168.0.100:8032/ota/recalibrate # Sensor reboots, re-runs the 300-packet gain-lock sampler (~3–12s). # Sanity probe: curl http://192.168.0.100:8032/ota/status ``` With auth provisioned (`security/ota_psk` in NVS): ```bash curl -X POST -H "Authorization: Bearer $RUVIEW_OTA_PSK" \ -d '192.168.0.103:5005' \ http://192.168.0.100:8032/ota/set-target ``` --- **Bottom line:** OTA is not "send a file via curl", it's an **end-to-end protocol** between the on-chip handler and the host tooling. 80 % of the work lives on the FW side (correct erase, correct stack, correct log). The network part is trivial (`urllib.request.urlopen(POST)`). Agents who "can't" usually stopped at the network layer and didn't realise the chip is panicking.