13 KiB

Raw Blame History

OTA Pipeline — Full Reproduction Recipe

Verbatim agent contribution (2026-05-17), saved as authoritative reference for the WiFi-OTA flow on this RuView fork. Kept whole deliberately — splitting it would lose the diagnostic flowchart.

TL;DR

OTA works because three FW-side fixes are in place. Without them the chip receives the firmware, reboots, panics during early boot of the new partition, the bootloader rolls back, and from outside it looks like "OTA didn't work" even though the upload succeeded. Most agents focus on the network side (curl, gh-action) and miss it, because the bug lives inside the firmware.

0 · Prerequisites (without them OTA = panic loop)

These three things must already be in the firmware running on the chip (i.e. in ota_0/factory before the first OTA). If they're not there, fix once via USB-flash; after that, OTA works.

A. `OTA_SIZE_UNKNOWN` instead of `OTA_WITH_SEQUENTIAL_WRITES`

File: firmware/esp32-csi-node/main/ota_update.c:137

esp_err_t err = esp_ota_begin(update_partition, OTA_SIZE_UNKNOWN, &ota_handle);

Why: OTA_WITH_SEQUENTIAL_WRITES erases 4 KB pages on the fly as it writes. If the new binary (~870 KB) is smaller than the previous one in the same partition (~1.1 MB), tail of the old code stays in the partition. The SHA-image-verify in esp_ota_end() only checks the declared image-header length — residual code isn't covered. After reboot the new app may jump into IRAM / a .literal pool address overlapped by stale code → Guru Meditation Error → bootloader rolls back.

OTA_SIZE_UNKNOWN forces a full partition erase before write (~1.5 s overhead, unnoticeable).

B. `config.stack_size = 8192` for httpd

File: firmware/esp32-csi-node/main/ota_update.c:225

httpd_config_t config = HTTPD_DEFAULT_CONFIG();   // default stack_size = 4096
config.server_port = OTA_PORT;
config.max_uri_handlers = 12;
config.recv_wait_timeout = 30;
config.stack_size = 8192;                          // ← critical

Why: esp_ota_end() streams a SHA-256 verify over the entire image and walks the mmap segments = >5 KB of local variables. On the standard 4 KB httpd-task stack → stack overflow at validation time. The chip panics inside the handler, before esp_ota_set_boot_partition(). From outside you see {"status":"ok"} (it's sent before esp_ota_end), but the partition doesn't switch.

C. Reset reason logged in `app_main`

File: firmware/esp32-csi-node/main/main.c:130-153

static const char *reset_reason_str(esp_reset_reason_t r) {
    switch (r) {
        case ESP_RST_PANIC:    return "PANIC";
        case ESP_RST_TASK_WDT: return "TASK_WDT";
        case ESP_RST_SW:       return "SW";
        ...
    }
}
void app_main(void) {
    esp_reset_reason_t rr = esp_reset_reason();
    const esp_partition_t *running = esp_ota_get_running_partition();
    ESP_LOGI(TAG, "boot: reset_reason=%s running_partition=%s",
             reset_reason_str(rr),
             running ? running->label : "?");
    ...
}

Why: Without this line you cannot tell "new image booted cleanly after OTA" from "new image panicked → rolled back". /ota/status looks the same (or suspicious) in both cases. With this line the first UART line after boot tells the truth:

reset_reason=SW running_partition=ota_1 → OTA OK, new image in ota_1.
reset_reason=PANIC running_partition=ota_0 → new image panicked, rollback worked. This is the case other agents get stuck in — without the log it's impossible to diagnose.

1 · Wire format of POST /ota

Endpoint: POST http://<node-ip>:8032/ota

Headers:

Content-Type: application/octet-stream (required)
Content-Length: <bytes> (curl/urllib sets it)
Authorization: Bearer <psk> (only if security/ota_psk is in NVS)

Body: raw bytes of build/esp32-csi-node.bin — no multipart, no base64.

Response on success:

{"status":"ok","message":"OTA update successful. Rebooting..."}

Important about the response: the chip sends it before esp_restart(), but vTaskDelay(1000ms) between response and restart does not guarantee delivery. On macOS / Linux curl will see:

{"status":"ok"...}, or
Connection reset by peer (TCP RST from the dying side), or
Recv failure.

All three are upload success. The real check is NOT curl's status — it's a second GET /ota/status after reboot.

2 · Chip's path through the handler

HTTP POST /ota
    │
    ▼
ota_check_auth(req)              ← if PSK in NVS, verifies Authorization header
    │
    ▼
esp_ota_get_next_update_partition(NULL)
    │                            ← running in ota_0 → returns ota_1, and vice-versa
    ▼
esp_ota_begin(part, OTA_SIZE_UNKNOWN, &handle)
    │                            ← full erase of target partition (~1.5 s)
    ▼
loop {
    received = httpd_req_recv(req, buf, 1024)
    esp_ota_write(handle, buf, received)
}                                ← writes in 1 KB chunks
    │
    ▼
esp_ota_end(handle)              ← SHA-256 verify over the entire image (>5 KB stack)
    │
    ▼
esp_ota_set_boot_partition(part) ← writes "boot from target" into otadata
    │
    ▼
httpd_resp_send(JSON)            ← replies {"status":"ok"...}
    │
    ▼
vTaskDelay(1000ms)               ← window so TCP flush goes out (best-effort)
    │
    ▼
esp_restart()                    ← soft reset via RTC_SW_CPU_RST
    │
    ▼
[bootloader picks ota_1 from otadata → loads new image → app_main]
    │
    ▼
"I (335) main: boot: reset_reason=SW running_partition=ota_1"

3 · Flashing via `scripts/ota-deploy.sh`

# Scenario A — deploy to all nodes on local /24 (auto-discover):
scripts/ota-deploy.sh

# Scenario B — specific IPs:
scripts/ota-deploy.sh 192.168.0.100 192.168.0.101

# Scenario C — build before deploy:
scripts/ota-deploy.sh --build

# Scenario D — with auth:
OTA_PSK=your_token scripts/ota-deploy.sh

What the script does under the hood (4 phases):

Phase 1 — discovery

arp -a -n  →  ['192.168.0.100', '192.168.0.101', ...]
# parallel GET /ota/status:8032 (timeout 1.5s)
# only IPs that return valid JSON survive

If ARP is empty (fresh Mac boot) → fallback ping-sweep .100–.110.

Phase 2 — snapshot before

GET /ota/status:8032 on each node
→  remember running_partition (ota_0 or ota_1)

Phase 3 — parallel upload

ThreadPoolExecutor(max_workers=len(targets))
for each node:
    urllib POST with body = read_bytes(esp32-csi-node.bin)
    ConnectionResetError caught as expected (that's the reboot)

Phase 4 — verify

sleep 10  ← wait for boot to finish
for each node (up to 6 retries, 3-s delay):
    GET /ota/status:8032
    new_part != old_part   →  ✓
    new_part == old_part   →  ✗ FAIL (panicked)
exit 0 if all OK, 1 if any node didn't confirm

4 · Diagnosis when "OTA doesn't work"

Flowchart that catches every observable failure mode on ESP32-S3 in this FW:

GET /ota/status works?
├── 404/timeout    → node offline / wrong network / IP changed (check `arp -a`)
├── 200, time=OLD  → OTA didn't take (see below)
└── 200, time=NEW  → OTA OK ✓

OTA didn't take — diagnose via UART (USB!):

See "boot: reset_reason=..." in UART?
├── reset_reason=POWERON  → chip didn't reboot — POST didn't arrive, check curl
├── reset_reason=SW  AND  running_partition=ota_X  → OTA OK, may be server-side cache
├── reset_reason=PANIC AND running_partition=ota_0
│       → NEW image panics at boot
│       → causes (most likely first):
│           1. OTA_WITH_SEQUENTIAL_WRITES → tail of old code (fix A above)
│           2. esp_ota_end stack overflow (fix B above)
│           3. ABI mismatch bootloader vs new app (USB-flash bootloader.bin)
│           4. real bug in new code (read the backtrace before PANIC)
├── reset_reason=TASK_WDT → handler hung mid-upload
└── reset_reason=BROWNOUT → power supply browned out under stress
                            (USB on bus power?)

If UART is unavailable (no USB) but HTTP works: POST then GET /ota/status three times at 5 s intervals. If next_partition flip-flops, the chip is in a panic loop. That's a definitive diagnosis.

5 · Why other agents fail (common pitfalls)

Pitfall	Symptom	Fix
Treat OTA as a pure network problem, never look at FW	"POST returned 200 but time doesn't change" → endless curl-header experiments	Verify the three FW prerequisites first, before any curl
Use `OTA_WITH_SEQUENTIAL_WRITES` (it's in IDF examples)	OTA works once, stops working after binary size changes	Switch to `OTA_SIZE_UNKNOWN`
Leave httpd stack at 4 KB	Sometimes works (fast SHA), sometimes doesn't — looks flaky	`config.stack_size = 8192`
Enable `CONFIG_BOOTLOADER_APP_ROLLBACK_ENABLE=y` "for safety"	Every OTA rolled back because nobody calls `esp_ota_mark_app_valid_cancel_rollback()`	Either disable, or call the API after 10 s
`curl` without `--data-binary` (only `-d`)	Binary corrupted by HTML-encoding	Use `--data-binary @file.bin` or urllib bytes
Measure success by HTTP response code	Connection reset = normal (esp_restart kills socket), not failure	Re-check via GET /ota/status after reboot
Don't wait 10 s after reboot before verify	Verify times out, agent thinks OTA failed	`sleep 10` (or backoff retries)
Ignore that mDNS names drift	Flash the wrong node, or stale ARP cache	Auto-discover by IP at deploy time, not by hostname
Share a single file descriptor across upload threads	Race conditions, partial reads	Each upload-thread opens its own file
Rely on bootloader rollback instead of explicit app_valid	Image sometimes flagged BAD, OTA becomes non-idempotent	If rollback enabled, MUST call `esp_ota_mark_app_valid_cancel_rollback()`

6 · Things other agents do wrong

From recurring patterns in others' logs:

Rely on idf.py flash --port .../ota — that mode does NOT exist in idf.py. OTA is only via the HTTP handler.
Send via ssh esp32 'esp_ota_write ...' — ESP32 has no shell; OTA is only via the HTTP endpoint.
Run MQTT-based OTA — this FW has no MQTT client; only HTTP POST on 8032.
Use ESP RainMaker / esp_https_ota — those require HTTPS + cert; we serve plain HTTP. Don't confuse the APIs.
Re-use an old build of firmware/esp32-csi-node/build/esp32-csi-node.bin — forget to run idf.py build. The script's --build solves that.

7 · Quick reference (for the next agent)

# Once over USB if the nodes still run pre-fix firmware:
cd /Users/arsen/Desktop/RuView/firmware/esp32-csi-node
source ~/esp/esp-idf-v5.2/export.sh
idf.py build

# Hold BOOT+RESET on the device
cd build
esptool.py --chip esp32s3 --port /dev/cu.usbmodem... -b 460800 \
  --before default-reset --after hard-reset write-flash \
  --flash-mode dio --flash-size 8MB --flash-freq 80m \
  0x0 bootloader/bootloader.bin \
  0x8000 partition_table/partition-table.bin \
  0xf000 ota_data_initial.bin \
  0x20000 esp32-csi-node.bin

# Forever after, over WiFi:
scripts/ota-deploy.sh --build
# (auto-discover, parallel POST, verify, exit code)

Operator REST endpoints on the running FW (port 8032)

After the first OTA the FW exposes three control endpoints. They share the same Bearer-PSK auth as /ota (open when security/ota_psk NVS key is unset, gated when set). All accept plain HTTP — no JSON dependency on the FW side.

Method	Path	Body	Purpose	ADR
`GET`	`/ota/status`	—	Version, date, running/next partition, max image size	ADR-045
`POST`	`/ota`	image bin	Upload + flash (auth-gated)	ADR-045
`POST`	`/ota/recalibrate`	—	Clear `csi_cfg/gl_agc` + `gl_fft` + `gl_ap_mac`, reboot — forces fresh gain-lock at next boot	ADR-109
`POST`	`/ota/set-target`	`IPv4:PORT` plain text	Write `csi_cfg/target_ip` + `target_port` to NVS, reboot — repoints the CSI aggregator after Mac IP move / router swap without USB	ADR-115

Examples (operator side, no USB):

# After moving Mac to a new LAN / changing routers:
curl -s -X POST -d '192.168.0.103:5005' http://192.168.0.100:8032/ota/set-target
curl -s -X POST -d '192.168.0.103:5005' http://192.168.0.101:8032/ota/set-target
# Each returns {"status":"ok","target_ip":"...","target_port":...,"message":"rebooting"}

# After AP swap that changed the indoor path geometry:
curl -X POST http://192.168.0.100:8032/ota/recalibrate
# Sensor reboots, re-runs the 300-packet gain-lock sampler (~3–12s).

# Sanity probe:
curl http://192.168.0.100:8032/ota/status

With auth provisioned (security/ota_psk in NVS):

curl -X POST -H "Authorization: Bearer $RUVIEW_OTA_PSK" \
     -d '192.168.0.103:5005' \
     http://192.168.0.100:8032/ota/set-target

Bottom line: OTA is not "send a file via curl", it's an end-to-end protocol between the on-chip handler and the host tooling. 80 % of the work lives on the FW side (correct erase, correct stack, correct log). The network part is trivial (urllib.request.urlopen(POST)). Agents who "can't" usually stopped at the network layer and didn't realise the chip is panicking.

13 KiB Raw Blame History Unescape Escape