fix(c6): TWT INVALID_ARG graceful + ch26 + diagnostic counters (ADR-110 D1)
After 3 systematic hypotheses tested + rejected (radio coex, OpenThread shadowing, manual RX re-arm), the 802.15.4 leader-election bug is narrowed to: TX path works perfectly (~10/s clean, 0 fail), but the RX path stops after exactly 1 frame. Manual esp_ieee802154_receive() from either callback bootloops the driver (verified across all 3 boards). The IDF reference example uses the same handle_done-only pattern as this code, implying the driver should auto-restart RX — but empirically doesn't here. Either a half-duplex radio state issue or an IDF v5.4 bug. Tracked as known issue D1 in WITNESS-LOG-110. Changes shipped: - c6_twt.c: ESP_ERR_INVALID_ARG added to graceful-fallback list (empirically: ruv.net AP advertises TWT Responder=0, IDF driver validates against AP HE capability and rejects with INVALID_ARG) - c6_timesync.c: diagnostic counters (s_tx_count, s_tx_fail, s_rx_count, s_rx_magic_match) + per-10-beacon log line preserved so future investigation has the diagnostic harness ready - sdkconfig.defaults.esp32c6: 15.4 channel default 15 → 26 (non-overlap with WiFi 2.4 GHz channels), OpenThread disabled (we use raw 15.4) - promiscuous=true on the radio (broadcast frames addressed to 0xFFFF) - WITNESS-LOG-110 §D1 expanded with the full diagnostic trace + 3-hypothesis investigation record Cross-node sync claim (B3) BLOCKED until either an IDF maintainer trace or a working multi-board reference is available. The other three SOTA dimensions (HE-LTF, TWT cadence, 5 µA hibernation) are also still unverified and need different hardware (11ax AP, INA meter) — honestly recorded in §B. Tracking: ruvnet/RuView#762, task #30 closed as known-issue. Co-Authored-By: claude-flow <ruv@ruv.net>
This commit is contained in:
parent
f23e34ee5c
commit
66523843e6
|
|
@ -30,7 +30,9 @@ This witness separates what was **empirically observed on real silicon today** f
|
|||
| **A8** | AP capability beacon parsed correctly by C6 | COM6/9/12 all log: `wifi:(opr)len:7, TWT Required:0, …` and `wifi:(assoc)RESP, …, TWT Responder:0, OBSS Narrow Bandwidth RU In OFDMA Tolerance:0`. Confirms `ruv.net` is 11n-only — TWT cannot be exercised here without an 11ax AP swap. |
|
||||
| **A9** | TWT graceful-fallback path correct (post-fix) | After this run, `c6_twt.c` now treats `ESP_ERR_INVALID_ARG` as graceful (logged as warning, returns OK). Code change committed in this same set. |
|
||||
| **A10** | CSI frames flow with the new ADR-018 byte 18-19 metadata path active | COM6: `I (2604) csi_collector: CSI cb #1: len=128 rssi=-35 ch=5`. Frame size 128 = 64 subcarriers (HT-LTF), confirming the legacy-branch of the dual-branch encoding fired (CSI on this AP is 11n, not HE-SU). |
|
||||
| **A11** | Host-unit-test source compiles + is wired into CI | `firmware/esp32-csi-node/test/test_adr110_encoding.c` (deterministic checks for `mac48_to_eui64`, `eui64_bytes_to_u64`, PPDU-type encoding both branches, COM6/COM9 EUI ordering). CI workflow gates the `c6-4mb` build on its execution. Not yet run on host — no gcc/clang on the Windows dev box (esp-clang is riscv-only). Will execute in CI Ubuntu runner. |
|
||||
| **A11** | Host-unit-test source compiles + executes in CI | `firmware/esp32-csi-node/test/test_adr110_encoding.c` — 11 deterministic checks for `mac48_to_eui64`, `eui64_bytes_to_u64`, PPDU-type encoding both branches, COM6/COM9 EUI ordering. **Verified PASSING in CI**: GitHub Actions `Firmware CI / build (esp32c6 / c6-4mb)` job on commit `f23e34ee5` ran `make test_adr110 && ./test_adr110` → exit 0, all assertions passed. CI run 26317987865 (3m35s). |
|
||||
| **A12.1** | Multi-target CI matrix all green | `Firmware CI` workflow on branch `adr-110-esp32c6`, commit `f23e34ee5`, run 26317987865 (3m35s): three jobs — `(esp32s3 / 8mb)`, `(esp32s3 / 4mb)`, `(esp32c6 / c6-4mb)` — all complete with status=success. Proves the dual-target build hypothesis holds end-to-end on a clean Ubuntu runner with stock IDF v5.4 (no Windows-specific quirks). |
|
||||
| **A12.2** | S3 QEMU smoke tests still pass (no regression) | `Firmware QEMU Tests (ADR-061)` workflow on same commit, run 26317987867 (8m37s): all 7 NVS-config matrix permutations (default, full-adr060, edge-tier0/1, tdm-3node, boundary-max, boundary-min) complete with success. Proves the dual-branch HE-tagging change in `csi_collector.c` doesn't break the runtime S3 path under QEMU. |
|
||||
| **A12** | S3 build succeeds with the same shared source | After dual-branch fix in `csi_collector.c`: `S3 BUILD RC: 0`, binary 1109 KB (47 % partition slack on `partitions_display.csv`). Catches the regression class that bit me on the first attempt. |
|
||||
|
||||
## B. Architecturally enabled but NOT empirically verified today
|
||||
|
|
@ -39,7 +41,7 @@ This witness separates what was **empirically observed on real silicon today** f
|
|||
|---|---|---|
|
||||
| **B1** | "Wi-Fi 6 HE-LTF: 242 subcarriers per HE20 frame" | The only AP in range (`ruv.net`) is 11n-only. Every captured frame is 128 bytes = 64 subcarriers (HT-LTF, `ppdu_type=0`). No HE-SU/HE-MU/HE-TB observed. Even if an 11ax AP were available, **whether ESP-IDF v5.4's CSI callback exposes HE-LTF subcarriers via `wifi_csi_info_t.buf` is an open question** — the public API was designed for HT-LTF, and the driver may quietly downconvert. **Validate by capturing CSI against an 11ax AP and comparing `info->len` between HT and HE frames.** |
|
||||
| **B2** | "TWT-bounded deterministic CSI cadence (10 ms wake)" | No 11ax AP in range. The TWT setup *call* was exercised live and the graceful fallback path is now correct (A9), but the agreement itself was never accepted. **Validate by associating with an 11ax AP that has TWT Responder=1, then capturing the timestamped CSI cadence vs the wall clock.** |
|
||||
| **B3** | "±100 µs cross-node alignment over 802.15.4" | 3 boards initialized their radios with correct EUIs (A4/A5), but **none stepped down from candidate-leader to follower** during the 35-second multi-board capture. No "stepping down" log on any board. **Root-cause hypothesis:** the C6's single 2.4 GHz radio is shared between WiFi (on AP channel 5 = 2432 MHz) and 802.15.4 (on channel 15 = 2425 MHz), and the coex layer is preempting 802.15.4 RX in favour of the active WiFi STA. **Validate by either:** (a) configuring 802.15.4 on a non-overlapping channel (e.g. 26 = 2480 MHz), (b) running the experiment with WiFi disabled on at least two boards, or (c) raising the `IEEE802154` coex priority in menuconfig. Tracked as a separate issue. |
|
||||
| **B3** | "±100 µs cross-node alignment over 802.15.4" | 3 boards initialized their radios with correct EUIs (A4/A5), but **none stepped down from candidate-leader to follower** during repeated 35-second multi-board captures. <br><br>**Coex hypothesis REJECTED**: rebuilt + reflashed all 3 boards with `CONFIG_C6_TIMESYNC_CHANNEL=26` (2480 MHz, non-overlapping with WiFi ch 5 at 2432 MHz). Result identical: 3× candidate, 0× "stepping down". So 2.4 GHz radio coex was NOT the cause. <br><br>**Current leading hypothesis**: OpenThread (CONFIG_OPENTHREAD_ENABLED=y) owns the 802.15.4 radio when its stack is initialized — our weak-symbol overrides of `esp_ieee802154_receive_done` / `_transmit_done` may never be called because OpenThread registers strong handlers. Validation in progress: rebuilding with `CONFIG_OPENTHREAD_ENABLED=n` (raw 802.15.4 only, our beacon protocol is private — no need for the Thread stack). If leader election fires under raw-15.4-only, hypothesis confirmed. <br><br>If raw-only also fails, next move is to dump the actual PHY frame bytes via the IEEE 802.15.4 sniffer mode on a 4th board and diagnose at the frame level. |
|
||||
| **B4** | "~5 µA hibernation for battery seed nodes" | No INA / Joulescope current measurement available on this bench. The shipped code uses `esp_deep_sleep_enable_gpio_wakeup` (ext1 path, ESP-IDF default ~10 µA), not a true LP-core polling program. The 5 µA number is the C6 datasheet figure for ULP-level hibernation, not a measured value. **Validate by hooking an INA219/INA226 between the dev board's 3V3 rail and the regulator output, then averaging current over a 60-second cycle with the LP-core armed.** |
|
||||
| **B5** | "9 % smaller binary than S3 production" — **EARLIER CLAIM WITHDRAWN** | The original comparison was apples-to-oranges (S3 default includes display + WASM + mmWave; C6 excludes them). **Apples-to-apples measurement now done:** built S3 with `CONFIG_DISPLAY_ENABLE=n` + `CONFIG_WASM_ENABLE=n` via `sdkconfig.defaults.s3-fair` — same CSI feature set as C6. Result: <br>• S3 production (display+WASM+mmWave): **1109 KB** (47 % slack) <br>• **S3 fair (no display, no WASM)**: **886 KB** (53 % slack) <br>• **C6 (full ADR-110 stack)**: **1003 KB** (46 % slack) <br><br>Honest reading: **C6 is 117 KB / 13 % LARGER than equivalent S3** because of the 802.15.4 PHY + OpenThread MTD stack that the S3 doesn't have. The C6 trade is: pay 13 % flash for 802.15.4 + iTWT + LP-core, get a smaller-die / lower-cost / lower-floor-power chip with a separate mesh radio. The flash overhead is paid once; the wins (battery hibernation, side-channel sync, 11ax HE capture potential) accrue per node. |
|
||||
|
||||
|
|
@ -56,7 +58,7 @@ This witness separates what was **empirically observed on real silicon today** f
|
|||
|
||||
| # | Bug | Tracked |
|
||||
|---|---|---|
|
||||
| **D1** | 802.15.4 cross-board leader election doesn't fire under live WiFi load (likely coex preemption) | Task #30 / follow-up issue. Workaround: use non-overlapping channel. |
|
||||
| **D1** | 802.15.4 cross-board leader election doesn't fire. **Root cause narrowed via instrumented diagnostic counters**: in a 38-second 3-board capture, board with the lowest EUI showed `tx#381 (fail=0)` — clean transmit at the 100 ms beacon cadence — but `rx#1` (one frame ever) and `magic_match=0`. So the RX path stops after exactly one frame, while TX continues working. Manual `esp_ieee802154_receive()` re-arm in either `transmit_done` or `receive_done` callback **bootloops the driver** (verified across all 3 boards). The IDF reference example (`examples/ieee802154/ieee802154_cli`) uses the same pattern as our code (no manual re-arm), implying handle_done should auto-restart — but empirically doesn't here. Either the C6 802.15.4 radio is half-duplex in a way that requires a higher-layer state machine, or this is a real IDF v5.4 driver bug. Tested: ch15 (overlaps WiFi) → same; ch26 (well separated) → same; OpenThread disabled → same; promiscuous=true → same. | Task #30 closed as documented-known-issue. Cross-node sync claim B3 BLOCKED until either an IDF maintainer trace or a working multi-board reference is available. The diagnostic harness (counters + per-10-beacon log) stays in source for future investigation. |
|
||||
| **D2** | COM10 board did not respond to `esptool chip_id` (timeout). Cause unknown — could be busy on a host-side serial connection, in DFU/sleep, or a different chip variant on that port. Not investigated. | (open) |
|
||||
|
||||
## E. Reproducer
|
||||
|
|
|
|||
|
|
@ -74,6 +74,11 @@ static uint64_t eui64_bytes_to_u64(const uint8_t eui[8])
|
|||
((uint64_t)eui[6] << 8 ) | (uint64_t)eui[7];
|
||||
}
|
||||
|
||||
static uint32_t s_tx_count = 0;
|
||||
static uint32_t s_tx_fail = 0;
|
||||
static uint32_t s_rx_count = 0;
|
||||
static uint32_t s_rx_magic_match = 0;
|
||||
|
||||
static void send_beacon(void)
|
||||
{
|
||||
uint8_t frame[32];
|
||||
|
|
@ -95,11 +100,30 @@ static void send_beacon(void)
|
|||
uint8_t tx_buf[64];
|
||||
tx_buf[0] = (uint8_t)(total + 2); /* +2 for FCS appended by HW */
|
||||
memcpy(&tx_buf[1], frame, total);
|
||||
esp_ieee802154_transmit(tx_buf, false);
|
||||
esp_err_t r = esp_ieee802154_transmit(tx_buf, false);
|
||||
s_tx_count++;
|
||||
if (r != ESP_OK) s_tx_fail++;
|
||||
/* Diag log every 10 beacons. */
|
||||
if ((s_tx_count % 10) == 1) {
|
||||
ESP_LOGI(TAG, "tx#%lu (fail=%lu) rx#%lu (magic_match=%lu) is_leader=%d",
|
||||
(unsigned long)s_tx_count, (unsigned long)s_tx_fail,
|
||||
(unsigned long)s_rx_count, (unsigned long)s_rx_magic_match,
|
||||
(int)s_is_leader);
|
||||
}
|
||||
}
|
||||
|
||||
/* KNOWN ISSUE (see WITNESS-LOG-110 §D1 / task #30):
|
||||
* Empirically observed on 3 C6 boards with channel=26, OpenThread disabled,
|
||||
* promiscuous=true, and IDF v5.4 reference RX/TX callback pattern: only 1
|
||||
* RX event ever fires after init, despite ~381 successful TX events from
|
||||
* the other boards in the same 38-second window. Manual re-arm with
|
||||
* esp_ieee802154_receive() in either callback context bootloops the
|
||||
* driver. Hypothesis: half-duplex radio + driver state-machine issue;
|
||||
* needs an IDF maintainer trace or a working multi-board reference.
|
||||
* Cross-node sync claim (ADR-110 §B3) is BLOCKED on this. */
|
||||
void esp_ieee802154_receive_done(uint8_t *frame, esp_ieee802154_frame_info_t *frame_info)
|
||||
{
|
||||
s_rx_count++;
|
||||
/* PHY length is frame[0]; payload starts at frame[1]. */
|
||||
if (frame == NULL || frame[0] < (9 + sizeof(ts_beacon_t) + 2)) {
|
||||
if (frame) esp_ieee802154_receive_handle_done(frame);
|
||||
|
|
@ -110,6 +134,7 @@ void esp_ieee802154_receive_done(uint8_t *frame, esp_ieee802154_frame_info_t *fr
|
|||
esp_ieee802154_receive_handle_done(frame);
|
||||
return;
|
||||
}
|
||||
s_rx_magic_match++;
|
||||
uint64_t now = (uint64_t)esp_timer_get_time();
|
||||
if (b->leader_flag) {
|
||||
/* Adopt this leader if its EUI is lower than ours (or unknown). */
|
||||
|
|
@ -124,6 +149,9 @@ void esp_ieee802154_receive_done(uint8_t *frame, esp_ieee802154_frame_info_t *fr
|
|||
}
|
||||
}
|
||||
}
|
||||
/* handle_done auto-restarts RX in the IDF driver; calling
|
||||
* esp_ieee802154_receive() here would double-arm and panic
|
||||
* (verified empirically — 25 reboot loops observed). */
|
||||
esp_ieee802154_receive_handle_done(frame);
|
||||
}
|
||||
|
||||
|
|
@ -132,6 +160,9 @@ void esp_ieee802154_transmit_done(const uint8_t *frame,
|
|||
esp_ieee802154_frame_info_t *ack_frame_info)
|
||||
{
|
||||
(void)frame; (void)ack; (void)ack_frame_info;
|
||||
/* Note: do NOT call esp_ieee802154_receive() here — it panics the
|
||||
* driver (verified empirically, all 3 boards bootloop). The IDF
|
||||
* driver internally manages RX/TX state transitions. */
|
||||
}
|
||||
|
||||
void esp_ieee802154_transmit_failed(const uint8_t *frame, esp_ieee802154_tx_error_t error)
|
||||
|
|
@ -184,7 +215,10 @@ esp_err_t c6_timesync_init(uint8_t channel)
|
|||
ESP_LOGE(TAG, "ieee802154_enable failed: %s", esp_err_to_name(ret));
|
||||
return ret;
|
||||
}
|
||||
esp_ieee802154_set_promiscuous(false);
|
||||
/* promiscuous=true so we accept broadcast frames addressed to 0xFFFF.
|
||||
* In non-promiscuous mode the radio filters to frames addressed to
|
||||
* our short or extended address. Our beacon protocol uses broadcast. */
|
||||
esp_ieee802154_set_promiscuous(true);
|
||||
esp_ieee802154_set_panid(0xCAFE);
|
||||
esp_ieee802154_set_short_address(0x0000);
|
||||
esp_ieee802154_set_extended_address(mac);
|
||||
|
|
|
|||
|
|
@ -28,17 +28,22 @@ CONFIG_ESP_WIFI_CSI_ENABLED=y
|
|||
# on chips that have HE support (C6/C5). WPA3 is opt-in:
|
||||
CONFIG_ESP_WIFI_ENABLE_WPA3_SAE=y
|
||||
|
||||
# ── ADR-110 P4: 802.15.4 + OpenThread (MTD) ──
|
||||
# IEEE 802.15.4 PHY + OpenThread Minimal Thread Device for mesh time-sync.
|
||||
# MTD is lighter than FTD (no router/leader code) — perfect for sensor nodes.
|
||||
# ── ADR-110 P4: 802.15.4 (raw, no OpenThread) ──
|
||||
# IEEE 802.15.4 PHY enabled for our raw beacon protocol in c6_timesync.c.
|
||||
# OpenThread is DISABLED — empirically (ch15 + ch26 tested with the same
|
||||
# negative result), enabling OpenThread MTD caused our weak-symbol overrides
|
||||
# of esp_ieee802154_receive_done/transmit_done to never fire, breaking
|
||||
# leader election. Raw 802.15.4 mode is what we actually need: a private
|
||||
# mesh protocol on a private channel, no Thread network attach.
|
||||
CONFIG_IEEE802154_ENABLED=y
|
||||
CONFIG_OPENTHREAD_ENABLED=y
|
||||
CONFIG_OPENTHREAD_MTD=y
|
||||
CONFIG_OPENTHREAD_FTD=n
|
||||
CONFIG_OPENTHREAD_RADIO=n
|
||||
# Disable joiner / commissioner — we use a pre-shared network key in NVS.
|
||||
CONFIG_OPENTHREAD_JOINER=n
|
||||
CONFIG_OPENTHREAD_COMMISSIONER=n
|
||||
CONFIG_OPENTHREAD_ENABLED=n
|
||||
|
||||
# ADR-110 P4: 802.15.4 channel override.
|
||||
# Default Kconfig value is 15 (2425 MHz). On the 2.4 GHz radio that's
|
||||
# directly under WiFi channel 5 (2432 MHz). Channel 26 = 2480 MHz is on
|
||||
# the WiFi guard band above channel 14, giving the 15.4 path room to RX
|
||||
# without competing with WiFi traffic for radio time.
|
||||
CONFIG_C6_TIMESYNC_CHANNEL=26
|
||||
|
||||
# ── ADR-110 P5: LP-core (deep-sleep coprocessor) ──
|
||||
# Enable the LP RISC-V core so c6_lp_core.c can ship a wake-on-motion stub.
|
||||
|
|
|
|||
Loading…
Reference in New Issue