Skip to content

Add option to persist boot/update failure info to flash#809

Merged
danielinux merged 11 commits into
wolfSSL:masterfrom
mattia-moffa:20260626-failure-diagnostics
Jul 2, 2026
Merged

Add option to persist boot/update failure info to flash#809
danielinux merged 11 commits into
wolfSSL:masterfrom
mattia-moffa:20260626-failure-diagnostics

Conversation

@mattia-moffa

Copy link
Copy Markdown
Member

When boot/update partition verification fails during boot or update, with this option the event is logged to flash in an ad-hoc partition. Information about logged failures is made available to the application through an API.

For a more detailed description see the changes to docs/API.md.

When boot/update partition verification fails during boot or update,
with this option the event is logged to flash in an ad-hoc partition.
Information about logged failures is made available to the application
through an API.
@mattia-moffa mattia-moffa self-assigned this Jun 26, 2026
Copilot AI review requested due to automatic review settings June 26, 2026 19:54

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optional “persistent failure diagnostics” feature to wolfBoot: when enabled, boot/update/rollback verification failures are recorded to a dedicated flash region and made available to the application via a small read/clear API.

Changes:

  • Add a flash-backed, circular log for failure records and expose read/clear APIs in libwolfboot.
  • Record verification failures during update/boot and record rollback-not-confirmed events when rollback occurs.
  • Add build options and document the new feature and API.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/update_flash.c Records boot/update verification failures and rollback-not-confirmed events when diagnostics are enabled.
src/libwolfboot.c Implements the on-flash diagnostics log format, scanning/ordering logic, and the public read/clear APIs.
options.mk Adds build-time options/macros to enable diagnostics and configure the reserved flash region.
include/wolfboot/wolfboot.h Defines failure phases/causes, the persisted record structure, and the new public API prototypes.
docs/API.md Documents the new failure diagnostics feature and how applications consume the records.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/libwolfboot.c Outdated
Comment thread src/libwolfboot.c Outdated
Comment thread src/libwolfboot.c Outdated
Comment thread docs/API.md
Comment thread src/libwolfboot.c
@mattia-moffa mattia-moffa marked this pull request as ready for review July 1, 2026 19:28
@mattia-moffa mattia-moffa requested a review from danielinux July 1, 2026 19:29

@danielinux danielinux left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check word size comment.

Secondary notes (non-blocking):

  • No overlap/alignment validation. Address-must-be-sector-aligned and must-not-overlap-partitions are documented but not enforced at compile time. A BUILD_BUG-style check against partition bounds would prevent a footgun.
  • App-side wolfBoot_clear_failures() needs the flash driver. get_failure/clear_failures compile outside the __WOLFBOOT guard, so an application calling clear pulls in hal_flash_unlock/erase/lock. That's presumably intended, but worth a doc note that clearing requires the HAL flash driver linked into the app.
  • diag_read non-ext path ignores unmapped/secure memory faults — fine for memory-mapped internal flash, just noting the XMEMCPY assumes the region is always readable.

Comment thread include/wolfboot/wolfboot.h Outdated
@danielinux danielinux assigned mattia-moffa and unassigned danielinux Jul 2, 2026
@KotMorderca

Copy link
Copy Markdown

Currently the empty slot for storage is picked as the first one with wrong CRC (and it works fine as long as we got happy path and the memory is indeed empty). In case of an interrupted record write (e.g. power failure) this slot will be chosen for the next record infinitely and the whole feature gets stuck. On write-once/ECC flash reprogramming such partially written word is also not allowed, so the write itself can fail.
There is need to distinguish between actually empty slot and unusable dirty slot.

@KotMorderca

Copy link
Copy Markdown

diag_write_header returns the ext_flash_write/hal_flash_write result but it is not consistent across HALs, some return 0 on success and some return len. So this value can't be checked with != 0 - on those HALs a successful header write would be treated as an error and with WOLFBOOT_DIAGNOSTICS_EXT the whole feature stops working. Possibly checking < 0 should be safe for both conventions.

@KotMorderca

Copy link
Copy Markdown

Two verification failure paths don't get recorded at all:

  1. In wolfBoot_start(), when the emergency update succeeds but the re-verification of the new boot image fails ("Boot (try 2) failed"), wolfBoot panics without writing any record. This is probably the most valuable event to persist, because right after it the device becomes unbootable and the diagnostics log is the only trace of what happened.
  2. Rejection of a self-update image in wolfBoot_check_self_update() is also not recorded.

danielinux
danielinux previously approved these changes Jul 2, 2026
- Failed boot after emergency update (WOLFBOOT_FAILURE_PHASE_RECOVERY)
- Failed verification of self-update image
@mattia-moffa

Copy link
Copy Markdown
Member Author

Thanks @KotMorderca!

@KotMorderca

Copy link
Copy Markdown

Thanks @mattia-moffa for fixes! Awesome work!

Few additional smaller findings:

  1. diag_erase() results are ignored at both call sites in wolfBoot_record_failure() (region init and sector rotation). Now that the value is normalized it is a one line check.
  2. A rejected self-update image is recorded with the same phase/partition as a normal update rejection (PHASE_UPDATE + PART_UPDATE), so the application can't tell them apart. Maybe a dedicated phase would be worth it, or at least a note in the docs that PHASE_UPDATE covers both cases.
  3. The docs say the record size must be a multiple of the flash write granularity, but nothing enforces it - a static assert would catch obvious misconfigurations. Related: the slot[]/buf[] byte arrays are not guaranteed to be 32-bit aligned and some HALs cast the write source to uint32_t*, so declaring them as uint32_t arrays would be safer.
  4. The new overlap checks don't cover the bootloader's own region (WOLFBOOT_ORIGIN + BOOTLOADER_PARTITION_SIZE) - placing the diagnostics region there would erase the bootloader at runtime. Same #if pattern as the partition checks would do.

@mattia-moffa

Copy link
Copy Markdown
Member Author

The docs say the record size must be a multiple of the flash write granularity, but nothing enforces it - a static assert would catch obvious misconfigurations.

That would be tricky to enforce, because at the moment we don't have a generic wolfBoot macro that specifies the flash write granularity.

- Check diag_erase() return value in wolfBoot_record_failure()
- Dedicated WOLFBOOT_FAILURE_PHASE_SELF_UPDATE
- Mark the diagnostics flash write buffers XALIGNED_STACK(4) for HALs
  that access the source word by word
- Ensure diagnostics, update, or swap partitions don't overlap the
  bootloader
@KotMorderca

Copy link
Copy Markdown

The docs say the record size must be a multiple of the flash write granularity, but nothing enforces it - a static assert would catch obvious misconfigurations.

That would be tricky to enforce, because at the moment we don't have a generic wolfBoot macro that specifies the flash write granularity.

Maybe those two static asserts would catch the obvious misconfigurations:

  1. (DIAG_RECORD_SIZE % 4) == 0 - odd sizes like 20 silently misalign every slot for HALs that write word by word
  2. DIAG_SLOTS_PER_SECTOR >= 1 - on platforms where the program unit is as large as the erase sector, setting the record size accordingly makes (SECTOR_SIZE - HDR_SIZE) / RECORD_SIZE evaluate to 0. It compiles fine and the feature just silently stores nothing

Apart from that the feature looks good to me now.

Some configs may exploit that possibility
@mattia-moffa

mattia-moffa commented Jul 2, 2026

Copy link
Copy Markdown
Member Author

DIAG_SLOTS_PER_SECTOR >= 1 - on platforms where the program unit is as large as the erase sector, setting the record size accordingly makes (SECTOR_SIZE - HDR_SIZE) / RECORD_SIZE evaluate to 0. It compiles fine and the feature just silently stores nothing

This one is reasonable. Thanks, I added it.

@KotMorderca

Copy link
Copy Markdown

Looks good to me, thanks for all fixes.

@mattia-moffa grat work!

@danielinux danielinux merged commit ca2c7f6 into wolfSSL:master Jul 2, 2026
389 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants