How Sure Are You of Device Reliability?

Posted by: Thom Denholm

For an embedded system, reliability means no unexpected loss of data. Looking below the application, this breaks down into two main categories:

  1. Whether the device remains functional, often after being shelved for a significant period of time. The software concern here is usually the flash media, and primarily involves hardware specifications and environmental conditions. This has also been referred to in the industry as “data integrity.”
  2. Whether data just written actually resides on the media, usually after a system crash or unexpected loss of power. File system and flash management software are integral to this version of reliability, and the best way to demonstrate that is through effective testing.

When an embedded device isn’t writing data to the media, a power interruption or system crash will lose uncommitted file data in RAM. Application programmers are usually familiar with this, and tend to issue flush (and related) commands to make sure data isn’t lost. Power interruptions in this situation will only cause problems if an atomic multi-block write is interrupted AND the file system doesn’t have a way to handle this (e.g. discarding the results of the partial commit).

It is also safe to say that most embedded systems spend the majority of their time in this state – basically not writing data. This also means that random power interruption testing will hit this state most frequently, proving only what the system designer has already planned for.

The more interesting failure location is at the point of media write. It is here that we believe focused power interruption testing needs to be conducted. This will enable system designers to discover how the file system and flash media firmware or drivers handle an interrupted write operation, including what sorts of errors (if any) are returned to the application. Testing here will also examine how interrupted atomic (and non-atomic) writes are handled, and under what conditions files can be corrupted. We focused on both of these topics for the FAT file system in a white paper (“Where does FAT fail?”) and will further expand on them in another paper later this year.

If effort isn’t spent validating all the power interruption options, some small measure of benefit could be gained by reducing the amount of writing required by an embedded system. I believe this is an area of rapidly diminishing returns – what is the point of devices that can log data and usage and even failure statistics if that data isn’t actually written due to fear of interruption? Even systems that write seldom do write sometimes, and an interruption that is not planned for there could be a major expense to the company later. It is always better (and far less expensive) to use software designed to handle power interruption.

Download our Where Does FAT Fail whitepaper to explore this topic further.

Get the Whitepaper

Comments (0)

Add a Comment

Allowed tags: <b><i><br>Add a new comment: