Effective Power Interruption Testing - USB Removal and NAND Corruption

Posted by: Thom Denholm

Unsafe USB removal

We had some excellent questions in our April web seminar. These answers delve into removing USB media and corruption on flash media. If you haven't read the first part, see that post here; the link to the original recording is below.

Is removing USB media similar to interrupting the power?

There are some similarities, as the write will be interrupted. That said, removing a USB, CF or SD media will likely cause additional problems which are not easy to track down. The primary reason for this is the lack of power to complete write operations on the NAND media.

We briefly discussed interrupted NAND write operations during the presentation. When a write or erase operation is interrupted, the media may not be fully charged, and the result may be close to the threshold voltage. It may read fine one time, then have uncorrectable bit errors on a subsequent read. I suspect that the Industrial grade SD media contains additional protection from this - better tested firmware, potential hardware detection of removal - but I can't say for sure.

What are the most common causes of SD card corruption?

As discussed in the previous question, an interrupted programming or erase operation (resulting in a weakly programmed page) is likely the most common failure for SD media. There isn't a "signature" to detect this sort of failure, which could result in uncorrectable errors (and therefore unreadable blocks) at any time, even after a previous read was good. The next most common failure is most likely read disturb.

The very operation of reading NAND flash media can slowly remove charge from a cell, eventually causing a bit error on a read. This type of error is planned for and detectable through the Error Detection and Correction (EDC, or sometimes ECC) processes. The usual operation to correct this type of error is to erase that block of cells after the data has been relocated elsewhere; vendors usually refer to that procedure as "scrubbing." Well written firmware will handle this properly; Datalight flash software goes further and allows the developer to control the scrubbing threshold.

Beyond those two items are things such as flash wear (wear leveling) and block retirement. As NAND wears out it becomes more error-prone. It seems very likely that good wear-leveling and retirement algorithms are another reason why "industrial grade" managed NAND and SD cards are more reliable.

What techniques can be used to recover from corruption in the field?

The easiest way to repair corruption is to avoid corruption in the first place. One component of the ounce of prevention we are talking about here is power fail-safe file systems and NAND media drivers. Whether through transactions or journaling, a reliable software package is the number one way to prevent corruption.

The other key requirement is following the media vendor specifications for maintaining power during a NAND media write. A design needs to be able to stop writing when power loss is imminent.

Those two choices avoid a majority of problems, and after that some other strategies can help avoid problems. Keeping a second copy of the data can be fruitful, if expensive. A similar option is to write a new file instead of overwriting an existing file. If techniques are used to detect when corruption has occurred, the second copy of the data can be used instead. Depending on the data, reacquiring it may be an option also - a corrupt file in a browser's cache can be discarded and re-downloaded. Finally, at the system level, corrupted files could be replaced by reinstalling the system, though this may be rather more than a pound of cure.

If you have any questions or want clarification, please respond below. Please subscribe using the form on the right for the most up-to-date content!

Watch the original recording

Comments (0)

Add a Comment

Allowed tags: <b><i><br>Add a new comment: