On a recent call with a customer, we were asked about reliability and power interruption. Specifically, the customer wanted to know why FlashFX Tera requires that page program operations not be interrupted, because he had conducted power interruption tests on eMMC and didn’t observe any failures.
To answer his questions, one of the FlashFX Tera architects provided this summary of the problems caused by interrupted operations, what FlashFX Tera does to mitigate them, and why casual testing is insufficient to determine how well the storage system deals with the potential consequences.
All NAND flash chips suffer from vulnerabilities when erase or program operations are interrupted, the most frequent cause of which is power disruptions. NAND datasheets frequently do not call attention to the consequences. These problems occur at a fundamental physical level, so understanding the effects of power failure requires a brief review of how NAND flash works.
Although NAND is used to store digital data, at a physical level it is an analog device. Each flash memory cell stores some level of charge. Cells are said to be erased when they contain no excess charge, and are programmed by adding charge to surpass the required threshold. In SLC (single-level cell) NAND, the charge level is sensed and compared to a single threshold to decide whether the cell contains a binary 1 or a 0 (figure 1). Multiple thresholds can be used to store more than one bit per cell (multi-level cell or MLC). Bits are grouped into pages, which are programmed as a unit, and pages are grouped into erase blocks, which are erased as a unit. (figure 2).
Programming and erasing are not instantaneous operations. To program a cell, a pulse of programming voltage is applied and the cell's charge level is checked. This process repeats until the cell reaches a charge level that will be sensed reliably as programmed. Erasing happens similarly.
If programming of a cell is interrupted, it can be left in a state where the charge level cannot be detected reliably, causing immediate obvious errors. Worse than immediate failure, this programming insufficiency can cause errors at a later time. The weakly programmed cell may have poor data retention due to “program disturb” and “read disturb” effects or charge leakage that cause its charge level to drop below the threshold. These disturbances always occur in NAND flash, but in a properly programmed cell they are either insignificant or manageable.
Additional problems exist in MLC NAND. In MLC to reduce the likelihood that an error in the cell will be uncorrectable, the bits are assigned to different pages, designated Page A and Page B in the encoding diagram to the left, called “paired” or "associated" pages. Programming a given page involves changing the charge level in cells that also contain data for a different page. Interrupting programming of a page in MLC NAND can corrupt data in a paired/associated page. Let's say that you've applied programming charge pulses to move the charge to the level that changes Page A to 0, leaving Page B a 1. Now you want to program Page B also. If the programming is interrupted before you get the level all the way to the required threshold, you could find that you've changed Page A back to a 1. (figures 3 & 4).
FlashFX Tera handling of interrupted operations
The requirement for not interrupting operations is not unique to FlashFX Tera.
In order for FlashFX Tera to operate reliably on NAND flash, it is necessary to ensure that all page program operations complete without interruption. Erase operations are preceded by a program operation that unambiguously marks the erase block as invalid, so an interrupted erase can be detected. Reliable operation on MLC NAND requires also that erase operations complete to ensure that the above ambiguous charge level problem does not manifest.
FlashFX Tera, contains logic that attempts to mitigate interrupted program operations. “Mitigate” means make the problem less severe; it does not mean solve it completely. While this can catch some interrupted operations, it cannot catch all of them. A page that is almost completely programmed may have some cells that have not been programmed sufficiently to ensure reliable long-term data storage. It can appear to have no errors at all initially, but as the weakly programmed cells lose charge they can cause a bit error rate beyond what the ECC can correct. A page that is very slightly programmed can appear completely blank, but bits that are only slightly programmed may suffer from disturb effects at an excessive bit error rate. These conditions cannot be detected by software when they first occur, they only become evident over time.
Why testing is difficult
So back to the original question of why the customer's testing of eMMC had not found any errors from power interruption: There are at least three reasons why it is difficult to test the effects of interrupted program and erase operations.
In normal operation of a typical device, much of the time it is not writing, so there is no program operation to interrupt. Ensuring that power interruptions occur during program operations likely requires running special software to increase write activity. For example, a test program could continually write and delete files.
Another method is one we used in our side by side comparison demos for marketing: we wrote an application that flashed an LED during a write so that one could yank power at the most inopportune time (link to RE demo).
Errors may not appear for a period of time that could vary from minutes to months, and may not be caught at all before being overwritten unless the interruption rate is slowed to allow them to develop. The read disturb effect is weak, and many reads may be needed in a block to cause even a weakly-programmed bit to be affected. Charge loss also happens slowly, and may require elevated temperatures in order to become significant.
Failure rates that are unacceptable in a large volume of shipped units may still be so low as to require extended testing with a large test sample. The design goal for failure rate in the field (which may be very small but is never zero in the real world) needs to be considered when designing a validation test. If a 1% failure is the maximum acceptable rate, do you have the time and resources, not to mention bench space, to test a sufficiently large number of units and simulate the lifetime of the device? Furthermore, if 1 in 100 units fail, the sample size is likely too small to allow root cause analysis of the failure.
Just running "normal" use case testing is often insufficient – both in terms of control of environmental factors and statistical probability – to thoroughly exorcise the occurrence of NAND misbehavior. Over the decades that Datalight has been supporting flash, we've designed solutions and mitigations for common and not-so-common failure scenarios. The hundreds of millions of devices that our customers have in operation attest to the effectiveness of the data reliability techniques we use.