Datalight’s Reliance Nitro and journaling file systems such as ext4 are designed to recover from unexpected power interruption. These kinds of “post mortem” recoveries typically consists of determining which files are in which states, and restoring them to the proper working state. Methods like these are fine for recovering from a power failure, but what about a media failure?
When a media block fails, it is either in the empty space, the user data, or the file system data. A block from the empty space can be detected on the next write, which will either cause failure at the application, or will be marked bad internally and the system will move on to another block. When a media block in the user space fails, it cannot be reliably read. Often, the media driver will detect and report an unreadable sector, resulting in an error status (and probably no data) to the user application. When a media block containing file system data or metadata fails, it is the responsibility of the file system to detect and (if possible) repair that damage. Often the best thing that can be done is to stop writing to the media immediately.
In some ways, blocks lost due to media corruption present a problem similar to recovering deleted files. If it is detected quickly enough, user analysis can be done on the cyclical journal file, and this might help determine the previous state of the file system metadata. Information about the previous state can then be used to create a replacement for that block, effectively restoring a file.
Metadata checksums have been added to several file system data blocks for ext4 in the 3.5 kernel release. Noticeably absent from this list are the indirect and double indirect point blocks, used to allocate trees of blocks for a very large file. The latest release of Datalight’s Reliance Nitro file system (version 3.0) adds CRCs to all file system metadata and internal blocks, allowing for rapid and thorough detection of media failures.
Optional within this new version of Reliance Nitro is using CRCs on user data blocks, for individual files or entire volumes. This failsafe can be configured to write protect the volume or halt system operations. Diagnostic messages are also available to indicate the specific logical block number of the corrupted block.
The combination of full CRC protection on every metadata block and optional protection of user file data blocks is one of the key attributes of this release of Reliance Nitro. Embedded system designers can detect more media failures in testing, and can diagnose failed units more quickly, leading to greater success in the marketplace.
Thom Denholm | January 26, 2013 | Flash File System, Flash Memory, Reliability |
Before embedded devices, file systems were designed to work in servers and desktops. Power loss was an infrequent occurrence, so little consideration was given to protecting the data. Frequent checks of the file system structures were important, and were often handled at system startup by a program such as chkdsk (for FAT) or fsck (for Linux file systems). In each case, the OS could also request a run of these utilities when an inconsistency is detected, or when the power was interrupted.
The method behind these tools is a check of the entire disk – reading each block to determine if it is allocated for use, then cross checking with an allocated list located elsewhere on the media. FAT file systems have little other protection, and can only flag sections of the media without matching metadata by creating a CHK file for later user analysis. Linux file systems add in a journal mechanism to detect which files are affected, and can often correct the damage without user intervention.
These utilities are necessary because these basic file systems are not atomic in nature – data and metadata are written separately. Datalight’s Reliance Nitro file system treats updates as a single operation, and thus the file system is never in a state where it would need to be corrected. Our Dynamic Transaction Point technology allows the user to customize just how atomic their design is, protecting not just a block of data and metadata but the whole file – half a JPG is pretty much useless, from a user perspective.
The repairs that fsck and chkdsk can perform are completely unnecessary with the Reliance Nitro file system. At the device design level, this results in quicker boot times for a system that is completely protected from power failure. A file system checker is of course provided, and is useful for detecting failures caused by media corruption.
Taking chkdsk and fsk to the next level of protection would be a tool to repair some media corruption. If a block of data on the media becomes only partially readable, this tool could read it multiple times (to try and collect the most data) and store the results in a newly allocated block, correcting the file system structures appropriately. User intervention would likely be required to understand if enough data was recovered to make this effort worthwhile. Stay tuned for more updates on this topic.
Thom Denholm | January 7, 2013 | Reliability |
The challenge – making ext4 just as reliable as Datalight’s Reliance Nitro file system, within limitations of the POSIX specification. Unlike most real world embedded designs, performance and media lifetime are not a consideration for this exercise.
1) Change the way programs are written. By changing the habits of coders (and underlying libraries), it is possible to gain some measure of reliability. I’m talking specifically about the write(), fsync(), close(), rename() combination discussed on some forums. This gets around the aggressive buffering of ext4, and gives some assurance that a file either exists or doesn’t. What this does not do is handle an update, which overwrites part of the file, unless an entire new copy of the file is written.
What this also fails to handle is a multithreaded environment. As each of these operations is not atomic, cohesiveness can be lost if power is interrupted while one thread is performing an fsync() while another is writing, or issuing a rename.
2) Journal the data as well as the metadata. For smaller files or overwrites, this solution will have all the data in the journal and available for playback when recovering from a power loss. While this worked well in ext3, this option was changed in ext4. The delayed allocation strategy gives an increase in performance, but can cause more loss of data than is expected. Applications which frequently rewrite many small files seem especially vulnerable.
Shortening the system’s writeback time, stored in shell variables such as dirty_expire_centisecs, can help mitigate this by reducing the amount of data which can be lost.
Some changes just aren’t possible, however. Writes are not atomic, as the metadata and data are written separately. The power can be interrupted between the two types of writes, no matter what is done at the application level. While these suggestions make ext4 more reliable, the cost in changes to user code, loss of performance and additional wear are not insubstantial. Far better to use a file system designed for unexpected power loss, where atomic writes allow the system designer to decide when to put data at risk.
Thom Denholm | October 31, 2012 | Reliability |
Embedded device designers have to come up with systems that handle variations in network traffic, resource constrained environments, and battery limitations. All that pales in comparison with challenges at system start and power down times.
In powering down a multi-threaded device, each application will want to get state information committed to the media. The operating system will have its own set of files or registries to update before the power goes off. At the low end, the MLC NAND flash or eMMC media requires power to be maintained while writing a block, in order to prevent corruption of that or other blocks on the media. Therefore a large group of block writes arrive at the media level simultaneously, all with the requirement to finish up before the power drops.
The only thing worse than having to perform several writes in a short time frame is when those writes display the worst performance characteristics of eMMC media. These block writes are often located in different locations, essentially defining Random I/O.
When restoring power, the same situation occurs in reverse, with each thread requesting state information. For some file systems, the interrupted writes will need to be validated, the file system checked, or the journal replayed. Here the time pressure is not with actual loss of power, but with frustrated users, who want the device to be on, now!
This is a very important use case to consider for designers of file systems and block device drivers. While shutdown consists of primarily writes, a multithreaded file system should not block reads. The same applies at startup time, where a write from one thread should not block the other read threads. Where possible, the block device driver should sequence the reads and writes to match the high performance characteristics of the media, allowing the most blocks to be accessed in the least amount of time.
We are currently working on an eMMC-specific driver that will manage the sequencing of writes. It promises to improve write performance by many times the rate of standard drivers. Check back with us for more on this topic next month!
If possible, applications should also be written to minimize the startup and shutdown impact. With more applications being written by outside developers, this is getting harder to control. The system designer must focus on what they can control – choice of hardware, device drivers and file system to mitigate these problems.
Thom Denholm | June 26, 2012 | Extended Flash Life, Flash Memory, Performance, Product Benefit, Reliability, Uncategorized |
The new chief executive for Research in Motion Ltd., Thorsten Heins, mentioned recently that 80 to 90 percent of all BlackBerry users in the U.S. are still using older devices, rather than the latest Blackberry 7.
Longevity of a consumer device is something that we at Datalight know belongs firmly in the hands of the product designer, rather than being limited by the shortened lifespan of incorrectly programmed NAND flash media. Both Datalight’s FlashFX Tera and Reliance Nitro incorporate algorithms which reduce the Write Amplification on all Flash media. These methods are especially important on e-MMC, which is at its heart NAND flash. In addition, the static and dynamic wear leveling in FlashFX Tera provides even wearing of all flash for maximum achievable lifetime.
Shorter lifetime for some consumer devices, such as low end cell phones, may be found acceptable. However, many newer converged mobile devices that command a higher price, such as tablets, are expected by consumers to have a much longer lifetime. These devices may be replaced by the primary user with some frequency, although since they are viewed as mini-computers and therefore less “disposable,” they will likely be handed down to younger users rather than being discarded or recycled. Consumers will protest in if they discover their $500 tablet only has a lifespan of 3 years, and they will be even more upset if due to flash densities and write amplification that the next version they purchase may have even a shorter lifespan.
How will flash longevity affect your new embedded design?
Thom Denholm | March 6, 2012 | Extended Flash Life, Flash Industry Info, Flash Memory, Flash Memory Manager |
There’s a lot of buzz on the MSDN blog site regarding their latest file system post. http://blogs.msdn.com/b/b8/archive/2012/01/16/building-the-next-generation-file-system-for-windows-refs.aspx – and plenty of insightful comments as well.
I for one am happy to see people talking about file system features, especially Data Integrity, knowledge of Flash Media, and faster access through B+ trees. Of course, Datalight’s own Reliance Nitro file system has had all this and more for some time now…
Microsoft has a new term for a thing we’ve seen often in the case of unexpected power loss – a “Torn Write”. They point this out as a specific problem for their journalling file system, NTFS, but updating any file system metadata in place can be problematic. It looks to me like this new file system, ReFS, handles this by bundling the metadata writes with other metadata writes or with the file data. If the former, this demonstrates the trade-off between Reliability and Performance that we are very familiar with at Datalight. Bundling smaller writes will help with spinning media and flash. In time we will see how much control the application developer has over this configuration – another important point for our customers.
One of the commenters posted that error correction belongs at the block device layer, and I tend to agree. Microsoft’s design goal “to detect and correct corruption” is a noble one, but how would they detect corruption for user data? Additional file checksums and ECC algorithms would be intrusive and potentially time consuming. Keeping a watch on vital file system structures is important, of course, and a good backup in case block level error detection fails.
I look forward to reading more from Microsoft’s file system team in the future, and especially hope to see a roadmap for when these important changes will make it down to the embedded space.
Thom Denholm | January 18, 2012 | Reliability, Uncategorized |
A recent article by Doug Wong compared performance characteristics of eMMC and ONFI specification EZ-NAND, specifically Toshiba’s SmartNAND here: http://www.eetimes.com/design/memory-design/4218886
One consideration I would add to this quite excellent summary is about the availability of drivers. Raw NAND has been around for quite a while and the market supplies a large range of drivers. Many of these will utilize the basic functionality of SmartNAND and other EZ NAND chips with only small modifications. Drivers for eMMC, on the other hand, are much harder to find. Only Linux has a freely available driver, which Google’s Android has taken advantage of in recent releases.
At Datalight, we continue to be excited by both of these new technologies. From the JEDEC eMMC parts, the cool features such as Secure Delete and Replay Protected Memory Block are very exciting. On the other hand, the sheer performance of Toshiba’s SmartNAND and other EZ NAND solutions is very much in demand.
Thom Denholm | November 8, 2011 | Flash Industry Info, Flash Memory, Performance, Uncategorized |
If you’ve noticed the numerous posts lately on the Datalight blog regarding JEDEC and eMMC, you might be wondering why we’re so excited about this particular standard. There are many features that this “smarter” memory will enable for OEMs; In this post I’ll focus on one of those features in the eMMC specification –secure delete.
Securely deleting information on flash memory is more complicated than it seems. For one thing, files are constantly being moved around to ensure even wear of the flash, resulting in multiple copies of file data on the media. Furthermore, when a file is marked for delete, it is typically not physically deleted, rather the space is only marked as available to be overwritten. Until that happens, the “deleted” data is still present and recoverable on the media. In fact, the University of California San Diego Non-volatile systems lab has produced an in depth study of file deletion on flash memory, where they found significant data still present on the media even after deleting the files. A copy of the report can be found at: http://cseweb.ucsd.edu/users/swanson/papers/Fast2011SecErase.pdf
In order to securely delete a file on raw flash, you must use a controller that will either track every block where the file has been stored, or will overwrite the space the file was stored in each time it is moved. The latter describes exactly the secure erase and secure trim features found in the eMMC 4.41 standard. This means that the hardware will finally be capable of securely deleting files –brilliant! There is just one problem: Who has software to support this functionality? As of this writing, there is no file system which supports the feature. While an application can make a call to the media to delete a file securely, the file system may have a backup copy stored somewhere. Fact is, the file system must support the secure delete capabilities of the hardware in order for these features to function correctly.
If an OEM wants to take advantage of the secure erase and secure trim features, their application will need to communicate with the eMMC driver, which may differ from part to part. As the only software company that is an active member of JEDEC, we are excited offer support for quite a few eMMC features. File system support for secure erase and secure trim will be coming later this summer!
Michele Pike | June 29, 2011 | Flash File System, Reliability |
To protect against unexpected power loss, so common in the embedded world, file writes need to be atomic.
Linux file systems ext3 and ext4 were designed for server or desktop environments. Google developer Tim Bray suggests that appropriate use of fsync() can mitigate the risk of data loss, but I am sure that’s not the best solution. The use of delayed allocation means that metadata is committed but the data is not. Alternatively, both can be committed to the journal at a performance penalty. Performance is crucial in both desktops and devices, but not at the expense of data corruption.
This problem is readily demonstrated when updating files, an action which usually happens “in place”. This is quite common in database and other important system files. When power is lost, data can be overwritten only partially, or else metadata can be altered to point to where the data will be updated but has not been. Another alternative to liberal use of fsync() is a rename strategy, that is, write only new data, then rename and replace the old file. Rename is atomic, at least.
The best solution, and one which does not require applications to change the way they do writes, is to perform all data writes atomically. In addition to that, the file system should never overwrite live data and always retain a “known good” state on the media. This way caching does not have to be removed – either user data changes get to the disk fully or not at all. No partial writes or incorrect metadata, and no mount-time journal rebuilds or disk checks either.
Instead of adapting a desktop or server file system for embedded use, it is far better to use a file system designed specifically for embedded use.
View whitepaper: Breakthrough Performance with Tree-based File Systems
Thom Denholm | June 8, 2011 | Performance, Reliability |