The challenge – making ext4 just as reliable as Datalight’s Reliance Nitro file system, within limitations of the POSIX specification. Unlike most real world embedded designs, performance and media lifetime are not a consideration for this exercise.
1) Change the way programs are written. By changing the habits of coders (and underlying libraries), it is possible to gain some measure of reliability. I’m talking specifically about the write(), fsync(), close(), rename() combination discussed on some forums. This gets around the aggressive buffering of ext4, and gives some assurance that a file either exists or doesn’t. What this does not do is handle an update, which overwrites part of the file, unless an entire new copy of the file is written.
What this also fails to handle is a multithreaded environment. As each of these operations is not atomic, cohesiveness can be lost if power is interrupted while one thread is performing an fsync() while another is writing, or issuing a rename.
2) Journal the data as well as the metadata. For smaller files or overwrites, this solution will have all the data in the journal and available for playback when recovering from a power loss. While this worked well in ext3, this option was changed in ext4. The delayed allocation strategy gives an increase in performance, but can cause more loss of data than is expected. Applications which frequently rewrite many small files seem especially vulnerable.
Shortening the system’s writeback time, stored in shell variables such as dirty_expire_centisecs, can help mitigate this by reducing the amount of data which can be lost.
Some changes just aren’t possible, however. Writes are not atomic, as the metadata and data are written separately. The power can be interrupted between the two types of writes, no matter what is done at the application level. While these suggestions make ext4 more reliable, the cost in changes to user code, loss of performance and additional wear are not insubstantial. Far better to use a file system designed for unexpected power loss, where atomic writes allow the system designer to decide when to put data at risk.
Thom Denholm | October 31, 2012 | Reliability |
Embedded device designers have to come up with systems that handle variations in network traffic, resource constrained environments, and battery limitations. All that pales in comparison with challenges at system start and power down times.
In powering down a multi-threaded device, each application will want to get state information committed to the media. The operating system will have its own set of files or registries to update before the power goes off. At the low end, the MLC NAND flash or eMMC media requires power to be maintained while writing a block, in order to prevent corruption of that or other blocks on the media. Therefore a large group of block writes arrive at the media level simultaneously, all with the requirement to finish up before the power drops.
The only thing worse than having to perform several writes in a short time frame is when those writes display the worst performance characteristics of eMMC media. These block writes are often located in different locations, essentially defining Random I/O.
When restoring power, the same situation occurs in reverse, with each thread requesting state information. For some file systems, the interrupted writes will need to be validated, the file system checked, or the journal replayed. Here the time pressure is not with actual loss of power, but with frustrated users, who want the device to be on, now!
This is a very important use case to consider for designers of file systems and block device drivers. While shutdown consists of primarily writes, a multithreaded file system should not block reads. The same applies at startup time, where a write from one thread should not block the other read threads. Where possible, the block device driver should sequence the reads and writes to match the high performance characteristics of the media, allowing the most blocks to be accessed in the least amount of time.
We are currently working on an eMMC-specific driver that will manage the sequencing of writes. It promises to improve write performance by many times the rate of standard drivers. Check back with us for more on this topic next month!
If possible, applications should also be written to minimize the startup and shutdown impact. With more applications being written by outside developers, this is getting harder to control. The system designer must focus on what they can control – choice of hardware, device drivers and file system to mitigate these problems.
Thom Denholm | June 26, 2012 | Extended Flash Life, Flash Memory, Performance, Product Benefit, Reliability, Uncategorized |
The new chief executive for Research in Motion Ltd., Thorsten Heins, mentioned recently that 80 to 90 percent of all BlackBerry users in the U.S. are still using older devices, rather than the latest Blackberry 7.
Longevity of a consumer device is something that we at Datalight know belongs firmly in the hands of the product designer, rather than being limited by the shortened lifespan of incorrectly programmed NAND flash media. Both Datalight’s FlashFX Tera and Reliance Nitro incorporate algorithms which reduce the Write Amplification on all Flash media. These methods are especially important on e-MMC, which is at its heart NAND flash. In addition, the static and dynamic wear leveling in FlashFX Tera provides even wearing of all flash for maximum achievable lifetime.
Shorter lifetime for some consumer devices, such as low end cell phones, may be found acceptable. However, many newer converged mobile devices that command a higher price, such as tablets, are expected by consumers to have a much longer lifetime. These devices may be replaced by the primary user with some frequency, although since they are viewed as mini-computers and therefore less “disposable,” they will likely be handed down to younger users rather than being discarded or recycled. Consumers will protest in if they discover their $500 tablet only has a lifespan of 3 years, and they will be even more upset if due to flash densities and write amplification that the next version they purchase may have even a shorter lifespan.
How will flash longevity affect your new embedded design?
Thom Denholm | March 6, 2012 | Extended Flash Life, Flash Industry Info, Flash Memory, Flash Memory Manager |
There’s a lot of buzz on the MSDN blog site regarding their latest file system post. http://blogs.msdn.com/b/b8/archive/2012/01/16/building-the-next-generation-file-system-for-windows-refs.aspx – and plenty of insightful comments as well.
I for one am happy to see people talking about file system features, especially Data Integrity, knowledge of Flash Media, and faster access through B+ trees. Of course, Datalight’s own Reliance Nitro file system has had all this and more for some time now…
Microsoft has a new term for a thing we’ve seen often in the case of unexpected power loss – a “Torn Write”. They point this out as a specific problem for their journalling file system, NTFS, but updating any file system metadata in place can be problematic. It looks to me like this new file system, ReFS, handles this by bundling the metadata writes with other metadata writes or with the file data. If the former, this demonstrates the trade-off between Reliability and Performance that we are very familiar with at Datalight. Bundling smaller writes will help with spinning media and flash. In time we will see how much control the application developer has over this configuration – another important point for our customers.
One of the commenters posted that error correction belongs at the block device layer, and I tend to agree. Microsoft’s design goal “to detect and correct corruption” is a noble one, but how would they detect corruption for user data? Additional file checksums and ECC algorithms would be intrusive and potentially time consuming. Keeping a watch on vital file system structures is important, of course, and a good backup in case block level error detection fails.
I look forward to reading more from Microsoft’s file system team in the future, and especially hope to see a roadmap for when these important changes will make it down to the embedded space.
Thom Denholm | January 18, 2012 | Reliability, Uncategorized |
A recent article by Doug Wong compared performance characteristics of eMMC and ONFI specification EZ-NAND, specifically Toshiba’s SmartNAND here: http://www.eetimes.com/design/memory-design/4218886
One consideration I would add to this quite excellent summary is about the availability of drivers. Raw NAND has been around for quite a while and the market supplies a large range of drivers. Many of these will utilize the basic functionality of SmartNAND and other EZ NAND chips with only small modifications. Drivers for eMMC, on the other hand, are much harder to find. Only Linux has a freely available driver, which Google’s Android has taken advantage of in recent releases.
At Datalight, we continue to be excited by both of these new technologies. From the JEDEC eMMC parts, the cool features such as Secure Delete and Replay Protected Memory Block are very exciting. On the other hand, the sheer performance of Toshiba’s SmartNAND and other EZ NAND solutions is very much in demand.
Thom Denholm | November 8, 2011 | Flash Industry Info, Flash Memory, Performance, Uncategorized |
If you’ve noticed the numerous posts lately on the Datalight blog regarding JEDEC and eMMC, you might be wondering why we’re so excited about this particular standard. There are many features that this “smarter” memory will enable for OEMs; In this post I’ll focus on one of those features in the eMMC specification –secure delete.
Securely deleting information on flash memory is more complicated than it seems. For one thing, files are constantly being moved around to ensure even wear of the flash, resulting in multiple copies of file data on the media. Furthermore, when a file is marked for delete, it is typically not physically deleted, rather the space is only marked as available to be overwritten. Until that happens, the “deleted” data is still present and recoverable on the media. In fact, the University of California San Diego Non-volatile systems lab has produced an in depth study of file deletion on flash memory, where they found significant data still present on the media even after deleting the files. A copy of the report can be found at: http://cseweb.ucsd.edu/users/swanson/papers/Fast2011SecErase.pdf
In order to securely delete a file on raw flash, you must use a controller that will either track every block where the file has been stored, or will overwrite the space the file was stored in each time it is moved. The latter describes exactly the secure erase and secure trim features found in the eMMC 4.41 standard. This means that the hardware will finally be capable of securely deleting files –brilliant! There is just one problem: Who has software to support this functionality? As of this writing, there is no file system which supports the feature. While an application can make a call to the media to delete a file securely, the file system may have a backup copy stored somewhere. Fact is, the file system must support the secure delete capabilities of the hardware in order for these features to function correctly.
If an OEM wants to take advantage of the secure erase and secure trim features, their application will need to communicate with the eMMC driver, which may differ from part to part. As the only software company that is an active member of JEDEC, we are excited offer support for quite a few eMMC features. File system support for secure erase and secure trim will be coming later this summer!
Michele Pike | June 29, 2011 | Flash File System, Reliability |
To protect against unexpected power loss, so common in the embedded world, file writes need to be atomic.
Linux file systems ext3 and ext4 were designed for server or desktop environments. Google developer Tim Bray suggests that appropriate use of fsync() can mitigate the risk of data loss, but I am sure that’s not the best solution. The use of delayed allocation means that metadata is committed but the data is not. Alternatively, both can be committed to the journal at a performance penalty. Performance is crucial in both desktops and devices, but not at the expense of data corruption.
This problem is readily demonstrated when updating files, an action which usually happens “in place”. This is quite common in database and other important system files. When power is lost, data can be overwritten only partially, or else metadata can be altered to point to where the data will be updated but has not been. Another alternative to liberal use of fsync() is a rename strategy, that is, write only new data, then rename and replace the old file. Rename is atomic, at least.
The best solution, and one which does not require applications to change the way they do writes, is to perform all data writes atomically. In addition to that, the file system should never overwrite live data and always retain a “known good” state on the media. This way caching does not have to be removed – either user data changes get to the disk fully or not at all. No partial writes or incorrect metadata, and no mount-time journal rebuilds or disk checks either.
Instead of adapting a desktop or server file system for embedded use, it is far better to use a file system designed specifically for embedded use.
View whitepaper: Breakthrough Performance with Tree-based File Systems
Thom Denholm | June 8, 2011 | Performance, Reliability |
The JEDEC eMMC 4.4 specification added two variations to the basic erase command for data security. These were:
Secure Erase – A command indicating a secure purge should be performed on an erase group. The specification states that this should be done not only to the data in this erase group but also on any copies of this data in separate erase groups. This command must be executed immediately, and the eMMC device will not return control to the host until all necessary erase groups are purged. One erase group is the minimum block of memory that can be erased on a particular NAND flash.
Secure Trim – Similar to Secure Erase, this command operates on write blocks instead of erase groups. To handle this properly, the specification breaks this into two steps. The first step marks blocks for secure purge, and this can be done to multiple sets of blocks before the second step is called. The second step is an erase with a separate bit flag sequence that performs all the requested secure trims.
This feature was changed in the eMMC 4.5 specification, due out later this year, and neither of these commands will be functional. To properly handle this change and allow a board design to support multiple types of eMMC parts, the file system or driver will have have a built in flexibility. The alternative, assuming both eMMC vendor drivers work in the design, is still a complete recoding phase and full software test cycle.
Thom Denholm | March 25, 2011 | Reliability |
Do you need defrag? It mostly depends on your hardware and your use case. While defragmenting a file system can make the computer run faster, it’s not the only answer.
Fragmentation is usually caused when modifying a file. Overwriting the file or making it larger means storing a fragment of the file in a new place, unless the file system creates a complete new copy of the file. Databases are particularly susceptible here – they are usually large files and often updated in the middle.
Another way fragmentation happens is when the file system initially stores the file in pieces. This could happen if the file system is not configured to keep file blocks together, or if the media is fairly full and there are no spaces of sufficient size for the new file.
What about the impact of fragmentation? In the days of rotating media, a fragmented file meant extra head movement and platter rotation to read the file. With flash media, the extra overhead is just additional block reads – a far smaller cost.
Avoiding fragmentation if you’re using Reliance Nitro can be as simple as customizing your transaction points. Instead of transacting on a timed basis, create a new transaction point only when the entire file is on the media, at “file close”. Similar settings may be available on other file systems.
If your use case causes fragmentation, a valid workaround might be to reformat the media after backing up the database files. A fresh file format is fairly quick on modern hardware, and can be coupled with a bad block test as well.
Thom Denholm | January 26, 2011 | Performance, Product Benefit |