At Datalight, we frequently find ourselves helping customers on what we call "rescue missions" – when a device is failing in the field and the design team is under pressure to quickly resolve a data corruption or data loss issue. Many times, the failure happens because data didn't get to the media, usually because a cache or other performance optimization has delayed those slow flash writes. For our talk at Embedded World 2017 and our recent seminar, we focused on sharing what we've learned about reliability on Linux with a focus on when the data is on the media. Probably the best place to start is defining just what a system failure is.
We can all agree that when a device fails to start-up, that is a failure. The system files have become corrupted, or something else on the design is in an unknown state that the boot code or application can't deal with. Reliability solutions on Linux and VxWorks deal with those situations by making sure to commit the system files and metadata first, even if some user data is lost. On Linux, the O_DIRECT flag can fall into this category - utilizing the system to commit the metadata without calling an explicit flush to do so.
The user data in that situation is secondary, and, for some embedded designs, that is perfectly acceptable. For other designs – such as medical devices, the user data is just as important (or more so!) as the device simply being able to start up. In these situations, we must understand the complete path to the media because it’s key to fully committing the data. We explored the whole route on our seminar by taking the "data metro" from the application end of the line all the way down to the media end of the line, examining each stop along the way.
Sometimes a flush is necessary, and sometimes a cache commit must be requested. Timeouts must be set properly, and additional buffers where data can gather before the commit must be cleared. This can be complicated stuff - using O_DIRECT is not enough to clear all the hoops. Probably the simplest way to prove that O_DIRECT doesn't commit all the data is to look at the throughput using that flag.
An I/O speed greater than the speed of the media (the red line) indicates that some of the data is still in the cache. This simple test was enough to prove that O_DIRECT is not committing all the data, which is the goal of our exercise - an fsync() must be used. A Linux file system such as Reliance Nitro gives clear access to the necessary flags, and the data-at-risk conditions can be changed with a simple API, which is far easier than dismounting and remounting the media.