The Electronic Engineering Journal published an interesting article by Jim Turley this week, discussing file system and the popular SD media used. While this article brings up some good points about media reliability, I’d like to dive a little deeper into two of the points he talks about - hopefully giving a bit more perspective. A file system designed for better reliability can be less tricky than you think.
Definitions of Reliability
The users of embedded devices are probably not file system experts, and sometimes the designers of the devices aren't either. From the perspective of the user, they just want their data to be on the device when they expect it to. We think of this as data integrity. As the device ages, data retention also becomes a consideration – but that’s a topic for another blog post. Some of the techniques that protect the data integrity include journaling the data, using redundant writes, atomic updates like Tuxera product family, and transaction points provided by Datalight's Reliance family of file systems.
The designer of the device may – or may not – care about the user data, but the absolute requirement from their perspective is that the device be able to boot and operate. This is the primary focus of most reliability improvements to file systems over the last decades – making the file system fail-safe. Some of the techniques used include logging or journaling the metadata, atomic operations, and utilizing the second FAT table to provide a pseudo-transaction – as in Microsoft TexFAT. Most operations that protect the data integrity also provide a fail-safe environment for the system data.
Underlying all of this is the hardware, and as Jim Turley pointed out, reliability has to be a design concern from top to bottom, not just an add-on or an afterthought. The file system certainly can't prevent failures of the media – blocks or sectors going bad, in other words – but it should be able to detect and mitigate them.
Mitigating Media Problems
SD media fails in a number of ways, including failure to read or write, and returning erroneous data. The first two are easily detected by the file system, but the third can be a bit trickier.
Detecting erroneous data in the system data provides a different level of fail-safety, and this is often done with a CRC on the file system structures and metadata. The default file system on Linux can do this, but it is not enabled by default.
Once detected, the next step is handling the error – is recovery possible? For user files and folders, a disk check can mark those files – or restore data to a fixed name like FILE0000.CHK – and move on. While the user may lose data, at least the system continues to function. For system files and folders, the solution can be a lot more difficult.
Our files systems either transparently recover on-the-fly or optionally throw an exception for these situations, allowing the system designer to handle some situations gracefully. As an example, an error in the automotive design map data could result in an error message letting the driver know that map data is unavailable or corrupt, and that they should return to a dealer for an update.
The unhandled exception, utilized primarily in system validation, is also useful because it can lock the system down in a read-only state. This allows the test engineers to step in and see exactly where failure occurred, helping them quickly determine the root cause of the failure.
We can go one step further and provide optional CRC protection of user data files, taking user data integrity to a much higher level.
While Turley's article does point out key design concerns, he suggests that the media is most of the problem. I've used this space to explain some of the file system choices for reliability, and how data integrity differs from fail safety. We also examined how detection of a problem can lead to possible solutions – or at least more graceful failures.