Last week, we recorded a web seminar of the talk Datalight gave at Embedded World 2018. There were some excellent questions at that session (thanks Mark!), and I wanted to share more complete responses with you.
What are some examples of how you would test the various scenarios (e.g. between block, in block)?
As we discussed in the seminar, in-block interruptions should only be a problem for magnetic media. Testing for them could of course confirm that case, but this isn't easy to do. Use the largest block size available for your media (to increase the length of writes appropriately) and use a large sequential write (which should bypass any cache). Then just keep interrupting the power.
The other case of an interruption between writes is easier, and can even be handled from a debugger. At the block level, insert a breakpoint between block writes (after the first few, before the last few, etc.) and then power the device off during the interrupt. Upon restart, check the state to make sure it is what you expect and is handled properly by the file system and application software.
What tools can be used to triage corruption? fsck? Anything else?
Disk checks (such as fsck) can be used to detect corruption to the file metadata but not the data within the file. Catching a problem is not guaranteed at that level either, though turning on CRCs for the metadata (a capability of Reliance Nitro, Reliance Edge and ext4 on Linux) will help.
For corruption of the user data, I would suggest (for testing purposes) to use serialized data in that space. Each data packet should be serialized within the file and also contain a CRC, which allows you to check for both corruption and out-of-order writes.
Another option which is available in live systems for a small performance penalty is to use Reliance Nitro and enable CRCs on all user data. Then write the application to handle the exception for invalid user data appropriately.
What is the best way to test a system update?
Recent problems in the industry indicate that the most important thing to do is validate that the system update actually works before pushing it to devices. After that, confirm that it works in every situation that might occur. Uconnect’s recent problem had more to do with the Travel Link feature from Sirius XM than their own software, and that option should have been accounted for in testing.
Testing the delivery of the update is considerably easier, and could be handled in the source code. After each write to the media, add an exception, then have a test program interrupt power after each occurrence. The on media format should be stable at each power interruption, either continuing or recovering as appropriate.
That only leaves problems with over the air downloads (validate with CRC or stronger first) and the media itself (a much larger topic of discussion).
If you have any questions or want clarification, please respond below. More answers in the next blog post!