Help! Why are my devices failing?

Posted by: Thom Denholm

In conversations with the embedded OEMs we work with, a common issue affects almost every manufacturer - the cost of diagnosing and fixing causes of field failure. This impacts time-to-market and pulls resources from development for field diagnostics and post-mortem analysis. This issue is especially relevant due to the following reasons:

  • Need for defect prevention during field operations: The high degree of reliability required for protecting critical data dictates that devices must not fail. To ensure that devices are wear-fail-safe, manufacturers are required to run extensive tests for a range of user scenarios so as to safeguard against edge cases. The analysis of test results can be a daunting task due to several interfaces between hardware and software and application layers. Hence, there is a need to continuously track these interactions so that, during a failure, any difference in the interactions can be discovered and corrected.
  • Vulnerability of device to wear-related failures: As flash media continues to increase in density and complexity, it is becoming more vulnerable to wear-related failures.  With the shrinking lithography comes increased ECC requirements, and the move to more bits/cell, there's a concern that what was written to the disk, may not be what is read off the disk. However, most applications assume that the data written to the file system will be completely accurate when read back.  If the application does not fully validate the data read there may be errors in the data that cause the application to fail, hang or just misbehave. These complications require checks to validate data read as against the data written so as to prevent device failures due to data corruption.
  • Complexity of hardware and software integration: The complex nature of hardware and software integration within embedded devices makes finding the cause of failures a painstaking job, one that requires coordination between several hardware and software vendors. For this reason, it often takes OEMs days to investigate causes at the file system layer alone. Problems below that layer can involve more extensive testing and involve multiple vendors. Log messages can help manufacturers pinpoint the location of failure so that the correct vendor can be notified.

This ability to pinpoint the cause of failure is especially helpful when an OEM is:

  • Troubleshooting during the manufacturing and testing process to make sure that their devices do not fail for the given user scenarios
  • Doing post-mortem analysis on parts returned from their customers to understand the reason and solution for failures
  • Required to maintain a log of interactions between the various parts of the device for future assistance with failure prevention or optimization.

So, what are we doing to help our customers find and correct the cause of field failures? Stay tuned for as we have some upcoming news on this topic. Also, do you have an embedded device related issue which you wish someone should look into? Please share your issues and solutions in the comment section of this blog.

Comments (0)

Add a Comment

Allowed tags: <b><i><br>Add a new comment: