When reading the article on Tesla MCUv1 failures (which may now include MCUv2), I was struck by an interesting fact, one that isn't being reported. The article at insideevs.com describes a problem which is more evident when the storage media is nearly full. This is especially the case when "Tesla's firmware image size has gone from about 300MB to the full 1GB maximum size." The article goes on to state that "With Tesla utilizing near 100 percent of the flash memory today, there is no free space left for additional wear leveling to compensate for the excessive log writing." So does wear leveling require free space, and how much?
Proper wear leveling doesn't require a lot of cushion, which we define as blocks reserved for improving performance and replacing blocks that wear out through use. The amount allocated for this in eMMC parts is defined by the firmware, and typically not disclosed. Using a software solution such as FlashFX Tera allows our customers to define how many additional NAND blocks are reserved for purpose of performance and wear leveling. Data can be written to this additional cushion at full performance. If a design did not have sufficient cushion, an erase would be required first, and that would cause noticeable latency.
However, the description of the problem hints at another issue related to wear leveling. As the name implies, wear leveling describes techniques used to ensure even use of the media. When a single block yields uncorrectable write or erase errors indicating it is close to the maximum capability, good wear leveling means that the other blocks are also close to maximum, and the device has achieved the longest possible lifetime.
The most basic wear leveling design is this – for each data write, make sure that data is written to the least erased block. This is known as dynamic wear leveling, and unfortunately this "quick and dirty" design has a flaw on most use cases that will reduce the lifetime of the flash. Any blocks containing data written only once will never be wear leveled. This is referred to as static data, and if a system contains a large proportion of this data, the total number of blocks available for wear-leveling is greatly reduced.
The heat map shown above shows the erase counts on simulated NAND media when only dynamic wear leveling is used. As you can see, most erases are concentrated on a small portion of blocks, while others were only erased once.
Moving the data written once – Tesla firmware for example – requires static wear leveling. While moving that firmware means an additional write, the overall benefit is to greatly increase the available pool of blocks to be used. Typically this is a very small price to pay for wear leveling that results in the longest life for the media. Using the same test conditions as above, we simulated a system with both static and dynamic wear leveling, resulting in a much better picture.
So does the 32GB eMMC media used in this design have enough cushion and does it use static wear leveling features in the firmware? One solution described in the article – replacing it with eMMC from another vendor – would tend to indicate that the eMMC firmware isn’t good enough. Unfortunately, the media firmware is usually a black box with limited to no visibility into what the firmware is doing.
Short of asking the vendor pointed questions and getting clear and complete answers, how can you tell whether your choice of eMMC will be right for your design? Tuxera has worked with customers to develop a rigorous test program to help them choose parts with the best characteristics for their use cases. See this page on Tuxera’s Flash Testing Service for more information.