For a lot of 2020, we’ve been talking about avoiding end-of-life from NAND Correctable Errors. I spoke about this very topic at the Embedded Online Conference, where I got to digitally interact with many of you, and received your questions. For those not up to speed on the entire topic, please feel free to see the whitepaper we produced here. This topic brought up some interesting questions that I think warrant a little more discussion and digging.

All about the firmware

Perhaps the most common question was, “Where is the error management actually being handled?” For an example project – an ARM single board computer running Linux, with ext3 file systems on both microSD and eMMC – the answer starts with the firmware. This is special code written to work with the NAND flash media and controller. On Linux, there are also drivers to connect that firmware to a standard block device layer, allowing the developer to use block tools like encryption.

While error management is handled by the firmware, the file system can make requests which make that management much easier on the media, adding lifetime to the design. In this case, the interface used is known as Trim or Discard – a notification from the file system that blocks are no longer being used. Developers can use flash storage with the Trim or Discard notifications turned off, and they may see higher short-term performance – but both long term performance and media lifetime will suffer.

Handling errors on flash media designs

Another question I received was related to special flash media designs that contain a one-time programmable (OTP) section. This sort of read-only area can be used for system firmware or default configuration settings. Even that use case does not mean it is impossible for bit errors to occur there. If the OTP section is provided by the vendor (and their firmware), they may have a contingency to handle the situation – reprogramming in place while maintaining power. This is a question worth asking. If the OTP section is more of a design choice by the development team, I would suggest working with the vendor and a flash software team to make sure errors are properly handled. In such cases, optimized and tailored support is crucial. Our team at Tuxera offers design review services which may be helpful.

Some designs however use flash media that doesn’t have firmware. We refer to this as “raw flash”, and on Linux that can mean using a flash file system, such as YAFFS, JFFS2 or UBIFS. This software must include the error handling software which decides whether to ignore a bit error for now, or correct errors by relocating the data. Balancing this choice is dependent on use case and desired lifetime, and it’s something I discuss in our whitepaper. Unfortunately, the Linux flash file systems relocate the data on the first bit error, which can reduce lifetime considerably. This was a good choice when the NAND controllers could only handle error correction on 4 bits of data, but modern controllers can perform bit correction on 40 or more bits per NAND block.

Tuxera’s FlashFX Tera is a Linux solution which can handle these situations with ease. To learn more about it, click here.

Let us help you solve your data storage challenges.