What are some of the challenges posed by increasing error rates in modern MLC NAND devices? We’ll cover some strategies required to ensure data integrity and reliability, with a focus on software-based ECC solutions.
Earlier, we’ve covered choosing the right flash technology for your embedded systems design. Despite all your best efforts to choose the optimal flash technology and device, you might still see errors and corruptions as the data is handled and stored by the system. Errors can happen due to a variety of factors, including electromagnetic interference, physical limitations of the storage device, temperature, or other environmental factors. Luckily, there’s a way to tackle these inevitable errors through Error Correcting Codes (ECC). Like noble guardians of the flash galaxy, ECC ensure the integrity and reliability of data stored on NAND devices. These specialized algorithms function as the backbone of data reliability, diligently detecting and fixing errors that may arise during data storage or handling processes.
Error correction is a silent process, returning the corrected data to the caller. When too many bit errors occur, even ECCs can’t save you. Relocating the data before that bit error threshold is reached is referred to as scrubbing.
Related content: Embedded storage management 101 – a glossary
Embedded data storage management is filled with technical terms. To make it easier to keep track of them all, check out our glossary of essential embedded storage technology.
How to handle ECC in flash devices: hardware vs. software
Mechanisms for ECCs can be implemented at different levels in memory systems. Many modern flash memory devices, including NAND flash, have hardware-based ECCs built directly into their architecture. This means the hardware can automatically detect and correct errors without relying on external intervention. Hardware solutions (eMMC, UFS, and SD) have exactly one setting for how often they can scrub.
In some cases, error correction might instead be handled through software. This involves algorithms and routines executed by the system’s software or operating system to detect and fix the errors. Customizing error handling with flash translation layer (FTL) software, for example, allows for tailored approaches in mitigating errors specific to NAND flash memory. At Tuxera, we specialize in storage software, so we’ll shift the focus specifically to software-based ECCs.
The advantage of using software-based ECCs is in adaptability and improved system resiliency. Bit errors grow over time, and error correction requirements vary based on the use case. If a device is powered on continually, it can wait until it gets close to the maximum bit errors before scrubbing. If it is powered off for weeks at a time, the bit error count can grow from few to many between uses. In this case, the device should then scrub after only a few bit errors.
Adapting ECC strategies for modern MLC technology
Back when NAND flash was just single-level cell (SLC) technology, the demand for data integrity was typically addressed with 2-bit detect, single bit correct ECCs. The error rates on these types of flash devices were much lower when compared to today’s newer NAND flash memory – both newer, lower lithography SLC flash designs that require higher levels of ECCs than their predecessors, as well as MLC (multi-level cell) NAND flash devices, in which multi-bit errors can be seen as part of normal device operation.
The usual recommendation for scrubbing a block on NAND flash is to read the data, correct any bit errors using ECCs, and then write the corrected data to an unused block. The existing block is then marked for erasure and reuse. Erasure clears any charge leakage or disturbance, which can make the cell appear to be inconsistently programmed.
In modern MLC NAND flash, errors occur more frequently. Partial programming and associated pages can also cause errors to occur. The severity of an error (and the recommended response) can also vary among manufacturers. This variation creates a clear need for flexibility in how NAND errors are addressed through ECCs.
Related whitepaper: Avoid end of life from NAND correctable errors
Learn more about extending the lifetime of NAND flash memory in our whitepaper, “How to avoid end of life from NAND correctable errors.”
Feature highlight: Tuxera Error Policy Manager for ECCs
Our flash management software – Tuxera FlashFX Tera and Tuxera SafeFTL – was developed to customize how these different levels of errors are handled for various flash devices. With Tuxera FlashFX Tera, we introduced the Error Policy Manager feature. This method allows the software to react to bit errors during a read and decide (based on the configured Error Policy) if a page requires an action “soon” or “now”.
Other functions that the Error Policy Manager can employ – in addition to scrubbing – include Relocate the Data, Retire the Block, and Abandon Operations. Error Policies can be specified for any number of bits, from zero to the maximum supported by the media and NAND controller. This flexibility allows the device designer to completely control how their software works with media from any NAND flash vendor. Error handling requirements can easily be translated from the design document into the software stack.
The Tuxera Error Policy Manager works equally well with new higher ECC requirement SLC NAND flash. The Error Policy Manager requires the Tuxera VBF layer be present on the media. Correctable errors in any NAND flash exempted from this coverage must currently be handled by the OS read routines.
Want to know more about our flash management software with features such as Tuxera Error Policy Manager?
Check out our flash storage software offerings.
Thom Denholm
Thom is the Technical Product Manager at Tuxera, and also a former developer at Datalight with over 20 years embedded software experience in file systems and flash media management. In his spare time, he works as a professional baseball umpire and an internet librarian.