Electronic – How does UBI determine a NAND flash block is bad

charge-pumpflashlinux

I understand that if a block erase fails it can be marked as bad. If a page write fails immediately it can also be marked bad. I have a board that shows "read disturb" and/or "program disturb" errors after very few reads or writes, should UBI mark the block as bad?

Should erasing a block always reset the disturb level? If not, as in my case, what could be wrong? Could the charge pump within the flash chip be damaged? If so how could this have happened?

I have written a Linux driver to test the flash at the MTD level and after a "read disturb" has flipped bits in an erased page, subsequent reads of that page may be OK (no ECC errors). How can this happen? Is this expected?

Best Answer

Here is a useful link when using UBI:FS, but as it's the official documentation, you've probably already read it.

The first thing to know is that UBI behavior depends on MTD, which rely on your NAND controller driver. I will try to answer using NAND flash requirements, MTD ones and UBI ones.

According to NAND datasheets, (here, I have a Toshiba SLC NAND, 8-bits ECC, but it's a common recommendation amongst Micron, Hynix, Samsung), if a block erase fails, the block should be marked as bad. If a page write fails, the block should be marked as bad. The definition of "fails" evolved with technology. With 4 and 8 bits ECC Flashs, it is common to have some bits sticked to 0 or 1. A write is commonly regarded as OK if, when reading the page just after the write, there is strictly less bitflips than acceptable. You may choose also to reduce your acceptance threshold to improve robustness, but being to restrictive restricts the number of blocks you use, and reduce the life length of your part, as wear is leveled on less blocks, and blocks are marked bad sooner.

Read disturb may also happen later. NAND flashs are made with floating gate transistors, which may discharge over time, or due to the activity on the current page, or even the other pages of the block. This is a known behavior of the NAND flash, and this is why flash filesystems should read every data often: to detect when some pages reach the maximum number of correctable bitflips. In such case, the page should be relocated, but the block isn't bad.

You may also get traces if ECC correction happen, even if this doesn't trigger the relocation. It depends on your driver.

And, at last, an erase page may contain some bits sticked to 0. In that case, ECC correction on read fails, while the page is OK. To avoid this issue, there is often a "page written" market added in the spare area, bigger than 2 times the maximum bitflips number for the NAND, which allow to ignore ECC errors on unwritten pages. In this case, your driver may fix bitflips before returning the page to the upper layers, resulting in a "read disturb" notice and no error reports to the upper layer.

As you may see, the behavior mainly depends on your NAND controller driver. Maybe your driver is open-source ? This would allow to get the actual behavior.