Electronic – Checking EEPROM for corruption / methods to avoid corruption

eeprom

When working with EEPROM I'm working on developing methods to expand the life of the EEPROM as long as possible but also checking it for corruption and doing everything I can to avoid corruption upon several occurrences that may happen on a daily basis such as power failure.

  1. What are some methods I can take upon developing to protect the EEPROM on the Controller i'm using?

  2. "Have a plan for power loss during a write cycle. Will your hold-up cap keep things afloat long enough to finish the write? Do you re-initialize the EEPROM when the CPU powers up in case the EEPROM was left half-way through a write cycle when the CPU was reset?" Found at: https://betterembsw.blogspot.com/2011/11/avoiding-eeprom-corruption-problems.html

How would this be possible to check if it was in the middle of a write cycle?

  1. "Don't use address zero of the EEPROM. It is common for corruption problems to hit address zero, which can be the default address pointed to in the EEPROM if that chip is reset or otherwise has a problem during a write cycle."

Is this true? Is address 0 considered a bad seed that shouldn't be used?

Best Answer

Is address 0 considered a bad seed that shouldn't be used?

This is about what the address happens to be when something goes wrong in your system. Probably the two most-often cases I'd worry about, thinking this way, would be where the address lines are either all-0 (top) or all-1 (bottom.) I can't provide you with a specific mechanism in the case of your MCU, of course. And perhaps this applies more to cases where the EEPROM is an external part than one where it is internal. But it sits in my head as a concern certainly in the case of an external bus. So I'd guess that's also why the author of that comment said what they said.

(The same idea also applies to the top and bottom addresses of any physical page layout for the EEPROM, if any. So I might also worry about the top and bottom of each page, and well as the first and last pages.)

How would this be possible to check if it was in the middle of a write cycle?

One possible symptom is that a write never terminates. This could be due to some hardware problem (MCU is suffering from an undetected brown-out and behaving badly, for example) or it could be due to your own software getting stuck. Or other reasons. Regardless, knowing your worst-case write time as well as then using the watchdog timer (or a different timer) as a means to reset the MCU or to bring attention to the problem, can help here. Nothing is perfect. So this is no panacea. But knowing your write times and setting up a means to detect a failure on this basis alone can help.

Suppose that your processor resets, though, and you return knowing the fact that something went awry. (Usually, there is a flag that tells you the watchdog caused a reset or else you can use a specialized vector in SRAM that tells you this fact.) You know there may be a problem in the EEPROM.

Or, you might have gone through a complete, normal, power-off and power-on cycle and are just starting up and you find a problem in the EEPROM.

How do you detect that a write didn't fully complete? How do you find sections that weren't properly written?

Well, you need to apply a number of techniques:

  1. Provide a method for error detection. This may include a checksum, a CRC, a hash function like MD5 or SHA256, or one of many ECC choices; or a combination of them. If you can afford the idea, I'd recommend including two methods: a checksum that is very simple to implement and provides a first test which, if it succeeds, leads to the second test -- a CRC or ECC.
  2. Provide a means for error correction. This can be achieved using ECC at whatever level of correction you feel you want (1-bit or 2-bit or more.) But another way to approach this is redundancy, by providing duplicate segments of your data so that if one fails its testing you can use the other (which is then replicated again to provide redundancy again.)
  3. Don't rely on fixed locations. Instead, provide segments of data which include a SIZE, a TYPE, a CHECKSUM, and an ECC/CRC component. You skip the ones that don't check out and move on, using the SIZE field. This allows you to recover from cases where a segment is bad and cannot be trusted. A problem here is that the SIZE field itself may be flawed. You can improve that situation by providing very good ECC on the SIZE field itself. So that might be an approach. There are others, though.
  4. Include a wear-leveling scheme. I'd recommend that you look at UBIFS in particular, as it deals with raw memory and is open source. Some of the ideas may be applicable even if you don't bother with wear leveling, because of the shared goals of data integrity.

I wouldn't bother with wear leveling, directly. But I would recommend that you read thoroughly about the techniques applied in UBIFS so that you make sure you don't miss any good ideas there. One of the nice things about the existence of it, is that you can find exact implementations of various ideas so that you have no question about how to implement something, if you decide you want to, and you have something you can use to cross-validate your own coding if there's a remaining question. No amount of white-paper talking points will do that for you.

I think you should consider the idea of not placing things in fixed locations, though. And I would recommend that you either consider duplication of data or else decent ECC, so that you have a way to automatically recover from the more likely errors. I think you will need ECC for some elements within structures, like the SIZE field I mentioned, because certain kinds of errors at the element level might make it impossible to reliably function, afterwards. So you need to identify single-point failures and cover those with ECC, I think. Once that is handled, you can build on that with other flexible structures which can be skipped or used, as you determine appropriate.

It sounds as though you have enough EEPROM to do a credible job at this. And none of the above should be taken to imply that you don't do other things such as use your watchdog timer (or other timers) to provide still more ability to at least detect problems, if not completely recover from them. You'll need to carefully think about the features provided by your MCU to see which of them may help, and how.

If can get even more involved, of course. You could get transient errors that affect the program counter. (This is more of a present and continuing problem when sending MCUs into orbit or outer space.) Also, transient errors not only can affect the PC, but may momentarily affect module behavior. More persistent failures can affect modules within your MCU and might hamper its ability to function well. Etc. Let your imagination run wild, if you want. There are lots of ways things can go wrong.