I've been reading up on AVRs myself; I'm not the most knowledgeable, but this is what I've picked up in my research.
First, I realize you are disabling your BOD to save power, but this is listed as a preventative method to avoid EEPROM corruption in the ATmega328P. From section 8.4.2:
Keep the AVR RESET active (low) during periods of insufficient power
supply voltage. This can be done by enabling the internal Brown-out
Detector (BOD).
If you are really keen on disabling the BOD, it is possible to have it disabled when in sleep mode. To accomplish this, you can set the BODS bit in the MCUCR register.
Like you, I have become frustrated with the ambiguity in the voltage which at which the AVR's EEPROM will become corrupted. I see nothing in the datasheet. However quoting Atmel's EEPROM Corruption article on their website:
a regular write sequence to the EEPROM requires a minimum
voltage to operate correctly
Minimum voltage? This is a generic article, so I'd guess this refers to the lowest voltage at which the ATmega328P operates at, which is 1.8V.
Hope I've helped.
While the suggested approach from PeterJ is fine, its cleaner to decouple the "data layer" logic (EEPROM access) from the CRC routine.
This CRC-32 can be universally used, its not bound to program specific behavior:
// CCITT CRC-32 (Autodin II) polynomial
uint32_t CalcCRC32(uint32_t crc, uint8_t *buffer, uint16_t length) {
while(length--) {
crc = crc ^ *buffer++;
for (uint8_t j=0; j < 8; j++) {
if (crc & 1)
crc = (crc >> 1) ^ 0xEDB88320;
else
crc = crc >> 1;
}
}
return crc;
}
Then you just feed blocks of data to the CRC function like you see fit and as required by your application.
This usage example is just written down. It uses the fictional function eeprom_read() which reads a block of data from EEPROM. We start at EEPROM address 0.
const uint8_t BLOCKLENGTH = 128;
uint32_t crc32;
uint8_t buffer[BLOCKLENGTH]; // temporary data buffer
crc32 = 0xFFFFFFFF; // initial CRC value
for (uint16_t i=0; i < NUMBEROFBLOCKS; i++) {
eeprom_read(BLOCKLENGTH * i, buffer, BLOCKLENGTH); // read block number i from EEPROM into buffer
crc32 = CalcCRC32(crc32, buffer, BLOCKLENGTH); // update CRC
}
// crc32 is your final CRC here
Note that NUMBEROFBLOCKS is just a placeholder. I hope you get the idea.
Best Answer
This is about what the address happens to be when something goes wrong in your system. Probably the two most-often cases I'd worry about, thinking this way, would be where the address lines are either all-0 (top) or all-1 (bottom.) I can't provide you with a specific mechanism in the case of your MCU, of course. And perhaps this applies more to cases where the EEPROM is an external part than one where it is internal. But it sits in my head as a concern certainly in the case of an external bus. So I'd guess that's also why the author of that comment said what they said.
(The same idea also applies to the top and bottom addresses of any physical page layout for the EEPROM, if any. So I might also worry about the top and bottom of each page, and well as the first and last pages.)
One possible symptom is that a write never terminates. This could be due to some hardware problem (MCU is suffering from an undetected brown-out and behaving badly, for example) or it could be due to your own software getting stuck. Or other reasons. Regardless, knowing your worst-case write time as well as then using the watchdog timer (or a different timer) as a means to reset the MCU or to bring attention to the problem, can help here. Nothing is perfect. So this is no panacea. But knowing your write times and setting up a means to detect a failure on this basis alone can help.
Suppose that your processor resets, though, and you return knowing the fact that something went awry. (Usually, there is a flag that tells you the watchdog caused a reset or else you can use a specialized vector in SRAM that tells you this fact.) You know there may be a problem in the EEPROM.
Or, you might have gone through a complete, normal, power-off and power-on cycle and are just starting up and you find a problem in the EEPROM.
How do you detect that a write didn't fully complete? How do you find sections that weren't properly written?
Well, you need to apply a number of techniques:
I wouldn't bother with wear leveling, directly. But I would recommend that you read thoroughly about the techniques applied in UBIFS so that you make sure you don't miss any good ideas there. One of the nice things about the existence of it, is that you can find exact implementations of various ideas so that you have no question about how to implement something, if you decide you want to, and you have something you can use to cross-validate your own coding if there's a remaining question. No amount of white-paper talking points will do that for you.
I think you should consider the idea of not placing things in fixed locations, though. And I would recommend that you either consider duplication of data or else decent ECC, so that you have a way to automatically recover from the more likely errors. I think you will need ECC for some elements within structures, like the SIZE field I mentioned, because certain kinds of errors at the element level might make it impossible to reliably function, afterwards. So you need to identify single-point failures and cover those with ECC, I think. Once that is handled, you can build on that with other flexible structures which can be skipped or used, as you determine appropriate.
It sounds as though you have enough EEPROM to do a credible job at this. And none of the above should be taken to imply that you don't do other things such as use your watchdog timer (or other timers) to provide still more ability to at least detect problems, if not completely recover from them. You'll need to carefully think about the features provided by your MCU to see which of them may help, and how.
If can get even more involved, of course. You could get transient errors that affect the program counter. (This is more of a present and continuing problem when sending MCUs into orbit or outer space.) Also, transient errors not only can affect the PC, but may momentarily affect module behavior. More persistent failures can affect modules within your MCU and might hamper its ability to function well. Etc. Let your imagination run wild, if you want. There are lots of ways things can go wrong.