Electronic – EEPROM only retaining a value for a short duration

eepromembedded

Background

I have a product that includes an SPI EEPROM connected to a Microcontroller.

Address 0 if the EEPROM contains what we call the status word. In production the value of the status word is set to 0x2152 which indicates that the EEPROM is "alive" and the rest of the data stored in the EEPROM is sane.

If a erase/write/read/verify failure occurs we mark the status word as 0xDEAD. We also mark the status word as 0xDEAD if we detect corrupt data at boot. Note, 0xDEAD == ~0x2152

The Problem

I've noticed on small population of our units when I write a value of 0x2152 to the EEPROM's status word and read it back immediately it is still 0x2152, but if I then perform a read several seconds later the value seems to "decay" to 0x2142 or 0x2102. On a particular unit I read the value back five minutes later and it was 0x0000. All of the other locations in the EEPROM can be written to and appear to retain the proper values for long periods of time.

We do not think we write/erase to that EEPROM location frequently, nominally just once ever. We have identified a situation though where we could perform a lot of writes/erases to that location in production if some steps are not performed correctly. The endurance is a million writes and we could be hitting that.

We perform frequent reads from this location over the life of the product, we generally read every give minutes.

The Question

Previously in my career I've always seen write endurance failures look like sticking bits that seem to never take on a new value. Could this "decay" phenomenon that I am seeing also be a explained by excessive writes? Or is there another way EEPROM could become damaged that could explain this failure mode?

EDIT:

Answers to questions in the comments, and some tangential things:

  • I am deliberately not including the part number or data sheet because we have an open case with the vendor and I do not want to disclose too much if we end up uncovering a quality issue.
  • The SPI clock speed is 1MHz.
  • Writes are self timed by the part. We confirm the part is done with its write before attempting any other operations or powering it down (the part signals it is done on is MISO line)
  • We're using a hardware SPI peripheral with software control of CS.
  • This is a bare metal system.
  • We have adequate delays on power up before attempting to communicate with the part.
  • We always enable writes before writing.
  • Interrupts are not factor, we do a blocking write in the main thread.
  • The minimum erase/write block is one 16 bit word, this part is word addressable.
  • This part has a erase/write endurance or 1M cycles per word.
  • The power supply to the system is very stable, the system is powered by a lithium thionyl chloride battery that has tab welded leads. It is connected to the PCB with a robust connector that is potted over so vibration/contact bounce isn't possible. The system is "always on", the microcontroller is in control of when it goes to sleep.
  • The voltage at the VCC pin of the EEPROM is stable and within spec throughout the duration of a write. This was measured with an o'scope.

Best Answer

In your comments you ask "if a write endurance issue can explain this specific type of EEPROM failure mode." From my past experience I would say the answer is absolutely yes.

We have identified a situation though where we could perform a lot of writes/erases to that location in production if some steps are not performed correctly. The endurance is a million writes and we could be hitting that.

As you may know, the endurance spec of a EEPROM only applies to normal usage. If the device is written rapid-fire (for example a firmware bug causing the device to get stuck in a loop performing writes immediately one after the other) than the endurance will be much shorter. It sounds like that may be happening here.

Previously in my career I've always seen write endurance failures look like sticking bits that seem to never take on a new value. Could this "decay" phenomenon that I am seeing also be a explained by excessive writes?

Yes. While completely "burnt" (i.e. fatigued) EEPROM cells will be stuck at a single value, it is also entirely possible for EEPROM fatigue to cause the memory operation to just degrade, rather than fail completely.


Footnote / war story illustrating this phenomenon:

I was on a team where we built a device with EEPROM memory storage. The customer complained that the EEPROM was failing to hold its value. They sent it back to us, we tested it and it worked fine. We sent it back to them and it failed again. This whole loop happened one more time until we visited the customer on site and found the real problem. The basic root cause:

  • Customer was operating the device in a manner which caused the EEPROM to erase over and over in rapid succession, fatiguing the part. This was a surprise to us, another case where "no customer would ever do it that way" turned out to be a faulty assumption.
  • Every time we tested the product at our facility we operated it "normally", so we did not see the problem.
  • Here's the key: luckily we had device-level components engineers on our team, and one of those engineers informed us that EEPROM memory cells can have a self-healing effect over time. If you let the device rest, it will actually start to operate somewhat normally again, but obviously that device should no longer be trusted. (Note this was very surprising to me and I still don't understand the physics behind it, but all my empirical observations tell me this engineer was correct.)

So the reason this problem was so infuriatingly difficult to troubleshoot is that the EEPROM cells got fatigued by the customer to the point of failure, but then they had a chance to rest during their time being shipped from the customer facility to ours, so they worked fine in our testing! Then we returned them to the customer, where they would promptly get fatigued again and fail again.