Electronic – RTC clock skipping months back

rtc

I will come to a question further down, but first a little background

We are struggling with reproducing a nasty bug that we have been getting reports for.

The symptoms clearly show that the RTC (a DS1305) is skipping from November 30 to April 1, the same year (e.g. backwards).

We have received enough reports as to not being able to write it off as a hardware fault or solar flare or other unlikely one-time-error. However all attempts at reproducing this behavior in-house have failed. Even with the exact same hardware and settings as was used by our customer when the error did occur.

Since it doesn't always happen, nor for all devices, it doesn't feel like a software bug. At least not acting on it's own.

Question

Any ideas for how to go about reproducing this kind of behavior, fault-finding methods, what to look for, etc.

Any one else have any experience with this kind of error?

We are aware of one other with a very similar symptom, however unclear if this is related at all.

I know there is a lot of details missing. I can't disclose any source, and simply stating everything I know will be a little to much to type; I can fill you in if you post concrete questions.

Update

Finally!
We have been able to reproduce this erratic behaviour in the lab!

Pressed for time as we are, all our attempts at reproducing was started one or a few days prior to 30/11 to see how it went, and all passed over to 1/12 just fine. It was after that we noticed that all customer devices were started during october.

We can't really work with waiting over a month for reproducing, so we came up with a work-around that to my surprise actually seems to work.

By speeding up the clock!

We have replaced the standard 32.768kHz osc with a 1Mhz signal, and can now reproduce in about a day.

I'll keep you posted as to what we will discover about this.

Thank you all for excellent brainstorming. I appreciate it a lot.

Now, I'm off trying to further trim the reproduction time, and dig out more facts about it.

Solved

I have posted the root cause of this as the accepted answer.

Summary: month value used was not a valid BCD value.

Best Answer

Ok, the original question asked for methods, not the root cause, which I'll give here.

But I don't like to leave the question unanswered. No disrespect to all fine suggestions that I've received. Thanks to everyone who've contributed to us finally being able to resolve this.

It all went a lot better after we realized that the issue was related to setting the time in october, rather than some obscure bug going into december.

The culprit was a bug in the INT to BCD encoding of the month value, where the original author mistakenly added 1 to the BCD encoded value, rather to the INT value before encoding it; resulting in october being sent to the RTC as 0x0A, rather than 0x10.

The clock happily steps from 0x0A to 0x0B when going into november, and the BCD to INT routine wasn't too picky about getting invalid BCD values. It still got the 0x0B right (BCD 0x0B to INT = 0x0B, however 0x0B is not a valid BCD value...).

I have not yet confirmed how we ended up in april, that is still on my TODO for this.

I am rather confident however that I am finally on to the real root cause of this issue.

@Kellenjb: and you were right, too: it was a firmware bug :)

Update

OK, I have now confirmed that when the DS1305 goes from the invalid 0x0A month value (October) it ends up with 0x0B for November. With a naive implementation of BCD to INT (one that doesn't check the BCD input for validity), BCD 0x0B == INT 0x0B, so it still works.

It is when the DS1305 is to increment 0x0B (in seeing 0x0A go to 0x0B, one could expect 0x0C, but no) it ends up with 0x04 when going into December.

Mystery solved.