Electronic – Debugging DDR bus issues

busddrdebuggingramtiming

We have an SBC board, in the style of the Leopardboard or Beagleboard, that is misbehaving. It's based on the Leopardboard design (TI-DM368 CPU, DDR2 RAM, NAND Flash).

Developing software on the leopard all works well. However, the 1st prototype board turned up and refused to boot. Investigation lead us to the point where slowing the clocks (ARM & DDR) down means the board will boot.

The hardware (either board layout, termination, DDR chip, whatever) is the #1 suspect, as we can run identical software on the Leopard and it works fine. Unfortunately, the nature of the fault means we can't boot into linux to run strenuous RAM tests that may provide better diagnostics.

From the hardware side, the DDR clock is one of 345MHz, 486MHz, 680MHz depending on the clocking settings – beyond the scope of any of our scopes or logic analysers.

So – it's sort of two questions in one really:

From the hardware point of view, other than "rent a faster scope", is there an approach to diagnosing this with what's to hand? We have 200MHz DSO's, <100MHz logic sniffer, and 1.5GHz spectrum analyser to play with.

From the software point of view (I know, wrong forum) if anyone has tips or code snippets on exercising DDR RAM, they'd be greatly appreciated.

Edited to add the answer:

Well we borrowed a Tektronix 7104 and it worked so well we didn't even have to touch the board with it 😉

The problem revealed itself to be a sagging 1v3 power supply line being strangled by 0402-size SMT ferrites.

The symptoms were that, near the marginal operating frequency, the board would boot but lock when a memory-hungry / high-bandwidth video streaming process tried to start. This, coupled with the fact that running slower made it work OK, led us to believe it was a frequency-related issue when in fact the lower clock frequency was also putting less load on the power supply components.

The 0402 ferrite beads used for filtering were going surprisingly high impedance as the current went up, dropping a critical supply line below the allowable operating point.

Unfortunately this means I can't give the "winning answer" to Dave Tweed, but it does mean our board now works. Even better, it's the boss's fault not mine!

Oh and Tek 7104's are freakin' awesome feats of electronic engineering. If you've never looked at how they work, it's pure analogue kung-fu.

Best Answer

This is more of an "extended comment" than an answer, but let me start by saying no, I don't think you can debug this issue with such a limited set of test equipment. A person who has had a lot of experience doing these designs might be able to get some clues about what's going wrong using them, but I get the impression that you're not such a person.

For example, issues with risetime and ringing don't scale with clock frequency. If you can't see them at the high clock rate, you won't see them at the low clock rate, either.

The degree of success at this sort of thing depends on just how closely you duplicated the reference design — not just the schematic, but also the physical design, such as relative parts placement, PCB stackup, the routing and length-matching of traces, etc. Unless you know exactly what you're doing, you must match every detail of the reference design.

The fact that it does run at a lower clock rate suggests that you have issues with timing skews, possibly due to mismatched trace lengths, but also possibly due to mismatched terminations. You could verify this by renting the high-speed scope, but your time might be better spent getting started on a respin of the board right away.

Also, it's foolish to take a brand new board design and expect to boot a full-blown OS and application on it right away. You should always plan to develop (or find) some basic functional tests on individual functional units such as memory and communications interfaces.