FBDIMM Thermal/TDP Issue

eccmemoryphysical-environment

I've got a 2U dual Xeon server with 8x 2GB DDR2 FBdim/ECC ram, on an intel s5000PSL board. It's stable, the ram memtests clean and both CPUs are running cool (35C). Half of the sticks run ~60-65C, which seems hot to me but is well withing TDP… BUT there's four that run 75-90+ depending on load.

I'd expect it to be bad heat spreaders, but it's ANY stick in those four slots, no matter how I shuffle them. The ram's next to the PSU, there's about 3/4-1" between the edge of the sockets and the side of the PSU, but the stick closest to the PSU is one of the cool ones, so it's not overheating from that somehow.

Physically its laid out: C C H C H H H C [PSU]

C-cool, H-hot

I tried adding a pair of 30mm fans to the back for outflow, and even some stick-on (but removable) heat sinks across the top of the sticks, stuck to the spreaders to help spread the heat out some– but both of those seemed to only make it worse for some reason, so I'm completely stumped.

Anyone have any ideas at all what the heck's going on, and especially how to fix?

EDIT: I put in a temporary duct to route the airflow from the CPU that was blowing over the modules, away from them, 15 minutes later I look, and they're even hotter, one broke 97C, needless to say I shut it down instantly– I'll remove the duct and rerun a memtest later to be sure nothing was damaged.

EDIT #2: I ran memtest86+ overnight, results were 100% clean, the SEL is clear, the BIOS error log is clear, the system status LED is solid green, everything is 100% rock solid and clean….

Except those RAM temps (slots B1, C1, C2, D1 if I'm reading the layout right), and now the beeps from, I assume, the BIOS, that started a couple days ago after I pulled and reseated eeverything– two short, short pause, three short. I can't find that in any manual I've got access to, but every test and burn-in that I can throw at it says it's clean and rock solid.

I can live with the beeps though I'd like to know what they mean, but the temps are concerning me. The only thing I haven't tried is modding the case top with a 120/240mm exhaust fan and I'd seriously rather not – but even with the lid off they still run 75ish.

EDIT #3: I did a little more digging, the RAM slots are broken into two branches, each branch having two channels with two slots. A1/A2, B1/B2, C1/C2, D1/D2: As of now, idle with the cover off, the temps are as follows: A1:63C/A2:66C / B1:71C B2:60C / C1:76C C2:81C / D1:81C D2:67C. If it was one channel or even one branch I'd think controller or something, but it's B1, C1, C2, and D1 that are so much higher than the others (I didn't notice B1 earlier)- not even B2/C1/C2/D1 all in a block — and it's regardless of the order I switch the sticks around, so I can't see how it's the sticks themselves.

If it's not a specific channel, or a specific stick, I don't know what could be going on. i mentioned the beeps during post earlier, but I can't find them in any manual, and nothing I can test shows any problems with anything anywhere, except the temperatures that seems to have no reason at all.

Best Answer

I'm pretty sure you've got a problem with the VRM/s and/or chokes delivering power to those memory slots - I've seen this exact thing happen on an old HP DL380 G5 with Xeon 54xx CPUs and FBDIMMs - we had to swap out the system board, in our case it was enough to actually kill a couple of DIMMs.

Ironically the overclocking boys over on Superuser.com consciously go out of their way to do this so they can get more memory performance :)