Ubuntu – 8 GPU machine freezes

cudanvidiasupermicroUbuntu

We have a SuperMicro GPU server with:

  • 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
  • 512GB memory
  • more than enough disk space
  • X10DRG-O+-CPU (BIOS Version : 2.0a [current])
  • X9DRG-O-PCIE PCI-E expander card
  • 8x GTX 1080

It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0.
When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) — the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.

To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.

There seem to be other [1] people [2] having this issue, but no solution there.

Is anyone having the same experience with this type of machine?

Update:
The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU.
Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.

[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/

[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/

Best Answer

I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.

Related Topic