Centos – nvme device dropouts – I/O 0 QID 0 timeout, controller disabled

centosintelkernelnvmesupermicro

We have 6 Supermicro servers all of the same (or very similar spec),
Over the last two weeks one of them has been dropped an NVMe disk at random times due to:


[ 66.856719] nvme 0000:03:00.0: I/O 0 QID 0 timeout, disable controller
[ 66.957911] nvme 0000:03:00.0: Identify Controller failed (-4)
[ 66.957961] nvme 0000:03:00.0: Removing after probe failure status: -5

We have tried:

  • Swapping the disk
  • Swapping the NVMe cables
  • Swapping the NVMe controller (motherboard)
  • Swapping the backplane
  • Downgrading from Kernel 4.5.0 to 4.4.2 given recent changes to the storage subsystem
  • Upgrading disk and motherboard firmwares
  • Swapping the motherboard

So it's essentially a whole new server except that we haven't done a reinstall – why? Because I want to understand the problem and if reinstalling fixes it we'll never know why it's happening on this machine and not on our other 5.

Best Answer

I've had a similar failure with Intel P4600 drives (different from yours), the ruling from Intel for our case was a rare firmware with the action items to replace the specific drives and update the firmware to the latest on all remaining drives. YMMV.

The error you are getting means that the drive is there at the PCIe level and even can be communicated with at some basic NVMe level but it cannot complete full initialization due to an internal assert on the drive (again, based on FA results for our drives, it may differ for you).

Related Topic