Linux – Linus/ext4/nvme crashes during high io

ext4iolinuxnvmessd

During mvn compilation, I have random crashes.

The problem seems related to high IO and in kern.log, I can see things like:

kernel: [158430.895045] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
kernel: [158430.951331] blk_update_request: I/O error, dev nvme0n1, sector 819134096 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
kernel: [158430.995307] nvme nvme1: Removing after probe failure status: -19
kernel: [158431.035065] blk_update_request: I/O error, dev nvme0n1, sector 253382656 op 0x1:(WRITE) flags 0x4000 phys_seg 127 prio class 0
kernel: [158431.035083] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:309: I/O error 10 writing to inode 3933601 (offset 16777216 size 2101248 starting block 31672832)
kernel: [158431.035085] Buffer I/O error on device nvme0n1p1, logical block 31672320
kernel: [158431.035090] ecryptfs_write_inode_size_to_header: Error writing file size to header; rc = [-5]

To replicate the error, I use:

stress-ng --all 8  --timeout 60s --metrics-brief --tz

I've tried some boot options, like adding acpiphp.disable=1 pcie_aspm=off to /etc/default/grup, this seemed to help stress-ng test, but not my compilation.

  • Distribution: Ubuntu 19.10
  • Kernel: 5.3.0-45-generic #37-Ubuntu SMP Thu Mar 26 20:41:27 UTC 2020

nvme list shows:

Node             SN                   Model                            Namespace Usage                      Format           FW Rev  
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     28FF72PTFQAS         KXG50ZNV256G NVMe TOSHIBA 256GB          1        256,06  GB / 256,06  GB    512   B +  0 B   AADA4102
/dev/nvme1n1     37DS103NTEQT         THNSN5512GPU7 NVMe TOSHIBA 512GB         1         512,11 GB / 512,11  GB    512   B +  0 B   57DC4102

Best Answer

I can't exactly tell you where the problem is as this is just a "generic failure" somewhere in NVMe subsystem. But I can suggest what you can try to pinpoint the problem.

  1. Try adding nvme_core.default_ps_max_latency_us=5500 kernel boot option.
  2. Install nvme-cli package (or even better build a most recent one from sources) and check various logs with it, like smart-log and error-log. That might help to diagnose error further.
  3. Try booting some other distros (live) and stress test under them to see if this is kernel version / distro related. Systemrescuecd distro might be a good starting point.
  4. If that doesn't helps you can try updating your MB firmware ("BIOS", which is not BIOS in fact now with UEFI) to a most recent one. While this doesn't sound obvious and even the patch notes might not have anything directly related to NVMe/PCI-E subsystems, sometimes it helps (practical knowledge).
  5. Update your NVMe drive firmware. Look for a vendor supplied tools and manual for this.
  6. If everything above won't help or give any clues you might have faced yet unknown bug or hardware failure.
Related Topic