Linux – How to override IRQ affinity for NVME devices

irqlinuxlinux-kernelnvmesmp

I am trying to move all interrupts over to cores 0-3 to keep the rest of my cores free for high speed, low latency virtualization.

I wrote a quick script to set IRQ affinity to 0-3:

#!/bin/bash

while IFS= read -r LINE; do
    echo "0-3 -> \"$LINE\""
    sudo bash -c "echo 0-3 > \"$LINE\""
done <<< "$(find /proc/irq/ -name smp_affinity_list)"

This appears to work for USB devices and network devices, but not NVME devices. They all produce this error:

bash: line 1: echo: write error: Input/output error

And they stubbornly continue to produce interrupts evenly across almost all my cores.

If I check the current affinities of those devices:

$ cat /proc/irq/81/smp_affinity_list 
0-1,16-17
$ cat /proc/irq/82/smp_affinity_list
2-3,18-19
$ cat /proc/irq/83/smp_affinity_list
4-5,20-21
$ cat /proc/irq/84/smp_affinity_list
6-7,22-23
...

It appears "something" is taking full control of spreading IRQs across cores and not letting me change it.

It is completely critical that I move these to other cores, as I'm doing heavy IO in virtual machines on these cores and the NVME drives are producing a crap load of interrupts. This isn't Windows, I'm supposed to be able to decide what my machine does.

What is controlling IRQ affinity for these devices and how do I override it?


I am using a Ryzen 3950X CPU on a Gigabyte Auros X570 Master motherboard with 3 NVME drives connected to the M.2 ports on the motherboard.

(Update: I am now using a 5950X, still having the exact same issue)

Kernel: 5.12.2-arch1-1

Output of lspci -v related to NVME:

01:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01) (prog-if 02 [NVM Express])
    Subsystem: Phison Electronics Corporation E12 NVMe Controller
    Flags: bus master, fast devsel, latency 0, IRQ 45, NUMA node 0, IOMMU group 14
    Memory at fc100000 (64-bit, non-prefetchable) [size=16K]
    Capabilities: [80] Express Endpoint, MSI 00
    Capabilities: [d0] MSI-X: Enable+ Count=9 Masked-
    Capabilities: [e0] MSI: Enable- Count=1/8 Maskable- 64bit+
    Capabilities: [f8] Power Management version 3
    Capabilities: [100] Latency Tolerance Reporting
    Capabilities: [110] L1 PM Substates
    Capabilities: [128] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [200] Advanced Error Reporting
    Capabilities: [300] Secondary PCI Express
    Kernel driver in use: nvme

04:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01) (prog-if 02 [NVM Express])
    Subsystem: Phison Electronics Corporation E12 NVMe Controller
    Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0, IOMMU group 25
    Memory at fbd00000 (64-bit, non-prefetchable) [size=16K]
    Capabilities: [80] Express Endpoint, MSI 00
    Capabilities: [d0] MSI-X: Enable+ Count=9 Masked-
    Capabilities: [e0] MSI: Enable- Count=1/8 Maskable- 64bit+
    Capabilities: [f8] Power Management version 3
    Capabilities: [100] Latency Tolerance Reporting
    Capabilities: [110] L1 PM Substates
    Capabilities: [128] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [200] Advanced Error Reporting
    Capabilities: [300] Secondary PCI Express
    Kernel driver in use: nvme

05:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01) (prog-if 02 [NVM Express])
    Subsystem: Phison Electronics Corporation E12 NVMe Controller
    Flags: bus master, fast devsel, latency 0, IRQ 40, NUMA node 0, IOMMU group 26
    Memory at fbc00000 (64-bit, non-prefetchable) [size=16K]
    Capabilities: [80] Express Endpoint, MSI 00
    Capabilities: [d0] MSI-X: Enable+ Count=9 Masked-
    Capabilities: [e0] MSI: Enable- Count=1/8 Maskable- 64bit+
    Capabilities: [f8] Power Management version 3
    Capabilities: [100] Latency Tolerance Reporting
    Capabilities: [110] L1 PM Substates
    Capabilities: [128] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [200] Advanced Error Reporting
    Capabilities: [300] Secondary PCI Express
    Kernel driver in use: nvme
$ dmesg | grep -i nvme
[    2.042888] nvme nvme0: pci function 0000:01:00.0
[    2.042912] nvme nvme1: pci function 0000:04:00.0
[    2.042941] nvme nvme2: pci function 0000:05:00.0
[    2.048103] nvme nvme0: missing or invalid SUBNQN field.
[    2.048109] nvme nvme2: missing or invalid SUBNQN field.
[    2.048109] nvme nvme1: missing or invalid SUBNQN field.
[    2.048112] nvme nvme0: Shutdown timeout set to 10 seconds
[    2.048120] nvme nvme1: Shutdown timeout set to 10 seconds
[    2.048127] nvme nvme2: Shutdown timeout set to 10 seconds
[    2.049578] nvme nvme0: 8/0/0 default/read/poll queues
[    2.049668] nvme nvme1: 8/0/0 default/read/poll queues
[    2.049716] nvme nvme2: 8/0/0 default/read/poll queues
[    2.051211]  nvme1n1: p1
[    2.051260]  nvme2n1: p1
[    2.051577]  nvme0n1: p1 p2

Best Answer

What is controlling IRQ affinity for these devices?

Linux kernel since v4.8 is automatically using MSI/MSI-X interrupt masking in NVMe drivers; and with IRQD_AFFINITY_MANAGED, automatically manages MSI/MSI-X interrupts in kernel.

See these commits:

  1. 90c9712fbb388077b5e53069cae43f1acbb0102a - NVMe: Always use MSI/MSI-X interrupts
  2. 9c2555835bb3d34dfac52a0be943dcc4bedd650f - genirq: Introduce IRQD_AFFINITY_MANAGED flag

Seeing your kernel version and your devices capabilities via lspci -v output, apparently it is the case.

and how do I override it?

Besides disabling the flags and recompiling the kernel, probably disable MSI/MSI-X to your PCI bridge (instead of devices):

echo 1 > /sys/bus/pci/devices/$bridge/msi_bus

Note that there will be performance impact on disabling MSI/MSI-X. See this kernel documentation for more details.

Instead of disabling MSI/MSI-X, a better approach would be keeping MSI-X but also enable polling mode in NVMe driver. See Andrew H's answer.