Fiber multipath fails: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK

I am trying to copy(~7TB of data using rsync) between two server in same data center in the backend its using EMC VMAX3

After copying ~30-40GB of data multipath start failing

Dec 15 01:57:53 test.example.com multipathd: 360000970000196801239533037303434: Recovered to normal mode
Dec 15 01:57:53 test.example.com multipathd: 360000970000196801239533037303434: remaining active paths: 1
Dec 15 01:57:53 test.example.com kernel: sd 1:0:2:20: [sdeu]  Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK 

[root@test log]# multipath -ll |grep -i fail
 |- 1:0:0:15 sdq  65:0   failed ready running
  - 3:0:0:15 sdai 66:32  failed ready running

We are using default multipath.conf

HBA driver version  8.07.00.26.06.8-k

HBA model QLogic Corp. ISP8324-based 16Gb Fibre Channel to PCI Express Adapter

OS: CentOS 64-bit/2.6.32-642.6.2.el6.x86_64
Hardware:Intel/HP ProLiant DL380 Gen9

Already verified this solution and checked with EMC everything looks good https://access.redhat.com/solutions/438403

Some more info

–
There is no drop/error packet on the network side.

Filesystem is mounted with noatime,nodiratime
Filesystem ext4(Already tried xfs but same error)
LVM is in striped mode(Started with linear option and then converted to striped)
Already disabled THP
echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
Whenever multipath start failing process goes to D state
System firmware upgraded
Tried with latest version of qlogic driver
Tried with different scheduler(noop,deadline,cfq)
Tried with different tuned profile(enterprise-storage)

Vmcore collected during the time of issue

I am able to collect vmcore during the time of issue

  KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.6.2.el6.x86_64/vmlinux
DUMPFILE: vmcore  [PARTIAL DUMP]
    CPUS: 36
    DATE: Fri Dec 16 00:11:26 2016
  UPTIME: 01:48:57
  LOAD AVERAGE: 0.41, 0.49, 0.60
   TASKS: 1238
NODENAME: test.example.com
 RELEASE: 2.6.32-642.6.2.el6.x86_64
 VERSION: #1 SMP Wed Oct 26 06:52:09 UTC 2016
 MACHINE: x86_64  (2297 Mhz)
  MEMORY: 511.9 GB
   PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000018"
     PID: 15840
 COMMAND: "kjournald"
    TASK: ffff884023446ab0  [THREAD_INFO: ffff88103def4000]
     CPU: 2
   STATE: TASK_RUNNING (PANIC)

After Enbaling Debug mode on the qlogic sid

qla2xxx [0000:0b:00.0]-3822:5: FCP command status: 0x2-0x0 (0x70000) nexus=5:1:0 portid=1f0160 oxid=0x800 cdb=2a200996238000038000 len=0x70000 rsp_info=0x0 resid=0x0 fw_resid=0x0 sp=ffff882189d42580 cp=ffff88276d249480.
qla2xxx [0000:84:00.0]-3822:7: FCP command status: 0x2-0x0 (0x70000) nexus=7:0:3 portid=450000 oxid=0x4de cdb=2a20098a5b0000010000 len=0x20000 rsp_info=0x0 resid=0x0 fw_resid=0x0 sp=ffff882189d421c0 cp=ffff8880237e0880.

Best Answer

This is an HP ProLiant DL380 Gen9 server. Pretty standard enterprise-class server.

Can you give me information on the server's firmware revision?
Is EMC PowerPath actually installed? If so, check here.

Do you have the HP Management Agents installed? If so, do you have the ability to post the output of hplog -v.

Have you seen anything in the ILO4 log? Is the ILO accessible?

Can you describe all of the PCIe cards installed in the system's slots?

For RHEL6-specific tuning, I highly recommend XFS, running tuned-adm profile enterprise-storage and ensuring your filesystems are mounted nobarrier (the tuned profile should handle that).

For the volumes, please ensure that you're using the dm (multipath) devices instead of /dev/sdX. See: https://access.redhat.com/solutions/1212233

Looking at what you've presented so far and the check listed at Redhat's support site (and the description here), I can't rule out the potential for HBA failure or PCIe riser problems. Also, there's a slight possibility that there's an issue on the VMAX side.

Can you swap PCIe slots and try again? Can you swap cards and try again?

Is the firmware on the HBA current? Here's the most recent package from December 2016.

Firmware 6.07.02 BIOS 3.21

A DID_ERROR typically indicates the driver software detected some type of hardware error via an anomaly within the returned data from the HBA.

A hardware or san-based issue is present within the storage subsystem such that received fibre channel response frames contain invalid or conflicting information that the driver is not able to use or reconcile.

Please review the systems hardware, switch error counters, etc. to see if there is any indication of where the issue might lie. The most likely candidate is the HBA itself.