I am trying to copy(~7TB of data using rsync) between two server in same data center in the backend its using EMC VMAX3
After copying ~30-40GB of data multipath start failing
Dec 15 01:57:53 test.example.com multipathd: 360000970000196801239533037303434: Recovered to normal mode
Dec 15 01:57:53 test.example.com multipathd: 360000970000196801239533037303434: remaining active paths: 1
Dec 15 01:57:53 test.example.com kernel: sd 1:0:2:20: [sdeu] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[root@test log]# multipath -ll |grep -i fail
|- 1:0:0:15 sdq 65:0 failed ready running
- 3:0:0:15 sdai 66:32 failed ready running
We are using default multipath.conf
HBA driver version 8.07.00.26.06.8-k
HBA model QLogic Corp. ISP8324-based 16Gb Fibre Channel to PCI Express Adapter
OS: CentOS 64-bit/2.6.32-642.6.2.el6.x86_64
Hardware:Intel/HP ProLiant DL380 Gen9
Already verified this solution and checked with EMC everything looks good https://access.redhat.com/solutions/438403
Some more info
–
There is no drop/error packet on the network side.
- Filesystem is mounted with noatime,nodiratime
- Filesystem ext4(Already tried xfs but same error)
- LVM is in striped mode(Started with linear option and then converted to striped)
-
Already disabled THP
-
echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
- Whenever multipath start failing process goes to D state
- System firmware upgraded
- Tried with latest version of qlogic driver
- Tried with different scheduler(noop,deadline,cfq)
- Tried with different tuned profile(enterprise-storage)
Vmcore collected during the time of issue
I am able to collect vmcore during the time of issue
KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.6.2.el6.x86_64/vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 36
DATE: Fri Dec 16 00:11:26 2016
UPTIME: 01:48:57
LOAD AVERAGE: 0.41, 0.49, 0.60
TASKS: 1238
NODENAME: test.example.com
RELEASE: 2.6.32-642.6.2.el6.x86_64
VERSION: #1 SMP Wed Oct 26 06:52:09 UTC 2016
MACHINE: x86_64 (2297 Mhz)
MEMORY: 511.9 GB
PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000018"
PID: 15840
COMMAND: "kjournald"
TASK: ffff884023446ab0 [THREAD_INFO: ffff88103def4000]
CPU: 2
STATE: TASK_RUNNING (PANIC)
After Enbaling Debug mode on the qlogic sid
qla2xxx [0000:0b:00.0]-3822:5: FCP command status: 0x2-0x0 (0x70000) nexus=5:1:0 portid=1f0160 oxid=0x800 cdb=2a200996238000038000 len=0x70000 rsp_info=0x0 resid=0x0 fw_resid=0x0 sp=ffff882189d42580 cp=ffff88276d249480.
qla2xxx [0000:84:00.0]-3822:7: FCP command status: 0x2-0x0 (0x70000) nexus=7:0:3 portid=450000 oxid=0x4de cdb=2a20098a5b0000010000 len=0x20000 rsp_info=0x0 resid=0x0 fw_resid=0x0 sp=ffff882189d421c0 cp=ffff8880237e0880.
Best Answer
This is an HP ProLiant DL380 Gen9 server. Pretty standard enterprise-class server.
Can you give me information on the server's firmware revision?
Is EMC PowerPath actually installed? If so, check here.
Do you have the HP Management Agents installed? If so, do you have the ability to post the output of
hplog -v
.Have you seen anything in the ILO4 log? Is the ILO accessible?
Can you describe all of the PCIe cards installed in the system's slots?
For RHEL6-specific tuning, I highly recommend XFS, running
tuned-adm profile enterprise-storage
and ensuring your filesystems are mountednobarrier
(the tuned profile should handle that).For the volumes, please ensure that you're using the
dm
(multipath) devices instead of/dev/sdX
. See: https://access.redhat.com/solutions/1212233Looking at what you've presented so far and the check listed at Redhat's support site (and the description here), I can't rule out the potential for HBA failure or PCIe riser problems. Also, there's a slight possibility that there's an issue on the VMAX side.
Can you swap PCIe slots and try again? Can you swap cards and try again?
Is the firmware on the HBA current? Here's the most recent package from December 2016.