Centos – How to “fix” a faulty path in device-mapper-multipath

centosmultipathrhel5storage-area-network

I have a multipath config that was working but now shows a "faulty" path:

[root@nas ~]# multipath -ll
sdd: checker msg is "readsector0 checker reports path is down"
mpath1 (36001f93000a63000019f000200000000) dm-2 XIOTECH,ISE1400
[size=200G][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=1][active]
 \_ 1:0:0:1 sdb 8:16  [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 2:0:0:1 sdd 8:48  [active][faulty]

At the same time I'm seeing these three lines over and over in /var/log/messages

Feb  5 12:52:57 nas kernel: sd 2:0:0:1: SCSI error: return code = 0x00010000
Feb  5 12:52:57 nas kernel: end_request: I/O error, dev sdd, sector 0
Feb  5 12:52:57 nas kernel: Buffer I/O error on device sdd, logical block 0

And this line shows up fairly often too

Feb  5 12:52:58 nas multipathd: sdd: readsector0 checker reports path is down

One thing I don't understand is why its using the readsector0 checking method when my /etc/multipath.conf file say to use tur

[root@nas ~]# tail -n15 /etc/multipath.conf

devices {
        device {
                vendor                  "XIOTECH "
                product                 "ISE1400         "
                path_grouping_policy    multibus
                getuid_callout          "/sbin/scsi_id -g -u -d /dev/%n"
                path_checker            tur
                prio_callout              "none"
                path_selector           "round-robin 0"
                failback                    immediate
                no_path_retry           12
                user_friendly_names yes
        }
}

Looking at the upstream documentation here this paragraph seems relevant:
http://christophe.varoqui.free.fr/usage.html

For each path:

\_ host:channel:id:lun devnode major:minor [path_status][dm_status_if_known]

The dm status (dm_status_if_known) is like the path status
(path_status), but from the kernel's point of view. The dm status has two
states: "failed", which is analogous to "faulty", and "active" which
covers all other path states. Occasionally, the path state and the 
dm state of a device will temporarily not agree. 

Its been well over 24 hours for me so its not temporary.

So with all that as background my questions are
– how can I determine the root cause here?
– how can I manually/command-line perform whatever check its doing
– why is it ignoring my multipath.conf (did I do it wrong?)

Thanks in advance for any ideas, if there's anything else I can provide for info let me know in a comment and I'll edit it into the post.

Best Answer

There's a subtle bug in your multipath.conf, vendor and product are matching at the regexp level, that you've added a series of leading spaces is causing multipathd to fail to match your configuration with the actual devices on the system. If you were to examine the output of echo 'show config' | multipathd -k you would find two device sections for your SAN, one that matches all the extra spaces you added, and the default config (should it exist) provided by internal database.

Adjust your multipath.conf to look like this:

            vendor                  "XIOTECH "
            product                 "ISE1400.*"

SCSI Inquiry expects a vendor field that is no greater than 8 characters terminated by an ASCII Zero, if you don't use all 8 you must pad the field with spaces to reach 8 characters. Multipathd is interpreting the spec to the letter of the law, you could have also done "XIOTECH.*" if you really want to be sure.

Once you make these changes, stop multipathd using your initscripts, multipath -F which will flush your config and then start multipathd again. Your config file should be honored now. If you still have problems, reboot.

If there's ever a doubt that your config file isn't being honored, always examine the running config using the echo incantation and compare what's loaded in the database to your config file.