VM iscsi disk crashes on one VM Host, not on the other

blade-serverlefthandocfs2open-iscsivmware-vsphere

I have a VmWare solution running on a HP bladesystem with a Lefthand ISCSI san. There are currently two VmWare hosts in that environment.

I have two Debian VM's sharing an ISCSI disk (with ocfs2), mounted directly from the san using open-iscsi. It all worked perfectly, but yesterday one client crashed as soon as it tempted to write something on the shared ocfs2 partition.

I tried setting some ISCSI parameters to more conservative values, to no avail. Only (v-)moving the client to the other VM host resolved the problem. Today, moving the other client to the problematic host provokes the same errors:

connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4294971299, last ping 4294966612, now 4294973799
connection1:0: detected conn error (1011)
iscsid: Kernel reported iSCSI connection 1:0 error (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)
kernel: [  328.558970]  connection1:0: detected conn error (1020)
iscsid: connection1:0 is operational after recovery (1 attempts)
[repeat until hard reset]

It seems to be related to that VM host, wich has the exact same configuration as the other one. Being blades, they use the same networking hardware, a flex-10 interconnect.

Does someone has any idea what this could be related to ? I'd like to find the cause, as both VM hosts could en up having the same problem (I'll have to switch to networked disks then, seems more stable, less prone to hard resets).

Best Answer

This error is related to timeout of write messages. There are people that recommend to use vlan to get better throughput to transfer the data. So the problem involve here: host: IP stack, adapter, network switch, NetApp network adapter etc.

Other thing you can do is to increase the timeout of write at disk.

echo 180 > /sys/block/sdX/device/timeout

At iscsi initiator config I used to config:

node.session.iscsi.InitialR2T = No

and these parameters will increase the iscsi logs. Use only which is necessary:

# echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_session 
# echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_eh
# echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_conn
# echo 1 > /sys/module/libiscsi_tcp/parameters/debug_libiscsi_tcp
# echo 1 > /sys/module/iscsi_tcp/parameters/debug_iscsi_tcp