FreeBSD guests running on ESX hang without panic log

freebsdkernel-panicvmware-esx

We have three servers running on a same ESX host, all virtual disks are from a remote SAN storage controller. These tree servers hanged and restarted several days ago, and it happened to the DB server today once more. The weird thing is there is not any panic log, crash log, error log when the problem occurred.


Server1. Web Server
FreeBSD Meduna 8.1-RELEASE-p2 FreeBSD 8.1-RELEASE-p2 #2: Mon Feb 14 12:57:36 MYT 2011 hailang@Meduna:/usr/obj/usr/src/sys/Meduna amd64

Meduna# cat /var/log/messages | grep panic

Meduna# bzcat /var/log/messages.?.bz2 | grep panic

Meduna# cat /var/log/messages | grep error

Meduna# bzcat /var/log/messages.?.bz2 | grep error

May 28 16:05:04 Meduna kernel: /var: mount pending error: blocks 4 files 1


Server2. DB Server
FreeBSD Moncalvo 8.1-RELEASE-p2 FreeBSD 8.1-RELEASE-p2 #1: Mon Jan 10 13:02:48 MYT 2011 hailang@Moncalve:/usr/obj/usr/src/sys/Moncalve amd64

Moncalvo# cat /var/log/messages | grep panic

Moncalvo# cat /var/log/messages | grep panic

Moncalvo# bzcat /var/log/messages.?.bz2 | grep panic

Moncalvo# cat /var/log/messages | grep error

Moncalvo# bzcat /var/log/messages.?.bz2 | grep error

May 28 16:17:17 Moncalvo kernel: /var: mount pending error: blocks -32 files 0


Server3. Not_In_Use
FreeBSD Mecure 8.1-RELEASE-p2 FreeBSD 8.1-RELEASE-p2 #0: Fri Feb 11 14:45:55 MYT 2011 hailang@ServerX:/usr/obj/usr/src/sys/Mecure amd64

Mecure# cat /var/log/messages | grep panic

Mecure# bzcat /var/log/messages.?.bz2 | grep panic

Mecure# bzcat /var/log/messages.?.bz2 | grep error

Mecure# cat /var/log/messages | grep error

May 28 15:42:41 Mecure kernel: g_vfs_done():da0s1d[WRITE(offset=3275046912, length=16384)]error = 5

May 28 15:42:41 Mecure kernel: g_vfs_done():da0s1d[READ(offset=4062199808, length=16384)]error = 5

May 28 15:42:41 Mecure kernel: g_vfs_done():da0s1d[WRITE(offset=3281371136, length=10240)]error = 5


This is how /var/log/messages looks like when the problem occurs


May 28 13:06:26 Meduna kernel: icmp redirect from 10.16.10.250: 113.23.142.94 => 10.16.10.18

May 28 13:07:01 Meduna kernel: icmp redirect from 10.16.10.250: 202.186.13.232 => 10.16.10.18

May 28 13:15:00 Meduna kernel: icmp redirect from 10.16.10.250: 113.23.142.94 => 10.16.10.18

May 28 13:15:35 Meduna kernel: icmp redirect from 10.16.10.250: 202.186.13.232 => 10.16.10.18

May 28 13:41:36 Meduna syslogd: kernel boot file is /boot/kernel/kernel

May 28 13:41:36 Meduna kernel: Copyright (c) 1992-2010 The FreeBSD Project.

May 28 13:41:36 Meduna kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994

[!]It just hanged for about half an hour and restarted without any error.

May 28 13:13:14 Moncalvo kernel: icmp redirect from 10.16.10.250: 60.49.152.98 => 10.16.10.18

May 28 13:14:25 Moncalvo kernel: icmp redirect from 10.16.10.250: 210.48.150.200 => 10.16.10.18

May 28 13:16:58 Moncalvo kernel: icmp redirect from 10.16.10.250: 183.78.169.57 => 10.16.10.18

May 28 15:59:06 Moncalvo syslogd: kernel boot file is /boot/kernel/kernel

May 28 15:59:06 Moncalvo kernel: Copyright (c) 1992-2010 The FreeBSD Project.

May 28 15:59:06 Moncalvo kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994

[!]And this server hanged for more than 2 hours to restart


I suspect that this might be a storage problem but without any prove for that. Could you please give me some advise to solve/dig the issue. Any help is highly appreciated!

Best Regards,

Hai Lang

Best Answer

Problem most probably cased by SAN malfunction. When FreeBSD looses disk there almost no way of leaving panic log entry. But in VM environment (and also in very few motherboards) there can be msgbuf (dmesg) left after reboot. You may try to examine it.

For debug you can try using DDB instead of reboot after panic.

PS. If you have system programmer at hand you can ask him to write something like Linux's netconsole for FreeBSD