Since a few months, our server periodically hangs for a minute or two. The logs show these errors:
May 15 20:01:02 www kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 15 20:01:02 www kernel: ata2.00: failed command: FLUSH CACHE
May 15 20:01:02 www kernel: ata2.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0
May 15 20:01:02 www kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
May 15 20:01:02 www kernel: ata2.00: status: { DRDY }
May 15 20:01:02 www kernel: ata2: hard resetting link
May 15 20:01:03 www kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
May 15 20:01:03 www kernel: ata2.00: configured for UDMA/133
May 15 20:01:03 www kernel: ata2.00: retrying FLUSH 0xe7 Emask 0x4
May 15 20:01:03 www kernel: ata2.00: device reported invalid CHS sector 0
May 15 20:01:03 www kernel: ata2: EH complete
The timing of these errors is peculiar, always a few minutes after the whole hour:
May 15 00:06:02 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 15 10:05:02 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 15 20:01:02 www kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 16 00:04:01 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 16 04:01:02 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 16 07:02:02 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 16 07:03:03 www kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 16 11:02:02 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 16 12:06:02 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 16 13:06:01 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 16 20:04:02 www kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 17 06:03:01 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 17 09:06:02 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 17 14:04:02 www kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 17 17:03:01 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 18 02:02:01 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 18 10:03:01 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 18 11:05:03 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 18 13:03:02 www kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 18 16:06:01 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 18 18:02:01 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 19 00:01:02 www kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
I've tried smartctl but the health test showed PASSED. Also no errors in the log.
SMART Error Log Version: 1
No Errors Logged
The raid info showed this:
cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md1 : active raid1 sda1[0] sdb1[1] sdc1[2]
20478912 blocks [3/3] [UUU]
md2 : active raid1 sda2[0] sdb2[1] sdc2[2]
96211904 blocks [3/3] [UUU]
Any ideas what to do? The errors seems hardware related, but the timing suggests a software error to me.
Best Answer
Most likely your system is running a cron job every hour, and sometimes it uses data that is located at a bad sector, causing the error message.
You should run
smartctl
, which is insmartmontools
package in Debian / Ubuntu distributions on your hard disk. You can check the logged errors status on the device that way. There should be information about the errors your hard drive has.