Linux – watchdog: behavior of file and sync options

linuxwatchdog

Here's my situation:

I'm having a very occasional problem where a (very) remote embedded PC/104 system running Debian seems to lose the ability to use any communications interface. I can't get to it via ethernet or serial ports (the console). After cycling the power, the system logs show nothing amiss. They just end abruptly and resume minutes or hours later when I cycle the power.

I suspect the system isn't locked up, because I have a python script which tries to ping google.com and if it fails, it uses an IO pin to toggle the wireless modem's power supply via a relay.

So, I have a completely unresponsive system, and a modem which is being power cycled every ten minutes by that same system. Fortunately, between reboots, I can use the modem to power-cycle the processor. And get back up and collecting data.

The system has a hardware watchdog and I've had watchdogd setup and running for a while. Last time this happened, I tried adding the line:

file=/var/log/messages

to watchdog.conf, but it didn't help. I then read that

When using file mode watchdog will try to stat(2) the given files. Errors returned by stat will not cause a reboot. For a reboot the stat call has to last at least one minute.

I don't know enough about stat to know how it might respond to losing the ability to write to disk, but I suspect it doesn't just hang.

I also just noticed that watchdogd has a –sync option, but the man pages aren't very verbose as to what happens if sync fails. My interval is 2 seconds, are there reasons not to sync a SSD every two seconds?

-Thanks

Best Answer

What do you mean by "if sync fails"? Man page for sync(2) says about return codes "sync() is always successful". So only way it can "fail" in your case is that it doesn't return control to watchdogd fast enough (becase of many blocks to write, slow writes, broken or corrupt disk or filesystem or kernel I/O layer, ...)

And if it does not return control to watchdogd fast enough, it won't be able to write to /dev/watchdog soon enough, and your hardware watchdog should trigger hardware reboot.

stat(2) could have problem with unwritable disk only if error if of such type to prevent reading (kernel bug, corrupted I/O layer). And yes, it could hang if there is a problem there. BTW, you should use "file=/var/log/messages" in combination with "change=" so the watchdog would initiate reboot if file was not changed often enough.

As for watchdog, are you absolutely sure that hardware watchdog is working? did you modprobe correct hardware module before starting watchdogd? Does dmesg(8) indicate so? if you "KILL -STOP" watchdogd process, the machine should reboot. If so, you may try to add "nowayout" option to your hardware module to eliminate chance of for example OOM killer killing watchdogd and thus stopping hardware watchdog. You could also add "test-binary" and "test-timeout" to run custom script which would return if system is to be considered alive or not (and initiate reboot if not).

Related Topic