Linux – watchdog: basic configuration options don’t work as expected

linuxwatchdog

I run ubuntu 14.04 LTS and watchdog 5.13.
My goal is to achieve following:

run external check script every 30 seconds
reboot if script fails during 300 seconds (e.g. 10 failed attempts in a row)

I am having issues with the most basic watchdog configuration:

$ cat /etc/watchdog.conf
watchdog-device = /dev/watchdog
watchdog-timeout = 300
interval = 30
test-binary = /usr/local/sbin/watchdog_check.sh
realtime = yes
priority = 1

$ cat /etc/default/watchdog
run_watchdog=1
run_wd_keepalive=1
watchdog_module="none"
watchdog_options="-c /etc/watchdog.conf --verbose"

According to syslog,

watchdog-timeout is being set to 254s (discussed here).
System reboots after first failure of test-binary.

Is it an expected behaviour or am I missing something?

P.S. At this moment I've implemented a 'wait until 10 failures' logic in my script itself.

Best Answer

I can't speak for the watchdog-timeout being clamped to 254 seconds but what you link to certainly explains it.

Watchdog timers don't generally run in a "N failures in a row" mode though. At the first indication of error they reboot the machine so the behaviour you're seeing is how I'd expect it to work. Usually they're implemented in hardware which requires "tickling" within the configured period otherwise it will hard power cycle the machine with no warning whatsoever. This is to try and rescue from kernel panics, etc.

Related Solutions

Linux – DNS configuration doesn’t work

The output of nslookup is to be expected.

When you look up an IP address for which a PTR record is set up, the result is that same IP address, with the octets printed in reverse order, ending with in-addr.arpa.

Finally, you get the corresponding fully qualified domain name set in the PTR record.

If a PTR record is not set, you will get NXDOMAIN as a lookup result for that IP address.

Example of a result of a reverse lookup with a PTR record

I do a lookup of twitter.com

kenny@computer ~ $ nslookup twitter.com
Server:     127.0.1.1
Address:    127.0.1.1#53

Non-authoritative answer:
Name:   twitter.com
Address: 199.59.148.10
Name:   twitter.com
Address: 199.59.148.82
Name:   twitter.com
Address: 199.59.150.39

I now do a reverse lookup of the first result of my previous lookup.

kenny@computer ~ $ nslookup 199.59.148.10
Server:     127.0.1.1
Address:    127.0.1.1#53

Non-authoritative answer:
10.148.59.199.in-addr.arpa  name = r-199-59-148-10.twttr.com.

Authoritative answers can be found from:

Ok, so it turns out that the folks at twitter set up a PTR record for that IP address (note that the octets of the IP in the output are reversed, beginning with 10 and ending with 199) that points to r-199-59-148-10.twttr.com.

Example of a result of a reverse lookup without a PTR record
Let's do the same thing for this website, stackoverflow.com

kenny@computer ~ $ nslookup stackoverflow.com
Server:     127.0.1.1
Address:    127.0.1.1#53

Non-authoritative answer:
Name:   stackoverflow.com
Address: 69.59.197.21

The result is only one IP address, let's do a reverse lookup on that.

kenny@computer ~ $ nslookup 69.59.197.21
Server:     127.0.1.1
Address:    127.0.1.1#53

** server can't find 21.197.59.69.in-addr.arpa.: NXDOMAIN

Again, the octets in the output have been reversed and appended with .in-addr.arpa. The result this time is not a domain name, but NXDOMAIN. No PTR record is set.

Linux – watchdog: behavior of file and sync options

What do you mean by "if sync fails"? Man page for sync(2) says about return codes "sync() is always successful". So only way it can "fail" in your case is that it doesn't return control to watchdogd fast enough (becase of many blocks to write, slow writes, broken or corrupt disk or filesystem or kernel I/O layer, ...)

And if it does not return control to watchdogd fast enough, it won't be able to write to /dev/watchdog soon enough, and your hardware watchdog should trigger hardware reboot.

stat(2) could have problem with unwritable disk only if error if of such type to prevent reading (kernel bug, corrupted I/O layer). And yes, it could hang if there is a problem there. BTW, you should use "file=/var/log/messages" in combination with "change=" so the watchdog would initiate reboot if file was not changed often enough.

As for watchdog, are you absolutely sure that hardware watchdog is working? did you modprobe correct hardware module before starting watchdogd? Does dmesg(8) indicate so? if you "KILL -STOP" watchdogd process, the machine should reboot. If so, you may try to add "nowayout" option to your hardware module to eliminate chance of for example OOM killer killing watchdogd and thus stopping hardware watchdog. You could also add "test-binary" and "test-timeout" to run custom script which would return if system is to be considered alive or not (and initiate reboot if not).

Best Answer

Related Solutions

Linux – DNS configuration doesn’t work

Linux – watchdog: behavior of file and sync options

Related Topic