Linux – apcupsd slave client keeps loosing and restoring communications with UPS master

linuxupsvmware-esxi

On a VMWare ESXI 5.0.0 (vSphere Hypervisor – the free version) I have three server images. All running CentOS 6 – Linux.
All are configured to run the apcupsd ( http://www.apcupsd.org/ ) daemon for controlling APC upses.

One of the servers (master) is connected, using a USB cable to an APC CS 350 UPS.
apcupsd is configured to have the netserver available on port 3551.

The two other (also virtualized) servers have apcupsd configured to retrieve the UPS status from master.

It works, but i i see lots of warnings coming from apcupsd on the two slaves. In a terminal window I see entries saying

Broadcast message from root@slavehostname (Thu Nov 1 19:55:10 2012):

Warning communications lost with UPS masterhostname

Broadcast message from root@slavehostname (Thu Nov 1 19:55:47 2012):

Communications restored with UPS masterhostname

On the same day I see about 200 sets of lost/restored messages. They are a lot more frequent during the day than during the night.

I don't get any warnings on the master.

These servers have lots of memory and CPU available to them. Practically no swapping taking place.
I don't think that they are starved. And generally they do not do very much work.

This is the master configuration settings (leaving out the EPROM settings):

UPSCABLE usb
UPSTYPE usb
DEVICE
POLLTIME 10
LOCKFILE /var/lock
SCRIPTDIR /etc/apcupsd
PWRFAILDIR /etc/apcupsd
NOLOGINDIR /etc
ONBATTERYDELAY 6
BATTERYLEVEL 5
MINUTES 3
TIMEOUT 0
ANNOY 300
ANNOYDELAY 60
NOLOGON disable
KILLDELAY 0
NETSERVER on
NISIP 0.0.0.0
NISPORT 3551
EVENTSFILE /var/log/apcupsd.events
EVENTSFILEMAX 10
UPSCLASS standalone
UPSMODE disable
STATTIME 0
STATFILE /var/log/apcupsd.status
LOGSTATS off
DATATIME 0

And this is the slave settings:

UPSCABLE ether
UPSTYPE net       
DEVICE 192.168.0.59:3551
POLLTIME 10
LOCKFILE /var/lock
SCRIPTDIR /etc/apcupsd
PWRFAILDIR /etc/apcupsd
NOLOGINDIR /etc
ONBATTERYDELAY 12
BATTERYLEVEL 10
MINUTES 7
TIMEOUT 0
ANNOY 300
ANNOYDELAY 60
NOLOGON disable
KILLDELAY 0
NETSERVER on
NISIP 0.0.0.0
NISPORT 3551
EVENTSFILE /var/log/apcupsd.events
EVENTSFILEMAX 10
UPSCLASS standalone
UPSMODE disable
STATTIME 20
STATFILE /var/log/apcupsd.status
LOGSTATS off
DATATIME 0

I would like to ask for help on how to move on from here. How do I debug this? Any suggestions on how I might have configured my servers in a way that could cause this.

Best Answer

This doesn't fix the underlying problem, but it helps clean up the console a bit:

The script that outputs these messages is called apccontrol, and in my Ubuntu 12.04.02 LTS boxen it lives in /etc/apcupsd. It uses wall for all the messages.

But it also calls other scripts if they exist in that directory to do secondary handlings, like emailing root every time there's a comms failure. You can turn that off by moving the script or changing it.

Also: if the other script exits with status code 99, then apccontrol will not call the default action, and you won't get spam on your wall.

I've just used it to push all the comms loss alerts into syslog instead of wall, and now it doesn't clutter up all my terminals that I'm trying to use. And I can put the polltime back down to the default of 60 so my slave box will still notice if the UPS kicks in.