Ubuntu – How to detect virtual machines freezes

monitoringUbuntuvirtual-machines

On cloud platforms you often hear that because of high load on neighboring VMs, disks over oversubscribed ethernet, backups, or live migration to other hardware, the virtual machine can 'freeze' for a moment.

I have the suspicion that this is happening to one of our Ubuntu virtual machines on a Cloud provider I'm not looking to publicly shame.

Every night it's unavailable to external monitoring services. The machine itself looks healthy in terms of load, traffic, etc. The provider suggests the network is fine though.

I would like to be able to (dis)prove VM freezes are causing these pagers.

One idea I had was to write the date to a log every second, and after an short moment of unavailability see if we skipped a 'beat'.
However that seems flawed because what if the VM maintains its own clock and allows a drift from the Host's hardware.
If our internal clock freezes along with the VM, we'd still have a nice sequence of seconds in that log file, and a clock that's now behind on the real time.

Is there a better way / tool I can use to determine that there are machine freezes?

I would guess real time and our time would be a tell, then again, there are other causes for drifting clocks.

Best Answer

I think you're on the right track in with writing the time to a log file every second, but for the reasons you pointed out that may not be reliable. In addition to writing the time to a local disk, why not have your cron process reach out to a known stable system over the network and have that system log the request to disk? Something as simple as wget could work assuming you're doing an http request to a system and that system is logging the requests. Of course, you'd ideally want to have the target system relatively "close" to the system you suspect of being problematic network-wise, but that could help you get some debugging data at least.

Related Topic