Linux – Command that will take a ‘system snapshot’ of a linux system for later diagnosis

debugginglinuxmonitoring

I am running a Ubuntu Linux server with Apache, wsgi, django and mysql on it. Recently something happened and the wsgi processes frooze up. Restarting apache solved the problem. As with many live systems, it's better to get the system back up and running rather than poking around. However we are having trouble diagnosing the problem, since everything looked fine, and we don't know the full state of the problem now.

Is there any tool/command (on linux/debian/ubuntu (or any other *nix flavour, I'm fine with compiling any command)) that, when invoked, will write to a file some details about the state of the system as it is now? If/When this happens again, we can just run this command, then get down to fighting some fires (restarting apache/the server etc.), and then later, we can try to diagnose the problem.

Wishlist of things it would record:

  • CPU status (and various types)
  • Process list & various detailed
  • Details of file system usage
  • List of all open files (and what process has them, etc.)
  • List of all open internet connections
  • (If possible) Details of what our mod_wsgi processes are doing
  • MySQL status: current queries that are being run, etc.
  • (maybe) run strace on apache/mysql/mod_wsgi for a few seconds to collect some data of what they are doing, and save this to a file.
  • Anything else I'm forgetting?

In theory this is a simple set of commands and if no-one else has done this, then we'll just write our own scripts, but it would be better if we can use a proper tool.

Best Answer

Work is being done in mod_wsgi 4.0 to better recover from the issue where all the WSGI request threads block on something, which is ultimately going to be the cause of this. How this then leads onto Apache as a whole blocking and why you may not get any logging out of Apache about it is mostly understood.

As part of the new recovery mechanism which has been implemented, mod_wsgi will prior to restarting the blocked daemon process attempt to log a minimal stack trace of each WSGI request thread so you can see where the code was blocked.

There is also work going on with tracking and reporting thread utilisation so you can know when request threads are starting to block in your code for some reason. This data will be able to be reported into a tool such as New Relic so you can chart it and then analyse it in conjunction with all the other information about web requests that the New Relic Python agent captures about your application.

New Relic also now has server monitoring, so it also can track a reasonable amount of information about the system as a whole, disk activity, network activity, cpu, processes etc etc. So, as a whole New Relic is one possible option for monitoring your system.

Overall, as time allows, a lot of work is being done on trying to make mod_wsgi easier to monitor and better able to automatically recover when your application starts to hang for one reason or another.

You might consider getting onto the mod_wsgi mailing list and watch for posts about this, or ask any specific questions about it you may have on the mailing list.