Sql-server – What to check during a periodic system health check

healthcheckoracle-11gsql serverwindows-server-2003windows-server-2008-r2

I have been tasked with preparing a list of checks to do as a part of a weekly system health check routine which my team is supposed to do. The problem is neither I nor any of my colleagues have ever been a professional system administrator and the best we can come up with is pretty laughable.

The system runs Siemens SIMATIC IT and LIMS, but I’m interested in some generic checks/tests for operating systems and database servers. Someone else will take care of tests specific to the applications being run.

The setup is a follows:

All servers are virtual, running in the vSphere5 environment.

  • Web server – MS Windows Server 2003 R2
  • 2 servers running SIMATIC IT components one for Historian and one for the Production Modeler and
    other components – MS Windows Server 2003 R2
  • Database server – MS Windows Server 2003 R2 + MS SQL Server 2005
  • Database + LIMS server – MS Windows Server 2008 R2 + Oracle Database 11g

Most probably we will not get access to the vCenter console, so the idea is to connect a remote desktop to those servers, make some constructive checks/ tests and prepare a report.

As I wrote already, there is no much besides checking for a free disk space, that I can come up with. I can also think of checking the level of fragmentation of a file system and file system errors with ChkDsk, looking into windows event viewer for some important errors and warnings, checking the level of index fragmentation in databases and maybe collecting some statistics of response times and times of execution of some important queries.

I will greatly appreciate any help. Besides information about what should be checked, hints for what not to do on a system that is under load 24/5 will also be very helpful. For example running a defragmenter even just for analysis on a database server under load might be a very bad idea, but I don’t know it yet.

Thank you.

Best Answer

You are being asked to do it wrong.

You should not be logging in to production systems and doing periodic manual checks.
This guarantees that you will (a) miss something that happens between the checks and takes your business down, and (b) eventually screw up while doing the checks and take the business down.

Instead, you should be implementing a monitoring system that does continuous periodic checks (every 5-10 minutes) and reports anomalies to you. See the tag for more information and ideas on what to check.

Disk space, swap utilization, and CPU load (RunQ depth) are typical things to monitor. You may also want to perform (and time/check the output of) standard test queries on database servers (these queries are something you have to create based on your environment).