Sql server 2008 cluster hang when a heavy load is run

clustersql-server-2008

we have a sql server 2008 active/active cluster running on wondows 2008R2 O/S. 14GB RAM, 4xCPU. we have set a ceiling of 12GB for sql server. We're running an agent job which loads 3 million records to a database. during this load the job fails and the cluster seems to attempt to fail over to the other node but unsuccessfully i.e., the cluster address is no longer accessible. we have to manually fail the cluster node back.

during the load on viewing task manager we can see that memory usage hits a max of 12.5GB and CPU at times hits 100% on all 4 CPU, but for the most part fluctuates at an average of about 60%.

I suppose my question is, will a cluster try to fail over if memory or CPU are taking a heavy hit? or am i barking up the wrong tree?
also any ideas why it wouldn't fully fail over? we've crawled through logs, of which there are a lot, and can't find anything useful. we've also tried recreating the issue but it ran successfully at a later time. Also 3 million rows doesn't seem like a lot but in terms of resources should 14GB RAM and 4xCPU not be sufficient?

Further information on this, we ran the load again today and corrupted the database!

We received the error message : LogWriter: Operating system error 170.
It looks like, under the heavy load, the sql cluster attempted to fail over and in doing so migrated a lun (or drive) which meant the disk was no longer reachable. (this is just our theory). The database is now 'suspect' and requiring restoration.

The 170 error above also indicates that on failing over to the other node, the sql service could not start as it was already in use, therefore it couldn't fail over fully?? But I'm wondering why would it need to fail over in the first place?

My assumptions could be completely wrong on this, so any ideas would be appreciated.

Best Answer

If the CPUs are locking up (ie all going to 100%) consider checking the Maxdop setting in SQL Server as a temporary stabilisation measure - it's available from the GUI at the server level, or sp_configure advanced options. As default, it should be set to 0 - which is fine.

You might want to consider changing maxdop to 3 or 2 or 1 - as a stabilisation attempt to make things better. It's dynamic, but please remember that any processes that run across multiple cores (parallel!) may well take longer.

If it makes a positive difference, you've bought yourself some time to find the actual root cause - which will come from looking at the workload in more detail.

Related Topic