Sql-server – What can cause long query durations without high resource usage

performancesql serversql-server-2005

In advance: sorry for the length of the question… unable to get the right balance between detail and brevity.

We are having issues with the DB Server for our web application where queries that should (and usually do) run in a very short time (< 10ms) are, on random occasions, taking anywhere between 1 ~ 30 seconds to execute – with no obvious pattern. According to our profiler traces, some of them are even "do-nothing" queries, such as "exec sp_reset_connection" (usually runs in 0ms; observed peaks of 3~6s), and "SET NO_BROWSETABLE ON", etc. Some examples are:

SELECT * FROM [Localisation].[TimeZoneRule] WHERE [Name] = 'EU'

Where TimeZoneRule has about 500,000 rows in 5 columns. Has a surrogate primary key, and an index on Name. Usually takes 0.97ms, peaks at 11s. Table is NEVER written to (was pre-populated prior to go-live). Profiler records it as taking 0-15 CPU, 18-25 Reads, 0-1 Writes (no idea why the write).

UPDATE [Core].[User] SET [LastUsed] = GETUTCDATE() WHERE Id = '<uid>'

Where User has about 30,000 rows over about 10 columns (one of which is an Xml column). Id is the clustered primary key. Table is regularly written to and read from. Usually takes 10~20ms, peaks at 26s. Profiler records it as taking 0 CPU, 15-36 Reads, 0-1 Writes.

INSERT INTO [Log].[Session] (ASPSessionId, Start, ClientAddress, ClientSoftware, ProxyAddress, ProxySoftware)
   VALUES(<number>, GETUTCDATE(), '<ipv4address>', '<User agent string>', '<ipv4address>', '<proxy software name (if present)>')

Where Session has about 1,000,000 rows in it over about 8 columns. Has a surrogate primary key (identity), and an index on ASPSessionId. Table is regularly written to, but rarely read from (only by us directly from SSMS). Usually takes 15~150ms, peaks at 5s. I don't have the profile record for it at hand, but from memory, CPU is around 0, Reads and writes were between 0 and 100 each.

The setup we're using is a mirrored setup with a Dell 2950 as the principle (2 4-core xeon 2.6, 16Gb RAM), and a Dell 6850 as the mirror (4 HT Xeon 3.2, 8Gb RAM). Both running SQL 2005 SP4 64-bit. The database in question is not particularly large, around 16Gb in size. The primary has 6 SAS disks divided into 3 RAID-1 volumes; one for System + Page + TempDB, one for the database's MDF, and one for the transaction log + hourly log backup + daily DB backup. I know the log situation is far from the best – in terms of disk IO (see below), and data security.

So far, we think we've eliminated:

  • The mirror. We separated the servers, and ran using one of them (and then switched to the other), but the performance issues remained.
  • Blocking due to locks(*). TimeZoneRule is never written to and, by my reckoning, should never have an exclusive lock on it. Additionally, we've checked through the traces and on many occasions the "problem query" is the only one running – the only other activity is other connections disconnecting
  • Poor indexing. With the low figures for Reads and CPU, it would suggest that SQL Server is using indexes effectively.
  • Disk IO. PerfMon indicates some odd figures for the data file drive (but only that drive) – whilst data read/ write rates rarely seem to exceed 32KB/s, the Current Disk Queue Length spikes to around 215 for durations of 2-5 seconds at approximately 45-60minute intervals with no fixed pattern. However, these do not correlate with times of poor query performance. The disk queue lengths for the other two drives [system + page + tempdb] and [log + backups]
    never exceeds 3.

(*) We've tried getting profiler to capture events related to lock acquisition, but the trace bloats to unreadable proportions and, worse, the web application grinds to a halt.

Not being DBAs, we're rapidly running out of ideas. Can anyone think of anything I should consider looking at next or anything I've stupidly missed?

Best Answer

As you are running SQL 2005 you can take your SQL Profiler data and compare it against the Perfmon data to see if you can see a correlation. This is done by saving your trace data and perfmon data to files using the normal techniques. Then open the SQL Profiler trace within profiler, and then one of the options in the file menu will be Import Performance Data. This will let you select a query and see what the counters were doing at that time (or close to it depending on your perfmon collection interval).

Disk queue spikes are never good. Especially that high. What's the IO that you are pushing to the disk when the queue gets that high? Basically you don't want a disk queue any higher than (2*n) where n in the number of disks in the array. Since you are using a 2 disk RAID 1 n=1 in your case (since you only get the speed of a single disk).

In perfmon there's a counter which is the seconds per read and seconds per write. What do these counters look like when the queries start taking a long time to run. What about normally? (Anything over .02 seconds is bad.) What's the estimated page life expectancy? (Anything under 300 seconds is usually bad, but this can vary.) What's the SQL Server cache hit ratio? (Anything below ~97% is usually bad. I like mine above 99.9%.)