Azure – Disk Queue length of 30 in Azure App Service, that can’t be right

azureazure-web-appshard driveperformance-monitoring

We are having a battle with the Microsoft Azure support team. I hope the Serverfault community can chime in as the support team has messed us about before.

Here is what is happening.

As part of a larger SaaS service that we host on Azure, we have a front-end App Service that accepts basic HTTP requests, carries out some minor validation and then passes the grunt work on to a back-end server. This process is not CPU, memory or network intensive and we don't touch the disk subsystem at all.

The pricing tier is 'Basic: 2 Medium', which is more than sufficient for the load we put on it. The CPU and Memory charts show that the system is largely sleeping with memory usage being around 36%.

As we paid good attention in server school, we actively monitor the various layers of the overall solution using Azure's standard monitoring facilities. One of the counters we keep track of is the 'Disk Queue length', it is one of the very few counters available on Azure App Services so it must be important.

Back in server school we were told that the disk queue length should ideally be zero and if it is persistently above 1, you need to get your act together (There are some exceptions for certain RAID configurations). Over the last few years all was well, the disk queue length was zero 99% of the time with the occasional spike to 5 when Microsoft was servicing the system.

A couple of months ago things started to change out of the blue (so not after we rolled out changes). Disk queue alerts started flooding in and the average queue length is in the 30s.

We let it run for a few days to see if the problem would go away (performance is not noticeably impacted, at least not under the current load). As the problem did not go away we thought that perhaps the underlying system had a problem so we instantiated a brand new Azure App Service and migrated to that one. Same problem.

Disk Queue Length for a typical week

So we reached out to Azure support. Naturally they asked us to run a number of nonsense tests in the hope we would go away (they asked for network traces… for a disk queue problem!). We don't give up so easily, so we ran their nonsense tests and eventually were told to just set the alert for the queue length to 50 (over 10 minutes).

Although we have no control over the underlying hardware, infrastructure and system configuration, this just does not sound right.

Their full response is as follows

I reached out to our product team with the information gathered in
this case.

They investigated the issues where the alert you have specified for
Disk Queue Length is firing more frequently than expected.

This alert is set to notify you if the Disk Queue Length average
exceeded 10 over 5 minutes. This metric is the average number of both
read and write requests that were queued for the selected disk during
the sample interval. For the Azure App Service Infrastructure this
metric is discussed in the following documentation link:
https://docs.microsoft.com/en-us/azure/app-service-web/web-sites-monitor

The value of 10 is very low for any type of application deployed and
so you may be seeing false positives. This means the alert might
trigger more frequently than the exact number of connections.

For example on each virtual machine we run an Anti-Malware Service to
protect the Azure App Service infrastructure. During these times you
will see connections made and if the alert is set to a low number it
can be triggered.

We did not identify any instance of this Anti-Malware scanning
affecting your site availability. Microsoft recommends that you
consider increasing the Disk Queue Length metric be set to an average
value of at least 50 over 10 minutes.

We believe this value should allow you to continue to monitor your
application for performance purposes. It should also be less affected
by the Anti-Malware scanning or other connections we run for
maintenance purposes.

Anyone wants to chime in?

Best Answer

To me, that sounds high as well, with Azure your in a shared pool environment. I bet your back-end disk is getting hammered by other clients. Based on other posts it sounds like Azure is known for this. I would see if they can relocate your backend disk to less used storage or try the recommendations in these posts or others.

Performance azure disks, high average queue length

Azure IO performance

Related Topic