Azure – Sudden SQL Azure performance issues

azure

I'm using New Relic to monitor one of my sites, and about every two weeks my Apdex drops through the floor. This appears to be down to SQL Azure.

What I know:

  • Requests Per Minute are the same as they are every working day at that particular time. There isn't a spike at all compared to the same time yesterday or last week.
  • Performance goes from about 100ms on average to 12 seconds on average.
  • No code changes have happened in the week prior.
  • Restarting the Azure Web Site that accesses this database makes no difference.
  • Scaling upwards on the front end Web Site makes no difference.
  • There don't appear to be any unclosed connections or undisposed connection objects.

Interestingly, what does seem to work to resolve it instantly is to change the scale of the database – in ANY direction. Moving it from S0 to S1 fixes it; moving it from S2 to S1 fixes it. Obviously, it's not possible to "restart" an Azure database, but this process seems to do something.

I'm unsure how to investigate this further. Does anybody have any suggestions or thoughts?

Best Answer

We had the exact same issue multiple times - generally every 3-6 weeks (2 years back). Azure support kept saying we need to tune our queries. But the problem was similar to what you had - with nothing changed (code or load), the performance would simply tank for couple of hours and then it will come back to normal. After days of frustation and after adding more and more logging and monitoring, we found what Azure did not want to share with us:

If something goes wrong with the primary instance, they would kill it and the secondary instance would now be the primary instance. That switch was at the root of all this and Azure support hesitatingly agreed to it!!! Although the secondary instance is a replica, there is something with the switch that slows it - alomst like restarting it.

The instance can be killed for many reasons: 1. Azure SQL is a shared database. If one of the other databases on the same instance is misbehaving (too much load with some batch job), that would create instance-wide problems. 2. Hardware failure - probably not as frequent as the above one.

Related Topic