Azure – Catastrophic Azure App Service outage after an automatic Azure platform upgrade

azureazure-app-services

A relatively simple Azure App Service (currently .net 4.6.2, against Azure SQL) has been running for over 18 months. It is rock solid. I rarely think about this site and have not released an update for several months.

I wake this morning to find emails from customers saying that the web site is reporting "The specified CGI application encountered an error and the server terminated the process." As a first guess I clicked "Restart" from the Azure portal against the App Service. About a minute later it came back to life and has been running fine ever since.

I went to "Diagnose and solve problems" -> "Availability and Performance". The "Requests and Errors" timeline showed the moment the web site went down and when it came back to life. I drilled into the timeline and selected "Full Report".

In a very matter of fact way it reported the following

Application stop events are detected
We analyzed 3 Platform Events, 1
User Event.

Platform(File Server Upgrade) Your application was recycled due to a
file server upgrade. This event occurred multiple times during the day
across multiple instances. These events cause a Storage Volume
movement which may result in a restart of your application. If this
restart event negatively impacts the availability of the application,
enabling the Local Cache feature can help reduce dependency on storage
file servers to some extent. Learn more: Check Local Cache described
in the Troubleshooting and Next Steps.

Platform (Infrastructure Upgrade) Around 11/20/2019 2:09:57 PM (UTC), on Instance
xxxxxxxx, your application was recycled as the Azure scale unit
was undergoing an upgrade.
There are periodic updates made by
Microsoft to the underlying Azure platform to improve overall
reliability, performance, and security of the platform infrastructure
where your application is running on. Most of these updates are
performed without any impact upon your web app. To reduce the impact
of such events on your application, consider deploying your
application to multiple regions and use Azure Traffic Manager to
distribute the load across regions.

User(Stop Site) Around 11/20/2019
9:00:00 PM (UTC), your application process was restarted due to a user
action like stopping the site from azure portal.

I am at a total loss as to what to do and how to prevent this from happening again.

I suspect the "local cache" suggestion is a red herring. I use the file system to create a few temporary files that the code deletes afterwards.

Googling has returned few results.

I guess I am after suggestions as to what I can do to ensure that this never happens again.

Any ideas?

Thanks in advance.

Best Answer

In my case setting WEBSITE_LOCAL_CACHE_OPTION to Always did not work.

Instead, setting WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG to 1 was what finally helped.

Related Topic