Azure WebApps Downtime – Why Restart Causes Hours of Downtime?

azureazure-web-appstomcat

My Azure WebApps instance (running on Tomcat, Linux) has been working well for 9 months.
Recently there was a couple of hours downtime, which according to a Microsoft Support Engineer was caused by the following:

a storage file server reboot on this instance and the web app was not
able to start after till you made a manual restart, the web app got
stuck. to avoud this kid of issues you can adhere to best practices

  1. Use 2 instances all the time
    These instances are in different
    upgrade domains and hence will not be upgraded at the same time. While
    one worker instance is getting upgraded the other is still active to
    serve web requests. The web app is currently configured to run on only
    one instance. Since you have only one instance you can expect downtime
    because when the App Service platform is upgraded, the instance on
    which your web app is running will be upgraded. Therefore, your web
    app process will be restarted and will experience downtime.
  1. Use
    Health Check
    This feature automatically removes a faulty instance
    from rotation, thus improving availability. This feature will ping the
    specified health check path on all instances of your web app every 2
    minutes. If an instance does not respond within 10 minutes (5 pings),
    the instance is determined to be unhealthy and our service will stop
    routing requests to it. It is highly recommended for production apps
    to utilize this feature and minimize any potential downtime caused due
    to a faulty instance. Note: Health Check feature only works for
    applications that are hosted on more than one instance. For more
    information check the documentation below.
    https://github.com/projectkudu/kudu/wiki/Health-Check-(Preview)

So I understand that I can avoid these type of rare events by following best practices. However I wonder if there is something else, because the problem is that since then, whenever I have restarted the WebApp (via the Azure Portal), it has suffered downtime of between 2 and 5 hours, whereupon it fixes itself automatically.

The response of the Microsoft Support Engineer was that this was due to the temp directory being full.

Temp file space usage was almost exhausted. The app may experiencing
stability and performance issues.

Applications make use of temp files during in-memory processing,
downloading content from API calls etc. If the application code does
not clean up, temp space gets used up.

Recommended Action For a permanent fix, review and analyze each
application hosted in this App Service Plan and identify the apps that
are not performing proper cleanup routines.

I looked in the \tmp directory and it was basically empty. Also a WebApp restart is supposed to clear the temp directory, so I do not understand why I should be having problems specifically after a restart.

The Support Request with Microsoft is still open. I am hoping to explore other possibilities of solving the problem – as it has been ongoing for two weeks.

Here are parts of the logs that may be pertinent (while the WebApp is offline) with identifying details obfuscated.

Docker

2020-09-22T16:09:57.514Z ERROR – Container
examplewebapp__ for site examplewebapp__a81a did not
start within expected time limit. Elapsed time = 600.9031978 sec
2020-09-22T16:09:57.515Z ERROR – Container
examplewebapp__ didn't respond to HTTP pings on port:
80, failing site start. See container logs for debugging.
2020-09-22T16:09:57.544Z INFO – Stopping site examplewebapp__a81a
because it failed during startup. 2020-09-22T16:14:53.608Z INFO –
Pulling image from Docker hub:
mcr.microsoft.com/azure-app-service/tomcat:9.0-java11_200319054033
2020-09-22T16:14:53.687Z INFO – 9.0-java11_200319054033 Pulling from
azure-app-service/tomcat 2020-09-22T16:14:53.720Z INFO – Digest:
sha256:c2c5…….73d96 2020-09-22T16:14:53.722Z INFO – Status:
Image is up to date for
mcr.microsoft.com/azure-app-service/tomcat:9.0-java11_200319054033
2020-09-22T16:14:53.726Z INFO – Pull Image successful, Time taken: 0
Minutes and 0 Seconds 2020-09-22T16:14:53.825Z INFO – Starting
container for site 2020-09-22T16:14:53.825Z INFO – docker run -d -p
6807:80 –name examplewebapp__aetete -e
WEBSITE_SITE_NAME=exampleWebApp -e WEBSITE_AUTH_ENABLED=False -e
WEBSITE_ROLE_INSTANCE_ID=0 -e
WEBSITE_HOSTNAME=examplewebapp.azurewebsites.net -e
WEBSITE_INSTANCE_ID=dgsgdhs…sdshsd
-e HTTP_LOGGING_ENABLED=1 mcr.microsoft.com/azure-app-service/tomcat:9.0-java11_200319054033

2020-09-22T16:14:56.980Z INFO – Initiating warmup request to
container examplewebapp__aetete for site examplewebapp__a81a
2020-09-22T16:15:17.526Z INFO – Waiting for response to warmup
request for container examplewebapp__aetete. Elapsed time = 20.5455075
sec 2020-09-22T16:15:33.144Z INFO – Waiting for response to warmup
request for container examplewebapp__aetete. Elapsed time = 36.1635991
sec 2020-09-22T16:15:54.629Z INFO – Waiting for response to warmup
request for container examplewebapp__aetete. Elapsed time = 57.6488951
sec 2020-09-22T16:16:09.914Z INFO – Waiting for response to warmup
request for container examplewebapp__aetete. Elapsed time = 72.9343365
sec 2020-09-22T16:16:25.080Z INFO – Waiting for response to warmup
request for container examplewebapp__aetete. Elapsed time = 88.1001723
sec 2020-09-22T16:16:40.281Z INFO – Waiting for response to warmup
request for container examplewebapp__aetete. Elapsed time =
103.3011586 sec

Default_Docker

2020-09-22T11:45:17.432527708Z / | / /| | /| | /\ /
2020-09-22T11:45:17.432531708Z _
|__ /_____ __/ || ___ >
2020-09-22T11:45:17.432535708Z / / /
2020-09-22T11:45:17.432539208Z A P P S E R V I C E O N L I N U X
2020-09-22T11:45:17.432542708Z 2020-09-22T11:45:17.432562008Z
Documentation: http://aka.ms/webapp-linux
2020-09-22T11:45:17.432565208Z 2020-09-22T11:45:17.432568708Z
NOTE: No files or system changes outside of /home will persist beyond your application's current session. /home is your application's
persistent storage and is shared across all the server instances.
2020-09-22T11:45:17.432573808Z 2020-09-22T11:45:17.432576808Z
2020-09-22T11:45:17.432836008Z Setup openrc …
2020-09-22T11:45:20.011688823Z * Caching service dependencies … [
ok ] 2020-09-22T11:45:20.040479470Z Updating /etc/ssh/sshd_config to
use PORT 2222 2020-09-22T11:45:20.056556396Z Starting ssh service…
2020-09-22T11:45:23.318735610Z ssh-keygen: generating new host keys:
RSA DSA ECDSA ED25519 2020-09-22T11:45:27.654655866Z * Starting sshd
… [ ok ] 2020-09-22T11:45:27.675340497Z ## Printing build info…
2020-09-22T11:45:27.685373113Z
PACKAGE | VERSION | COMMIT
2020-09-22T11:45:27.685419013Z
Microsoft.AppService.EasyAuthExtensionsJava |
1.0.011720002-alpha-793ad718 | 793ad718 2020-09-22T11:45:27.685426413Z Microsoft.AppService.WebsitesExtensionsJava |
1.0.011730003-alpha-53ae38d3 | 53ae38d3 2020-09-22T11:45:27.685430813Z self | 1.0.011730002-alpha-c6f00046 | c6f00046
2020-09-22T11:45:27.687085515Z ## Done printing build info.

2020-09-22T11:55:23.212406842Z _____
2020-09-22T11:55:23.212435742Z / _ \ __________ _________ ____
2020-09-22T11:55:23.212440842Z / /\ ___ / | _ __ _/ __ \
2020-09-22T11:55:23.212444742Z / | / /| | /| | /\ /
2020-09-22T11:55:23.212448142Z _
|__ /_____ _
/ || ___ >

Best Answer

The following information is garnered from a Microsoft support call.

The reason that this was happening is that the temporary file storage had been used up. For a P1V2 Linux you get 35GB of temporary file storage and for a P2V2 Linux you get 69GB of temporary file storage.

You can check how much your application is using by going to "Diagnose and Solve problems" and then selecting "Temp File Usage on Workers"

Note that there is no Microsoft-supported way of actually accessing these temp files, nor of deleting them, other than by upgrading your instance (e.g. from P1V2 to P2V2), waiting 15 minutes, and then degrading it.

Note that it is necessary to wait 15 minutes, because otherwise you run the risk of returning to your pre-existing instance but without it being formatted.

Note that the Microsoft support engineer made the following additional changes (I do not know if these were necessary to solve the problem)

  1. Stopped the Deployment Slot I was using

  2. Added Application Slot Configuration Parameter WEBSITES_CONTAINER_START_TIME_LIMIT = 1800

  3. Changed Java Web Server Version to 9.0.20 from 9.0

  4. Removed the deployments by going to /home/deployments and doing

    rm -rf *

See this question for more info about Azure Temp files. Note that the solution there for viewing them does not appear to work for Azure Web Apps for Linux

Related Topic