Fixing "Our Services aren't available right now" error on Azure Front Door

Our Services aren't available right now - Azure Front door

Problem:

In one of the projects I am currently working on, we have deployed Sitecore on Azure PaaS infrastructure. In the past projects, I have used Azure Traffic Manager to route the traffic to the nearest geographical location and also able to control the traffic to specific web apps which in turn helped us deliver the promise of Blue/Green deployment.

With Traffic Manager, we have faced a challenge that haunted us for many months. It is the famous 503 Service Unavailable error. According to Microsoft, it can be caused due to any of the following reasons

Requests taking a long time
Application using high memory/CPU
Application crashing due to an exception.

We have spent hours trying to solve the problem tackling it from all the above-mentioned possibilities. In the end, we were not satisfied with what the Traffic Manager offered with respect to the timeout strategy.

(Learn more about 502 and 503 errors here - https://docs.microsoft.com/en-us/azure/app-service/troubleshoot-http-502-http-503)

The issue looks to be related to the issue with the Redis session state provider.

https://sitecorepassion.wordpress.com/2018/08/14/hotfix-for-azure-sitecore-9-0-1-web-app-500-502-503-service-unavailable-frequently-unstable/

This promoted us to innovate a new solution for this problem and in the hunt, we settled to use Azure Front Door as the replacement for Azure Traffic Manager. Our main reasons to choose Azure Front Door are

Azure Front Door provides TLS protocol termination (SSL offload), and Azure Traffic Manager does not
Azure Front Door provides application-layer processing, and Azure Traffic Manager does not. This means that Azure Front Door can do things like URL rewriting and that it provides a Web Application Firewall (WAF) that protects you against common web attacks. To use WAF with Azure Traffic Manager we have to deploy another managed solution like Application Gateway.

(Above points are from https://microsoft.github.io/AzureTipsAndTricks/blog/tip192.html)

While the reasons for the technical decision on swapping Azure Traffic Manager with Front Door was appreciated, little we know that our choice of selection is going to chase us into similar problems of service unavailability.

This time we have hit the "Our services aren't available right now". Not exactly the same but a similar one. Nevertheless, the pain remains the same. Our clients are not happy to see these messages during their test cycles. And there isn't a lot we can do about it due to the following limits in Azure:

Timeout values (https://docs.microsoft.com/en-us/azure/azure-subscription-service-limits#timeout-values)

Client to Front Door

Front Door has an idle TCP connection timeout of 61 seconds.

Front Door to the application back-end

If the response is a chunked response, a 200 is returned if or when the first chunk is received.
After the HTTP request is forwarded to the back end, Front Door waits for 30 seconds for the first packet from the back end. Then it returns a 503 error to the client.
After the first packet is received from the back end, Front Door waits for 30 seconds in an idle timeout. Then it returns a 503 error to the client.
Front Door to the back-end TCP session timeout is 30 minutes.

Solution

Thanks to our relentless searches to resolve this issue, I stumbled across this Azure feedback site.

Allow configurable timeout period for Front Door

As per the forum, currently, there isn't a way to configure the default timeout of Azure Front Door from Azure Portal. But that doesn't mean we cannot update it. We can certainly use the Azure REST API calls to configure this information. Take a look at Azure Front Door specification on GitHub
https://github.com/Azure/azure-rest-api-specs/blob/master/specification/frontdoor/resource-manager/Microsoft.Network/stable/2019-05-01/frontdoor.json

Here is a snippet of that json file with a highlighted section for updating the timeout.


"BackendPoolsSettings":
{
 "description": "Settings that apply to all backend pools.",
 "type": "object",
 "properties":
 {
  "enforceCertificateNameCheck":
  {
   "description": "Whether to enforce certificate name check on HTTPS requests to all backend pools. No effect on non-HTTPS requests.",
   "enum": [ "Enabled", "Disabled" ],
   "type": "string",
   "x-ms-enum":
   {
    "name": "enforceCertificateNameCheckEnabledState",
    "modelAsString": true
   },
   "default": "Enabled"
  },
  "sendRecvTimeoutSeconds":
  {
   "description": "Send and receive timeout on forwarding request to the backend. When timeout is reached, the request fails and returns.",
   "type": "integer",
   "minimum": 16,
   "exclusiveMinimum": false
  }
 }
}

Once we figured this out, I tried setting this value using PostMan.

Learning

Although we knew we can update the timeout using 'PUT' with API call, figuring it out was not very intuitive. Below are few learnings from the update exercise.

Payload as simple key value data:
Request

PUT / subscriptions / xxxxxxxx - xxxx - xxxx - xxxx - xxxxxxxxxxxx / resourceGroups / < resourcegroup > /providers/Microsoft.Network / frontDoors / < azurefrontdoor - name > ? api - version = 2019 - 05 - 01 HTTP / 1.1
Host: management.azure.com
Authorization: Bearer XXXX
Content - Type: text / plain
User - Agent: PostmanRuntime / 7.11 .0
Accept: */*
Cache-Control: no-cache
Postman-Token: bf25c7d0-01e3-4925-ad86-7b7fe5a18a6d,efb8605b-935e-4c70-a7c6-ecd7dffc48af
Host: management.azure.com
accept-encoding: gzip, deflate
content-length: 28
Connection: keep-alive
cache-control: no-cache

"sendRecvTimeoutSeconds": 60

Response

{

  "error": {
    "code": "UnsupportedMediaType",
    "message": "The content media type 'text/plain' is not supported. Only 'application/json' is supported."
  }
}

Payload as simple json:

Request payload

{
  "properties": {
    "backendPoolsSettings": {
      "sendRecvTimeoutSeconds": 120
    }
  }
}

Response

{
  "error": {
    "code": "LocationRequired",
    "message": "The location property is required for this definition."
  }
}

Payload with Location information

Bad request response

Full response of GET as Payload

Azure Front Door Management API GET response

Once we did that, we were able to successfully update the timeout to 120 seconds.

sendRecvTimeoutSeconds set to 120

It is possible that some of your implementation may take longer than 120 seconds to return due to various design and technical constraints. This setting will not completely solve all those problems as the maximum value you can enter for sendRecvTimeoutSeconds is only 240. We tried setting the value to 600 seconds and later realized it is not possible.

Digital Diary of a Developer

Search This Blog