Looking for a recommendation on measuring a high availability app that is using a CDN

cdnhigh-availabilitymetricsreporting

I work for a Fortune 500 company that struggles with accurately measuring performance and availability for high availability applications (i.e., apps that are up 99.5% with 5 seconds page to page navigation). We factor in both scheduled and unscheduled downtime to determine this availability number. However, we recently added a CDN into the mix, which kind of complicates our metrics a bit. The CDN now handles about 75% of our traffic, while sending the remainder to our own servers.

We attempt to measure what we call a "true user experience" (i.e., our testing scripts emulate a typical user clicking through the application.) These monitoring scripts sit outside of our network, which means we're hitting the CDN about 75% of the time.

Management has decided that we take the worst case scenario to measure availability. So if our origin servers are having problems, but yet the CDN is serving content just fine, we still take a hit on availability. The same is true the other way around. My thought is that as long as the "user experience" is successful, we should not unnecessarily punish ourselves. After all, a CDN is there to improve performance and availability!

I'm just wondering if anyone has any knowledge of how other Fortune 500 companies calculate their availability numbers? I look at apple.com, for instance, of a storefront that uses a CDN that never seems to be down (unless there is about to be a major product announcement.) It would be great to have some hard, factual data because I don't believe that we need to unnecessarily hurt ourselves on these metrics. We are making business decisions based on these numbers.

I can say, however, given that these metrics are visible to management, issues get addressed and resolved pretty fast (read: we cut through the red-tape pretty quick.) Unfortunately, as a developer, I don't want management to think that the application is up or down because some external factor (i.e., CDN) is influencing the numbers.

Thoughts?

(I mistakenly posted this question on StackOverflow, sorry in advance for the cross-post)

Best Answer

In the abstract, I would say you should sharply define what constitutes "available" vs. "unavailable" and measure yourself against it. For example, you could have a client-side performance SLA for the site of 1 second to the "fold" and 3 seconds for a completely rendered page. When you don't meet the performance SLA, you should count that as an availablility failure for that time period. It shouldn't matter whether you're hitting the CDN or not -- the user experience is what matters.

However, since you're only taking measurements every 5 minutes, it seems reasonable to measure hits to the CDN vs. the master site separately, and calculate that 75% of availability is coming from the CDN and 25% from the master. The difficulty here is that 75% is just an average. To apportion blame accurately for a given time period, you need to know when one or the other site is not actually customer-facing, e.g., during a planned change or after manual action when a problem is detected. You also need to factor in what happens when one of the master site or the CDN are down. Does the customer get an HTTP 500, or do they just transparently fail over to the working site? A lot depends on your load balancing solution. The "worst-case" metric you described seems too simplistic. Ask yourself, "What are our customers experiencing?"

As far as whether you should take "blame" when the CDN is down: absolutely. If 75% of your hits are going to the CDN, then 75% of your customer experience is dependent on them. You're responsible for providing a good experience to your customers, so if the CDN is having issues, you need to use your engineering resources to prove it and follow up with the provider.

One other thing to think about is what happens when the master site is unavailable for an extended period of time. As you've described it, it sounds like the CDN is a static copy of the content on the master site. If the master site is down for a long time, the CDN could start to get stale. So maybe part of your SLA should be freshness: 1 second to the "fold" and 3 seconds for a completely rendered page, with content no more than 15 minutes old.