Looking for a recommendation on measuring a high availability app that is using a CDN

cdnhigh-availabilitymetricsreporting

I work for a Fortune 500 company that struggles with accurately measuring performance and availability for high availability applications (i.e., apps that are up 99.5% with 5 seconds page to page navigation). We factor in both scheduled and unscheduled downtime to determine this availability number. However, we recently added a CDN into the mix, which kind of complicates our metrics a bit. The CDN now handles about 75% of our traffic, while sending the remainder to our own servers.

We attempt to measure what we call a "true user experience" (i.e., our testing scripts emulate a typical user clicking through the application.) These monitoring scripts sit outside of our network, which means we're hitting the CDN about 75% of the time.

Management has decided that we take the worst case scenario to measure availability. So if our origin servers are having problems, but yet the CDN is serving content just fine, we still take a hit on availability. The same is true the other way around. My thought is that as long as the "user experience" is successful, we should not unnecessarily punish ourselves. After all, a CDN is there to improve performance and availability!

I'm just wondering if anyone has any knowledge of how other Fortune 500 companies calculate their availability numbers? I look at apple.com, for instance, of a storefront that uses a CDN that never seems to be down (unless there is about to be a major product announcement.) It would be great to have some hard, factual data because I don't believe that we need to unnecessarily hurt ourselves on these metrics. We are making business decisions based on these numbers.

I can say, however, given that these metrics are visible to management, issues get addressed and resolved pretty fast (read: we cut through the red-tape pretty quick.) Unfortunately, as a developer, I don't want management to think that the application is up or down because some external factor (i.e., CDN) is influencing the numbers.

Thoughts?

(I mistakenly posted this question on StackOverflow, sorry in advance for the cross-post)

Best Answer

In the abstract, I would say you should sharply define what constitutes "available" vs. "unavailable" and measure yourself against it. For example, you could have a client-side performance SLA for the site of 1 second to the "fold" and 3 seconds for a completely rendered page. When you don't meet the performance SLA, you should count that as an availablility failure for that time period. It shouldn't matter whether you're hitting the CDN or not -- the user experience is what matters.

However, since you're only taking measurements every 5 minutes, it seems reasonable to measure hits to the CDN vs. the master site separately, and calculate that 75% of availability is coming from the CDN and 25% from the master. The difficulty here is that 75% is just an average. To apportion blame accurately for a given time period, you need to know when one or the other site is not actually customer-facing, e.g., during a planned change or after manual action when a problem is detected. You also need to factor in what happens when one of the master site or the CDN are down. Does the customer get an HTTP 500, or do they just transparently fail over to the working site? A lot depends on your load balancing solution. The "worst-case" metric you described seems too simplistic. Ask yourself, "What are our customers experiencing?"

As far as whether you should take "blame" when the CDN is down: absolutely. If 75% of your hits are going to the CDN, then 75% of your customer experience is dependent on them. You're responsible for providing a good experience to your customers, so if the CDN is having issues, you need to use your engineering resources to prove it and follow up with the provider.

One other thing to think about is what happens when the master site is unavailable for an extended period of time. As you've described it, it sounds like the CDN is a static copy of the content on the master site. If the master site is down for a long time, the CDN could start to get stale. So maybe part of your SLA should be freshness: 1 second to the "fold" and 3 seconds for a completely rendered page, with content no more than 15 minutes old.

Related Solutions

How to Achieve High Availability for SMB using virtualization

As others have mentioned above; it depends on your actual requirements for availability.

Option 1) I know what it means and I really need HA and Fault Tolerance.

Assuming you want a decent level of availability. I would budget at least $50,000 to $100,000 as a starting point.

Lets assume for easy maths you have 20 users who need access to LDAP and File sharing. To do this you run up two physical machines attached to a SAN (required for most mid range virtualisation technologies). Physical machines run VMware with 2 virtual machines running Windows as AD replicas and file sharing.

Costs so far (REALLY rough maths): Windows + Cals ~ $5,000 VMware ~ $14,000 SAN ~ $35,000 Physical Servers ~ $25,000 (32gb of RAM) Network infrastructure ~ $5,000

Lets call the total $80k and include working/run up costs in there. Lets ignore power and connectivity costs for now.

This will buy you a scenario where your VMware instances can failover very quickly and the internal mechanics of Windows clustering should allow you have fault tolerance. It will allow you to scale out the VMs as required with future growth.

Assuming you now say VMware is too expensive and go with Xen you can drop off the VMware licensing fees however your biggest ticket items, SAN and Physical hardware still exist and you won't ever be able to avoid the Windows licensing. You will also have more administrative overhead and enjoy the wonderful learning curve that is Linux HA. I have not worked with Windows Hyper-V so won't quote there but I expect the hardware costs are unlikely to be reduced.

Option 2) I'd like to be able to recover quickly from a hardware failure

If by HA you mean "I'd like to be able to recover quickly from a hardware failure, but I can have decent maintenance time when I need it", a single Virtual Host with sufficient disk space will likely function, take reliable backups often enough and you will be able to recover to another machine quickly. Buy a decent enough single VM host and your downtime should be minimal.

This will allow you to dip your toes cheaply into virtualisation and learn quickly what you really want.

Option 3) I'm insane and wish to use older hardware I've hoarded over the last decade

Grab three machines, on one you will run Linux with an iSCSI target (I've used ietd with success), on the remaining you will run VMware ESXi. Configure the ESXi hosts to connect to the Linux iSCSI target as their storage and you have a very cheap SAN. Between the ESXi hosts you can manually balance whatever machines you need. If you run two Windows in clustered mode you can lose an ESXi host without too much consequence. Bonus with this is you can add extra ESXi hosts at little cost and transferring guests around only requires a reboot.

If you want to get tricky grab a 4th machine and run Linux DRDB to block replicate the "SAN" for redundancy there.

Ultimately

For any kind of VM machine migration and most clustering/failovers however you will need a SAN and while 4-8tb isn't a lot in consumer or single server storage it is still significant in terms of reliable quick SAN storage.

If you are serious about this I would actually just pick up a phone and ring your favourite vendor (HP/Dell/IBM et al), ask them what they can do for your budget and start from there. They have cookie cutter builds and are proficient at dropping clusters that will suit most shops easily enough.

** The really short answer **

For 50 Vanilla users I simply wouldn't, I'd back a server up often and collect the overtime running my patches on a weekend. The complexity simply isn't worth it, and if it is don't scrimp on costs.

Using Google’s App Engine as CDN for static files

The app engine is a cloud computing platform and is not designed to be a CDN. While your data may be stored on multiple nodes those nodes are not edge-cache nodes so they will not offer the same benefits that a CDN would. You can compare GAE vs various CDNs using the CloudHarmony.com speed test. Here were the results when I tested today:

Order   Service Location    Type    Size    Time (secs) Rate (Mb/s)
1   Google AppEngine    download    1.00 MB     3.50    2.29
2   Google AppEngine    upload      512.00 KB   3.57    1.12
3   Google AppEngine    website     102.55 KB   0.75    1.07

Order   Service     Type        Size    Time (secs) Rate (Mb/s)
05  EdgeCast CDN    download    1.00 MB 1.03    7.77
02  Cotendo CDN     download    1.00 MB 1.08    7.37
12  Amz CloudFront  download    1.00 MB 1.11    7.19
10  CacheFly CDN    download    1.00 MB 1.29    6.19
08  Azure CDN       download    1.00 MB 1.36    5.90
07  Internap CDN    download    1.00 MB 1.47    5.43
09  VoxCAST CDN     download    1.00 MB 1.55    5.17
04  SimpleCDN       download    1.00 MB 1.65    4.84
06  MaxCDN          download    1.00 MB 1.69    4.73
03  Highwinds CDN   download    1.00 MB 1.81    4.43
11  Akamai CDN      download    1.00 MB 2.22    3.60
01  LimeLight CDN   download    1.00 MB 2.34    3.42

You'll see that the CDN ends up being 2-7 times faster than GAE for file downloads.

Best Answer

Related Solutions

How to Achieve High Availability for SMB using virtualization

Using Google’s App Engine as CDN for static files

Related Topic