Best way to load balance across multiple static file servers for even an bandwidth distribution

bandwidthclusterhaproxyload balancing

First off, I'll explain my situation to you. I'm running a fairly popular website as a side project, so I can't really invest a ton of money into it. I currently have just one server with HAProxy in the front sending out normal requests to Apache, and all static file requests to Lighttpd. This is working really well because all php and post requests get handled by Apache, while all images are sent to the faster Lighttpd (the site is mostly images, so this is really important). It would be nice to not have to set up a sub-domain for serving the images, because short URLs are also really important, thus my reason for using HAProxy.

I've found a hosting provider that offers pretty cheap unmetered bandwidth that I've been using, the problem comes in when I start pushing out as much bandwidth as the 100mbs network card can handle, thus needing a second server.

I've put a lot of thought into my options, so I'll explain each one to you. Hopefully you could provide some insight into which one is the best option for me, or maybe there's another option out there that I haven't thought of yet.

Requirements:

  • Even bandwidth distribution is a must. I have a pretty powerful server, so scaling up is not an option. I need to scale out to gain more bandwidth.

  • Short URLs. I really don't wont to setup a subdomain, like img.example.com, to serve my images. example.com/image.jpg is how it is now, and how I would really like it to stay. But if there's no other way, then I understand.

  • The clostest server handling the request would be really nice, but not a must. Something to keep in mind.

HAProxy to loadbalance:

  • It would be really easy to do since I'm already using HAProxy anyway. However, I think the problem comes in when distributing bandwidth. I might be wrong on this, but doesn't HAProxy send the request to a server where the server processes it and then sends it back through HAProxy to the client? Thus, all traffic goes back out through the load balancer causing it to use as much bandwidth as all the servers combined.

DNS Round Robin:

  • This might be my best option. Just replicate the website across multiple servers and do what I'm doing now. The downside is that if one server goes down, clients are still sent to it. I would also need to replicate the site across the multiple servers. I was kind of hoping that I could have one main server that handles everything except static files, and then have a couple of static file servers. I also read that this was sort of the 'poor man's load balancing', and it would be nice to have something a little more sophisticated.

Direct Server Return:

  • It seems really complicated, but might be a good option. Would I still be able to send certain URLs to certain servers? Like right now with HAProxy, every URL that ends in the right file extension is sent to Lighttpd, while other extensions are sent to Apache. So I would need something similar. Like, all php requests are handled by the same server that's running the balancing software, while all jpg requests are sent out to multiple servers.

Ideally, if HAProxy supported Direct Server Return, then my problem would be solved. I also do not want to use a CDN, because they're really expensive, and this is just a side project after all.

Do you understand my problem? Let me know if I didn't explain something right or if you need more info.

Best Answer

Draw a picture of your request/response cycle for the application and isolate the bottleneck. You are correct that a single proxy distributing load to many application servers will require the aggregate bandwidth of all application servers. The classical solution is RR DNS. Google, Yahoo and Amazon all use this technique with a short TTL. I did some investigation a while back and documented my findings.

Another solution is to use a fancy-pants enterprise load balancing solution using virtual IP addressing to balance requests among multiple application servers with real IP addresses. I have worked with Netscaler and Stonesoft products. Both perform well but have terrible idiosyncrasies and are quite complex.