I think you'll have to abandon the POSIX requirement, very few systems implement that - in fact even NFS doesn't really (think locks etc) and that has no redundancy.
Any system which uses synchronous replication is going to be glacially slow; any system which has asynchronous replication (or "eventual consistency") is going to violate POSIX rules and not behave like a "conventional" filesystem.
Robert, you're clearly a smart guy, but respectfully, get a consultant with prior domain knowledge, or begin building something small now and see where it takes you. There is no way to answer your post; it has too many abstract concepts and no hard numbers.
A few thoughts:
will serve several thousand users at
first ... grow to support hundreds of
thousands to millions of users
Prove that you need that level of scale first. Don't build a scale-out architecture in anticipation of users that never show up. Sorry if I sound harsh, but 99% of all websites don't grow to the large end of the scale. See Stack Overflow / Server Fault; they're serving a million users monthly from a handful of fairly conventional servers.
should I get a hardware load balancer
solution from one of the vendors, or
build one myself with open source
solution
Depends on your skills and your situation regarding time vs money. Once built, the open source and commercial offerings work pretty much exactly the same. Commercial solutions tend to have better statistics and nicer management interfaces out of the box.
For the web server hardware, should I
use one-u single socket server or a
blade solution?
Ask your server vendor for prices. Ask your datacenter about power density, i.e. their preferred balance between size and power consumption -- often you'll be power limited, so a dense solution like blades may not win you anything.
For the storage, should I use a SAN or
storage server like Sun unified
storage 7000 will be sufficient.
Get SAN when you have a proven need for SAN; then you will also better understand what needs your SAN should solve for you.
Since this website will likely be more
heavy on read operations, what
consideration should be made for the
mysql cluster and storage setup?
Create a really good caching solution. Either full page caching like Squid (Varnish), or application data caching like Memcached, or a combination of both. Consider cache invalidation, could you need to quickly purge content from your caches to avoid it being served again?
What is the best way to back up up a
mysql cluster?
Opinions vary, but one common approach is to have a dedicated slave MySQL just for backups, and use something like InnoBackup or Maatkit for a self-scripted backup solution.
Edit: If you're really going to build this from scratch now, then please take a good look at cloud computing before committing. Cloud computing isn't just about scalability, even if scalability is a great strenght. Certain services that come as part of the package can really help in making day to day operations easier. Some examples:
- Live snapshots of Amazon EBS volumes make for easy database backups.
- Amazon has load balancing as a set and forget service (of course more feature limited than good self-hosted load balancer, but easy to get started with).
- Rightscale has extensive server monitoring built into their images, which makes for easy capacity planning / application introspection.
Best Answer
Facebook is using cassandra for data storage.
Here is article about scaling youtube and google architecture and prestentation: Designs, Lessons and Advice from Building Large Distributed Systems by Jeff Dean of Google describing how they do their thing.