Highly Scalable Web Site – Best Design Practices

Architecturedesignscalabilitywebsites

For websites that need to be highly scalable, such as social networks like facebook, what's the best way to design the website?

Should I have a web service which the site queries to get data that it needs?

or
Should the site query databases directly? (can be done using built in language constructs to fill tables automatically etc).

I would think the web service is the better design since it gives centralized data access and things like caching and the like become much easier to control, but what do others think?

Best Answer

Wow, this is a simple question, which a huge array of possible answers. The more explicit part of your question asks whether it is more scalable to interface with your database directly or through a web service. That answer is simple: query the database directly. Going through the web service adds a whole bunch of latency that is completely unnecessary for code operating behind a firewall (by and large). A web service for example requires some component to receive a request, deserialize it, query the DB, serialize a response and return it. So if your code is all operating behind a firewall, save yourself the trouble and just query the DB directly.

Making a web site scalable however goes way beyond the question you initially posed. So forgive me if I go off on a tangent here, but I thought it might be useful considering that you mentioned Facebook in particular.

I would recommend you read up on the work and tools built by Brad Fitzpatrick (founder of LiveJournal and now at Google). When I worked with him at Six Apart, here are some of the things I learned from him, and about LiveJournal's architecture that made it so scalable.

Use narrow database tables as opposed to wide ones. What was fascinating about this was learning what motivated this architecture, which was creating a system that was easily and quickly upgraded. If you use wide tables, or tables for which each field or property is a column in the table, when it comes time to upgrade the database schema, for example adding a new column, then the system will need to lock the table while the schema change is implemented. When operating at scale this would mean a simple change to the database schema could result in a large database outage. Which sucks obviously. A narrow table on the other hand simply stores each individual property associated with an object as a single row in the database. Therefore when you want to add a new column to the database all you need to do is INSERT records into a table, which is a non-locking operation. Ok, that is a little background, let's see how this model actually translates in working system like LiveJournal.

Let's say you want to load the last 10 journal entries on a person's blog, and let's say each journal entry has ten properties. In a classic wide table layout, each property would correlate to a column on a table. A user would then query the table once to fetch all of the data they need. The query would return 10 rows and each row would have all the data they need (e.g. SELECT * FROM entries ORDER BY date LIMIT 10). In a narrow table layout however things are bit different. In this example there are actually two tables: the first table (table A) stores simple criteria one would want to search by, e.g. the id of the entry, the id of the author, the date of the entry, etc. A second table (table B) then stores all of the properties associated with an entry. This second table has three columns: entry_id, key and value. For every row in table A, there would be 10 rows in table B (one row for each property). Therefore in order to fetch and display the last ten entries, you would need 11 queries. The first query gives you the list of entry IDs, and then the next ten queries would fetch the properties associated with each of the entries returned in the first query.

"Holy moly!" you say, "how on Earth can that be more scalable?!" Its totally counter-intuitive right? In the first scenario we just had one database query, but in the second "more scalable" solution we have 11 database queries. That makes no sense. The answer to that question relies entirely upon the next bullet.
Use memcache liberally. In case you were not aware, memcache is a distributed, stateless, low latency, network based caching system. It is used by Facebook, Google, Yahoo, and just about every popular and scalable web site on the planet. It was invented by Brad Fitzpatrick partially to help offset the database overhead inherent in a narrow table database design. Let's take a look at the same example as discussed in #1 above, but this time, let's introduce memcache.

Let's begin when a user first visits a page and nothing is in the cache. You begin by querying table A which returns the IDs of the 10 entries you want to display on the page. For each of those entries you then query the database to retrieve the properties associated with that entry, and then using those properties constitute an object that your code can interface with (e.g. an object). You then stash that object (or a serialized form of that object) in memcache.

The second time someone loads the same page, you begin the same way: by querying table A for the list of entry IDs you will display. For each entry you first go to memcache and say, "do you have entry #X in the cache?" If yes, then memcache returns the entry object to you. If not, then you need to query the database again to fetch its properties, constitute the object and stash it in memcache. Most of the time, the second time someone visits the same page there is only one database query, all other data is then pulled straight from memcache.

In practice, what ended up happening for most of LiveJournal is that most of the system's data, especially the less volatile data, was cached in memcache and the extra queries to the database needed to support the narrow table schema were all but completely offset.

This design made solving the problem associated with assembling a list of posts associated with all of your friends into a stream, or "wall" much, much easier.
Next, consider partitioning your database. The model discussed above surfaces yet another problem, and that is your narrow tables will tend to be very large/long. And the more rows those tables have the harder other administrative tasks become. To offset this, it might make sense to manage the size of your tables by partitioning the tables in someway, so that clusters of users are served by one database, and another cluster of users are served by a separate database. This distributes load on the database and keeps queries efficient.
Finally, you need awesome indexes. The speed of your queries will depend largely upon how well indexed your database's tables are. I won't spend too much time discussing what an index is, except to say that it is a lot like a giant card catalog system to make finding needles in a haystack more efficient. If you use mysql then I recommend turning on the slow query log to monitor for queries that take a long time to fulfill. When a query pops up on your radar (e.g. because it is slow), then figure out what index you need to add to the table to speed it up.

"Thank you for all of this great background, but holy crud, that is a lot of code I will have to write."

Not necessarily. Many libraries have been written that make interfacing with memcache really easy. Still other libraries have codified the entire process described above; Data::ObjectDriver in Perl is just such a library. As for other languages, you will need to do your own research.

I hope you found this answer helpful. What I have found more often than not is that the scalability of a system often comes down less and less to code, and more and more to a sound data storage and management strategy/technical design.

Best Answer

Related Solutions

CQRS System – Best Way to Scale

Design – System Design: Scalable Chat Server

Related Topic