Scalability – How Amazon Avoids Database Layer Bottlenecks

distributed computingscalabilityweb-applications

If you imagine a company like Amazon (or any other large e-commerce web application), that is operating an online store at massive scale and only has limited quantity of physical items in its warehouses, how can they optimize this such that there is no single bottleneck? Of course, they must have a number of databases with replication, and many servers that are handling the load independently. However, if multiple users are being served by separate servers and both try to add the same item to their cart, for which there is only one remaining, there must be some "source of truth" for the quantity left for that item. Wouldn't this mean that at the very least, all users accessing product info for a single item must be querying the same database in serial?

I would like to understand how you can operate a store that large using distributed computing and not create a huge bottleneck on a single DB containing inventory information.

Best Answer

However, if multiple users are being served by separate servers and both try to add the same item to their cart, for which there is only one remaining, there must be some "source of truth" for the quantity left for that item.

Not really. This is not a problem that requires a 100% perfect technical solution, because both error cases have a business solution that is not very expensive:

  • If you incorrectly tell a user an item is sold out, you lose a sale. If you sell millions of items every day and this happens maybe once or twice a day, it gets lost in the noise.
  • If you accept an order and while processing it find that you've run out of the item, you just tell the customer so and give them the choice of waiting until you can restock, or cancelling the order. You have one slightly annoyed customer. Again not a huge problem when 99.99% of orders work fine.

In fact, I recently experienced the second case myself, so it's not hypothetical: that is what happens and how Amazon handles it.

It's a concept that applies often when you have problem that is theoretically very hard to solve (be it in terms of performance, optimization, or whatever): you can often live with a solution that works really well for most cases and accept that it sometimes fails, as long as you can detect and handle the failures when they occur.

Related Topic