I would like to know how these companies deploy new versions of their sites without downtime. I am aware of the BlueGreenDeployment model, however, I would like to know what these sites do to avoid/minimize downtime.
How do companies like Google and ebay deploy new versions of their sites without downtime
deployment
Related Solutions
There's no simple answer to this question.
Using an architecture designed around images (commonly referred to as "immutable infrastructure") works fantastically for stateless services, like your application servers.
It's most definitely possible to expand that to your stateful services with the right tools, failover systems and upgrade paths, but those are usually overkill for simple systems (as the one you describe).
One thing to keep in mind when using these tools is that you don't have to go "all in". Packer and Terraform are very much designed to work only where you want them. They don't enforce a pattern across all of your systems on purpose.
Practically speaking, the best way to handle this problem is maintain your database servers differently, outside of Packer (building the initial image, yes! But not necessarily upgrading them in the same way as the stateless web servers) or outsource managing the state to someone else. Notable options include Heroku Postgres or AWS RDS.
To round it out – yes, it's possible, but with our current tooling it's probably more trouble then it's worth at smaller scale or with simple architectures.
Packer and Terraform can still be a huge boon in other aspects of the very same infrastructure – Terraform, for example, could provision a Heroku database for use in your DigitalOcean Application servers in a very straight forward way. Packer can handle upgrading and releasing your application server images, and likewise for development.
I do not think it is possible to go completely zero-downtime with single server instance. What you are looking for is blue green deployment.
Basically you need to have a web server in front of your server pool. You decide to roll out the new version so your pick a subset of your servers and drain them - do not accept any new connections and finish any pending requests(usually done on webserver/load balancer by disabling request forwarding to these servers). Once drained, you deploy the new version to these idle instances, test it and if everything is ok you enable them again and load balancer can send user requests to the new version. Then you take the rest of the servers that still run the old version of your app and repeat the same procedure - drain, update, test, enable.
Best Answer
Google handles it in a few different manners. If you study their data center model, and the fact that they run 'clusters' of machines in each data center, they are able to shut down a data center so that it doesn't take requests, roll out the changes, turn on the data center, and do rolling upgrades. They can also do this with the clusters within a data center. Recently they upgraded the filesystems on their machines from ext3 to ext4 in-place by rolling things out on a per data center basis.
Google also does staged rollouts where clusters of people get different user experiences than other users. Facebook does this as well.
Ebay decommissions a large portion of their data center through load balancer changes, upgrades and migrates over decommissioning the other half. It has been said that they have enough redundancy to run their site on 1/3 of their available hardware. They may have more sophisticated methods now, this was a paper I read about 4 years ago.