I maintain two datacenters, and as more of our important infrastructure starts to get controlled via puppet, it is important the the puppet master work at the second site should our primary site fail.
Even better would be to have a sort of active / active setup so the servers at the second site are not polling over the WAN.
Are there any standard methods of multi-site puppet high availability?
Best Answer
Puppet actually lends itself pretty well to multi-master environments, with caveats. The main one? Lots of parts of Puppet like to be centralized. The certificate authority, the inventory and dashboard/report services, filebucketing and stored configs - all of them are at their best in (or simply require) a setup where there's just one place for them to talk to.
It's quite workable, though, to get a lot of those moving parts working in a multi-master environment, if you're ok with the graceful loss of some of the functionality when you've lost your primary site.
Let's start with the base functionality to get a node reporting to a master:
Modules and Manifests
This part's simple. Version control them. If it's a distributed version control system, then just centralize and sync, and alter your push/pull flow as needed in the failover site. If it's Subversion, then you'll probably want to
svnsync
the repo to your failover site.Certificate Authority
One option here is to simply sync the certificate authority files between the masters, so that all share the same root cert and are capable of signing certificates. This has always struck me as "doing it wrong";
I can't honestly say that I've done thorough testing of this option, since it seems horrible. However, it seems that Puppet Labs are not looking to encourage this option, per the note here.
So, what that leaves is to have a central CA master. All trust relationships remain working when the CA is down since all clients and other masters cache the CA certificate and the CRL (though they don't refresh the CRL as often as they should), but you'll be unable to sign new certificates until you get the primary site back up or restore the CA master from backups at the failover site.
You'll pick one master to act as CA, and have all other masters disable it:
Then, you'll want that central system to get all of the certificate related traffic. There are a few options for this;
SRV
record support in 3.0 to point all agent nodes to the right place for the CA -_x-puppet-ca._tcp.example.com
ca_server
config option in thepuppet.conf
of all agentsProxy all traffic for CA-related requests from agents on to the correct master. For instance, if you're running all your masters in Apache via Passenger, then configure this on the non-CAs:
And, that should do it.
Before we move on to the ancillary services, a side note;
DNS Names for Master Certificates
I think this right here is the most compelling reason to move to 3.0. Say you want to point a node at "any ol' working master".
Under 2.7, you'd need a generic DNS name like
puppet.example.com
, and all of the masters need this in their certificate. That means settingdns_alt_names
in their config, re-issuing the cert that they had before they were configured as a master, re-issuing the cert again when you need to add a new DNS name to the list (like if you wanted multiple DNS names to have agents prefer masters in their site).. ugly.With 3.0, you can use
SRV
records. Give all your clients this;Then, no special certs needed for the masters - just add a new record to your
SRV
RR at_x-puppet._tcp.example.com
and you're set, it's a live master in the group. Better yet, you can easily make the master selection logic more sophisticated; "any ol' working master, but prefer the one in your site" by setting up different sets ofSRV
records for different sites; nodns_alt_names
needed.Reports / Dashboard
This one works out best centralized, but if you can live without it when your primary site's down, then no problem. Just configure all of your masters with the correct place to put the reports..
..and you're all set. Failure to upload a report is non-fatal for the configuration run; it'll just be lost if the dashboard server's toast.
Fact Inventory
Another nice thing to have glued into your dashboard is the inventory service. With the
facts_terminus
set torest
as recommended in the documentation, this'll actually break configuration runs when the central inventory service is down. The trick here is to use theinventory_service
terminus on the non-central masters, which allows for graceful failure..Have your central inventory server set to store the inventory data through either ActiveRecord or PuppetDB, and it should keep up to date whenever the service is available.
So - if you're ok with being down to a pretty barebones config management environment where you can't even use the CA to sign a new node's cert until it's restored, then this can work just fine - though it'd be really nice if some of these components were a bit more friendly to being distributed.