Postgresql – repmgr – after a failover switch, both nodes act as a master

database-replicationfailoverpostgresqlreplication

I have a two node PostgreSQL cluster configured through repmgr.
The database topology looks like this:

db1 - 10.10.10.50 ( master )
db2 - 10.10.10.60 ( standby )
wit - 10.10.10.70 ( witness )

The creation of cluster ( as the replication and automatic failover ) work as expected, but the problem is the following.

Let's say that in my cluster the db1 node goes down, then the expected behaviour is that the db2 node gets promoted to a new master. That is all good and the logs prove it:

[WARNING] connection to upstream has been lost, trying to recover... 60 seconds before failover decision
[WARNING] connection to upstream has been lost, trying to recover... 50 seconds before failover decision
[WARNING] connection to upstream has been lost, trying to recover... 40 seconds before failover decision
[WARNING] connection to upstream has been lost, trying to recover... 30 seconds before failover decision
[WARNING] connection to upstream has been lost, trying to recover... 20 seconds before failover decision
[WARNING] connection to upstream has been lost, trying to recover... 10 seconds before failover decision
[ERROR] unable to reconnect to upstream after 60 seconds...
[ERROR] connection to database failed: could not connect to server: No route to host
        Is the server running on host "10.10.10.50" and accepting
        TCP/IP connections on port 5432?

[ERROR] connection to database failed: could not connect to server: No route to host
        Is the server running on host "10.10.10.50" and accepting
        TCP/IP connections on port 5432?

[NOTICE] promoting standby
[NOTICE] promoting server using '/usr/lib/postgresql/9.3/bin/pg_ctl -D /var/lib/postgresql/9.3/main promote'
[NOTICE] STANDBY PROMOTE successful.  You should REINDEX any hash indexes you have.

The db2 node is now promoted to a new master, and it's all good, until the db1 node gets back up.

In that scenario, it's expected that the db1 to became a new standby, but that is not the case as I end up with both nodes acting as master ?!

So my question is, after a failover, how can I prevent that both nodes act as a master ( in the docs it says to include a third node to be a witness – I have that ), but the desired effect is not there.

Here is an example of my repmgr.conf file:

cluster=test_cluster
node=1
node_name=db1
conninfo='host=10.10.10.50 dbname=repmgr user=repmgr'
master_response_timeout=60
reconnect_attempts=6
reconnect_interval=10
failover=automatic
promote_command='repmgr standby promote -f /etc/repmgr/repmgr.conf'
follow_command='repmgr standby follow -f /etc/repmgr/repmgr.conf'
pg_bindir=/usr/lib/postgresql/9.3/bin

And the cluster state after the db1 node gets back up:

repmgr -f /etc/repmgr/repmgr.conf cluster show
Role      | Connection String
* master  | host=10.10.10.50 dbname=repmgr user=repmgr
* master  | host=10.10.10.60 dbname=repmgr user=repmgr
  witness | host=10.10.10.70 dbname=repmgr user=repmgr port=5499

Thanks a bunch,
Best regards

Best Answer

I looked into auto failover, using repmgr, a few months back. It seems repmgr is working as expected.

IIRC repmgr doesn't bring an old master up as a new standby, you would need to run a --force standby clone. You can set other standby nodes to follow a new master, should failover occur (repmgr standby follow).

Would you expect your master to recover unexpectedly?
How do you handle failover in your application?
Aren't you redirecting all database traffic to the new master?

KISS

Keep it simple silly.

I'm kind of lost here.

For this very reason, don't begin to overcomplicate something that needs not be complicated. If you don't know the right method to implement something in the first instance - you certainly won't know what to do when something goes wrong.

First, lets address the hardware

Ref: https://www.sonassihosting.com/help/magestack/cpu-sizing/

a) A standard Magento demo store is capable of delivering roughly 230 uniques per GHz, per hour.

b) A typical web store, with admin user activity, development activity, product addition/deletion can see this degrade by around 100%, to 115 uniques per GHz, per hour.

Using your figure of 100 active visitors at any given time,

hourly_hits = (60 / time_on_site (mins)) * concurrent_users

So, we'll assume an industry average time on site of 8 minutes and 8 page views per visit.

hourly_hits = (60 / 8) * 100
hourly_hits = (7.5) * 100
hourly_hits = 750

Which gives a figure of 750 hourly unique visitors, or around 7,500 daily unique visitors.

To support 750 visitors per hour, at 115 uniques/GHz - you'll need the equivalent of 7x 1GHz CPU cores. So lets assume your i7 Quad Core is 2.5GHz - that will give a cumulative total of 10GHz.

Secondly, lets address your configuration

What is your goal exactly?

High availability
Reliability
Simplicity of administration
Performance
Scalability

None of your ideas are particularly good, your load balancer is a single point of failure and I feel you're getting a bit too caught up in MySQL redundancy.

Master-Master is a configuration nightmare, and you have no benefits from doing it. Magento IS NOT bound by MySQL, in the slightest. See Which should I put on my bigger machine? Magento Webserver of Magento Database?

And unless you are planning to make EVERYTHING in your architecture redundant, Ie.

Bonded network interfaces
A+B switches
A+B firewalls
A+B separate power feeds from diverse UPS
Multihomed upstream connectivity

... there isn't much point trying to build some resilience in at the software layer.

How we would do it

Has anyone done a similar setup with Magento?

In a word. Yes.

We configure anything from a single-server to n servers in MageStack - by containerising every single node.

So in your case, we typically would set up the following (assuming you requested HA).

**Server 1**        **Server 2**        **Server 3**
LB  (m)    <==>     LB  (s)             
Web (m)             Web (m)             Web (m)
                    DB  (s)    <==>     DB  (m)

The LB and DB virtual servers would have their root partitions on a DRBD mirror (represented by <==>). The web nodes would either use a common NFS store, or more commonly, a repo pull on the live web nodes.

Just to reference a reply here How to arrange web servers with Varnish?

Our typical architecture is

lvs (initial ssl load balancing)
 -> pound (ssl-unwrapping) 
 -> varnish (caching) 
 -> haproxy (load balancing) 
 -> nginx (static content) 
 -> php (dynamic content) 
 -> mysql (db)

Heartbeat would maintain healthchecks between machines and provide failover of IP and start/stop the respective virtual servers.

So the resultant containerised architecture would look like this ... (excuse the graphic, I poached it from a marketing PDF).

MageStack example configuration

How I would recommend you do it

Don't use master/slave, don't use DRBD and just keep it really, really simple - so its easy for you to manage and debug when things don't work.

**Server 1**        **Server 2**        **Server 3**
LB                           
Web                 Web                 Web 
                                        DB

That way, you get load distribution and full utilisation of hardware. Worst case scenario - if Server 1 or Server 3 fail - then you pull the hard drives and put them in Server 2. With remote hands at a DC - this could be done within ~5 minutes. It will be a damn site easier to manage, it will mean you won't have to produce a 30 page document on the configuration of the machines and run-book procedures and will take considerably less time to set up.

We've got servers that have uptimes of over 3 years - so it should put into perspective how often server grade equipment fails. More often than not, the most common cause for issues on a server is purely down to a dodgy software configuration.

My only concern is that your hardware isn't server grade - so you might run the risk of higher failure rates - but the risk your choosing to take by using it.

In summary

I wouldn't advise attempting to build, manage and oversee the server configuration yourself for an e-Commerce store, where the hosting and support are the single most important parts of keeping your business online.

Best Answer

Related Solutions

Cluster failover and strange gratuitous arp behavior

Magento hosting on a budget