Postgresql – repmgr – after a failover switch, both nodes act as a master

database-replicationfailoverpostgresqlreplication

I have a two node PostgreSQL cluster configured through repmgr.
The database topology looks like this:

db1 - 10.10.10.50 ( master )
db2 - 10.10.10.60 ( standby )
wit - 10.10.10.70 ( witness )

The creation of cluster ( as the replication and automatic failover ) work as expected, but the problem is the following.

Let's say that in my cluster the db1 node goes down, then the expected behaviour is that the db2 node gets promoted to a new master. That is all good and the logs prove it:

[WARNING] connection to upstream has been lost, trying to recover... 60 seconds before failover decision
[WARNING] connection to upstream has been lost, trying to recover... 50 seconds before failover decision
[WARNING] connection to upstream has been lost, trying to recover... 40 seconds before failover decision
[WARNING] connection to upstream has been lost, trying to recover... 30 seconds before failover decision
[WARNING] connection to upstream has been lost, trying to recover... 20 seconds before failover decision
[WARNING] connection to upstream has been lost, trying to recover... 10 seconds before failover decision
[ERROR] unable to reconnect to upstream after 60 seconds...
[ERROR] connection to database failed: could not connect to server: No route to host
        Is the server running on host "10.10.10.50" and accepting
        TCP/IP connections on port 5432?

[ERROR] connection to database failed: could not connect to server: No route to host
        Is the server running on host "10.10.10.50" and accepting
        TCP/IP connections on port 5432?

[NOTICE] promoting standby
[NOTICE] promoting server using '/usr/lib/postgresql/9.3/bin/pg_ctl -D /var/lib/postgresql/9.3/main promote'
[NOTICE] STANDBY PROMOTE successful.  You should REINDEX any hash indexes you have.

The db2 node is now promoted to a new master, and it's all good, until the db1 node gets back up.

In that scenario, it's expected that the db1 to became a new standby, but that is not the case as I end up with both nodes acting as master ?!

So my question is, after a failover, how can I prevent that both nodes act as a master ( in the docs it says to include a third node to be a witness – I have that ), but the desired effect is not there.

Here is an example of my repmgr.conf file:

cluster=test_cluster
node=1
node_name=db1
conninfo='host=10.10.10.50 dbname=repmgr user=repmgr'
master_response_timeout=60
reconnect_attempts=6
reconnect_interval=10
failover=automatic
promote_command='repmgr standby promote -f /etc/repmgr/repmgr.conf'
follow_command='repmgr standby follow -f /etc/repmgr/repmgr.conf'
pg_bindir=/usr/lib/postgresql/9.3/bin

And the cluster state after the db1 node gets back up:

repmgr -f /etc/repmgr/repmgr.conf cluster show
Role      | Connection String
* master  | host=10.10.10.50 dbname=repmgr user=repmgr
* master  | host=10.10.10.60 dbname=repmgr user=repmgr
  witness | host=10.10.10.70 dbname=repmgr user=repmgr port=5499

Thanks a bunch,
Best regards

Best Answer

I looked into auto failover, using repmgr, a few months back. It seems repmgr is working as expected.

IIRC repmgr doesn't bring an old master up as a new standby, you would need to run a --force standby clone. You can set other standby nodes to follow a new master, should failover occur (repmgr standby follow).

  • Would you expect your master to recover unexpectedly?
  • How do you handle failover in your application?
  • Aren't you redirecting all database traffic to the new master?