Postgresql replicas stuck in recovery (archive_command won’t run on replicas)

postgresqlreplication

We're currently running a master -> slave, slave, slave, slave set up using postgresql 9.2.8 with streaming and WAL-E/S3 for handling wal segments.

Should all of the replicas currently be "in recovery"? Running SELECT pg_is_in_recovery(); on all of them returns true, which is concerning. We can run queries on them (assuming they don't take longer than 30 seconds).

I'm trying to spin up another replica off of one of the existing slaves using WAL-E but I'm currently unable to do so given every replica is in recovery mode. I can't run pg_basebackup or use wal-e's backup features on the replicas.

Amy I missing something glaringly obvious? The only thing I can think of is we had an issue roughly 2 months ago where our hard drive filled up on our master, and it shut off. We were able to boot it, clear disk space, and continue streaming/replication from the master.

If I were to simply start up 3 postgresql servers and configure them in a 3 server chain (master -> slave -> slave) using streaming/archiving it would work properly with WAL-E, as I have done this. Just for some reason I'm unable to get our existing production replicas to stream/archive to any other server. Specifically the archive_command is NEVER run on any of the replicas (because it's stuck in recovery mode).

Does anyone have any suggestions on how I can further debug/diagnose this? I'm trying to find a solution without significant downtime to our production database (as I could always just re-import the DB to a new server and start the chain again, but this would take 12+ hours).

Here is the configuration details: https://gist.github.com/Geesu/1a696262e46ba9f0a24c
As well as the local_backup_script.sh: https://gist.github.com/Geesu/3b8b35e108d8e2205da7

Thanks!

Best Answer

Hopefully this still qualifies as an answer to your question, even though I've not solved your problem.

Should all of the replicas currently be "in recovery"? Running SELECT pg_is_in_recovery(); on all of > them returns true, which is concerning. We can run queries on them

This is normal. Your slave is in recovery of a sort, albeit a slow and perpetual one, while it's still chomping WAL segments (or streaming) from another server.

Just for some reason I'm unable to get our existing production replicas to stream/archive to any other server. Specifically the archive_command is NEVER run on any of the replicas (because it's stuck in recovery mode).

Are you receiving errors anywhere? Remember streaming is initiated by the downstream slaves: what state are they in? What data do they have? And is anything interesting logged when the streaming connection is attempted? Remember, PostgreSQL's built-in streaming replication is independent of the archive system (assuming the downstream machine is otherwise up-to-date); can you make a connection on behalf of the replication user?

Does anyone have any suggestions on how I can further debug/diagnose this?

Given the inconsistencies between production and your trial, it does sound like a misconfiguration hidden somewhere, though I don't know anything about WAL-E. A diff of postgresql.conf, pg_hba.conf (and recovery.conf I suppose) would be a boring but good start. Between your production slaves and working, trial slaves, that is.

You could also check the contents of the pg_settings table. If these are long-running production machines, perhaps a setting simply hasn't been applied yet? And I know you will have looked at the docs on cascading replication and its requirements, but I'm linking them just in case.