Perform Manual Failover & Switchover using repmgr in PostgreSQL

Perform Manual Failover & Switchover with repmgr in PostgreSQL

In the recent blog, we discuss How to Setup Streaming Replication with repmgr and barman Click here to read more. In this blog, we will perform Manual Failover & Switchover using repmgr in PostgreSql. We will use the same setup we did in the last blog.

Below are the high-level steps we will follow.
1. Promoting a Standby server to Primary (Failover)
2. Failed primary rejoin
3. Switchover

Below are setup details and the same will be used in this demonstration.

Sr No.	Hostname	IP	Role
1	test-machine02	192.168.114.176	Master / Primary Server
2	test-machine01	192.168.114.177	Standby / Secondary Server

Standby Server

Step 1. Promoting a Standby server to Primary (Failover): Before promoting Standby to Primary, we will cross-check cluster health with repmgr cluster show command. Once we are confirmed our cluster health, will stop Primary PostgreSQL to simulate Primary DB failure. Then use repmgr standby promote to promote Standby DB to become new Primary.

[root@test-machine01 ~]# su - postgres
Last login: Wed Aug  4 09:41:42 +03 2021 on pts/0
[postgres@test-machine01]
[postgres@test-machine01]
[postgres@test-machine01]/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster show
 ID | Name           | Role    | Status    | Upstream       | Location | Priority | Timeline | Connection string
----+----------------+---------+-----------+----------------+----------+----------+----------+----------------------------------------------------------
 1  | test-machine02 | primary | * running |                | default  | 100      | 5        | host=192.168.114.176 user=repmgr dbname=repmgr port=5432
 2  | test-machine01 | standby |   running | test-machine02 | default  | 100      | 5        | host=test-machine01 user=repmgr dbname=repmgr port=5432
[postgres@test-machine01]
[postgres@test-machine01]/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster matrix
INFO: connecting to database
 Name           | ID | 1 | 2
----------------+----+---+---
 test-machine02 | 1  | * | *
 test-machine01 | 2  | * | *
[postgres@test-machine01]
[postgres@test-machine01]

[root@test-machine02 ~]# systemctl stop postgresql-13
[root@test-machine02 ~]#

[postgres@test-machine01]
[postgres@test-machine01]/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster show
 ID | Name           | Role    | Status        | Upstream         | Location | Priority | Timeline | Connection string
----+----------------+---------+---------------+------------------+----------+----------+----------+----------------------------------------------------------
 1  | test-machine02 | primary | ? unreachable | ?                | default  | 100      |          | host=192.168.114.176 user=repmgr dbname=repmgr port=5432
 2  | test-machine01 | standby |   running     | ? test-machine02 | default  | 100      | 5        | host=test-machine01 user=repmgr dbname=repmgr port=5432

WARNING: following issues were detected
  - unable to connect to node "test-machine02" (ID: 1)
  - node "test-machine02" (ID: 1) is registered as an active primary but is unreachable
  - unable to connect to node "test-machine01" (ID: 2)'s upstream node "test-machine02" (ID: 1)
  - unable to determine if node "test-machine01" (ID: 2) is attached to its upstream node "test-machine02" (ID: 1)

HINT: execute with --verbose option to see connection error messages
[postgres@test-machine01]
[postgres@test-machine01]
[postgres@test-machine01]/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster matrix
INFO: connecting to database
 Name           | ID | 1 | 2
----------------+----+---+---
 test-machine02 | 1  | ? | ?
 test-machine01 | 2  | x | *
[postgres@test-machine01]
[postgres@test-machine01]


[postgres@test-machine01]
[postgres@test-machine01]/usr/pgsql-13/bin/repmgr standby promote -f /etc/repmgr/13/repmgr.conf --log-to-file
[2021-08-04 09:52:12] [NOTICE] redirecting logging output to "/var/log/repmgr.log"

waiting for server to promote..... done
server promoted
[postgres@test-machine01]


[postgres@test-machine01]cat /var/log/repmgr.log
[2021-08-04 09:52:12] [INFO] user "repmgr" does not have permission to execute "pg_promote()", falling back to "pg_ctl promote"
[2021-08-04 09:52:12] [NOTICE] promoting standby to primary
[2021-08-04 09:52:12] [DETAIL] promoting server "test-machine01" (ID: 2) using "/usr/pgsql-13/bin/pg_ctl  -w -D '/var/lib/pgsql/13/data' promote"
[2021-08-04 09:52:14] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2021-08-04 09:52:14] [NOTICE] STANDBY PROMOTE successful
[2021-08-04 09:52:14] [DETAIL] server "test-machine01" (ID: 2) was successfully promoted to primary
[postgres@test-machine01]

[postgres@test-machine01]/usr/pgsql-13/bin/pg_isready
/var/run/postgresql:5432 - accepting connections
[postgres@test-machine01]

[postgres@test-machine01]
[postgres@test-machine01]/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster show
 ID | Name           | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+----------------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------
 1  | test-machine02 | primary | - failed  | ?        | default  | 100      |          | host=test-machine02 user=repmgr dbname=repmgr port=5432
 2  | test-machine01 | primary | * running |          | default  | 100      | 2        | host=test-machine01 user=repmgr dbname=repmgr port=5432

WARNING: following issues were detected
  - unable to connect to node "test-machine02" (ID: 1)

HINT: execute with --verbose option to see connection error messages
[postgres@test-machine01]

Step 2. Failed primary rejoin: We can use repmgr node rejoin command to set our old Primary as the new Standby. Use option –dry-run this will perform any necessary checks and inform you about success/failure.

OLD-Primary Server

[postgres@test-machine02]
[postgres@test-machine02]/usr/pgsql-13/bin/repmgr  node rejoin -f /etc/repmgr/13/repmgr.conf -d 'host=test-machine01 dbname=repmgr user=repmgr' --verbose --dry-run
NOTICE: using provided configuration file "/etc/repmgr/13/repmgr.conf"
INFO: replication connection to the rejoin target node was successful
INFO: local and rejoin target system identifiers match
DETAIL: system identifier is 6992789734735809123
INFO: local node 1 can attach to rejoin target node 2
DETAIL: local node's recovery point: 0/9000028; rejoin target node's fork point: 0/90000A0
INFO: prerequisites for executing NODE REJOIN are met
[postgres@test-machine02]


[postgres@test-machine02]
[postgres@test-machine02]/usr/pgsql-13/bin/repmgr  node rejoin -f /etc/repmgr/13/repmgr.conf -d 'host=test-machine01 dbname=repmgr user=repmgr' --verbose
NOTICE: using provided configuration file "/etc/repmgr/13/repmgr.conf"
INFO: local node 1 can attach to rejoin target node 2
DETAIL: local node's recovery point: 0/9000028; rejoin target node's fork point: 0/90000A0
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=test-machine02 user=repmgr dbname=repmgr port=5432"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/usr/pgsql-13/bin/pg_ctl  -w -D '/var/lib/pgsql/13/data' start"
WARNING: unable to ping "host=test-machine02 user=repmgr dbname=repmgr port=5432"
DETAIL: PQping() returned "PQPING_REJECT"
INFO: waiting for node "test-machine02" (ID: 1) to respond to pings; 1 of max 60 attempts (parameter "node_rejoin_timeout")
WARNING: unable to ping "host=test-machine02 user=repmgr dbname=repmgr port=5432"
DETAIL: PQping() returned "PQPING_REJECT"
INFO: node "test-machine02" (ID: 1) is pingable
INFO: node "test-machine02" (ID: 1) has attached to its upstream node
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
[postgres@test-machine02]
[postgres@test-machine02]
[postgres@test-machine02]

[postgres@test-machine02]
[postgres@test-machine02]/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster show
 ID | Name           | Role    | Status    | Upstream       | Location | Priority | Timeline | Connection string
----+----------------+---------+-----------+----------------+----------+----------+----------+---------------------------------------------------------
 1  | test-machine02 | standby |   running | test-machine01 | default  | 100      | 1        | host=test-machine02 user=repmgr dbname=repmgr port=5432
 2  | test-machine01 | primary | * running |                | default  | 100      | 2        | host=test-machine01 user=repmgr dbname=repmgr port=5432
[postgres@test-machine02]
[postgres@test-machine02]

Step 3. Cluster Switchover: Sometimes we just want to be able to switch servers due to some scheduled maintenance on the Primary server. This can be achieved using command repmgr standby switchover. Use option –dry-run first to check for any potential error using switchover process. We can also use option –archive-ready to check that there is no backlog of WAL files waiting to be archived in the Primary Server. Please note switchover must be executed from the standby node.

After Step 3 we should be back in the same cluster configuration we were at the beginning of this tutorial

NEW-Primary Server

[postgres@test-machine01]
[postgres@test-machine01]/usr/pgsql-13/bin/repmgr node check --archive-ready
OK (0 pending archive ready files)
[postgres@test-machine01]

OLD-Primary Server

[postgres@test-machine02]
[postgres@test-machine02]/usr/pgsql-13/bin/repmgr standby switchover -f /etc/repmgr/13/repmgr.conf  --dry-run
NOTICE: checking switchover on node "test-machine02" (ID: 1) in --dry-run mode
WARNING: no superuser connection available
DETAIL: it is recommended to perform switchover operations with a database superuser
HINT: provide the name of a superuser with -S/--superuser
INFO: SSH connection to host "test-machine01" succeeded
INFO: able to execute "repmgr" on remote host "test-machine01"
INFO: 1 walsenders required, 10 available
INFO: demotion candidate is able to make replication connection to promotion candidate
INFO: 0 pending archive files
INFO: replication lag on this standby is 0 seconds
NOTICE: local node "test-machine02" (ID: 1) would be promoted to primary; current primary "test-machine01" (ID: 2) would be demoted to standby
INFO: following shutdown command would be run on node "test-machine01":
  "/usr/pgsql-13/bin/pg_ctl  -D '/var/lib/pgsql/13/data' -W -m fast stop"
INFO: parameter "shutdown_check_timeout" is set to 60 seconds
INFO: prerequisites for executing STANDBY SWITCHOVER are met
[postgres@test-machine02]

[postgres@test-machine02]
[postgres@test-machine02]/usr/pgsql-13/bin/repmgr standby switchover -f /etc/repmgr/13/repmgr.conf
NOTICE: executing switchover on node "test-machine02" (ID: 1)
WARNING: no superuser connection available
DETAIL: it is recommended to perform switchover operations with a database superuser
HINT: provide the name of a superuser with -S/--superuser
NOTICE: local node "test-machine02" (ID: 1) will be promoted to primary; current primary "test-machine01" (ID: 2) will be demoted to standby
NOTICE: stopping current primary node "test-machine01" (ID: 2)
WARNING: a superuser connection is required to issue a CHECKPOINT
HINT: provide a superuser with -S/--superuser
DETAIL: executing server command "/usr/pgsql-13/bin/pg_ctl  -D '/var/lib/pgsql/13/data' -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 3 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/B000028
INFO: user "repmgr" does not have permission to execute "pg_promote()", falling back to "pg_ctl promote"
NOTICE: promoting standby to primary
DETAIL: promoting server "test-machine02" (ID: 1) using "/usr/pgsql-13/bin/pg_ctl  -w -D '/var/lib/pgsql/13/data' promote"
waiting for server to promote...... done
server promoted
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "test-machine02" (ID: 1) was successfully promoted to primary
INFO: local node 2 can attach to rejoin target node 1
DETAIL: local node's recovery point: 0/B000028; rejoin target node's fork point: 0/B0000A0
NOTICE: setting node 2's upstream to node 1
WARNING: unable to ping "host=test-machine01 user=repmgr dbname=repmgr port=5432"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/usr/pgsql-13/bin/pg_ctl  -w -D '/var/lib/pgsql/13/data' start"
WARNING: node "test-machine01" not found in "pg_stat_replication"
NOTICE: NODE REJOIN successful
DETAIL: node 2 is now attached to node 1
WARNING: node "test-machine01" not found in "pg_stat_replication"
INFO: waiting for node "test-machine01" (ID: 2) to connect to new primary; 1 of max 60 attempts (parameter "node_rejoin_timeout")
DETAIL: checking for record in node "test-machine02"'s "pg_stat_replication" table where "application_name" is "test-machine01"
WARNING: node "test-machine01" not found in "pg_stat_replication"
NOTICE: node  "test-machine02" (ID: 1) promoted to primary, node "test-machine01" (ID: 2) demoted to standby
NOTICE: switchover was successful
DETAIL: node "test-machine02" is now primary and node "test-machine01" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully
[postgres@test-machine02]

[postgres@test-machine02]
[postgres@test-machine02]/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster show
 ID | Name           | Role    | Status    | Upstream       | Location | Priority | Timeline | Connection string
----+----------------+---------+-----------+----------------+----------+----------+----------+---------------------------------------------------------
 1  | test-machine02 | primary | * running |                | default  | 100      | 3        | host=test-machine02 user=repmgr dbname=repmgr port=5432
 2  | test-machine01 | standby |   running | test-machine02 | default  | 100      | 2        | host=test-machine01 user=repmgr dbname=repmgr port=5432
[postgres@test-machine02]
[postgres@test-machine02]

References:repmgr Documentation
https://repmgr.org/docs/4.0/promoting-standby.html
https://repmgr.org/docs/4.0/repmgr-node-rejoin.html
https://repmgr.org/docs/4.0/performing-switchover.html

This document is just for learning purpose and always validate in the LAB environment first before applying in the LIVE environment.

Hope so you like this article!
Please share your valuable feedback/comments/subscribe and follow us below and don’t forget to click on the bell icon to get the most recent update. Click here to understand more about our pursuit.

POSTGRESQL, REPLICATION, REPMGR

DBsGuru