当前位置：首页 > news >正文

HG_REPMGR autofailvoer自动故障转移

news 2026/3/26 21:56:36

文章目录

文档用途
详细信息

文档用途

HG_REPMGR自动故障转移配置参考

详细信息

配置集群自动故障转移（failover），需要为集群中的每个节点开启 repmgrd 守护进程。当主节点出现故障后，会自动将合适的备节点提升为新主节点，继

续对外提供服务。示例如下。

配置 postgresql.replication.conf 文件（所有节点）

在上述 postgresql.replication.conf 的基础上，添加如下参数：

shared_preload_libraries='repmgr'

或者

altersystemsetshared_preload_libraries=pg_pathman,timescaledb,repmgr;

重启数据库：

pg_ctl restart

配置 hg_repmgr.conf（所有节点）

在现有的 hg_repmgr.conf 文件中添加如下参数：

failover=automatic promote_command='repmgr -f /opt/highgo/5.6.1/conf/hg_repmgr.conf standby promote'follow_command='repmgr -f /opt/highgo/5.6.1/conf/hg_repmgr.conf standby follow --upstream-node-id=%n'

如果需要将 repmgr 的日志定位到固定的日志文件可添加 log_file 参数，如下：

log_file='/opt/highgo/5.6.1/conf/data/log/hg_repmgr.log'

为了防止上述日志文件不断膨胀，可配置系统的 logrotate。（详细步骤略）

开启 repmgrd 进程（所有节点）

repmgrd-f/opt/highgo/5.6.1/conf/hg_repmgr.conf-d-p/tmp/hg_repmgrd.pid[highgo@dbrsconf]$ repmgrd-d-p/tmp/hg_repmgrd.pid[2019-05-0614:02:42][NOTICE]repmgrd(repmgrd4.2)startingup[2019-05-0614:02:42][INFO]connectingtodatabase""[2019-05-0614:02:43][ERROR]repmgr extensionnotfoundonthis node[2019-05-0614:02:43][DETAIL]repmgr extensionisavailable butnotinstalledindatabase"highgo"[2019-05-0614:02:43][HINT]checkthat this nodeispartofa repmgr cluster[highgo@dbrsconf]$ highgo=# \cYou are now connectedtodatabase"highgo"asuser"highgo".createextension repmgr;[highgo@dbrsconf]$ repmgrd-f/opt/highgo/5.6.1/conf/hg_repmgr.conf-d-p/tmp/hg_repmgrd.pid[2019-05-0614:21:21][NOTICE]repmgrd(repmgrd4.2)startingup[2019-05-0614:21:21][INFO]connectingtodatabase"host=dbrs user=hgrepmgr dbname=hgrepmgr connect_timeout=2"[highgo@dbrsconf]$ хϢ: set_repmgrd_pid(): provided pidfileis/tmp/hg_repmgrd.pid[2019-05-0614:21:21][NOTICE]startingmonitoringofnode"dbrs"(ID:1)[2019-05-0614:21:21][NOTICE]monitoring clusterprimary"dbrs"(node ID:1)[highgo@dbrs2conf]$ repmgrd-f/opt/highgo/5.6.1/conf/hg_repmgr.conf-d-p/tmp/hg_repmgrd.pid[2019-05-0614:21:50][NOTICE]repmgrd(repmgrd4.2)startingup[2019-05-0614:21:50][INFO]connectingtodatabase"host=dbrs2 user=hgrepmgr dbname=hgrepmgr connect_timeout=2"[highgo@dbrs2conf]$ хϢ: set_repmgrd_pid(): provided pidfileis/tmp/hg_repmgrd.pid[2019-05-0614:21:50][NOTICE]startingmonitoringofnode"dbrs2"(ID:2)[2019-05-0614:21:50][INFO]monitoring connectiontoupstream node"dbrs"(node ID:1)[highgo@dbrsconf]$ ls-atl/tmp/hg_repmgrd.pid-rw-rw-r--. 1 highgo highgo 5 May 6 14:21 /tmp/hg_repmgrd.pid[highgo@dbrsconf]$[highgo@dbrs2conf]$ ls-atl/tmp/hg_repmgrd.pid-rw-rw-r--. 1 highgo highgo 5 May 6 14:21 /tmp/hg_repmgrd.pid[highgo@dbrs2conf]$

提示：这个后台进程，每次重启服务器，都要手动启动吗？

开发回复：目前是，后期会修改为自动

查看集群状态

[highgo@dbrsconf]$ repmgr-f/opt/highgo/5.6.1/conf/hg_repmgr.conf clustershowID|Name|Role|Status|Upstream|Location|Connection string----+-------+---------+-----------+----------+----------+------------------------------------------------------------1|dbrs|primary|*running||default|host=dbrsuser=hgrepmgr dbname=hgrepmgr connect_timeout=22|dbrs2|standby|running|dbrs|default|host=dbrs2user=hgrepmgr dbname=hgrepmgr connect_timeout=2[highgo@dbrsconf]$

模拟主节点故障

1）在 node1 上关闭数据库

pg_ctl stop

2）在 node2 上查看集群状态

[highgo@dbrs2conf]$ repmgr-f/opt/highgo/5.6.1/conf/hg_repmgr.conf clustershowID|Name|Role|Status|Upstream|Location|Connection string----+-------+---------+-----------+----------+----------+------------------------------------------------------------1|dbrs|primary|-failed||default|host=dbrsuser=hgrepmgr dbname=hgrepmgr connect_timeout=22|dbrs2|primary|*running||default|host=dbrs2user=hgrepmgr dbname=hgrepmgr connect_timeout=2WARNING:followingissues were detected-unabletoconnecttonode"dbrs"(ID:1)[highgo@dbrs2conf]$

此时 node2 已经提升为 primary

日志

[highgo@dbrs2conf]$[2019-05-0614:24:14][WARNING]unabletoconnecttoupstream node"dbrs"(node ID:1)[2019-05-0614:24:14][INFO]checking stateofnode1,1of6attempts[2019-05-0614:24:14][INFO]sleeping10seconds untilnextreconnection attempt[2019-05-0614:24:24][INFO]checking stateofnode1,2of6attempts[2019-05-0614:24:24][INFO]sleeping10seconds untilnextreconnection attempt[2019-05-0614:24:34][INFO]checking stateofnode1,3of6attempts[2019-05-0614:24:34][INFO]sleeping10seconds untilnextreconnection attempt[2019-05-0614:24:44][INFO]checking stateofnode1,4of6attempts[2019-05-0614:24:44][INFO]sleeping10seconds untilnextreconnection attempt[2019-05-0614:24:54][INFO]checking stateofnode1,5of6attempts[2019-05-0614:24:54][INFO]sleeping10seconds untilnextreconnection attempt[highgo@dbrs2conf]$[2019-05-0614:25:04][INFO]checking stateofnode1,6of6attempts[2019-05-0614:25:04][WARNING]unabletoreconnecttonode1after6attempts[2019-05-0614:25:04][NOTICE]this nodeisthe only available candidateandwill now promote itself[2019-05-0614:25:04][INFO]promote_commandis:"repmgr -f /opt/highgo/5.6.1/conf/hg_repmgr.conf standby promote"NOTICE: promoting standbytoprimaryDETAIL: promoting server"dbrs2"(ID:2)using"/opt/highgo/5.6.1/bin/pg_ctl -w -D '/opt/highgo/5.6.1/data' promote"DETAIL: waiting upto60seconds(parameter"promote_check_timeout")forpromotiontocomplete NOTICE: STANDBY PROMOTE successful DETAIL: server"dbrs2"(ID:2)was successfully promotedtoprimary[2019-05-0614:25:10][INFO]switchingtoprimarymonitoringmode[2019-05-0614:25:10][NOTICE]monitoring clusterprimary"dbrs2"(node ID:2)

当 node1 的故障恢复之后，可重新加入集群

[highgo@dbrsconf]$ repmgr-f/opt/highgo/5.6.1/conf/hg_repmgr.conf clustershowID|Name|Role|Status|Upstream|Location|Connection string----+-------+---------+----------------------+----------+----------+------------------------------------------------------------1|dbrs|primary|*running||default|host=dbrsuser=hgrepmgr dbname=hgrepmgr connect_timeout=22|dbrs2|standby|!runningasprimary|dbrs|default|host=dbrs2user=hgrepmgr dbname=hgrepmgr connect_timeout=2

1）重新加入集群（在故障节点上执行，host指定新的主节点，重新加入后作为standby节点。想想pg_rewind）

repmgr-f/opt/highgo/5.6.1/conf/hg_repmgr.conf node rejoin-d'host=dbrs2 dbname=hgrepmgr user=hgrepmgr'--force-rewind --verbose

注意：执行该命令前应关闭 node1 的 HGDB。

[highgo@dbrsconf]$ repmgr-f/opt/highgo/5.6.1/conf/hg_repmgr.conf node rejoin-d'host=dbrs2 dbname=hgrepmgr user=hgrepmgr'--force-rewind --verboseNOTICE:usingprovided configurationfile"/opt/highgo/5.6.1/conf/hg_repmgr.conf"INFO: prerequisitesforusingpg_rewind are met INFO:0files copiedto"/tmp/repmgr-config-archive-dbrs"NOTICE: executing pg_rewind NOTICE:0files copiedto/opt/highgo/5.6.1/dataINFO: directory"/tmp/repmgr-config-archive-dbrs"deleted INFO: deleting"recovery.done"NOTICE: setting node1's primary to node 2 NOTICE: starting server using "/opt/highgo/5.6.1/bin/pg_ctl -w -D '/opt/highgo/5.6.1/data'start" INFO: demotedprimaryispingable INFO: node1has attachedtoits upstream node NOTICE: NODE REJOIN successful DETAIL: node1isnow attachedtonode2[highgo@dbrsconf]$

2）查看集群状态 repmgr cluster show

[highgo@dbrsconf]$ repmgr-f/opt/highgo/5.6.1/conf/hg_repmgr.conf clustershowID|Name|Role|Status|Upstream|Location|Connection string----+-------+---------+-----------+----------+----------+------------------------------------------------------------1|dbrs|standby|running|dbrs2|default|host=dbrsuser=hgrepmgr dbname=hgrepmgr connect_timeout=22|dbrs2|primary|*running||default|host=dbrs2user=hgrepmgr dbname=hgrepmgr connect_timeout=2[highgo@dbrsconf]$