glusterfs

mirror of https://github.com/gluster/glusterfs.git synced 2026-02-06 18:48:16 +01:00

Files

Sanju Rakonde af95d11f59 geo-replication: fix for secondary node fail-over (#3959 )

* geo-replication: fiz for secondary node fail-over

Problem: When geo-replication session is setup, all the gsyncd slave
processes are coming up on the host which is used in creating the
geo-rep session. When this primary slave node goes down, all the
bricks are going into faulty state.

Cause: When monitor process tries to connect to the remote secondary
node, we are always using the remote_addr as a hostname. This variable
holds the hostname of the node which is used in creating the geo-rep
session. Thus, the gsyncd slave processes are always coming up on the
primary slave node. When this node goes down, monitor process is not able
to bring up gsyncd slave process and bricks are going into faulty state.

Fix: Instead of remote_addr, we should use resource_remote which holds
the hostname of randomly picked remote node. This way, when geo-rep
session is created and started, we will have the gsyncd slave processes
distributed across the secondary cluster. If the node which is used in
creating the session goes down, monitor process will bring the gsyncd
slave process on a randomly picked remote node (from the nodes which are
up at the moment). Bricks will not go into faulty state.

fixes:#3956

Signed-off-by: Sanju Rakonde <sanju.rakonde@phonepe.com>

* geo-replication: fiz for secondary node fail-over

Problem: When geo-replication session is setup, all the gsyncd slave
processes are coming up on the host which is used in creating the
geo-rep session. When this primary slave node goes down, all the
bricks are going into faulty state.

Cause: When monitor process tries to connect to the remote secondary
node, we are always using the remote_addr as a hostname. This variable
holds the hostname of the node which is used in creating the geo-rep
session. Thus, the gsyncd slave processes are always coming up on the
primary slave node. When this node goes down, monitor process is not able
to bring up gsyncd slave process and bricks are going into faulty state.

Fix: Instead of remote_addr, we should use resource_remote which holds
the hostname of randomly picked remote node. This way, when geo-rep
session is created and started, we will have the gsyncd slave processes
distributed across the secondary cluster. If the node which is used in
creating the session goes down, monitor process will bring the gsyncd
slave process on a randomly picked remote node (from the nodes which are
up at the moment). Bricks will not go into faulty state.

fixes:#3956

Signed-off-by: Sanju Rakonde <sanju.rakonde@phonepe.com>

---------

Signed-off-by: Sanju Rakonde <sanju.rakonde@phonepe.com>

2023-01-30 13:11:04 +05:30