1
0
mirror of https://github.com/gluster/glusterdocs.git synced 2026-02-06 00:48:24 +01:00

update arbiter doc

- Highlight the change in syntax
- EROFS behaviour removal
- Remove split-brain corner case reference since now AFR takes full locks
  always during write transaction.
- Update arbiter brick sizing to reflect potential inode saturation issues with
large XFS volumes.

Signed-off-by: Ravishankar N <ravishankar@redhat.com>
This commit is contained in:
Ravishankar N
2019-08-05 18:23:29 +05:30
parent b4b4c9d2ed
commit c48cd70ba3

View File

@@ -22,14 +22,18 @@ replica 3 volume without consuming 3x space.
The syntax for creating the volume is:
```
# gluster volume create <VOLNAME> replica 3 arbiter 1 <NEW-BRICK> ...
# gluster volume create <VOLNAME> replica 2 arbiter 1 <NEW-BRICK> ...
```
_**Note:** Volumes using the arbiter feature can **only** be ```replica 3 arbiter 1```_
**Note**: The earlier syntax used to be ```replica 3 arbiter 1``` but that was
leading to confusions among users about the total no. of data bricks. For the
sake of backward compatibility, the old syntax also works. In any case, the
implied meaning is that there are 2 data bricks and 1 arbiter brick in a nx(2+1)
arbiter volume.
For example:
```
# gluster volume create testvol replica 3 arbiter 1 server{1..6}:/bricks/brick
# gluster volume create testvol replica 2 arbiter 1 server{1..6}:/bricks/brick
volume create: testvol: success: please start the volume to access data
```
@@ -74,7 +78,17 @@ _client-quorum to 'auto'. This setting is **not** to be changed._
Since the arbiter brick does not store file data, its disk usage will be considerably
less than the other bricks of the replica. The sizing of the brick will depend on
how many files you plan to store in the volume. A good estimate will be
4kb times the number of files in the replica.
4KB times the number of files in the replica. Note that the estimate also
depends on the inode space alloted by the underlying filesystem for a given
disk size.
The `maxpct` value in XFS for volumes of size 1TB to 50TB is only 5%.
If you want to store say 300 million files, 4KB x 300M gives us 1.2TB.
5% of this is around 60GB. Assuming the recommended inode size of 512 bytes,
that gives us the ability to store only 60GB/512 ~= 120 million files. So it is
better to choose a higher `maxpct` value (say 25%) while formatting an XFS disk
of size greater than
1TB. Refer the man page of `mkfs.xfs` for details.
# Why Arbiter?
## Split-brains in replica volumes
@@ -138,7 +152,7 @@ node (for the participating volumes) making even reads impossible.
Client-quorum is a feature implemented in AFR to prevent split-brains in the I/O
path for replicate/distributed-replicate volumes. By default, if the client-quorum
is not met for a particular replica subvol, it becomes read-only. The other subvols
is not met for a particular replica subvol, it becomes unavailable. The other subvols
(in a dist-rep volume) will still have R/W access.
The following volume set options are used to configure it:
@@ -162,13 +176,10 @@ The following volume set options are used to configure it:
to specify the number of bricks to be active to participate in quorum.
If the quorum-type is auto then this option has no significance.
> Option: cluster.quorum-reads
Default Value: no
Value Description: yes|no
If quorum-reads is set to 'yes' (or 'true' or 'on'), read will be allowed
only if quorum is met, without which read (and write) will return ENOTCONN.
If set to 'no' (or 'false' or 'off'), then reads will be served even when quorum is
not met, but write will fail with EROFS.
Earlier, when quorm was not met, the replica subvolume turned read-only. But
since [glusterfs-3.13](https://docs.gluster.org/en/latest/release-notes/3.13.0/#addition-of-checks-for-allowing-lookups-in-afr-and-removal-of-clusterquorum-reads-volume-option) and upwards, the subvolume becomes unavailable, i.e. all
the file operations fail with ENOTCONN error instead of becoming EROFS.
This means the ```cluster.quorum-reads``` volume option is also not supported.
## Replica 2 and Replica 3 volumes
@@ -194,13 +205,6 @@ Say B1, B2 and B3 are the bricks:
that B2 and B3's pending xattrs blame each other and therefore the write is not
allowed and is failed with an EIO.
There is a corner case even with replica 3 volumes where the file can
end up in a split-brain. AFR usually takes range locks for the {offset, length}
of the write. If 3 writes happen on the same file at non-overlapping {offset, length}
and each write fails on (only) one different brick, then we have AFR xattrs of the
file blaming each other.
# How Arbiter works
There are 2 components to the arbiter volume. One is the arbiter xlator that is
@@ -213,14 +217,10 @@ entry operations to hit POSIX, blocks certain inode operations like
read (unwinds the call with ENOTCONN) and unwinds other inode operations
like write, truncate etc. with success without winding it down to POSIX.
The latter. i.e. the arbitration logic present in AFR does the following:
- Takes full file locks when writing to a file as opposed to range locks in a
normal replicate volume. This prevents the corner-case split-brain described
earlier for 3 way replicas.
The behavior of arbiter volumes in allowing/failing write FOPS in conjunction
with client-quorum can be summarized in the below steps:
The latter i.e. the arbitration logic present in AFR takes full file locks
when writing to a file, just like in normal replica volumes. The behavior of
arbiter volumes in allowing/failing write FOPS in conjunction with
client-quorum can be summarized in the below steps:
- If all 3 bricks are up (happy case), then there is no issue and the FOPs are allowed.