1
0
mirror of https://github.com/gluster/glusterdocs.git synced 2026-02-06 00:48:24 +01:00

Arbiter documentation improvement

This commit is contained in:
Joe Julian
2016-08-18 15:07:14 -07:00
parent ab581a927a
commit 8ef2a9e686

View File

@@ -1,14 +1,84 @@
Arbiter volumes and quorum options in gluster
=============================================
The arbiter volume is special subset of replica 3 volumes that is aimed at
preventing split-brains and providing the same consistency guarantees as a normal
replica 3 volume without consuming 3x space. Before we look at how to create one
and how they work etc. it is worthwhile to elaborate on the types of split-brain
in gluster parlance and how features like server-quorum and client-quorum help in
reducing the occurrence of split-brains to a certain extent.
# Arbiter volumes and quorum options in gluster
The arbiter volume is special subset of replica volumes that is aimed at
preventing split-brains and providing the same consistency guarantees as a normal
replica 3 volume without consuming 3x space.
<!-- TOC depthFrom:1 depthTo:6 withLinks:1 updateOnSave:1 orderedList:0 -->
- [Arbiter volumes and quorum options in gluster](#arbiter-volumes-and-quorum-options-in-gluster)
- [Arbiter configuration](#arbiter-configuration)
- [Arbiter brick(s) sizing](#arbiter-bricks-sizing)
- [Why Arbiter?](#why-arbiter)
- [Split-brains in replica volumes](#split-brains-in-replica-volumes)
- [Server-quorum and some pitfalls](#server-quorum-and-some-pitfalls)
- [Client Quorum](#client-quorum)
- [Replica 2 and Replica 3 volumes](#replica-2-and-replica-3-volumes)
- [How Arbiter works](#how-arbiter-works)
<!-- /TOC -->
# Arbiter configuration
The syntax for creating the volume is:
```
# gluster volume create <VOLNAME> replica 3 arbiter 1 <NEW-BRICK> ...
```
_**Note:** Volumes using the arbiter feature can **only** be ```replica 3 arbiter 1```_
For example:
```
# gluster volume create testvol replica 3 arbiter 1 server{1..6}:/bricks/brick
volume create: testvol: success: please start the volume to access data
```
This means that for every 3 bricks listed, 1 of them is an arbiter. We have
created 6 bricks. With a replica count of three, each 3 bricks in series will be
a replica subvolume. Since we have two sets of 3, this created a distribute
subvolume made of up two replica subvolumes.
Each replica subvolume is defined to have 1 arbiter out of the 3 bricks. The
arbiter bricks are taken from the end of each replica subvolume.
```
# gluster volume info
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: ae6c4162-38c2-4368-ae5d-6bad141a4119
Status: Created
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: server1:/bricks/brick
Brick2: server2:/bricks/brick
Brick3: server3:/bricks/brick (arbiter)
Brick4: server4:/bricks/brick
Brick5: server5:/bricks/brick
Brick6: server6:/bricks/brick (arbiter)
Options Reconfigured :
transport.address-family: inet
performance.readdir-ahead: on `
```
The arbiter brick will store only the file/directory names (i.e. the tree structure)
and extended attributes (metadata) but not any data. i.e. the file size
(as shown by `ls -l`) will be zero bytes. It will also store other gluster
metadata like the .glusterfs folder and its contents.
_**Note:** Enabling the arbiter feature **automatically** configures_
_client-quourm to 'auto'. This setting is **not** to be changed._
## Arbiter brick(s) sizing
Since the arbiter brick does not store file data, its disk usage will be considerably
less than the other bricks of the replica. The sizing of the brick will depend on
how many files you plan to store in the volume. A good estimate will be
4kb times the number of files in the replica.
# Why Arbiter?
## Split-brains in replica volumes
Split-brains in replica volumes
================================
When a file is in split-brain, there is an inconsistency in either data or
metadata (permissions, uid/gid, extended attributes etc.)of the file amongst the
bricks of a replica *and* we do not have enough information to authoritatively
@@ -18,23 +88,23 @@ is also an entry-split brain where a file inside it has different gfids/
file-type (say one is a file and another is a directory of the same name)
across the bricks of a replica.
This [document](https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md)
This [document](https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md)
describes how to resolve files that are in split-brain using gluster cli or the
mount point. Almost always, split-brains occur due to network disconnects (where
a client temporarily loses connection to the bricks) and very rarely due to
the gluster brick processes going down or returning an error.
Server-quorum and some pitfalls
================================
## Server-quorum and some pitfalls
This [document](http://www.gluster.org/community/documentation/index.php/Features/Server-quorum)
provides a detailed description of this feature.
The volume options for server-quorum are:
> Option:cluster.server-quorum-ratio
Value Description: 0 to 100
> Option:cluster.server-quorum-type
Value Description: none | server
> Option:cluster.server-quorum-ratio
Value Description: 0 to 100
> Option:cluster.server-quorum-type
Value Description: none | server
If set to server, this option enables the specified volume to participate in the server-side quorum.
If set to none, that volume alone is not considered for volume checks.
@@ -58,51 +128,51 @@ by the client.
by the client.
3. We now have different contents for the file in B1 and B2 ==>split-brain.
In the authors opinion, server quorum is useful if you want to avoid split-brains
In the authors opinion, server quorum is useful if you want to avoid split-brains
to the volume(s) configuration across the nodes and not in the I/O path.
Unlike in client-quorum where the volume becomes read-only when quorum is lost, loss of
server-quorum in a particular node makes glusterd kill the brick processes on that
node (for the participating volumes) making even reads impossible.
Client-quorum
==============
## Client Quorum
Client-quorum is a feature implemented in AFR to prevent split-brains in the I/O
path for replicate/distributed-replicate volumes. By default, if the client-quorum
is not met for a particular replica subvol, it becomes read-only. The other subvols
(in a dist-rep volume) will still have R/W access.
The following volume set options are used to configure it:
>Option: cluster.quorum-type
Default Value: none
Value Description: none|auto|fixed
>Option: cluster.quorum-type
Default Value: none
Value Description: none|auto|fixed
If set to "fixed", this option allows writes to a file only if the number of
active bricks in that replica set (to which the file belongs) is greater
than or equal to the count specified in the 'quorum-count' option.
than or equal to the count specified in the 'quorum-count' option.
If set to "auto", this option allows writes to the file only if number of
bricks that are up >= ceil (of the total number of bricks that constitute that replica/2).
bricks that are up >= ceil (of the total number of bricks that constitute that replica/2).
If the number of replicas is even, then there is a further check:
If the number of up bricks is exactly equal to n/2, then the first brick must
be one of the bricks that is up. If it is more than n/2 then it is not
necessary that the first brick is one of the up bricks.
>Option: cluster.quorum-count
>Value Description:
>Option: cluster.quorum-count
>Value Description:
The number of bricks that must be active in a replica-set to allow writes.
This option is used in conjunction with cluster.quorum-type *=fixed* option
to specify the number of bricks to be active to participate in quorum.
If the quorum-type is auto then this option has no significance.
> Option: cluster.quorum-reads
Default Value: no
Value Description: yes|no
> Option: cluster.quorum-reads
Default Value: no
Value Description: yes|no
If quorum-reads is set to 'yes' (or 'true' or 'on') then even reads will be allowed
only if quorum is met, without which the read (and writes) will return ENOTCONN.
If set to 'no' (or 'false' or 'off'), then reads will be served even when quorum is
not met, but writes will fail with EROFS.
Replica 2 and Replica 3 volumes
===============================
## Replica 2 and Replica 3 volumes
From the above descriptions, it is clear that client-quorum cannot really be applied
to a replica 2 volume:(without costing HA).
If the quorum-type is set to auto, then by the description
@@ -130,66 +200,23 @@ of the write. If 3 writes happen on the same file at non-overlapping {offset, le
and each write fails on (only) one different brick, then we have AFR xattrs of the
file blaming each other.
Arbiter configuration
======================
The arbiter configuration a.k.a. the arbiter volume is the perfect sweet spot
between a 2-way replica and 3-way replica to avoid files getting into split-brain,
***without the 3x storage space*** as mentioned earlier.
The syntax for creating the volume is:
>#gluster volume create <VOLNAME> replica 3 arbiter 1 host1:brick1 host2:brick2 host3:brick3
For example:
>#gluster volume create testvol replica 3 arbiter 1 127.0.0.2:/bricks/brick{1..6} force
volume create: testvol: success: please start the volume to access data
# How Arbiter works
>#gluster volume info
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: ae6c4162-38c2-4368-ae5d-6bad141a4119
Status: Created
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: 127.0.0.2:/bricks/brick1
Brick2: 127.0.0.2:/bricks/brick2
Brick3: 127.0.0.2:/bricks/brick3 (arbiter)
Brick4: 127.0.0.2:/bricks/brick4
Brick5: 127.0.0.2:/bricks/brick5
Brick6: 127.0.0.2:/bricks/brick6 (arbiter)
Options Reconfigured :
transport.address-family: inet
performance.readdir-ahead: on `
Notice that the 3rd brick of every replica subvol is designated as the arbiter brick.
This brick will store only the file/directory names (i.e. the tree structure)
and extended attributes (metadata) but not any data. i.e. the file size
(as shown by `ls -l`) will be zero bytes. It will also store other gluster
metadata like the .glusterfs folder and its contents. Since the arbiter volume
is also a type of replica 3 volume, client-quourm is enabled by default and
set to 'auto'. This setting is **not** to be changed.
#Arbiter brick(s) sizing:
Since the arbiter brick does not store file data, its disk usage will be considerably
lesser than the other bricks of the replica. The sizing of the brick will depend on
how many files you plan to store in the volume. A good estimate will be
4kb times the no.of files in the replica.
How it works:
-------------
There are 2 components to the arbiter volume. One is the arbiter xlator that is
loaded in the brick process of every 3rd (i.e. the arbiter) brick. The other is the
arbitration logic itself that is present in AFR (the replicate xlator) loaded
arbitration logic itself that is present in AFR (the replicate xlator) loaded
on the clients.
The former acts as a sort of 'filter' translator for the FOPS- i.e. it allows
entry operations to hit posix, blocks certain inode operations like
read (unwinds the call with ENOTCONN) and unwinds other inode operations
The former acts as a sort of 'filter' translator for the FOPS- i.e. it allows
entry operations to hit posix, blocks certain inode operations like
read (unwinds the call with ENOTCONN) and unwinds other inode operations
like write, truncate etc. with success without winding it down to posix.
The latter. i.e. the arbitration logic present in AFR does the following:
- Takes full file locks when writing to a file as opposed to range locks in a
normal replicate volume. This prevents the corner-case split-brain described
normal replicate volume. This prevents the corner-case split-brain described
earlier for 3 way replicas.
The behaviour of arbiter volumes in allowing/failing write FOPS in conjunction
@@ -199,7 +226,7 @@ with client-quorum can be summarized in the below steps:
- If 2 bricks are up and if one of them is the arbiter (i.e. the 3rd brick) *and*
it blames the other up brick for a given file, then all write FOPS will fail
with ENOTCONN. This is because in this scenario, the only true copy is on the
with ENOTCONN. This is because in this scenario, the only true copy is on the
brick that is down. Hence we cannot allow writes until that brick is also up.
If the arbiter doesn't blame the other brick, FOPS will be allowed to proceed.
'Blaming' here is w.r.t the values of AFR changelog extended attributes.
@@ -216,9 +243,8 @@ with client-quorum can be summarized in the below steps:
and B3, i.e. B1 is the only source. Now if for some reason, the second write
failed on B1 (before there was a chance for selfheal to complete despite all brick
being up), the application would receive failure (ENOTCONN) for that write.
The bricks being up or down described above does not necessarily mean the brick
process is offline. It can also mean the mount lost the connection to the brick
due to network disconnects etc.