diff --git a/Administrator Guide/arbiter-volumes-and-quorum.md b/Administrator Guide/arbiter-volumes-and-quorum.md index ec1759e..391bbb6 100644 --- a/Administrator Guide/arbiter-volumes-and-quorum.md +++ b/Administrator Guide/arbiter-volumes-and-quorum.md @@ -1,14 +1,84 @@ -Arbiter volumes and quorum options in gluster -============================================= -The arbiter volume is special subset of replica 3 volumes that is aimed at -preventing split-brains and providing the same consistency guarantees as a normal -replica 3 volume without consuming 3x space. Before we look at how to create one -and how they work etc. it is worthwhile to elaborate on the types of split-brain -in gluster parlance and how features like server-quorum and client-quorum help in -reducing the occurrence of split-brains to a certain extent. +# Arbiter volumes and quorum options in gluster + +The arbiter volume is special subset of replica volumes that is aimed at +preventing split-brains and providing the same consistency guarantees as a normal +replica 3 volume without consuming 3x space. + + + +- [Arbiter volumes and quorum options in gluster](#arbiter-volumes-and-quorum-options-in-gluster) +- [Arbiter configuration](#arbiter-configuration) + - [Arbiter brick(s) sizing](#arbiter-bricks-sizing) +- [Why Arbiter?](#why-arbiter) + - [Split-brains in replica volumes](#split-brains-in-replica-volumes) + - [Server-quorum and some pitfalls](#server-quorum-and-some-pitfalls) + - [Client Quorum](#client-quorum) + - [Replica 2 and Replica 3 volumes](#replica-2-and-replica-3-volumes) +- [How Arbiter works](#how-arbiter-works) + + + +# Arbiter configuration + +The syntax for creating the volume is: +``` +# gluster volume create replica 3 arbiter 1 ... +``` +_**Note:** Volumes using the arbiter feature can **only** be ```replica 3 arbiter 1```_ + + +For example: +``` +# gluster volume create testvol replica 3 arbiter 1 server{1..6}:/bricks/brick +volume create: testvol: success: please start the volume to access data +``` + +This means that for every 3 bricks listed, 1 of them is an arbiter. We have +created 6 bricks. With a replica count of three, each 3 bricks in series will be +a replica subvolume. Since we have two sets of 3, this created a distribute +subvolume made of up two replica subvolumes. + +Each replica subvolume is defined to have 1 arbiter out of the 3 bricks. The +arbiter bricks are taken from the end of each replica subvolume. + +``` +# gluster volume info +Volume Name: testvol +Type: Distributed-Replicate +Volume ID: ae6c4162-38c2-4368-ae5d-6bad141a4119 +Status: Created +Number of Bricks: 2 x (2 + 1) = 6 +Transport-type: tcp +Bricks: +Brick1: server1:/bricks/brick +Brick2: server2:/bricks/brick +Brick3: server3:/bricks/brick (arbiter) +Brick4: server4:/bricks/brick +Brick5: server5:/bricks/brick +Brick6: server6:/bricks/brick (arbiter) +Options Reconfigured : +transport.address-family: inet +performance.readdir-ahead: on ` +``` + +The arbiter brick will store only the file/directory names (i.e. the tree structure) +and extended attributes (metadata) but not any data. i.e. the file size +(as shown by `ls -l`) will be zero bytes. It will also store other gluster +metadata like the .glusterfs folder and its contents. + +_**Note:** Enabling the arbiter feature **automatically** configures_ +_client-quourm to 'auto'. This setting is **not** to be changed._ + +## Arbiter brick(s) sizing + +Since the arbiter brick does not store file data, its disk usage will be considerably +less than the other bricks of the replica. The sizing of the brick will depend on +how many files you plan to store in the volume. A good estimate will be +4kb times the number of files in the replica. + +# Why Arbiter? +## Split-brains in replica volumes -Split-brains in replica volumes -================================ When a file is in split-brain, there is an inconsistency in either data or metadata (permissions, uid/gid, extended attributes etc.)of the file amongst the bricks of a replica *and* we do not have enough information to authoritatively @@ -18,23 +88,23 @@ is also an entry-split brain where a file inside it has different gfids/ file-type (say one is a file and another is a directory of the same name) across the bricks of a replica. -This [document](https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md) +This [document](https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md) describes how to resolve files that are in split-brain using gluster cli or the mount point. Almost always, split-brains occur due to network disconnects (where a client temporarily loses connection to the bricks) and very rarely due to the gluster brick processes going down or returning an error. -Server-quorum and some pitfalls -================================ +## Server-quorum and some pitfalls + This [document](http://www.gluster.org/community/documentation/index.php/Features/Server-quorum) provides a detailed description of this feature. The volume options for server-quorum are: -> Option:cluster.server-quorum-ratio - Value Description: 0 to 100 - -> Option:cluster.server-quorum-type - Value Description: none | server +> Option:cluster.server-quorum-ratio + Value Description: 0 to 100 + +> Option:cluster.server-quorum-type + Value Description: none | server If set to server, this option enables the specified volume to participate in the server-side quorum. If set to none, that volume alone is not considered for volume checks. @@ -58,51 +128,51 @@ by the client. by the client. 3. We now have different contents for the file in B1 and B2 ==>split-brain. -In the author’s opinion, server quorum is useful if you want to avoid split-brains +In the author’s opinion, server quorum is useful if you want to avoid split-brains to the volume(s) configuration across the nodes and not in the I/O path. Unlike in client-quorum where the volume becomes read-only when quorum is lost, loss of server-quorum in a particular node makes glusterd kill the brick processes on that node (for the participating volumes) making even reads impossible. -Client-quorum -============== +## Client Quorum + Client-quorum is a feature implemented in AFR to prevent split-brains in the I/O path for replicate/distributed-replicate volumes. By default, if the client-quorum is not met for a particular replica subvol, it becomes read-only. The other subvols (in a dist-rep volume) will still have R/W access. The following volume set options are used to configure it: ->Option: cluster.quorum-type - Default Value: none - Value Description: none|auto|fixed +>Option: cluster.quorum-type + Default Value: none + Value Description: none|auto|fixed If set to "fixed", this option allows writes to a file only if the number of active bricks in that replica set (to which the file belongs) is greater - than or equal to the count specified in the 'quorum-count' option. + than or equal to the count specified in the 'quorum-count' option. If set to "auto", this option allows writes to the file only if number of - bricks that are up >= ceil (of the total number of bricks that constitute that replica/2). + bricks that are up >= ceil (of the total number of bricks that constitute that replica/2). If the number of replicas is even, then there is a further check: If the number of up bricks is exactly equal to n/2, then the first brick must be one of the bricks that is up. If it is more than n/2 then it is not necessary that the first brick is one of the up bricks. - ->Option: cluster.quorum-count ->Value Description: + +>Option: cluster.quorum-count +>Value Description: The number of bricks that must be active in a replica-set to allow writes. This option is used in conjunction with cluster.quorum-type *=fixed* option to specify the number of bricks to be active to participate in quorum. If the quorum-type is auto then this option has no significance. - -> Option: cluster.quorum-reads -Default Value: no -Value Description: yes|no + +> Option: cluster.quorum-reads +Default Value: no +Value Description: yes|no If quorum-reads is set to 'yes' (or 'true' or 'on') then even reads will be allowed only if quorum is met, without which the read (and writes) will return ENOTCONN. If set to 'no' (or 'false' or 'off'), then reads will be served even when quorum is not met, but writes will fail with EROFS. -Replica 2 and Replica 3 volumes -=============================== +## Replica 2 and Replica 3 volumes + From the above descriptions, it is clear that client-quorum cannot really be applied to a replica 2 volume:(without costing HA). If the quorum-type is set to auto, then by the description @@ -130,66 +200,23 @@ of the write. If 3 writes happen on the same file at non-overlapping {offset, le and each write fails on (only) one different brick, then we have AFR xattrs of the file blaming each other. -Arbiter configuration -====================== -The arbiter configuration a.k.a. the arbiter volume is the perfect sweet spot -between a 2-way replica and 3-way replica to avoid files getting into split-brain, -***without the 3x storage space*** as mentioned earlier. -The syntax for creating the volume is: ->#gluster volume create replica 3 arbiter 1 host1:brick1 host2:brick2 host3:brick3 -For example: ->#gluster volume create testvol replica 3 arbiter 1 127.0.0.2:/bricks/brick{1..6} force -volume create: testvol: success: please start the volume to access data +# How Arbiter works ->#gluster volume info -Volume Name: testvol -Type: Distributed-Replicate -Volume ID: ae6c4162-38c2-4368-ae5d-6bad141a4119 -Status: Created -Number of Bricks: 2 x (2 + 1) = 6 -Transport-type: tcp -Bricks: -Brick1: 127.0.0.2:/bricks/brick1 -Brick2: 127.0.0.2:/bricks/brick2 -Brick3: 127.0.0.2:/bricks/brick3 (arbiter) -Brick4: 127.0.0.2:/bricks/brick4 -Brick5: 127.0.0.2:/bricks/brick5 -Brick6: 127.0.0.2:/bricks/brick6 (arbiter) -Options Reconfigured : -transport.address-family: inet -performance.readdir-ahead: on ` - -Notice that the 3rd brick of every replica subvol is designated as the arbiter brick. -This brick will store only the file/directory names (i.e. the tree structure) -and extended attributes (metadata) but not any data. i.e. the file size -(as shown by `ls -l`) will be zero bytes. It will also store other gluster -metadata like the .glusterfs folder and its contents. Since the arbiter volume -is also a type of replica 3 volume, client-quourm is enabled by default and -set to 'auto'. This setting is **not** to be changed. - -#Arbiter brick(s) sizing: -Since the arbiter brick does not store file data, its disk usage will be considerably -lesser than the other bricks of the replica. The sizing of the brick will depend on -how many files you plan to store in the volume. A good estimate will be -4kb times the no.of files in the replica. - -How it works: -------------- There are 2 components to the arbiter volume. One is the arbiter xlator that is loaded in the brick process of every 3rd (i.e. the arbiter) brick. The other is the -arbitration logic itself that is present in AFR (the replicate xlator) loaded +arbitration logic itself that is present in AFR (the replicate xlator) loaded on the clients. -The former acts as a sort of 'filter' translator for the FOPS- i.e. it allows -entry operations to hit posix, blocks certain inode operations like -read (unwinds the call with ENOTCONN) and unwinds other inode operations +The former acts as a sort of 'filter' translator for the FOPS- i.e. it allows +entry operations to hit posix, blocks certain inode operations like +read (unwinds the call with ENOTCONN) and unwinds other inode operations like write, truncate etc. with success without winding it down to posix. The latter. i.e. the arbitration logic present in AFR does the following: - Takes full file locks when writing to a file as opposed to range locks in a -normal replicate volume. This prevents the corner-case split-brain described +normal replicate volume. This prevents the corner-case split-brain described earlier for 3 way replicas. The behaviour of arbiter volumes in allowing/failing write FOPS in conjunction @@ -199,7 +226,7 @@ with client-quorum can be summarized in the below steps: - If 2 bricks are up and if one of them is the arbiter (i.e. the 3rd brick) *and* it blames the other up brick for a given file, then all write FOPS will fail - with ENOTCONN. This is because in this scenario, the only true copy is on the + with ENOTCONN. This is because in this scenario, the only true copy is on the brick that is down. Hence we cannot allow writes until that brick is also up. If the arbiter doesn't blame the other brick, FOPS will be allowed to proceed. 'Blaming' here is w.r.t the values of AFR changelog extended attributes. @@ -216,9 +243,8 @@ with client-quorum can be summarized in the below steps: and B3, i.e. B1 is the only source. Now if for some reason, the second write failed on B1 (before there was a chance for selfheal to complete despite all brick being up), the application would receive failure (ENOTCONN) for that write. - + The bricks being up or down described above does not necessarily mean the brick process is offline. It can also mean the mount lost the connection to the brick due to network disconnects etc. -