1
0
mirror of https://github.com/gluster/glusterdocs.git synced 2026-02-05 15:47:01 +01:00

[admin-guide] Fix docs and cleanup syntax (2/2)

Signed-off-by: black-dragon74 <niryadav@redhat.com>
This commit is contained in:
black-dragon74
2022-05-25 13:30:58 +05:30
parent c5b766f198
commit a79c006108
30 changed files with 1871 additions and 1717 deletions

View File

@@ -1,43 +1,52 @@
Coreutils for GlusterFS volumes
===============================
# Coreutils for GlusterFS volumes
The GlusterFS Coreutils is a suite of utilities that aims to mimic the standard Linux coreutils, with the exception that it utilizes the gluster C API in order to do work. It offers an interface similar to that of the ftp program.
Operations include things like getting files from the server to the local machine, putting files from the local machine to the server, retrieving directory information from the server and so on.
## Installation
#### Install GlusterFS
For information on prerequisites, instructions and configuration of GlusterFS, see Installation Guides from <http://docs.gluster.org/en/latest/>.
#### Install glusterfs-coreutils
For now glusterfs-coreutils will be packaged only as rpm. Other package formats will be supported very soon.
##### For fedora
Use dnf/yum to install glusterfs-coreutils:
```console
# dnf install glusterfs-coreutils
dnf install glusterfs-coreutils
```
OR
```console
# yum install glusterfs-coreutils
yum install glusterfs-coreutils
```
## Usage
glusterfs-coreutils provides a set of basic utilities such as cat, cp, flock, ls, mkdir, rm, stat and tail that are implemented specifically using the GlusterFS API commonly known as libgfapi. These utilities can be used either inside a gluster remote
shell or as standalone commands with 'gf' prepended to their respective base names. For example, glusterfs cat utility is named as gfcat and so on with an exception to flock core utility for which a standalone gfflock command is not provided as such(see the notes section on why flock is designed in that way).
#### Using coreutils within a remote gluster-shell
##### Invoke a new shell
In order to enter into a gluster client-shell, type *gfcli* and press enter. You will now be presented with a similar prompt as shown below:
In order to enter into a gluster client-shell, type _gfcli_ and press enter. You will now be presented with a similar prompt as shown below:
```console
# gfcli
gfcli>
```
See the man page for *gfcli* for more options.
See the man page for _gfcli_ for more options.
##### Connect to a gluster volume
Now we need to connect as a client to some glusterfs volume which has already started. Use connect command to do so as follows:
```console
@@ -57,7 +66,8 @@ gfcli (<SERVER IP or HOSTNAME/<VOLNAME>)
```
##### Try out your favorite utilities
Please go through the man pages for different utilities and available options for each command. For example, *man gfcp* will display details on the usage of cp command outside or within a gluster-shell. Run different commands as follows:
Please go through the man pages for different utilities and available options for each command. For example, _man gfcp_ will display details on the usage of cp command outside or within a gluster-shell. Run different commands as follows:
```console
gfcli (localhost/vol) ls .
@@ -65,6 +75,7 @@ gfcli (localhost/vol) stat .trashcan
```
##### Terminate the client connection from the volume
Use disconnect command to close the connection:
```console
@@ -73,6 +84,7 @@ gfcli>
```
##### Exit from shell
Run quit from shell:
```console
@@ -80,6 +92,7 @@ gfcli> quit
```
#### Using standalone glusterfs coreutil commands
As mentioned above glusterfs coreutils also provides standalone commands to perform the basic GNU coreutil functionalities. All those commands are prepended by 'gf'. Instead of invoking a gluster client-shell you can directly make use of these to establish and perform the operation in one shot. For example see the following sample usage of gfstat command:
```console
@@ -91,5 +104,6 @@ There is an exemption regarding flock coreutility which is not available as a st
For more information on each command and corresponding options see associated man pages.
## Notes
* Within a particular session of gluster client-shell, history of commands are preserved i.e, you can use up/down arrow keys to search through previously executed commands or the reverse history search technique using Ctrl+R.
* flock is not available as standalone 'gfflock'. Because locks are always associated with file descriptors. Unlike all other commands flock cannot straight away clean up the file descriptor after acquiring the lock. For flock we need to maintain an active connection as a glusterfs client.
- Within a particular session of gluster client-shell, history of commands are preserved i.e, you can use up/down arrow keys to search through previously executed commands or the reverse history search technique using Ctrl+R.
- flock is not available as standalone 'gfflock'. Because locks are always associated with file descriptors. Unlike all other commands flock cannot straight away clean up the file descriptor after acquiring the lock. For flock we need to maintain an active connection as a glusterfs client.

View File

@@ -1,5 +1,4 @@
Modifying .vol files with a filter
==================================
# Modifying .vol files with a filter
If you need to make manual changes to a .vol file it is recommended to
make these through the client interface ('gluster foo'). Making changes
@@ -7,22 +6,24 @@ directly to .vol files is discouraged, because it cannot be predicted
when a .vol file will be reset on disk, for example with a 'gluster set
foo' command. The command line interface was never designed to read the
.vol files, but rather to keep state and rebuild them (from
'/var/lib/glusterd/vols/\$vol/info'). There is, however, another way to
`/var/lib/glusterd/vols/$vol/info`). There is, however, another way to
do this.
You can create a shell script in the directory
'/usr/lib\*/glusterfs/\$VERSION/filter'. All scripts located there will
`/usr/lib*/glusterfs/$VERSION/filter`. All scripts located there will
be executed every time the .vol files are written back to disk. The
first and only argument passed to all script located there is the name
of the .vol file.
So you could create a script there that looks like this:
#!/bin/sh`\
sed -i 'some-sed-magic' "$1"
```console
#!/bin/sh
sed -i 'some-sed-magic' "$1"
```
Which will run the script, which in turn will run the sed command on the
.vol file (passed as \$1).
Importantly, the script needs to be set as executable (eg via chmod),
else it won't be run.
else it won't be run.

View File

@@ -1,30 +1,24 @@
What is Gluster ?
=================
# What is Gluster ?
Gluster is a scalable, distributed file system that aggregates disk storage resources from multiple servers into a single global namespace.
### Advantages
* Scales to several petabytes
* Handles thousands of clients
* POSIX compatible
* Uses commodity hardware
* Can use any ondisk filesystem that supports extended attributes
* Accessible using industry standard protocols like NFS and SMB
* Provides replication, quotas, geo-replication, snapshots and bitrot detection
* Allows optimization for different workloads
* Open Source
- Scales to several petabytes
- Handles thousands of clients
- POSIX compatible
- Uses commodity hardware
- Can use any ondisk filesystem that supports extended attributes
- Accessible using industry standard protocols like NFS and SMB
- Provides replication, quotas, geo-replication, snapshots and bitrot detection
- Allows optimization for different workloads
- Open Source
![640px-glusterfs_architecture](../images/640px-GlusterFS-Architecture.png)
Enterprises can scale capacity, performance, and availability on demand, with no vendor lock-in, across on-premise, public cloud, and hybrid environments.
Gluster is used in production at thousands of organisations spanning media, healthcare, government, education, web 2.0, and financial services.
### Commercial offerings and support
Several companies offer support or [consulting](https://www.gluster.org/support/).

View File

@@ -12,131 +12,175 @@ These docs are largely derived from:
[`http://fedoraproject.org/wiki/Getting_started_with_OpenStack_on_Fedora_17#Initial_Keystone_setup`](http://fedoraproject.org/wiki/Getting_started_with_OpenStack_on_Fedora_17#Initial_Keystone_setup)
Add the RDO Openstack Grizzly and Epel repos:
$ sudo yum install -y `[`http://dl.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm`](http://dl.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm)
$ sudo yum install -y `[`http://rdo.fedorapeople.org/openstack/openstack-grizzly/rdo-release-grizzly-1.noarch.rpm`](http://rdo.fedorapeople.org/openstack/openstack-grizzly/rdo-release-grizzly-1.noarch.rpm)
```console
sudo yum install -y "http://dl.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm"
sudo yum install -y "http://rdo.fedorapeople.org/openstack/openstack-grizzly/rdo-release-grizzly-1.noarch.rpm"
```
Install Openstack-Keystone
$ sudo yum install openstack-keystone openstack-utils python-keystoneclient
```console
sudo yum install openstack-keystone openstack-utils python-keystoneclient
```
Configure keystone
$ cat > keystonerc << _EOF
export ADMIN_TOKEN=$(openssl rand -hex 10)
export OS_USERNAME=admin
export OS_PASSWORD=$(openssl rand -hex 10)
export OS_TENANT_NAME=admin
export OS_AUTH_URL=`[`https://127.0.0.1:5000/v2.0/`](https://127.0.0.1:5000/v2.0/)
export SERVICE_ENDPOINT=`[`https://127.0.0.1:35357/v2.0/`](https://127.0.0.1:35357/v2.0/)
export SERVICE_TOKEN=\$ADMIN_TOKEN
_EOF
$ . ./keystonerc
$ sudo openstack-db --service keystone --init
```console
$ cat > keystonerc << _EOF
export ADMIN_TOKEN=$(openssl rand -hex 10)
export OS_USERNAME=admin
export OS_PASSWORD=$(openssl rand -hex 10)
export OS_TENANT_NAME=admin
export OS_AUTH_URL=`[`https://127.0.0.1:5000/v2.0/`](https://127.0.0.1:5000/v2.0/)
export SERVICE_ENDPOINT=`[`https://127.0.0.1:35357/v2.0/`](https://127.0.0.1:35357/v2.0/)
export SERVICE_TOKEN=\$ADMIN_TOKEN
_EOF
$ . ./keystonerc
$ sudo openstack-db --service keystone --init
```
Append the keystone configs to /etc/swift/proxy-server.conf
$ sudo -i`
# cat >> /etc/swift/proxy-server.conf << _EOM`
[filter:keystone]`
use = egg:swift#keystoneauth`
operator_roles = admin, swiftoperator`
[filter:authtoken]
paste.filter_factory = keystoneclient.middleware.auth_token:filter_factory
auth_port = 35357
auth_host = 127.0.0.1
auth_protocol = https
_EOM
exit
```console
$ sudo -i
# cat >> /etc/swift/proxy-server.conf << _EOM
[filter:keystone]`
use = egg:swift#keystoneauth`
operator_roles = admin, swiftoperator`
[filter:authtoken]
paste.filter_factory = keystoneclient.middleware.auth_token:filter_factory
auth_port = 35357
auth_host = 127.0.0.1
auth_protocol = https
_EOM
# exit
```
Finish configuring both swift and keystone using the command-line tool:
$ sudo openstack-config --set /etc/swift/proxy-server.conf filter:authtoken admin_token $ADMIN_TOKEN
$ sudo openstack-config --set /etc/swift/proxy-server.conf filter:authtoken auth_token $ADMIN_TOKEN
$ sudo openstack-config --set /etc/swift/proxy-server.conf DEFAULT log_name proxy_server
$ sudo openstack-config --set /etc/swift/proxy-server.conf filter:authtoken signing_dir /etc/swift
$ sudo openstack-config --set /etc/swift/proxy-server.conf pipeline:main pipeline "healthcheck cache authtoken keystone proxy-server"
```console
sudo openstack-config --set /etc/swift/proxy-server.conf filter:authtoken admin_token $ADMIN_TOKEN
sudo openstack-config --set /etc/swift/proxy-server.conf filter:authtoken auth_token $ADMIN_TOKEN
sudo openstack-config --set /etc/swift/proxy-server.conf DEFAULT log_name proxy_server
sudo openstack-config --set /etc/swift/proxy-server.conf filter:authtoken signing_dir /etc/swift
sudo openstack-config --set /etc/swift/proxy-server.conf pipeline:main pipeline "healthcheck cache authtoken keystone proxy-server"
$ sudo openstack-config --set /etc/keystone/keystone.conf DEFAULT admin_token $ADMIN_TOKEN
$ sudo openstack-config --set /etc/keystone/keystone.conf ssl enable True
$ sudo openstack-config --set /etc/keystone/keystone.conf ssl keyfile /etc/swift/cert.key
$ sudo openstack-config --set /etc/keystone/keystone.conf ssl certfile /etc/swift/cert.crt
$ sudo openstack-config --set /etc/keystone/keystone.conf signing token_format UUID
$ sudo openstack-config --set /etc/keystone/keystone.conf sql connection mysql://keystone:keystone@127.0.0.1/keystone
sudo openstack-config --set /etc/keystone/keystone.conf DEFAULT admin_token $ADMIN_TOKEN
sudo openstack-config --set /etc/keystone/keystone.conf ssl enable True
sudo openstack-config --set /etc/keystone/keystone.conf ssl keyfile /etc/swift/cert.key
sudo openstack-config --set /etc/keystone/keystone.conf ssl certfile /etc/swift/cert.crt
sudo openstack-config --set /etc/keystone/keystone.conf signing token_format UUID
sudo openstack-config --set /etc/keystone/keystone.conf sql connection mysql://keystone:keystone@127.0.0.1/keystone
```
Configure keystone to start at boot and start it up.
$ sudo chkconfig openstack-keystone on
$ sudo service openstack-keystone start # If you script this, you'll want to wait a few seconds to start using it
```console
sudo chkconfig openstack-keystone on
sudo service openstack-keystone start # If you script this, you'll want to wait a few seconds to start using it
```
We are using untrusted certs, so tell keystone not to complain. If you replace with trusted certs, or are not using SSL, set this to "".
$ INSECURE="--insecure"
```console
INSECURE="--insecure"
```
Create the keystone and swift services in keystone:
$ KS_SERVICEID=$(keystone $INSECURE service-create --name=keystone --type=identity --description="Keystone Identity Service" | grep " id " | cut -d "|" -f 3)
$ SW_SERVICEID=$(keystone $INSECURE service-create --name=swift --type=object-store --description="Swift Service" | grep " id " | cut -d "|" -f 3)
$ endpoint="`[`https://127.0.0.1:443`](https://127.0.0.1:443)`"
$ keystone $INSECURE endpoint-create --service_id $KS_SERVICEID \
  --publicurl $endpoint'/v2.0' --adminurl `[`https://127.0.0.1:35357/v2.0`](https://127.0.0.1:35357/v2.0)` \
  --internalurl `[`https://127.0.0.1:5000/v2.0`](https://127.0.0.1:5000/v2.0)
$ keystone $INSECURE endpoint-create --service_id $SW_SERVICEID \
  --publicurl $endpoint'/v1/AUTH_$(tenant_id)s' \
  --adminurl $endpoint'/v1/AUTH_$(tenant_id)s' \
  --internalurl $endpoint'/v1/AUTH_$(tenant_id)s'
```console
KS_SERVICEID=$(keystone $INSECURE service-create --name=keystone --type=identity --description="Keystone Identity Service" | grep " id " | cut -d "|" -f 3)
SW_SERVICEID=$(keystone $INSECURE service-create --name=swift --type=object-store --description="Swift Service" | grep " id " | cut -d "|" -f 3)
endpoint="`[`https://127.0.0.1:443`](https://127.0.0.1:443)`"
keystone $INSECURE endpoint-create --service_id $KS_SERVICEID \
  --publicurl $endpoint'/v2.0' --adminurl `[`https://127.0.0.1:35357/v2.0`](https://127.0.0.1:35357/v2.0)` \
  --internalurl `[`https://127.0.0.1:5000/v2.0`](https://127.0.0.1:5000/v2.0)
keystone $INSECURE endpoint-create --service_id $SW_SERVICEID \
  --publicurl $endpoint'/v1/AUTH_$(tenant_id)s' \
  --adminurl $endpoint'/v1/AUTH_$(tenant_id)s' \
  --internalurl $endpoint'/v1/AUTH_$(tenant_id)s'
```
Create the admin tenant:
$ admin_id=$(keystone $INSECURE tenant-create --name admin --description "Internal Admin Tenant" | grep id | awk '{print $4}')
```console
admin_id=$(keystone $INSECURE tenant-create --name admin --description "Internal Admin Tenant" | grep id | awk '{print $4}')
```
Create the admin roles:
$ admin_role=$(keystone $INSECURE role-create --name admin | grep id | awk '{print $4}')
$ ksadmin_role=$(keystone $INSECURE role-create --name KeystoneServiceAdmin | grep id | awk '{print $4}')
$ kadmin_role=$(keystone $INSECURE role-create --name KeystoneAdmin | grep id | awk '{print $4}')
$ member_role=$(keystone $INSECURE role-create --name member | grep id | awk '{print $4}')
```console
admin_role=$(keystone $INSECURE role-create --name admin | grep id | awk '{print $4}')
ksadmin_role=$(keystone $INSECURE role-create --name KeystoneServiceAdmin | grep id | awk '{print $4}')
kadmin_role=$(keystone $INSECURE role-create --name KeystoneAdmin | grep id | awk '{print $4}')
member_role=$(keystone $INSECURE role-create --name member | grep id | awk '{print $4}')
```
Create the admin user:
$ user_id=$(keystone $INSECURE user-create --name admin --tenant-id $admin_id --pass $OS_PASSWORD | grep id | awk '{print $4}')
$ keystone $INSECURE user-role-add --user-id $user_id --tenant-id $admin_id \
  --role-id $admin_role
$ keystone $INSECURE user-role-add --user-id $user_id --tenant-id $admin_id \
  --role-id $kadmin_role
$ keystone $INSECURE user-role-add --user-id $user_id --tenant-id $admin_id \
  --role-id $ksadmin_role
```console
user_id=$(keystone $INSECURE user-create --name admin --tenant-id $admin_id --pass $OS_PASSWORD | grep id | awk '{print $4}')
keystone $INSECURE user-role-add --user-id $user_id --tenant-id $admin_id \
  --role-id $admin_role
keystone $INSECURE user-role-add --user-id $user_id --tenant-id $admin_id \
  --role-id $kadmin_role
keystone $INSECURE user-role-add --user-id $user_id --tenant-id $admin_id \
  --role-id $ksadmin_role
```
If you do not have multi-volume support (broken in 3.3.1-11), then the volume names will not correlate to the tenants, and all tenants will map to the same volume, so just use a normal name. (This will be fixed in 3.4, and should be fixed in 3.4 Beta. The bug report for this is here: <https://bugzilla.redhat.com/show_bug.cgi?id=924792>)
$ volname="admin"
#  or if you have the multi-volume patch
$ volname=$admin_id
```console
volname="admin"
# or if you have the multi-volume patch
volname=$admin_id
```
Create and start the admin volume:
$ sudo gluster volume create $volname $myhostname:$pathtobrick
$ sudo gluster volume start $volname
$ sudo service openstack-keystone start
```console
sudo gluster volume create $volname $myhostname:$pathtobrick
sudo gluster volume start $volname
sudo service openstack-keystone start
```
Create the ring for the admin tenant. If you have working multi-volume support, then you can specify multiple volume names in the call:
$ cd /etc/swift
$ sudo /usr/bin/gluster-swift-gen-builders $volname
$ sudo swift-init main restart
```console
cd /etc/swift
sudo /usr/bin/gluster-swift-gen-builders $volname
sudo swift-init main restart
```
Create a testadmin user associated with the admin tenant with password testadmin and admin role:
$ user_id=$(keystone $INSECURE user-create --name testadmin --tenant-id $admin_id --pass testadmin | grep id | awk '{print $4}')
$ keystone $INSECURE user-role-add --user-id $user_id --tenant-id $admin_id \
  --role-id $admin_role
```console
user_id=$(keystone $INSECURE user-create --name testadmin --tenant-id $admin_id --pass testadmin | grep id | awk '{print $4}')
keystone $INSECURE user-role-add --user-id $user_id --tenant-id $admin_id \
  --role-id $admin_role
```
Test the user:
$ curl $INSECURE -d '{"auth":{"tenantName": "admin", "passwordCredentials":{"username": "testadmin", "password": "testadmin"}}}' -H "Content-type: application/json" `[`https://127.0.0.1:5000/v2.0/tokens`](https://127.0.0.1:5000/v2.0/tokens)
```console
curl $INSECURE -d '{"auth":{"tenantName": "admin", "passwordCredentials":{"username": "testadmin", "password": "testadmin"}}}' -H "Content-type: application/json" "https://127.0.0.1:5000/v2.0/tokens"
```
See here for more examples:

View File

@@ -1,11 +1,10 @@
# GlusterFS iSCSI
## Introduction
iSCSI on Gluster can be set up using the Linux Target driver. This is a user space daemon that accepts iSCSI (as well as iSER and FCoE.) It interprets iSCSI CDBs and converts them into some other I/O operation, according to user configuration. In our case, we can convert the CDBs into file operations that run against a gluster file. The file represents the LUN and the offset in the file the LBA.
A plug-in for the Linux target driver has been written to use the libgfapi. It is part of the Linux target driver (bs\_glfs.c). Using it, the datapath skips FUSE. This document will be updated to describe how to use it. You can see README.glfs in the Linux target driver's documentation subdirectory.
A plug-in for the Linux target driver has been written to use the libgfapi. It is part of the Linux target driver (bs_glfs.c). Using it, the datapath skips FUSE. This document will be updated to describe how to use it. You can see README.glfs in the Linux target driver's documentation subdirectory.
LIO is a replacement for the Linux Target Driver that is included in RHEL7. A user-space plug-in mechanism for it is under development. Once that piece of code exists a similar mechanism can be built for gluster as was done for the Linux target driver.
@@ -17,18 +16,24 @@ For more information on iSCSI and the Linux target driver, see [1] and [2].
Mount gluster locally on your gluster server. Note you can also run it on the gluster client. There are pros and cons to these configurations, described [below](#Running_the_target_on_the_gluster_client "wikilink").
# mount -t glusterfs 127.0.0.1:gserver /mnt
```console
mount -t glusterfs 127.0.0.1:gserver /mnt
```
Create a large file representing your block device within the gluster fs. In this case, the lun is 2G. (<i>You could also create a gluster "block device" for this purpose, which would skip the file system</i>).
Create a large file representing your block device within the gluster fs. In this case, the lun is 2G. (_You could also create a gluster "block device" for this purpose, which would skip the file system_).
# dd if=/dev/zero of=disk3 bs=2G count=25
```console
dd if=/dev/zero of=disk3 bs=2G count=25
```
Create a target using the file as the backend storage.
If necessary, download the Linux SCSI target. Then start the service.
# yum install scsi-target-utils
# service tgtd start
```console
yum install scsi-target-utils
service tgtd start
```
You must give an iSCSI Qualified name (IQN), in the format : iqn.yyyy-mm.reversed.domain.name:OptionalIdentifierText
@@ -36,41 +41,57 @@ where:
yyyy-mm represents the 4-digit year and 2-digit month the device was started (for example: 2011-07)
# tgtadm --lld iscsi --op new --mode target --tid 1 -T iqn.20013-10.com.redhat
```console
tgtadm --lld iscsi --op new --mode target --tid 1 -T iqn.20013-10.com.redhat
```
You can look at the target:
# tgtadm --lld iscsi --op show --mode conn --tid 1
```console
# tgtadm --lld iscsi --op show --mode conn --tid 1
Session: 11  Connection: 0     Initiator iqn.1994-05.com.redhat:cf75c8d4274d
Session: 11  Connection: 0     Initiator iqn.1994-05.com.redhat:cf75c8d4274d
```
Next, add a logical unit to the target
# tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 1 -b /mnt/disk3
```console
tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 1 -b /mnt/disk3
```
Allow any initiator to access the target.
# tgtadm --lld iscsi --op bind --mode target --tid 1 -I ALL
```console
tgtadm --lld iscsi --op bind --mode target --tid 1 -I ALL
```
Now its time to set up your client.
Discover your targets. Note in this example's case, the target IP address is 192.168.1.2
# iscsiadm --mode discovery --type sendtargets --portal 192.168.1.2
```console
iscsiadm --mode discovery --type sendtargets --portal 192.168.1.2
```
Login to your target session.
# iscsiadm --mode node --targetname iqn.2001-04.com.example:storage.disk1.amiens.sys1.xyz --portal 192.168.1.2:3260 --login
```console
iscsiadm --mode node --targetname iqn.2001-04.com.example:storage.disk1.amiens.sys1.xyz --portal 192.168.1.2:3260 --login
```
You should have a new SCSI disk. You will see it created in /var/log/messages. You will see it in lsblk.
You can send I/O to it:
# dd if=/dev/zero of=/dev/sda bs=4K count=100
```console
dd if=/dev/zero of=/dev/sda bs=4K count=100
```
To tear down your iSCSI connection:
# iscsiadm  -m node -T iqn.2001-04.com.redhat  -p 172.17.40.21 -u
```console
iscsiadm  -m node -T iqn.2001-04.com.redhat  -p 172.17.40.21 -u
```
## Running the iSCSI target on the gluster client

View File

@@ -10,7 +10,6 @@ different restrictions on different levels in the stack. The explanations in
this document should clarify which restrictions exist, and how these can be
handled.
## tl;dr
- if users belong to more than 90 groups, the brick processes need to resolve
@@ -25,7 +24,6 @@ For all of the above options counts that the system doing the group resolving
must be configured (`nsswitch`, `sssd`, ..) to be able to get all groups when
only a UID is known.
## Limit in the GlusterFS protocol
When a Gluster client does some action on a Gluster volume, the operation is
@@ -52,7 +50,6 @@ use the POSIX `getgrouplist()` function to fetch them.
Because this is a protocol limitation, all clients, including FUSE mounts,
Gluster/NFS server and libgfapi applications are affected by this.
## Group limit with FUSE
The FUSE client gets the groups of the process that does the I/O by reading the
@@ -64,7 +61,6 @@ For that reason a mount option has been added. With the `resolve-gids` mount
option, the FUSE client calls the POSIX `getgrouplist()` function instead of
reading `/proc/$pid/status`.
## Group limit for NFS
The NFS protocol (actually the AUTH_SYS/AUTH_UNIX RPC header) allows up to 16
@@ -78,7 +74,6 @@ Other NFS-servers offer options like this too. The Linux kernel nfsd server
uses `rpc.mountd --manage-gids`. NFS-Ganesha has the configuration option
`Manage_Gids`.
## Implications of these solutions
All of the mentioned options are disabled by default. one of the reasons is

View File

@@ -1,63 +1,70 @@
# Managing GlusterFS Volume Life-Cycle Extensions with Hook Scripts
Glusterfs allows automation of operations by user-written scripts. For every operation, you can execute a *pre* and a *post* script.
Glusterfs allows automation of operations by user-written scripts. For every operation, you can execute a _pre_ and a _post_ script.
### Pre Scripts
These scripts are run before the occurrence of the event. You can write a script to automate activities like managing system-wide services. For example, you can write a script to stop exporting the SMB share corresponding to the volume before you stop the volume.
### Post Scripts
These scripts are run after execution of the event. For example, you can write a script to export the SMB share corresponding to the volume after you start the volume.
You can run scripts for the following events:
+ Creating a volume
+ Starting a volume
+ Adding a brick
+ Removing a brick
+ Tuning volume options
+ Stopping a volume
+ Deleting a volume
- Creating a volume
- Starting a volume
- Adding a brick
- Removing a brick
- Tuning volume options
- Stopping a volume
- Deleting a volume
### Naming Convention
While creating the file names of your scripts, you must follow the naming convention followed in your underlying file system like XFS.
> Note: To enable the script, the name of the script must start with an S . Scripts run in lexicographic order of their names.
### Location of Scripts
This section provides information on the folders where the scripts must be placed. When you create a trusted storage pool, the following directories are created:
+ `/var/lib/glusterd/hooks/1/create/`
+ `/var/lib/glusterd/hooks/1/delete/`
+ `/var/lib/glusterd/hooks/1/start/`
+ `/var/lib/glusterd/hooks/1/stop/`
+ `/var/lib/glusterd/hooks/1/set/`
+ `/var/lib/glusterd/hooks/1/add-brick/`
+ `/var/lib/glusterd/hooks/1/remove-brick/`
- `/var/lib/glusterd/hooks/1/create/`
- `/var/lib/glusterd/hooks/1/delete/`
- `/var/lib/glusterd/hooks/1/start/`
- `/var/lib/glusterd/hooks/1/stop/`
- `/var/lib/glusterd/hooks/1/set/`
- `/var/lib/glusterd/hooks/1/add-brick/`
- `/var/lib/glusterd/hooks/1/remove-brick/`
After creating a script, you must ensure to save the script in its respective folder on all the nodes of the trusted storage pool. The location of the script dictates whether the script must be executed before or after an event. Scripts are provided with the command line argument `--volname=VOLNAME` to specify the volume. Command-specific additional arguments are provided for the following volume operations:
Start volume
--first=yes, if the volume is the first to be started
--first=no, for otherwise
Stop volume
--last=yes, if the volume is to be stopped last.
--last=no, for otherwise
Set volume
-o key=value
For every key, value is specified in volume set command.
```{ .text .no-copy }
Start volume
--first=yes, if the volume is the first to be started
--first=no, for otherwise
Stop volume
--last=yes, if the volume is to be stopped last.
--last=no, for otherwise
Set volume
-o key=value
For every key, value is specified in volume set command.
```
### Prepackaged Scripts
Gluster provides scripts to export Samba (SMB) share when you start a volume and to remove the share when you stop the volume. These scripts are available at: `/var/lib/glusterd/hooks/1/start/post` and `/var/lib/glusterd/hooks/1/stop/pre`. By default, the scripts are enabled.
When you start a volume using `gluster volume start VOLNAME`, the S30samba-start.sh script performs the following:
+ Adds Samba share configuration details of the volume to the smb.conf file
+ Mounts the volume through FUSE and adds an entry in /etc/fstab for the same.
+ Restarts Samba to run with updated configuration
- Adds Samba share configuration details of the volume to the smb.conf file
- Mounts the volume through FUSE and adds an entry in /etc/fstab for the same.
- Restarts Samba to run with updated configuration
When you stop the volume using `gluster volume stop VOLNAME`, the S30samba-stop.sh script performs the following:
+ Removes the Samba share details of the volume from the smb.conf file
+ Unmounts the FUSE mount point and removes the corresponding entry in
- Removes the Samba share details of the volume from the smb.conf file
- Unmounts the FUSE mount point and removes the corresponding entry in
/etc/fstab
+ Restarts Samba to run with updated configuration
- Restarts Samba to run with updated configuration

View File

@@ -1,5 +1,4 @@
Linux kernel tuning for GlusterFS
---------------------------------
## Linux kernel tuning for GlusterFS
Every now and then, questions come up here internally and with many
enthusiasts on what Gluster has to say about kernel tuning, if anything.
@@ -52,18 +51,18 @@ from the user for their own applications. Heavily loaded, streaming apps
should set this value to '0'. By changing this value to '0', the
system's responsiveness improves.
### vm.vfs\_cache\_pressure
### vm.vfs_cache_pressure
This option controls the tendency of the kernel to reclaim the memory
which is used for caching of directory and inode objects.
At the default value of vfs\_cache\_pressure=100 the kernel will attempt
At the default value of vfs_cache_pressure=100 the kernel will attempt
to reclaim dentries and inodes at a "fair" rate with respect to
pagecache and swapcache reclaim. Decreasing vfs\_cache\_pressure causes
pagecache and swapcache reclaim. Decreasing vfs_cache_pressure causes
the kernel to prefer to retain dentry and inode caches. When
vfs\_cache\_pressure=0, the kernel will never reclaim dentries and
vfs_cache_pressure=0, the kernel will never reclaim dentries and
inodes due to memory pressure and this can easily lead to out-of-memory
conditions. Increasing vfs\_cache\_pressure beyond 100 causes the kernel
conditions. Increasing vfs_cache_pressure beyond 100 causes the kernel
to prefer to reclaim dentries and inodes.
With GlusterFS, many users with a lot of storage and many small files
@@ -73,18 +72,18 @@ keeps crawling through data-structures on a 40GB RAM system. Changing
this value higher than 100 has helped many users to achieve fair caching
and more responsiveness from the kernel.
### vm.dirty\_background\_ratio
### vm.dirty_background_ratio
### vm.dirty\_ratio
### vm.dirty_ratio
The first of the two (vm.dirty\_background\_ratio) defines the
The first of the two (vm.dirty_background_ratio) defines the
percentage of memory that can become dirty before a background flushing
of the pages to disk starts. Until this percentage is reached no pages
are flushed to disk. However when the flushing starts, then it's done in
the background without disrupting any of the running processes in the
foreground.
Now the second of the two parameters (vm.dirty\_ratio) defines the
Now the second of the two parameters (vm.dirty_ratio) defines the
percentage of memory which can be occupied by dirty pages before a
forced flush starts. If the percentage of dirty pages reaches this
threshold, then all processes become synchronous, and they are not
@@ -124,14 +123,14 @@ performance. You can read more about them in the Linux kernel source
documentation: linux/Documentation/block/\*iosched.txt . I have also
seen 'read' throughput increase during mixed-operations (many writes).
### "256" \> /sys/block/sdc/queue/nr\_requests
### "256" \> /sys/block/sdc/queue/nr_requests
This is the size of I/O requests which are buffered before they are
communicated to the disk by the Scheduler. The internal queue size of
some controllers (queue\_depth) is larger than the I/O scheduler's
nr\_requests so that the I/O scheduler doesn't get much of a chance to
some controllers (queue_depth) is larger than the I/O scheduler's
nr_requests so that the I/O scheduler doesn't get much of a chance to
properly order and merge the requests. Deadline or CFQ scheduler likes
to have nr\_requests to be set 2 times the value of queue\_depth, which
to have nr_requests to be set 2 times the value of queue_depth, which
is the default for a given controller. Merging the order and requests
helps the scheduler to be more responsive during huge load.
@@ -144,7 +143,7 @@ after you have used swappiness=0, but if you defined swappiness=10 or
20, then using this value helps when your have a RAID stripe size of
64k.
### blockdev --setra 4096 /dev/<devname> (eg:- sdb, hdc or dev\_mapper)
### blockdev --setra 4096 /dev/<devname> (eg:- sdb, hdc or dev_mapper)
Default block device settings often result in terrible performance for
many RAID controllers. Adding the above option, which sets read-ahead to
@@ -183,94 +182,94 @@ issues.
More informative and interesting articles/emails/blogs to read
- <http://dom.as/2008/02/05/linux-io-schedulers/>
- <http://www.nextre.it/oracledocs/oraclemyths.html>
- <https://lkml.org/lkml/2006/11/15/40>
- <http://misterd77.blogspot.com/2007/11/3ware-hardware-raid-vs-linux-software.html>
- <http://dom.as/2008/02/05/linux-io-schedulers/>
- <http://www.nextre.it/oracledocs/oraclemyths.html>
- <https://lkml.org/lkml/2006/11/15/40>
- <http://misterd77.blogspot.com/2007/11/3ware-hardware-raid-vs-linux-software.html>
`   Last updated by: `[`User:y4m4`](User:y4m4 "wikilink")
`Last updated by:`[`User:y4m4`](User:y4m4 "wikilink")
### comment:jdarcy
Some additional tuning ideas:
`   * The choice of scheduler is *really* hardware- and workload-dependent, and some schedulers have unique features other than performance.  For example, last time I looked cgroups support was limited to the cfq scheduler.  Different tests regularly do best on any of cfq, deadline, or noop.  The best advice here is not to use a particular scheduler but to try them all for a specific need.`
` * The choice of scheduler is *really* hardware- and workload-dependent, and some schedulers have unique features other than performance. For example, last time I looked cgroups support was limited to the cfq scheduler. Different tests regularly do best on any of cfq, deadline, or noop. The best advice here is not to use a particular scheduler but to try them all for a specific need.`
`   * It's worth checking to make sure that /sys/.../max_sectors_kb matches max_hw_sectors_kb.  I haven't seen this problem for a while, but back when I used to work on Lustre I often saw that these didn't match and performance suffered.`
` * It's worth checking to make sure that /sys/.../max_sectors_kb matches max_hw_sectors_kb. I haven't seen this problem for a while, but back when I used to work on Lustre I often saw that these didn't match and performance suffered.`
`   * For read-heavy workloads, experimenting with /sys/.../readahead_kb is definitely worthwhile.`
` * For read-heavy workloads, experimenting with /sys/.../readahead_kb is definitely worthwhile.`
`   * Filesystems should be built with -I 512 or similar so that more xattrs can be stored in the inode instead of requiring an extra seek.`
` * Filesystems should be built with -I 512 or similar so that more xattrs can be stored in the inode instead of requiring an extra seek.`
`   * Mounting with noatime or relatime is usually good for performance.`
` * Mounting with noatime or relatime is usually good for performance.`
#### reply:y4m4
`   Agreed i was about write those parameters you mentioned. I should write another elaborate article on FS changes. `
`Agreed i was about write those parameters you mentioned. I should write another elaborate article on FS changes.`
y4m4
### comment:eco
`       1 year ago`\
`   This article is the model on which all articles should be written.  Detailed information, solid examples and a great selection of references to let readers go more in depth on topics they choose.  Great benchmark for others to strive to attain.`\
`       Eco`\
` 1 year ago`\
` This article is the model on which all articles should be written. Detailed information, solid examples and a great selection of references to let readers go more in depth on topics they choose. Great benchmark for others to strive to attain.`\
` Eco`\
### comment:y4m4
`   sysctl -w net.core.{r,w}mem_max = 4096000 - this helped us to Reach 800MB/sec with replicated GlusterFS on 10gige  - Thanks to Ben England for these test results. `\
`       y4m4`
`sysctl -w net.core.{r,w}mem_max = 4096000 - this helped us to Reach 800MB/sec with replicated GlusterFS on 10gige - Thanks to Ben England for these test results.`\
` y4m4`
### comment:bengland
`   After testing Gluster 3.2.4 performance with RHEL6.1, I'd suggest some changes to this article's recommendations:`
` After testing Gluster 3.2.4 performance with RHEL6.1, I'd suggest some changes to this article's recommendations:`
`   vm.swappiness=10 not 0 -- I think 0 is a bit extreme and might lead to out-of-memory conditions, but 10 will avoid just about all paging/swapping.  If you still see swapping, you need to probably focus on restricting dirty pages with vm.dirty_ratio.`
` vm.swappiness=10 not 0 -- I think 0 is a bit extreme and might lead to out-of-memory conditions, but 10 will avoid just about all paging/swapping. If you still see swapping, you need to probably focus on restricting dirty pages with vm.dirty_ratio.`
`   vfs_cache_pressure > 100 -- why?   I thought this was a percentage.`
` vfs_cache_pressure > 100 -- why? I thought this was a percentage.`
`   vm.pagecache=1 -- some distros (e.g. RHEL6) don't have vm.pagecache parameter. `
`vm.pagecache=1 -- some distros (e.g. RHEL6) don't have vm.pagecache parameter.`
`   vm.dirty_background_ratio=1 not 10 (kernel default?) -- the kernel default is a bit dependent on choice of Linux distro, but for most workloads it's better to set this parameter very low to cause Linux to push dirty pages out to storage sooner.    It means that if dirty pages exceed 1% of RAM then it will start to asynchronously write dirty pages to storage. The only workload where this is really bad: apps that write temp files and then quickly delete them (compiles) -- and you should probably be using local storage for such files anyway. `
`vm.dirty_background_ratio=1 not 10 (kernel default?) -- the kernel default is a bit dependent on choice of Linux distro, but for most workloads it's better to set this parameter very low to cause Linux to push dirty pages out to storage sooner. It means that if dirty pages exceed 1% of RAM then it will start to asynchronously write dirty pages to storage. The only workload where this is really bad: apps that write temp files and then quickly delete them (compiles) -- and you should probably be using local storage for such files anyway.`
`   Choice of vm.dirty_ratio is more dependent upon the workload, but in other contexts I have observed that response time fairness and stability is much better if you lower dirty ratio so that it doesn't take more than 2-5 seconds to flush all dirty pages to storage. `
`Choice of vm.dirty_ratio is more dependent upon the workload, but in other contexts I have observed that response time fairness and stability is much better if you lower dirty ratio so that it doesn't take more than 2-5 seconds to flush all dirty pages to storage.`
`   block device parameters:`
` block device parameters:`
`   I'm not aware of any case where cfq scheduler actually helps Gluster server.   Unless server I/O threads correspond directly to end-users, I don't see how cfq can help you.  Deadline scheduler is a good choice.  I/O request queue has to be deep enough to allow scheduler to reorder requests to optimize away disk seeks.  The parameters max_sectors_kb and nr_requests are relevant for this.  For read-ahead, consider increasing it to the point where you prefetch for longer period of time than a disk seek (on order of 10 msec), so that you can avoid unnecessary disk seeks for multi-stream workloads.  This comes at the expense of I/O latency so don't overdo it.`
` I'm not aware of any case where cfq scheduler actually helps Gluster server. Unless server I/O threads correspond directly to end-users, I don't see how cfq can help you. Deadline scheduler is a good choice. I/O request queue has to be deep enough to allow scheduler to reorder requests to optimize away disk seeks. The parameters max_sectors_kb and nr_requests are relevant for this. For read-ahead, consider increasing it to the point where you prefetch for longer period of time than a disk seek (on order of 10 msec), so that you can avoid unnecessary disk seeks for multi-stream workloads. This comes at the expense of I/O latency so don't overdo it.`
`   network:`
` network:`
`   jumbo frames can increase throughput significantly for 10-GbE networks.`
` jumbo frames can increase throughput significantly for 10-GbE networks.`
`   Raise net.core.{r,w}mem_max to 540000 from default of 131071  (not 4 MB above, my previous recommendation).  Gluster 3.2 does setsockopt() call to use 1/2 MB mem for TCP socket buffer space.`\
`       bengland`\
` Raise net.core.{r,w}mem_max to 540000 from default of 131071 (not 4 MB above, my previous recommendation). Gluster 3.2 does setsockopt() call to use 1/2 MB mem for TCP socket buffer space.`\
` bengland`\
### comment:hjmangalam
`   Thanks very much for noting this info - the descriptions are VERY good.. I'm in the midst of debugging a misbehaving gluster that can't seem to handle small writes over IPoIB and this contains some useful pointers.`
` Thanks very much for noting this info - the descriptions are VERY good.. I'm in the midst of debugging a misbehaving gluster that can't seem to handle small writes over IPoIB and this contains some useful pointers.`
`   Some suggestions that might make this more immediately useful:`
` Some suggestions that might make this more immediately useful:`
`   - I'm assuming that this discussion refers to the gluster server nodes, not to the gluster native client nodes, yes?  If that's the case, are there are also kernel parameters or recommended settings for the client nodes?`\
`   -  While there are some cases where you mention that a value should be changed to a particular # or %, in a number of cases you advise just increasing/decreasing the values, which for something like  a kernel parameter is probably not a useful suggestion.  Do I raise it by 10?  10%  2x? 10x?  `
` - I'm assuming that this discussion refers to the gluster server nodes, not to the gluster native client nodes, yes? If that's the case, are there are also kernel parameters or recommended settings for the client nodes?`\
`- While there are some cases where you mention that a value should be changed to a particular # or %, in a number of cases you advise just increasing/decreasing the values, which for something like a kernel parameter is probably not a useful suggestion. Do I raise it by 10? 10% 2x? 10x?`
`   I also ran across a complimentary page, which might be of  interest - it explains more of the vm variables, especially as it relates to writing.`\
`   "Theory of Operation and Tuning for Write-Heavy Loads" `\
`      ``   and refs therein.`
`       hjmangalam`
` I also ran across a complimentary page, which might be of interest - it explains more of the vm variables, especially as it relates to writing.`\
`"Theory of Operation and Tuning for Write-Heavy Loads"`\
` `` and refs therein.`
` hjmangalam`
### comment:bengland
`   Here are some additional suggestions based on recent testing:`\
`   - scaling out number of clients -- you need to increase the size of the ARP tables on Gluster server if you want to support more than 1K clients mounting a gluster volume.  The defaults for RHEL6.3 were too low to support this, we used this:`
` Here are some additional suggestions based on recent testing:`\
` - scaling out number of clients -- you need to increase the size of the ARP tables on Gluster server if you want to support more than 1K clients mounting a gluster volume. The defaults for RHEL6.3 were too low to support this, we used this:`
`   net.ipv4.neigh.default.gc_thresh2 = 2048`\
`   net.ipv4.neigh.default.gc_thresh3 = 4096`
` net.ipv4.neigh.default.gc_thresh2 = 2048`\
` net.ipv4.neigh.default.gc_thresh3 = 4096`
`   In addition, tunings common to webservers become relevant at this number of clients as well, such as netdev_max_backlog, tcp_fin_timeout, and somaxconn.`
` In addition, tunings common to webservers become relevant at this number of clients as well, such as netdev_max_backlog, tcp_fin_timeout, and somaxconn.`
`   Bonding mode 6 has been observed to increase replication write performance, I have no experience with bonding mode 4 but it should work if switch is properly configured, other bonding modes are a waste of time.`
` Bonding mode 6 has been observed to increase replication write performance, I have no experience with bonding mode 4 but it should work if switch is properly configured, other bonding modes are a waste of time.`
`       bengland`\
`       3 months ago`
` bengland`\
` 3 months ago`

View File

@@ -8,37 +8,41 @@ Below lists the component, services, and functionality based logs in the Gluster
glusterd logs are located at `/var/log/glusterfs/glusterd.log`. One glusterd log file per server. This log file also contains the snapshot and user logs.
## Gluster cli command:
gluster cli logs are located at `/var/log/glusterfs/cli.log`. Gluster commands executed on a node in a GlusterFS Trusted Storage Pool is logged in `/var/log/glusterfs/cmd_history.log`.
gluster cli logs are located at `/var/log/glusterfs/cli.log`. Gluster commands executed on a node in a GlusterFS Trusted Storage Pool is logged in `/var/log/glusterfs/cmd_history.log`.
## Bricks:
Bricks logs are located at `/var/log/glusterfs/bricks/<path extraction of brick path>.log` . One log file per brick on the server
Bricks logs are located at `/var/log/glusterfs/bricks/<path extraction of brick path>.log`. One log file per brick on the server
## Rebalance:
rebalance logs are located at `/var/log/glusterfs/VOLNAME-rebalance.log` . One log file per volume on the server.
rebalance logs are located at `/var/log/glusterfs/VOLNAME-rebalance.log` . One log file per volume on the server.
## Self heal deamon:
self heal deamon are logged at `/var/log/glusterfs/glustershd.log`. One log file per server
self heal deamon are logged at `/var/log/glusterfs/glustershd.log`. One log file per server
## Quota:
`/var/log/glusterfs/quotad.log` are log of the quota daemons running on each node.
`/var/log/glusterfs/quota-crawl.log` Whenever quota is enabled, a file system crawl is performed and the corresponding log is stored in this file.
`/var/log/glusterfs/quota-mount- VOLNAME.log` An auxiliary FUSE client is mounted in <gluster-run-dir>/VOLNAME of the glusterFS and the corresponding client logs found in this file.
One log file per server (and per volume from quota-mount.
`/var/log/glusterfs/quota-mount- VOLNAME.log` An auxiliary FUSE client is mounted in <gluster-run-dir>/VOLNAME of the glusterFS and the corresponding client logs found in this file. One log file per server and per volume from quota-mount.
## Gluster NFS:
`/var/log/glusterfs/nfs.log ` One log file per server
`/var/log/glusterfs/nfs.log ` One log file per server
## SAMBA Gluster:
`/var/log/samba/glusterfs-VOLNAME-<ClientIP>.log` . If the client mounts this on a glusterFS server node, the actual log file or the mount point may not be found. In such a case, the mount outputs of all the glusterFS type mount operations need to be considered.
`/var/log/samba/glusterfs-VOLNAME-<ClientIP>.log` . If the client mounts this on a glusterFS server node, the actual log file or the mount point may not be found. In such a case, the mount outputs of all the glusterFS type mount operations need to be considered.
## Ganesha NFS :
`/var/log/nfs-ganesha.log`
## FUSE Mount:
`/var/log/glusterfs/<mountpoint path extraction>.log `
## Geo-replication:
@@ -47,10 +51,13 @@ self heal deamon are logged at `/var/log/glusterfs/glustershd.log`. One log f
`/var/log/glusterfs/geo-replication-secondary `
## Gluster volume heal VOLNAME info command:
`/var/log/glusterfs/glfsheal-VOLNAME.log` . One log file per server on which the command is executed.
## Gluster-swift:
`/var/log/messages`
## SwiftKrbAuth:
`/var/log/httpd/error_log `

View File

@@ -9,15 +9,14 @@ GlusterFS volume snapshot feature is based on thinly provisioned LVM snapshot.
To make use of snapshot feature GlusterFS volume should fulfill following
pre-requisites:
* Each brick should be on an independent thinly provisioned LVM.
* Brick LVM should not contain any other data other than brick.
* None of the brick should be on a thick LVM.
* gluster version should be 3.6 and above.
- Each brick should be on an independent thinly provisioned LVM.
- Brick LVM should not contain any other data other than brick.
- None of the brick should be on a thick LVM.
- gluster version should be 3.6 and above.
Details of how to create thin volume can be found at the following link.
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Logical_Volume_Manager_Administration/LV.html#thinly_provisioned_volume_creation
## Few features of snapshot are:
**Crash Consistency**
@@ -26,13 +25,11 @@ when a snapshot is taken at a particular point-in-time, it is made sure that
the taken snapshot is crash consistent. when the taken snapshot is restored,
then the data is identical as it was at the time of taking a snapshot.
**Online Snapshot**
When the snapshot is being taken the file system and its associated data
continue to be available for the clients.
**Barrier**
During snapshot creation some of the fops are blocked to guarantee crash
@@ -95,7 +92,7 @@ gluster snapshot delete (all | <snapname> | volume <volname>)
If snapname is specified then mentioned snapshot is deleted.
If volname is specified then all snapshots belonging to that particular
volume is deleted. If keyword *all* is used then all snapshots belonging
volume is deleted. If keyword _all_ is used then all snapshots belonging
to the system is deleted.
### Listing of available snaps
@@ -104,7 +101,7 @@ to the system is deleted.
gluster snapshot list [volname]
```
Lists all snapshots taken.
Lists all snapshots taken.
If volname is provided, then only the snapshots belonging to
that particular volume is listed.
@@ -125,14 +122,14 @@ for that particular volume, and the state of the snapshot.
gluster snapshot status [(snapname | volume <volname>)]
```
This command gives status of the snapshot.
This command gives status of the snapshot.
The details included are snapshot brick path, volume group(LVM details),
status of the snapshot bricks, PID of the bricks, data percentage filled for
status of the snapshot bricks, PID of the bricks, data percentage filled for
that particular volume group to which the snapshots belong to, and total size
of the logical volume.
If snapname is specified then status of the mentioned snapshot is displayed.
If volname is specified then status of all snapshots belonging to that volume
If volname is specified then status of all snapshots belonging to that volume
is displayed. If both snapname and volname is not specified then status of all
the snapshots present in the system are displayed.
@@ -146,15 +143,15 @@ snapshot config [volname] ([snap-max-hard-limit <count>] [snap-max-soft-limit <p
Displays and sets the snapshot config values.
snapshot config without any keywords displays the snapshot config values of
snapshot config without any keywords displays the snapshot config values of
all volumes in the system. If volname is provided, then the snapshot config
values of that volume is displayed.
values of that volume is displayed.
Snapshot config command along with keywords can be used to change the existing
Snapshot config command along with keywords can be used to change the existing
config values. If volname is provided then config value of that volume is
changed, else it will set/change the system limit.
changed, else it will set/change the system limit.
snap-max-soft-limit and auto-delete are global options, that will be
snap-max-soft-limit and auto-delete are global options, that will be
inherited by all volumes in the system and cannot be set to individual volumes.
The system limit takes precedence over the volume specific limit.
@@ -162,7 +159,7 @@ The system limit takes precedence over the volume specific limit.
When auto-delete feature is enabled, then upon reaching the soft-limit,
with every successful snapshot creation, the oldest snapshot will be deleted.
When auto-delete feature is disabled, then upon reaching the soft-limit,
When auto-delete feature is disabled, then upon reaching the soft-limit,
the user gets a warning with every successful snapshot creation.
Upon reaching the hard-limit, further snapshot creations will not be allowed.
@@ -192,7 +189,7 @@ Deactivates the mentioned snapshot.
Snapshots can be accessed in 2 ways.
1. Mounting the snapshot:
1. Mounting the snapshot:
The snapshot can be accessed via FUSE mount (only fuse). To do that it has to be
mounted first. A snapshot can be mounted via fuse by below command
@@ -202,10 +199,9 @@ Snapshots can be accessed in 2 ways.
i.e. say "host1" is one of the peers. Let "vol" be the volume name and "my-snap"
be the snapshot name. In this case a snapshot can be mounted via this command
# mount -t glusterfs host1:/snaps/my-snap/vol /mnt/snapshot
mount -t glusterfs host1:/snaps/my-snap/vol /mnt/snapshot
2. User serviceability:
2. User serviceability:
Apart from the above method of mounting the snapshot, a list of available
snapshots and the contents of each snapshot can be viewed from any of the mount
@@ -226,7 +222,7 @@ Snapshots can be accessed in 2 ways.
directory entries. They represent the state of the directory from which .snaps
was entered, at different points in time.
**NOTE**: The access to the snapshots are read-only. The snapshot needs to be
**NOTE**: The access to the snapshots are read-only. The snapshot needs to be
activated for it to be accessible inside .snaps directory.
Also, the name of the hidden directory (or the access point to the snapshot
@@ -234,7 +230,7 @@ Snapshots can be accessed in 2 ways.
gluster volume set <volname> snapshot-directory <new-name>
3. Accessing from windows:
3. Accessing from windows:
The glusterfs volumes can be made accessible by windows via samba. (the
glusterfs plugin for samba helps achieve this, without having to re-export
@@ -242,11 +238,12 @@ Snapshots can be accessed in 2 ways.
also be viewed in the windows explorer.
There are 2 ways:
* Give the path of the entry point directory
- Give the path of the entry point directory
(`<hostname><samba-share><directory><entry-point path>`) in the run command
window
* Go to the samba share via windows explorer. Make hidden files and folders
- Go to the samba share via windows explorer. Make hidden files and folders
visible so that in the root of the samba share a folder icon for the entry point
can be seen.
@@ -256,28 +253,28 @@ the path should be provided in the run command window.
For snapshots to be accessible from windows, below 2 options can be used.
1. The glusterfs plugin for samba should give the option "snapdir-entry-path"
while starting. The option is an indication to glusterfs, that samba is loading
it and the value of the option should be the path that is being used as the
share for windows.
1. The glusterfs plugin for samba should give the option "snapdir-entry-path"
while starting. The option is an indication to glusterfs, that samba is loading
it and the value of the option should be the path that is being used as the
share for windows.
Ex: Say, there is a glusterfs volume and a directory called "export" from the
root of the volume is being used as the samba share, then samba has to load
glusterfs with this option as well.
ret = glfs_set_xlator_option(
fs,
"*-snapview-client",
"snapdir-entry-path", "/export"
);
ret = glfs_set_xlator_option(
fs,
"*-snapview-client",
"snapdir-entry-path", "/export"
);
The xlator option "snapdir-entry-path" is not exposed via volume set options,
cannot be changed from CLI. Its an option that has to be provided at the time of
mounting glusterfs or when samba loads glusterfs.
2. The accessibility of snapshots via root of the samba share from windows
is configurable. By default it is turned off. It is a volume set option which can
be changed via CLI.
2. The accessibility of snapshots via root of the samba share from windows
is configurable. By default it is turned off. It is a volume set option which can
be changed via CLI.
`gluster volume set <volname> features.show-snapshot-directory <on/off>`. By
default it is off.

View File

@@ -15,6 +15,7 @@ operations, including the following:
- [Non Uniform File Allocation(NUFA)](#non-uniform-file-allocation)
<a name="configuring-transport-types-for-a-volume"></a>
## Configuring Transport Types for a Volume
A volume can support one or more transport types for communication between clients and brick processes.
@@ -24,21 +25,22 @@ To change the supported transport types of a volume, follow the procedure:
1. Unmount the volume on all the clients using the following command:
`# umount mount-point`
umount mount-point
2. Stop the volumes using the following command:
`# gluster volume stop <VOLNAME>`
gluster volume stop <VOLNAME>
3. Change the transport type. For example, to enable both tcp and rdma execute the followimg command:
`# gluster volume set test-volume config.transport tcp,rdma OR tcp OR rdma`
gluster volume set test-volume config.transport tcp,rdma OR tcp OR rdma
4. Mount the volume on all the clients. For example, to mount using rdma transport, use the following command:
`# mount -t glusterfs -o transport=rdma server1:/test-volume /mnt/glusterfs`
mount -t glusterfs -o transport=rdma server1:/test-volume /mnt/glusterfs
<a name="expanding-volumes"></a>
## Expanding Volumes
You can expand volumes, as needed, while the cluster is online and
@@ -49,8 +51,7 @@ of the GlusterFS volume.
Similarly, you might want to add a group of bricks to a distributed
replicated volume, increasing the capacity of the GlusterFS volume.
> **Note**
>
> **Note**
> When expanding distributed replicated and distributed dispersed volumes,
> you need to add a number of bricks that is a multiple of the replica
> or disperse count. For example, to expand a distributed replicated
@@ -62,7 +63,7 @@ replicated volume, increasing the capacity of the GlusterFS volume.
1. If they are not already part of the TSP, probe the servers which contain the bricks you
want to add to the volume using the following command:
`# gluster peer probe <SERVERNAME>`
gluster peer probe <SERVERNAME>
For example:
@@ -71,7 +72,7 @@ replicated volume, increasing the capacity of the GlusterFS volume.
2. Add the brick using the following command:
`# gluster volume add-brick <VOLNAME> <NEW-BRICK>`
gluster volume add-brick <VOLNAME> <NEW-BRICK>
For example:
@@ -80,7 +81,7 @@ replicated volume, increasing the capacity of the GlusterFS volume.
3. Check the volume information using the following command:
`# gluster volume info <VOLNAME>`
gluster volume info <VOLNAME>
The command displays information similar to the following:
@@ -100,14 +101,14 @@ replicated volume, increasing the capacity of the GlusterFS volume.
You can use the rebalance command as described in [Rebalancing Volumes](#rebalancing-volumes)
<a name="shrinking-volumes"></a>
## Shrinking Volumes
You can shrink volumes, as needed, while the cluster is online and
available. For example, you might need to remove a brick that has become
inaccessible in a distributed volume due to hardware or network failure.
> **Note**
>
> **Note**
> Data residing on the brick that you are removing will no longer be
> accessible at the Gluster mount point. Note however that only the
> configuration information is removed - you can continue to access the
@@ -128,7 +129,7 @@ operation to migrate data from the removed-bricks to the rest of the volume.
1. Remove the brick using the following command:
`# gluster volume remove-brick <VOLNAME> <BRICKNAME> start`
gluster volume remove-brick <VOLNAME> <BRICKNAME> start
For example, to remove server2:/exp2:
@@ -138,7 +139,7 @@ operation to migrate data from the removed-bricks to the rest of the volume.
2. View the status of the remove brick operation using the
following command:
`# gluster volume remove-brick <VOLNAME> <BRICKNAME> status`
gluster volume remove-brick <VOLNAME> <BRICKNAME> status
For example, to view the status of remove brick operation on
server2:/exp2 brick:
@@ -150,7 +151,7 @@ operation to migrate data from the removed-bricks to the rest of the volume.
3. Once the status displays "completed", commit the remove-brick operation
# gluster volume remove-brick <VOLNAME> <BRICKNAME> commit
gluster volume remove-brick <VOLNAME> <BRICKNAME> commit
In this example:
@@ -162,7 +163,7 @@ operation to migrate data from the removed-bricks to the rest of the volume.
4. Check the volume information using the following command:
`# gluster volume info `
gluster volume info
The command displays information similar to the following:
@@ -176,15 +177,15 @@ operation to migrate data from the removed-bricks to the rest of the volume.
Brick3: server3:/exp3
Brick4: server4:/exp4
<a name="replace-brick"></a>
## Replace faulty brick
**Replacing a brick in a *pure* distribute volume**
**Replacing a brick in a _pure_ distribute volume**
To replace a brick on a distribute only volume, add the new brick and then remove the brick you want to replace. This will trigger a rebalance operation which will move data from the removed brick.
> NOTE: Replacing a brick using the 'replace-brick' command in gluster is supported only for distributed-replicate or *pure* replicate volumes.
> NOTE: Replacing a brick using the 'replace-brick' command in gluster is supported only for distributed-replicate or _pure_ replicate volumes.
Steps to remove brick Server1:/home/gfs/r2_1 and add Server1:/home/gfs/r2_2:
@@ -200,10 +201,8 @@ Steps to remove brick Server1:/home/gfs/r2_1 and add Server1:/home/gfs/r2_2:
Brick1: Server1:/home/gfs/r2_0
Brick2: Server1:/home/gfs/r2_1
2. Here are the files that are present on the mount:
# ls
1 10 2 3 4 5 6 7 8 9
@@ -220,13 +219,11 @@ Steps to remove brick Server1:/home/gfs/r2_1 and add Server1:/home/gfs/r2_2:
5. Wait until remove-brick status indicates that it is complete.
# gluster volume remove-brick r2 Server1:/home/gfs/r2_1 status
Node Rebalanced-files size scanned failures skipped status run time in secs
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 5 20Bytes 15 0 0 completed 0.00
6. Now we can safely remove the old brick, so commit the changes:
# gluster volume remove-brick r2 Server1:/home/gfs/r2_1 commit
@@ -266,58 +263,57 @@ This section of the document describes how brick: `Server1:/home/gfs/r2_0` is re
Brick3: Server1:/home/gfs/r2_2
Brick4: Server2:/home/gfs/r2_3
Steps:
1. Make sure there is no data in the new brick Server1:/home/gfs/r2_5
2. Check that all the bricks are running. It is okay if the brick that is going to be replaced is down.
3. Bring the brick that is going to be replaced down if not already.
- Get the pid of the brick by executing 'gluster volume <volname> status'
- Get the pid of the brick by executing 'gluster volume <volname> status'
# gluster volume status
Status of volume: r2
Gluster process Port Online Pid
------------------------------------------------------------------------------
Brick Server1:/home/gfs/r2_0 49152 Y 5342
Brick Server2:/home/gfs/r2_1 49153 Y 5354
Brick Server1:/home/gfs/r2_2 49154 Y 5365
Brick Server2:/home/gfs/r2_3 49155 Y 5376
# gluster volume status
Status of volume: r2
Gluster process Port Online Pid
------------------------------------------------------------------------------
Brick Server1:/home/gfs/r2_0 49152 Y 5342
Brick Server2:/home/gfs/r2_1 49153 Y 5354
Brick Server1:/home/gfs/r2_2 49154 Y 5365
Brick Server2:/home/gfs/r2_3 49155 Y 5376
- Login to the machine where the brick is running and kill the brick.
- Login to the machine where the brick is running and kill the brick.
# kill -15 5342
# kill -15 5342
- Confirm that the brick is not running anymore and the other bricks are running fine.
- Confirm that the brick is not running anymore and the other bricks are running fine.
# gluster volume status
Status of volume: r2
Gluster process Port Online Pid
------------------------------------------------------------------------------
Brick Server1:/home/gfs/r2_0 N/A N 5342 <<---- brick is not running, others are running fine.
Brick Server2:/home/gfs/r2_1 49153 Y 5354
Brick Server1:/home/gfs/r2_2 49154 Y 5365
Brick Server2:/home/gfs/r2_3 49155 Y 5376
# gluster volume status
Status of volume: r2
Gluster process Port Online Pid
------------------------------------------------------------------------------
Brick Server1:/home/gfs/r2_0 N/A N 5342 <<---- brick is not running, others are running fine.
Brick Server2:/home/gfs/r2_1 49153 Y 5354
Brick Server1:/home/gfs/r2_2 49154 Y 5365
Brick Server2:/home/gfs/r2_3 49155 Y 5376
4. Using the gluster volume fuse mount (In this example: `/mnt/r2`) set up metadata so that data will be synced to new brick (In this case it is from `Server1:/home/gfs/r2_1` to `Server1:/home/gfs/r2_5`)
- Create a directory on the mount point that doesn't already exist. Then delete that directory, do the same for metadata changelog by doing setfattr. This operation marks the pending changelog which will tell self-heal damon/mounts to perform self-heal from `/home/gfs/r2_1` to `/home/gfs/r2_5`.
- Create a directory on the mount point that doesn't already exist. Then delete that directory, do the same for metadata changelog by doing setfattr. This operation marks the pending changelog which will tell self-heal damon/mounts to perform self-heal from `/home/gfs/r2_1` to `/home/gfs/r2_5`.
mkdir /mnt/r2/<name-of-nonexistent-dir>
rmdir /mnt/r2/<name-of-nonexistent-dir>
setfattr -n trusted.non-existent-key -v abc /mnt/r2
setfattr -x trusted.non-existent-key /mnt/r2
mkdir /mnt/r2/<name-of-nonexistent-dir>
rmdir /mnt/r2/<name-of-nonexistent-dir>
setfattr -n trusted.non-existent-key -v abc /mnt/r2
setfattr -x trusted.non-existent-key /mnt/r2
- Check that there are pending xattrs on the replica of the brick that is being replaced:
- Check that there are pending xattrs on the replica of the brick that is being replaced:
getfattr -d -m. -e hex /home/gfs/r2_1
# file: home/gfs/r2_1
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.r2-client-0=0x000000000000000300000002 <<---- xattrs are marked from source brick Server2:/home/gfs/r2_1
trusted.afr.r2-client-1=0x000000000000000000000000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000000000007ffffffe
trusted.glusterfs.volume-id=0xde822e25ebd049ea83bfaa3c4be2b440
getfattr -d -m. -e hex /home/gfs/r2_1
# file: home/gfs/r2_1
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.r2-client-0=0x000000000000000300000002 <<---- xattrs are marked from source brick Server2:/home/gfs/r2_1
trusted.afr.r2-client-1=0x000000000000000000000000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000000000007ffffffe
trusted.glusterfs.volume-id=0xde822e25ebd049ea83bfaa3c4be2b440
5. Volume heal info will show that '/' needs healing.(There could be more entries based on the work load. But '/' must exist)
@@ -337,23 +333,23 @@ Steps:
6. Replace the brick with 'commit force' option. Please note that other variants of replace-brick command are not supported.
- Execute replace-brick command
- Execute replace-brick command
# gluster volume replace-brick r2 Server1:/home/gfs/r2_0 Server1:/home/gfs/r2_5 commit force
volume replace-brick: success: replace-brick commit successful
# gluster volume replace-brick r2 Server1:/home/gfs/r2_0 Server1:/home/gfs/r2_5 commit force
volume replace-brick: success: replace-brick commit successful
- Check that the new brick is now online
- Check that the new brick is now online
# gluster volume status
Status of volume: r2
Gluster process Port Online Pid
------------------------------------------------------------------------------
Brick Server1:/home/gfs/r2_5 49156 Y 5731 <<<---- new brick is online
Brick Server2:/home/gfs/r2_1 49153 Y 5354
Brick Server1:/home/gfs/r2_2 49154 Y 5365
Brick Server2:/home/gfs/r2_3 49155 Y 5376
# gluster volume status
Status of volume: r2
Gluster process Port Online Pid
------------------------------------------------------------------------------
Brick Server1:/home/gfs/r2_5 49156 Y 5731 <<<---- new brick is online
Brick Server2:/home/gfs/r2_1 49153 Y 5354
Brick Server1:/home/gfs/r2_2 49154 Y 5365
Brick Server2:/home/gfs/r2_3 49155 Y 5376
- Users can track the progress of self-heal using: `gluster volume heal [volname] info`.
- Users can track the progress of self-heal using: `gluster volume heal [volname] info`.
Once self-heal completes the changelogs will be removed.
# getfattr -d -m. -e hex /home/gfs/r2_1
@@ -366,22 +362,23 @@ Steps:
trusted.glusterfs.dht=0x0000000100000000000000007ffffffe
trusted.glusterfs.volume-id=0xde822e25ebd049ea83bfaa3c4be2b440
- `# gluster volume heal <VOLNAME> info` will show that no heal is required.
- `# gluster volume heal <VOLNAME> info` will show that no heal is required.
# gluster volume heal r2 info
Brick Server1:/home/gfs/r2_5
Number of entries: 0
# gluster volume heal r2 info
Brick Server1:/home/gfs/r2_5
Number of entries: 0
Brick Server2:/home/gfs/r2_1
Number of entries: 0
Brick Server2:/home/gfs/r2_1
Number of entries: 0
Brick Server1:/home/gfs/r2_2
Number of entries: 0
Brick Server1:/home/gfs/r2_2
Number of entries: 0
Brick Server2:/home/gfs/r2_3
Number of entries: 0
Brick Server2:/home/gfs/r2_3
Number of entries: 0
<a name="rebalancing-volumes"></a>
## Rebalancing Volumes
After expanding a volume using the add-brick command, you may need to rebalance the data
@@ -393,11 +390,11 @@ layout and/or data.
This section describes how to rebalance GlusterFS volumes in your
storage environment, using the following common scenarios:
- **Fix Layout** - Fixes the layout to use the new volume topology so that files can
be distributed to newly added nodes.
- **Fix Layout** - Fixes the layout to use the new volume topology so that files can
be distributed to newly added nodes.
- **Fix Layout and Migrate Data** - Rebalances volume by fixing the layout
to use the new volume topology and migrating the existing data.
- **Fix Layout and Migrate Data** - Rebalances volume by fixing the layout
to use the new volume topology and migrating the existing data.
### Rebalancing Volume to Fix Layout Changes
@@ -410,27 +407,27 @@ When this command is issued, all the file stat information which is
already cached will get revalidated.
As of GlusterFS 3.6, the assignment of files to bricks will take into account
the sizes of the bricks. For example, a 20TB brick will be assigned twice as
many files as a 10TB brick. In versions before 3.6, the two bricks were
the sizes of the bricks. For example, a 20TB brick will be assigned twice as
many files as a 10TB brick. In versions before 3.6, the two bricks were
treated as equal regardless of size, and would have been assigned an equal
share of files.
A fix-layout rebalance will only fix the layout changes and does not
migrate data. If you want to migrate the existing data,
migrate data. If you want to migrate the existing data,
use `gluster volume rebalance <volume> start` command to rebalance data among
the servers.
**To rebalance a volume to fix layout**
- Start the rebalance operation on any Gluster server using the
following command:
- Start the rebalance operation on any Gluster server using the
following command:
`# gluster volume rebalance <VOLNAME> fix-layout start`
`# gluster volume rebalance <VOLNAME> fix-layout start`
For example:
For example:
# gluster volume rebalance test-volume fix-layout start
Starting rebalance on volume test-volume has been successful
# gluster volume rebalance test-volume fix-layout start
Starting rebalance on volume test-volume has been successful
### Rebalancing Volume to Fix Layout and Migrate Data
@@ -439,29 +436,29 @@ among the servers. A remove-brick command will automatically trigger a rebalance
**To rebalance a volume to fix layout and migrate the existing data**
- Start the rebalance operation on any one of the server using the
following command:
- Start the rebalance operation on any one of the server using the
following command:
`# gluster volume rebalance <VOLNAME> start`
`# gluster volume rebalance <VOLNAME> start`
For example:
For example:
# gluster volume rebalance test-volume start
Starting rebalancing on volume test-volume has been successful
# gluster volume rebalance test-volume start
Starting rebalancing on volume test-volume has been successful
- Start the migration operation forcefully on any one of the servers
using the following command:
- Start the migration operation forcefully on any one of the servers
using the following command:
`# gluster volume rebalance <VOLNAME> start force`
`# gluster volume rebalance <VOLNAME> start force`
For example:
For example:
# gluster volume rebalance test-volume start force
Starting rebalancing on volume test-volume has been successful
# gluster volume rebalance test-volume start force
Starting rebalancing on volume test-volume has been successful
A rebalance operation will attempt to balance the diskusage across nodes, therefore it will skip
files where the move will result in a less balanced volume. This leads to link files that are still
left behind in the system and hence may cause performance issues. The behaviour can be overridden
A rebalance operation will attempt to balance the diskusage across nodes, therefore it will skip
files where the move will result in a less balanced volume. This leads to link files that are still
left behind in the system and hence may cause performance issues. The behaviour can be overridden
with the `force` argument.
### Displaying the Status of Rebalance Operation
@@ -469,56 +466,57 @@ with the `force` argument.
You can display the status information about rebalance volume operation,
as needed.
- Check the status of the rebalance operation, using the following
command:
- Check the status of the rebalance operation, using the following
command:
`# gluster volume rebalance <VOLNAME> status`
`# gluster volume rebalance <VOLNAME> status`
For example:
For example:
# gluster volume rebalance test-volume status
Node Rebalanced-files size scanned status
--------- ---------------- ---- ------- -----------
617c923e-6450-4065-8e33-865e28d9428f 416 1463 312 in progress
# gluster volume rebalance test-volume status
Node Rebalanced-files size scanned status
--------- ---------------- ---- ------- -----------
617c923e-6450-4065-8e33-865e28d9428f 416 1463 312 in progress
The time to complete the rebalance operation depends on the number
of files on the volume along with the corresponding file sizes.
Continue checking the rebalance status, verifying that the number of
files rebalanced or total files scanned keeps increasing.
The time to complete the rebalance operation depends on the number
of files on the volume along with the corresponding file sizes.
Continue checking the rebalance status, verifying that the number of
files rebalanced or total files scanned keeps increasing.
For example, running the status command again might display a result
similar to the following:
For example, running the status command again might display a result
similar to the following:
# gluster volume rebalance test-volume status
Node Rebalanced-files size scanned status
--------- ---------------- ---- ------- -----------
617c923e-6450-4065-8e33-865e28d9428f 498 1783 378 in progress
# gluster volume rebalance test-volume status
Node Rebalanced-files size scanned status
--------- ---------------- ---- ------- -----------
617c923e-6450-4065-8e33-865e28d9428f 498 1783 378 in progress
The rebalance status displays the following when the rebalance is
complete:
The rebalance status displays the following when the rebalance is
complete:
# gluster volume rebalance test-volume status
Node Rebalanced-files size scanned status
--------- ---------------- ---- ------- -----------
617c923e-6450-4065-8e33-865e28d9428f 502 1873 334 completed
# gluster volume rebalance test-volume status
Node Rebalanced-files size scanned status
--------- ---------------- ---- ------- -----------
617c923e-6450-4065-8e33-865e28d9428f 502 1873 334 completed
### Stopping an Ongoing Rebalance Operation
You can stop the rebalance operation, if needed.
- Stop the rebalance operation using the following command:
- Stop the rebalance operation using the following command:
`# gluster volume rebalance <VOLNAME> stop`
`# gluster volume rebalance <VOLNAME> stop`
For example:
For example:
# gluster volume rebalance test-volume stop
Node Rebalanced-files size scanned status
--------- ---------------- ---- ------- -----------
617c923e-6450-4065-8e33-865e28d9428f 59 590 244 stopped
Stopped rebalance process on volume test-volume
# gluster volume rebalance test-volume stop
Node Rebalanced-files size scanned status
--------- ---------------- ---- ------- -----------
617c923e-6450-4065-8e33-865e28d9428f 59 590 244 stopped
Stopped rebalance process on volume test-volume
<a name="stopping-volumes"></a>
## Stopping Volumes
1. Stop the volume using the following command:
@@ -536,6 +534,7 @@ You can stop the rebalance operation, if needed.
Stopping volume test-volume has been successful
<a name="deleting-volumes"></a>
## Deleting Volumes
1. Delete the volume using the following command:
@@ -553,6 +552,7 @@ You can stop the rebalance operation, if needed.
Deleting volume test-volume has been successful
<a name="triggering-self-heal-on-replicate"></a>
## Triggering Self-Heal on Replicate
In replicate module, previously you had to manually trigger a self-heal
@@ -561,133 +561,134 @@ replicas in sync. Now the pro-active self-heal daemon runs in the
background, diagnoses issues and automatically initiates self-healing
every 10 minutes on the files which requires*healing*.
You can view the list of files that need *healing*, the list of files
which are currently/previously *healed*, list of files which are in
You can view the list of files that need _healing_, the list of files
which are currently/previously _healed_, list of files which are in
split-brain state, and you can manually trigger self-heal on the entire
volume or only on the files which need *healing*.
volume or only on the files which need _healing_.
- Trigger self-heal only on the files which requires *healing*:
- Trigger self-heal only on the files which requires _healing_:
`# gluster volume heal <VOLNAME>`
`# gluster volume heal <VOLNAME>`
For example, to trigger self-heal on files which requires *healing*
of test-volume:
For example, to trigger self-heal on files which requires _healing_
of test-volume:
# gluster volume heal test-volume
Heal operation on volume test-volume has been successful
# gluster volume heal test-volume
Heal operation on volume test-volume has been successful
- Trigger self-heal on all the files of a volume:
- Trigger self-heal on all the files of a volume:
`# gluster volume heal <VOLNAME> full`
`# gluster volume heal <VOLNAME> full`
For example, to trigger self-heal on all the files of of
test-volume:
For example, to trigger self-heal on all the files of of
test-volume:
# gluster volume heal test-volume full
Heal operation on volume test-volume has been successful
# gluster volume heal test-volume full
Heal operation on volume test-volume has been successful
- View the list of files that needs *healing*:
- View the list of files that needs _healing_:
`# gluster volume heal <VOLNAME> info`
`# gluster volume heal <VOLNAME> info`
For example, to view the list of files on test-volume that needs
*healing*:
For example, to view the list of files on test-volume that needs
_healing_:
# gluster volume heal test-volume info
Brick server1:/gfs/test-volume_0
Number of entries: 0
# gluster volume heal test-volume info
Brick server1:/gfs/test-volume_0
Number of entries: 0
Brick server2:/gfs/test-volume_1
Number of entries: 101
/95.txt
/32.txt
/66.txt
/35.txt
/18.txt
/26.txt
/47.txt
/55.txt
/85.txt
...
Brick server2:/gfs/test-volume_1
Number of entries: 101
/95.txt
/32.txt
/66.txt
/35.txt
/18.txt
/26.txt
/47.txt
/55.txt
/85.txt
...
- View the list of files that are self-healed:
- View the list of files that are self-healed:
`# gluster volume heal <VOLNAME> info healed`
`# gluster volume heal <VOLNAME> info healed`
For example, to view the list of files on test-volume that are
self-healed:
For example, to view the list of files on test-volume that are
self-healed:
# gluster volume heal test-volume info healed
Brick Server1:/gfs/test-volume_0
Number of entries: 0
# gluster volume heal test-volume info healed
Brick Server1:/gfs/test-volume_0
Number of entries: 0
Brick Server2:/gfs/test-volume_1
Number of entries: 69
/99.txt
/93.txt
/76.txt
/11.txt
/27.txt
/64.txt
/80.txt
/19.txt
/41.txt
/29.txt
/37.txt
/46.txt
...
Brick Server2:/gfs/test-volume_1
Number of entries: 69
/99.txt
/93.txt
/76.txt
/11.txt
/27.txt
/64.txt
/80.txt
/19.txt
/41.txt
/29.txt
/37.txt
/46.txt
...
- View the list of files of a particular volume on which the self-heal
failed:
- View the list of files of a particular volume on which the self-heal
failed:
`# gluster volume heal <VOLNAME> info failed`
`# gluster volume heal <VOLNAME> info failed`
For example, to view the list of files of test-volume that are not
self-healed:
For example, to view the list of files of test-volume that are not
self-healed:
# gluster volume heal test-volume info failed
Brick Server1:/gfs/test-volume_0
Number of entries: 0
# gluster volume heal test-volume info failed
Brick Server1:/gfs/test-volume_0
Number of entries: 0
Brick Server2:/gfs/test-volume_3
Number of entries: 72
/90.txt
/95.txt
/77.txt
/71.txt
/87.txt
/24.txt
...
Brick Server2:/gfs/test-volume_3
Number of entries: 72
/90.txt
/95.txt
/77.txt
/71.txt
/87.txt
/24.txt
...
- View the list of files of a particular volume which are in
split-brain state:
- View the list of files of a particular volume which are in
split-brain state:
`# gluster volume heal <VOLNAME> info split-brain`
`# gluster volume heal <VOLNAME> info split-brain`
For example, to view the list of files of test-volume which are in
split-brain state:
For example, to view the list of files of test-volume which are in
split-brain state:
# gluster volume heal test-volume info split-brain
Brick Server1:/gfs/test-volume_2
Number of entries: 12
/83.txt
/28.txt
/69.txt
...
# gluster volume heal test-volume info split-brain
Brick Server1:/gfs/test-volume_2
Number of entries: 12
/83.txt
/28.txt
/69.txt
...
Brick Server2:/gfs/test-volume_3
Number of entries: 12
/83.txt
/28.txt
/69.txt
...
Brick Server2:/gfs/test-volume_3
Number of entries: 12
/83.txt
/28.txt
/69.txt
...
<a name="non-uniform-file-allocation"></a>
## Non Uniform File Allocation
NUFA translator or Non Uniform File Access translator is designed for giving higher preference
to a local drive when used in a HPC type of environment. It can be applied to Distribute and Replica translators;
in the latter case it ensures that *one* copy is local if space permits.
in the latter case it ensures that _one_ copy is local if space permits.
When a client on a server creates files, the files are allocated to a brick in the volume based on the file name.
This allocation may not be ideal, as there is higher latency and unnecessary network traffic for read/write operations
@@ -723,17 +724,17 @@ The NUFA scheduler also exists, for use with the Unify translator; see below.
##### NUFA additional options
- lookup-unhashed
- lookup-unhashed
This is an advanced option where files are looked up in all subvolumes if they are missing on the subvolume matching the hash value of the filename. The default is on.
This is an advanced option where files are looked up in all subvolumes if they are missing on the subvolume matching the hash value of the filename. The default is on.
- local-volume-name
- local-volume-name
The volume name to consider local and prefer file creations on. The default is to search for a volume matching the hostname of the system.
The volume name to consider local and prefer file creations on. The default is to search for a volume matching the hostname of the system.
- subvolumes
- subvolumes
This option lists the subvolumes that are part of this 'cluster/nufa' volume. This translator requires more than one subvolume.
This option lists the subvolumes that are part of this 'cluster/nufa' volume. This translator requires more than one subvolume.
## BitRot Detection
@@ -748,41 +749,43 @@ sub-commands.
1. To enable bitrot detection for a given volume <VOLNAME>:
`# gluster volume bitrot <VOLNAME> enable`
`# gluster volume bitrot <VOLNAME> enable`
and similarly to disable bitrot use:
and similarly to disable bitrot use:
`# gluster volume bitrot <VOLNAME> disable`
`# gluster volume bitrot <VOLNAME> disable`
> NOTE: Enabling bitrot spawns the Signer & Scrubber daemon per node. Signer is responsible
for signing (calculating checksum for each file) an object and scrubber verifies the
calculated checksum against the objects data.
2. Scrubber daemon has three (3) throttling modes that adjusts the rate at which objects
are verified.
2. Scrubber daemon has three (3) throttling modes that adjusts the rate at which objects
are verified.
# volume bitrot <VOLNAME> scrub-throttle lazy
# volume bitrot <VOLNAME> scrub-throttle normal
# volume bitrot <VOLNAME> scrub-throttle aggressive
# volume bitrot <VOLNAME> scrub-throttle lazy
# volume bitrot <VOLNAME> scrub-throttle normal
# volume bitrot <VOLNAME> scrub-throttle aggressive
3. By default scrubber scrubs the filesystem biweekly. It's possible to tune it to scrub
based on predefined frequency such as monthly, etc. This can be done as shown below:
3. By default scrubber scrubs the filesystem biweekly. It's possible to tune it to scrub
based on predefined frequency such as monthly, etc. This can be done as shown below:
# volume bitrot <VOLNAME> scrub-frequency daily
# volume bitrot <VOLNAME> scrub-frequency weekly
# volume bitrot <VOLNAME> scrub-frequency biweekly
# volume bitrot <VOLNAME> scrub-frequency monthly
# volume bitrot <VOLNAME> scrub-frequency daily
# volume bitrot <VOLNAME> scrub-frequency weekly
# volume bitrot <VOLNAME> scrub-frequency biweekly
# volume bitrot <VOLNAME> scrub-frequency monthly
> NOTE: Daily scrubbing would not be available with GA release.
4. Scrubber daemon can be paused and later resumed when required. This can be done as
shown below:
`# volume bitrot <VOLNAME> scrub pause`
`# volume bitrot <VOLNAME> scrub pause`
and to resume scrubbing:
`# volume bitrot <VOLNAME> scrub resume`
`# volume bitrot <VOLNAME> scrub resume`
> NOTE: Signing cannot be paused (and resumed) and would always be active as long as
bitrot is enabled for that particular volume.

View File

@@ -1,8 +1,9 @@
Mandatory Locks
===============
# Mandatory Locks
Support for mandatory locks inside GlusterFS does not converge all by itself to what Linux kernel provides to user space file systems. Here we enforce core mandatory lock semantics with and without the help of file mode bits. Please read through the [design specification](https://github.com/gluster/glusterfs-specs/blob/master/done/GlusterFS%203.8/Mandatory%20Locks.md) which explains the whole concept behind the mandatory locks implementation done for GlusterFS.
## Implications and Usage
By default, mandatory locking will be disabled for a volume and a volume set options is available to configure volume to operate under 3 different mandatory locking modes.
## Volume Option
@@ -11,22 +12,24 @@ By default, mandatory locking will be disabled for a volume and a volume set opt
gluster volume set <VOLNAME> locks.mandatory-locking <off / file / forced / optimal>
```
**off** - Disable mandatory locking for specified volume.<br/>
**file** - Enable Linux kernel style mandatory locking semantics with the help of mode bits (not well tested)<br/>
**forced** - Check for conflicting byte range locks for every data modifying operation in a volume<br/>
**optimal** - Combinational mode where POSIX clients can live with their advisory lock semantics which will still honour the mandatory locks acquired by other clients like SMB.
**off** - Disable mandatory locking for specified volume.<br/>
**file** - Enable Linux kernel style mandatory locking semantics with the help of mode bits (not well tested)<br/>
**forced** - Check for conflicting byte range locks for every data modifying operation in a volume<br/>
**optimal** - Combinational mode where POSIX clients can live with their advisory lock semantics which will still honour the mandatory locks acquired by other clients like SMB.
**Note**:- Please refer the design doc for more information on these key values.
#### Points to be remembered
* Valid key values available with mandatory-locking volume set option are taken into effect only after a subsequent start/restart of the volume.
* Due to some outstanding issues, it is recommended to turn off the performance translators in order to have the complete functionality of mandatory-locks when volume is configured in any one of the above described mandatory-locking modes. Please see the 'Known issue' section below for more details.
- Valid key values available with mandatory-locking volume set option are taken into effect only after a subsequent start/restart of the volume.
- Due to some outstanding issues, it is recommended to turn off the performance translators in order to have the complete functionality of mandatory-locks when volume is configured in any one of the above described mandatory-locking modes. Please see the 'Known issue' section below for more details.
#### Known issues
* Since the whole logic of mandatory-locks are implemented within the locks translator loaded at the server side, early success returned to fops like open, read, write to upper/application layer by performance translators residing at the client side will impact the intended functionality of mandatory-locks. One such issue is being tracked in the following bugzilla report:
<https://bugzilla.redhat.com/show_bug.cgi?id=1194546>
- Since the whole logic of mandatory-locks are implemented within the locks translator loaded at the server side, early success returned to fops like open, read, write to upper/application layer by performance translators residing at the client side will impact the intended functionality of mandatory-locks. One such issue is being tracked in the following bugzilla report:
* There is a possible race window uncovered with respect to mandatory locks and an ongoing read/write operation. For more details refer the bug report given below:
<https://bugzilla.redhat.com/show_bug.cgi?id=1194546>
<https://bugzilla.redhat.com/show_bug.cgi?id=1287099>
- There is a possible race window uncovered with respect to mandatory locks and an ongoing read/write operation. For more details refer the bug report given below:
<https://bugzilla.redhat.com/show_bug.cgi?id=1287099>

File diff suppressed because it is too large Load Diff

View File

@@ -2,32 +2,33 @@
NFS-Ganesha is a user-space file server for the NFS protocol with support for NFSv3, v4, v4.1, pNFS. It provides a FUSE-compatible File System Abstraction Layer(FSAL) to allow the file-system developers to plug in their storage mechanism and access it from any NFS client. NFS-Ganesha can access the FUSE filesystems directly through its FSAL without copying any data to or from the kernel, thus potentially improving response times.
## Installing nfs-ganesha
## Installing nfs-ganesha
#### Gluster RPMs (>= 3.10)
> glusterfs-server
> glusterfs-api
> glusterfs-server
> glusterfs-api
> glusterfs-ganesha
#### Ganesha RPMs (>= 2.5)
> nfs-ganesha
> nfs-ganesha
> nfs-ganesha-gluster
## Start NFS-Ganesha manually
- To start NFS-Ganesha manually, use the command:
- *service nfs-ganesha start*
`service nfs-ganesha start`
```sh
where:
/var/log/ganesha.log is the default log file for the ganesha process.
/etc/ganesha/ganesha.conf is the default configuration file
NIV_EVENT is the default log level.
```
- If the user wants to run ganesha in a preferred mode, execute the following command :
- *#ganesha.nfsd -f \<location_of_nfs-ganesha.conf_file\> -L \<location_of_log_file\> -N \<log_level\>*
- If the user wants to run ganesha in a preferred mode, execute the following command :
`ganesha.nfsd -f <location_of_nfs-ganesha.conf_file> -L <location_of_log_file> -N <log_level>`
```sh
For example:
@@ -37,6 +38,7 @@ nfs-ganesha.log is the log file for the ganesha.nfsd process.
nfs-ganesha.conf is the configuration file
NIV_DEBUG is the log level.
```
- By default, the export list for the server will be Null
```sh
@@ -52,12 +54,14 @@ NFS_Core_Param {
#Enable_RQUOTA = false;
}
```
## Step by step procedures to exporting GlusterFS volume via NFS-Ganesha
#### step 1 :
To export any GlusterFS volume or directory inside a volume, create the EXPORT block for each of those entries in an export configuration file. The following parameters are required to export any entry.
- *#cat export.conf*
To export any GlusterFS volume or directory inside a volume, create the EXPORT block for each of those entries in an export configuration file. The following parameters are required to export any entry.
- `cat export.conf`
```sh
EXPORT{
@@ -83,7 +87,8 @@ EXPORT{
#### step 2 :
Now include the export configuration file in the ganesha configuration file (by default). This can be done by adding the line below at the end of file
- %include “\<path of export configuration\>”
- `%include “<path of export configuration>”`
```sh
Note :
@@ -95,33 +100,40 @@ Also, it will add the above entry to ganesha.conf
```
#### step 3 :
Turn on features.cache-invalidation for that volume
- gluster volume set \<volume name\> features.cache-invalidation on
- `gluster volume set \<volume name\> features.cache-invalidation on`
#### step 4 :
dbus commands are used to export/unexport volume <br />
- export
- *#dbus-send --system --print-reply --dest=org.ganesha.nfsd /org/ganesha/nfsd/ExportMgr org.ganesha.nfsd.exportmgr.AddExport string:<ganesha directory>/exports/export.<volume name>.conf string:"EXPORT(Path=/\<volume name\>)"*
- `dbus-send --system --print-reply --dest=org.ganesha.nfsd /org/ganesha/nfsd/ExportMgr org.ganesha.nfsd.exportmgr.AddExport string:<ganesha directory>/exports/export.<volume name>.conf string:"EXPORT(Path=/<volume name>)"`
- unexport
- *#dbus-send --system --dest=org.ganesha.nfsd /org/ganesha/nfsd/ExportMgr org.ganesha.nfsd.exportmgr.RemoveExport uint16:\<export id\>*
- `dbus-send --system --dest=org.ganesha.nfsd /org/ganesha/nfsd/ExportMgr org.ganesha.nfsd.exportmgr.RemoveExport uint16:<export id>`
```sh
Note :
Step 4 can be performed via following script
#/usr/libexec/ganesha/dbus-send.sh <ganesha directory> [on|off] <volume name>
```
Above scripts (mentioned in step 3 and step 4) are available in glusterfs 3.10 rpms.
You can download it from [here](https://github.com/gluster/glusterfs/blob/release-3.10/extras/ganesha/scripts/)
#### step 5 :
- To check if the volume is exported, run
- *#showmount -e localhost*
- Or else use the following dbus command
- *#dbus-send --type=method_call --print-reply --system --dest=org.ganesha.nfsd /org/ganesha/nfsd/ExportMgr org.ganesha.nfsd.exportmgr.ShowExports*
- To see clients
- *#dbus-send --type=method_call --print-reply --system --dest=org.ganesha.nfsd /org/ganesha/nfsd/ClientMgr org.ganesha.nfsd.clientmgr.ShowClients*
- To check if the volume is exported, run
- `showmount -e localhost`
- Or else use the following dbus command
- `dbus-send --type=method_call --print-reply --system --dest=org.ganesha.nfsd /org/ganesha/nfsd/ExportMgr org.ganesha.nfsd.exportmgr.ShowExports`
- To see clients
- `dbus-send --type=method_call --print-reply --system --dest=org.ganesha.nfsd /org/ganesha/nfsd/ClientMgr org.ganesha.nfsd.clientmgr.ShowClients`
## Using Highly Available Active-Active NFS-Ganesha And GlusterFS cli
@@ -132,69 +144,72 @@ The cluster is maintained using Pacemaker and Corosync. Pacemaker acts as a reso
Data coherency across the multi-head NFS-Ganesha servers in the cluster is achieved using the UPCALL infrastructure. UPCALL infrastructure is a generic and extensible framework that sends notifications to the respective glusterfs clients (in this case NFS-Ganesha server) in case of any changes detected in the backend filesystem.
The Highly Available cluster is configured in the following three stages:
### Creating the ganesha-ha.conf file
The ganesha-ha.conf.example is created in the following location /etc/ganesha when Gluster Storage is installed. Rename the file to ganesha-ha.conf and make the changes as suggested in the following example:
sample ganesha-ha.conf file:
> \# Name of the HA cluster created. must be unique within the subnet
> HA_NAME="ganesha-ha-360"
> \# The subset of nodes of the Gluster Trusted Pool that form the ganesha HA cluster.
> \# Hostname is specified.
> HA_CLUSTER_NODES="server1,server2,..."
> \#HA_CLUSTER_NODES="server1.lab.redhat.com,server2.lab.redhat.com,..."
> \# Virtual IPs for each of the nodes specified above.
> VIP_server1="10.0.2.1"
> \# Name of the HA cluster created. must be unique within the subnet
> HA_NAME="ganesha-ha-360"
> \# The subset of nodes of the Gluster Trusted Pool that form the ganesha HA cluster.
> \# Hostname is specified.
> HA_CLUSTER_NODES="server1,server2,..."
> \#HA_CLUSTER_NODES="server1.lab.redhat.com,server2.lab.redhat.com,..."
> \# Virtual IPs for each of the nodes specified above.
> VIP_server1="10.0.2.1"
> VIP_server2="10.0.2.2"
### Configuring NFS-Ganesha using gluster CLI
The HA cluster can be set up or torn down using gluster CLI. Also, it can export and unexport specific volumes. For more information, see section Configuring NFS-Ganesha using gluster CLI.
### Modifying the HA cluster using the `ganesha-ha.sh` script
Post the cluster creation any further modification can be done using the `ganesha-ha.sh` script. For more information, see the section Modifying the HA cluster using the `ganesha-ha.sh` script.
## Step-by-step guide
### Configuring NFS-Ganesha using Gluster CLI
#### Pre-requisites to run NFS-Ganesha
Ensure that the following pre-requisites are taken into consideration before you run NFS-Ganesha in your environment:
* A Gluster Storage volume must be available for export and NFS-Ganesha rpms are installed on all the nodes.
* IPv6 must be enabled on the host interface which is used by the NFS-Ganesha daemon. To enable IPv6 support, perform the following steps:
- Comment or remove the line options ipv6 disable=1 in the /etc/modprobe.d/ipv6.conf file.
- Reboot the system.
- A Gluster Storage volume must be available for export and NFS-Ganesha rpms are installed on all the nodes.
- IPv6 must be enabled on the host interface which is used by the NFS-Ganesha daemon. To enable IPv6 support, perform the following steps:
* Ensure that all the nodes in the cluster are DNS resolvable. For example, you can populate the /etc/hosts with the details of all the nodes in the cluster.
* Disable and stop NetworkManager service.
* Enable and start network service on all machines.
* Create and mount a gluster shared volume.
* `gluster volume set all cluster.enable-shared-storage enable`
* Install Pacemaker and Corosync on all machines.
* Set the cluster auth password on all the machines.
* Passwordless ssh needs to be enabled on all the HA nodes. Follow these steps,
- Comment or remove the line options ipv6 disable=1 in the /etc/modprobe.d/ipv6.conf file.
- Reboot the system.
- On one (primary) node in the cluster, run:
- ssh-keygen -f /var/lib/glusterd/nfs/secret.pem
- Deploy the pubkey ~root/.ssh/authorized keys on _all_ nodes, run:
- ssh-copy-id -i /var/lib/glusterd/nfs/secret.pem.pub root@$node
- Copy the keys to _all_ nodes in the cluster, run:
- scp /var/lib/glusterd/nfs/secret.* $node:/var/lib/glusterd/nfs/
* Create a directory named "nfs-ganesha" in shared storage path and create ganesha.conf & ganesha-ha.conf in it (from glusterfs 3.9 onwards)
- Ensure that all the nodes in the cluster are DNS resolvable. For example, you can populate the /etc/hosts with the details of all the nodes in the cluster.
- Disable and stop NetworkManager service.
- Enable and start network service on all machines.
- Create and mount a gluster shared volume.
- `gluster volume set all cluster.enable-shared-storage enable`
- Install Pacemaker and Corosync on all machines.
- Set the cluster auth password on all the machines.
- Passwordless ssh needs to be enabled on all the HA nodes. Follow these steps,
- On one (primary) node in the cluster, run:
- `ssh-keygen -f /var/lib/glusterd/nfs/secret.pem`
- Deploy the pubkey ~root/.ssh/authorized keys on _all_ nodes, run:
- `ssh-copy-id -i /var/lib/glusterd/nfs/secret.pem.pub root@$node`
- Copy the keys to _all_ nodes in the cluster, run:
- `scp /var/lib/glusterd/nfs/secret.\* $node:/var/lib/glusterd/nfs/`
- Create a directory named "nfs-ganesha" in shared storage path and create ganesha.conf & ganesha-ha.conf in it (from glusterfs 3.9 onwards)
#### Configuring the HA Cluster
To set up the HA cluster, enable NFS-Ganesha by executing the following command:
#gluster nfs-ganesha enable
gluster nfs-ganesha enable
To tear down the HA cluster, execute the following command:
#gluster nfs-ganesha disable
gluster nfs-ganesha disable
```sh
Note :
Enable command performs the following
@@ -209,28 +224,32 @@ Also if gluster nfs-ganesha [enable/disable] fails of please check following log
```
#### Exporting Volumes through NFS-Ganesha using cli
To export a Red Hat Gluster Storage volume, execute the following command:
#gluster volume set <volname> ganesha.enable on
gluster volume set <volname> ganesha.enable on
To unexport a Red Hat Gluster Storage volume, execute the following command:
#gluster volume set <volname> ganesha.enable off
gluster volume set <volname> ganesha.enable off
This command unexports the Red Hat Gluster Storage volume without affecting other exports.
To verify the status of the volume set options, follow the guidelines mentioned below:
* Check if NFS-Ganesha is started by executing the following command:
- `ps aux | grep ganesha.nfsd`
* Check if the volume is exported.
- `showmount -e localhost`
- Check if NFS-Ganesha is started by executing the following command:
- `ps aux | grep ganesha.nfsd`
- Check if the volume is exported.
- `showmount -e localhost`
The logs of ganesha.nfsd daemon is written to /var/log/ganesha.log. Check the log file on noticing any unexpected behavior.
### Modifying the HA cluster using the ganesha-ha.sh script
To modify the existing HA cluster and to change the default values of the exports use the ganesha-ha.sh script located at /usr/libexec/ganesha/.
#### Adding a node to the cluster
Before adding a node to the cluster, ensure all the prerequisites mentioned in section `Pre-requisites to run NFS-Ganesha` are met. To add a node to the cluster. execute the following command on any of the nodes in the existing NFS-Ganesha cluster:
#./ganesha-ha.sh --add <HA_CONF_DIR> <HOSTNAME> <NODE-VIP>
@@ -238,7 +257,9 @@ Before adding a node to the cluster, ensure all the prerequisites mentioned in s
HA_CONF_DIR: The directory path containing the ganesha-ha.conf file.
HOSTNAME: Hostname of the new node to be added
NODE-VIP: Virtual IP of the new node to be added.
#### Deleting a node in the cluster
To delete a node from the cluster, execute the following command on any of the nodes in the existing NFS-Ganesha cluster:
#./ganesha-ha.sh --delete <HA_CONF_DIR> <HOSTNAME>
@@ -246,22 +267,25 @@ To delete a node from the cluster, execute the following command on any of the n
where,
HA_CONF_DIR: The directory path containing the ganesha-ha.conf file.
HOSTNAME: Hostname of the new node to be added
#### Modifying the default export configuration
To modify the default export configurations perform the following steps on any of the nodes in the existing ganesha cluster:
* Edit/add the required fields in the corresponding export file located at `/etc/ganesha/exports`.
- Edit/add the required fields in the corresponding export file located at `/etc/ganesha/exports`.
* Execute the following command:
- Execute the following command:
#./ganesha-ha.sh --refresh-config <HA_CONFDIR> <volname>
#./ganesha-ha.sh --refresh-config <HA_CONFDIR> <volname>
where,
HA_CONF_DIR: The directory path containing the ganesha-ha.conf file.
volname: The name of the volume whose export configuration has to be changed.
where,
HA_CONF_DIR: The directory path containing the ganesha-ha.conf file.
volname: The name of the volume whose export configuration has to be changed.
Note:
The export ID must not be changed.
Note:
The export ID must not be changed.
### Configure ganesha ha cluster outside of gluster nodes
@@ -269,39 +293,43 @@ Currently, ganesha HA cluster creating tightly integrated with glusterd. So here
Exporting/Unexporting should be performed without using glusterd cli (follow the manual steps, before performing step 4 replace localhost with required hostname/ip "hostname=localhost;" in the export configuration file)
## Configuring Gluster volume for pNFS
The Parallel Network File System (pNFS) is part of the NFS v4.1 protocol that allows computing clients to access storage devices directly and in parallel. The pNFS cluster consists of MDS (Meta-Data-Server) and DS (Data-Server). The client sends all the read/write requests directly to DS and all other operations are handle by the MDS.
### Step by step guide
- Turn on `feature.cache-invalidation` for the volume.
- gluster v set \<volname\> features.cache-invalidation on
- Turn on `feature.cache-invalidation` for the volume.
- `gluster v set <volname> features.cache-invalidation on`
- Select one of the nodes in the cluster as MDS and configure it adding the following block to ganesha configuration file
- Select one of the nodes in the cluster as MDS and configure it adding the following block to ganesha configuration file
```sh
GLUSTER
{
PNFS_MDS = true;
}
```
- Manually start NFS-Ganesha in every node in the cluster.
- Manually start NFS-Ganesha in every node in the cluster.
- Check whether the volume is exported via nfs-ganesha in all the nodes.
- *#showmount -e localhost*
- Mount the volume using NFS version 4.1 protocol with the ip of MDS
- *#mount -t nfs4 -o minorversion=1 \<ip of MDS\>:/\<volume name\> \<mount path\>*
- `showmount -e localhost`
- Mount the volume using NFS version 4.1 protocol with the ip of MDS
- `mount -t nfs4 -o minorversion=1 <ip of MDS>:<volume name> <mount path>`
### Points to be Noted
- The current architecture supports only a single MDS and multiple DS. The server with which client mounts will act as MDS and all servers including MDS can act as DS.
- The current architecture supports only a single MDS and multiple DS. The server with which client mounts will act as MDS and all servers including MDS can act as DS.
- Currently, HA is not supported for pNFS (more specifically MDS). Although it is configurable, consistency is guaranteed across the cluster.
- Currently, HA is not supported for pNFS (more specifically MDS). Although it is configurable, consistency is guaranteed across the cluster.
- If any of the DS goes down, then MDS will handle those I/O's.
- If any of the DS goes down, then MDS will handle those I/O's.
- Hereafter, all the subsequent NFS clients need to use the same server for mounting that volume via pNFS. i.e more than one MDS for a volume is not preferred
- Hereafter, all the subsequent NFS clients need to use the same server for mounting that volume via pNFS. i.e more than one MDS for a volume is not preferred
- pNFS support is only tested with distributed, replicated, or distribute-replicate volumes
- It is tested and verified with RHEL 6.5 , fedora 20, fedora 21 nfs clients. It is always better to use latest nfs-clients
- pNFS support is only tested with distributed, replicated, or distribute-replicate volumes
- It is tested and verified with RHEL 6.5 , fedora 20, fedora 21 nfs clients. It is always better to use latest nfs-clients

View File

@@ -1,23 +1,24 @@
# Network Configurations Techniques
#### Bonding best practices
Bonded network interfaces incorporate multiple physical interfaces into a single logical bonded interface, with a single IP addr. An N-way bonded interface can survive loss of N-1 physical interfaces, and performance can be improved in some cases.
###### When to bond?
- Need high availability for network link
- Workload: sequential access to large files (most time spent reading/writing)
- Network throughput limit of client/server \<\< storage throughput limit
- 1 GbE (almost always)
- 10-Gbps links or faster -- for writes, replication doubles the load on the network and replicas are usually on different peers to which the client can transmit in parallel.
- LIMITATION: Bonding mode 6 doesn't improve throughput if network peers are not on the same VLAN.
- Need high availability for network link
- Workload: sequential access to large files (most time spent reading/writing)
- Network throughput limit of client/server \<\< storage throughput limit
- 1 GbE (almost always)
- 10-Gbps links or faster -- for writes, replication doubles the load on the network and replicas are usually on different peers to which the client can transmit in parallel.
- LIMITATION: Bonding mode 6 doesn't improve throughput if network peers are not on the same VLAN.
###### How to configure
- [Bonding-howto](http://www.linuxquestions.org/linux/answers/Networking/Linux_bonding_howto_0)
- Best bonding mode for Gluster client is mode 6 (balance-alb), this allows client to transmit writes in parallel on separate NICs much of the time. A peak throughput of 750 MB/s on writes from a single client was observed with bonding mode 6 on 2 10-GbE NICs with jumbo frames. That's 1.5 GB/s of network traffic.
- Another way to balance both transmit and receive traffic is bonding mode 4 (802.3ad) but this requires switch configuration (trunking commands)
- Still another way to load balance is bonding mode 2 (balance-xor) with option "xmit\_hash\_policy=layer3+4". The bonding modes 6 and 2 will not improve single-connection throughput, but improve aggregate throughput across all connections.
- [Bonding-howto](http://www.linuxquestions.org/linux/answers/Networking/Linux_bonding_howto_0)
- Best bonding mode for Gluster client is mode 6 (balance-alb), this allows client to transmit writes in parallel on separate NICs much of the time. A peak throughput of 750 MB/s on writes from a single client was observed with bonding mode 6 on 2 10-GbE NICs with jumbo frames. That's 1.5 GB/s of network traffic.
- Another way to balance both transmit and receive traffic is bonding mode 4 (802.3ad) but this requires switch configuration (trunking commands)
- Still another way to load balance is bonding mode 2 (balance-xor) with option "xmit_hash_policy=layer3+4". The bonding modes 6 and 2 will not improve single-connection throughput, but improve aggregate throughput across all connections.
##### Jumbo frames
@@ -25,18 +26,18 @@ Jumbo frames are Ethernet (or Infiniband) frames with size greater than the defa
###### When to configure?
- Any network faster than 1-GbE
- Workload is sequential large-file reads/writes
- LIMITATION: Requires all network switches in VLAN must be configured to handle jumbo frames, do not configure otherwise.
- Any network faster than 1-GbE
- Workload is sequential large-file reads/writes
- LIMITATION: Requires all network switches in VLAN must be configured to handle jumbo frames, do not configure otherwise.
###### How to configure?
- Edit network interface file at /etc/sysconfig/network-scripts/ifcfg-your-interface
- Ethernet (on ixgbe driver): add "MTU=9000" (MTU means "maximum transfer unit") record to network interface file
- Infiniband (on mlx4 driver): add "CONNECTED\_MODE=yes" and "MTU=65520" records to network interface file
- ifdown your-interface; ifup your-interface
- Test with "ping -s 16384 other-host-on-VLAN"
- Switch requires max frame size larger than MTU because of protocol headers, usually 9216 bytes
- Edit network interface file at /etc/sysconfig/network-scripts/ifcfg-your-interface
- Ethernet (on ixgbe driver): add "MTU=9000" (MTU means "maximum transfer unit") record to network interface file
- Infiniband (on mlx4 driver): add "CONNECTED_MODE=yes" and "MTU=65520" records to network interface file
- ifdown your-interface; ifup your-interface
- Test with "ping -s 16384 other-host-on-VLAN"
- Switch requires max frame size larger than MTU because of protocol headers, usually 9216 bytes
##### Configuring a backend network for storage
@@ -44,10 +45,10 @@ This method lets you add network capacity for multi-protocol sites by segregatin
###### When to configure?
- For non-Gluster services such as NFS, Swift (REST), CIFS being provided on Gluster servers. It will not help Gluster clients (external nodes with Gluster mountpoints on them).
- Network port is over-utilized.
- For non-Gluster services such as NFS, Swift (REST), CIFS being provided on Gluster servers. It will not help Gluster clients (external nodes with Gluster mountpoints on them).
- Network port is over-utilized.
###### How to configure?
- Most network cards have multiple ports on them -- make port 1 the non-Gluster port and port 2 the Gluster port.
- Separate Gluster ports onto a separate VLAN from non-Gluster ports, to simplify configuration.
- Most network cards have multiple ports on them -- make port 1 the non-Gluster port and port 2 the Gluster port.
- Separate Gluster ports onto a separate VLAN from non-Gluster ports, to simplify configuration.

View File

@@ -6,8 +6,7 @@ API to be accessed as files over filesystem interface and vice versa i.e files
created over filesystem interface (NFS/FUSE/native) can be accessed as objects
over Swift's RESTful API.
SwiftOnFile project was formerly known as `gluster-swift` and also as `UFO
(Unified File and Object)` before that. More information about SwiftOnFile can
SwiftOnFile project was formerly known as `gluster-swift` and also as `UFO (Unified File and Object)` before that. More information about SwiftOnFile can
be found [here](https://github.com/swiftonfile/swiftonfile/blob/master/doc/markdown/quick_start_guide.md).
There are differences in working of gluster-swift (now obsolete) and swiftonfile
projects. The older gluster-swift code and relevant documentation can be found
@@ -17,10 +16,9 @@ of swiftonfile repo.
## SwiftOnFile vs gluster-swift
| Gluster-Swift | SwiftOnFile |
|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| One GlusterFS volume maps to and stores only one Swift account. Mountpoint Hierarchy: `container/object` | One GlusterFS volume or XFS partition can have multiple accounts. Mountpoint Hierarchy: `acc/container/object` |
| :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| One GlusterFS volume maps to and stores only one Swift account. Mountpoint Hierarchy: `container/object` | One GlusterFS volume or XFS partition can have multiple accounts. Mountpoint Hierarchy: `acc/container/object` |
| Over-rides account server, container server and object server. We need to keep in sync with upstream Swift and often may need code changes or workarounds to support new Swift features | Implements only object-server. Very less need to catch-up to Swift as new features at proxy,container and account level would very likely be compatible with SwiftOnFile as it's just a storage policy. |
| Does not use DBs for accounts and container.A container listing involves a filesystem crawl.A HEAD on account/container gives inaccurate or stale results without FS crawl. | Uses Swift's DBs to store account and container information. An account or container listing does not involve FS crawl. Accurate info on HEAD to account/container ability to support account quotas. |
| GET on a container and account lists actual files in filesystem. | GET on a container and account only lists objects PUT over Swift. Files created over filesystem interface do not appear in container and object listings. |
| Standalone deployment required and does not integrate with existing Swift cluster. | Integrates with any existing Swift deployment as a Storage Policy. |
| Does not use DBs for accounts and container.A container listing involves a filesystem crawl.A HEAD on account/container gives inaccurate or stale results without FS crawl. | Uses Swift's DBs to store account and container information. An account or container listing does not involve FS crawl. Accurate info on HEAD to account/container ability to support account quotas. |
| GET on a container and account lists actual files in filesystem. | GET on a container and account only lists objects PUT over Swift. Files created over filesystem interface do not appear in container and object listings. |
| Standalone deployment required and does not integrate with existing Swift cluster. | Integrates with any existing Swift deployment as a Storage Policy. |

View File

@@ -1,5 +1,4 @@
Gluster performance testing
===========================
# Gluster performance testing
Once you have created a Gluster volume, you need to verify that it has
adequate performance for your application, and if it does not, you need
@@ -7,18 +6,18 @@ a way to isolate the root cause of the problem.
There are two kinds of workloads:
* synthetic - run a test program such as ones below
* application - run existing application
- synthetic - run a test program such as ones below
- application - run existing application
# Profiling tools
Ideally it's best to use the actual application that you want to run on Gluster, but applications often don't tell the sysadmin much about where the performance problems are, particularly latency (response-time) problems. So there are non-invasive profiling tools built into Gluster that can measure performance as seen by the application, without changing the application. Gluster profiling methods at present are based on the io-stats translator, and include:
Ideally it's best to use the actual application that you want to run on Gluster, but applications often don't tell the sysadmin much about where the performance problems are, particularly latency (response-time) problems. So there are non-invasive profiling tools built into Gluster that can measure performance as seen by the application, without changing the application. Gluster profiling methods at present are based on the io-stats translator, and include:
* client-side profiling - instrument a Gluster mountpoint or libgfapi process to sample profiling data. In this case, the io-stats translator is at the "top" of the translator stack, so the profile data truly represents what the application (or FUSE mountpoint) is asking Gluster to do. For example, a single application write is counted once as a WRITE FOP (file operation) call, and the latency for that WRITE FOP includes latency of the data replication done by the AFR translator lower in the stack.
- client-side profiling - instrument a Gluster mountpoint or libgfapi process to sample profiling data. In this case, the io-stats translator is at the "top" of the translator stack, so the profile data truly represents what the application (or FUSE mountpoint) is asking Gluster to do. For example, a single application write is counted once as a WRITE FOP (file operation) call, and the latency for that WRITE FOP includes latency of the data replication done by the AFR translator lower in the stack.
* server-side profiling - this is done using the "gluster volume profile" command (and "gluster volume top" can be used to identify particular hot files in use as well). Server-side profiling can measure the throughput of an entire Gluster volume over time, and can measure server-side latencies. However, it does not incorporate network or client-side latencies. It is also hard to infer application behavior because of client-side translators that alter the I/O workload (examples: erasure coding, cache tiering).
- server-side profiling - this is done using the "gluster volume profile" command (and "gluster volume top" can be used to identify particular hot files in use as well). Server-side profiling can measure the throughput of an entire Gluster volume over time, and can measure server-side latencies. However, it does not incorporate network or client-side latencies. It is also hard to infer application behavior because of client-side translators that alter the I/O workload (examples: erasure coding, cache tiering).
In short, use client-side profiling for understanding "why is my application unresponsive"? and use server-side profiling for understanding how busy your Gluster volume is, what kind of workload is being applied to it (i.e. is it mostly-read? is it small-file?), and how well the I/O load is spread across the volume.
In short, use client-side profiling for understanding "why is my application unresponsive"? and use server-side profiling for understanding how busy your Gluster volume is, what kind of workload is being applied to it (i.e. is it mostly-read? is it small-file?), and how well the I/O load is spread across the volume.
## client-side profiling
@@ -27,7 +26,7 @@ To run client-side profiling,
- gluster volume profile your-volume start
- setfattr -n trusted.io-stats-dump -v io-stats-pre.txt /your/mountpoint
This will generate the specified file (`/var/run/gluster/io-stats-pre.txt`) on the client. A script like [gvp-client.sh](https://github.com/bengland2/gluster-profile-analysis) can automate collection of this data.
This will generate the specified file (`/var/run/gluster/io-stats-pre.txt`) on the client. A script like [gvp-client.sh](https://github.com/bengland2/gluster-profile-analysis) can automate collection of this data.
TBS: what the different FOPs are and what they mean.
@@ -58,11 +57,11 @@ that can be run from a single system. While single-system results are
important, they are far from a definitive measure of the performance
capabilities of a distributed filesystem.
- [fio](http://freecode.com/projects/fio) - for large file I/O tests.
- [smallfile](https://github.com/bengland2/smallfile) - for
pure-workload small-file tests
- [iozone](http://www.iozone.org) - for pure-workload large-file tests
- [parallel-libgfapi](https://github.com/bengland2/parallel-libgfapi) - for pure-workload libgfapi tests
- [fio](http://freecode.com/projects/fio) - for large file I/O tests.
- [smallfile](https://github.com/bengland2/smallfile) - for
pure-workload small-file tests
- [iozone](http://www.iozone.org) - for pure-workload large-file tests
- [parallel-libgfapi](https://github.com/bengland2/parallel-libgfapi) - for pure-workload libgfapi tests
The "netmist" mixed-workload generator of SPECsfs2014 may be suitable in some cases, but is not technically an open-source tool. This tool was written by Don Capps, who was an author of iozone.
@@ -78,13 +77,13 @@ And make sure your firewall allows port 8765 through for it. You can now run tes
You can also use it for distributed testing, however, by launching fio instances on separate hosts, taking care to start all fio instances as close to the same time as possible, limiting per-thread throughput, and specifying the run duration rather than the amount of data, so that all fio instances end at around the same time. You can then aggregate the fio results from different hosts to get a meaningful aggregate result.
fio also has different I/O engines, in particular Huamin Chen authored the ***libgfapi*** engine for fio so that you can use fio to test Gluster performance without using FUSE.
fio also has different I/O engines, in particular Huamin Chen authored the **_libgfapi_** engine for fio so that you can use fio to test Gluster performance without using FUSE.
Limitations of fio in distributed mode:
- stonewalling - fio calculates throughput based on when the last thread finishes a test run. In contrast, iozone calculates throughput by default based on when the FIRST thread finishes the workload. This can lead to (deceptively?) higher throughput results for iozone, since there are inevitably some "straggler" threads limping to the finish line later than others. It is possible in some cases to overcome this limitation by specifying a time limit for the test. This works well for random I/O tests, where typically you do not want to read/write the entire file/device anyway.
- inaccuracy when response times > 1 sec - at least in some cases fio has reported excessively high IOPS when fio threads encounter response times much greater than 1 second, this can happen for distributed storage when there is unfairness in the implementation.
- io engines are not integrated.
- stonewalling - fio calculates throughput based on when the last thread finishes a test run. In contrast, iozone calculates throughput by default based on when the FIRST thread finishes the workload. This can lead to (deceptively?) higher throughput results for iozone, since there are inevitably some "straggler" threads limping to the finish line later than others. It is possible in some cases to overcome this limitation by specifying a time limit for the test. This works well for random I/O tests, where typically you do not want to read/write the entire file/device anyway.
- inaccuracy when response times > 1 sec - at least in some cases fio has reported excessively high IOPS when fio threads encounter response times much greater than 1 second, this can happen for distributed storage when there is unfairness in the implementation.
- io engines are not integrated.
### smallfile Distributed I/O Benchmark
@@ -108,10 +107,10 @@ option (below).
The "-a" option for automated testing of all use cases is discouraged,
because:
- this does not allow you to drop the read cache in server before a
test.
- most of the data points being measured will be irrelevant to the
problem you are solving.
- this does not allow you to drop the read cache in server before a
test.
- most of the data points being measured will be irrelevant to the
problem you are solving.
Single-thread testing is an important use case, but to fully utilize the
available hardware you typically need to do multi-thread and even
@@ -124,16 +123,16 @@ re-read and re-write tests. "-w" option tells iozone not to delete any
files that it accessed, so that subsequent tests can use them. Specify
these options with each test:
- -i -- test type, 0=write, 1=read, 2=random read/write
- -r -- data transfer size -- allows you to simulate I/O size used by
application
- -s -- per-thread file size -- choose this to be large enough for the
system to reach steady state (typically multiple GB needed)
- -t -- number of threads -- how many subprocesses will be
concurrently issuing I/O requests
- -F -- list of files -- what files to write/read. If you do not
specify then the filenames iozone.DUMMY.\* will be used in the
default directory.
- -i -- test type, 0=write, 1=read, 2=random read/write
- -r -- data transfer size -- allows you to simulate I/O size used by
application
- -s -- per-thread file size -- choose this to be large enough for the
system to reach steady state (typically multiple GB needed)
- -t -- number of threads -- how many subprocesses will be
concurrently issuing I/O requests
- -F -- list of files -- what files to write/read. If you do not
specify then the filenames iozone.DUMMY.\* will be used in the
default directory.
Example of an 8-thread sequential write test with 64-KB transfer size
and file size of 1 GB to shared Gluster mountpoint directory
@@ -213,11 +212,11 @@ This test exercises Gluster performance using the libgfapi API,
bypassing FUSE - no mountpoints are used. Available
[here](https://github.com/bengland2/parallel-libgfapi).
To use it, you edit the script parameters in parallel\_gfapi\_test.sh
To use it, you edit the script parameters in parallel_gfapi_test.sh
script - all of them are above the comment "NO EDITABLE PARAMETERS BELOW
THIS LINE". These include such things as the Gluster volume name, a host
serving that volume, number of files, etc. You then make sure that the
gfapi\_perf\_test executable is distributed to the client machines at
gfapi_perf_test executable is distributed to the client machines at
the specified directory, and then run the script. The script starts all
libgfapi workload generator processes in parallel in such a way that
they all start the test at the same time. It waits until they all
@@ -240,8 +239,7 @@ S3 workload generation.
part of OpenStack Swift toolset and is command-line tool with a workload
definition file format.
Workload
--------
## Workload
An application can be as simple as writing some files, or it can be as
complex as running a cloud on top of Gluster. But all applications have
@@ -253,10 +251,10 @@ application spends most of its time doing with Gluster are called the
the filesystem requests being delivered to Gluster by the application.
There are two ways to look at workload:
- top-down - what is the application trying to get the filesystem to
do?
- bottom-up - what requests is the application actually generating to
the filesystem?
- top-down - what is the application trying to get the filesystem to
do?
- bottom-up - what requests is the application actually generating to
the filesystem?
### data vs metadata
@@ -277,21 +275,21 @@ Often this is what users will be able to help you with -- for example, a
workload might consist of ingesting a billion .mp3 files. Typical
questions that need to be answered (approximately) are:
- what is file size distribution? Averages are often not enough - file
size distributions can be bi-modal (i.e. consist mostly of the very
large and very small file sizes). TBS: provide pointers to scripts
that can collect this.
- what fraction of file accesses are reads vs writes?
- how cache-friendly is the workload? Do the same files get read
repeatedly by different Gluster clients, or by different
processes/threads on these clients?
- for large-file workloads, what fraction of accesses are
sequential/random? Sequential file access means that the application
thread reads/writes the file from start to finish in byte offset
order, and random file access is the exact opposite -- the thread
may read/write from any offset at any time. Virtual machine disk
images are typically accessed randomly, since the VM's filesystem is
embedded in a Gluster file.
- what is file size distribution? Averages are often not enough - file
size distributions can be bi-modal (i.e. consist mostly of the very
large and very small file sizes). TBS: provide pointers to scripts
that can collect this.
- what fraction of file accesses are reads vs writes?
- how cache-friendly is the workload? Do the same files get read
repeatedly by different Gluster clients, or by different
processes/threads on these clients?
- for large-file workloads, what fraction of accesses are
sequential/random? Sequential file access means that the application
thread reads/writes the file from start to finish in byte offset
order, and random file access is the exact opposite -- the thread
may read/write from any offset at any time. Virtual machine disk
images are typically accessed randomly, since the VM's filesystem is
embedded in a Gluster file.
Why do these questions matter? For example, if you have a large-file
sequential read workload, network configuration + Gluster and Linux
@@ -311,20 +309,19 @@ and the bottlenecks which are limiting performance of that workload.
TBS: links to documentation for these tools and scripts that reduce the data to usable form.
Configuration
-------------
## Configuration
There are 4 basic hardware dimensions to a Gluster server, listed here
in order of importance:
- network - possibly the most important hardware component of a
Gluster site
- access protocol - what kind of client is used to get to the
files/objects?
- storage - this is absolutely critical to get right up front
- cpu - on client, look for hot threads (see below)
- memory - can impact performance of read-intensive, cacheable
workloads
- network - possibly the most important hardware component of a
Gluster site
- access protocol - what kind of client is used to get to the
files/objects?
- storage - this is absolutely critical to get right up front
- cpu - on client, look for hot threads (see below)
- memory - can impact performance of read-intensive, cacheable
workloads
### network testing
@@ -338,7 +335,7 @@ To measure network performance, consider use of a
[netperf-based](http://www.cs.kent.edu/~farrell/dist/ref/Netperf.html)
script.
The purpose of these two tools is to characterize the capacity of your entire network infrastructure to support the desired level of traffic induced by distributed storage, using multiple network connections in parallel. The latter script is probably the most realistic network workload for distributed storage.
The purpose of these two tools is to characterize the capacity of your entire network infrastructure to support the desired level of traffic induced by distributed storage, using multiple network connections in parallel. The latter script is probably the most realistic network workload for distributed storage.
The two most common hardware problems impacting distributed storage are,
not surprisingly, disk drive failures and network failures. Some of
@@ -379,7 +376,7 @@ To simulate a mixed read-write workload, use both sets of pairs:
(c1,s1), (c2, s2), (c3, s1), (c4, s2), (s1, c1), (s2, c2), (s1, c3), (s2, c4)
More complicated flows can model behavior of non-native protocols, where a cluster node acts as a proxy server- it is a server (for non-native protocol) and a client (for native protocol). For example, such protocols often induce full-duplex traffic which can stress the network differently than unidirectional in/out traffic. For example, try adding this set of flows to preceding flow:
More complicated flows can model behavior of non-native protocols, where a cluster node acts as a proxy server- it is a server (for non-native protocol) and a client (for native protocol). For example, such protocols often induce full-duplex traffic which can stress the network differently than unidirectional in/out traffic. For example, try adding this set of flows to preceding flow:
(s1, s2),.(s2, s3),.(s3, s4),.(s4, s1)
@@ -391,8 +388,8 @@ do not need ssh access to each other -- they only have to allow
password-less ssh access from the head node. The script does not rely on
root privileges, so you can run it from a non-root account. Just create
a public key on the head node in the right account (usually in
\$HOME/.ssh/id\_rsa.pub ) and then append this public key to
\$HOME/.ssh/authorized\_keys on each host participating in the test.
\$HOME/.ssh/id_rsa.pub ) and then append this public key to
\$HOME/.ssh/authorized_keys on each host participating in the test.
We input senders and receivers using separate text files, 1 host per
line. For pair (sender[j], receiver[j]), you get sender[j] from line j
@@ -401,23 +398,22 @@ You have to use the IP address/name that corresponds to the interface
you want to test, and you have to be able to ssh to each host from the
head node using this interface.
Results
-------
## Results
There are 3 basic forms of performance results, not in order of
importance:
- throughput -- how much work is done in a unit of time? Best metrics
typically are workload-dependent:
- for large-file random: IOPS
- for large-file sequential: MB/s
- for small-file: files/sec
- response time -- IMPORTANT, how long does it take for filesystem
request to complete?
- utilization -- how busy is the hardware while the workload is
running?
- scalability -- can we linearly scale throughput without sacrificing
response time as we add servers to a Gluster volume?
- throughput -- how much work is done in a unit of time? Best metrics
typically are workload-dependent:
- for large-file random: IOPS
- for large-file sequential: MB/s
- for small-file: files/sec
- response time -- IMPORTANT, how long does it take for filesystem
request to complete?
- utilization -- how busy is the hardware while the workload is
running?
- scalability -- can we linearly scale throughput without sacrificing
response time as we add servers to a Gluster volume?
Typically throughput results get the most attention, but in a
distributed-storage environment, the hardest goal to achieve may well be

View File

@@ -1,80 +1,91 @@
# Performance tuning
## Enable Metadata cache
Metadata caching improves performance in almost all the workloads, except for use cases
with most of the workload accessing a file sumultaneously from multiple clients.
1. Execute the following command to enable metadata caching and cache invalidation:
```
# gluster volume set <volname> group metadata-cache
1. Execute the following command to enable metadata caching and cache invalidation:
```console
gluster volume set <volname> group metadata-cache
```
This group command enables caching of stat and xattr information of a file or directory.
The caching is refreshed every 10 min, and cache-invalidation is enabled to ensure cache
consistency.
2. To increase the number of files that can be cached, execute the following command:
```
# gluster volume set <volname> network.inode-lru-limit <n>
2. To increase the number of files that can be cached, execute the following command:
```console
gluster volume set <volname> network.inode-lru-limit <n>
```
n, is set to 50000. It can be increased if the number of active files in the volume
is very high. Increasing this number increases the memory footprint of the brick processes.
3. Execute the following command to enable samba specific metadata caching:
```
# gluster volume set <volname> cache-samba-metadata on
3. Execute the following command to enable samba specific metadata caching:
```console
gluster volume set <volname> cache-samba-metadata on
```
4. By default, some xattrs are cached by gluster like: capability xattrs, ima xattrs
4. By default, some xattrs are cached by gluster like: capability xattrs, ima xattrs
ACLs, etc. If there are any other xattrs that are used by the application using
the Gluster storage, execute the following command to add these xattrs to the metadata
cache list:
```
# gluster volume set <volname> xattr-cache-list "comma separated xattr list"
```console
gluster volume set <volname> xattr-cache-list "comma separated xattr list"
```
Eg:
```
# gluster volume set <volname> xattr-cache-list "user.org.netatalk.*,user.swift.metadata"
```console
gluster volume set <volname> xattr-cache-list "user.org.netatalk.*,user.swift.metadata"
```
## Directory operations
Along with enabling the metadata caching, the following options can be set to
increase performance of directory operations:
### Directory listing Performance:
### Directory listing Performance:
- Enable `parallel-readdir`
```
# gluster volume set <VOLNAME> performance.readdir-ahead on
# gluster volume set <VOLNAME> performance.parallel-readdir on
```
- Enable `parallel-readdir`
### File/Directory Create Performance
```console
gluster volume set <VOLNAME> performance.readdir-ahead on
gluster volume set <VOLNAME> performance.parallel-readdir on
```
- Enable `nl-cache`
```
# gluster volume set <volname> group nl-cache
# gluster volume set <volname> nl-cache-positive-entry on
```
### File/Directory Create Performance
- Enable `nl-cache`
```console
gluster volume set <volname> group nl-cache
gluster volume set <volname> nl-cache-positive-entry on
```
The above command also enables cache invalidation and increases the timeout to
10 minutes
## Small file Read operations
For use cases with dominant small file reads, enable the following options
# gluster volume set <volname> performance.cache-invalidation on
# gluster volume set <volname> features.cache-invalidation on
# gluster volume set <volname> performance.qr-cache-timeout 600 --> 10 min recommended setting
# gluster volume set <volname> cache-invalidation-timeout 600 --> 10 min recommended setting
gluster volume set <volname> performance.cache-invalidation on
gluster volume set <volname> features.cache-invalidation on
gluster volume set <volname> performance.qr-cache-timeout 600 # 10 min recommended setting
gluster volume set <volname> cache-invalidation-timeout 600 # 10 min recommended setting
This command enables caching of the content of small file, in the client cache.
Enabling cache invalidation ensures cache consistency.
The total cache size can be set using
# gluster volume set <volname> cache-size <size>
gluster volume set <volname> cache-size <size>
By default, the files with size `<=64KB` are cached. To change this value:
# gluster volume set <volname> performance.cache-max-file-size <size>
gluster volume set <volname> performance.cache-max-file-size <size>
Note that the `size` arguments use SI unit suffixes, e.g. `64KB` or `2MB`.

View File

@@ -3,13 +3,13 @@
THE RDMA is no longer supported in Gluster builds. This has been removed from release 8 onwards.
Currently we dont have
1. The expertise to support RDMA
2. Infrastructure to test/verify the performances each release
The options are getting discussed here - https://github.com/gluster/glusterfs/issues/2000
The options are getting discussed here - https://github.com/gluster/glusterfs/issues/2000
Ready to enable as a compile time option, if there is proper support and testing infrastructure.
# Introduction
GlusterFS supports using RDMA protocol for communication between glusterfs clients and glusterfs bricks.
@@ -17,20 +17,22 @@ GlusterFS clients include FUSE client, libgfapi clients(Samba and NFS-Ganesha in
NOTE: As of now only FUSE client and gNFS server would support RDMA transport.
NOTE:
NFS client to gNFS Server/NFS Ganesha Server communication would still happen over tcp.
CIFS Clients/Windows Clients to Samba Server communication would still happen over tcp.
# Setup
Please refer to these external documentation to setup RDMA on your machines
http://people.redhat.com/dledford/infiniband_get_started.html
http://people.redhat.com/dledford/infiniband_get_started.html
## Creating Trusted Storage Pool
All the servers in the Trusted Storage Pool must have RDMA devices if either RDMA or TCP,RDMA volumes are created in the storage pool.
The peer probe must be performed using IP/hostname assigned to the RDMA device.
## Ports and Firewall
Process glusterd will listen on both tcp and rdma if rdma device is found. Port used for rdma is 24008. Similarly, brick processes will also listen on two ports for a volume created with transport "tcp,rdma".
Make sure you update the firewall to accept packets on these ports.
@@ -46,36 +48,49 @@ Creation of test-volume has been successful
Please start the volume to access data.
# Changing Transport of Volume
To change the supported transport types of a existing volume, follow the procedure:
NOTE: This is possible only if the volume was created with IP/hostname assigned to RDMA device.
1. Unmount the volume on all the clients using the following command:
`# umount mount-point`
2. Stop the volumes using the following command:
`# gluster volume stop volname`
3. Change the transport type.
For example, to enable both tcp and rdma execute the followimg command:
`# gluster volume set volname config.transport tcp,rdma`
4. Mount the volume on all the clients.
For example, to mount using rdma transport, use the following command:
`# mount -t glusterfs -o transport=rdma server1:/test-volume /mnt/glusterfs`
To change the supported transport types of a existing volume, follow the procedure:
NOTE: This is possible only if the volume was created with IP/hostname assigned to RDMA device.
1. Unmount the volume on all the clients using the following command:
umount mount-point
2. Stop the volumes using the following command:
gluster volume stop volname
3. Change the transport type.
For example, to enable both tcp and rdma execute the followimg command:
gluster volume set volname config.transport tcp,rdma
4. Mount the volume on all the clients.
For example, to mount using rdma transport, use the following command:
mount -t glusterfs -o transport=rdma server1:/test-volume /mnt/glusterfs`
NOTE:
config.transport option does not have a entry in help of gluster cli.
`#gluster vol set help | grep config.transport`
However, the key is a valid one.
config.transport option does not have a entry in help of gluster cli.
```console
gluster vol set help | grep config.transport`
```
However, the key is a valid one.
# Mounting a Volume using RDMA
You can use the mount option "transport" to specify the transport type that FUSE client must use to communicate with bricks. If the volume was created with only one transport type, then that becomes the default when no value is specified. In case of tcp,rdma volume, tcp is the default.
For example, to mount using rdma transport, use the following command:
`# mount -t glusterfs -o transport=rdma server1:/test-volume /mnt/glusterfs`
For example, to mount using rdma transport, use the following command:
```console
mount -t glusterfs -o transport=rdma server1:/test-volume /mnt/glusterfs
```
# Transport used by auxillary processes
All the auxillary processes like self-heal daemon, rebalance process etc use the default transport.In case you have a tcp,rdma volume it will use tcp.
In case of rdma volume, rdma will be used.
Configuration options to select transport used by these processes when volume is tcp,rdma are not yet available and will be coming in later releases.

View File

@@ -2,67 +2,67 @@
GlusterFS allows its communication to be secured using the [Transport Layer
Security][tls] standard (which supersedes Secure Sockets Layer), using the
[OpenSSL][ossl] library. Setting this up requires a basic working knowledge of
[OpenSSL][ossl] library. Setting this up requires a basic working knowledge of
some SSL/TLS concepts, which can only be briefly summarized here.
* "Authentication" is the process of one entity (e.g. a machine, process, or
person) proving its identity to a second entity.
- "Authentication" is the process of one entity (e.g. a machine, process, or
person) proving its identity to a second entity.
* "Authorization" is the process of checking whether an entity has permission
to perform an action.
- "Authorization" is the process of checking whether an entity has permission
to perform an action.
* TLS provides authentication and encryption. It does not provide
authorization, though GlusterFS can use TLS-authenticated identities to
authorize client connections to bricks/volumes.
- TLS provides authentication and encryption. It does not provide
authorization, though GlusterFS can use TLS-authenticated identities to
authorize client connections to bricks/volumes.
* An entity X which must authenticate to a second entity Y does so by sharing
with Y a *certificate*, which contains information sufficient to prove X's
identity. X's proof of identity also requires possession of a *private key*
which matches its certificate, but this key is never seen by Y or anyone
else. Because the certificate is already public, anyone who has the key can
claim that identity.
- An entity X which must authenticate to a second entity Y does so by sharing
with Y a _certificate_, which contains information sufficient to prove X's
identity. X's proof of identity also requires possession of a _private key_
which matches its certificate, but this key is never seen by Y or anyone
else. Because the certificate is already public, anyone who has the key can
claim that identity.
* Each certificate contains the identity of its principal (owner) along with
the identity of a *certifying authority* or CA who can verify the integrity
of the certificate's contents. The principal and CA can be the same (a
"self-signed certificate"). If they are different, the CA must *sign* the
certificate by appending information derived from both the certificate
contents and the CA's own private key.
- Each certificate contains the identity of its principal (owner) along with
the identity of a _certifying authority_ or CA who can verify the integrity
of the certificate's contents. The principal and CA can be the same (a
"self-signed certificate"). If they are different, the CA must _sign_ the
certificate by appending information derived from both the certificate
contents and the CA's own private key.
* Certificate-signing relationships can extend through multiple levels. For
example, a company X could sign another company Y's certificate, which could
then be used to sign a third certificate Z for a specific user or purpose.
Anyone who trusts X (and is willing to extend that trust through a
*certificate depth* of two or more) would therefore be able to authenticate
Y and Z as well.
- Certificate-signing relationships can extend through multiple levels. For
example, a company X could sign another company Y's certificate, which could
then be used to sign a third certificate Z for a specific user or purpose.
Anyone who trusts X (and is willing to extend that trust through a
_certificate depth_ of two or more) would therefore be able to authenticate
Y and Z as well.
* Any entity willing to accept other entities' authentication attempts must
have some sort of database seeded with the certificates that already accept.
- Any entity willing to accept other entities' authentication attempts must
have some sort of database seeded with the certificates that already accept.
In GlusterFS's case, a client or server X uses the following files to contain
TLS-related information:
* /etc/ssl/glusterfs.pem X's own certificate
- /etc/ssl/glusterfs.pem X's own certificate
* /etc/ssl/glusterfs.key X's private key
- /etc/ssl/glusterfs.key X's private key
* /etc/ssl/glusterfs.ca concatenation of *others'* certificates
- /etc/ssl/glusterfs.ca concatenation of _others'_ certificates
GlusterFS always performs *mutual authentication*, though clients do not
currently do anything with the authenticated server identity. Thus, if client X
GlusterFS always performs _mutual authentication_, though clients do not
currently do anything with the authenticated server identity. Thus, if client X
wants to communicate with server Y, then X's certificate (or that of a signer)
must be in Y's CA file, and vice versa.
For all uses of TLS in GlusterFS, if one side of a connection is configured to
use TLS then the other side must use it as well. There is no automatic fallback
use TLS then the other side must use it as well. There is no automatic fallback
to non-TLS communication, or allowance for concurrent TLS and non-TLS access to
the same resource, because either would be insecure. Instead, any such "mixed
the same resource, because either would be insecure. Instead, any such "mixed
mode" connections will be rejected by the TLS-using side, sacrificing
availability to maintain security.
**NOTE**The TLS certificate verification will fail if the machines' date and
time are not in sync with each other. Certificate verification depends on the
time of the client as well as the server and if that is not found to be in
**NOTE**The TLS certificate verification will fail if the machines' date and
time are not in sync with each other. Certificate verification depends on the
time of the client as well as the server and if that is not found to be in
sync then it is deemed to be an invalid certificate. To get the date and times
in sync, tools such as ntpdate can be used.
@@ -70,50 +70,50 @@ in sync, tools such as ntpdate can be used.
Certmonger can be used to generate keys, request certs from a CA and then
automatically keep the Gluster certificate and the CA bundle updated as
required, simplifying deployment. Either a commercial CA or a local CA can
be used. E.g., FreeIPA (with dogtag CA) is an open-source CA with
required, simplifying deployment. Either a commercial CA or a local CA can
be used. E.g., FreeIPA (with dogtag CA) is an open-source CA with
user-friendly tooling.
If using FreeIPA, first add the host. This is required for FreeIPA to issue
certificates. This can be done via the web UI, or the CLI with:
ipa host-add <hostname>
ipa host-add <hostname>
If the host has been added the following should show the host:
ipa host-show <hostname>
ipa host-show <hostname>
And it should show a kerberos principal for the host in the form of:
host/<hostname>
host/<hostname>
Now use certmonger on the gluster server or client to generate the key (if
required), and submit a CSR to the CA. Certmonger will monitor the request,
and create and update the files as required. For FreeIPA we need to specify
the Kerberos principal from above to -K. E.g.:
required), and submit a CSR to the CA. Certmonger will monitor the request,
and create and update the files as required. For FreeIPA we need to specify
the Kerberos principal from above to -K. E.g.:
getcert request -r \
-K host/$(hostname) \
-f /etc/ssl/gluster.pem \
-k /etc/ssl/gluster.key \
-D $(hostname) \
-F /etc/ssl/gluster.ca
getcert request -r \
-K host/$(hostname) \
-f /etc/ssl/gluster.pem \
-k /etc/ssl/gluster.key \
-D $(hostname) \
-F /etc/ssl/gluster.ca
Certmonger should print out an ID for the request, e.g.:
New signing request "20210801190305" added.
New signing request "20210801190305" added.
You can check the status of the request with this ID:
getcert list -i 20210801190147
getcert list -i 20210801190147
If the CA approves the CSR and issues the cert, then the previous command
should print a status field with:
status: MONITORING
status: MONITORING
As this point, the key, the cert and the CA bundle should all be in /etc/ssl
ready for Gluster to use. Certmonger will renew the certificates as
ready for Gluster to use. Certmonger will renew the certificates as
required for you.
You do not need to manually concatenate certs to a trusted cert bundle and
@@ -123,7 +123,7 @@ You may need to set the certificate depth to allow the CA signed certs to be
used, if there are intermediate CAs in the signing path. E.g., on every server
and client:
echo "option transport.socket.ssl-cert-depth 3" > /var/lib/glusterd/secure-access
echo "option transport.socket.ssl-cert-depth 3" > /var/lib/glusterd/secure-access
This should not be necessary where a local CA (e.g., FreeIPA) has directly
signed the cart.
@@ -133,45 +133,44 @@ signed the cart.
To enable authentication and encryption between clients and brick servers, two
options must be set:
gluster volume set MYVOLUME client.ssl on
gluster volume set MYVOLUME server.ssl on
gluster volume set MYVOLUME client.ssl on
gluster volume set MYVOLUME server.ssl on
>**Note** that the above options affect only the GlusterFS native protocol.
>For foreign protocols such as NFS, SMB, or Swift the encryption will not be
>affected between:
> **Note** that the above options affect only the GlusterFS native protocol.
> For foreign protocols such as NFS, SMB, or Swift the encryption will not be
> affected between:
>
>1. NFS client and Glusterfs NFS Ganesha Server
>2. SMB client and Glusterfs SMB server
> 1. NFS client and Glusterfs NFS Ganesha Server
> 2. SMB client and Glusterfs SMB server
>
>While it affects the encryption between the following:
> While it affects the encryption between the following:
>
>1. NFS Ganesha server and Glusterfs bricks
>2. Glusterfs SMB server and Glusterfs bricks
> 1. NFS Ganesha server and Glusterfs bricks
> 2. Glusterfs SMB server and Glusterfs bricks
## Using TLS Identities for Authorization
Once TLS has been enabled on the I/O path, TLS identities can be used instead of
IP addresses or plain usernames to control access to specific volumes. For
IP addresses or plain usernames to control access to specific volumes. For
example:
gluster volume set MYVOLUME auth.ssl-allow Zaphod
gluster volume set MYVOLUME auth.ssl-allow Zaphod
Here, we're allowing the TLS-authenticated identity "Zaphod" to access MYVOLUME.
This is intentionally identical to the existing "auth.allow" option, except that
the name is taken from a TLS certificate instead of a command-line string. Note
the name is taken from a TLS certificate instead of a command-line string. Note
that infelicities in the gluster CLI preclude using names that include spaces,
which would otherwise be allowed.
## Enabling TLS on the Management Path
Management-daemon traffic is not controlled by an option. Instead, it is
Management-daemon traffic is not controlled by an option. Instead, it is
controlled by the presence of a file on each machine:
/var/lib/glusterd/secure-access
/var/lib/glusterd/secure-access
Creating this file will cause glusterd connections made from that machine to use
TLS. Note that even clients must do this to communicate with a remote glusterd
TLS. Note that even clients must do this to communicate with a remote glusterd
while mounting, but not thereafter.
## Additional Options
@@ -182,22 +181,22 @@ internals.
The first option allows the user to set the certificate depth, as mentioned
above.
gluster volume set MYVOLUME ssl.certificate-depth 2
gluster volume set MYVOLUME ssl.certificate-depth 2
Here, we're setting our certificate depth to two, as in the introductory
example. By default this value is zero, meaning that only certificates which
example. By default this value is zero, meaning that only certificates which
are directly specified in the local CA file will be accepted (i.e. no signed
certificates at all).
The second option allows the user to specify the set of allowed TLS ciphers.
gluster volume set MYVOLUME ssl.cipher-list 'HIGH:!SSLv2'
gluster volume set MYVOLUME ssl.cipher-list 'HIGH:!SSLv2'
Cipher lists are negotiated between the two parties to a TLS connection so
that both sides' security needs are satisfied. In this example, we're setting
that both sides' security needs are satisfied. In this example, we're setting
the initial cipher list to HIGH, representing ciphers that the cryptography
community still believes to be unbroken. We are also explicitly disallowing
ciphers specific to SSL version 2. The default is based on this example but
community still believes to be unbroken. We are also explicitly disallowing
ciphers specific to SSL version 2. The default is based on this example but
also excludes CBC-based cipher modes to provide extra mitigation against the
[POODLE][poo] attack.

View File

@@ -31,12 +31,12 @@ the required modules as follows:
1. Add the FUSE loadable kernel module (LKM) to the Linux kernel:
`# modprobe fuse`
modprobe fuse
2. Verify that the FUSE module is loaded:
`# dmesg | grep -i fuse `
`fuse init (API version 7.13)`
# dmesg | grep -i fuse
fuse init (API version 7.13)
### Installing on Red Hat Package Manager (RPM) Distributions
@@ -45,7 +45,7 @@ To install Gluster Native Client on RPM distribution-based systems
1. Install required prerequisites on the client using the following
command:
`$ sudo yum -y install openssh-server wget fuse fuse-libs openib libibverbs`
sudo yum -y install openssh-server wget fuse fuse-libs openib libibverbs
2. Ensure that TCP and UDP ports 24007 and 24008 are open on all
Gluster servers. Apart from these ports, you need to open one port
@@ -64,13 +64,12 @@ To install Gluster Native Client on RPM distribution-based systems
into effect.
You can use the following chains with iptables:
~~~
`$ sudo iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 24007:24008 -j ACCEPT `
`$ sudo iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 49152:49156 -j ACCEPT`
~~~
> **Note**
>
sudo iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 24007:24008 -j ACCEPT
sudo iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 49152:49156 -j ACCEPT
> **Note**
> If you already have iptable chains, make sure that the above
> ACCEPT rules precede the DROP rules. This can be achieved by
> providing a lower rule number than the DROP rule.
@@ -84,15 +83,15 @@ To install Gluster Native Client on RPM distribution-based systems
You can download the software at [GlusterFS download page][1].
4. Install Gluster Native Client on the client.
**Note**
4. Install Gluster Native Client on the client.
**Note**
The package versions listed in the example below may not be the latest release. Please refer to the download page to ensure that you have the recently released packages.
~~~
`$ sudo rpm -i glusterfs-3.8.5-1.x86_64`
`$ sudo rpm -i glusterfs-fuse-3.8.5-1.x86_64`
`$ sudo rpm -i glusterfs-rdma-3.8.5-1.x86_64`
~~~
sudo rpm -i glusterfs-3.8.5-1.x86_64
sudo rpm -i glusterfs-fuse-3.8.5-1.x86_64
sudo rpm -i glusterfs-rdma-3.8.5-1.x86_64
> **Note:**
> The RDMA module is only required when using Infiniband.
@@ -102,7 +101,7 @@ To install Gluster Native Client on Debian-based distributions
1. Install OpenSSH Server on each client using the following command:
`$ sudo apt-get install openssh-server vim wget`
sudo apt-get install openssh-server vim wget
2. Download the latest GlusterFS .deb file and checksum to each client.
@@ -112,14 +111,14 @@ To install Gluster Native Client on Debian-based distributions
and compare it against the checksum for that file in the md5sum
file.
`$ md5sum GlusterFS_DEB_file.deb `
md5sum GlusterFS_DEB_file.deb
The md5sum of the packages is available at: [GlusterFS download page][2]
4. Uninstall GlusterFS v3.1 (or an earlier version) from the client
using the following command:
`$ sudo dpkg -r glusterfs `
sudo dpkg -r glusterfs
(Optional) Run `$ sudo dpkg -purge glusterfs `to purge the
configuration files.
@@ -127,11 +126,11 @@ To install Gluster Native Client on Debian-based distributions
5. Install Gluster Native Client on the client using the following
command:
`$ sudo dpkg -i GlusterFS_DEB_file `
sudo dpkg -i GlusterFS_DEB_file
For example:
`$ sudo dpkg -i glusterfs-3.8.x.deb `
sudo dpkg -i glusterfs-3.8.x.deb
6. Ensure that TCP and UDP ports 24007 and 24008 are open on all
Gluster servers. Apart from these ports, you need to open one port
@@ -151,12 +150,11 @@ To install Gluster Native Client on Debian-based distributions
You can use the following chains with iptables:
~~~
`$ sudo iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 24007:24008 -j ACCEPT `
`$ sudo iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 49152:49156 -j ACCEPT`
~~~
> **Note**
>
sudo iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 24007:24008 -j ACCEPT
sudo iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 49152:49156 -j ACCEPT
> **Note**
> If you already have iptable chains, make sure that the above
> ACCEPT rules precede the DROP rules. This can be achieved by
> providing a lower rule number than the DROP rule.
@@ -167,10 +165,8 @@ To build and install Gluster Native Client from the source code
1. Create a new directory using the following commands:
~~~
`# mkdir glusterfs `
`# cd glusterfs`
~~~
mkdir glusterfs
cd glusterfs
2. Download the source code.
@@ -178,11 +174,11 @@ To build and install Gluster Native Client from the source code
3. Extract the source code using the following command:
`# tar -xvzf SOURCE-FILE `
tar -xvzf SOURCE-FILE
4. Run the configuration utility using the following command:
`# ./configure `
$ ./configure
GlusterFS configure summary
===========================
@@ -198,26 +194,24 @@ To build and install Gluster Native Client from the source code
5. Build the Gluster Native Client software using the following
commands:
~~~
`# make `
`# make install`
~~~
make
make install`
6. Verify that the correct version of Gluster Native Client is
installed, using the following command:
`# glusterfs --version`
glusterfs --version
## Mounting Volumes
After installing the Gluster Native Client, you need to mount Gluster
volumes to access data. There are two methods you can choose:
- [Manually Mounting Volumes](#manual-mount)
- [Automatically Mounting Volumes](#auto-mount)
- [Manually Mounting Volumes](#manual-mount)
- [Automatically Mounting Volumes](#auto-mount)
> **Note**
>
> **Note**
> Server names selected during creation of Volumes should be resolvable
> in the client machine. You can use appropriate /etc/hosts entries or
> DNS server to resolve server names to IP addresses.
@@ -226,26 +220,25 @@ volumes to access data. There are two methods you can choose:
### Manually Mounting Volumes
- To mount a volume, use the following command:
- To mount a volume, use the following command:
`# mount -t glusterfs HOSTNAME-OR-IPADDRESS:/VOLNAME MOUNTDIR`
mount -t glusterfs HOSTNAME-OR-IPADDRESS:/VOLNAME MOUNTDIR
For example:
For example:
`# mount -t glusterfs server1:/test-volume /mnt/glusterfs`
mount -t glusterfs server1:/test-volume /mnt/glusterfs
> **Note**
>
> The server specified in the mount command is only used to fetch
> the gluster configuration volfile describing the volume name.
> Subsequently, the client will communicate directly with the
> servers mentioned in the volfile (which might not even include the
> one used for mount).
>
> If you see a usage message like "Usage: mount.glusterfs", mount
> usually requires you to create a directory to be used as the mount
> point. Run "mkdir /mnt/glusterfs" before you attempt to run the
> mount command listed above.
> **Note**
> The server specified in the mount command is only used to fetch
> the gluster configuration volfile describing the volume name.
> Subsequently, the client will communicate directly with the
> servers mentioned in the volfile (which might not even include the
> one used for mount).
>
> If you see a usage message like "Usage: mount.glusterfs", mount
> usually requires you to create a directory to be used as the mount
> point. Run "mkdir /mnt/glusterfs" before you attempt to run the
> mount command listed above.
**Mounting Options**
@@ -253,7 +246,7 @@ You can specify the following options when using the
`mount -t glusterfs` command. Note that you need to separate all options
with commas.
~~~
```text
backupvolfile-server=server-name
volfile-max-fetch-attempts=number of attempts
@@ -268,11 +261,11 @@ direct-io-mode=[enable|disable]
use-readdirp=[yes|no]
~~~
```
For example:
`# mount -t glusterfs -o backupvolfile-server=volfile_server2,use-readdirp=no,volfile-max-fetch-attempts=2,log-level=WARNING,log-file=/var/log/gluster.log server1:/test-volume /mnt/glusterfs`
`mount -t glusterfs -o backupvolfile-server=volfile_server2,use-readdirp=no,volfile-max-fetch-attempts=2,log-level=WARNING,log-file=/var/log/gluster.log server1:/test-volume /mnt/glusterfs`
If `backupvolfile-server` option is added while mounting fuse client,
when the first volfile server fails, then the server specified in
@@ -288,6 +281,7 @@ If `use-readdirp` is set to ON, it forces the use of readdirp
mode in fuse kernel module
<a name="auto-mount"></a>
### Automatically Mounting Volumes
You can configure your system to automatically mount the Gluster volume
@@ -298,21 +292,21 @@ gluster configuration volfile describing the volume name. Subsequently,
the client will communicate directly with the servers mentioned in the
volfile (which might not even include the one used for mount).
- To mount a volume, edit the /etc/fstab file and add the following
line:
- To mount a volume, edit the /etc/fstab file and add the following
line:
`HOSTNAME-OR-IPADDRESS:/VOLNAME MOUNTDIR glusterfs defaults,_netdev 0 0 `
`HOSTNAME-OR-IPADDRESS:/VOLNAME MOUNTDIR glusterfs defaults,_netdev 0 0 `
For example:
For example:
`server1:/test-volume /mnt/glusterfs glusterfs defaults,_netdev 0 0`
`server1:/test-volume /mnt/glusterfs glusterfs defaults,_netdev 0 0`
**Mounting Options**
You can specify the following options when updating the /etc/fstab file.
Note that you need to separate all options with commas.
~~~
```
log-level=loglevel
log-file=logfile
@@ -322,7 +316,7 @@ transport=transport-type
direct-io-mode=[enable|disable]
use-readdirp=no
~~~
```
For example:
@@ -332,40 +326,41 @@ For example:
To test mounted volumes
- Use the following command:
- Use the following command:
`# mount `
`# mount `
If the gluster volume was successfully mounted, the output of the
mount command on the client will be similar to this example:
If the gluster volume was successfully mounted, the output of the
mount command on the client will be similar to this example:
`server1:/test-volume on /mnt/glusterfs type fuse.glusterfs (rw,allow_other,default_permissions,max_read=131072`
`server1:/test-volume on /mnt/glusterfs type fuse.glusterfs (rw,allow_other,default_permissions,max_read=131072`
- Use the following command:
- Use the following command:
`# df`
`# df`
The output of df command on the client will display the aggregated
storage space from all the bricks in a volume similar to this
example:
The output of df command on the client will display the aggregated
storage space from all the bricks in a volume similar to this
example:
# df -h /mnt/glusterfs
Filesystem Size Used Avail Use% Mounted on
server1:/test-volume 28T 22T 5.4T 82% /mnt/glusterfs
# df -h /mnt/glusterfs
Filesystem Size Used Avail Use% Mounted on
server1:/test-volume 28T 22T 5.4T 82% /mnt/glusterfs
- Change to the directory and list the contents by entering the
following:
~~~
- Change to the directory and list the contents by entering the
following:
```
`# cd MOUNTDIR `
`# ls`
~~~
```
- For example,
- For example,
~~~
```
`# cd /mnt/glusterfs `
`# ls`
~~~
```
# NFS
@@ -388,59 +383,59 @@ mounted successfully.
## Using NFS to Mount Volumes
You can use either of the following methods to mount Gluster volumes:
- [Manually Mounting Volumes Using NFS](#manual-nfs)
- [Automatically Mounting Volumes Using NFS](#auto-nfs)
- [Manually Mounting Volumes Using NFS](#manual-nfs)
- [Automatically Mounting Volumes Using NFS](#auto-nfs)
**Prerequisite**: Install nfs-common package on both servers and clients
(only for Debian-based distribution), using the following command:
`$ sudo aptitude install nfs-common `
sudo aptitude install nfs-common
<a name="manual-nfs"></a>
### Manually Mounting Volumes Using NFS
**To manually mount a Gluster volume using NFS**
- To mount a volume, use the following command:
- To mount a volume, use the following command:
`# mount -t nfs -o vers=3 HOSTNAME-OR-IPADDRESS:/VOLNAME MOUNTDIR`
mount -t nfs -o vers=3 HOSTNAME-OR-IPADDRESS:/VOLNAME MOUNTDIR
For example:
For example:
`# mount -t nfs -o vers=3 server1:/test-volume /mnt/glusterfs`
mount -t nfs -o vers=3 server1:/test-volume /mnt/glusterfs
> **Note**
>
> Gluster NFS server does not support UDP. If the NFS client you are
> using defaults to connecting using UDP, the following message
> appears:
>
> `requested NFS version or transport protocol is not supported`.
> **Note**
> Gluster NFS server does not support UDP. If the NFS client you are
> using defaults to connecting using UDP, the following message
> appears:
>
> `requested NFS version or transport protocol is not supported`.
**To connect using TCP**
**To connect using TCP**
- Add the following option to the mount command:
- Add the following option to the mount command:
`-o mountproto=tcp `
`-o mountproto=tcp `
For example:
For example:
`# mount -o mountproto=tcp -t nfs server1:/test-volume /mnt/glusterfs`
mount -o mountproto=tcp -t nfs server1:/test-volume /mnt/glusterfs
**To mount Gluster NFS server from a Solaris client**
- Use the following command:
- Use the following command:
`# mount -o proto=tcp,vers=3 nfs://HOSTNAME-OR-IPADDRESS:38467/VOLNAME MOUNTDIR`
mount -o proto=tcp,vers=3 nfs://HOSTNAME-OR-IPADDRESS:38467/VOLNAME MOUNTDIR
For example:
For example:
` # mount -o proto=tcp,vers=3 nfs://server1:38467/test-volume /mnt/glusterfs`
mount -o proto=tcp,vers=3 nfs://server1:38467/test-volume /mnt/glusterfs
<a name="auto-nfs"></a>
### Automatically Mounting Volumes Using NFS
You can configure your system to automatically mount Gluster volumes
@@ -448,32 +443,31 @@ using NFS each time the system starts.
**To automatically mount a Gluster volume using NFS**
- To mount a volume, edit the /etc/fstab file and add the following
line:
- To mount a volume, edit the /etc/fstab file and add the following
line:
`HOSTNAME-OR-IPADDRESS:/VOLNAME MOUNTDIR nfs defaults,_netdev,vers=3 0 0`
HOSTNAME-OR-IPADDRESS:/VOLNAME MOUNTDIR nfs defaults,_netdev,vers=3 0 0
For example,
For example,
`server1:/test-volume /mnt/glusterfs nfs defaults,_netdev,vers=3 0 0`
`server1:/test-volume /mnt/glusterfs nfs defaults,_netdev,vers=3 0 0`
> **Note**
>
> Gluster NFS server does not support UDP. If the NFS client you are
> using defaults to connecting using UDP, the following message
> appears:
>
> `requested NFS version or transport protocol is not supported.`
> **Note**
> Gluster NFS server does not support UDP. If the NFS client you are
> using defaults to connecting using UDP, the following message
> appears:
>
> `requested NFS version or transport protocol is not supported.`
To connect using TCP
To connect using TCP
- Add the following entry in /etc/fstab file :
- Add the following entry in /etc/fstab file :
`HOSTNAME-OR-IPADDRESS:/VOLNAME MOUNTDIR nfs defaults,_netdev,mountproto=tcp 0 0`
HOSTNAME-OR-IPADDRESS:/VOLNAME MOUNTDIR nfs defaults,_netdev,mountproto=tcp 0 0
For example,
For example,
`server1:/test-volume /mnt/glusterfs nfs defaults,_netdev,mountproto=tcp 0 0`
`server1:/test-volume /mnt/glusterfs nfs defaults,_netdev,mountproto=tcp 0 0`
**To automount NFS mounts**
@@ -488,31 +482,31 @@ You can confirm that Gluster directories are mounting successfully.
**To test mounted volumes**
- Use the mount command by entering the following:
- Use the mount command by entering the following:
`# mount`
`# mount`
For example, the output of the mount command on the client will
display an entry like the following:
For example, the output of the mount command on the client will
display an entry like the following:
`server1:/test-volume on /mnt/glusterfs type nfs (rw,vers=3,addr=server1)`
`server1:/test-volume on /mnt/glusterfs type nfs (rw,vers=3,addr=server1)`
- Use the df command by entering the following:
- Use the df command by entering the following:
`# df`
`# df`
For example, the output of df command on the client will display the
aggregated storage space from all the bricks in a volume.
For example, the output of df command on the client will display the
aggregated storage space from all the bricks in a volume.
# df -h /mnt/glusterfs
Filesystem Size Used Avail Use% Mounted on
server1:/test-volume 28T 22T 5.4T 82% /mnt/glusterfs
# df -h /mnt/glusterfs
Filesystem Size Used Avail Use% Mounted on
server1:/test-volume 28T 22T 5.4T 82% /mnt/glusterfs
- Change to the directory and list the contents by entering the
following:
- Change to the directory and list the contents by entering the
following:
`# cd MOUNTDIR`
`# ls`
`# cd MOUNTDIR`
`# ls`
# CIFS
@@ -535,14 +529,15 @@ verify that the volume has mounted successfully.
You can use either of the following methods to mount Gluster volumes:
- [Exporting Gluster Volumes Through Samba](#export-samba)
- [Manually Mounting Volumes Using CIFS](#cifs-manual)
- [Automatically Mounting Volumes Using CIFS](#cifs-auto)
- [Exporting Gluster Volumes Through Samba](#export-samba)
- [Manually Mounting Volumes Using CIFS](#cifs-manual)
- [Automatically Mounting Volumes Using CIFS](#cifs-auto)
You can also use Samba for exporting Gluster Volumes through CIFS
protocol.
<a name="export-samba"></a>
### Exporting Gluster Volumes Through Samba
We recommend you to use Samba for exporting Gluster volumes through the
@@ -560,7 +555,7 @@ CIFS protocol.
smb.conf file in an editor and add the following lines for a simple
configuration:
~~~
```
[glustertest]
comment = For testing a Gluster volume exported through CIFS
@@ -570,14 +565,14 @@ CIFS protocol.
read only = no
guest ok = yes
~~~
```
Save the changes and start the smb service using your systems init
scripts (/etc/init.d/smb [re]start). Abhove steps is needed for doing
multiple mount. If you want only samba mount then in your smb.conf you
multiple mount. If you want only samba mount then in your smb.conf you
need to add
~~~
```
kernel share modes = no
kernel oplocks = no
map archive = no
@@ -585,8 +580,7 @@ need to add
map read only = no
map system = no
store dos attributes = yes
~~~
```
> **Note**
>
@@ -595,6 +589,7 @@ need to add
> configurations, see Samba documentation.
<a name="cifs-manual"></a>
### Manually Mounting Volumes Using CIFS
You can manually mount Gluster volumes using CIFS on Microsoft
@@ -618,6 +613,7 @@ Alternatively, to manually mount a Gluster volume using CIFS by going to
**Start \> Run** and entering Network path manually.
<a name="cifs-auto"></a>
### Automatically Mounting Volumes Using CIFS
You can configure your system to automatically mount Gluster volumes

View File

@@ -1,59 +1,73 @@
# Split brain and the ways to deal with it
### Split brain:
Split brain is a situation where two or more replicated copies of a file become divergent. When a file is in split brain, there is an inconsistency in either data or metadata of the file amongst the bricks of a replica and do not have enough information to authoritatively pick a copy as being pristine and heal the bad copies, despite all bricks being up and online. For a directory, there is also an entry split brain where a file inside it can have different gfid/file-type across the bricks of a replica. Split brain can happen mainly because of 2 reasons:
1. Due to network disconnect:
Where a client temporarily loses connection to the bricks.
- There is a replica pair of 2 bricks, brick1 on server1 and brick2 on server2.
- Client1 loses connection to brick2 and client2 loses connection to brick1 due to network split.
- Writes from client1 goes to brick1 and from client2 goes to brick2, which is nothing but split-brain.
2. Gluster brick processes going down or returning error:
- Server1 is down and server2 is up: Writes happen on server 2.
- Server1 comes up, server2 goes down (Heal not happened / data on server 2 is not replicated on server1): Writes happen on server1.
- Server2 comes up: Both server1 and server2 has data independent of each other.
- Due to network disconnect Where a client temporarily loses connection to the bricks.
> 1. There is a replica pair of 2 bricks, brick1 on server1 and brick2 on server2.
> 2. Client1 loses connection to brick2 and client2 loses connection to brick1 due to network split.
> 3. Writes from client1 goes to brick1 and from client2 goes to brick2, which is nothing but split-brain.
- Gluster brick processes going down or returning error:
> 1. Server1 is down and server2 is up: Writes happen on server 2.
> 2. Server1 comes up, server2 goes down (Heal not happened / data on server 2 is not replicated on server1): Writes happen on server1.
> 3. Server2 comes up: Both server1 and server2 has data independent of each other.
If we use the replica 2 volume, it is not possible to prevent split-brain without losing availability.
### Ways to deal with split brain:
In glusterfs there are ways to resolve split brain. You can see the detailed description of how to resolve a split-brain [here](../Troubleshooting/resolving-splitbrain.md). Moreover, there are ways to reduce the chances of ending up in split-brain situations. They are:
1. Replica 3 volume
2. Arbiter volume
Both of these uses the client-quorum option of glusterfs to avoid the split-brain situations.
### Client quorum:
This is a feature implemented in Automatic File Replication (AFR here on) module, to prevent split-brains in the I/O path for replicate/distributed-replicate volumes. By default, if the client-quorum is not met for a particular replica subvol, it becomes read-only. The other subvols (in a dist-rep volume) will still have R/W access. [Here](arbiter-volumes-and-quorum.md#client-quorum) you can see more details about client-quorum.
#### Client quorum in replica 2 volumes:
In a replica 2 volume it is not possible to achieve high availability and consistency at the same time, without sacrificing tolerance to partition. If we set the client-quorum option to auto, then the first brick must always be up, irrespective of the status of the second brick. If only the second brick is up, the subvolume becomes read-only.
If the quorum-type is set to fixed, and the quorum-count is set to 1, then we may end up in split brain.
- Brick1 is up and brick2 is down. Quorum is met and write happens on brick1.
- Brick1 goes down and brick2 comes up (No heal happened). Quorum is met, write happens on brick2.
- Brick1 comes up. Quorum is met, but both the bricks have independent writes - split-brain.
If the quorum-type is set to fixed, and the quorum-count is set to 1, then we may end up in split brain. - Brick1 is up and brick2 is down. Quorum is met and write happens on brick1. - Brick1 goes down and brick2 comes up (No heal happened). Quorum is met, write happens on brick2. - Brick1 comes up. Quorum is met, but both the bricks have independent writes - split-brain.
To avoid this we have to set the quorum-count to 2, which will cost the availability. Even if we have one replica brick up and running, the quorum is not met and we end up seeing EROFS.
### 1. Replica 3 volume:
When we create a replicated or distributed replicated volume with replica count 3, the cluster.quorum-type option is set to auto by default. That means at least 2 bricks should be up and running to satisfy the quorum and allow the writes. This is the recommended setting for a replica 3 volume and this should not be changed. Here is how it prevents files from ending up in split brain:
B1, B2, and B3 are the 3 bricks of a replica 3 volume.
1. B1 & B2 are up and B3 is down. Quorum is met and write happens on B1 & B2.
2. B3 comes up and B2 is down. Quorum is met and write happens on B1 & B3.
3. B2 comes up and B1 goes down. Quorum is met. But when a write request comes, AFR sees that B2 & B3 are blaming each other (B2 says that some writes are pending on B3 and B3 says that some writes are pending on B2), therefore the write is not allowed and is failed with EIO.
Command to create a replica 3 volume:
```sh
$gluster volume create <volname> replica 3 host1:brick1 host2:brick2 host3:brick3
gluster volume create <volname> replica 3 host1:brick1 host2:brick2 host3:brick3
```
### 2. Arbiter volume:
Arbiter offers the sweet spot between replica 2 and replica 3, where user wants the split-brain protection offered by replica 3 but does not want to invest in 3x storage space. Arbiter is also an replica 3 volume where the third brick of the replica is automatically configured as an arbiter node. This means that the third brick stores only the file name and metadata, but not any data. This will help in avoiding split brain while providing the same level of consistency as a normal replica 3 volume.
Command to create a arbiter volume:
```sh
$gluster volume create <volname> replica 3 arbiter 1 host1:brick1 host2:brick2 host3:brick3
gluster volume create <volname> replica 3 arbiter 1 host1:brick1 host2:brick2 host3:brick3
```
The only difference in the command is, we need to add one more keyword ``` arbiter 1 ``` after the replica count. Since it is also a replica 3 volume, the cluster.quorum-type option is set to auto by default and at least 2 bricks should be up to satisfy the quorum and allow writes.
The only difference in the command is, we need to add one more keyword `arbiter 1` after the replica count. Since it is also a replica 3 volume, the cluster.quorum-type option is set to auto by default and at least 2 bricks should be up to satisfy the quorum and allow writes.
Since the arbiter brick has only name and metadata of the files, there are some more checks to guarantee consistency. Arbiter works as follows:
1. Clients take full file locks while writing (replica 3 takes range locks).
@@ -65,6 +79,7 @@ Since the arbiter brick has only name and metadata of the files, there are some
You can find more details on arbiter [here](arbiter-volumes-and-quorum.md).
### Differences between replica 3 and arbiter volumes:
1. In case of a replica 3 volume, we store the entire file in all the bricks and it is recommended to have bricks of same size. But in case of arbiter, since we do not store data, the size of the arbiter brick is comparatively lesser than the other bricks.
2. Arbiter is a state between replica 2 and replica 3 volume. If we have only arbiter and one of the other brick is up and the arbiter brick blames the other brick, then we can not proceed with the FOPs.
4. Replica 3 gives high availability compared to arbiter, because unlike in arbiter, replica 3 has a full copy of the data in all 3 bricks.
3. Replica 3 gives high availability compared to arbiter, because unlike in arbiter, replica 3 has a full copy of the data in all 3 bricks.

View File

@@ -19,53 +19,47 @@ following ways:
## Distributions with systemd
<a name="manual"></a>
### Starting and stopping glusterd manually
- To start `glusterd` manually:
```console
systemctl start glusterd
```
systemctl start glusterd
- To stop `glusterd` manually:
```console
systemctl stop glusterd
```
systemctl stop glusterd
<a name="auto"></a>
### Starting glusterd automatically
- To enable the glusterd service and start it if stopped:
```console
systemctl enable --now glusterd
```
systemctl enable --now glusterd
- To disable the glusterd service and stop it if started:
```console
systemctl disable --now glusterd
```
systemctl disable --now glusterd
## Distributions without systemd
<a name="manual-legacy"></a>
### Starting and stopping glusterd manually
This section describes how to start and stop glusterd manually
- To start glusterd manually, enter the following command:
```console
# /etc/init.d/glusterd start
```
/etc/init.d/glusterd start
- To stop glusterd manually, enter the following command:
- To stop glusterd manually, enter the following command:
```console
# /etc/init.d/glusterd stop
```
/etc/init.d/glusterd stop
<a name="auto-legacy"></a>
### Starting glusterd Automatically
This section describes how to configure the system to automatically
@@ -78,7 +72,7 @@ service every time the system boots, enter the following from the
command line:
```console
# chkconfig glusterd on
chkconfig glusterd on
```
#### Debian and derivatives like Ubuntu
@@ -88,7 +82,7 @@ service every time the system boots, enter the following from the
command line:
```console
# update-rc.d glusterd defaults
update-rc.d glusterd defaults
```
#### Systems Other than Red Hat and Debian
@@ -98,5 +92,5 @@ the glusterd service every time the system boots, enter the following
entry to the*/etc/rc.local* file:
```console
# echo "glusterd" >> /etc/rc.local
echo "glusterd" >> /etc/rc.local
```

View File

@@ -1,6 +1,5 @@
# Managing Trusted Storage Pools
### Overview
A trusted storage pool(TSP) is a trusted network of storage servers. Before you can configure a
@@ -11,19 +10,19 @@ The servers in a TSP are peers of each other.
After installing Gluster on your servers and before creating a trusted storage pool,
each server belongs to a storage pool consisting of only that server.
- [Adding Servers](#adding-servers)
- [Listing Servers](#listing-servers)
- [Viewing Peer Status](#peer-status)
- [Removing Servers](#removing-servers)
- [Managing Trusted Storage Pools](#managing-trusted-storage-pools)
- [Overview](#overview)
- [Adding Servers](#adding-servers)
- [Listing Servers](#listing-servers)
- [Viewing Peer Status](#viewing-peer-status)
- [Removing Servers](#removing-servers)
**Before you start**:
- The servers used to create the storage pool must be resolvable by hostname.
- The glusterd daemon must be running on all storage servers that you
want to add to the storage pool. See [Managing the glusterd Service](./Start-Stop-Daemon.md) for details.
want to add to the storage pool. See [Managing the glusterd Service](./Start-Stop-Daemon.md) for details.
- The firewall on the servers must be configured to allow access to port 24007.
@@ -31,6 +30,7 @@ The following commands were run on a TSP consisting of 3 servers - server1, serv
and server3.
<a name="adding-servers"></a>
### Adding Servers
To add a server to a TSP, peer probe it from a server already in the pool.
@@ -59,9 +59,8 @@ Verify the peer status from the first server (server1):
Uuid: 3e0cabaa-9df7-4f66-8e5d-cbc348f29ff7
State: Peer in Cluster (Connected)
<a name="listing-servers"></a>
### Listing Servers
To list all nodes in the TSP:
@@ -73,9 +72,8 @@ To list all nodes in the TSP:
1e0ca3aa-9ef7-4f66-8f15-cbc348f29ff7 server3 Connected
3e0cabaa-9df7-4f66-8e5d-cbc348f29ff7 server4 Connected
<a name="peer-status"></a>
### Viewing Peer Status
To view the status of the peers in the TSP:
@@ -95,9 +93,8 @@ To view the status of the peers in the TSP:
Uuid: 3e0cabaa-9df7-4f66-8e5d-cbc348f29ff7
State: Peer in Cluster (Connected)
<a name="removing-servers"></a>
### Removing Servers
To remove a server from the TSP, run the following command from another server in the pool:
@@ -109,7 +106,6 @@ For example, to remove server4 from the trusted storage pool:
server1# gluster peer detach server4
Detach successful
Verify the peer status:
server1# gluster peer status

View File

@@ -9,13 +9,13 @@ all files as a whole. So, even different file, if the write fails
on the other data brick but succeeds on this 'bad' brick we will return
failure for the write.
- [Thin Arbiter volumes in gluster](#thin-arbiter-volumes-in-gluster)
- [Why Thin Arbiter?](#why-thin-arbiter)
- [Setting UP Thin Arbiter Volume](#setting-up-thin-arbiter-volume)
- [How Thin Arbiter works](#how-thin-arbiter-works)
# Why Thin Arbiter?
This is a solution for handling stretch cluster kind of workload,
but it can be used for regular workloads as well in case users are
satisfied with this kind of quorum in comparison to arbiter/3-way-replication.
@@ -31,28 +31,34 @@ thin-arbiter only in the case of first failure until heal completes.
# Setting UP Thin Arbiter Volume
The command to run thin-arbiter process on node:
```console
/usr/local/sbin/glusterfsd -N --volfile-id ta-vol -f /var/lib/glusterd/vols/thin-arbiter.vol --brick-port 24007 --xlator-option ta-vol-server.transport.socket.listen-port=24007
```
#/usr/local/sbin/glusterfsd -N --volfile-id ta-vol -f /var/lib/glusterd/vols/thin-arbiter.vol --brick-port 24007 --xlator-option ta-vol-server.transport.socket.listen-port=24007
```
Creating a thin arbiter replica 2 volume:
```console
glustercli volume create <volname> --replica 2 <host1>:<brick1> <host2>:<brick2> --thin-arbiter <quorum-host>:<path-to-store-replica-id-file>
```
#glustercli volume create <volname> --replica 2 <host1>:<brick1> <host2>:<brick2> --thin-arbiter <quorum-host>:<path-to-store-replica-id-file>
```
For example:
```
```console
glustercli volume create testvol --replica 2 server{1..2}:/bricks/brick-{1..2} --thin-arbiter server-3:/bricks/brick_ta --force
volume create: testvol: success: please start the volume to access data
```
# How Thin Arbiter works
There will be only one process running on thin arbiter node which will be
used to update replica id file for all replica pairs across all volumes.
Replica id file contains the information of good and bad data bricks in the
form of xattrs. Replica pairs will use its respective replica-id file that
is going to be created during mount.
1) Read Transactions:
Reads are allowed when quorum is met. i.e.
1. Read Transactions:
Reads are allowed when quorum is met. i.e.
- When all data bricks and thin arbiter are up: Perform lookup on data bricks to figure out good/bad bricks and
serve content from the good brick.
@@ -65,7 +71,7 @@ Reads are allowed when quorum is met. i.e.
done on the data brick to check if the file is really healthy or not. If the file is good, data will be served from
this brick else an EIO error would be returned to user.
2) Write transactions:
Thin arbiter doesnt participate in I/O, transaction will choose to wind operations on thin-arbiter brick to
make sure the necessary metadata is kept up-to-date in case of failures. Operation failure will lead to
updating the replica-id file on thin-arbiter with source/sink information in the xattrs just how it happens in AFR.
2. Write transactions:
Thin arbiter doesnt participate in I/O, transaction will choose to wind operations on thin-arbiter brick to
make sure the necessary metadata is kept up-to-date in case of failures. Operation failure will lead to
updating the replica-id file on thin-arbiter with source/sink information in the xattrs just how it happens in AFR.

View File

@@ -1,80 +1,85 @@
Trash Translator
================
# Trash Translator
Trash translator will allow users to access deleted or truncated files. Every brick will maintain a hidden .trashcan directory, which will be used to store the files deleted or truncated from the respective brick. The aggregate of all those .trashcan directories can be accessed from the mount point. To avoid name collisions, a timestamp is appended to the original file name while it is being moved to the trash directory.
## Implications and Usage
Apart from the primary use-case of accessing files deleted or truncated by the user, the trash translator can be helpful for internal operations such as self-heal and rebalance. During self-heal and rebalance it is possible to lose crucial data. In those circumstances, the trash translator can assist in the recovery of the lost data. The trash translator is designed to intercept unlink, truncate and ftruncate fops, store a copy of the current file in the trash directory, and then perform the fop on the original file. For the internal operations, the files are stored under the 'internal_op' folder inside the trash directory.
## Volume Options
* ***`gluster volume set <VOLNAME> features.trash <on/off>`***
- **_`gluster volume set <VOLNAME> features.trash <on/off>`_**
This command can be used to enable a trash translator in a volume. If set to on, a trash directory will be created in every brick inside the volume during the volume start command. By default, a translator is loaded during volume start but remains non-functional. Disabling trash with the help of this option will not remove the trash directory or even its contents from the volume.
This command can be used to enable a trash translator in a volume. If set to on, a trash directory will be created in every brick inside the volume during the volume start command. By default, a translator is loaded during volume start but remains non-functional. Disabling trash with the help of this option will not remove the trash directory or even its contents from the volume.
* ***`gluster volume set <VOLNAME> features.trash-dir <name>`***
- **_`gluster volume set <VOLNAME> features.trash-dir <name>`_**
This command is used to reconfigure the trash directory to a user-specified name. The argument is a valid directory name. The directory will be created inside every brick under this name. If not specified by the user, the trash translator will create the trash directory with the default name “.trashcan”. This can be used only when the trash-translator is on.
This command is used to reconfigure the trash directory to a user-specified name. The argument is a valid directory name. The directory will be created inside every brick under this name. If not specified by the user, the trash translator will create the trash directory with the default name “.trashcan”. This can be used only when the trash-translator is on.
* ***`gluster volume set <VOLNAME> features.trash-max-filesize <size>`***
- **_`gluster volume set <VOLNAME> features.trash-max-filesize <size>`_**
This command can be used to filter files entering the trash directory based on their size. Files above trash_max_filesize are deleted/truncated directly. Value for size may be followed by multiplicative suffixes as KB(=1024 bytes), MB(=1024\*1024 bytes) ,and GB(=1024\*1024\*1024 bytes). The default size is set to 5MB.
This command can be used to filter files entering the trash directory based on their size. Files above trash_max_filesize are deleted/truncated directly. Value for size may be followed by multiplicative suffixes as KB(=1024 bytes), MB(=1024\*1024 bytes) ,and GB(=1024\*1024\*1024 bytes). The default size is set to 5MB.
* ***`gluster volume set <VOLNAME> features.trash-eliminate-path <path1> [ , <path2> , . . . ]`***
- **_`gluster volume set <VOLNAME> features.trash-eliminate-path <path1> [ , <path2> , . . . ]`_**
This command can be used to set the eliminate pattern for the trash translator. Files residing under this pattern will not be moved to the trash directory during deletion/truncation. The path must be a valid one present in the volume.
This command can be used to set the eliminate pattern for the trash translator. Files residing under this pattern will not be moved to the trash directory during deletion/truncation. The path must be a valid one present in the volume.
* ***`gluster volume set <VOLNAME> features.trash-internal-op <on/off>`***
- **_`gluster volume set <VOLNAME> features.trash-internal-op <on/off>`_**
This command can be used to enable trash for internal operations like self-heal and re-balance. By default set to off.
This command can be used to enable trash for internal operations like self-heal and re-balance. By default set to off.
## Sample usage
The following steps give illustrates a simple scenario of deletion of a file from a directory
1. Create a simple distributed volume and start it.
1. Create a simple distributed volume and start it.
# gluster volume create test rhs:/home/brick
# gluster volume start test
gluster volume create test rhs:/home/brick
gluster volume start test
2. Enable trash translator
2. Enable trash translator
# gluster volume set test features.trash on
gluster volume set test features.trash on
3. Mount glusterfs volume via native client as follows.
3. Mount glusterfs volume via native client as follows.
# mount -t glusterfs rhs:test /mnt
mount -t glusterfs rhs:test /mnt
4. Create a directory and file in the mount.
4. Create a directory and file in the mount.
# mkdir mnt/dir
# echo abc > mnt/dir/file
mkdir mnt/dir
echo abc > mnt/dir/file
5. Delete the file from the mount.
5. Delete the file from the mount.
# rm mnt/dir/file -rf
rm mnt/dir/file -rf
6. Checkout inside the trash directory.
6. Checkout inside the trash directory.
# ls mnt/.trashcan
ls mnt/.trashcan
We can find the deleted file inside the trash directory with a timestamp appending on its filename.
For example,
```console
# mount -t glusterfs rh-host:/test /mnt/test
# mkdir /mnt/test/abc
# touch /mnt/test/abc/file
# rm /mnt/test/abc/file
remove regular empty file /mnt/test/abc/file? y
# ls /mnt/test/abc
#
# ls /mnt/test/.trashcan/abc/
file2014-08-21_123400
mount -t glusterfs rh-host:/test /mnt/test
mkdir /mnt/test/abc
touch /mnt/test/abc/file
rm -f /mnt/test/abc/file
ls /mnt/test/abc
ls /mnt/test/.trashcan/abc/
```
You will see `file2014-08-21_123400` as the output of the last `ls` command.
#### Points to be remembered
* As soon as the volume is started, the trash directory will be created inside the volume and will be visible through the mount. Disabling the trash will not have any impact on its visibility from the mount.
* Even though deletion of trash-directory is not permitted, currently residing trash contents will be removed on issuing delete on it and only an empty trash-directory exists.
- As soon as the volume is started, the trash directory will be created inside the volume and will be visible through the mount. Disabling the trash will not have any impact on its visibility from the mount.
- Even though deletion of trash-directory is not permitted, currently residing trash contents will be removed on issuing delete on it and only an empty trash-directory exists.
#### Known issue
Since trash translator resides on the server side higher translators like AFR, DHT are unaware of rename and truncate operations being done by this translator which eventually moves the files to trash directory. Unless and until a complete-path-based lookup comes on trashed files, those may not be visible from the mount.

View File

@@ -1,4 +1,3 @@
<a name="tuning-options"></a>
You can tune volume options, as needed, while the cluster is online and
@@ -34,130 +33,130 @@ description and default value:
> The default options given here are subject to modification at any
> given time and may not be the same for all versions.
Type | Option | Description | Default Value | Available Options
--- | --- | --- | --- | ---
| auth.allow | IP addresses of the clients which should be allowed to access the volume. | \* (allow all) | Valid IP address which includes wild card patterns including \*, such as 192.168.1.\*
| auth.reject | IP addresses of the clients which should be denied to access the volume. | NONE (reject none) | Valid IP address which includes wild card patterns including \*, such as 192.168.2.\*
Cluster | cluster.self-heal-window-size | Specifies the maximum number of blocks per file on which self-heal would happen simultaneously. | 1 | 0 - 1024 blocks
| cluster.data-self-heal-algorithm | Specifies the type of self-heal. If you set the option as "full", the entire file is copied from source to destinations. If the option is set to "diff" the file blocks that are not in sync are copied to destinations. Reset uses a heuristic model. If the file does not exist on one of the subvolumes, or a zero-byte file exists (created by entry self-heal) the entire content has to be copied anyway, so there is no benefit from using the "diff" algorithm. If the file size is about the same as page size, the entire file can be read and written with a few operations, which will be faster than "diff" which has to read checksums and then read and write. | reset | full/diff/reset
| cluster.min-free-disk | Specifies the percentage of disk space that must be kept free. Might be useful for non-uniform bricks | 10% | Percentage of required minimum free disk space
| cluster.min-free-inodes | Specifies when system has only N% of inodes remaining, warnings starts to appear in log files | 10% | Percentage of required minimum free inodes
| cluster.stripe-block-size | Specifies the size of the stripe unit that will be read from or written to. | 128 KB (for all files) | size in bytes
| cluster.self-heal-daemon | Allows you to turn-off proactive self-heal on replicated | On | On/Off
| cluster.ensure-durability | This option makes sure the data/metadata is durable across abrupt shutdown of the brick. | On | On/Off
| cluster.lookup-unhashed | This option does a lookup through all the sub-volumes, in case a lookup didnt return any result from the hashed subvolume. If set to OFF, it does not do a lookup on the remaining subvolumes. | on | auto, yes/no, enable/disable, 1/0, on/off
| cluster.lookup-optimize | This option enables the optimization of -ve lookups, by not doing a lookup on non-hashed subvolumes for files, in case the hashed subvolume does not return any result. This option disregards the lookup-unhashed setting, when enabled. | on | on/off
| cluster.randomize-hash-range-by-gfid | Allows to use gfid of directory to determine the subvolume from which hash ranges are allocated starting with 0. Note that we still use a directory/files name to determine the subvolume to which it hashes | off | on/off
| cluster.rebal-throttle | Sets the maximum number of parallel file migrations allowed on a node during the rebalance operation. The default value is normal and allows 2 files to be migrated at a time. Lazy will allow only one file to be migrated at a time and aggressive will allow maxof[(((processing units) - 4) / 2), 4] | normal | lazy/normal/aggressive
| cluster.background-self-heal-count | Specifies the number of per client self-heal jobs that can perform parallel heals in the background. | 8 | 0-256
| cluster.heal-timeout | Time interval for checking the need to self-heal in self-heal-daemon | 600 | 5-(signed-int)
| cluster.eager-lock | If eager-lock is off, locks release immediately after file operations complete, improving performance for some operations, but reducing access efficiency | on | on/off
| cluster.quorum-type | If value is “fixed” only allow writes if quorum-count bricks are present. If value is “auto” only allow writes if more than half of bricks, or exactly half including the first brick, are present | none | none/auto/fixed
| cluster.quorum-count | If quorum-type is “fixed” only allow writes if this many bricks are present. Other quorum types will OVERWRITE this value | null | 1-(signed-int)
| cluster.heal-wait-queue-length | Specifies the number of heals that can be queued for the parallel background self heal jobs. | 128 | 0-10000
| cluster.favorite-child-policy | Specifies which policy can be used to automatically resolve split-brains without user intervention. “size” picks the file with the biggest size as the source. “ctime” and “mtime” pick the file with the latest ctime and mtime respectively as the source. “majority” picks a file with identical mtime and size in more than half the number of bricks in the replica. | none | none/size/ctime/mtime/majority
| cluster.use-anonymous-inode | Setting this option heals directory renames efficiently | no | no/yes
Disperse | disperse.eager-lock | If eager-lock is on, the lock remains in place either until lock contention is detected, or for 1 second in order to check if there is another request for that file from the same client. If eager-lock is off, locks release immediately after file operations complete, improving performance for some operations, but reducing access efficiency. | on | on/off
| disperse.other-eager-lock | This option is equivalent to the disperse.eager-lock option but applicable only for non regular files. When multiple clients access a particular directory, disabling disperse.other-eager-lockoption for the volume can improve performance for directory access without compromising performance of I/O's for regular files. | off | on/off
| disperse.shd-max-threads | Specifies the number of entries that can be self healed in parallel on each disperse subvolume by self-heal daemon. | 1 | 1 - 64
| disperse.shd-wait-qlength | Specifies the number of entries that must be kept in the dispersed subvolume's queue for self-heal daemon threads to take up as soon as any of the threads are free to heal. This value should be changed based on how much memory self-heal daemon process can use for keeping the next set of entries that need to be healed. | 1024 | 1 - 655536
| disprse.eager-lock-timeout | Maximum time (in seconds) that a lock on an inode is kept held if no new operations on the inode are received. | 1 | 1-60
| disperse.other-eager-lock-timeout | Its equivalent to eager-lock-timeout option but for non regular files. | 1 | 1-60
| disperse.background-heals | This option can be used to control number of parallel heals running in background. | 8 | 0-256
| disperse.heal-wait-qlength | This option can be used to control number of heals that can wait | 128 | 0-65536
| disperse.read-policy | inode-read fops happen only on k number of bricks in n=k+m disperse subvolume. round-robin selects the read subvolume using round-robin algo. gfid-hash selects read subvolume based on hash of the gfid of that file/directory. | gfid-hash | round-robin/gfid-hash
| disperse.self-heal-window-size | Maximum number blocks(128KB) per file for which self-heal process would be applied simultaneously. | 1 | 1-1024
| disperse.optimistic-change-log | This option Set/Unset dirty flag for every update fop at the start of the fop. If OFF, this option impacts performance of entry or metadata operations as it will set dirty flag at the start and unset it at the end of ALL update fop. If ON and all the bricks are good, dirty flag will be set at the start only for file fops, For metadata and entry fops dirty flag will not be set at the start This does not impact performance for metadata operations and entry operation but has a very small window to miss marking entry as dirty in case it is required to be healed. |on | on/off
| disperse.parallel-writes | This controls if writes can be wound in parallel as long as it doesnt modify same stripes | on | on/off
| disperse.stripe-cache | This option will keep the last stripe of write fop in memory. If next write falls in this stripe, we need not to read it again from backend and we can save READ fop going over the network. This will improve performance, specially for sequential writes. However, this will also lead to extra memory consumption, maximum (cache size * stripe size) Bytes per open file |4 | 0-10
| disperse.quorum-count | This option can be used to define how many successes on the bricks constitute a success to the application. This count should be in the range [disperse-data-count, disperse-count] (inclusive) | 0 | 0-(signedint)
| disperse.use-anonymous-inode | Setting this option heals renames efficiently | off | on/off
Logging | diagnostics.brick-log-level | Changes the log-level of the bricks | INFO | DEBUG/WARNING/ERROR/CRITICAL/NONE/TRACE
| diagnostics.client-log-level | Changes the log-level of the clients. | INFO | DEBUG/WARNING/ERROR/CRITICAL/NONE/TRACE
| diagnostics.brick-sys-log-level | Depending on the value defined for this option, log messages at and above the defined level are generated in the syslog and the brick log files. | CRITICAL | INFO/WARNING/ERROR/CRITICAL
| diagnostics.client-sys-log-level | Depending on the value defined for this option, log messages at and above the defined level are generated in the syslog and the client log files. | CRITICAL | INFO/WARNING/ERROR/CRITICAL
| diagnostics.brick-log-format | Allows you to configure the log format to log either with a message id or without one on the brick. | with-msg-id | no-msg-id/with-msg-id
| diagnostics.client-log-format | Allows you to configure the log format to log either with a message ID or without one on the client. | with-msg-id | no-msg-id/with-msg-id
| diagnostics.brick-log-buf-size | The maximum number of unique log messages that can be suppressed until the timeout or buffer overflow, whichever occurs first on the bricks.| 5 | 0 and 20 (0 and 20 included)
| diagnostics.client-log-buf-size | The maximum number of unique log messages that can be suppressed until the timeout or buffer overflow, whichever occurs first on the clients.| 5 | 0 and 20 (0 and 20 included)
| diagnostics.brick-log-flush-timeout | The length of time for which the log messages are buffered, before being flushed to the logging infrastructure (gluster or syslog files) on the bricks. | 120 | 30 - 300 seconds (30 and 300 included)
| diagnostics.client-log-flush-timeout | The length of time for which the log messages are buffered, before being flushed to the logging infrastructure (gluster or syslog files) on the clients. | 120 | 30 - 300 seconds (30 and 300 included)
Performance | *features.trash | Enable/disable trash translator | off | on/off
| *performance.readdir-ahead | Enable/disable readdir-ahead translator in the volume | off | on/off
| *performance.read-ahead | Enable/disable read-ahead translator in the volume | off | on/off
| *performance.io-cache | Enable/disable io-cache translator in the volume | off | on/off
| performance.quick-read | To enable/disable quick-read translator in the volume. | on | off/on
| performance.md-cache | Enables and disables md-cache translator. | off | off/on
| performance.open-behind | Enables and disables open-behind translator. | on | off/on
| performance.nl-cache | Enables and disables nl-cache translator. | off | off/on
| performance.stat-prefetch | Enables and disables stat-prefetch translator. | on | off/on
| performance.client-io-threads | Enables and disables client-io-thread translator. | on | off/on
| performance.write-behind | Enables and disables write-behind translator. | on | off/on
| performance.write-behind-window-size | Size of the per-file write-behind buffer. | 1MB | Write-behind cache size
| performance.io-thread-count | The number of threads in IO threads translator. | 16 | 1-64
| performance.flush-behind | If this option is set ON, instructs write-behind translator to perform flush in background, by returning success (or any errors, if any of previous writes were failed) to application even before flush is sent to backend filesystem. | On | On/Off
| performance.cache-max-file-size | Sets the maximum file size cached by the io-cache translator. Can use the normal size descriptors of KB, MB, GB,TB or PB (for example, 6GB). Maximum size uint64. | 2 ^ 64 -1 bytes | size in bytes
| performance.cache-min-file-size | Sets the minimum file size cached by the io-cache translator. Values same as "max" above | 0B | size in bytes
| performance.cache-refresh-timeout | The cached data for a file will be retained till 'cache-refresh-timeout' seconds, after which data re-validation is performed. | 1s | 0-61
| performance.cache-size | Size of the read cache. | 32 MB | size in bytes
| performance.lazy-open | This option requires open-behind to be on. Perform an open in the backend only when a necessary FOP arrives (for example, write on the file descriptor, unlink of the file). When this option is disabled, perform backend open immediately after an unwinding open. | Yes | Yes/No
| performance.md-cache-timeout | The time period in seconds which controls when metadata cache has to be refreshed. If the age of cache is greater than this time-period, it is refreshed. Every time cache is refreshed, its age is reset to 0. | 1 | 0-600 seconds
| performance.nfs-strict-write-ordering | Specifies whether to prevent later writes from overtaking earlier writes for NFS, even if the writes do not relate to the same files or locations. | off | on/off
| performance.nfs.flush-behind | Specifies whether the write-behind translator performs flush operations in the background for NFS by returning (false) success to the application before flush file operations are sent to the backend file system. | on | on/off
| performance.nfs.strict-o-direct | Specifies whether to attempt to minimize the cache effects of I/O for a file on NFS. When this option is enabled and a file descriptor is opened using the O_DIRECT flag, write-back caching is disabled for writes that affect that file descriptor. When this option is disabled, O_DIRECT has no effect on caching. This option is ignored if performance.write-behind is disabled. | off | on/off
| performance.nfs.write-behind-trickling-writes | Enables and disables trickling-write strategy for the write-behind translator for NFS clients. | on | off/on
| performance.nfs.write-behind-window-size | Specifies the size of the write-behind buffer for a single file or inode for NFS. | 1 MB | 512 KB - 1 GB
| performance.rda-cache-limit | The value specified for this option is the maximum size of cache consumed by the readdir-ahead translator. This value is global and the total memory consumption by readdir-ahead is capped by this value, irrespective of the number/size of directories cached. | 10MB | 0-1GB
| performance.rda-request-size | The value specified for this option will be the size of buffer holding directory entries in readdirp response. | 128KB | 4KB-128KB
| performance.resync-failed-syncs-after-fsync | If syncing cached writes that were issued before an fsync operation fails, this option configures whether to reattempt the failed sync operations. |off | on/off
| performance.strict-o-direct | Specifies whether to attempt to minimize the cache effects of I/O for a file. When this option is enabled and a file descriptor is opened using the O_DIRECT flag, write-back caching is disabled for writes that affect that file descriptor. When this option is disabled, O_DIRECT has no effect on caching. This option is ignored if performance.write-behind is disabled. | on | on/off
| performance.strict-write-ordering | Specifies whether to prevent later writes from overtaking earlier writes, even if the writes do not relate to the same files or locations. | on | on/off
| performance.use-anonymous-fd | This option requires open-behind to be on. For read operations, use anonymous file descriptor when the original file descriptor is open-behind and not yet opened in the backend.| Yes | No/Yes
| performance.write-behind-trickling-writes | Enables and disables trickling-write strategy for the write-behind translator for FUSE clients. | on | off/on
| performance.write-behind-window-size | Specifies the size of the write-behind buffer for a single file or inode. | 1MB | 512 KB - 1 GB
| features.read-only | Enables you to mount the entire volume as read-only for all the clients (including NFS clients) accessing it. | Off | On/Off
| features.quota-deem-statfs | When this option is set to on, it takes the quota limits into consideration while estimating the filesystem size. The limit will be treated as the total size instead of the actual size of filesystem. | on | on/off
| features.shard | Enables or disables sharding on the volume. Affects files created after volume configuration. | disable | enable/disable
| features.shard-block-size | Specifies the maximum size of file pieces when sharding is enabled. Affects files created after volume configuration. | 64MB | 4MB-4TB
| features.uss | This option enable/disable User Serviceable Snapshots on the volume. | off | on/off
| geo-replication.indexing | Use this option to automatically sync the changes in the filesystem from Primary to Secondary. | Off | On/Off
| network.frame-timeout | The time frame after which the operation has to be declared as dead, if the server does not respond for a particular operation. | 1800 (30 mins) | 1800 secs
| network.ping-timeout | The time duration for which the client waits to check if the server is responsive. When a ping timeout happens, there is a network disconnect between the client and server. All resources held by server on behalf of the client get cleaned up. When a reconnection happens, all resources will need to be re-acquired before the client can resume its operations on the server. Additionally, the locks will be acquired and the lock tables updated. This reconnect is a very expensive operation and should be avoided. | 42 Secs | 42 Secs
nfs | nfs.enable-ino32 | For 32-bit nfs clients or applications that do not support 64-bit inode numbers or large files, use this option from the CLI to make Gluster NFS return 32-bit inode numbers instead of 64-bit inode numbers. | Off | On/Off
| nfs.volume-access | Set the access type for the specified sub-volume. | read-write | read-write/read-only
| nfs.trusted-write | If there is an UNSTABLE write from the client, STABLE flag will be returned to force the client to not send a COMMIT request. In some environments, combined with a replicated GlusterFS setup, this option can improve write performance. This flag allows users to trust Gluster replication logic to sync data to the disks and recover when required. COMMIT requests if received will be handled in a default manner by fsyncing. STABLE writes are still handled in a sync manner. | Off | On/Off
| nfs.trusted-sync | All writes and COMMIT requests are treated as async. This implies that no write requests are guaranteed to be on server disks when the write reply is received at the NFS client. Trusted sync includes trusted-write behavior. | Off | On/Off
| nfs.export-dir | This option can be used to export specified comma separated subdirectories in the volume. The path must be an absolute path. Along with path allowed list of IPs/hostname can be associated with each subdirectory. If provided connection will allowed only from these IPs. Format: \<dir\>[(hostspec[hostspec...])][,...]. Where hostspec can be an IP address, hostname or an IP range in CIDR notation. **Note**: Care must be taken while configuring this option as invalid entries and/or unreachable DNS servers can introduce unwanted delay in all the mount calls. | No sub directory exported. | Absolute path with allowed list of IP/hostname
| nfs.export-volumes | Enable/Disable exporting entire volumes, instead if used in conjunction with nfs3.export-dir, can allow setting up only subdirectories as exports. | On | On/Off
| nfs.rpc-auth-unix | Enable/Disable the AUTH_UNIX authentication type. This option is enabled by default for better interoperability. However, you can disable it if required. | On | On/Off
| nfs.rpc-auth-null | Enable/Disable the AUTH_NULL authentication type. It is not recommended to change the default value for this option. | On | On/Off
| nfs.rpc-auth-allow\<IP- Addresses\> | Allow a comma separated list of addresses and/or hostnames to connect to the server. By default, all clients are disallowed. This allows you to define a general rule for all exported volumes. | Reject All | IP address or Host name
| nfs.rpc-auth-reject\<IP- Addresses\> | Reject a comma separated list of addresses and/or hostnames from connecting to the server. By default, all connections are disallowed. This allows you to define a general rule for all exported volumes. | Reject All | IP address or Host name
| nfs.ports-insecure | Allow client connections from unprivileged ports. By default only privileged ports are allowed. This is a global setting in case insecure ports are to be enabled for all exports using a single option. | Off | On/Off
| nfs.addr-namelookup | Turn-off name lookup for incoming client connections using this option. In some setups, the name server can take too long to reply to DNS queries resulting in timeouts of mount requests. Use this option to turn off name lookups during address authentication. Note, turning this off will prevent you from using hostnames in rpc-auth.addr.* filters. | On | On/Off
| nfs.register-with-portmap |For systems that need to run multiple NFS servers, you need to prevent more than one from registering with portmap service. Use this option to turn off portmap registration for Gluster NFS. | On | On/Off
| nfs.port \<PORT- NUMBER\> | Use this option on systems that need Gluster NFS to be associated with a non-default port number. | NA | 38465-38467
| nfs.disable | Turn-off volume being exported by NFS | Off | On/Off
Server | server.allow-insecure | Allow client connections from unprivileged ports. By default only privileged ports are allowed. This is a global setting in case insecure ports are to be enabled for all exports using a single option.| On | On/Off
| server.statedump-path | Location of the state dump file. | tmp directory of the brick | New directory path
| server.allow-insecure | Allows FUSE-based client connections from unprivileged ports.By default, this is enabled, meaning that ports can accept and reject messages from insecure ports. When disabled, only privileged ports are allowed. | on | on/off
| server.anongid | Value of the GID used for the anonymous user when root-squash is enabled. When root-squash is enabled, all the requests received from the root GID (that is 0) are changed to have the GID of the anonymous user. | 65534 (this UID is also known as nfsnobody) | 0 - 4294967295
| server.anonuid | Value of the UID used for the anonymous user when root-squash is enabled. When root-squash is enabled, all the requests received from the root UID (that is 0) are changed to have the UID of the anonymous user. | 65534 (this UID is also known as nfsnobody) | 0 - 4294967295
| server.event-threads | Specifies the number of event threads to execute in parallel. Larger values would help process responses faster, depending on available processing power. | 2 | 1-1024
| server.gid-timeout | The time period in seconds which controls when cached groups has to expire. This is the cache that contains the groups (GIDs) where a specified user (UID) belongs to. This option is used only when server.manage-gids is enabled.| 2 | 0-4294967295 seconds
| server.manage-gids | Resolve groups on the server-side. By enabling this option, the groups (GIDs) a user (UID) belongs to gets resolved on the server, instead of using the groups that were send in the RPC Call by the client. This option makes it possible to apply permission checks for users that belong to bigger group lists than the protocol supports (approximately 93). | off | on/off
| server.root-squash | Prevents root users from having root privileges, and instead assigns them the privileges of nfsnobody. This squashes the power of the root users, preventing unauthorized modification of files on the Red Hat Gluster Storage servers. This option is used only for glusterFS NFS protocol. | off | on/off
| server.statedump-path | Specifies the directory in which the statedumpfiles must be stored. | path to directory | /var/run/gluster (for a default installation)
Storage | storage.health-check-interval | Number of seconds between health-checks done on the filesystem that is used for the brick(s). Defaults to 30 seconds, set to 0 to disable. | tmp directory of the brick | New directory path
| storage.linux-io_uring | Enable/Disable io_uring based I/O at the posix xlator on the bricks. | Off | On/Off
| storage.fips-mode-rchecksum | If enabled, posix_rchecksum uses the FIPS compliant SHA256 checksum, else it uses MD5. | on | on/ off
| storage.create-mask | Maximum set (upper limit) of permission for the files that will be created. | 0777 | 0000 - 0777
| storage.create-directory-mask | Maximum set (upper limit) of permission for the directories that will be created. | 0777 | 0000 - 0777
| storage.force-create-mode | Minimum set (lower limit) of permission for the files that will be created. | 0000 | 0000 - 0777
| storage.force-create-directory | Minimum set (lower limit) of permission for the directories that will be created. | 0000 | 0000 - 0777
| storage.health-check-interval | Sets the time interval in seconds for a filesystem health check. You can set it to 0 to disable. | 30 seconds | 0-4294967295 seconds
| storage.reserve | To reserve storage space at the brick. This option accepts size in form of MB and also in form of percentage. If user has configured the storage.reserve option using size in MB earlier, and then wants to give the size in percentage, it can be done using the same option. Also, the newest set value is considered, if it was in MB before and then if it sent in percentage, the percentage value becomes new value and the older one is over-written | 1 (1% of the brick size) | 0-100
| Type | Option | Description | Default Value | Available Options |
| --------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | --------------------------------------- |
| auth.allow | IP addresses of the clients which should be allowed to access the volume. | \* (allow all) | Valid IP address which includes wild card patterns including \*, such as 192.168.1.\* |
| auth.reject | IP addresses of the clients which should be denied to access the volume. | NONE (reject none) | Valid IP address which includes wild card patterns including \*, such as 192.168.2.\* |
| Cluster | cluster.self-heal-window-size | Specifies the maximum number of blocks per file on which self-heal would happen simultaneously. | 1 | 0 - 1024 blocks |
| cluster.data-self-heal-algorithm | Specifies the type of self-heal. If you set the option as "full", the entire file is copied from source to destinations. If the option is set to "diff" the file blocks that are not in sync are copied to destinations. Reset uses a heuristic model. If the file does not exist on one of the subvolumes, or a zero-byte file exists (created by entry self-heal) the entire content has to be copied anyway, so there is no benefit from using the "diff" algorithm. If the file size is about the same as page size, the entire file can be read and written with a few operations, which will be faster than "diff" which has to read checksums and then read and write. | reset | full/diff/reset |
| cluster.min-free-disk | Specifies the percentage of disk space that must be kept free. Might be useful for non-uniform bricks | 10% | Percentage of required minimum free disk space |
| cluster.min-free-inodes | Specifies when system has only N% of inodes remaining, warnings starts to appear in log files | 10% | Percentage of required minimum free inodes |
| cluster.stripe-block-size | Specifies the size of the stripe unit that will be read from or written to. | 128 KB (for all files) | size in bytes |
| cluster.self-heal-daemon | Allows you to turn-off proactive self-heal on replicated | On | On/Off |
| cluster.ensure-durability | This option makes sure the data/metadata is durable across abrupt shutdown of the brick. | On | On/Off |
| cluster.lookup-unhashed | This option does a lookup through all the sub-volumes, in case a lookup didnt return any result from the hashed subvolume. If set to OFF, it does not do a lookup on the remaining subvolumes. | on | auto, yes/no, enable/disable, 1/0, on/off |
| cluster.lookup-optimize | This option enables the optimization of -ve lookups, by not doing a lookup on non-hashed subvolumes for files, in case the hashed subvolume does not return any result. This option disregards the lookup-unhashed setting, when enabled. | on | on/off |
| cluster.randomize-hash-range-by-gfid | Allows to use gfid of directory to determine the subvolume from which hash ranges are allocated starting with 0. Note that we still use a directory/files name to determine the subvolume to which it hashes | off | on/off |
| cluster.rebal-throttle | Sets the maximum number of parallel file migrations allowed on a node during the rebalance operation. The default value is normal and allows 2 files to be migrated at a time. Lazy will allow only one file to be migrated at a time and aggressive will allow maxof[(((processing units) - 4) / 2), 4] | normal | lazy/normal/aggressive |
| cluster.background-self-heal-count | Specifies the number of per client self-heal jobs that can perform parallel heals in the background. | 8 | 0-256 |
| cluster.heal-timeout | Time interval for checking the need to self-heal in self-heal-daemon | 600 | 5-(signed-int) |
| cluster.eager-lock | If eager-lock is off, locks release immediately after file operations complete, improving performance for some operations, but reducing access efficiency | on | on/off |
| cluster.quorum-type | If value is “fixed” only allow writes if quorum-count bricks are present. If value is “auto” only allow writes if more than half of bricks, or exactly half including the first brick, are present | none | none/auto/fixed |
| cluster.quorum-count | If quorum-type is “fixed” only allow writes if this many bricks are present. Other quorum types will OVERWRITE this value | null | 1-(signed-int) |
| cluster.heal-wait-queue-length | Specifies the number of heals that can be queued for the parallel background self heal jobs. | 128 | 0-10000 |
| cluster.favorite-child-policy | Specifies which policy can be used to automatically resolve split-brains without user intervention. “size” picks the file with the biggest size as the source. “ctime” and “mtime” pick the file with the latest ctime and mtime respectively as the source. “majority” picks a file with identical mtime and size in more than half the number of bricks in the replica. | none | none/size/ctime/mtime/majority |
| cluster.use-anonymous-inode | Setting this option heals directory renames efficiently | no | no/yes |
| Disperse | disperse.eager-lock | If eager-lock is on, the lock remains in place either until lock contention is detected, or for 1 second in order to check if there is another request for that file from the same client. If eager-lock is off, locks release immediately after file operations complete, improving performance for some operations, but reducing access efficiency. | on | on/off |
| disperse.other-eager-lock | This option is equivalent to the disperse.eager-lock option but applicable only for non regular files. When multiple clients access a particular directory, disabling disperse.other-eager-lockoption for the volume can improve performance for directory access without compromising performance of I/O's for regular files. | off | on/off |
| disperse.shd-max-threads | Specifies the number of entries that can be self healed in parallel on each disperse subvolume by self-heal daemon. | 1 | 1 - 64 |
| disperse.shd-wait-qlength | Specifies the number of entries that must be kept in the dispersed subvolume's queue for self-heal daemon threads to take up as soon as any of the threads are free to heal. This value should be changed based on how much memory self-heal daemon process can use for keeping the next set of entries that need to be healed. | 1024 | 1 - 655536 |
| disprse.eager-lock-timeout | Maximum time (in seconds) that a lock on an inode is kept held if no new operations on the inode are received. | 1 | 1-60 |
| disperse.other-eager-lock-timeout | Its equivalent to eager-lock-timeout option but for non regular files. | 1 | 1-60 |
| disperse.background-heals | This option can be used to control number of parallel heals running in background. | 8 | 0-256 |
| disperse.heal-wait-qlength | This option can be used to control number of heals that can wait | 128 | 0-65536 |
| disperse.read-policy | inode-read fops happen only on k number of bricks in n=k+m disperse subvolume. round-robin selects the read subvolume using round-robin algo. gfid-hash selects read subvolume based on hash of the gfid of that file/directory. | gfid-hash | round-robin/gfid-hash |
| disperse.self-heal-window-size | Maximum number blocks(128KB) per file for which self-heal process would be applied simultaneously. | 1 | 1-1024 |
| disperse.optimistic-change-log | This option Set/Unset dirty flag for every update fop at the start of the fop. If OFF, this option impacts performance of entry or metadata operations as it will set dirty flag at the start and unset it at the end of ALL update fop. If ON and all the bricks are good, dirty flag will be set at the start only for file fops, For metadata and entry fops dirty flag will not be set at the start This does not impact performance for metadata operations and entry operation but has a very small window to miss marking entry as dirty in case it is required to be healed. | on | on/off |
| disperse.parallel-writes | This controls if writes can be wound in parallel as long as it doesnt modify same stripes | on | on/off |
| disperse.stripe-cache | This option will keep the last stripe of write fop in memory. If next write falls in this stripe, we need not to read it again from backend and we can save READ fop going over the network. This will improve performance, specially for sequential writes. However, this will also lead to extra memory consumption, maximum (cache size \* stripe size) Bytes per open file | 4 | 0-10 |
| disperse.quorum-count | This option can be used to define how many successes on the bricks constitute a success to the application. This count should be in the range [disperse-data-count, disperse-count] (inclusive) | 0 | 0-(signedint) |
| disperse.use-anonymous-inode | Setting this option heals renames efficiently | off | on/off |
| Logging | diagnostics.brick-log-level | Changes the log-level of the bricks | INFO | DEBUG/WARNING/ERROR/CRITICAL/NONE/TRACE |
| diagnostics.client-log-level | Changes the log-level of the clients. | INFO | DEBUG/WARNING/ERROR/CRITICAL/NONE/TRACE |
| diagnostics.brick-sys-log-level | Depending on the value defined for this option, log messages at and above the defined level are generated in the syslog and the brick log files. | CRITICAL | INFO/WARNING/ERROR/CRITICAL |
| diagnostics.client-sys-log-level | Depending on the value defined for this option, log messages at and above the defined level are generated in the syslog and the client log files. | CRITICAL | INFO/WARNING/ERROR/CRITICAL |
| diagnostics.brick-log-format | Allows you to configure the log format to log either with a message id or without one on the brick. | with-msg-id | no-msg-id/with-msg-id |
| diagnostics.client-log-format | Allows you to configure the log format to log either with a message ID or without one on the client. | with-msg-id | no-msg-id/with-msg-id |
| diagnostics.brick-log-buf-size | The maximum number of unique log messages that can be suppressed until the timeout or buffer overflow, whichever occurs first on the bricks. | 5 | 0 and 20 (0 and 20 included) |
| diagnostics.client-log-buf-size | The maximum number of unique log messages that can be suppressed until the timeout or buffer overflow, whichever occurs first on the clients. | 5 | 0 and 20 (0 and 20 included) |
| diagnostics.brick-log-flush-timeout | The length of time for which the log messages are buffered, before being flushed to the logging infrastructure (gluster or syslog files) on the bricks. | 120 | 30 - 300 seconds (30 and 300 included) |
| diagnostics.client-log-flush-timeout | The length of time for which the log messages are buffered, before being flushed to the logging infrastructure (gluster or syslog files) on the clients. | 120 | 30 - 300 seconds (30 and 300 included) |
| Performance | \*features.trash | Enable/disable trash translator | off | on/off |
| \*performance.readdir-ahead | Enable/disable readdir-ahead translator in the volume | off | on/off |
| \*performance.read-ahead | Enable/disable read-ahead translator in the volume | off | on/off |
| \*performance.io-cache | Enable/disable io-cache translator in the volume | off | on/off |
| performance.quick-read | To enable/disable quick-read translator in the volume. | on | off/on |
| performance.md-cache | Enables and disables md-cache translator. | off | off/on |
| performance.open-behind | Enables and disables open-behind translator. | on | off/on |
| performance.nl-cache | Enables and disables nl-cache translator. | off | off/on |
| performance.stat-prefetch | Enables and disables stat-prefetch translator. | on | off/on |
| performance.client-io-threads | Enables and disables client-io-thread translator. | on | off/on |
| performance.write-behind | Enables and disables write-behind translator. | on | off/on |
| performance.write-behind-window-size | Size of the per-file write-behind buffer. | 1MB | Write-behind cache size |
| performance.io-thread-count | The number of threads in IO threads translator. | 16 | 1-64 |
| performance.flush-behind | If this option is set ON, instructs write-behind translator to perform flush in background, by returning success (or any errors, if any of previous writes were failed) to application even before flush is sent to backend filesystem. | On | On/Off |
| performance.cache-max-file-size | Sets the maximum file size cached by the io-cache translator. Can use the normal size descriptors of KB, MB, GB,TB or PB (for example, 6GB). Maximum size uint64. | 2 ^ 64 -1 bytes | size in bytes |
| performance.cache-min-file-size | Sets the minimum file size cached by the io-cache translator. Values same as "max" above | 0B | size in bytes |
| performance.cache-refresh-timeout | The cached data for a file will be retained till 'cache-refresh-timeout' seconds, after which data re-validation is performed. | 1s | 0-61 |
| performance.cache-size | Size of the read cache. | 32 MB | size in bytes |
| performance.lazy-open | This option requires open-behind to be on. Perform an open in the backend only when a necessary FOP arrives (for example, write on the file descriptor, unlink of the file). When this option is disabled, perform backend open immediately after an unwinding open. | Yes | Yes/No |
| performance.md-cache-timeout | The time period in seconds which controls when metadata cache has to be refreshed. If the age of cache is greater than this time-period, it is refreshed. Every time cache is refreshed, its age is reset to 0. | 1 | 0-600 seconds |
| performance.nfs-strict-write-ordering | Specifies whether to prevent later writes from overtaking earlier writes for NFS, even if the writes do not relate to the same files or locations. | off | on/off |
| performance.nfs.flush-behind | Specifies whether the write-behind translator performs flush operations in the background for NFS by returning (false) success to the application before flush file operations are sent to the backend file system. | on | on/off |
| performance.nfs.strict-o-direct | Specifies whether to attempt to minimize the cache effects of I/O for a file on NFS. When this option is enabled and a file descriptor is opened using the O_DIRECT flag, write-back caching is disabled for writes that affect that file descriptor. When this option is disabled, O_DIRECT has no effect on caching. This option is ignored if performance.write-behind is disabled. | off | on/off |
| performance.nfs.write-behind-trickling-writes | Enables and disables trickling-write strategy for the write-behind translator for NFS clients. | on | off/on |
| performance.nfs.write-behind-window-size | Specifies the size of the write-behind buffer for a single file or inode for NFS. | 1 MB | 512 KB - 1 GB |
| performance.rda-cache-limit | The value specified for this option is the maximum size of cache consumed by the readdir-ahead translator. This value is global and the total memory consumption by readdir-ahead is capped by this value, irrespective of the number/size of directories cached. | 10MB | 0-1GB |
| performance.rda-request-size | The value specified for this option will be the size of buffer holding directory entries in readdirp response. | 128KB | 4KB-128KB |
| performance.resync-failed-syncs-after-fsync | If syncing cached writes that were issued before an fsync operation fails, this option configures whether to reattempt the failed sync operations. | off | on/off |
| performance.strict-o-direct | Specifies whether to attempt to minimize the cache effects of I/O for a file. When this option is enabled and a file descriptor is opened using the O_DIRECT flag, write-back caching is disabled for writes that affect that file descriptor. When this option is disabled, O_DIRECT has no effect on caching. This option is ignored if performance.write-behind is disabled. | on | on/off |
| performance.strict-write-ordering | Specifies whether to prevent later writes from overtaking earlier writes, even if the writes do not relate to the same files or locations. | on | on/off |
| performance.use-anonymous-fd | This option requires open-behind to be on. For read operations, use anonymous file descriptor when the original file descriptor is open-behind and not yet opened in the backend. | Yes | No/Yes |
| performance.write-behind-trickling-writes | Enables and disables trickling-write strategy for the write-behind translator for FUSE clients. | on | off/on |
| performance.write-behind-window-size | Specifies the size of the write-behind buffer for a single file or inode. | 1MB | 512 KB - 1 GB |
| features.read-only | Enables you to mount the entire volume as read-only for all the clients (including NFS clients) accessing it. | Off | On/Off |
| features.quota-deem-statfs | When this option is set to on, it takes the quota limits into consideration while estimating the filesystem size. The limit will be treated as the total size instead of the actual size of filesystem. | on | on/off |
| features.shard | Enables or disables sharding on the volume. Affects files created after volume configuration. | disable | enable/disable |
| features.shard-block-size | Specifies the maximum size of file pieces when sharding is enabled. Affects files created after volume configuration. | 64MB | 4MB-4TB |
| features.uss | This option enable/disable User Serviceable Snapshots on the volume. | off | on/off |
| geo-replication.indexing | Use this option to automatically sync the changes in the filesystem from Primary to Secondary. | Off | On/Off |
| network.frame-timeout | The time frame after which the operation has to be declared as dead, if the server does not respond for a particular operation. | 1800 (30 mins) | 1800 secs |
| network.ping-timeout | The time duration for which the client waits to check if the server is responsive. When a ping timeout happens, there is a network disconnect between the client and server. All resources held by server on behalf of the client get cleaned up. When a reconnection happens, all resources will need to be re-acquired before the client can resume its operations on the server. Additionally, the locks will be acquired and the lock tables updated. This reconnect is a very expensive operation and should be avoided. | 42 Secs | 42 Secs |
| nfs | nfs.enable-ino32 | For 32-bit nfs clients or applications that do not support 64-bit inode numbers or large files, use this option from the CLI to make Gluster NFS return 32-bit inode numbers instead of 64-bit inode numbers. | Off | On/Off |
| nfs.volume-access | Set the access type for the specified sub-volume. | read-write | read-write/read-only |
| nfs.trusted-write | If there is an UNSTABLE write from the client, STABLE flag will be returned to force the client to not send a COMMIT request. In some environments, combined with a replicated GlusterFS setup, this option can improve write performance. This flag allows users to trust Gluster replication logic to sync data to the disks and recover when required. COMMIT requests if received will be handled in a default manner by fsyncing. STABLE writes are still handled in a sync manner. | Off | On/Off |
| nfs.trusted-sync | All writes and COMMIT requests are treated as async. This implies that no write requests are guaranteed to be on server disks when the write reply is received at the NFS client. Trusted sync includes trusted-write behavior. | Off | On/Off |
| nfs.export-dir | This option can be used to export specified comma separated subdirectories in the volume. The path must be an absolute path. Along with path allowed list of IPs/hostname can be associated with each subdirectory. If provided connection will allowed only from these IPs. Format: \<dir\>[(hostspec[hostspec...])][,...]. Where hostspec can be an IP address, hostname or an IP range in CIDR notation. **Note**: Care must be taken while configuring this option as invalid entries and/or unreachable DNS servers can introduce unwanted delay in all the mount calls. | No sub directory exported. | Absolute path with allowed list of IP/hostname |
| nfs.export-volumes | Enable/Disable exporting entire volumes, instead if used in conjunction with nfs3.export-dir, can allow setting up only subdirectories as exports. | On | On/Off |
| nfs.rpc-auth-unix | Enable/Disable the AUTH_UNIX authentication type. This option is enabled by default for better interoperability. However, you can disable it if required. | On | On/Off |
| nfs.rpc-auth-null | Enable/Disable the AUTH_NULL authentication type. It is not recommended to change the default value for this option. | On | On/Off |
| nfs.rpc-auth-allow\<IP- Addresses\> | Allow a comma separated list of addresses and/or hostnames to connect to the server. By default, all clients are disallowed. This allows you to define a general rule for all exported volumes. | Reject All | IP address or Host name |
| nfs.rpc-auth-reject\<IP- Addresses\> | Reject a comma separated list of addresses and/or hostnames from connecting to the server. By default, all connections are disallowed. This allows you to define a general rule for all exported volumes. | Reject All | IP address or Host name |
| nfs.ports-insecure | Allow client connections from unprivileged ports. By default only privileged ports are allowed. This is a global setting in case insecure ports are to be enabled for all exports using a single option. | Off | On/Off |
| nfs.addr-namelookup | Turn-off name lookup for incoming client connections using this option. In some setups, the name server can take too long to reply to DNS queries resulting in timeouts of mount requests. Use this option to turn off name lookups during address authentication. Note, turning this off will prevent you from using hostnames in rpc-auth.addr.\* filters. | On | On/Off |
| nfs.register-with-portmap | For systems that need to run multiple NFS servers, you need to prevent more than one from registering with portmap service. Use this option to turn off portmap registration for Gluster NFS. | On | On/Off |
| nfs.port \<PORT- NUMBER\> | Use this option on systems that need Gluster NFS to be associated with a non-default port number. | NA | 38465-38467 |
| nfs.disable | Turn-off volume being exported by NFS | Off | On/Off |
| Server | server.allow-insecure | Allow client connections from unprivileged ports. By default only privileged ports are allowed. This is a global setting in case insecure ports are to be enabled for all exports using a single option. | On | On/Off |
| server.statedump-path | Location of the state dump file. | tmp directory of the brick | New directory path |
| server.allow-insecure | Allows FUSE-based client connections from unprivileged ports.By default, this is enabled, meaning that ports can accept and reject messages from insecure ports. When disabled, only privileged ports are allowed. | on | on/off |
| server.anongid | Value of the GID used for the anonymous user when root-squash is enabled. When root-squash is enabled, all the requests received from the root GID (that is 0) are changed to have the GID of the anonymous user. | 65534 (this UID is also known as nfsnobody) | 0 - 4294967295 |
| server.anonuid | Value of the UID used for the anonymous user when root-squash is enabled. When root-squash is enabled, all the requests received from the root UID (that is 0) are changed to have the UID of the anonymous user. | 65534 (this UID is also known as nfsnobody) | 0 - 4294967295 |
| server.event-threads | Specifies the number of event threads to execute in parallel. Larger values would help process responses faster, depending on available processing power. | 2 | 1-1024 |
| server.gid-timeout | The time period in seconds which controls when cached groups has to expire. This is the cache that contains the groups (GIDs) where a specified user (UID) belongs to. This option is used only when server.manage-gids is enabled. | 2 | 0-4294967295 seconds |
| server.manage-gids | Resolve groups on the server-side. By enabling this option, the groups (GIDs) a user (UID) belongs to gets resolved on the server, instead of using the groups that were send in the RPC Call by the client. This option makes it possible to apply permission checks for users that belong to bigger group lists than the protocol supports (approximately 93). | off | on/off |
| server.root-squash | Prevents root users from having root privileges, and instead assigns them the privileges of nfsnobody. This squashes the power of the root users, preventing unauthorized modification of files on the Red Hat Gluster Storage servers. This option is used only for glusterFS NFS protocol. | off | on/off |
| server.statedump-path | Specifies the directory in which the statedumpfiles must be stored. | path to directory | /var/run/gluster (for a default installation) |
| Storage | storage.health-check-interval | Number of seconds between health-checks done on the filesystem that is used for the brick(s). Defaults to 30 seconds, set to 0 to disable. | tmp directory of the brick | New directory path |
| storage.linux-io_uring | Enable/Disable io_uring based I/O at the posix xlator on the bricks. | Off | On/Off |
| storage.fips-mode-rchecksum | If enabled, posix_rchecksum uses the FIPS compliant SHA256 checksum, else it uses MD5. | on | on/ off |
| storage.create-mask | Maximum set (upper limit) of permission for the files that will be created. | 0777 | 0000 - 0777 |
| storage.create-directory-mask | Maximum set (upper limit) of permission for the directories that will be created. | 0777 | 0000 - 0777 |
| storage.force-create-mode | Minimum set (lower limit) of permission for the files that will be created. | 0000 | 0000 - 0777 |
| storage.force-create-directory | Minimum set (lower limit) of permission for the directories that will be created. | 0000 | 0000 - 0777 |
| storage.health-check-interval | Sets the time interval in seconds for a filesystem health check. You can set it to 0 to disable. | 30 seconds | 0-4294967295 seconds |
| storage.reserve | To reserve storage space at the brick. This option accepts size in form of MB and also in form of percentage. If user has configured the storage.reserve option using size in MB earlier, and then wants to give the size in percentage, it can be done using the same option. Also, the newest set value is considered, if it was in MB before and then if it sent in percentage, the percentage value becomes new value and the older one is over-written | 1 (1% of the brick size) | 0-100 |
> **Note**
> **Note**
>
> We've found few performance xlators, options marked with * in above table have been causing more performance regression than improving. These xlators should be turned off for volumes.
> We've found few performance xlators, options marked with \* in above table have been causing more performance regression than improving. These xlators should be turned off for volumes.

View File

@@ -1,17 +1,19 @@
# io_uring support in gluster
io_uring is an asynchronous I/O interface similar to linux-aio, but aims to be more performant.
Refer https://kernel.dk/io_uring.pdf and https://kernel-recipes.org/en/2019/talks/faster-io-through-io_uring/ for more details.
Refer [https://kernel.dk/io_uring.pdf](https://kernel.dk/io_uring.pdf) and [https://kernel-recipes.org/en/2019/talks/faster-io-through-io_uring](https://kernel-recipes.org/en/2019/talks/faster-io-through-io_uring) for more details.
Incorporating io_uring in various layers of gluster is an ongoing activity but beginning with glusterfs-9.0, support has been added to the posix translator via the ```storage.linux-io_uring``` volume option. When this option is enabled, the posix translator in the glusterfs brick process (at the server side) will use io_uring calls for reads, writes and fsyncs as opposed to the normal pread/pwrite based syscalls.
Incorporating io_uring in various layers of gluster is an ongoing activity but beginning with glusterfs-9.0, support has been added to the posix translator via the `storage.linux-io_uring` volume option. When this option is enabled, the posix translator in the glusterfs brick process (at the server side) will use io_uring calls for reads, writes and fsyncs as opposed to the normal pread/pwrite based syscalls.
#### Example:
[server~]# gluster volume set testvol storage.linux-io_uring on
volume set: success
[server~]#
[server~]# gluster volume set testvol storage.linux-io_uring off
volume set: success
```{ .console .no-copy }
# gluster volume set testvol storage.linux-io_uring on
volume set: success
# gluster volume set testvol storage.linux-io_uring off
volume set: success
```
This option can be enabled/disabled only when the volume is not running.
i.e. you can toggle the option when the volume is `Created` or is `Stopped` as indicated in ```gluster volume status $VOLNAME```
i.e. you can toggle the option when the volume is `Created` or is `Stopped` as indicated in `gluster volume status $VOLNAME`

View File

@@ -1,6 +1,5 @@
### Overview
The Administration guide covers day to day management tasks as well as advanced configuration methods for your Gluster setup.
You can manage your Gluster cluster using the [Gluster CLI](../CLI-Reference/cli-main.md)

View File

@@ -3,7 +3,6 @@
A volume is a logical collection of bricks where each brick is an export directory on a server in the trusted storage pool.
Before creating a volume, you need to set up the bricks that will form the volume.
- [Brick Naming Conventions](./Brick-Naming-Conventions.md)
- [Formatting and Mounting Bricks](./formatting-and-mounting-bricks.md)
- [Posix ACLS](./Access-Control-Lists.md)
- [Brick Naming Conventions](./Brick-Naming-Conventions.md)
- [Formatting and Mounting Bricks](./formatting-and-mounting-bricks.md)
- [Posix ACLS](./Access-Control-Lists.md)