From 8b21b7e439a53b757cb10422ddbbdf669104816c Mon Sep 17 00:00:00 2001 From: shravantc Date: Wed, 3 Jun 2015 12:57:42 +0530 Subject: [PATCH] adding missing feature planning pages to 3.5 planning Signed-off-by: shravantc --- .../GlusterFS 3.5/Brick Failure Detection.md | 151 ++++++ .../GlusterFS 3.5/Disk Encryption.md | 443 ++++++++++++++++++ .../GlusterFS 3.5/File Snapshot.md | 101 ++++ .../Onwire Compression-Decompression.md | 96 ++++ .../GlusterFS 3.5/Quota Scalability.md | 99 ++++ Feature Planning/GlusterFS 3.5/Zerofill.md | 192 ++++++++ Feature Planning/GlusterFS 3.5/gfid access.md | 89 ++++ Feature Planning/GlusterFS 3.5/index.md | 10 + .../GlusterFS 3.5/readdir ahead.md | 117 +++++ mkdocs.yml | 8 + 10 files changed, 1306 insertions(+) create mode 100644 Feature Planning/GlusterFS 3.5/Brick Failure Detection.md create mode 100644 Feature Planning/GlusterFS 3.5/Disk Encryption.md create mode 100644 Feature Planning/GlusterFS 3.5/File Snapshot.md create mode 100644 Feature Planning/GlusterFS 3.5/Onwire Compression-Decompression.md create mode 100644 Feature Planning/GlusterFS 3.5/Quota Scalability.md create mode 100644 Feature Planning/GlusterFS 3.5/Zerofill.md create mode 100644 Feature Planning/GlusterFS 3.5/gfid access.md create mode 100644 Feature Planning/GlusterFS 3.5/readdir ahead.md diff --git a/Feature Planning/GlusterFS 3.5/Brick Failure Detection.md b/Feature Planning/GlusterFS 3.5/Brick Failure Detection.md new file mode 100644 index 0000000..9952698 --- /dev/null +++ b/Feature Planning/GlusterFS 3.5/Brick Failure Detection.md @@ -0,0 +1,151 @@ +Feature +------- + +Brick Failure Detection + +Summary +------- + +This feature attempts to identify storage/file system failures and +disable the failed brick without disrupting the remainder of the node's +operation. + +Owners +------ + +Vijay Bellur with help from Niels de Vos (or the other way around) + +Current status +-------------- + +Currently, if the underlying storage or file system failure happens, a +brick process will continue to function. In some cases, a brick can hang +due to failures in the underlying system. Due to such hangs in brick +processes, applications running on glusterfs clients can hang. + +Detailed Description +-------------------- + +Detecting failures on the filesystem that a brick uses makes it possible +to handle errors that are caused from outside of the Gluster +environment. + +There have been hanging brick processes when the underlying storage of a +brick went unavailable. A hanging brick process can still use the +network and repond to clients, but actual I/O to the storage is +impossible and can cause noticible delays on the client side. + +Benefit to GlusterFS +-------------------- + +Provide better detection of storage subsytem failures and prevent bricks +from hanging. + +Scope +----- + +### Nature of proposed change + +Add a health-checker to the posix xlator that periodically checks the +status of the filesystem (implies checking of functional +storage-hardware). + +### Implications on manageability + +When a brick process detects that the underlaying storage is not +responding anymore, the process will exit. There is no automated way +that the brick process gets restarted, the sysadmin will need to fix the +problem with the storage first. + +After correcting the storage (hardware or filesystem) issue, the +following command will start the brick process again: + + # gluster volume start force + +### Implications on presentation layer + +None + +### Implications on persistence layer + +None + +### Implications on 'GlusterFS' backend + +None + +### Modification to GlusterFS metadata + +None + +### Implications on 'glusterd' + +'glusterd' can detect that the brick process has exited, +`gluster volume status` will show that the brick process is not running +anymore. System administrators checking the logs should be able to +triage the cause. + +How To Test +----------- + +The health-checker thread that is part of each brick process will get +started automatically when a volume has been started. Verifying its +functionality can be done in different ways. + +On virtual hardware: + +- disconnect the disk from the VM that holds the brick + +On real hardware: + +- simulate a RAID-card failure by unplugging the card or cables + +On a system that uses LVM for the bricks: + +- use device-mapper to load an error-table for the disk, see [this + description](http://review.gluster.org/5176). + +On any system (writing to random offsets of the block device, more +difficult to trigger): + +1. cause corruption on the filesystem that holds the brick +2. read contents from the brick, hoping to hit the corrupted area +3. the filsystem should abort after hitting a bad spot, the + health-checker should notice that shortly afterwards + +User Experience +--------------- + +No more hanging brick processes when storage-hardware or the filesystem +fails. + +Dependencies +------------ + +Posix translator, not available for the BD-xlator. + +Documentation +------------- + +The health-checker is enabled by default and runs a check every 30 +seconds. This interval can be changed per volume with: + + # gluster volume set storage.health-check-interval + +If `SECONDS` is set to 0, the health-checker will be disabled. + +For further details refer: + + +Status +------ + +glusterfs-3.4 and newer include a health-checker for the posix xlator, +which was introduced with [bug +971774](https://bugzilla.redhat.com/971774): + +- [posix: add a simple + health-checker](http://review.gluster.org/5176)? + +Comments and Discussion +----------------------- diff --git a/Feature Planning/GlusterFS 3.5/Disk Encryption.md b/Feature Planning/GlusterFS 3.5/Disk Encryption.md new file mode 100644 index 0000000..4c6ab89 --- /dev/null +++ b/Feature Planning/GlusterFS 3.5/Disk Encryption.md @@ -0,0 +1,443 @@ +Feature +======= + +Transparent encryption. Allows a volume to be encrypted "at rest" on the +server using keys only available on the client. + +1 Summary +========= + +Distributed systems impose tighter requirements to at-rest encryption. +This is because your encrypted data will be stored on servers, which are +de facto untrusted. In particular, your private encrypted data can be +subjected to analysis and tampering, which eventually will lead to its +revealing, if it is not properly protected. Specifically, usually it is +not enough to just encrypt data. In distributed systems serious +protection of your personal data is possible only in conjunction with a +special process, which is called authentication. GlusterFS provides such +enhanced service: In GlusterFS encryption is enhanced with +authentication. Currently we provide protection from "silent tampering". +This is a kind of tampering, which is hard to detect, because it doesn't +break POSIX compliance. Specifically, we protect encryption-specific +file's metadata. Such metadata includes unique file's object id (GFID), +cipher algorithm id, cipher block size and other attributes used by the +encryption process. + +1.1 Restrictions +---------------- + +​1. We encrypt only file content. The feature of transparent encryption +doesn't protect file names: they are neither encrypted, nor verified. +Protection of file names is not so critical as protection of +encryption-specific file's metadata: any attacks based on tampering file +names will break POSIX compliance and result in massive corruption, +which is easy to detect. + +​2. The feature of transparent encryption doesn't work in NFS-mounts of +GlusterFS volumes: NFS's file handles introduce security issues, which +are hard to resolve. NFS mounts of encrypted GlusterFS volumes will +result in failed file operations (see section "Encryption in different +types of mount sessions" for more details). + +​3. The feature of transparent encryption is incompatible with GlusterFS +performance translators quick-read, write-behind and open-behind. + +2 Owners +======== + +Jeff Darcy +Edward Shishkin + +3 Current status +================ + +Merged to the upstream. + +4 Detailed Description +====================== + +See Summary. + +5 Benefit to GlusterFS +====================== + +Besides the justifications that have applied to on-disk encryption just +about forever, recent events have raised awareness significantly. +Encryption using keys that are physically present at the server leaves +data vulnerable to physical seizure of the server. Encryption using keys +that are kept by the same organization entity leaves data vulnerable to +"insider threat" plus coercion or capture at the organization level. For +many, especially various kinds of service providers, only pure +client-side encryption provides the necessary levels of privacy and +deniability. + +Competitively, other projects - most notably +[Tahoe-LAFS](https://leastauthority.com/) - are already using recently +heightened awareness of these issues to attract users who would be +better served by our performance/scalability, usability, and diversity +of interfaces. Only the lack of proper encryption holds us back in these +cases. + +6 Scope +======= + +6.1. Nature of proposed change +------------------------------ + +This is a new client-side translator, using user-provided key +information plus information stored in xattrs to encrypt data +transparently as it's written and decrypt when it's read. + +6.2. Implications on manageability +---------------------------------- + +User needs to manage a per-volume master key (MK). That is: + +​1) Generate an independent MK for every volume which is to be +encrypted. Note, that one MK is created for the whole life of the +volume. + +​2) Provide MK on the client side at every mount in accordance with the +location, which has been specified at volume create time, or overridden +via respective mount option (see section How To Test). + +​3) Keep MK between mount sessions. Note that after successful mount MK +may be removed from the specified location. In this case user should +retain MK safely till next mount session. + +MK is a 256-bit secret string, which is known only to user. Generating +and retention of MK is in user's competence. + +WARNING!!! Losing MK will make content of all regular files of your +volume inaccessible. It is possible to mount a volume with improper MK, +however such mount sessions will allow to access only file names as they +are not encrypted. + +Recommendations on MK generation + +MK has to be a high-entropy key, appropriately generated by a key +derivation algorithm. One of the possible ways is using rand(1) provided +by the OpenSSL package. You need to specify the option "-hex" for proper +output format. For example, the next command prints a generated key to +the standard output: + + $ openssl rand -hex 32 + +6.3. Implications on presentation layer +--------------------------------------- + +N/A + +6.4. Implications on persistence layer +-------------------------------------- + +N/A + +6.5. Implications on 'GlusterFS' backend +---------------------------------------- + +All encrypted files on the servers contains padding at the end of file. +That is, size of all enDefines location of the master volume key on the +trusted client machine.crypted files on the servers is multiple to +cipher block size. Real file size is stored as file's xattr with the key +"trusted.glusterfs.crypt.att.size". The translation padded-file-size -\> +real-file-size (and backward) is performed by the crypt translator. + +6.6. Modification to GlusterFS metadata +--------------------------------------- + +Encryption-specific metadata in specified format is stored as file's +xattr with the key "trusted.glusterfs.crypt.att.cfmt". Current format of +metadata string is described in the slide \#27 of the following [ design +document](http://www.gluster.org/community/documentation/index.php/File:GlusterFS_transparent_encryption.pdf) + +6.7. Options of the crypt translator +------------------------------------ + +- data-cipher-alg + +Specifies cipher algorithm for file data encryption. Currently only one +option is available: AES\_XTS. This is hidden option. + +- block-size + +Specifies size (in bytes) of logical chunk which is encrypted as a whole +unit in the file body. If cipher modes with initial vectors are used for +encryption, then the initial vector gets reset for every such chunk. +Available values are: "512", "1024", "2048" and "4096". Default value is +"4096". + +- data-key-size + +Specifies size (in bits) of data cipher key. For AES\_XTS available +values are: "256" and "512". Default value is "256". The larger key size +("512") is for stronger security. + +- master-key + +Specifies pathname of the regular file, or symlink. Defines location of +the master volume key on the trusted client machine. + +7 Getting Started With Crypt Translator +======================================= + +​1. Create a volume . + +​2. Turn on crypt xlator: + + # gluster volume set `` encryption on + +​3. Turn off performance xlators that currently encryption is +incompatible with: + + # gluster volume set  performance.quick-read off + # gluster volume set  performance.write-behind off + # gluster volume set  performance.open-behind off + +​4. (optional) Set location of the volume master key: + + # gluster volume set  encryption.master-key  + +where is an absolute pathname of the file, which +will contain the volume master key (see section implications on +manageability). + +​5. (optional) Override default options of crypt xlator: + + # gluster volume set  encryption.data-key-size  + +where should have one of the following values: +"256"(default), "512". + + # gluster volume set  encryption.block-size  + +where should have one of the following values: "512", +"1024", "2048", "4096"(default). + +​6. Define location of the master key on your client machine, if it +wasn't specified at section 4 above, or you want it to be different from +the , specified at section 4. + +​7. On the client side make sure that the file with name + (or defined at section +6) exists and contains respective per-volume master key (see section +implications on manageability). This key has to be in hex form, i.e. +should be represented by 64 symbols from the set {'0', ..., '9', 'a', +..., 'f'}. The key should start at the beginning of the file. All +symbols at offsets \>= 64 are ignored. + +NOTE: (or defined at +step 6) can be a symlink. In this case make sure that the target file of +this symlink exists and contains respective per-volume master key. + +​8. Mount the volume on the client side as usual. If you +specified a location of the master key at section 6, then use the mount +option + +--xlator-option=.master-key= + +where is location of master key specified at +section 6, is suffixed with "-crypt". For +example, if you created a volume "myvol" in the step 1, then +suffixed\_vol\_name is "myvol-crypt". + +​9. During mount your client machine receives configuration info from +the untrusted server, so this step is extremely important! Check, that +your volume is really encrypted, and that it is encrypted with the +proper master key (see FAQ \#1,\#2). + +​10. (optional) After successful mount the file which contains master +key may be removed. NOTE: Next mount session will require the master-key +again. Keeping the master key between mount sessions is in user's +competence (see section implications on manageability). + +8 How to test +============= + +From a correctness standpoint, it's sufficient to run normal tests with +encryption enabled. From a security standpoint, there's a whole +discipline devoted to analysing the stored data for weaknesses, and +engagement with practitioners of that discipline will be necessary to +develop the right tests. + +9 Dependencies +============== + +Crypt translator requires OpenSSL of version \>= 1.0.1 + +10 Documentation +================ + +10.1 Basic design concepts +-------------------------- + +The basic design concepts are described in the following [pdf +slides](http://www.gluster.org/community/documentation/index.php/File:GlusterFS_transparent_encryption.pdf) + +10.2 Procedure of security open +------------------------------- + +So, in accordance with the basic design concepts above, before every +access to a file's body (by read(2), write(2), truncate(2), etc) we need +to make sure that the file's metadata is trusted. Otherwise, we risk to +deal with untrusted file's data. + +To make sure that file's metadata is trusted, file is subjected to a +special procedure of security open. The procedure of security open is +performed by crypt translator at FOP-\>open() (crypt\_open) time by the +function open\_format(). Currently this is a hardcoded composition of 2 +checks: + +1. verification of file's GFID by the file name; +2. verification of file's metadata by the verified GFID; + +If the security open succeeds, then the cache of trusted client machine +is replenished with file descriptor and file's inode, and user can +access the file's content by read(2), write(2), ftruncate(2), etc. +system calls, which accept file descriptor as argument. + +However, file API also allows to accept file body without opening the +file. For example, truncate(2), which accepts pathname instead of file +descriptor. To make sure that file's metadata is trusted, we create a +temporal file descriptor and mandatory call crypt\_open() before +truncating the file's body. + +10.3 Encryption in different types of mount sessions +---------------------------------------------------- + +Everything described in the section above is valid only for FUSE-mounts. +Besides, GlusterFS also supports so-called NFS-mounts. From the +standpoint of security the key difference between the mentioned types of +mount sessions is that in NFS-mount sessions file operations instead of +file name accept a so-called file handle (which is actually GFID). It +creates problems, since the file name is a basic point for verification. +As it follows from the section above, using the step 1, we can replenish +the cache of trusted machine with trusted file handles (GFIDs), and +perform a security open only by trusted GFID (by the step 2). However, +in this case we need to make sure that there is no leaks of non-trusted +GFIDs (and, moreover, such leaks won't be introduced by the development +process in future). This is possible only with changed GFID format: +everywhere in GlusterFS GFID should appear as a pair (uuid, +is\_verified), where is\_verified is a boolean variable, which is true, +if this GFID passed off the procedure of verification (step 1 in the +section above). + +The next problem is that current NFS protocol doesn't encrypt the +channel between NFS client and NFS server. It means that in NFS-mounts +of GlusterFS volumes NFS client and GlusterFS client should be the same +(trusted) machine. + +Taking into account the described problems, encryption in GlusterFS is +not supported in NFS-mount sessions. + +10.4 Class of cipher algorithms for file data encryption that can be supported by the crypt translator +------------------------------------------------------------------------------------------------------ + +We'll assume that any symmetric block cipher algorithm is completely +determined by a pair (alg\_id, mode\_id), where alg\_id is an algorithm +defined on elementary cipher blocks (e.g. AES), and mode\_id is a mode +of operation (e.g. ECB, XTS, etc). + +Technically, the crypt translator is able to support any symmetric block +cipher algorithms via additional options of the crypt translator. +However, in practice the set of supported algorithms is narrowed because +of various security and organization issues. Currently we support only +one algotithm. This is AES\_XTS. + +10.5 Bibliography +----------------- + +1. Recommendations for for Block Cipher Modes of Operation (NIST + Special Publication 800-38A). +2. Recommendation for Block Cipher Modes of Operation: The XTS-AES Mode + for Confidentiality on Storage Devices (NIST Special Publication + 800-38E). +3. Recommendation for Key Derivation Using Pseudorandom Functions, + (NIST Special Publication 800-108). +4. Recommendation for Block Cipher Modes of Operation: The CMAC Mode + for Authentication, (NIST Special Publication 800-38B). +5. Recommendation for Block Cipher Modes of Operation: Methods for Key + Wrapping, (NIST Special Publication 800-38F). +6. FIPS PUB 198-1 The Keyed-Hash Message Authentication Code (HMAC). +7. David A. McGrew, John Viega "The Galois/Counter Mode of Operation + (GCM)". + +11 FAQ +====== + +**1. How to make sure that my volume is really encrypted?** + +Check the respective graph of translators on your trusted client +machine. This graph is created at mount time and is stored by default in +the file /usr/local/var/log/glusterfs/mountpoint.log + +Here "mountpoint" is the absolute name of the mountpoint, where "/" are +replaced with "-". For example, if your volume is mounted to +/mnt/testfs, then you'll need to check the file +/usr/local/var/log/glusterfs/mnt-testfs.log + +Make sure that this graph contains the crypt translator, which looks +like the following: + + 13: volume xvol-crypt + 14:     type encryption/crypt + 15:     option master-key /home/edward/mykey + 16:     subvolumes xvol-dht + 17: end-volume + +**2. How to make sure that my volume is encrypted with a proper master +key?** + +Check the graph of translators on your trusted client machine (see the +FAQ\#1). Make sure that the option "master-key" of the crypt translator +specifies correct location of the master key on your trusted client +machine. + +**3. Can I change the encryption status of a volume?** + +You can change encryption status (enable/disable encryption) only for +empty volumes. Otherwise it will be incorrect (you'll end with IO +errors, data corruption and security problems). We strongly recommend to +decide once and forever at volume creation time, whether your volume has +to be encrypted, or not. + +**4. I am able to mount my encrypted volume with improper master keys +and get list of file names for every directory. Is it normal?** + +Yes, it is normal. It doesn't contradict the announced functionality: we +encrypt only file's content. File names are not encrypted, so it doesn't +make sense to hide them on the trusted client machine. + +**5. What is the reason for only supporting AES-XTS? This mode is not +using Intel's AES-NI instruction thus not utilizing hardware feature..** + +Distributed file systems impose tighter requirements to at-rest +encryption. We offer more than "at-rest-encryption". We offer "at-rest +encryption and authentication in distributed systems with non-trusted +servers". Data and metadata on the server can be easily subjected to +tampering and analysis with the purpose to reveal secret user's data. +And we have to resist to this tampering by performing data and metadata +authentication. + +Unfortunately, it is technically hard to implement full-fledged data +authentication via a stackable file system (GlusterFS translator), so we +have decided to perform a "light" authentication by using a special +cipher mode, which is resistant to tampering. Currently OpenSSL supports +only one such mode: this is XTS. Tampering of ciphertext created in XTS +mode will lead to unpredictable changes in the plain text. That said, +user will see "unpredictable gibberish" on the client side. Of course, +this is not an "official way" to detect tampering, but this is much +better than nothing. The "official way" (creating/checking MACs) we use +for metadata authentication. + +Other modes like CBC, CFB, OFB, etc supported by OpenSSL are strongly +not recommended for use in distributed systems with non-trusted servers. +For example, CBC mode doesn't "survive" overwrite of a logical block in +a file. It means that with every such overwrite (standard file system +operation) we'll need to re-encrypt the whole(!) file with different +key. CFB and OFB modes are sensitive to tampering: there is a way to +perform \*predictable\* changes in plaintext, which is unacceptable. + +Yes, XTS is slow (at least its current implementation in OpenSSL), but +we don't promise, that CFB, OFB with full-fledged authentication will be +faster. So.. diff --git a/Feature Planning/GlusterFS 3.5/File Snapshot.md b/Feature Planning/GlusterFS 3.5/File Snapshot.md new file mode 100644 index 0000000..b2d6c69 --- /dev/null +++ b/Feature Planning/GlusterFS 3.5/File Snapshot.md @@ -0,0 +1,101 @@ +Feature +------- + +File Snapshots in GlusterFS + +### Summary + +Ability to take snapshots of files in GlusterFS + +### Owners + +Anand Avati + +### Source code + +Patch for this feature - + +### Detailed Description + +The feature adds file snapshotting support to GlusterFS. '' To use this +feature the file format should be QCOW2 (from QEMU)'' . The patch takes +the block layer code from Qemu and converts it into a translator in +gluster. + +### Benefit to GlusterFS + +Better integration with Openstack Cinder, and in general ability to take +snapshots of files (typically VM images) + +### Usage + +*To take snapshot of a file, the file format should be QCOW2. To set +file type as qcow2 check step \#2 below* + +​1. Turning on snapshot feature : + + gluster volume set `` features.file-snapshot on + +​2. To set qcow2 file format: + + setfattr -n trusted.glusterfs.block-format -v qcow2:10GB  + +​3. To create a snapshot: + + setfattr -n trusted.glusterfs.block-snapshot-create -v  + +​4. To apply/revert back to a snapshot: + + setfattr -n trusted.glusterfs.block-snapshot-goto -v   + +### Scope + +#### Nature of proposed change + +The work is going to be a new translator. Very minimal changes to +existing code (minor change in syncops) + +#### Implications on manageability + +Will need ability to load/unload the translator in the stack. + +#### Implications on presentation layer + +Feature must be presentation layer independent. + +#### Implications on persistence layer + +No implications + +#### Implications on 'GlusterFS' backend + +Internal snapshots - No implications. External snapshots - there will be +hidden directories added. + +#### Modification to GlusterFS metadata + +New xattr will be added to identify files which are 'snapshot managed' +vs raw files. + +#### Implications on 'glusterd' + +Yet another turn on/off feature for glusterd. Volgen will have to add a +new translator in the generated graph. + +### How To Test + +Snapshots can be tested by taking snapshots along with checksum of the +state of the file, making further changes and going back to old snapshot +and verify the checksum again. + +### Dependencies + +Dependent QEMU code is imported into the codebase. + +### Documentation + + + +### Status + +Merged in master and available in Gluster3.5 \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.5/Onwire Compression-Decompression.md b/Feature Planning/GlusterFS 3.5/Onwire Compression-Decompression.md new file mode 100644 index 0000000..a26aa7a --- /dev/null +++ b/Feature Planning/GlusterFS 3.5/Onwire Compression-Decompression.md @@ -0,0 +1,96 @@ +Feature +======= + +On-Wire Compression/Decompression + +1. Summary +========== + +Translator to compress/decompress data in flight between client and +server. + +2. Owners +========= + +- Venky Shankar +- Prashanth Pai + +3. Current Status +================= + +Code has already been merged. Needs more testing. + +The [initial submission](http://review.gluster.org/3251) contained a +`compress` option, which introduced [some +confusion](https://bugzilla.redhat.com/1053670). [A correction has been +sent](http://review.gluster.org/6765) to rename the user visible options +to start with `network.compression`. + +TODO + +- Make xlator pluggable to add support for other compression methods +- Add support for lz4 compression: + +4. Detailed Description +======================= + +- When a writev call occurs, the client compresses the data before + sending it to server. On the server, compressed data is + decompressed. Similarly, when a readv call occurs, the server + compresses the data before sending it to client. On the client, the + compressed data is decompressed. Thus the amount of data sent over + the wire is minimized. + +- Compression/Decompression is done using Zlib library. + +- During normal operation, this is the format of data sent over wire: + + trailer(8 bytes). The trailer contains the CRC32 + checksum and length of original uncompressed data. This is used for + validation. + +5. Usage +======== + +Turning on compression xlator: + + # gluster volume set  network.compression on + +Configurable options: + + # gluster volume set  network.compression.compression-level 8 + # gluster volume set  network.compression.min-size 50 + +6. Benefits to GlusterFS +======================== + +Fewer bytes transferred over the network. + +7. Issues +========= + +- Issues with striped volumes. Compression xlator cannot work with + striped volumes + +- Issues with write-behind: Mount point hangs when writing a file with + write-behind xlator turned on. To overcome this, turn off + write-behind entirely OR set "performance.strict-write-ordering" to + on. + +- Issues with AFR: AFR v1 currently does not propagate xdata. + This issue has + been resolved in AFR v2. + +8. Dependencies +=============== + +Zlib library + +9. Documentation +================ + + + +10. Status +========== + +Code merged upstream. \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.5/Quota Scalability.md b/Feature Planning/GlusterFS 3.5/Quota Scalability.md new file mode 100644 index 0000000..f3b0a0d --- /dev/null +++ b/Feature Planning/GlusterFS 3.5/Quota Scalability.md @@ -0,0 +1,99 @@ +Feature +------- + +Quota Scalability + +Summary +------- + +Support upto 65536 quota configurations per volume. + +Owners +------ + +Krishnan Parthasarathi +Vijay Bellur + +Current status +-------------- + +Current implementation of Directory Quota cannot scale beyond a few +hundred configured limits per volume. The aim of this feature is to +support upto 65536 quota configurations per volume. + +Detailed Description +-------------------- + +TBD + +Benefit to GlusterFS +-------------------- + +More quotas can be configured in a single volume thereby leading to +support GlusterFS for use cases like home directory. + +Scope +----- + +### Nature of proposed change + +- Move quota enforcement translator to the server +- Introduce a new quota daemon which helps in aggregating directory + consumption on the server +- Enhance marker's accounting to be modular +- Revamp configuration persistence and CLI listing for better scale +- Allow configuration of soft limits in addition to hard limits. + +### Implications on manageability + +Mostly the CLI will be backward compatible. New CLI to be introduced +needs to be enumerated here. + +### Implications on presentation layer + +None + +### Implications on persistence layer + +None + +### Implications on 'GlusterFS' backend + +None + +### Modification to GlusterFS metadata + +- Addition of a new extended attribute for storing configured hard and +soft limits on directories. + +### Implications on 'glusterd' + +- New file based configuration persistence + +How To Test +----------- + +TBD + +User Experience +--------------- + +TBD + +Dependencies +------------ + +None + +Documentation +------------- + +TBD + +Status +------ + +In development + +Comments and Discussion +----------------------- diff --git a/Feature Planning/GlusterFS 3.5/Zerofill.md b/Feature Planning/GlusterFS 3.5/Zerofill.md new file mode 100644 index 0000000..43b279d --- /dev/null +++ b/Feature Planning/GlusterFS 3.5/Zerofill.md @@ -0,0 +1,192 @@ +Feature +------- + +zerofill API for GlusterFS + +Summary +------- + +zerofill() API would allow creation of pre-allocated and zeroed-out +files on GlusterFS volumes by offloading the zeroing part to server +and/or storage (storage offloads use SCSI WRITESAME). + +Owners +------ + +Bharata B Rao +M. Mohankumar + +Current status +-------------- + +Patch on gerrit: + +Detailed Description +-------------------- + +Add support for a new ZEROFILL fop. Zerofill writes zeroes to a file in +the specified range. This fop will be useful when a whole file needs to +be initialized with zero (could be useful for zero filled VM disk image +provisioning or during scrubbing of VM disk images). + +Client/application can issue this FOP for zeroing out. Gluster server +will zero out required range of bytes ie server offloaded zeroing. In +the absence of this fop, client/application has to repetitively issue +write (zero) fop to the server, which is very inefficient method because +of the overheads involved in RPC calls and acknowledgements. + +WRITESAME is a SCSI T10 command that takes a block of data as input and +writes the same data to other blocks and this write is handled +completely within the storage and hence is known as offload . Linux ,now +has support for SCSI WRITESAME command which is exposed to the user in +the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to +implement this fop. Thus zeroing out operations can be completely +offloaded to the storage device , making it highly efficient. + +The fop takes two arguments offset and size. It zeroes out 'size' number +of bytes in an opened file starting from 'offset' position. + +Benefit to GlusterFS +-------------------- + +Benefits GlusterFS in virtualization by providing the ability to quickly +create pre-allocated and zeroed-out VM disk image by using +server/storage off-loads. + +### Scope + +Nature of proposed change +------------------------- + +An FOP supported in libgfapi and FUSE. + +Implications on manageability +----------------------------- + +None. + +Implications on presentation layer +---------------------------------- + +N/A + +Implications on persistence layer +--------------------------------- + +N/A + +Implications on 'GlusterFS' backend +----------------------------------- + +N/A + +Modification to GlusterFS metadata +---------------------------------- + +N/A + +Implications on 'glusterd' +-------------------------- + +N/A + +How To Test +----------- + +Test server offload by measuring the time taken for creating a fully +allocated and zeroed file on Posix backend. + +Test storage offload by measuring the time taken for creating a fully +allocated and zeroed file on BD backend. + +User Experience +--------------- + +Fast provisioning of VM images when GlusterFS is used as a file system +backend for KVM virtualization. + +Dependencies +------------ + +zerofill() support in BD backend depends on the new BD translator - + + +Documentation +------------- + +This feature add support for a new ZEROFILL fop. Zerofill writes zeroes +to a file in the specified range. This fop will be useful when a whole +file needs to be initialized with zero (could be useful for zero filled +VM disk image provisioning or during scrubbing of VM disk images). + +Client/application can issue this FOP for zeroing out. Gluster server +will zero out required range of bytes ie server offloaded zeroing. In +the absence of this fop, client/application has to repetitively issue +write (zero) fop to the server, which is very inefficient method because +of the overheads involved in RPC calls and acknowledgements. + +WRITESAME is a SCSI T10 command that takes a block of data as input and +writes the same data to other blocks and this write is handled +completely within the storage and hence is known as offload . Linux ,now +has support for SCSI WRITESAME command which is exposed to the user in +the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to +implement this fop. Thus zeroing out operations can be completely +offloaded to the storage device , making it highly efficient. + +The fop takes two arguments offset and size. It zeroes out 'size' number +of bytes in an opened file starting from 'offset' position. + +This feature adds zerofill support to the following areas: + +-  libglusterfs +-  io-stats +-  performance/md-cache,open-behind +-  quota +-  cluster/afr,dht,stripe +-  rpc/xdr +-  protocol/client,server +-  io-threads +-  marker +-  storage/posix +-  libgfapi + +Client applications can exploit this fop by using glfs\_zerofill +introduced in libgfapi.FUSE support to this fop has not been added as +there is no system call for this fop. + +Here is a performance comparison of server offloaded zeofill vs zeroing +out using repeated writes. + + [root@llmvm02 remote]# time ./offloaded aakash-test log 20 + + real    3m34.155s + user    0m0.018s + sys 0m0.040s + + +  [root@llmvm02 remote]# time ./manually aakash-test log 20 + + real    4m23.043s + user    0m2.197s + sys 0m14.457s +  [root@llmvm02 remote]# time ./offloaded aakash-test log 25; + + real    4m28.363s + user    0m0.021s + sys 0m0.025s + [root@llmvm02 remote]# time ./manually aakash-test log 25 + + real    5m34.278s + user    0m2.957s + sys 0m18.808s + +The argument log is a file which we want to set for logging purpose and +the third argument is size in GB . + +As we can see there is a performance improvement of around 20% with this +fop. + +Status +------ + +Patch : Status : Merged \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.5/gfid access.md b/Feature Planning/GlusterFS 3.5/gfid access.md new file mode 100644 index 0000000..db64076 --- /dev/null +++ b/Feature Planning/GlusterFS 3.5/gfid access.md @@ -0,0 +1,89 @@ +### Instructions + +**Feature** + +'gfid-access' translator to provide access to data in glusterfs using a virtual path. + +**1 Summary** + +This particular Translator is designed to provide direct access to files in glusterfs using its gfid.'GFID' is glusterfs's inode numbers for a file to identify it uniquely. + +**2 Owners** + +Amar Tumballi  +Raghavendra G  +Anand Avati  + +**3 Current status** + +With glusterfs-3.4.0, glusterfs provides only path based access.A feature is added in 'fuse' layer in the current master branch, +but its desirable to have it as a separate translator for long time +maintenance. + +**4 Detailed Description** + +With this method, we can consume the data in changelog translator +(which is logging 'gfid' internally) very efficiently. + +**5 Benefit to GlusterFS** + +Provides a way to access files quickly with direct gfid. + +​**6. Scope** + +6.1. Nature of proposed change + +* A new translator. +* Fixes in 'glusterfsd.c' to add this translator automatically based +on mount time option. +* change to mount.glusterfs to parse this new option  +(single digit number or lines changed) + +6.2. Implications on manageability + +* No CLI required. +* mount.glusterfs script gets a new option. + +6.3. Implications on presentation layer + +* A new virtual access path is made available. But all access protocols work seemlessly, as the complexities are handled internally. + +6.4. Implications on persistence layer + +* None + +6.5. Implications on 'GlusterFS' backend + +* None + +6.6. Modification to GlusterFS metadata + +* None + +6.7. Implications on 'glusterd' + +* None + +7 How To Test + +* Mount glusterfs client with '-o aux-gfid-mount' and access files using '/mount/point/.gfid/ '. + +8 User Experience + +* A new virtual path available for users. + +9 Dependencies + +* None + +10 Documentation + +This wiki. + +11 Status + +Patch sent upstream. More review comments required. (http://review.gluster.org/5497) + +12 Comments and Discussion + +Please do give comments :-) \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.5/index.md b/Feature Planning/GlusterFS 3.5/index.md index 592a909..c36fa7d 100644 --- a/Feature Planning/GlusterFS 3.5/index.md +++ b/Feature Planning/GlusterFS 3.5/index.md @@ -14,6 +14,16 @@ GlusterFS 3.5 - [Features/AFR CLI enhancements](./AFR CLI enhancements.md) - [Features/exposing volume capabilities](./Exposing Volume Capabilities.md) +- [Features/File Snapshot](./File Snapshot.md) +- [Features/gfid-access](./gfid access.md) +- [Features/On-Wire Compression + Decompression](./Onwire Compression-Decompression.md) +- [Features/Quota Scalability](./Quota Scalability.md) +- [Features/readdir ahead](./readdir ahead.md) +- [Features/zerofill](./Zerofill.md) +- [Features/Brick Failure Detection](./Brick Failure Detection.md) +- [Features/disk-encryption](./Disk-Encryption.md) +- Changelog based parallel geo-replication +- Improved block device translator Proposing New Features ---------------------- diff --git a/Feature Planning/GlusterFS 3.5/readdir ahead.md b/Feature Planning/GlusterFS 3.5/readdir ahead.md new file mode 100644 index 0000000..fe34a97 --- /dev/null +++ b/Feature Planning/GlusterFS 3.5/readdir ahead.md @@ -0,0 +1,117 @@ +Feature +------- + +readdir-ahead + +Summary +------- + +Provide read-ahead support for directories to improve sequential +directory read performance. + +Owners +------ + +Brian Foster + +Current status +-------------- + +Gluster currently does not attempt to improve directory read +performance. As a result, simple operations (i.e., ls) on large +directories are slow. + +Detailed Description +-------------------- + +The read-ahead feature for directories is analogous to read-ahead for +files. The objective is to detect sequential directory read operations +and establish a pipeline for directory content. When a readdir request +is received and fulfilled, preemptively issue subsequent readdir +requests to the server in anticipation of those requests from the user. +If sequential readdir requests are received, the directory content is +already immediately available in the client. If subsequent requests are +not sequential or not received, said data is simply dropped and the +optimization is bypassed. + +Benefit to GlusterFS +-------------------- + +Improved read performance of large directories. + +### Scope + +Nature of proposed change +------------------------- + +readdir-ahead support is enabled through a new client-side translator. + +Implications on manageability +----------------------------- + +None beyond the ability to enable and disable the translator. + +Implications on presentation layer +---------------------------------- + +N/A + +Implications on persistence layer +--------------------------------- + +N/A + +Implications on 'GlusterFS' backend +----------------------------------- + +N/A + +Modification to GlusterFS metadata +---------------------------------- + +N/A + +Implications on 'glusterd' +-------------------------- + +N/A + +How To Test +----------- + +Performance testing. Verify that sequential reads of large directories +complete faster (i.e., ls, xfs\_io -c readdir). + +User Experience +--------------- + +Improved performance on sequential read workloads. The translator should +otherwise be invisible and not detract performance or disrupt behavior +in any way. + +Dependencies +------------ + +N/A + +Documentation +------------- + +Set the associated config option to enable or disable directory +read-ahead on a volume: + + gluster volume set  readdir-ahead [enable|disable] + +readdir-ahead is disabled by default. + +Status +------ + +Development complete for the initial version. Minor changes and bug +fixes likely. + +Future versions might expand to provide generic caching and more +flexible behavior. + +Comments and Discussion +----------------------- \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index 1224ee3..dc56fd7 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -184,6 +184,14 @@ pages: - ['Feature Planning/GlusterFS 3.5/index.md','Feature Planning 3.5','index'] - ['Feature Planning/GlusterFS 3.5/AFR CLI enhancements.md','Feature Planning 3.5','AFR CLI enhancements'] - ['Feature Planning/GlusterFS 3.5/Exposing Volume Capabilities.md','Feature Planning 3.5','Exposing Volume Capabilities'] +- ['Feature Planning/GlusterFS 3.5/File Snapshot.md','Feature Planning 3.5','File Snapshot'] +- ['Feature Planning/GlusterFS 3.5/gfid access.md','Feature Planning 3.5','gfid access'] +- ['Feature Planning/GlusterFS 3.5/Onwire Compression-Decompression.md','Feature Planning 3.5','On wire Compression + Decompression'] +- ['Feature Planning/GlusterFS 3.5/Quota Scalability.md','Feature Planning 3.5','Quota Scalability'] +- ['Feature Planning/GlusterFS 3.5/readdir ahead.md','Feature Planning 3.5','readdir ahead'] +- ['Feature Planning/GlusterFS 3.5/Zerofill.md','Feature Planning 3.5','Zerofill'] +- ['Feature Planning/GlusterFS 3.5/Brick Failure Detection.md','Feature Planning 3.5','Brick Failure Detection'] +- ['Feature Planning/GlusterFS 3.5/Disk Encryption.md','Feature Planning 3.5','Disk Encryption'] #GlusterFS Tools - ['GlusterFS Tools/README.md', 'GlusterFS Tools', 'GlusterFS Tools List']