1
0
mirror of https://github.com/gluster/glusterdocs.git synced 2026-02-05 15:47:01 +01:00

adding missing feature planning pages to 3.5 planning

Signed-off-by: shravantc <shravantc@ymail.com>
This commit is contained in:
shravantc
2015-06-03 12:57:42 +05:30
parent 205a70c43d
commit 8b21b7e439
10 changed files with 1306 additions and 0 deletions

View File

@@ -0,0 +1,151 @@
Feature
-------
Brick Failure Detection
Summary
-------
This feature attempts to identify storage/file system failures and
disable the failed brick without disrupting the remainder of the node's
operation.
Owners
------
Vijay Bellur with help from Niels de Vos (or the other way around)
Current status
--------------
Currently, if the underlying storage or file system failure happens, a
brick process will continue to function. In some cases, a brick can hang
due to failures in the underlying system. Due to such hangs in brick
processes, applications running on glusterfs clients can hang.
Detailed Description
--------------------
Detecting failures on the filesystem that a brick uses makes it possible
to handle errors that are caused from outside of the Gluster
environment.
There have been hanging brick processes when the underlying storage of a
brick went unavailable. A hanging brick process can still use the
network and repond to clients, but actual I/O to the storage is
impossible and can cause noticible delays on the client side.
Benefit to GlusterFS
--------------------
Provide better detection of storage subsytem failures and prevent bricks
from hanging.
Scope
-----
### Nature of proposed change
Add a health-checker to the posix xlator that periodically checks the
status of the filesystem (implies checking of functional
storage-hardware).
### Implications on manageability
When a brick process detects that the underlaying storage is not
responding anymore, the process will exit. There is no automated way
that the brick process gets restarted, the sysadmin will need to fix the
problem with the storage first.
After correcting the storage (hardware or filesystem) issue, the
following command will start the brick process again:
# gluster volume start <VOLNAME> force
### Implications on presentation layer
None
### Implications on persistence layer
None
### Implications on 'GlusterFS' backend
None
### Modification to GlusterFS metadata
None
### Implications on 'glusterd'
'glusterd' can detect that the brick process has exited,
`gluster volume status` will show that the brick process is not running
anymore. System administrators checking the logs should be able to
triage the cause.
How To Test
-----------
The health-checker thread that is part of each brick process will get
started automatically when a volume has been started. Verifying its
functionality can be done in different ways.
On virtual hardware:
- disconnect the disk from the VM that holds the brick
On real hardware:
- simulate a RAID-card failure by unplugging the card or cables
On a system that uses LVM for the bricks:
- use device-mapper to load an error-table for the disk, see [this
description](http://review.gluster.org/5176).
On any system (writing to random offsets of the block device, more
difficult to trigger):
1. cause corruption on the filesystem that holds the brick
2. read contents from the brick, hoping to hit the corrupted area
3. the filsystem should abort after hitting a bad spot, the
health-checker should notice that shortly afterwards
User Experience
---------------
No more hanging brick processes when storage-hardware or the filesystem
fails.
Dependencies
------------
Posix translator, not available for the BD-xlator.
Documentation
-------------
The health-checker is enabled by default and runs a check every 30
seconds. This interval can be changed per volume with:
# gluster volume set <VOLNAME> storage.health-check-interval <SECONDS>
If `SECONDS` is set to 0, the health-checker will be disabled.
For further details refer:
<https://forge.gluster.org/glusterfs-core/glusterfs/blobs/release-3.5/doc/features/brick-failure-detection.md>
Status
------
glusterfs-3.4 and newer include a health-checker for the posix xlator,
which was introduced with [bug
971774](https://bugzilla.redhat.com/971774):
- [posix: add a simple
health-checker](http://review.gluster.org/5176)?
Comments and Discussion
-----------------------

View File

@@ -0,0 +1,443 @@
Feature
=======
Transparent encryption. Allows a volume to be encrypted "at rest" on the
server using keys only available on the client.
1 Summary
=========
Distributed systems impose tighter requirements to at-rest encryption.
This is because your encrypted data will be stored on servers, which are
de facto untrusted. In particular, your private encrypted data can be
subjected to analysis and tampering, which eventually will lead to its
revealing, if it is not properly protected. Specifically, usually it is
not enough to just encrypt data. In distributed systems serious
protection of your personal data is possible only in conjunction with a
special process, which is called authentication. GlusterFS provides such
enhanced service: In GlusterFS encryption is enhanced with
authentication. Currently we provide protection from "silent tampering".
This is a kind of tampering, which is hard to detect, because it doesn't
break POSIX compliance. Specifically, we protect encryption-specific
file's metadata. Such metadata includes unique file's object id (GFID),
cipher algorithm id, cipher block size and other attributes used by the
encryption process.
1.1 Restrictions
----------------
1. We encrypt only file content. The feature of transparent encryption
doesn't protect file names: they are neither encrypted, nor verified.
Protection of file names is not so critical as protection of
encryption-specific file's metadata: any attacks based on tampering file
names will break POSIX compliance and result in massive corruption,
which is easy to detect.
2. The feature of transparent encryption doesn't work in NFS-mounts of
GlusterFS volumes: NFS's file handles introduce security issues, which
are hard to resolve. NFS mounts of encrypted GlusterFS volumes will
result in failed file operations (see section "Encryption in different
types of mount sessions" for more details).
3. The feature of transparent encryption is incompatible with GlusterFS
performance translators quick-read, write-behind and open-behind.
2 Owners
========
Jeff Darcy <jdarcy@redhat.com>
Edward Shishkin <eshishki@redhat.com>
3 Current status
================
Merged to the upstream.
4 Detailed Description
======================
See Summary.
5 Benefit to GlusterFS
======================
Besides the justifications that have applied to on-disk encryption just
about forever, recent events have raised awareness significantly.
Encryption using keys that are physically present at the server leaves
data vulnerable to physical seizure of the server. Encryption using keys
that are kept by the same organization entity leaves data vulnerable to
"insider threat" plus coercion or capture at the organization level. For
many, especially various kinds of service providers, only pure
client-side encryption provides the necessary levels of privacy and
deniability.
Competitively, other projects - most notably
[Tahoe-LAFS](https://leastauthority.com/) - are already using recently
heightened awareness of these issues to attract users who would be
better served by our performance/scalability, usability, and diversity
of interfaces. Only the lack of proper encryption holds us back in these
cases.
6 Scope
=======
6.1. Nature of proposed change
------------------------------
This is a new client-side translator, using user-provided key
information plus information stored in xattrs to encrypt data
transparently as it's written and decrypt when it's read.
6.2. Implications on manageability
----------------------------------
User needs to manage a per-volume master key (MK). That is:
1) Generate an independent MK for every volume which is to be
encrypted. Note, that one MK is created for the whole life of the
volume.
2) Provide MK on the client side at every mount in accordance with the
location, which has been specified at volume create time, or overridden
via respective mount option (see section How To Test).
3) Keep MK between mount sessions. Note that after successful mount MK
may be removed from the specified location. In this case user should
retain MK safely till next mount session.
MK is a 256-bit secret string, which is known only to user. Generating
and retention of MK is in user's competence.
WARNING!!! Losing MK will make content of all regular files of your
volume inaccessible. It is possible to mount a volume with improper MK,
however such mount sessions will allow to access only file names as they
are not encrypted.
Recommendations on MK generation
MK has to be a high-entropy key, appropriately generated by a key
derivation algorithm. One of the possible ways is using rand(1) provided
by the OpenSSL package. You need to specify the option "-hex" for proper
output format. For example, the next command prints a generated key to
the standard output:
$ openssl rand -hex 32
6.3. Implications on presentation layer
---------------------------------------
N/A
6.4. Implications on persistence layer
--------------------------------------
N/A
6.5. Implications on 'GlusterFS' backend
----------------------------------------
All encrypted files on the servers contains padding at the end of file.
That is, size of all enDefines location of the master volume key on the
trusted client machine.crypted files on the servers is multiple to
cipher block size. Real file size is stored as file's xattr with the key
"trusted.glusterfs.crypt.att.size". The translation padded-file-size -\>
real-file-size (and backward) is performed by the crypt translator.
6.6. Modification to GlusterFS metadata
---------------------------------------
Encryption-specific metadata in specified format is stored as file's
xattr with the key "trusted.glusterfs.crypt.att.cfmt". Current format of
metadata string is described in the slide \#27 of the following [ design
document](http://www.gluster.org/community/documentation/index.php/File:GlusterFS_transparent_encryption.pdf)
6.7. Options of the crypt translator
------------------------------------
- data-cipher-alg
Specifies cipher algorithm for file data encryption. Currently only one
option is available: AES\_XTS. This is hidden option.
- block-size
Specifies size (in bytes) of logical chunk which is encrypted as a whole
unit in the file body. If cipher modes with initial vectors are used for
encryption, then the initial vector gets reset for every such chunk.
Available values are: "512", "1024", "2048" and "4096". Default value is
"4096".
- data-key-size
Specifies size (in bits) of data cipher key. For AES\_XTS available
values are: "256" and "512". Default value is "256". The larger key size
("512") is for stronger security.
- master-key
Specifies pathname of the regular file, or symlink. Defines location of
the master volume key on the trusted client machine.
7 Getting Started With Crypt Translator
=======================================
1. Create a volume <vol_name>.
2. Turn on crypt xlator:
# gluster volume set `<vol_name>` encryption on
3. Turn off performance xlators that currently encryption is
incompatible with:
# gluster volume set <vol_name> performance.quick-read off
# gluster volume set <vol_name> performance.write-behind off
# gluster volume set <vol_name> performance.open-behind off
4. (optional) Set location of the volume master key:
# gluster volume set <vol_name> encryption.master-key <master_key_location>
where <master_key_location> is an absolute pathname of the file, which
will contain the volume master key (see section implications on
manageability).
5. (optional) Override default options of crypt xlator:
# gluster volume set <vol_name> encryption.data-key-size <data_key_size>
where <data_key_size> should have one of the following values:
"256"(default), "512".
# gluster volume set <vol_name> encryption.block-size <block_size>
where <block_size> should have one of the following values: "512",
"1024", "2048", "4096"(default).
6. Define location of the master key on your client machine, if it
wasn't specified at section 4 above, or you want it to be different from
the <master_key_location>, specified at section 4.
7. On the client side make sure that the file with name
<master_key_location> (or <master_key_new_location> defined at section
6) exists and contains respective per-volume master key (see section
implications on manageability). This key has to be in hex form, i.e.
should be represented by 64 symbols from the set {'0', ..., '9', 'a',
..., 'f'}. The key should start at the beginning of the file. All
symbols at offsets \>= 64 are ignored.
NOTE: <master_key_location> (or <master_key_new_location> defined at
step 6) can be a symlink. In this case make sure that the target file of
this symlink exists and contains respective per-volume master key.
8. Mount the volume <vol_name> on the client side as usual. If you
specified a location of the master key at section 6, then use the mount
option
--xlator-option=<suffixed_vol_name>.master-key=<master_key_new_location>
where <master_key_new_location> is location of master key specified at
section 6, <suffixed_vol_name> is <vol_name> suffixed with "-crypt". For
example, if you created a volume "myvol" in the step 1, then
suffixed\_vol\_name is "myvol-crypt".
9. During mount your client machine receives configuration info from
the untrusted server, so this step is extremely important! Check, that
your volume is really encrypted, and that it is encrypted with the
proper master key (see FAQ \#1,\#2).
10. (optional) After successful mount the file which contains master
key may be removed. NOTE: Next mount session will require the master-key
again. Keeping the master key between mount sessions is in user's
competence (see section implications on manageability).
8 How to test
=============
From a correctness standpoint, it's sufficient to run normal tests with
encryption enabled. From a security standpoint, there's a whole
discipline devoted to analysing the stored data for weaknesses, and
engagement with practitioners of that discipline will be necessary to
develop the right tests.
9 Dependencies
==============
Crypt translator requires OpenSSL of version \>= 1.0.1
10 Documentation
================
10.1 Basic design concepts
--------------------------
The basic design concepts are described in the following [pdf
slides](http://www.gluster.org/community/documentation/index.php/File:GlusterFS_transparent_encryption.pdf)
10.2 Procedure of security open
-------------------------------
So, in accordance with the basic design concepts above, before every
access to a file's body (by read(2), write(2), truncate(2), etc) we need
to make sure that the file's metadata is trusted. Otherwise, we risk to
deal with untrusted file's data.
To make sure that file's metadata is trusted, file is subjected to a
special procedure of security open. The procedure of security open is
performed by crypt translator at FOP-\>open() (crypt\_open) time by the
function open\_format(). Currently this is a hardcoded composition of 2
checks:
1. verification of file's GFID by the file name;
2. verification of file's metadata by the verified GFID;
If the security open succeeds, then the cache of trusted client machine
is replenished with file descriptor and file's inode, and user can
access the file's content by read(2), write(2), ftruncate(2), etc.
system calls, which accept file descriptor as argument.
However, file API also allows to accept file body without opening the
file. For example, truncate(2), which accepts pathname instead of file
descriptor. To make sure that file's metadata is trusted, we create a
temporal file descriptor and mandatory call crypt\_open() before
truncating the file's body.
10.3 Encryption in different types of mount sessions
----------------------------------------------------
Everything described in the section above is valid only for FUSE-mounts.
Besides, GlusterFS also supports so-called NFS-mounts. From the
standpoint of security the key difference between the mentioned types of
mount sessions is that in NFS-mount sessions file operations instead of
file name accept a so-called file handle (which is actually GFID). It
creates problems, since the file name is a basic point for verification.
As it follows from the section above, using the step 1, we can replenish
the cache of trusted machine with trusted file handles (GFIDs), and
perform a security open only by trusted GFID (by the step 2). However,
in this case we need to make sure that there is no leaks of non-trusted
GFIDs (and, moreover, such leaks won't be introduced by the development
process in future). This is possible only with changed GFID format:
everywhere in GlusterFS GFID should appear as a pair (uuid,
is\_verified), where is\_verified is a boolean variable, which is true,
if this GFID passed off the procedure of verification (step 1 in the
section above).
The next problem is that current NFS protocol doesn't encrypt the
channel between NFS client and NFS server. It means that in NFS-mounts
of GlusterFS volumes NFS client and GlusterFS client should be the same
(trusted) machine.
Taking into account the described problems, encryption in GlusterFS is
not supported in NFS-mount sessions.
10.4 Class of cipher algorithms for file data encryption that can be supported by the crypt translator
------------------------------------------------------------------------------------------------------
We'll assume that any symmetric block cipher algorithm is completely
determined by a pair (alg\_id, mode\_id), where alg\_id is an algorithm
defined on elementary cipher blocks (e.g. AES), and mode\_id is a mode
of operation (e.g. ECB, XTS, etc).
Technically, the crypt translator is able to support any symmetric block
cipher algorithms via additional options of the crypt translator.
However, in practice the set of supported algorithms is narrowed because
of various security and organization issues. Currently we support only
one algotithm. This is AES\_XTS.
10.5 Bibliography
-----------------
1. Recommendations for for Block Cipher Modes of Operation (NIST
Special Publication 800-38A).
2. Recommendation for Block Cipher Modes of Operation: The XTS-AES Mode
for Confidentiality on Storage Devices (NIST Special Publication
800-38E).
3. Recommendation for Key Derivation Using Pseudorandom Functions,
(NIST Special Publication 800-108).
4. Recommendation for Block Cipher Modes of Operation: The CMAC Mode
for Authentication, (NIST Special Publication 800-38B).
5. Recommendation for Block Cipher Modes of Operation: Methods for Key
Wrapping, (NIST Special Publication 800-38F).
6. FIPS PUB 198-1 The Keyed-Hash Message Authentication Code (HMAC).
7. David A. McGrew, John Viega "The Galois/Counter Mode of Operation
(GCM)".
11 FAQ
======
**1. How to make sure that my volume is really encrypted?**
Check the respective graph of translators on your trusted client
machine. This graph is created at mount time and is stored by default in
the file /usr/local/var/log/glusterfs/mountpoint.log
Here "mountpoint" is the absolute name of the mountpoint, where "/" are
replaced with "-". For example, if your volume is mounted to
/mnt/testfs, then you'll need to check the file
/usr/local/var/log/glusterfs/mnt-testfs.log
Make sure that this graph contains the crypt translator, which looks
like the following:
13: volume xvol-crypt
14:     type encryption/crypt
15:     option master-key /home/edward/mykey
16:     subvolumes xvol-dht
17: end-volume
**2. How to make sure that my volume is encrypted with a proper master
key?**
Check the graph of translators on your trusted client machine (see the
FAQ\#1). Make sure that the option "master-key" of the crypt translator
specifies correct location of the master key on your trusted client
machine.
**3. Can I change the encryption status of a volume?**
You can change encryption status (enable/disable encryption) only for
empty volumes. Otherwise it will be incorrect (you'll end with IO
errors, data corruption and security problems). We strongly recommend to
decide once and forever at volume creation time, whether your volume has
to be encrypted, or not.
**4. I am able to mount my encrypted volume with improper master keys
and get list of file names for every directory. Is it normal?**
Yes, it is normal. It doesn't contradict the announced functionality: we
encrypt only file's content. File names are not encrypted, so it doesn't
make sense to hide them on the trusted client machine.
**5. What is the reason for only supporting AES-XTS? This mode is not
using Intel's AES-NI instruction thus not utilizing hardware feature..**
Distributed file systems impose tighter requirements to at-rest
encryption. We offer more than "at-rest-encryption". We offer "at-rest
encryption and authentication in distributed systems with non-trusted
servers". Data and metadata on the server can be easily subjected to
tampering and analysis with the purpose to reveal secret user's data.
And we have to resist to this tampering by performing data and metadata
authentication.
Unfortunately, it is technically hard to implement full-fledged data
authentication via a stackable file system (GlusterFS translator), so we
have decided to perform a "light" authentication by using a special
cipher mode, which is resistant to tampering. Currently OpenSSL supports
only one such mode: this is XTS. Tampering of ciphertext created in XTS
mode will lead to unpredictable changes in the plain text. That said,
user will see "unpredictable gibberish" on the client side. Of course,
this is not an "official way" to detect tampering, but this is much
better than nothing. The "official way" (creating/checking MACs) we use
for metadata authentication.
Other modes like CBC, CFB, OFB, etc supported by OpenSSL are strongly
not recommended for use in distributed systems with non-trusted servers.
For example, CBC mode doesn't "survive" overwrite of a logical block in
a file. It means that with every such overwrite (standard file system
operation) we'll need to re-encrypt the whole(!) file with different
key. CFB and OFB modes are sensitive to tampering: there is a way to
perform \*predictable\* changes in plaintext, which is unacceptable.
Yes, XTS is slow (at least its current implementation in OpenSSL), but
we don't promise, that CFB, OFB with full-fledged authentication will be
faster. So..

View File

@@ -0,0 +1,101 @@
Feature
-------
File Snapshots in GlusterFS
### Summary
Ability to take snapshots of files in GlusterFS
### Owners
Anand Avati
### Source code
Patch for this feature - <http://review.gluster.org/5367>
### Detailed Description
The feature adds file snapshotting support to GlusterFS. '' To use this
feature the file format should be QCOW2 (from QEMU)'' . The patch takes
the block layer code from Qemu and converts it into a translator in
gluster.
### Benefit to GlusterFS
Better integration with Openstack Cinder, and in general ability to take
snapshots of files (typically VM images)
### Usage
*To take snapshot of a file, the file format should be QCOW2. To set
file type as qcow2 check step \#2 below*
1. Turning on snapshot feature :
gluster volume set `<vol_name>` features.file-snapshot on
2. To set qcow2 file format:
setfattr -n trusted.glusterfs.block-format -v qcow2:10GB <file_name>
3. To create a snapshot:
setfattr -n trusted.glusterfs.block-snapshot-create -v <image_name> <file_name>
4. To apply/revert back to a snapshot:
setfattr -n trusted.glusterfs.block-snapshot-goto -v <image_name> <file_name>
### Scope
#### Nature of proposed change
The work is going to be a new translator. Very minimal changes to
existing code (minor change in syncops)
#### Implications on manageability
Will need ability to load/unload the translator in the stack.
#### Implications on presentation layer
Feature must be presentation layer independent.
#### Implications on persistence layer
No implications
#### Implications on 'GlusterFS' backend
Internal snapshots - No implications. External snapshots - there will be
hidden directories added.
#### Modification to GlusterFS metadata
New xattr will be added to identify files which are 'snapshot managed'
vs raw files.
#### Implications on 'glusterd'
Yet another turn on/off feature for glusterd. Volgen will have to add a
new translator in the generated graph.
### How To Test
Snapshots can be tested by taking snapshots along with checksum of the
state of the file, making further changes and going back to old snapshot
and verify the checksum again.
### Dependencies
Dependent QEMU code is imported into the codebase.
### Documentation
<http://review.gluster.org/#/c/7488/6/doc/features/file-snapshot.md>
### Status
Merged in master and available in Gluster3.5

View File

@@ -0,0 +1,96 @@
Feature
=======
On-Wire Compression/Decompression
1. Summary
==========
Translator to compress/decompress data in flight between client and
server.
2. Owners
=========
- Venky Shankar <vshankar@redhat.com>
- Prashanth Pai <ppai@redhat.com>
3. Current Status
=================
Code has already been merged. Needs more testing.
The [initial submission](http://review.gluster.org/3251) contained a
`compress` option, which introduced [some
confusion](https://bugzilla.redhat.com/1053670). [A correction has been
sent](http://review.gluster.org/6765) to rename the user visible options
to start with `network.compression`.
TODO
- Make xlator pluggable to add support for other compression methods
- Add support for lz4 compression: <https://code.google.com/p/lz4/>
4. Detailed Description
=======================
- When a writev call occurs, the client compresses the data before
sending it to server. On the server, compressed data is
decompressed. Similarly, when a readv call occurs, the server
compresses the data before sending it to client. On the client, the
compressed data is decompressed. Thus the amount of data sent over
the wire is minimized.
- Compression/Decompression is done using Zlib library.
- During normal operation, this is the format of data sent over wire:
<compressed-data> + trailer(8 bytes). The trailer contains the CRC32
checksum and length of original uncompressed data. This is used for
validation.
5. Usage
========
Turning on compression xlator:
# gluster volume set <vol_name> network.compression on
Configurable options:
# gluster volume set <vol_name> network.compression.compression-level 8
# gluster volume set <vol_name> network.compression.min-size 50
6. Benefits to GlusterFS
========================
Fewer bytes transferred over the network.
7. Issues
=========
- Issues with striped volumes. Compression xlator cannot work with
striped volumes
- Issues with write-behind: Mount point hangs when writing a file with
write-behind xlator turned on. To overcome this, turn off
write-behind entirely OR set "performance.strict-write-ordering" to
on.
- Issues with AFR: AFR v1 currently does not propagate xdata.
<https://bugzilla.redhat.com/show_bug.cgi?id=951800> This issue has
been resolved in AFR v2.
8. Dependencies
===============
Zlib library
9. Documentation
================
<http://review.gluster.org/#/c/7479/3/doc/network_compression.md>
10. Status
==========
Code merged upstream.

View File

@@ -0,0 +1,99 @@
Feature
-------
Quota Scalability
Summary
-------
Support upto 65536 quota configurations per volume.
Owners
------
Krishnan Parthasarathi
Vijay Bellur
Current status
--------------
Current implementation of Directory Quota cannot scale beyond a few
hundred configured limits per volume. The aim of this feature is to
support upto 65536 quota configurations per volume.
Detailed Description
--------------------
TBD
Benefit to GlusterFS
--------------------
More quotas can be configured in a single volume thereby leading to
support GlusterFS for use cases like home directory.
Scope
-----
### Nature of proposed change
- Move quota enforcement translator to the server
- Introduce a new quota daemon which helps in aggregating directory
consumption on the server
- Enhance marker's accounting to be modular
- Revamp configuration persistence and CLI listing for better scale
- Allow configuration of soft limits in addition to hard limits.
### Implications on manageability
Mostly the CLI will be backward compatible. New CLI to be introduced
needs to be enumerated here.
### Implications on presentation layer
None
### Implications on persistence layer
None
### Implications on 'GlusterFS' backend
None
### Modification to GlusterFS metadata
- Addition of a new extended attribute for storing configured hard and
soft limits on directories.
### Implications on 'glusterd'
- New file based configuration persistence
How To Test
-----------
TBD
User Experience
---------------
TBD
Dependencies
------------
None
Documentation
-------------
TBD
Status
------
In development
Comments and Discussion
-----------------------

View File

@@ -0,0 +1,192 @@
Feature
-------
zerofill API for GlusterFS
Summary
-------
zerofill() API would allow creation of pre-allocated and zeroed-out
files on GlusterFS volumes by offloading the zeroing part to server
and/or storage (storage offloads use SCSI WRITESAME).
Owners
------
Bharata B Rao
M. Mohankumar
Current status
--------------
Patch on gerrit: <http://review.gluster.org/5327>
Detailed Description
--------------------
Add support for a new ZEROFILL fop. Zerofill writes zeroes to a file in
the specified range. This fop will be useful when a whole file needs to
be initialized with zero (could be useful for zero filled VM disk image
provisioning or during scrubbing of VM disk images).
Client/application can issue this FOP for zeroing out. Gluster server
will zero out required range of bytes ie server offloaded zeroing. In
the absence of this fop, client/application has to repetitively issue
write (zero) fop to the server, which is very inefficient method because
of the overheads involved in RPC calls and acknowledgements.
WRITESAME is a SCSI T10 command that takes a block of data as input and
writes the same data to other blocks and this write is handled
completely within the storage and hence is known as offload . Linux ,now
has support for SCSI WRITESAME command which is exposed to the user in
the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to
implement this fop. Thus zeroing out operations can be completely
offloaded to the storage device , making it highly efficient.
The fop takes two arguments offset and size. It zeroes out 'size' number
of bytes in an opened file starting from 'offset' position.
Benefit to GlusterFS
--------------------
Benefits GlusterFS in virtualization by providing the ability to quickly
create pre-allocated and zeroed-out VM disk image by using
server/storage off-loads.
### Scope
Nature of proposed change
-------------------------
An FOP supported in libgfapi and FUSE.
Implications on manageability
-----------------------------
None.
Implications on presentation layer
----------------------------------
N/A
Implications on persistence layer
---------------------------------
N/A
Implications on 'GlusterFS' backend
-----------------------------------
N/A
Modification to GlusterFS metadata
----------------------------------
N/A
Implications on 'glusterd'
--------------------------
N/A
How To Test
-----------
Test server offload by measuring the time taken for creating a fully
allocated and zeroed file on Posix backend.
Test storage offload by measuring the time taken for creating a fully
allocated and zeroed file on BD backend.
User Experience
---------------
Fast provisioning of VM images when GlusterFS is used as a file system
backend for KVM virtualization.
Dependencies
------------
zerofill() support in BD backend depends on the new BD translator -
<http://review.gluster.org/#/c/4809/>
Documentation
-------------
This feature add support for a new ZEROFILL fop. Zerofill writes zeroes
to a file in the specified range. This fop will be useful when a whole
file needs to be initialized with zero (could be useful for zero filled
VM disk image provisioning or during scrubbing of VM disk images).
Client/application can issue this FOP for zeroing out. Gluster server
will zero out required range of bytes ie server offloaded zeroing. In
the absence of this fop, client/application has to repetitively issue
write (zero) fop to the server, which is very inefficient method because
of the overheads involved in RPC calls and acknowledgements.
WRITESAME is a SCSI T10 command that takes a block of data as input and
writes the same data to other blocks and this write is handled
completely within the storage and hence is known as offload . Linux ,now
has support for SCSI WRITESAME command which is exposed to the user in
the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to
implement this fop. Thus zeroing out operations can be completely
offloaded to the storage device , making it highly efficient.
The fop takes two arguments offset and size. It zeroes out 'size' number
of bytes in an opened file starting from 'offset' position.
This feature adds zerofill support to the following areas:
-  libglusterfs
-  io-stats
-  performance/md-cache,open-behind
-  quota
-  cluster/afr,dht,stripe
-  rpc/xdr
-  protocol/client,server
-  io-threads
-  marker
-  storage/posix
-  libgfapi
Client applications can exploit this fop by using glfs\_zerofill
introduced in libgfapi.FUSE support to this fop has not been added as
there is no system call for this fop.
Here is a performance comparison of server offloaded zeofill vs zeroing
out using repeated writes.
[root@llmvm02 remote]# time ./offloaded aakash-test log 20
real    3m34.155s
user    0m0.018s
sys 0m0.040s
 [root@llmvm02 remote]# time ./manually aakash-test log 20
real    4m23.043s
user    0m2.197s
sys 0m14.457s
 [root@llmvm02 remote]# time ./offloaded aakash-test log 25;
real    4m28.363s
user    0m0.021s
sys 0m0.025s
[root@llmvm02 remote]# time ./manually aakash-test log 25
real    5m34.278s
user    0m2.957s
sys 0m18.808s
The argument log is a file which we want to set for logging purpose and
the third argument is size in GB .
As we can see there is a performance improvement of around 20% with this
fop.
Status
------
Patch : <http://review.gluster.org/5327> Status : Merged

View File

@@ -0,0 +1,89 @@
### Instructions
**Feature**
'gfid-access' translator to provide access to data in glusterfs using a virtual path.
**1 Summary**
This particular Translator is designed to provide direct access to files in glusterfs using its gfid.'GFID' is glusterfs's inode numbers for a file to identify it uniquely.
**2 Owners**
Amar Tumballi <atumball@redhat.com>
Raghavendra G <rgowdapp@redhat.com>
Anand Avati <aavati@redhat.com>
**3 Current status**
With glusterfs-3.4.0, glusterfs provides only path based access.A feature is added in 'fuse' layer in the current master branch,
but its desirable to have it as a separate translator for long time
maintenance.
**4 Detailed Description**
With this method, we can consume the data in changelog translator
(which is logging 'gfid' internally) very efficiently.
**5 Benefit to GlusterFS**
Provides a way to access files quickly with direct gfid.
**6. Scope**
6.1. Nature of proposed change
* A new translator.
* Fixes in 'glusterfsd.c' to add this translator automatically based
on mount time option.
* change to mount.glusterfs to parse this new option 
(single digit number or lines changed)
6.2. Implications on manageability
* No CLI required.
* mount.glusterfs script gets a new option.
6.3. Implications on presentation layer
* A new virtual access path is made available. But all access protocols work seemlessly, as the complexities are handled internally.
6.4. Implications on persistence layer
* None
6.5. Implications on 'GlusterFS' backend
* None
6.6. Modification to GlusterFS metadata
* None
6.7. Implications on 'glusterd'
* None
7 How To Test
* Mount glusterfs client with '-o aux-gfid-mount' and access files using '/mount/point/.gfid/ <actual-canonical-gfid-of-the-file>'.
8 User Experience
* A new virtual path available for users.
9 Dependencies
* None
10 Documentation
This wiki.
11 Status
Patch sent upstream. More review comments required. (http://review.gluster.org/5497)
12 Comments and Discussion
Please do give comments :-)

View File

@@ -14,6 +14,16 @@ GlusterFS 3.5
- [Features/AFR CLI enhancements](./AFR CLI enhancements.md)
- [Features/exposing volume capabilities](./Exposing Volume Capabilities.md)
- [Features/File Snapshot](./File Snapshot.md)
- [Features/gfid-access](./gfid access.md)
- [Features/On-Wire Compression + Decompression](./Onwire Compression-Decompression.md)
- [Features/Quota Scalability](./Quota Scalability.md)
- [Features/readdir ahead](./readdir ahead.md)
- [Features/zerofill](./Zerofill.md)
- [Features/Brick Failure Detection](./Brick Failure Detection.md)
- [Features/disk-encryption](./Disk-Encryption.md)
- Changelog based parallel geo-replication
- Improved block device translator
Proposing New Features
----------------------

View File

@@ -0,0 +1,117 @@
Feature
-------
readdir-ahead
Summary
-------
Provide read-ahead support for directories to improve sequential
directory read performance.
Owners
------
Brian Foster
Current status
--------------
Gluster currently does not attempt to improve directory read
performance. As a result, simple operations (i.e., ls) on large
directories are slow.
Detailed Description
--------------------
The read-ahead feature for directories is analogous to read-ahead for
files. The objective is to detect sequential directory read operations
and establish a pipeline for directory content. When a readdir request
is received and fulfilled, preemptively issue subsequent readdir
requests to the server in anticipation of those requests from the user.
If sequential readdir requests are received, the directory content is
already immediately available in the client. If subsequent requests are
not sequential or not received, said data is simply dropped and the
optimization is bypassed.
Benefit to GlusterFS
--------------------
Improved read performance of large directories.
### Scope
Nature of proposed change
-------------------------
readdir-ahead support is enabled through a new client-side translator.
Implications on manageability
-----------------------------
None beyond the ability to enable and disable the translator.
Implications on presentation layer
----------------------------------
N/A
Implications on persistence layer
---------------------------------
N/A
Implications on 'GlusterFS' backend
-----------------------------------
N/A
Modification to GlusterFS metadata
----------------------------------
N/A
Implications on 'glusterd'
--------------------------
N/A
How To Test
-----------
Performance testing. Verify that sequential reads of large directories
complete faster (i.e., ls, xfs\_io -c readdir).
User Experience
---------------
Improved performance on sequential read workloads. The translator should
otherwise be invisible and not detract performance or disrupt behavior
in any way.
Dependencies
------------
N/A
Documentation
-------------
Set the associated config option to enable or disable directory
read-ahead on a volume:
gluster volume set <vol> readdir-ahead [enable|disable]
readdir-ahead is disabled by default.
Status
------
Development complete for the initial version. Minor changes and bug
fixes likely.
Future versions might expand to provide generic caching and more
flexible behavior.
Comments and Discussion
-----------------------

View File

@@ -184,6 +184,14 @@ pages:
- ['Feature Planning/GlusterFS 3.5/index.md','Feature Planning 3.5','index']
- ['Feature Planning/GlusterFS 3.5/AFR CLI enhancements.md','Feature Planning 3.5','AFR CLI enhancements']
- ['Feature Planning/GlusterFS 3.5/Exposing Volume Capabilities.md','Feature Planning 3.5','Exposing Volume Capabilities']
- ['Feature Planning/GlusterFS 3.5/File Snapshot.md','Feature Planning 3.5','File Snapshot']
- ['Feature Planning/GlusterFS 3.5/gfid access.md','Feature Planning 3.5','gfid access']
- ['Feature Planning/GlusterFS 3.5/Onwire Compression-Decompression.md','Feature Planning 3.5','On wire Compression + Decompression']
- ['Feature Planning/GlusterFS 3.5/Quota Scalability.md','Feature Planning 3.5','Quota Scalability']
- ['Feature Planning/GlusterFS 3.5/readdir ahead.md','Feature Planning 3.5','readdir ahead']
- ['Feature Planning/GlusterFS 3.5/Zerofill.md','Feature Planning 3.5','Zerofill']
- ['Feature Planning/GlusterFS 3.5/Brick Failure Detection.md','Feature Planning 3.5','Brick Failure Detection']
- ['Feature Planning/GlusterFS 3.5/Disk Encryption.md','Feature Planning 3.5','Disk Encryption']
#GlusterFS Tools
- ['GlusterFS Tools/README.md', 'GlusterFS Tools', 'GlusterFS Tools List']