From 0a3f07bd4a9943a18ef1f4defd5225b6f5b975c5 Mon Sep 17 00:00:00 2001 From: black-dragon74 Date: Fri, 3 Jun 2022 16:17:31 +0530 Subject: [PATCH] [troubleshooting] Fix AFR and Split brain pages and cleanup the syntax Signed-off-by: black-dragon74 --- docs/Troubleshooting/README.md | 8 +- docs/Troubleshooting/gfid-to-path.md | 12 +- docs/Troubleshooting/gluster-crash.md | 14 +- docs/Troubleshooting/resolving-splitbrain.md | 381 ++++++++++-------- docs/Troubleshooting/statedump.md | 90 ++--- docs/Troubleshooting/troubleshooting-afr.md | 142 ++++--- .../troubleshooting-filelocks.md | 18 +- .../Troubleshooting/troubleshooting-georep.md | 92 +++-- .../troubleshooting-glusterd.md | 72 ++-- docs/Troubleshooting/troubleshooting-gnfs.md | 53 ++- .../Troubleshooting/troubleshooting-memory.md | 4 +- 11 files changed, 471 insertions(+), 415 deletions(-) diff --git a/docs/Troubleshooting/README.md b/docs/Troubleshooting/README.md index 0741662..4ec0122 100644 --- a/docs/Troubleshooting/README.md +++ b/docs/Troubleshooting/README.md @@ -1,9 +1,8 @@ -Troubleshooting Guide ---------------------- +## Troubleshooting Guide + This guide describes some commonly seen issues and steps to recover from them. If that doesn’t help, reach out to the [Gluster community](https://www.gluster.org/community/), in which case the guide also describes what information needs to be provided in order to debug the issue. At minimum, we need the version of gluster running and the output of `gluster volume info`. - ### Where Do I Start? Is the issue already listed in the component specific troubleshooting sections? @@ -15,7 +14,6 @@ Is the issue already listed in the component specific troubleshooting sections? - [Gluster NFS Issues](./troubleshooting-gnfs.md) - [File Locks](./troubleshooting-filelocks.md) - If that didn't help, here is how to debug further. Identifying the problem and getting the necessary information to diagnose it is the first step in troubleshooting your Gluster setup. As Gluster operations involve interactions between multiple processes, this can involve multiple steps. @@ -25,5 +23,3 @@ Identifying the problem and getting the necessary information to diagnose it is - An operation failed - [High Memory Usage](./troubleshooting-memory.md) - [A Gluster process crashed](./gluster-crash.md) - - diff --git a/docs/Troubleshooting/gfid-to-path.md b/docs/Troubleshooting/gfid-to-path.md index 275fb71..3a25a1b 100644 --- a/docs/Troubleshooting/gfid-to-path.md +++ b/docs/Troubleshooting/gfid-to-path.md @@ -8,24 +8,26 @@ normal filesystem. The GFID of a file is stored in its xattr named #### Special mount using gfid-access translator: ```console -# mount -t glusterfs -o aux-gfid-mount vm1:test /mnt/testvol +mount -t glusterfs -o aux-gfid-mount vm1:test /mnt/testvol ``` Assuming, you have `GFID` of a file from changelog (or somewhere else). For trying this out, you can get `GFID` of a file from mountpoint: ```console -# getfattr -n glusterfs.gfid.string /mnt/testvol/dir/file +getfattr -n glusterfs.gfid.string /mnt/testvol/dir/file ``` --- + ### Get file path from GFID (Method 1): + **(Lists hardlinks delimited by `:`, returns path as seen from mountpoint)** #### Turn on build-pgfid option ```console -# gluster volume set test build-pgfid on +gluster volume set test build-pgfid on ``` Read virtual xattr `glusterfs.ancestry.path` which contains the file path @@ -36,7 +38,7 @@ getfattr -n glusterfs.ancestry.path -e text /mnt/testvol/.gfid/ **Example:** -```console +```{ .console .no-copy } [root@vm1 glusterfs]# ls -il /mnt/testvol/dir/ total 1 10610563327990022372 -rw-r--r--. 2 root root 3 Jul 17 18:05 file @@ -54,6 +56,7 @@ glusterfs.ancestry.path="/dir/file:/dir/file3" ``` ### Get file path from GFID (Method 2): + **(Does not list all hardlinks, returns backend brick path)** ```console @@ -70,4 +73,5 @@ trusted.glusterfs.pathinfo="( info` This lists all the files that require healing (and will be processed by the self-heal daemon). It prints either their path or their GFID. ### Interpreting the output + All the files listed in the output of this command need to be healed. The files listed may also be accompanied by the following tags: a) 'Is in split-brain' -A file in data or metadata split-brain will -be listed with " - Is in split-brain" appended after its path/GFID. E.g. +A file in data or metadata split-brain will +be listed with " - Is in split-brain" appended after its path/GFID. E.g. "/file4" in the output provided below. However, for a file in GFID split-brain, - the parent directory of the file is shown to be in split-brain and the file -itself is shown to be needing healing, e.g. "/dir" in the output provided below +the parent directory of the file is shown to be in split-brain and the file +itself is shown to be needing healing, e.g. "/dir" in the output provided below is in split-brain because of GFID split-brain of file "/dir/a". Files in split-brain cannot be healed without resolving the split-brain. @@ -36,11 +37,13 @@ b) 'Is possibly undergoing heal' When the heal info command is run, it (or to be more specific, the 'glfsheal' binary that is executed when you run the command) takes locks on each file to find if it needs healing. However, if the self-heal daemon had already started healing the file, it would have taken locks which glfsheal wouldn't be able to acquire. In such a case, it could print this message. Another possible case could be multiple glfsheal processes running simultaneously (e.g. multiple users ran a heal info command at the same time) and competing for same lock. The following is an example of heal info command's output. + ### Example + Consider a replica volume "test" with two bricks b1 and b2; self-heal daemon off, mounted at /mnt. -```console +```{ .console .no-copy } # gluster volume heal test info Brick \ - Is in split-brain @@ -63,24 +66,27 @@ Number of entries: 6 ``` ### Analysis of the output -It can be seen that -A) from brick b1, four entries need healing: -      1) file with gfid:6dc78b20-7eb6-49a3-8edb-087b90142246 needs healing -      2) "aaca219f-0e25-4576-8689-3bfd93ca70c2", -"39f301ae-4038-48c2-a889-7dac143e82dd" and "c3c94de2-232d-4083-b534-5da17fc476ac" - are in split-brain -B) from brick b2 six entries need healing- -      1) "a", "file2" and "file3" need healing -      2) "file1", "file4" & "/dir" are in split-brain +It can be seen that + +A) from brick b1, four entries need healing: + +- file with gfid:6dc78b20-7eb6-49a3-8edb-087b90142246 needs healing +- "aaca219f-0e25-4576-8689-3bfd93ca70c2", "39f301ae-4038-48c2-a889-7dac143e82dd" and "c3c94de2-232d-4083-b534-5da17fc476ac" are in split-brain + +B) from brick b2 six entries need healing- + +- "a", "file2" and "file3" need healing +- "file1", "file4" & "/dir" are in split-brain # 2. Volume heal info split-brain + Usage: `gluster volume heal info split-brain` This command only shows the list of files that are in split-brain. The output is therefore a subset of `gluster volume heal info` ### Example -```console +```{ .console .no-copy } # gluster volume heal test info split-brain Brick @@ -95,19 +101,22 @@ Brick Number of entries in split-brain: 3 ``` -Note that similar to the heal info command, for GFID split-brains (same filename but different GFID) +Note that similar to the heal info command, for GFID split-brains (same filename but different GFID) their parent directories are listed to be in split-brain. # 3. Resolution of split-brain using gluster CLI + Once the files in split-brain are identified, their resolution can be done from the gluster command line using various policies. Type-mismatch cannot be healed using this methods. Split-brain resolution commands let the user resolve data, metadata, and GFID split-brains. ## 3.1 Resolution of data/metadata split-brain using gluster CLI + Data and metadata split-brains can be resolved using the following policies: ## i) Select the bigger-file as source + This command is useful for per file healing where it is known/decided that the -file with bigger size is to be considered as source. +file with bigger size is to be considered as source. `gluster volume heal split-brain bigger-file ` Here, `` can be either the full file name as seen from the root of the volume (or) the GFID-string representation of the file, which sometimes gets displayed @@ -115,13 +124,14 @@ in the heal info command's output. Once this command is executed, the replica co size is found and healing is completed with that brick as a source. ### Example : + Consider the earlier output of the heal info split-brain command. -Before healing the file, notice file size and md5 checksums : +Before healing the file, notice file size and md5 checksums : On brick b1: -```console +```{ .console .no-copy } [brick1]# stat b1/dir/file1 File: ‘b1/dir/file1’ Size: 17 Blocks: 16 IO Block: 4096 regular file @@ -138,7 +148,7 @@ Change: 2015-03-06 13:55:37.206880347 +0530 On brick b2: -```console +```{ .console .no-copy } [brick2]# stat b2/dir/file1 File: ‘b2/dir/file1’ Size: 13 Blocks: 16 IO Block: 4096 regular file @@ -153,7 +163,7 @@ Change: 2015-03-06 13:52:22.910758923 +0530 cb11635a45d45668a403145059c2a0d5 b2/dir/file1 ``` -**Healing file1 using the above command** :- +**Healing file1 using the above command** :- `gluster volume heal test split-brain bigger-file /dir/file1` Healed /dir/file1. @@ -161,7 +171,7 @@ After healing is complete, the md5sum and file size on both bricks should be the On brick b1: -```console +```{ .console .no-copy } [brick1]# stat b1/dir/file1 File: ‘b1/dir/file1’ Size: 17 Blocks: 16 IO Block: 4096 regular file @@ -178,7 +188,7 @@ Change: 2015-03-06 14:17:12.880343950 +0530 On brick b2: -```console +```{ .console .no-copy } [brick2]# stat b2/dir/file1 File: ‘b2/dir/file1’ Size: 17 Blocks: 16 IO Block: 4096 regular file @@ -195,7 +205,7 @@ Change: 2015-03-06 14:17:12.881343955 +0530 ## ii) Select the file with the latest mtime as source -```console +```{ .console .no-copy } gluster volume heal split-brain latest-mtime ``` @@ -203,20 +213,21 @@ As is perhaps self-explanatory, this command uses the brick having the latest mo ## iii) Select one of the bricks in the replica as the source for a particular file -```console +```{ .console .no-copy } gluster volume heal split-brain source-brick ``` Here, `` is selected as source brick and `` present in the source brick is taken as the source for healing. ### Example : + Notice the md5 checksums and file size before and after healing. Before heal : On brick b1: -```console +```{ .console .no-copy } [brick1]# stat b1/file4 File: ‘b1/file4’ Size: 4 Blocks: 16 IO Block: 4096 regular file @@ -233,7 +244,7 @@ b6273b589df2dfdbd8fe35b1011e3183 b1/file4 On brick b2: -```console +```{ .console .no-copy } [brick2]# stat b2/file4 File: ‘b2/file4’ Size: 4 Blocks: 16 IO Block: 4096 regular file @@ -251,7 +262,7 @@ Change: 2015-03-06 13:52:35.769833142 +0530 **Healing the file with gfid c3c94de2-232d-4083-b534-5da17fc476ac using the above command** : ```console -# gluster volume heal test split-brain source-brick test-host:/test/b1 gfid:c3c94de2-232d-4083-b534-5da17fc476ac +gluster volume heal test split-brain source-brick test-host:/test/b1 gfid:c3c94de2-232d-4083-b534-5da17fc476ac ``` Healed gfid:c3c94de2-232d-4083-b534-5da17fc476ac. @@ -260,7 +271,7 @@ After healing : On brick b1: -```console +```{ .console .no-copy } # stat b1/file4 File: ‘b1/file4’ Size: 4 Blocks: 16 IO Block: 4096 regular file @@ -276,7 +287,7 @@ b6273b589df2dfdbd8fe35b1011e3183 b1/file4 On brick b2: -```console +```{ .console .no-copy } # stat b2/file4 File: ‘b2/file4’ Size: 4 Blocks: 16 IO Block: 4096 regular file @@ -292,7 +303,7 @@ b6273b589df2dfdbd8fe35b1011e3183 b2/file4 ## iv) Select one brick of the replica as the source for all files -```console +```{ .console .no-copy } gluster volume heal split-brain source-brick ``` @@ -301,9 +312,10 @@ replica pair is source. As the result of the above command all split-brained files in `` are selected as source and healed to the sink. ### Example: + Consider a volume having three entries "a, b and c" in split-brain. -```console +```{ .console .no-copy } # gluster volume heal test split-brain source-brick test-host:/test/b1 Healed gfid:944b4764-c253-4f02-b35f-0d0ae2f86c0f. Healed gfid:3256d814-961c-4e6e-8df2-3a3143269ced. @@ -312,19 +324,24 @@ Number of healed entries: 3 ``` # 3.2 Resolution of GFID split-brain using gluster CLI + GFID split-brains can also be resolved by the gluster command line using the same policies that are used to resolve data and metadata split-brains. ## i) Selecting the bigger-file as source + This method is useful for per file healing and where you can decided that the file with bigger size is to be considered as source. Run the following command to obtain the path of the file that is in split-brain: -```console + +```{ .console .no-copy } # gluster volume heal VOLNAME info split-brain ``` From the output, identify the files for which file operations performed from the client failed with input/output error. + ### Example : -```console + +```{ .console .no-copy } # gluster volume heal testvol info Brick 10.70.47.45:/bricks/brick2/b0 /f5 @@ -340,19 +357,22 @@ Brick 10.70.47.144:/bricks/brick2/b1 Status: Connected Number of entries: 2 ``` + > **Note** > Entries which are in GFID split-brain may not be shown as in split-brain by the heal info or heal info split-brain commands always. For entry split-brains, it is the parent directory which is shown as being in split-brain. So one might need to run info split-brain to get the dir names and then heal info to get the list of files under that dir which might be in split-brain (it could just be needing heal without split-brain). In the above command, testvol is the volume name, b0 and b1 are the bricks. Execute the below getfattr command on the brick to fetch information if a file is in GFID split-brain or not. -```console +```{ .console .no-copy } # getfattr -d -e hex -m. ``` ### Example : + On brick /b0 -```console + +```{ .console .no-copy } # getfattr -d -m . -e hex /bricks/brick2/b0/f5 getfattr: Removing leading '/' from absolute path names file: bricks/brick2/b0/f5 @@ -364,7 +384,8 @@ trusted.gfid2path.9cde09916eabc845=0x30303030303030302d303030302d303030302d30303 ``` On brick /b1 -```console + +```{ .console .no-copy } # getfattr -d -m . -e hex /bricks/brick2/b1/f5 getfattr: Removing leading '/' from absolute path names file: bricks/brick2/b1/f5 @@ -379,7 +400,8 @@ You can notice the difference in GFID for the file f5 in both the bricks. You can find the differences in the file size by executing stat command on the file from the bricks. On brick /b0 -```console + +```{ .console .no-copy } # stat /bricks/brick2/b0/f5 File: ‘/bricks/brick2/b0/f5’ Size: 15 Blocks: 8 IO Block: 4096 regular file @@ -393,7 +415,8 @@ Birth: - ``` On brick /b1 -```console + +```{ .console .no-copy } # stat /bricks/brick2/b1/f5 File: ‘/bricks/brick2/b1/f5’ Size: 2 Blocks: 8 IO Block: 4096 regular file @@ -408,12 +431,13 @@ Birth: - Execute the following command along with the full filename as seen from the root of the volume which is displayed in the heal info command's output: -```console +```{ .console .no-copy } # gluster volume heal VOLNAME split-brain bigger-file FILE ``` ### Example : -```console + +```{ .console .no-copy } # gluster volume heal testvol split-brain bigger-file /f5 GFID split-brain resolved for file /f5 ``` @@ -421,7 +445,8 @@ GFID split-brain resolved for file /f5 After the healing is complete, the GFID of the file on both the bricks must be the same as that of the file which had the bigger size. The following is a sample output of the getfattr command after completion of healing the file. On brick /b0 -```console + +```{ .console .no-copy } # getfattr -d -m . -e hex /bricks/brick2/b0/f5 getfattr: Removing leading '/' from absolute path names file: bricks/brick2/b0/f5 @@ -431,7 +456,8 @@ trusted.gfid2path.9cde09916eabc845=0x30303030303030302d303030302d303030302d30303 ``` On brick /b1 -```console + +```{ .console .no-copy } # getfattr -d -m . -e hex /bricks/brick2/b1/f5 getfattr: Removing leading '/' from absolute path names file: bricks/brick2/b1/f5 @@ -441,14 +467,16 @@ trusted.gfid2path.9cde09916eabc845=0x30303030303030302d303030302d303030302d30303 ``` ## ii) Selecting the file with latest mtime as source + This method is useful for per file healing and if you want the file with latest mtime has to be considered as source. ### Example : + Lets take another file which is in GFID split-brain and try to heal that using the latest-mtime option. On brick /b0 -```console +```{ .console .no-copy } # getfattr -d -m . -e hex /bricks/brick2/b0/f4 getfattr: Removing leading '/' from absolute path names file: bricks/brick2/b0/f4 @@ -460,7 +488,8 @@ trusted.gfid2path.364f55367c7bd6f4=0x30303030303030302d303030302d303030302d30303 ``` On brick /b1 -```console + +```{ .console .no-copy } # getfattr -d -m . -e hex /bricks/brick2/b1/f4 getfattr: Removing leading '/' from absolute path names file: bricks/brick2/b1/f4 @@ -475,7 +504,8 @@ You can notice the difference in GFID for the file f4 in both the bricks. You can find the difference in the modification time by executing stat command on the file from the bricks. On brick /b0 -```console + +```{ .console .no-copy } # stat /bricks/brick2/b0/f4 File: ‘/bricks/brick2/b0/f4’ Size: 14 Blocks: 8 IO Block: 4096 regular file @@ -489,7 +519,8 @@ Birth: - ``` On brick /b1 -```console + +```{ .console .no-copy } # stat /bricks/brick2/b1/f4 File: ‘/bricks/brick2/b1/f4’ Size: 2 Blocks: 8 IO Block: 4096 regular file @@ -503,12 +534,14 @@ Birth: - ``` Execute the following command: -```console + +```{ .console .no-copy } # gluster volume heal VOLNAME split-brain latest-mtime FILE ``` ### Example : -```console + +```{ .console .no-copy } # gluster volume heal testvol split-brain latest-mtime /f4 GFID split-brain resolved for file /f4 ``` @@ -516,7 +549,9 @@ GFID split-brain resolved for file /f4 After the healing is complete, the GFID of the files on both bricks must be same. The following is a sample output of the getfattr command after completion of healing the file. You can notice that the file has been healed using the brick having the latest mtime as the source. On brick /b0 -```console# getfattr -d -m . -e hex /bricks/brick2/b0/f4 + +```{ .console .no-copy } +# getfattr -d -m . -e hex /bricks/brick2/b0/f4 getfattr: Removing leading '/' from absolute path names file: bricks/brick2/b0/f4 security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 @@ -525,7 +560,8 @@ trusted.gfid2path.364f55367c7bd6f4=0x30303030303030302d303030302d303030302d30303 ``` On brick /b1 -```console + +```{ .console .no-copy } # getfattr -d -m . -e hex /bricks/brick2/b1/f4 getfattr: Removing leading '/' from absolute path names file: bricks/brick2/b1/f4 @@ -535,13 +571,16 @@ trusted.gfid2path.364f55367c7bd6f4=0x30303030303030302d303030302d303030302d30303 ``` ## iii) Select one of the bricks in the replica as source for a particular file + This method is useful for per file healing and if you know which copy of the file is good. ### Example : + Lets take another file which is in GFID split-brain and try to heal that using the source-brick option. On brick /b0 -```console + +```{ .console .no-copy } # getfattr -d -m . -e hex /bricks/brick2/b0/f3 getfattr: Removing leading '/' from absolute path names file: bricks/brick2/b0/f3 @@ -553,7 +592,8 @@ trusted.gfid2path.364f55367c7bd6f4=0x30303030303030302d303030302d303030302d30303 ``` On brick /b1 -```console + +```{ .console .no-copy } # getfattr -d -m . -e hex /bricks/brick2/b1/f3 getfattr: Removing leading '/' from absolute path names file: bricks/brick2/b0/f3 @@ -567,14 +607,16 @@ trusted.gfid2path.364f55367c7bd6f4=0x30303030303030302d303030302d303030302d30303 You can notice the difference in GFID for the file f3 in both the bricks. Execute the following command: -```console + +```{ .console .no-copy } # gluster volume heal VOLNAME split-brain source-brick HOSTNAME:export-directory-absolute-path FILE ``` In this command, FILE present in HOSTNAME : export-directory-absolute-path is taken as source for healing. ### Example : -```console + +```{ .console .no-copy } # gluster volume heal testvol split-brain source-brick 10.70.47.144:/bricks/brick2/b1 /f3 GFID split-brain resolved for file /f3 ``` @@ -582,7 +624,8 @@ GFID split-brain resolved for file /f3 After the healing is complete, the GFID of the file on both the bricks should be same as that of the brick which was chosen as source for healing. The following is a sample output of the getfattr command after the file is healed. On brick /b0 -```console + +```{ .console .no-copy } # getfattr -d -m . -e hex /bricks/brick2/b0/f3 getfattr: Removing leading '/' from absolute path names file: bricks/brick2/b0/f3 @@ -592,7 +635,8 @@ trusted.gfid2path.364f55367c7bd6f4=0x30303030303030302d303030302d303030302d30303 ``` On brick /b1 -```console + +```{ .console .no-copy } # getfattr -d -m . -e hex /bricks/brick2/b1/f3 getfattr: Removing leading '/' from absolute path names file: bricks/brick2/b1/f3 @@ -602,19 +646,22 @@ trusted.gfid2path.364f55367c7bd6f4=0x30303030303030302d303030302d303030302d30303 ``` > **Note** ->- One cannot use the GFID of the file as an argument with any of the CLI options to resolve GFID split-brain. It should be the absolute path as seen from the mount point to the file considered as source. > ->- With source-brick option there is no way to resolve all the GFID split-brain in one shot by not specifying any file path in the CLI as done while resolving data or metadata split-brain. For each file in GFID split-brain, run the CLI with the policy you want to use. +> - One cannot use the GFID of the file as an argument with any of the CLI options to resolve GFID split-brain. It should be the absolute path as seen from the mount point to the file considered as source. > ->- Resolving directory GFID split-brain using CLI with the "source-brick" option in a "distributed-replicated" volume needs to be done on all the sub-volumes explicitly, which are in this state. Since directories get created on all the sub-volumes, using one particular brick as source for directory GFID split-brain heals the directory for that particular sub-volume. Source brick should be chosen in such a way that after heal all the bricks of all the sub-volumes have the same GFID. +> - With source-brick option there is no way to resolve all the GFID split-brain in one shot by not specifying any file path in the CLI as done while resolving data or metadata split-brain. For each file in GFID split-brain, run the CLI with the policy you want to use. +> +> - Resolving directory GFID split-brain using CLI with the "source-brick" option in a "distributed-replicated" volume needs to be done on all the sub-volumes explicitly, which are in this state. Since directories get created on all the sub-volumes, using one particular brick as source for directory GFID split-brain heals the directory for that particular sub-volume. Source brick should be chosen in such a way that after heal all the bricks of all the sub-volumes have the same GFID. ## Note: + As mentioned earlier, type-mismatch can not be resolved using CLI. Type-mismatch means different st_mode values (for example, the entry is a file in one brick while it is a directory on the other). Trying to heal such entry would fail. ### Example + The entry named "entry1" is of different types on the bricks of the replica. Lets try to heal that using the split-brain CLI. -```console +```{ .console .no-copy } # gluster volume heal test split-brain source-brick test-host:/test/b1 /entry1 Healing /entry1 failed:Operation not permitted. Volume heal failed. @@ -623,22 +670,23 @@ Volume heal failed. However, they can be fixed by deleting the file from all but one bricks. See [Fixing Directory entry split-brain](#dir-split-brain) # An overview of working of heal info commands -When these commands are invoked, a "glfsheal" process is spawned which reads -the entries from the various sub-directories under `//.glusterfs/indices/` of all -the bricks that are up (that it can connect to) one after another. These -entries are GFIDs of files that might need healing. Once GFID entries from a -brick are obtained, based on the lookup response of this file on each -participating brick of replica-pair & trusted.afr.* extended attributes it is -found out if the file needs healing, is in split-brain etc based on the + +When these commands are invoked, a "glfsheal" process is spawned which reads +the entries from the various sub-directories under `//.glusterfs/indices/` of all +the bricks that are up (that it can connect to) one after another. These +entries are GFIDs of files that might need healing. Once GFID entries from a +brick are obtained, based on the lookup response of this file on each +participating brick of replica-pair & trusted.afr.\* extended attributes it is +found out if the file needs healing, is in split-brain etc based on the requirement of each command and displayed to the user. - # 4. Resolution of split-brain from the mount point + A set of getfattr and setfattr commands have been provided to detect the data and metadata split-brain status of a file and resolve split-brain, if any, from mount point. Consider a volume "test", having bricks b0, b1, b2 and b3. -```console +```{ .console .no-copy } # gluster volume info test Volume Name: test @@ -656,7 +704,7 @@ Brick4: test-host:/test/b3 Directory structure of the bricks is as follows: -```console +```{ .console .no-copy } # tree -R /test/b? /test/b0 ├── dir @@ -683,7 +731,7 @@ Directory structure of the bricks is as follows: Some files in the volume are in split-brain. -```console +```{ .console .no-copy } # gluster v heal test info split-brain Brick test-host:/test/b0/ /file100 @@ -708,7 +756,7 @@ Number of entries in split-brain: 2 ### To know data/metadata split-brain status of a file: -```console +```{ .console .no-copy } getfattr -n replica.split-brain-status ``` @@ -716,50 +764,52 @@ The above command executed from mount provides information if a file is in data/ This command is not applicable to gfid/directory split-brain. ### Example: -1) "file100" is in metadata split-brain. Executing the above mentioned command for file100 gives : -```console +1. "file100" is in metadata split-brain. Executing the above mentioned command for file100 gives : + +```{ .console .no-copy } # getfattr -n replica.split-brain-status file100 file: file100 replica.split-brain-status="data-split-brain:no metadata-split-brain:yes Choices:test-client-0,test-client-1" ``` -2) "file1" is in data split-brain. +2. "file1" is in data split-brain. -```console +```{ .console .no-copy } # getfattr -n replica.split-brain-status file1 file: file1 replica.split-brain-status="data-split-brain:yes metadata-split-brain:no Choices:test-client-2,test-client-3" ``` -3) "file99" is in both data and metadata split-brain. +3. "file99" is in both data and metadata split-brain. -```console +```{ .console .no-copy } # getfattr -n replica.split-brain-status file99 file: file99 replica.split-brain-status="data-split-brain:yes metadata-split-brain:yes Choices:test-client-2,test-client-3" ``` -4) "dir" is in directory split-brain but as mentioned earlier, the above command is not applicable to such split-brain. So it says that the file is not under data or metadata split-brain. +4. "dir" is in directory split-brain but as mentioned earlier, the above command is not applicable to such split-brain. So it says that the file is not under data or metadata split-brain. -```console +```{ .console .no-copy } # getfattr -n replica.split-brain-status dir file: dir replica.split-brain-status="The file is not under data or metadata split-brain" ``` -5) "file2" is not in any kind of split-brain. +5. "file2" is not in any kind of split-brain. -```console +```{ .console .no-copy } # getfattr -n replica.split-brain-status file2 file: file2 replica.split-brain-status="The file is not under data or metadata split-brain" ``` ### To analyze the files in data and metadata split-brain + Trying to do operations (say cat, getfattr etc) from the mount on files in split-brain, gives an input/output error. To enable the users analyze such files, a setfattr command is provided. -```console +```{ .console .no-copy } # setfattr -n replica.split-brain-choice -v "choiceX" ``` @@ -767,9 +817,9 @@ Using this command, a particular brick can be chosen to access the file in split ### Example: -1) "file1" is in data-split-brain. Trying to read from the file gives input/output error. +1. "file1" is in data-split-brain. Trying to read from the file gives input/output error. -```console +```{ .console .no-copy } # cat file1 cat: file1: Input/output error ``` @@ -778,13 +828,13 @@ Split-brain choices provided for file1 were test-client-2 and test-client-3. Setting test-client-2 as split-brain choice for file1 serves reads from b2 for the file. -```console +```{ .console .no-copy } # setfattr -n replica.split-brain-choice -v test-client-2 file1 ``` Now, read operations on the file can be done. -```console +```{ .console .no-copy } # cat file1 xyz ``` @@ -793,18 +843,18 @@ Similarly, to inspect the file from other choice, replica.split-brain-choice is Trying to inspect the file from a wrong choice errors out. -To undo the split-brain-choice that has been set, the above mentioned setfattr command can be used +To undo the split-brain-choice that has been set, the above mentioned setfattr command can be used with "none" as the value for extended attribute. ### Example: -```console +```{ .console .no-copy } # setfattr -n replica.split-brain-choice -v none file1 ``` Now performing cat operation on the file will again result in input/output error, as before. -```console +```{ .console .no-copy } # cat file cat: file1: Input/output error ``` @@ -812,13 +862,13 @@ cat: file1: Input/output error Once the choice for resolving split-brain is made, source brick is supposed to be set for the healing to be done. This is done using the following command: -```console +```{ .console .no-copy } # setfattr -n replica.split-brain-heal-finalize -v ``` ## Example -```console +```{ .console .no-copy } # setfattr -n replica.split-brain-heal-finalize -v test-client-2 file1 ``` @@ -826,18 +876,19 @@ The above process can be used to resolve data and/or metadata split-brain on all **NOTE**: -1) If "fopen-keep-cache" fuse mount option is disabled then inode needs to be invalidated each time before selecting a new replica.split-brain-choice to inspect a file. This can be done by using: +1. If "fopen-keep-cache" fuse mount option is disabled then inode needs to be invalidated each time before selecting a new replica.split-brain-choice to inspect a file. This can be done by using: -```console +```{ .console .no-copy } # sefattr -n inode-invalidate -v 0 ``` -2) The above mentioned process for split-brain resolution from mount will not work on nfs mounts as it doesn't provide xattrs support. +2. The above mentioned process for split-brain resolution from mount will not work on nfs mounts as it doesn't provide xattrs support. # 5. Automagic unsplit-brain by [ctime|mtime|size|majority] -The CLI and fuse mount based resolution methods require intervention in the sense that the admin/ user needs to run the commands manually. There is a `cluster.favorite-child-policy` volume option which when set to one of the various policies available, automatically resolve split-brains without user intervention. The default value is 'none', i.e. it is disabled. -```console +The CLI and fuse mount based resolution methods require intervention in the sense that the admin/ user needs to run the commands manually. There is a `cluster.favorite-child-policy` volume option which when set to one of the various policies available, automatically resolve split-brains without user intervention. The default value is 'none', i.e. it is disabled. + +```{ .console .no-copy } # gluster volume set help | grep -A3 cluster.favorite-child-policy Option: cluster.favorite-child-policy Default Value: none @@ -846,40 +897,41 @@ Description: This option can be used to automatically resolve split-brains using `cluster.favorite-child-policy` applies to all files of the volume. It is assumed that if this option is enabled with a particular policy, you don't care to examine the split-brain files on a per file basis but just want the split-brain to be resolved as and when it occurs based on the set policy. - - # Manual Split-Brain Resolution: -Quick Start: -============ -1. Get the path of the file that is in split-brain: -> It can be obtained either by -> a) The command `gluster volume heal info split-brain`. -> b) Identify the files for which file operations performed - from the client keep failing with Input/Output error. +# Quick Start: -2. Close the applications that opened this file from the mount point. -In case of VMs, they need to be powered-off. +1. Get the path of the file that is in split-brain: -3. Decide on the correct copy: -> This is done by observing the afr changelog extended attributes of the file on -the bricks using the getfattr command; then identifying the type of split-brain -(data split-brain, metadata split-brain, entry split-brain or split-brain due to -gfid-mismatch); and finally determining which of the bricks contains the 'good copy' -of the file. -> `getfattr -d -m . -e hex `. -It is also possible that one brick might contain the correct data while the -other might contain the correct metadata. + > It can be obtained either by + > a) The command `gluster volume heal info split-brain`. + > b) Identify the files for which file operations performed from the client keep failing with Input/Output error. -4. Reset the relevant extended attribute on the brick(s) that contains the -'bad copy' of the file data/metadata using the setfattr command. -> `setfattr -n -v ` +1. Close the applications that opened this file from the mount point. + In case of VMs, they need to be powered-off. -5. Trigger self-heal on the file by performing lookup from the client: -> `ls -l ` +1. Decide on the correct copy: + + > This is done by observing the afr changelog extended attributes of the file on + > the bricks using the getfattr command; then identifying the type of split-brain + > (data split-brain, metadata split-brain, entry split-brain or split-brain due to + > gfid-mismatch); and finally determining which of the bricks contains the 'good copy' + > of the file. + > `getfattr -d -m . -e hex `. + > It is also possible that one brick might contain the correct data while the + > other might contain the correct metadata. + +1. Reset the relevant extended attribute on the brick(s) that contains the + 'bad copy' of the file data/metadata using the setfattr command. + + > `setfattr -n -v ` + +1. Trigger self-heal on the file by performing lookup from the client: + + > `ls -l ` + +# Detailed Instructions for steps 3 through 5: -Detailed Instructions for steps 3 through 5: -=========================================== To understand how to resolve split-brain we need to know how to interpret the afr changelog extended attributes. @@ -887,7 +939,7 @@ Execute `getfattr -d -m . -e hex ` Example: -```console +```{ .console .no-copy } [root@store3 ~]# getfattr -d -e hex -m. brick-a/file.txt \#file: brick-a/file.txt security.selinux=0x726f6f743a6f626a6563745f723a66696c655f743a733000 @@ -900,7 +952,7 @@ The extended attributes with `trusted.afr.-client-` are used by afr to maintain changelog of the file.The values of the `trusted.afr.-client-` are calculated by the glusterfs client (fuse or nfs-server) processes. When the glusterfs client modifies a file -or directory, the client contacts each brick and updates the changelog extended +or directory, the client contacts each brick and updates the changelog extended attribute according to the response of the brick. 'subvolume-index' is nothing but (brick number - 1) in @@ -908,7 +960,7 @@ attribute according to the response of the brick. Example: -```console +```{ .console .no-copy } [root@pranithk-laptop ~]# gluster volume info vol Volume Name: vol Type: Distributed-Replicate @@ -929,7 +981,7 @@ Example: In the example above: -```console +```{ .console .no-copy } Brick | Replica set | Brick subvolume index ---------------------------------------------------------------------------- -/gfs/brick-a | 0 | 0 @@ -945,25 +997,25 @@ Brick | Replica set | Brick subvolume index Each file in a brick maintains the changelog of itself and that of the files present in all the other bricks in its replica set as seen by that brick. -In the example volume given above, all files in brick-a will have 2 entries, +In the example volume given above, all files in brick-a will have 2 entries, one for itself and the other for the file present in its replica pair, i.e.brick-b: trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for itself (brick-a) -trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for brick-b as seen by brick-a +trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for brick-b as seen by brick-a Likewise, all files in brick-b will have: trusted.afr.vol-client-0=0x000000000000000000000000 -->changelog for brick-a as seen by brick-b -trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for itself (brick-b) +trusted.afr.vol-client-1=0x000000000000000000000000 -->changelog for itself (brick-b) -The same can be extended for other replica pairs. +The same can be extended for other replica pairs. Interpreting Changelog (roughly pending operation count) Value: Each extended attribute has a value which is 24 hexa decimal digits. First 8 digits represent changelog of data. Second 8 digits represent changelog -of metadata. Last 8 digits represent Changelog of directory entries. +of metadata. Last 8 digits represent Changelog of directory entries. Pictorially representing the same, we have: -```text +```{ .text .no-copy } 0x 000003d7 00000001 00000000 | | | | | \_ changelog of directory entries @@ -971,17 +1023,16 @@ Pictorially representing the same, we have: \ _ changelog of data ``` - For Directories metadata and entry changelogs are valid. For regular files data and metadata changelogs are valid. For special files like device files etc metadata changelog is valid. When a file split-brain happens it could be either data split-brain or meta-data split-brain or both. When a split-brain happens the changelog of the -file would be something like this: +file would be something like this: Example:(Lets consider both data, metadata split-brain on same file). -```console +```{ .console .no-copy } [root@pranithk-laptop vol]# getfattr -d -m . -e hex /gfs/brick-?/a getfattr: Removing leading '/' from absolute path names \#file: gfs/brick-a/a @@ -1007,7 +1058,7 @@ on itself but failed on /gfs/brick-b/a. The second 8 digits of trusted.afr.vol-client-0 are all zeros (0x........00000000........), and the second 8 digits of trusted.afr.vol-client-1 are not all zeros (0x........00000001........). -So the changelog on /gfs/brick-a/a implies that some metadata operations succeeded +So the changelog on /gfs/brick-a/a implies that some metadata operations succeeded on itself but failed on /gfs/brick-b/a. #### According to Changelog extended attributes on file /gfs/brick-b/a: @@ -1029,12 +1080,12 @@ file, it is in both data and metadata split-brain. #### Deciding on the correct copy: -The user may have to inspect stat,getfattr output of the files to decide which +The user may have to inspect stat,getfattr output of the files to decide which metadata to retain and contents of the file to decide which data to retain. Continuing with the example above, lets say we want to retain the data of /gfs/brick-a/a and metadata of /gfs/brick-b/a. -#### Resetting the relevant changelogs to resolve the split-brain: +#### Resetting the relevant changelogs to resolve the split-brain: For resolving data-split-brain: @@ -1068,27 +1119,31 @@ For trusted.afr.vol-client-1 Hence execute `setfattr -n trusted.afr.vol-client-1 -v 0x000003d70000000000000000 /gfs/brick-a/a` -Thus after the above operations are done, the changelogs look like this: -[root@pranithk-laptop vol]# getfattr -d -m . -e hex /gfs/brick-?/a -getfattr: Removing leading '/' from absolute path names -\#file: gfs/brick-a/a -trusted.afr.vol-client-0=0x000000000000000000000000 -trusted.afr.vol-client-1=0x000003d70000000000000000 -trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57 +Thus after the above operations are done, the changelogs look like this: -\#file: gfs/brick-b/a -trusted.afr.vol-client-0=0x000000000000000100000000 -trusted.afr.vol-client-1=0x000000000000000000000000 -trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57 +```{ .console .no-copy } +[root@pranithk-laptop vol]# getfattr -d -m . -e hex /gfs/brick-?/a +getfattr: Removing leading '/' from absolute path names +\#file: gfs/brick-a/a +trusted.afr.vol-client-0=0x000000000000000000000000 +trusted.afr.vol-client-1=0x000003d70000000000000000 +trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57 +\#file: gfs/brick-b/a +trusted.afr.vol-client-0=0x000000000000000100000000 +trusted.afr.vol-client-1=0x000000000000000000000000 +trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57 +``` + +## Triggering Self-heal: -Triggering Self-heal: ---------------------- Perform `ls -l ` to trigger healing. Fixing Directory entry split-brain: ----------------------------------- + +--- + Afr has the ability to conservatively merge different entries in the directories when there is a split-brain on directory. If on one brick directory 'd' has entries '1', '2' and has entries '3', '4' on @@ -1108,9 +1163,11 @@ needs to be removed.The gfid-link files are present in the .glusterfs folder in the top-level directory of the brick. If the gfid of the file is 0x307a5c9efddd4e7c96e94fd4bcdcbd1b (the trusted.gfid extended attribute got from the getfattr command earlier),the gfid-link file can be found at + > /gfs/brick-a/.glusterfs/30/7a/307a5c9efddd4e7c96e94fd4bcdcbd1b #### Word of caution: + Before deleting the gfid-link, we have to ensure that there are no hard links to the file present on that brick. If hard-links exist,they must be deleted as well. diff --git a/docs/Troubleshooting/statedump.md b/docs/Troubleshooting/statedump.md index 3c33810..b89345d 100644 --- a/docs/Troubleshooting/statedump.md +++ b/docs/Troubleshooting/statedump.md @@ -2,20 +2,18 @@ A statedump is, as the name suggests, a dump of the internal state of a glusterfs process. It captures information about in-memory structures such as frames, call stacks, active inodes, fds, mempools, iobufs, and locks as well as xlator specific data structures. This can be an invaluable tool for debugging memory leaks and hung processes. +- [Generate a Statedump](#generate-a-statedump) +- [Read a Statedump](#read-a-statedump) +- [Debug with a Statedump](#debug-with-statedumps) - - - [Generate a Statedump](#generate-a-statedump) - - [Read a Statedump](#read-a-statedump) - - [Debug with a Statedump](#debug-with-statedumps) - -************************ - +--- ## Generate a Statedump + Run the command ```console -# gluster --print-statedumpdir +gluster --print-statedumpdir ``` on a gluster server node to find out which directory the statedumps will be created in. This directory may need to be created if not already present. @@ -38,7 +36,6 @@ kill -USR1 There are specific commands to generate statedumps for all brick processes/nfs server/quotad which can be used instead of the above. Run the following commands on one of the server nodes: - For bricks: ```console @@ -59,16 +56,17 @@ gluster volume statedump quotad The statedumps will be created in `statedump-directory` on each node. The statedumps for brick processes will be created with the filename `hyphenated-brick-path..dump.timestamp` while for all other processes it will be `glusterdump..dump.timestamp`. -*** +--- ## Read a Statedump Statedumps are text files and can be opened in any text editor. The first and last lines of the file contain the start and end time (in UTC)respectively of when the statedump file was written. ### Mallinfo + The mallinfo return status is printed in the following format. Please read _man mallinfo_ for more information about what each field means. -``` +```{.text .no-copy } [mallinfo] mallinfo_arena=100020224 /* Non-mmapped space allocated (bytes) */ mallinfo_ordblks=69467 /* Number of free chunks */ @@ -83,19 +81,19 @@ mallinfo_keepcost=133712 /* Top-most, releasable space (bytes) */ ``` ### Memory accounting stats + Each xlator defines data structures specific to its requirements. The statedump captures information about the memory usage and allocations of these structures for each xlator in the call-stack and prints them in the following format: For the xlator with the name _glusterfs_ -``` +```{.text .no-copy } [global.glusterfs - Memory usage] #[global. - Memory usage] num_types=119 #The number of data types it is using ``` - followed by the memory usage for each data-type for that translator. The following example displays a sample for the gf_common_mt_gf_timer_t type -``` +```{.text .no-copy } [global.glusterfs - usage-type gf_common_mt_gf_timer_t memusage] #[global. - usage-type memusage] size=112 #Total size allocated for data-type when the statedump was taken i.e. num_allocs * sizeof (data-type) @@ -113,7 +111,7 @@ Mempools are an optimization intended to reduce the number of allocations of a d Memory pool allocations by each xlator are displayed in the following format: -``` +```{.text .no-copy } [mempool] #Section name -----=----- pool-name=fuse:fd_t #pool-name=: @@ -129,10 +127,9 @@ max-stdalloc=0 #Maximum number of allocations from heap that were in active This information is also useful while debugging high memory usage issues as large hot_count and cur-stdalloc values may point to an element not being freed after it has been used. - ### Iobufs -``` +```{.text .no-copy } [iobuf.global] iobuf_pool=0x1f0d970 #The memory pool for iobufs iobuf_pool.default_page_size=131072 #The default size of iobuf (if no iobuf size is specified the default size is allocated) @@ -148,7 +145,7 @@ There are 3 lists of arenas 2. Purge list: arenas that can be purged(no active iobufs, active_cnt == 0). 3. Filled list: arenas without free iobufs. -``` +```{.text .no-copy } [purge.1] #purge. purge.1.mem_base=0x7fc47b35f000 #The address of the arena structure purge.1.active_cnt=0 #The number of iobufs active in that arena @@ -168,7 +165,7 @@ arena.5.page_size=32768 If the active_cnt of any arena is non zero, then the statedump will also have the iobuf list. -``` +```{.text .no-copy } [arena.6.active_iobuf.1] #arena..active_iobuf. arena.6.active_iobuf.1.ref=1 #refcount of the iobuf arena.6.active_iobuf.1.ptr=0x7fdb921a9000 #address of the iobuf @@ -180,12 +177,11 @@ arena.6.active_iobuf.2.ptr=0x7fdb92189000 A lot of filled arenas at any given point in time could be a sign of iobuf leaks. - ### Call stack The fops received by gluster are handled using call stacks. A call stack contains information about the uid/gid/pid etc of the process that is executing the fop. Each call stack contains different call-frames for each xlator which handles that fop. -``` +```{.text .no-copy } [global.callpool.stack.3] #global.callpool.stack. stack=0x7fc47a44bbe0 #Stack address uid=0 #Uid of the process executing the fop @@ -199,9 +195,10 @@ cnt=9 #Number of frames in this stack. ``` ### Call-frame + Each frame will have information about which xlator the frame belongs to, which function it wound to/from and which it will be unwound to, and whether it has unwound. -``` +```{.text .no-copy } [global.callpool.stack.3.frame.2] #global.callpool.stack..frame. frame=0x7fc47a611dbc #Frame address ref_count=0 #Incremented at the time of wind and decremented at the time of unwind. @@ -215,12 +212,11 @@ unwind_to=afr_lookup_cbk #Parent xlator function to unwind to To debug hangs in the system, see which xlator has not yet unwound its fop by checking the value of the _complete_ tag in the statedump. (_complete=0_ indicates the xlator has not yet unwound). - ### FUSE Operation History Gluster Fuse maintains a history of the operations that it has performed. -``` +```{.text .no-copy } [xlator.mount.fuse.history] TIME=2014-07-09 16:44:57.523364 message=[0] fuse_release: RELEASE(): 4590:, fd: 0x1fef0d8, gfid: 3afb4968-5100-478d-91e9-76264e634c9f @@ -234,7 +230,7 @@ message=[0] fuse_getattr_resume: 4591, STAT, path: (/iozone.tmp), gfid: (3afb496 ### Xlator configuration -``` +```{.text .no-copy } [cluster/replicate.r2-replicate-0] #Xlator type, name information child_count=2 #Number of children for the xlator #Xlator specific configuration below @@ -255,7 +251,7 @@ wait_count=1 ### Graph/inode table -``` +```{.text .no-copy } [active graph - 1] conn.1.bound_xl./data/brick01a/homegfs.hashsize=14057 @@ -268,7 +264,7 @@ conn.1.bound_xl./data/brick01a/homegfs.purge_size=0 #Number of inodes present ### Inode -``` +```{.text .no-copy } [conn.1.bound_xl./data/brick01a/homegfs.active.324] #324th inode in active inode list gfid=e6d337cf-97eb-44b3-9492-379ba3f6ad42 #Gfid of the inode nlookup=13 #Number of times lookups happened from the client or from fuse kernel @@ -285,9 +281,10 @@ ia_type=2 ``` ### Inode context + Each xlator can store information specific to it in the inode context. This context can also be printed in the statedump. Here is the inode context of the locks xlator -``` +```{.text .no-copy } [xlator.features.locks.homegfs-locks.inode] path=/homegfs/users/dfrobins/gfstest/r4/SCRATCH/fort.5102 - path of the file mandatory=0 @@ -301,10 +298,11 @@ lock-dump.domain.domain=homegfs-replicate-0:metadata #Domain name where metadata lock-dump.domain.domain=homegfs-replicate-0 #Domain name where entry/data operations take locks to maintain replication consistency inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=11141120, len=131072, pid = 18446744073709551615, owner=080b1ada117f0000, client=0xb7fc30, connection-id=compute-30-029.com-3505-2014/06/29-14:46:12:477358-homegfs-client-0-0-1, granted at Sun Jun 29 11:10:36 2014 #Active lock information ``` - -*** + +--- ## Debug With Statedumps + ### Memory leaks Statedumps can be used to determine whether the high memory usage of a process is caused by a leak. To debug the issue, generate statedumps for that process at regular intervals, or before and after running the steps that cause the memory used to increase. Once you have multiple statedumps, compare the memory allocation stats to see if any of them are increasing steadily as those could indicate a potential memory leak. @@ -315,7 +313,7 @@ The following examples walk through using statedumps to debug two different memo [BZ 1120151](https://bugzilla.redhat.com/show_bug.cgi?id=1120151) reported high memory usage by the self heal daemon whenever one of the bricks was wiped in a replicate volume and a full self-heal was invoked to heal the contents. This issue was debugged using statedumps to determine which data-structure was leaking memory. -A statedump of the self heal daemon process was taken using +A statedump of the self heal daemon process was taken using ```console kill -USR1 `` @@ -323,7 +321,7 @@ kill -USR1 `` On examining the statedump: -``` +```{.text .no-copy } grep -w num_allocs glusterdump.5225.dump.1405493251 num_allocs=77078 num_allocs=87070 @@ -338,6 +336,7 @@ hot-count=4095 ``` On searching for num_allocs with high values in the statedump, a `grep` of the statedump revealed a large number of allocations for the following data-types under the replicate xlator: + 1. gf_common_mt_asprintf 2. gf_common_mt_char 3. gf_common_mt_mem_pool. @@ -345,16 +344,15 @@ On searching for num_allocs with high values in the statedump, a `grep` of the s On checking the afr-code for allocations with tag `gf_common_mt_char`, it was found that the `data-self-heal` code path does not free one such allocated data structure. `gf_common_mt_mem_pool` suggests that there is a leak in pool memory. The `replicate-0:dict_t`, `glusterfs:data_t` and `glusterfs:data_pair_t` pools are using a lot of memory, i.e. cold_count is `0` and there are too many allocations. Checking the source code of dict.c shows that `key` in `dict` is allocated with `gf_common_mt_char` i.e. `2.` tag and value is created using gf_asprintf which in-turn uses `gf_common_mt_asprintf` i.e. `1.`. Checking the code for leaks in self-heal code paths led to a line which over-writes a variable with new dictionary even when it was already holding a reference to another dictionary. After fixing these leaks, we ran the same test to verify that none of the `num_allocs` values increased in the statedump of the self-daemon after healing 10,000 files. Please check [http://review.gluster.org/8316](http://review.gluster.org/8316) for more info about the patch/code. - #### Leaks in mempools: -The statedump output of mempools was used to test and verify the fixes for [BZ 1134221](https://bugzilla.redhat.com/show_bug.cgi?id=1134221). On code analysis, dict_t objects were found to be leaking (due to missing unref's) during name self-heal. + +The statedump output of mempools was used to test and verify the fixes for [BZ 1134221](https://bugzilla.redhat.com/show_bug.cgi?id=1134221). On code analysis, dict_t objects were found to be leaking (due to missing unref's) during name self-heal. Glusterfs was compiled with the -DDEBUG flags to have cold count set to 0 by default. The test involved creating 100 files on plain replicate volume, removing them from one of the backend bricks, and then triggering lookups on them from the mount point. A statedump of the mount process was taken before executing the test case and after it was completed. Statedump output of the fuse mount process before the test case was executed: -``` - +```{.text .no-copy } pool-name=glusterfs:dict_t hot-count=0 cold-count=0 @@ -364,12 +362,11 @@ max-alloc=0 pool-misses=33 cur-stdalloc=14 max-stdalloc=18 - ``` + Statedump output of the fuse mount process after the test case was executed: -``` - +```{.text .no-copy } pool-name=glusterfs:dict_t hot-count=0 cold-count=0 @@ -379,15 +376,15 @@ max-alloc=0 pool-misses=2841 cur-stdalloc=214 max-stdalloc=220 - ``` + Here, as cold count was 0 by default, cur-stdalloc indicates the number of dict_t objects that were allocated from the heap using mem_get(), and are yet to be freed using mem_put(). After running the test case (named selfheal of 100 files), there was a rise in the cur-stdalloc value (from 14 to 214) for dict_t. After the leaks were fixed, glusterfs was again compiled with -DDEBUG flags and the steps were repeated. Statedumps of the FUSE mount were taken before and after executing the test case to ascertain the validity of the fix. And the results were as follows: Statedump output of the fuse mount process before executing the test case: -``` +```{.text .no-copy } pool-name=glusterfs:dict_t hot-count=0 cold-count=0 @@ -397,11 +394,11 @@ max-alloc=0 pool-misses=33 cur-stdalloc=14 max-stdalloc=18 - ``` + Statedump output of the fuse mount process after executing the test case: -``` +```{.text .no-copy } pool-name=glusterfs:dict_t hot-count=0 cold-count=0 @@ -411,17 +408,18 @@ max-alloc=0 pool-misses=2837 cur-stdalloc=14 max-stdalloc=119 - ``` + The value of cur-stdalloc remained 14 after the test, indicating that the fix indeed does what it's supposed to do. ### Hangs caused by frame loss + [BZ 994959](https://bugzilla.redhat.com/show_bug.cgi?id=994959) reported that the Fuse mount hangs on a readdirp operation. Here are the steps used to locate the cause of the hang using statedump. Statedumps were taken for all gluster processes after reproducing the issue. The following stack was seen in the FUSE mount's statedump: -``` +```{.text .no-copy } [global.callpool.stack.1.frame.1] ref_count=1 translator=fuse @@ -463,8 +461,8 @@ parent=r2-quick-read wind_from=qr_readdirp wind_to=FIRST_CHILD (this)->fops->readdirp unwind_to=qr_readdirp_cbk - ``` + `unwind_to` shows that call was unwound to `afr_readdirp_cbk` from the r2-client-1 xlator. Inspecting that function revealed that afr is not unwinding the stack when fop failed. Check [http://review.gluster.org/5531](http://review.gluster.org/5531) for more info about patch/code changes. diff --git a/docs/Troubleshooting/troubleshooting-afr.md b/docs/Troubleshooting/troubleshooting-afr.md index 42bc2b4..8d85562 100644 --- a/docs/Troubleshooting/troubleshooting-afr.md +++ b/docs/Troubleshooting/troubleshooting-afr.md @@ -8,7 +8,7 @@ The first level of analysis always starts with looking at the log files. Which o Sometimes, you might need more verbose logging to figure out what’s going on: `gluster volume set $volname client-log-level $LEVEL` -where LEVEL can be any one of `DEBUG, WARNING, ERROR, INFO, CRITICAL, NONE, TRACE`. This should ideally make all the log files mentioned above to start logging at `$LEVEL`. The default is `INFO` but you can temporarily toggle it to `DEBUG` or `TRACE` if you want to see under-the-hood messages. Useful when the normal logs don’t give a clue as to what is happening. +where LEVEL can be any one of `DEBUG, WARNING, ERROR, INFO, CRITICAL, NONE, TRACE`. This should ideally make all the log files mentioned above to start logging at `$LEVEL`. The default is `INFO` but you can temporarily toggle it to `DEBUG` or `TRACE` if you want to see under-the-hood messages. Useful when the normal logs don’t give a clue as to what is happening. ## Heal related issues: @@ -20,17 +20,19 @@ Most issues I’ve seen on the mailing list and with customers can broadly fit i If the number of entries are large, then heal info will take longer than usual. While there are performance improvements to heal info being planned, a faster way to get an approx. count of the pending entries is to use the `gluster volume heal $VOLNAME statistics heal-count` command. -**Knowledge Hack:** Since we know that during the write transaction. the xattrop folder will capture the gfid-string of the file if it needs heal, we can also do an `ls /brick/.glusterfs/indices/xattrop|wc -l` on each brick to get the approx. no of entries that need heal. If this number reduces over time, it is a sign that the heal backlog is reducing. You will also see messages whenever a particular type of heal starts/ends for a given gfid, like so: +**Knowledge Hack:** Since we know that during the write transaction. the xattrop folder will capture the gfid-string of the file if it needs heal, we can also do an `ls /brick/.glusterfs/indices/xattrop|wc -l` on each brick to get the approx. no of entries that need heal. If this number reduces over time, it is a sign that the heal backlog is reducing. You will also see messages whenever a particular type of heal starts/ends for a given gfid, like so: -`[2019-05-07 12:05:14.460442] I [MSGID: 108026] [afr-self-heal-entry.c:883:afr_selfheal_entry_do] 0-testvol-replicate-0: performing entry selfheal on d120c0cf-6e87-454b-965b-0d83a4c752bb` +```{.text .no-copy } +[2019-05-07 12:05:14.460442] I [MSGID: 108026] [afr-self-heal-entry.c:883:afr_selfheal_entry_do] 0-testvol-replicate-0: performing entry selfheal on d120c0cf-6e87-454b-965b-0d83a4c752bb -`[2019-05-07 12:05:14.474710] I [MSGID: 108026] [afr-self-heal-common.c:1741:afr_log_selfheal] 0-testvol-replicate-0: Completed entry selfheal on d120c0cf-6e87-454b-965b-0d83a4c752bb. sources=[0] 2 sinks=1` +[2019-05-07 12:05:14.474710] I [MSGID: 108026] [afr-self-heal-common.c:1741:afr_log_selfheal] 0-testvol-replicate-0: Completed entry selfheal on d120c0cf-6e87-454b-965b-0d83a4c752bb. sources=[0] 2 sinks=1 -`[2019-05-07 12:05:14.493506] I [MSGID: 108026] [afr-self-heal-common.c:1741:afr_log_selfheal] 0-testvol-replicate-0: Completed data selfheal on a9b5f183-21eb-4fb3-a342-287d3a7dddc5. sources=[0] 2 sinks=1` +[2019-05-07 12:05:14.493506] I [MSGID: 108026] [afr-self-heal-common.c:1741:afr_log_selfheal] 0-testvol-replicate-0: Completed data selfheal on a9b5f183-21eb-4fb3-a342-287d3a7dddc5. sources=[0] 2 sinks=1 -`[2019-05-07 12:05:14.494577] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-testvol-replicate-0: performing metadata selfheal on a9b5f183-21eb-4fb3-a342-287d3a7dddc5` +[2019-05-07 12:05:14.494577] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-testvol-replicate-0: performing metadata selfheal on a9b5f183-21eb-4fb3-a342-287d3a7dddc5 -`[2019-05-07 12:05:14.498398] I [MSGID: 108026] [afr-self-heal-common.c:1741:afr_log_selfheal] 0-testvol-replicate-0: Completed metadata selfheal on a9b5f183-21eb-4fb3-a342-287d3a7dddc5. sources=[0] 2 sinks=1` +[2019-05-07 12:05:14.498398] I [MSGID: 108026] [afr-self-heal-common.c:1741:afr_log_selfheal] 0-testvol-replicate-0: Completed metadata selfheal on a9b5f183-21eb-4fb3-a342-287d3a7dddc5. sources=[0] 2 sinks=1 +``` ### ii) Self-heal is stuck/ not getting completed. @@ -38,69 +40,88 @@ If a file seems to be forever appearing in heal info and not healing, check the - Examine the afr xattrs- Do they clearly indicate the good and bad copies? If there isn’t at least one good copy, then the file is in split-brain and you would need to use the split-brain resolution CLI. - Identify which node’s shds would be picking up the file for heal. If a file is listed in the heal info output under brick1 and brick2, then the shds on the nodes which host those bricks would attempt (and one of them would succeed) in doing the heal. - - Once the shd is identified, look at the shd logs to see if it is indeed connected to the bricks. +- Once the shd is identified, look at the shd logs to see if it is indeed connected to the bricks. This is good: -`[2019-05-07 09:53:02.912923] I [MSGID: 114046] [client-handshake.c:1106:client_setvolume_cbk] 0-testvol-client-2: Connected to testvol-client-2, attached to remote volume '/bricks/brick3'` + +```{.text .no-copy } +[2019-05-07 09:53:02.912923] I [MSGID: 114046] [client-handshake.c:1106:client_setvolume_cbk] 0-testvol-client-2: Connected to testvol-client-2, attached to remote volume '/bricks/brick3' +``` This indicates a disconnect: -`[2019-05-07 11:44:47.602862] I [MSGID: 114018] [client.c:2334:client_rpc_notify] 0-testvol-client-2: disconnected from testvol-client-2. Client process will keep trying to connect to glusterd until brick's port is available` -`[2019-05-07 11:44:50.953516] E [MSGID: 114058] [client-handshake.c:1456:client_query_portmap_cbk] 0-testvol-client-2: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.` +```{.text .no-copy } +[2019-05-07 11:44:47.602862] I [MSGID: 114018] [client.c:2334:client_rpc_notify] 0-testvol-client-2: disconnected from testvol-client-2. Client process will keep trying to connect to glusterd until brick's port is available + +[2019-05-07 11:44:50.953516] E [MSGID: 114058] [client-handshake.c:1456:client_query_portmap_cbk] 0-testvol-client-2: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. +``` Alternatively, take a statedump of the self-heal daemon (shd) and check if all client xlators are connected to the respective bricks. The shd must have `connected=1` for all the client xlators, meaning it can talk to all the bricks. -| Shd’s statedump entry of a client xlator that is connected to the 3rd brick | Shd’s statedump entry of the same client xlator if it is diconnected from the 3rd brick | -|:--------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------:| +| Shd’s statedump entry of a client xlator that is connected to the 3rd brick | Shd’s statedump entry of the same client xlator if it is diconnected from the 3rd brick | +| :------------------------------------------------------------------------------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------: | | [xlator.protocol.client.testvol-client-2.priv] connected=1 total_bytes_read=75004 ping_timeout=42 total_bytes_written=50608 ping_msgs_sent=0 msgs_sent=0 | [xlator.protocol.client.testvol-client-2.priv] connected=0 total_bytes_read=75004 ping_timeout=42 total_bytes_written=50608 ping_msgs_sent=0 msgs_sent=0 | If there are connection issues (i.e. `connected=0`), you would need to investigate and fix them. Check if the pid and the TCP/RDMA Port of the brick proceess from gluster volume status $VOLNAME matches that of `ps aux|grep glusterfsd|grep $brick-path` -`[root@tuxpad glusterfs]# gluster volume status` +```{.text .no-copy } +# gluster volume status Status of volume: testvol -Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------- -Brick 127.0.0.2:/bricks/brick1 49152 0 Y 12527 +Gluster process TCP Port RDMA Port Online Pid -`[root@tuxpad glusterfs]# ps aux|grep brick1` +--- -`root 12527 0.0 0.1 1459208 20104 ? Ssl 11:20 0:01 /usr/local/sbin/glusterfsd -s 127.0.0.2 --volfile-id testvol.127.0.0.2.bricks-brick1 -p /var/run/gluster/vols/testvol/127.0.0.2-bricks-brick1.pid -S /var/run/gluster/70529980362a17d6.socket --brick-name /bricks/brick1 -l /var/log/glusterfs/bricks/bricks-brick1.log --xlator-option *-posix.glusterd-uuid=d90b1532-30e5-4f9d-a75b-3ebb1c3682d4 --process-name brick --brick-port 49152 --xlator-option testvol-server.listen-port=49152` +Brick 127.0.0.2:/bricks/brick1 49152 0 Y 12527 +``` + +```{.text .no-copy } +# ps aux|grep brick1 + +root 12527 0.0 0.1 1459208 20104 ? Ssl 11:20 0:01 /usr/local/sbin/glusterfsd -s 127.0.0.2 --volfile-id testvol.127.0.0.2.bricks-brick1 -p /var/run/gluster/vols/testvol/127.0.0.2-bricks-brick1.pid -S /var/run/gluster/70529980362a17d6.socket --brick-name /bricks/brick1 -l /var/log/glusterfs/bricks/bricks-brick1.log --xlator-option *-posix.glusterd-uuid=d90b1532-30e5-4f9d-a75b-3ebb1c3682d4 --process-name brick --brick-port 49152 --xlator-option testvol-server.listen-port=49152 +``` Though this will likely match, sometimes there could be a bug leading to stale port usage. A quick workaround would be to restart glusterd on that node and check if things match. Report the issue to the devs if you see this problem. - I have seen some cases where a file is listed in heal info, and the afr xattrs indicate pending metadata or data heal but the file itself is not present on all bricks. Ideally, the parent directory of the file must have pending entry heal xattrs so that the file either gets created on the missing bricks or gets deleted from the ones where it is present. But if the parent dir doesn’t have xattrs, the entry heal can’t proceed. In such cases, you can - -- Either do a lookup directly on the file from the mount so that name heal is triggered and then shd can pickup the data/metadata heal. - -- Or manually set entry xattrs on the parent dir to emulate an entry heal so that the file gets created as a part of it. - -- If a brick’s underlying filesystem/lvm was damaged and fsck’d to recovery, some files/dirs might be missing on it. If there is a lot of missing info on the recovered bricks, it might be better to just to a replace-brick or reset-brick and let the heal fully sync everything rather than fiddling with afr xattrs of individual entries. -**Hack:** How to trigger heal on *any* file/directory + - Either do a lookup directly on the file from the mount so that name heal is triggered and then shd can pickup the data/metadata heal. + - Or manually set entry xattrs on the parent dir to emulate an entry heal so that the file gets created as a part of it. + - If a brick’s underlying filesystem/lvm was damaged and fsck’d to recovery, some files/dirs might be missing on it. If there is a lot of missing info on the recovered bricks, it might be better to just to a replace-brick or reset-brick and let the heal fully sync everything rather than fiddling with afr xattrs of individual entries. + +**Hack:** How to trigger heal on _any_ file/directory Knowing about self-heal logic and index heal from the previous post, we can sort of emulate a heal with the following steps. This is not something that you should be doing on your cluster but it pays to at least know that it is possible when push comes to shove. 1. Picking one brick as good and setting the afr pending xattr on it blaming the bad bricks. 2. Capture the gfid inside .glusterfs/indices/xattrop so that the shd can pick it up during index heal. 3. Finally, trigger index heal: gluster volume heal $VOLNAME . -*Example:* Let us say a FILE-1 exists with `trusted.gfid=0x1ad2144928124da9b7117d27393fea5c` on all bricks of a replica 3 volume called testvol. It has no afr xattrs. But you still need to emulate a heal. Let us say you choose brick-2 as the source. Let us do the steps listed above: +_Example:_ Let us say a FILE-1 exists with `trusted.gfid=0x1ad2144928124da9b7117d27393fea5c` on all bricks of a replica 3 volume called testvol. It has no afr xattrs. But you still need to emulate a heal. Let us say you choose brick-2 as the source. Let us do the steps listed above: -1. Make brick-2 blame the other 2 bricks: -[root@tuxpad fuse_mnt]# setfattr -n trusted.afr.testvol-client-2 -v 0x000000010000000000000000 /bricks/brick2/FILE-1 -[root@tuxpad fuse_mnt]# setfattr -n trusted.afr.testvol-client-1 -v 0x000000010000000000000000 /bricks/brick2/FILE-1 +1. Make brick-2 blame the other 2 bricks: -2. Store the gfid string inside xattrop folder as a hardlink to the base entry: -root@tuxpad ~]# cd /bricks/brick2/.glusterfs/indices/xattrop/ -[root@tuxpad xattrop]# ls -li -total 0 -17829255 ----------. 1 root root 0 May 10 11:20 xattrop-a400ca91-cec9-4463-a183-aca9eaff9fa7` -[root@tuxpad xattrop]# ln xattrop-a400ca91-cec9-4463-a183-aca9eaff9fa7 1ad21449-2812-4da9-b711-7d27393fea5c -[root@tuxpad xattrop]# ll -total 0 -----------. 2 root root 0 May 10 11:20 1ad21449-2812-4da9-b711-7d27393fea5c -----------. 2 root root 0 May 10 11:20 xattrop-a400ca91-cec9-4463-a183-aca9eaff9fa7 + setfattr -n trusted.afr.testvol-client-2 -v 0x000000010000000000000000 /bricks/brick2/FILE-1 + setfattr -n trusted.afr.testvol-client-1 -v 0x000000010000000000000000 /bricks/brick2/FILE-1 -3. Trigger heal: gluster volume heal testvol -The glustershd.log of node-2 should log about the heal. -[2019-05-10 06:10:46.027238] I [MSGID: 108026] [afr-self-heal-common.c:1741:afr_log_selfheal] 0-testvol-replicate-0: Completed data selfheal on 1ad21449-2812-4da9-b711-7d27393fea5c. sources=[1] sinks=0 2 -So the data was healed from the second brick to the first and third brick. +2. Store the gfid string inside xattrop folder as a hardlink to the base entry: + + # cd /bricks/brick2/.glusterfs/indices/xattrop/ + # ls -li + total 0 + 17829255 ----------. 1 root root 0 May 10 11:20 xattrop-a400ca91-cec9-4463-a183-aca9eaff9fa7` + + # ln xattrop-a400ca91-cec9-4463-a183-aca9eaff9fa7 1ad21449-2812-4da9-b711-7d27393fea5c + # ll + total 0 + ----------. 2 root root 0 May 10 11:20 1ad21449-2812-4da9-b711-7d27393fea5c + ----------. 2 root root 0 May 10 11:20 xattrop-a400ca91-cec9-4463-a183-aca9eaff9fa7 + +3. Trigger heal: `gluster volume heal testvol` + + The glustershd.log of node-2 should log about the heal. + + [2019-05-10 06:10:46.027238] I [MSGID: 108026] [afr-self-heal-common.c:1741:afr_log_selfheal] 0-testvol-replicate-0: Completed data selfheal on 1ad21449-2812-4da9-b711-7d27393fea5c. sources=[1] sinks=0 2 + + So the data was healed from the second brick to the first and third brick. ### iii) Self-heal is too slow @@ -109,7 +130,7 @@ If the heal backlog is decreasing and you see glustershd logging heals but you Option: cluster.shd-max-threads Default Value: 1 Description: Maximum number of parallel heals SHD can do per local brick. This can substantially lower heal times, but can also crush your bricks if you don’t have the storage hardware to support this. - + Option: cluster.shd-wait-qlength Default Value: 1024 Description: This option can be used to control number of heals that can wait in SHD per subvolume @@ -118,38 +139,45 @@ I’m not covering it here but it is possible to launch multiple shd instances ( ### iv) Self-heal is too aggressive and slows down the system. -If shd-max-threads are at the lowest value (i.e. 1) and you see if CPU usage of the bricks is too high, you can check if the volume’s profile info shows a lot of RCHECKSUM fops. Data self-heal does checksum calculation (i.e the `posix_rchecksum()` FOP) which can be CPU intensive. You can the `cluster.data-self-heal-algorithm` option to full. This does a full file copy instead of computing rolling checksums and syncing only the mismatching blocks. The tradeoff is that the network consumption will be increased. +If shd-max-threads are at the lowest value (i.e. 1) and you see if CPU usage of the bricks is too high, you can check if the volume’s profile info shows a lot of RCHECKSUM fops. Data self-heal does checksum calculation (i.e the `posix_rchecksum()` FOP) which can be CPU intensive. You can the `cluster.data-self-heal-algorithm` option to full. This does a full file copy instead of computing rolling checksums and syncing only the mismatching blocks. The tradeoff is that the network consumption will be increased. -You can also disable all client-side heals if they are turned on so that the client bandwidth is consumed entirely by the application FOPs and not the ones by client side background heals. i.e. turn off `cluster.metadata-self-heal, cluster.data-self-heal and cluster.entry-self-heal`. -Note: In recent versions of gluster, client-side heals are disabled by default. +You can also disable all client-side heals if they are turned on so that the client bandwidth is consumed entirely by the application FOPs and not the ones by client side background heals. i.e. turn off `cluster.metadata-self-heal, cluster.data-self-heal and cluster.entry-self-heal`. +Note: In recent versions of gluster, client-side heals are disabled by default. ## Mount related issues: - ### i) All fops are failing with ENOTCONN + +### i) All fops are failing with ENOTCONN Check mount log/ statedump for loss of quorum, just like for glustershd. If this is a fuse client (as opposed to an nfs/ gfapi client), you can also check the .meta folder to check the connection status to the bricks. -`[root@tuxpad ~]# cat /mnt/fuse_mnt/.meta/graphs/active/testvol-client-*/private |grep connected` -`connected = 0` -`connected = 1` -`connected = 1` +```{.text .no-copy } +# cat /mnt/fuse_mnt/.meta/graphs/active/testvol-client-*/private |grep connected -If `connected=0`, the connection to that brick is lost. Find out why. If the client is not connected to quorum number of bricks, then AFR fails lookups (and therefore any subsequent FOP) with Transport endpoint is not connected +connected = 0 +connected = 1 +connected = 1 +``` + +If `connected=0`, the connection to that brick is lost. Find out why. If the client is not connected to quorum number of bricks, then AFR fails lookups (and therefore any subsequent FOP) with Transport endpoint is not connected ### ii) FOPs on some files are failing with ENOTCONN Check mount log for the file being unreadable: -`[2019-05-10 11:04:01.607046] W [MSGID: 108027] [afr-common.c:2268:afr_attempt_readsubvol_set] 13-testvol-replicate-0: no read subvols for /FILE.txt` -`[2019-05-10 11:04:01.607775] W [fuse-bridge.c:939:fuse_entry_cbk] 0-glusterfs-fuse: 234: LOOKUP() /FILE.txt => -1 (Transport endpoint is not connected)` -This means there was only 1 good copy and the client has lost connection to that brick. You need to ensure that the client is connected to all bricks. +```{.text .no-copy } +[2019-05-10 11:04:01.607046] W [MSGID: 108027] [afr-common.c:2268:afr_attempt_readsubvol_set] 13-testvol-replicate-0: no read subvols for /FILE.txt +[2019-05-10 11:04:01.607775] W [fuse-bridge.c:939:fuse_entry_cbk] 0-glusterfs-fuse: 234: LOOKUP() /FILE.txt => -1 (Transport endpoint is not connected) +``` + +This means there was only 1 good copy and the client has lost connection to that brick. You need to ensure that the client is connected to all bricks. ### iii) Mount is hung It can be difficult to pin-point the issue immediately and might require assistance from the developers but the first steps to debugging could be to - - strace the fuse mount; see where it is hung. - - Take a statedump of the mount to see which xlator has frames that are not wound (i.e. complete=0) and for which FOP. Then check the source code to see if there are any unhanded cases where the xlator doesn’t wind the FOP to its child. - - Take statedump of bricks to see if there are any stale locks. An indication of stale locks is the same lock being present in multiple statedumps or the ‘granted’ date being very old. +- strace the fuse mount; see where it is hung. +- Take a statedump of the mount to see which xlator has frames that are not wound (i.e. complete=0) and for which FOP. Then check the source code to see if there are any unhanded cases where the xlator doesn’t wind the FOP to its child. +- Take statedump of bricks to see if there are any stale locks. An indication of stale locks is the same lock being present in multiple statedumps or the ‘granted’ date being very old. Excerpt from a brick statedump: diff --git a/docs/Troubleshooting/troubleshooting-filelocks.md b/docs/Troubleshooting/troubleshooting-filelocks.md index ec5da40..aaf42b5 100644 --- a/docs/Troubleshooting/troubleshooting-filelocks.md +++ b/docs/Troubleshooting/troubleshooting-filelocks.md @@ -1,6 +1,4 @@ -Troubleshooting File Locks -========================== - +# Troubleshooting File Locks Use [statedumps](./statedump.md) to find and list the locks held on files. The statedump output also provides information on each lock @@ -13,11 +11,11 @@ lock using the following `clear lock` commands. 1. **Perform statedump on the volume to view the files that are locked using the following command:** - # gluster volume statedump inode + gluster volume statedump inode For example, to display statedump of test-volume: - # gluster volume statedump test-volume + gluster volume statedump test-volume Volume statedump successful The statedump files are created on the brick servers in the` /tmp` @@ -58,25 +56,23 @@ lock using the following `clear lock` commands. 2. **Clear the lock using the following command:** - # gluster volume clear-locks + gluster volume clear-locks For example, to clear the entry lock on `file1` of test-volume: - # gluster volume clear-locks test-volume / kind granted entry file1 + gluster volume clear-locks test-volume / kind granted entry file1 Volume clear-locks successful vol-locks: entry blocked locks=0 granted locks=1 3. **Clear the inode lock using the following command:** - # gluster volume clear-locks + gluster volume clear-locks For example, to clear the inode lock on `file1` of test-volume: - # gluster volume clear-locks test-volume /file1 kind granted inode 0,0-0 + gluster volume clear-locks test-volume /file1 kind granted inode 0,0-0 Volume clear-locks successful vol-locks: inode blocked locks=0 granted locks=1 Perform statedump on test-volume again to verify that the above inode and entry locks are cleared. - - diff --git a/docs/Troubleshooting/troubleshooting-georep.md b/docs/Troubleshooting/troubleshooting-georep.md index 9ef49fe..cb66538 100644 --- a/docs/Troubleshooting/troubleshooting-georep.md +++ b/docs/Troubleshooting/troubleshooting-georep.md @@ -8,13 +8,13 @@ to GlusterFS Geo-replication. For every Geo-replication session, the following three log files are associated to it (four, if the secondary is a gluster volume): -- **Primary-log-file** - log file for the process which monitors the Primary - volume -- **Secondary-log-file** - log file for process which initiates the changes in - secondary -- **Primary-gluster-log-file** - log file for the maintenance mount point - that Geo-replication module uses to monitor the Primary volume -- **Secondary-gluster-log-file** - is the secondary's counterpart of it +- **Primary-log-file** - log file for the process which monitors the Primary + volume +- **Secondary-log-file** - log file for process which initiates the changes in + secondary +- **Primary-gluster-log-file** - log file for the maintenance mount point + that Geo-replication module uses to monitor the Primary volume +- **Secondary-gluster-log-file** - is the secondary's counterpart of it **Primary Log File** @@ -28,7 +28,7 @@ gluster volume geo-replication config log-file For example: ```console -# gluster volume geo-replication Volume1 example.com:/data/remote_dir config log-file +gluster volume geo-replication Volume1 example.com:/data/remote_dir config log-file ``` **Secondary Log File** @@ -38,13 +38,13 @@ running on secondary machine), use the following commands: 1. On primary, run the following command: - # gluster volume geo-replication Volume1 example.com:/data/remote_dir config session-owner 5f6e5200-756f-11e0-a1f0-0800200c9a66 + gluster volume geo-replication Volume1 example.com:/data/remote_dir config session-owner 5f6e5200-756f-11e0-a1f0-0800200c9a66 Displays the session owner details. 2. On secondary, run the following command: - # gluster volume geo-replication /data/remote_dir config log-file /var/log/gluster/${session-owner}:remote-mirror.log + gluster volume geo-replication /data/remote_dir config log-file /var/log/gluster/${session-owner}:remote-mirror.log 3. Replace the session owner details (output of Step 1) to the output of Step 2 to get the location of the log file. @@ -52,7 +52,7 @@ running on secondary machine), use the following commands: /var/log/gluster/5f6e5200-756f-11e0-a1f0-0800200c9a66:remote-mirror.log ### Rotating Geo-replication Logs - + Administrators can rotate the log file of a particular primary-secondary session, as needed. When you run geo-replication's ` log-rotate` command, the log file is backed up with the current timestamp suffixed @@ -61,34 +61,34 @@ log file. **To rotate a geo-replication log file** -- Rotate log file for a particular primary-secondary session using the - following command: +- Rotate log file for a particular primary-secondary session using the + following command: - # gluster volume geo-replication log-rotate + gluster volume geo-replication log-rotate - For example, to rotate the log file of primary `Volume1` and secondary - `example.com:/data/remote_dir` : + For example, to rotate the log file of primary `Volume1` and secondary + `example.com:/data/remote_dir` : - # gluster volume geo-replication Volume1 example.com:/data/remote_dir log rotate + gluster volume geo-replication Volume1 example.com:/data/remote_dir log rotate log rotate successful -- Rotate log file for all sessions for a primary volume using the - following command: +- Rotate log file for all sessions for a primary volume using the + following command: - # gluster volume geo-replication log-rotate + gluster volume geo-replication log-rotate - For example, to rotate the log file of primary `Volume1`: + For example, to rotate the log file of primary `Volume1`: - # gluster volume geo-replication Volume1 log rotate + gluster volume geo-replication Volume1 log rotate log rotate successful -- Rotate log file for all sessions using the following command: +- Rotate log file for all sessions using the following command: - # gluster volume geo-replication log-rotate + gluster volume geo-replication log-rotate - For example, to rotate the log file for all sessions: + For example, to rotate the log file for all sessions: - # gluster volume geo-replication log rotate + gluster volume geo-replication log rotate log rotate successful ### Synchronization is not complete @@ -102,16 +102,14 @@ GlusterFS geo-replication begins synchronizing all the data. All files are compared using checksum, which can be a lengthy and high resource utilization operation on large data sets. - ### Issues in Data Synchronization **Description**: Geo-replication display status as OK, but the files do not get synced, only directories and symlink gets synced with the following error message in the log: -```console -[2011-05-02 13:42:13.467644] E [primary:288:regjob] GMaster: failed to -sync ./some\_file\` +```{ .text .no-copy } +[2011-05-02 13:42:13.467644] E [primary:288:regjob] GMaster: failed to sync ./some\_file\` ``` **Solution**: Geo-replication invokes rsync v3.0.0 or higher on the host @@ -123,7 +121,7 @@ required version. **Description**: Geo-replication displays status as faulty very often with a backtrace similar to the following: -```console +```{ .text .no-copy } 2011-04-28 14:06:18.378859] E [syncdutils:131:log\_raise\_exception] \: FAIL: Traceback (most recent call last): File "/usr/local/libexec/glusterfs/python/syncdaemon/syncdutils.py", line @@ -139,28 +137,28 @@ the primary gsyncd module and secondary gsyncd module is broken and this can happen for various reasons. Check if it satisfies all the following pre-requisites: -- Password-less SSH is set up properly between the host and the remote - machine. -- If FUSE is installed in the machine, because geo-replication module - mounts the GlusterFS volume using FUSE to sync data. -- If the **Secondary** is a volume, check if that volume is started. -- If the Secondary is a plain directory, verify if the directory has been - created already with the required permissions. -- If GlusterFS 3.2 or higher is not installed in the default location - (in Primary) and has been prefixed to be installed in a custom - location, configure the `gluster-command` for it to point to the - exact location. -- If GlusterFS 3.2 or higher is not installed in the default location - (in secondary) and has been prefixed to be installed in a custom - location, configure the `remote-gsyncd-command` for it to point to - the exact place where gsyncd is located. +- Password-less SSH is set up properly between the host and the remote + machine. +- If FUSE is installed in the machine, because geo-replication module + mounts the GlusterFS volume using FUSE to sync data. +- If the **Secondary** is a volume, check if that volume is started. +- If the Secondary is a plain directory, verify if the directory has been + created already with the required permissions. +- If GlusterFS 3.2 or higher is not installed in the default location + (in Primary) and has been prefixed to be installed in a custom + location, configure the `gluster-command` for it to point to the + exact location. +- If GlusterFS 3.2 or higher is not installed in the default location + (in secondary) and has been prefixed to be installed in a custom + location, configure the `remote-gsyncd-command` for it to point to + the exact place where gsyncd is located. ### Intermediate Primary goes to Faulty State **Description**: In a cascading set-up, the intermediate primary goes to faulty state with the following log: -```console +```{ .text .no-copy } raise RuntimeError ("aborting on uuid change from %s to %s" % \\ RuntimeError: aborting on uuid change from af07e07c-427f-4586-ab9f- 4bf7d299be81 to de6b5040-8f4e-4575-8831-c4f55bd41154 diff --git a/docs/Troubleshooting/troubleshooting-glusterd.md b/docs/Troubleshooting/troubleshooting-glusterd.md index c42936b..dfa2ed7 100644 --- a/docs/Troubleshooting/troubleshooting-glusterd.md +++ b/docs/Troubleshooting/troubleshooting-glusterd.md @@ -4,45 +4,40 @@ The glusterd daemon runs on every trusted server node and is responsible for the The gluster CLI sends commands to the glusterd daemon on the local node, which executes the operation and returns the result to the user. -
- ### Debugging glusterd #### Logs + Start by looking at the log files for clues as to what went wrong when you hit a problem. The default directory for Gluster logs is /var/log/glusterfs. The logs for the CLI and glusterd are: - - glusterd : /var/log/glusterfs/glusterd.log - - gluster CLI : /var/log/glusterfs/cli.log - +- glusterd : /var/log/glusterfs/glusterd.log +- gluster CLI : /var/log/glusterfs/cli.log #### Statedumps + Statedumps are useful in debugging memory leaks and hangs. See [Statedump](./statedump.md) for more details. -
- ### Common Issues and How to Resolve Them - -**"*Another transaction is in progress for volname*" or "*Locking failed on xxx.xxx.xxx.xxx"*** +**"_Another transaction is in progress for volname_" or "_Locking failed on xxx.xxx.xxx.xxx"_** As Gluster is distributed by nature, glusterd takes locks when performing operations to ensure that configuration changes made to a volume are atomic across the cluster. These errors are returned when: -* More than one transaction contends on the same lock. -> *Solution* : These are likely to be transient errors and the operation will succeed if retried once the other transaction is complete. +- More than one transaction contends on the same lock. -* A stale lock exists on one of the nodes. -> *Solution* : Repeating the operation will not help until the stale lock is cleaned up. Restart the glusterd process holding the lock + > _Solution_ : These are likely to be transient errors and the operation will succeed if retried once the other transaction is complete. - * Check the glusterd.log file to find out which node holds the stale lock. Look for the message: - `lock being held by ` - * Run `gluster peer status` to identify the node with the uuid in the log message. - * Restart glusterd on that node. +- A stale lock exists on one of the nodes. + > _Solution_ : Repeating the operation will not help until the stale lock is cleaned up. Restart the glusterd process holding the lock -
+ - Check the glusterd.log file to find out which node holds the stale lock. Look for the message: + `lock being held by ` + - Run `gluster peer status` to identify the node with the uuid in the log message. + - Restart glusterd on that node. **"_Transport endpoint is not connected_" errors but all bricks are up** @@ -51,51 +46,40 @@ Gluster client processes query glusterd for the ports the bricks processes are l If the port information in glusterd is incorrect, the client will fail to connect to the brick even though it is up. Operations which would need to access that brick may fail with "Transport endpoint is not connected". -*Solution* : Restart the glusterd service. - -
+_Solution_ : Restart the glusterd service. **"Peer Rejected"** `gluster peer status` returns "Peer Rejected" for a node. -```console +```{ .text .no-copy } Hostname: Uuid: State: Peer Rejected (Connected) ``` -This indicates that the volume configuration on the node is not in sync with the rest of the trusted storage pool. +This indicates that the volume configuration on the node is not in sync with the rest of the trusted storage pool. You should see the following message in the glusterd log for the node on which the peer status command was run: -```console +```{ .text .no-copy } Version of Cksums differ. local cksum = xxxxxx, remote cksum = xxxxyx on peer ``` -*Solution*: Update the cluster.op-version +_Solution_: Update the cluster.op-version - * Run `gluster volume get all cluster.max-op-version` to get the latest supported op-version. - * Update the cluster.op-version to the latest supported op-version by executing `gluster volume set all cluster.op-version `. - -
+- Run `gluster volume get all cluster.max-op-version` to get the latest supported op-version. +- Update the cluster.op-version to the latest supported op-version by executing `gluster volume set all cluster.op-version `. **"Accepted Peer Request"** -If the glusterd handshake fails while expanding a cluster, the view of the cluster will be inconsistent. The state of the peer in `gluster peer status` will be “accepted peer request” and subsequent CLI commands will fail with an error. -Eg. `Volume create command will fail with "volume create: testvol: failed: Host is not in 'Peer in Cluster' state` - +If the glusterd handshake fails while expanding a cluster, the view of the cluster will be inconsistent. The state of the peer in `gluster peer status` will be “accepted peer request” and subsequent CLI commands will fail with an error. +Eg. `Volume create command will fail with "volume create: testvol: failed: Host is not in 'Peer in Cluster' state` + In this case the value of the state field in `/var/lib/glusterd/peers/` will be other than 3. -*Solution*: - -* Stop glusterd -* Open `/var/lib/glusterd/peers/` -* Change state to 3 -* Start glusterd - - - - - - +_Solution_: +- Stop glusterd +- Open `/var/lib/glusterd/peers/` +- Change state to 3 +- Start glusterd diff --git a/docs/Troubleshooting/troubleshooting-gnfs.md b/docs/Troubleshooting/troubleshooting-gnfs.md index 7e2c61a..9d7c455 100644 --- a/docs/Troubleshooting/troubleshooting-gnfs.md +++ b/docs/Troubleshooting/troubleshooting-gnfs.md @@ -11,14 +11,14 @@ This error is encountered when the server has not started correctly. On most Linux distributions this is fixed by starting portmap: ```console -# /etc/init.d/portmap start +/etc/init.d/portmap start ``` On some distributions where portmap has been replaced by rpcbind, the following command is required: ```console -# /etc/init.d/rpcbind start +/etc/init.d/rpcbind start ``` After starting portmap or rpcbind, gluster NFS server needs to be @@ -32,13 +32,13 @@ This error can arise in case there is already a Gluster NFS server running on the same machine. This situation can be confirmed from the log file, if the following error lines exist: -```text +```{ .text .no-copy } [2010-05-26 23:40:49] E [rpc-socket.c:126:rpcsvc_socket_listen] rpc-socket: binding socket failed:Address already in use -[2010-05-26 23:40:49] E [rpc-socket.c:129:rpcsvc_socket_listen] rpc-socket: Port is already in use -[2010-05-26 23:40:49] E [rpcsvc.c:2636:rpcsvc_stage_program_register] rpc-service: could not create listening connection -[2010-05-26 23:40:49] E [rpcsvc.c:2675:rpcsvc_program_register] rpc-service: stage registration of program failed -[2010-05-26 23:40:49] E [rpcsvc.c:2695:rpcsvc_program_register] rpc-service: Program registration failed: MOUNT3, Num: 100005, Ver: 3, Port: 38465 -[2010-05-26 23:40:49] E [nfs.c:125:nfs_init_versions] nfs: Program init failed +[2010-05-26 23:40:49] E [rpc-socket.c:129:rpcsvc_socket_listen] rpc-socket: Port is already in use +[2010-05-26 23:40:49] E [rpcsvc.c:2636:rpcsvc_stage_program_register] rpc-service: could not create listening connection +[2010-05-26 23:40:49] E [rpcsvc.c:2675:rpcsvc_program_register] rpc-service: stage registration of program failed +[2010-05-26 23:40:49] E [rpcsvc.c:2695:rpcsvc_program_register] rpc-service: Program registration failed: MOUNT3, Num: 100005, Ver: 3, Port: 38465 +[2010-05-26 23:40:49] E [nfs.c:125:nfs_init_versions] nfs: Program init failed [2010-05-26 23:40:49] C [nfs.c:531:notify] nfs: Failed to initialize protocols ``` @@ -50,7 +50,7 @@ multiple NFS servers on the same machine. If the mount command fails with the following error message: -```console +```{ .text .no-copy } mount.nfs: rpc.statd is not running but is required for remote locking. mount.nfs: Either use '-o nolock' to keep locks local, or start statd. ``` @@ -59,7 +59,7 @@ For NFS clients to mount the NFS server, rpc.statd service must be running on the clients. Start rpc.statd service by running the following command: ```console -# rpc.statd +rpc.statd ``` ### mount command takes too long to finish. @@ -71,14 +71,14 @@ NFS client. The resolution for this is to start either of these services by running the following command: ```console -# /etc/init.d/portmap start +/etc/init.d/portmap start ``` On some distributions where portmap has been replaced by rpcbind, the following command is required: ```console -# /etc/init.d/rpcbind start +/etc/init.d/rpcbind start ``` ### NFS server glusterfsd starts but initialization fails with “nfsrpc- service: portmap registration of program failed” error message in the log. @@ -88,8 +88,8 @@ still fail preventing clients from accessing the mount points. Such a situation can be confirmed from the following error messages in the log file: -```text -[2010-05-26 23:33:47] E [rpcsvc.c:2598:rpcsvc_program_register_portmap] rpc-service: Could notregister with portmap +```{ .text .no-copy } +[2010-05-26 23:33:47] E [rpcsvc.c:2598:rpcsvc_program_register_portmap] rpc-service: Could notregister with portmap [2010-05-26 23:33:47] E [rpcsvc.c:2682:rpcsvc_program_register] rpc-service: portmap registration of program failed [2010-05-26 23:33:47] E [rpcsvc.c:2695:rpcsvc_program_register] rpc-service: Program registration failed: MOUNT3, Num: 100005, Ver: 3, Port: 38465 [2010-05-26 23:33:47] E [nfs.c:125:nfs_init_versions] nfs: Program init failed @@ -104,12 +104,12 @@ file: On most Linux distributions, portmap can be started using the following command: - # /etc/init.d/portmap start + /etc/init.d/portmap start On some distributions where portmap has been replaced by rpcbind, run the following command: - # /etc/init.d/rpcbind start + /etc/init.d/rpcbind start After starting portmap or rpcbind, gluster NFS server needs to be restarted. @@ -126,8 +126,8 @@ file: On Linux, kernel NFS servers can be stopped by using either of the following commands depending on the distribution in use: - # /etc/init.d/nfs-kernel-server stop - # /etc/init.d/nfs stop + /etc/init.d/nfs-kernel-server stop + /etc/init.d/nfs stop 3. **Restart Gluster NFS server** @@ -135,7 +135,7 @@ file: mount command fails with following error -```console +```{ .text .no-copy } mount: mount to NFS server '10.1.10.11' failed: timed out (retrying). ``` @@ -175,14 +175,13 @@ Perform one of the following to resolve this issue: forcing the NFS client to use version 3. The **vers** option to mount command is used for this purpose: - # mount -o vers=3 + mount -o vers=3 -### showmount fails with clnt\_create: RPC: Unable to receive +### showmount fails with clnt_create: RPC: Unable to receive Check your firewall setting to open ports 111 for portmap requests/replies and Gluster NFS server requests/replies. Gluster NFS -server operates over the following port numbers: 38465, 38466, and -38467. +server operates over the following port numbers: 38465, 38466, and 38467. ### Application fails with "Invalid argument" or "Value too large for defined data type" error. @@ -193,9 +192,9 @@ numbers instead: nfs.enable-ino32 \ Applications that will benefit are those that were either: -- built 32-bit and run on 32-bit machines such that they do not - support large files by default -- built 32-bit on 64-bit systems +- built 32-bit and run on 32-bit machines such that they do not + support large files by default +- built 32-bit on 64-bit systems This option is disabled by default so NFS returns 64-bit inode numbers by default. @@ -203,6 +202,6 @@ by default. Applications which can be rebuilt from source are recommended to rebuild using the following flag with gcc: -``` +```console -D_FILE_OFFSET_BITS=64 ``` diff --git a/docs/Troubleshooting/troubleshooting-memory.md b/docs/Troubleshooting/troubleshooting-memory.md index 12336d5..70d83b2 100644 --- a/docs/Troubleshooting/troubleshooting-memory.md +++ b/docs/Troubleshooting/troubleshooting-memory.md @@ -1,5 +1,4 @@ -Troubleshooting High Memory Utilization -======================================= +# Troubleshooting High Memory Utilization If the memory utilization of a Gluster process increases significantly with time, it could be a leak caused by resources not being freed. If you suspect that you may have hit such an issue, try using [statedumps](./statedump.md) to debug the issue. @@ -12,4 +11,3 @@ If you are unable to figure out where the leak is, please [file an issue](https: - Steps to reproduce the issue if available - Statedumps for the process collected at intervals as the memory utilization increases - The Gluster log files for the process (if possible) -