modules/es-cluster-health-is-red.adoc

// Module included in the following assemblies:
//
// * observability/logging/troubleshooting/troubleshooting-logging-alerts.adoc

:_mod-docs-content-type: PROCEDURE
[id="es-cluster-health-is-red_{context}"]
= Elasticsearch cluster health status is red

At least one primary shard and its replicas are not allocated to a node. Use the following procedure to troubleshoot this alert.

include::snippets/es-pod-var-logging.adoc[]

.Procedure

. Check the Elasticsearch cluster health and verify that the cluster `status` is red by running the following command:
+
[source,terminal]
----
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME -- health
----

. List the nodes that have joined the cluster by running the following command:
+
[source,terminal]
----
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \
  -- es_util --query=_cat/nodes?v
----

. List the Elasticsearch pods and compare them with the nodes in the command output from the previous step, by running the following command:
+
[source,terminal]
----
$ oc -n openshift-logging get pods -l component=elasticsearch
----

. If some of the Elasticsearch nodes have not joined the cluster, perform the following steps.

.. Confirm that Elasticsearch has an elected master node by running the following command and observing the output:
+
[source,terminal]
----
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \
  -- es_util --query=_cat/master?v
----

.. Review the pod logs of the elected master node for issues by running the following command and observing the output:
+
[source,terminal]
----
$ oc logs <elasticsearch_master_pod_name> -c elasticsearch -n openshift-logging
----

.. Review the logs of nodes that have not joined the cluster for issues by running the following command and observing the output:
+
[source,terminal]
----
$ oc logs <elasticsearch_node_name> -c elasticsearch -n openshift-logging
----

. If all the nodes have joined the cluster, check if the cluster is in the process of recovering by running the following command and observing the output:
+
[source,terminal]
----
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \
  -- es_util --query=_cat/recovery?active_only=true
----
+
If there is no command output, the recovery process might be delayed or stalled by pending tasks.

. Check if there are pending tasks by running the following command and observing the output:
+
[source,terminal]
----
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \
  -- health | grep number_of_pending_tasks
----

. If there are pending tasks, monitor their status. If their status changes and indicates that the cluster is recovering, continue waiting. The recovery time varies according to the size of the cluster and other factors. Otherwise, if the status of the pending tasks does not change, this indicates that the recovery has stalled.

. If it seems like the recovery has stalled, check if the `cluster.routing.allocation.enable` value is set to `none`, by running the following command and observing the output:
+
[source,terminal]
----
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \
  -- es_util --query=_cluster/settings?pretty
----

. If the `cluster.routing.allocation.enable` value is set to `none`, set it to `all`, by running the following command:
+
[source,terminal]
----
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \
  -- es_util --query=_cluster/settings?pretty \
  -X PUT -d '{"persistent": {"cluster.routing.allocation.enable":"all"}}'
----

. Check if any indices are still red by running the following command and observing the output:
+
[source,terminal]
----
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \
  -- es_util --query=_cat/indices?v
----

. If any indices are still red, try to clear them by performing the following steps.

.. Clear the cache by running the following command:
+
[source,terminal]
----
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \
  -- es_util --query=<elasticsearch_index_name>/_cache/clear?pretty
----

.. Increase the max allocation retries by running the following command:
+
[source,terminal]
----
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \
  -- es_util --query=<elasticsearch_index_name>/_settings?pretty \
  -X PUT -d '{"index.allocation.max_retries":10}'
----

.. Delete all the scroll items by running the following command:
+
[source,terminal]
----
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \
  -- es_util --query=_search/scroll/_all -X DELETE
----

.. Increase the timeout by running the following command:
+
[source,terminal]
----
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \
  -- es_util --query=<elasticsearch_index_name>/_settings?pretty \
  -X PUT -d '{"index.unassigned.node_left.delayed_timeout":"10m"}'
----

. If the preceding steps do not clear the red indices, delete the indices individually.

.. Identify the red index name by running the following command:
+
[source,terminal]
----
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \
  -- es_util --query=_cat/indices?v
----

.. Delete the red index by running the following command:
+
[source,terminal]
----
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \
  -- es_util --query=<elasticsearch_red_index_name> -X DELETE
----

. If there are no red indices and the cluster status is red, check for a continuous heavy processing load on a data node.

.. Check if the Elasticsearch JVM Heap usage is high by running the following command:
+
[source,terminal]
----
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \
  -- es_util --query=_nodes/stats?pretty
----
+
In the command output, review the `node_name.jvm.mem.heap_used_percent` field to determine the JVM Heap usage.

.. Check for high CPU utilization. For more information about CPU utilitzation, see the {product-title} "Reviewing monitoring dashboards" documentation.
OBSDOCS-248: Updating logging alerts docs 2023-09-27 13:43:58 -05:00			`// Module included in the following assemblies:`
			`//`
OBSDOCS-693 2024-04-18 15:46:31 +05:30			`// * observability/logging/troubleshooting/troubleshooting-logging-alerts.adoc`
OBSDOCS-248: Updating logging alerts docs 2023-09-27 13:43:58 -05:00
[enterprise-4.14] changing content type attribute 2023-10-30 10:13:25 -04:00			`:_mod-docs-content-type: PROCEDURE`
OBSDOCS-248: Updating logging alerts docs 2023-09-27 13:43:58 -05:00			`[id="es-cluster-health-is-red_{context}"]`
			`= Elasticsearch cluster health status is red`

			`At least one primary shard and its replicas are not allocated to a node. Use the following procedure to troubleshoot this alert.`

			`include::snippets/es-pod-var-logging.adoc[]`

			`.Procedure`

			. Check the Elasticsearch cluster health and verify that the cluster `status` is red by running the following command:
			`+`
			`[source,terminal]`
			`----`
			`$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME -- health`
			`----`

			`. List the nodes that have joined the cluster by running the following command:`
			`+`
			`[source,terminal]`
			`----`
			`$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \`
			`-- es_util --query=_cat/nodes?v`
			`----`

			`. List the Elasticsearch pods and compare them with the nodes in the command output from the previous step, by running the following command:`
			`+`
			`[source,terminal]`
			`----`
			`$ oc -n openshift-logging get pods -l component=elasticsearch`
			`----`

			`. If some of the Elasticsearch nodes have not joined the cluster, perform the following steps.`

			`.. Confirm that Elasticsearch has an elected master node by running the following command and observing the output:`
			`+`
			`[source,terminal]`
			`----`
			`$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \`
			`-- es_util --query=_cat/master?v`
			`----`

			`.. Review the pod logs of the elected master node for issues by running the following command and observing the output:`
			`+`
			`[source,terminal]`
			`----`
			`$ oc logs <elasticsearch_master_pod_name> -c elasticsearch -n openshift-logging`
			`----`

			`.. Review the logs of nodes that have not joined the cluster for issues by running the following command and observing the output:`
			`+`
			`[source,terminal]`
			`----`
			`$ oc logs <elasticsearch_node_name> -c elasticsearch -n openshift-logging`
			`----`

			`. If all the nodes have joined the cluster, check if the cluster is in the process of recovering by running the following command and observing the output:`
			`+`
			`[source,terminal]`
			`----`
			`$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \`
			`-- es_util --query=_cat/recovery?active_only=true`
			`----`
			`+`
			`If there is no command output, the recovery process might be delayed or stalled by pending tasks.`

			`. Check if there are pending tasks by running the following command and observing the output:`
			`+`
			`[source,terminal]`
			`----`
			`$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \`
			`-- health \| grep number_of_pending_tasks`
			`----`

			`. If there are pending tasks, monitor their status. If their status changes and indicates that the cluster is recovering, continue waiting. The recovery time varies according to the size of the cluster and other factors. Otherwise, if the status of the pending tasks does not change, this indicates that the recovery has stalled.`

			. If it seems like the recovery has stalled, check if the `cluster.routing.allocation.enable` value is set to `none`, by running the following command and observing the output:
			`+`
			`[source,terminal]`
			`----`
			`$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \`
			`-- es_util --query=_cluster/settings?pretty`
			`----`

			. If the `cluster.routing.allocation.enable` value is set to `none`, set it to `all`, by running the following command:
			`+`
			`[source,terminal]`
			`----`
			`$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \`
			`-- es_util --query=_cluster/settings?pretty \`
			`-X PUT -d '{"persistent": {"cluster.routing.allocation.enable":"all"}}'`
			`----`

			`. Check if any indices are still red by running the following command and observing the output:`
			`+`
			`[source,terminal]`
			`----`
			`$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \`
			`-- es_util --query=_cat/indices?v`
			`----`

			`. If any indices are still red, try to clear them by performing the following steps.`

			`.. Clear the cache by running the following command:`
			`+`
			`[source,terminal]`
			`----`
			`$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \`
			`-- es_util --query=<elasticsearch_index_name>/_cache/clear?pretty`
			`----`

			`.. Increase the max allocation retries by running the following command:`
			`+`
			`[source,terminal]`
			`----`
			`$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \`
			`-- es_util --query=<elasticsearch_index_name>/_settings?pretty \`
			`-X PUT -d '{"index.allocation.max_retries":10}'`
			`----`

			`.. Delete all the scroll items by running the following command:`
			`+`
			`[source,terminal]`
			`----`
			`$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \`
			`-- es_util --query=_search/scroll/_all -X DELETE`
			`----`

			`.. Increase the timeout by running the following command:`
			`+`
			`[source,terminal]`
			`----`
			`$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \`
			`-- es_util --query=<elasticsearch_index_name>/_settings?pretty \`
			`-X PUT -d '{"index.unassigned.node_left.delayed_timeout":"10m"}'`
			`----`

			`. If the preceding steps do not clear the red indices, delete the indices individually.`

			`.. Identify the red index name by running the following command:`
			`+`
			`[source,terminal]`
			`----`
			`$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \`
			`-- es_util --query=_cat/indices?v`
			`----`

			`.. Delete the red index by running the following command:`
			`+`
			`[source,terminal]`
			`----`
			`$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \`
			`-- es_util --query=<elasticsearch_red_index_name> -X DELETE`
			`----`

			`. If there are no red indices and the cluster status is red, check for a continuous heavy processing load on a data node.`

			`.. Check if the Elasticsearch JVM Heap usage is high by running the following command:`
			`+`
			`[source,terminal]`
			`----`
			`$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \`
			`-- es_util --query=_nodes/stats?pretty`
			`----`
			`+`
			In the command output, review the `node_name.jvm.mem.heap_used_percent` field to determine the JVM Heap usage.

			`.. Check for high CPU utilization. For more information about CPU utilitzation, see the {product-title} "Reviewing monitoring dashboards" documentation.`