From 4ea62842a95ec07bf635fe07f9145a7239caaa49 Mon Sep 17 00:00:00 2001 From: Eliska Romanova Date: Mon, 25 Mar 2024 15:16:36 +0100 Subject: [PATCH] OBSDOCS-920: Add troubleshooting steps for KubePersistentVolumeFillingUp alert --- ...fillingup-alert-firing-for-prometheus.adoc | 93 +++++++++++++++++++ .../troubleshooting-monitoring-issues.adoc | 3 + .../investigating-monitoring-issues.adoc | 9 +- 3 files changed, 104 insertions(+), 1 deletion(-) create mode 100644 modules/monitoring-resolving-the-kubepersistentvolumefillingup-alert-firing-for-prometheus.adoc diff --git a/modules/monitoring-resolving-the-kubepersistentvolumefillingup-alert-firing-for-prometheus.adoc b/modules/monitoring-resolving-the-kubepersistentvolumefillingup-alert-firing-for-prometheus.adoc new file mode 100644 index 0000000000..d7609ad916 --- /dev/null +++ b/modules/monitoring-resolving-the-kubepersistentvolumefillingup-alert-firing-for-prometheus.adoc @@ -0,0 +1,93 @@ +// Module included in the following assemblies: +// +// * monitoring/troubleshooting-monitoring-issues.adoc +// * support/troubleshooting/investigating-monitoring-issues.adoc + +:_mod-docs-content-type: PROCEDURE +[id="resolving-the-kubepersistentvolumefillingup-alert-firing-for-prometheus_{context}"] += Resolving the KubePersistentVolumeFillingUp alert firing for Prometheus + +As a cluster administrator, you can resolve the `KubePersistentVolumeFillingUp` alert being triggered for Prometheus. + +The critical alert fires when a persistent volume (PV) claimed by a `prometheus-k8s-*` pod in the `openshift-monitoring` project has less than 3% total space remaining. This can cause Prometheus to function abnormally. + +[NOTE] +==== +There are two `KubePersistentVolumeFillingUp` alerts: + +* *Critical alert*: The alert with the `severity="critical"` label is triggered when the mounted PV has less than 3% total space remaining. +* *Warning alert*: The alert with the `severity="warning"` label is triggered when the mounted PV has less than 15% total space remaining and is expected to fill up within four days. +==== + +To address this issue, you can remove Prometheus time-series database (TSDB) blocks to create more space for the PV. + +.Prerequisites + +ifndef::openshift-dedicated,openshift-rosa[] +* You have access to the cluster as a user with the `cluster-admin` cluster role. +endif::openshift-dedicated,openshift-rosa[] +ifdef::openshift-dedicated,openshift-rosa[] +* You have access to the cluster as a user with the `dedicated-admin` role. +endif::openshift-dedicated,openshift-rosa[] +* You have installed the OpenShift CLI (`oc`). + +.Procedure + +. List the size of all TSDB blocks, sorted from oldest to newest, by running the following command: ++ +[source,terminal] +---- +$ oc debug -n openshift-monitoring \// <1> +-c prometheus --image=$(oc get po -n openshift-monitoring \// <1> +-o jsonpath='{.spec.containers[?(@.name=="prometheus")].image}') \ +-- sh -c 'cd /prometheus/;du -hs $(ls -dt */ | grep -Eo "[0-9|A-Z]{26}")' +---- +<1> Replace `` with the pod mentioned in the `KubePersistentVolumeFillingUp` alert description. ++ +.Example output +[source,terminal] +---- +308M 01HVKMPKQWZYWS8WVDAYQHNMW6 +52M 01HVK64DTDA81799TBR9QDECEZ +102M 01HVK64DS7TRZRWF2756KHST5X +140M 01HVJS59K11FBVAPVY57K88Z11 +90M 01HVH2A5Z58SKT810EM6B9AT50 +152M 01HV8ZDVQMX41MKCN84S32RRZ1 +354M 01HV6Q2N26BK63G4RYTST71FBF +156M 01HV664H9J9Z1FTZD73RD1563E +216M 01HTHXB60A7F239HN7S2TENPNS +104M 01HTHMGRXGS0WXA3WATRXHR36B +---- + +. Identify which and how many blocks could be removed, then remove the blocks. The following example command removes the three oldest Prometheus TSDB blocks from the `prometheus-k8s-0` pod: ++ +[source,terminal] +---- +$ oc debug prometheus-k8s-0 -n openshift-monitoring \ +-c prometheus --image=$(oc get po -n openshift-monitoring prometheus-k8s-0 \ +-o jsonpath='{.spec.containers[?(@.name=="prometheus")].image}') \ +-- sh -c 'ls -latr /prometheus/ | egrep -o "[0-9|A-Z]{26}" | head -3 | \ +while read BLOCK; do rm -r /prometheus/$BLOCK; done' +---- + +. Verify the usage of the mounted PV and ensure there is enough space available by running the following command: ++ +[source,terminal] +---- +$ oc debug -n openshift-monitoring \// <1> +--image=$(oc get po -n openshift-monitoring \// <1> +-o jsonpath='{.spec.containers[?(@.name=="prometheus")].image}') -- df -h /prometheus/ +---- +<1> Replace `` with the pod mentioned in the `KubePersistentVolumeFillingUp` alert description. ++ +The following example output shows the mounted PV claimed by the `prometheus-k8s-0` pod that has 63% of space remaining: ++ +.Example output +[source,terminal] +---- +Starting pod/prometheus-k8s-0-debug-j82w4 ... +Filesystem Size Used Avail Use% Mounted on +/dev/nvme0n1p4 40G 15G 40G 37% /prometheus + +Removing debug pod ... +---- diff --git a/observability/monitoring/troubleshooting-monitoring-issues.adoc b/observability/monitoring/troubleshooting-monitoring-issues.adoc index aa49a6aa3f..d5c21e2ddd 100644 --- a/observability/monitoring/troubleshooting-monitoring-issues.adoc +++ b/observability/monitoring/troubleshooting-monitoring-issues.adoc @@ -39,3 +39,6 @@ include::modules/monitoring-determining-why-prometheus-is-consuming-disk-space.a * xref:../../observability/monitoring/accessing-third-party-monitoring-apis.adoc#about-accessing-monitoring-web-service-apis_accessing-monitoring-apis-by-using-the-cli[Accessing monitoring APIs by using the CLI] * xref:../../observability/monitoring/configuring-the-monitoring-stack.adoc#setting-scrape-sample-and-label-limits-for-user-defined-projects_configuring-the-monitoring-stack[Setting a scrape sample limit for user-defined projects] * xref:../../support/getting-support.adoc#support-submitting-a-case_getting-support[Submitting a support case] + +// Resolving the KubePersistentVolumeFillingUp alert firing for Prometheus +include::modules/monitoring-resolving-the-kubepersistentvolumefillingup-alert-firing-for-prometheus.adoc[leveloffset=+1] \ No newline at end of file diff --git a/support/troubleshooting/investigating-monitoring-issues.adoc b/support/troubleshooting/investigating-monitoring-issues.adoc index 8070a5643b..adae4f21a1 100644 --- a/support/troubleshooting/investigating-monitoring-issues.adoc +++ b/support/troubleshooting/investigating-monitoring-issues.adoc @@ -9,7 +9,11 @@ toc::[] {product-title} includes a preconfigured, preinstalled, and self-updating monitoring stack that provides monitoring for core platform components. In {product-title} {product-version}, cluster administrators can optionally enable monitoring for user-defined projects. // Note - please update the following sentence if you add further modules to this assembly. -You can follow these procedures if your own metrics are unavailable or if Prometheus is consuming a lot of disk space. +Use these procedures if the following issues occur: + +* Your own metrics are unavailable. +* Prometheus is consuming a lot of disk space. +* The `KubePersistentVolumeFillingUp` alert is firing for Prometheus. // Investigating why user-defined metrics are unavailable include::modules/monitoring-investigating-why-user-defined-metrics-are-unavailable.adoc[leveloffset=+1] @@ -28,3 +32,6 @@ include::modules/monitoring-determining-why-prometheus-is-consuming-disk-space.a .Additional resources * See xref:../../observability/monitoring/configuring-the-monitoring-stack.adoc#setting-scrape-sample-and-label-limits-for-user-defined-projects_configuring-the-monitoring-stack[Setting a scrape sample limit for user-defined projects] for details on how to set a scrape sample limit and create related alerting rules + +// Resolving the KubePersistentVolumeFillingUp alert firing for Prometheus +include::modules/monitoring-resolving-the-kubepersistentvolumefillingup-alert-firing-for-prometheus.adoc[leveloffset=+1] \ No newline at end of file