1
0
mirror of https://github.com/coreos/prometheus-operator.git synced 2026-02-05 06:45:27 +01:00
Files
prometheus-operator/Documentation/troubleshooting.md

139 lines
6.0 KiB
Markdown
Raw Normal View History

2021-03-09 01:03:31 +01:00
---
weight: 600
toc: true
title: Troubleshooting
menu:
docs:
parent: operator
lead: ""
images: []
draft: false
description: Guide on troubleshooting the Prometheus Operator.
date: "2021-03-08T08:49:31+00:00"
2021-03-09 01:03:31 +01:00
---
2017-05-09 12:35:53 +03:00
### RBAC on Google Container Engine (GKE)
When you try to create `ClusterRole` (`kube-state-metrics`, `prometheus` `prometheus-operator`, etc.) on GKE Kubernetes cluster running 1.6 version, you will probably run into permission errors:
```
<....>
Error from server (Forbidden): error when creating
"manifests/prometheus-operator/prometheus-operator-cluster-role.yaml":
2017-05-09 12:35:53 +03:00
clusterroles.rbac.authorization.k8s.io "prometheus-operator" is forbidden: attempt to grant extra privileges:
<....>
```
2017-05-09 12:35:53 +03:00
This is due to the way Container Engine checks permissions. From [Google Kubernetes Engine docs](https://cloud.google.com/kubernetes-engine/docs/how-to/role-based-access-control):
2017-05-09 12:35:53 +03:00
> Because of the way Container Engine checks permissions when you create a Role or ClusterRole, you must first create a RoleBinding that grants you all of the permissions included in the role you want to create.
> An example workaround is to create a RoleBinding that gives your Google identity a cluster-admin role before attempting to create additional Role or ClusterRole permissions.
> This is a known issue in the Beta release of Role-Based Access Control in Kubernetes and Container Engine version 1.6.
To overcome this, you must grant your current Google identity `cluster-admin` Role:
```console
# get current google identity
$ gcloud info | grep Account
Account: [myname@example.org]
# grant cluster-admin to your current identity
$ kubectl create clusterrolebinding myname-cluster-admin-binding --clusterrole=cluster-admin --user=myname@example.org
Clusterrolebinding "myname-cluster-admin-binding" created
```
2018-04-08 14:53:30 +02:00
### Troubleshooting ServiceMonitor changes
When creating/deleting/modifying `ServiceMonitor` objects it is sometimes not as obvious what piece is not working properly. This section gives a step by step guide how to troubleshoot such actions on a `ServiceMonitor` object.
#### Overview of `ServiceMonitor` tagging and related elements
A common problem related to `ServiceMonitor` identification by Prometheus is related to an incorrect tagging, that does not match the `Prometheus` custom resource definition scope, or lack of permission for the Prometheus `ServiceAccount` to *get, list, watch* `Services` and `Endpoints` from the target application being monitored. As a general guideline consider the diagram below, giving an example of a `Deployment` and `Service` called `my-app`, being monitored by Prometheus based on a `ServiceMonitor` named `my-service-monitor`:
![flow diagram](custom-metrics-elements.png)
Note: The `ServiceMonitor` references a `Service` (not a `Deployment`, or a `Pod`), by labels *and* by the port name in the `Service`. This *port name* is optional in Kubernetes, but must be specified for the `ServiceMonitor` to work. It is not the same as the port name on the `Pod` or container, although it can be.
2018-04-08 14:53:30 +02:00
#### Has my `ServiceMonitor` been picked up by Prometheus?
`ServiceMonitor` objects and the namespace where they belong are selected by the `serviceMonitorSelector` and `serviceMonitorNamespaceSelector`of a Prometheus object. The name of a `ServiceMonitor` is encoded in the Prometheus configuration, so you can simply grep whether it is present there. The configuration generated by the Prometheus Operator is stored in a Kubernetes `Secret`, named after the Prometheus object name prefixed with `prometheus-` and is located in the same namespace as the Prometheus object. For example for a Prometheus object called `k8s` one can find out if the `ServiceMonitor` named `my-service-monitor` has been picked up with:
2018-04-08 14:53:30 +02:00
```sh
kubectl -n monitoring get secret prometheus-k8s -ojson | jq -r '.data["prometheus.yaml.gz"]' | base64 -d | gunzip | grep "my-service-monitor"
2018-04-08 14:53:30 +02:00
```
### Prometheus kubelet metrics server returned HTTP status 403 Forbidden
Prometheus is installed, all looks good, however the `Targets` are all showing as down. All permissions seem to be good, yet no joy. Prometheus pulling metrics from all namespaces expect kube-system, and Prometheus has access to all namespaces including kube-system.
#### Did you check the webhooks?
Issue has been resolved by amending the webhooks to use `0.0.0.0` instead of `127.0.0.1`. Follow the below commands and it will update the webhooks which allows connections to all `clusterIP's` in all `namespaces` and not just `127.0.0.1`.
**Update the kubelet service to include webhook and restart:**
```sh
KUBEADM_SYSTEMD_CONF=/etc/systemd/system/kubelet.service.d/10-kubeadm.conf
sed -e "/cadvisor-port=0/d" -i "$KUBEADM_SYSTEMD_CONF"
if ! grep -q "authentication-token-webhook=true" "$KUBEADM_SYSTEMD_CONF"; then
sed -e "s/--authorization-mode=Webhook/--authentication-token-webhook=true --authorization-mode=Webhook/" -i "$KUBEADM_SYSTEMD_CONF"
fi
systemctl daemon-reload
systemctl restart kubelet
```
**Modify the kube controller and kube scheduler to allow for reading data:**
```sh
sed -e "s/- --address=127.0.0.1/- --address=0.0.0.0/" -i /etc/kubernetes/manifests/kube-controller-manager.yaml
sed -e "s/- --address=127.0.0.1/- --address=0.0.0.0/" -i /etc/kubernetes/manifests/kube-scheduler.yaml
```
2018-04-08 14:53:30 +02:00
### Using textual port number instead of port name
The ServiceMonitor expects to use the port name as defined on the Service. So, using the Service example from the
diagram above, we have this Service definition:
```yaml
kind: Service
metadata:
labels:
k8s-app: my-app
name: my-app
...
spec:
ports:
- name: metrics
port: 8080
selector:
k8s-app: my-app
```
We would then define the service monitor using `metrics` as the port not `"8080"`. E.g.
**CORRECT**
```yaml
kind: ServiceMonitor
metadata:
name: my-app
spec:
...
endpoints:
- port: metrics
```
**INCORRECT**
```yaml
kind: ServiceMonitor
metadata:
name: my-app
spec:
...
endpoints:
- port: "8080"
```
The incorrect example will give an error along these lines `spec.endpoints.port in body must be of type string: "integer"`