openshift-docs/modules/machine-health-checks-resource.adoc

// Module included in the following assemblies:
//
// * machine_management/deploying-machine-health-checks.adoc
// * post_installation_configuration/node-tasks.adoc

:_mod-docs-content-type: CONCEPT
[id="machine-health-checks-resource_{context}"]
= Sample MachineHealthCheck resource

The `MachineHealthCheck` resource for all cloud-based installation types, and other than bare metal, resembles the following YAML file:

[source,yaml]
----
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: example <1>
  namespace: openshift-machine-api
spec:
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-machine-role: <role> <2>
      machine.openshift.io/cluster-api-machine-type: <role> <2>
      machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone> <3>
  unhealthyConditions:
  - type:    "Ready"
    timeout: "300s" <4>
    status: "False"
  - type:    "Ready"
    timeout: "300s" <4>
    status: "Unknown"
  maxUnhealthy: "40%" <5>
  nodeStartupTimeout: "10m" <6>
----
<1> Specify the name of the machine health check to deploy.
<2> Specify a label for the machine pool that you want to check.
<3> Specify the machine set to track in `<cluster_name>-<label>-<zone>` format. For example, `prod-node-us-east-1a`.
<4> Specify the timeout duration for a node condition. If a condition is met for the duration of the timeout, the machine will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy machine.
<5> Specify the amount of machines allowed to be concurrently remediated in the targeted pool. This can be set as a percentage or an integer. If the number of unhealthy machines exceeds the limit set by `maxUnhealthy`, remediation is not performed.
<6> Specify the timeout duration that a machine health check must wait for a node to join the cluster before a machine is determined to be unhealthy.

[NOTE]
====
The `matchLabels` are examples only; you must map your machine groups based on your specific needs.
====

[id="machine-health-checks-short-circuiting_{context}"]
== Short-circuiting machine health check remediation

Short-circuiting ensures that machine health checks remediate machines only when the cluster is healthy.
Short-circuiting is configured through the `maxUnhealthy` field in the `MachineHealthCheck` resource.

If the user defines a value for the `maxUnhealthy` field, before remediating any machines, the `MachineHealthCheck` compares the value of `maxUnhealthy` with the number of machines within its target pool that it has determined to be unhealthy. Remediation is not performed if the number of unhealthy machines exceeds the `maxUnhealthy` limit.

[IMPORTANT]
====
If `maxUnhealthy` is not set, the value defaults to `100%` and the machines are remediated regardless of the state of the cluster.
====

The appropriate `maxUnhealthy` value depends on the scale of the cluster you deploy and how many machines the `MachineHealthCheck` covers. For example, you can use the `maxUnhealthy` value to cover multiple compute machine sets across multiple availability zones so that if you lose an entire zone, your `maxUnhealthy` setting prevents further remediation within the cluster. In global Azure regions that do not have multiple availability zones, you can use availability sets to ensure high availability.

[IMPORTANT]
====
If you configure a `MachineHealthCheck` resource for the control plane, set the value of `maxUnhealthy` to `1`.

This configuration ensures that the machine health check takes no action when multiple control plane machines appear to be unhealthy. Multiple unhealthy control plane machines can indicate that the etcd cluster is degraded or that a scaling operation to replace a failed machine is in progress.

If the etcd cluster is degraded, manual intervention might be required. If a scaling operation is in progress, the machine health check should allow it to finish.
====

The `maxUnhealthy` field can be set as either an integer or percentage.
There are different remediation implementations depending on the `maxUnhealthy` value.

=== Setting maxUnhealthy by using an absolute value

If `maxUnhealthy` is set to `2`:

* Remediation will be performed if 2 or fewer nodes are unhealthy
* Remediation will not be performed if 3 or more nodes are unhealthy

These values are independent of how many machines are being checked by the machine health check.

=== Setting maxUnhealthy by using percentages

If `maxUnhealthy` is set to `40%` and there are 25 machines being checked:

* Remediation will be performed if 10 or fewer nodes are unhealthy
* Remediation will not be performed if 11 or more nodes are unhealthy

If `maxUnhealthy` is set to `40%` and there are 6 machines being checked:

* Remediation will be performed if 2 or fewer nodes are unhealthy
* Remediation will not be performed if 3 or more nodes are unhealthy

[NOTE]
====
The allowed number of machines is rounded down when the percentage of `maxUnhealthy` machines that are checked is not a whole number.
====