openshift-docs/modules/investigating-master-node-installation-issues.adoc

// Module included in the following assemblies:
//
// * support/troubleshooting/troubleshooting-installations.adoc

:_mod-docs-content-type: PROCEDURE
[id="investigating-master-node-installation-issues_{context}"]
= Investigating control plane node installation issues

If you experience control plane node installation issues, determine the control plane node {product-title} software defined network (SDN), and network Operator status. Collect `kubelet.service`, `crio.service` journald unit logs, and control plane node container logs for visibility into control plane node agent, CRI-O container runtime, and pod activity.

.Prerequisites

ifndef::openshift-rosa,openshift-dedicated[]
* You have access to the cluster as a user with the `cluster-admin` role.
endif::openshift-rosa,openshift-dedicated[]
ifdef::openshift-rosa,openshift-dedicated[]
* You have access to the cluster as a user with the `dedicated-admin` role.
endif::openshift-rosa,openshift-dedicated[]
* You have installed the OpenShift CLI (`oc`).
* You have SSH access to your hosts.
* You have the fully qualified domain names of the bootstrap and control plane nodes.
* If you are hosting Ignition configuration files by using an HTTP server, you must have the HTTP server's fully qualified domain name and the port number. You must also have SSH access to the HTTP host.
+
[NOTE]
====
The initial `kubeadmin` password can be found in `<install_directory>/auth/kubeadmin-password` on the installation host.
====

.Procedure

. If you have access to the console for the control plane node, monitor the console until the node reaches the login prompt. During the installation, Ignition log messages are output to the console.

. Verify Ignition file configuration.
+
* If you are hosting Ignition configuration files by using an HTTP server.
+
.. Verify the control plane node Ignition file URL. Replace `<http_server_fqdn>` with HTTP server's fully qualified domain name:
+
[source,terminal]
----
$ curl -I http://<http_server_fqdn>:<port>/master.ign  <1>
----
<1> The `-I` option returns the header only. If the Ignition file is available on the specified URL, the command returns `200 OK` status. If it is not available, the command returns `404 file not found`.
+
.. To verify that the Ignition file was received by the control plane node query the HTTP server logs on the serving host. For example, if you are using an Apache web server to serve Ignition files:
+
[source,terminal]
----
$ grep -is 'master.ign' /var/log/httpd/access_log
----
+
If the master Ignition file is received, the associated `HTTP GET` log message will include a `200 OK` success status, indicating that the request succeeded.
+
.. If the Ignition file was not received, check that it exists on the serving host directly. Ensure that the appropriate file and web server permissions are in place.
+
* If you are using a cloud provider mechanism to inject Ignition configuration files into hosts as part of their initial deployment.
+
.. Review the console for the control plane node to determine if the mechanism is injecting the control plane node Ignition file correctly.

. Check the availability of the storage device assigned to the control plane node.

. Verify that the control plane node has been assigned an IP address from the DHCP server.

. Determine control plane node status.
.. Query control plane node status:
+
[source,terminal]
----
$ oc get nodes
----
+
.. If one of the control plane nodes does not reach a `Ready` status, retrieve a detailed node description:
+
[source,terminal]
----
$ oc describe node <master_node>
----
+
[NOTE]
====
It is not possible to run `oc` commands if an installation issue prevents the {product-title} API from running or if the kubelet is not running yet on each node:
====
+
. Determine OVN-Kubernetes status.
+
.. Review `ovnkube-node` daemon set status, in the `openshift-ovn-kubernetes` namespace:
+
[source,terminal]
----
$ oc get daemonsets -n openshift-ovn-kubernetes
----
+
.. If those resources are listed as `Not found`, review pods in the `openshift-ovn-kubernetes` namespace:
+
[source,terminal]
----
$ oc get pods -n openshift-ovn-kubernetes
----
+
.. Review logs relating to failed {product-title} OVN-Kubernetes pods in the `openshift-ovn-kubernetes` namespace:
+
[source,terminal]
----
$ oc logs <ovn-k_pod> -n openshift-ovn-kubernetes
----

. Determine cluster network configuration status.
.. Review whether the cluster's network configuration exists:
+
[source,terminal]
----
$ oc get network.config.openshift.io cluster -o yaml
----
+
.. If the installer failed to create the network configuration, generate the Kubernetes manifests again and review message output:
+
[source,terminal]
----
$ ./openshift-install create manifests
----
+
.. Review the pod status in the `openshift-network-operator` namespace to determine whether the Cluster Network Operator (CNO) is running:
+
[source,terminal]
----
$ oc get pods -n openshift-network-operator
----
+
.. Gather network Operator pod logs from the `openshift-network-operator` namespace:
+
[source,terminal]
----
$ oc logs pod/<network_operator_pod_name> -n openshift-network-operator
----

. Monitor `kubelet.service` journald unit logs on control plane nodes, after they have booted. This provides visibility into control plane node agent activity.
.. Retrieve the logs using `oc`:
+
[source,terminal]
----
$ oc adm node-logs --role=master -u kubelet
----
+
.. If the API is not functional, review the logs using SSH instead. Replace `<master-node>.<cluster_name>.<base_domain>` with appropriate values:
+
[source,terminal]
----
$ ssh core@<master-node>.<cluster_name>.<base_domain> journalctl -b -f -u kubelet.service
----
+
[NOTE]
====
{product-title} {product-version} cluster nodes running {op-system-first} are immutable and rely on Operators to apply cluster changes. Accessing cluster nodes by using SSH is not recommended. Before attempting to collect diagnostic data over SSH, review whether the data collected by running `oc adm must gather` and other `oc` commands is sufficient instead. However, if the {product-title} API is not available, or the kubelet is not properly functioning on the target node, `oc` operations will be impacted. In such situations, it is possible to access nodes using `ssh core@<node>.<cluster_name>.<base_domain>`.
====
+
. Retrieve `crio.service` journald unit logs on control plane nodes, after they have booted. This provides visibility into control plane node CRI-O container runtime activity.
.. Retrieve the logs using `oc`:
+
[source,terminal]
----
$ oc adm node-logs --role=master -u crio
----
+
.. If the API is not functional, review the logs using SSH instead:
+
[source,terminal]
----
$ ssh core@<master-node>.<cluster_name>.<base_domain> journalctl -b -f -u crio.service
----

. Collect logs from specific subdirectories under `/var/log/` on control plane nodes.
.. Retrieve a list of logs contained within a `/var/log/` subdirectory. The following example lists files in `/var/log/openshift-apiserver/` on all control plane nodes:
+
[source,terminal]
----
$ oc adm node-logs --role=master --path=openshift-apiserver
----
+
.. Inspect a specific log within a `/var/log/` subdirectory. The following example outputs `/var/log/openshift-apiserver/audit.log` contents from all control plane nodes:
+
[source,terminal]
----
$ oc adm node-logs --role=master --path=openshift-apiserver/audit.log
----
+
.. If the API is not functional, review the logs on each node using SSH instead. The following example tails `/var/log/openshift-apiserver/audit.log`:
+
[source,terminal]
----
$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo tail -f /var/log/openshift-apiserver/audit.log
----

. Review control plane node container logs using SSH.
.. List the containers:
+
[source,terminal]
----
$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl ps -a
----
+
.. Retrieve a container's logs using `crictl`:
+
[source,terminal]
----
$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl logs -f <container_id>
----

. If you experience control plane node configuration issues, verify that the MCO, MCO endpoint, and DNS record are functioning. The Machine Config Operator (MCO) manages operating system configuration during the installation procedure. Also verify system clock accuracy and certificate validity.
.. Test whether the MCO endpoint is available. Replace `<cluster_name>` with appropriate values:
+
[source,terminal]
----
$ curl https://api-int.<cluster_name>:22623/config/master
----
+
.. If the endpoint is unresponsive, verify load balancer configuration. Ensure that the endpoint is configured to run on port 22623.
+
.. Verify that the MCO endpoint's DNS record is configured and resolves to the load balancer.
... Run a DNS lookup for the defined MCO endpoint name:
+
[source,terminal]
----
$ dig api-int.<cluster_name> @<dns_server>
----
+
... Run a reverse lookup to the assigned MCO IP address on the load balancer:
+
[source,terminal]
----
$ dig -x <load_balancer_mco_ip_address> @<dns_server>
----
+
.. Verify that the MCO is functioning from the bootstrap node directly. Replace `<bootstrap_fqdn>` with the bootstrap node's fully qualified domain name:
+
[source,terminal]
----
$ ssh core@<bootstrap_fqdn> curl https://api-int.<cluster_name>:22623/config/master
----
+
.. System clock time must be synchronized between bootstrap, master, and worker nodes. Check each node's system clock reference time and time synchronization statistics:
+
[source,terminal]
----
$ ssh core@<node>.<cluster_name>.<base_domain> chronyc tracking
----
+
.. Review certificate validity:
+
[source,terminal]
----
$ openssl s_client -connect api-int.<cluster_name>:22623 | openssl x509 -noout -text
----