openshift-docs/modules/investigating-master-node-installation-issues.adoc

// Module included in the following assemblies:
//
// * support/troubleshooting/troubleshooting-installations.adoc

[id="investigating-master-node-installation-issues_{context}"]
= Investigating master node installation issues

If you experience master node installation issues, determine the master node, {product-title} software defined network (SDN), and network Operator status. Collect `kubelet.service`, `crio.service` journald unit logs, and master node container logs for visibility into master node agent, CRI-O container runtime, and pod activity.

.Prerequisites

* You have access to the cluster as a user with the `cluster-admin` role.
* You have installed the OpenShift CLI (`oc`).
* You have SSH access to your hosts.
* You have the fully qualified domain names of the bootstrap and master nodes.
* If you are hosting Ignition configuration files by using an HTTP server, you must have the HTTP server's fully qualified domain name and the port number. You must also have SSH access to the HTTP host.
+
[NOTE]
====
The initial `kubeadmin` password can be found in `<install_directory>/auth/kubeadmin-password` on the installation host.
====

.Procedure

. If you have access to the master node's console, monitor the console until the node reaches the login prompt. During the installation, Ignition log messages are output to the console.

. Verify Ignition file configuration.
+
* If you are hosting Ignition configuration files by using an HTTP server.
+
.. Verify the master node Ignition file URL. Replace `<http_server_fqdn>` with HTTP server's fully qualified domain name:
+
[source,terminal]
----
$ curl -I http://<http_server_fqdn>:<port>/master.ign  <1>
----
<1> The `-I` option returns the header only. If the Ignition file is available on the specified URL, the command returns `200 OK` status. If it is not available, the command returns `404 file not found`.
+
.. To verify that the ignition file was received by the master node, query the HTTP server logs on the serving host. For example, if you are using an Apache web server to serve Ignition files:
+
[source,terminal]
----
$ grep -is 'master.ign' /var/log/httpd/access_log
----
+
If the master Ignition file is received, the associated `HTTP GET` log message will include a `200 OK` success status, indicating that the request succeeded.
+
.. If the Ignition file was not received, check that it exists on the serving host directly. Ensure that the appropriate file and web server permissions are in place.
+
* If you are using a cloud provider mechanism to inject Ignition configuration files into hosts as part of their initial deployment.
+
.. Review the master node's console to determine if the mechanism is injecting the master node Ignition file correctly.

. Check the availability of the master node's assigned storage device.

. Verify that the master node has been assigned an IP address from the DHCP server.

. Determine master node status.
.. Query master node status:
+
[source,terminal]
----
$ oc get nodes
----
+
.. If one of the master nodes does not reach a `Ready` status, retrieve a detailed node description:
+
[source,terminal]
----
$ oc describe node <master_node>
----
+
[NOTE]
====
It is not possible to run `oc` commands if an installation issue prevents the {product-title} API from running or if the kubelet is not running yet on each node:
====
+
. Determine {product-title} SDN status.
+
.. Review `sdn-controller`, `sdn`, and `ovs` daemon set status, in the `openshift-sdn` namespace:
+
[source,terminal]
----
$ oc get daemonsets -n openshift-sdn
----
+
.. If those resources are listed as `Not found`, review pods in the `openshift-sdn` namespace:
+
[source,terminal]
----
$ oc get pods -n openshift-sdn
----
+
.. Review logs relating to failed {product-title} SDN pods in the `openshift-sdn` namespace:
+
[source,terminal]
----
$ oc logs <sdn_pod> -n openshift-sdn
----

. Determine cluster network configuration status.
.. Review whether the cluster's network configuration exists:
+
[source,terminal]
----
$ oc get network.config.openshift.io cluster -o yaml
----
+
.. If the installer failed to create the network configuration, generate the Kubernetes manifests again and review message output:
+
[source,terminal]
----
$ ./openshift-install create manifests
----
+
.. Review the pod status in the `openshift-network-operator` namespace to determine whether the Cluster Network Operator (CNO) is running:
+
[source,terminal]
----
$ oc get pods -n openshift-network-operator
----
+
.. Gather network Operator pod logs from the `openshift-network-operator` namespace:
+
[source,terminal]
----
$ oc logs pod/<network_operator_pod_name> -n openshift-network-operator
----

. Monitor `kubelet.service` journald unit logs on master nodes, after they have booted. This provides visibility into master node agent activity.
.. Retrieve the logs using `oc`:
+
[source,terminal]
----
$ oc adm node-logs --role=master -u kubelet
----
+
.. If the API is not functional, review the logs using SSH instead. Replace `<master-node>.<cluster_name>.<base_domain>` with appropriate values:
+
[source,terminal]
----
$ ssh core@<master-node>.<cluster_name>.<base_domain> journalctl -b -f -u kubelet.service
----
+
[NOTE]
====
{product-title} {product-version} cluster nodes running {op-system-first} are immutable and rely on Operators to apply cluster changes. Accessing cluster nodes using SSH is not recommended and nodes will be tainted as _accessed_. Before attempting to collect diagnostic data over SSH, review whether the data collected by running `oc adm must gather` and other `oc` commands is sufficient instead. However, if the {product-title} API is not available, or the kubelet is not properly functioning on the target node, `oc` operations will be impacted. In such situations, it is possible to access nodes using `ssh core@<node>.<cluster_name>.<base_domain>`.
====
+
. Retrieve `crio.service` journald unit logs on master nodes, after they have booted. This provides visibility into master node CRI-O container runtime activity.
.. Retrieve the logs using `oc`:
+
[source,terminal]
----
$ oc adm node-logs --role=master -u crio
----
+
.. If the API is not functional, review the logs using SSH instead:
+
[source,terminal]
----
$ ssh core@<master-node>.<cluster_name>.<base_domain> journalctl -b -f -u crio.service
----

. Collect logs from specific subdirectories under `/var/log/` on master nodes.
.. Retrieve a list of logs contained within a `/var/log/` subdirectory. The following example lists files in `/var/log/openshift-apiserver/` on all master nodes:
+
[source,terminal]
----
$ oc adm node-logs --role=master --path=openshift-apiserver
----
+
.. Inspect a specific log within a `/var/log/` subdirectory. The following example outputs `/var/log/openshift-apiserver/audit.log` contents from all master nodes:
+
[source,terminal]
----
$ oc adm node-logs --role=master --path=openshift-apiserver/audit.log
----
+
.. If the API is not functional, review the logs on each node using SSH instead. The following example tails `/var/log/openshift-apiserver/audit.log`:
+
[source,terminal]
----
$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo tail -f /var/log/openshift-apiserver/audit.log
----

. Review master node container logs using SSH.
.. List the containers:
+
[source,terminal]
----
$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl ps -a
----
+
.. Retrieve a container's logs using `crictl`:
+
[source,terminal]
----
$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl logs -f <container_id>
----

. If you experience master node configuration issues, verify that the MCO, MCO endpoint, and DNS record are functioning. The Machine Config Operator (MCO) manages operating system configuration during the installation procedure. Also verify system clock accuracy and certificate validity.
.. Test whether the MCO endpoint is available. Replace `<cluster_name>` with appropriate values:
+
[source,terminal]
----
$ curl https://api-int.<cluster_name>:22623/config/master
----
+
.. If the endpoint is unresponsive, verify load balancer configuration. Ensure that the endpoint is configured to run on port 22623.
+
.. Verify that the MCO endpoint's DNS record is configured and resolves to the load balancer.
... Run a DNS lookup for the defined MCO endpoint name:
+
[source,terminal]
----
$ dig api-int.<cluster_name> @<dns_server>
----
+
... Run a reverse lookup to the assigned MCO IP address on the load balancer:
+
[source,terminal]
----
$ dig -x <load_balancer_mco_ip_address> @<dns_server>
----
+
.. Verify that the MCO is functioning from the bootstrap node directly. Replace `<bootstrap_fqdn>` with the bootstrap node's fully qualified domain name:
+
[source,terminal]
----
$ ssh core@<bootstrap_fqdn> curl https://api-int.<cluster_name>:22623/config/master
----
+
.. System clock time must be synchronized between bootstrap, master, and worker nodes. Check each node's system clock reference time and time synchronization statistics:
+
[source,terminal]
----
$ ssh core@<node>.<cluster_name>.<base_domain> chronyc tracking
----
+
.. Review certificate validity:
+
[source,terminal]
----
$ openssl s_client -connect api-int.<cluster_name>:22623 | openssl x509 -noout -text
----