pkg/agent/logging.go:
QF1006: could lift into loop condition
Skip lint check.
pkg/asset/manifests/azure/cluster.go:
QF1003: could use tagged switch on subnetType
Use a switch instead of if-else
pkg/infrastructure/azure/storage.go:
QF1007: could merge conditional assignment into variable declaration
pkg/infrastructure/baremetal/image.go:
QF1009: probably want to use time.Time.Equal instead
Use function for time.Equal rather than ==.
Removed custom agent wait-for install-complete code.
Moved installer WaitForInstallComplete function from
cmd/openshift-install/main to cmd/openshift-install/command so
that the function can be made public.
Modified agent.newWaitForInstallCompleted() to use the common
WaitForInstallComplete function.
The benefit of moving agent over to the common
WaitForInstallComplete function is that the common function has a
step to wait for operators to be in a stable state before calling
the cluster installation complete.
- Store the three authentication tokens (userAuth, agentAuth, watcherAuth)
as a secret in the cluster when creating the node ISO.
- Automatically regenerate expired tokens and refresh the asset store
to maintain valid authentication credentials in the cluster secret.
IPv6 was not being handled correctly when calling this function.
Use net.JoinHostPort.
From the docs:
JoinHostPort combines host and port into a network address of the form "host:port".
If host contains a colon, as found in literal IPv6 addresses, then JoinHostPort returns "[host]:port".
- Create 3 seperate JWT tokens- AGENT_AUTH_TOKEN, USER_AUTH_TOKEN, WATCHER_AUTH_TOKEN
- Update the claim to set 'auth_scheme' to identify the user persona
- Assisted service checks the `auth_scheme` to determine which user persona is allowed to access an endpoint
- WATCHER_AUTH_TOKEN is used with header `Watcher-Authorization` and is used by wait-for command ( watcher persona)
- USER_AUTH_TOKEN is used with header `Authorization` and is used by curl API requests, systemd services ( user persona)
- AGENT_AUTH_TOKEN is used with header `X-Secret-Key` and is used by agent service ( agent persona)
Monitoring output are now batched and displayed every 5 seconds
for each node. This makes the logs easier to read because the logs
for each node are more likely to be grouped together.
The log prefix was getting set the first time through the host loop and
not getting modified for subsequent hosts. We need to calculate it anew
for each host.
In the agent wait-for, when the APIs are not available, an attempt is
made to check the host by ssh'ing to it. This results in confusing
debug messages if the keys aren't present. This changes the check to
just test connectivity to the host.
Creating a Nodes ISO:
- secret does not exist
1. Generate a new public key and JWT token with an expiration time of 48 hours
2. The public key and token gets saved into the asset store
3. Create a secret named agent-auth-token in the openshift-config namespace with this token and public key from the asset store.
- secret already exists
1. Retrieve the stored token and check if the secret with JWT token is older than 24 hours.
- secret already exists and the token is older than 24 Hours
1. Generate a new public key and JWT token with a new expiration time of 48 hours
2. Update the secret with a new public key and JWT token
- secret already exists and the token is not older than 24 Hours
1. Retrieve the token and public key from the secret and update the asset store with the values from the secret.
Running monitor-add-nodes Command:
1. Retrieve the token from the secret.
2. Register the agent installer client with the retrieved token.
3. Send the auth token to the assisted service API via HTTP headers
Remove the "Neither --kubeconfig nor --master was specified.
Using the inClusterConfig. This might not work." warning because
it does work and would confuse the user when the monitor-add-nodes
command is run in cluster.
The Linter is upgraded in https://github.com/openshift/release/pull/52723. The linter caught
3 issues that are addressed here (including name change, potential secret exposed, and unnecessary newline).
The first and second CSRs pending approval have the node name
(hostname) embedded in their specs. monitor-add-nodes should only
show CSRs pending approval for a specific node. Currently it shows
all CSRs pending approval for all nodes.
If the IP address of the node cannot be resolved to a hostname,
we will not be able to determine if there are any CSRs pending
approval for that node. The monitoring command will skip showing
CSRs pending approval. In this case, users can still approve the
CSRs, and the monitoring command will continue to check if the node
has joined the cluster and has become Ready.
NewCluster needs both assetDir for install workflow and
kubeconfigPath for addnodes workflow.
Cluster.assetDir should only be initialized for the install
workflow.
Adds the ability to monitor a node being added during day2.
The command is:
node-joiner monitor-add-nodes --kubeconfig <kubeconfig-file-path>
<IP-address-of-node-to-monitor>
Both the kubeconfig file and IP address are required.
Multi node monitoring will be added in a future PR.
The function now requires kubeconfig file path, rendezvousIP, and
sshKey as parameters. Previously it had a single parameter, assetStore,
and it searched the asset store to determine the three parameters
above.
When the host fails to boot due to pending-user-action (this can
occur when the disk boot order is set incorrectly) log the
message at Warning level instead of Debug to make it obvious.
This also removes an additional spurious info message that was
being logged.
When the cluster status is installing-pending-user-action the install
won't complete. Most likely this is due to an invalid boot disk. When
this status is detected also log the host's status_info for hosts that
have this status.
If the Rest API and Kube API are not reachable, it may be because
network connectivity checks are preventing the install from progressing
(which will also prevent the Node0 SSH server from starting). Attempt
to SSH to the node and provide instructions for further debug to the
user upon failure.
When running the 'agent wait-for install-complete' command, we first
check that bootstrapping is complete (by running the equivalent of
'agent wait-for bootstrap-complete'. However, if this failed because the
bootstrapping timed out, we would report it as an install failure along
with the corresponding debug messages (stating that the problem is with
the cluster operators, and inevitably failing to fetch data about
which).
If the failure occurs during bootstrapping, report it as a bootstrap
error the same as you would get from 'agent wait-for
bootstrap-complete'.
err is always nil at this point, because we check it further up and it
is not overwritten by the variable of the same name that is shadowing it
inside the anonymous function, as was probably intended.