Give a better error message when the gather logs are not collected
rather than
```
time="2024-06-05T08:34:45-04:00" level=error msg="The bootstrap machine did not execute the release-image.service systemd unit"
```
The release-image service could have been executed but we don't know if
the installer cannot connect to the bootstrap node (e.g, in a private
install).
The machine manifests from capi have multiple addresses including
ipv4 and ipv6. In vSphere CI specifically ipv6 is non-routed
and since that is the first address in the list is being
used by default. This causes bootstrap gather failures.
This PR returns all available addresses from the machine
manifest and for vSphere only prioritizes the IPv4
address.
returned wrong var
The Installer is unconditionally requiring permissions needed to create
IAM roles even when the users use existing roles. They should be
included only when needed.
This change kind of reverts
https://github.com/openshift/installer/pull/5286. IAM roles created by
the Installer are now consistently tagged with "owned". We should also
tag BYO roles so we know which clusters are using them, and so that it's
not deleted by the installer during cluster destroy.
Before this patch, we used the Neutron call to add tags to the newly
created security groups. However, that API doesn't accept tags
containing special characters such as slash (`/`), even when
url-encoded.
With this change, the security groups are tagged with an alternative API
call (replace-all-tags) which accepts the tags in a JSON object.
Apparently, Neutron accepts special characters (including slash) when
they come in a JSON object.
Instance types for which `OnHostMaintenance` is set to `Terminate`, GCP
requires the `DiscardLocalSsd` value to be defined, otherwise we get the
following error when destroying a cluster:
```
WARNING failed to stop instance jiwei-0530b-q9t8w-worker-c-ck6s8 in zone us-central1-c: googleapi: Error 400: VM has a Local SSD attached but an undefined value for `discard-local-ssd`. If using gcloud, please add `--discard-local-ssd=false` or `--discard-local-ssd=true` to your command., badRequest
```
We are setting the value to `true` because we are about to destroy the
cluster, which means destroying the instances and all cluster-owned
resources.
`clusterapi.System().Run()` is not atomic and it can fail after local
controlplane, kube-apiserver, etcd or some controllers are already
running. To make sure the capi system is properly shut down and the etcd
data is cleaned up on errors, we need to `defer` the cleanup before we
even attempt to run the capi system.
Instead of doing the control plane shutdown as part of the controllers
shutdown process, it should be done at Teardown time instead. This makes
sure that local control plane binaries are stopped even when we fail to
create controllers, for example when creating a cloud session for
controller setup.
Some providers like Azure require 2 controllers to run. If a controller
fails to be spawned (e.g cluster-api-provider-azureaso), we were not
stopping controllers that were already running (e.g. the cluster-api,
cluster-api-provider-azure), resulting in leak processes even though the
Installer reported it had stopped the capi system:
```
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to run cluster api system: failed to run controller "azureaso infrastructure provider": failed to start controller "azureaso infrastructure provider": timeout waiting for process cluster-api-provider-azureaso to start successfully (it may have failed to start, or stopped unexpectedly before becoming ready)
INFO Shutting down local Cluster API control plane...
INFO Local Cluster API system has completed operations
```
By just changing the order of operations to run the controller *after*
the WaitGroup is created, we are able to properly shutdown all running
controllers and the local control plane in case of error:
```
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to run cluster api system: failed to run controller "aws infrastructure provider": failed to extract provider "aws infrastructure provider": fake error
INFO Shutting down local Cluster API control plane...
INFO Stopped controller: Cluster API
INFO Local Cluster API system has completed operations
```
Doing `logrus.Fatal` when a controller fails to be extracted means that
we abort the installer process without giving it a chance to stop the
capi-related processes that are still running.
Let's just return an error instead and let the Installer go through the
normal capi shutdown procedure.
These changes will update the RHCOS 4.16 boot image metadata. Notable
changes in this update is:
OCPBUGS-36147 - Resizing LUKS on 512e disk causes ignition-ostree-growfs
to fail with "Device size is not aligned to requested sector size."
This change was generated using:
```
plume cosa2stream --target data/data/coreos/rhcos.json \
--distro rhcos --no-signatures --name 4.16-9.4 \
--url https://rhcos.mirror.openshift.com/art/storage/prod/streams \
x86_64=416.94.202406251923-0 \
aarch64=416.94.202406251923-0 \
s390x=416.94.202406251923-0 \
ppc64le=416.94.202406251923-0
```
With CAPI being the default, these configs are not used anymore and as
such are prone to be unmaintained. Users who still wish to use the
configs can access them in the 4.15 branch where they are still
maintained.
The EKS controller feature gate is enabled by default in CAPA, which
causes the following lines to show up in the logs:
```
time="2024-06-18T11:43:59Z" level=debug msg="I0618 11:43:59.613409 349 logger.go:75] \"enabling EKS controllers and webhooks\" logger=\"setup\""
time="2024-06-18T11:43:59Z" level=debug msg="I0618 11:43:59.613416 349 logger.go:81] \"EKS IAM role creation\" logger=\"setup\" enabled=false"
time="2024-06-18T11:43:59Z" level=debug msg="I0618 11:43:59.613420 349 logger.go:81] \"EKS IAM additional roles\" logger=\"setup\" enabled=false"
time="2024-06-18T11:43:59Z" level=debug msg="I0618 11:43:59.613425 349 logger.go:81] \"enabling EKS control plane controller\" logger=\"setup\""
time="2024-06-18T11:43:59Z" level=debug msg="I0618 11:43:59.613449 349 logger.go:81] \"enabling EKS bootstrap controller\" logger=\"setup\""
time="2024-06-18T11:43:59Z" level=debug msg="I0618 11:43:59.613464 349 logger.go:81] \"enabling EKS managed cluster controller\" logger=\"setup\""
time="2024-06-18T11:43:59Z" level=debug msg="I0618 11:43:59.613496 349 logger.go:81] \"enabling EKS managed machine pool controller\" logger=\"setup\""
```
Although harmless, they can be confusing for users. This change
disables the feature so the lines are gone and we are not running
controllers unnecessarily.
- remove duplicate test
- since capv infrastructure can and will create custom
folders the check that makes sure the folder pre-exists is no
longer valid. Changed the test to pass if there is no expected
error.