This is split out from CAPO starting with CAPO v0.12.0. Start deploying it manually
in preparation for a CAPO bump.
Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
The MachinePool feature requires the S3:PutBucketLifecycleConfiguration
permission. The installer does not support machine pools, so
we can disable the featgure gate to bypass this permission requirement.
PowerVC is an OpenStack based cloud provider with some significant
differences. Since we can use the OpenStack provider for most of the
work, we will create a thin provider which will only handle the
differences.
Upgrading from capi v1beta1 -> v1beta2 will take a not
insignificant amount of work. I have captured that work in
https://issues.redhat.com/browse/CORS-3563
and set nolint to disable the linters from failing on this package.
CAPI divided the API into subpackages.
For example:
sigs.k8s.io/cluster-api/exp/ipam/api/v1beta1 ->
sigs.k8s.io/cluster-api/api/ipam/v1beta1
sigs.k8s.io/cluster-api/api/v1beta1 ->
sigs.k8s.io/cluster-api/api/core/v1beta1
See: https://github.com/kubernetes-sigs/cluster-api/pull/12262
This updates the import paths accordingly.
This PR improves cross-platform compatibility.
It solves two main issues:
1. inconsistent line endings
2. inconsistent path separators
Path separators, in installer, needs to target two different
environments:
1. the OS where the installer runs
2. the OS where the injected files been used
This PR unified path separators used in 2 to be UNIX path separators,
while in 1 to be platform-dependant.
Ref: https://forum.golangbridge.org/t/filepath-join-or-path-join/13479
Known issues:
The spawn processes, including etcd.exe, kube-apiserver.exe,
and openshift-installer.exe, will not exit once installation
aborted or completed. Users need to manually terminate those
processes in task manager.
Adds support for certificate based authentication in the CAPI Azure
installation method. Azure Service Operator requires that the
certificate's contents, rather than the path, is passed as an
environment variable. .pfx certificates are in binary format, so
we must convert it to a pem certifcate and pass that.
We're seeing the local control plane fail to start on systems with
resource restrictions (say a laptop in "quiet" mode). This commit
removes the hardcoded 10 second envtest timeouts to use the default
20 second timeouts. By using the defaults, it also allows the timeouts
to be tuned with the environment variables:
KUBEBUILDER_CONTROLPLANE_START_TIMEOUT
KUBEBUILDER_CONTROLPLANE_STOP_TIMEOUT
Updates the environment variable setting the token audience on the
Azure Service Operator controller in order to authenticate azure
stack successfully.
When the host that runs the OpenShift install is configured with
IPv6 only, the kube-apiserver created with envtest would fail
as the service-cluster-ip-range would be configured with a default
IPv4 CIDR and the public address family, which is the host address,
would be configured with an IPv6. This commit fixes the issue by setting
a default IPv6 CIDR to service-cluster-ip-range, in case the host
has no IPv4 available.
Drop the metrics-bind-addr flag for the IBM Cloud CAPI
deployment, as it does not appear to be supported in newer
cluster-api releases.
Related: https://issues.redhat.com/browse/OCPBUGS-49319
Until we can bump capv to the latest version
disable session keep alive that causes session
timeout and deadlocks as described in the links
attached to the bug.
Instead of doing the control plane shutdown as part of the controllers
shutdown process, it should be done at Teardown time instead. This makes
sure that local control plane binaries are stopped even when we fail to
create controllers, for example when creating a cloud session for
controller setup.
Some providers like Azure require 2 controllers to run. If a controller
fails to be spawned (e.g cluster-api-provider-azureaso), we were not
stopping controllers that were already running (e.g. the cluster-api,
cluster-api-provider-azure), resulting in leak processes even though the
Installer reported it had stopped the capi system:
```
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to run cluster api system: failed to run controller "azureaso infrastructure provider": failed to start controller "azureaso infrastructure provider": timeout waiting for process cluster-api-provider-azureaso to start successfully (it may have failed to start, or stopped unexpectedly before becoming ready)
INFO Shutting down local Cluster API control plane...
INFO Local Cluster API system has completed operations
```
By just changing the order of operations to run the controller *after*
the WaitGroup is created, we are able to properly shutdown all running
controllers and the local control plane in case of error:
```
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to run cluster api system: failed to run controller "aws infrastructure provider": failed to extract provider "aws infrastructure provider": fake error
INFO Shutting down local Cluster API control plane...
INFO Stopped controller: Cluster API
INFO Local Cluster API system has completed operations
```
Doing `logrus.Fatal` when a controller fails to be extracted means that
we abort the installer process without giving it a chance to stop the
capi-related processes that are still running.
Let's just return an error instead and let the Installer go through the
normal capi shutdown procedure.
The EKS controller feature gate is enabled by default in CAPA, which
causes the following lines to show up in the logs:
```
time="2024-06-18T11:43:59Z" level=debug msg="I0618 11:43:59.613409 349 logger.go:75] \"enabling EKS controllers and webhooks\" logger=\"setup\""
time="2024-06-18T11:43:59Z" level=debug msg="I0618 11:43:59.613416 349 logger.go:81] \"EKS IAM role creation\" logger=\"setup\" enabled=false"
time="2024-06-18T11:43:59Z" level=debug msg="I0618 11:43:59.613420 349 logger.go:81] \"EKS IAM additional roles\" logger=\"setup\" enabled=false"
time="2024-06-18T11:43:59Z" level=debug msg="I0618 11:43:59.613425 349 logger.go:81] \"enabling EKS control plane controller\" logger=\"setup\""
time="2024-06-18T11:43:59Z" level=debug msg="I0618 11:43:59.613449 349 logger.go:81] \"enabling EKS bootstrap controller\" logger=\"setup\""
time="2024-06-18T11:43:59Z" level=debug msg="I0618 11:43:59.613464 349 logger.go:81] \"enabling EKS managed cluster controller\" logger=\"setup\""
time="2024-06-18T11:43:59Z" level=debug msg="I0618 11:43:59.613496 349 logger.go:81] \"enabling EKS managed machine pool controller\" logger=\"setup\""
```
Although harmless, they can be confusing for users. This change
disables the feature so the lines are gone and we are not running
controllers unnecessarily.
Etcd data is preserved to support the standalone
openshift-install destroy bootstrap command. We can only delete this
once bootstrap destroy has been completed. Teardown may be called
in other cases, such as an error or user interrupt, so this commit
introduces a separate function to delete the etcd directoy specifically.
We save an etcd data dir in order to be able to restart the local
control plane to destroy the bootstrap node. Prior to this commit,
we saved etcd data in the binary directory, but that is cleaned
whenever the control plane shuts down, so it defeats the purpose
of saving the data for a restart.
Instead, save the etcd data in its own dir.
Prior to this commit, envtest.kubeconfig was placed in the auth
dir which contains the cluster kubeconfigs. Leaving the
envtest.kubeconfig in this dir may confuse users. Instead, let's
hide the kubeconfig in the capi artifacts directory.
Managed clusters might rely on the KUBECONFIG to reach their kube api
server. Instead of using the env var and possibly causing issues, we
can specify a custom kube config via cmdline argument for the capi
controllers. That seems a more appropriate approach for an ephemeral
kube API like the one spawned by envtest.
Adding option to skip the image upload from the installer if
an environment variable is set since it takes a lot of time and
the marketplace images can be used to skip this step.
Changes the temporary directory where clusterapi dependencies
are unpacked from
<clusterdir>/bin/cluster-api
to
<clusterdir>/cluster-api
The clusterapi teardown function only removes the cluster-api dir,
so we are leaking the bin dir. There are other ways we could resolve
this issue, but ultimately we don't need to create a nested temporary
directory.
This commit removes the dependency on the installconfig and uses
metadata instead. This will make it easier to restart the capi system
at any point because we will not need to retrieve the installconfig
asset. At the moment, the standalone bootstrap destroy command is the
only example of when we will need to restart the CAPI control plane.
This is the result of the following steps:
1. Fork cluster-api-provider-openstack and revert its go.mod to Go v1.21
2. Replace the fork in the Installer's go.mod
3. Replace imports from v1alphaX to v1beta1
4. Update manifests to use the v1beta1 spec