Currently, the approach for removing OpenStack loadbalancers is to look
for appropriate tag (i.e. openshiftClusterID=<cluster_id>) and then
delete it. Issue with this approach is that no such tags are applied on
the loadbalancer resources, and also there is no such tag in description
field. Hence, deleteLoadBalancer function will give 0 results for
existing loadbalancers, as they have no such tag either in tags nor on
description.
With this patch, deleteLoadBalancer has been refactored to get all the
loadbalancers and filter out that resources which have ClusterID in
description, so that they will be safely deleted.
In 4.15 Kuryr is no longer a supported NetworkType, following its
deprecation in 4.12. This commit removes mentions of Kuryr from the
documentation and code, but also adds validation to prevent
installations from being executed when `networkType` is set to `Kuryr`.
Avoid the duplication of configuring the client in multiple locations.
It also gives us a single point to start configuring a user agent for
the installer.
Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
cloud-provider-openstack can be configured to create security groups for
the NodePorts of the load balancers it is creating. These SGs are then
attached to the nodes. On `cluster destroy` we're orphaning them. This
commit makes sure that we're looking for them.
As they aren't tagged or have a proper cluster ID in the name, we will
look at each of the ports, list its SGs and evaluate them comparing
names with the pattern. If it matches, `destroy` will attempt to delete
such SG.
When using dual-stack installations the user needs to pre-create
the api and ingress port given OpenStack does not allow direct
assignment of addresses when using slaac/stateless, consequently
the installer can't create those. This commit adds support to tag
those Ports, assign security groups to them, attach the Floating IP
when needed and allow clean up of resources.
Some object-storage instances may be set to have a limit to the LIST
operation that is higher to the limit to the BULK DELETE operation. On
those clouds, objects in the BULK DELETE call beyond the limit are
silently ignored. As a consequence, the call to destroy the container
fails and object deletion is re-queued after a growing waiting time,
potentially dilating deletion by hours.
With this change, object bulk deletion is put in a loop. After checking
that no errors were encountered, we reduce the BULK DELETE list by the
number of processed objects, and send it back to the server. As a
consequence, the object deletion routines should only complete when the
container is emtpy, thus avoiding the 409 error that causes a retry.
With this change:
* listing of container object is no longer limited to 50, but left to
the default (which is 10000 on a standard Swift configuration);
* the object deletion calls are issued in concurrent goroutines rather
than serially, giving Swift a chance to work them in parallel. The
limit is set to 10 concurrent goroutines.
The goal of this change is to tackle waiting times on OCP destroy on
clusters with massive amounts of data stored in OpenStack object
storage.
With the bump to Gophercloud v1.1.1, the library should be able to
handle HTTP status 204 responses without `content-type` without
erroring. The workaround that was in place to force contentful responses
can then be removed.
Some OpenStack object storages respond with `204 No Content` to list
requests when there are no containers or objects to list. In these
cases, when responding to requests with an `Accept: text/plain` or no
`Accept` header, some object storages omit the `content-type` header in
their status-204 responses.
Now, Gophercloud throws an error when the response does not contain a
`content-type` header.
With this change, we work around the issue by forcing Gophercloud to
request a JSON response from the object storage when listing objects.
When passed an `Accept: application/json` header, the server responds
with `200 Ok` and a `content-type` header in our tests.
This solution gives us a fix that is easily backportable because it
doesn't require any dependency bump.
Some OpenStack object storages respond with `204 No Content` to list
requests when there are no containers or objects to list. In these
cases, when responding to requests with an `Accept: text/plain` or no
`Accept` header, some object storages omit the `content-type` header in
their status-204 responses.
Now, Gophercloud throws an error when the response does not contain a
`content-type` header.
With this change, we work around the issue by forcing Gophercloud to
request a JSON response from the object storage when listing containers.
When passed an `Accept: application/json` header, the server responds
with `220 Ok` and a `content-type` header in our tests.
This solution gives us a fix that is easily backportable because it
doesn't require any dependency bump.
Note that in my local tests, I didn't find the 'full' listing to take
more time than the short, name-only response we were requesting Swift
prior to this change.
d2630f2995 implemented deleting ports from
networks even if they're untagged. The motivation was to not block
destroy when some untagged ports were orphaned on the network (in
Neutron tagging is a separate operation that can fail).
As this was always done before deleting the network, we also deleted the
LoadBalancer Services VIP ports (untagged). This means that we couldn't
track down FIPs created for these Services and these were orphaned.
This commit makes sure that we only attempt to delete untagged ports on
409 failure to delete the network. We also do that only after successful
deletion of the LBs to make sure all the Service FIPs are already
tracked down and taken care of.
This reverts commit a272e59b99. As noted
in the corresponding bug [1], the expectation was always to revert this
once cloud-provided-openstack started providing cluster ID information
in snapshot metadata, which has been the case for some time now [2].
Conflicts:
pkg/destroy/openstack/openstack.go
Changes:
pkg/destroy/openstack/openstack.go
NOTE(stephenfin): Conflicts are due to commit 375fe6f389 ("OpenStack:
Optimize cluster deletion") which changed two blocks so that we now
continue in a loop rather than returning early. We introduce this same
logic into a newly restored block inside 'deleteSnapshots' to prevent a
regression here.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1965468
[2] https://github.com/kubernetes/cloud-provider-openstack/pull/1544
Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
Currently, we are trying to remove ports, which are tagged, but
sometimes it happens, that port is created but tagging has failed, or we
didn't get response from Neutron API, resulting with untagged port.
Having network already tagged we can use its id to select all the port
within such network, so that all the ports, not only tagged would be
deleted.
OpenStack Neutron is slow but it's sometimes able to handle quite a few
requests in parallel. This commit splits the operation of Neutron port
deletion into 10 concurrent goroutines to speed it up. This mostly aids
Kuryr case when we can have hundreds of ports created that block
deletion of other resources.
Networking resources tagging is a hard requirement for OpenShift on
OpenStack and we should refuse from running the installer when the
underlying OpenStack platform does not support it.
Also, the destroy script may delete unmanaged resources when network
tagging is not available. With this patch, the destroy script will
refuse to work when network tagging is not available.
Fixes Bug 2013877
Co-authored-by: Martin André <m.andre@redhat.com>
Co-authored-by: Pierre Prinetti <pierreprinetti@redhat.com>
When we destroy an entity in the public cloud relating to a cluster, we
now record the impact that the item had to the quota in the account that
the cluster was provisioned in. This will allow for downstream users of
the installer to reason about the footprint of clusters they run,
allowing for more automated reasoning about how many clusters of a type
can fit into an account.
Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
When performing a cluster destroy we should
look for all cluster tagged Networks that might be
connected to the router, regardless if it was created
by CAPO. This commit updates the filtering for the
look up of Networks and ensure to skip the Subnet
when no Gateway is set.
When using byon with Kuryr the Router would be identified based
on the Primary Network used by the Servers, which would be filtered
by the device-owner compute:nova. However, when using azs the
device-owner name can change. This commit fixes the issue by
identifying the router by looking for any tagged networks that
has a subnet connected to the router. The gorouting for this
approach can't be run before other clean ups because the CNO
would always attempt to re-connect the service subnet to the
router.
Co-Authored-By: Maysa Macedo <maysa.macedo95@gmail.com>
Co-Authored-By: Emilien Macchi <emilien@redhat.com>
To implement LoadBalancer Services OpenStack Cloud Provider creates an
LB in Octavia for each of them. Those LBs have floating IPs associated
to the LB VIPs. As this is handled by the Cloud Provider itself,
Octavia's cascade delete will not remove those FIPs and we need to take
care of that ourselves.
This commit adds deleting of the FIPs associated to the LBs on
destroying cluster. In order to do that correctly it was required to
split deleting routers into two separate functions and make sure that
FIP detach and actual removal of the routers is happening after all the
delete functions finished (meaning that LBs are gone too). This is
because if FIP detach happens before we lose information about which FIP
was attached to which LB effectively preventing us from handling
deletion of them.
Only the function detaching the subnets from the routers is running as
part of main deletion step now.
Previously each destroy function would exit on the first conflict and
it was expected to retry on a next iteration, hoping that in the mean
time the conflict that prevented removal of the resource was fixed.
While this strategy works, it is also slower then necessary. A better
strategy is to try deleting all resources and ignore the ones that have
conflicts. On the next iteration there will be less conflicts.
This patch was tested on a openstack cluster with openshiftSDN, and
we observed a 37% faster cluster deletion. I expect the performance
boost to be significantly higher for clusters using kuryr.
There could be cases where the trunk is not properly tagged, for
example the UPI scripts do not set tags to trunk since the openstack
client doesn't support it.
Failure to delete trunks could result in the destroy command stuck in
a loop until it hits the timeout.
In these cases, the cluster destroy should be smart enough to try
deleting trunks for which the tagged port is a parent.
It turned out that Cinder CSI driver doesn't attach cluster id to
volume snapshot metadata. To workaround this we delete snapshots
based on their volume IDs.
When using your own Network for the Machines
connected to a tagged Router created by CNO
on an installation with Kuryr, the Subnet of
that Network is not tagged and consequently
not identified by installer when destroying
the Router. This commit fixes the issue by
ensuring all Subnetes connected to a Router
created by Installer or CNO are removed.
Now installer deletes only volumes created by the in-tree provisioner,
omitting those created by the CSI driver, which leads to resource
leakage.
This commit starts deleting CSI volumes and snapshots as well.
* github.com/gophercloud/gophercloud
* github.com/gophercloud/utils
* github.com/terraform-provider-openstack/terraform-provider-openstack
Also adjust a call to servergroup.List to comply with the new function
signature[1].
[1]: gophercloud/gophercloud#2070
Signed-off-by: Emilien Macchi <emilien@redhat.com>
During a FIP less installation VMs that were not
created by the installer can get a FIP detached when
the cluster in being destroyed. This commit fixes the
issue by detaching only FIPs that were created by
installer or Kuryr during a FIP less installation.
When using byo the cluster destroy was relying on the
primary network tag to identify the router used and remove
the extra interfaces. As the provided network is not tagged
anymore the cluster is not able to get cleaned up. This commit
fixes the issue by relying on the network used by the Servers
to identify the router used.
If the user destroyed a cluster without removing all the associated
service LBs, the `destroy` command would fail to remove the network
and loop until it hits the timeout.
The destroy command now looks if there are any leftover LBs where its
`VipNetworkID` matches the network ID and deprovisions it. We filter on
services LBs created by the openstack cloud provider, matching the
`Kubernetes external service` string in the description [1], to ensure
we're not destroying a user-created resource by mistake.
[1] https://github.com/openshift/kubernetes/blob/442a69c/staging/src/k8s.io/legacy-cloud-providers/openstack/openstack_loadbalancer.go#L446
This commit explicitly disables reading auth data from env variables
by setting an invalid EnvPrefix. By doing this, we make sure that the
data from clouds.yaml is enough to authenticate.
After this change we don't have to unset OS_CLOUD env variable explicitly
anymore.
Ref https://issues.redhat.com/browse/OSASINFRA-2152
We should unset OS_CLOUD env variable during cloudinfo and session
generation, and cluster destruction. We have to do it because the
real cloud name is defined by user in the install-config. OS_CLOUD
has more priority, so the user-defined value will be ignored if
OS_CLOUD contains something.
/label platform/openstack
In case the machines Subnet is not connected to a Router
there is not a need to clean-up the interfaces from the
custom Router as no additional interfaces would have
been created on it. Also, when using Kuryr if a Service
of LoadBalancer type was created, a floating ip would get
created for the load balancer and the removal of the service
subnet from the router would be blocked.
This commit fixes both issues by ignoring the custom Router
clean-up when no Router is found and moving the custom
Router clean-up to after the load balancers removal.
To support a fip less installation and bring your
own network when using Kuryr, new interfaces are
added to a custom Router that may exist, enabling
traffic between Pods, Services and VMs. As the
Router is not created by the installer its interfaces
must be cleaned up upon cluster Destroy.
This commit solves the issue by discovering the
Router through the Primary Network and the gateway
interface attached to the Router.
In https://github.com/openshift/installer/pull/3818 we introduced
tag-based server delition. Unfortunately it turned out we can destroy
servers, created by previous versions, as no tags were set there.
To fix this situation we come back to the previous solution - deleting
servers by metadata.
During deletion of containers, we get a list of available containers
first, and then we iterate through them to find the ones we need,
based on their metadata.
Since this is not an atomic operation, it may happen that the containers
can be removed at runtime. We should ignore these cases and continue
to iterate through the remaining ones.
Now to delete servers we first get a list of all available servers
from Nova, and then we iterate through them to find those with
required metadata. In the case of a large number of servers, this
can take a very long time.
Fortunately gophercloud introduced filtering by tags, so we can
start using this feature to get only servers with the required tag.
https://github.com/gophercloud/gophercloud/pull/1759