At the moment we try to allocate ports in the tests, the close them, and
then start alertmanager on those ports. This is very brittle and often
fails.
Fix the race conditions by directly starting alertmanager on
system-allocated free ports (using :0 in the address) and then detecting
the ports used, and using those in the test.
Signed-off-by: Guido Trotter <guido@hudson-trading.com>
Co-authored-by: Guido Trotter <guido@hudson-trading.com>
* MockWebhook: track shutdown status more gracefully
When we close the server sometimes the messages are interrupted and
can't be decoded by the webhook. This change accepts any decoding
failure after server shutdown.
Signed-off-by: Guido Trotter <guido@hudson-trading.com>
* Call NewWebhook passing 't', and protect against nil cmd in Terminate
Signed-off-by: Guido Trotter <guido@hudson-trading.com>
---------
Signed-off-by: Guido Trotter <guido@hudson-trading.com>
Co-authored-by: Guido Trotter <guido@hudson-trading.com>
* Fix silence import: also wait for the error collection goroutine to finish
As noticed by George Robinson the error collection goroutine in silence
import is also not waited for, so we may get an incorrect count when we
exit. This adds a done channel for that goroutine, and checks that the
error count is correct with a new test.
Signed-off-by: Guido Trotter <guido@hudson-trading.com>
* bulkImport: use sync.Once to close channels
Signed-off-by: Guido Trotter <guido@hudson-trading.com>
---------
Signed-off-by: Guido Trotter <guido@hudson-trading.com>
Co-authored-by: Guido Trotter <guido@hudson-trading.com>
This change adds a new cmd flag `--dispatch.start-delay` which
corresponds to the `--rules.alert.resend-delay` flag in Prometheus.
This flag controls the minimum amount of time that Prometheus waits
before resending an alert to Alertmanager.
By adding this value to the start time of Alertmanager, we delay
the aggregation groups' first flush, until we are confident all alerts
are resent by Prometheus instances.
This should help avoid race conditions in inhibitions after a (re)start.
Other improvements:
- remove hasFlushed flag from aggrGroup
- remove mutex locking from aggrGroup
Signed-off-by: Alexander Rickardsson <alxric@aiven.io>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Co-authored-by: Alexander Rickardsson <alxric@aiven.io>
* Fix erroneous channels close
while we did fix the goroutine leak at bulk import, in case of errors,
unfortunately we broke the non-error path, since the silence channel
needed to be closed, so that addSilenceWorker would terminate the loop,
and and wg.Wait would work
solve this by having just one cleanup function, that gets called on
defer, but also manually before returning, ensuring the error count is
correct, and all workers have indeed been collected.
Fixes issue introduced in https://github.com/prometheus/alertmanager/pull/4556
Signed-off-by: Guido Trotter <guido@hudson-trading.com>
* Add import silence cli tests
Signed-off-by: Guido Trotter <guido@hudson-trading.com>
---------
Signed-off-by: Guido Trotter <guido@hudson-trading.com>
Co-authored-by: Guido Trotter <guido@hudson-trading.com>
This commit fixes a bug where an invalid silence causes incomplete
updates of existing silences. This is fixed moving validation
out of the setSilence method and putting it at the start of the
Set method instead.
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Fix panic in acceptance tests
This commit attempts to address a panic that occurs in acceptance
tests if a server in the cluster fails to start.
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Remove started and check am.cmd.Process != nil
Signed-off-by: George Robinson <george.robinson@grafana.com>
---------
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Fix UTF-8 not supported in group_by
This commit fixes missing UTF-8 support in the group_by for routes.
Signed-off-by: George Robinson <george.robinson@grafana.com>
---------
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Support UTF-8 label matchers: Use compat package in Alertmanager server
This pull request adds use of the compat package in Alertmanager server that will allow users to switch between the new matchers/parse parser and the old pkg/labels parser. The new matchers/parse parser uses a fallback mechanism where if the input cannot be parsed in the new parser it then attempts to use the old parser. If an input is parsed in the old parser but not the new parser then a warning log is emitted.
Signed-off-by: George Robinson <george.robinson@grafana.com>
---------
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Add adapter package for parser feature flag
This commit adds the compat package allowing users to switch
between the new matchers/parse parser and the old pkg/labels parser.
The new matchers/parse parser uses a fallback mechanism where if
the input cannot be parsed in the new parser it then attempts to
use the old parser. If an input is parsed in the old parser but
not the new parser, then a warning log is emitted.
---------
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Fix scheme required for webhook url in amtool
This commit fixes issue #3505 where amtool would fail with
"error: scheme required for webhook url" when using amtool
with --alertmanager.url.
The issue here is that UnmarshalYaml for WebhookConfig checks
if the scheme is present when c.URL is non-nil. However,
UnmarshalYaml for SecretURL returns a non-nil, default value
url.URL{} if the response from api/v2/status contains <secret>
as the webhook URL.
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Add test for config routes test
Signed-off-by: George Robinson <george.robinson@grafana.com>
---------
Signed-off-by: George Robinson <george.robinson@grafana.com>
The CI environment isn't as performant as local machines: the time
needed to fully initialize the test environment can be significant and
skew the verification. Rather than setting the "virtual" clock used to
measure alert timings at the beginning of the acceptance test, it is
better to wait for the test bed to be ready.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Update Go to 1.19
* Update Go.
* Update some Go modules.
* Update Swagger to the latest for Go 1.19 compatibility.
* api/v2: regenerate
* Accommodate to the changes in the client package
* asset/assets_vfsdata.go: regenerate
Signed-off-by: SuperQ <superq@gmail.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Co-authored-by: Simon Pasquier <spasquie@redhat.com>
The CI keeps reporting flakes for our acceptance test around the starting and stopping of the Alertmanagers. While I have an idea of where these failures are coming from, it would be nice to get a confirmation by structuring our error messages a bit better.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
While merging #2944, I noticed the CI failed: https://app.circleci.com/pipelines/github/prometheus/alertmanager/2686/workflows/b6f87b0a-20c3-455b-b706-432c38a77511/jobs/12028.
It seemed like a deadlock between uncoordinated routines but I couldn't pin point (or reproduce, I tried with -race and -count) the exact problem. However, from the logs, I could point out where the problem originated and kind of have a hunch it had to do with the way net listeners are handled by the TODO removed.
The more worrying bit of the CI failure is that it took 10m to timeout, with this change we'll force close the connection with a 5s deadline so at the very least we'll get the feedback faster.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
This commit moves the stuff formerly in /client into /test/with_api_v1
so that we can discourage use of the v1 client without breaking things
Signed-off-by: sinkingpoint <colin@quirl.co.nz>
Instead of keeping all notifiers in the notify package, it splits them
into individual sub-packages. This improves readability and
maintainability of the code.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
- Move the generated api/v2 client code out of the test directory
and into the api/v2 directory with models and restapi.
- Remove duplicate models directory
- Update tests to use api/v2 package for models and client
Signed-off-by: Paul Gier <pgier@redhat.com>
- make clean shouldn't print errors when files/directories have already
been removed
- add copyright header to generated api files to pass license check
Signed-off-by: Paul Gier <pgier@redhat.com>