Add tracing support using otel to the the following components:
- api: extract trace and span IDs from request context
- provider: mem put
- dispatch: split logic and use better naming
- inhibit: source and target traces, mutes, etc. drop metrics
- silence: query, expire, mutes
- notify: add distributed tracing support to stages and all http requests
Note: inhibitor metrics are dropped since we have tracing now and they
are not needed. We have not released any version with these metrics so
we can drop them safely, this is not a breaking change.
This change borrows part of the implementation from #3673
Fixes #3670
Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Co-authored-by: Dave Henderson <dhenderson@gmail.com>
* Add new behavior to avoid races on config reload
* Add context to Groups to allow timeouts
---------
Signed-off-by: Ethan Hunter <ehunter@hudson-trading.com>
This change adds a new cmd flag `--dispatch.start-delay` which
corresponds to the `--rules.alert.resend-delay` flag in Prometheus.
This flag controls the minimum amount of time that Prometheus waits
before resending an alert to Alertmanager.
By adding this value to the start time of Alertmanager, we delay
the aggregation groups' first flush, until we are confident all alerts
are resent by Prometheus instances.
This should help avoid race conditions in inhibitions after a (re)start.
Other improvements:
- remove hasFlushed flag from aggrGroup
- remove mutex locking from aggrGroup
Signed-off-by: Alexander Rickardsson <alxric@aiven.io>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Co-authored-by: Alexander Rickardsson <alxric@aiven.io>
* fix: set context timeout for resolvePeers
Given resolvePeers is context aware, yet, uses background that may lead to blocking call. This is due to the fact that `LookupIPAddr` is called under the hood that does dns look up which may be blocked in some platforms.
In case of dns lookups are blocked may lead to hanging goroutine due to the fact that main goroutine is being used for the call that leads to application hanging & being unresponsive as well as readiness check to fail.
---------
Signed-off-by: emreya <e.yazici1990@gmail.com>
Currently the default behaviour is to generate a new, random peer name
on every startup.
It might be desired to have a persistent peer name in consistent setups
like running the cluster on bare-metal nodes.
This change adds an optional cli arg that allows specifying a custom
peer name.
If no peer name is specified a random name is generated.
Duplicated peer names are detected, logged and increment the new metric
`alertmanager_cluster_peer_name_conflicts_total`.
Signed-off-by: sinkingpoint <colin@quirl.co.nz>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Co-authored-by: sinkingpoint <colin@quirl.co.nz>
Use the new version collector to expose the `alertmanager_build_info`
metric. This fixes an incorrect deprecation migration done in #3806.
Signed-off-by: SuperQ <superq@gmail.com>
* chore!: adopt log/slog, drop go-kit/log
The bulk of this change set was automated by the following script which
is being used to aid in converting the various exporters/projects to use
slog:
https://gist.github.com/tjhop/49f96fb7ebbe55b12deee0b0312d8434
This commit includes several changes:
- bump exporter-tookit to v0.13.1 for log/slog support
- updates golangci-lint deprecated configs
- enables sloglint linter
- removes old go-kit/log linter configs
- introduce some `if logger == nil { $newLogger }` additions to prevent
nil references
- converts cluster membership config to use a stdlib compatible slog
adapter, rather than creating a custom io.Writer for use as the
membership `logOutput` config
Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
* chore: address PR feedback
Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
---------
Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
This commit updates /api/v2/alerts/groups to show if an alert is
suppressed from one or more active or mute time intervals. While
the muted by field can be found in /api/v2/alerts, it is not
used here because /api/v2/alerts does not take aggregation
or routing into consideration.
It also updates the UI to support filtering muted alerts via the
Muted checkbox.
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Rename matchers package to matcher singular
I realized that we had named the package plural "matchers" when
its idiomatic in Go to use singular package names.
---------
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Rename silence limit to max-silence-size-bytes
This commit renames an existing (unreleased) limit from
max-per-silence-bytes to max-silence-size-bytes.
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Update help
Signed-off-by: George Robinson <george.robinson@grafana.com>
---------
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Silence limits as functions
This commit changes silence limits from a struct of ints to a struct
of functions that return individual limits. This allows limits
to be lazy-loaded and updated without having to call silences.New().
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Add explicit test for no limits
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Fix run()
Signed-off-by: George Robinson <george.robinson@grafana.com>
---------
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Add limits for silences
This commit adds limits for silences including the maximum number
of active and pending silences, and the maximum size per silence
(in bytes).
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Remove default limits
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Allow expiration of silences that exceed max size
---------
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Mark muted groups
This commit updates TimeMuteStage and TimeActiveStage to mark groups
as muted when its alerts are muted by an active or mute time interval,
and remove any existing markers when outside all active and mute
time intervals.
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Remove unlock to defer
Signed-off-by: George Robinson <george.robinson@grafana.com>
---------
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Bump prometheus/common to v0.52.3
This commit bumps prometheus/common to v0.52.3. It has a breaking
change where the metric alertmanager_build_info has been renamed
to go_build_info as the metric has been moved from prometheus/common
to prometheus/client_golang and the namspace argument has been
removed.
---------
Signed-off-by: George Robinson <george.robinson@grafana.com>
Note that this does not stop showing classic metrics, for now
it is up to the scrape config to decide whether to keep those instead or
both.
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
This commit removes the metrics from the compat package
in favour of the existing logging and the additional tools
at hand, such as amtool, to validate Alertmanager configurations.
Due to the global nature of the compat package, a consequence
of config.Load, these metrics have proven to be less useful
in practice than expected, both in Alertmanager and other projects
such as Mimir.
There are a number of reasons for this:
1. Because the compat package is global, these metrics cannot be
reset each time config.Load is called, as in multi-tenant
projects like Mimir loading a config for one tenant would reset
the metrics for all tenants. This is also the reason the metrics
are counters and not gauges.
2. Since the metrics are counters, it is difficult to create
meaningful dashboards for Alertmanager as, unlike in Mimir,
configurations are not reloaded at fixed intervals, and as such,
operators cannot use rate to track configuration changes
over time.
In Alertmanager, there are much better tools available to validate
that an Alertmanager configuration is compatible with the UTF-8
parser, including both the existing logging from Alertmanager
server and amtool check-config.
In other projects like Mimir, we can track configurations for
individual tenants using log aggregation and storage systems
such as Loki. This gives operators far more information than
what is possible with the metrics, including the timestamp,
input and ID of tenant configurations that are incompatible
or have disagreement.
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Add metric for inhibit rules
This commit adds a new metric called alertmanager_inhibit_rules.
It is identical to the alertmanager_integrations and
alertmanager_receivers metrics that are present in the current
and previous versions.
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Rename metric and variable
Signed-off-by: George Robinson <george.robinson@grafana.com>
---------
Signed-off-by: George Robinson <george.robinson@grafana.com>
This commit removes some code that should have been removed in #3668.
The FeatureFlags in silence.Options are no longer used but were
still initialized. These had a no-op effect.
Signed-off-by: George Robinson <george.robinson@grafana.com>
This commit fixes inconsistent UTF-8 behavior if the compat package is
not initialized and feature flags are not passed to the API. This can
happen when Alertmanager is used as a package in software such
as Cortex or Mimir.
The inconsistent behavior is that Alertmanager will accept UTF-8 alerts
but reject UTF-8 configurations.
Since feature flags are optional via api.Options, we cannot force them
to be passed to api.New at compile time. Instead, it's better to defer
back to the compat package which is consistent even when not initialized.
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Add metrics to matchers compat package
This commit adds the following metrics to the compat package:
alertmanager_matchers_parse
alertmanager_matchers_disagree
alertmanager_matchers_incompatible
alertmanager_matchers_invalid
With a label called origin to differentiate the different sources
of inputs: the configuration file, the API, and amtool.
The disagree_total metric is incremented when an input is invalid
in both parsers, but results in different parsed representations,
then there is disagreement. This should not happen, and suggests
their is either a bug in one of the parsers or a mistake in the
backwards compatible guarantees of the matchers/parse parser.
The incompatible_total metric is incremented when an input is valid
in pkg/labels, but not the UTF-8 parser in matchers/parse. In such
case, the matcher should be updated to be compatible. This often
means adding double quotes around the right hand side of the matcher.
For example, foo="bar".
The invalid_total metric is incremented when an input is invalid
in both parsers. This was never a valid input.
The tests have been updated to check the metrics are incremented
as expected.
Signed-off-by: George Robinson <george.robinson@grafana.com>
---------
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Support UTF-8 label matchers: Use compat package in Alertmanager server
This pull request adds use of the compat package in Alertmanager server that will allow users to switch between the new matchers/parse parser and the old pkg/labels parser. The new matchers/parse parser uses a fallback mechanism where if the input cannot be parsed in the new parser it then attempts to use the old parser. If an input is parsed in the old parser but not the new parser then a warning log is emitted.
Signed-off-by: George Robinson <george.robinson@grafana.com>
---------
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Refactor: Move `inTimeIntervals` from `notify` to `timeinterval`
There's absolutely no change of functionality here and I've expanded coverage for similar logic in both places.
---------
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Add receiver name as a label to notify metrics
This commit adds in a second label to the notify family of metrics
(e.g. numTotalFailedNotifications) - the receiver name. This allows
disambiguating which receiver is failing when one has many receivers
with the same integration type
Signed-off-by: sinkingpoint <colin@quirl.co.nz>
* Gate receiver names behind a feature flag
Signed-off-by: sinkingpoint <colin@quirl.co.nz>
---------
Signed-off-by: sinkingpoint <colin@quirl.co.nz>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Co-authored-by: gotjosh <josue.abreu@gmail.com>
* Log a warning when repeat_interval is less than group_interval
This commit updates Alertmanager to log a warning when
repeat_interval is less than group_interval for an individual route.
When repeat_interval is less than group_interval, the earliest
a notification can be sent again is the next time the aggregation
group is flushed, and this happens at each group_interval.
Signed-off-by: George Robinson <george.robinson@grafana.com>
---------
Signed-off-by: George Robinson <george.robinson@grafana.com>
Co-authored-by: gotjosh <josue.abreu@gmail.com>
* Refactor nflog configuration options to make it similar to Silences.
The Notification Log is a similar component to Silences. They're the only two things that are shared between nodes when running in HA and they both hold some sort of internal state that needs to be cleaned up on an interval.
To simplify the code and make it a bit more understandable (among other benefits such as improved testability) - I've refactor the notification log configuration and `run` to be similar to the silences.