Add tracing support using otel to the the following components:
- api: extract trace and span IDs from request context
- provider: mem put
- dispatch: split logic and use better naming
- inhibit: source and target traces, mutes, etc. drop metrics
- silence: query, expire, mutes
- notify: add distributed tracing support to stages and all http requests
Note: inhibitor metrics are dropped since we have tracing now and they
are not needed. We have not released any version with these metrics so
we can drop them safely, this is not a breaking change.
This change borrows part of the implementation from #3673
Fixes #3670
Signed-off-by: Dave Henderson <dhenderson@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Co-authored-by: Dave Henderson <dhenderson@gmail.com>
* Add new behavior to avoid races on config reload
* Add context to Groups to allow timeouts
---------
Signed-off-by: Ethan Hunter <ehunter@hudson-trading.com>
This change adds a new cmd flag `--dispatch.start-delay` which
corresponds to the `--rules.alert.resend-delay` flag in Prometheus.
This flag controls the minimum amount of time that Prometheus waits
before resending an alert to Alertmanager.
By adding this value to the start time of Alertmanager, we delay
the aggregation groups' first flush, until we are confident all alerts
are resent by Prometheus instances.
This should help avoid race conditions in inhibitions after a (re)start.
Other improvements:
- remove hasFlushed flag from aggrGroup
- remove mutex locking from aggrGroup
Signed-off-by: Alexander Rickardsson <alxric@aiven.io>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Co-authored-by: Alexander Rickardsson <alxric@aiven.io>
Add `alertmanager_alerts_subscriber_channel_writes_total` metric to
track the number of alerts written to subscriber channels.
A drop in the rate of this metric may indicate a problem with the
ingestion of alerts by subscribers (inhibitor and dispatcher).
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
This change makes aggregation groups to delete resolved alerts from marker,
therefore avoiding the leakage of ghost states mentioned in #4402.
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
* chore!: adopt log/slog, drop go-kit/log
The bulk of this change set was automated by the following script which
is being used to aid in converting the various exporters/projects to use
slog:
https://gist.github.com/tjhop/49f96fb7ebbe55b12deee0b0312d8434
This commit includes several changes:
- bump exporter-tookit to v0.13.1 for log/slog support
- updates golangci-lint deprecated configs
- enables sloglint linter
- removes old go-kit/log linter configs
- introduce some `if logger == nil { $newLogger }` additions to prevent
nil references
- converts cluster membership config to use a stdlib compatible slog
adapter, rather than creating a custom io.Writer for use as the
membership `logOutput` config
Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
* chore: address PR feedback
Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
---------
Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
This commit updates /api/v2/alerts/groups to show if an alert is
suppressed from one or more active or mute time intervals. While
the muted by field can be found in /api/v2/alerts, it is not
used here because /api/v2/alerts does not take aggregation
or routing into consideration.
It also updates the UI to support filtering muted alerts via the
Muted checkbox.
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Mark muted groups
This commit updates TimeMuteStage and TimeActiveStage to mark groups
as muted when its alerts are muted by an active or mute time interval,
and remove any existing markers when outside all active and mute
time intervals.
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Remove unlock to defer
Signed-off-by: George Robinson <george.robinson@grafana.com>
---------
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Fix race condition in dispatch.go
This commit fixes a race condition in dispatch.go that would cause
a firing alert to be deleted from the aggregation group when instead
it should have been flushed.
The root cause is a race condition that can occur when dispatch.go
deletes resolved alerts from the aggregation group following a
successful notification. If a firing alert with the same
fingerprint is added back to the aggregation group at the same time
then the firing alert can be deleted.
---------
Signed-off-by: George Robinson <george.robinson@grafana.com>
* Alert metric reports different results to what the user sees via API
Fixes #1439 and #2619.
The previous metric is not _technically_ reporting incorrect results as the alerts _are_ still around and will be re-used if that same alert (equal fingerprint) is received before it is GCed. Therefore, I have kept the old metric under a new name `alertmanager_marked_alerts` and repurpose the current metric to match what the user sees in the UI.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* fix dispatcher race condition
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
* add test to check for race condition in dispatcher
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
* return when dispatcher Stop has nil receiver
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
* remove unneeded chec
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
If the original EndsAt is left in place, then as time moves forwards
past the EndsAt then firing alerts will be rendered and treated as
resolved alerts which can cause confusion and races. This is most
likely to happen on retries for a notification.
Mitigate race and fix data races in TestAggrGroup.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
To aggregate by all possible labels use '...' as the sole label name.
This effectively disables aggregation entirely, passing through all
alerts as-is. This is unlikely to be what you want, unless you have
a very low alert volume or your upstream notification system performs
its own grouping. Example: group_by: [...]
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
Turn the GroupKey into a string that is composed of the matchers if the
path in the routing tree and the grouping labels.
Only hash it at the very end to ensure we don't exceed size limits of
integration APIs.
* Vendor dependencies.
This updates several old dependencies, removes
some that are no longer needed, and adds
`pkg/labels` from prometheus `dev-2.0` branch.
* Add metrics selector parsing code
This is a temporary simplified re-implementation
of promQL's metric selector parsing.
* Add alerts filtering
Filter alerts through `?filter=` query string.
* Add silences filtering
Filter silences through `?filter=` query string.
* Move `parse` to `pkg/parse`
This string value is initially used to store a receiver name. It is
later overloaded with a unique string identifier of <name, integration,
index>.
This renaming is in preparation to separate the two and use the Receiver
object of the nflogpb package.