mirror of
https://github.com/prometheus/alertmanager.git
synced 2026-02-05 15:45:34 +01:00
Merge pull request #4708 from ultrotter/hadocs
Add documentation about high availability
This commit is contained in:
486
docs/high_availability.md
Normal file
486
docs/high_availability.md
Normal file
@@ -0,0 +1,486 @@
|
||||
---
|
||||
title: High Availability
|
||||
sort_rank: 4
|
||||
nav_icon: network
|
||||
---
|
||||
|
||||
# High Availability
|
||||
|
||||
Alertmanager supports configuration to create a cluster for high availability. This document describes how the HA mechanism works, its design goals, and operational considerations.
|
||||
|
||||
## Design Goals
|
||||
|
||||
The Alertmanager HA implementation is designed around three core principles:
|
||||
|
||||
1. **Single pane view and management** - Silences and alerts can be viewed and managed from any cluster member, providing a unified operational experience
|
||||
2. **Survive cluster split-brain with "fail open"** - During network partitions, Alertmanager prefers to send duplicate notifications rather than miss critical alerts
|
||||
3. **At-least-once delivery** - The system guarantees that notifications are delivered at least once, in line with the fail-open philosophy
|
||||
|
||||
These goals prioritize operational reliability and alert delivery over strict exactly-once semantics.
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
An Alertmanager cluster consists of multiple Alertmanager instances that communicate using a gossip protocol. Each instance:
|
||||
|
||||
- Receives alerts independently from Prometheus servers
|
||||
- Participates in a peer-to-peer gossip mesh
|
||||
- Replicates state (silences and notification log) to other cluster members
|
||||
- Processes and sends notifications independently
|
||||
|
||||
```
|
||||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ Prometheus 1 │ │ Prometheus 2 │ │ Prometheus N │
|
||||
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
|
||||
│ │ │
|
||||
│ alerts │ alerts │ alerts
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌────────────────────────────────────────────┐
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||
│ │ AM-1 │ │ AM-2 │ │ AM-3 │ │
|
||||
│ │ (pos: 0) ├──┤ (pos: 1) ├──┤ (pos: 2) │ │
|
||||
│ └──────────┘ └──────────┘ └──────────┘ │
|
||||
│ Gossip Protocol (Memberlist) │
|
||||
└────────────────────────────────────────────┘
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
Receivers Receivers Receivers
|
||||
```
|
||||
|
||||
## Gossip Protocol
|
||||
|
||||
Alertmanager uses [Hashicorp's Memberlist](https://github.com/hashicorp/memberlist) library to implement gossip-based communication. The gossip protocol handles:
|
||||
|
||||
### Membership Management
|
||||
|
||||
- **Automatic peer discovery** - Instances can be configured with a list of known peers and will automatically discover other cluster members
|
||||
- **Health checking** - Regular probes detect failed members (default: every 1 second)
|
||||
- **Failure detection** - Failed members are marked and can attempt to rejoin
|
||||
|
||||
### State Replication
|
||||
|
||||
The gossip layer replicates three types of state:
|
||||
|
||||
1. **Silences** - Create, update, and delete operations are broadcast to all peers
|
||||
2. **Notification log** - Records of which notifications were sent to prevent duplicates
|
||||
3. **Membership changes** - Join, leave, and failure events
|
||||
|
||||
State is eventually consistent - all cluster members will converge to the same state given sufficient time and network connectivity.
|
||||
|
||||
### Gossip Settling
|
||||
|
||||
When an Alertmanager starts or rejoins the cluster, it waits for gossip to "settle" before processing notifications. This prevents sending notifications based on incomplete state.
|
||||
|
||||
The settling algorithm waits until:
|
||||
- The number of peers remains stable for 3 consecutive checks (default interval: push-pull interval)
|
||||
- Or a timeout occurs (configurable via context)
|
||||
|
||||
During this time, the instance already receives and stores alerts but defers notification processing.
|
||||
|
||||
## Notification Pipeline in HA Mode
|
||||
|
||||
The notification pipeline operates differently in a clustered environment to ensure deduplication while maintaining at-least-once delivery:
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────┐
|
||||
│ DISPATCHER STAGE │
|
||||
├────────────────────────────────────────────────┤
|
||||
│ 1. Find matching route(s) │
|
||||
│ 2. Find/create aggregation group within route │
|
||||
│ 3. Throttle by group wait or group interval │
|
||||
└───────────────────┬────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌────────────────────────────────────────────────┐
|
||||
│ NOTIFIER STAGE │
|
||||
├────────────────────────────────────────────────┤
|
||||
│ 1. Wait for HA gossip to settle │◄─── Ensures complete state
|
||||
│ 2. Filter inhibited alerts │
|
||||
│ 3. Filter non-time-active alerts │
|
||||
│ 4. Filter time-muted alerts │
|
||||
│ 5. Filter silenced alerts │◄─── Uses replicated silences
|
||||
│ 6. Wait according to HA cluster peer index │◄─── Staggered notifications
|
||||
│ 7. Dedupe by repeat interval/HA state │◄─── Uses notification log
|
||||
│ 8. Notify & retry intermittent failures │
|
||||
│ 9. Update notification log │◄─── Replicated to peers
|
||||
└────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### HA-Specific Stages
|
||||
|
||||
#### 1. Gossip Settling Wait
|
||||
|
||||
Before the first notification from a group, the instance waits for gossip to settle. This ensures:
|
||||
- Silences are fully replicated
|
||||
- The notification log contains recent send records from other instances
|
||||
- The cluster membership is stable
|
||||
|
||||
**Implementation**: `peer.WaitReady(ctx)`
|
||||
|
||||
#### 2. Peer Position-Based Wait
|
||||
|
||||
To prevent all cluster members from sending notifications simultaneously, each instance waits based on its position in the sorted peer list:
|
||||
|
||||
```
|
||||
wait_time = peer_position × peer_timeout
|
||||
```
|
||||
|
||||
For example, with 3 instances and a 15-second peer timeout:
|
||||
- Instance `am-1` (position 0): waits 0 seconds
|
||||
- Instance `am-2` (position 1): waits 15 seconds
|
||||
- Instance `am-3` (position 2): waits 30 seconds
|
||||
|
||||
This staggered timing allows:
|
||||
- The first instance to send the notification
|
||||
- Subsequent instances to see the notification log entry
|
||||
- Deduplication to prevent duplicate sends
|
||||
|
||||
**Implementation**: `clusterWait()` in `cmd/alertmanager/main.go:594`
|
||||
|
||||
Position is determined by sorting all peer names alphabetically:
|
||||
|
||||
```go
|
||||
func (p *Peer) Position() int {
|
||||
all := p.mlist.Members()
|
||||
sort.Slice(all, func(i, j int) bool {
|
||||
return all[i].Name < all[j].Name
|
||||
})
|
||||
// Find position of self in sorted list
|
||||
}
|
||||
```
|
||||
|
||||
#### 3. Deduplication via Notification Log
|
||||
|
||||
The `DedupStage` queries the notification log to determine if a notification should be sent:
|
||||
|
||||
```go
|
||||
// Check notification log for recent sends
|
||||
entry := nflog.Query(receiver, groupKey)
|
||||
if entry.exists && !shouldNotify(entry, alerts, repeatInterval) {
|
||||
// Skip: already notified recently
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
Deduplication checks:
|
||||
- **Firing alerts changed?** If yes, notify
|
||||
- **Resolved alerts changed?** If yes and `send_resolved: true`, notify
|
||||
- **Repeat interval elapsed?** If yes, notify
|
||||
- **Otherwise**: Skip notification (deduplicated)
|
||||
|
||||
The notification log is replicated via gossip, so all cluster members share the same send history.
|
||||
|
||||
## Split-Brain Handling (Fail Open)
|
||||
|
||||
During a network partition, the cluster may split into multiple groups that cannot communicate. Alertmanager's "fail open" design ensures alerts are still delivered:
|
||||
|
||||
### Scenario: Network Partition
|
||||
|
||||
```
|
||||
Before partition:
|
||||
┌────────┬────────┬────────┐
|
||||
│ AM-1 │ AM-2 │ AM-3 │
|
||||
└────────┴────────┴────────┘
|
||||
Unified cluster
|
||||
|
||||
After partition:
|
||||
┌────────┐ │ ┌────────┬────────┐
|
||||
│ AM-1 │ │ │ AM-2 │ AM-3 │
|
||||
└────────┘ │ └────────┴────────┘
|
||||
Partition A │ Partition B
|
||||
```
|
||||
|
||||
### Behavior During Partition
|
||||
|
||||
**In Partition A** (AM-1 alone):
|
||||
- AM-1 sees itself as position 0
|
||||
- Waits 0 × timeout = 0 seconds
|
||||
- Sends notifications (no dedup from AM-2/AM-3)
|
||||
|
||||
**In Partition B** (AM-2, AM-3):
|
||||
- AM-2 is position 0, AM-3 is position 1
|
||||
- AM-2 waits 0 seconds, sends notification
|
||||
- AM-3 sees AM-2's notification log entry, deduplicates
|
||||
|
||||
**Result**: Duplicate notifications sent (one from Partition A, one from Partition B)
|
||||
|
||||
This is **intentional** - Alertmanager prefers duplicate notifications over missed alerts.
|
||||
|
||||
### After Partition Heals
|
||||
|
||||
When the network partition heals:
|
||||
1. Gossip protocol detects all peers again
|
||||
2. Notification logs are merged (via CRDT-like merge with timestamp)
|
||||
3. Future notifications are deduplicated correctly across all instances
|
||||
4. Silences created in either partition are replicated to all peers
|
||||
|
||||
## Silence Management in HA
|
||||
|
||||
Silences are first-class replicated state in the cluster.
|
||||
|
||||
### Silence Creation and Updates
|
||||
|
||||
When a silence is created or updated on any instance:
|
||||
|
||||
1. **Local storage** - Silence is stored in the local state map
|
||||
2. **Broadcast** - Silence is serialized (protobuf) and broadcast via gossip
|
||||
3. **Merge on receive** - Other instances receive and merge the silence:
|
||||
```go
|
||||
// Merge logic: last-write-wins based on UpdatedAt timestamp
|
||||
if !exists || incoming.UpdatedAt > existing.UpdatedAt {
|
||||
accept_update()
|
||||
}
|
||||
```
|
||||
4. **Indexing** - The silence matcher cache is updated for fast alert matching
|
||||
|
||||
### Silence Expiry
|
||||
|
||||
Silences have:
|
||||
- `StartsAt`, `EndsAt` - The active time range
|
||||
- `ExpiresAt` - When to garbage collect (EndsAt + retention period)
|
||||
- `UpdatedAt` - For conflict resolution during merge
|
||||
|
||||
Each instance independently:
|
||||
- Evaluates silence state (pending/active/expired) based on current time
|
||||
- Garbage collects expired silences past their retention period
|
||||
- The GC is local only (no gossip) since all instances converge to the same decision
|
||||
|
||||
### Single Pane of Glass
|
||||
|
||||
Users can interact with any Alertmanager instance in the cluster:
|
||||
- **View silences** - All instances have the same silence state (eventually consistent)
|
||||
- **Create/update silences** - Changes made on any instance propagate to all peers
|
||||
- **Delete silences** - Implemented as "expire immediately" + gossip
|
||||
|
||||
This provides a unified operational experience regardless of which instance you access.
|
||||
|
||||
## Operational Considerations
|
||||
|
||||
### Configuration
|
||||
|
||||
To configure a cluster, each Alertmanager instance needs:
|
||||
|
||||
```yaml
|
||||
# alertmanager.yml
|
||||
global:
|
||||
# ... other config ...
|
||||
|
||||
# No cluster config in YAML - use CLI flags
|
||||
```
|
||||
|
||||
Command-line flags:
|
||||
|
||||
```bash
|
||||
alertmanager \
|
||||
--cluster.listen-address=0.0.0.0:9094 \
|
||||
--cluster.peer=am-1.example.com:9094 \
|
||||
--cluster.peer=am-2.example.com:9094 \
|
||||
--cluster.peer=am-3.example.com:9094 \
|
||||
--cluster.advertise-address=$(hostname):9094 \
|
||||
--cluster.peer-timeout=15s \
|
||||
--cluster.gossip-interval=200ms \
|
||||
--cluster.pushpull-interval=60s
|
||||
```
|
||||
|
||||
Key flags:
|
||||
- `--cluster.listen-address` - Bind address for cluster communication (default: `0.0.0.0:9094`)
|
||||
- `--cluster.peer` - List of peer addresses (can be repeated)
|
||||
- `--cluster.advertise-address` - Address advertised to peers (auto-detected if omitted)
|
||||
- `--cluster.peer-timeout` - Wait time per peer position for deduplication (default: `15s`)
|
||||
- `--cluster.gossip-interval` - How often to gossip (default: `200ms`)
|
||||
- `--cluster.pushpull-interval` - Full state sync interval (default: `60s`)
|
||||
- `--cluster.probe-interval` - Peer health check interval (default: `1s`)
|
||||
- `--cluster.settle-timeout` - Max time to wait for gossip settling (default: context timeout)
|
||||
|
||||
### Prometheus Configuration
|
||||
|
||||
**Important**: Configure Prometheus to send alerts to **all** Alertmanager instances, not via a load balancer.
|
||||
|
||||
```yaml
|
||||
# prometheus.yml
|
||||
alerting:
|
||||
alertmanagers:
|
||||
- static_configs:
|
||||
- targets:
|
||||
- am-1.example.com:9093
|
||||
- am-2.example.com:9093
|
||||
- am-3.example.com:9093
|
||||
```
|
||||
|
||||
This ensures:
|
||||
- **Redundancy** - If one Alertmanager is down, others still receive alerts
|
||||
- **Independent processing** - Each instance independently evaluates routing, grouping, and deduplication
|
||||
- **No single point of failure** - Load balancers introduce a single point of failure
|
||||
|
||||
### Cluster Size Considerations
|
||||
|
||||
Since Alertmanager uses gossip without quorum or voting, **any N instances tolerate up to N-1 failures** - as long as one instance is alive, notifications will be sent.
|
||||
|
||||
However, cluster size involves tradeoffs:
|
||||
|
||||
**Benefits of more instances:**
|
||||
- Greater resilience to simultaneous failures (hardware, network, datacenter outages)
|
||||
- Continued operation even during maintenance windows
|
||||
|
||||
**Costs of more instances:**
|
||||
- In case of partitions there will be an increase in duplicate notifications
|
||||
- More gossip traffic
|
||||
|
||||
**Typical deployments:**
|
||||
- **2-3 instances** - Common for single-datacenter production deployments
|
||||
- **4-5 instances** - Multi-datacenter or highly critical environments
|
||||
|
||||
**Note**: Unlike consensus-based systems (etcd, Raft), odd vs. even cluster sizes make no difference - there is no voting or quorum.
|
||||
|
||||
### Monitoring Cluster Health
|
||||
|
||||
Key metrics to monitor:
|
||||
|
||||
```
|
||||
# Cluster size
|
||||
alertmanager_cluster_members
|
||||
|
||||
# Peer health
|
||||
alertmanager_cluster_peer_info
|
||||
|
||||
# Peer position (affects notification timing)
|
||||
alertmanager_peer_position
|
||||
|
||||
# Failed peers
|
||||
alertmanager_cluster_failed_peers
|
||||
|
||||
# State replication
|
||||
alertmanager_nflog_gossip_messages_propagated_total
|
||||
alertmanager_silences_gossip_messages_propagated_total
|
||||
```
|
||||
|
||||
### Security
|
||||
|
||||
By default, cluster communication is unencrypted. For production deployments, especially across WANs, use mutual TLS:
|
||||
|
||||
```bash
|
||||
alertmanager \
|
||||
--cluster.tls-config=/etc/alertmanager/cluster-tls.yml
|
||||
```
|
||||
|
||||
See [Secure Cluster Traffic](../doc/design/secure-cluster-traffic.md) for details.
|
||||
|
||||
### Persistence
|
||||
|
||||
Each Alertmanager instance persists:
|
||||
- **Silences** - Stored in a snapshot file (default: `data/silences`)
|
||||
- **Notification log** - Stored in a snapshot file (default: `data/nflog`)
|
||||
|
||||
On restart:
|
||||
1. Instance loads silences and notification log from disk
|
||||
2. Joins the cluster and gossips with peers
|
||||
3. Merges state received from peers (newer timestamps win)
|
||||
4. Begins processing notifications after gossip settling
|
||||
|
||||
**Note**: Alerts themselves are **not** persisted - Prometheus re-sends firing alerts regularly.
|
||||
|
||||
### Common Pitfalls
|
||||
|
||||
1. **Load balancing Prometheus → Alertmanager**
|
||||
- ❌ Don't use a load balancer
|
||||
- ✅ Configure all instances in Prometheus
|
||||
|
||||
2. **Not waiting for gossip to settle**
|
||||
- Can lead to missed silences or duplicate notifications on startup
|
||||
- The `--cluster.settle-timeout` flag controls this
|
||||
|
||||
3. **Network ACLs blocking cluster port**
|
||||
- Ensure port 9094 (or your `--cluster.listen-address` port) is open between all instances
|
||||
- Both TCP and UDP are used by default (TCP only if using TLS transport)
|
||||
|
||||
4. **Unroutable advertise addresses**
|
||||
- If `--cluster.advertise-address` is not set, Alertmanager tries to auto-detect
|
||||
- For cloud/NAT environments, explicitly set a routable address
|
||||
|
||||
5. **Mismatched cluster configurations**
|
||||
- All instances should have the same `--cluster.peer-timeout` and gossip settings
|
||||
- Mismatches can cause unnecessary duplicates or missed notifications
|
||||
|
||||
## How It Works: End-to-End Example
|
||||
|
||||
### Scenario: 3-instance cluster, new alert group
|
||||
|
||||
1. **Alert arrives** at all 3 instances from Prometheus
|
||||
2. **Dispatcher** creates aggregation group, waits `group_wait` (e.g., 30s)
|
||||
3. **After group_wait**:
|
||||
- Each instance prepares to notify
|
||||
4. **Notifier stage**:
|
||||
- All instances wait for gossip to settle (if just started)
|
||||
- **AM-1** (position 0): waits 0s, checks notification log (empty), sends notification, logs to nflog
|
||||
- **AM-2** (position 1): waits 15s, checks notification log (sees AM-1's entry), **skips** notification
|
||||
- **AM-3** (position 2): waits 30s, checks notification log (sees AM-1's entry), **skips** notification
|
||||
5. **Result**: Exactly one notification sent (by AM-1)
|
||||
|
||||
### Scenario: AM-1 fails
|
||||
|
||||
1. **Alert arrives** at AM-2 and AM-3 only
|
||||
2. **Dispatcher** creates group, waits `group_wait`
|
||||
3. **Notifier stage**:
|
||||
- AM-1 is not in cluster (failed probe)
|
||||
- **AM-2** is now position 0: waits 0s, sends notification
|
||||
- **AM-3** is now position 1: waits 15s, sees AM-2's entry, skips
|
||||
4. **Result**: Notification still sent (fail-open)
|
||||
|
||||
### Scenario: Network partition during notification
|
||||
|
||||
1. **Alert arrives** at all instances
|
||||
2. **Network partition** splits AM-1 from AM-2/AM-3
|
||||
3. **In partition A** (AM-1):
|
||||
- Position 0, waits 0s, sends notification
|
||||
4. **In partition B** (AM-2, AM-3):
|
||||
- AM-2 is position 0, waits 0s, sends notification
|
||||
- AM-3 is position 1, waits 15s, deduplicates
|
||||
5. **Result**: Two notifications sent (one per partition) - fail-open behavior
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Check cluster status
|
||||
|
||||
```bash
|
||||
# View cluster members via API
|
||||
curl http://am-1:9093/api/v2/status
|
||||
|
||||
# Check metrics
|
||||
curl http://am-1:9093/metrics | grep cluster
|
||||
```
|
||||
|
||||
### Diagnose split-brain
|
||||
|
||||
If you suspect split-brain:
|
||||
|
||||
1. Check `alertmanager_cluster_members` on each instance
|
||||
- Should match total cluster size
|
||||
2. Check `alertmanager_cluster_peer_info{state="alive"}`
|
||||
- Should show all peers as alive
|
||||
3. Review network connectivity between instances
|
||||
|
||||
### Debug duplicate notifications
|
||||
|
||||
Duplicate notifications can occur due to:
|
||||
|
||||
1. **Network partitions** (expected, fail-open)
|
||||
2. **Gossip not settled** - Check `--cluster.settle-timeout`
|
||||
3. **Clock skew** - Ensure NTP is configured on all instances
|
||||
4. **Notification log not replicating** - Check gossip metrics
|
||||
|
||||
Enable debug logging:
|
||||
|
||||
```bash
|
||||
alertmanager --log.level=debug
|
||||
```
|
||||
|
||||
Look for:
|
||||
- `"Waiting for gossip to settle..."`
|
||||
- `"gossip settled; proceeding"`
|
||||
- Deduplication decisions in notification pipeline
|
||||
|
||||
## Further Reading
|
||||
|
||||
- [Alertmanager Configuration](configuration.md)
|
||||
- [Secure Cluster Traffic Design](../doc/design/secure-cluster-traffic.md)
|
||||
- [Hashicorp Memberlist Documentation](https://github.com/hashicorp/memberlist)
|
||||
Reference in New Issue
Block a user