--- title: High Availability sort_rank: 4 nav_icon: network --- # High Availability Alertmanager supports configuration to create a cluster for high availability. This document describes how the HA mechanism works, its design goals, and operational considerations. ## Design Goals The Alertmanager HA implementation is designed around three core principles: 1. **Single pane view and management** - Silences and alerts can be viewed and managed from any cluster member, providing a unified operational experience 2. **Survive cluster split-brain with "fail open"** - During network partitions, Alertmanager prefers to send duplicate notifications rather than miss critical alerts 3. **At-least-once delivery** - The system guarantees that notifications are delivered at least once, in line with the fail-open philosophy These goals prioritize operational reliability and alert delivery over strict exactly-once semantics. ## Architecture Overview An Alertmanager cluster consists of multiple Alertmanager instances that communicate using a gossip protocol. Each instance: - Receives alerts independently from Prometheus servers - Participates in a peer-to-peer gossip mesh - Replicates state (silences and notification log) to other cluster members - Processes and sends notifications independently ``` ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Prometheus 1 │ │ Prometheus 2 │ │ Prometheus N │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ │ alerts │ alerts │ alerts │ │ │ ▼ ▼ ▼ ┌────────────────────────────────────────────┐ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ AM-1 │ │ AM-2 │ │ AM-3 │ │ │ │ (pos: 0) ├──┤ (pos: 1) ├──┤ (pos: 2) │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ Gossip Protocol (Memberlist) │ └────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ Receivers Receivers Receivers ``` ## Gossip Protocol Alertmanager uses [Hashicorp's Memberlist](https://github.com/hashicorp/memberlist) library to implement gossip-based communication. The gossip protocol handles: ### Membership Management - **Automatic peer discovery** - Instances can be configured with a list of known peers and will automatically discover other cluster members - **Health checking** - Regular probes detect failed members (default: every 1 second) - **Failure detection** - Failed members are marked and can attempt to rejoin ### State Replication The gossip layer replicates three types of state: 1. **Silences** - Create, update, and delete operations are broadcast to all peers 2. **Notification log** - Records of which notifications were sent to prevent duplicates 3. **Membership changes** - Join, leave, and failure events State is eventually consistent - all cluster members will converge to the same state given sufficient time and network connectivity. ### Gossip Settling When an Alertmanager starts or rejoins the cluster, it waits for gossip to "settle" before processing notifications. This prevents sending notifications based on incomplete state. The settling algorithm waits until: - The number of peers remains stable for 3 consecutive checks (default interval: push-pull interval) - Or a timeout occurs (configurable via context) During this time, the instance already receives and stores alerts but defers notification processing. ## Notification Pipeline in HA Mode The notification pipeline operates differently in a clustered environment to ensure deduplication while maintaining at-least-once delivery: ``` ┌────────────────────────────────────────────────┐ │ DISPATCHER STAGE │ ├────────────────────────────────────────────────┤ │ 1. Find matching route(s) │ │ 2. Find/create aggregation group within route │ │ 3. Throttle by group wait or group interval │ └───────────────────┬────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────┐ │ NOTIFIER STAGE │ ├────────────────────────────────────────────────┤ │ 1. Wait for HA gossip to settle │◄─── Ensures complete state │ 2. Filter inhibited alerts │ │ 3. Filter non-time-active alerts │ │ 4. Filter time-muted alerts │ │ 5. Filter silenced alerts │◄─── Uses replicated silences │ 6. Wait according to HA cluster peer index │◄─── Staggered notifications │ 7. Dedupe by repeat interval/HA state │◄─── Uses notification log │ 8. Notify & retry intermittent failures │ │ 9. Update notification log │◄─── Replicated to peers └────────────────────────────────────────────────┘ ``` ### HA-Specific Stages #### 1. Gossip Settling Wait Before the first notification from a group, the instance waits for gossip to settle. This ensures: - Silences are fully replicated - The notification log contains recent send records from other instances - The cluster membership is stable **Implementation**: `peer.WaitReady(ctx)` #### 2. Peer Position-Based Wait To prevent all cluster members from sending notifications simultaneously, each instance waits based on its position in the sorted peer list: ``` wait_time = peer_position × peer_timeout ``` For example, with 3 instances and a 15-second peer timeout: - Instance `am-1` (position 0): waits 0 seconds - Instance `am-2` (position 1): waits 15 seconds - Instance `am-3` (position 2): waits 30 seconds This staggered timing allows: - The first instance to send the notification - Subsequent instances to see the notification log entry - Deduplication to prevent duplicate sends **Implementation**: `clusterWait()` in `cmd/alertmanager/main.go:594` Position is determined by sorting all peer names alphabetically: ```go func (p *Peer) Position() int { all := p.mlist.Members() sort.Slice(all, func(i, j int) bool { return all[i].Name < all[j].Name }) // Find position of self in sorted list } ``` #### 3. Deduplication via Notification Log The `DedupStage` queries the notification log to determine if a notification should be sent: ```go // Check notification log for recent sends entry := nflog.Query(receiver, groupKey) if entry.exists && !shouldNotify(entry, alerts, repeatInterval) { // Skip: already notified recently return nil } ``` Deduplication checks: - **Firing alerts changed?** If yes, notify - **Resolved alerts changed?** If yes and `send_resolved: true`, notify - **Repeat interval elapsed?** If yes, notify - **Otherwise**: Skip notification (deduplicated) The notification log is replicated via gossip, so all cluster members share the same send history. ## Split-Brain Handling (Fail Open) During a network partition, the cluster may split into multiple groups that cannot communicate. Alertmanager's "fail open" design ensures alerts are still delivered: ### Scenario: Network Partition ``` Before partition: ┌────────┬────────┬────────┐ │ AM-1 │ AM-2 │ AM-3 │ └────────┴────────┴────────┘ Unified cluster After partition: ┌────────┐ │ ┌────────┬────────┐ │ AM-1 │ │ │ AM-2 │ AM-3 │ └────────┘ │ └────────┴────────┘ Partition A │ Partition B ``` ### Behavior During Partition **In Partition A** (AM-1 alone): - AM-1 sees itself as position 0 - Waits 0 × timeout = 0 seconds - Sends notifications (no dedup from AM-2/AM-3) **In Partition B** (AM-2, AM-3): - AM-2 is position 0, AM-3 is position 1 - AM-2 waits 0 seconds, sends notification - AM-3 sees AM-2's notification log entry, deduplicates **Result**: Duplicate notifications sent (one from Partition A, one from Partition B) This is **intentional** - Alertmanager prefers duplicate notifications over missed alerts. ### After Partition Heals When the network partition heals: 1. Gossip protocol detects all peers again 2. Notification logs are merged (via CRDT-like merge with timestamp) 3. Future notifications are deduplicated correctly across all instances 4. Silences created in either partition are replicated to all peers ## Silence Management in HA Silences are first-class replicated state in the cluster. ### Silence Creation and Updates When a silence is created or updated on any instance: 1. **Local storage** - Silence is stored in the local state map 2. **Broadcast** - Silence is serialized (protobuf) and broadcast via gossip 3. **Merge on receive** - Other instances receive and merge the silence: ```go // Merge logic: last-write-wins based on UpdatedAt timestamp if !exists || incoming.UpdatedAt > existing.UpdatedAt { accept_update() } ``` 4. **Indexing** - The silence matcher cache is updated for fast alert matching ### Silence Expiry Silences have: - `StartsAt`, `EndsAt` - The active time range - `ExpiresAt` - When to garbage collect (EndsAt + retention period) - `UpdatedAt` - For conflict resolution during merge Each instance independently: - Evaluates silence state (pending/active/expired) based on current time - Garbage collects expired silences past their retention period - The GC is local only (no gossip) since all instances converge to the same decision ### Single Pane of Glass Users can interact with any Alertmanager instance in the cluster: - **View silences** - All instances have the same silence state (eventually consistent) - **Create/update silences** - Changes made on any instance propagate to all peers - **Delete silences** - Implemented as "expire immediately" + gossip This provides a unified operational experience regardless of which instance you access. ## Operational Considerations ### Configuration To configure a cluster, each Alertmanager instance needs: ```yaml # alertmanager.yml global: # ... other config ... # No cluster config in YAML - use CLI flags ``` Command-line flags: ```bash alertmanager \ --cluster.listen-address=0.0.0.0:9094 \ --cluster.peer=am-1.example.com:9094 \ --cluster.peer=am-2.example.com:9094 \ --cluster.peer=am-3.example.com:9094 \ --cluster.advertise-address=$(hostname):9094 \ --cluster.peer-timeout=15s \ --cluster.gossip-interval=200ms \ --cluster.pushpull-interval=60s ``` Key flags: - `--cluster.listen-address` - Bind address for cluster communication (default: `0.0.0.0:9094`) - `--cluster.peer` - List of peer addresses (can be repeated) - `--cluster.advertise-address` - Address advertised to peers (auto-detected if omitted) - `--cluster.peer-timeout` - Wait time per peer position for deduplication (default: `15s`) - `--cluster.gossip-interval` - How often to gossip (default: `200ms`) - `--cluster.pushpull-interval` - Full state sync interval (default: `60s`) - `--cluster.probe-interval` - Peer health check interval (default: `1s`) - `--cluster.settle-timeout` - Max time to wait for gossip settling (default: context timeout) ### Prometheus Configuration **Important**: Configure Prometheus to send alerts to **all** Alertmanager instances, not via a load balancer. ```yaml # prometheus.yml alerting: alertmanagers: - static_configs: - targets: - am-1.example.com:9093 - am-2.example.com:9093 - am-3.example.com:9093 ``` This ensures: - **Redundancy** - If one Alertmanager is down, others still receive alerts - **Independent processing** - Each instance independently evaluates routing, grouping, and deduplication - **No single point of failure** - Load balancers introduce a single point of failure ### Cluster Size Considerations Since Alertmanager uses gossip without quorum or voting, **any N instances tolerate up to N-1 failures** - as long as one instance is alive, notifications will be sent. However, cluster size involves tradeoffs: **Benefits of more instances:** - Greater resilience to simultaneous failures (hardware, network, datacenter outages) - Continued operation even during maintenance windows **Costs of more instances:** - In case of partitions there will be an increase in duplicate notifications - More gossip traffic **Typical deployments:** - **2-3 instances** - Common for single-datacenter production deployments - **4-5 instances** - Multi-datacenter or highly critical environments **Note**: Unlike consensus-based systems (etcd, Raft), odd vs. even cluster sizes make no difference - there is no voting or quorum. ### Monitoring Cluster Health Key metrics to monitor: ``` # Cluster size alertmanager_cluster_members # Peer health alertmanager_cluster_peer_info # Peer position (affects notification timing) alertmanager_peer_position # Failed peers alertmanager_cluster_failed_peers # State replication alertmanager_nflog_gossip_messages_propagated_total alertmanager_silences_gossip_messages_propagated_total ``` ### Security By default, cluster communication is unencrypted. For production deployments, especially across WANs, use mutual TLS: ```bash alertmanager \ --cluster.tls-config=/etc/alertmanager/cluster-tls.yml ``` See [Secure Cluster Traffic](../doc/design/secure-cluster-traffic.md) for details. ### Persistence Each Alertmanager instance persists: - **Silences** - Stored in a snapshot file (default: `data/silences`) - **Notification log** - Stored in a snapshot file (default: `data/nflog`) On restart: 1. Instance loads silences and notification log from disk 2. Joins the cluster and gossips with peers 3. Merges state received from peers (newer timestamps win) 4. Begins processing notifications after gossip settling **Note**: Alerts themselves are **not** persisted - Prometheus re-sends firing alerts regularly. ### Common Pitfalls 1. **Load balancing Prometheus → Alertmanager** - ❌ Don't use a load balancer - ✅ Configure all instances in Prometheus 2. **Not waiting for gossip to settle** - Can lead to missed silences or duplicate notifications on startup - The `--cluster.settle-timeout` flag controls this 3. **Network ACLs blocking cluster port** - Ensure port 9094 (or your `--cluster.listen-address` port) is open between all instances - Both TCP and UDP are used by default (TCP only if using TLS transport) 4. **Unroutable advertise addresses** - If `--cluster.advertise-address` is not set, Alertmanager tries to auto-detect - For cloud/NAT environments, explicitly set a routable address 5. **Mismatched cluster configurations** - All instances should have the same `--cluster.peer-timeout` and gossip settings - Mismatches can cause unnecessary duplicates or missed notifications ## How It Works: End-to-End Example ### Scenario: 3-instance cluster, new alert group 1. **Alert arrives** at all 3 instances from Prometheus 2. **Dispatcher** creates aggregation group, waits `group_wait` (e.g., 30s) 3. **After group_wait**: - Each instance prepares to notify 4. **Notifier stage**: - All instances wait for gossip to settle (if just started) - **AM-1** (position 0): waits 0s, checks notification log (empty), sends notification, logs to nflog - **AM-2** (position 1): waits 15s, checks notification log (sees AM-1's entry), **skips** notification - **AM-3** (position 2): waits 30s, checks notification log (sees AM-1's entry), **skips** notification 5. **Result**: Exactly one notification sent (by AM-1) ### Scenario: AM-1 fails 1. **Alert arrives** at AM-2 and AM-3 only 2. **Dispatcher** creates group, waits `group_wait` 3. **Notifier stage**: - AM-1 is not in cluster (failed probe) - **AM-2** is now position 0: waits 0s, sends notification - **AM-3** is now position 1: waits 15s, sees AM-2's entry, skips 4. **Result**: Notification still sent (fail-open) ### Scenario: Network partition during notification 1. **Alert arrives** at all instances 2. **Network partition** splits AM-1 from AM-2/AM-3 3. **In partition A** (AM-1): - Position 0, waits 0s, sends notification 4. **In partition B** (AM-2, AM-3): - AM-2 is position 0, waits 0s, sends notification - AM-3 is position 1, waits 15s, deduplicates 5. **Result**: Two notifications sent (one per partition) - fail-open behavior ## Troubleshooting ### Check cluster status ```bash # View cluster members via API curl http://am-1:9093/api/v2/status # Check metrics curl http://am-1:9093/metrics | grep cluster ``` ### Diagnose split-brain If you suspect split-brain: 1. Check `alertmanager_cluster_members` on each instance - Should match total cluster size 2. Check `alertmanager_cluster_peer_info{state="alive"}` - Should show all peers as alive 3. Review network connectivity between instances ### Debug duplicate notifications Duplicate notifications can occur due to: 1. **Network partitions** (expected, fail-open) 2. **Gossip not settled** - Check `--cluster.settle-timeout` 3. **Clock skew** - Ensure NTP is configured on all instances 4. **Notification log not replicating** - Check gossip metrics Enable debug logging: ```bash alertmanager --log.level=debug ``` Look for: - `"Waiting for gossip to settle..."` - `"gossip settled; proceeding"` - Deduplication decisions in notification pipeline ## Further Reading - [Alertmanager Configuration](configuration.md) - [Secure Cluster Traffic Design](../doc/design/secure-cluster-traffic.md) - [Hashicorp Memberlist Documentation](https://github.com/hashicorp/memberlist)