diff --git a/docs/high_availability.md b/docs/high_availability.md new file mode 100644 index 000000000..55b8d0cdb --- /dev/null +++ b/docs/high_availability.md @@ -0,0 +1,486 @@ +--- +title: High Availability +sort_rank: 4 +nav_icon: network +--- + +# High Availability + +Alertmanager supports configuration to create a cluster for high availability. This document describes how the HA mechanism works, its design goals, and operational considerations. + +## Design Goals + +The Alertmanager HA implementation is designed around three core principles: + +1. **Single pane view and management** - Silences and alerts can be viewed and managed from any cluster member, providing a unified operational experience +2. **Survive cluster split-brain with "fail open"** - During network partitions, Alertmanager prefers to send duplicate notifications rather than miss critical alerts +3. **At-least-once delivery** - The system guarantees that notifications are delivered at least once, in line with the fail-open philosophy + +These goals prioritize operational reliability and alert delivery over strict exactly-once semantics. + +## Architecture Overview + +An Alertmanager cluster consists of multiple Alertmanager instances that communicate using a gossip protocol. Each instance: + +- Receives alerts independently from Prometheus servers +- Participates in a peer-to-peer gossip mesh +- Replicates state (silences and notification log) to other cluster members +- Processes and sends notifications independently + +``` +┌──────────────┐ ┌──────────────┐ ┌──────────────┐ +│ Prometheus 1 │ │ Prometheus 2 │ │ Prometheus N │ +└──────┬───────┘ └──────┬───────┘ └──────┬───────┘ + │ │ │ + │ alerts │ alerts │ alerts + │ │ │ + ▼ ▼ ▼ + ┌────────────────────────────────────────────┐ + │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ + │ │ AM-1 │ │ AM-2 │ │ AM-3 │ │ + │ │ (pos: 0) ├──┤ (pos: 1) ├──┤ (pos: 2) │ │ + │ └──────────┘ └──────────┘ └──────────┘ │ + │ Gossip Protocol (Memberlist) │ + └────────────────────────────────────────────┘ + │ │ │ + ▼ ▼ ▼ + Receivers Receivers Receivers +``` + +## Gossip Protocol + +Alertmanager uses [Hashicorp's Memberlist](https://github.com/hashicorp/memberlist) library to implement gossip-based communication. The gossip protocol handles: + +### Membership Management + +- **Automatic peer discovery** - Instances can be configured with a list of known peers and will automatically discover other cluster members +- **Health checking** - Regular probes detect failed members (default: every 1 second) +- **Failure detection** - Failed members are marked and can attempt to rejoin + +### State Replication + +The gossip layer replicates three types of state: + +1. **Silences** - Create, update, and delete operations are broadcast to all peers +2. **Notification log** - Records of which notifications were sent to prevent duplicates +3. **Membership changes** - Join, leave, and failure events + +State is eventually consistent - all cluster members will converge to the same state given sufficient time and network connectivity. + +### Gossip Settling + +When an Alertmanager starts or rejoins the cluster, it waits for gossip to "settle" before processing notifications. This prevents sending notifications based on incomplete state. + +The settling algorithm waits until: +- The number of peers remains stable for 3 consecutive checks (default interval: push-pull interval) +- Or a timeout occurs (configurable via context) + +During this time, the instance already receives and stores alerts but defers notification processing. + +## Notification Pipeline in HA Mode + +The notification pipeline operates differently in a clustered environment to ensure deduplication while maintaining at-least-once delivery: + +``` +┌────────────────────────────────────────────────┐ +│ DISPATCHER STAGE │ +├────────────────────────────────────────────────┤ +│ 1. Find matching route(s) │ +│ 2. Find/create aggregation group within route │ +│ 3. Throttle by group wait or group interval │ +└───────────────────┬────────────────────────────┘ + │ + ▼ +┌────────────────────────────────────────────────┐ +│ NOTIFIER STAGE │ +├────────────────────────────────────────────────┤ +│ 1. Wait for HA gossip to settle │◄─── Ensures complete state +│ 2. Filter inhibited alerts │ +│ 3. Filter non-time-active alerts │ +│ 4. Filter time-muted alerts │ +│ 5. Filter silenced alerts │◄─── Uses replicated silences +│ 6. Wait according to HA cluster peer index │◄─── Staggered notifications +│ 7. Dedupe by repeat interval/HA state │◄─── Uses notification log +│ 8. Notify & retry intermittent failures │ +│ 9. Update notification log │◄─── Replicated to peers +└────────────────────────────────────────────────┘ +``` + +### HA-Specific Stages + +#### 1. Gossip Settling Wait + +Before the first notification from a group, the instance waits for gossip to settle. This ensures: +- Silences are fully replicated +- The notification log contains recent send records from other instances +- The cluster membership is stable + +**Implementation**: `peer.WaitReady(ctx)` + +#### 2. Peer Position-Based Wait + +To prevent all cluster members from sending notifications simultaneously, each instance waits based on its position in the sorted peer list: + +``` +wait_time = peer_position × peer_timeout +``` + +For example, with 3 instances and a 15-second peer timeout: +- Instance `am-1` (position 0): waits 0 seconds +- Instance `am-2` (position 1): waits 15 seconds +- Instance `am-3` (position 2): waits 30 seconds + +This staggered timing allows: +- The first instance to send the notification +- Subsequent instances to see the notification log entry +- Deduplication to prevent duplicate sends + +**Implementation**: `clusterWait()` in `cmd/alertmanager/main.go:594` + +Position is determined by sorting all peer names alphabetically: + +```go +func (p *Peer) Position() int { + all := p.mlist.Members() + sort.Slice(all, func(i, j int) bool { + return all[i].Name < all[j].Name + }) + // Find position of self in sorted list +} +``` + +#### 3. Deduplication via Notification Log + +The `DedupStage` queries the notification log to determine if a notification should be sent: + +```go +// Check notification log for recent sends +entry := nflog.Query(receiver, groupKey) +if entry.exists && !shouldNotify(entry, alerts, repeatInterval) { + // Skip: already notified recently + return nil +} +``` + +Deduplication checks: +- **Firing alerts changed?** If yes, notify +- **Resolved alerts changed?** If yes and `send_resolved: true`, notify +- **Repeat interval elapsed?** If yes, notify +- **Otherwise**: Skip notification (deduplicated) + +The notification log is replicated via gossip, so all cluster members share the same send history. + +## Split-Brain Handling (Fail Open) + +During a network partition, the cluster may split into multiple groups that cannot communicate. Alertmanager's "fail open" design ensures alerts are still delivered: + +### Scenario: Network Partition + +``` +Before partition: +┌────────┬────────┬────────┐ +│ AM-1 │ AM-2 │ AM-3 │ +└────────┴────────┴────────┘ + Unified cluster + +After partition: +┌────────┐ │ ┌────────┬────────┐ +│ AM-1 │ │ │ AM-2 │ AM-3 │ +└────────┘ │ └────────┴────────┘ + Partition A │ Partition B +``` + +### Behavior During Partition + +**In Partition A** (AM-1 alone): +- AM-1 sees itself as position 0 +- Waits 0 × timeout = 0 seconds +- Sends notifications (no dedup from AM-2/AM-3) + +**In Partition B** (AM-2, AM-3): +- AM-2 is position 0, AM-3 is position 1 +- AM-2 waits 0 seconds, sends notification +- AM-3 sees AM-2's notification log entry, deduplicates + +**Result**: Duplicate notifications sent (one from Partition A, one from Partition B) + +This is **intentional** - Alertmanager prefers duplicate notifications over missed alerts. + +### After Partition Heals + +When the network partition heals: +1. Gossip protocol detects all peers again +2. Notification logs are merged (via CRDT-like merge with timestamp) +3. Future notifications are deduplicated correctly across all instances +4. Silences created in either partition are replicated to all peers + +## Silence Management in HA + +Silences are first-class replicated state in the cluster. + +### Silence Creation and Updates + +When a silence is created or updated on any instance: + +1. **Local storage** - Silence is stored in the local state map +2. **Broadcast** - Silence is serialized (protobuf) and broadcast via gossip +3. **Merge on receive** - Other instances receive and merge the silence: + ```go + // Merge logic: last-write-wins based on UpdatedAt timestamp + if !exists || incoming.UpdatedAt > existing.UpdatedAt { + accept_update() + } + ``` +4. **Indexing** - The silence matcher cache is updated for fast alert matching + +### Silence Expiry + +Silences have: +- `StartsAt`, `EndsAt` - The active time range +- `ExpiresAt` - When to garbage collect (EndsAt + retention period) +- `UpdatedAt` - For conflict resolution during merge + +Each instance independently: +- Evaluates silence state (pending/active/expired) based on current time +- Garbage collects expired silences past their retention period +- The GC is local only (no gossip) since all instances converge to the same decision + +### Single Pane of Glass + +Users can interact with any Alertmanager instance in the cluster: +- **View silences** - All instances have the same silence state (eventually consistent) +- **Create/update silences** - Changes made on any instance propagate to all peers +- **Delete silences** - Implemented as "expire immediately" + gossip + +This provides a unified operational experience regardless of which instance you access. + +## Operational Considerations + +### Configuration + +To configure a cluster, each Alertmanager instance needs: + +```yaml +# alertmanager.yml +global: + # ... other config ... + +# No cluster config in YAML - use CLI flags +``` + +Command-line flags: + +```bash +alertmanager \ + --cluster.listen-address=0.0.0.0:9094 \ + --cluster.peer=am-1.example.com:9094 \ + --cluster.peer=am-2.example.com:9094 \ + --cluster.peer=am-3.example.com:9094 \ + --cluster.advertise-address=$(hostname):9094 \ + --cluster.peer-timeout=15s \ + --cluster.gossip-interval=200ms \ + --cluster.pushpull-interval=60s +``` + +Key flags: +- `--cluster.listen-address` - Bind address for cluster communication (default: `0.0.0.0:9094`) +- `--cluster.peer` - List of peer addresses (can be repeated) +- `--cluster.advertise-address` - Address advertised to peers (auto-detected if omitted) +- `--cluster.peer-timeout` - Wait time per peer position for deduplication (default: `15s`) +- `--cluster.gossip-interval` - How often to gossip (default: `200ms`) +- `--cluster.pushpull-interval` - Full state sync interval (default: `60s`) +- `--cluster.probe-interval` - Peer health check interval (default: `1s`) +- `--cluster.settle-timeout` - Max time to wait for gossip settling (default: context timeout) + +### Prometheus Configuration + +**Important**: Configure Prometheus to send alerts to **all** Alertmanager instances, not via a load balancer. + +```yaml +# prometheus.yml +alerting: + alertmanagers: + - static_configs: + - targets: + - am-1.example.com:9093 + - am-2.example.com:9093 + - am-3.example.com:9093 +``` + +This ensures: +- **Redundancy** - If one Alertmanager is down, others still receive alerts +- **Independent processing** - Each instance independently evaluates routing, grouping, and deduplication +- **No single point of failure** - Load balancers introduce a single point of failure + +### Cluster Size Considerations + +Since Alertmanager uses gossip without quorum or voting, **any N instances tolerate up to N-1 failures** - as long as one instance is alive, notifications will be sent. + +However, cluster size involves tradeoffs: + +**Benefits of more instances:** +- Greater resilience to simultaneous failures (hardware, network, datacenter outages) +- Continued operation even during maintenance windows + +**Costs of more instances:** +- In case of partitions there will be an increase in duplicate notifications +- More gossip traffic + +**Typical deployments:** +- **2-3 instances** - Common for single-datacenter production deployments +- **4-5 instances** - Multi-datacenter or highly critical environments + +**Note**: Unlike consensus-based systems (etcd, Raft), odd vs. even cluster sizes make no difference - there is no voting or quorum. + +### Monitoring Cluster Health + +Key metrics to monitor: + +``` +# Cluster size +alertmanager_cluster_members + +# Peer health +alertmanager_cluster_peer_info + +# Peer position (affects notification timing) +alertmanager_peer_position + +# Failed peers +alertmanager_cluster_failed_peers + +# State replication +alertmanager_nflog_gossip_messages_propagated_total +alertmanager_silences_gossip_messages_propagated_total +``` + +### Security + +By default, cluster communication is unencrypted. For production deployments, especially across WANs, use mutual TLS: + +```bash +alertmanager \ + --cluster.tls-config=/etc/alertmanager/cluster-tls.yml +``` + +See [Secure Cluster Traffic](../doc/design/secure-cluster-traffic.md) for details. + +### Persistence + +Each Alertmanager instance persists: +- **Silences** - Stored in a snapshot file (default: `data/silences`) +- **Notification log** - Stored in a snapshot file (default: `data/nflog`) + +On restart: +1. Instance loads silences and notification log from disk +2. Joins the cluster and gossips with peers +3. Merges state received from peers (newer timestamps win) +4. Begins processing notifications after gossip settling + +**Note**: Alerts themselves are **not** persisted - Prometheus re-sends firing alerts regularly. + +### Common Pitfalls + +1. **Load balancing Prometheus → Alertmanager** + - ❌ Don't use a load balancer + - ✅ Configure all instances in Prometheus + +2. **Not waiting for gossip to settle** + - Can lead to missed silences or duplicate notifications on startup + - The `--cluster.settle-timeout` flag controls this + +3. **Network ACLs blocking cluster port** + - Ensure port 9094 (or your `--cluster.listen-address` port) is open between all instances + - Both TCP and UDP are used by default (TCP only if using TLS transport) + +4. **Unroutable advertise addresses** + - If `--cluster.advertise-address` is not set, Alertmanager tries to auto-detect + - For cloud/NAT environments, explicitly set a routable address + +5. **Mismatched cluster configurations** + - All instances should have the same `--cluster.peer-timeout` and gossip settings + - Mismatches can cause unnecessary duplicates or missed notifications + +## How It Works: End-to-End Example + +### Scenario: 3-instance cluster, new alert group + +1. **Alert arrives** at all 3 instances from Prometheus +2. **Dispatcher** creates aggregation group, waits `group_wait` (e.g., 30s) +3. **After group_wait**: + - Each instance prepares to notify +4. **Notifier stage**: + - All instances wait for gossip to settle (if just started) + - **AM-1** (position 0): waits 0s, checks notification log (empty), sends notification, logs to nflog + - **AM-2** (position 1): waits 15s, checks notification log (sees AM-1's entry), **skips** notification + - **AM-3** (position 2): waits 30s, checks notification log (sees AM-1's entry), **skips** notification +5. **Result**: Exactly one notification sent (by AM-1) + +### Scenario: AM-1 fails + +1. **Alert arrives** at AM-2 and AM-3 only +2. **Dispatcher** creates group, waits `group_wait` +3. **Notifier stage**: + - AM-1 is not in cluster (failed probe) + - **AM-2** is now position 0: waits 0s, sends notification + - **AM-3** is now position 1: waits 15s, sees AM-2's entry, skips +4. **Result**: Notification still sent (fail-open) + +### Scenario: Network partition during notification + +1. **Alert arrives** at all instances +2. **Network partition** splits AM-1 from AM-2/AM-3 +3. **In partition A** (AM-1): + - Position 0, waits 0s, sends notification +4. **In partition B** (AM-2, AM-3): + - AM-2 is position 0, waits 0s, sends notification + - AM-3 is position 1, waits 15s, deduplicates +5. **Result**: Two notifications sent (one per partition) - fail-open behavior + +## Troubleshooting + +### Check cluster status + +```bash +# View cluster members via API +curl http://am-1:9093/api/v2/status + +# Check metrics +curl http://am-1:9093/metrics | grep cluster +``` + +### Diagnose split-brain + +If you suspect split-brain: + +1. Check `alertmanager_cluster_members` on each instance + - Should match total cluster size +2. Check `alertmanager_cluster_peer_info{state="alive"}` + - Should show all peers as alive +3. Review network connectivity between instances + +### Debug duplicate notifications + +Duplicate notifications can occur due to: + +1. **Network partitions** (expected, fail-open) +2. **Gossip not settled** - Check `--cluster.settle-timeout` +3. **Clock skew** - Ensure NTP is configured on all instances +4. **Notification log not replicating** - Check gossip metrics + +Enable debug logging: + +```bash +alertmanager --log.level=debug +``` + +Look for: +- `"Waiting for gossip to settle..."` +- `"gossip settled; proceeding"` +- Deduplication decisions in notification pipeline + +## Further Reading + +- [Alertmanager Configuration](configuration.md) +- [Secure Cluster Traffic Design](../doc/design/secure-cluster-traffic.md) +- [Hashicorp Memberlist Documentation](https://github.com/hashicorp/memberlist)