Merge pull request #4708 from ultrotter/hadocs

Add documentation about high availability
2026-02-05 15:45:34 +01:00 · 2025-11-11 17:39:51 +01:00
parent cdf76d95d1 19bebbe0d4
commit 366e34dcbc
1 changed files with 486 additions and 0 deletions
--- a/docs/high_availability.md
+++ b/docs/high_availability.md
@@ -0,0 +1,486 @@
+---
+title: High Availability
+sort_rank: 4
+nav_icon: network
+---
+
+# High Availability
+
+Alertmanager supports configuration to create a cluster for high availability. This document describes how the HA mechanism works, its design goals, and operational considerations.
+
+## Design Goals
+
+The Alertmanager HA implementation is designed around three core principles:
+
+1. **Single pane view and management** - Silences and alerts can be viewed and managed from any cluster member, providing a unified operational experience
+2. **Survive cluster split-brain with "fail open"** - During network partitions, Alertmanager prefers to send duplicate notifications rather than miss critical alerts
+3. **At-least-once delivery** - The system guarantees that notifications are delivered at least once, in line with the fail-open philosophy
+
+These goals prioritize operational reliability and alert delivery over strict exactly-once semantics.
+
+## Architecture Overview
+
+An Alertmanager cluster consists of multiple Alertmanager instances that communicate using a gossip protocol. Each instance:
+
+- Receives alerts independently from Prometheus servers
+- Participates in a peer-to-peer gossip mesh
+- Replicates state (silences and notification log) to other cluster members
+- Processes and sends notifications independently
+
+```
+┌──────────────┐    ┌──────────────┐    ┌──────────────┐
+│ Prometheus 1 │    │ Prometheus 2 │    │ Prometheus N │
+└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
+       │                   │                   │
+       │ alerts            │ alerts            │ alerts
+       │                   │                   │
+       ▼                   ▼                   ▼
+    ┌────────────────────────────────────────────┐
+    │  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
+    │  │  AM-1    │  │  AM-2    │  │  AM-3    │  │
+    │  │ (pos: 0) ├──┤ (pos: 1) ├──┤ (pos: 2) │  │
+    │  └──────────┘  └──────────┘  └──────────┘  │
+    │          Gossip Protocol (Memberlist)      │
+    └────────────────────────────────────────────┘
+              │           │           │
+              ▼           ▼           ▼
+         Receivers   Receivers   Receivers
+```
+
+## Gossip Protocol
+
+Alertmanager uses [Hashicorp's Memberlist](https://github.com/hashicorp/memberlist) library to implement gossip-based communication. The gossip protocol handles:
+
+### Membership Management
+
+- **Automatic peer discovery** - Instances can be configured with a list of known peers and will automatically discover other cluster members
+- **Health checking** - Regular probes detect failed members (default: every 1 second)
+- **Failure detection** - Failed members are marked and can attempt to rejoin
+
+### State Replication
+
+The gossip layer replicates three types of state:
+
+1. **Silences** - Create, update, and delete operations are broadcast to all peers
+2. **Notification log** - Records of which notifications were sent to prevent duplicates
+3. **Membership changes** - Join, leave, and failure events
+
+State is eventually consistent - all cluster members will converge to the same state given sufficient time and network connectivity.
+
+### Gossip Settling
+
+When an Alertmanager starts or rejoins the cluster, it waits for gossip to "settle" before processing notifications. This prevents sending notifications based on incomplete state.
+
+The settling algorithm waits until:
+- The number of peers remains stable for 3 consecutive checks (default interval: push-pull interval)
+- Or a timeout occurs (configurable via context)
+
+During this time, the instance already receives and stores alerts but defers notification processing.
+
+## Notification Pipeline in HA Mode
+
+The notification pipeline operates differently in a clustered environment to ensure deduplication while maintaining at-least-once delivery:
+
+```
+┌────────────────────────────────────────────────┐
+│              DISPATCHER STAGE                  │
+├────────────────────────────────────────────────┤
+│ 1. Find matching route(s)                      │
+│ 2. Find/create aggregation group within route  │
+│ 3. Throttle by group wait or group interval    │
+└───────────────────┬────────────────────────────┘
+                    │
+                    ▼
+┌────────────────────────────────────────────────┐
+│               NOTIFIER STAGE                   │
+├────────────────────────────────────────────────┤
+│ 1. Wait for HA gossip to settle                │◄─── Ensures complete state
+│ 2. Filter inhibited alerts                     │
+│ 3. Filter non-time-active alerts               │
+│ 4. Filter time-muted alerts                    │
+│ 5. Filter silenced alerts                      │◄─── Uses replicated silences
+│ 6. Wait according to HA cluster peer index     │◄─── Staggered notifications
+│ 7. Dedupe by repeat interval/HA state          │◄─── Uses notification log
+│ 8. Notify & retry intermittent failures        │
+│ 9. Update notification log                     │◄─── Replicated to peers
+└────────────────────────────────────────────────┘
+```
+
+### HA-Specific Stages
+
+#### 1. Gossip Settling Wait
+
+Before the first notification from a group, the instance waits for gossip to settle. This ensures:
+- Silences are fully replicated
+- The notification log contains recent send records from other instances
+- The cluster membership is stable
+
+**Implementation**: `peer.WaitReady(ctx)`
+
+#### 2. Peer Position-Based Wait
+
+To prevent all cluster members from sending notifications simultaneously, each instance waits based on its position in the sorted peer list:
+
+```
+wait_time = peer_position × peer_timeout
+```
+
+For example, with 3 instances and a 15-second peer timeout:
+- Instance `am-1` (position 0): waits 0 seconds
+- Instance `am-2` (position 1): waits 15 seconds
+- Instance `am-3` (position 2): waits 30 seconds
+
+This staggered timing allows:
+- The first instance to send the notification
+- Subsequent instances to see the notification log entry
+- Deduplication to prevent duplicate sends
+
+**Implementation**: `clusterWait()` in `cmd/alertmanager/main.go:594`
+
+Position is determined by sorting all peer names alphabetically:
+
+```go
+func (p *Peer) Position() int {
+    all := p.mlist.Members()
+    sort.Slice(all, func(i, j int) bool {
+        return all[i].Name < all[j].Name
+    })
+    // Find position of self in sorted list
+}
+```
+
+#### 3. Deduplication via Notification Log
+
+The `DedupStage` queries the notification log to determine if a notification should be sent:
+
+```go
+// Check notification log for recent sends
+entry := nflog.Query(receiver, groupKey)
+if entry.exists && !shouldNotify(entry, alerts, repeatInterval) {
+    // Skip: already notified recently
+    return nil
+}
+```
+
+Deduplication checks:
+- **Firing alerts changed?** If yes, notify
+- **Resolved alerts changed?** If yes and `send_resolved: true`, notify
+- **Repeat interval elapsed?** If yes, notify
+- **Otherwise**: Skip notification (deduplicated)
+
+The notification log is replicated via gossip, so all cluster members share the same send history.
+
+## Split-Brain Handling (Fail Open)
+
+During a network partition, the cluster may split into multiple groups that cannot communicate. Alertmanager's "fail open" design ensures alerts are still delivered:
+
+### Scenario: Network Partition
+
+```
+Before partition:
+┌────────┬────────┬────────┐
+│  AM-1  │  AM-2  │  AM-3  │
+└────────┴────────┴────────┘
+    Unified cluster
+
+After partition:
+┌────────┐       │       ┌────────┬────────┐
+│  AM-1  │       │       │  AM-2  │  AM-3  │
+└────────┘       │       └────────┴────────┘
+ Partition A     │        Partition B
+```
+
+### Behavior During Partition
+
+**In Partition A** (AM-1 alone):
+- AM-1 sees itself as position 0
+- Waits 0 × timeout = 0 seconds
+- Sends notifications (no dedup from AM-2/AM-3)
+
+**In Partition B** (AM-2, AM-3):
+- AM-2 is position 0, AM-3 is position 1
+- AM-2 waits 0 seconds, sends notification
+- AM-3 sees AM-2's notification log entry, deduplicates
+
+**Result**: Duplicate notifications sent (one from Partition A, one from Partition B)
+
+This is **intentional** - Alertmanager prefers duplicate notifications over missed alerts.
+
+### After Partition Heals
+
+When the network partition heals:
+1. Gossip protocol detects all peers again
+2. Notification logs are merged (via CRDT-like merge with timestamp)
+3. Future notifications are deduplicated correctly across all instances
+4. Silences created in either partition are replicated to all peers
+
+## Silence Management in HA
+
+Silences are first-class replicated state in the cluster.
+
+### Silence Creation and Updates
+
+When a silence is created or updated on any instance:
+
+1. **Local storage** - Silence is stored in the local state map
+2. **Broadcast** - Silence is serialized (protobuf) and broadcast via gossip
+3. **Merge on receive** - Other instances receive and merge the silence:
+   ```go
+   // Merge logic: last-write-wins based on UpdatedAt timestamp
+   if !exists || incoming.UpdatedAt > existing.UpdatedAt {
+       accept_update()
+   }
+   ```
+4. **Indexing** - The silence matcher cache is updated for fast alert matching
+
+### Silence Expiry
+
+Silences have:
+- `StartsAt`, `EndsAt` - The active time range
+- `ExpiresAt` - When to garbage collect (EndsAt + retention period)
+- `UpdatedAt` - For conflict resolution during merge
+
+Each instance independently:
+- Evaluates silence state (pending/active/expired) based on current time
+- Garbage collects expired silences past their retention period
+- The GC is local only (no gossip) since all instances converge to the same decision
+
+### Single Pane of Glass
+
+Users can interact with any Alertmanager instance in the cluster:
+- **View silences** - All instances have the same silence state (eventually consistent)
+- **Create/update silences** - Changes made on any instance propagate to all peers
+- **Delete silences** - Implemented as "expire immediately" + gossip
+
+This provides a unified operational experience regardless of which instance you access.
+
+## Operational Considerations
+
+### Configuration
+
+To configure a cluster, each Alertmanager instance needs:
+
+```yaml
+# alertmanager.yml
+global:
+  # ... other config ...
+
+# No cluster config in YAML - use CLI flags
+```
+
+Command-line flags:
+
+```bash
+alertmanager \
+  --cluster.listen-address=0.0.0.0:9094 \
+  --cluster.peer=am-1.example.com:9094 \
+  --cluster.peer=am-2.example.com:9094 \
+  --cluster.peer=am-3.example.com:9094 \
+  --cluster.advertise-address=$(hostname):9094 \
+  --cluster.peer-timeout=15s \
+  --cluster.gossip-interval=200ms \
+  --cluster.pushpull-interval=60s
+```
+
+Key flags:
+- `--cluster.listen-address` - Bind address for cluster communication (default: `0.0.0.0:9094`)
+- `--cluster.peer` - List of peer addresses (can be repeated)
+- `--cluster.advertise-address` - Address advertised to peers (auto-detected if omitted)
+- `--cluster.peer-timeout` - Wait time per peer position for deduplication (default: `15s`)
+- `--cluster.gossip-interval` - How often to gossip (default: `200ms`)
+- `--cluster.pushpull-interval` - Full state sync interval (default: `60s`)
+- `--cluster.probe-interval` - Peer health check interval (default: `1s`)
+- `--cluster.settle-timeout` - Max time to wait for gossip settling (default: context timeout)
+
+### Prometheus Configuration
+
+**Important**: Configure Prometheus to send alerts to **all** Alertmanager instances, not via a load balancer.
+
+```yaml
+# prometheus.yml
+alerting:
+  alertmanagers:
+    - static_configs:
+        - targets:
+            - am-1.example.com:9093
+            - am-2.example.com:9093
+            - am-3.example.com:9093
+```
+
+This ensures:
+- **Redundancy** - If one Alertmanager is down, others still receive alerts
+- **Independent processing** - Each instance independently evaluates routing, grouping, and deduplication
+- **No single point of failure** - Load balancers introduce a single point of failure
+
+### Cluster Size Considerations
+
+Since Alertmanager uses gossip without quorum or voting, **any N instances tolerate up to N-1 failures** - as long as one instance is alive, notifications will be sent.
+
+However, cluster size involves tradeoffs:
+
+**Benefits of more instances:**
+- Greater resilience to simultaneous failures (hardware, network, datacenter outages)
+- Continued operation even during maintenance windows
+
+**Costs of more instances:**
+- In case of partitions there will be an increase in duplicate notifications
+- More gossip traffic
+
+**Typical deployments:**
+- **2-3 instances** - Common for single-datacenter production deployments
+- **4-5 instances** - Multi-datacenter or highly critical environments
+
+**Note**: Unlike consensus-based systems (etcd, Raft), odd vs. even cluster sizes make no difference - there is no voting or quorum.
+
+### Monitoring Cluster Health
+
+Key metrics to monitor:
+
+```
+# Cluster size
+alertmanager_cluster_members
+
+# Peer health
+alertmanager_cluster_peer_info
+
+# Peer position (affects notification timing)
+alertmanager_peer_position
+
+# Failed peers
+alertmanager_cluster_failed_peers
+
+# State replication
+alertmanager_nflog_gossip_messages_propagated_total
+alertmanager_silences_gossip_messages_propagated_total
+```
+
+### Security
+
+By default, cluster communication is unencrypted. For production deployments, especially across WANs, use mutual TLS:
+
+```bash
+alertmanager \
+  --cluster.tls-config=/etc/alertmanager/cluster-tls.yml
+```
+
+See [Secure Cluster Traffic](../doc/design/secure-cluster-traffic.md) for details.
+
+### Persistence
+
+Each Alertmanager instance persists:
+- **Silences** - Stored in a snapshot file (default: `data/silences`)
+- **Notification log** - Stored in a snapshot file (default: `data/nflog`)
+
+On restart:
+1. Instance loads silences and notification log from disk
+2. Joins the cluster and gossips with peers
+3. Merges state received from peers (newer timestamps win)
+4. Begins processing notifications after gossip settling
+
+**Note**: Alerts themselves are **not** persisted - Prometheus re-sends firing alerts regularly.
+
+### Common Pitfalls
+
+1. **Load balancing Prometheus → Alertmanager**
+   - ❌ Don't use a load balancer
+   - ✅ Configure all instances in Prometheus
+
+2. **Not waiting for gossip to settle**
+   - Can lead to missed silences or duplicate notifications on startup
+   - The `--cluster.settle-timeout` flag controls this
+
+3. **Network ACLs blocking cluster port**
+   - Ensure port 9094 (or your `--cluster.listen-address` port) is open between all instances
+   - Both TCP and UDP are used by default (TCP only if using TLS transport)
+
+4. **Unroutable advertise addresses**
+   - If `--cluster.advertise-address` is not set, Alertmanager tries to auto-detect
+   - For cloud/NAT environments, explicitly set a routable address
+
+5. **Mismatched cluster configurations**
+   - All instances should have the same `--cluster.peer-timeout` and gossip settings
+   - Mismatches can cause unnecessary duplicates or missed notifications
+
+## How It Works: End-to-End Example
+
+### Scenario: 3-instance cluster, new alert group
+
+1. **Alert arrives** at all 3 instances from Prometheus
+2. **Dispatcher** creates aggregation group, waits `group_wait` (e.g., 30s)
+3. **After group_wait**:
+   - Each instance prepares to notify
+4. **Notifier stage**:
+   - All instances wait for gossip to settle (if just started)
+   - **AM-1** (position 0): waits 0s, checks notification log (empty), sends notification, logs to nflog
+   - **AM-2** (position 1): waits 15s, checks notification log (sees AM-1's entry), **skips** notification
+   - **AM-3** (position 2): waits 30s, checks notification log (sees AM-1's entry), **skips** notification
+5. **Result**: Exactly one notification sent (by AM-1)
+
+### Scenario: AM-1 fails
+
+1. **Alert arrives** at AM-2 and AM-3 only
+2. **Dispatcher** creates group, waits `group_wait`
+3. **Notifier stage**:
+   - AM-1 is not in cluster (failed probe)
+   - **AM-2** is now position 0: waits 0s, sends notification
+   - **AM-3** is now position 1: waits 15s, sees AM-2's entry, skips
+4. **Result**: Notification still sent (fail-open)
+
+### Scenario: Network partition during notification
+
+1. **Alert arrives** at all instances
+2. **Network partition** splits AM-1 from AM-2/AM-3
+3. **In partition A** (AM-1):
+   - Position 0, waits 0s, sends notification
+4. **In partition B** (AM-2, AM-3):
+   - AM-2 is position 0, waits 0s, sends notification
+   - AM-3 is position 1, waits 15s, deduplicates
+5. **Result**: Two notifications sent (one per partition) - fail-open behavior
+
+## Troubleshooting
+
+### Check cluster status
+
+```bash
+# View cluster members via API
+curl http://am-1:9093/api/v2/status
+
+# Check metrics
+curl http://am-1:9093/metrics | grep cluster
+```
+
+### Diagnose split-brain
+
+If you suspect split-brain:
+
+1. Check `alertmanager_cluster_members` on each instance
+   - Should match total cluster size
+2. Check `alertmanager_cluster_peer_info{state="alive"}`
+   - Should show all peers as alive
+3. Review network connectivity between instances
+
+### Debug duplicate notifications
+
+Duplicate notifications can occur due to:
+
+1. **Network partitions** (expected, fail-open)
+2. **Gossip not settled** - Check `--cluster.settle-timeout`
+3. **Clock skew** - Ensure NTP is configured on all instances
+4. **Notification log not replicating** - Check gossip metrics
+
+Enable debug logging:
+
+```bash
+alertmanager --log.level=debug
+```
+
+Look for:
+- `"Waiting for gossip to settle..."`
+- `"gossip settled; proceeding"`
+- Deduplication decisions in notification pipeline
+
+## Further Reading
+
+- [Alertmanager Configuration](configuration.md)
+- [Secure Cluster Traffic Design](../doc/design/secure-cluster-traffic.md)
+- [Hashicorp Memberlist Documentation](https://github.com/hashicorp/memberlist)