From Beginner to Expert
Table of Contents
- Introduction to Alertmanager
- Architecture and Core Concepts
- Installation and Setup
- Configuration Fundamentals
- Routing and Grouping
- Notification Channels
- Silencing and Inhibition
- Integration with Prometheus
- Advanced Features
- Monitoring and Troubleshooting
- Best Practices
- Real-world Examples
1. Introduction to Alertmanager
What is Alertmanager?
Alertmanager is a crucial component of the Prometheus monitoring ecosystem that handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing alerts to the correct receiver integrations such as email, PagerDuty, Slack, or webhooks.
Why Do We Need Alertmanager?
graph TD
A[Prometheus Server] -->|Firing Alerts| B[Alertmanager]
B --> C[Grouping & Deduplication]
C --> D[Routing Engine]
D --> E[Email]
D --> F[Slack]
D --> G[PagerDuty]
D --> H[Webhook]
style B fill:#ff9999
style C fill:#99ccff
style D fill:#99ff99Key Problems Alertmanager Solves:
- Alert Fatigue: Groups similar alerts together
- Duplicate Notifications: Deduplicates identical alerts
- Routing Complexity: Routes alerts to appropriate teams/channels
- Notification Management: Handles various notification channels
- Silencing: Temporarily suppress alerts during maintenance
Core Features
- Grouping: Combines related alerts into single notifications
- Inhibition: Suppresses certain alerts when others are firing
- Silencing: Temporarily mute alerts based on matchers
- High Availability: Supports clustering for redundancy
- Web UI: Provides interface for managing alerts and silences
2. Architecture and Core Concepts
Alertmanager Architecture
graph TB
subgraph "Prometheus Ecosystem"
P[Prometheus Server]
AM[Alertmanager]
P -->|HTTP POST /api/v1/alerts| AM
end
subgraph "Alertmanager Internal"
API[API Layer]
ROUTER[Router]
GROUPER[Grouper]
NOTIFIER[Notifier]
SILENCE[Silence Manager]
INHIB[Inhibitor]
API --> ROUTER
ROUTER --> GROUPER
GROUPER --> INHIB
INHIB --> SILENCE
SILENCE --> NOTIFIER
end
subgraph "External Integrations"
EMAIL[Email]
SLACK[Slack]
PD[PagerDuty]
WH[Webhook]
NOTIFIER --> EMAIL
NOTIFIER --> SLACK
NOTIFIER --> PD
NOTIFIER --> WH
end
style AM fill:#ff9999
style ROUTER fill:#99ccff
style NOTIFIER fill:#99ff99Key Concepts
Alert Lifecycle
stateDiagram-v2
[*] --> Inactive
Inactive --> Pending: Condition Met
Pending --> Firing: Duration Exceeded
Firing --> Pending: Condition Resolved
Pending --> Inactive: Condition False
Firing --> Inactive: Condition False
note right of Pending: Alert exists but hasn't\nexceeded 'for' duration
note right of Firing: Alert is actively firing\nand sent to AlertmanagerAlert States in Alertmanager
stateDiagram-v2
[*] --> Active
Active --> Suppressed: Silenced/Inhibited
Suppressed --> Active: Silence Expired/Inhibition Removed
Active --> [*]: Alert Resolved
Suppressed --> [*]: Alert Resolved
note right of Active: Alert is processed\nand notifications sent
note right of Suppressed: Alert exists but\nno notifications sentData Flow
sequenceDiagram
participant P as Prometheus
participant AM as Alertmanager
participant R as Receiver
P->>AM: POST /api/v1/alerts
AM->>AM: Group Similar Alerts
AM->>AM: Apply Inhibition Rules
AM->>AM: Check Silences
AM->>AM: Route to Receivers
AM->>R: Send Notification
R-->>AM: Acknowledge3. Installation and Setup
Installation Methods
Method 1: Binary Installation
# Download Alertmanager binary
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
# Extract
tar xvfz alertmanager-0.26.0.linux-amd64.tar.gz
cd alertmanager-0.26.0.linux-amd64
# Run Alertmanager
./alertmanager --config.file=alertmanager.ymlBashMethod 2: Docker Installation
# Run Alertmanager with Docker
docker run -d \
--name alertmanager \
-p 9093:9093 \
-v /path/to/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
prom/alertmanager:latestBashMethod 3: Docker Compose
version: '3.8'
services:
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager-data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--web.external-url=http://localhost:9093'
- '--cluster.listen-address=0.0.0.0:9094'
volumes:
alertmanager-data:YAMLMethod 4: Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: prom/alertmanager:latest
ports:
- containerPort: 9093
volumeMounts:
- name: config
mountPath: /etc/alertmanager/
volumes:
- name: config
configMap:
name: alertmanager-config
---
apiVersion: v1
kind: Service
metadata:
name: alertmanager
spec:
selector:
app: alertmanager
ports:
- port: 9093
targetPort: 9093
type: LoadBalancerYAMLSystem Service Setup
Systemd Service File
# Create user for Alertmanager
sudo useradd --no-create-home --shell /bin/false alertmanager
# Create directories
sudo mkdir /etc/alertmanager
sudo mkdir /var/lib/alertmanager
# Set ownership
sudo chown alertmanager:alertmanager /etc/alertmanager
sudo chown alertmanager:alertmanager /var/lib/alertmanager
# Copy binary
sudo cp alertmanager /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager
# Create systemd service
sudo tee /etc/systemd/system/alertmanager.service << EOF
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file /etc/alertmanager/alertmanager.yml \
--storage.path /var/lib/alertmanager/ \
--web.external-url=http://localhost:9093
[Install]
WantedBy=multi-user.target
EOF
# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanagerBash4. Configuration Fundamentals
Basic Configuration Structure
global:
# Global configuration options
route:
# Root routing configuration
receivers:
# List of notification receivers
inhibit_rules:
# List of inhibition rules
templates:
# Custom notification templatesYAMLConfiguration Flow
graph TD
A[Alert Received] --> B{Match Route?}
B -->|Yes| C[Apply Grouping]
B -->|No| D[Default Route]
C --> E{Inhibited?}
E -->|No| F{Silenced?}
E -->|Yes| G[Suppress Alert]
F -->|No| H[Send to Receiver]
F -->|Yes| G
D --> C
style A fill:#ffcccc
style H fill:#ccffcc
style G fill:#ffffccBasic Configuration Example
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alertmanager@example.com'
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']YAMLGlobal Configuration Options
global:
# Time to wait before sending a notification about new alerts
resolve_timeout: 5m
# SMTP configuration
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@company.com'
smtp_auth_username: 'alerts@company.com'
smtp_auth_password: 'app_password'
smtp_require_tls: true
# Slack configuration
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
# PagerDuty configuration
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
# HTTP configuration
http_config:
proxy_url: 'http://proxy.company.com:8080'
tls_config:
insecure_skip_verify: trueYAML5. Routing and Grouping
Understanding Routes
Routes define how alerts are organized and where they should be sent. The routing tree starts with a root route and can have multiple child routes.
graph TD
ROOT[Root Route] --> TEAM_A[Team A Route]
ROOT --> TEAM_B[Team B Route]
ROOT --> CRITICAL[Critical Route]
ROOT --> DEFAULT[Default Route]
TEAM_A --> EMAIL_A[Email Team A]
TEAM_B --> SLACK_B[Slack Team B]
CRITICAL --> PAGER[PagerDuty]
DEFAULT --> WEBHOOK[Webhook]
style ROOT fill:#ff9999
style CRITICAL fill:#ffccccRoute Matching Logic
flowchart TD
A[Alert Received] --> B{Root Route Matches?}
B -->|Yes| C{Child Route 1 Matches?}
B -->|No| Z[Drop Alert]
C -->|Yes| D[Use Child Route 1]
C -->|No| E{Child Route 2 Matches?}
E -->|Yes| F[Use Child Route 2]
E -->|No| G{Continue Flag?}
G -->|Yes| H[Check Next Route]
G -->|No| I[Use Parent Route]
style D fill:#ccffcc
style F fill:#ccffcc
style I fill:#ccffccAdvanced Routing Configuration
route:
# Default grouping, timing, and receiver
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'default-receiver'
# Nested routes
routes:
# Critical alerts go to PagerDuty immediately
- match:
severity: critical
receiver: 'pagerduty-critical'
group_wait: 0s
repeat_interval: 5m
# Database alerts go to DB team
- match_re:
service: ^(mysql|postgres|mongodb).*
receiver: 'database-team'
group_by: ['alertname', 'instance']
# Team-specific routing
- match:
team: frontend
receiver: 'frontend-team'
routes:
# Frontend critical alerts
- match:
severity: critical
receiver: 'frontend-oncall'
continue: true # Also send to team channel
# Infrastructure alerts
- match:
component: infrastructure
receiver: 'infra-team'
group_by: ['alertname', 'datacenter']
# Development environment (lower priority)
- match:
environment: development
receiver: 'dev-team'
group_interval: 1h
repeat_interval: 24hYAMLGrouping Strategies
Time-based Grouping
route:
group_by: ['alertname']
group_wait: 10s # Wait for more alerts before sending
group_interval: 10s # Wait before sending additional alerts for group
repeat_interval: 1h # Resend interval for unresolved alertsYAMLLabel-based Grouping
route:
# Group by alert name and instance
group_by: ['alertname', 'instance']
routes:
- match:
team: database
group_by: ['alertname', 'database_cluster']
- match:
service: web
group_by: ['alertname', 'datacenter', 'environment']YAMLGrouping Flow Diagram
sequenceDiagram
participant P as Prometheus
participant AM as Alertmanager
participant G as Grouper
participant N as Notifier
P->>AM: Alert 1 (web-server-down, instance=web1)
AM->>G: Create Group [web-server-down, web1]
Note over G: Wait group_wait (10s)
P->>AM: Alert 2 (web-server-down, instance=web2)
AM->>G: Add to Group [web-server-down]
P->>AM: Alert 3 (web-server-down, instance=web3)
AM->>G: Add to Group [web-server-down]
Note over G: group_wait expires
G->>N: Send grouped notification (3 alerts)
Note over G: Wait group_interval (10s)
P->>AM: Alert 4 (web-server-down, instance=web4)
AM->>G: Add to existing Group
Note over G: group_interval expires
G->>N: Send update (4 alerts total)6. Notification Channels
Supported Receivers
Alertmanager supports various notification channels:
graph LR
AM[Alertmanager] --> EMAIL[Email]
AM --> SLACK[Slack]
AM --> PD[PagerDuty]
AM --> TEAMS[Microsoft Teams]
AM --> WEBHOOK[Webhook]
AM --> PUSHOVER[Pushover]
AM --> OPSGENIE[OpsGenie]
AM --> VICTOROPS[VictorOps]
AM --> WECHAT[WeChat]
AM --> TELEGRAM[Telegram]
style AM fill:#ff9999Email Configuration
Basic Email Setup
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@company.com'
smtp_auth_username: 'alerts@company.com'
smtp_auth_password: 'app_specific_password'
smtp_require_tls: true
receivers:
- name: 'email-team'
email_configs:
- to: 'team@company.com'
subject: 'Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
{{ end }}YAMLAdvanced Email Configuration
receivers:
- name: 'advanced-email'
email_configs:
- to: 'oncall@company.com'
cc: 'team-lead@company.com'
subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }} ({{ .Alerts | len }} alerts)'
html: |
<!DOCTYPE html>
<html>
<head>
<style>
.critical { background-color: #ff4444; color: white; }
.warning { background-color: #ffaa00; color: white; }
.info { background-color: #4444ff; color: white; }
</style>
</head>
<body>
<h2>Alert Summary</h2>
<table border="1">
<tr><th>Alert</th><th>Severity</th><th>Instance</th><th>Description</th></tr>
{{ range .Alerts }}
<tr class="{{ .Labels.severity }}">
<td>{{ .Labels.alertname }}</td>
<td>{{ .Labels.severity }}</td>
<td>{{ .Labels.instance }}</td>
<td>{{ .Annotations.description }}</td>
</tr>
{{ end }}
</table>
</body>
</html>
headers:
X-Priority: 'High'
X-MC-Important: 'true'YAMLSlack Configuration
Basic Slack Setup
global:
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
receivers:
- name: 'slack-general'
slack_configs:
- channel: '#alerts'
username: 'Alertmanager'
icon_emoji: ':exclamation:'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
{{ end }}YAMLAdvanced Slack Configuration
receivers:
- name: 'slack-advanced'
slack_configs:
- api_url: 'https://hooks.slack.com/services/TEAM/CHANNEL/TOKEN'
channel: '#production-alerts'
username: 'AlertBot'
icon_url: 'https://example.com/alertmanager-icon.png'
title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
title_link: 'http://alertmanager.company.com/#/alerts'
text: |
{{ if eq .Status "firing" }}
:fire: *FIRING ALERTS* :fire:
{{ else }}
:white_check_mark: *RESOLVED ALERTS* :white_check_mark:
{{ end }}
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Instance:* {{ .Labels.instance }}
*Severity:* {{ .Labels.severity }}
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Started:* {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ if ne .EndsAt .StartsAt }}*Ended:* {{ .EndsAt.Format "2006-01-02 15:04:05" }}{{ end }}
---
{{ end }}
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
fields:
- title: 'Environment'
value: '{{ .GroupLabels.environment }}'
short: true
- title: 'Severity'
value: '{{ .GroupLabels.severity }}'
short: true
actions:
- type: 'button'
text: 'View in Grafana'
url: 'http://grafana.company.com/dashboard'
- type: 'button'
text: 'Silence Alert'
url: 'http://alertmanager.company.com/#/silences/new'YAMLPagerDuty Configuration
global:
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- routing_key: 'YOUR_INTEGRATION_KEY'
description: '{{ .GroupLabels.alertname }}: {{ .Annotations.summary }}'
severity: '{{ .GroupLabels.severity }}'
source: '{{ .GroupLabels.instance }}'
component: '{{ .GroupLabels.service }}'
group: '{{ .GroupLabels.cluster }}'
class: '{{ .GroupLabels.alertname }}'
details:
firing_alerts: '{{ .Alerts.Firing | len }}'
resolved_alerts: '{{ .Alerts.Resolved | len }}'
alert_details: |
{{ range .Alerts }}
- {{ .Labels.alertname }} on {{ .Labels.instance }}
{{ end }}YAMLWebhook Configuration
receivers:
- name: 'webhook-receiver'
webhook_configs:
- url: 'http://webhook-server.company.com/alerts'
http_config:
basic_auth:
username: 'webhook_user'
password: 'webhook_password'
send_resolved: true
max_alerts: 10YAMLMicrosoft Teams Configuration
receivers:
- name: 'teams-alerts'
webhook_configs:
- url: 'https://outlook.office.com/webhook/YOUR_TEAMS_WEBHOOK'
send_resolved: true
http_config:
tls_config:
insecure_skip_verify: false
title: 'Alert: {{ .GroupLabels.alertname }}'
text: |
**Status:** {{ .Status | toUpper }}
{{ range .Alerts }}
**Alert:** {{ .Labels.alertname }}
**Instance:** {{ .Labels.instance }}
**Severity:** {{ .Labels.severity }}
**Summary:** {{ .Annotations.summary }}
**Description:** {{ .Annotations.description }}
{{ end }}YAML7. Silencing and Inhibition
Silencing Alerts
Silencing temporarily mutes alerts based on label matchers. This is useful during maintenance windows or when investigating issues.
graph TD
A[Alert Received] --> B{Matches Silence?}
B -->|Yes| C[Suppress Notification]
B -->|No| D[Process Normally]
C --> E[Log Silenced Alert]
D --> F[Send Notification]
style C fill:#ffffcc
style F fill:#ccffccCreating Silences via API
# Create a silence for maintenance
curl -X POST http://localhost:9093/api/v1/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{
"name": "instance",
"value": "web-server-01",
"isRegex": false
},
{
"name": "alertname",
"value": "InstanceDown",
"isRegex": false
}
],
"startsAt": "2024-01-01T12:00:00Z",
"endsAt": "2024-01-01T14:00:00Z",
"createdBy": "john.doe@company.com",
"comment": "Scheduled maintenance for web-server-01"
}'BashSilence Configuration Examples
# Silence all alerts from development environment
- matchers:
- name: environment
value: development
isRegex: false
comment: "Development environment maintenance"
createdBy: "devops-team"
# Silence critical disk alerts during backup window
- matchers:
- name: alertname
value: DiskSpaceHigh
isRegex: false
- name: severity
value: critical
isRegex: false
comment: "Daily backup window"
createdBy: "backup-system"
# Silence all alerts matching regex pattern
- matchers:
- name: instance
value: "web-.*"
isRegex: true
comment: "Web server maintenance"
createdBy: "sre-team"YAMLInhibition Rules
Inhibition suppresses notifications for certain alerts when other alerts are firing. This prevents alert spam when a root cause alert is already active.
graph TD
A[Source Alert Firing] --> B{Target Alerts Match?}
B -->|Yes| C[Inhibit Target Alerts]
B -->|No| D[Allow Target Alerts]
E[Source Alert Resolved] --> F[Remove Inhibition]
F --> G[Target Alerts Active Again]
style C fill:#ffffcc
style D fill:#ccffccBasic Inhibition Rules
inhibit_rules:
# Inhibit warning alerts when critical alerts are firing
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
# Inhibit individual service alerts when entire node is down
- source_match:
alertname: 'NodeDown'
target_match_re:
alertname: '(ServiceDown|HighCPU|HighMemory)'
equal: ['instance']
# Inhibit database connection alerts when database is down
- source_match:
alertname: 'DatabaseDown'
target_match:
alertname: 'DatabaseConnectionFailed'
equal: ['database_cluster']YAMLAdvanced Inhibition Examples
inhibit_rules:
# Complex multi-label matching
- source_match:
alertname: 'DatacenterPowerOutage'
target_match_re:
alertname: '(InstanceDown|ServiceUnavailable|NetworkUnreachable)'
equal: ['datacenter', 'region']
# Inhibit application alerts during deployment
- source_match:
alertname: 'DeploymentInProgress'
environment: 'production'
target_match_re:
alertname: '(HighErrorRate|SlowResponse|ServiceDown)'
equal: ['service', 'environment']
# Inhibit monitoring alerts when monitoring system is down
- source_match:
alertname: 'PrometheusDown'
target_match_re:
alertname: '(.*)'
equal: ['monitoring_cluster']YAMLInhibition Flow
sequenceDiagram
participant P as Prometheus
participant AM as Alertmanager
participant I as Inhibitor
participant N as Notifier
P->>AM: NodeDown Alert (Critical)
AM->>I: Check inhibition rules
I->>I: Store active inhibition
P->>AM: HighCPU Alert (Warning)
AM->>I: Check if inhibited
I-->>AM: Inhibited by NodeDown
AM->>AM: Suppress HighCPU notification
P->>AM: NodeDown Resolved
AM->>I: Remove inhibition
I->>N: Allow suppressed alerts
N->>N: Process HighCPU if still activeManaging Silences
List Active Silences
# Get all silences
curl http://localhost:9093/api/v1/silences
# Get specific silence
curl http://localhost:9093/api/v1/silence/SILENCE_IDBashUpdate Silence
# Expire a silence early
curl -X DELETE http://localhost:9093/api/v1/silence/SILENCE_IDBashSilence Best Practices
# Template for emergency silence
emergency_silence_template: |
matchers:
- name: severity
value: critical
isRegex: false
- name: team
value: "{{ .team }}"
isRegex: false
comment: "Emergency silence - {{ .reason }}"
createdBy: "{{ .operator }}"
endsAt: "{{ .end_time }}"
# Scheduled maintenance silence
maintenance_silence_template: |
matchers:
- name: instance
value: "{{ .instance_pattern }}"
isRegex: true
comment: "Scheduled maintenance: {{ .maintenance_ticket }}"
createdBy: "maintenance-system"
startsAt: "{{ .maintenance_start }}"
endsAt: "{{ .maintenance_end }}"YAML8. Integration with Prometheus
Prometheus Configuration
To send alerts to Alertmanager, configure Prometheus with alert rules and Alertmanager endpoints.
graph LR
P[Prometheus] -->|Scrape Metrics| T[Targets]
P -->|Evaluate Rules| R[Alert Rules]
R -->|Fire Alerts| AM[Alertmanager]
AM -->|Notifications| N[Receivers]
style P fill:#ff9999
style AM fill:#99ccffPrometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager-1:9093
- alertmanager-2:9093
- alertmanager-3:9093
timeout: 10s
api_version: v2
# Load alert rules
rule_files:
- "alert_rules/*.yml"
- "recording_rules/*.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']YAMLAlert Rules
Basic Alert Rules
# alert_rules/basic_alerts.yml
groups:
- name: basic_alerts
rules:
# Instance down alert
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.instance }} has been down for more than 5 minutes"
runbook_url: "https://wiki.company.com/runbooks/instance-down"
# High CPU usage
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 10 minutes"
current_value: "{{ $value | humanizePercentage }}"
# High memory usage
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 15m
labels:
severity: critical
team: infrastructure
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90% for more than 15 minutes"
current_value: "{{ $value | humanizePercentage }}"YAMLAdvanced Alert Rules
# alert_rules/application_alerts.yml
groups:
- name: application_alerts
rules:
# HTTP error rate too high
- alert: HighHTTPErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service, instance)
/
sum(rate(http_requests_total[5m])) by (service, instance)
) * 100 > 5
for: 5m
labels:
severity: critical
team: "{{ $labels.service }}"
annotations:
summary: "High HTTP error rate for {{ $labels.service }}"
description: "HTTP 5xx error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"
grafana_url: "http://grafana.company.com/d/http-dashboard"
# Response time too high
- alert: HighResponseTime
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
) > 0.5
for: 10m
labels:
severity: warning
team: "{{ $labels.service }}"
annotations:
summary: "High response time for {{ $labels.service }}"
description: "95th percentile response time is {{ $value | humanizeDuration }}"
# Database connection pool exhausted
- alert: DatabaseConnectionPoolExhausted
expr: |
(
sum(database_connections_active) by (database, instance)
/
sum(database_connections_max) by (database, instance)
) * 100 > 90
for: 5m
labels:
severity: critical
team: database
annotations:
summary: "Database connection pool almost exhausted"
description: "{{ $labels.database }} connection pool is {{ $value | humanizePercentage }} full"
# Disk space running low
- alert: DiskSpaceLow
expr: |
(
node_filesystem_avail_bytes{fstype!="tmpfs"}
/
node_filesystem_size_bytes{fstype!="tmpfs"}
) * 100 < 10
for: 30m
labels:
severity: warning
team: infrastructure
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} has only {{ $value | humanizePercentage }} space left"
# Service down
- alert: ServiceDown
expr: probe_success == 0
for: 5m
labels:
severity: critical
team: "{{ $labels.team }}"
annotations:
summary: "Service {{ $labels.job }} is down"
description: "{{ $labels.instance }} has been down for more than 5 minutes"YAMLMulti-Datacenter Alert Rules
# alert_rules/datacenter_alerts.yml
groups:
- name: datacenter_alerts
rules:
# Datacenter connectivity issues
- alert: DatacenterConnectivityIssue
expr: |
up{job="datacenter-health"} == 0
or
increase(network_packets_dropped_total[5m]) > 1000
for: 2m
labels:
severity: critical
team: network
escalate: "true"
annotations:
summary: "Connectivity issues in {{ $labels.datacenter }}"
description: "Network connectivity problems detected in {{ $labels.datacenter }}"
# Cross-datacenter replication lag
- alert: HighReplicationLag
expr: |
database_replication_lag_seconds > 300
for: 10m
labels:
severity: warning
team: database
annotations:
summary: "High replication lag between datacenters"
description: "Replication lag is {{ $value | humanizeDuration }} between {{ $labels.source_dc }} and {{ $labels.target_dc }}"
# Load balancer backend down
- alert: LoadBalancerBackendDown
expr: |
haproxy_server_up == 0
for: 1m
labels:
severity: critical
team: network
annotations:
summary: "Load balancer backend {{ $labels.server }} is down"
description: "Backend server {{ $labels.server }} in {{ $labels.backend }} is not responding"YAMLAlert Rule Best Practices
Rule Organization
graph TD
A[Alert Rules] --> B[Infrastructure]
A --> C[Applications]
A --> D[Security]
A --> E[Business]
B --> B1[Node Exporter]
B --> B2[Network]
B --> B3[Storage]
C --> C1[Web Services]
C --> C2[Databases]
C --> C3[Message Queues]
D --> D1[Authentication]
D --> D2[Compliance]
E --> E1[SLA Violations]
E --> E2[Revenue Impact]Template for Alert Rules
# Template for standardized alerts
- alert: AlertName
expr: |
# Multi-line PromQL query
metric_expression
for: duration
labels:
severity: critical|warning|info
team: responsible_team
service: service_name
environment: prod|staging|dev
escalate: "true|false"
annotations:
summary: "Brief description of the issue"
description: "Detailed description with context and impact"
runbook_url: "https://runbooks.company.com/alert-name"
dashboard_url: "https://grafana.company.com/dashboard"
current_value: "{{ $value | humanize }}"
threshold: "threshold_value"YAML9. Advanced Features
High Availability Setup
Setting up Alertmanager in HA mode ensures no single point of failure.
graph TD
subgraph "Prometheus Instances"
P1[Prometheus 1]
P2[Prometheus 2]
P3[Prometheus 3]
end
subgraph "Alertmanager Cluster"
AM1[Alertmanager 1:9093]
AM2[Alertmanager 2:9094]
AM3[Alertmanager 3:9095]
AM1 -.->|Gossip Protocol| AM2
AM2 -.->|Gossip Protocol| AM3
AM3 -.->|Gossip Protocol| AM1
end
P1 --> AM1
P1 --> AM2
P1 --> AM3
P2 --> AM1
P2 --> AM2
P2 --> AM3
P3 --> AM1
P3 --> AM2
P3 --> AM3
AM1 --> RECEIVER[Notification Receivers]
AM2 --> RECEIVER
AM3 --> RECEIVER
style AM1 fill:#ff9999
style AM2 fill:#ff9999
style AM3 fill:#ff9999HA Configuration
# alertmanager-1.yml
global:
smtp_smarthost: 'smtp.company.com:587'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://webhook.company.com/alerts'
# Cluster configuration
cluster:
listen-address: "0.0.0.0:9094"
peer: "alertmanager-2.company.com:9094"
peer: "alertmanager-3.company.com:9094"
gossip-interval: "200ms"
pushpull-interval: "1m"YAMLDocker Compose HA Setup
version: '3.8'
services:
alertmanager-1:
image: prom/alertmanager:latest
ports:
- "9093:9093"
- "9094:9094"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--cluster.listen-address=0.0.0.0:9094'
- '--cluster.peer=alertmanager-2:9094'
- '--cluster.peer=alertmanager-3:9094'
- '--web.external-url=http://localhost:9093'
networks:
- alerting
alertmanager-2:
image: prom/alertmanager:latest
ports:
- "9095:9093"
- "9096:9094"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--cluster.listen-address=0.0.0.0:9094'
- '--cluster.peer=alertmanager-1:9094'
- '--cluster.peer=alertmanager-3:9094'
- '--web.external-url=http://localhost:9095'
networks:
- alerting
alertmanager-3:
image: prom/alertmanager:latest
ports:
- "9097:9093"
- "9098:9094"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--cluster.listen-address=0.0.0.0:9094'
- '--cluster.peer=alertmanager-1:9094'
- '--cluster.peer=alertmanager-2:9094'
- '--web.external-url=http://localhost:9097'
networks:
- alerting
networks:
alerting:
driver: bridgeYAMLCustom Templates
Create custom notification templates for better formatting.
Template Structure
graph TD
A[Template Files] --> B[Email Templates]
A --> C[Slack Templates]
A --> D[Webhook Templates]
B --> B1[HTML Templates]
B --> B2[Text Templates]
C --> C1[Message Format]
C --> C2[Attachment Format]
D --> D1[JSON Format]
D --> D2[Custom Format]Email Templates
<!-- templates/email.html -->
<!DOCTYPE html>
<html>
<head>
<style>
body { font-family: Arial, sans-serif; }
.alert-critical { background-color: #d32f2f; color: white; }
.alert-warning { background-color: #f57c00; color: white; }
.alert-info { background-color: #1976d2; color: white; }
.resolved { background-color: #388e3c; color: white; }
table { border-collapse: collapse; width: 100%; }
th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
th { background-color: #f2f2f2; }
</style>
</head>
<body>
<h1>{{ if eq .Status "firing" }}🔥 ALERTS FIRING{{ else }}✅ ALERTS RESOLVED{{ end }}</h1>
<h2>Summary</h2>
<ul>
<li><strong>Status:</strong> {{ .Status | toUpper }}</li>
<li><strong>Group:</strong> {{ .GroupLabels.alertname }}</li>
<li><strong>Total Alerts:</strong> {{ .Alerts | len }}</li>
<li><strong>Firing:</strong> {{ .Alerts.Firing | len }}</li>
<li><strong>Resolved:</strong> {{ .Alerts.Resolved | len }}</li>
</ul>
<h2>Alert Details</h2>
<table>
<tr>
<th>Alert</th>
<th>Severity</th>
<th>Instance</th>
<th>Status</th>
<th>Started</th>
<th>Summary</th>
</tr>
{{ range .Alerts }}
<tr class="alert-{{ .Labels.severity }}{{ if eq .Status "resolved" }} resolved{{ end }}">
<td>{{ .Labels.alertname }}</td>
<td>{{ .Labels.severity | toUpper }}</td>
<td>{{ .Labels.instance }}</td>
<td>{{ .Status | toUpper }}</td>
<td>{{ .StartsAt.Format "2006-01-02 15:04:05" }}</td>
<td>{{ .Annotations.summary }}</td>
</tr>
{{ end }}
</table>
<h2>Actions</h2>
<ul>
<li><a href="http://alertmanager.company.com">View in Alertmanager</a></li>
<li><a href="http://grafana.company.com">View in Grafana</a></li>
<li><a href="http://alertmanager.company.com/#/silences/new">Create Silence</a></li>
</ul>
</body>
</html>Jinja HTMLSlack Templates
# templates/slack.tmpl
{{ define "slack.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}
{{ end }}
{{ define "slack.text" }}
{{ if eq .Status "firing" }}
:fire: **FIRING ALERTS** :fire:
{{ else }}
:white_check_mark: **RESOLVED ALERTS** :white_check_mark:
{{ end }}
{{ range .Alerts }}
{{ if eq .Status "firing" }}:red_circle:{{ else }}:green_circle:{{ end }} **{{ .Labels.alertname }}**
• **Instance:** {{ .Labels.instance }}
• **Severity:** {{ .Labels.severity | toUpper }}
• **Summary:** {{ .Annotations.summary }}
• **Started:** {{ .StartsAt.Format "Jan 02, 2006 15:04:05 MST" }}
{{ if .Annotations.runbook_url }}• **Runbook:** {{ .Annotations.runbook_url }}{{ end }}
{{ end }}
{{ if gt (len .GroupLabels) 0 }}
**Labels:** {{ range .GroupLabels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
{{ end }}
{{ end }}
{{ define "slack.color" }}
{{ if eq .Status "firing" }}
{{ if eq .GroupLabels.severity "critical" }}danger{{ else }}warning{{ end }}
{{ else }}
good
{{ end }}
{{ end }}PythonUsing Templates in Configuration
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alertmanager@company.com'
templates:
- '/etc/alertmanager/templates/*.tmpl'
receivers:
- name: 'email-templates'
email_configs:
- to: 'team@company.com'
subject: '{{ template "email.subject" . }}'
html: '{{ template "email.html" . }}'
- name: 'slack-templates'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts'
title: '{{ template "slack.title" . }}'
text: '{{ template "slack.text" . }}'
color: '{{ template "slack.color" . }}'YAMLAPI Usage and Automation
Alertmanager provides a REST API for automation and integration.
API Endpoints Overview
graph LR
API[Alertmanager API] --> ALERTS["/api/v1/alerts"]
API --> SILENCES["/api/v1/silences"]
API --> RECEIVERS["/api/v1/receivers"]
API --> STATUS["/api/v1/status"]
API --> CONFIG["/api/v1/config"]
ALERTS --> GET_ALERTS[GET: List alerts]
ALERTS --> POST_ALERTS[POST: Send alerts]
SILENCES --> GET_SILENCES[GET: List silences]
SILENCES --> POST_SILENCES[POST: Create silence]
SILENCES --> DELETE_SILENCE[DELETE: Expire silence]Common API Operations
# Get all active alerts
curl -X GET http://localhost:9093/api/v1/alerts
# Get alerts with specific labels
curl -X GET "http://localhost:9093/api/v1/alerts?filter=alertname%3DHighCPU"
# Send test alert
curl -X POST http://localhost:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '[
{
"labels": {
"alertname": "TestAlert",
"instance": "localhost:9090",
"severity": "warning"
},
"annotations": {
"summary": "This is a test alert",
"description": "Testing alertmanager configuration"
},
"startsAt": "'$(date -u +%Y-%m-%dT%H:%M:%S.%3NZ)'",
"endsAt": "'$(date -u -d '+1 hour' +%Y-%m-%dT%H:%M:%S.%3NZ)'"
}
]'
# Create silence
curl -X POST http://localhost:9093/api/v1/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{
"name": "alertname",
"value": "HighCPU",
"isRegex": false
}
],
"startsAt": "'$(date -u +%Y-%m-%dT%H:%M:%S.%3NZ)'",
"endsAt": "'$(date -u -d '+2 hours' +%Y-%m-%dT%H:%M:%S.%3NZ)'",
"createdBy": "automation-script",
"comment": "Automated silence during maintenance"
}'
# Get configuration
curl -X GET http://localhost:9093/api/v1/config
# Get status
curl -X GET http://localhost:9093/api/v1/statusBashPython API Client Example
import requests
import json
from datetime import datetime, timedelta
class AlertmanagerClient:
def __init__(self, base_url):
self.base_url = base_url.rstrip('/')
def get_alerts(self, filters=None):
"""Get all alerts or filtered alerts"""
url = f"{self.base_url}/api/v1/alerts"
params = {}
if filters:
params['filter'] = filters
response = requests.get(url, params=params)
response.raise_for_status()
return response.json()['data']
def send_alert(self, alertname, labels, annotations, starts_at=None, ends_at=None):
"""Send a test alert"""
url = f"{self.base_url}/api/v1/alerts"
if not starts_at:
starts_at = datetime.utcnow()
if not ends_at:
ends_at = starts_at + timedelta(hours=1)
alert = {
"labels": {"alertname": alertname, **labels},
"annotations": annotations,
"startsAt": starts_at.isoformat() + 'Z',
"endsAt": ends_at.isoformat() + 'Z'
}
response = requests.post(url, json=[alert])
response.raise_for_status()
return response.json()
def create_silence(self, matchers, comment, created_by, duration_hours=1):
"""Create a silence"""
url = f"{self.base_url}/api/v1/silences"
starts_at = datetime.utcnow()
ends_at = starts_at + timedelta(hours=duration_hours)
silence = {
"matchers": matchers,
"startsAt": starts_at.isoformat() + 'Z',
"endsAt": ends_at.isoformat() + 'Z',
"createdBy": created_by,
"comment": comment
}
response = requests.post(url, json=silence)
response.raise_for_status()
return response.json()
def get_silences(self):
"""Get all silences"""
url = f"{self.base_url}/api/v1/silences"
response = requests.get(url)
response.raise_for_status()
return response.json()['data']
def expire_silence(self, silence_id):
"""Expire a silence"""
url = f"{self.base_url}/api/v1/silence/{silence_id}"
response = requests.delete(url)
response.raise_for_status()
return response.status_code == 200
# Usage example
if __name__ == "__main__":
client = AlertmanagerClient("http://localhost:9093")
# Send test alert
client.send_alert(
alertname="APITestAlert",
labels={"instance": "test-server", "severity": "warning"},
annotations={
"summary": "Test alert from API",
"description": "This is a test alert sent via API"
}
)
# Create silence
matchers = [
{"name": "alertname", "value": "APITestAlert", "isRegex": False}
]
silence_response = client.create_silence(
matchers=matchers,
comment="Testing API silence creation",
created_by="api-script",
duration_hours=2
)
print(f"Created silence with ID: {silence_response['silenceID']}")Python10. Monitoring and Troubleshooting
Monitoring Alertmanager Itself
It’s crucial to monitor Alertmanager to ensure it’s functioning correctly.
graph TD
AM[Alertmanager] --> METRICS["/metrics endpoint"]
METRICS --> PROM[Prometheus]
PROM --> GRAFANA[Grafana Dashboard]
PROM --> ALERTS[Alertmanager Alerts]
ALERTS --> EMAIL[Email Notifications]
ALERTS --> SLACK[Slack Notifications]
style AM fill:#ff9999
style ALERTS fill:#ffccccKey Metrics to Monitor
# alertmanager_monitoring_rules.yml
groups:
- name: alertmanager_monitoring
rules:
# Alertmanager is down
- alert: AlertmanagerDown
expr: up{job="alertmanager"} == 0
for: 5m
labels:
severity: critical
service: alertmanager
annotations:
summary: "Alertmanager instance is down"
description: "Alertmanager instance {{ $labels.instance }} is down"
# Configuration reload failed
- alert: AlertmanagerConfigReloadFailed
expr: alertmanager_config_last_reload_successful == 0
for: 10m
labels:
severity: critical
service: alertmanager
annotations:
summary: "Alertmanager configuration reload failed"
description: "Alertmanager {{ $labels.instance }} configuration reload failed"
# High number of alerts
- alert: AlertmanagerHighAlertVolume
expr: sum(alertmanager_alerts) by (instance) > 1000
for: 10m
labels:
severity: warning
service: alertmanager
annotations:
summary: "High volume of alerts in Alertmanager"
description: "Alertmanager {{ $labels.instance }} is processing {{ $value }} alerts"
# Notification failures
- alert: AlertmanagerNotificationFailed
expr: rate(alertmanager_notifications_failed_total[5m]) > 0.1
for: 10m
labels:
severity: warning
service: alertmanager
annotations:
summary: "Alertmanager notifications failing"
description: "Alertmanager {{ $labels.instance }} notification failure rate is {{ $value | humanizePercentage }}"
# Cluster member down
- alert: AlertmanagerClusterMemberDown
expr: alertmanager_cluster_members != on (job) group_left count by (job) (up{job="alertmanager"})
for: 15m
labels:
severity: warning
service: alertmanager
annotations:
summary: "Alertmanager cluster member missing"
description: "Alertmanager cluster has {{ $value }} members but should have more"YAMLPrometheus Scrape Configuration
# prometheus.yml
scrape_configs:
- job_name: 'alertmanager'
static_configs:
- targets:
- 'alertmanager-1:9093'
- 'alertmanager-2:9093'
- 'alertmanager-3:9093'
scrape_interval: 30s
metrics_path: /metricsYAMLCommon Issues and Solutions
Troubleshooting Flow
flowchart TD
A[Alert Issue] --> B{Alert Received?}
B -->|No| C[Check Prometheus Config]
B -->|Yes| D{Notification Sent?}
C --> C1[Verify alertmanager URL]
C --> C2[Check alert rules]
C --> C3[Verify connectivity]
D -->|No| E[Check Alertmanager]
D -->|Yes| F[Issue Resolved]
E --> E1[Check routing rules]
E --> E2[Verify receiver config]
E --> E3[Check silences]
E --> E4[Check inhibition rules]
style A fill:#ff9999
style F fill:#ccffccCommon Problems and Solutions
- Alerts Not Firing
# Check if Prometheus can reach Alertmanager
curl http://prometheus:9090/api/v1/alertmanagers
# Check alert rule evaluation
curl http://prometheus:9090/api/v1/rules
# Verify alert is active in Prometheus
curl http://prometheus:9090/api/v1/alertsBash- Notifications Not Sent
# Check Alertmanager logs
docker logs alertmanager
# Verify receiver configuration
curl http://alertmanager:9093/api/v1/config
# Check for silences
curl http://alertmanager:9093/api/v1/silences
# Test notification manually
amtool alert add alertname=TestAlert severity=warning instance=testBash- Configuration Issues
# Validate configuration
./alertmanager --config.file=alertmanager.yml --config.check
# Check configuration reload status
curl http://alertmanager:9093/api/v1/statusBashDebug Tools
# Install amtool (Alertmanager CLI tool)
go install github.com/prometheus/alertmanager/cmd/amtool@latest
# Configure amtool
export ALERTMANAGER_URL=http://localhost:9093
# List alerts
amtool alert query
# List silences
amtool silence query
# Create test alert
amtool alert add alertname=TestAlert severity=critical instance=localhost
# Create silence
amtool silence add alertname=TestAlert --duration=1h --comment="Testing silence"
# Import silences from file
amtool silence import < silences.json
# Export silences to file
amtool silence export > silences.jsonBashLog Analysis
Log Patterns to Monitor
# Error patterns to watch for
grep -E "(error|Error|ERROR)" /var/log/alertmanager/alertmanager.log
# Configuration reload events
grep "Completed loading of configuration file" /var/log/alertmanager/alertmanager.log
# Notification failures
grep "notify.*failed" /var/log/alertmanager/alertmanager.log
# Cluster communication issues
grep "cluster.*error" /var/log/alertmanager/alertmanager.logBashStructured Logging Configuration
# Add to Alertmanager startup flags
--log.format=json
--log.level=infoBashLog Aggregation with Fluentd/Fluentbit
# fluent-bit.conf
[INPUT]
Name tail
Path /var/log/alertmanager/alertmanager.log
Tag alertmanager
Parser json
[OUTPUT]
Name elasticsearch
Match alertmanager
Host elasticsearch.company.com
Port 9200
Index alertmanager-logsINI11. Best Practices
Configuration Best Practices
Organization and Structure
graph TD
A[Configuration Best Practices] --> B[File Organization]
A --> C[Naming Conventions]
A --> D[Environment Separation]
A --> E[Security Practices]
B --> B1[config/]
B --> B2[templates/]
B --> B3[rules/]
C --> C1[Descriptive Names]
C --> C2[Consistent Patterns]
D --> D1[Dev/Stage/Prod]
D --> D2[Feature Flags]
E --> E1[Secrets Management]
E --> E2[Access Control]File Structure Best Practices
# Recommended directory structure
alertmanager/
├── config/
│ ├── alertmanager-dev.yml
│ ├── alertmanager-staging.yml
│ └── alertmanager-prod.yml
├── templates/
│ ├── email/
│ │ ├── html.tmpl
│ │ └── text.tmpl
│ ├── slack/
│ │ └── message.tmpl
│ └── common/
│ └── functions.tmpl
├── rules/
│ ├── infrastructure.yml
│ ├── applications.yml
│ └── business.yml
└── scripts/
├── deploy.sh
├── validate.sh
└── test.shBashConfiguration Validation
# Validation script template
#!/bin/bash
set -e
CONFIG_FILE="$1"
ALERTMANAGER_BINARY="./alertmanager"
echo "Validating Alertmanager configuration: $CONFIG_FILE"
# Syntax check
$ALERTMANAGER_BINARY --config.file="$CONFIG_FILE" --config.check
# Template validation
if [ -d "templates/" ]; then
echo "Validating templates..."
for template in templates/*.tmpl; do
echo " Checking $template"
# Add template-specific validation here
done
fi
echo "Configuration validation passed!"BashEnvironment-Specific Configurations
# alertmanager-prod.yml
global:
smtp_smarthost: 'smtp.company.com:587'
smtp_from: 'alerts-prod@company.com'
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-prod'
receivers:
- name: 'default-prod'
email_configs:
- to: 'oncall-prod@company.com'
slack_configs:
- api_url: '{{ .SlackProdURL }}'
channel: '#production-alerts'
---
# alertmanager-dev.yml
global:
smtp_smarthost: 'localhost:1025' # MailHog for testing
smtp_from: 'alerts-dev@company.com'
resolve_timeout: 1m
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 10s
repeat_interval: 1h
receiver: 'default-dev'
receivers:
- name: 'default-dev'
webhook_configs:
- url: 'http://webhook-test:8080/alerts'YAMLAlert Design Best Practices
Alert Quality Guidelines
flowchart TD
A[Alert Quality] --> B[Actionable]
A --> C[Meaningful]
A --> D[Proportional]
A --> E[Contextual]
B --> B1[Clear Action Required]
B --> B2[Owner Identified]
C --> C1[Business Impact]
C --> C2[User Impact]
D --> D1[Severity Matches Impact]
D --> D2[Frequency Appropriate]
E --> E1[Sufficient Information]
E --> E2[Links to Resources]Alert Rule Standards
# Standard alert template
- alert: StandardAlertName
expr: |
# Clear, readable PromQL expression
metric_name{label="value"} > threshold
for: 5m # Appropriate duration to avoid flapping
labels:
severity: critical|warning|info
team: responsible_team
service: affected_service
environment: prod|staging|dev
runbook: "runbook-identifier"
annotations:
summary: "Brief, actionable description (< 80 chars)"
description: |
Detailed description with:
- What is happening
- Why it matters
- Current value: {{ $value }}
- Expected threshold: {{ .threshold }}
runbook_url: "https://runbooks.company.com/{{ .Labels.runbook }}"
dashboard_url: "https://grafana.company.com/d/dashboard-id"
grafana_panel_url: "https://grafana.company.com/d/dashboard-id?panelId=1"YAMLSeverity Guidelines
# Severity classification
severity_guidelines:
critical:
description: "Service is completely down or severely degraded"
response_time: "Immediate (5 minutes)"
examples:
- "Complete service outage"
- "Data loss imminent"
- "Security breach"
warning:
description: "Service degraded but still functional"
response_time: "Within business hours (4 hours)"
examples:
- "High error rate"
- "Performance degradation"
- "Capacity concerns"
info:
description: "Informational, no immediate action needed"
response_time: "Best effort"
examples:
- "Deployment notifications"
- "Capacity planning info"
- "Maintenance reminders"YAMLOperational Best Practices
On-Call Procedures
sequenceDiagram
participant A as Alert Fires
participant AM as Alertmanager
participant OC as On-Call Engineer
participant T as Team
participant M as Management
A->>AM: Critical Alert
AM->>OC: Immediate Notification
alt Response within 5 minutes
OC->>OC: Acknowledge Alert
OC->>AM: Update Status
else No response
AM->>T: Escalate to Team Lead
alt No response from team
AM->>M: Escalate to Management
end
end
OC->>OC: Investigate & Resolve
OC->>AM: Mark ResolvedEscalation Policies
# Escalation configuration
escalation_policies:
production_critical:
level_1:
- "primary-oncall@company.com"
- timeout: 5m
level_2:
- "team-lead@company.com"
- "secondary-oncall@company.com"
- timeout: 10m
level_3:
- "engineering-manager@company.com"
- "director@company.com"
- timeout: 15m
production_warning:
level_1:
- "team-channel@slack"
- timeout: 30m
level_2:
- "team-lead@company.com"
- timeout: 2hYAMLSilence Management
# Silence management best practices
silence_policies:
maintenance_windows:
- prefix: "MAINT-"
- max_duration: "4h"
- required_fields: ["ticket_number", "approval"]
- auto_expire: true
emergency_silences:
- prefix: "EMERG-"
- max_duration: "2h"
- required_fields: ["incident_id", "responder"]
- approval_required: false
scheduled_silences:
- prefix: "SCHED-"
- max_duration: "24h"
- required_fields: ["change_request", "owner"]
- advance_notice: "24h"YAMLSecurity Best Practices
Authentication and Authorization
graph TD
A[Security Layers] --> B[Network Security]
A --> C[Authentication]
A --> D[Authorization]
A --> E[Encryption]
B --> B1[Firewall Rules]
B --> B2[VPN Access]
C --> C1[OAuth/OIDC]
C --> C2[API Keys]
D --> D1[RBAC]
D --> D2[Team-based Access]
E --> E1[TLS Everywhere]
E --> E2[Secrets Management]Secure Configuration
# Secure Alertmanager configuration
global:
# Use TLS for SMTP
smtp_require_tls: true
smtp_auth_username: '{{ env "SMTP_USERNAME" }}'
smtp_auth_password: '{{ env "SMTP_PASSWORD" }}'
# HTTP client configuration
http_config:
tls_config:
# Verify certificates
insecure_skip_verify: false
# Use specific CA if needed
ca_file: /etc/ssl/certs/ca-bundle.pem
# Use environment variables for secrets
receivers:
- name: 'secure-webhook'
webhook_configs:
- url: 'https://webhook.company.com/alerts'
http_config:
bearer_token: '{{ env "WEBHOOK_TOKEN" }}'
tls_config:
cert_file: /etc/alertmanager/client.crt
key_file: /etc/alertmanager/client.keyYAMLContainer Security
# Secure Dockerfile for Alertmanager
FROM alpine:3.18
# Create non-root user
RUN addgroup -g 1001 alertmanager && \
adduser -D -s /bin/sh -u 1001 -G alertmanager alertmanager
# Install certificates
RUN apk add --no-cache ca-certificates
# Copy binary and set permissions
COPY --from=builder /app/alertmanager /bin/alertmanager
RUN chmod +x /bin/alertmanager
# Create directories with proper ownership
RUN mkdir -p /etc/alertmanager /var/lib/alertmanager && \
chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
USER alertmanager
EXPOSE 9093
ENTRYPOINT ["/bin/alertmanager"]
CMD ["--config.file=/etc/alertmanager/alertmanager.yml", \
"--storage.path=/var/lib/alertmanager", \
"--web.external-url=http://localhost:9093"]DockerfilePerformance Optimization
Resource Management
# Resource optimization guidelines
resource_management:
memory:
- "Size based on alert volume and retention"
- "~1GB RAM per 100k active alerts"
- "Monitor alertmanager_alerts metric"
cpu:
- "Generally not CPU intensive"
- "Scale with notification volume"
- "2-4 cores sufficient for most workloads"
storage:
- "Minimal storage requirements"
- "~10MB per million alerts"
- "Use SSD for better performance"
network:
- "Outbound bandwidth for notifications"
- "Inbound for receiving alerts"
- "Consider notification channel limits"YAMLHigh Availability Configuration
# HA deployment best practices
ha_configuration:
cluster_size:
- minimum: 3
- recommended: 3-5
- maximum: 7
deployment:
- "Spread across availability zones"
- "Use anti-affinity rules"
- "Monitor cluster health"
load_balancing:
- "Use load balancer for Prometheus"
- "Health check: GET /-/ready"
- "Sticky sessions not required"YAML12. Real-world Examples
Example 1: E-commerce Platform
Scenario
Large e-commerce platform with microservices architecture, multiple data centers, and 24/7 operations.
graph TD
A[E-commerce Platform] --> B[Frontend Services]
A --> C[Backend APIs]
A --> D[Databases]
A --> E[Payment Systems]
A --> F[Inventory Management]
B --> B1[Web App]
B --> B2[Mobile API]
B --> B3[CDN]
C --> C1[User Service]
C --> C2[Product Service]
C --> C3[Order Service]
D --> D1[PostgreSQL]
D --> D2[Redis Cache]
D --> D3[Elasticsearch]
E --> E1[Payment Gateway]
E --> E2[Fraud Detection]
F --> F1[Warehouse System]
F --> F2[Stock Management]Alertmanager Configuration
global:
smtp_smarthost: 'smtp.company.com:587'
smtp_from: 'alerts@ecommerce.com'
resolve_timeout: 5m
route:
group_by: ['alertname', 'environment', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
# Critical business impact alerts
- match:
severity: critical
business_impact: high
receiver: 'critical-business'
group_wait: 0s
repeat_interval: 5m
# Payment system alerts
- match:
service: payment
receiver: 'payment-team'
group_by: ['alertname', 'payment_provider']
# Database alerts
- match_re:
service: (postgres|redis|elasticsearch)
receiver: 'database-team'
group_by: ['alertname', 'database_cluster']
# Frontend alerts
- match_re:
service: (web-app|mobile-api|cdn)
receiver: 'frontend-team'
# Infrastructure alerts
- match:
team: infrastructure
receiver: 'infrastructure-team'
group_by: ['alertname', 'datacenter']
receivers:
- name: 'default'
slack_configs:
- api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
channel: '#alerts-general'
- name: 'critical-business'
pagerduty_configs:
- routing_key: '{{ env "PAGERDUTY_CRITICAL_KEY" }}'
description: 'CRITICAL: {{ .GroupLabels.alertname }}'
slack_configs:
- api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
channel: '#critical-alerts'
color: 'danger'
title: '🚨 CRITICAL BUSINESS IMPACT'
email_configs:
- to: 'executives@ecommerce.com'
subject: 'CRITICAL: Business Impact Alert'
- name: 'payment-team'
pagerduty_configs:
- routing_key: '{{ env "PAGERDUTY_PAYMENT_KEY" }}'
slack_configs:
- api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
channel: '#payment-alerts'
- name: 'database-team'
email_configs:
- to: 'dba-team@ecommerce.com'
slack_configs:
- api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
channel: '#database-alerts'
- name: 'frontend-team'
slack_configs:
- api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
channel: '#frontend-alerts'
- name: 'infrastructure-team'
email_configs:
- to: 'infrastructure@ecommerce.com'
slack_configs:
- api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
channel: '#infrastructure-alerts'
inhibit_rules:
# Inhibit service alerts when entire datacenter is down
- source_match:
alertname: 'DatacenterDown'
target_match_re:
alertname: '(ServiceDown|HighLatency|DatabaseDown)'
equal: ['datacenter']
# Inhibit warning alerts when critical alerts are firing
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['service', 'instance']
# Inhibit payment alerts during maintenance
- source_match:
alertname: 'PaymentMaintenanceMode'
target_match_re:
service: 'payment'
equal: ['environment']YAMLAlert Rules
# Business critical alerts
groups:
- name: business_critical
rules:
- alert: OrderProcessingDown
expr: |
(
rate(http_requests_total{service="order-service",status=~"5.."}[5m])
/
rate(http_requests_total{service="order-service"}[5m])
) > 0.1
for: 2m
labels:
severity: critical
business_impact: high
service: order
team: backend
annotations:
summary: "Order processing service experiencing high error rate"
description: "{{ $value | humanizePercentage }} of order requests failing"
- alert: PaymentGatewayDown
expr: probe_success{job="payment-gateway"} == 0
for: 1m
labels:
severity: critical
business_impact: high
service: payment
team: payment
annotations:
summary: "Payment gateway is unreachable"
description: "Primary payment gateway has been down for 1 minute"
- alert: InventoryServiceDown
expr: up{job="inventory-service"} == 0
for: 3m
labels:
severity: critical
business_impact: high
service: inventory
team: backend
annotations:
summary: "Inventory service is down"
description: "Inventory service unavailable - affecting product availability"
# Performance alerts
- name: performance
rules:
- alert: HighCheckoutLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{service="checkout"}[5m])) by (le)
) > 2
for: 5m
labels:
severity: warning
service: checkout
team: frontend
annotations:
summary: "High checkout latency detected"
description: "95th percentile checkout time is {{ $value | humanizeDuration }}"
- alert: DatabaseConnectionPoolHigh
expr: |
(
postgres_connections_active
/
postgres_connections_max
) > 0.8
for: 10m
labels:
severity: warning
service: postgres
team: database
annotations:
summary: "Database connection pool utilization high"
description: "{{ $labels.database }} connection pool at {{ $value | humanizePercentage }}"YAMLExample 2: SaaS Application
Scenario
Multi-tenant SaaS application with global customer base, requiring tenant-specific alerting.
graph TD
A[SaaS Platform] --> B[API Gateway]
A --> C[Tenant Services]
A --> D[Shared Services]
A --> E[Data Layer]
B --> B1[Authentication]
B --> B2[Rate Limiting]
B --> B3[Load Balancing]
C --> C1[Tenant A Services]
C --> C2[Tenant B Services]
C --> C3[Tenant C Services]
D --> D1[Notification Service]
D --> D2[Billing Service]
D --> D3[Analytics Service]
E --> E1[Tenant Databases]
E --> E2[Shared Cache]
E --> E3[Message Queue]Multi-Tenant Alerting Configuration
global:
smtp_smarthost: 'smtp.saas-company.com:587'
smtp_from: 'platform-alerts@saas-company.com'
route:
group_by: ['alertname', 'tenant', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 2h
receiver: 'default'
routes:
# Enterprise customer alerts (immediate escalation)
- match:
customer_tier: enterprise
severity: critical
receiver: 'enterprise-critical'
group_wait: 0s
repeat_interval: 15m
# Tenant-specific routing
- match:
tenant: tenant-a
receiver: 'tenant-a-alerts'
- match:
tenant: tenant-b
receiver: 'tenant-b-alerts'
# Platform-wide issues
- match:
alert_type: platform
receiver: 'platform-team'
group_by: ['alertname', 'region']
# Customer-facing service alerts
- match_re:
service: (api-gateway|auth-service|billing)
receiver: 'customer-facing-team'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://alert-router:8080/webhook'
- name: 'enterprise-critical'
pagerduty_configs:
- routing_key: '{{ env "PAGERDUTY_ENTERPRISE_KEY" }}'
description: 'ENTERPRISE CRITICAL: {{ .GroupLabels.alertname }}'
details:
tenant: '{{ .GroupLabels.tenant }}'
customer_tier: '{{ .GroupLabels.customer_tier }}'
email_configs:
- to: 'enterprise-support@saas-company.com'
cc: 'customer-success@saas-company.com'
subject: 'CRITICAL: Enterprise Customer Impact - {{ .GroupLabels.tenant }}'
- name: 'tenant-a-alerts'
webhook_configs:
- url: 'http://tenant-notification-service:8080/notify'
http_config:
basic_auth:
username: 'tenant-a'
password: '{{ env "TENANT_A_PASSWORD" }}'
send_resolved: true
- name: 'platform-team'
slack_configs:
- api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
channel: '#platform-alerts'
title: 'Platform Alert: {{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Region:* {{ .Labels.region }}
*Affected Tenants:* {{ .Labels.affected_tenants }}
*Impact:* {{ .Annotations.impact }}
{{ end }}
inhibit_rules:
# Inhibit tenant-specific alerts during platform outage
- source_match:
alert_type: platform
severity: critical
target_match:
alert_type: tenant
equal: ['region']
# Inhibit individual service alerts during API gateway issues
- source_match:
alertname: 'APIGatewayDown'
target_match_re:
service: '(auth-service|billing-service|notification-service)'
equal: ['region']YAMLExample 3: Financial Services
Scenario
Financial services company with strict compliance requirements, multiple environments, and complex approval workflows.
graph TD
A[Financial Services] --> B[Trading Platform]
A --> C[Risk Management]
A --> D[Compliance Systems]
A --> E[Customer Portal]
B --> B1[Order Management]
B --> B2[Market Data]
B --> B3[Settlement]
C --> C1[Real-time Risk]
C --> C2[Credit Monitoring]
C --> C3[Fraud Detection]
D --> D1[Audit Logging]
D --> D2[Regulatory Reporting]
D --> D3[Data Retention]
E --> E1[Account Management]
E --> E2[Portfolio View]
E --> E3[Transaction History]Compliance-Focused Configuration
global:
smtp_smarthost: 'mail.financial-company.com:587'
smtp_from: 'compliance-alerts@financial-company.com'
resolve_timeout: 10m
route:
group_by: ['alertname', 'compliance_level', 'environment']
group_wait: 60s # Longer wait for compliance review
group_interval: 10m
repeat_interval: 6h
receiver: 'default-compliance'
routes:
# Regulatory compliance alerts (highest priority)
- match:
compliance_level: regulatory
receiver: 'regulatory-compliance'
group_wait: 0s
repeat_interval: 30m
# Trading system alerts
- match:
system: trading
receiver: 'trading-team'
group_by: ['alertname', 'trading_venue']
# Risk management alerts
- match:
system: risk
receiver: 'risk-management'
group_by: ['alertname', 'risk_type']
# Production environment (requires immediate attention)
- match:
environment: production
severity: critical
receiver: 'production-critical'
group_wait: 30s
# Development/staging (business hours only)
- match_re:
environment: (development|staging)
receiver: 'development-team'
group_interval: 1h
repeat_interval: 24h
receivers:
- name: 'default-compliance'
email_configs:
- to: 'compliance-team@financial-company.com'
subject: '[COMPLIANCE] {{ .GroupLabels.alertname }}'
headers:
X-Priority: 'High'
X-Compliance-Level: '{{ .GroupLabels.compliance_level }}'
- name: 'regulatory-compliance'
email_configs:
- to: 'compliance-officer@financial-company.com'
cc: 'legal-team@financial-company.com'
subject: '[REGULATORY] IMMEDIATE ATTENTION REQUIRED'
body: |
REGULATORY COMPLIANCE ALERT
This alert requires immediate attention and may need to be reported to regulators.
{{ range .Alerts }}
Alert: {{ .Labels.alertname }}
System: {{ .Labels.system }}
Compliance Type: {{ .Labels.compliance_type }}
Regulatory Impact: {{ .Annotations.regulatory_impact }}
Required Actions: {{ .Annotations.required_actions }}
{{ end }}
webhook_configs:
- url: 'https://compliance-system.financial-company.com/api/alerts'
http_config:
bearer_token: '{{ env "COMPLIANCE_SYSTEM_TOKEN" }}'
send_resolved: true
- name: 'trading-team'
pagerduty_configs:
- routing_key: '{{ env "PAGERDUTY_TRADING_KEY" }}'
description: 'Trading System Alert: {{ .GroupLabels.alertname }}'
details:
trading_venue: '{{ .GroupLabels.trading_venue }}'
market_impact: '{{ .GroupLabels.market_impact }}'
slack_configs:
- api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
channel: '#trading-alerts'
- name: 'risk-management'
email_configs:
- to: 'risk-team@financial-company.com'
webhook_configs:
- url: 'https://risk-system.financial-company.com/api/notifications'
inhibit_rules:
# During market close, inhibit non-critical trading alerts
- source_match:
alertname: 'MarketClosed'
target_match:
system: trading
severity: warning
equal: ['trading_venue']
# Inhibit development alerts during business hours
- source_match:
alertname: 'BusinessHoursActive'
target_match:
environment: development
severity: infoYAMLExample 4: Gaming Platform
Scenario
Online gaming platform with real-time multiplayer games, user-generated content, and global infrastructure.
graph TD
A[Gaming Platform] --> B[Game Servers]
A --> C[User Services]
A --> D[Content Systems]
A --> E[Analytics]
B --> B1[Matchmaking]
B --> B2[Game Logic]
B --> B3[Real-time Communication]
C --> C1[Authentication]
C --> C2[Player Profiles]
C --> C3[Friends & Social]
D --> D1[Asset Storage]
D --> D2[Content Delivery]
D --> D3[User Generated Content]
E --> E1[Player Analytics]
E --> E2[Game Metrics]
E --> E3[Business Intelligence]Gaming-Specific Alerting
global:
smtp_smarthost: 'smtp.gaming-company.com:587'
smtp_from: 'game-ops@gaming-company.com'
route:
group_by: ['alertname', 'game_title', 'region']
group_wait: 15s # Fast response for gaming
group_interval: 2m
repeat_interval: 1h
receiver: 'default-gaming'
routes:
# Player-affecting issues (highest priority)
- match:
impact: player_facing
severity: critical
receiver: 'player-impact-critical'
group_wait: 0s
repeat_interval: 10m
# Live events (tournaments, etc.)
- match:
event_type: live_event
receiver: 'live-events-team'
group_wait: 5s
# Matchmaking issues
- match:
service: matchmaking
receiver: 'matchmaking-team'
group_by: ['alertname', 'game_mode', 'region']
# Content delivery issues
- match_re:
service: (cdn|asset-storage|content-delivery)
receiver: 'content-team'
# Regional routing
- match:
region: na-east
receiver: 'na-ops-team'
- match:
region: eu-west
receiver: 'eu-ops-team'
- match:
region: asia-pacific
receiver: 'apac-ops-team'
receivers:
- name: 'default-gaming'
slack_configs:
- api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
channel: '#game-ops'
- name: 'player-impact-critical'
pagerduty_configs:
- routing_key: '{{ env "PAGERDUTY_PLAYER_IMPACT_KEY" }}'
description: 'PLAYER IMPACT: {{ .GroupLabels.alertname }}'
details:
game_title: '{{ .GroupLabels.game_title }}'
affected_players: '{{ .GroupLabels.affected_players }}'
revenue_impact: '{{ .GroupLabels.revenue_impact }}'
slack_configs:
- api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
channel: '#critical-player-issues'
color: 'danger'
title: '🎮 CRITICAL PLAYER IMPACT'
text: |
**Game:** {{ .GroupLabels.game_title }}
**Region:** {{ .GroupLabels.region }}
**Affected Players:** {{ .GroupLabels.affected_players }}
{{ range .Alerts }}
**Issue:** {{ .Annotations.summary }}
**Player Impact:** {{ .Annotations.player_impact }}
{{ end }}
- name: 'live-events-team'
pagerduty_configs:
- routing_key: '{{ env "PAGERDUTY_LIVE_EVENTS_KEY" }}'
slack_configs:
- api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
channel: '#live-events'
email_configs:
- to: 'esports-team@gaming-company.com'
- name: 'matchmaking-team'
slack_configs:
- api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
channel: '#matchmaking-alerts'
text: |
**Matchmaking Issue Detected**
{{ range .Alerts }}
**Game:** {{ .Labels.game_title }}
**Mode:** {{ .Labels.game_mode }}
**Region:** {{ .Labels.region }}
**Queue Time:** {{ .Labels.avg_queue_time }}
**Issue:** {{ .Annotations.summary }}
{{ end }}
inhibit_rules:
# During scheduled maintenance, inhibit game server alerts
- source_match:
alertname: 'ScheduledMaintenance'
target_match_re:
service: (game-server|matchmaking|player-data)
equal: ['game_title', 'region']
# Inhibit individual server alerts during region-wide issues
- source_match:
alertname: 'RegionNetworkIssue'
target_match_re:
alertname: '(ServerDown|HighLatency|ConnectionIssues)'
equal: ['region']YAMLThis comprehensive book covers Alertmanager from basic concepts to expert-level configurations with real-world examples. Each section builds upon the previous ones, providing both theoretical understanding and practical implementation guidance with extensive use of Mermaid diagrams to visualize complex concepts
Discover more from Altgr Blog
Subscribe to get the latest posts sent to your email.
