The Complete Guide to Alertmanager

From Beginner to Expert

Introduction to Alertmanager
Architecture and Core Concepts
Installation and Setup
Configuration Fundamentals
Routing and Grouping
Notification Channels
Silencing and Inhibition
Integration with Prometheus
Advanced Features
Monitoring and Troubleshooting
Best Practices
Real-world Examples

1. Introduction to Alertmanager

What is Alertmanager?

Alertmanager is a crucial component of the Prometheus monitoring ecosystem that handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing alerts to the correct receiver integrations such as email, PagerDuty, Slack, or webhooks.

Why Do We Need Alertmanager?

graph TD
    A[Prometheus Server] -->|Firing Alerts| B[Alertmanager]
    B --> C[Grouping & Deduplication]
    C --> D[Routing Engine]
    D --> E[Email]
    D --> F[Slack]
    D --> G[PagerDuty]
    D --> H[Webhook]

    style B fill:#ff9999
    style C fill:#99ccff
    style D fill:#99ff99

Key Problems Alertmanager Solves:

Alert Fatigue: Groups similar alerts together
Duplicate Notifications: Deduplicates identical alerts
Routing Complexity: Routes alerts to appropriate teams/channels
Notification Management: Handles various notification channels
Silencing: Temporarily suppress alerts during maintenance

Core Features

Grouping: Combines related alerts into single notifications
Inhibition: Suppresses certain alerts when others are firing
Silencing: Temporarily mute alerts based on matchers
High Availability: Supports clustering for redundancy
Web UI: Provides interface for managing alerts and silences

2. Architecture and Core Concepts

Alertmanager Architecture

graph TB
    subgraph "Prometheus Ecosystem"
        P[Prometheus Server]
        AM[Alertmanager]
        P -->|HTTP POST /api/v1/alerts| AM
    end

    subgraph "Alertmanager Internal"
        API[API Layer]
        ROUTER[Router]
        GROUPER[Grouper]
        NOTIFIER[Notifier]
        SILENCE[Silence Manager]
        INHIB[Inhibitor]

        API --> ROUTER
        ROUTER --> GROUPER
        GROUPER --> INHIB
        INHIB --> SILENCE
        SILENCE --> NOTIFIER
    end

    subgraph "External Integrations"
        EMAIL[Email]
        SLACK[Slack]
        PD[PagerDuty]
        WH[Webhook]

        NOTIFIER --> EMAIL
        NOTIFIER --> SLACK
        NOTIFIER --> PD
        NOTIFIER --> WH
    end

    style AM fill:#ff9999
    style ROUTER fill:#99ccff
    style NOTIFIER fill:#99ff99

Key Concepts

Alert Lifecycle

stateDiagram-v2
    [*] --> Inactive
    Inactive --> Pending: Condition Met
    Pending --> Firing: Duration Exceeded
    Firing --> Pending: Condition Resolved
    Pending --> Inactive: Condition False
    Firing --> Inactive: Condition False

    note right of Pending: Alert exists but hasn't\nexceeded 'for' duration
    note right of Firing: Alert is actively firing\nand sent to Alertmanager

Alert States in Alertmanager

stateDiagram-v2
    [*] --> Active
    Active --> Suppressed: Silenced/Inhibited
    Suppressed --> Active: Silence Expired/Inhibition Removed
    Active --> [*]: Alert Resolved
    Suppressed --> [*]: Alert Resolved

    note right of Active: Alert is processed\nand notifications sent
    note right of Suppressed: Alert exists but\nno notifications sent

Data Flow

sequenceDiagram
    participant P as Prometheus
    participant AM as Alertmanager
    participant R as Receiver

    P->>AM: POST /api/v1/alerts
    AM->>AM: Group Similar Alerts
    AM->>AM: Apply Inhibition Rules
    AM->>AM: Check Silences
    AM->>AM: Route to Receivers
    AM->>R: Send Notification
    R-->>AM: Acknowledge

3. Installation and Setup

Installation Methods

Method 1: Binary Installation

# Download Alertmanager binary
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz

# Extract
tar xvfz alertmanager-0.26.0.linux-amd64.tar.gz
cd alertmanager-0.26.0.linux-amd64

# Run Alertmanager
./alertmanager --config.file=alertmanager.yml

# Download Alertmanager binary
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz

# Extract
tar xvfz alertmanager-0.26.0.linux-amd64.tar.gz
cd alertmanager-0.26.0.linux-amd64

# Run Alertmanager
./alertmanager --config.file=alertmanager.yml

Bash

Method 2: Docker Installation

# Run Alertmanager with Docker
docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v /path/to/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager:latest

# Run Alertmanager with Docker
docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v /path/to/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager:latest

Bash

Method 3: Docker Compose

version: '3.8'
services:
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager-data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--web.external-url=http://localhost:9093'
      - '--cluster.listen-address=0.0.0.0:9094'

volumes:
  alertmanager-data:

version: '3.8'
services:
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager-data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--web.external-url=http://localhost:9093'
      - '--cluster.listen-address=0.0.0.0:9094'

volumes:
  alertmanager-data:

YAML

Method 4: Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      labels:
        app: alertmanager
    spec:
      containers:
      - name: alertmanager
        image: prom/alertmanager:latest
        ports:
        - containerPort: 9093
        volumeMounts:
        - name: config
          mountPath: /etc/alertmanager/
      volumes:
      - name: config
        configMap:
          name: alertmanager-config
---
apiVersion: v1
kind: Service
metadata:
  name: alertmanager
spec:
  selector:
    app: alertmanager
  ports:
  - port: 9093
    targetPort: 9093
  type: LoadBalancer

apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      labels:
        app: alertmanager
    spec:
      containers:
      - name: alertmanager
        image: prom/alertmanager:latest
        ports:
        - containerPort: 9093
        volumeMounts:
        - name: config
          mountPath: /etc/alertmanager/
      volumes:
      - name: config
        configMap:
          name: alertmanager-config
---
apiVersion: v1
kind: Service
metadata:
  name: alertmanager
spec:
  selector:
    app: alertmanager
  ports:
  - port: 9093
    targetPort: 9093
  type: LoadBalancer

YAML

System Service Setup

Systemd Service File

# Create user for Alertmanager
sudo useradd --no-create-home --shell /bin/false alertmanager

# Create directories
sudo mkdir /etc/alertmanager
sudo mkdir /var/lib/alertmanager

# Set ownership
sudo chown alertmanager:alertmanager /etc/alertmanager
sudo chown alertmanager:alertmanager /var/lib/alertmanager

# Copy binary
sudo cp alertmanager /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager

# Create systemd service
sudo tee /etc/systemd/system/alertmanager.service << EOF
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
    --config.file /etc/alertmanager/alertmanager.yml \
    --storage.path /var/lib/alertmanager/ \
    --web.external-url=http://localhost:9093

[Install]
WantedBy=multi-user.target
EOF

# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager

# Create user for Alertmanager
sudo useradd --no-create-home --shell /bin/false alertmanager

# Create directories
sudo mkdir /etc/alertmanager
sudo mkdir /var/lib/alertmanager

# Set ownership
sudo chown alertmanager:alertmanager /etc/alertmanager
sudo chown alertmanager:alertmanager /var/lib/alertmanager

# Copy binary
sudo cp alertmanager /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager

# Create systemd service
sudo tee /etc/systemd/system/alertmanager.service << EOF
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
    --config.file /etc/alertmanager/alertmanager.yml \
    --storage.path /var/lib/alertmanager/ \
    --web.external-url=http://localhost:9093

[Install]
WantedBy=multi-user.target
EOF

# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager

Bash

4. Configuration Fundamentals

Basic Configuration Structure

global:
  # Global configuration options

route:
  # Root routing configuration

receivers:
  # List of notification receivers

inhibit_rules:
  # List of inhibition rules

templates:
  # Custom notification templates

global:
  # Global configuration options

route:
  # Root routing configuration

receivers:
  # List of notification receivers

inhibit_rules:
  # List of inhibition rules

templates:
  # Custom notification templates

YAML

Configuration Flow

graph TD
    A[Alert Received] --> B{Match Route?}
    B -->|Yes| C[Apply Grouping]
    B -->|No| D[Default Route]
    C --> E{Inhibited?}
    E -->|No| F{Silenced?}
    E -->|Yes| G[Suppress Alert]
    F -->|No| H[Send to Receiver]
    F -->|Yes| G
    D --> C

    style A fill:#ffcccc
    style H fill:#ccffcc
    style G fill:#ffffcc

Basic Configuration Example

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alertmanager@example.com'
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alertmanager@example.com'
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

YAML

Global Configuration Options

global:
  # Time to wait before sending a notification about new alerts
  resolve_timeout: 5m

  # SMTP configuration
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@company.com'
  smtp_auth_username: 'alerts@company.com'
  smtp_auth_password: 'app_password'
  smtp_require_tls: true

  # Slack configuration
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

  # PagerDuty configuration
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

  # HTTP configuration
  http_config:
    proxy_url: 'http://proxy.company.com:8080'
    tls_config:
      insecure_skip_verify: true

global:
  # Time to wait before sending a notification about new alerts
  resolve_timeout: 5m

  # SMTP configuration
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@company.com'
  smtp_auth_username: 'alerts@company.com'
  smtp_auth_password: 'app_password'
  smtp_require_tls: true

  # Slack configuration
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

  # PagerDuty configuration
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

  # HTTP configuration
  http_config:
    proxy_url: 'http://proxy.company.com:8080'
    tls_config:
      insecure_skip_verify: true

YAML

5. Routing and Grouping

Understanding Routes

Routes define how alerts are organized and where they should be sent. The routing tree starts with a root route and can have multiple child routes.

graph TD
    ROOT[Root Route] --> TEAM_A[Team A Route]
    ROOT --> TEAM_B[Team B Route]
    ROOT --> CRITICAL[Critical Route]
    ROOT --> DEFAULT[Default Route]

    TEAM_A --> EMAIL_A[Email Team A]
    TEAM_B --> SLACK_B[Slack Team B]
    CRITICAL --> PAGER[PagerDuty]
    DEFAULT --> WEBHOOK[Webhook]

    style ROOT fill:#ff9999
    style CRITICAL fill:#ffcccc

Route Matching Logic

flowchart TD
    A[Alert Received] --> B{Root Route Matches?}
    B -->|Yes| C{Child Route 1 Matches?}
    B -->|No| Z[Drop Alert]
    C -->|Yes| D[Use Child Route 1]
    C -->|No| E{Child Route 2 Matches?}
    E -->|Yes| F[Use Child Route 2]
    E -->|No| G{Continue Flag?}
    G -->|Yes| H[Check Next Route]
    G -->|No| I[Use Parent Route]

    style D fill:#ccffcc
    style F fill:#ccffcc
    style I fill:#ccffcc

Advanced Routing Configuration

route:
  # Default grouping, timing, and receiver
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'default-receiver'

  # Nested routes
  routes:
  # Critical alerts go to PagerDuty immediately
  - match:
      severity: critical
    receiver: 'pagerduty-critical'
    group_wait: 0s
    repeat_interval: 5m

  # Database alerts go to DB team
  - match_re:
      service: ^(mysql|postgres|mongodb).*
    receiver: 'database-team'
    group_by: ['alertname', 'instance']

  # Team-specific routing
  - match:
      team: frontend
    receiver: 'frontend-team'
    routes:
    # Frontend critical alerts
    - match:
        severity: critical
      receiver: 'frontend-oncall'
      continue: true  # Also send to team channel

  # Infrastructure alerts
  - match:
      component: infrastructure
    receiver: 'infra-team'
    group_by: ['alertname', 'datacenter']

  # Development environment (lower priority)
  - match:
      environment: development
    receiver: 'dev-team'
    group_interval: 1h
    repeat_interval: 24h

route:
  # Default grouping, timing, and receiver
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'default-receiver'

  # Nested routes
  routes:
  # Critical alerts go to PagerDuty immediately
  - match:
      severity: critical
    receiver: 'pagerduty-critical'
    group_wait: 0s
    repeat_interval: 5m

  # Database alerts go to DB team
  - match_re:
      service: ^(mysql|postgres|mongodb).*
    receiver: 'database-team'
    group_by: ['alertname', 'instance']

  # Team-specific routing
  - match:
      team: frontend
    receiver: 'frontend-team'
    routes:
    # Frontend critical alerts
    - match:
        severity: critical
      receiver: 'frontend-oncall'
      continue: true  # Also send to team channel

  # Infrastructure alerts
  - match:
      component: infrastructure
    receiver: 'infra-team'
    group_by: ['alertname', 'datacenter']

  # Development environment (lower priority)
  - match:
      environment: development
    receiver: 'dev-team'
    group_interval: 1h
    repeat_interval: 24h

YAML

Grouping Strategies

Time-based Grouping

route:
  group_by: ['alertname']
  group_wait: 10s      # Wait for more alerts before sending
  group_interval: 10s  # Wait before sending additional alerts for group
  repeat_interval: 1h  # Resend interval for unresolved alerts

route:
  group_by: ['alertname']
  group_wait: 10s      # Wait for more alerts before sending
  group_interval: 10s  # Wait before sending additional alerts for group
  repeat_interval: 1h  # Resend interval for unresolved alerts

YAML

Label-based Grouping

route:
  # Group by alert name and instance
  group_by: ['alertname', 'instance']

routes:
- match:
    team: database
  group_by: ['alertname', 'database_cluster']

- match:
    service: web
  group_by: ['alertname', 'datacenter', 'environment']

route:
  # Group by alert name and instance
  group_by: ['alertname', 'instance']

routes:
- match:
    team: database
  group_by: ['alertname', 'database_cluster']

- match:
    service: web
  group_by: ['alertname', 'datacenter', 'environment']

YAML

Grouping Flow Diagram

sequenceDiagram
    participant P as Prometheus
    participant AM as Alertmanager
    participant G as Grouper
    participant N as Notifier

    P->>AM: Alert 1 (web-server-down, instance=web1)
    AM->>G: Create Group [web-server-down, web1]
    Note over G: Wait group_wait (10s)

    P->>AM: Alert 2 (web-server-down, instance=web2)
    AM->>G: Add to Group [web-server-down]

    P->>AM: Alert 3 (web-server-down, instance=web3)
    AM->>G: Add to Group [web-server-down]

    Note over G: group_wait expires
    G->>N: Send grouped notification (3 alerts)

    Note over G: Wait group_interval (10s)
    P->>AM: Alert 4 (web-server-down, instance=web4)
    AM->>G: Add to existing Group

    Note over G: group_interval expires
    G->>N: Send update (4 alerts total)

6. Notification Channels

Supported Receivers

Alertmanager supports various notification channels:

graph LR
    AM[Alertmanager] --> EMAIL[Email]
    AM --> SLACK[Slack]
    AM --> PD[PagerDuty]
    AM --> TEAMS[Microsoft Teams]
    AM --> WEBHOOK[Webhook]
    AM --> PUSHOVER[Pushover]
    AM --> OPSGENIE[OpsGenie]
    AM --> VICTOROPS[VictorOps]
    AM --> WECHAT[WeChat]
    AM --> TELEGRAM[Telegram]

    style AM fill:#ff9999

Email Configuration

Basic Email Setup

global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@company.com'
  smtp_auth_username: 'alerts@company.com'
  smtp_auth_password: 'app_specific_password'
  smtp_require_tls: true

receivers:
- name: 'email-team'
  email_configs:
  - to: 'team@company.com'
    subject: 'Alert: {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
      {{ end }}

global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@company.com'
  smtp_auth_username: 'alerts@company.com'
  smtp_auth_password: 'app_specific_password'
  smtp_require_tls: true

receivers:
- name: 'email-team'
  email_configs:
  - to: 'team@company.com'
    subject: 'Alert: {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
      {{ end }}

YAML

Advanced Email Configuration

receivers:
- name: 'advanced-email'
  email_configs:
  - to: 'oncall@company.com'
    cc: 'team-lead@company.com'
    subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }} ({{ .Alerts | len }} alerts)'
    html: |
      <!DOCTYPE html>
      <html>
      <head>
        <style>
          .critical { background-color: #ff4444; color: white; }
          .warning { background-color: #ffaa00; color: white; }
          .info { background-color: #4444ff; color: white; }
        </style>
      </head>
      <body>
        <h2>Alert Summary</h2>
        <table border="1">
          <tr><th>Alert</th><th>Severity</th><th>Instance</th><th>Description</th></tr>
          {{ range .Alerts }}
          <tr class="{{ .Labels.severity }}">
            <td>{{ .Labels.alertname }}</td>
            <td>{{ .Labels.severity }}</td>
            <td>{{ .Labels.instance }}</td>
            <td>{{ .Annotations.description }}</td>
          </tr>
          {{ end }}
        </table>
      </body>
      </html>
    headers:
      X-Priority: 'High'
      X-MC-Important: 'true'

receivers:
- name: 'advanced-email'
  email_configs:
  - to: 'oncall@company.com'
    cc: 'team-lead@company.com'
    subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }} ({{ .Alerts | len }} alerts)'
    html: |
      <!DOCTYPE html>
      <html>
      <head>
        <style>
          .critical { background-color: #ff4444; color: white; }
          .warning { background-color: #ffaa00; color: white; }
          .info { background-color: #4444ff; color: white; }
        </style>
      </head>
      <body>
        <h2>Alert Summary</h2>
        <table border="1">
          <tr><th>Alert</th><th>Severity</th><th>Instance</th><th>Description</th></tr>
          {{ range .Alerts }}
          <tr class="{{ .Labels.severity }}">
            <td>{{ .Labels.alertname }}</td>
            <td>{{ .Labels.severity }}</td>
            <td>{{ .Labels.instance }}</td>
            <td>{{ .Annotations.description }}</td>
          </tr>
          {{ end }}
        </table>
      </body>
      </html>
    headers:
      X-Priority: 'High'
      X-MC-Important: 'true'

YAML

Slack Configuration

Basic Slack Setup

global:
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

receivers:
- name: 'slack-general'
  slack_configs:
  - channel: '#alerts'
    username: 'Alertmanager'
    icon_emoji: ':exclamation:'
    title: 'Alert: {{ .GroupLabels.alertname }}'
    text: |
      {{ range .Alerts }}
      *Alert:* {{ .Annotations.summary }}
      *Description:* {{ .Annotations.description }}
      *Severity:* {{ .Labels.severity }}
      {{ end }}

global:
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

receivers:
- name: 'slack-general'
  slack_configs:
  - channel: '#alerts'
    username: 'Alertmanager'
    icon_emoji: ':exclamation:'
    title: 'Alert: {{ .GroupLabels.alertname }}'
    text: |
      {{ range .Alerts }}
      *Alert:* {{ .Annotations.summary }}
      *Description:* {{ .Annotations.description }}
      *Severity:* {{ .Labels.severity }}
      {{ end }}

YAML

Advanced Slack Configuration

receivers:
- name: 'slack-advanced'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/TEAM/CHANNEL/TOKEN'
    channel: '#production-alerts'
    username: 'AlertBot'
    icon_url: 'https://example.com/alertmanager-icon.png'
    title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
    title_link: 'http://alertmanager.company.com/#/alerts'
    text: |
      {{ if eq .Status "firing" }}
      :fire: *FIRING ALERTS* :fire:
      {{ else }}
      :white_check_mark: *RESOLVED ALERTS* :white_check_mark:
      {{ end }}

      {{ range .Alerts }}
      *Alert:* {{ .Labels.alertname }}
      *Instance:* {{ .Labels.instance }}
      *Severity:* {{ .Labels.severity }}
      *Summary:* {{ .Annotations.summary }}
      *Description:* {{ .Annotations.description }}
      *Started:* {{ .StartsAt.Format "2006-01-02 15:04:05" }}
      {{ if ne .EndsAt .StartsAt }}*Ended:* {{ .EndsAt.Format "2006-01-02 15:04:05" }}{{ end }}
      ---
      {{ end }}
    color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
    fields:
    - title: 'Environment'
      value: '{{ .GroupLabels.environment }}'
      short: true
    - title: 'Severity'
      value: '{{ .GroupLabels.severity }}'
      short: true
    actions:
    - type: 'button'
      text: 'View in Grafana'
      url: 'http://grafana.company.com/dashboard'
    - type: 'button'
      text: 'Silence Alert'
      url: 'http://alertmanager.company.com/#/silences/new'

receivers:
- name: 'slack-advanced'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/TEAM/CHANNEL/TOKEN'
    channel: '#production-alerts'
    username: 'AlertBot'
    icon_url: 'https://example.com/alertmanager-icon.png'
    title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
    title_link: 'http://alertmanager.company.com/#/alerts'
    text: |
      {{ if eq .Status "firing" }}
      :fire: *FIRING ALERTS* :fire:
      {{ else }}
      :white_check_mark: *RESOLVED ALERTS* :white_check_mark:
      {{ end }}

      {{ range .Alerts }}
      *Alert:* {{ .Labels.alertname }}
      *Instance:* {{ .Labels.instance }}
      *Severity:* {{ .Labels.severity }}
      *Summary:* {{ .Annotations.summary }}
      *Description:* {{ .Annotations.description }}
      *Started:* {{ .StartsAt.Format "2006-01-02 15:04:05" }}
      {{ if ne .EndsAt .StartsAt }}*Ended:* {{ .EndsAt.Format "2006-01-02 15:04:05" }}{{ end }}
      ---
      {{ end }}
    color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
    fields:
    - title: 'Environment'
      value: '{{ .GroupLabels.environment }}'
      short: true
    - title: 'Severity'
      value: '{{ .GroupLabels.severity }}'
      short: true
    actions:
    - type: 'button'
      text: 'View in Grafana'
      url: 'http://grafana.company.com/dashboard'
    - type: 'button'
      text: 'Silence Alert'
      url: 'http://alertmanager.company.com/#/silences/new'

YAML

PagerDuty Configuration

global:
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

receivers:
- name: 'pagerduty-critical'
  pagerduty_configs:
  - routing_key: 'YOUR_INTEGRATION_KEY'
    description: '{{ .GroupLabels.alertname }}: {{ .Annotations.summary }}'
    severity: '{{ .GroupLabels.severity }}'
    source: '{{ .GroupLabels.instance }}'
    component: '{{ .GroupLabels.service }}'
    group: '{{ .GroupLabels.cluster }}'
    class: '{{ .GroupLabels.alertname }}'
    details:
      firing_alerts: '{{ .Alerts.Firing | len }}'
      resolved_alerts: '{{ .Alerts.Resolved | len }}'
      alert_details: |
        {{ range .Alerts }}
        - {{ .Labels.alertname }} on {{ .Labels.instance }}
        {{ end }}

global:
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

receivers:
- name: 'pagerduty-critical'
  pagerduty_configs:
  - routing_key: 'YOUR_INTEGRATION_KEY'
    description: '{{ .GroupLabels.alertname }}: {{ .Annotations.summary }}'
    severity: '{{ .GroupLabels.severity }}'
    source: '{{ .GroupLabels.instance }}'
    component: '{{ .GroupLabels.service }}'
    group: '{{ .GroupLabels.cluster }}'
    class: '{{ .GroupLabels.alertname }}'
    details:
      firing_alerts: '{{ .Alerts.Firing | len }}'
      resolved_alerts: '{{ .Alerts.Resolved | len }}'
      alert_details: |
        {{ range .Alerts }}
        - {{ .Labels.alertname }} on {{ .Labels.instance }}
        {{ end }}

YAML

Webhook Configuration

receivers:
- name: 'webhook-receiver'
  webhook_configs:
  - url: 'http://webhook-server.company.com/alerts'
    http_config:
      basic_auth:
        username: 'webhook_user'
        password: 'webhook_password'
    send_resolved: true
    max_alerts: 10

receivers:
- name: 'webhook-receiver'
  webhook_configs:
  - url: 'http://webhook-server.company.com/alerts'
    http_config:
      basic_auth:
        username: 'webhook_user'
        password: 'webhook_password'
    send_resolved: true
    max_alerts: 10

YAML

Microsoft Teams Configuration

receivers:
- name: 'teams-alerts'
  webhook_configs:
  - url: 'https://outlook.office.com/webhook/YOUR_TEAMS_WEBHOOK'
    send_resolved: true
    http_config:
      tls_config:
        insecure_skip_verify: false
    title: 'Alert: {{ .GroupLabels.alertname }}'
    text: |
      **Status:** {{ .Status | toUpper }}

      {{ range .Alerts }}
      **Alert:** {{ .Labels.alertname }}
      **Instance:** {{ .Labels.instance }}
      **Severity:** {{ .Labels.severity }}
      **Summary:** {{ .Annotations.summary }}
      **Description:** {{ .Annotations.description }}

      {{ end }}

receivers:
- name: 'teams-alerts'
  webhook_configs:
  - url: 'https://outlook.office.com/webhook/YOUR_TEAMS_WEBHOOK'
    send_resolved: true
    http_config:
      tls_config:
        insecure_skip_verify: false
    title: 'Alert: {{ .GroupLabels.alertname }}'
    text: |
      **Status:** {{ .Status | toUpper }}

      {{ range .Alerts }}
      **Alert:** {{ .Labels.alertname }}
      **Instance:** {{ .Labels.instance }}
      **Severity:** {{ .Labels.severity }}
      **Summary:** {{ .Annotations.summary }}
      **Description:** {{ .Annotations.description }}

      {{ end }}

YAML

7. Silencing and Inhibition

Silencing Alerts

Silencing temporarily mutes alerts based on label matchers. This is useful during maintenance windows or when investigating issues.

graph TD
    A[Alert Received] --> B{Matches Silence?}
    B -->|Yes| C[Suppress Notification]
    B -->|No| D[Process Normally]
    C --> E[Log Silenced Alert]
    D --> F[Send Notification]

    style C fill:#ffffcc
    style F fill:#ccffcc

Creating Silences via API

# Create a silence for maintenance
curl -X POST http://localhost:9093/api/v1/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {
        "name": "instance",
        "value": "web-server-01",
        "isRegex": false
      },
      {
        "name": "alertname",
        "value": "InstanceDown",
        "isRegex": false
      }
    ],
    "startsAt": "2024-01-01T12:00:00Z",
    "endsAt": "2024-01-01T14:00:00Z",
    "createdBy": "john.doe@company.com",
    "comment": "Scheduled maintenance for web-server-01"
  }'

# Create a silence for maintenance
curl -X POST http://localhost:9093/api/v1/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {
        "name": "instance",
        "value": "web-server-01",
        "isRegex": false
      },
      {
        "name": "alertname",
        "value": "InstanceDown",
        "isRegex": false
      }
    ],
    "startsAt": "2024-01-01T12:00:00Z",
    "endsAt": "2024-01-01T14:00:00Z",
    "createdBy": "john.doe@company.com",
    "comment": "Scheduled maintenance for web-server-01"
  }'

Bash

Silence Configuration Examples

# Silence all alerts from development environment
- matchers:
  - name: environment
    value: development
    isRegex: false
  comment: "Development environment maintenance"
  createdBy: "devops-team"

# Silence critical disk alerts during backup window
- matchers:
  - name: alertname
    value: DiskSpaceHigh
    isRegex: false
  - name: severity
    value: critical
    isRegex: false
  comment: "Daily backup window"
  createdBy: "backup-system"

# Silence all alerts matching regex pattern
- matchers:
  - name: instance
    value: "web-.*"
    isRegex: true
  comment: "Web server maintenance"
  createdBy: "sre-team"

# Silence all alerts from development environment
- matchers:
  - name: environment
    value: development
    isRegex: false
  comment: "Development environment maintenance"
  createdBy: "devops-team"

# Silence critical disk alerts during backup window
- matchers:
  - name: alertname
    value: DiskSpaceHigh
    isRegex: false
  - name: severity
    value: critical
    isRegex: false
  comment: "Daily backup window"
  createdBy: "backup-system"

# Silence all alerts matching regex pattern
- matchers:
  - name: instance
    value: "web-.*"
    isRegex: true
  comment: "Web server maintenance"
  createdBy: "sre-team"

YAML

Inhibition Rules

Inhibition suppresses notifications for certain alerts when other alerts are firing. This prevents alert spam when a root cause alert is already active.

graph TD
    A[Source Alert Firing] --> B{Target Alerts Match?}
    B -->|Yes| C[Inhibit Target Alerts]
    B -->|No| D[Allow Target Alerts]

    E[Source Alert Resolved] --> F[Remove Inhibition]
    F --> G[Target Alerts Active Again]

    style C fill:#ffffcc
    style D fill:#ccffcc

Basic Inhibition Rules

inhibit_rules:
# Inhibit warning alerts when critical alerts are firing
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['alertname', 'instance']

# Inhibit individual service alerts when entire node is down
- source_match:
    alertname: 'NodeDown'
  target_match_re:
    alertname: '(ServiceDown|HighCPU|HighMemory)'
  equal: ['instance']

# Inhibit database connection alerts when database is down
- source_match:
    alertname: 'DatabaseDown'
  target_match:
    alertname: 'DatabaseConnectionFailed'
  equal: ['database_cluster']

inhibit_rules:
# Inhibit warning alerts when critical alerts are firing
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['alertname', 'instance']

# Inhibit individual service alerts when entire node is down
- source_match:
    alertname: 'NodeDown'
  target_match_re:
    alertname: '(ServiceDown|HighCPU|HighMemory)'
  equal: ['instance']

# Inhibit database connection alerts when database is down
- source_match:
    alertname: 'DatabaseDown'
  target_match:
    alertname: 'DatabaseConnectionFailed'
  equal: ['database_cluster']

YAML

Advanced Inhibition Examples

inhibit_rules:
# Complex multi-label matching
- source_match:
    alertname: 'DatacenterPowerOutage'
  target_match_re:
    alertname: '(InstanceDown|ServiceUnavailable|NetworkUnreachable)'
  equal: ['datacenter', 'region']

# Inhibit application alerts during deployment
- source_match:
    alertname: 'DeploymentInProgress'
    environment: 'production'
  target_match_re:
    alertname: '(HighErrorRate|SlowResponse|ServiceDown)'
  equal: ['service', 'environment']

# Inhibit monitoring alerts when monitoring system is down
- source_match:
    alertname: 'PrometheusDown'
  target_match_re:
    alertname: '(.*)'
  equal: ['monitoring_cluster']

inhibit_rules:
# Complex multi-label matching
- source_match:
    alertname: 'DatacenterPowerOutage'
  target_match_re:
    alertname: '(InstanceDown|ServiceUnavailable|NetworkUnreachable)'
  equal: ['datacenter', 'region']

# Inhibit application alerts during deployment
- source_match:
    alertname: 'DeploymentInProgress'
    environment: 'production'
  target_match_re:
    alertname: '(HighErrorRate|SlowResponse|ServiceDown)'
  equal: ['service', 'environment']

# Inhibit monitoring alerts when monitoring system is down
- source_match:
    alertname: 'PrometheusDown'
  target_match_re:
    alertname: '(.*)'
  equal: ['monitoring_cluster']

YAML

Inhibition Flow

sequenceDiagram
    participant P as Prometheus
    participant AM as Alertmanager
    participant I as Inhibitor
    participant N as Notifier

    P->>AM: NodeDown Alert (Critical)
    AM->>I: Check inhibition rules
    I->>I: Store active inhibition

    P->>AM: HighCPU Alert (Warning)
    AM->>I: Check if inhibited
    I-->>AM: Inhibited by NodeDown
    AM->>AM: Suppress HighCPU notification

    P->>AM: NodeDown Resolved
    AM->>I: Remove inhibition
    I->>N: Allow suppressed alerts
    N->>N: Process HighCPU if still active

Managing Silences

List Active Silences

# Get all silences
curl http://localhost:9093/api/v1/silences

# Get specific silence
curl http://localhost:9093/api/v1/silence/SILENCE_ID

# Get all silences
curl http://localhost:9093/api/v1/silences

# Get specific silence
curl http://localhost:9093/api/v1/silence/SILENCE_ID

Bash

Update Silence

# Expire a silence early
curl -X DELETE http://localhost:9093/api/v1/silence/SILENCE_ID

# Expire a silence early
curl -X DELETE http://localhost:9093/api/v1/silence/SILENCE_ID

Bash

Silence Best Practices

# Template for emergency silence
emergency_silence_template: |
  matchers:
  - name: severity
    value: critical
    isRegex: false
  - name: team
    value: "{{ .team }}"
    isRegex: false
  comment: "Emergency silence - {{ .reason }}"
  createdBy: "{{ .operator }}"
  endsAt: "{{ .end_time }}"

# Scheduled maintenance silence
maintenance_silence_template: |
  matchers:
  - name: instance
    value: "{{ .instance_pattern }}"
    isRegex: true
  comment: "Scheduled maintenance: {{ .maintenance_ticket }}"
  createdBy: "maintenance-system"
  startsAt: "{{ .maintenance_start }}"
  endsAt: "{{ .maintenance_end }}"

# Template for emergency silence
emergency_silence_template: |
  matchers:
  - name: severity
    value: critical
    isRegex: false
  - name: team
    value: "{{ .team }}"
    isRegex: false
  comment: "Emergency silence - {{ .reason }}"
  createdBy: "{{ .operator }}"
  endsAt: "{{ .end_time }}"

# Scheduled maintenance silence
maintenance_silence_template: |
  matchers:
  - name: instance
    value: "{{ .instance_pattern }}"
    isRegex: true
  comment: "Scheduled maintenance: {{ .maintenance_ticket }}"
  createdBy: "maintenance-system"
  startsAt: "{{ .maintenance_start }}"
  endsAt: "{{ .maintenance_end }}"

YAML

8. Integration with Prometheus

Prometheus Configuration

To send alerts to Alertmanager, configure Prometheus with alert rules and Alertmanager endpoints.

graph LR
    P[Prometheus] -->|Scrape Metrics| T[Targets]
    P -->|Evaluate Rules| R[Alert Rules]
    R -->|Fire Alerts| AM[Alertmanager]
    AM -->|Notifications| N[Receivers]

    style P fill:#ff9999
    style AM fill:#99ccff

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager-1:9093
          - alertmanager-2:9093
          - alertmanager-3:9093
      timeout: 10s
      api_version: v2

# Load alert rules
rule_files:
  - "alert_rules/*.yml"
  - "recording_rules/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager-1:9093
          - alertmanager-2:9093
          - alertmanager-3:9093
      timeout: 10s
      api_version: v2

# Load alert rules
rule_files:
  - "alert_rules/*.yml"
  - "recording_rules/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

YAML

Alert Rules

Basic Alert Rules

# alert_rules/basic_alerts.yml
groups:
- name: basic_alerts
  rules:
  # Instance down alert
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: critical
      team: infrastructure
    annotations:
      summary: "Instance {{ $labels.instance }} is down"
      description: "{{ $labels.instance }} has been down for more than 5 minutes"
      runbook_url: "https://wiki.company.com/runbooks/instance-down"

  # High CPU usage
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 10m
    labels:
      severity: warning
      team: infrastructure
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage is above 80% for more than 10 minutes"
      current_value: "{{ $value | humanizePercentage }}"

  # High memory usage
  - alert: HighMemoryUsage
    expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
    for: 15m
    labels:
      severity: critical
      team: infrastructure
    annotations:
      summary: "High memory usage on {{ $labels.instance }}"
      description: "Memory usage is above 90% for more than 15 minutes"
      current_value: "{{ $value | humanizePercentage }}"

# alert_rules/basic_alerts.yml
groups:
- name: basic_alerts
  rules:
  # Instance down alert
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: critical
      team: infrastructure
    annotations:
      summary: "Instance {{ $labels.instance }} is down"
      description: "{{ $labels.instance }} has been down for more than 5 minutes"
      runbook_url: "https://wiki.company.com/runbooks/instance-down"

  # High CPU usage
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 10m
    labels:
      severity: warning
      team: infrastructure
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage is above 80% for more than 10 minutes"
      current_value: "{{ $value | humanizePercentage }}"

  # High memory usage
  - alert: HighMemoryUsage
    expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
    for: 15m
    labels:
      severity: critical
      team: infrastructure
    annotations:
      summary: "High memory usage on {{ $labels.instance }}"
      description: "Memory usage is above 90% for more than 15 minutes"
      current_value: "{{ $value | humanizePercentage }}"

YAML

Advanced Alert Rules

# alert_rules/application_alerts.yml
groups:
- name: application_alerts
  rules:
  # HTTP error rate too high
  - alert: HighHTTPErrorRate
    expr: |
      (
        sum(rate(http_requests_total{status=~"5.."}[5m])) by (service, instance)
        /
        sum(rate(http_requests_total[5m])) by (service, instance)
      ) * 100 > 5
    for: 5m
    labels:
      severity: critical
      team: "{{ $labels.service }}"
    annotations:
      summary: "High HTTP error rate for {{ $labels.service }}"
      description: "HTTP 5xx error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"
      grafana_url: "http://grafana.company.com/d/http-dashboard"

  # Response time too high
  - alert: HighResponseTime
    expr: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
      ) > 0.5
    for: 10m
    labels:
      severity: warning
      team: "{{ $labels.service }}"
    annotations:
      summary: "High response time for {{ $labels.service }}"
      description: "95th percentile response time is {{ $value | humanizeDuration }}"

  # Database connection pool exhausted
  - alert: DatabaseConnectionPoolExhausted
    expr: |
      (
        sum(database_connections_active) by (database, instance)
        /
        sum(database_connections_max) by (database, instance)
      ) * 100 > 90
    for: 5m
    labels:
      severity: critical
      team: database
    annotations:
      summary: "Database connection pool almost exhausted"
      description: "{{ $labels.database }} connection pool is {{ $value | humanizePercentage }} full"

  # Disk space running low
  - alert: DiskSpaceLow
    expr: |
      (
        node_filesystem_avail_bytes{fstype!="tmpfs"} 
        / 
        node_filesystem_size_bytes{fstype!="tmpfs"}
      ) * 100 < 10
    for: 30m
    labels:
      severity: warning
      team: infrastructure
    annotations:
      summary: "Disk space low on {{ $labels.instance }}"
      description: "Disk {{ $labels.mountpoint }} has only {{ $value | humanizePercentage }} space left"

  # Service down
  - alert: ServiceDown
    expr: probe_success == 0
    for: 5m
    labels:
      severity: critical
      team: "{{ $labels.team }}"
    annotations:
      summary: "Service {{ $labels.job }} is down"
      description: "{{ $labels.instance }} has been down for more than 5 minutes"

# alert_rules/application_alerts.yml
groups:
- name: application_alerts
  rules:
  # HTTP error rate too high
  - alert: HighHTTPErrorRate
    expr: |
      (
        sum(rate(http_requests_total{status=~"5.."}[5m])) by (service, instance)
        /
        sum(rate(http_requests_total[5m])) by (service, instance)
      ) * 100 > 5
    for: 5m
    labels:
      severity: critical
      team: "{{ $labels.service }}"
    annotations:
      summary: "High HTTP error rate for {{ $labels.service }}"
      description: "HTTP 5xx error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"
      grafana_url: "http://grafana.company.com/d/http-dashboard"

  # Response time too high
  - alert: HighResponseTime
    expr: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
      ) > 0.5
    for: 10m
    labels:
      severity: warning
      team: "{{ $labels.service }}"
    annotations:
      summary: "High response time for {{ $labels.service }}"
      description: "95th percentile response time is {{ $value | humanizeDuration }}"

  # Database connection pool exhausted
  - alert: DatabaseConnectionPoolExhausted
    expr: |
      (
        sum(database_connections_active) by (database, instance)
        /
        sum(database_connections_max) by (database, instance)
      ) * 100 > 90
    for: 5m
    labels:
      severity: critical
      team: database
    annotations:
      summary: "Database connection pool almost exhausted"
      description: "{{ $labels.database }} connection pool is {{ $value | humanizePercentage }} full"

  # Disk space running low
  - alert: DiskSpaceLow
    expr: |
      (
        node_filesystem_avail_bytes{fstype!="tmpfs"} 
        / 
        node_filesystem_size_bytes{fstype!="tmpfs"}
      ) * 100 < 10
    for: 30m
    labels:
      severity: warning
      team: infrastructure
    annotations:
      summary: "Disk space low on {{ $labels.instance }}"
      description: "Disk {{ $labels.mountpoint }} has only {{ $value | humanizePercentage }} space left"

  # Service down
  - alert: ServiceDown
    expr: probe_success == 0
    for: 5m
    labels:
      severity: critical
      team: "{{ $labels.team }}"
    annotations:
      summary: "Service {{ $labels.job }} is down"
      description: "{{ $labels.instance }} has been down for more than 5 minutes"

YAML

Multi-Datacenter Alert Rules

# alert_rules/datacenter_alerts.yml
groups:
- name: datacenter_alerts
  rules:
  # Datacenter connectivity issues
  - alert: DatacenterConnectivityIssue
    expr: |
      up{job="datacenter-health"} == 0
      or
      increase(network_packets_dropped_total[5m]) > 1000
    for: 2m
    labels:
      severity: critical
      team: network
      escalate: "true"
    annotations:
      summary: "Connectivity issues in {{ $labels.datacenter }}"
      description: "Network connectivity problems detected in {{ $labels.datacenter }}"

  # Cross-datacenter replication lag
  - alert: HighReplicationLag
    expr: |
      database_replication_lag_seconds > 300
    for: 10m
    labels:
      severity: warning
      team: database
    annotations:
      summary: "High replication lag between datacenters"
      description: "Replication lag is {{ $value | humanizeDuration }} between {{ $labels.source_dc }} and {{ $labels.target_dc }}"

  # Load balancer backend down
  - alert: LoadBalancerBackendDown
    expr: |
      haproxy_server_up == 0
    for: 1m
    labels:
      severity: critical
      team: network
    annotations:
      summary: "Load balancer backend {{ $labels.server }} is down"
      description: "Backend server {{ $labels.server }} in {{ $labels.backend }} is not responding"

# alert_rules/datacenter_alerts.yml
groups:
- name: datacenter_alerts
  rules:
  # Datacenter connectivity issues
  - alert: DatacenterConnectivityIssue
    expr: |
      up{job="datacenter-health"} == 0
      or
      increase(network_packets_dropped_total[5m]) > 1000
    for: 2m
    labels:
      severity: critical
      team: network
      escalate: "true"
    annotations:
      summary: "Connectivity issues in {{ $labels.datacenter }}"
      description: "Network connectivity problems detected in {{ $labels.datacenter }}"

  # Cross-datacenter replication lag
  - alert: HighReplicationLag
    expr: |
      database_replication_lag_seconds > 300
    for: 10m
    labels:
      severity: warning
      team: database
    annotations:
      summary: "High replication lag between datacenters"
      description: "Replication lag is {{ $value | humanizeDuration }} between {{ $labels.source_dc }} and {{ $labels.target_dc }}"

  # Load balancer backend down
  - alert: LoadBalancerBackendDown
    expr: |
      haproxy_server_up == 0
    for: 1m
    labels:
      severity: critical
      team: network
    annotations:
      summary: "Load balancer backend {{ $labels.server }} is down"
      description: "Backend server {{ $labels.server }} in {{ $labels.backend }} is not responding"

YAML

Alert Rule Best Practices

Rule Organization

graph TD
    A[Alert Rules] --> B[Infrastructure]
    A --> C[Applications]
    A --> D[Security]
    A --> E[Business]

    B --> B1[Node Exporter]
    B --> B2[Network]
    B --> B3[Storage]

    C --> C1[Web Services]
    C --> C2[Databases]
    C --> C3[Message Queues]

    D --> D1[Authentication]
    D --> D2[Compliance]

    E --> E1[SLA Violations]
    E --> E2[Revenue Impact]

Template for Alert Rules

# Template for standardized alerts
- alert: AlertName
  expr: |
    # Multi-line PromQL query
    metric_expression
  for: duration
  labels:
    severity: critical|warning|info
    team: responsible_team
    service: service_name
    environment: prod|staging|dev
    escalate: "true|false"
  annotations:
    summary: "Brief description of the issue"
    description: "Detailed description with context and impact"
    runbook_url: "https://runbooks.company.com/alert-name"
    dashboard_url: "https://grafana.company.com/dashboard"
    current_value: "{{ $value | humanize }}"
    threshold: "threshold_value"

# Template for standardized alerts
- alert: AlertName
  expr: |
    # Multi-line PromQL query
    metric_expression
  for: duration
  labels:
    severity: critical|warning|info
    team: responsible_team
    service: service_name
    environment: prod|staging|dev
    escalate: "true|false"
  annotations:
    summary: "Brief description of the issue"
    description: "Detailed description with context and impact"
    runbook_url: "https://runbooks.company.com/alert-name"
    dashboard_url: "https://grafana.company.com/dashboard"
    current_value: "{{ $value | humanize }}"
    threshold: "threshold_value"

YAML

9. Advanced Features

High Availability Setup

Setting up Alertmanager in HA mode ensures no single point of failure.

graph TD
    subgraph "Prometheus Instances"
        P1[Prometheus 1]
        P2[Prometheus 2]
        P3[Prometheus 3]
    end

    subgraph "Alertmanager Cluster"
        AM1[Alertmanager 1:9093]
        AM2[Alertmanager 2:9094] 
        AM3[Alertmanager 3:9095]

        AM1 -.->|Gossip Protocol| AM2
        AM2 -.->|Gossip Protocol| AM3
        AM3 -.->|Gossip Protocol| AM1
    end

    P1 --> AM1
    P1 --> AM2
    P1 --> AM3

    P2 --> AM1
    P2 --> AM2
    P2 --> AM3

    P3 --> AM1
    P3 --> AM2
    P3 --> AM3

    AM1 --> RECEIVER[Notification Receivers]
    AM2 --> RECEIVER
    AM3 --> RECEIVER

    style AM1 fill:#ff9999
    style AM2 fill:#ff9999
    style AM3 fill:#ff9999

HA Configuration

# alertmanager-1.yml
global:
  smtp_smarthost: 'smtp.company.com:587'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://webhook.company.com/alerts'

# Cluster configuration
cluster:
  listen-address: "0.0.0.0:9094"
  peer: "alertmanager-2.company.com:9094"
  peer: "alertmanager-3.company.com:9094"
  gossip-interval: "200ms"
  pushpull-interval: "1m"

# alertmanager-1.yml
global:
  smtp_smarthost: 'smtp.company.com:587'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://webhook.company.com/alerts'

# Cluster configuration
cluster:
  listen-address: "0.0.0.0:9094"
  peer: "alertmanager-2.company.com:9094"
  peer: "alertmanager-3.company.com:9094"
  gossip-interval: "200ms"
  pushpull-interval: "1m"

YAML

Docker Compose HA Setup

version: '3.8'
services:
  alertmanager-1:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
      - "9094:9094"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--cluster.listen-address=0.0.0.0:9094'
      - '--cluster.peer=alertmanager-2:9094'
      - '--cluster.peer=alertmanager-3:9094'
      - '--web.external-url=http://localhost:9093'
    networks:
      - alerting

  alertmanager-2:
    image: prom/alertmanager:latest
    ports:
      - "9095:9093"
      - "9096:9094"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--cluster.listen-address=0.0.0.0:9094'
      - '--cluster.peer=alertmanager-1:9094'
      - '--cluster.peer=alertmanager-3:9094'
      - '--web.external-url=http://localhost:9095'
    networks:
      - alerting

  alertmanager-3:
    image: prom/alertmanager:latest
    ports:
      - "9097:9093"
      - "9098:9094"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--cluster.listen-address=0.0.0.0:9094'
      - '--cluster.peer=alertmanager-1:9094'
      - '--cluster.peer=alertmanager-2:9094'
      - '--web.external-url=http://localhost:9097'
    networks:
      - alerting

networks:
  alerting:
    driver: bridge

version: '3.8'
services:
  alertmanager-1:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
      - "9094:9094"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--cluster.listen-address=0.0.0.0:9094'
      - '--cluster.peer=alertmanager-2:9094'
      - '--cluster.peer=alertmanager-3:9094'
      - '--web.external-url=http://localhost:9093'
    networks:
      - alerting

  alertmanager-2:
    image: prom/alertmanager:latest
    ports:
      - "9095:9093"
      - "9096:9094"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--cluster.listen-address=0.0.0.0:9094'
      - '--cluster.peer=alertmanager-1:9094'
      - '--cluster.peer=alertmanager-3:9094'
      - '--web.external-url=http://localhost:9095'
    networks:
      - alerting

  alertmanager-3:
    image: prom/alertmanager:latest
    ports:
      - "9097:9093"
      - "9098:9094"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--cluster.listen-address=0.0.0.0:9094'
      - '--cluster.peer=alertmanager-1:9094'
      - '--cluster.peer=alertmanager-2:9094'
      - '--web.external-url=http://localhost:9097'
    networks:
      - alerting

networks:
  alerting:
    driver: bridge

YAML

Custom Templates

Create custom notification templates for better formatting.

Template Structure

graph TD
    A[Template Files] --> B[Email Templates]
    A --> C[Slack Templates]
    A --> D[Webhook Templates]

    B --> B1[HTML Templates]
    B --> B2[Text Templates]

    C --> C1[Message Format]
    C --> C2[Attachment Format]

    D --> D1[JSON Format]
    D --> D2[Custom Format]

Email Templates

<!-- templates/email.html -->
<!DOCTYPE html>
<html>
<head>
    <style>
        body { font-family: Arial, sans-serif; }
        .alert-critical { background-color: #d32f2f; color: white; }
        .alert-warning { background-color: #f57c00; color: white; }
        .alert-info { background-color: #1976d2; color: white; }
        .resolved { background-color: #388e3c; color: white; }
        table { border-collapse: collapse; width: 100%; }
        th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
        th { background-color: #f2f2f2; }
    </style>
</head>
<body>
    <h1>{{ if eq .Status "firing" }}🔥 ALERTS FIRING{{ else }}✅ ALERTS RESOLVED{{ end }}</h1>

    <h2>Summary</h2>
    <ul>
        <li><strong>Status:</strong> {{ .Status | toUpper }}</li>
        <li><strong>Group:</strong> {{ .GroupLabels.alertname }}</li>
        <li><strong>Total Alerts:</strong> {{ .Alerts | len }}</li>
        <li><strong>Firing:</strong> {{ .Alerts.Firing | len }}</li>
        <li><strong>Resolved:</strong> {{ .Alerts.Resolved | len }}</li>
    </ul>

    <h2>Alert Details</h2>
    <table>
        <tr>
            <th>Alert</th>
            <th>Severity</th>
            <th>Instance</th>
            <th>Status</th>
            <th>Started</th>
            <th>Summary</th>
        </tr>
        {{ range .Alerts }}
        <tr class="alert-{{ .Labels.severity }}{{ if eq .Status "resolved" }} resolved{{ end }}">
            <td>{{ .Labels.alertname }}</td>
            <td>{{ .Labels.severity | toUpper }}</td>
            <td>{{ .Labels.instance }}</td>
            <td>{{ .Status | toUpper }}</td>
            <td>{{ .StartsAt.Format "2006-01-02 15:04:05" }}</td>
            <td>{{ .Annotations.summary }}</td>
        </tr>
        {{ end }}
    </table>

    <h2>Actions</h2>
    <ul>
        <li><a href="http://alertmanager.company.com">View in Alertmanager</a></li>
        <li><a href="http://grafana.company.com">View in Grafana</a></li>
        <li><a href="http://alertmanager.company.com/#/silences/new">Create Silence</a></li>
    </ul>
</body>
</html>

<!-- templates/email.html -->
<!DOCTYPE html>
<html>
<head>
    <style>
        body { font-family: Arial, sans-serif; }
        .alert-critical { background-color: #d32f2f; color: white; }
        .alert-warning { background-color: #f57c00; color: white; }
        .alert-info { background-color: #1976d2; color: white; }
        .resolved { background-color: #388e3c; color: white; }
        table { border-collapse: collapse; width: 100%; }
        th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
        th { background-color: #f2f2f2; }
    </style>
</head>
<body>
    <h1>{{ if eq .Status "firing" }}🔥 ALERTS FIRING{{ else }}✅ ALERTS RESOLVED{{ end }}</h1>

    <h2>Summary</h2>
    <ul>
        <li><strong>Status:</strong> {{ .Status | toUpper }}</li>
        <li><strong>Group:</strong> {{ .GroupLabels.alertname }}</li>
        <li><strong>Total Alerts:</strong> {{ .Alerts | len }}</li>
        <li><strong>Firing:</strong> {{ .Alerts.Firing | len }}</li>
        <li><strong>Resolved:</strong> {{ .Alerts.Resolved | len }}</li>
    </ul>

    <h2>Alert Details</h2>
    <table>
        <tr>
            <th>Alert</th>
            <th>Severity</th>
            <th>Instance</th>
            <th>Status</th>
            <th>Started</th>
            <th>Summary</th>
        </tr>
        {{ range .Alerts }}
        <tr class="alert-{{ .Labels.severity }}{{ if eq .Status "resolved" }} resolved{{ end }}">
            <td>{{ .Labels.alertname }}</td>
            <td>{{ .Labels.severity | toUpper }}</td>
            <td>{{ .Labels.instance }}</td>
            <td>{{ .Status | toUpper }}</td>
            <td>{{ .StartsAt.Format "2006-01-02 15:04:05" }}</td>
            <td>{{ .Annotations.summary }}</td>
        </tr>
        {{ end }}
    </table>

    <h2>Actions</h2>
    <ul>
        <li><a href="http://alertmanager.company.com">View in Alertmanager</a></li>
        <li><a href="http://grafana.company.com">View in Grafana</a></li>
        <li><a href="http://alertmanager.company.com/#/silences/new">Create Silence</a></li>
    </ul>
</body>
</html>

Jinja HTML

Slack Templates

# templates/slack.tmpl
{{ define "slack.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}
{{ end }}

{{ define "slack.text" }}
{{ if eq .Status "firing" }}
:fire: **FIRING ALERTS** :fire:
{{ else }}
:white_check_mark: **RESOLVED ALERTS** :white_check_mark:
{{ end }}

{{ range .Alerts }}
{{ if eq .Status "firing" }}:red_circle:{{ else }}:green_circle:{{ end }} **{{ .Labels.alertname }}**
• **Instance:** {{ .Labels.instance }}
• **Severity:** {{ .Labels.severity | toUpper }}
• **Summary:** {{ .Annotations.summary }}
• **Started:** {{ .StartsAt.Format "Jan 02, 2006 15:04:05 MST" }}
{{ if .Annotations.runbook_url }}• **Runbook:** {{ .Annotations.runbook_url }}{{ end }}

{{ end }}

{{ if gt (len .GroupLabels) 0 }}
**Labels:** {{ range .GroupLabels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
{{ end }}
{{ end }}

{{ define "slack.color" }}
{{ if eq .Status "firing" }}
  {{ if eq .GroupLabels.severity "critical" }}danger{{ else }}warning{{ end }}
{{ else }}
  good
{{ end }}
{{ end }}

# templates/slack.tmpl
{{ define "slack.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}
{{ end }}

{{ define "slack.text" }}
{{ if eq .Status "firing" }}
:fire: **FIRING ALERTS** :fire:
{{ else }}
:white_check_mark: **RESOLVED ALERTS** :white_check_mark:
{{ end }}

{{ range .Alerts }}
{{ if eq .Status "firing" }}:red_circle:{{ else }}:green_circle:{{ end }} **{{ .Labels.alertname }}**
• **Instance:** {{ .Labels.instance }}
• **Severity:** {{ .Labels.severity | toUpper }}
• **Summary:** {{ .Annotations.summary }}
• **Started:** {{ .StartsAt.Format "Jan 02, 2006 15:04:05 MST" }}
{{ if .Annotations.runbook_url }}• **Runbook:** {{ .Annotations.runbook_url }}{{ end }}

{{ end }}

{{ if gt (len .GroupLabels) 0 }}
**Labels:** {{ range .GroupLabels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
{{ end }}
{{ end }}

{{ define "slack.color" }}
{{ if eq .Status "firing" }}
  {{ if eq .GroupLabels.severity "critical" }}danger{{ else }}warning{{ end }}
{{ else }}
  good
{{ end }}
{{ end }}

Python

Using Templates in Configuration

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alertmanager@company.com'

templates:
  - '/etc/alertmanager/templates/*.tmpl'

receivers:
- name: 'email-templates'
  email_configs:
  - to: 'team@company.com'
    subject: '{{ template "email.subject" . }}'
    html: '{{ template "email.html" . }}'

- name: 'slack-templates'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/...'
    channel: '#alerts'
    title: '{{ template "slack.title" . }}'
    text: '{{ template "slack.text" . }}'
    color: '{{ template "slack.color" . }}'

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alertmanager@company.com'

templates:
  - '/etc/alertmanager/templates/*.tmpl'

receivers:
- name: 'email-templates'
  email_configs:
  - to: 'team@company.com'
    subject: '{{ template "email.subject" . }}'
    html: '{{ template "email.html" . }}'

- name: 'slack-templates'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/...'
    channel: '#alerts'
    title: '{{ template "slack.title" . }}'
    text: '{{ template "slack.text" . }}'
    color: '{{ template "slack.color" . }}'

YAML

API Usage and Automation

Alertmanager provides a REST API for automation and integration.

API Endpoints Overview

graph LR
    API[Alertmanager API] --> ALERTS["/api/v1/alerts"]
    API --> SILENCES["/api/v1/silences"]
    API --> RECEIVERS["/api/v1/receivers"]
    API --> STATUS["/api/v1/status"]
    API --> CONFIG["/api/v1/config"]

    ALERTS --> GET_ALERTS[GET: List alerts]
    ALERTS --> POST_ALERTS[POST: Send alerts]

    SILENCES --> GET_SILENCES[GET: List silences]
    SILENCES --> POST_SILENCES[POST: Create silence]
    SILENCES --> DELETE_SILENCE[DELETE: Expire silence]

Common API Operations

# Get all active alerts
curl -X GET http://localhost:9093/api/v1/alerts

# Get alerts with specific labels
curl -X GET "http://localhost:9093/api/v1/alerts?filter=alertname%3DHighCPU"

# Send test alert
curl -X POST http://localhost:9093/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '[
    {
      "labels": {
        "alertname": "TestAlert",
        "instance": "localhost:9090",
        "severity": "warning"
      },
      "annotations": {
        "summary": "This is a test alert",
        "description": "Testing alertmanager configuration"
      },
      "startsAt": "'$(date -u +%Y-%m-%dT%H:%M:%S.%3NZ)'",
      "endsAt": "'$(date -u -d '+1 hour' +%Y-%m-%dT%H:%M:%S.%3NZ)'"
    }
  ]'

# Create silence
curl -X POST http://localhost:9093/api/v1/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {
        "name": "alertname",
        "value": "HighCPU",
        "isRegex": false
      }
    ],
    "startsAt": "'$(date -u +%Y-%m-%dT%H:%M:%S.%3NZ)'",
    "endsAt": "'$(date -u -d '+2 hours' +%Y-%m-%dT%H:%M:%S.%3NZ)'",
    "createdBy": "automation-script",
    "comment": "Automated silence during maintenance"
  }'

# Get configuration
curl -X GET http://localhost:9093/api/v1/config

# Get status
curl -X GET http://localhost:9093/api/v1/status

# Get all active alerts
curl -X GET http://localhost:9093/api/v1/alerts

# Get alerts with specific labels
curl -X GET "http://localhost:9093/api/v1/alerts?filter=alertname%3DHighCPU"

# Send test alert
curl -X POST http://localhost:9093/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '[
    {
      "labels": {
        "alertname": "TestAlert",
        "instance": "localhost:9090",
        "severity": "warning"
      },
      "annotations": {
        "summary": "This is a test alert",
        "description": "Testing alertmanager configuration"
      },
      "startsAt": "'$(date -u +%Y-%m-%dT%H:%M:%S.%3NZ)'",
      "endsAt": "'$(date -u -d '+1 hour' +%Y-%m-%dT%H:%M:%S.%3NZ)'"
    }
  ]'

# Create silence
curl -X POST http://localhost:9093/api/v1/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {
        "name": "alertname",
        "value": "HighCPU",
        "isRegex": false
      }
    ],
    "startsAt": "'$(date -u +%Y-%m-%dT%H:%M:%S.%3NZ)'",
    "endsAt": "'$(date -u -d '+2 hours' +%Y-%m-%dT%H:%M:%S.%3NZ)'",
    "createdBy": "automation-script",
    "comment": "Automated silence during maintenance"
  }'

# Get configuration
curl -X GET http://localhost:9093/api/v1/config

# Get status
curl -X GET http://localhost:9093/api/v1/status

Bash

Python API Client Example

import requests
import json
from datetime import datetime, timedelta

class AlertmanagerClient:
    def __init__(self, base_url):
        self.base_url = base_url.rstrip('/')

    def get_alerts(self, filters=None):
        """Get all alerts or filtered alerts"""
        url = f"{self.base_url}/api/v1/alerts"
        params = {}
        if filters:
            params['filter'] = filters

        response = requests.get(url, params=params)
        response.raise_for_status()
        return response.json()['data']

    def send_alert(self, alertname, labels, annotations, starts_at=None, ends_at=None):
        """Send a test alert"""
        url = f"{self.base_url}/api/v1/alerts"

        if not starts_at:
            starts_at = datetime.utcnow()
        if not ends_at:
            ends_at = starts_at + timedelta(hours=1)

        alert = {
            "labels": {"alertname": alertname, **labels},
            "annotations": annotations,
            "startsAt": starts_at.isoformat() + 'Z',
            "endsAt": ends_at.isoformat() + 'Z'
        }

        response = requests.post(url, json=[alert])
        response.raise_for_status()
        return response.json()

    def create_silence(self, matchers, comment, created_by, duration_hours=1):
        """Create a silence"""
        url = f"{self.base_url}/api/v1/silences"

        starts_at = datetime.utcnow()
        ends_at = starts_at + timedelta(hours=duration_hours)

        silence = {
            "matchers": matchers,
            "startsAt": starts_at.isoformat() + 'Z',
            "endsAt": ends_at.isoformat() + 'Z',
            "createdBy": created_by,
            "comment": comment
        }

        response = requests.post(url, json=silence)
        response.raise_for_status()
        return response.json()

    def get_silences(self):
        """Get all silences"""
        url = f"{self.base_url}/api/v1/silences"
        response = requests.get(url)
        response.raise_for_status()
        return response.json()['data']

    def expire_silence(self, silence_id):
        """Expire a silence"""
        url = f"{self.base_url}/api/v1/silence/{silence_id}"
        response = requests.delete(url)
        response.raise_for_status()
        return response.status_code == 200

# Usage example
if __name__ == "__main__":
    client = AlertmanagerClient("http://localhost:9093")

    # Send test alert
    client.send_alert(
        alertname="APITestAlert",
        labels={"instance": "test-server", "severity": "warning"},
        annotations={
            "summary": "Test alert from API",
            "description": "This is a test alert sent via API"
        }
    )

    # Create silence
    matchers = [
        {"name": "alertname", "value": "APITestAlert", "isRegex": False}
    ]

    silence_response = client.create_silence(
        matchers=matchers,
        comment="Testing API silence creation",
        created_by="api-script",
        duration_hours=2
    )

    print(f"Created silence with ID: {silence_response['silenceID']}")

import requests
import json
from datetime import datetime, timedelta

class AlertmanagerClient:
    def __init__(self, base_url):
        self.base_url = base_url.rstrip('/')

    def get_alerts(self, filters=None):
        """Get all alerts or filtered alerts"""
        url = f"{self.base_url}/api/v1/alerts"
        params = {}
        if filters:
            params['filter'] = filters

        response = requests.get(url, params=params)
        response.raise_for_status()
        return response.json()['data']

    def send_alert(self, alertname, labels, annotations, starts_at=None, ends_at=None):
        """Send a test alert"""
        url = f"{self.base_url}/api/v1/alerts"

        if not starts_at:
            starts_at = datetime.utcnow()
        if not ends_at:
            ends_at = starts_at + timedelta(hours=1)

        alert = {
            "labels": {"alertname": alertname, **labels},
            "annotations": annotations,
            "startsAt": starts_at.isoformat() + 'Z',
            "endsAt": ends_at.isoformat() + 'Z'
        }

        response = requests.post(url, json=[alert])
        response.raise_for_status()
        return response.json()

    def create_silence(self, matchers, comment, created_by, duration_hours=1):
        """Create a silence"""
        url = f"{self.base_url}/api/v1/silences"

        starts_at = datetime.utcnow()
        ends_at = starts_at + timedelta(hours=duration_hours)

        silence = {
            "matchers": matchers,
            "startsAt": starts_at.isoformat() + 'Z',
            "endsAt": ends_at.isoformat() + 'Z',
            "createdBy": created_by,
            "comment": comment
        }

        response = requests.post(url, json=silence)
        response.raise_for_status()
        return response.json()

    def get_silences(self):
        """Get all silences"""
        url = f"{self.base_url}/api/v1/silences"
        response = requests.get(url)
        response.raise_for_status()
        return response.json()['data']

    def expire_silence(self, silence_id):
        """Expire a silence"""
        url = f"{self.base_url}/api/v1/silence/{silence_id}"
        response = requests.delete(url)
        response.raise_for_status()
        return response.status_code == 200

# Usage example
if __name__ == "__main__":
    client = AlertmanagerClient("http://localhost:9093")

    # Send test alert
    client.send_alert(
        alertname="APITestAlert",
        labels={"instance": "test-server", "severity": "warning"},
        annotations={
            "summary": "Test alert from API",
            "description": "This is a test alert sent via API"
        }
    )

    # Create silence
    matchers = [
        {"name": "alertname", "value": "APITestAlert", "isRegex": False}
    ]

    silence_response = client.create_silence(
        matchers=matchers,
        comment="Testing API silence creation",
        created_by="api-script",
        duration_hours=2
    )

    print(f"Created silence with ID: {silence_response['silenceID']}")

Python

10. Monitoring and Troubleshooting

Monitoring Alertmanager Itself

It’s crucial to monitor Alertmanager to ensure it’s functioning correctly.

graph TD
    AM[Alertmanager] --> METRICS["/metrics endpoint"]
    METRICS --> PROM[Prometheus]
    PROM --> GRAFANA[Grafana Dashboard]
    PROM --> ALERTS[Alertmanager Alerts]

    ALERTS --> EMAIL[Email Notifications]
    ALERTS --> SLACK[Slack Notifications]

    style AM fill:#ff9999
    style ALERTS fill:#ffcccc

Key Metrics to Monitor

# alertmanager_monitoring_rules.yml
groups:
- name: alertmanager_monitoring
  rules:
  # Alertmanager is down
  - alert: AlertmanagerDown
    expr: up{job="alertmanager"} == 0
    for: 5m
    labels:
      severity: critical
      service: alertmanager
    annotations:
      summary: "Alertmanager instance is down"
      description: "Alertmanager instance {{ $labels.instance }} is down"

  # Configuration reload failed
  - alert: AlertmanagerConfigReloadFailed
    expr: alertmanager_config_last_reload_successful == 0
    for: 10m
    labels:
      severity: critical
      service: alertmanager
    annotations:
      summary: "Alertmanager configuration reload failed"
      description: "Alertmanager {{ $labels.instance }} configuration reload failed"

  # High number of alerts
  - alert: AlertmanagerHighAlertVolume
    expr: sum(alertmanager_alerts) by (instance) > 1000
    for: 10m
    labels:
      severity: warning
      service: alertmanager
    annotations:
      summary: "High volume of alerts in Alertmanager"
      description: "Alertmanager {{ $labels.instance }} is processing {{ $value }} alerts"

  # Notification failures
  - alert: AlertmanagerNotificationFailed
    expr: rate(alertmanager_notifications_failed_total[5m]) > 0.1
    for: 10m
    labels:
      severity: warning
      service: alertmanager
    annotations:
      summary: "Alertmanager notifications failing"
      description: "Alertmanager {{ $labels.instance }} notification failure rate is {{ $value | humanizePercentage }}"

  # Cluster member down
  - alert: AlertmanagerClusterMemberDown
    expr: alertmanager_cluster_members != on (job) group_left count by (job) (up{job="alertmanager"})
    for: 15m
    labels:
      severity: warning
      service: alertmanager
    annotations:
      summary: "Alertmanager cluster member missing"
      description: "Alertmanager cluster has {{ $value }} members but should have more"

# alertmanager_monitoring_rules.yml
groups:
- name: alertmanager_monitoring
  rules:
  # Alertmanager is down
  - alert: AlertmanagerDown
    expr: up{job="alertmanager"} == 0
    for: 5m
    labels:
      severity: critical
      service: alertmanager
    annotations:
      summary: "Alertmanager instance is down"
      description: "Alertmanager instance {{ $labels.instance }} is down"

  # Configuration reload failed
  - alert: AlertmanagerConfigReloadFailed
    expr: alertmanager_config_last_reload_successful == 0
    for: 10m
    labels:
      severity: critical
      service: alertmanager
    annotations:
      summary: "Alertmanager configuration reload failed"
      description: "Alertmanager {{ $labels.instance }} configuration reload failed"

  # High number of alerts
  - alert: AlertmanagerHighAlertVolume
    expr: sum(alertmanager_alerts) by (instance) > 1000
    for: 10m
    labels:
      severity: warning
      service: alertmanager
    annotations:
      summary: "High volume of alerts in Alertmanager"
      description: "Alertmanager {{ $labels.instance }} is processing {{ $value }} alerts"

  # Notification failures
  - alert: AlertmanagerNotificationFailed
    expr: rate(alertmanager_notifications_failed_total[5m]) > 0.1
    for: 10m
    labels:
      severity: warning
      service: alertmanager
    annotations:
      summary: "Alertmanager notifications failing"
      description: "Alertmanager {{ $labels.instance }} notification failure rate is {{ $value | humanizePercentage }}"

  # Cluster member down
  - alert: AlertmanagerClusterMemberDown
    expr: alertmanager_cluster_members != on (job) group_left count by (job) (up{job="alertmanager"})
    for: 15m
    labels:
      severity: warning
      service: alertmanager
    annotations:
      summary: "Alertmanager cluster member missing"
      description: "Alertmanager cluster has {{ $value }} members but should have more"

YAML

Prometheus Scrape Configuration

# prometheus.yml
scrape_configs:
- job_name: 'alertmanager'
  static_configs:
  - targets: 
    - 'alertmanager-1:9093'
    - 'alertmanager-2:9093'
    - 'alertmanager-3:9093'
  scrape_interval: 30s
  metrics_path: /metrics

# prometheus.yml
scrape_configs:
- job_name: 'alertmanager'
  static_configs:
  - targets: 
    - 'alertmanager-1:9093'
    - 'alertmanager-2:9093'
    - 'alertmanager-3:9093'
  scrape_interval: 30s
  metrics_path: /metrics

YAML

Common Issues and Solutions

Troubleshooting Flow

flowchart TD
    A[Alert Issue] --> B{Alert Received?}
    B -->|No| C[Check Prometheus Config]
    B -->|Yes| D{Notification Sent?}

    C --> C1[Verify alertmanager URL]
    C --> C2[Check alert rules]
    C --> C3[Verify connectivity]

    D -->|No| E[Check Alertmanager]
    D -->|Yes| F[Issue Resolved]

    E --> E1[Check routing rules]
    E --> E2[Verify receiver config]
    E --> E3[Check silences]
    E --> E4[Check inhibition rules]

    style A fill:#ff9999
    style F fill:#ccffcc

Common Problems and Solutions

Alerts Not Firing

# Check if Prometheus can reach Alertmanager
curl http://prometheus:9090/api/v1/alertmanagers


# Check alert rule evaluation
curl http://prometheus:9090/api/v1/rules


# Verify alert is active in Prometheus
curl http://prometheus:9090/api/v1/alerts

# Check if Prometheus can reach Alertmanager
curl http://prometheus:9090/api/v1/alertmanagers


# Check alert rule evaluation
curl http://prometheus:9090/api/v1/rules


# Verify alert is active in Prometheus
curl http://prometheus:9090/api/v1/alerts

Bash

Notifications Not Sent

# Check Alertmanager logs
docker logs alertmanager


# Verify receiver configuration
curl http://alertmanager:9093/api/v1/config


# Check for silences
curl http://alertmanager:9093/api/v1/silences


# Test notification manually
amtool alert add alertname=TestAlert severity=warning instance=test

# Check Alertmanager logs
docker logs alertmanager


# Verify receiver configuration
curl http://alertmanager:9093/api/v1/config


# Check for silences
curl http://alertmanager:9093/api/v1/silences


# Test notification manually
amtool alert add alertname=TestAlert severity=warning instance=test

Bash

Configuration Issues

# Validate configuration
./alertmanager --config.file=alertmanager.yml --config.check


# Check configuration reload status
curl http://alertmanager:9093/api/v1/status

# Validate configuration
./alertmanager --config.file=alertmanager.yml --config.check


# Check configuration reload status
curl http://alertmanager:9093/api/v1/status

Bash

Debug Tools

# Install amtool (Alertmanager CLI tool)
go install github.com/prometheus/alertmanager/cmd/amtool@latest

# Configure amtool
export ALERTMANAGER_URL=http://localhost:9093

# List alerts
amtool alert query

# List silences
amtool silence query

# Create test alert
amtool alert add alertname=TestAlert severity=critical instance=localhost

# Create silence
amtool silence add alertname=TestAlert --duration=1h --comment="Testing silence"

# Import silences from file
amtool silence import < silences.json

# Export silences to file
amtool silence export > silences.json

# Install amtool (Alertmanager CLI tool)
go install github.com/prometheus/alertmanager/cmd/amtool@latest

# Configure amtool
export ALERTMANAGER_URL=http://localhost:9093

# List alerts
amtool alert query

# List silences
amtool silence query

# Create test alert
amtool alert add alertname=TestAlert severity=critical instance=localhost

# Create silence
amtool silence add alertname=TestAlert --duration=1h --comment="Testing silence"

# Import silences from file
amtool silence import < silences.json

# Export silences to file
amtool silence export > silences.json

Bash

Log Analysis

Log Patterns to Monitor

# Error patterns to watch for
grep -E "(error|Error|ERROR)" /var/log/alertmanager/alertmanager.log

# Configuration reload events
grep "Completed loading of configuration file" /var/log/alertmanager/alertmanager.log

# Notification failures
grep "notify.*failed" /var/log/alertmanager/alertmanager.log

# Cluster communication issues
grep "cluster.*error" /var/log/alertmanager/alertmanager.log

# Error patterns to watch for
grep -E "(error|Error|ERROR)" /var/log/alertmanager/alertmanager.log

# Configuration reload events
grep "Completed loading of configuration file" /var/log/alertmanager/alertmanager.log

# Notification failures
grep "notify.*failed" /var/log/alertmanager/alertmanager.log

# Cluster communication issues
grep "cluster.*error" /var/log/alertmanager/alertmanager.log

Bash

Structured Logging Configuration

# Add to Alertmanager startup flags
--log.format=json
--log.level=info

# Add to Alertmanager startup flags
--log.format=json
--log.level=info

Bash

Log Aggregation with Fluentd/Fluentbit

# fluent-bit.conf
[INPUT]
    Name tail
    Path /var/log/alertmanager/alertmanager.log
    Tag alertmanager
    Parser json

[OUTPUT]
    Name elasticsearch
    Match alertmanager
    Host elasticsearch.company.com
    Port 9200
    Index alertmanager-logs

# fluent-bit.conf
[INPUT]
    Name tail
    Path /var/log/alertmanager/alertmanager.log
    Tag alertmanager
    Parser json

[OUTPUT]
    Name elasticsearch
    Match alertmanager
    Host elasticsearch.company.com
    Port 9200
    Index alertmanager-logs

INI

11. Best Practices

Configuration Best Practices

Organization and Structure

graph TD
    A[Configuration Best Practices] --> B[File Organization]
    A --> C[Naming Conventions]
    A --> D[Environment Separation]
    A --> E[Security Practices]

    B --> B1[config/]
    B --> B2[templates/]
    B --> B3[rules/]

    C --> C1[Descriptive Names]
    C --> C2[Consistent Patterns]

    D --> D1[Dev/Stage/Prod]
    D --> D2[Feature Flags]

    E --> E1[Secrets Management]
    E --> E2[Access Control]

File Structure Best Practices

# Recommended directory structure
alertmanager/
├── config/
│   ├── alertmanager-dev.yml
│   ├── alertmanager-staging.yml
│   └── alertmanager-prod.yml
├── templates/
│   ├── email/
│   │   ├── html.tmpl
│   │   └── text.tmpl
│   ├── slack/
│   │   └── message.tmpl
│   └── common/
│       └── functions.tmpl
├── rules/
│   ├── infrastructure.yml
│   ├── applications.yml
│   └── business.yml
└── scripts/
    ├── deploy.sh
    ├── validate.sh
    └── test.sh

# Recommended directory structure
alertmanager/
├── config/
│   ├── alertmanager-dev.yml
│   ├── alertmanager-staging.yml
│   └── alertmanager-prod.yml
├── templates/
│   ├── email/
│   │   ├── html.tmpl
│   │   └── text.tmpl
│   ├── slack/
│   │   └── message.tmpl
│   └── common/
│       └── functions.tmpl
├── rules/
│   ├── infrastructure.yml
│   ├── applications.yml
│   └── business.yml
└── scripts/
    ├── deploy.sh
    ├── validate.sh
    └── test.sh

Bash

Configuration Validation

# Validation script template
#!/bin/bash
set -e

CONFIG_FILE="$1"
ALERTMANAGER_BINARY="./alertmanager"

echo "Validating Alertmanager configuration: $CONFIG_FILE"

# Syntax check
$ALERTMANAGER_BINARY --config.file="$CONFIG_FILE" --config.check

# Template validation
if [ -d "templates/" ]; then
    echo "Validating templates..."
    for template in templates/*.tmpl; do
        echo "  Checking $template"
        # Add template-specific validation here
    done
fi

echo "Configuration validation passed!"

# Validation script template
#!/bin/bash
set -e

CONFIG_FILE="$1"
ALERTMANAGER_BINARY="./alertmanager"

echo "Validating Alertmanager configuration: $CONFIG_FILE"

# Syntax check
$ALERTMANAGER_BINARY --config.file="$CONFIG_FILE" --config.check

# Template validation
if [ -d "templates/" ]; then
    echo "Validating templates..."
    for template in templates/*.tmpl; do
        echo "  Checking $template"
        # Add template-specific validation here
    done
fi

echo "Configuration validation passed!"

Bash

Environment-Specific Configurations

# alertmanager-prod.yml
global:
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'alerts-prod@company.com'
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-prod'

receivers:
- name: 'default-prod'
  email_configs:
  - to: 'oncall-prod@company.com'
  slack_configs:
  - api_url: '{{ .SlackProdURL }}'
    channel: '#production-alerts'

---
# alertmanager-dev.yml
global:
  smtp_smarthost: 'localhost:1025'  # MailHog for testing
  smtp_from: 'alerts-dev@company.com'
  resolve_timeout: 1m

route:
  group_by: ['alertname']
  group_wait: 5s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default-dev'

receivers:
- name: 'default-dev'
  webhook_configs:
  - url: 'http://webhook-test:8080/alerts'

# alertmanager-prod.yml
global:
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'alerts-prod@company.com'
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-prod'

receivers:
- name: 'default-prod'
  email_configs:
  - to: 'oncall-prod@company.com'
  slack_configs:
  - api_url: '{{ .SlackProdURL }}'
    channel: '#production-alerts'

---
# alertmanager-dev.yml
global:
  smtp_smarthost: 'localhost:1025'  # MailHog for testing
  smtp_from: 'alerts-dev@company.com'
  resolve_timeout: 1m

route:
  group_by: ['alertname']
  group_wait: 5s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default-dev'

receivers:
- name: 'default-dev'
  webhook_configs:
  - url: 'http://webhook-test:8080/alerts'

YAML

Alert Design Best Practices

Alert Quality Guidelines

flowchart TD
    A[Alert Quality] --> B[Actionable]
    A --> C[Meaningful]
    A --> D[Proportional]
    A --> E[Contextual]

    B --> B1[Clear Action Required]
    B --> B2[Owner Identified]

    C --> C1[Business Impact]
    C --> C2[User Impact]

    D --> D1[Severity Matches Impact]
    D --> D2[Frequency Appropriate]

    E --> E1[Sufficient Information]
    E --> E2[Links to Resources]

Alert Rule Standards

# Standard alert template
- alert: StandardAlertName
  expr: |
    # Clear, readable PromQL expression
    metric_name{label="value"} > threshold
  for: 5m  # Appropriate duration to avoid flapping
  labels:
    severity: critical|warning|info
    team: responsible_team
    service: affected_service
    environment: prod|staging|dev
    runbook: "runbook-identifier"
  annotations:
    summary: "Brief, actionable description (< 80 chars)"
    description: |
      Detailed description with:
      - What is happening
      - Why it matters
      - Current value: {{ $value }}
      - Expected threshold: {{ .threshold }}
    runbook_url: "https://runbooks.company.com/{{ .Labels.runbook }}"
    dashboard_url: "https://grafana.company.com/d/dashboard-id"
    grafana_panel_url: "https://grafana.company.com/d/dashboard-id?panelId=1"

# Standard alert template
- alert: StandardAlertName
  expr: |
    # Clear, readable PromQL expression
    metric_name{label="value"} > threshold
  for: 5m  # Appropriate duration to avoid flapping
  labels:
    severity: critical|warning|info
    team: responsible_team
    service: affected_service
    environment: prod|staging|dev
    runbook: "runbook-identifier"
  annotations:
    summary: "Brief, actionable description (< 80 chars)"
    description: |
      Detailed description with:
      - What is happening
      - Why it matters
      - Current value: {{ $value }}
      - Expected threshold: {{ .threshold }}
    runbook_url: "https://runbooks.company.com/{{ .Labels.runbook }}"
    dashboard_url: "https://grafana.company.com/d/dashboard-id"
    grafana_panel_url: "https://grafana.company.com/d/dashboard-id?panelId=1"

YAML

Severity Guidelines

# Severity classification
severity_guidelines:
  critical:
    description: "Service is completely down or severely degraded"
    response_time: "Immediate (5 minutes)"
    examples:
      - "Complete service outage"
      - "Data loss imminent"
      - "Security breach"

  warning:
    description: "Service degraded but still functional"
    response_time: "Within business hours (4 hours)"
    examples:
      - "High error rate"
      - "Performance degradation"
      - "Capacity concerns"

  info:
    description: "Informational, no immediate action needed"
    response_time: "Best effort"
    examples:
      - "Deployment notifications"
      - "Capacity planning info"
      - "Maintenance reminders"

# Severity classification
severity_guidelines:
  critical:
    description: "Service is completely down or severely degraded"
    response_time: "Immediate (5 minutes)"
    examples:
      - "Complete service outage"
      - "Data loss imminent"
      - "Security breach"

  warning:
    description: "Service degraded but still functional"
    response_time: "Within business hours (4 hours)"
    examples:
      - "High error rate"
      - "Performance degradation"
      - "Capacity concerns"

  info:
    description: "Informational, no immediate action needed"
    response_time: "Best effort"
    examples:
      - "Deployment notifications"
      - "Capacity planning info"
      - "Maintenance reminders"

YAML

Operational Best Practices

On-Call Procedures

sequenceDiagram
    participant A as Alert Fires
    participant AM as Alertmanager
    participant OC as On-Call Engineer
    participant T as Team
    participant M as Management

    A->>AM: Critical Alert
    AM->>OC: Immediate Notification

    alt Response within 5 minutes
        OC->>OC: Acknowledge Alert
        OC->>AM: Update Status
    else No response
        AM->>T: Escalate to Team Lead
        alt No response from team
            AM->>M: Escalate to Management
        end
    end

    OC->>OC: Investigate & Resolve
    OC->>AM: Mark Resolved

Escalation Policies

# Escalation configuration
escalation_policies:
  production_critical:
    level_1:
      - "primary-oncall@company.com"
      - timeout: 5m
    level_2:
      - "team-lead@company.com"
      - "secondary-oncall@company.com"
      - timeout: 10m
    level_3:
      - "engineering-manager@company.com"
      - "director@company.com"
      - timeout: 15m

  production_warning:
    level_1:
      - "team-channel@slack"
      - timeout: 30m
    level_2:
      - "team-lead@company.com"
      - timeout: 2h

# Escalation configuration
escalation_policies:
  production_critical:
    level_1:
      - "primary-oncall@company.com"
      - timeout: 5m
    level_2:
      - "team-lead@company.com"
      - "secondary-oncall@company.com"
      - timeout: 10m
    level_3:
      - "engineering-manager@company.com"
      - "director@company.com"
      - timeout: 15m

  production_warning:
    level_1:
      - "team-channel@slack"
      - timeout: 30m
    level_2:
      - "team-lead@company.com"
      - timeout: 2h

YAML

Silence Management

# Silence management best practices
silence_policies:
  maintenance_windows:
    - prefix: "MAINT-"
    - max_duration: "4h"
    - required_fields: ["ticket_number", "approval"]
    - auto_expire: true

  emergency_silences:
    - prefix: "EMERG-"
    - max_duration: "2h"
    - required_fields: ["incident_id", "responder"]
    - approval_required: false

  scheduled_silences:
    - prefix: "SCHED-"
    - max_duration: "24h"
    - required_fields: ["change_request", "owner"]
    - advance_notice: "24h"

# Silence management best practices
silence_policies:
  maintenance_windows:
    - prefix: "MAINT-"
    - max_duration: "4h"
    - required_fields: ["ticket_number", "approval"]
    - auto_expire: true

  emergency_silences:
    - prefix: "EMERG-"
    - max_duration: "2h"
    - required_fields: ["incident_id", "responder"]
    - approval_required: false

  scheduled_silences:
    - prefix: "SCHED-"
    - max_duration: "24h"
    - required_fields: ["change_request", "owner"]
    - advance_notice: "24h"

YAML

Security Best Practices

Authentication and Authorization

graph TD
    A[Security Layers] --> B[Network Security]
    A --> C[Authentication]
    A --> D[Authorization]
    A --> E[Encryption]

    B --> B1[Firewall Rules]
    B --> B2[VPN Access]

    C --> C1[OAuth/OIDC]
    C --> C2[API Keys]

    D --> D1[RBAC]
    D --> D2[Team-based Access]

    E --> E1[TLS Everywhere]
    E --> E2[Secrets Management]

Secure Configuration

# Secure Alertmanager configuration
global:
  # Use TLS for SMTP
  smtp_require_tls: true
  smtp_auth_username: '{{ env "SMTP_USERNAME" }}'
  smtp_auth_password: '{{ env "SMTP_PASSWORD" }}'

  # HTTP client configuration
  http_config:
    tls_config:
      # Verify certificates
      insecure_skip_verify: false
      # Use specific CA if needed
      ca_file: /etc/ssl/certs/ca-bundle.pem

# Use environment variables for secrets
receivers:
- name: 'secure-webhook'
  webhook_configs:
  - url: 'https://webhook.company.com/alerts'
    http_config:
      bearer_token: '{{ env "WEBHOOK_TOKEN" }}'
      tls_config:
        cert_file: /etc/alertmanager/client.crt
        key_file: /etc/alertmanager/client.key

# Secure Alertmanager configuration
global:
  # Use TLS for SMTP
  smtp_require_tls: true
  smtp_auth_username: '{{ env "SMTP_USERNAME" }}'
  smtp_auth_password: '{{ env "SMTP_PASSWORD" }}'

  # HTTP client configuration
  http_config:
    tls_config:
      # Verify certificates
      insecure_skip_verify: false
      # Use specific CA if needed
      ca_file: /etc/ssl/certs/ca-bundle.pem

# Use environment variables for secrets
receivers:
- name: 'secure-webhook'
  webhook_configs:
  - url: 'https://webhook.company.com/alerts'
    http_config:
      bearer_token: '{{ env "WEBHOOK_TOKEN" }}'
      tls_config:
        cert_file: /etc/alertmanager/client.crt
        key_file: /etc/alertmanager/client.key

YAML

Container Security

# Secure Dockerfile for Alertmanager
FROM alpine:3.18

# Create non-root user
RUN addgroup -g 1001 alertmanager && \
    adduser -D -s /bin/sh -u 1001 -G alertmanager alertmanager

# Install certificates
RUN apk add --no-cache ca-certificates

# Copy binary and set permissions
COPY --from=builder /app/alertmanager /bin/alertmanager
RUN chmod +x /bin/alertmanager

# Create directories with proper ownership
RUN mkdir -p /etc/alertmanager /var/lib/alertmanager && \
    chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager

USER alertmanager
EXPOSE 9093

ENTRYPOINT ["/bin/alertmanager"]
CMD ["--config.file=/etc/alertmanager/alertmanager.yml", \
     "--storage.path=/var/lib/alertmanager", \
     "--web.external-url=http://localhost:9093"]

# Secure Dockerfile for Alertmanager
FROM alpine:3.18

# Create non-root user
RUN addgroup -g 1001 alertmanager && \
    adduser -D -s /bin/sh -u 1001 -G alertmanager alertmanager

# Install certificates
RUN apk add --no-cache ca-certificates

# Copy binary and set permissions
COPY --from=builder /app/alertmanager /bin/alertmanager
RUN chmod +x /bin/alertmanager

# Create directories with proper ownership
RUN mkdir -p /etc/alertmanager /var/lib/alertmanager && \
    chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager

USER alertmanager
EXPOSE 9093

ENTRYPOINT ["/bin/alertmanager"]
CMD ["--config.file=/etc/alertmanager/alertmanager.yml", \
     "--storage.path=/var/lib/alertmanager", \
     "--web.external-url=http://localhost:9093"]

Dockerfile

Performance Optimization

Resource Management

# Resource optimization guidelines
resource_management:
  memory:
    - "Size based on alert volume and retention"
    - "~1GB RAM per 100k active alerts"
    - "Monitor alertmanager_alerts metric"

  cpu:
    - "Generally not CPU intensive"
    - "Scale with notification volume"
    - "2-4 cores sufficient for most workloads"

  storage:
    - "Minimal storage requirements"
    - "~10MB per million alerts"
    - "Use SSD for better performance"

  network:
    - "Outbound bandwidth for notifications"
    - "Inbound for receiving alerts"
    - "Consider notification channel limits"

# Resource optimization guidelines
resource_management:
  memory:
    - "Size based on alert volume and retention"
    - "~1GB RAM per 100k active alerts"
    - "Monitor alertmanager_alerts metric"

  cpu:
    - "Generally not CPU intensive"
    - "Scale with notification volume"
    - "2-4 cores sufficient for most workloads"

  storage:
    - "Minimal storage requirements"
    - "~10MB per million alerts"
    - "Use SSD for better performance"

  network:
    - "Outbound bandwidth for notifications"
    - "Inbound for receiving alerts"
    - "Consider notification channel limits"

YAML

High Availability Configuration

# HA deployment best practices
ha_configuration:
  cluster_size:
    - minimum: 3
    - recommended: 3-5
    - maximum: 7

  deployment:
    - "Spread across availability zones"
    - "Use anti-affinity rules"
    - "Monitor cluster health"

  load_balancing:
    - "Use load balancer for Prometheus"
    - "Health check: GET /-/ready"
    - "Sticky sessions not required"

# HA deployment best practices
ha_configuration:
  cluster_size:
    - minimum: 3
    - recommended: 3-5
    - maximum: 7

  deployment:
    - "Spread across availability zones"
    - "Use anti-affinity rules"
    - "Monitor cluster health"

  load_balancing:
    - "Use load balancer for Prometheus"
    - "Health check: GET /-/ready"
    - "Sticky sessions not required"

YAML

12. Real-world Examples

Example 1: E-commerce Platform

Scenario

Large e-commerce platform with microservices architecture, multiple data centers, and 24/7 operations.

graph TD
    A[E-commerce Platform] --> B[Frontend Services]
    A --> C[Backend APIs]
    A --> D[Databases]
    A --> E[Payment Systems]
    A --> F[Inventory Management]

    B --> B1[Web App]
    B --> B2[Mobile API]
    B --> B3[CDN]

    C --> C1[User Service]
    C --> C2[Product Service]
    C --> C3[Order Service]

    D --> D1[PostgreSQL]
    D --> D2[Redis Cache]
    D --> D3[Elasticsearch]

    E --> E1[Payment Gateway]
    E --> E2[Fraud Detection]

    F --> F1[Warehouse System]
    F --> F2[Stock Management]

Alertmanager Configuration

global:
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'alerts@ecommerce.com'
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'environment', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'

  routes:
  # Critical business impact alerts
  - match:
      severity: critical
      business_impact: high
    receiver: 'critical-business'
    group_wait: 0s
    repeat_interval: 5m

  # Payment system alerts
  - match:
      service: payment
    receiver: 'payment-team'
    group_by: ['alertname', 'payment_provider']

  # Database alerts
  - match_re:
      service: (postgres|redis|elasticsearch)
    receiver: 'database-team'
    group_by: ['alertname', 'database_cluster']

  # Frontend alerts
  - match_re:
      service: (web-app|mobile-api|cdn)
    receiver: 'frontend-team'

  # Infrastructure alerts
  - match:
      team: infrastructure
    receiver: 'infrastructure-team'
    group_by: ['alertname', 'datacenter']

receivers:
- name: 'default'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#alerts-general'

- name: 'critical-business'
  pagerduty_configs:
  - routing_key: '{{ env "PAGERDUTY_CRITICAL_KEY" }}'
    description: 'CRITICAL: {{ .GroupLabels.alertname }}'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#critical-alerts'
    color: 'danger'
    title: '🚨 CRITICAL BUSINESS IMPACT'
  email_configs:
  - to: 'executives@ecommerce.com'
    subject: 'CRITICAL: Business Impact Alert'

- name: 'payment-team'
  pagerduty_configs:
  - routing_key: '{{ env "PAGERDUTY_PAYMENT_KEY" }}'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#payment-alerts'

- name: 'database-team'
  email_configs:
  - to: 'dba-team@ecommerce.com'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#database-alerts'

- name: 'frontend-team'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#frontend-alerts'

- name: 'infrastructure-team'
  email_configs:
  - to: 'infrastructure@ecommerce.com'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#infrastructure-alerts'

inhibit_rules:
# Inhibit service alerts when entire datacenter is down
- source_match:
    alertname: 'DatacenterDown'
  target_match_re:
    alertname: '(ServiceDown|HighLatency|DatabaseDown)'
  equal: ['datacenter']

# Inhibit warning alerts when critical alerts are firing
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['service', 'instance']

# Inhibit payment alerts during maintenance
- source_match:
    alertname: 'PaymentMaintenanceMode'
  target_match_re:
    service: 'payment'
  equal: ['environment']

global:
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'alerts@ecommerce.com'
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'environment', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'

  routes:
  # Critical business impact alerts
  - match:
      severity: critical
      business_impact: high
    receiver: 'critical-business'
    group_wait: 0s
    repeat_interval: 5m

  # Payment system alerts
  - match:
      service: payment
    receiver: 'payment-team'
    group_by: ['alertname', 'payment_provider']

  # Database alerts
  - match_re:
      service: (postgres|redis|elasticsearch)
    receiver: 'database-team'
    group_by: ['alertname', 'database_cluster']

  # Frontend alerts
  - match_re:
      service: (web-app|mobile-api|cdn)
    receiver: 'frontend-team'

  # Infrastructure alerts
  - match:
      team: infrastructure
    receiver: 'infrastructure-team'
    group_by: ['alertname', 'datacenter']

receivers:
- name: 'default'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#alerts-general'

- name: 'critical-business'
  pagerduty_configs:
  - routing_key: '{{ env "PAGERDUTY_CRITICAL_KEY" }}'
    description: 'CRITICAL: {{ .GroupLabels.alertname }}'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#critical-alerts'
    color: 'danger'
    title: '🚨 CRITICAL BUSINESS IMPACT'
  email_configs:
  - to: 'executives@ecommerce.com'
    subject: 'CRITICAL: Business Impact Alert'

- name: 'payment-team'
  pagerduty_configs:
  - routing_key: '{{ env "PAGERDUTY_PAYMENT_KEY" }}'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#payment-alerts'

- name: 'database-team'
  email_configs:
  - to: 'dba-team@ecommerce.com'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#database-alerts'

- name: 'frontend-team'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#frontend-alerts'

- name: 'infrastructure-team'
  email_configs:
  - to: 'infrastructure@ecommerce.com'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#infrastructure-alerts'

inhibit_rules:
# Inhibit service alerts when entire datacenter is down
- source_match:
    alertname: 'DatacenterDown'
  target_match_re:
    alertname: '(ServiceDown|HighLatency|DatabaseDown)'
  equal: ['datacenter']

# Inhibit warning alerts when critical alerts are firing
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['service', 'instance']

# Inhibit payment alerts during maintenance
- source_match:
    alertname: 'PaymentMaintenanceMode'
  target_match_re:
    service: 'payment'
  equal: ['environment']

YAML

Alert Rules

# Business critical alerts
groups:
- name: business_critical
  rules:
  - alert: OrderProcessingDown
    expr: |
      (
        rate(http_requests_total{service="order-service",status=~"5.."}[5m]) 
        / 
        rate(http_requests_total{service="order-service"}[5m])
      ) > 0.1
    for: 2m
    labels:
      severity: critical
      business_impact: high
      service: order
      team: backend
    annotations:
      summary: "Order processing service experiencing high error rate"
      description: "{{ $value | humanizePercentage }} of order requests failing"

  - alert: PaymentGatewayDown
    expr: probe_success{job="payment-gateway"} == 0
    for: 1m
    labels:
      severity: critical
      business_impact: high
      service: payment
      team: payment
    annotations:
      summary: "Payment gateway is unreachable"
      description: "Primary payment gateway has been down for 1 minute"

  - alert: InventoryServiceDown
    expr: up{job="inventory-service"} == 0
    for: 3m
    labels:
      severity: critical
      business_impact: high
      service: inventory
      team: backend
    annotations:
      summary: "Inventory service is down"
      description: "Inventory service unavailable - affecting product availability"

# Performance alerts
- name: performance
  rules:
  - alert: HighCheckoutLatency
    expr: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket{service="checkout"}[5m])) by (le)
      ) > 2
    for: 5m
    labels:
      severity: warning
      service: checkout
      team: frontend
    annotations:
      summary: "High checkout latency detected"
      description: "95th percentile checkout time is {{ $value | humanizeDuration }}"

  - alert: DatabaseConnectionPoolHigh
    expr: |
      (
        postgres_connections_active 
        / 
        postgres_connections_max
      ) > 0.8
    for: 10m
    labels:
      severity: warning
      service: postgres
      team: database
    annotations:
      summary: "Database connection pool utilization high"
      description: "{{ $labels.database }} connection pool at {{ $value | humanizePercentage }}"

# Business critical alerts
groups:
- name: business_critical
  rules:
  - alert: OrderProcessingDown
    expr: |
      (
        rate(http_requests_total{service="order-service",status=~"5.."}[5m]) 
        / 
        rate(http_requests_total{service="order-service"}[5m])
      ) > 0.1
    for: 2m
    labels:
      severity: critical
      business_impact: high
      service: order
      team: backend
    annotations:
      summary: "Order processing service experiencing high error rate"
      description: "{{ $value | humanizePercentage }} of order requests failing"

  - alert: PaymentGatewayDown
    expr: probe_success{job="payment-gateway"} == 0
    for: 1m
    labels:
      severity: critical
      business_impact: high
      service: payment
      team: payment
    annotations:
      summary: "Payment gateway is unreachable"
      description: "Primary payment gateway has been down for 1 minute"

  - alert: InventoryServiceDown
    expr: up{job="inventory-service"} == 0
    for: 3m
    labels:
      severity: critical
      business_impact: high
      service: inventory
      team: backend
    annotations:
      summary: "Inventory service is down"
      description: "Inventory service unavailable - affecting product availability"

# Performance alerts
- name: performance
  rules:
  - alert: HighCheckoutLatency
    expr: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket{service="checkout"}[5m])) by (le)
      ) > 2
    for: 5m
    labels:
      severity: warning
      service: checkout
      team: frontend
    annotations:
      summary: "High checkout latency detected"
      description: "95th percentile checkout time is {{ $value | humanizeDuration }}"

  - alert: DatabaseConnectionPoolHigh
    expr: |
      (
        postgres_connections_active 
        / 
        postgres_connections_max
      ) > 0.8
    for: 10m
    labels:
      severity: warning
      service: postgres
      team: database
    annotations:
      summary: "Database connection pool utilization high"
      description: "{{ $labels.database }} connection pool at {{ $value | humanizePercentage }}"

YAML

Example 2: SaaS Application

Scenario

Multi-tenant SaaS application with global customer base, requiring tenant-specific alerting.

graph TD
    A[SaaS Platform] --> B[API Gateway]
    A --> C[Tenant Services]
    A --> D[Shared Services]
    A --> E[Data Layer]

    B --> B1[Authentication]
    B --> B2[Rate Limiting]
    B --> B3[Load Balancing]

    C --> C1[Tenant A Services]
    C --> C2[Tenant B Services]
    C --> C3[Tenant C Services]

    D --> D1[Notification Service]
    D --> D2[Billing Service]
    D --> D3[Analytics Service]

    E --> E1[Tenant Databases]
    E --> E2[Shared Cache]
    E --> E3[Message Queue]

Multi-Tenant Alerting Configuration

global:
  smtp_smarthost: 'smtp.saas-company.com:587'
  smtp_from: 'platform-alerts@saas-company.com'

route:
  group_by: ['alertname', 'tenant', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 2h
  receiver: 'default'

  routes:
  # Enterprise customer alerts (immediate escalation)
  - match:
      customer_tier: enterprise
      severity: critical
    receiver: 'enterprise-critical'
    group_wait: 0s
    repeat_interval: 15m

  # Tenant-specific routing
  - match:
      tenant: tenant-a
    receiver: 'tenant-a-alerts'

  - match:
      tenant: tenant-b
    receiver: 'tenant-b-alerts'

  # Platform-wide issues
  - match:
      alert_type: platform
    receiver: 'platform-team'
    group_by: ['alertname', 'region']

  # Customer-facing service alerts
  - match_re:
      service: (api-gateway|auth-service|billing)
    receiver: 'customer-facing-team'

receivers:
- name: 'default'
  webhook_configs:
  - url: 'http://alert-router:8080/webhook'

- name: 'enterprise-critical'
  pagerduty_configs:
  - routing_key: '{{ env "PAGERDUTY_ENTERPRISE_KEY" }}'
    description: 'ENTERPRISE CRITICAL: {{ .GroupLabels.alertname }}'
    details:
      tenant: '{{ .GroupLabels.tenant }}'
      customer_tier: '{{ .GroupLabels.customer_tier }}'
  email_configs:
  - to: 'enterprise-support@saas-company.com'
    cc: 'customer-success@saas-company.com'
    subject: 'CRITICAL: Enterprise Customer Impact - {{ .GroupLabels.tenant }}'

- name: 'tenant-a-alerts'
  webhook_configs:
  - url: 'http://tenant-notification-service:8080/notify'
    http_config:
      basic_auth:
        username: 'tenant-a'
        password: '{{ env "TENANT_A_PASSWORD" }}'
    send_resolved: true

- name: 'platform-team'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#platform-alerts'
    title: 'Platform Alert: {{ .GroupLabels.alertname }}'
    text: |
      {{ range .Alerts }}
      *Alert:* {{ .Labels.alertname }}
      *Region:* {{ .Labels.region }}
      *Affected Tenants:* {{ .Labels.affected_tenants }}
      *Impact:* {{ .Annotations.impact }}
      {{ end }}

inhibit_rules:
# Inhibit tenant-specific alerts during platform outage
- source_match:
    alert_type: platform
    severity: critical
  target_match:
    alert_type: tenant
  equal: ['region']

# Inhibit individual service alerts during API gateway issues
- source_match:
    alertname: 'APIGatewayDown'
  target_match_re:
    service: '(auth-service|billing-service|notification-service)'
  equal: ['region']

global:
  smtp_smarthost: 'smtp.saas-company.com:587'
  smtp_from: 'platform-alerts@saas-company.com'

route:
  group_by: ['alertname', 'tenant', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 2h
  receiver: 'default'

  routes:
  # Enterprise customer alerts (immediate escalation)
  - match:
      customer_tier: enterprise
      severity: critical
    receiver: 'enterprise-critical'
    group_wait: 0s
    repeat_interval: 15m

  # Tenant-specific routing
  - match:
      tenant: tenant-a
    receiver: 'tenant-a-alerts'

  - match:
      tenant: tenant-b
    receiver: 'tenant-b-alerts'

  # Platform-wide issues
  - match:
      alert_type: platform
    receiver: 'platform-team'
    group_by: ['alertname', 'region']

  # Customer-facing service alerts
  - match_re:
      service: (api-gateway|auth-service|billing)
    receiver: 'customer-facing-team'

receivers:
- name: 'default'
  webhook_configs:
  - url: 'http://alert-router:8080/webhook'

- name: 'enterprise-critical'
  pagerduty_configs:
  - routing_key: '{{ env "PAGERDUTY_ENTERPRISE_KEY" }}'
    description: 'ENTERPRISE CRITICAL: {{ .GroupLabels.alertname }}'
    details:
      tenant: '{{ .GroupLabels.tenant }}'
      customer_tier: '{{ .GroupLabels.customer_tier }}'
  email_configs:
  - to: 'enterprise-support@saas-company.com'
    cc: 'customer-success@saas-company.com'
    subject: 'CRITICAL: Enterprise Customer Impact - {{ .GroupLabels.tenant }}'

- name: 'tenant-a-alerts'
  webhook_configs:
  - url: 'http://tenant-notification-service:8080/notify'
    http_config:
      basic_auth:
        username: 'tenant-a'
        password: '{{ env "TENANT_A_PASSWORD" }}'
    send_resolved: true

- name: 'platform-team'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#platform-alerts'
    title: 'Platform Alert: {{ .GroupLabels.alertname }}'
    text: |
      {{ range .Alerts }}
      *Alert:* {{ .Labels.alertname }}
      *Region:* {{ .Labels.region }}
      *Affected Tenants:* {{ .Labels.affected_tenants }}
      *Impact:* {{ .Annotations.impact }}
      {{ end }}

inhibit_rules:
# Inhibit tenant-specific alerts during platform outage
- source_match:
    alert_type: platform
    severity: critical
  target_match:
    alert_type: tenant
  equal: ['region']

# Inhibit individual service alerts during API gateway issues
- source_match:
    alertname: 'APIGatewayDown'
  target_match_re:
    service: '(auth-service|billing-service|notification-service)'
  equal: ['region']

YAML

Example 3: Financial Services

Scenario

Financial services company with strict compliance requirements, multiple environments, and complex approval workflows.

graph TD
    A[Financial Services] --> B[Trading Platform]
    A --> C[Risk Management]
    A --> D[Compliance Systems]
    A --> E[Customer Portal]

    B --> B1[Order Management]
    B --> B2[Market Data]
    B --> B3[Settlement]

    C --> C1[Real-time Risk]
    C --> C2[Credit Monitoring]
    C --> C3[Fraud Detection]

    D --> D1[Audit Logging]
    D --> D2[Regulatory Reporting]
    D --> D3[Data Retention]

    E --> E1[Account Management]
    E --> E2[Portfolio View]
    E --> E3[Transaction History]

Compliance-Focused Configuration

global:
  smtp_smarthost: 'mail.financial-company.com:587'
  smtp_from: 'compliance-alerts@financial-company.com'
  resolve_timeout: 10m

route:
  group_by: ['alertname', 'compliance_level', 'environment']
  group_wait: 60s  # Longer wait for compliance review
  group_interval: 10m
  repeat_interval: 6h
  receiver: 'default-compliance'

  routes:
  # Regulatory compliance alerts (highest priority)
  - match:
      compliance_level: regulatory
    receiver: 'regulatory-compliance'
    group_wait: 0s
    repeat_interval: 30m

  # Trading system alerts
  - match:
      system: trading
    receiver: 'trading-team'
    group_by: ['alertname', 'trading_venue']

  # Risk management alerts
  - match:
      system: risk
    receiver: 'risk-management'
    group_by: ['alertname', 'risk_type']

  # Production environment (requires immediate attention)
  - match:
      environment: production
      severity: critical
    receiver: 'production-critical'
    group_wait: 30s

  # Development/staging (business hours only)
  - match_re:
      environment: (development|staging)
    receiver: 'development-team'
    group_interval: 1h
    repeat_interval: 24h

receivers:
- name: 'default-compliance'
  email_configs:
  - to: 'compliance-team@financial-company.com'
    subject: '[COMPLIANCE] {{ .GroupLabels.alertname }}'
    headers:
      X-Priority: 'High'
      X-Compliance-Level: '{{ .GroupLabels.compliance_level }}'

- name: 'regulatory-compliance'
  email_configs:
  - to: 'compliance-officer@financial-company.com'
    cc: 'legal-team@financial-company.com'
    subject: '[REGULATORY] IMMEDIATE ATTENTION REQUIRED'
    body: |
      REGULATORY COMPLIANCE ALERT

      This alert requires immediate attention and may need to be reported to regulators.

      {{ range .Alerts }}
      Alert: {{ .Labels.alertname }}
      System: {{ .Labels.system }}
      Compliance Type: {{ .Labels.compliance_type }}
      Regulatory Impact: {{ .Annotations.regulatory_impact }}
      Required Actions: {{ .Annotations.required_actions }}
      {{ end }}
  webhook_configs:
  - url: 'https://compliance-system.financial-company.com/api/alerts'
    http_config:
      bearer_token: '{{ env "COMPLIANCE_SYSTEM_TOKEN" }}'
    send_resolved: true

- name: 'trading-team'
  pagerduty_configs:
  - routing_key: '{{ env "PAGERDUTY_TRADING_KEY" }}'
    description: 'Trading System Alert: {{ .GroupLabels.alertname }}'
    details:
      trading_venue: '{{ .GroupLabels.trading_venue }}'
      market_impact: '{{ .GroupLabels.market_impact }}'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#trading-alerts'

- name: 'risk-management'
  email_configs:
  - to: 'risk-team@financial-company.com'
  webhook_configs:
  - url: 'https://risk-system.financial-company.com/api/notifications'

inhibit_rules:
# During market close, inhibit non-critical trading alerts
- source_match:
    alertname: 'MarketClosed'
  target_match:
    system: trading
    severity: warning
  equal: ['trading_venue']

# Inhibit development alerts during business hours
- source_match:
    alertname: 'BusinessHoursActive'
  target_match:
    environment: development
    severity: info

global:
  smtp_smarthost: 'mail.financial-company.com:587'
  smtp_from: 'compliance-alerts@financial-company.com'
  resolve_timeout: 10m

route:
  group_by: ['alertname', 'compliance_level', 'environment']
  group_wait: 60s  # Longer wait for compliance review
  group_interval: 10m
  repeat_interval: 6h
  receiver: 'default-compliance'

  routes:
  # Regulatory compliance alerts (highest priority)
  - match:
      compliance_level: regulatory
    receiver: 'regulatory-compliance'
    group_wait: 0s
    repeat_interval: 30m

  # Trading system alerts
  - match:
      system: trading
    receiver: 'trading-team'
    group_by: ['alertname', 'trading_venue']

  # Risk management alerts
  - match:
      system: risk
    receiver: 'risk-management'
    group_by: ['alertname', 'risk_type']

  # Production environment (requires immediate attention)
  - match:
      environment: production
      severity: critical
    receiver: 'production-critical'
    group_wait: 30s

  # Development/staging (business hours only)
  - match_re:
      environment: (development|staging)
    receiver: 'development-team'
    group_interval: 1h
    repeat_interval: 24h

receivers:
- name: 'default-compliance'
  email_configs:
  - to: 'compliance-team@financial-company.com'
    subject: '[COMPLIANCE] {{ .GroupLabels.alertname }}'
    headers:
      X-Priority: 'High'
      X-Compliance-Level: '{{ .GroupLabels.compliance_level }}'

- name: 'regulatory-compliance'
  email_configs:
  - to: 'compliance-officer@financial-company.com'
    cc: 'legal-team@financial-company.com'
    subject: '[REGULATORY] IMMEDIATE ATTENTION REQUIRED'
    body: |
      REGULATORY COMPLIANCE ALERT

      This alert requires immediate attention and may need to be reported to regulators.

      {{ range .Alerts }}
      Alert: {{ .Labels.alertname }}
      System: {{ .Labels.system }}
      Compliance Type: {{ .Labels.compliance_type }}
      Regulatory Impact: {{ .Annotations.regulatory_impact }}
      Required Actions: {{ .Annotations.required_actions }}
      {{ end }}
  webhook_configs:
  - url: 'https://compliance-system.financial-company.com/api/alerts'
    http_config:
      bearer_token: '{{ env "COMPLIANCE_SYSTEM_TOKEN" }}'
    send_resolved: true

- name: 'trading-team'
  pagerduty_configs:
  - routing_key: '{{ env "PAGERDUTY_TRADING_KEY" }}'
    description: 'Trading System Alert: {{ .GroupLabels.alertname }}'
    details:
      trading_venue: '{{ .GroupLabels.trading_venue }}'
      market_impact: '{{ .GroupLabels.market_impact }}'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#trading-alerts'

- name: 'risk-management'
  email_configs:
  - to: 'risk-team@financial-company.com'
  webhook_configs:
  - url: 'https://risk-system.financial-company.com/api/notifications'

inhibit_rules:
# During market close, inhibit non-critical trading alerts
- source_match:
    alertname: 'MarketClosed'
  target_match:
    system: trading
    severity: warning
  equal: ['trading_venue']

# Inhibit development alerts during business hours
- source_match:
    alertname: 'BusinessHoursActive'
  target_match:
    environment: development
    severity: info

YAML

Example 4: Gaming Platform

Scenario

Online gaming platform with real-time multiplayer games, user-generated content, and global infrastructure.

graph TD
    A[Gaming Platform] --> B[Game Servers]
    A --> C[User Services]
    A --> D[Content Systems]
    A --> E[Analytics]

    B --> B1[Matchmaking]
    B --> B2[Game Logic]
    B --> B3[Real-time Communication]

    C --> C1[Authentication]
    C --> C2[Player Profiles]
    C --> C3[Friends & Social]

    D --> D1[Asset Storage]
    D --> D2[Content Delivery]
    D --> D3[User Generated Content]

    E --> E1[Player Analytics]
    E --> E2[Game Metrics]
    E --> E3[Business Intelligence]

Gaming-Specific Alerting

global:
  smtp_smarthost: 'smtp.gaming-company.com:587'
  smtp_from: 'game-ops@gaming-company.com'

route:
  group_by: ['alertname', 'game_title', 'region']
  group_wait: 15s  # Fast response for gaming
  group_interval: 2m
  repeat_interval: 1h
  receiver: 'default-gaming'

  routes:
  # Player-affecting issues (highest priority)
  - match:
      impact: player_facing
      severity: critical
    receiver: 'player-impact-critical'
    group_wait: 0s
    repeat_interval: 10m

  # Live events (tournaments, etc.)
  - match:
      event_type: live_event
    receiver: 'live-events-team'
    group_wait: 5s

  # Matchmaking issues
  - match:
      service: matchmaking
    receiver: 'matchmaking-team'
    group_by: ['alertname', 'game_mode', 'region']

  # Content delivery issues
  - match_re:
      service: (cdn|asset-storage|content-delivery)
    receiver: 'content-team'

  # Regional routing
  - match:
      region: na-east
    receiver: 'na-ops-team'

  - match:
      region: eu-west
    receiver: 'eu-ops-team'

  - match:
      region: asia-pacific
    receiver: 'apac-ops-team'

receivers:
- name: 'default-gaming'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#game-ops'

- name: 'player-impact-critical'
  pagerduty_configs:
  - routing_key: '{{ env "PAGERDUTY_PLAYER_IMPACT_KEY" }}'
    description: 'PLAYER IMPACT: {{ .GroupLabels.alertname }}'
    details:
      game_title: '{{ .GroupLabels.game_title }}'
      affected_players: '{{ .GroupLabels.affected_players }}'
      revenue_impact: '{{ .GroupLabels.revenue_impact }}'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#critical-player-issues'
    color: 'danger'
    title: '🎮 CRITICAL PLAYER IMPACT'
    text: |
      **Game:** {{ .GroupLabels.game_title }}
      **Region:** {{ .GroupLabels.region }}
      **Affected Players:** {{ .GroupLabels.affected_players }}

      {{ range .Alerts }}
      **Issue:** {{ .Annotations.summary }}
      **Player Impact:** {{ .Annotations.player_impact }}
      {{ end }}

- name: 'live-events-team'
  pagerduty_configs:
  - routing_key: '{{ env "PAGERDUTY_LIVE_EVENTS_KEY" }}'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#live-events'
  email_configs:
  - to: 'esports-team@gaming-company.com'

- name: 'matchmaking-team'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#matchmaking-alerts'
    text: |
      **Matchmaking Issue Detected**

      {{ range .Alerts }}
      **Game:** {{ .Labels.game_title }}
      **Mode:** {{ .Labels.game_mode }}
      **Region:** {{ .Labels.region }}
      **Queue Time:** {{ .Labels.avg_queue_time }}
      **Issue:** {{ .Annotations.summary }}
      {{ end }}

inhibit_rules:
# During scheduled maintenance, inhibit game server alerts
- source_match:
    alertname: 'ScheduledMaintenance'
  target_match_re:
    service: (game-server|matchmaking|player-data)
  equal: ['game_title', 'region']

# Inhibit individual server alerts during region-wide issues
- source_match:
    alertname: 'RegionNetworkIssue'
  target_match_re:
    alertname: '(ServerDown|HighLatency|ConnectionIssues)'
  equal: ['region']

global:
  smtp_smarthost: 'smtp.gaming-company.com:587'
  smtp_from: 'game-ops@gaming-company.com'

route:
  group_by: ['alertname', 'game_title', 'region']
  group_wait: 15s  # Fast response for gaming
  group_interval: 2m
  repeat_interval: 1h
  receiver: 'default-gaming'

  routes:
  # Player-affecting issues (highest priority)
  - match:
      impact: player_facing
      severity: critical
    receiver: 'player-impact-critical'
    group_wait: 0s
    repeat_interval: 10m

  # Live events (tournaments, etc.)
  - match:
      event_type: live_event
    receiver: 'live-events-team'
    group_wait: 5s

  # Matchmaking issues
  - match:
      service: matchmaking
    receiver: 'matchmaking-team'
    group_by: ['alertname', 'game_mode', 'region']

  # Content delivery issues
  - match_re:
      service: (cdn|asset-storage|content-delivery)
    receiver: 'content-team'

  # Regional routing
  - match:
      region: na-east
    receiver: 'na-ops-team'

  - match:
      region: eu-west
    receiver: 'eu-ops-team'

  - match:
      region: asia-pacific
    receiver: 'apac-ops-team'

receivers:
- name: 'default-gaming'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#game-ops'

- name: 'player-impact-critical'
  pagerduty_configs:
  - routing_key: '{{ env "PAGERDUTY_PLAYER_IMPACT_KEY" }}'
    description: 'PLAYER IMPACT: {{ .GroupLabels.alertname }}'
    details:
      game_title: '{{ .GroupLabels.game_title }}'
      affected_players: '{{ .GroupLabels.affected_players }}'
      revenue_impact: '{{ .GroupLabels.revenue_impact }}'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#critical-player-issues'
    color: 'danger'
    title: '🎮 CRITICAL PLAYER IMPACT'
    text: |
      **Game:** {{ .GroupLabels.game_title }}
      **Region:** {{ .GroupLabels.region }}
      **Affected Players:** {{ .GroupLabels.affected_players }}

      {{ range .Alerts }}
      **Issue:** {{ .Annotations.summary }}
      **Player Impact:** {{ .Annotations.player_impact }}
      {{ end }}

- name: 'live-events-team'
  pagerduty_configs:
  - routing_key: '{{ env "PAGERDUTY_LIVE_EVENTS_KEY" }}'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#live-events'
  email_configs:
  - to: 'esports-team@gaming-company.com'

- name: 'matchmaking-team'
  slack_configs:
  - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
    channel: '#matchmaking-alerts'
    text: |
      **Matchmaking Issue Detected**

      {{ range .Alerts }}
      **Game:** {{ .Labels.game_title }}
      **Mode:** {{ .Labels.game_mode }}
      **Region:** {{ .Labels.region }}
      **Queue Time:** {{ .Labels.avg_queue_time }}
      **Issue:** {{ .Annotations.summary }}
      {{ end }}

inhibit_rules:
# During scheduled maintenance, inhibit game server alerts
- source_match:
    alertname: 'ScheduledMaintenance'
  target_match_re:
    service: (game-server|matchmaking|player-data)
  equal: ['game_title', 'region']

# Inhibit individual server alerts during region-wide issues
- source_match:
    alertname: 'RegionNetworkIssue'
  target_match_re:
    alertname: '(ServerDown|HighLatency|ConnectionIssues)'
  equal: ['region']

YAML

This comprehensive book covers Alertmanager from basic concepts to expert-level configurations with real-world examples. Each section builds upon the previous ones, providing both theoretical understanding and practical implementation guidance with extensive use of Mermaid diagrams to visualize complex concepts

Discover more from Altgr Blog

Subscribe to get the latest posts sent to your email.

Table of Contents

1. Introduction to Alertmanager

What is Alertmanager?

Why Do We Need Alertmanager?

Core Features

2. Architecture and Core Concepts

Alertmanager Architecture

Key Concepts

Alert Lifecycle

Alert States in Alertmanager

Data Flow

3. Installation and Setup

Installation Methods

Method 1: Binary Installation

Method 2: Docker Installation

Method 3: Docker Compose

Method 4: Kubernetes Deployment

System Service Setup

Systemd Service File

4. Configuration Fundamentals

Basic Configuration Structure

Configuration Flow

Basic Configuration Example

Global Configuration Options

5. Routing and Grouping

Understanding Routes

Route Matching Logic

Advanced Routing Configuration

Grouping Strategies

Time-based Grouping

Label-based Grouping

Grouping Flow Diagram

6. Notification Channels

Supported Receivers

Email Configuration

Basic Email Setup

Advanced Email Configuration

Slack Configuration

Basic Slack Setup

Advanced Slack Configuration

PagerDuty Configuration

Webhook Configuration

Microsoft Teams Configuration

7. Silencing and Inhibition

Silencing Alerts

Creating Silences via API

Silence Configuration Examples

Inhibition Rules

Basic Inhibition Rules

Advanced Inhibition Examples

Inhibition Flow

Managing Silences

List Active Silences

Update Silence

Silence Best Practices

8. Integration with Prometheus

Prometheus Configuration

Prometheus Configuration

Alert Rules

Basic Alert Rules

Advanced Alert Rules

Multi-Datacenter Alert Rules

Alert Rule Best Practices

Rule Organization

Template for Alert Rules

9. Advanced Features

High Availability Setup

HA Configuration

Docker Compose HA Setup

Custom Templates

Template Structure

Email Templates

Slack Templates

Using Templates in Configuration

API Usage and Automation

API Endpoints Overview

Common API Operations

Python API Client Example

10. Monitoring and Troubleshooting

Monitoring Alertmanager Itself