Prometheus Observability

Introduction to Observability
Getting Started with Prometheus
Metrics and Data Collection
PromQL: Querying and Analyzing Data
Alerting and Notifications
Visualization
Prometheus in Kubernetes
Scaling and Performance
Best Practices and Pitfalls
Advanced Topics
Capstone Project
Appendices

1. Introduction to Observability

What is Observability?

Observability is the ability to understand the internal state of a system by examining its external outputs. Unlike traditional monitoring, which tells you when something breaks, observability helps you understand why it broke and how to fix it.

Monitoring vs. Observability

Monitoring	Observability
Known unknowns	Unknown unknowns
Predefined dashboards	Ad-hoc queries
Health checks	Deep insights
Reactive	Proactive

Monitoring answers: “Is the system up?” Observability answers: “Why is the system behaving this way?”

The Three Pillars of Observability

graph TB
    A[Observability] --> B[Metrics]
    A --> C[Logs]
    A --> D[Traces]

    B --> B1[Numerical data over time]
    B --> B2[System performance indicators]

    C --> C1[Discrete events with context]
    C --> C2[Application and system logs]

    D --> D1[Request flows across services]
    D --> D2[Performance bottleneck identification]

1. Metrics

Definition: Numerical measurements captured over time
Examples: CPU usage, memory consumption, request rate, error rate
Best for: Dashboards, alerting, trend analysis

2. Logs

Definition: Discrete events with timestamps and context
Examples: Application errors, access logs, audit trails
Best for: Debugging, forensic analysis, compliance

3. Traces

Definition: Records of requests as they flow through distributed systems
Examples: Microservice call chains, database queries, external API calls
Best for: Performance optimization, dependency mapping

Where Prometheus Fits

Prometheus is primarily a metrics-based monitoring system that excels at:

Time-series data collection and storage
Powerful querying language (PromQL)
Built-in alerting capabilities
Service discovery integration
Scalable architecture

Chapter 1 Summary

graph LR
    A[Applications] --> B[Prometheus]
    C[Infrastructure] --> B
    D[Exporters] --> B
    B --> E[Alertmanager]
    B --> F[Grafana]
    B --> G[Remote Storage]

Observability goes beyond traditional monitoring by providing deep insights into system behavior. The three pillars—metrics, logs, and traces—work together to provide comprehensive visibility. Prometheus serves as the foundation for metrics collection and analysis in modern observability stacks.

Hands-on Exercise

Reflection Exercise: Think about a recent production issue in your environment
- What metrics could have helped detect it earlier?
- What logs would have aided in debugging?
- How would distributed tracing have helped?
Research Task: Investigate the observability stack used in your organization
- Identify which tools handle metrics, logs, and traces
- Note any gaps in observability coverage

2. Getting Started with Prometheus

History and Background

Prometheus was created at SoundCloud in 2012 by Matt T. Proud and Julius Volz. Inspired by Google’s Borgmon, it became a Cloud Native Computing Foundation (CNCF) project in 2016 and graduated in 2018.

Key Timeline:

2012: Created at SoundCloud
2015: Open-sourced
2016: Joined CNCF
2018: CNCF Graduated Project

Prometheus Architecture

graph TB
    subgraph "Prometheus Server"
        A[Retrieval] --> B[TSDB]
        C[PromQL Engine] --> B
        D[Web UI] --> C
        E[HTTP API] --> C
    end

    F[Targets] --> A
    G[Exporters] --> A
    H[Pushgateway] --> A

    B --> I[Alertmanager]
    D --> J[Grafana]
    C --> J

    K[Service Discovery] --> A

Core Components

Prometheus Server
- Scrapes and stores time-series data
- Executes PromQL queries
- Evaluates alerting rules
Client Libraries
- Instrument applications
- Expose metrics endpoints
Exporters
- Bridge between Prometheus and third-party systems
- Translate metrics to Prometheus format
Alertmanager
- Handles alerts from Prometheus
- Manages routing, grouping, and silencing
Pushgateway
- Allows ephemeral jobs to push metrics
- Used for batch jobs and short-lived processes

Installation Methods

Method 1: Binary Installation (Windows)

Download Prometheus: # Create directory New-Item -ItemType Directory -Path C:\prometheus # Download latest release $url = "https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.windows-amd64.zip" Invoke-WebRequest -Uri $url -OutFile C:\prometheus\prometheus.zip # Extract Expand-Archive -Path C:\prometheus\prometheus.zip -DestinationPath C:\prometheus
Create basic configuration: # prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s rule_files: # - "first_rules.yml" # - "second_rules.yml" scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090']
Run Prometheus: cd C:\prometheus\prometheus-2.47.0.windows-amd64 .\prometheus.exe --config.file=prometheus.yml --storage.tsdb.path=data\

Method 2: Docker Installation

Create configuration directory: mkdir prometheus-data
Create prometheus.yml: global: scrape_interval: 15s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node-exporter' static_configs: - targets: ['host.docker.internal:9100']
Run with Docker: docker run -d \ --name prometheus \ -p 9090:9090 \ -v ${PWD}/prometheus.yml:/etc/prometheus/prometheus.yml \ -v prometheus-data:/prometheus \ prom/prometheus:latest \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/prometheus \ --web.console.libraries=/etc/prometheus/console_libraries \ --web.console.templates=/etc/prometheus/consoles \ --web.enable-lifecycle

Method 3: Docker Compose

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'

volumes:
  prometheus_data:

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'

volumes:
  prometheus_data:

YAML

Configuration Basics

Understanding prometheus.yml

# Global configuration
global:
  scrape_interval: 15s        # How often to scrape targets
  evaluation_interval: 15s    # How often to evaluate rules
  external_labels:            # Labels added to metrics when federating
    cluster: 'production'
    region: 'us-west-2'

# Rule files for recording and alerting rules
rule_files:
  - "alert_rules.yml"
  - "recording_rules.yml"

# Scrape configuration
scrape_configs:
  # Self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 5s         # Override global interval
    metrics_path: /metrics      # Default metrics endpoint

  # Application monitoring
  - job_name: 'my-app'
    static_configs:
      - targets: ['app1:8080', 'app2:8080']
    scrape_timeout: 10s
    honor_labels: true

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

# Remote write configuration (optional)
remote_write:
  - url: "https://remote-storage-endpoint/write"
    headers:
      Authorization: "Bearer token"

# Global configuration
global:
  scrape_interval: 15s        # How often to scrape targets
  evaluation_interval: 15s    # How often to evaluate rules
  external_labels:            # Labels added to metrics when federating
    cluster: 'production'
    region: 'us-west-2'

# Rule files for recording and alerting rules
rule_files:
  - "alert_rules.yml"
  - "recording_rules.yml"

# Scrape configuration
scrape_configs:
  # Self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 5s         # Override global interval
    metrics_path: /metrics      # Default metrics endpoint

  # Application monitoring
  - job_name: 'my-app'
    static_configs:
      - targets: ['app1:8080', 'app2:8080']
    scrape_timeout: 10s
    honor_labels: true

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

# Remote write configuration (optional)
remote_write:
  - url: "https://remote-storage-endpoint/write"
    headers:
      Authorization: "Bearer token"

YAML

Key Configuration Parameters

Parameter	Description	Default
`scrape_interval`	How often to collect metrics	1m
`scrape_timeout`	Maximum time for scrape request	10s
`evaluation_interval`	Rule evaluation frequency	1m
`metrics_path`	HTTP path for metrics	/metrics
`scheme`	Protocol (http/https)	http

Verifying Installation

Access Prometheus Web UI:
- Open browser to http://localhost:9090
- Check Status → Targets to see configured endpoints
Test basic query: up This should return 1 for all healthy targets.
Check metrics endpoint: curl http://localhost:9090/metrics

Chapter 2 Summary

Prometheus follows a pull-based architecture where the server scrapes metrics from configured targets. The system consists of the main server, client libraries, exporters, and supporting components like Alertmanager. Installation can be done via binaries, Docker, or Kubernetes, with configuration managed through the prometheus.yml file.

Hands-on Exercise

Basic Setup:
- Install Prometheus using your preferred method
- Configure it to monitor itself
- Access the web UI and explore the interface
Configuration Practice:
- Modify the scrape interval to 30 seconds
- Add a new job that targets a non-existent endpoint
- Observe the target status and understand failure states
Metrics Exploration:
- Use the web UI to explore available metrics
- Try simple queries like prometheus_tsdb_samples_total
- Understand the difference between different metric types you see

3. Metrics and Data Collection

Types of Metrics

Prometheus supports four fundamental metric types, each serving different purposes:

1. Counter

A cumulative metric that only increases (or resets to zero on restart).

Use cases: Request counts, error counts, tasks completed Examples: http_requests_total, errors_total

// Go example
var requestsTotal = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total number of HTTP requests",
    },
    []string{"method", "endpoint", "status"},
)

// Go example
var requestsTotal = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total number of HTTP requests",
    },
    []string{"method", "endpoint", "status"},
)

2. Gauge

A metric that can go up and down.

Use cases: Memory usage, CPU usage, queue size, temperature Examples: memory_usage_bytes, cpu_usage_percent

// Go example
var memoryUsage = prometheus.NewGauge(
    prometheus.GaugeOpts{
        Name: "memory_usage_bytes",
        Help: "Current memory usage in bytes",
    },
)

// Go example
var memoryUsage = prometheus.NewGauge(
    prometheus.GaugeOpts{
        Name: "memory_usage_bytes",
        Help: "Current memory usage in bytes",
    },
)

3. Histogram

Samples observations and counts them in configurable buckets.

Use cases: Request durations, response sizes, latency distribution Features: Provides _count, _sum, and _bucket metrics

// Go example
var requestDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name: "http_request_duration_seconds",
        Help: "HTTP request duration in seconds",
        Buckets: prometheus.DefBuckets, // or custom: []float64{.1, .25, .5, 1, 2.5, 5, 10}
    },
    []string{"method", "endpoint"},
)

// Go example
var requestDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name: "http_request_duration_seconds",
        Help: "HTTP request duration in seconds",
        Buckets: prometheus.DefBuckets, // or custom: []float64{.1, .25, .5, 1, 2.5, 5, 10}
    },
    []string{"method", "endpoint"},
)

4. Summary

Similar to histogram but calculates configurable quantiles.

Use cases: Request durations when you need specific percentiles Features: Provides _count, _sum, and quantile metrics

// Go example
var requestDuration = prometheus.NewSummaryVec(
    prometheus.SummaryOpts{
        Name: "http_request_duration_seconds",
        Help: "HTTP request duration in seconds",
        Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
    },
    []string{"method", "endpoint"},
)

// Go example
var requestDuration = prometheus.NewSummaryVec(
    prometheus.SummaryOpts{
        Name: "http_request_duration_seconds",
        Help: "HTTP request duration in seconds",
        Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
    },
    []string{"method", "endpoint"},
)

Exposing Metrics with Client Libraries

Go Application Example

// main.go
package main

import (
    "fmt"
    "log"
    "math/rand"
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    requestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request duration in seconds",
            Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
        },
        []string{"method", "endpoint"},
    )

    activeConnections = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "Number of active connections",
        },
    )
)

func init() {
    // Register metrics with Prometheus
    prometheus.MustRegister(requestsTotal)
    prometheus.MustRegister(requestDuration)
    prometheus.MustRegister(activeConnections)
}

func metricsMiddleware(next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        // Increment active connections
        activeConnections.Inc()
        defer activeConnections.Dec()

        // Call the next handler
        next(w, r)

        // Record metrics
        duration := time.Since(start).Seconds()
        requestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
        requestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
    }
}

func helloHandler(w http.ResponseWriter, r *http.Request) {
    // Simulate some work
    time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond)
    fmt.Fprintf(w, "Hello, World!")
}

func main() {
    // Application routes
    http.HandleFunc("/hello", metricsMiddleware(helloHandler))

    // Metrics endpoint
    http.Handle("/metrics", promhttp.Handler())

    log.Println("Server starting on :8080")
    log.Fatal(http.ListenAndServe(":8080", nil))
}

// main.go
package main

import (
    "fmt"
    "log"
    "math/rand"
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    requestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request duration in seconds",
            Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
        },
        []string{"method", "endpoint"},
    )

    activeConnections = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "Number of active connections",
        },
    )
)

func init() {
    // Register metrics with Prometheus
    prometheus.MustRegister(requestsTotal)
    prometheus.MustRegister(requestDuration)
    prometheus.MustRegister(activeConnections)
}

func metricsMiddleware(next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        // Increment active connections
        activeConnections.Inc()
        defer activeConnections.Dec()

        // Call the next handler
        next(w, r)

        // Record metrics
        duration := time.Since(start).Seconds()
        requestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
        requestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
    }
}

func helloHandler(w http.ResponseWriter, r *http.Request) {
    // Simulate some work
    time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond)
    fmt.Fprintf(w, "Hello, World!")
}

func main() {
    // Application routes
    http.HandleFunc("/hello", metricsMiddleware(helloHandler))

    // Metrics endpoint
    http.Handle("/metrics", promhttp.Handler())

    log.Println("Server starting on :8080")
    log.Fatal(http.ListenAndServe(":8080", nil))
}

Python Application Example

# app.py
from flask import Flask, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest
import time
import random

app = Flask(__name__)

# Define metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

ACTIVE_CONNECTIONS = Gauge(
    'active_connections',
    'Number of active connections'
)

def track_metrics(f):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        ACTIVE_CONNECTIONS.inc()

        try:
            result = f(*args, **kwargs)
            status = '200'
            return result
        except Exception as e:
            status = '500'
            raise
        finally:
            REQUEST_COUNT.labels(
                method=request.method,
                endpoint=request.endpoint or 'unknown',
                status=status
            ).inc()

            REQUEST_DURATION.labels(
                method=request.method,
                endpoint=request.endpoint or 'unknown'
            ).observe(time.time() - start_time)

            ACTIVE_CONNECTIONS.dec()

    wrapper.__name__ = f.__name__
    return wrapper

@app.route('/hello')
@track_metrics
def hello():
    # Simulate work
    time.sleep(random.uniform(0.01, 0.1))
    return "Hello, World!"

@app.route('/metrics')
def metrics():
    return generate_latest()

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

# app.py
from flask import Flask, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest
import time
import random

app = Flask(__name__)

# Define metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

ACTIVE_CONNECTIONS = Gauge(
    'active_connections',
    'Number of active connections'
)

def track_metrics(f):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        ACTIVE_CONNECTIONS.inc()

        try:
            result = f(*args, **kwargs)
            status = '200'
            return result
        except Exception as e:
            status = '500'
            raise
        finally:
            REQUEST_COUNT.labels(
                method=request.method,
                endpoint=request.endpoint or 'unknown',
                status=status
            ).inc()

            REQUEST_DURATION.labels(
                method=request.method,
                endpoint=request.endpoint or 'unknown'
            ).observe(time.time() - start_time)

            ACTIVE_CONNECTIONS.dec()

    wrapper.__name__ = f.__name__
    return wrapper

@app.route('/hello')
@track_metrics
def hello():
    # Simulate work
    time.sleep(random.uniform(0.01, 0.1))
    return "Hello, World!"

@app.route('/metrics')
def metrics():
    return generate_latest()

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Python

Node.js Application Example

// app.js
const express = require('express');
const promClient = require('prom-client');

const app = express();
const port = 8080;

// Create metrics
const requestCounter = new promClient.Counter({
    name: 'http_requests_total',
    help: 'Total number of HTTP requests',
    labelNames: ['method', 'endpoint', 'status']
});

const requestDuration = new promClient.Histogram({
    name: 'http_request_duration_seconds',
    help: 'HTTP request duration in seconds',
    labelNames: ['method', 'endpoint'],
    buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
});

const activeConnections = new promClient.Gauge({
    name: 'active_connections',
    help: 'Number of active connections'
});

// Middleware to track metrics
function metricsMiddleware(req, res, next) {
    const start = Date.now();
    activeConnections.inc();

    res.on('finish', () => {
        const duration = (Date.now() - start) / 1000;

        requestCounter.labels(req.method, req.path, res.statusCode).inc();
        requestDuration.labels(req.method, req.path).observe(duration);
        activeConnections.dec();
    });

    next();
}

app.use(metricsMiddleware);

app.get('/hello', (req, res) => {
    // Simulate work
    setTimeout(() => {
        res.send('Hello, World!');
    }, Math.random() * 100);
});

app.get('/metrics', (req, res) => {
    res.set('Content-Type', promClient.register.contentType);
    res.end(promClient.register.metrics());
});

app.listen(port, () => {
    console.log(`Server running on port ${port}`);
});

// app.js
const express = require('express');
const promClient = require('prom-client');

const app = express();
const port = 8080;

// Create metrics
const requestCounter = new promClient.Counter({
    name: 'http_requests_total',
    help: 'Total number of HTTP requests',
    labelNames: ['method', 'endpoint', 'status']
});

const requestDuration = new promClient.Histogram({
    name: 'http_request_duration_seconds',
    help: 'HTTP request duration in seconds',
    labelNames: ['method', 'endpoint'],
    buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
});

const activeConnections = new promClient.Gauge({
    name: 'active_connections',
    help: 'Number of active connections'
});

// Middleware to track metrics
function metricsMiddleware(req, res, next) {
    const start = Date.now();
    activeConnections.inc();

    res.on('finish', () => {
        const duration = (Date.now() - start) / 1000;

        requestCounter.labels(req.method, req.path, res.statusCode).inc();
        requestDuration.labels(req.method, req.path).observe(duration);
        activeConnections.dec();
    });

    next();
}

app.use(metricsMiddleware);

app.get('/hello', (req, res) => {
    // Simulate work
    setTimeout(() => {
        res.send('Hello, World!');
    }, Math.random() * 100);
});

app.get('/metrics', (req, res) => {
    res.set('Content-Type', promClient.register.contentType);
    res.end(promClient.register.metrics());
});

app.listen(port, () => {
    console.log(`Server running on port ${port}`);
});

JavaScript

Exporters

Exporters are components that fetch statistics from third-party systems and export them as Prometheus metrics.

Node Exporter (System Metrics)

# docker-compose.yml addition
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    restart: unless-stopped

# docker-compose.yml addition
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    restart: unless-stopped

YAML

Key metrics from node-exporter:

node_cpu_seconds_total: CPU usage
node_memory_MemTotal_bytes: Total memory
node_filesystem_size_bytes: Filesystem size
node_network_receive_bytes_total: Network received bytes

Blackbox Exporter (External Monitoring)

# blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: []
      method: GET
      follow_redirects: true
      preferred_ip_protocol: "ip4"

  http_post_2xx:
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{"test": "data"}'

  tcp_connect:
    prober: tcp
    timeout: 5s

  dns:
    prober: dns
    timeout: 5s
    dns:
      query_name: "example.com"
      query_type: "A"

# blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: []
      method: GET
      follow_redirects: true
      preferred_ip_protocol: "ip4"

  http_post_2xx:
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{"test": "data"}'

  tcp_connect:
    prober: tcp
    timeout: 5s

  dns:
    prober: dns
    timeout: 5s
    dns:
      query_name: "example.com"
      query_type: "A"

YAML

# prometheus.yml addition
scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://google.com
        - https://github.com
        - https://stackoverflow.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

# prometheus.yml addition
scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://google.com
        - https://github.com
        - https://stackoverflow.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

YAML

Custom Exporter Example

# custom_exporter.py
from prometheus_client import start_http_server, Gauge, Counter
import time
import psutil
import requests

# Define custom metrics
CUSTOM_CPU_USAGE = Gauge('custom_cpu_usage_percent', 'Custom CPU usage percentage')
CUSTOM_DISK_USAGE = Gauge('custom_disk_usage_percent', 'Custom disk usage percentage', ['device'])
API_CALLS_TOTAL = Counter('api_calls_total', 'Total API calls made', ['endpoint'])

def collect_system_metrics():
    """Collect custom system metrics"""
    # CPU usage
    cpu_percent = psutil.cpu_percent(interval=1)
    CUSTOM_CPU_USAGE.set(cpu_percent)

    # Disk usage
    for partition in psutil.disk_partitions():
        try:
            partition_usage = psutil.disk_usage(partition.mountpoint)
            usage_percent = (partition_usage.used / partition_usage.total) * 100
            CUSTOM_DISK_USAGE.labels(device=partition.device).set(usage_percent)
        except PermissionError:
            continue

def call_external_api():
    """Simulate calling external APIs and track calls"""
    endpoints = ['/users', '/orders', '/products']

    for endpoint in endpoints:
        try:
            # Simulate API call
            response = requests.get(f'https://jsonplaceholder.typicode.com{endpoint}', timeout=5)
            API_CALLS_TOTAL.labels(endpoint=endpoint).inc()
        except requests.RequestException:
            pass

if __name__ == '__main__':
    # Start metrics server
    start_http_server(8000)
    print("Custom exporter started on port 8000")

    while True:
        collect_system_metrics()
        call_external_api()
        time.sleep(30)

# custom_exporter.py
from prometheus_client import start_http_server, Gauge, Counter
import time
import psutil
import requests

# Define custom metrics
CUSTOM_CPU_USAGE = Gauge('custom_cpu_usage_percent', 'Custom CPU usage percentage')
CUSTOM_DISK_USAGE = Gauge('custom_disk_usage_percent', 'Custom disk usage percentage', ['device'])
API_CALLS_TOTAL = Counter('api_calls_total', 'Total API calls made', ['endpoint'])

def collect_system_metrics():
    """Collect custom system metrics"""
    # CPU usage
    cpu_percent = psutil.cpu_percent(interval=1)
    CUSTOM_CPU_USAGE.set(cpu_percent)

    # Disk usage
    for partition in psutil.disk_partitions():
        try:
            partition_usage = psutil.disk_usage(partition.mountpoint)
            usage_percent = (partition_usage.used / partition_usage.total) * 100
            CUSTOM_DISK_USAGE.labels(device=partition.device).set(usage_percent)
        except PermissionError:
            continue

def call_external_api():
    """Simulate calling external APIs and track calls"""
    endpoints = ['/users', '/orders', '/products']

    for endpoint in endpoints:
        try:
            # Simulate API call
            response = requests.get(f'https://jsonplaceholder.typicode.com{endpoint}', timeout=5)
            API_CALLS_TOTAL.labels(endpoint=endpoint).inc()
        except requests.RequestException:
            pass

if __name__ == '__main__':
    # Start metrics server
    start_http_server(8000)
    print("Custom exporter started on port 8000")

    while True:
        collect_system_metrics()
        call_external_api()
        time.sleep(30)

Python

Service Discovery and Relabeling

File-based Service Discovery

# prometheus.yml
scrape_configs:
  - job_name: 'file-discovery'
    file_sd_configs:
      - files:
        - 'targets/*.json'
        refresh_interval: 30s

# prometheus.yml
scrape_configs:
  - job_name: 'file-discovery'
    file_sd_configs:
      - files:
        - 'targets/*.json'
        refresh_interval: 30s

YAML

# targets/web-servers.json
[
  {
    "targets": ["web1:8080", "web2:8080", "web3:8080"],
    "labels": {
      "job": "web-servers",
      "environment": "production",
      "region": "us-west-2"
    }
  },
  {
    "targets": ["api1:8080", "api2:8080"],
    "labels": {
      "job": "api-servers",
      "environment": "production",
      "region": "us-east-1"
    }
  }
]

# targets/web-servers.json
[
  {
    "targets": ["web1:8080", "web2:8080", "web3:8080"],
    "labels": {
      "job": "web-servers",
      "environment": "production",
      "region": "us-west-2"
    }
  },
  {
    "targets": ["api1:8080", "api2:8080"],
    "labels": {
      "job": "api-servers",
      "environment": "production",
      "region": "us-east-1"
    }
  }
]

JSON

Relabeling Configuration

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

      # Use custom metrics path if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

      # Use custom port if specified
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

      # Add pod metadata as labels
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: kubernetes_pod_name
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_node_name]
        target_label: kubernetes_node

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

      # Use custom metrics path if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

      # Use custom port if specified
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

      # Add pod metadata as labels
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: kubernetes_pod_name
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_node_name]
        target_label: kubernetes_node

YAML

Chapter 3 Summary

Prometheus supports four metric types: counters for cumulative values, gauges for current values, histograms for distribution analysis, and summaries for quantile calculations. Client libraries in various languages make it easy to instrument applications, while exporters bridge third-party systems. Service discovery and relabeling provide flexible configuration for dynamic environments.

Hands-on Exercise

Instrument an Application:
- Choose a simple web application in your preferred language
- Add Prometheus metrics for request count, duration, and active connections
- Test the metrics endpoint
Deploy Exporters:
- Set up node-exporter to monitor system metrics
- Configure blackbox-exporter to monitor external websites
- Add both to your Prometheus configuration
Service Discovery:
- Create a file-based service discovery configuration
- Add and remove targets dynamically
- Observe how Prometheus handles target changes

4. PromQL: Querying and Analyzing Data

Introduction to PromQL

Prometheus Query Language (PromQL) is a functional query language that allows you to select and aggregate time-series data. It’s designed to be both powerful and intuitive for operational use cases.

Basic PromQL Concepts

Instant Vectors vs Range Vectors

# Instant vector - single value per time series at query time
up

# Range vector - range of values over time
up[5m]

# Instant vector - single value per time series at query time
up

# Range vector - range of values over time
up[5m]

INI

Selectors and Matchers

# Exact match
http_requests_total{job="prometheus"}

# Regex match
http_requests_total{job=~".*server.*"}

# Negative match
http_requests_total{job!="prometheus"}

# Negative regex match
http_requests_total{job!~".*test.*"}

# Multiple labels
http_requests_total{job="api-server",method="GET",status="200"}

# Exact match
http_requests_total{job="prometheus"}

# Regex match
http_requests_total{job=~".*server.*"}

# Negative match
http_requests_total{job!="prometheus"}

# Negative regex match
http_requests_total{job!~".*test.*"}

# Multiple labels
http_requests_total{job="api-server",method="GET",status="200"}

INI

Common Queries for System Metrics

CPU Metrics

# Current CPU usage per core
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# CPU usage by mode
rate(node_cpu_seconds_total[5m]) * 100

# Top 5 instances by CPU usage
topk(5, 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))

# CPU usage over 80%
(100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80

# Current CPU usage per core
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# CPU usage by mode
rate(node_cpu_seconds_total[5m]) * 100

# Top 5 instances by CPU usage
topk(5, 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))

# CPU usage over 80%
(100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80

INI

Memory Metrics

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Available memory in GB
node_memory_MemAvailable_bytes / 1024 / 1024 / 1024

# Memory usage by instance
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Instances with memory usage > 90%
((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100) > 90

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Available memory in GB
node_memory_MemAvailable_bytes / 1024 / 1024 / 1024

# Memory usage by instance
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Instances with memory usage > 90%
((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100) > 90

INI

Disk Metrics

# Disk usage percentage
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

# Disk usage excluding system filesystems
(1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs"} / node_filesystem_size_bytes)) * 100

# Free disk space in GB
node_filesystem_avail_bytes / 1024 / 1024 / 1024

# Disk I/O rate
rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m])

# Disk usage percentage
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

# Disk usage excluding system filesystems
(1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs"} / node_filesystem_size_bytes)) * 100

# Free disk space in GB
node_filesystem_avail_bytes / 1024 / 1024 / 1024

# Disk I/O rate
rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m])

INI

Network Metrics

# Network receive rate in MB/s
rate(node_network_receive_bytes_total[5m]) / 1024 / 1024

# Network transmit rate in MB/s
rate(node_network_transmit_bytes_total[5m]) / 1024 / 1024

# Total network traffic
rate(node_network_receive_bytes_total[5m]) + rate(node_network_transmit_bytes_total[5m])

# Network errors
rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m])

# Network receive rate in MB/s
rate(node_network_receive_bytes_total[5m]) / 1024 / 1024

# Network transmit rate in MB/s
rate(node_network_transmit_bytes_total[5m]) / 1024 / 1024

# Total network traffic
rate(node_network_receive_bytes_total[5m]) + rate(node_network_transmit_bytes_total[5m])

# Network errors
rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m])

INI

Advanced PromQL Functions

Rate and Increase

# Rate: per-second average rate over time window
rate(http_requests_total[5m])

# Increase: total increase over time window
increase(http_requests_total[5m])

# irate: instantaneous rate (using last two data points)
irate(http_requests_total[5m])

# Rate: per-second average rate over time window
rate(http_requests_total[5m])

# Increase: total increase over time window
increase(http_requests_total[5m])

# irate: instantaneous rate (using last two data points)
irate(http_requests_total[5m])

INI

Histogram Functions

# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# 50th percentile (median)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

# Average response time
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

# Request rate
rate(http_request_duration_seconds_count[5m])

# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# 50th percentile (median)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

# Average response time
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

# Request rate
rate(http_request_duration_seconds_count[5m])

INI

Aggregation Functions

# Sum across all instances
sum(rate(http_requests_total[5m]))

# Average across instances
avg(rate(http_requests_total[5m]))

# Maximum value
max(node_memory_MemTotal_bytes)

# Count number of instances
count(up == 1)

# Sum by job
sum by (job) (rate(http_requests_total[5m]))

# Average without specific labels
avg without (instance) (rate(http_requests_total[5m]))

# Sum across all instances
sum(rate(http_requests_total[5m]))

# Average across instances
avg(rate(http_requests_total[5m]))

# Maximum value
max(node_memory_MemTotal_bytes)

# Count number of instances
count(up == 1)

# Sum by job
sum by (job) (rate(http_requests_total[5m]))

# Average without specific labels
avg without (instance) (rate(http_requests_total[5m]))

INI

Mathematical Functions

# Absolute value
abs(delta(cpu_temp_celsius[5m]))

# Round to nearest integer
round(rate(http_requests_total[5m]))

# Ceiling and floor
ceil(rate(http_requests_total[5m]))
floor(rate(http_requests_total[5m]))

# Square root
sqrt(rate(http_requests_total[5m]))

# Logarithm
ln(rate(http_requests_total[5m]))
log10(rate(http_requests_total[5m]))

# Absolute value
abs(delta(cpu_temp_celsius[5m]))

# Round to nearest integer
round(rate(http_requests_total[5m]))

# Ceiling and floor
ceil(rate(http_requests_total[5m]))
floor(rate(http_requests_total[5m]))

# Square root
sqrt(rate(http_requests_total[5m]))

# Logarithm
ln(rate(http_requests_total[5m]))
log10(rate(http_requests_total[5m]))

INI

Time Functions

# Current timestamp
time()

# Time since epoch for each sample
timestamp(up)

# Day of week (0=Sunday, 6=Saturday)
day_of_week()

# Hour of day (0-23)
hour()

# Predict linear trend
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)

# Current timestamp
time()

# Time since epoch for each sample
timestamp(up)

# Day of week (0=Sunday, 6=Saturday)
day_of_week()

# Hour of day (0-23)
hour()

# Predict linear trend
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)

INI

Recording Rules

Recording rules allow you to precompute frequently used expressions and save them as new time series.

# recording_rules.yml
groups:
  - name: instance_rules
    interval: 30s
    rules:
      - record: instance:cpu_usage:rate5m
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
        labels:
          job: node-exporter

      - record: instance:memory_usage:percentage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
        labels:
          job: node-exporter

      - record: instance:disk_usage:percentage
        expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs"} / node_filesystem_size_bytes)) * 100
        labels:
          job: node-exporter

  - name: application_rules
    interval: 15s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: job:http_request_duration:p95
        expr: histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))

      - record: job:http_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))

# recording_rules.yml
groups:
  - name: instance_rules
    interval: 30s
    rules:
      - record: instance:cpu_usage:rate5m
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
        labels:
          job: node-exporter

      - record: instance:memory_usage:percentage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
        labels:
          job: node-exporter

      - record: instance:disk_usage:percentage
        expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs"} / node_filesystem_size_bytes)) * 100
        labels:
          job: node-exporter

  - name: application_rules
    interval: 15s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: job:http_request_duration:p95
        expr: histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))

      - record: job:http_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))

YAML

Complex Query Examples

SLI/SLO Calculations

# Error rate (percentage of 5xx responses)
(
  sum(rate(http_requests_total{status=~"5.."}[5m])) /
  sum(rate(http_requests_total[5m]))
) * 100

# Availability (percentage of successful requests)
(
  sum(rate(http_requests_total{status!~"5.."}[5m])) /
  sum(rate(http_requests_total[5m]))
) * 100

# Latency SLI (percentage of requests under threshold)
(
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) /
  sum(rate(http_request_duration_seconds_count[5m]))
) * 100

# Error rate (percentage of 5xx responses)
(
  sum(rate(http_requests_total{status=~"5.."}[5m])) /
  sum(rate(http_requests_total[5m]))
) * 100

# Availability (percentage of successful requests)
(
  sum(rate(http_requests_total{status!~"5.."}[5m])) /
  sum(rate(http_requests_total[5m]))
) * 100

# Latency SLI (percentage of requests under threshold)
(
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) /
  sum(rate(http_request_duration_seconds_count[5m]))
) * 100

INI

Resource Utilization Patterns

# Predict when disk will be full (4 hours from now)
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0

# Instance running out of memory (< 10% available)
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1

# High load average (> number of CPUs)
node_load1 > count by (instance) (node_cpu_seconds_total{mode="idle"})

# Network saturation (approaching interface limit)
rate(node_network_transmit_bytes_total[5m]) > 
  node_network_speed_bytes * 0.8

# Predict when disk will be full (4 hours from now)
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0

# Instance running out of memory (< 10% available)
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1

# High load average (> number of CPUs)
node_load1 > count by (instance) (node_cpu_seconds_total{mode="idle"})

# Network saturation (approaching interface limit)
rate(node_network_transmit_bytes_total[5m]) > 
  node_network_speed_bytes * 0.8

INI

Application Performance Analysis

# Request rate by endpoint
sum by (endpoint) (rate(http_requests_total[5m]))

# Error rate by endpoint
sum by (endpoint) (rate(http_requests_total{status=~"5.."}[5m])) /
sum by (endpoint) (rate(http_requests_total[5m]))

# 95th percentile latency by endpoint
histogram_quantile(0.95, 
  sum by (endpoint, le) (rate(http_request_duration_seconds_bucket[5m]))
)

# Slow endpoints (95th percentile > 1 second)
histogram_quantile(0.95, 
  sum by (endpoint, le) (rate(http_request_duration_seconds_bucket[5m]))
) > 1

# Request rate by endpoint
sum by (endpoint) (rate(http_requests_total[5m]))

# Error rate by endpoint
sum by (endpoint) (rate(http_requests_total{status=~"5.."}[5m])) /
sum by (endpoint) (rate(http_requests_total[5m]))

# 95th percentile latency by endpoint
histogram_quantile(0.95, 
  sum by (endpoint, le) (rate(http_request_duration_seconds_bucket[5m]))
)

# Slow endpoints (95th percentile > 1 second)
histogram_quantile(0.95, 
  sum by (endpoint, le) (rate(http_request_duration_seconds_bucket[5m]))
) > 1

INI

Alerting Rules

# alert_rules.yml
groups:
  - name: infrastructure_alerts
    rules:
      - alert: HighCPUUsage
        expr: instance:cpu_usage:rate5m > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"

      - alert: HighMemoryUsage
        expr: instance:memory_usage:percentage > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: instance:disk_usage:percentage > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk usage is {{ $value }}% on {{ $labels.instance }}"

  - name: application_alerts
    rules:
      - alert: HighErrorRate
        expr: job:http_errors:rate5m / job:http_requests:rate5m > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate for {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"

      - alert: HighLatency
        expr: job:http_request_duration:p95 > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency for {{ $labels.job }}"
          description: "95th percentile latency is {{ $value }}s for {{ $labels.job }}"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} has been down for more than 1 minute"

# alert_rules.yml
groups:
  - name: infrastructure_alerts
    rules:
      - alert: HighCPUUsage
        expr: instance:cpu_usage:rate5m > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"

      - alert: HighMemoryUsage
        expr: instance:memory_usage:percentage > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: instance:disk_usage:percentage > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk usage is {{ $value }}% on {{ $labels.instance }}"

  - name: application_alerts
    rules:
      - alert: HighErrorRate
        expr: job:http_errors:rate5m / job:http_requests:rate5m > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate for {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"

      - alert: HighLatency
        expr: job:http_request_duration:p95 > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency for {{ $labels.job }}"
          description: "95th percentile latency is {{ $value }}s for {{ $labels.job }}"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} has been down for more than 1 minute"

YAML

Chapter 4 Summary

PromQL is a powerful query language that enables complex analysis of time-series data. Key concepts include instant vs range vectors, label selectors, aggregation functions, and mathematical operations. Recording rules help optimize performance by precomputing common queries, while alerting rules define when notifications should be sent.

Hands-on Exercise

Basic Queries:
- Write queries to find CPU usage for all instances
- Calculate memory usage percentage
- Find instances with high disk usage
Advanced Analysis:
- Create queries for error rates and latency percentiles
- Write a query to predict disk space exhaustion
- Build SLI queries for your application
Rules Configuration:
- Create recording rules for common calculations
- Write alerting rules for infrastructure monitoring
- Test rules using the Prometheus web UI

5. Alerting and Notifications

Alertmanager Architecture

Alertmanager handles alerts sent by Prometheus and other client applications. It provides grouping, inhibition, silencing, and routing to various notification channels.

graph TB
    A[Prometheus] --> B[Alertmanager]
    C[Other Sources] --> B

    subgraph "Alertmanager"
        D[Receiver] --> E[Grouping]
        E --> F[Throttling]
        F --> G[Inhibition]
        G --> H[Silencing]
        H --> I[Routing]
    end

    I --> J[Email]
    I --> K[Slack]
    I --> L[PagerDuty]
    I --> M[Webhook]

Installing and Configuring Alertmanager

Docker Installation

# docker-compose.yml addition
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--web.external-url=http://localhost:9093'
    restart: unless-stopped

volumes:
  alertmanager_data:

# docker-compose.yml addition
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--web.external-url=http://localhost:9093'
    restart: unless-stopped

volumes:
  alertmanager_data:

YAML

Basic Alertmanager Configuration

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@yourcompany.com'
  smtp_auth_username: 'alerts@yourcompany.com'
  smtp_auth_password: 'your-app-password'

route:
  group_by: ['alertname', 'job']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
  routes:
    - matchers:
        - severity=critical
      receiver: 'critical-alerts'
      continue: true
    - matchers:
        - severity=warning
      receiver: 'warning-alerts'

receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://webhook-server:8080/webhook'

  - name: 'critical-alerts'
    email_configs:
      - to: 'oncall@yourcompany.com'
        subject: 'CRITICAL: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Severity: {{ .Labels.severity }}
          Instance: {{ .Labels.instance }}
          {{ end }}
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#critical-alerts'
        title: 'Critical Alert'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Severity:* {{ .Labels.severity }}
          *Instance:* {{ .Labels.instance }}
          {{ end }}

  - name: 'warning-alerts'
    email_configs:
      - to: 'team@yourcompany.com'
        subject: 'WARNING: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#warnings'
        title: 'Warning Alert'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          {{ end }}

inhibit_rules:
  - source_matchers:
      - severity=critical
    target_matchers:
      - severity=warning
    equal: ['instance']

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@yourcompany.com'
  smtp_auth_username: 'alerts@yourcompany.com'
  smtp_auth_password: 'your-app-password'

route:
  group_by: ['alertname', 'job']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
  routes:
    - matchers:
        - severity=critical
      receiver: 'critical-alerts'
      continue: true
    - matchers:
        - severity=warning
      receiver: 'warning-alerts'

receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://webhook-server:8080/webhook'

  - name: 'critical-alerts'
    email_configs:
      - to: 'oncall@yourcompany.com'
        subject: 'CRITICAL: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Severity: {{ .Labels.severity }}
          Instance: {{ .Labels.instance }}
          {{ end }}
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#critical-alerts'
        title: 'Critical Alert'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Severity:* {{ .Labels.severity }}
          *Instance:* {{ .Labels.instance }}
          {{ end }}

  - name: 'warning-alerts'
    email_configs:
      - to: 'team@yourcompany.com'
        subject: 'WARNING: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#warnings'
        title: 'Warning Alert'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          {{ end }}

inhibit_rules:
  - source_matchers:
      - severity=critical
    target_matchers:
      - severity=warning
    equal: ['instance']

YAML

Writing Effective Alerts

Alert Quality Guidelines

Actionable: Every alert should require human action
Relevant: Alerts should indicate real problems
Clear: Alert messages should be immediately understandable
Timely: Alerts should fire before customers notice

Infrastructure Alerting Rules

# infrastructure_alerts.yml
groups:
  - name: node_alerts
    rules:
      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "Node {{ $labels.instance }} has been down for more than 1 minute"
          runbook_url: "https://runbooks.company.com/node-down"

      - alert: HighCPUUsage
        expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"
          runbook_url: "https://runbooks.company.com/high-cpu"

      - alert: CriticalCPUUsage
        expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 95
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Critical CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"

      - alert: HighMemoryUsage
        expr: ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"

      - alert: DiskSpaceCritical
        expr: ((node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes) * 100 > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical disk space on {{ $labels.instance }}"
          description: "Disk usage is {{ $value | humanizePercentage }} on {{ $labels.instance }} {{ $labels.mountpoint }}"

      - alert: DiskWillFillIn4Hours
        expr: predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk will fill in 4 hours on {{ $labels.instance }}"
          description: "Disk {{ $labels.mountpoint }} on {{ $labels.instance }} will fill in approximately 4 hours"

# infrastructure_alerts.yml
groups:
  - name: node_alerts
    rules:
      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "Node {{ $labels.instance }} has been down for more than 1 minute"
          runbook_url: "https://runbooks.company.com/node-down"

      - alert: HighCPUUsage
        expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"
          runbook_url: "https://runbooks.company.com/high-cpu"

      - alert: CriticalCPUUsage
        expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 95
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Critical CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"

      - alert: HighMemoryUsage
        expr: ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"

      - alert: DiskSpaceCritical
        expr: ((node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes) * 100 > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical disk space on {{ $labels.instance }}"
          description: "Disk usage is {{ $value | humanizePercentage }} on {{ $labels.instance }} {{ $labels.mountpoint }}"

      - alert: DiskWillFillIn4Hours
        expr: predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk will fill in 4 hours on {{ $labels.instance }}"
          description: "Disk {{ $labels.mountpoint }} on {{ $labels.instance }} will fill in approximately 4 hours"

YAML

Application Alerting Rules

# application_alerts.yml
groups:
  - name: application_alerts
    rules:
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) /
            sum(rate(http_requests_total[5m])) by (job)
          ) * 100 > 5
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate for {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency for {{ $labels.job }}"
          description: "95th percentile latency is {{ $value }}s for {{ $labels.job }}"

      - alert: LowThroughput
        expr: sum(rate(http_requests_total[5m])) by (job) < 10
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low throughput for {{ $labels.job }}"
          description: "Request rate is {{ $value }} req/s for {{ $labels.job }}"

      - alert: DatabaseConnectionFailure
        expr: db_connections_failed_total > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database connection failures for {{ $labels.job }}"
          description: "{{ $value }} database connection failures in the last minute"

# application_alerts.yml
groups:
  - name: application_alerts
    rules:
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) /
            sum(rate(http_requests_total[5m])) by (job)
          ) * 100 > 5
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate for {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency for {{ $labels.job }}"
          description: "95th percentile latency is {{ $value }}s for {{ $labels.job }}"

      - alert: LowThroughput
        expr: sum(rate(http_requests_total[5m])) by (job) < 10
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low throughput for {{ $labels.job }}"
          description: "Request rate is {{ $value }} req/s for {{ $labels.job }}"

      - alert: DatabaseConnectionFailure
        expr: db_connections_failed_total > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database connection failures for {{ $labels.job }}"
          description: "{{ $value }} database connection failures in the last minute"

YAML

Grouping, Inhibition, and Silences

Grouping Configuration

# Group alerts by cluster and alertname
route:
  group_by: ['cluster', 'alertname']
  group_wait: 30s      # Wait for more alerts before sending
  group_interval: 5m   # How often to send updates for a group
  repeat_interval: 12h # How often to resend the same alert

# Group alerts by cluster and alertname
route:
  group_by: ['cluster', 'alertname']
  group_wait: 30s      # Wait for more alerts before sending
  group_interval: 5m   # How often to send updates for a group
  repeat_interval: 12h # How often to resend the same alert

YAML

Inhibition Rules

inhibit_rules:
  # Don't send warning alerts if critical alerts are firing for the same instance
  - source_matchers:
      - severity=critical
    target_matchers:
      - severity=warning
    equal: ['instance']

  # Don't send individual service alerts if the whole node is down
  - source_matchers:
      - alertname=NodeDown
    target_matchers:
      - alertname=~"High.*|.*ServiceDown"
    equal: ['instance']

  # Don't send disk space warnings if disk is critically full
  - source_matchers:
      - alertname=DiskSpaceCritical
    target_matchers:
      - alertname=DiskWillFillIn4Hours
    equal: ['instance', 'device']

inhibit_rules:
  # Don't send warning alerts if critical alerts are firing for the same instance
  - source_matchers:
      - severity=critical
    target_matchers:
      - severity=warning
    equal: ['instance']

  # Don't send individual service alerts if the whole node is down
  - source_matchers:
      - alertname=NodeDown
    target_matchers:
      - alertname=~"High.*|.*ServiceDown"
    equal: ['instance']

  # Don't send disk space warnings if disk is critically full
  - source_matchers:
      - alertname=DiskSpaceCritical
    target_matchers:
      - alertname=DiskWillFillIn4Hours
    equal: ['instance', 'device']

YAML

Managing Silences

# Create a silence via API
curl -X POST http://localhost:9093/api/v1/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {
        "name": "alertname",
        "value": "HighCPUUsage"
      },
      {
        "name": "instance",
        "value": "server-01:9100"
      }
    ],
    "startsAt": "2023-08-21T12:00:00.000Z",
    "endsAt": "2023-08-21T14:00:00.000Z",
    "createdBy": "maintenance-team",
    "comment": "Planned maintenance window"
  }'

# Create a silence via API
curl -X POST http://localhost:9093/api/v1/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {
        "name": "alertname",
        "value": "HighCPUUsage"
      },
      {
        "name": "instance",
        "value": "server-01:9100"
      }
    ],
    "startsAt": "2023-08-21T12:00:00.000Z",
    "endsAt": "2023-08-21T14:00:00.000Z",
    "createdBy": "maintenance-team",
    "comment": "Planned maintenance window"
  }'

Bash

Integration Examples

Slack Integration

# Slack configuration with rich formatting
slack_configs:
  - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
    channel: '#alerts'
    title: '{{ if eq .Status "firing" }}🔥{{ else }}✅{{ end }} {{ .GroupLabels.alertname }}'
    text: |
      {{ if eq .Status "firing" }}
      *Status:* Firing
      *Alerts:* {{ len .Alerts }}
      {{ range .Alerts }}
      *Alert:* {{ .Annotations.summary }}
      *Description:* {{ .Annotations.description }}
      *Severity:* {{ .Labels.severity }}
      *Instance:* {{ .Labels.instance }}
      *Runbook:* {{ .Annotations.runbook_url }}
      {{ end }}
      {{ else }}
      *Status:* Resolved
      All alerts have been resolved.
      {{ end }}
    actions:
      - type: button
        text: 'View in Alertmanager'
        url: '{{ template "__alertmanagerURL" . }}'
      - type: button
        text: 'Silence'
        url: '{{ template "__alertmanagerURL" . }}/#/silences/new'

# Slack configuration with rich formatting
slack_configs:
  - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
    channel: '#alerts'
    title: '{{ if eq .Status "firing" }}🔥{{ else }}✅{{ end }} {{ .GroupLabels.alertname }}'
    text: |
      {{ if eq .Status "firing" }}
      *Status:* Firing
      *Alerts:* {{ len .Alerts }}
      {{ range .Alerts }}
      *Alert:* {{ .Annotations.summary }}
      *Description:* {{ .Annotations.description }}
      *Severity:* {{ .Labels.severity }}
      *Instance:* {{ .Labels.instance }}
      *Runbook:* {{ .Annotations.runbook_url }}
      {{ end }}
      {{ else }}
      *Status:* Resolved
      All alerts have been resolved.
      {{ end }}
    actions:
      - type: button
        text: 'View in Alertmanager'
        url: '{{ template "__alertmanagerURL" . }}'
      - type: button
        text: 'Silence'
        url: '{{ template "__alertmanagerURL" . }}/#/silences/new'

YAML

PagerDuty Integration

pagerduty_configs:
  - routing_key: 'YOUR_INTEGRATION_KEY'
    description: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
    details:
      severity: '{{ range .Alerts }}{{ .Labels.severity }}{{ end }}'
      instance: '{{ range .Alerts }}{{ .Labels.instance }}{{ end }}'
      alertname: '{{ range .Alerts }}{{ .Labels.alertname }}{{ end }}'
    links:
      - href: '{{ range .Alerts }}{{ .Annotations.runbook_url }}{{ end }}'
        text: 'Runbook'

pagerduty_configs:
  - routing_key: 'YOUR_INTEGRATION_KEY'
    description: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
    details:
      severity: '{{ range .Alerts }}{{ .Labels.severity }}{{ end }}'
      instance: '{{ range .Alerts }}{{ .Labels.instance }}{{ end }}'
      alertname: '{{ range .Alerts }}{{ .Labels.alertname }}{{ end }}'
    links:
      - href: '{{ range .Alerts }}{{ .Annotations.runbook_url }}{{ end }}'
        text: 'Runbook'

YAML

Email Integration

email_configs:
  - to: 'team@company.com'
    from: 'alertmanager@company.com'
    subject: '{{ .Status | toUpper }}: {{ .GroupLabels.alertname }} ({{ len .Alerts }} alerts)'
    html: |
      <!DOCTYPE html>
      <html>
      <head>
          <style>
              table { border-collapse: collapse; width: 100%; }
              th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
              th { background-color: #f2f2f2; }
              .critical { background-color: #ffebee; }
              .warning { background-color: #fff3e0; }
          </style>
      </head>
      <body>
          <h2>Alert {{ .Status | toUpper }}</h2>
          <table>
              <tr>
                  <th>Alert</th>
                  <th>Severity</th>
                  <th>Instance</th>
                  <th>Description</th>
              </tr>
              {{ range .Alerts }}
              <tr class="{{ .Labels.severity }}">
                  <td>{{ .Labels.alertname }}</td>
                  <td>{{ .Labels.severity }}</td>
                  <td>{{ .Labels.instance }}</td>
                  <td>{{ .Annotations.description }}</td>
              </tr>
              {{ end }}
          </table>
      </body>
      </html>

email_configs:
  - to: 'team@company.com'
    from: 'alertmanager@company.com'
    subject: '{{ .Status | toUpper }}: {{ .GroupLabels.alertname }} ({{ len .Alerts }} alerts)'
    html: |
      <!DOCTYPE html>
      <html>
      <head>
          <style>
              table { border-collapse: collapse; width: 100%; }
              th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
              th { background-color: #f2f2f2; }
              .critical { background-color: #ffebee; }
              .warning { background-color: #fff3e0; }
          </style>
      </head>
      <body>
          <h2>Alert {{ .Status | toUpper }}</h2>
          <table>
              <tr>
                  <th>Alert</th>
                  <th>Severity</th>
                  <th>Instance</th>
                  <th>Description</th>
              </tr>
              {{ range .Alerts }}
              <tr class="{{ .Labels.severity }}">
                  <td>{{ .Labels.alertname }}</td>
                  <td>{{ .Labels.severity }}</td>
                  <td>{{ .Labels.instance }}</td>
                  <td>{{ .Annotations.description }}</td>
              </tr>
              {{ end }}
          </table>
      </body>
      </html>

YAML

Custom Webhook Integration

# webhook_server.py
from flask import Flask, request, jsonify
import json
import requests

app = Flask(__name__)

@app.route('/webhook', methods=['POST'])
def webhook():
    data = request.get_json()

    # Process the alert
    status = data.get('status')
    alerts = data.get('alerts', [])

    for alert in alerts:
        labels = alert.get('labels', {})
        annotations = alert.get('annotations', {})

        # Custom logic based on alert
        if labels.get('severity') == 'critical':
            send_to_ops_team(alert)
        elif 'database' in labels.get('alertname', '').lower():
            send_to_dba_team(alert)

        # Log to external system
        log_alert_to_system(alert)

    return jsonify({'status': 'received'})

def send_to_ops_team(alert):
    # Send to ticketing system, chat platform, etc.
    pass

def send_to_dba_team(alert):
    # Send to database team's channel
    pass

def log_alert_to_system(alert):
    # Log to centralized logging system
    pass

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

# webhook_server.py
from flask import Flask, request, jsonify
import json
import requests

app = Flask(__name__)

@app.route('/webhook', methods=['POST'])
def webhook():
    data = request.get_json()

    # Process the alert
    status = data.get('status')
    alerts = data.get('alerts', [])

    for alert in alerts:
        labels = alert.get('labels', {})
        annotations = alert.get('annotations', {})

        # Custom logic based on alert
        if labels.get('severity') == 'critical':
            send_to_ops_team(alert)
        elif 'database' in labels.get('alertname', '').lower():
            send_to_dba_team(alert)

        # Log to external system
        log_alert_to_system(alert)

    return jsonify({'status': 'received'})

def send_to_ops_team(alert):
    # Send to ticketing system, chat platform, etc.
    pass

def send_to_dba_team(alert):
    # Send to database team's channel
    pass

def log_alert_to_system(alert):
    # Log to centralized logging system
    pass

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Python

Testing Alerts

Manual Alert Testing

# Send test alert to Alertmanager
curl -X POST http://localhost:9093/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '[
    {
      "labels": {
        "alertname": "TestAlert",
        "instance": "test-instance",
        "severity": "warning"
      },
      "annotations": {
        "summary": "This is a test alert",
        "description": "Testing alert routing and notifications"
      },
      "startsAt": "2023-08-21T12:00:00.000Z"
    }
  ]'

# Send test alert to Alertmanager
curl -X POST http://localhost:9093/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '[
    {
      "labels": {
        "alertname": "TestAlert",
        "instance": "test-instance",
        "severity": "warning"
      },
      "annotations": {
        "summary": "This is a test alert",
        "description": "Testing alert routing and notifications"
      },
      "startsAt": "2023-08-21T12:00:00.000Z"
    }
  ]'

Bash

Alert Testing Framework

# alert_tester.py
import requests
import time
from datetime import datetime, timezone

class AlertTester:
    def __init__(self, alertmanager_url, prometheus_url):
        self.alertmanager_url = alertmanager_url
        self.prometheus_url = prometheus_url

    def send_test_alert(self, alertname, labels, annotations):
        """Send a test alert to Alertmanager"""
        alert = {
            "labels": {
                "alertname": alertname,
                **labels
            },
            "annotations": annotations,
            "startsAt": datetime.now(timezone.utc).isoformat()
        }

        response = requests.post(
            f"{self.alertmanager_url}/api/v1/alerts",
            json=[alert]
        )
        return response.status_code == 200

    def check_alert_rule(self, rule_name):
        """Check if an alert rule is defined in Prometheus"""
        response = requests.get(f"{self.prometheus_url}/api/v1/rules")
        rules = response.json()

        for group in rules['data']['groups']:
            for rule in group['rules']:
                if rule.get('name') == rule_name:
                    return True
        return False

    def test_critical_alert_routing(self):
        """Test that critical alerts go to the right channels"""
        return self.send_test_alert(
            "TestCriticalAlert",
            {"severity": "critical", "instance": "test-server"},
            {
                "summary": "Test critical alert",
                "description": "This should route to critical alerts channel"
            }
        )

# Usage
tester = AlertTester("http://localhost:9093", "http://localhost:9090")
tester.test_critical_alert_routing()

# alert_tester.py
import requests
import time
from datetime import datetime, timezone

class AlertTester:
    def __init__(self, alertmanager_url, prometheus_url):
        self.alertmanager_url = alertmanager_url
        self.prometheus_url = prometheus_url

    def send_test_alert(self, alertname, labels, annotations):
        """Send a test alert to Alertmanager"""
        alert = {
            "labels": {
                "alertname": alertname,
                **labels
            },
            "annotations": annotations,
            "startsAt": datetime.now(timezone.utc).isoformat()
        }

        response = requests.post(
            f"{self.alertmanager_url}/api/v1/alerts",
            json=[alert]
        )
        return response.status_code == 200

    def check_alert_rule(self, rule_name):
        """Check if an alert rule is defined in Prometheus"""
        response = requests.get(f"{self.prometheus_url}/api/v1/rules")
        rules = response.json()

        for group in rules['data']['groups']:
            for rule in group['rules']:
                if rule.get('name') == rule_name:
                    return True
        return False

    def test_critical_alert_routing(self):
        """Test that critical alerts go to the right channels"""
        return self.send_test_alert(
            "TestCriticalAlert",
            {"severity": "critical", "instance": "test-server"},
            {
                "summary": "Test critical alert",
                "description": "This should route to critical alerts channel"
            }
        )

# Usage
tester = AlertTester("http://localhost:9093", "http://localhost:9090")
tester.test_critical_alert_routing()

Python

Chapter 5 Summary

Alertmanager provides sophisticated alert routing, grouping, and notification capabilities. Effective alerting requires clear rules, proper grouping, inhibition to reduce noise, and integration with appropriate notification channels. Testing alerts ensures they work as expected and reach the right people.

Hands-on Exercise

Alertmanager Setup:
- Install and configure Alertmanager
- Set up basic routing to email or Slack
- Test with manual alerts
Alert Rules:
- Create alerting rules for your infrastructure
- Set appropriate thresholds and timing
- Add helpful annotations and runbook links
Advanced Features:
- Configure inhibition rules to reduce noise
- Set up silences for maintenance windows
- Test different notification channels

6. Visualization

Introduction to Grafana

Grafana is the de facto standard for visualizing Prometheus metrics. It provides powerful dashboarding capabilities, alerting integration, and supports multiple data sources beyond Prometheus.

graph TB
    A[Prometheus] --> B[Grafana]
    C[Users] --> B
    B --> D[Dashboards]
    B --> E[Alerts]
    B --> F[Data Sources]

    D --> G[Panels]
    D --> H[Variables]
    D --> I[Annotations]

    G --> J[Time Series]
    G --> K[Stats]
    G --> L[Tables]
    G --> M[Heatmaps]

Installing and Configuring Grafana

Docker Installation

# docker-compose.yml
version: '3.8'

services:
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_USERS_DEFAULT_THEME=dark
      - GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=/etc/grafana/provisioning/dashboards/overview.json
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards
    restart: unless-stopped

volumes:
  grafana_data:

# docker-compose.yml
version: '3.8'

services:
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_USERS_DEFAULT_THEME=dark
      - GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=/etc/grafana/provisioning/dashboards/overview.json
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards
    restart: unless-stopped

volumes:
  grafana_data:

YAML

Configuration as Code

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      httpMethod: POST
      prometheusType: Prometheus
      prometheusVersion: 2.40.0
      cacheLevel: 'High'
      disableMetricsLookup: false
      customQueryParameters: ''
      incrementalQuerying: false
      disableRecordingRules: false

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      httpMethod: POST
      prometheusType: Prometheus
      prometheusVersion: 2.40.0
      cacheLevel: 'High'
      disableMetricsLookup: false
      customQueryParameters: ''
      incrementalQuerying: false
      disableRecordingRules: false

YAML

# grafana/provisioning/dashboards/dashboard.yml
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /var/lib/grafana/dashboards

# grafana/provisioning/dashboards/dashboard.yml
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /var/lib/grafana/dashboards

YAML

Dashboard Design Principles

Information Hierarchy

Overview Level: High-level health and performance indicators
Service Level: Detailed metrics for specific services
Component Level: Deep-dive into individual components
Debug Level: Raw metrics for troubleshooting

Dashboard Layout Best Practices

{
  "dashboard": {
    "title": "Service Overview",
    "panels": [
      {
        "id": 1,
        "title": "Key Metrics (Top Row)",
        "type": "stat",
        "gridPos": {"h": 6, "w": 24, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Trends (Middle Section)",
        "type": "timeseries", 
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 6}
      },
      {
        "id": 3,
        "title": "Distribution (Right Side)",
        "type": "heatmap",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 6}
      },
      {
        "id": 4,
        "title": "Details (Bottom)",
        "type": "table",
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 14}
      }
    ]
  }
}

{
  "dashboard": {
    "title": "Service Overview",
    "panels": [
      {
        "id": 1,
        "title": "Key Metrics (Top Row)",
        "type": "stat",
        "gridPos": {"h": 6, "w": 24, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Trends (Middle Section)",
        "type": "timeseries", 
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 6}
      },
      {
        "id": 3,
        "title": "Distribution (Right Side)",
        "type": "heatmap",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 6}
      },
      {
        "id": 4,
        "title": "Details (Bottom)",
        "type": "table",
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 14}
      }
    ]
  }
}

JSON

Essential Panel Types

Time Series Panels

{
  "id": 1,
  "title": "Request Rate",
  "type": "timeseries",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total[5m])) by (service)",
      "legendFormat": "{{service}}",
      "refId": "A"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "reqps",
      "custom": {
        "drawStyle": "line",
        "lineInterpolation": "linear",
        "barAlignment": 0,
        "lineWidth": 2,
        "fillOpacity": 10,
        "gradientMode": "none",
        "spanNulls": false,
        "insertNulls": false,
        "showPoints": "never",
        "pointSize": 5,
        "stacking": {
          "mode": "none",
          "group": "A"
        },
        "axisPlacement": "auto",
        "axisLabel": "",
        "scaleDistribution": {
          "type": "linear"
        },
        "hideFrom": {
          "legend": false,
          "tooltip": false,
          "vis": false
        },
        "thresholdsStyle": {
          "mode": "off"
        }
      }
    },
    "overrides": [
      {
        "matcher": {
          "id": "byName",
          "options": "frontend"
        },
        "properties": [
          {
            "id": "color",
            "value": {
              "mode": "fixed",
              "fixedColor": "green"
            }
          }
        ]
      }
    ]
  },
  "options": {
    "tooltip": {
      "mode": "multi",
      "sort": "desc"
    },
    "legend": {
      "displayMode": "table",
      "placement": "bottom",
      "calcs": ["lastNotNull", "max", "mean"],
      "values": true
    }
  }
}

{
  "id": 1,
  "title": "Request Rate",
  "type": "timeseries",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total[5m])) by (service)",
      "legendFormat": "{{service}}",
      "refId": "A"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "reqps",
      "custom": {
        "drawStyle": "line",
        "lineInterpolation": "linear",
        "barAlignment": 0,
        "lineWidth": 2,
        "fillOpacity": 10,
        "gradientMode": "none",
        "spanNulls": false,
        "insertNulls": false,
        "showPoints": "never",
        "pointSize": 5,
        "stacking": {
          "mode": "none",
          "group": "A"
        },
        "axisPlacement": "auto",
        "axisLabel": "",
        "scaleDistribution": {
          "type": "linear"
        },
        "hideFrom": {
          "legend": false,
          "tooltip": false,
          "vis": false
        },
        "thresholdsStyle": {
          "mode": "off"
        }
      }
    },
    "overrides": [
      {
        "matcher": {
          "id": "byName",
          "options": "frontend"
        },
        "properties": [
          {
            "id": "color",
            "value": {
              "mode": "fixed",
              "fixedColor": "green"
            }
          }
        ]
      }
    ]
  },
  "options": {
    "tooltip": {
      "mode": "multi",
      "sort": "desc"
    },
    "legend": {
      "displayMode": "table",
      "placement": "bottom",
      "calcs": ["lastNotNull", "max", "mean"],
      "values": true
    }
  }
}

JSON

Stat Panels for Key Metrics

{
  "id": 2,
  "title": "Service Availability",
  "type": "stat",
  "targets": [
    {
      "expr": "avg(up{job=~\".*-service\"})",
      "refId": "A",
      "format": "time_series",
      "instant": true
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percentunit",
      "min": 0,
      "max": 1,
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {
            "color": "red",
            "value": 0
          },
          {
            "color": "yellow", 
            "value": 0.95
          },
          {
            "color": "green",
            "value": 0.99
          }
        ]
      },
      "mappings": [],
      "custom": {
        "hideFrom": {
          "legend": false,
          "tooltip": false,
          "vis": false
        }
      }
    }
  },
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": ["lastNotNull"],
      "fields": ""
    },
    "orientation": "auto",
    "textMode": "auto",
    "colorMode": "background",
    "graphMode": "area",
    "justifyMode": "auto"
  },
  "gridPos": {"h": 6, "w": 6, "x": 0, "y": 0}
}

{
  "id": 2,
  "title": "Service Availability",
  "type": "stat",
  "targets": [
    {
      "expr": "avg(up{job=~\".*-service\"})",
      "refId": "A",
      "format": "time_series",
      "instant": true
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percentunit",
      "min": 0,
      "max": 1,
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {
            "color": "red",
            "value": 0
          },
          {
            "color": "yellow", 
            "value": 0.95
          },
          {
            "color": "green",
            "value": 0.99
          }
        ]
      },
      "mappings": [],
      "custom": {
        "hideFrom": {
          "legend": false,
          "tooltip": false,
          "vis": false
        }
      }
    }
  },
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": ["lastNotNull"],
      "fields": ""
    },
    "orientation": "auto",
    "textMode": "auto",
    "colorMode": "background",
    "graphMode": "area",
    "justifyMode": "auto"
  },
  "gridPos": {"h": 6, "w": 6, "x": 0, "y": 0}
}

JSON

Heatmap for Latency Distribution

{
  "id": 3,
  "title": "Response Time Distribution",
  "type": "heatmap",
  "targets": [
    {
      "expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
      "format": "heatmap",
      "legendFormat": "{{le}}",
      "refId": "A"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "custom": {
        "hideFrom": {
          "legend": false,
          "tooltip": false,
          "vis": false
        },
        "scaleDistribution": {
          "type": "linear"
        }
      }
    }
  },
  "options": {
    "calculate": false,
    "cellGap": 2,
    "cellValues": {
      "unit": "short"
    },
    "color": {
      "exponent": 0.5,
      "fill": "dark-orange",
      "mode": "spectrum",
      "reverse": false,
      "scale": "exponential",
      "scheme": "Oranges",
      "steps": 64
    },
    "exemplars": {
      "color": "rgba(255,0,255,0.7)"
    },
    "filterValues": {
      "le": 1e-9
    },
    "legend": {
      "show": true
    },
    "rowsFrame": {
      "layout": "auto"
    },
    "tooltip": {
      "show": true,
      "yHistogram": false
    },
    "yAxis": {
      "axisPlacement": "left",
      "reverse": false,
      "unit": "s"
    }
  }
}

{
  "id": 3,
  "title": "Response Time Distribution",
  "type": "heatmap",
  "targets": [
    {
      "expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
      "format": "heatmap",
      "legendFormat": "{{le}}",
      "refId": "A"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "custom": {
        "hideFrom": {
          "legend": false,
          "tooltip": false,
          "vis": false
        },
        "scaleDistribution": {
          "type": "linear"
        }
      }
    }
  },
  "options": {
    "calculate": false,
    "cellGap": 2,
    "cellValues": {
      "unit": "short"
    },
    "color": {
      "exponent": 0.5,
      "fill": "dark-orange",
      "mode": "spectrum",
      "reverse": false,
      "scale": "exponential",
      "scheme": "Oranges",
      "steps": 64
    },
    "exemplars": {
      "color": "rgba(255,0,255,0.7)"
    },
    "filterValues": {
      "le": 1e-9
    },
    "legend": {
      "show": true
    },
    "rowsFrame": {
      "layout": "auto"
    },
    "tooltip": {
      "show": true,
      "yHistogram": false
    },
    "yAxis": {
      "axisPlacement": "left",
      "reverse": false,
      "unit": "s"
    }
  }
}

JSON

Table for Detailed Breakdown

{
  "id": 4,
  "title": "Service Status Details",
  "type": "table",
  "targets": [
    {
      "expr": "up{job=~\".*-service\"}",
      "format": "table",
      "instant": true,
      "refId": "A"
    },
    {
      "expr": "rate(http_requests_total[5m])",
      "format": "table", 
      "instant": true,
      "refId": "B"
    },
    {
      "expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m])",
      "format": "table",
      "instant": true, 
      "refId": "C"
    }
  ],
  "transformations": [
    {
      "id": "merge",
      "options": {}
    },
    {
      "id": "organize",
      "options": {
        "excludeByName": {
          "Time": true,
          "__name__": true
        },
        "indexByName": {
          "instance": 0,
          "job": 1,
          "Value #A": 2,
          "Value #B": 3,
          "Value #C": 4
        },
        "renameByName": {
          "Value #A": "Status",
          "Value #B": "Request Rate",
          "Value #C": "Error Rate",
          "instance": "Instance",
          "job": "Service"
        }
      }
    }
  ],
  "fieldConfig": {
    "defaults": {
      "custom": {
        "align": "auto",
        "displayMode": "auto",
        "inspect": false
      },
      "mappings": [
        {
          "options": {
            "0": {
              "color": "red",
              "index": 0,
              "text": "DOWN"
            },
            "1": {
              "color": "green", 
              "index": 1,
              "text": "UP"
            }
          },
          "type": "value"
        }
      ],
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {
            "color": "green",
            "value": null
          },
          {
            "color": "red",
            "value": 80
          }
        ]
      }
    },
    "overrides": [
      {
        "matcher": {
          "id": "byName",
          "options": "Error Rate"
        },
        "properties": [
          {
            "id": "unit",
            "value": "percentunit"
          },
          {
            "id": "custom.displayMode",
            "value": "color-background"
          },
          {
            "id": "thresholds",
            "value": {
              "mode": "absolute",
              "steps": [
                {
                  "color": "green",
                  "value": null
                },
                {
                  "color": "yellow",
                  "value": 0.01
                },
                {
                  "color": "red",
                  "value": 0.05
                }
              ]
            }
          }
        ]
      }
    ]
  }
}

{
  "id": 4,
  "title": "Service Status Details",
  "type": "table",
  "targets": [
    {
      "expr": "up{job=~\".*-service\"}",
      "format": "table",
      "instant": true,
      "refId": "A"
    },
    {
      "expr": "rate(http_requests_total[5m])",
      "format": "table", 
      "instant": true,
      "refId": "B"
    },
    {
      "expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m])",
      "format": "table",
      "instant": true, 
      "refId": "C"
    }
  ],
  "transformations": [
    {
      "id": "merge",
      "options": {}
    },
    {
      "id": "organize",
      "options": {
        "excludeByName": {
          "Time": true,
          "__name__": true
        },
        "indexByName": {
          "instance": 0,
          "job": 1,
          "Value #A": 2,
          "Value #B": 3,
          "Value #C": 4
        },
        "renameByName": {
          "Value #A": "Status",
          "Value #B": "Request Rate",
          "Value #C": "Error Rate",
          "instance": "Instance",
          "job": "Service"
        }
      }
    }
  ],
  "fieldConfig": {
    "defaults": {
      "custom": {
        "align": "auto",
        "displayMode": "auto",
        "inspect": false
      },
      "mappings": [
        {
          "options": {
            "0": {
              "color": "red",
              "index": 0,
              "text": "DOWN"
            },
            "1": {
              "color": "green", 
              "index": 1,
              "text": "UP"
            }
          },
          "type": "value"
        }
      ],
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {
            "color": "green",
            "value": null
          },
          {
            "color": "red",
            "value": 80
          }
        ]
      }
    },
    "overrides": [
      {
        "matcher": {
          "id": "byName",
          "options": "Error Rate"
        },
        "properties": [
          {
            "id": "unit",
            "value": "percentunit"
          },
          {
            "id": "custom.displayMode",
            "value": "color-background"
          },
          {
            "id": "thresholds",
            "value": {
              "mode": "absolute",
              "steps": [
                {
                  "color": "green",
                  "value": null
                },
                {
                  "color": "yellow",
                  "value": 0.01
                },
                {
                  "color": "red",
                  "value": 0.05
                }
              ]
            }
          }
        ]
      }
    ]
  }
}

JSON

Dashboard Templates and Variables

Template Variables

{
  "templating": {
    "list": [
      {
        "name": "environment",
        "type": "query",
        "query": "label_values(up, environment)",
        "current": {
          "selected": true,
          "text": "production",
          "value": "production"
        },
        "options": [],
        "refresh": 1,
        "regex": "",
        "sort": 1,
        "multi": false,
        "includeAll": false,
        "allValue": null
      },
      {
        "name": "service",
        "type": "query", 
        "query": "label_values(http_requests_total{environment=\"$environment\"}, service)",
        "current": {
          "selected": false,
          "text": "All",
          "value": "$__all"
        },
        "options": [],
        "refresh": 1,
        "regex": "",
        "sort": 1,
        "multi": true,
        "includeAll": true,
        "allValue": ".*"
      },
      {
        "name": "instance",
        "type": "query",
        "query": "label_values(up{job=\"$service\"}, instance)",
        "current": {
          "selected": false,
          "text": "All", 
          "value": "$__all"
        },
        "options": [],
        "refresh": 2,
        "regex": "",
        "sort": 1,
        "multi": true,
        "includeAll": true,
        "allValue": ".*"
      },
      {
        "name": "interval",
        "type": "interval",
        "current": {
          "selected": false,
          "text": "5m",
          "value": "5m"
        },
        "options": [
          {
            "selected": true,
            "text": "1m",
            "value": "1m"
          },
          {
            "selected": false,
            "text": "5m", 
            "value": "5m"
          },
          {
            "selected": false,
            "text": "15m",
            "value": "15m"
          },
          {
            "selected": false,
            "text": "1h",
            "value": "1h"
          }
        ],
        "query": "1m,5m,15m,1h,6h,12h,1d,7d,14d,30d",
        "refresh": 2,
        "auto": true,
        "auto_count": 30,
        "auto_min": "10s"
      }
    ]
  }
}

{
  "templating": {
    "list": [
      {
        "name": "environment",
        "type": "query",
        "query": "label_values(up, environment)",
        "current": {
          "selected": true,
          "text": "production",
          "value": "production"
        },
        "options": [],
        "refresh": 1,
        "regex": "",
        "sort": 1,
        "multi": false,
        "includeAll": false,
        "allValue": null
      },
      {
        "name": "service",
        "type": "query", 
        "query": "label_values(http_requests_total{environment=\"$environment\"}, service)",
        "current": {
          "selected": false,
          "text": "All",
          "value": "$__all"
        },
        "options": [],
        "refresh": 1,
        "regex": "",
        "sort": 1,
        "multi": true,
        "includeAll": true,
        "allValue": ".*"
      },
      {
        "name": "instance",
        "type": "query",
        "query": "label_values(up{job=\"$service\"}, instance)",
        "current": {
          "selected": false,
          "text": "All", 
          "value": "$__all"
        },
        "options": [],
        "refresh": 2,
        "regex": "",
        "sort": 1,
        "multi": true,
        "includeAll": true,
        "allValue": ".*"
      },
      {
        "name": "interval",
        "type": "interval",
        "current": {
          "selected": false,
          "text": "5m",
          "value": "5m"
        },
        "options": [
          {
            "selected": true,
            "text": "1m",
            "value": "1m"
          },
          {
            "selected": false,
            "text": "5m", 
            "value": "5m"
          },
          {
            "selected": false,
            "text": "15m",
            "value": "15m"
          },
          {
            "selected": false,
            "text": "1h",
            "value": "1h"
          }
        ],
        "query": "1m,5m,15m,1h,6h,12h,1d,7d,14d,30d",
        "refresh": 2,
        "auto": true,
        "auto_count": 30,
        "auto_min": "10s"
      }
    ]
  }
}

JSON

Using Variables in Queries

# Using service variable
sum(rate(http_requests_total{service=~"$service"}[$interval])) by (service)

# Using environment and instance variables  
up{environment="$environment",instance=~"$instance"}

# Advanced variable usage with regex
rate(http_requests_total{service=~"$service",status!~"$__interval"}[5m])

# Using service variable
sum(rate(http_requests_total{service=~"$service"}[$interval])) by (service)

# Using environment and instance variables  
up{environment="$environment",instance=~"$instance"}

# Advanced variable usage with regex
rate(http_requests_total{service=~"$service",status!~"$__interval"}[5m])

INI

Complete Dashboard Examples

Infrastructure Overview Dashboard

{
  "dashboard": {
    "id": null,
    "title": "Infrastructure Overview",
    "description": "High-level infrastructure health and performance metrics",
    "tags": ["infrastructure", "overview"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "templating": {
      "list": [
        {
          "name": "instance",
          "type": "query",
          "query": "label_values(up{job=\"node-exporter\"}, instance)",
          "refresh": 1,
          "multi": true,
          "includeAll": true,
          "current": {
            "value": "$__all",
            "text": "All"
          }
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "Node Status",
        "type": "stat",
        "targets": [
          {
            "expr": "up{job=\"node-exporter\",instance=~\"$instance\"}",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "mappings": [
              {
                "options": {
                  "0": {"color": "red", "text": "DOWN"},
                  "1": {"color": "green", "text": "UP"}
                },
                "type": "value"
              }
            ],
            "thresholds": {
              "steps": [
                {"color": "red", "value": 0},
                {"color": "green", "value": 1}
              ]
            }
          }
        },
        "gridPos": {"h": 4, "w": 24, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "CPU Usage",
        "type": "timeseries",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\",instance=~\"$instance\"}[5m])) * 100)",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "max": 100,
            "min": 0,
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4}
      },
      {
        "id": 3,
        "title": "Memory Usage",
        "type": "timeseries",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\"} / node_memory_MemTotal_bytes{instance=~\"$instance\"})) * 100",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "max": 100,
            "min": 0,
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 80},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 4}
      },
      {
        "id": 4,
        "title": "Disk Usage",
        "type": "timeseries",
        "targets": [
          {
            "expr": "(1 - (node_filesystem_avail_bytes{instance=~\"$instance\",fstype!~\"tmpfs|fuse.lxcfs|squashfs\"} / node_filesystem_size_bytes{instance=~\"$instance\",fstype!~\"tmpfs|fuse.lxcfs|squashfs\"})) * 100",
            "legendFormat": "{{instance}}:{{mountpoint}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "max": 100,
            "min": 0,
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 80},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 12}
      },
      {
        "id": 5,
        "title": "Network I/O",
        "type": "timeseries",
        "targets": [
          {
            "expr": "rate(node_network_receive_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*\"}[5m])",
            "legendFormat": "{{instance}}:{{device}} - Receive"
          },
          {
            "expr": "rate(node_network_transmit_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*\"}[5m])",
            "legendFormat": "{{instance}}:{{device}} - Transmit"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "Bps"
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 12}
      }
    ]
  }
}

{
  "dashboard": {
    "id": null,
    "title": "Infrastructure Overview",
    "description": "High-level infrastructure health and performance metrics",
    "tags": ["infrastructure", "overview"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "templating": {
      "list": [
        {
          "name": "instance",
          "type": "query",
          "query": "label_values(up{job=\"node-exporter\"}, instance)",
          "refresh": 1,
          "multi": true,
          "includeAll": true,
          "current": {
            "value": "$__all",
            "text": "All"
          }
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "Node Status",
        "type": "stat",
        "targets": [
          {
            "expr": "up{job=\"node-exporter\",instance=~\"$instance\"}",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "mappings": [
              {
                "options": {
                  "0": {"color": "red", "text": "DOWN"},
                  "1": {"color": "green", "text": "UP"}
                },
                "type": "value"
              }
            ],
            "thresholds": {
              "steps": [
                {"color": "red", "value": 0},
                {"color": "green", "value": 1}
              ]
            }
          }
        },
        "gridPos": {"h": 4, "w": 24, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "CPU Usage",
        "type": "timeseries",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\",instance=~\"$instance\"}[5m])) * 100)",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "max": 100,
            "min": 0,
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4}
      },
      {
        "id": 3,
        "title": "Memory Usage",
        "type": "timeseries",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\"} / node_memory_MemTotal_bytes{instance=~\"$instance\"})) * 100",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "max": 100,
            "min": 0,
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 80},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 4}
      },
      {
        "id": 4,
        "title": "Disk Usage",
        "type": "timeseries",
        "targets": [
          {
            "expr": "(1 - (node_filesystem_avail_bytes{instance=~\"$instance\",fstype!~\"tmpfs|fuse.lxcfs|squashfs\"} / node_filesystem_size_bytes{instance=~\"$instance\",fstype!~\"tmpfs|fuse.lxcfs|squashfs\"})) * 100",
            "legendFormat": "{{instance}}:{{mountpoint}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "max": 100,
            "min": 0,
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 80},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 12}
      },
      {
        "id": 5,
        "title": "Network I/O",
        "type": "timeseries",
        "targets": [
          {
            "expr": "rate(node_network_receive_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*\"}[5m])",
            "legendFormat": "{{instance}}:{{device}} - Receive"
          },
          {
            "expr": "rate(node_network_transmit_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*\"}[5m])",
            "legendFormat": "{{instance}}:{{device}} - Transmit"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "Bps"
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 12}
      }
    ]
  }
}

JSON

Application Performance Dashboard

{
  "dashboard": {
    "id": null,
    "title": "Application Performance",
    "description": "Application performance metrics and SLIs",
    "tags": ["application", "performance", "sli"],
    "timezone": "browser", 
    "refresh": "30s",
    "templating": {
      "list": [
        {
          "name": "service",
          "type": "query",
          "query": "label_values(http_requests_total, service)",
          "refresh": 1,
          "multi": true,
          "includeAll": true
        },
        {
          "name": "environment",
          "type": "query", 
          "query": "label_values(http_requests_total, environment)",
          "refresh": 1
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service)",
            "legendFormat": "{{service}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "reqps"
          }
        },
        "gridPos": {"h": 8, "w": 8, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Error Rate",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"[45]..\",service=~\"$service\",environment=\"$environment\"}[5m])) by (service) / sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service) * 100",
            "legendFormat": "{{service}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "max": 100,
            "min": 0,
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 1},
                {"color": "red", "value": 5}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 8, "x": 8, "y": 0}
      },
      {
        "id": 3,
        "title": "Response Time (95th percentile)",
        "type": "timeseries",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (service, le))",
            "legendFormat": "{{service}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "s",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 0.5},
                {"color": "red", "value": 1}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 8, "x": 16, "y": 0}
      },
      {
        "id": 4,
        "title": "Response Time Heatmap",
        "type": "heatmap",
        "targets": [
          {
            "expr": "sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (le)",
            "format": "heatmap",
            "legendFormat": "{{le}}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}
      },
      {
        "id": 5,
        "title": "Top Endpoints by Request Count",
        "type": "table",
        "targets": [
          {
            "expr": "topk(10, sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (endpoint))",
            "format": "table",
            "instant": true
          }
        ],
        "transformations": [
          {
            "id": "organize",
            "options": {
              "excludeByName": {
                "Time": true,
                "__name__": true
              },
              "renameByName": {
                "Value": "Requests/sec",
                "endpoint": "Endpoint"
              }
            }
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8}
      }
    ]
  }
}

{
  "dashboard": {
    "id": null,
    "title": "Application Performance",
    "description": "Application performance metrics and SLIs",
    "tags": ["application", "performance", "sli"],
    "timezone": "browser", 
    "refresh": "30s",
    "templating": {
      "list": [
        {
          "name": "service",
          "type": "query",
          "query": "label_values(http_requests_total, service)",
          "refresh": 1,
          "multi": true,
          "includeAll": true
        },
        {
          "name": "environment",
          "type": "query", 
          "query": "label_values(http_requests_total, environment)",
          "refresh": 1
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service)",
            "legendFormat": "{{service}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "reqps"
          }
        },
        "gridPos": {"h": 8, "w": 8, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Error Rate",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"[45]..\",service=~\"$service\",environment=\"$environment\"}[5m])) by (service) / sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service) * 100",
            "legendFormat": "{{service}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "max": 100,
            "min": 0,
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 1},
                {"color": "red", "value": 5}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 8, "x": 8, "y": 0}
      },
      {
        "id": 3,
        "title": "Response Time (95th percentile)",
        "type": "timeseries",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (service, le))",
            "legendFormat": "{{service}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "s",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 0.5},
                {"color": "red", "value": 1}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 8, "x": 16, "y": 0}
      },
      {
        "id": 4,
        "title": "Response Time Heatmap",
        "type": "heatmap",
        "targets": [
          {
            "expr": "sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (le)",
            "format": "heatmap",
            "legendFormat": "{{le}}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}
      },
      {
        "id": 5,
        "title": "Top Endpoints by Request Count",
        "type": "table",
        "targets": [
          {
            "expr": "topk(10, sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (endpoint))",
            "format": "table",
            "instant": true
          }
        ],
        "transformations": [
          {
            "id": "organize",
            "options": {
              "excludeByName": {
                "Time": true,
                "__name__": true
              },
              "renameByName": {
                "Value": "Requests/sec",
                "endpoint": "Endpoint"
              }
            }
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8}
      }
    ]
  }
}

JSON

Advanced Visualization Techniques

Custom Annotations

{
  "annotations": {
    "list": [
      {
        "name": "Deployments",
        "datasource": "Prometheus",
        "enable": true,
        "expr": "increase(prometheus_config_last_reload_success_timestamp_seconds[1m]) > 0",
        "iconColor": "green",
        "titleFormat": "Config Reload",
        "textFormat": "Prometheus configuration reloaded"
      },
      {
        "name": "Alerts",
        "datasource": "Prometheus", 
        "enable": true,
        "expr": "ALERTS{alertstate=\"firing\"}",
        "iconColor": "red",
        "titleFormat": "{{alertname}}",
        "textFormat": "{{summary}}"
      }
    ]
  }
}

{
  "annotations": {
    "list": [
      {
        "name": "Deployments",
        "datasource": "Prometheus",
        "enable": true,
        "expr": "increase(prometheus_config_last_reload_success_timestamp_seconds[1m]) > 0",
        "iconColor": "green",
        "titleFormat": "Config Reload",
        "textFormat": "Prometheus configuration reloaded"
      },
      {
        "name": "Alerts",
        "datasource": "Prometheus", 
        "enable": true,
        "expr": "ALERTS{alertstate=\"firing\"}",
        "iconColor": "red",
        "titleFormat": "{{alertname}}",
        "textFormat": "{{summary}}"
      }
    ]
  }
}

JSON

Value Mappings and Overrides

{
  "fieldConfig": {
    "defaults": {
      "mappings": [
        {
          "options": {
            "0": {"text": "Healthy", "color": "green"},
            "1": {"text": "Warning", "color": "yellow"},
            "2": {"text": "Critical", "color": "red"}
          },
          "type": "value"
        },
        {
          "options": {
            "from": 0,
            "to": 50,
            "result": {"text": "Low", "color": "green"}
          },
          "type": "range"
        }
      ]
    },
    "overrides": [
      {
        "matcher": {
          "id": "byName",
          "options": "Critical Services"
        },
        "properties": [
          {
            "id": "color",
            "value": {"mode": "fixed", "fixedColor": "red"}
          },
          {
            "id": "custom.displayMode",
            "value": "color-background"
          }
        ]
      }
    ]
  }
}

{
  "fieldConfig": {
    "defaults": {
      "mappings": [
        {
          "options": {
            "0": {"text": "Healthy", "color": "green"},
            "1": {"text": "Warning", "color": "yellow"},
            "2": {"text": "Critical", "color": "red"}
          },
          "type": "value"
        },
        {
          "options": {
            "from": 0,
            "to": 50,
            "result": {"text": "Low", "color": "green"}
          },
          "type": "range"
        }
      ]
    },
    "overrides": [
      {
        "matcher": {
          "id": "byName",
          "options": "Critical Services"
        },
        "properties": [
          {
            "id": "color",
            "value": {"mode": "fixed", "fixedColor": "red"}
          },
          {
            "id": "custom.displayMode",
            "value": "color-background"
          }
        ]
      }
    ]
  }
}

JSON

Dynamic Thresholds

{
  "targets": [
    {
      "expr": "avg(response_time_seconds)",
      "refId": "A"
    },
    {
      "expr": "avg(response_time_seconds) + 2 * stddev(response_time_seconds)",
      "refId": "B",
      "hide": true
    }
  ],
  "fieldConfig": {
    "defaults": {
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {"color": "green", "value": null},
          {"color": "red", "value": "${B}"}
        ]
      }
    }
  }
}

{
  "targets": [
    {
      "expr": "avg(response_time_seconds)",
      "refId": "A"
    },
    {
      "expr": "avg(response_time_seconds) + 2 * stddev(response_time_seconds)",
      "refId": "B",
      "hide": true
    }
  ],
  "fieldConfig": {
    "defaults": {
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {"color": "green", "value": null},
          {"color": "red", "value": "${B}"}
        ]
      }
    }
  }
}

JSON

Dashboard Organization and Management

Folder Structure

Dashboards/
├── Overview/
│   ├── System Overview
│   ├── Application Overview
│   └── Business Metrics
├── Infrastructure/
│   ├── Node Metrics
│   ├── Network Performance
│   └── Storage Performance
├── Applications/
│   ├── Frontend Service
│   ├── Backend Services
│   └── Database Performance
├── Troubleshooting/
│   ├── Error Analysis
│   ├── Performance Deep Dive
│   └── Debug Dashboard
└── Business/
    ├── User Metrics
    ├── Revenue Tracking
    └── KPI Dashboard

Dashboards/
├── Overview/
│   ├── System Overview
│   ├── Application Overview
│   └── Business Metrics
├── Infrastructure/
│   ├── Node Metrics
│   ├── Network Performance
│   └── Storage Performance
├── Applications/
│   ├── Frontend Service
│   ├── Backend Services
│   └── Database Performance
├── Troubleshooting/
│   ├── Error Analysis
│   ├── Performance Deep Dive
│   └── Debug Dashboard
└── Business/
    ├── User Metrics
    ├── Revenue Tracking
    └── KPI Dashboard

JSON

Dashboard Tags and Search

{
  "dashboard": {
    "tags": [
      "infrastructure", 
      "monitoring", 
      "production",
      "team:platform",
      "level:l1"
    ],
    "title": "Production Infrastructure Overview",
    "description": "L1 monitoring dashboard for production infrastructure"
  }
}

{
  "dashboard": {
    "tags": [
      "infrastructure", 
      "monitoring", 
      "production",
      "team:platform",
      "level:l1"
    ],
    "title": "Production Infrastructure Overview",
    "description": "L1 monitoring dashboard for production infrastructure"
  }
}

JSON

Dashboard Links and Navigation

{
  "links": [
    {
      "title": "System Overview",
      "url": "/d/system-overview/system-overview",
      "type": "dashboards",
      "icon": "dashboard"
    },
    {
      "title": "Runbook",
      "url": "https://runbooks.company.com/infrastructure",
      "type": "link",
      "targetBlank": true,
      "icon": "doc"
    },
    {
      "title": "Alert Manager",
      "url": "http://alertmanager:9093",
      "type": "link",
      "targetBlank": true,
      "icon": "bell"
    }
  ]
}

{
  "links": [
    {
      "title": "System Overview",
      "url": "/d/system-overview/system-overview",
      "type": "dashboards",
      "icon": "dashboard"
    },
    {
      "title": "Runbook",
      "url": "https://runbooks.company.com/infrastructure",
      "type": "link",
      "targetBlank": true,
      "icon": "doc"
    },
    {
      "title": "Alert Manager",
      "url": "http://alertmanager:9093",
      "type": "link",
      "targetBlank": true,
      "icon": "bell"
    }
  ]
}

JSON

Performance Optimization for Dashboards

Query Optimization

# Inefficient - multiple queries
sum(rate(http_requests_total[5m])) by (service)
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
sum(rate(http_requests_total{status=~"4.."}[5m])) by (service)

# Better - single query with grouping
sum(rate(http_requests_total[5m])) by (service, status)

# Inefficient - multiple queries
sum(rate(http_requests_total[5m])) by (service)
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
sum(rate(http_requests_total{status=~"4.."}[5m])) by (service)

# Better - single query with grouping
sum(rate(http_requests_total[5m])) by (service, status)

INI

Using Recording Rules for Heavy Queries

# recording_rules.yml
groups:
  - name: dashboard_optimization
    interval: 30s
    rules:
      - record: dashboard:request_rate:5m
        expr: sum(rate(http_requests_total[5m])) by (service)

      - record: dashboard:error_rate:5m
        expr: |
          sum(rate(http_requests_total{status=~"[45].."}[5m])) by (service) /
          sum(rate(http_requests_total[5m])) by (service)

# recording_rules.yml
groups:
  - name: dashboard_optimization
    interval: 30s
    rules:
      - record: dashboard:request_rate:5m
        expr: sum(rate(http_requests_total[5m])) by (service)

      - record: dashboard:error_rate:5m
        expr: |
          sum(rate(http_requests_total{status=~"[45].."}[5m])) by (service) /
          sum(rate(http_requests_total[5m])) by (service)

YAML

Dashboard Caching Configuration

# grafana.ini
[caching]
enabled = true

[database]
query_cache_enabled = true
query_cache_size = 100MB
query_cache_ttl = 300s

# grafana.ini
[caching]
enabled = true

[database]
query_cache_enabled = true
query_cache_size = 100MB
query_cache_ttl = 300s

INI

Alerting Integration

Alert Panel Configuration

{
  "id": 6,
  "title": "Active Alerts",
  "type": "alertlist",
  "options": {
    "showOptions": "current",
    "maxItems": 20,
    "sortOrder": 1,
    "dashboardAlerts": false,
    "alertInstanceLabelFilter": "",
    "dashboardTitle": "",
    "folderId": null,
    "tags": []
  },
  "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16}
}

{
  "id": 6,
  "title": "Active Alerts",
  "type": "alertlist",
  "options": {
    "showOptions": "current",
    "maxItems": 20,
    "sortOrder": 1,
    "dashboardAlerts": false,
    "alertInstanceLabelFilter": "",
    "dashboardTitle": "",
    "folderId": null,
    "tags": []
  },
  "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16}
}

JSON

Conditional Formatting Based on Alerts

{
  "fieldConfig": {
    "overrides": [
      {
        "matcher": {
          "id": "byFrameRefID",
          "options": "Alerts"
        },
        "properties": [
          {
            "id": "custom.displayMode",
            "value": "color-background"
          },
          {
            "id": "mappings",
            "value": [
              {
                "options": {
                  "0": {"text": "OK", "color": "green"},
                  "1": {"text": "ALERT", "color": "red"}
                },
                "type": "value"
              }
            ]
          }
        ]
      }
    ]
  }
}

{
  "fieldConfig": {
    "overrides": [
      {
        "matcher": {
          "id": "byFrameRefID",
          "options": "Alerts"
        },
        "properties": [
          {
            "id": "custom.displayMode",
            "value": "color-background"
          },
          {
            "id": "mappings",
            "value": [
              {
                "options": {
                  "0": {"text": "OK", "color": "green"},
                  "1": {"text": "ALERT", "color": "red"}
                },
                "type": "value"
              }
            ]
          }
        ]
      }
    ]
  }
}

JSON

Export and Import Strategies

Dashboard Export Script

#!/bin/bash
# scripts/export-dashboards.sh

GRAFANA_URL="http://localhost:3000"
GRAFANA_USER="admin"
GRAFANA_PASS="admin123"

# Get all dashboards
curl -u $GRAFANA_USER:$GRAFANA_PASS \
  "$GRAFANA_URL/api/search?type=dash-db" | \
  jq -r '.[] | .uid' | \
  while read uid; do
    echo "Exporting dashboard: $uid"
    curl -u $GRAFANA_USER:$GRAFANA_PASS \
      "$GRAFANA_URL/api/dashboards/uid/$uid" | \
      jq '.dashboard' > "dashboards/${uid}.json"
  done

#!/bin/bash
# scripts/export-dashboards.sh

GRAFANA_URL="http://localhost:3000"
GRAFANA_USER="admin"
GRAFANA_PASS="admin123"

# Get all dashboards
curl -u $GRAFANA_USER:$GRAFANA_PASS \
  "$GRAFANA_URL/api/search?type=dash-db" | \
  jq -r '.[] | .uid' | \
  while read uid; do
    echo "Exporting dashboard: $uid"
    curl -u $GRAFANA_USER:$GRAFANA_PASS \
      "$GRAFANA_URL/api/dashboards/uid/$uid" | \
      jq '.dashboard' > "dashboards/${uid}.json"
  done

Bash

Dashboard Import with Provisioning

# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1

providers:
  - name: 'infrastructure'
    orgId: 1
    folder: 'Infrastructure'
    type: file
    disableDeletion: false
    editable: true
    updateIntervalSeconds: 10
    options:
      path: /etc/grafana/provisioning/dashboards/infrastructure

  - name: 'applications'
    orgId: 1
    folder: 'Applications'
    type: file
    disableDeletion: false
    editable: true
    updateIntervalSeconds: 10
    options:
      path: /etc/grafana/provisioning/dashboards/applications

# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1

providers:
  - name: 'infrastructure'
    orgId: 1
    folder: 'Infrastructure'
    type: file
    disableDeletion: false
    editable: true
    updateIntervalSeconds: 10
    options:
      path: /etc/grafana/provisioning/dashboards/infrastructure

  - name: 'applications'
    orgId: 1
    folder: 'Applications'
    type: file
    disableDeletion: false
    editable: true
    updateIntervalSeconds: 10
    options:
      path: /etc/grafana/provisioning/dashboards/applications

YAML

Chapter 6 Summary

Grafana provides powerful visualization capabilities for Prometheus metrics through various panel types, template variables, and advanced features. Effective dashboard design follows information hierarchy principles, uses appropriate panel types for different data, and optimizes queries for performance. Dashboard organization, alerting integration, and automation through provisioning enable scalable monitoring visualization.

Hands-on Exercise

Dashboard Creation:
- Create an infrastructure overview dashboard
- Add template variables for dynamic filtering
- Implement different panel types (stat, timeseries, table, heatmap)
Advanced Features:
- Set up annotations for deployments and alerts
- Configure custom thresholds and value mappings
- Create dashboard links and navigation
Optimization and Management:
- Optimize queries using recording rules
- Organize dashboards with folders and tags
- Set up

7. Prometheus in Kubernetes

Service Discovery in Kubernetes

Kubernetes provides rich metadata that Prometheus can use for automatic service discovery, eliminating the need for manual target configuration.

graph TB
    A[Kubernetes API] --> B[Prometheus]
    B --> C[Pods]
    B --> D[Services]
    B --> E[Endpoints]
    B --> F[Nodes]

    C --> G[App Metrics]
    D --> H[Service Metrics]
    E --> I[Endpoint Metrics]
    F --> J[Node Metrics]

Kubernetes SD Configuration

# prometheus.yml for Kubernetes
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Scrape Kubernetes API server
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

  # Scrape Kubernetes nodes
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics

  # Scrape pods with prometheus.io annotations
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

      # Use custom path if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

      # Use custom port if specified
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

      # Add Kubernetes metadata as labels
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

  # Scrape services with prometheus.io annotations
  - job_name: 'kubernetes-services'
    kubernetes_sd_configs:
      - role: service
    metrics_path: /probe
    params:
      module: [http_2xx]
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
        action: keep
        regex: true
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox-exporter:9115
      - source_labels: [__param_target]
        target_label: instance
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        target_label: kubernetes_name

# prometheus.yml for Kubernetes
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Scrape Kubernetes API server
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

  # Scrape Kubernetes nodes
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics

  # Scrape pods with prometheus.io annotations
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

      # Use custom path if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

      # Use custom port if specified
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

      # Add Kubernetes metadata as labels
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

  # Scrape services with prometheus.io annotations
  - job_name: 'kubernetes-services'
    kubernetes_sd_configs:
      - role: service
    metrics_path: /probe
    params:
      module: [http_2xx]
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
        action: keep
        regex: true
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox-exporter:9115
      - source_labels: [__param_target]
        target_label: instance
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        target_label: kubernetes_name

YAML

Using kube-state-metrics

kube-state-metrics generates metrics about Kubernetes object states, providing cluster-level visibility.

Installing kube-state-metrics

# kube-state-metrics.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-state-metrics
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kube-state-metrics
  template:
    metadata:
      labels:
        app: kube-state-metrics
    spec:
      serviceAccountName: kube-state-metrics
      containers:
      - name: kube-state-metrics
        image: k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.6.0
        ports:
        - containerPort: 8080
          name: http-metrics
        - containerPort: 8081
          name: telemetry
        readinessProbe:
          httpGet:
            path: /
            port: 8081
          initialDelaySeconds: 5
          timeoutSeconds: 5

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kube-state-metrics
  namespace: kube-system

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-state-metrics
rules:
- apiGroups: [""]
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs: ["list", "watch"]
- apiGroups: ["apps"]
  resources:
  - statefulsets
  - daemonsets
  - deployments
  - replicasets
  verbs: ["list", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: kube-system

# kube-state-metrics.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-state-metrics
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kube-state-metrics
  template:
    metadata:
      labels:
        app: kube-state-metrics
    spec:
      serviceAccountName: kube-state-metrics
      containers:
      - name: kube-state-metrics
        image: k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.6.0
        ports:
        - containerPort: 8080
          name: http-metrics
        - containerPort: 8081
          name: telemetry
        readinessProbe:
          httpGet:
            path: /
            port: 8081
          initialDelaySeconds: 5
          timeoutSeconds: 5

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kube-state-metrics
  namespace: kube-system

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-state-metrics
rules:
- apiGroups: [""]
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs: ["list", "watch"]
- apiGroups: ["apps"]
  resources:
  - statefulsets
  - daemonsets
  - deployments
  - replicasets
  verbs: ["list", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: kube-system

YAML

Key kube-state-metrics Metrics

# Pod status metrics
kube_pod_status_phase{phase="Running"}
kube_pod_status_ready{condition="true"}
kube_pod_container_status_restarts_total

# Deployment metrics
kube_deployment_status_replicas_available
kube_deployment_status_replicas_unavailable

# Node metrics
kube_node_status_condition{condition="Ready", status="true"}
kube_node_spec_unschedulable

# Resource requests and limits
kube_pod_container_resource_requests
kube_pod_container_resource_limits

# Namespace resource quotas
kube_resourcequota

# Pod status metrics
kube_pod_status_phase{phase="Running"}
kube_pod_status_ready{condition="true"}
kube_pod_container_status_restarts_total

# Deployment metrics
kube_deployment_status_replicas_available
kube_deployment_status_replicas_unavailable

# Node metrics
kube_node_status_condition{condition="Ready", status="true"}
kube_node_spec_unschedulable

# Resource requests and limits
kube_pod_container_resource_requests
kube_pod_container_resource_limits

# Namespace resource quotas
kube_resourcequota

Bash

Prometheus Operator and CRDs

The Prometheus Operator simplifies Prometheus deployment and management in Kubernetes through Custom Resource Definitions (CRDs).

Installing Prometheus Operator

# Using Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus-operator prometheus-community/kube-prometheus-stack

# Using Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus-operator prometheus-community/kube-prometheus-stack

Bash

Custom Resource Examples

Prometheus CR

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
  namespace: monitoring
spec:
  serviceAccountName: prometheus
  serviceMonitorSelector:
    matchLabels:
      team: frontend
  ruleSelector:
    matchLabels:
      prometheus: kube-prometheus
      role: alert-rules
  resources:
    requests:
      memory: 400Mi
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: fast-ssd
        resources:
          requests:
            storage: 50Gi
  retention: 30d
  retentionSize: 45GB

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
  namespace: monitoring
spec:
  serviceAccountName: prometheus
  serviceMonitorSelector:
    matchLabels:
      team: frontend
  ruleSelector:
    matchLabels:
      prometheus: kube-prometheus
      role: alert-rules
  resources:
    requests:
      memory: 400Mi
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: fast-ssd
        resources:
          requests:
            storage: 50Gi
  retention: 30d
  retentionSize: 45GB

YAML

ServiceMonitor CR

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-monitor
  namespace: monitoring
  labels:
    team: frontend
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
    honorLabels: true
  namespaceSelector:
    matchNames:
    - production
    - staging

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-monitor
  namespace: monitoring
  labels:
    team: frontend
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
    honorLabels: true
  namespaceSelector:
    matchNames:
    - production
    - staging

YAML

PrometheusRule CR

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-app-rules
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
  - name: my-app.rules
    rules:
    - alert: MyAppHighErrorRate
      expr: |
        (
          sum(rate(http_requests_total{job="my-app", status=~"5.."}[5m])) /
          sum(rate(http_requests_total{job="my-app"}[5m]))
        ) > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate in my-app"
        description: "Error rate is {{ $value | humanizePercentage }}"

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-app-rules
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
  - name: my-app.rules
    rules:
    - alert: MyAppHighErrorRate
      expr: |
        (
          sum(rate(http_requests_total{job="my-app", status=~"5.."}[5m])) /
          sum(rate(http_requests_total{job="my-app"}[5m]))
        ) > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate in my-app"
        description: "Error rate is {{ $value | humanizePercentage }}"

YAML

Best Practices for Monitoring Kubernetes Workloads

Pod Annotations for Scraping

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: my-app
        image: my-app:latest
        ports:
        - containerPort: 8080
          name: metrics

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: my-app
        image: my-app:latest
        ports:
        - containerPort: 8080
          name: metrics

YAML

Resource Monitoring Queries

# CPU usage by pod
sum by (pod) (rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m]))

# Memory usage by pod
sum by (pod) (container_memory_working_set_bytes{container!="POD",container!=""})

# Pod restart rate
increase(kube_pod_container_status_restarts_total[1h])

# Pods not ready
kube_pod_status_ready{condition="false"}

# Node CPU usage
(1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100

# Node memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Persistent Volume usage
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100

# CPU usage by pod
sum by (pod) (rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m]))

# Memory usage by pod
sum by (pod) (container_memory_working_set_bytes{container!="POD",container!=""})

# Pod restart rate
increase(kube_pod_container_status_restarts_total[1h])

# Pods not ready
kube_pod_status_ready{condition="false"}

# Node CPU usage
(1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100

# Node memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Persistent Volume usage
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100

INI

Kubernetes Alerting Rules

# k8s-alerts.yml
groups:
  - name: kubernetes-alerts
    rules:
      - alert: KubePodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Pod is crash looping"
          description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting {{ $value | humanize }} times per 15 minutes"

      - alert: KubePodNotReady
        expr: kube_pod_status_ready{condition="false"} == 1
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Pod has been in not ready state for more than 15 minutes"
          description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than 15 minutes"

      - alert: KubeDeploymentGenerationMismatch
        expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Deployment generation mismatch"
          description: "Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match"

      - alert: KubeNodeNotReady
        expr: kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Node is not ready"
          description: "Node {{ $labels.node }} has been unready for more than 15 minutes"

      - alert: KubeDaemonSetRolloutStuck
        expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "DaemonSet rollout is stuck"
          description: "Only {{ $value | humanizePercentage }} of the desired Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are scheduled and ready"

# k8s-alerts.yml
groups:
  - name: kubernetes-alerts
    rules:
      - alert: KubePodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Pod is crash looping"
          description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting {{ $value | humanize }} times per 15 minutes"

      - alert: KubePodNotReady
        expr: kube_pod_status_ready{condition="false"} == 1
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Pod has been in not ready state for more than 15 minutes"
          description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than 15 minutes"

      - alert: KubeDeploymentGenerationMismatch
        expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Deployment generation mismatch"
          description: "Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match"

      - alert: KubeNodeNotReady
        expr: kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Node is not ready"
          description: "Node {{ $labels.node }} has been unready for more than 15 minutes"

      - alert: KubeDaemonSetRolloutStuck
        expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "DaemonSet rollout is stuck"
          description: "Only {{ $value | humanizePercentage }} of the desired Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are scheduled and ready"

YAML

Network Policy Monitoring

# Example application with network policies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: my-app-netpol
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: my-app
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: production
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: production
    ports:
    - protocol: TCP
      port: 5432

# Example application with network policies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: my-app-netpol
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: my-app
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: production
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: production
    ports:
    - protocol: TCP
      port: 5432

YAML

Chapter 7 Summary

Prometheus integrates seamlessly with Kubernetes through service discovery, automatically finding and monitoring pods, services, and nodes. kube-state-metrics provides cluster-level visibility, while the Prometheus Operator simplifies deployment through CRDs. Proper annotation strategies and resource monitoring ensure comprehensive Kubernetes observability.

Hands-on Exercise

Service Discovery Setup:
- Configure Prometheus for Kubernetes service discovery
- Deploy applications with proper annotations
- Verify automatic target discovery
kube-state-metrics:
- Install and configure kube-state-metrics
- Create queries for cluster health monitoring
- Build dashboards for Kubernetes resources
Prometheus Operator:
- Deploy Prometheus using the operator
- Create ServiceMonitor and PrometheusRule resources
- Test the operator’s automated configuration management

8. Scaling and Performance

Federation and Hierarchical Prometheus Setups

Federation allows Prometheus servers to scrape selected time series from other Prometheus servers, enabling hierarchical monitoring architectures.

graph TB
    A[Global Prometheus] --> B[Regional Prometheus US]
    A --> C[Regional Prometheus EU]
    A --> D[Regional Prometheus APAC]

    B --> E[Cluster Prometheus US-1]
    B --> F[Cluster Prometheus US-2]

    C --> G[Cluster Prometheus EU-1]
    C --> H[Cluster Prometheus EU-2]

    D --> I[Cluster Prometheus APAC-1]

Federation Configuration

# Global Prometheus configuration
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job=~"prometheus|node-exporter"}'
        - '{__name__=~"job:.*"}'
        - '{__name__=~"instance:.*"}'
    static_configs:
      - targets:
        - 'us-prometheus:9090'
        - 'eu-prometheus:9090'
        - 'apac-prometheus:9090'

  # Aggregate high-level metrics
  - job_name: 'federate-aggregates'
    scrape_interval: 30s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~"cluster:.*"}'
        - '{__name__=~"region:.*"}'
    static_configs:
      - targets:
        - 'us-prometheus:9090'
        - 'eu-prometheus:9090'
        - 'apac-prometheus:9090'

# Global Prometheus configuration
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job=~"prometheus|node-exporter"}'
        - '{__name__=~"job:.*"}'
        - '{__name__=~"instance:.*"}'
    static_configs:
      - targets:
        - 'us-prometheus:9090'
        - 'eu-prometheus:9090'
        - 'apac-prometheus:9090'

  # Aggregate high-level metrics
  - job_name: 'federate-aggregates'
    scrape_interval: 30s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~"cluster:.*"}'
        - '{__name__=~"region:.*"}'
    static_configs:
      - targets:
        - 'us-prometheus:9090'
        - 'eu-prometheus:9090'
        - 'apac-prometheus:9090'

YAML

Recording Rules for Federation

# Regional Prometheus recording rules
groups:
  - name: cluster_aggregates
    interval: 30s
    rules:
      - record: cluster:cpu_usage:avg
        expr: avg by (cluster) (instance:cpu_usage:rate5m)

      - record: cluster:memory_usage:avg
        expr: avg by (cluster) (instance:memory_usage:percentage)

      - record: cluster:disk_usage:avg
        expr: avg by (cluster) (instance:disk_usage:percentage)

  - name: region_aggregates
    interval: 60s
    rules:
      - record: region:request_rate:sum
        expr: sum by (region) (cluster:request_rate:sum)

      - record: region:error_rate:avg
        expr: avg by (region) (cluster:error_rate:avg)

# Regional Prometheus recording rules
groups:
  - name: cluster_aggregates
    interval: 30s
    rules:
      - record: cluster:cpu_usage:avg
        expr: avg by (cluster) (instance:cpu_usage:rate5m)

      - record: cluster:memory_usage:avg
        expr: avg by (cluster) (instance:memory_usage:percentage)

      - record: cluster:disk_usage:avg
        expr: avg by (cluster) (instance:disk_usage:percentage)

  - name: region_aggregates
    interval: 60s
    rules:
      - record: region:request_rate:sum
        expr: sum by (region) (cluster:request_rate:sum)

      - record: region:error_rate:avg
        expr: avg by (region) (cluster:error_rate:avg)

YAML

Remote Storage Integrations

Remote storage solutions provide long-term storage and horizontal scalability for Prometheus metrics.

Thanos Integration

Thanos provides unlimited retention and horizontal scaling for Prometheus.

# Prometheus with Thanos sidecar
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus
spec:
  serviceName: prometheus
  replicas: 1
  template:
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:latest
        args:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--storage.tsdb.retention.time=2h'
          - '--storage.tsdb.min-block-duration=2h'
          - '--storage.tsdb.max-block-duration=2h'
          - '--web.enable-lifecycle'
        ports:
        - containerPort: 9090
        volumeMounts:
        - name: prometheus-storage
          mountPath: /prometheus

      - name: thanos-sidecar
        image: thanosio/thanos:latest
        args:
          - sidecar
          - --tsdb.path=/prometheus
          - --prometheus.url=http://localhost:9090
          - --objstore.config-file=/etc/thanos/objstore.yml
        ports:
        - containerPort: 10901
        - containerPort: 10902
        volumeMounts:
        - name: prometheus-storage
          mountPath: /prometheus
        - name: thanos-objstore-config
          mountPath: /etc/thanos

      volumes:
      - name: thanos-objstore-config
        secret:
          secretName: thanos-objstore-config

# Prometheus with Thanos sidecar
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus
spec:
  serviceName: prometheus
  replicas: 1
  template:
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:latest
        args:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--storage.tsdb.retention.time=2h'
          - '--storage.tsdb.min-block-duration=2h'
          - '--storage.tsdb.max-block-duration=2h'
          - '--web.enable-lifecycle'
        ports:
        - containerPort: 9090
        volumeMounts:
        - name: prometheus-storage
          mountPath: /prometheus

      - name: thanos-sidecar
        image: thanosio/thanos:latest
        args:
          - sidecar
          - --tsdb.path=/prometheus
          - --prometheus.url=http://localhost:9090
          - --objstore.config-file=/etc/thanos/objstore.yml
        ports:
        - containerPort: 10901
        - containerPort: 10902
        volumeMounts:
        - name: prometheus-storage
          mountPath: /prometheus
        - name: thanos-objstore-config
          mountPath: /etc/thanos

      volumes:
      - name: thanos-objstore-config
        secret:
          secretName: thanos-objstore-config

YAML

# Thanos objstore configuration
# objstore.yml
type: S3
config:
  bucket: "thanos-metrics"
  endpoint: "s3.amazonaws.com"
  access_key: "ACCESS_KEY"
  secret_key: "SECRET_KEY"
  insecure: false

# Thanos objstore configuration
# objstore.yml
type: S3
config:
  bucket: "thanos-metrics"
  endpoint: "s3.amazonaws.com"
  access_key: "ACCESS_KEY"
  secret_key: "SECRET_KEY"
  insecure: false

YAML

# Thanos query deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-query
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: thanos-query
        image: thanosio/thanos:latest
        args:
          - query
          - --store=prometheus-0.prometheus:10901
          - --store=prometheus-1.prometheus:10901
          - --store=thanos-store:10901
        ports:
        - containerPort: 10902

# Thanos query deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-query
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: thanos-query
        image: thanosio/thanos:latest
        args:
          - query
          - --store=prometheus-0.prometheus:10901
          - --store=prometheus-1.prometheus:10901
          - --store=thanos-store:10901
        ports:
        - containerPort: 10902

YAML

VictoriaMetrics Integration

VictoriaMetrics provides high-performance storage and querying.

# VictoriaMetrics deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: victoriametrics
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: victoriametrics
        image: victoriametrics/victoria-metrics:latest
        args:
          - '--storageDataPath=/victoria-metrics-data'
          - '--retentionPeriod=12'
          - '--httpListenAddr=:8428'
        ports:
        - containerPort: 8428
        volumeMounts:
        - name: storage
          mountPath: /victoria-metrics-data

# VictoriaMetrics deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: victoriametrics
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: victoriametrics
        image: victoriametrics/victoria-metrics:latest
        args:
          - '--storageDataPath=/victoria-metrics-data'
          - '--retentionPeriod=12'
          - '--httpListenAddr=:8428'
        ports:
        - containerPort: 8428
        volumeMounts:
        - name: storage
          mountPath: /victoria-metrics-data

YAML

# Prometheus remote write configuration
remote_write:
  - url: "http://victoriametrics:8428/api/v1/write"
    queue_config:
      max_samples_per_send: 10000
      batch_send_deadline: 5s
      max_shards: 20

# Prometheus remote write configuration
remote_write:
  - url: "http://victoriametrics:8428/api/v1/write"
    queue_config:
      max_samples_per_send: 10000
      batch_send_deadline: 5s
      max_shards: 20

YAML

Cortex Configuration

# Cortex configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: cortex-config
data:
  cortex.yml: |
    server:
      http_listen_port: 9009
      grpc_listen_port: 9095

    distributor:
      ring:
        kvstore:
          store: consul
          consul:
            host: consul:8500

    ingester:
      lifecycler:
        ring:
          kvstore:
            store: consul
            consul:
              host: consul:8500
          replication_factor: 3

    storage:
      engine: blocks

    blocks_storage:
      backend: s3
      s3:
        endpoint: s3.amazonaws.com
        bucket_name: cortex-blocks
        access_key_id: ACCESS_KEY
        secret_access_key: SECRET_KEY

# Cortex configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: cortex-config
data:
  cortex.yml: |
    server:
      http_listen_port: 9009
      grpc_listen_port: 9095

    distributor:
      ring:
        kvstore:
          store: consul
          consul:
            host: consul:8500

    ingester:
      lifecycler:
        ring:
          kvstore:
            store: consul
            consul:
              host: consul:8500
          replication_factor: 3

    storage:
      engine: blocks

    blocks_storage:
      backend: s3
      s3:
        endpoint: s3.amazonaws.com
        bucket_name: cortex-blocks
        access_key_id: ACCESS_KEY
        secret_access_key: SECRET_KEY

YAML

Retention Policies and Storage Tuning

Prometheus Storage Configuration

# Prometheus with optimized storage settings
args:
  - '--storage.tsdb.path=/prometheus'
  - '--storage.tsdb.retention.time=15d'
  - '--storage.tsdb.retention.size=50GB'
  - '--storage.tsdb.wal-compression'
  - '--storage.tsdb.min-block-duration=2h'
  - '--storage.tsdb.max-block-duration=2h'
  - '--web.enable-admin-api'

# Prometheus with optimized storage settings
args:
  - '--storage.tsdb.path=/prometheus'
  - '--storage.tsdb.retention.time=15d'
  - '--storage.tsdb.retention.size=50GB'
  - '--storage.tsdb.wal-compression'
  - '--storage.tsdb.min-block-duration=2h'
  - '--storage.tsdb.max-block-duration=2h'
  - '--web.enable-admin-api'

YAML

Storage Optimization Strategies

# Monitor Prometheus storage metrics
prometheus_tsdb_symbol_table_size_bytes
prometheus_tsdb_head_series
prometheus_tsdb_compaction_duration_seconds
prometheus_config_last_reload_successful

# Storage utilization
prometheus_tsdb_size_bytes{type="wal"}
prometheus_tsdb_size_bytes{type="head"}
prometheus_tsdb_size_bytes{type="blocks"}

# Query performance
prometheus_engine_query_duration_seconds
prometheus_engine_queries_concurrent_max

# Monitor Prometheus storage metrics
prometheus_tsdb_symbol_table_size_bytes
prometheus_tsdb_head_series
prometheus_tsdb_compaction_duration_seconds
prometheus_config_last_reload_successful

# Storage utilization
prometheus_tsdb_size_bytes{type="wal"}
prometheus_tsdb_size_bytes{type="head"}
prometheus_tsdb_size_bytes{type="blocks"}

# Query performance
prometheus_engine_query_duration_seconds
prometheus_engine_queries_concurrent_max

INI

Handling High Cardinality Metrics

Cardinality Analysis

# Find high cardinality metrics
topk(10, count by (__name__)({__name__!=""}))

# Series count by job
count by (job) ({__name__!=""})

# Label cardinality analysis
count by (__name__) (group by (__name__, instance) ({__name__!=""}))

# Find high cardinality metrics
topk(10, count by (__name__)({__name__!=""}))

# Series count by job
count by (job) ({__name__!=""})

# Label cardinality analysis
count by (__name__) (group by (__name__, instance) ({__name__!=""}))

INI

Cardinality Management Strategies

# Metric relabeling to reduce cardinality
metric_relabel_configs:
  # Drop unnecessary labels
  - source_labels: [__name__]
    regex: 'http_request_duration_seconds_bucket'
    target_label: __tmp_bucket_drop
    replacement: 'true'
  - source_labels: [__tmp_bucket_drop, le]
    regex: 'true;(0.005|0.01|0.025|0.05|0.1|0.25|0.5|1|2.5|5|10|+Inf)'
    action: keep
  - regex: '__tmp_bucket_drop'
    action: labeldrop

  # Limit user agent variations
  - source_labels: [user_agent]
    regex: '(.*Chrome.*|.*Firefox.*|.*Safari.*)'
    target_label: user_agent_family
    replacement: '${1}'
  - source_labels: [user_agent]
    regex: '.*'
    target_label: user_agent_family
    replacement: 'other'
  - regex: 'user_agent'
    action: labeldrop

# Metric relabeling to reduce cardinality
metric_relabel_configs:
  # Drop unnecessary labels
  - source_labels: [__name__]
    regex: 'http_request_duration_seconds_bucket'
    target_label: __tmp_bucket_drop
    replacement: 'true'
  - source_labels: [__tmp_bucket_drop, le]
    regex: 'true;(0.005|0.01|0.025|0.05|0.1|0.25|0.5|1|2.5|5|10|+Inf)'
    action: keep
  - regex: '__tmp_bucket_drop'
    action: labeldrop

  # Limit user agent variations
  - source_labels: [user_agent]
    regex: '(.*Chrome.*|.*Firefox.*|.*Safari.*)'
    target_label: user_agent_family
    replacement: '${1}'
  - source_labels: [user_agent]
    regex: '.*'
    target_label: user_agent_family
    replacement: 'other'
  - regex: 'user_agent'
    action: labeldrop

YAML

Recording Rules for High Cardinality

# Aggregate high cardinality metrics
groups:
  - name: cardinality_reduction
    interval: 30s
    rules:
      # Aggregate by service instead of instance
      - record: service:request_rate:sum
        expr: sum by (service) (rate(http_requests_total[5m]))

      # Aggregate errors by service and status class
      - record: service:error_rate:sum
        expr: |
          sum by (service, status_class) (
            rate(http_requests_total{status=~"[45].."}[5m])
          )
        labels:
          status_class: "4xx_5xx"

      # Remove detailed path information
      - record: service:request_duration:p95
        expr: |
          histogram_quantile(0.95,
            sum by (service, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

# Aggregate high cardinality metrics
groups:
  - name: cardinality_reduction
    interval: 30s
    rules:
      # Aggregate by service instead of instance
      - record: service:request_rate:sum
        expr: sum by (service) (rate(http_requests_total[5m]))

      # Aggregate errors by service and status class
      - record: service:error_rate:sum
        expr: |
          sum by (service, status_class) (
            rate(http_requests_total{status=~"[45].."}[5m])
          )
        labels:
          status_class: "4xx_5xx"

      # Remove detailed path information
      - record: service:request_duration:p95
        expr: |
          histogram_quantile(0.95,
            sum by (service, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

YAML

Performance Optimization

Query Optimization

# Inefficient query - scans all time series
{__name__=~"http_.*"}

# Better - specific metric with labels
http_requests_total{job="my-service"}

# Inefficient - regex on high cardinality label
http_requests_total{instance=~".*prod.*"}

# Better - exact match or limited regex
http_requests_total{environment="production"}

# Use recording rules for complex calculations
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
# Replace with:
http_request_duration:p95

# Inefficient query - scans all time series
{__name__=~"http_.*"}

# Better - specific metric with labels
http_requests_total{job="my-service"}

# Inefficient - regex on high cardinality label
http_requests_total{instance=~".*prod.*"}

# Better - exact match or limited regex
http_requests_total{environment="production"}

# Use recording rules for complex calculations
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
# Replace with:
http_request_duration:p95

INI

Memory and CPU Tuning

# Prometheus resource optimization
resources:
  requests:
    memory: "4Gi"
    cpu: "1000m"
  limits:
    memory: "8Gi"
    cpu: "2000m"

# JVM tuning for Java exporters
env:
  - name: JAVA_OPTS
    value: "-Xmx1g -Xms1g -XX:+UseG1GC"

# Prometheus resource optimization
resources:
  requests:
    memory: "4Gi"
    cpu: "1000m"
  limits:
    memory: "8Gi"
    cpu: "2000m"

# JVM tuning for Java exporters
env:
  - name: JAVA_OPTS
    value: "-Xmx1g -Xms1g -XX:+UseG1GC"

YAML

Monitoring Prometheus Performance

# Prometheus performance dashboard queries
panels:
  - title: "Ingestion Rate"
    expr: "rate(prometheus_tsdb_samples_total[5m])"

  - title: "Active Series"
    expr: "prometheus_tsdb_head_series"

  - title: "Query Duration"
    expr: "histogram_quantile(0.99, rate(prometheus_engine_query_duration_seconds_bucket[5m]))"

  - title: "Memory Usage"
    expr: "process_resident_memory_bytes"

  - title: "WAL Truncations"
    expr: "rate(prometheus_tsdb_wal_truncations_total[5m])"

  - title: "Compaction Duration"
    expr: "rate(prometheus_tsdb_compaction_duration_seconds_sum[5m])"

# Prometheus performance dashboard queries
panels:
  - title: "Ingestion Rate"
    expr: "rate(prometheus_tsdb_samples_total[5m])"

  - title: "Active Series"
    expr: "prometheus_tsdb_head_series"

  - title: "Query Duration"
    expr: "histogram_quantile(0.99, rate(prometheus_engine_query_duration_seconds_bucket[5m]))"

  - title: "Memory Usage"
    expr: "process_resident_memory_bytes"

  - title: "WAL Truncations"
    expr: "rate(prometheus_tsdb_wal_truncations_total[5m])"

  - title: "Compaction Duration"
    expr: "rate(prometheus_tsdb_compaction_duration_seconds_sum[5m])"

YAML

Chapter 8 Summary

Scaling Prometheus involves federation for hierarchical setups, remote storage for long-term retention, and careful cardinality management. Performance optimization requires query tuning, resource allocation, and monitoring of Prometheus itself. Remote storage solutions like Thanos, VictoriaMetrics, and Cortex provide different approaches to horizontal scaling.

Hands-on Exercise

Federation Setup:
- Create a hierarchical Prometheus setup with federation
- Configure recording rules for aggregation
- Test cross-instance querying
Remote Storage:
- Implement remote write to VictoriaMetrics or Thanos
- Configure retention policies
- Compare query performance
Performance Optimization:
- Analyze cardinality in your metrics
- Implement relabeling to reduce cardinality
- Create recording rules for expensive queries

9. Best Practices and Pitfalls

Designing Effective Metrics

The Four Golden Signals

Focus on these key metrics for any system:

Latency: Time to process requests
Traffic: Amount of demand on the system
Errors: Rate of failed requests
Saturation: Resource utilization

# Latency - 95th percentile response time
histogram_quantile(0.95, sum by (service) (rate(http_request_duration_seconds_bucket[5m])))

# Traffic - Request rate
sum by (service) (rate(http_requests_total[5m]))

# Errors - Error rate
sum by (service) (rate(http_requests_total{status=~"5.."}[5m])) / 
sum by (service) (rate(http_requests_total[5m]))

# Saturation - CPU utilization
avg by (instance) (1 - rate(node_cpu_seconds_total{mode="idle"}[5m]))

USE Method for Resources

For every resource, monitor:

Utilization: How busy the resource is
Saturation: Extra work queued
Errors: Error events

# CPU Utilization
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# CPU Saturation
node_load1 / count by (instance) (node_cpu_seconds_total{mode="idle"})

# Memory Utilization
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Memory Saturation
rate(node_vmstat_pswpin[5m]) + rate(node_vmstat_pswpout[5m])

# Disk Utilization
rate(node_disk_io_time_seconds_total[5m]) * 100

# Disk Saturation
rate(node_disk_io_time_weighted_seconds_total[5m])

# Network Utilization
rate(node_network_transmit_bytes_total[5m]) + rate(node_network_receive_bytes_total[5m])

# Network Errors
rate(node_network_transmit_errs_total[5m]) + rate(node_network_receive_errs_total[5m])

# CPU Utilization
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# CPU Saturation
node_load1 / count by (instance) (node_cpu_seconds_total{mode="idle"})

# Memory Utilization
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Memory Saturation
rate(node_vmstat_pswpin[5m]) + rate(node_vmstat_pswpout[5m])

# Disk Utilization
rate(node_disk_io_time_seconds_total[5m]) * 100

# Disk Saturation
rate(node_disk_io_time_weighted_seconds_total[5m])

# Network Utilization
rate(node_network_transmit_bytes_total[5m]) + rate(node_network_receive_bytes_total[5m])

# Network Errors
rate(node_network_transmit_errs_total[5m]) + rate(node_network_receive_errs_total[5m])

INI

RED Method for Services

For every service, monitor:

Rate: Requests per second
Errors: Failed requests per second
Duration: Response time distribution

# Rate
sum by (service) (rate(http_requests_total[5m]))

# Errors
sum by (service) (rate(http_requests_total{status=~"[45].."}[5m]))

# Duration
histogram_quantile(0.50, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.95, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])))

# Rate
sum by (service) (rate(http_requests_total[5m]))

# Errors
sum by (service) (rate(http_requests_total{status=~"[45].."}[5m]))

# Duration
histogram_quantile(0.50, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.95, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])))

INI

Avoiding Cardinality Explosions

Common Cardinality Pitfalls

// BAD: User ID as label (unbounded cardinality)
requestsTotal := prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
    },
    []string{"method", "endpoint", "user_id"}, // user_id is unbounded!
)

// GOOD: Remove user_id or aggregate it
requestsTotal := prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
    },
    []string{"method", "endpoint", "user_type"}, // bounded categories
)

// BAD: Full URL path as label
errorCounter := prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "errors_total",
    },
    []string{"full_path"}, // /user/123/profile, /user/456/profile, etc.
)

// GOOD: Parameterized path
errorCounter := prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "errors_total",
    },
    []string{"path_template"}, // /user/:id/profile
)

// BAD: User ID as label (unbounded cardinality)
requestsTotal := prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
    },
    []string{"method", "endpoint", "user_id"}, // user_id is unbounded!
)

// GOOD: Remove user_id or aggregate it
requestsTotal := prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
    },
    []string{"method", "endpoint", "user_type"}, // bounded categories
)

// BAD: Full URL path as label
errorCounter := prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "errors_total",
    },
    []string{"full_path"}, // /user/123/profile, /user/456/profile, etc.
)

// GOOD: Parameterized path
errorCounter := prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "errors_total",
    },
    []string{"path_template"}, // /user/:id/profile
)

Label Guidelines

# Good label practices
labels:
  # Use bounded categorical values
  environment: ["production", "staging", "development"]
  region: ["us-east-1", "us-west-2", "eu-west-1"]
  service: ["frontend", "backend", "database"]

  # Avoid unbounded values
  # ❌ user_id: "12345"
  # ❌ session_id: "abc-def-123"
  # ❌ full_url: "/api/users/12345/posts/67890"

  # Use bounded alternatives
  # ✅ user_type: "premium"
  # ✅ endpoint: "/api/users/:id/posts/:id"
  # ✅ status_class: "2xx"

# Good label practices
labels:
  # Use bounded categorical values
  environment: ["production", "staging", "development"]
  region: ["us-east-1", "us-west-2", "eu-west-1"]
  service: ["frontend", "backend", "database"]

  # Avoid unbounded values
  # ❌ user_id: "12345"
  # ❌ session_id: "abc-def-123"
  # ❌ full_url: "/api/users/12345/posts/67890"

  # Use bounded alternatives
  # ✅ user_type: "premium"
  # ✅ endpoint: "/api/users/:id/posts/:id"
  # ✅ status_class: "2xx"

YAML

Cardinality Monitoring

# Monitor series count by job
count by (job) ({__name__!=""})

# Find metrics with highest cardinality
topk(10, count by (__name__) ({__name__!=""}))

# Monitor label value counts
count by (__name__, status) (http_requests_total)

# Alert on high cardinality
count by (__name__) ({__name__!=""}) > 10000

# Monitor series count by job
count by (job) ({__name__!=""})

# Find metrics with highest cardinality
topk(10, count by (__name__) ({__name__!=""}))

# Monitor label value counts
count by (__name__, status) (http_requests_total)

# Alert on high cardinality
count by (__name__) ({__name__!=""}) > 10000

INI

Setting SLOs and SLIs with Prometheus

Defining SLIs (Service Level Indicators)

# Example SLI definitions
slis:
  availability:
    description: "Percentage of successful requests"
    query: |
      sum(rate(http_requests_total{status!~"5.."}[5m])) /
      sum(rate(http_requests_total[5m])) * 100
    target: "> 99.9%"

  latency:
    description: "95th percentile response time"
    query: |
      histogram_quantile(0.95, 
        sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
      )
    target: "< 200ms"

  error_rate:
    description: "Rate of 5xx errors"
    query: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) /
      sum(rate(http_requests_total[5m])) * 100
    target: "< 0.1%"

# Example SLI definitions
slis:
  availability:
    description: "Percentage of successful requests"
    query: |
      sum(rate(http_requests_total{status!~"5.."}[5m])) /
      sum(rate(http_requests_total[5m])) * 100
    target: "> 99.9%"

  latency:
    description: "95th percentile response time"
    query: |
      histogram_quantile(0.95, 
        sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
      )
    target: "< 200ms"

  error_rate:
    description: "Rate of 5xx errors"
    query: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) /
      sum(rate(http_requests_total[5m])) * 100
    target: "< 0.1%"

YAML

SLO Implementation

# SLO recording rules
groups:
  - name: slo_rules
    interval: 30s
    rules:
      # Error rate SLI
      - record: sli:error_rate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) /
          sum(rate(http_requests_total[5m]))

      # Availability SLI
      - record: sli:availability
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m])) /
          sum(rate(http_requests_total[5m]))

      # Latency SLI
      - record: sli:latency:p95
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          )

      # Error budget calculation (30-day window)
      - record: slo:error_budget:30d
        expr: 1 - (1 - 0.999) * (30 * 24 * 60 * 60) / (30 * 24 * 60 * 60)

# SLO recording rules
groups:
  - name: slo_rules
    interval: 30s
    rules:
      # Error rate SLI
      - record: sli:error_rate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) /
          sum(rate(http_requests_total[5m]))

      # Availability SLI
      - record: sli:availability
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m])) /
          sum(rate(http_requests_total[5m]))

      # Latency SLI
      - record: sli:latency:p95
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          )

      # Error budget calculation (30-day window)
      - record: slo:error_budget:30d
        expr: 1 - (1 - 0.999) * (30 * 24 * 60 * 60) / (30 * 24 * 60 * 60)

YAML

SLO Alerting

# SLO alerting rules
groups:
  - name: slo_alerts
    rules:
      # Fast burn rate (1 hour)
      - alert: SLOErrorBudgetBurnRateFast
        expr: |
          sli:error_rate > (14.4 * (1 - 0.999)) and
          sli:error_rate[1h] > (14.4 * (1 - 0.999))
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Fast SLO burn rate detected"
          description: "Error rate is consuming error budget 14.4x faster than sustainable"

      # Slow burn rate (6 hours)
      - alert: SLOErrorBudgetBurnRateSlow
        expr: |
          sli:error_rate > (6 * (1 - 0.999)) and
          sli:error_rate[6h] > (6 * (1 - 0.999))
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Slow SLO burn rate detected"
          description: "Error rate is consuming error budget 6x faster than sustainable"

# SLO alerting rules
groups:
  - name: slo_alerts
    rules:
      # Fast burn rate (1 hour)
      - alert: SLOErrorBudgetBurnRateFast
        expr: |
          sli:error_rate > (14.4 * (1 - 0.999)) and
          sli:error_rate[1h] > (14.4 * (1 - 0.999))
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Fast SLO burn rate detected"
          description: "Error rate is consuming error budget 14.4x faster than sustainable"

      # Slow burn rate (6 hours)
      - alert: SLOErrorBudgetBurnRateSlow
        expr: |
          sli:error_rate > (6 * (1 - 0.999)) and
          sli:error_rate[6h] > (6 * (1 - 0.999))
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Slow SLO burn rate detected"
          description: "Error rate is consuming error budget 6x faster than sustainable"

YAML

Case Studies from Real-World Systems

Case Study 1: E-commerce Platform

Challenge: Monitor checkout flow reliability Solution: Multi-step funnel monitoring

# Checkout funnel metrics
checkout_funnel_step_total{step="cart_view"}
checkout_funnel_step_total{step="checkout_start"}
checkout_funnel_step_total{step="payment_submit"}
checkout_funnel_step_total{step="order_complete"}

# Conversion rates
rate(checkout_funnel_step_total{step="checkout_start"}[5m]) /
rate(checkout_funnel_step_total{step="cart_view"}[5m])

# Payment failure rate
rate(checkout_funnel_step_total{step="payment_failed"}[5m]) /
rate(checkout_funnel_step_total{step="payment_submit"}[5m])

# Revenue impact
sum(rate(order_value_total[5m])) * 3600

# Checkout funnel metrics
checkout_funnel_step_total{step="cart_view"}
checkout_funnel_step_total{step="checkout_start"}
checkout_funnel_step_total{step="payment_submit"}
checkout_funnel_step_total{step="order_complete"}

# Conversion rates
rate(checkout_funnel_step_total{step="checkout_start"}[5m]) /
rate(checkout_funnel_step_total{step="cart_view"}[5m])

# Payment failure rate
rate(checkout_funnel_step_total{step="payment_failed"}[5m]) /
rate(checkout_funnel_step_total{step="payment_submit"}[5m])

# Revenue impact
sum(rate(order_value_total[5m])) * 3600

INI

Case Study 2: Microservices Architecture

Challenge: Distributed tracing with metrics correlation Solution: Service dependency monitoring

# Service dependency health
up{job=~".*service.*"}

# Cross-service error propagation
sum by (source_service, target_service) (
  rate(http_requests_total{status=~"5.."}[5m])
)

# Service response time correlation
histogram_quantile(0.95,
  sum by (service, le) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

# Service dependency health
up{job=~".*service.*"}

# Cross-service error propagation
sum by (source_service, target_service) (
  rate(http_requests_total{status=~"5.."}[5m])
)

# Service response time correlation
histogram_quantile(0.95,
  sum by (service, le) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

INI

Case Study 3: Infrastructure Cost Optimization

Challenge: Monitor resource efficiency Solution: Cost-aware metrics

# CPU cost efficiency
sum by (instance_type) (node_cpu_seconds_total) /
sum by (instance_type) (node_cpu_cost_per_hour)

# Memory utilization by cost
avg by (instance_type) (
  (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) /
  node_memory_MemTotal_bytes
) * sum by (instance_type) (node_memory_cost_per_gb)

# Idle resource identification
avg_over_time(
  (1 - rate(node_cpu_seconds_total{mode="idle"}[5m]))[7d:]
) < 0.1

# CPU cost efficiency
sum by (instance_type) (node_cpu_seconds_total) /
sum by (instance_type) (node_cpu_cost_per_hour)

# Memory utilization by cost
avg by (instance_type) (
  (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) /
  node_memory_MemTotal_bytes
) * sum by (instance_type) (node_memory_cost_per_gb)

# Idle resource identification
avg_over_time(
  (1 - rate(node_cpu_seconds_total{mode="idle"}[5m]))[7d:]
) < 0.1

INI

Metrics Naming Conventions

Prometheus Naming Best Practices

# Good metric names
http_requests_total          # Counter with _total suffix
http_request_duration_seconds # Histogram with base unit
memory_usage_bytes          # Gauge with base unit
process_cpu_usage_ratio     # Ratio as _ratio suffix

# Bad metric names
HttpRequestsCount           # Should be snake_case
request_time_ms             # Should use base unit (seconds)
cpu_percentage              # Should be cpu_usage_ratio
errors                      # Not descriptive enough

# Good metric names
http_requests_total          # Counter with _total suffix
http_request_duration_seconds # Histogram with base unit
memory_usage_bytes          # Gauge with base unit
process_cpu_usage_ratio     # Ratio as _ratio suffix

# Bad metric names
HttpRequestsCount           # Should be snake_case
request_time_ms             # Should use base unit (seconds)
cpu_percentage              # Should be cpu_usage_ratio
errors                      # Not descriptive enough

INI

Label Naming Conventions

# Good labels
method: ["GET", "POST", "PUT", "DELETE"]
status: ["200", "404", "500"]
environment: ["production", "staging"]
region: ["us-east-1", "eu-west-1"]

# Bad labels
Method: "GET"               # Should be lowercase
http_status_code: "200"     # Redundant prefix
env: "prod"                 # Use full names
datacenter: "dc1"           # Be specific about location

# Good labels
method: ["GET", "POST", "PUT", "DELETE"]
status: ["200", "404", "500"]
environment: ["production", "staging"]
region: ["us-east-1", "eu-west-1"]

# Bad labels
Method: "GET"               # Should be lowercase
http_status_code: "200"     # Redundant prefix
env: "prod"                 # Use full names
datacenter: "dc1"           # Be specific about location

INI

Testing and Validation

Metrics Testing Framework

# metrics_test.py
import requests
import time
import pytest

class MetricsTestFramework:
    def __init__(self, prometheus_url, app_url):
        self.prometheus_url = prometheus_url
        self.app_url = app_url

    def query_metric(self, query):
        """Query Prometheus and return result"""
        response = requests.get(
            f"{self.prometheus_url}/api/v1/query",
            params={"query": query}
        )
        return response.json()

    def generate_load(self, endpoint, count=10):
        """Generate load on application endpoint"""
        for _ in range(count):
            requests.get(f"{self.app_url}{endpoint}")
            time.sleep(0.1)

    def test_counter_increment(self):
        """Test that counters increment properly"""
        # Get initial value
        initial = self.query_metric("http_requests_total")

        # Generate load
        self.generate_load("/test", 5)

        # Wait for scrape
        time.sleep(20)

        # Check increment
        final = self.query_metric("http_requests_total")
        assert final > initial

    def test_histogram_buckets(self):
        """Test histogram bucket distribution"""
        self.generate_load("/slow", 10)
        time.sleep(20)

        # Check bucket distribution
        result = self.query_metric(
            'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[1m]))'
        )
        assert float(result['data']['result'][0]['value'][1]) > 0

# Usage
framework = MetricsTestFramework(
    "http://localhost:9090",
    "http://localhost:8080"
)
framework.test_counter_increment()

# metrics_test.py
import requests
import time
import pytest

class MetricsTestFramework:
    def __init__(self, prometheus_url, app_url):
        self.prometheus_url = prometheus_url
        self.app_url = app_url

    def query_metric(self, query):
        """Query Prometheus and return result"""
        response = requests.get(
            f"{self.prometheus_url}/api/v1/query",
            params={"query": query}
        )
        return response.json()

    def generate_load(self, endpoint, count=10):
        """Generate load on application endpoint"""
        for _ in range(count):
            requests.get(f"{self.app_url}{endpoint}")
            time.sleep(0.1)

    def test_counter_increment(self):
        """Test that counters increment properly"""
        # Get initial value
        initial = self.query_metric("http_requests_total")

        # Generate load
        self.generate_load("/test", 5)

        # Wait for scrape
        time.sleep(20)

        # Check increment
        final = self.query_metric("http_requests_total")
        assert final > initial

    def test_histogram_buckets(self):
        """Test histogram bucket distribution"""
        self.generate_load("/slow", 10)
        time.sleep(20)

        # Check bucket distribution
        result = self.query_metric(
            'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[1m]))'
        )
        assert float(result['data']['result'][0]['value'][1]) > 0

# Usage
framework = MetricsTestFramework(
    "http://localhost:9090",
    "http://localhost:8080"
)
framework.test_counter_increment()

Python

Documentation and Runbooks

Metrics Documentation Template

# Metric: http_requests_total

## Description
Counter of HTTP requests processed by the application.

## Type
Counter

## Labels
- `method`: HTTP method (GET, POST, PUT, DELETE)
- `endpoint`: API endpoint template (e.g., /api/users/:id)
- `status`: HTTP status code
- `service`: Service name

## Usage Examples
```promql
# Request rate
rate(http_requests_total[5m])

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))

# Metric: http_requests_total

## Description
Counter of HTTP requests processed by the application.

## Type
Counter

## Labels
- `method`: HTTP method (GET, POST, PUT, DELETE)
- `endpoint`: API endpoint template (e.g., /api/users/:id)
- `status`: HTTP status code
- `service`: Service name

## Usage Examples
```promql
# Request rate
rate(http_requests_total[5m])

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))

INI

Alerts

HighErrorRate: Fires when error rate > 5%
LowRequestRate: Fires when request rate < 1 req/s

Dashboard Panels

Request Rate Over Time
Error Rate by Endpoint
Request Distribution by Method

Chapter 9 Summary

Effective Prometheus monitoring requires following established patterns like the Four Golden Signals, USE, and RED methods. Avoid cardinality explosions through careful label design, implement meaningful SLOs with proper error budget tracking, and establish clear naming conventions. Testing, documentation, and real-world case studies help ensure monitoring provides actionable insights.

Hands-on Exercise

Metrics Review:
- Audit your existing metrics for cardinality issues
- Apply the Four Golden Signals to your services
- Implement USE method for infrastructure resources
SLO Implementation:
- Define SLIs for a critical service
- Create SLO recording and alerting rules
- Set up error budget tracking
Best Practices Assessment:
- Review metric naming conventions
- Create documentation for key metrics
- Implement automated metrics testing

10. Advanced Topics

Exemplars and Tracing Correlation

Exemplars link metrics to traces, providing context for high-level aggregations by pointing to specific trace samples.

graph LR
    A[HTTP Request] --> B[Metrics]
    A --> C[Traces]
    B --> D[Exemplar]
    D --> C
    C --> E[Span Details]

Enabling Exemplars in Prometheus

# prometheus.yml
global:
  scrape_interval: 15s
  exemplar_storage:
    max_exemplars: 100000

scrape_configs:
  - job_name: 'my-app'
    scrape_interval: 10s
    static_configs:
      - targets: ['app:8080']

# prometheus.yml
global:
  scrape_interval: 15s
  exemplar_storage:
    max_exemplars: 100000

scrape_configs:
  - job_name: 'my-app'
    scrape_interval: 10s
    static_configs:
      - targets: ['app:8080']

YAML

Instrumenting Applications with Exemplars

// Go application with exemplars
package main

import (
    "context"
    "fmt"
    "math/rand"
    "net/http"
    "strconv"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/trace"
)

var (
    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request duration in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(requestDuration)
}

func instrumentedHandler(w http.ResponseWriter, r *http.Request) {
    start := time.Now()

    // Start OpenTelemetry span
    ctx, span := otel.Tracer("my-app").Start(r.Context(), "http_request")
    defer span.End()

    // Simulate work
    time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond)

    // Record metric with exemplar
    duration := time.Since(start).Seconds()
    exemplar := prometheus.Labels{
        "trace_id": span.SpanContext().TraceID().String(),
        "span_id":  span.SpanContext().SpanID().String(),
    }

    requestDuration.WithLabelValues(r.Method, r.URL.Path).
        ObserveWithExemplar(duration, exemplar)

    span.SetAttributes(
        attribute.String("http.method", r.Method),
        attribute.String("http.url", r.URL.Path),
        attribute.Float64("http.duration", duration),
    )

    w.WriteHeader(http.StatusOK)
    fmt.Fprintf(w, "Request processed in %.2f seconds", duration)
}

func main() {
    http.HandleFunc("/api", instrumentedHandler)
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}

// Go application with exemplars
package main

import (
    "context"
    "fmt"
    "math/rand"
    "net/http"
    "strconv"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/trace"
)

var (
    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request duration in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(requestDuration)
}

func instrumentedHandler(w http.ResponseWriter, r *http.Request) {
    start := time.Now()

    // Start OpenTelemetry span
    ctx, span := otel.Tracer("my-app").Start(r.Context(), "http_request")
    defer span.End()

    // Simulate work
    time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond)

    // Record metric with exemplar
    duration := time.Since(start).Seconds()
    exemplar := prometheus.Labels{
        "trace_id": span.SpanContext().TraceID().String(),
        "span_id":  span.SpanContext().SpanID().String(),
    }

    requestDuration.WithLabelValues(r.Method, r.URL.Path).
        ObserveWithExemplar(duration, exemplar)

    span.SetAttributes(
        attribute.String("http.method", r.Method),
        attribute.String("http.url", r.URL.Path),
        attribute.Float64("http.duration", duration),
    )

    w.WriteHeader(http.StatusOK)
    fmt.Fprintf(w, "Request processed in %.2f seconds", duration)
}

func main() {
    http.HandleFunc("/api", instrumentedHandler)
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}

Querying Exemplars

# Query histogram with exemplars
histogram_quantile(0.95, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# API endpoint for exemplars
GET /api/v1/query_exemplars?query=http_request_duration_seconds_bucket&start=<timestamp>&end=<timestamp>

# Query histogram with exemplars
histogram_quantile(0.95, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# API endpoint for exemplars
GET /api/v1/query_exemplars?query=http_request_duration_seconds_bucket&start=<timestamp>&end=<timestamp>

Multi-cluster Monitoring

Centralized Multi-cluster Architecture

graph TB
    A[Global Prometheus] --> B[Cluster A Prometheus]
    A --> C[Cluster B Prometheus]
    A --> D[Cluster C Prometheus]

    B --> E[Workloads A]
    C --> F[Workloads B]
    D --> G[Workloads C]

    A --> H[Global Grafana]
    A --> I[Global Alertmanager]

Cross-cluster Service Discovery

# Global Prometheus configuration
global:
  external_labels:
    cluster: 'management'
    region: 'global'

scrape_configs:
  # Federate from regional clusters
  - job_name: 'federate-clusters'
    scrape_interval: 30s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~"cluster:.*"}'
        - '{__name__=~"node_.*"}'
        - '{__name__=~"container_.*"}'
    static_configs:
      - targets:
        - 'cluster-a-prometheus:9090'
        labels:
          cluster: 'cluster-a'
          region: 'us-east-1'
      - targets:
        - 'cluster-b-prometheus:9090'
        labels:
          cluster: 'cluster-b'
          region: 'us-west-2'
      - targets:
        - 'cluster-c-prometheus:9090'
        labels:
          cluster: 'cluster-c'
          region: 'eu-west-1'

  # Cross-cluster service monitoring
  - job_name: 'cross-cluster-services'
    kubernetes_sd_configs:
      - role: endpoints
        api_server: 'https://cluster-a.k8s.local'
        tls_config:
          ca_file: /etc/ssl/cluster-a-ca.crt
          cert_file: /etc/ssl/cluster-a.crt
          key_file: /etc/ssl/cluster-a.key
      - role: endpoints
        api_server: 'https://cluster-b.k8s.local'
        tls_config:
          ca_file: /etc/ssl/cluster-b-ca.crt
          cert_file: /etc/ssl/cluster-b.crt
          key_file: /etc/ssl/cluster-b.key

# Global Prometheus configuration
global:
  external_labels:
    cluster: 'management'
    region: 'global'

scrape_configs:
  # Federate from regional clusters
  - job_name: 'federate-clusters'
    scrape_interval: 30s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~"cluster:.*"}'
        - '{__name__=~"node_.*"}'
        - '{__name__=~"container_.*"}'
    static_configs:
      - targets:
        - 'cluster-a-prometheus:9090'
        labels:
          cluster: 'cluster-a'
          region: 'us-east-1'
      - targets:
        - 'cluster-b-prometheus:9090'
        labels:
          cluster: 'cluster-b'
          region: 'us-west-2'
      - targets:
        - 'cluster-c-prometheus:9090'
        labels:
          cluster: 'cluster-c'
          region: 'eu-west-1'

  # Cross-cluster service monitoring
  - job_name: 'cross-cluster-services'
    kubernetes_sd_configs:
      - role: endpoints
        api_server: 'https://cluster-a.k8s.local'
        tls_config:
          ca_file: /etc/ssl/cluster-a-ca.crt
          cert_file: /etc/ssl/cluster-a.crt
          key_file: /etc/ssl/cluster-a.key
      - role: endpoints
        api_server: 'https://cluster-b.k8s.local'
        tls_config:
          ca_file: /etc/ssl/cluster-b-ca.crt
          cert_file: /etc/ssl/cluster-b.crt
          key_file: /etc/ssl/cluster-b.key

YAML

Multi-cluster Recording Rules

# Global recording rules
groups:
  - name: cross_cluster_aggregates
    interval: 60s
    rules:
      - record: global:request_rate:sum
        expr: sum by (service) (cluster:request_rate:sum)

      - record: global:error_rate:avg
        expr: avg by (service) (cluster:error_rate:avg)

      - record: global:latency:p95
        expr: |
          histogram_quantile(0.95,
            sum by (service, le) (cluster:latency:histogram)
          )

      - record: region:capacity:available
        expr: |
          sum by (region) (
            cluster:node_capacity:cpu - cluster:node_usage:cpu
          )

# Global recording rules
groups:
  - name: cross_cluster_aggregates
    interval: 60s
    rules:
      - record: global:request_rate:sum
        expr: sum by (service) (cluster:request_rate:sum)

      - record: global:error_rate:avg
        expr: avg by (service) (cluster:error_rate:avg)

      - record: global:latency:p95
        expr: |
          histogram_quantile(0.95,
            sum by (service, le) (cluster:latency:histogram)
          )

      - record: region:capacity:available
        expr: |
          sum by (region) (
            cluster:node_capacity:cpu - cluster:node_usage:cpu
          )

YAML

Integrating with Logging and Tracing

Correlation with ELK Stack

# Logstash configuration for metrics correlation
input {
  beats {
    port => 5044
  }
}

filter {
  if [fields][service] {
    # Add Prometheus job label
    mutate {
      add_field => { "prometheus_job" => "%{[fields][service]}" }
    }

    # Extract trace ID if present
    if [message] =~ /trace_id=/ {
      grok {
        match => { "message" => "trace_id=(?<trace_id>[a-f0-9]+)" }
      }
    }

    # Add links to metrics
    mutate {
      add_field => { 
        "metrics_link" => "http://grafana.local/d/app-dashboard?var-service=%{[fields][service]}&from=now-5m&to=now"
      }
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

# Logstash configuration for metrics correlation
input {
  beats {
    port => 5044
  }
}

filter {
  if [fields][service] {
    # Add Prometheus job label
    mutate {
      add_field => { "prometheus_job" => "%{[fields][service]}" }
    }

    # Extract trace ID if present
    if [message] =~ /trace_id=/ {
      grok {
        match => { "message" => "trace_id=(?<trace_id>[a-f0-9]+)" }
      }
    }

    # Add links to metrics
    mutate {
      add_field => { 
        "metrics_link" => "http://grafana.local/d/app-dashboard?var-service=%{[fields][service]}&from=now-5m&to=now"
      }
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

JSON

Jaeger Integration

# Jaeger query service with Prometheus metrics
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger-query
spec:
  template:
    spec:
      containers:
      - name: jaeger-query
        image: jaegertracing/jaeger-query:latest
        env:
        - name: SPAN_STORAGE_TYPE
          value: elasticsearch
        - name: ES_SERVER_URLS
          value: http://elasticsearch:9200
        - name: METRICS_BACKEND
          value: prometheus
        - name: PROMETHEUS_SERVER_URL
          value: http://prometheus:9090
        ports:
        - containerPort: 16686
        - containerPort: 16687

# Jaeger query service with Prometheus metrics
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger-query
spec:
  template:
    spec:
      containers:
      - name: jaeger-query
        image: jaegertracing/jaeger-query:latest
        env:
        - name: SPAN_STORAGE_TYPE
          value: elasticsearch
        - name: ES_SERVER_URLS
          value: http://elasticsearch:9200
        - name: METRICS_BACKEND
          value: prometheus
        - name: PROMETHEUS_SERVER_URL
          value: http://prometheus:9090
        ports:
        - containerPort: 16686
        - containerPort: 16687

YAML

OpenTelemetry Collector Configuration

# otelcol-config.yml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          static_configs:
            - targets: ['localhost:8888']

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

  attributes:
    actions:
      - key: cluster
        value: production
        action: insert

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true

  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: "otel"

  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [attributes, batch]
      exporters: [jaeger]

    metrics:
      receivers: [otlp, prometheus]
      processors: [attributes, batch]
      exporters: [prometheus, prometheusremotewrite]

# otelcol-config.yml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          static_configs:
            - targets: ['localhost:8888']

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

  attributes:
    actions:
      - key: cluster
        value: production
        action: insert

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true

  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: "otel"

  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [attributes, batch]
      exporters: [jaeger]

    metrics:
      receivers: [otlp, prometheus]
      processors: [attributes, batch]
      exporters: [prometheus, prometheusremotewrite]

YAML

Security and RBAC in Prometheus Setups

Prometheus Security Configuration

# Prometheus with TLS and authentication
apiVersion: v1
kind: Secret
metadata:
  name: prometheus-certs
type: Opaque
data:
  tls.crt: <base64-encoded-cert>
  tls.key: <base64-encoded-key>

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
spec:
  template:
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:latest
        args:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--web.config.file=/etc/prometheus/web.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--web.listen-address=0.0.0.0:9090'
        volumeMounts:
        - name: config
          mountPath: /etc/prometheus
        - name: certs
          mountPath: /etc/ssl/prometheus
          readOnly: true

# Prometheus with TLS and authentication
apiVersion: v1
kind: Secret
metadata:
  name: prometheus-certs
type: Opaque
data:
  tls.crt: <base64-encoded-cert>
  tls.key: <base64-encoded-key>

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
spec:
  template:
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:latest
        args:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--web.config.file=/etc/prometheus/web.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--web.listen-address=0.0.0.0:9090'
        volumeMounts:
        - name: config
          mountPath: /etc/prometheus
        - name: certs
          mountPath: /etc/ssl/prometheus
          readOnly: true

YAML

# web.yml - Prometheus web configuration
tls_server_config:
  cert_file: /etc/ssl/prometheus/tls.crt
  key_file: /etc/ssl/prometheus/tls.key

basic_auth_users:
  admin: $2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay
  readonly: $2b$12$6tgWf5DZ9z7LZtD.ZrAb/.VjBfI3WnJg3ULf.TgLBtO4vKAzp7KuG

# web.yml - Prometheus web configuration
tls_server_config:
  cert_file: /etc/ssl/prometheus/tls.crt
  key_file: /etc/ssl/prometheus/tls.key

basic_auth_users:
  admin: $2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay
  readonly: $2b$12$6tgWf5DZ9z7LZtD.ZrAb/.VjBfI3WnJg3ULf.TgLBtO4vKAzp7KuG

YAML

RBAC Configuration for Kubernetes

# ServiceAccount for Prometheus
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitoring

---
# ClusterRole with minimal permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups: ["extensions", "apps"]
  resources:
  - ingresses
  - deployments
  - daemonsets
  - statefulsets
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]

---
# ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring

# ServiceAccount for Prometheus
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitoring

---
# ClusterRole with minimal permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups: ["extensions", "apps"]
  resources:
  - ingresses
  - deployments
  - daemonsets
  - statefulsets
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]

---
# ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring

YAML

OAuth2 Proxy Integration

# OAuth2 Proxy for Prometheus
apiVersion: apps/v1
kind: Deployment
metadata:
  name: oauth2-proxy
spec:
  template:
    spec:
      containers:
      - name: oauth2-proxy
        image: quay.io/oauth2-proxy/oauth2-proxy:latest
        args:
          - --provider=github
          - --email-domain=yourcompany.com
          - --upstream=http://prometheus:9090
          - --http-address=0.0.0.0:4180
          - --client-id=$(OAUTH2_PROXY_CLIENT_ID)
          - --client-secret=$(OAUTH2_PROXY_CLIENT_SECRET)
          - --cookie-secret=$(OAUTH2_PROXY_COOKIE_SECRET)
        env:
        - name: OAUTH2_PROXY_CLIENT_ID
          valueFrom:
            secretKeyRef:
              name: oauth2-proxy-secrets
              key: client-id
        - name: OAUTH2_PROXY_CLIENT_SECRET
          valueFrom:
            secretKeyRef:
              name: oauth2-proxy-secrets
              key: client-secret
        - name: OAUTH2_PROXY_COOKIE_SECRET
          valueFrom:
            secretKeyRef:
              name: oauth2-proxy-secrets
              key: cookie-secret

# OAuth2 Proxy for Prometheus
apiVersion: apps/v1
kind: Deployment
metadata:
  name: oauth2-proxy
spec:
  template:
    spec:
      containers:
      - name: oauth2-proxy
        image: quay.io/oauth2-proxy/oauth2-proxy:latest
        args:
          - --provider=github
          - --email-domain=yourcompany.com
          - --upstream=http://prometheus:9090
          - --http-address=0.0.0.0:4180
          - --client-id=$(OAUTH2_PROXY_CLIENT_ID)
          - --client-secret=$(OAUTH2_PROXY_CLIENT_SECRET)
          - --cookie-secret=$(OAUTH2_PROXY_COOKIE_SECRET)
        env:
        - name: OAUTH2_PROXY_CLIENT_ID
          valueFrom:
            secretKeyRef:
              name: oauth2-proxy-secrets
              key: client-id
        - name: OAUTH2_PROXY_CLIENT_SECRET
          valueFrom:
            secretKeyRef:
              name: oauth2-proxy-secrets
              key: client-secret
        - name: OAUTH2_PROXY_COOKIE_SECRET
          valueFrom:
            secretKeyRef:
              name: oauth2-proxy-secrets
              key: cookie-secret

YAML

Chapter 10 Summary

Advanced Prometheus topics include exemplars for linking metrics to traces, multi-cluster monitoring architectures, integration with logging and tracing systems, and comprehensive security configurations. These features enable enterprise-scale observability with proper access controls and correlation across different observability signals.

Hands-on Exercise

Exemplars Implementation:
- Enable exemplars in Prometheus
- Instrument an application with trace correlation
- View exemplars in Grafana dashboards
Multi-cluster Setup:
- Configure federation between Prometheus instances
- Implement cross-cluster monitoring
- Test global query capabilities
Security Hardening:
- Implement TLS and authentication
- Configure RBAC for Kubernetes
- Set up OAuth2 proxy for access control

11. Capstone Project

Project Overview

Build a complete observability stack for a sample e-commerce application with microservices architecture, including metrics collection, alerting, visualization, and incident response workflows.

Architecture Overview

graph TB
    subgraph "Application Layer"
        A[Frontend Service] --> B[User Service]
        A --> C[Product Service]
        A --> D[Order Service]
        D --> E[Payment Service]
        D --> F[Inventory Service]
        B --> G[User Database]
        C --> H[Product Database]
        D --> I[Order Database]
    end

    subgraph "Observability Layer"
        J[Prometheus] --> K[Alertmanager]
        J --> L[Grafana]
        M[Node Exporter] --> J
        N[Application Metrics] --> J
        O[Blackbox Exporter] --> J
        K --> P[Slack/Email]
        L --> Q[Dashboards]
    end

    A --> N
    B --> N
    C --> N
    D --> N
    E --> N
    F --> N

Step 1: Infrastructure Setup

Docker Compose Environment

# docker-compose.yml
version: '3.8'

networks:
  monitoring:
    driver: bridge
  app:
    driver: bridge

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

services:
  # Prometheus
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
    networks:
      - monitoring
      - app
    restart: unless-stopped

  # Alertmanager
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager:/etc/alertmanager
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    networks:
      - monitoring
    restart: unless-stopped

  # Grafana
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    networks:
      - monitoring
    restart: unless-stopped

  # Node Exporter
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    networks:
      - monitoring
    restart: unless-stopped

  # Blackbox Exporter
  blackbox-exporter:
    image: prom/blackbox-exporter:latest
    container_name: blackbox-exporter
    ports:
      - "9115:9115"
    volumes:
      - ./blackbox:/etc/blackbox_exporter
    networks:
      - monitoring
    restart: unless-stopped

  # Application Services
  frontend:
    build: ./apps/frontend
    container_name: frontend
    ports:
      - "8080:8080"
    environment:
      - USER_SERVICE_URL=http://user-service:8081
      - PRODUCT_SERVICE_URL=http://product-service:8082
      - ORDER_SERVICE_URL=http://order-service:8083
    networks:
      - app
    restart: unless-stopped

  user-service:
    build: ./apps/user-service
    container_name: user-service
    ports:
      - "8081:8081"
    environment:
      - DATABASE_URL=postgresql://user:password@user-db:5432/users
    networks:
      - app
    restart: unless-stopped

  product-service:
    build: ./apps/product-service
    container_name: product-service
    ports:
      - "8082:8082"
    environment:
      - DATABASE_URL=postgresql://product:password@product-db:5432/products
    networks:
      - app
    restart: unless-stopped

  order-service:
    build: ./apps/order-service
    container_name: order-service
    ports:
      - "8083:8083"
    environment:
      - DATABASE_URL=postgresql://order:password@order-db:5432/orders
      - PAYMENT_SERVICE_URL=http://payment-service:8084
      - INVENTORY_SERVICE_URL=http://inventory-service:8085
    networks:
      - app
    restart: unless-stopped

  payment-service:
    build: ./apps/payment-service
    container_name: payment-service
    ports:
      - "8084:8084"
    networks:
      - app
    restart: unless-stopped

  inventory-service:
    build: ./apps/inventory-service
    container_name: inventory-service
    ports:
      - "8085:8085"
    networks:
      - app
    restart: unless-stopped

  # Databases
  user-db:
    image: postgres:13
    container_name: user-db
    environment:
      - POSTGRES_DB=users
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=password
    volumes:
      - ./data/user-db:/var/lib/postgresql/data
    networks:
      - app

  product-db:
    image: postgres:13
    container_name: product-db
    environment:
      - POSTGRES_DB=products
      - POSTGRES_USER=product
      - POSTGRES_PASSWORD=password
    volumes:
      - ./data/product-db:/var/lib/postgresql/data
    networks:
      - app

  order-db:
    image: postgres:13
    container_name: order-db
    environment:
      - POSTGRES_DB=orders
      - POSTGRES_USER=order
      - POSTGRES_PASSWORD=password
    volumes:
      - ./data/order-db:/var/lib/postgresql/data
    networks:
      - app

# docker-compose.yml
version: '3.8'

networks:
  monitoring:
    driver: bridge
  app:
    driver: bridge

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

services:
  # Prometheus
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
    networks:
      - monitoring
      - app
    restart: unless-stopped

  # Alertmanager
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager:/etc/alertmanager
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    networks:
      - monitoring
    restart: unless-stopped

  # Grafana
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    networks:
      - monitoring
    restart: unless-stopped

  # Node Exporter
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    networks:
      - monitoring
    restart: unless-stopped

  # Blackbox Exporter
  blackbox-exporter:
    image: prom/blackbox-exporter:latest
    container_name: blackbox-exporter
    ports:
      - "9115:9115"
    volumes:
      - ./blackbox:/etc/blackbox_exporter
    networks:
      - monitoring
    restart: unless-stopped

  # Application Services
  frontend:
    build: ./apps/frontend
    container_name: frontend
    ports:
      - "8080:8080"
    environment:
      - USER_SERVICE_URL=http://user-service:8081
      - PRODUCT_SERVICE_URL=http://product-service:8082
      - ORDER_SERVICE_URL=http://order-service:8083
    networks:
      - app
    restart: unless-stopped

  user-service:
    build: ./apps/user-service
    container_name: user-service
    ports:
      - "8081:8081"
    environment:
      - DATABASE_URL=postgresql://user:password@user-db:5432/users
    networks:
      - app
    restart: unless-stopped

  product-service:
    build: ./apps/product-service
    container_name: product-service
    ports:
      - "8082:8082"
    environment:
      - DATABASE_URL=postgresql://product:password@product-db:5432/products
    networks:
      - app
    restart: unless-stopped

  order-service:
    build: ./apps/order-service
    container_name: order-service
    ports:
      - "8083:8083"
    environment:
      - DATABASE_URL=postgresql://order:password@order-db:5432/orders
      - PAYMENT_SERVICE_URL=http://payment-service:8084
      - INVENTORY_SERVICE_URL=http://inventory-service:8085
    networks:
      - app
    restart: unless-stopped

  payment-service:
    build: ./apps/payment-service
    container_name: payment-service
    ports:
      - "8084:8084"
    networks:
      - app
    restart: unless-stopped

  inventory-service:
    build: ./apps/inventory-service
    container_name: inventory-service
    ports:
      - "8085:8085"
    networks:
      - app
    restart: unless-stopped

  # Databases
  user-db:
    image: postgres:13
    container_name: user-db
    environment:
      - POSTGRES_DB=users
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=password
    volumes:
      - ./data/user-db:/var/lib/postgresql/data
    networks:
      - app

  product-db:
    image: postgres:13
    container_name: product-db
    environment:
      - POSTGRES_DB=products
      - POSTGRES_USER=product
      - POSTGRES_PASSWORD=password
    volumes:
      - ./data/product-db:/var/lib/postgresql/data
    networks:
      - app

  order-db:
    image: postgres:13
    container_name: order-db
    environment:
      - POSTGRES_DB=orders
      - POSTGRES_USER=order
      - POSTGRES_PASSWORD=password
    volumes:
      - ./data/order-db:/var/lib/postgresql/data
    networks:
      - app

YAML

Step 2: Application Instrumentation

Frontend Service (Go)

// apps/frontend/main.go
package main

import (
    "encoding/json"
    "fmt"
    "log"
    "net/http"
    "os"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"service", "method", "endpoint", "status"},
    )

    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request duration in seconds",
            Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
        },
        []string{"service", "method", "endpoint"},
    )

    upstreamRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "upstream_requests_total",
            Help: "Total upstream requests",
        },
        []string{"service", "target_service", "status"},
    )

    businessMetrics = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "business_events_total",
            Help: "Business events counter",
        },
        []string{"service", "event_type"},
    )
)

func init() {
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
    prometheus.MustRegister(upstreamRequestsTotal)
    prometheus.MustRegister(businessMetrics)
}

func instrumentHandler(service, endpoint string, handler http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        // Wrap ResponseWriter to capture status code
        ww := &responseWriter{ResponseWriter: w, statusCode: 200}

        handler(ww, r)

        duration := time.Since(start).Seconds()
        status := fmt.Sprintf("%d", ww.statusCode)

        httpRequestsTotal.WithLabelValues(service, r.Method, endpoint, status).Inc()
        httpRequestDuration.WithLabelValues(service, r.Method, endpoint).Observe(duration)
    }
}

type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

func homeHandler(w http.ResponseWriter, r *http.Request) {
    businessMetrics.WithLabelValues("frontend", "page_view").Inc()

    response := map[string]string{
        "service": "frontend",
        "status":  "healthy",
        "version": "1.0.0",
    }

    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(response)
}

func usersHandler(w http.ResponseWriter, r *http.Request) {
    userServiceURL := os.Getenv("USER_SERVICE_URL")
    if userServiceURL == "" {
        userServiceURL = "http://localhost:8081"
    }

    start := time.Now()
    resp, err := http.Get(userServiceURL + "/users")
    duration := time.Since(start).Seconds()

    status := "500"
    if err == nil {
        status = fmt.Sprintf("%d", resp.StatusCode)
        defer resp.Body.Close()
    }

    upstreamRequestsTotal.WithLabelValues("frontend", "user-service", status).Inc()

    if err != nil {
        http.Error(w, "User service unavailable", http.StatusServiceUnavailable)
        return
    }

    businessMetrics.WithLabelValues("frontend", "user_list_view").Inc()
    w.Header().Set("Content-Type", "application/json")
    w.Write([]byte(`{"users": []}`))
}

func main() {
    http.Handle("/metrics", promhttp.Handler())
    http.HandleFunc("/", instrumentHandler("frontend", "/", homeHandler))
    http.HandleFunc("/users", instrumentHandler("frontend", "/users", usersHandler))
    http.HandleFunc("/health", instrumentHandler("frontend", "/health", func(w http.ResponseWriter, r *http.Request) {
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("OK"))
    }))

    log.Println("Frontend service starting on :8080")
    log.Fatal(http.ListenAndServe(":8080", nil))
}

// apps/frontend/main.go
package main

import (
    "encoding/json"
    "fmt"
    "log"
    "net/http"
    "os"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"service", "method", "endpoint", "status"},
    )

    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request duration in seconds",
            Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
        },
        []string{"service", "method", "endpoint"},
    )

    upstreamRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "upstream_requests_total",
            Help: "Total upstream requests",
        },
        []string{"service", "target_service", "status"},
    )

    businessMetrics = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "business_events_total",
            Help: "Business events counter",
        },
        []string{"service", "event_type"},
    )
)

func init() {
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
    prometheus.MustRegister(upstreamRequestsTotal)
    prometheus.MustRegister(businessMetrics)
}

func instrumentHandler(service, endpoint string, handler http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        // Wrap ResponseWriter to capture status code
        ww := &responseWriter{ResponseWriter: w, statusCode: 200}

        handler(ww, r)

        duration := time.Since(start).Seconds()
        status := fmt.Sprintf("%d", ww.statusCode)

        httpRequestsTotal.WithLabelValues(service, r.Method, endpoint, status).Inc()
        httpRequestDuration.WithLabelValues(service, r.Method, endpoint).Observe(duration)
    }
}

type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

func homeHandler(w http.ResponseWriter, r *http.Request) {
    businessMetrics.WithLabelValues("frontend", "page_view").Inc()

    response := map[string]string{
        "service": "frontend",
        "status":  "healthy",
        "version": "1.0.0",
    }

    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(response)
}

func usersHandler(w http.ResponseWriter, r *http.Request) {
    userServiceURL := os.Getenv("USER_SERVICE_URL")
    if userServiceURL == "" {
        userServiceURL = "http://localhost:8081"
    }

    start := time.Now()
    resp, err := http.Get(userServiceURL + "/users")
    duration := time.Since(start).Seconds()

    status := "500"
    if err == nil {
        status = fmt.Sprintf("%d", resp.StatusCode)
        defer resp.Body.Close()
    }

    upstreamRequestsTotal.WithLabelValues("frontend", "user-service", status).Inc()

    if err != nil {
        http.Error(w, "User service unavailable", http.StatusServiceUnavailable)
        return
    }

    businessMetrics.WithLabelValues("frontend", "user_list_view").Inc()
    w.Header().Set("Content-Type", "application/json")
    w.Write([]byte(`{"users": []}`))
}

func main() {
    http.Handle("/metrics", promhttp.Handler())
    http.HandleFunc("/", instrumentHandler("frontend", "/", homeHandler))
    http.HandleFunc("/users", instrumentHandler("frontend", "/users", usersHandler))
    http.HandleFunc("/health", instrumentHandler("frontend", "/health", func(w http.ResponseWriter, r *http.Request) {
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("OK"))
    }))

    log.Println("Frontend service starting on :8080")
    log.Fatal(http.ListenAndServe(":8080", nil))
}

User Service (Python)

# apps/user-service/app.py
from flask import Flask, jsonify, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time
import psycopg2
import os

app = Flask(__name__)

# Prometheus metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['service', 'method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['service', 'method', 'endpoint'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

DATABASE_CONNECTIONS = Gauge(
    'database_connections_active',
    'Active database connections',
    ['service', 'database']
)

BUSINESS_EVENTS = Counter(
    'business_events_total',
    'Business events',
    ['service', 'event_type']
)

def instrument_request(f):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        status = '200'

        try:
            result = f(*args, **kwargs)
            return result
        except Exception as e:
            status = '500'
            raise
        finally:
            REQUEST_COUNT.labels(
                service='user-service',
                method=request.method,
                endpoint=request.endpoint or 'unknown',
                status=status
            ).inc()

            REQUEST_DURATION.labels(
                service='user-service',
                method=request.method,
                endpoint=request.endpoint or 'unknown'
            ).observe(time.time() - start_time)

    wrapper.__name__ = f.__name__
    return wrapper

@app.route('/')
@instrument_request
def home():
    return jsonify({
        'service': 'user-service',
        'status': 'healthy',
        'version': '1.0.0'
    })

@app.route('/users')
@instrument_request
def get_users():
    BUSINESS_EVENTS.labels(service='user-service', event_type='user_list_request').inc()

    # Simulate database query
    DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').inc()
    time.sleep(0.01)  # Simulate query time
    DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').dec()

    return jsonify({
        'users': [
            {'id': 1, 'name': 'John Doe', 'email': 'john@example.com'},
            {'id': 2, 'name': 'Jane Smith', 'email': 'jane@example.com'}
        ]
    })

@app.route('/users/<int:user_id>')
@instrument_request
def get_user(user_id):
    BUSINESS_EVENTS.labels(service='user-service', event_type='user_detail_request').inc()

    DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').inc()
    time.sleep(0.005)
    DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').dec()

    return jsonify({
        'id': user_id,
        'name': f'User {user_id}',
        'email': f'user{user_id}@example.com'
    })

@app.route('/health')
@instrument_request
def health():
    return jsonify({'status': 'healthy'})

@app.route('/metrics')
def metrics():
    return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8081)

# apps/user-service/app.py
from flask import Flask, jsonify, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time
import psycopg2
import os

app = Flask(__name__)

# Prometheus metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['service', 'method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['service', 'method', 'endpoint'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

DATABASE_CONNECTIONS = Gauge(
    'database_connections_active',
    'Active database connections',
    ['service', 'database']
)

BUSINESS_EVENTS = Counter(
    'business_events_total',
    'Business events',
    ['service', 'event_type']
)

def instrument_request(f):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        status = '200'

        try:
            result = f(*args, **kwargs)
            return result
        except Exception as e:
            status = '500'
            raise
        finally:
            REQUEST_COUNT.labels(
                service='user-service',
                method=request.method,
                endpoint=request.endpoint or 'unknown',
                status=status
            ).inc()

            REQUEST_DURATION.labels(
                service='user-service',
                method=request.method,
                endpoint=request.endpoint or 'unknown'
            ).observe(time.time() - start_time)

    wrapper.__name__ = f.__name__
    return wrapper

@app.route('/')
@instrument_request
def home():
    return jsonify({
        'service': 'user-service',
        'status': 'healthy',
        'version': '1.0.0'
    })

@app.route('/users')
@instrument_request
def get_users():
    BUSINESS_EVENTS.labels(service='user-service', event_type='user_list_request').inc()

    # Simulate database query
    DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').inc()
    time.sleep(0.01)  # Simulate query time
    DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').dec()

    return jsonify({
        'users': [
            {'id': 1, 'name': 'John Doe', 'email': 'john@example.com'},
            {'id': 2, 'name': 'Jane Smith', 'email': 'jane@example.com'}
        ]
    })

@app.route('/users/<int:user_id>')
@instrument_request
def get_user(user_id):
    BUSINESS_EVENTS.labels(service='user-service', event_type='user_detail_request').inc()

    DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').inc()
    time.sleep(0.005)
    DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').dec()

    return jsonify({
        'id': user_id,
        'name': f'User {user_id}',
        'email': f'user{user_id}@example.com'
    })

@app.route('/health')
@instrument_request
def health():
    return jsonify({'status': 'healthy'})

@app.route('/metrics')
def metrics():
    return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8081)

Python

Step 3: Prometheus Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'ecommerce'
    environment: 'production'

rule_files:
  - "alert_rules.yml"
  - "recording_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
    scrape_interval: 30s

  # Application services
  - job_name: 'frontend'
    static_configs:
      - targets: ['frontend:8080']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'user-service'
    static_configs:
      - targets: ['user-service:8081']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'product-service'
    static_configs:
      - targets: ['product-service:8082']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'order-service'
    static_configs:
      - targets: ['order-service:8083']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'payment-service'
    static_configs:
      - targets: ['payment-service:8084']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'inventory-service'
    static_configs:
      - targets: ['inventory-service:8085']
    metrics_path: '/metrics'
    scrape_interval: 15s

  # Blackbox monitoring
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - http://frontend:8080/health
        - http://user-service:8081/health
        - http://product-service:8082/health
        - http://order-service:8083/health
        - http://payment-service:8084/health
        - http://inventory-service:8085/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'ecommerce'
    environment: 'production'

rule_files:
  - "alert_rules.yml"
  - "recording_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
    scrape_interval: 30s

  # Application services
  - job_name: 'frontend'
    static_configs:
      - targets: ['frontend:8080']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'user-service'
    static_configs:
      - targets: ['user-service:8081']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'product-service'
    static_configs:
      - targets: ['product-service:8082']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'order-service'
    static_configs:
      - targets: ['order-service:8083']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'payment-service'
    static_configs:
      - targets: ['payment-service:8084']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'inventory-service'
    static_configs:
      - targets: ['inventory-service:8085']
    metrics_path: '/metrics'
    scrape_interval: 15s

  # Blackbox monitoring
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - http://frontend:8080/health
        - http://user-service:8081/health
        - http://product-service:8082/health
        - http://order-service:8083/health
        - http://payment-service:8084/health
        - http://inventory-service:8085/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

YAML

Step 4: Recording Rules

# prometheus/recording_rules.yml
groups:
  - name: application_rules
    interval: 30s
    rules:
      # Request rates
      - record: service:request_rate:rate5m
        expr: sum by (service) (rate(http_requests_total[5m]))

      - record: service:request_rate:rate1h
        expr: sum by (service) (rate(http_requests_total[1h]))

      # Error rates
      - record: service:error_rate:rate5m
        expr: |
          sum by (service) (rate(http_requests_total{status=~"[45].."}[5m])) /
          sum by (service) (rate(http_requests_total[5m]))

      # Latency percentiles
      - record: service:request_duration:p50
        expr: |
          histogram_quantile(0.50,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

      - record: service:request_duration:p95
        expr: |
          histogram_quantile(0.95,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

      - record: service:request_duration:p99
        expr: |
          histogram_quantile(0.99,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

  - name: infrastructure_rules
    interval: 30s
    rules:
      # Node metrics
      - record: node:cpu_usage:rate5m
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      - record: node:memory_usage:percentage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

      - record: node:disk_usage:percentage
        expr: |
          (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

  - name: business_rules
    interval: 60s
    rules:
      # Business metrics
      - record: business:page_views:rate1h
        expr: rate(business_events_total{event_type="page_view"}[1h]) * 3600

      - record: business:user_requests:rate1h
        expr: rate(business_events_total{event_type=~"user_.*"}[1h]) * 3600

      # Service dependency health
      - record: service:dependency_success_rate:rate5m
        expr: |
          sum by (service, target_service) (rate(upstream_requests_total{status=~"2.."}[5m])) /
          sum by (service, target_service) (rate(upstream_requests_total[5m]))

# prometheus/recording_rules.yml
groups:
  - name: application_rules
    interval: 30s
    rules:
      # Request rates
      - record: service:request_rate:rate5m
        expr: sum by (service) (rate(http_requests_total[5m]))

      - record: service:request_rate:rate1h
        expr: sum by (service) (rate(http_requests_total[1h]))

      # Error rates
      - record: service:error_rate:rate5m
        expr: |
          sum by (service) (rate(http_requests_total{status=~"[45].."}[5m])) /
          sum by (service) (rate(http_requests_total[5m]))

      # Latency percentiles
      - record: service:request_duration:p50
        expr: |
          histogram_quantile(0.50,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

      - record: service:request_duration:p95
        expr: |
          histogram_quantile(0.95,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

      - record: service:request_duration:p99
        expr: |
          histogram_quantile(0.99,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

  - name: infrastructure_rules
    interval: 30s
    rules:
      # Node metrics
      - record: node:cpu_usage:rate5m
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      - record: node:memory_usage:percentage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

      - record: node:disk_usage:percentage
        expr: |
          (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

  - name: business_rules
    interval: 60s
    rules:
      # Business metrics
      - record: business:page_views:rate1h
        expr: rate(business_events_total{event_type="page_view"}[1h]) * 3600

      - record: business:user_requests:rate1h
        expr: rate(business_events_total{event_type=~"user_.*"}[1h]) * 3600

      # Service dependency health
      - record: service:dependency_success_rate:rate5m
        expr: |
          sum by (service, target_service) (rate(upstream_requests_total{status=~"2.."}[5m])) /
          sum by (service, target_service) (rate(upstream_requests_total[5m]))

YAML

Step 5: Alerting Rules

# prometheus/alert_rules.yml
groups:
  - name: infrastructure_alerts
    rules:
      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 1m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Node is down"
          description: "Node {{ $labels.instance }} has been down for more than 1 minute"
          runbook_url: "https://runbooks.company.com/node-down"

      - alert: HighCPUUsage
        expr: node:cpu_usage:rate5m > 80
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High CPU usage"
          description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"

      - alert: HighMemoryUsage
        expr: node:memory_usage:percentage > 85
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High memory usage"
          description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"

  - name: application_alerts
    rules:
      - alert: ServiceDown
        expr: up{job=~"frontend|.*-service"} == 0
        for: 1m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Service is down"
          description: "Service {{ $labels.job }} is down"

      - alert: HighErrorRate
        expr: service:error_rate:rate5m > 0.05
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "High error rate for {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"

      - alert: HighLatency
        expr: service:request_duration:p95 > 1
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High latency for {{ $labels.service }}"
          description: "95th percentile latency is {{ $value }}s for {{ $labels.service }}"

      - alert: LowRequestRate
        expr: service:request_rate:rate5m < 0.1
        for: 10m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Low request rate for {{ $labels.service }}"
          description: "Request rate is {{ $value }} req/s for {{ $labels.service }}"

  - name: business_alerts
    rules:
      - alert: LowPageViews
        expr: business:page_views:rate1h < 10
        for: 15m
        labels:
          severity: warning
          team: product
        annotations:
          summary: "Low page view rate"
          description: "Page view rate is {{ $value }} views/hour"

      - alert: ServiceDependencyFailure
        expr: service:dependency_success_rate:rate5m < 0.95
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Service dependency failure"
          description: "{{ $labels.service }} -> {{ $labels.target_service }} success rate is {{ $value | humanizePercentage }}"

# prometheus/alert_rules.yml
groups:
  - name: infrastructure_alerts
    rules:
      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 1m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Node is down"
          description: "Node {{ $labels.instance }} has been down for more than 1 minute"
          runbook_url: "https://runbooks.company.com/node-down"

      - alert: HighCPUUsage
        expr: node:cpu_usage:rate5m > 80
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High CPU usage"
          description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"

      - alert: HighMemoryUsage
        expr: node:memory_usage:percentage > 85
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High memory usage"
          description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"

  - name: application_alerts
    rules:
      - alert: ServiceDown
        expr: up{job=~"frontend|.*-service"} == 0
        for: 1m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Service is down"
          description: "Service {{ $labels.job }} is down"

      - alert: HighErrorRate
        expr: service:error_rate:rate5m > 0.05
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "High error rate for {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"

      - alert: HighLatency
        expr: service:request_duration:p95 > 1
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High latency for {{ $labels.service }}"
          description: "95th percentile latency is {{ $value }}s for {{ $labels.service }}"

      - alert: LowRequestRate
        expr: service:request_rate:rate5m < 0.1
        for: 10m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Low request rate for {{ $labels.service }}"
          description: "Request rate is {{ $value }} req/s for {{ $labels.service }}"

  - name: business_alerts
    rules:
      - alert: LowPageViews
        expr: business:page_views:rate1h < 10
        for: 15m
        labels:
          severity: warning
          team: product
        annotations:
          summary: "Low page view rate"
          description: "Page view rate is {{ $value }} views/hour"

      - alert: ServiceDependencyFailure
        expr: service:dependency_success_rate:rate5m < 0.95
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Service dependency failure"
          description: "{{ $labels.service }} -> {{ $labels.target_service }} success rate is {{ $value | humanizePercentage }}"

YAML

Step 6: Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@ecommerce.local'
  smtp_auth_username: 'alerts@ecommerce.local'
  smtp_auth_password: 'your-app-password'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    # Critical alerts to on-call
    - matchers:
        - severity=critical
      receiver: 'critical-alerts'
      continue: true

    # Infrastructure team alerts
    - matchers:
        - team=infrastructure
      receiver: 'infrastructure-team'

    # Platform team alerts
    - matchers:
        - team=platform
      receiver: 'platform-team'

    # Product team alerts
    - matchers:
        - team=product
      receiver: 'product-team'

receivers:
  - name: 'default'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Severity:* {{ .Labels.severity }}
          *Service:* {{ .Labels.service }}
          {{ end }}

  - name: 'critical-alerts'
    email_configs:
      - to: 'oncall@ecommerce.local'
        subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Severity: {{ .Labels.severity }}
          Service: {{ .Labels.service }}
          Runbook: {{ .Annotations.runbook_url }}
          {{ end }}
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#critical-alerts'
        title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Runbook:* {{ .Annotations.runbook_url }}
          {{ end }}

  - name: 'infrastructure-team'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#infrastructure'
        title: '⚠️ Infrastructure Alert: {{ .GroupLabels.alertname }}'

  - name: 'platform-team'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#platform'
        title: '🔧 Platform Alert: {{ .GroupLabels.alertname }}'

  - name: 'product-team'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#product'
        title: '📊 Business Alert: {{ .GroupLabels.alertname }}'

inhibit_rules:
  # Don't send warning alerts if critical alerts are firing
  - source_matchers:
      - severity=critical
    target_matchers:
      - severity=warning
    equal: ['service']

  # Don't send service alerts if node is down
  - source_matchers:
      - alertname=NodeDown
    target_matchers:
      - alertname=ServiceDown
    equal: ['instance']

# alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@ecommerce.local'
  smtp_auth_username: 'alerts@ecommerce.local'
  smtp_auth_password: 'your-app-password'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    # Critical alerts to on-call
    - matchers:
        - severity=critical
      receiver: 'critical-alerts'
      continue: true

    # Infrastructure team alerts
    - matchers:
        - team=infrastructure
      receiver: 'infrastructure-team'

    # Platform team alerts
    - matchers:
        - team=platform
      receiver: 'platform-team'

    # Product team alerts
    - matchers:
        - team=product
      receiver: 'product-team'

receivers:
  - name: 'default'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Severity:* {{ .Labels.severity }}
          *Service:* {{ .Labels.service }}
          {{ end }}

  - name: 'critical-alerts'
    email_configs:
      - to: 'oncall@ecommerce.local'
        subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Severity: {{ .Labels.severity }}
          Service: {{ .Labels.service }}
          Runbook: {{ .Annotations.runbook_url }}
          {{ end }}
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#critical-alerts'
        title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Runbook:* {{ .Annotations.runbook_url }}
          {{ end }}

  - name: 'infrastructure-team'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#infrastructure'
        title: '⚠️ Infrastructure Alert: {{ .GroupLabels.alertname }}'

  - name: 'platform-team'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#platform'
        title: '🔧 Platform Alert: {{ .GroupLabels.alertname }}'

  - name: 'product-team'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#product'
        title: '📊 Business Alert: {{ .GroupLabels.alertname }}'

inhibit_rules:
  # Don't send warning alerts if critical alerts are firing
  - source_matchers:
      - severity=critical
    target_matchers:
      - severity=warning
    equal: ['service']

  # Don't send service alerts if node is down
  - source_matchers:
      - alertname=NodeDown
    target_matchers:
      - alertname=ServiceDown
    equal: ['instance']

YAML

Step 7: Grafana Dashboards

Infrastructure Dashboard

# grafana/provisioning/dashboards/infrastructure.json
{
  "dashboard": {
    "id": null,
    "title": "Infrastructure Overview",
    "tags": ["infrastructure", "monitoring"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "panels": [
      {
        "id": 1,
        "title": "CPU Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "node:cpu_usage:rate5m",
            "legendFormat": "{{ instance }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Memory Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "node:memory_usage:percentage",
            "legendFormat": "{{ instance }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 80},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
      },
      {
        "id": 3,
        "title": "CPU Usage Over Time",
        "type": "graph",
        "targets": [
          {
            "expr": "node:cpu_usage:rate5m",
            "legendFormat": "{{ instance }}"
          }
        ],
        "yAxes": [
          {
            "unit": "percent",
            "max": 100,
            "min": 0
          }
        ],
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
      }
    ]
  }
}

# grafana/provisioning/dashboards/infrastructure.json
{
  "dashboard": {
    "id": null,
    "title": "Infrastructure Overview",
    "tags": ["infrastructure", "monitoring"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "panels": [
      {
        "id": 1,
        "title": "CPU Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "node:cpu_usage:rate5m",
            "legendFormat": "{{ instance }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Memory Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "node:memory_usage:percentage",
            "legendFormat": "{{ instance }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 80},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
      },
      {
        "id": 3,
        "title": "CPU Usage Over Time",
        "type": "graph",
        "targets": [
          {
            "expr": "node:cpu_usage:rate5m",
            "legendFormat": "{{ instance }}"
          }
        ],
        "yAxes": [
          {
            "unit": "percent",
            "max": 100,
            "min": 0
          }
        ],
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
      }
    ]
  }
}

JSON

Application Dashboard

# grafana/provisioning/dashboards/application.json
{
  "dashboard": {
    "id": null,
    "title": "Application Performance",
    "tags": ["application", "performance"],
    "timezone": "browser",
    "refresh": "30s",
    "templating": {
      "list": [
        {
          "name": "service",
          "type": "query",
          "query": "label_values(http_requests_total, service)",
          "refresh": 1,
          "multi": true,
          "includeAll": true
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "service:request_rate:rate5m{service=~\"$service\"}",
            "legendFormat": "{{ service }}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "service:error_rate:rate5m{service=~\"$service\"} * 100",
            "legendFormat": "{{ service }}"
          }
        ],
        "yAxes": [
          {
            "unit": "percent",
            "max": 100,
            "min": 0
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
      },
      {
        "id": 3,
        "title": "Response Time Percentiles",
        "type": "graph",
        "targets": [
          {
            "expr": "service:request_duration:p50{service=~\"$service\"}",
            "legendFormat": "{{ service }} - 50th"
          },
          {
            "expr": "service:request_duration:p95{service=~\"$service\"}",
            "legendFormat": "{{ service }} - 95th"
          },
          {
            "expr": "service:request_duration:p99{service=~\"$service\"}",
            "legendFormat": "{{ service }} - 99th"
          }
        ],
        "yAxes": [
          {
            "unit": "s"
          }
        ],
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
      }
    ]
  }
}

# grafana/provisioning/dashboards/application.json
{
  "dashboard": {
    "id": null,
    "title": "Application Performance",
    "tags": ["application", "performance"],
    "timezone": "browser",
    "refresh": "30s",
    "templating": {
      "list": [
        {
          "name": "service",
          "type": "query",
          "query": "label_values(http_requests_total, service)",
          "refresh": 1,
          "multi": true,
          "includeAll": true
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "service:request_rate:rate5m{service=~\"$service\"}",
            "legendFormat": "{{ service }}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "service:error_rate:rate5m{service=~\"$service\"} * 100",
            "legendFormat": "{{ service }}"
          }
        ],
        "yAxes": [
          {
            "unit": "percent",
            "max": 100,
            "min": 0
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
      },
      {
        "id": 3,
        "title": "Response Time Percentiles",
        "type": "graph",
        "targets": [
          {
            "expr": "service:request_duration:p50{service=~\"$service\"}",
            "legendFormat": "{{ service }} - 50th"
          },
          {
            "expr": "service:request_duration:p95{service=~\"$service\"}",
            "legendFormat": "{{ service }} - 95th"
          },
          {
            "expr": "service:request_duration:p99{service=~\"$service\"}",
            "legendFormat": "{{ service }} - 99th"
          }
        ],
        "yAxes": [
          {
            "unit": "s"
          }
        ],
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
      }
    ]
  }
}

JSON

Step 8: Testing and Validation

Load Testing Script

# scripts/load_test.py
import requests
import time
import random
import threading
from concurrent.futures import ThreadPoolExecutor

BASE_URL = "http://localhost:8080"

def make_request(endpoint):
    """Make a request to the specified endpoint"""
    try:
        response = requests.get(f"{BASE_URL}{endpoint}", timeout=5)
        return response.status_code
    except Exception as e:
        print(f"Error calling {endpoint}: {e}")
        return 500

def generate_load():
    """Generate load on the application"""
    endpoints = ["/", "/users", "/health"]

    while True:
        endpoint = random.choice(endpoints)
        status = make_request(endpoint)

        # Add some randomness to the load
        time.sleep(random.uniform(0.1, 1.0))

def run_load_test(duration_minutes=10, concurrent_users=5):
    """Run load test for specified duration"""
    print(f"Starting load test with {concurrent_users} concurrent users for {duration_minutes} minutes")

    with ThreadPoolExecutor(max_workers=concurrent_users) as executor:
        # Submit load generation tasks
        futures = []
        for _ in range(concurrent_users):
            future = executor.submit(generate_load)
            futures.append(future)

        # Let it run for the specified duration
        time.sleep(duration_minutes * 60)

        # Cancel all tasks
        for future in futures:
            future.cancel()

if __name__ == "__main__":
    run_load_test(duration_minutes=5, concurrent_users=10)

# scripts/load_test.py
import requests
import time
import random
import threading
from concurrent.futures import ThreadPoolExecutor

BASE_URL = "http://localhost:8080"

def make_request(endpoint):
    """Make a request to the specified endpoint"""
    try:
        response = requests.get(f"{BASE_URL}{endpoint}", timeout=5)
        return response.status_code
    except Exception as e:
        print(f"Error calling {endpoint}: {e}")
        return 500

def generate_load():
    """Generate load on the application"""
    endpoints = ["/", "/users", "/health"]

    while True:
        endpoint = random.choice(endpoints)
        status = make_request(endpoint)

        # Add some randomness to the load
        time.sleep(random.uniform(0.1, 1.0))

def run_load_test(duration_minutes=10, concurrent_users=5):
    """Run load test for specified duration"""
    print(f"Starting load test with {concurrent_users} concurrent users for {duration_minutes} minutes")

    with ThreadPoolExecutor(max_workers=concurrent_users) as executor:
        # Submit load generation tasks
        futures = []
        for _ in range(concurrent_users):
            future = executor.submit(generate_load)
            futures.append(future)

        # Let it run for the specified duration
        time.sleep(duration_minutes * 60)

        # Cancel all tasks
        for future in futures:
            future.cancel()

if __name__ == "__main__":
    run_load_test(duration_minutes=5, concurrent_users=10)

Python

Chaos Testing

# scripts/chaos_test.py
import docker
import time
import random

client = docker.from_env()

def stop_random_service():
    """Stop a random service for chaos testing"""
    services = ['user-service', 'product-service', 'order-service']
    service_name = random.choice(services)

    try:
        container = client.containers.get(service_name)
        print(f"Stopping {service_name}")
        container.stop()

        # Wait for some time
        time.sleep(30)

        print(f"Starting {service_name}")
        container.start()

    except Exception as e:
        print(f"Error with {service_name}: {e}")

def simulate_high_load():
    """Simulate high CPU load on a container"""
    try:
        container = client.containers.get('frontend')
        print("Simulating high CPU load")

        # Run stress test inside container
        container.exec_run("stress --cpu 2 --timeout 60s", detach=True)

    except Exception as e:
        print(f"Error simulating load: {e}")

if __name__ == "__main__":
    print("Starting chaos testing...")

    # Run different chaos scenarios
    stop_random_service()
    time.sleep(120)

    simulate_high_load()
    time.sleep(120)

# scripts/chaos_test.py
import docker
import time
import random

client = docker.from_env()

def stop_random_service():
    """Stop a random service for chaos testing"""
    services = ['user-service', 'product-service', 'order-service']
    service_name = random.choice(services)

    try:
        container = client.containers.get(service_name)
        print(f"Stopping {service_name}")
        container.stop()

        # Wait for some time
        time.sleep(30)

        print(f"Starting {service_name}")
        container.start()

    except Exception as e:
        print(f"Error with {service_name}: {e}")

def simulate_high_load():
    """Simulate high CPU load on a container"""
    try:
        container = client.containers.get('frontend')
        print("Simulating high CPU load")

        # Run stress test inside container
        container.exec_run("stress --cpu 2 --timeout 60s", detach=True)

    except Exception as e:
        print(f"Error simulating load: {e}")

if __name__ == "__main__":
    print("Starting chaos testing...")

    # Run different chaos scenarios
    stop_random_service()
    time.sleep(120)

    simulate_high_load()
    time.sleep(120)

Python

Step 9: Deployment Script

#!/bin/bash
# scripts/deploy.sh

set -e

echo "Starting E-commerce Observability Stack deployment..."

# Create necessary directories
mkdir -p data/{user-db,product-db,order-db}
mkdir -p prometheus grafana/provisioning/{datasources,dashboards}
mkdir -p alertmanager blackbox

# Set permissions
chmod 777 data/{user-db,product-db,order-db}

# Build application images
echo "Building application images..."
for service in frontend user-service product-service order-service payment-service inventory-service; do
    echo "Building $service..."
    docker build -t ecommerce/$service:latest apps/$service/
done

# Start the stack
echo "Starting services..."
docker-compose up -d

# Wait for services to be ready
echo "Waiting for services to start..."
sleep 30

# Check service health
echo "Checking service health..."
services=("prometheus:9090" "grafana:3000" "alertmanager:9093" "frontend:8080")

for service in "${services[@]}"; do
    IFS=':' read -r name port <<< "$service"
    echo "Checking $name on port $port..."

    for i in {1..30}; do
        if curl -f "http://localhost:$port/health" 2>/dev/null || curl -f "http://localhost:$port" 2>/dev/null; then
            echo "$name is healthy"
            break
        fi

        if [ $i -eq 30 ]; then
            echo "Warning: $name may not be ready"
        fi

        sleep 2
    done
done

echo "Deployment complete!"
echo "Access URLs:"
echo "  Prometheus: http://localhost:9090"
echo "  Grafana: http://localhost:3000 (admin/admin123)"
echo "  Alertmanager: http://localhost:9093"
echo "  Application: http://localhost:8080"

echo "Run load tests with: python scripts/load_test.py"
echo "Run chaos tests with: python scripts/chaos_test.py"

#!/bin/bash
# scripts/deploy.sh

set -e

echo "Starting E-commerce Observability Stack deployment..."

# Create necessary directories
mkdir -p data/{user-db,product-db,order-db}
mkdir -p prometheus grafana/provisioning/{datasources,dashboards}
mkdir -p alertmanager blackbox

# Set permissions
chmod 777 data/{user-db,product-db,order-db}

# Build application images
echo "Building application images..."
for service in frontend user-service product-service order-service payment-service inventory-service; do
    echo "Building $service..."
    docker build -t ecommerce/$service:latest apps/$service/
done

# Start the stack
echo "Starting services..."
docker-compose up -d

# Wait for services to be ready
echo "Waiting for services to start..."
sleep 30

# Check service health
echo "Checking service health..."
services=("prometheus:9090" "grafana:3000" "alertmanager:9093" "frontend:8080")

for service in "${services[@]}"; do
    IFS=':' read -r name port <<< "$service"
    echo "Checking $name on port $port..."

    for i in {1..30}; do
        if curl -f "http://localhost:$port/health" 2>/dev/null || curl -f "http://localhost:$port" 2>/dev/null; then
            echo "$name is healthy"
            break
        fi

        if [ $i -eq 30 ]; then
            echo "Warning: $name may not be ready"
        fi

        sleep 2
    done
done

echo "Deployment complete!"
echo "Access URLs:"
echo "  Prometheus: http://localhost:9090"
echo "  Grafana: http://localhost:3000 (admin/admin123)"
echo "  Alertmanager: http://localhost:9093"
echo "  Application: http://localhost:8080"

echo "Run load tests with: python scripts/load_test.py"
echo "Run chaos tests with: python scripts/chaos_test.py"

Bash

Step 10: Documentation and Runbooks

README.md

# E-commerce Observability Stack

This project demonstrates a complete observability setup for a microservices-based e-commerce application using Prometheus, Grafana, and Alertmanager.

## Architecture

- **Frontend Service** (Go): Main web interface
- **User Service** (Python): User management
- **Product Service** (Python): Product catalog
- **Order Service** (Python): Order processing
- **Payment Service** (Python): Payment processing
- **Inventory Service** (Python): Inventory management

## Deployment

```bash
# Clone the repository
git clone <repository-url>
cd ecommerce-observability

# Deploy the stack
./scripts/deploy.sh

# E-commerce Observability Stack

This project demonstrates a complete observability setup for a microservices-based e-commerce application using Prometheus, Grafana, and Alertmanager.

## Architecture

- **Frontend Service** (Go): Main web interface
- **User Service** (Python): User management
- **Product Service** (Python): Product catalog
- **Order Service** (Python): Order processing
- **Payment Service** (Python): Payment processing
- **Inventory Service** (Python): Inventory management

## Deployment

```bash
# Clone the repository
git clone <repository-url>
cd ecommerce-observability

# Deploy the stack
./scripts/deploy.sh

Markdown

Access Points

Prometheus: http://localhost:9090
Grafana: http://localhost:3000 (admin/admin123)
Alertmanager: http://localhost:9093
Application: http://localhost:8080

Testing

Load Testing

python scripts/load_test.py

python scripts/load_test.py

Bash

Chaos Testing

python scripts/chaos_test.py

python scripts/chaos_test.py

Bash

Monitoring

Key Metrics

Request rate per service
Error rate per service
Response time percentiles
Infrastructure utilization

Alerts

Service down
High error rate (>5%)
High latency (>1s p95)
Infrastructure issues

Troubleshooting

Service Discovery Issues

Check Prometheus targets: http://localhost:9090/targets

Missing Metrics

Verify service /metrics endpoints are accessible

Alert Not Firing

Check Prometheus rules: http://localhost:9090/rules

### Project Validation

#### Verification Checklist

1. **✅ Infrastructure Monitoring**
   - [ ] Node exporter collecting system metrics
   - [ ] CPU, memory, disk usage visible in Grafana
   - [ ] Infrastructure alerts firing correctly

2. **✅ Application Monitoring**
   - [ ] All services exposing metrics
   - [ ] Request rate, error rate, latency tracked
   - [ ] Business metrics instrumented

3. **✅ Alerting**
   - [ ] Critical alerts configured
   - [ ] Alert routing working
   - [ ] Notification channels tested

4. **✅ Visualization**
   - [ ] Infrastructure dashboard functional
   - [ ] Application dashboard with filters
   - [ ] Business metrics dashboard

5. **✅ Testing**
   - [ ] Load testing generating metrics
   - [ ] Chaos testing triggering alerts
   - [ ] Recovery scenarios validated

### Chapter 11 Summary

The capstone project demonstrates a production-ready observability stack with comprehensive monitoring, alerting, and visualization. It covers infrastructure monitoring, application performance tracking, business metrics, and incident response workflows. The project serves as a practical template for implementing Prometheus-based observability in real-world microservices environments.

### Final Exercise

1. **Deploy the Complete Stack**:
   - Follow the deployment guide
   - Verify all components are working
   - Access all web interfaces

2. **Run Tests and Observe**:
   - Execute load tests and watch metrics
   - Trigger chaos tests and verify alerts
   - Practice incident response workflows

3. **Customize and Extend**:
   - Add new metrics to services
   - Create custom dashboards
   - Implement additional alert rules

---

## 12. Appendices

### Appendix A: PromQL Cheat Sheet

#### Basic Selectors
```promql
# Simple metric selection
http_requests_total

# Label matching
http_requests_total{method="GET"}
http_requests_total{method!="GET"}
http_requests_total{method=~"GET|POST"}
http_requests_total{method!~"GET|POST"}

# Multiple labels
http_requests_total{method="GET", status="200"}

### Project Validation

#### Verification Checklist

1. **✅ Infrastructure Monitoring**
   - [ ] Node exporter collecting system metrics
   - [ ] CPU, memory, disk usage visible in Grafana
   - [ ] Infrastructure alerts firing correctly

2. **✅ Application Monitoring**
   - [ ] All services exposing metrics
   - [ ] Request rate, error rate, latency tracked
   - [ ] Business metrics instrumented

3. **✅ Alerting**
   - [ ] Critical alerts configured
   - [ ] Alert routing working
   - [ ] Notification channels tested

4. **✅ Visualization**
   - [ ] Infrastructure dashboard functional
   - [ ] Application dashboard with filters
   - [ ] Business metrics dashboard

5. **✅ Testing**
   - [ ] Load testing generating metrics
   - [ ] Chaos testing triggering alerts
   - [ ] Recovery scenarios validated

### Chapter 11 Summary

The capstone project demonstrates a production-ready observability stack with comprehensive monitoring, alerting, and visualization. It covers infrastructure monitoring, application performance tracking, business metrics, and incident response workflows. The project serves as a practical template for implementing Prometheus-based observability in real-world microservices environments.

### Final Exercise

1. **Deploy the Complete Stack**:
   - Follow the deployment guide
   - Verify all components are working
   - Access all web interfaces

2. **Run Tests and Observe**:
   - Execute load tests and watch metrics
   - Trigger chaos tests and verify alerts
   - Practice incident response workflows

3. **Customize and Extend**:
   - Add new metrics to services
   - Create custom dashboards
   - Implement additional alert rules

---

## 12. Appendices

### Appendix A: PromQL Cheat Sheet

#### Basic Selectors
```promql
# Simple metric selection
http_requests_total

# Label matching
http_requests_total{method="GET"}
http_requests_total{method!="GET"}
http_requests_total{method=~"GET|POST"}
http_requests_total{method!~"GET|POST"}

# Multiple labels
http_requests_total{method="GET", status="200"}

Markdown

Time Series Types

# Instant vector (single value per series)
up

# Range vector (range of values over time)
up[5m]

# Scalar (single numeric value)
42

# Instant vector (single value per series)
up

# Range vector (range of values over time)
up[5m]

# Scalar (single numeric value)
42

INI

Rate and Counter Functions

# Rate: per-second average rate
rate(http_requests_total[5m])

# Increase: total increase over time window
increase(http_requests_total[5m])

# irate: instantaneous rate
irate(http_requests_total[5m])

# Delta: difference between first and last value
delta(cpu_temp_celsius[2h])

# Rate: per-second average rate
rate(http_requests_total[5m])

# Increase: total increase over time window
increase(http_requests_total[5m])

# irate: instantaneous rate
irate(http_requests_total[5m])

# Delta: difference between first and last value
delta(cpu_temp_celsius[2h])

INI

Aggregation Operators

# Sum
sum(http_requests_total)
sum by (job) (http_requests_total)
sum without (instance) (http_requests_total)

# Average
avg(node_cpu_seconds_total)
avg by (mode) (node_cpu_seconds_total)

# Count
count(up)
count by (job) (up)

# Min/Max
min(node_filesystem_free_bytes)
max(node_filesystem_free_bytes)

# Quantile
quantile(0.95, http_request_duration_seconds)

# Top/Bottom K
topk(5, http_requests_total)
bottomk(3, node_filesystem_free_bytes)

# Sum
sum(http_requests_total)
sum by (job) (http_requests_total)
sum without (instance) (http_requests_total)

# Average
avg(node_cpu_seconds_total)
avg by (mode) (node_cpu_seconds_total)

# Count
count(up)
count by (job) (up)

# Min/Max
min(node_filesystem_free_bytes)
max(node_filesystem_free_bytes)

# Quantile
quantile(0.95, http_request_duration_seconds)

# Top/Bottom K
topk(5, http_requests_total)
bottomk(3, node_filesystem_free_bytes)

INI

Mathematical Functions

# Arithmetic operators
node_memory_MemTotal_bytes - node_memory_MemFree_bytes
rate(http_requests_total[5m]) * 60

# Mathematical functions
abs(delta(cpu_temp_celsius[5m]))
ceil(rate(http_requests_total[5m]))
floor(rate(http_requests_total[5m]))
round(rate(http_requests_total[5m]), 0.1)
sqrt(rate(http_requests_total[5m]))
ln(rate(http_requests_total[5m]))
log10(rate(http_requests_total[5m]))

# Arithmetic operators
node_memory_MemTotal_bytes - node_memory_MemFree_bytes
rate(http_requests_total[5m]) * 60

# Mathematical functions
abs(delta(cpu_temp_celsius[5m]))
ceil(rate(http_requests_total[5m]))
floor(rate(http_requests_total[5m]))
round(rate(http_requests_total[5m]), 0.1)
sqrt(rate(http_requests_total[5m]))
ln(rate(http_requests_total[5m]))
log10(rate(http_requests_total[5m]))

INI

Histogram Functions

# Quantiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

# Average from histogram
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

# Request rate from histogram
rate(http_request_duration_seconds_count[5m])

Time Functions

# Current time
time()

# Timestamp of samples
timestamp(up)

# Time-based filtering
hour() > 9 and hour() < 17  # Business hours
day_of_week() > 0 and day_of_week() < 6  # Weekdays

# Prediction
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)

# Current time
time()

# Timestamp of samples
timestamp(up)

# Time-based filtering
hour() > 9 and hour() < 17  # Business hours
day_of_week() > 0 and day_of_week() < 6  # Weekdays

# Prediction
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)

INI

String Functions

# Label manipulation
label_replace(up, "instance_short", "$1", "instance", "([^:]+):.*")
label_join(up, "instance_job", ":", "instance", "job")

# Label manipulation
label_replace(up, "instance_short", "$1", "instance", "([^:]+):.*")
label_join(up, "instance_job", ":", "instance", "job")

INI

Comparison Operators

# Comparison
node_filesystem_free_bytes < 1000000000  # Less than 1GB
rate(http_requests_total[5m]) > 10       # More than 10 req/s

# Boolean operators
up == 1 and on(instance) node_load1 > 2
up == 0 or on(instance) node_filesystem_free_bytes < 1000000000

# Comparison
node_filesystem_free_bytes < 1000000000  # Less than 1GB
rate(http_requests_total[5m]) > 10       # More than 10 req/s

# Boolean operators
up == 1 and on(instance) node_load1 > 2
up == 0 or on(instance) node_filesystem_free_bytes < 1000000000

INI

Advanced Patterns

# SLI/SLO calculations
sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# Error budget burn rate
(1 - sli_availability) / (1 - slo_target) > burn_rate_threshold

# Multi-service aggregation
sum by (environment) (rate(http_requests_total[5m]))

# Cross-metric calculations
rate(http_requests_total[5m]) / on(instance) group_left rate(node_cpu_seconds_total{mode="idle"}[5m])

# SLI/SLO calculations
sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# Error budget burn rate
(1 - sli_availability) / (1 - slo_target) > burn_rate_threshold

# Multi-service aggregation
sum by (environment) (rate(http_requests_total[5m]))

# Cross-metric calculations
rate(http_requests_total[5m]) / on(instance) group_left rate(node_cpu_seconds_total{mode="idle"}[5m])

INI

Appendix B: Exporter Catalog

Official Exporters

Exporter	Purpose	Port	Key Metrics
Node Exporter	System metrics	9100	CPU, memory, disk, network
Blackbox Exporter	External monitoring	9115	HTTP, DNS, TCP, ICMP
MySQL Exporter	MySQL database	9104	Connections, queries, performance
Redis Exporter	Redis database	9121	Memory, commands, keys
HAProxy Exporter	HAProxy load balancer	8404	Requests, responses, health
NGINX Exporter	NGINX web server	9113	Requests, connections, status

Third-party Exporters

Exporter	Purpose	Port	Key Metrics
Postgres Exporter	PostgreSQL database	9187	Connections, queries, locks
MongoDB Exporter	MongoDB database	9216	Operations, connections, memory
Elasticsearch Exporter	Elasticsearch	9114	Cluster health, indices, queries
RabbitMQ Exporter	RabbitMQ message broker	9419	Queues, messages, connections
Kafka Exporter	Apache Kafka	9308	Topics, partitions, lag
JMX Exporter	Java applications	8080	JVM metrics, garbage collection

Cloud Provider Exporters

Exporter	Purpose	Key Metrics
AWS CloudWatch Exporter	AWS services	EC2, RDS, ELB metrics
Azure Monitor Exporter	Azure services	VM, storage, network metrics
GCP Monitoring Exporter	Google Cloud	Compute, storage, network metrics

Configuration Examples

Node Exporter

# docker-compose.yml
node-exporter:
  image: prom/node-exporter:latest
  command:
    - '--path.procfs=/host/proc'
    - '--path.rootfs=/rootfs'
    - '--path.sysfs=/host/sys'
    - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
  volumes:
    - /proc:/host/proc:ro
    - /sys:/host/sys:ro
    - /:/rootfs:ro
  ports:
    - "9100:9100"

# docker-compose.yml
node-exporter:
  image: prom/node-exporter:latest
  command:
    - '--path.procfs=/host/proc'
    - '--path.rootfs=/rootfs'
    - '--path.sysfs=/host/sys'
    - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
  volumes:
    - /proc:/host/proc:ro
    - /sys:/host/sys:ro
    - /:/rootfs:ro
  ports:
    - "9100:9100"

YAML

Blackbox Exporter

# blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_status_codes: []
      method: GET
      follow_redirects: true

# blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_status_codes: []
      method: GET
      follow_redirects: true

YAML

MySQL Exporter

# Environment variables
DATA_SOURCE_NAME: "user:password@(mysql:3306)/"

# Or configuration file
[client]
user = exporter
password = password
host = mysql
port = 3306

# Environment variables
DATA_SOURCE_NAME: "user:password@(mysql:3306)/"

# Or configuration file
[client]
user = exporter
password = password
host = mysql
port = 3306

YAML

Appendix C: Alert Rule Templates

Infrastructure Alerts

groups:
  - name: node_alerts
    rules:
      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"

      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"

      - alert: HighMemory
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"

      - alert: DiskSpaceCritical
        expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical disk space on {{ $labels.instance }}"

groups:
  - name: node_alerts
    rules:
      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"

      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"

      - alert: HighMemory
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"

      - alert: DiskSpaceCritical
        expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical disk space on {{ $labels.instance }}"

YAML

Application Alerts

groups:
  - name: application_alerts
    rules:
      - alert: ServiceDown
        expr: up{job=~".*-service"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate for {{ $labels.job }}"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency for {{ $labels.job }}"

      - alert: LowThroughput
        expr: rate(http_requests_total[5m]) < 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low throughput for {{ $labels.job }}"

groups:
  - name: application_alerts
    rules:
      - alert: ServiceDown
        expr: up{job=~".*-service"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate for {{ $labels.job }}"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency for {{ $labels.job }}"

      - alert: LowThroughput
        expr: rate(http_requests_total[5m]) < 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low throughput for {{ $labels.job }}"

YAML

Database Alerts

groups:
  - name: database_alerts
    rules:
      - alert: DatabaseDown
        expr: mysql_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database {{ $labels.instance }} is down"

      - alert: HighConnections
        expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High database connections on {{ $labels.instance }}"

      - alert: SlowQueries
        expr: rate(mysql_global_status_slow_queries[5m]) > 0
        for: 5m
        labels:
          severity: warning# filepath: c:\Users\ankus\Videos\PROMETHEUS\prometheus.md
#### Multi-cluster Recording Rules

```yaml
# Global recording rules
groups:
  - name: cross_cluster_aggregates
    interval: 60s
    rules:
      - record: global:request_rate:sum
        expr: sum by (service) (cluster:request_rate:sum)

      - record: global:error_rate:avg
        expr: avg by (service) (cluster:error_rate:avg)

      - record: global:latency:p95
        expr: |
          histogram_quantile(0.95,
            sum by (service, le) (cluster:latency:histogram)
          )

      - record: region:capacity:available
        expr: |
          sum by (region) (
            cluster:node_capacity:cpu - cluster:node_usage:cpu
          )

groups:
  - name: database_alerts
    rules:
      - alert: DatabaseDown
        expr: mysql_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database {{ $labels.instance }} is down"

      - alert: HighConnections
        expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High database connections on {{ $labels.instance }}"

      - alert: SlowQueries
        expr: rate(mysql_global_status_slow_queries[5m]) > 0
        for: 5m
        labels:
          severity: warning# filepath: c:\Users\ankus\Videos\PROMETHEUS\prometheus.md
#### Multi-cluster Recording Rules

```yaml
# Global recording rules
groups:
  - name: cross_cluster_aggregates
    interval: 60s
    rules:
      - record: global:request_rate:sum
        expr: sum by (service) (cluster:request_rate:sum)

      - record: global:error_rate:avg
        expr: avg by (service) (cluster:error_rate:avg)

      - record: global:latency:p95
        expr: |
          histogram_quantile(0.95,
            sum by (service, le) (cluster:latency:histogram)
          )

      - record: region:capacity:available
        expr: |
          sum by (region) (
            cluster:node_capacity:cpu - cluster:node_usage:cpu
          )

YAML

Integrating with Logging and Tracing

Correlation with ELK Stack

# Logstash configuration for metrics correlation
input {
  beats {
    port => 5044
  }
}

filter {
  if [fields][service] {
    # Add Prometheus job label
    mutate {
      add_field => { "prometheus_job" => "%{[fields][service]}" }
    }

    # Extract trace ID if present
    if [message] =~ /trace_id=/ {
      grok {
        match => { "message" => "trace_id=(?<trace_id>[a-f0-9]+)" }
      }
    }

    # Add links to metrics
    mutate {
      add_field => { 
        "metrics_link" => "http://grafana.local/d/app-dashboard?var-service=%{[fields][service]}&from=now-5m&to=now"
      }
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

# Logstash configuration for metrics correlation
input {
  beats {
    port => 5044
  }
}

filter {
  if [fields][service] {
    # Add Prometheus job label
    mutate {
      add_field => { "prometheus_job" => "%{[fields][service]}" }
    }

    # Extract trace ID if present
    if [message] =~ /trace_id=/ {
      grok {
        match => { "message" => "trace_id=(?<trace_id>[a-f0-9]+)" }
      }
    }

    # Add links to metrics
    mutate {
      add_field => { 
        "metrics_link" => "http://grafana.local/d/app-dashboard?var-service=%{[fields][service]}&from=now-5m&to=now"
      }
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

Groovy

Jaeger Integration

# Jaeger query service with Prometheus metrics
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger-query
spec:
  template:
    spec:
      containers:
      - name: jaeger-query
        image: jaegertracing/jaeger-query:latest
        env:
        - name: SPAN_STORAGE_TYPE
          value: elasticsearch
        - name: ES_SERVER_URLS
          value: http://elasticsearch:9200
        - name: METRICS_BACKEND
          value: prometheus
        - name: PROMETHEUS_SERVER_URL
          value: http://prometheus:9090
        ports:
        - containerPort: 16686
        - containerPort: 16687

# Jaeger query service with Prometheus metrics
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger-query
spec:
  template:
    spec:
      containers:
      - name: jaeger-query
        image: jaegertracing/jaeger-query:latest
        env:
        - name: SPAN_STORAGE_TYPE
          value: elasticsearch
        - name: ES_SERVER_URLS
          value: http://elasticsearch:9200
        - name: METRICS_BACKEND
          value: prometheus
        - name: PROMETHEUS_SERVER_URL
          value: http://prometheus:9090
        ports:
        - containerPort: 16686
        - containerPort: 16687

YAML

OpenTelemetry Collector Configuration

# otelcol-config.yml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          static_configs:
            - targets: ['localhost:8888']

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

  attributes:
    actions:
      - key: cluster
        value: production
        action: insert

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true

  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: "otel"

  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [attributes, batch]
      exporters: [jaeger]

    metrics:
      receivers: [otlp, prometheus]
      processors: [attributes, batch]
      exporters: [prometheus, prometheusremotewrite]

# otelcol-config.yml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          static_configs:
            - targets: ['localhost:8888']

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

  attributes:
    actions:
      - key: cluster
        value: production
        action: insert

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true

  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: "otel"

  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [attributes, batch]
      exporters: [jaeger]

    metrics:
      receivers: [otlp, prometheus]
      processors: [attributes, batch]
      exporters: [prometheus, prometheusremotewrite]

YAML

Security and RBAC in Prometheus Setups

Prometheus Security Configuration

# Prometheus with TLS and authentication
apiVersion: v1
kind: Secret
metadata:
  name: prometheus-certs
type: Opaque
data:
  tls.crt: <base64-encoded-cert>
  tls.key: <base64-encoded-key>

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
spec:
  template:
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:latest
        args:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--web.config.file=/etc/prometheus/web.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--web.listen-address=0.0.0.0:9090'
        volumeMounts:
        - name: config
          mountPath: /etc/prometheus
        - name: certs
          mountPath: /etc/ssl/prometheus
          readOnly: true

# Prometheus with TLS and authentication
apiVersion: v1
kind: Secret
metadata:
  name: prometheus-certs
type: Opaque
data:
  tls.crt: <base64-encoded-cert>
  tls.key: <base64-encoded-key>

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
spec:
  template:
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:latest
        args:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--web.config.file=/etc/prometheus/web.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--web.listen-address=0.0.0.0:9090'
        volumeMounts:
        - name: config
          mountPath: /etc/prometheus
        - name: certs
          mountPath: /etc/ssl/prometheus
          readOnly: true

YAML

# web.yml - Prometheus web configuration
tls_server_config:
  cert_file: /etc/ssl/prometheus/tls.crt
  key_file: /etc/ssl/prometheus/tls.key

basic_auth_users:
  admin: $2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay
  readonly: $2b$12$6tgWf5DZ9z7LZtD.ZrAb/.VjBfI3WnJg3ULf.TgLBtO4vKAzp7KuG

# web.yml - Prometheus web configuration
tls_server_config:
  cert_file: /etc/ssl/prometheus/tls.crt
  key_file: /etc/ssl/prometheus/tls.key

basic_auth_users:
  admin: $2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay
  readonly: $2b$12$6tgWf5DZ9z7LZtD.ZrAb/.VjBfI3WnJg3ULf.TgLBtO4vKAzp7KuG

YAML

RBAC Configuration for Kubernetes

# ServiceAccount for Prometheus
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitoring

---
# ClusterRole with minimal permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups: ["extensions", "apps"]
  resources:
  - ingresses
  - deployments
  - daemonsets
  - statefulsets
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]

---
# ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring

# ServiceAccount for Prometheus
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitoring

---
# ClusterRole with minimal permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups: ["extensions", "apps"]
  resources:
  - ingresses
  - deployments
  - daemonsets
  - statefulsets
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]

---
# ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring

YAML

OAuth2 Proxy Integration

# OAuth2 Proxy for Prometheus
apiVersion: apps/v1
kind: Deployment
metadata:
  name: oauth2-proxy
spec:
  template:
    spec:
      containers:
      - name: oauth2-proxy
        image: quay.io/oauth2-proxy/oauth2-proxy:latest
        args:
          - --provider=github
          - --email-domain=yourcompany.com
          - --upstream=http://prometheus:9090
          - --http-address=0.0.0.0:4180
          - --client-id=$(OAUTH2_PROXY_CLIENT_ID)
          - --client-secret=$(OAUTH2_PROXY_CLIENT_SECRET)
          - --cookie-secret=$(OAUTH2_PROXY_COOKIE_SECRET)
        env:
        - name: OAUTH2_PROXY_CLIENT_ID
          valueFrom:
            secretKeyRef:
              name: oauth2-proxy-secrets
              key: client-id
        - name: OAUTH2_PROXY_CLIENT_SECRET
          valueFrom:
            secretKeyRef:
              name: oauth2-proxy-secrets
              key: client-secret
        - name: OAUTH2_PROXY_COOKIE_SECRET
          valueFrom:
            secretKeyRef:
              name: oauth2-proxy-secrets
              key: cookie-secret

# OAuth2 Proxy for Prometheus
apiVersion: apps/v1
kind: Deployment
metadata:
  name: oauth2-proxy
spec:
  template:
    spec:
      containers:
      - name: oauth2-proxy
        image: quay.io/oauth2-proxy/oauth2-proxy:latest
        args:
          - --provider=github
          - --email-domain=yourcompany.com
          - --upstream=http://prometheus:9090
          - --http-address=0.0.0.0:4180
          - --client-id=$(OAUTH2_PROXY_CLIENT_ID)
          - --client-secret=$(OAUTH2_PROXY_CLIENT_SECRET)
          - --cookie-secret=$(OAUTH2_PROXY_COOKIE_SECRET)
        env:
        - name: OAUTH2_PROXY_CLIENT_ID
          valueFrom:
            secretKeyRef:
              name: oauth2-proxy-secrets
              key: client-id
        - name: OAUTH2_PROXY_CLIENT_SECRET
          valueFrom:
            secretKeyRef:
              name: oauth2-proxy-secrets
              key: client-secret
        - name: OAUTH2_PROXY_COOKIE_SECRET
          valueFrom:
            secretKeyRef:
              name: oauth2-proxy-secrets
              key: cookie-secret

YAML

Chapter 10 Summary

Hands-on Exercise

Exemplars Implementation:
- Enable exemplars in Prometheus
- Instrument an application with trace correlation
- View exemplars in Grafana dashboards
Multi-cluster Setup:
- Configure federation between Prometheus instances
- Implement cross-cluster monitoring
- Test global query capabilities
Security Hardening:
- Implement TLS and authentication
- Configure RBAC for Kubernetes
- Set up OAuth2 proxy for access control

11. Capstone Project

Project Overview

Build a complete observability stack for a sample e-commerce application with microservices architecture, including metrics collection, alerting, visualization, and incident response workflows.

Architecture Overview

graph TB
    subgraph "Application Layer"
        A[Frontend Service] --> B[User Service]
        A --> C[Product Service]
        A --> D[Order Service]
        D --> E[Payment Service]
        D --> F[Inventory Service]
        B --> G[User Database]
        C --> H[Product Database]
        D --> I[Order Database]
    end

    subgraph "Observability Layer"
        J[Prometheus] --> K[Alertmanager]
        J --> L[Grafana]
        M[Node Exporter] --> J
        N[Application Metrics] --> J
        O[Blackbox Exporter] --> J
        K --> P[Slack/Email]
        L --> Q[Dashboards]
    end

    A --> N
    B --> N
    C --> N
    D --> N
    E --> N
    F --> N

Step 1: Infrastructure Setup

Docker Compose Environment

# docker-compose.yml
version: '3.8'

networks:
  monitoring:
    driver: bridge
  app:
    driver: bridge

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

services:
  # Prometheus
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
    networks:
      - monitoring
      - app
    restart: unless-stopped

  # Alertmanager
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager:/etc/alertmanager
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    networks:
      - monitoring
    restart: unless-stopped

  # Grafana
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    networks:
      - monitoring
    restart: unless-stopped

  # Node Exporter
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    networks:
      - monitoring
    restart: unless-stopped

  # Blackbox Exporter
  blackbox-exporter:
    image: prom/blackbox-exporter:latest
    container_name: blackbox-exporter
    ports:
      - "9115:9115"
    volumes:
      - ./blackbox:/etc/blackbox_exporter
    networks:
      - monitoring
    restart: unless-stopped

  # Application Services
  frontend:
    build: ./apps/frontend
    container_name: frontend
    ports:
      - "8080:8080"
    environment:
      - USER_SERVICE_URL=http://user-service:8081
      - PRODUCT_SERVICE_URL=http://product-service:8082
      - ORDER_SERVICE_URL=http://order-service:8083
    networks:
      - app
    restart: unless-stopped

  user-service:
    build: ./apps/user-service
    container_name: user-service
    ports:
      - "8081:8081"
    environment:
      - DATABASE_URL=postgresql://user:password@user-db:5432/users
    networks:
      - app
    restart: unless-stopped

  product-service:
    build: ./apps/product-service
    container_name: product-service
    ports:
      - "8082:8082"
    environment:
      - DATABASE_URL=postgresql://product:password@product-db:5432/products
    networks:
      - app
    restart: unless-stopped

  order-service:
    build: ./apps/order-service
    container_name: order-service
    ports:
      - "8083:8083"
    environment:
      - DATABASE_URL=postgresql://order:password@order-db:5432/orders
      - PAYMENT_SERVICE_URL=http://payment-service:8084
      - INVENTORY_SERVICE_URL=http://inventory-service:8085
    networks:
      - app
    restart: unless-stopped

  payment-service:
    build: ./apps/payment-service
    container_name: payment-service
    ports:
      - "8084:8084"
    networks:
      - app
    restart: unless-stopped

  inventory-service:
    build: ./apps/inventory-service
    container_name: inventory-service
    ports:
      - "8085:8085"
    networks:
      - app
    restart: unless-stopped

  # Databases
  user-db:
    image: postgres:13
    container_name: user-db
    environment:
      - POSTGRES_DB=users
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=password
    volumes:
      - ./data/user-db:/var/lib/postgresql/data
    networks:
      - app

  product-db:
    image: postgres:13
    container_name: product-db
    environment:
      - POSTGRES_DB=products
      - POSTGRES_USER=product
      - POSTGRES_PASSWORD=password
    volumes:
      - ./data/product-db:/var/lib/postgresql/data
    networks:
      - app

  order-db:
    image: postgres:13
    container_name: order-db
    environment:
      - POSTGRES_DB=orders
      - POSTGRES_USER=order
      - POSTGRES_PASSWORD=password
    volumes:
      - ./data/order-db:/var/lib/postgresql/data
    networks:
      - app

# docker-compose.yml
version: '3.8'

networks:
  monitoring:
    driver: bridge
  app:
    driver: bridge

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

services:
  # Prometheus
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
    networks:
      - monitoring
      - app
    restart: unless-stopped

  # Alertmanager
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager:/etc/alertmanager
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    networks:
      - monitoring
    restart: unless-stopped

  # Grafana
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    networks:
      - monitoring
    restart: unless-stopped

  # Node Exporter
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    networks:
      - monitoring
    restart: unless-stopped

  # Blackbox Exporter
  blackbox-exporter:
    image: prom/blackbox-exporter:latest
    container_name: blackbox-exporter
    ports:
      - "9115:9115"
    volumes:
      - ./blackbox:/etc/blackbox_exporter
    networks:
      - monitoring
    restart: unless-stopped

  # Application Services
  frontend:
    build: ./apps/frontend
    container_name: frontend
    ports:
      - "8080:8080"
    environment:
      - USER_SERVICE_URL=http://user-service:8081
      - PRODUCT_SERVICE_URL=http://product-service:8082
      - ORDER_SERVICE_URL=http://order-service:8083
    networks:
      - app
    restart: unless-stopped

  user-service:
    build: ./apps/user-service
    container_name: user-service
    ports:
      - "8081:8081"
    environment:
      - DATABASE_URL=postgresql://user:password@user-db:5432/users
    networks:
      - app
    restart: unless-stopped

  product-service:
    build: ./apps/product-service
    container_name: product-service
    ports:
      - "8082:8082"
    environment:
      - DATABASE_URL=postgresql://product:password@product-db:5432/products
    networks:
      - app
    restart: unless-stopped

  order-service:
    build: ./apps/order-service
    container_name: order-service
    ports:
      - "8083:8083"
    environment:
      - DATABASE_URL=postgresql://order:password@order-db:5432/orders
      - PAYMENT_SERVICE_URL=http://payment-service:8084
      - INVENTORY_SERVICE_URL=http://inventory-service:8085
    networks:
      - app
    restart: unless-stopped

  payment-service:
    build: ./apps/payment-service
    container_name: payment-service
    ports:
      - "8084:8084"
    networks:
      - app
    restart: unless-stopped

  inventory-service:
    build: ./apps/inventory-service
    container_name: inventory-service
    ports:
      - "8085:8085"
    networks:
      - app
    restart: unless-stopped

  # Databases
  user-db:
    image: postgres:13
    container_name: user-db
    environment:
      - POSTGRES_DB=users
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=password
    volumes:
      - ./data/user-db:/var/lib/postgresql/data
    networks:
      - app

  product-db:
    image: postgres:13
    container_name: product-db
    environment:
      - POSTGRES_DB=products
      - POSTGRES_USER=product
      - POSTGRES_PASSWORD=password
    volumes:
      - ./data/product-db:/var/lib/postgresql/data
    networks:
      - app

  order-db:
    image: postgres:13
    container_name: order-db
    environment:
      - POSTGRES_DB=orders
      - POSTGRES_USER=order
      - POSTGRES_PASSWORD=password
    volumes:
      - ./data/order-db:/var/lib/postgresql/data
    networks:
      - app

YAML

Step 2: Application Instrumentation

Frontend Service (Go)

// apps/frontend/main.go
package main

import (
    "encoding/json"
    "fmt"
    "log"
    "net/http"
    "os"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"service", "method", "endpoint", "status"},
    )

    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request duration in seconds",
            Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
        },
        []string{"service", "method", "endpoint"},
    )

    upstreamRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "upstream_requests_total",
            Help: "Total upstream requests",
        },
        []string{"service", "target_service", "status"},
    )

    businessMetrics = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "business_events_total",
            Help: "Business events counter",
        },
        []string{"service", "event_type"},
    )
)

func init() {
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
    prometheus.MustRegister(upstreamRequestsTotal)
    prometheus.MustRegister(businessMetrics)
}

func instrumentHandler(service, endpoint string, handler http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        // Wrap ResponseWriter to capture status code
        ww := &responseWriter{ResponseWriter: w, statusCode: 200}

        handler(ww, r)

        duration := time.Since(start).Seconds()
        status := fmt.Sprintf("%d", ww.statusCode)

        httpRequestsTotal.WithLabelValues(service, r.Method, endpoint, status).Inc()
        httpRequestDuration.WithLabelValues(service, r.Method, endpoint).Observe(duration)
    }
}

type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

func homeHandler(w http.ResponseWriter, r *http.Request) {
    businessMetrics.WithLabelValues("frontend", "page_view").Inc()

    response := map[string]string{
        "service": "frontend",
        "status":  "healthy",
        "version": "1.0.0",
    }

    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(response)
}

func usersHandler(w http.ResponseWriter, r *http.Request) {
    userServiceURL := os.Getenv("USER_SERVICE_URL")
    if userServiceURL == "" {
        userServiceURL = "http://localhost:8081"
    }

    start := time.Now()
    resp, err := http.Get(userServiceURL + "/users")
    duration := time.Since(start).Seconds()

    status := "500"
    if err == nil {
        status = fmt.Sprintf("%d", resp.StatusCode)
        defer resp.Body.Close()
    }

    upstreamRequestsTotal.WithLabelValues("frontend", "user-service", status).Inc()

    if err != nil {
        http.Error(w, "User service unavailable", http.StatusServiceUnavailable)
        return
    }

    businessMetrics.WithLabelValues("frontend", "user_list_view").Inc()
    w.Header().Set("Content-Type", "application/json")
    w.Write([]byte(`{"users": []}`))
}

func main() {
    http.Handle("/metrics", promhttp.Handler())
    http.HandleFunc("/", instrumentHandler("frontend", "/", homeHandler))
    http.HandleFunc("/users", instrumentHandler("frontend", "/users", usersHandler))
    http.HandleFunc("/health", instrumentHandler("frontend", "/health", func(w http.ResponseWriter, r *http.Request) {
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("OK"))
    }))

    log.Println("Frontend service starting on :8080")
    log.Fatal(http.ListenAndServe(":8080", nil))
}

// apps/frontend/main.go
package main

import (
    "encoding/json"
    "fmt"
    "log"
    "net/http"
    "os"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"service", "method", "endpoint", "status"},
    )

    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request duration in seconds",
            Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
        },
        []string{"service", "method", "endpoint"},
    )

    upstreamRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "upstream_requests_total",
            Help: "Total upstream requests",
        },
        []string{"service", "target_service", "status"},
    )

    businessMetrics = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "business_events_total",
            Help: "Business events counter",
        },
        []string{"service", "event_type"},
    )
)

func init() {
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
    prometheus.MustRegister(upstreamRequestsTotal)
    prometheus.MustRegister(businessMetrics)
}

func instrumentHandler(service, endpoint string, handler http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        // Wrap ResponseWriter to capture status code
        ww := &responseWriter{ResponseWriter: w, statusCode: 200}

        handler(ww, r)

        duration := time.Since(start).Seconds()
        status := fmt.Sprintf("%d", ww.statusCode)

        httpRequestsTotal.WithLabelValues(service, r.Method, endpoint, status).Inc()
        httpRequestDuration.WithLabelValues(service, r.Method, endpoint).Observe(duration)
    }
}

type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

func homeHandler(w http.ResponseWriter, r *http.Request) {
    businessMetrics.WithLabelValues("frontend", "page_view").Inc()

    response := map[string]string{
        "service": "frontend",
        "status":  "healthy",
        "version": "1.0.0",
    }

    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(response)
}

func usersHandler(w http.ResponseWriter, r *http.Request) {
    userServiceURL := os.Getenv("USER_SERVICE_URL")
    if userServiceURL == "" {
        userServiceURL = "http://localhost:8081"
    }

    start := time.Now()
    resp, err := http.Get(userServiceURL + "/users")
    duration := time.Since(start).Seconds()

    status := "500"
    if err == nil {
        status = fmt.Sprintf("%d", resp.StatusCode)
        defer resp.Body.Close()
    }

    upstreamRequestsTotal.WithLabelValues("frontend", "user-service", status).Inc()

    if err != nil {
        http.Error(w, "User service unavailable", http.StatusServiceUnavailable)
        return
    }

    businessMetrics.WithLabelValues("frontend", "user_list_view").Inc()
    w.Header().Set("Content-Type", "application/json")
    w.Write([]byte(`{"users": []}`))
}

func main() {
    http.Handle("/metrics", promhttp.Handler())
    http.HandleFunc("/", instrumentHandler("frontend", "/", homeHandler))
    http.HandleFunc("/users", instrumentHandler("frontend", "/users", usersHandler))
    http.HandleFunc("/health", instrumentHandler("frontend", "/health", func(w http.ResponseWriter, r *http.Request) {
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("OK"))
    }))

    log.Println("Frontend service starting on :8080")
    log.Fatal(http.ListenAndServe(":8080", nil))
}

User Service (Python)

# apps/user-service/app.py
from flask import Flask, jsonify, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time
import psycopg2
import os

app = Flask(__name__)

# Prometheus metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['service', 'method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['service', 'method', 'endpoint'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

DATABASE_CONNECTIONS = Gauge(
    'database_connections_active',
    'Active database connections',
    ['service', 'database']
)

BUSINESS_EVENTS = Counter(
    'business_events_total',
    'Business events',
    ['service', 'event_type']
)

def instrument_request(f):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        status = '200'

        try:
            result = f(*args, **kwargs)
            return result
        except Exception as e:
            status = '500'
            raise
        finally:
            REQUEST_COUNT.labels(
                service='user-service',
                method=request.method,
                endpoint=request.endpoint or 'unknown',
                status=status
            ).inc()

            REQUEST_DURATION.labels(
                service='user-service',
                method=request.method,
                endpoint=request.endpoint or 'unknown'
            ).observe(time.time() - start_time)

    wrapper.__name__ = f.__name__
    return wrapper

@app.route('/')
@instrument_request
def home():
    return jsonify({
        'service': 'user-service',
        'status': 'healthy',
        'version': '1.0.0'
    })

@app.route('/users')
@instrument_request
def get_users():
    BUSINESS_EVENTS.labels(service='user-service', event_type='user_list_request').inc()

    # Simulate database query
    DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').inc()
    time.sleep(0.01)  # Simulate query time
    DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').dec()

    return jsonify({
        'users': [
            {'id': 1, 'name': 'John Doe', 'email': 'john@example.com'},
            {'id': 2, 'name': 'Jane Smith', 'email': 'jane@example.com'}
        ]
    })

@app.route('/users/<int:user_id>')
@instrument_request
def get_user(user_id):
    BUSINESS_EVENTS.labels(service='user-service', event_type='user_detail_request').inc()

    DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').inc()
    time.sleep(0.005)
    DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').dec()

    return jsonify({
        'id': user_id,
        'name': f'User {user_id}',
        'email': f'user{user_id}@example.com'
    })

@app.route('/health')
@instrument_request
def health():
    return jsonify({'status': 'healthy'})

@app.route('/metrics')
def metrics():
    return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8081)

# apps/user-service/app.py
from flask import Flask, jsonify, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time
import psycopg2
import os

app = Flask(__name__)

# Prometheus metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['service', 'method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['service', 'method', 'endpoint'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

DATABASE_CONNECTIONS = Gauge(
    'database_connections_active',
    'Active database connections',
    ['service', 'database']
)

BUSINESS_EVENTS = Counter(
    'business_events_total',
    'Business events',
    ['service', 'event_type']
)

def instrument_request(f):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        status = '200'

        try:
            result = f(*args, **kwargs)
            return result
        except Exception as e:
            status = '500'
            raise
        finally:
            REQUEST_COUNT.labels(
                service='user-service',
                method=request.method,
                endpoint=request.endpoint or 'unknown',
                status=status
            ).inc()

            REQUEST_DURATION.labels(
                service='user-service',
                method=request.method,
                endpoint=request.endpoint or 'unknown'
            ).observe(time.time() - start_time)

    wrapper.__name__ = f.__name__
    return wrapper

@app.route('/')
@instrument_request
def home():
    return jsonify({
        'service': 'user-service',
        'status': 'healthy',
        'version': '1.0.0'
    })

@app.route('/users')
@instrument_request
def get_users():
    BUSINESS_EVENTS.labels(service='user-service', event_type='user_list_request').inc()

    # Simulate database query
    DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').inc()
    time.sleep(0.01)  # Simulate query time
    DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').dec()

    return jsonify({
        'users': [
            {'id': 1, 'name': 'John Doe', 'email': 'john@example.com'},
            {'id': 2, 'name': 'Jane Smith', 'email': 'jane@example.com'}
        ]
    })

@app.route('/users/<int:user_id>')
@instrument_request
def get_user(user_id):
    BUSINESS_EVENTS.labels(service='user-service', event_type='user_detail_request').inc()

    DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').inc()
    time.sleep(0.005)
    DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').dec()

    return jsonify({
        'id': user_id,
        'name': f'User {user_id}',
        'email': f'user{user_id}@example.com'
    })

@app.route('/health')
@instrument_request
def health():
    return jsonify({'status': 'healthy'})

@app.route('/metrics')
def metrics():
    return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8081)

Python

Step 3: Prometheus Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'ecommerce'
    environment: 'production'

rule_files:
  - "alert_rules.yml"
  - "recording_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
    scrape_interval: 30s

  # Application services
  - job_name: 'frontend'
    static_configs:
      - targets: ['frontend:8080']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'user-service'
    static_configs:
      - targets: ['user-service:8081']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'product-service'
    static_configs:
      - targets: ['product-service:8082']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'order-service'
    static_configs:
      - targets: ['order-service:8083']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'payment-service'
    static_configs:
      - targets: ['payment-service:8084']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'inventory-service'
    static_configs:
      - targets: ['inventory-service:8085']
    metrics_path: '/metrics'
    scrape_interval: 15s

  # Blackbox monitoring
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - http://frontend:8080/health
        - http://user-service:8081/health
        - http://product-service:8082/health
        - http://order-service:8083/health
        - http://payment-service:8084/health
        - http://inventory-service:8085/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'ecommerce'
    environment: 'production'

rule_files:
  - "alert_rules.yml"
  - "recording_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
    scrape_interval: 30s

  # Application services
  - job_name: 'frontend'
    static_configs:
      - targets: ['frontend:8080']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'user-service'
    static_configs:
      - targets: ['user-service:8081']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'product-service'
    static_configs:
      - targets: ['product-service:8082']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'order-service'
    static_configs:
      - targets: ['order-service:8083']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'payment-service'
    static_configs:
      - targets: ['payment-service:8084']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'inventory-service'
    static_configs:
      - targets: ['inventory-service:8085']
    metrics_path: '/metrics'
    scrape_interval: 15s

  # Blackbox monitoring
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - http://frontend:8080/health
        - http://user-service:8081/health
        - http://product-service:8082/health
        - http://order-service:8083/health
        - http://payment-service:8084/health
        - http://inventory-service:8085/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

YAML

Step 4: Recording Rules

# prometheus/recording_rules.yml
groups:
  - name: application_rules
    interval: 30s
    rules:
      # Request rates
      - record: service:request_rate:rate5m
        expr: sum by (service) (rate(http_requests_total[5m]))

      - record: service:request_rate:rate1h
        expr: sum by (service) (rate(http_requests_total[1h]))

      # Error rates
      - record: service:error_rate:rate5m
        expr: |
          sum by (service) (rate(http_requests_total{status=~"[45].."}[5m])) /
          sum by (service) (rate(http_requests_total[5m]))

      # Latency percentiles
      - record: service:request_duration:p50
        expr: |
          histogram_quantile(0.50,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

      - record: service:request_duration:p95
        expr: |
          histogram_quantile(0.95,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

      - record: service:request_duration:p99
        expr: |
          histogram_quantile(0.99,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

  - name: infrastructure_rules
    interval: 30s
    rules:
      # Node metrics
      - record: node:cpu_usage:rate5m
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      - record: node:memory_usage:percentage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

      - record: node:disk_usage:percentage
        expr: |
          (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

  - name: business_rules
    interval: 60s
    rules:
      # Business metrics
      - record: business:page_views:rate1h
        expr: rate(business_events_total{event_type="page_view"}[1h]) * 3600

      - record: business:user_requests:rate1h
        expr: rate(business_events_total{event_type=~"user_.*"}[1h]) * 3600

      # Service dependency health
      - record: service:dependency_success_rate:rate5m
        expr: |
          sum by (service, target_service) (rate(upstream_requests_total{status=~"2.."}[5m])) /
          sum by (service, target_service) (rate(upstream_requests_total[5m]))

# prometheus/recording_rules.yml
groups:
  - name: application_rules
    interval: 30s
    rules:
      # Request rates
      - record: service:request_rate:rate5m
        expr: sum by (service) (rate(http_requests_total[5m]))

      - record: service:request_rate:rate1h
        expr: sum by (service) (rate(http_requests_total[1h]))

      # Error rates
      - record: service:error_rate:rate5m
        expr: |
          sum by (service) (rate(http_requests_total{status=~"[45].."}[5m])) /
          sum by (service) (rate(http_requests_total[5m]))

      # Latency percentiles
      - record: service:request_duration:p50
        expr: |
          histogram_quantile(0.50,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

      - record: service:request_duration:p95
        expr: |
          histogram_quantile(0.95,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

      - record: service:request_duration:p99
        expr: |
          histogram_quantile(0.99,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

  - name: infrastructure_rules
    interval: 30s
    rules:
      # Node metrics
      - record: node:cpu_usage:rate5m
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      - record: node:memory_usage:percentage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

      - record: node:disk_usage:percentage
        expr: |
          (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

  - name: business_rules
    interval: 60s
    rules:
      # Business metrics
      - record: business:page_views:rate1h
        expr: rate(business_events_total{event_type="page_view"}[1h]) * 3600

      - record: business:user_requests:rate1h
        expr: rate(business_events_total{event_type=~"user_.*"}[1h]) * 3600

      # Service dependency health
      - record: service:dependency_success_rate:rate5m
        expr: |
          sum by (service, target_service) (rate(upstream_requests_total{status=~"2.."}[5m])) /
          sum by (service, target_service) (rate(upstream_requests_total[5m]))

YAML

Step 5: Alerting Rules

# prometheus/alert_rules.yml
groups:
  - name: infrastructure_alerts
    rules:
      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 1m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Node is down"
          description: "Node {{ $labels.instance }} has been down for more than 1 minute"
          runbook_url: "https://runbooks.company.com/node-down"

      - alert: HighCPUUsage
        expr: node:cpu_usage:rate5m > 80
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High CPU usage"
          description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"

      - alert: HighMemoryUsage
        expr: node:memory_usage:percentage > 85
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High memory usage"
          description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"

  - name: application_alerts
    rules:
      - alert: ServiceDown
        expr: up{job=~"frontend|.*-service"} == 0
        for: 1m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Service is down"
          description: "Service {{ $labels.job }} is down"

      - alert: HighErrorRate
        expr: service:error_rate:rate5m > 0.05
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "High error rate for {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"

      - alert: HighLatency
        expr: service:request_duration:p95 > 1
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High latency for {{ $labels.service }}"
          description: "95th percentile latency is {{ $value }}s for {{ $labels.service }}"

      - alert: LowRequestRate
        expr: service:request_rate:rate5m < 0.1
        for: 10m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Low request rate for {{ $labels.service }}"
          description: "Request rate is {{ $value }} req/s for {{ $labels.service }}"

  - name: business_alerts
    rules:
      - alert: LowPageViews
        expr: business:page_views:rate1h < 10
        for: 15m
        labels:
          severity: warning
          team: product
        annotations:
          summary: "Low page view rate"
          description: "Page view rate is {{ $value }} views/hour"

      - alert: ServiceDependencyFailure
        expr: service:dependency_success_rate:rate5m < 0.95
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Service dependency failure"
          description: "{{ $labels.service }} -> {{ $labels.target_service }} success rate is {{ $value | humanizePercentage }}"

# prometheus/alert_rules.yml
groups:
  - name: infrastructure_alerts
    rules:
      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 1m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Node is down"
          description: "Node {{ $labels.instance }} has been down for more than 1 minute"
          runbook_url: "https://runbooks.company.com/node-down"

      - alert: HighCPUUsage
        expr: node:cpu_usage:rate5m > 80
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High CPU usage"
          description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"

      - alert: HighMemoryUsage
        expr: node:memory_usage:percentage > 85
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High memory usage"
          description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"

  - name: application_alerts
    rules:
      - alert: ServiceDown
        expr: up{job=~"frontend|.*-service"} == 0
        for: 1m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Service is down"
          description: "Service {{ $labels.job }} is down"

      - alert: HighErrorRate
        expr: service:error_rate:rate5m > 0.05
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "High error rate for {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"

      - alert: HighLatency
        expr: service:request_duration:p95 > 1
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High latency for {{ $labels.service }}"
          description: "95th percentile latency is {{ $value }}s for {{ $labels.service }}"

      - alert: LowRequestRate
        expr: service:request_rate:rate5m < 0.1
        for: 10m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Low request rate for {{ $labels.service }}"
          description: "Request rate is {{ $value }} req/s for {{ $labels.service }}"

  - name: business_alerts
    rules:
      - alert: LowPageViews
        expr: business:page_views:rate1h < 10
        for: 15m
        labels:
          severity: warning
          team: product
        annotations:
          summary: "Low page view rate"
          description: "Page view rate is {{ $value }} views/hour"

      - alert: ServiceDependencyFailure
        expr: service:dependency_success_rate:rate5m < 0.95
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Service dependency failure"
          description: "{{ $labels.service }} -> {{ $labels.target_service }} success rate is {{ $value | humanizePercentage }}"

YAML

Step 6: Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@ecommerce.local'
  smtp_auth_username: 'alerts@ecommerce.local'
  smtp_auth_password: 'your-app-password'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    # Critical alerts to on-call
    - matchers:
        - severity=critical
      receiver: 'critical-alerts'
      continue: true

    # Infrastructure team alerts
    - matchers:
        - team=infrastructure
      receiver: 'infrastructure-team'

    # Platform team alerts
    - matchers:
        - team=platform
      receiver: 'platform-team'

    # Product team alerts
    - matchers:
        - team=product
      receiver: 'product-team'

receivers:
  - name: 'default'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Severity:* {{ .Labels.severity }}
          *Service:* {{ .Labels.service }}
          {{ end }}

  - name: 'critical-alerts'
    email_configs:
      - to: 'oncall@ecommerce.local'
        subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Severity: {{ .Labels.severity }}
          Service: {{ .Labels.service }}
          Runbook: {{ .Annotations.runbook_url }}
          {{ end }}
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#critical-alerts'
        title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Runbook:* {{ .Annotations.runbook_url }}
          {{ end }}

  - name: 'infrastructure-team'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#infrastructure'
        title: '⚠️ Infrastructure Alert: {{ .GroupLabels.alertname }}'

  - name: 'platform-team'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#platform'
        title: '🔧 Platform Alert: {{ .GroupLabels.alertname }}'

  - name: 'product-team'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#product'
        title: '📊 Business Alert: {{ .GroupLabels.alertname }}'

inhibit_rules:
  # Don't send warning alerts if critical alerts are firing
  - source_matchers:
      - severity=critical
    target_matchers:
      - severity=warning
    equal: ['service']

  # Don't send service alerts if node is down
  - source_matchers:
      - alertname=NodeDown
    target_matchers:
      - alertname=ServiceDown
    equal: ['instance']

# alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@ecommerce.local'
  smtp_auth_username: 'alerts@ecommerce.local'
  smtp_auth_password: 'your-app-password'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    # Critical alerts to on-call
    - matchers:
        - severity=critical
      receiver: 'critical-alerts'
      continue: true

    # Infrastructure team alerts
    - matchers:
        - team=infrastructure
      receiver: 'infrastructure-team'

    # Platform team alerts
    - matchers:
        - team=platform
      receiver: 'platform-team'

    # Product team alerts
    - matchers:
        - team=product
      receiver: 'product-team'

receivers:
  - name: 'default'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Severity:* {{ .Labels.severity }}
          *Service:* {{ .Labels.service }}
          {{ end }}

  - name: 'critical-alerts'
    email_configs:
      - to: 'oncall@ecommerce.local'
        subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Severity: {{ .Labels.severity }}
          Service: {{ .Labels.service }}
          Runbook: {{ .Annotations.runbook_url }}
          {{ end }}
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#critical-alerts'
        title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Runbook:* {{ .Annotations.runbook_url }}
          {{ end }}

  - name: 'infrastructure-team'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#infrastructure'
        title: '⚠️ Infrastructure Alert: {{ .GroupLabels.alertname }}'

  - name: 'platform-team'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#platform'
        title: '🔧 Platform Alert: {{ .GroupLabels.alertname }}'

  - name: 'product-team'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#product'
        title: '📊 Business Alert: {{ .GroupLabels.alertname }}'

inhibit_rules:
  # Don't send warning alerts if critical alerts are firing
  - source_matchers:
      - severity=critical
    target_matchers:
      - severity=warning
    equal: ['service']

  # Don't send service alerts if node is down
  - source_matchers:
      - alertname=NodeDown
    target_matchers:
      - alertname=ServiceDown
    equal: ['instance']

YAML

Step 7: Grafana Dashboards

Infrastructure Dashboard

# grafana/provisioning/dashboards/infrastructure.json
{
  "dashboard": {
    "id": null,
    "title": "Infrastructure Overview",
    "tags": ["infrastructure", "monitoring"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "panels": [
      {
        "id": 1,
        "title": "CPU Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "node:cpu_usage:rate5m",
            "legendFormat": "{{ instance }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Memory Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "node:memory_usage:percentage",
            "legendFormat": "{{ instance }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 80},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
      },
      {
        "id": 3,
        "title": "CPU Usage Over Time",
        "type": "graph",
        "targets": [
          {
            "expr": "node:cpu_usage:rate5m",
            "legendFormat": "{{ instance }}"
          }
        ],
        "yAxes": [
          {
            "unit": "percent",
            "max": 100,
            "min": 0
          }
        ],
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
      }
    ]
  }
}

# grafana/provisioning/dashboards/infrastructure.json
{
  "dashboard": {
    "id": null,
    "title": "Infrastructure Overview",
    "tags": ["infrastructure", "monitoring"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "panels": [
      {
        "id": 1,
        "title": "CPU Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "node:cpu_usage:rate5m",
            "legendFormat": "{{ instance }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Memory Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "node:memory_usage:percentage",
            "legendFormat": "{{ instance }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 80},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
      },
      {
        "id": 3,
        "title": "CPU Usage Over Time",
        "type": "graph",
        "targets": [
          {
            "expr": "node:cpu_usage:rate5m",
            "legendFormat": "{{ instance }}"
          }
        ],
        "yAxes": [
          {
            "unit": "percent",
            "max": 100,
            "min": 0
          }
        ],
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
      }
    ]
  }
}

JSON

Application Dashboard

# grafana/provisioning/dashboards/application.json
{
  "dashboard": {
    "id": null,
    "title": "Application Performance",
    "tags": ["application", "performance"],
    "timezone": "browser",
    "refresh": "30s",
    "templating": {
      "list": [
        {
          "name": "service",
          "type": "query",
          "query": "label_values(http_requests_total, service)",
          "refresh": 1,
          "multi": true,
          "includeAll": true
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "service:request_rate:rate5m{service=~\"$service\"}",
            "legendFormat": "{{ service }}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "service:error_rate:rate5m{service=~\"$service\"} * 100",
            "legendFormat": "{{ service }}"
          }
        ],
        "yAxes": [
          {
            "unit": "percent",
            "max": 100,
            "min": 0
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
      },
      {
        "id": 3,
        "title": "Response Time Percentiles",
        "type": "graph",
        "targets": [
          {
            "expr": "service:request_duration:p50{service=~\"$service\"}",
            "legendFormat": "{{ service }} - 50th"
          },
          {
            "expr": "service:request_duration:p95{service=~\"$service\"}",
            "legendFormat": "{{ service }} - 95th"
          },
          {
            "expr": "service:request_duration:p99{service=~\"$service\"}",
            "legendFormat": "{{ service }} - 99th"
          }
        ],
        "yAxes": [
          {
            "unit": "s"
          }
        ],
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
      }
    ]
  }
}

# grafana/provisioning/dashboards/application.json
{
  "dashboard": {
    "id": null,
    "title": "Application Performance",
    "tags": ["application", "performance"],
    "timezone": "browser",
    "refresh": "30s",
    "templating": {
      "list": [
        {
          "name": "service",
          "type": "query",
          "query": "label_values(http_requests_total, service)",
          "refresh": 1,
          "multi": true,
          "includeAll": true
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "service:request_rate:rate5m{service=~\"$service\"}",
            "legendFormat": "{{ service }}"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "service:error_rate:rate5m{service=~\"$service\"} * 100",
            "legendFormat": "{{ service }}"
          }
        ],
        "yAxes": [
          {
            "unit": "percent",
            "max": 100,
            "min": 0
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
      },
      {
        "id": 3,
        "title": "Response Time Percentiles",
        "type": "graph",
        "targets": [
          {
            "expr": "service:request_duration:p50{service=~\"$service\"}",
            "legendFormat": "{{ service }} - 50th"
          },
          {
            "expr": "service:request_duration:p95{service=~\"$service\"}",
            "legendFormat": "{{ service }} - 95th"
          },
          {
            "expr": "service:request_duration:p99{service=~\"$service\"}",
            "legendFormat": "{{ service }} - 99th"
          }
        ],
        "yAxes": [
          {
            "unit": "s"
          }
        ],
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
      }
    ]
  }
}

JSON

Step 8: Testing and Validation

Load Testing Script

# scripts/load_test.py
import requests
import time
import random
import threading
from concurrent.futures import ThreadPoolExecutor

BASE_URL = "http://localhost:8080"

def make_request(endpoint):
    """Make a request to the specified endpoint"""
    try:
        response = requests.get(f"{BASE_URL}{endpoint}", timeout=5)
        return response.status_code
    except Exception as e:
        print(f"Error calling {endpoint}: {e}")
        return 500

def generate_load():
    """Generate load on the application"""
    endpoints = ["/", "/users", "/health"]

    while True:
        endpoint = random.choice(endpoints)
        status = make_request(endpoint)

        # Add some randomness to the load
        time.sleep(random.uniform(0.1, 1.0))

def run_load_test(duration_minutes=10, concurrent_users=5):
    """Run load test for specified duration"""
    print(f"Starting load test with {concurrent_users} concurrent users for {duration_minutes} minutes")

    with ThreadPoolExecutor(max_workers=concurrent_users) as executor:
        # Submit load generation tasks
        futures = []
        for _ in range(concurrent_users):
            future = executor.submit(generate_load)
            futures.append(future)

        # Let it run for the specified duration
        time.sleep(duration_minutes * 60)

        # Cancel all tasks
        for future in futures:
            future.cancel()

if __name__ == "__main__":
    run_load_test(duration_minutes=5, concurrent_users=10)

# scripts/load_test.py
import requests
import time
import random
import threading
from concurrent.futures import ThreadPoolExecutor

BASE_URL = "http://localhost:8080"

def make_request(endpoint):
    """Make a request to the specified endpoint"""
    try:
        response = requests.get(f"{BASE_URL}{endpoint}", timeout=5)
        return response.status_code
    except Exception as e:
        print(f"Error calling {endpoint}: {e}")
        return 500

def generate_load():
    """Generate load on the application"""
    endpoints = ["/", "/users", "/health"]

    while True:
        endpoint = random.choice(endpoints)
        status = make_request(endpoint)

        # Add some randomness to the load
        time.sleep(random.uniform(0.1, 1.0))

def run_load_test(duration_minutes=10, concurrent_users=5):
    """Run load test for specified duration"""
    print(f"Starting load test with {concurrent_users} concurrent users for {duration_minutes} minutes")

    with ThreadPoolExecutor(max_workers=concurrent_users) as executor:
        # Submit load generation tasks
        futures = []
        for _ in range(concurrent_users):
            future = executor.submit(generate_load)
            futures.append(future)

        # Let it run for the specified duration
        time.sleep(duration_minutes * 60)

        # Cancel all tasks
        for future in futures:
            future.cancel()

if __name__ == "__main__":
    run_load_test(duration_minutes=5, concurrent_users=10)

Python

Chaos Testing

# scripts/chaos_test.py
import docker
import time
import random

client = docker.from_env()

def stop_random_service():
    """Stop a random service for chaos testing"""
    services = ['user-service', 'product-service', 'order-service']
    service_name = random.choice(services)

    try:
        container = client.containers.get(service_name)
        print(f"Stopping {service_name}")
        container.stop()

        # Wait for some time
        time.sleep(30)

        print(f"Starting {service_name}")
        container.start()

    except Exception as e:
        print(f"Error with {service_name}: {e}")

def simulate_high_load():
    """Simulate high CPU load on a container"""
    try:
        container = client.containers.get('frontend')
        print("Simulating high CPU load")

        # Run stress test inside container
        container.exec_run("stress --cpu 2 --timeout 60s", detach=True)

    except Exception as e:
        print(f"Error simulating load: {e}")

if __name__ == "__main__":
    print("Starting chaos testing...")

    # Run different chaos scenarios
    stop_random_service()
    time.sleep(120)

    simulate_high_load()
    time.sleep(120)

# scripts/chaos_test.py
import docker
import time
import random

client = docker.from_env()

def stop_random_service():
    """Stop a random service for chaos testing"""
    services = ['user-service', 'product-service', 'order-service']
    service_name = random.choice(services)

    try:
        container = client.containers.get(service_name)
        print(f"Stopping {service_name}")
        container.stop()

        # Wait for some time
        time.sleep(30)

        print(f"Starting {service_name}")
        container.start()

    except Exception as e:
        print(f"Error with {service_name}: {e}")

def simulate_high_load():
    """Simulate high CPU load on a container"""
    try:
        container = client.containers.get('frontend')
        print("Simulating high CPU load")

        # Run stress test inside container
        container.exec_run("stress --cpu 2 --timeout 60s", detach=True)

    except Exception as e:
        print(f"Error simulating load: {e}")

if __name__ == "__main__":
    print("Starting chaos testing...")

    # Run different chaos scenarios
    stop_random_service()
    time.sleep(120)

    simulate_high_load()
    time.sleep(120)

Python

Step 9: Deployment Script

#!/bin/bash
# scripts/deploy.sh

set -e

echo "Starting E-commerce Observability Stack deployment..."

# Create necessary directories
mkdir -p data/{user-db,product-db,order-db}
mkdir -p prometheus grafana/provisioning/{datasources,dashboards}
mkdir -p alertmanager blackbox

# Set permissions
chmod 777 data/{user-db,product-db,order-db}

# Build application images
echo "Building application images..."
for service in frontend user-service product-service order-service payment-service inventory-service; do
    echo "Building $service..."
    docker build -t ecommerce/$service:latest apps/$service/
done

# Start the stack
echo "Starting services..."
docker-compose up -d

# Wait for services to be ready
echo "Waiting for services to start..."
sleep 30

# Check service health
echo "Checking service health..."
services=("prometheus:9090" "grafana:3000" "alertmanager:9093" "frontend:8080")

for service in "${services[@]}"; do
    IFS=':' read -r name port <<< "$service"
    echo "Checking $name on port $port..."

    for i in {1..30}; do
        if curl -f "http://localhost:$port/health" 2>/dev/null || curl -f "http://localhost:$port" 2>/dev/null; then
            echo "$name is healthy"
            break
        fi

        if [ $i -eq 30 ]; then
            echo "Warning: $name may not be ready"
        fi

        sleep 2
    done
done

echo "Deployment complete!"
echo "Access URLs:"
echo "  Prometheus: http://localhost:9090"
echo "  Grafana: http://localhost:3000 (admin/admin123)"
echo "  Alertmanager: http://localhost:9093"
echo "  Application: http://localhost:8080"

echo "Run load tests with: python scripts/load_test.py"
echo "Run chaos tests with: python scripts/chaos_test.py"

#!/bin/bash
# scripts/deploy.sh

set -e

echo "Starting E-commerce Observability Stack deployment..."

# Create necessary directories
mkdir -p data/{user-db,product-db,order-db}
mkdir -p prometheus grafana/provisioning/{datasources,dashboards}
mkdir -p alertmanager blackbox

# Set permissions
chmod 777 data/{user-db,product-db,order-db}

# Build application images
echo "Building application images..."
for service in frontend user-service product-service order-service payment-service inventory-service; do
    echo "Building $service..."
    docker build -t ecommerce/$service:latest apps/$service/
done

# Start the stack
echo "Starting services..."
docker-compose up -d

# Wait for services to be ready
echo "Waiting for services to start..."
sleep 30

# Check service health
echo "Checking service health..."
services=("prometheus:9090" "grafana:3000" "alertmanager:9093" "frontend:8080")

for service in "${services[@]}"; do
    IFS=':' read -r name port <<< "$service"
    echo "Checking $name on port $port..."

    for i in {1..30}; do
        if curl -f "http://localhost:$port/health" 2>/dev/null || curl -f "http://localhost:$port" 2>/dev/null; then
            echo "$name is healthy"
            break
        fi

        if [ $i -eq 30 ]; then
            echo "Warning: $name may not be ready"
        fi

        sleep 2
    done
done

echo "Deployment complete!"
echo "Access URLs:"
echo "  Prometheus: http://localhost:9090"
echo "  Grafana: http://localhost:3000 (admin/admin123)"
echo "  Alertmanager: http://localhost:9093"
echo "  Application: http://localhost:8080"

echo "Run load tests with: python scripts/load_test.py"
echo "Run chaos tests with: python scripts/chaos_test.py"

Bash

Step 10: Documentation and Runbooks

README.md

# E-commerce Observability Stack

This project demonstrates a complete observability setup for a microservices-based e-commerce application using Prometheus, Grafana, and Alertmanager.

## Architecture

- **Frontend Service** (Go): Main web interface
- **User Service** (Python): User management
- **Product Service** (Python): Product catalog
- **Order Service** (Python): Order processing
- **Payment Service** (Python): Payment processing
- **Inventory Service** (Python): Inventory management

## Deployment

```bash
# Clone the repository
git clone <repository-url>
cd ecommerce-observability

# Deploy the stack
./scripts/deploy.sh

# E-commerce Observability Stack

This project demonstrates a complete observability setup for a microservices-based e-commerce application using Prometheus, Grafana, and Alertmanager.

## Architecture

- **Frontend Service** (Go): Main web interface
- **User Service** (Python): User management
- **Product Service** (Python): Product catalog
- **Order Service** (Python): Order processing
- **Payment Service** (Python): Payment processing
- **Inventory Service** (Python): Inventory management

## Deployment

```bash
# Clone the repository
git clone <repository-url>
cd ecommerce-observability

# Deploy the stack
./scripts/deploy.sh

Markdown

Access Points

Prometheus: http://localhost:9090
Grafana: http://localhost:3000 (admin/admin123)
Alertmanager: http://localhost:9093
Application: http://localhost:8080

Testing

Load Testing

python scripts/load_test.py

python scripts/load_test.py

Bash

Chaos Testing

python scripts/chaos_test.py

python scripts/chaos_test.py

Bash

Monitoring

Key Metrics

Request rate per service
Error rate per service
Response time percentiles
Infrastructure utilization

Alerts

Service down
High error rate (>5%)
High latency (>1s p95)
Infrastructure issues

Troubleshooting

Service Discovery Issues

Check Prometheus targets: http://localhost:9090/targets

Missing Metrics

Verify service /metrics endpoints are accessible

Alert Not Firing

Check Prometheus rules: http://localhost:9090/rules

### Project Validation

#### Verification Checklist

1. **✅ Infrastructure Monitoring**
   - [ ] Node exporter collecting system metrics
   - [ ] CPU, memory, disk usage visible in Grafana
   - [ ] Infrastructure alerts firing correctly

2. **✅ Application Monitoring**
   - [ ] All services exposing metrics
   - [ ] Request rate, error rate, latency tracked
   - [ ] Business metrics instrumented

3. **✅ Alerting**
   - [ ] Critical alerts configured
   - [ ] Alert routing working
   - [ ] Notification channels tested

4. **✅ Visualization**
   - [ ] Infrastructure dashboard functional
   - [ ] Application dashboard with filters
   - [ ] Business metrics dashboard

5. **✅ Testing**
   - [ ] Load testing generating metrics
   - [ ] Chaos testing triggering alerts
   - [ ] Recovery scenarios validated

### Chapter 11 Summary

The capstone project demonstrates a production-ready observability stack with comprehensive monitoring, alerting, and visualization. It covers infrastructure monitoring, application performance tracking, business metrics, and incident response workflows. The project serves as a practical template for implementing Prometheus-based observability in real-world microservices environments.

### Final Exercise

1. **Deploy the Complete Stack**:
   - Follow the deployment guide
   - Verify all components are working
   - Access all web interfaces

2. **Run Tests and Observe**:
   - Execute load tests and watch metrics
   - Trigger chaos tests and verify alerts
   - Practice incident response workflows

3. **Customize and Extend**:
   - Add new metrics to services
   - Create custom dashboards
   - Implement additional alert rules

---

## 12. Appendices

### Appendix A: PromQL Cheat Sheet

#### Basic Selectors
```promql
# Simple metric selection
http_requests_total

# Label matching
http_requests_total{method="GET"}
http_requests_total{method!="GET"}
http_requests_total{method=~"GET|POST"}
http_requests_total{method!~"GET|POST"}

# Multiple labels
http_requests_total{method="GET", status="200"}

### Project Validation

#### Verification Checklist

1. **✅ Infrastructure Monitoring**
   - [ ] Node exporter collecting system metrics
   - [ ] CPU, memory, disk usage visible in Grafana
   - [ ] Infrastructure alerts firing correctly

2. **✅ Application Monitoring**
   - [ ] All services exposing metrics
   - [ ] Request rate, error rate, latency tracked
   - [ ] Business metrics instrumented

3. **✅ Alerting**
   - [ ] Critical alerts configured
   - [ ] Alert routing working
   - [ ] Notification channels tested

4. **✅ Visualization**
   - [ ] Infrastructure dashboard functional
   - [ ] Application dashboard with filters
   - [ ] Business metrics dashboard

5. **✅ Testing**
   - [ ] Load testing generating metrics
   - [ ] Chaos testing triggering alerts
   - [ ] Recovery scenarios validated

### Chapter 11 Summary

The capstone project demonstrates a production-ready observability stack with comprehensive monitoring, alerting, and visualization. It covers infrastructure monitoring, application performance tracking, business metrics, and incident response workflows. The project serves as a practical template for implementing Prometheus-based observability in real-world microservices environments.

### Final Exercise

1. **Deploy the Complete Stack**:
   - Follow the deployment guide
   - Verify all components are working
   - Access all web interfaces

2. **Run Tests and Observe**:
   - Execute load tests and watch metrics
   - Trigger chaos tests and verify alerts
   - Practice incident response workflows

3. **Customize and Extend**:
   - Add new metrics to services
   - Create custom dashboards
   - Implement additional alert rules

---

## 12. Appendices

### Appendix A: PromQL Cheat Sheet

#### Basic Selectors
```promql
# Simple metric selection
http_requests_total

# Label matching
http_requests_total{method="GET"}
http_requests_total{method!="GET"}
http_requests_total{method=~"GET|POST"}
http_requests_total{method!~"GET|POST"}

# Multiple labels
http_requests_total{method="GET", status="200"}

Markdown

Time Series Types

# Instant vector (single value per series)
up

# Range vector (range of values over time)
up[5m]

# Scalar (single numeric value)
42

# Instant vector (single value per series)
up

# Range vector (range of values over time)
up[5m]

# Scalar (single numeric value)
42

INI

Rate and Counter Functions

# Rate: per-second average rate
rate(http_requests_total[5m])

# Increase: total increase over time window
increase(http_requests_total[5m])

# irate: instantaneous rate
irate(http_requests_total[5m])

# Delta: difference between first and last value
delta(cpu_temp_celsius[2h])

# Rate: per-second average rate
rate(http_requests_total[5m])

# Increase: total increase over time window
increase(http_requests_total[5m])

# irate: instantaneous rate
irate(http_requests_total[5m])

# Delta: difference between first and last value
delta(cpu_temp_celsius[2h])

INI

Aggregation Operators

# Sum
sum(http_requests_total)
sum by (job) (http_requests_total)
sum without (instance) (http_requests_total)

# Average
avg(node_cpu_seconds_total)
avg by (mode) (node_cpu_seconds_total)

# Count
count(up)
count by (job) (up)

# Min/Max
min(node_filesystem_free_bytes)
max(node_filesystem_free_bytes)

# Quantile
quantile(0.95, http_request_duration_seconds)

# Top/Bottom K
topk(5, http_requests_total)
bottomk(3, node_filesystem_free_bytes)

# Sum
sum(http_requests_total)
sum by (job) (http_requests_total)
sum without (instance) (http_requests_total)

# Average
avg(node_cpu_seconds_total)
avg by (mode) (node_cpu_seconds_total)

# Count
count(up)
count by (job) (up)

# Min/Max
min(node_filesystem_free_bytes)
max(node_filesystem_free_bytes)

# Quantile
quantile(0.95, http_request_duration_seconds)

# Top/Bottom K
topk(5, http_requests_total)
bottomk(3, node_filesystem_free_bytes)

INI

Mathematical Functions

# Arithmetic operators
node_memory_MemTotal_bytes - node_memory_MemFree_bytes
rate(http_requests_total[5m]) * 60

# Mathematical functions
abs(delta(cpu_temp_celsius[5m]))
ceil(rate(http_requests_total[5m]))
floor(rate(http_requests_total[5m]))
round(rate(http_requests_total[5m]), 0.1)
sqrt(rate(http_requests_total[5m]))
ln(rate(http_requests_total[5m]))
log10(rate(http_requests_total[5m]))

# Arithmetic operators
node_memory_MemTotal_bytes - node_memory_MemFree_bytes
rate(http_requests_total[5m]) * 60

# Mathematical functions
abs(delta(cpu_temp_celsius[5m]))
ceil(rate(http_requests_total[5m]))
floor(rate(http_requests_total[5m]))
round(rate(http_requests_total[5m]), 0.1)
sqrt(rate(http_requests_total[5m]))
ln(rate(http_requests_total[5m]))
log10(rate(http_requests_total[5m]))

INI

Histogram Functions

# Quantiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

# Average from histogram
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

# Request rate from histogram
rate(http_request_duration_seconds_count[5m])

# Quantiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

# Average from histogram
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

# Request rate from histogram
rate(http_request_duration_seconds_count[5m])

INI

Time Functions

# Current time
time()

# Timestamp of samples
timestamp(up)

# Time-based filtering
hour() > 9 and hour() < 17  # Business hours
day_of_week() > 0 and day_of_week() < 6  # Weekdays

# Prediction
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)

# Current time
time()

# Timestamp of samples
timestamp(up)

# Time-based filtering
hour() > 9 and hour() < 17  # Business hours
day_of_week() > 0 and day_of_week() < 6  # Weekdays

# Prediction
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)

INI

String Functions

# Label manipulation
label_replace(up, "instance_short", "$1", "instance", "([^:]+):.*")
label_join(up, "instance_job", ":", "instance", "job")

# Label manipulation
label_replace(up, "instance_short", "$1", "instance", "([^:]+):.*")
label_join(up, "instance_job", ":", "instance", "job")

INI

Comparison Operators

# Comparison
node_filesystem_free_bytes < 1000000000  # Less than 1GB
rate(http_requests_total[5m]) > 10       # More than 10 req/s

# Boolean operators
up == 1 and on(instance) node_load1 > 2
up == 0 or on(instance) node_filesystem_free_bytes < 1000000000

# Comparison
node_filesystem_free_bytes < 1000000000  # Less than 1GB
rate(http_requests_total[5m]) > 10       # More than 10 req/s

# Boolean operators
up == 1 and on(instance) node_load1 > 2
up == 0 or on(instance) node_filesystem_free_bytes < 1000000000

INI

Advanced Patterns

# SLI/SLO calculations
sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# Error budget burn rate
(1 - sli_availability) / (1 - slo_target) > burn_rate_threshold

# Multi-service aggregation
sum by (environment) (rate(http_requests_total[5m]))

# Cross-metric calculations
rate(http_requests_total[5m]) / on(instance) group_left rate(node_cpu_seconds_total{mode="idle"}[5m])

# SLI/SLO calculations
sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# Error budget burn rate
(1 - sli_availability) / (1 - slo_target) > burn_rate_threshold

# Multi-service aggregation
sum by (environment) (rate(http_requests_total[5m]))

# Cross-metric calculations
rate(http_requests_total[5m]) / on(instance) group_left rate(node_cpu_seconds_total{mode="idle"}[5m])

INI

Appendix B: Exporter Catalog

Official Exporters

Exporter	Purpose	Port	Key Metrics
Node Exporter	System metrics	9100	CPU, memory, disk, network
Blackbox Exporter	External monitoring	9115	HTTP, DNS, TCP, ICMP
MySQL Exporter	MySQL database	9104	Connections, queries, performance
Redis Exporter	Redis database	9121	Memory, commands, keys
HAProxy Exporter	HAProxy load balancer	8404	Requests, responses, health
NGINX Exporter	NGINX web server	9113	Requests, connections, status

| **RabbitMQ Exporter** | RabbitMQ message broker | 9419 | Queues, messages, connections |
| **Kafka Exporter** | Apache Kafka | 9308 | Topics, partitions, lag |
| **JMX Exporter** | Java applications | 8080 | JVM metrics, garbage collection |
| **Consul Exporter** | HashiCorp Consul | 9107 | Service health, cluster status |
| **Memcached Exporter** | Memcached | 9150 | Cache hits/misses, memory usage |
| **StatsD Exporter** | StatsD metrics | 9102 | Custom application metrics |

#### Cloud Provider Exporters

| Exporter | Purpose | Key Metrics |
|----------|---------|-------------|
| **AWS CloudWatch Exporter** | AWS services | EC2, RDS, ELB metrics |
| **Azure Monitor Exporter** | Azure services | VM, storage, network metrics |
| **GCP Monitoring Exporter** | Google Cloud | Compute, storage, network metrics |
| **DigitalOcean Exporter** | DigitalOcean | Droplet metrics, load balancers |

#### Configuration Examples

##### Node Exporter
```yaml
# docker-compose.yml
node-exporter:
  image: prom/node-exporter:latest
  command:
    - '--path.procfs=/host/proc'
    - '--path.rootfs=/rootfs'
    - '--path.sysfs=/host/sys'
    - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
    - '--collector.textfile.directory=/host/textfile_collector'
  volumes:
    - /proc:/host/proc:ro
    - /sys:/host/sys:ro
    - /:/rootfs:ro
    - /var/log:/host/var/log:ro
  ports:
    - "9100:9100"
  network_mode: host

| **RabbitMQ Exporter** | RabbitMQ message broker | 9419 | Queues, messages, connections |
| **Kafka Exporter** | Apache Kafka | 9308 | Topics, partitions, lag |
| **JMX Exporter** | Java applications | 8080 | JVM metrics, garbage collection |
| **Consul Exporter** | HashiCorp Consul | 9107 | Service health, cluster status |
| **Memcached Exporter** | Memcached | 9150 | Cache hits/misses, memory usage |
| **StatsD Exporter** | StatsD metrics | 9102 | Custom application metrics |

#### Cloud Provider Exporters

| Exporter | Purpose | Key Metrics |
|----------|---------|-------------|
| **AWS CloudWatch Exporter** | AWS services | EC2, RDS, ELB metrics |
| **Azure Monitor Exporter** | Azure services | VM, storage, network metrics |
| **GCP Monitoring Exporter** | Google Cloud | Compute, storage, network metrics |
| **DigitalOcean Exporter** | DigitalOcean | Droplet metrics, load balancers |

#### Configuration Examples

##### Node Exporter
```yaml
# docker-compose.yml
node-exporter:
  image: prom/node-exporter:latest
  command:
    - '--path.procfs=/host/proc'
    - '--path.rootfs=/rootfs'
    - '--path.sysfs=/host/sys'
    - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
    - '--collector.textfile.directory=/host/textfile_collector'
  volumes:
    - /proc:/host/proc:ro
    - /sys:/host/sys:ro
    - /:/rootfs:ro
    - /var/log:/host/var/log:ro
  ports:
    - "9100:9100"
  network_mode: host

Markdown

Blackbox Exporter

# blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_status_codes: []
      method: GET
      follow_redirects: true
      preferred_ip_protocol: "ip4"
      headers:
        User-Agent: "Prometheus Blackbox Exporter"

  http_post_2xx:
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{"health": "check"}'

  tcp_connect:
    prober: tcp
    timeout: 5s

  ping:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: "ip4"

  dns:
    prober: dns
    timeout: 5s
    dns:
      query_name: "example.com"
      query_type: "A"
      valid_rcodes:
        - NOERROR

# blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_status_codes: []
      method: GET
      follow_redirects: true
      preferred_ip_protocol: "ip4"
      headers:
        User-Agent: "Prometheus Blackbox Exporter"

  http_post_2xx:
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{"health": "check"}'

  tcp_connect:
    prober: tcp
    timeout: 5s

  ping:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: "ip4"

  dns:
    prober: dns
    timeout: 5s
    dns:
      query_name: "example.com"
      query_type: "A"
      valid_rcodes:
        - NOERROR

YAML

MySQL Exporter

# Environment variables
DATA_SOURCE_NAME: "user:password@(mysql:3306)/"

# Or configuration file
[client]
user = exporter
password = password
host = mysql
port = 3306

# Prometheus scrape config
scrape_configs:
  - job_name: 'mysql'
    static_configs:
      - targets: ['mysql-exporter:9104']

# Environment variables
DATA_SOURCE_NAME: "user:password@(mysql:3306)/"

# Or configuration file
[client]
user = exporter
password = password
host = mysql
port = 3306

# Prometheus scrape config
scrape_configs:
  - job_name: 'mysql'
    static_configs:
      - targets: ['mysql-exporter:9104']

YAML

PostgreSQL Exporter

# docker-compose.yml
postgres-exporter:
  image: prometheuscommunity/postgres-exporter
  environment:
    DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/database?sslmode=disable"
  ports:
    - "9187:9187"

# docker-compose.yml
postgres-exporter:
  image: prometheuscommunity/postgres-exporter
  environment:
    DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/database?sslmode=disable"
  ports:
    - "9187:9187"

YAML

Redis Exporter

redis-exporter:
  image: oliver006/redis_exporter
  environment:
    REDIS_ADDR: "redis://redis:6379"
    REDIS_PASSWORD: "your-redis-password"
  ports:
    - "9121:9121"

redis-exporter:
  image: oliver006/redis_exporter
  environment:
    REDIS_ADDR: "redis://redis:6379"
    REDIS_PASSWORD: "your-redis-password"
  ports:
    - "9121:9121"

YAML

Appendix C: Alert Rule Templates

Infrastructure Alerts

groups:
  - name: node_alerts
    rules:
      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 1m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "Node {{ $labels.instance }} has been down for more than 1 minute"
          runbook_url: "https://runbooks.company.com/alerts/node-down"

      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }}"

      - alert: CriticalCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
        for: 2m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Critical CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }}"

      - alert: HighMemory
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
        for: 5m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 85
        for: 10m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }} {{ $labels.mountpoint }}"

      - alert: DiskSpaceCritical
        expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 95
        for: 5m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Critical disk space on {{ $labels.instance }}"
          description: "Disk usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }} {{ $labels.mountpoint }}"

      - alert: HighLoadAverage
        expr: node_load1 / count by (instance) (count by (instance, cpu) (node_cpu_seconds_total{mode="idle"})) > 1.5
        for: 10m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High load average on {{ $labels.instance }}"
          description: "Load average is {{ $value | printf \"%.2f\" }} on {{ $labels.instance }}"

groups:
  - name: node_alerts
    rules:
      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 1m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "Node {{ $labels.instance }} has been down for more than 1 minute"
          runbook_url: "https://runbooks.company.com/alerts/node-down"

      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }}"

      - alert: CriticalCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
        for: 2m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Critical CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }}"

      - alert: HighMemory
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
        for: 5m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 85
        for: 10m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }} {{ $labels.mountpoint }}"

      - alert: DiskSpaceCritical
        expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 95
        for: 5m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Critical disk space on {{ $labels.instance }}"
          description: "Disk usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }} {{ $labels.mountpoint }}"

      - alert: HighLoadAverage
        expr: node_load1 / count by (instance) (count by (instance, cpu) (node_cpu_seconds_total{mode="idle"})) > 1.5
        for: 10m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High load average on {{ $labels.instance }}"
          description: "Load average is {{ $value | printf \"%.2f\" }} on {{ $labels.instance }}"

YAML

Application Alerts

groups:
  - name: application_alerts
    rules:
      - alert: ServiceDown
        expr: up{job=~".*-service"} == 0
        for: 1m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Service {{ $labels.job }} is down"
          description: "Service {{ $labels.job }} on {{ $labels.instance }} is down"
          runbook_url: "https://runbooks.company.com/alerts/service-down"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "High error rate for {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High latency for {{ $labels.job }}"
          description: "95th percentile latency is {{ $value | printf \"%.3f\" }}s for {{ $labels.job }}"

      - alert: LowThroughput
        expr: rate(http_requests_total[5m]) < 1
        for: 10m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Low throughput for {{ $labels.job }}"
          description: "Request rate is {{ $value | printf \"%.2f\" }} req/s for {{ $labels.job }}"

      - alert: HighMemoryUsage
        expr: (container_memory_usage_bytes{container!="POD",container!=""} / container_spec_memory_limit_bytes) * 100 > 90
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High memory usage for container {{ $labels.container }}"
          description: "Memory usage is {{ $value | printf \"%.2f\" }}% for container {{ $labels.container }} in pod {{ $labels.pod }}"

groups:
  - name: application_alerts
    rules:
      - alert: ServiceDown
        expr: up{job=~".*-service"} == 0
        for: 1m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Service {{ $labels.job }} is down"
          description: "Service {{ $labels.job }} on {{ $labels.instance }} is down"
          runbook_url: "https://runbooks.company.com/alerts/service-down"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "High error rate for {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High latency for {{ $labels.job }}"
          description: "95th percentile latency is {{ $value | printf \"%.3f\" }}s for {{ $labels.job }}"

      - alert: LowThroughput
        expr: rate(http_requests_total[5m]) < 1
        for: 10m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Low throughput for {{ $labels.job }}"
          description: "Request rate is {{ $value | printf \"%.2f\" }} req/s for {{ $labels.job }}"

      - alert: HighMemoryUsage
        expr: (container_memory_usage_bytes{container!="POD",container!=""} / container_spec_memory_limit_bytes) * 100 > 90
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High memory usage for container {{ $labels.container }}"
          description: "Memory usage is {{ $value | printf \"%.2f\" }}% for container {{ $labels.container }} in pod {{ $labels.pod }}"

YAML

Database Alerts

groups:
  - name: database_alerts
    rules:
      - alert: DatabaseDown
        expr: mysql_up == 0
        for: 1m
        labels:
          severity: critical
          team: database
        annotations:
          summary: "Database {{ $labels.instance }} is down"
          description: "MySQL database on {{ $labels.instance }} is not responding"
          runbook_url: "https://runbooks.company.com/alerts/database-down"

      - alert: HighConnections
        expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
        for: 5m
        labels:
          severity: warning
          team: database
        annotations:
          summary: "High database connections on {{ $labels.instance }}"
          description: "Database connection usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"

      - alert: SlowQueries
        expr: rate(mysql_global_status_slow_queries[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
          team: database
        annotations:
          summary: "High slow query rate on {{ $labels.instance }}"
          description: "Slow query rate is {{ $value | printf \"%.2f\" }} queries/s on {{ $labels.instance }}"

      - alert: DatabaseReplicationLag
        expr: mysql_slave_lag_seconds > 30
        for: 2m
        labels:
          severity: warning
          team: database
        annotations:
          summary: "Database replication lag on {{ $labels.instance }}"
          description: "Replication lag is {{ $value }}s on {{ $labels.instance }}"

      - alert: PostgreSQLDown
        expr: pg_up == 0
        for: 1m
        labels:
          severity: critical
          team: database
        annotations:
          summary: "PostgreSQL {{ $labels.instance }} is down"
          description: "PostgreSQL database on {{ $labels.instance }} is not responding"

      - alert: PostgreSQLHighConnections
        expr: sum by (instance) (pg_stat_activity_count) / pg_settings_max_connections > 0.8
        for: 5m
        labels:
          severity: warning
          team: database
        annotations:
          summary: "High PostgreSQL connections on {{ $labels.instance }}"
          description: "Connection usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"

groups:
  - name: database_alerts
    rules:
      - alert: DatabaseDown
        expr: mysql_up == 0
        for: 1m
        labels:
          severity: critical
          team: database
        annotations:
          summary: "Database {{ $labels.instance }} is down"
          description: "MySQL database on {{ $labels.instance }} is not responding"
          runbook_url: "https://runbooks.company.com/alerts/database-down"

      - alert: HighConnections
        expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
        for: 5m
        labels:
          severity: warning
          team: database
        annotations:
          summary: "High database connections on {{ $labels.instance }}"
          description: "Database connection usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"

      - alert: SlowQueries
        expr: rate(mysql_global_status_slow_queries[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
          team: database
        annotations:
          summary: "High slow query rate on {{ $labels.instance }}"
          description: "Slow query rate is {{ $value | printf \"%.2f\" }} queries/s on {{ $labels.instance }}"

      - alert: DatabaseReplicationLag
        expr: mysql_slave_lag_seconds > 30
        for: 2m
        labels:
          severity: warning
          team: database
        annotations:
          summary: "Database replication lag on {{ $labels.instance }}"
          description: "Replication lag is {{ $value }}s on {{ $labels.instance }}"

      - alert: PostgreSQLDown
        expr: pg_up == 0
        for: 1m
        labels:
          severity: critical
          team: database
        annotations:
          summary: "PostgreSQL {{ $labels.instance }} is down"
          description: "PostgreSQL database on {{ $labels.instance }} is not responding"

      - alert: PostgreSQLHighConnections
        expr: sum by (instance) (pg_stat_activity_count) / pg_settings_max_connections > 0.8
        for: 5m
        labels:
          severity: warning
          team: database
        annotations:
          summary: "High PostgreSQL connections on {{ $labels.instance }}"
          description: "Connection usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"

YAML

Network and External Service Alerts

groups:
  - name: network_alerts
    rules:
      - alert: HighNetworkReceive
        expr: rate(node_network_receive_bytes_total[5m]) > 100 * 1024 * 1024  # 100MB/s
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High network receive on {{ $labels.instance }}"
          description: "Network receive is {{ $value | humanize1024 }}B/s on {{ $labels.instance }} interface {{ $labels.device }}"

      - alert: HighNetworkTransmit
        expr: rate(node_network_transmit_bytes_total[5m]) > 100 * 1024 * 1024  # 100MB/s
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High network transmit on {{ $labels.instance }}"
          description: "Network transmit is {{ $value | humanize1024 }}B/s on {{ $labels.instance }} interface {{ $labels.device }}"

      - alert: ExternalServiceDown
        expr: probe_success{job="blackbox"} == 0
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "External service {{ $labels.instance }} is down"
          description: "External service check for {{ $labels.instance }} is failing"

      - alert: ExternalServiceSlowResponse
        expr: probe_duration_seconds{job="blackbox"} > 5
        for: 3m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "External service {{ $labels.instance }} is slow"
          description: "External service {{ $labels.instance }} is responding in {{ $value | printf \"%.2f\" }}s"

groups:
  - name: network_alerts
    rules:
      - alert: HighNetworkReceive
        expr: rate(node_network_receive_bytes_total[5m]) > 100 * 1024 * 1024  # 100MB/s
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High network receive on {{ $labels.instance }}"
          description: "Network receive is {{ $value | humanize1024 }}B/s on {{ $labels.instance }} interface {{ $labels.device }}"

      - alert: HighNetworkTransmit
        expr: rate(node_network_transmit_bytes_total[5m]) > 100 * 1024 * 1024  # 100MB/s
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High network transmit on {{ $labels.instance }}"
          description: "Network transmit is {{ $value | humanize1024 }}B/s on {{ $labels.instance }} interface {{ $labels.device }}"

      - alert: ExternalServiceDown
        expr: probe_success{job="blackbox"} == 0
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "External service {{ $labels.instance }} is down"
          description: "External service check for {{ $labels.instance }} is failing"

      - alert: ExternalServiceSlowResponse
        expr: probe_duration_seconds{job="blackbox"} > 5
        for: 3m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "External service {{ $labels.instance }} is slow"
          description: "External service {{ $labels.instance }} is responding in {{ $value | printf \"%.2f\" }}s"

YAML

Business Logic Alerts

groups:
  - name: business_alerts
    rules:
      - alert: LowOrderRate
        expr: rate(orders_total[1h]) * 3600 < 10
        for: 15m
        labels:
          severity: warning
          team: business
        annotations:
          summary: "Low order rate"
          description: "Order rate is {{ $value | printf \"%.2f\" }} orders/hour"

      - alert: HighCartAbandonmentRate
        expr: |
          (
            rate(cart_abandoned_total[1h]) /
            (rate(cart_created_total[1h]) + rate(cart_abandoned_total[1h]))
          ) > 0.7
        for: 30m
        labels:
          severity: warning
          team: business
        annotations:
          summary: "High cart abandonment rate"
          description: "Cart abandonment rate is {{ $value | humanizePercentage }}"

      - alert: PaymentProcessingFailures
        expr: rate(payment_failed_total[5m]) / rate(payment_attempted_total[5m]) > 0.05
        for: 10m
        labels:
          severity: critical
          team: payments
        annotations:
          summary: "High payment failure rate"
          description: "Payment failure rate is {{ $value | humanizePercentage }}"

groups:
  - name: business_alerts
    rules:
      - alert: LowOrderRate
        expr: rate(orders_total[1h]) * 3600 < 10
        for: 15m
        labels:
          severity: warning
          team: business
        annotations:
          summary: "Low order rate"
          description: "Order rate is {{ $value | printf \"%.2f\" }} orders/hour"

      - alert: HighCartAbandonmentRate
        expr: |
          (
            rate(cart_abandoned_total[1h]) /
            (rate(cart_created_total[1h]) + rate(cart_abandoned_total[1h]))
          ) > 0.7
        for: 30m
        labels:
          severity: warning
          team: business
        annotations:
          summary: "High cart abandonment rate"
          description: "Cart abandonment rate is {{ $value | humanizePercentage }}"

      - alert: PaymentProcessingFailures
        expr: rate(payment_failed_total[5m]) / rate(payment_attempted_total[5m]) > 0.05
        for: 10m
        labels:
          severity: critical
          team: payments
        annotations:
          summary: "High payment failure rate"
          description: "Payment failure rate is {{ $value | humanizePercentage }}"

YAML

Appendix D: Grafana Dashboard Templates

Infrastructure Overview Dashboard

{
  "dashboard": {
    "id": null,
    "title": "Infrastructure Overview",
    "tags": ["infrastructure", "overview"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "templating": {
      "list": [
        {
          "name": "instance",
          "type": "query",
          "query": "label_values(up{job=\"node-exporter\"}, instance)",
          "refresh": 1,
          "multi": true,
          "includeAll": true,
          "current": {
            "value": "$__all",
            "text": "All"
          }
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "System Load",
        "type": "stat",
        "targets": [
          {
            "expr": "node_load1{instance=~\"$instance\"}",
            "legendFormat": "{{ instance }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "short",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 2},
                {"color": "red", "value": 4}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "CPU Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\",instance=~\"$instance\"}[5m])) * 100)",
            "legendFormat": "{{ instance }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 6, "y": 0}
      },
      {
        "id": 3,
        "title": "Memory Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\"} / node_memory_MemTotal_bytes{instance=~\"$instance\"})) * 100",
            "legendFormat": "{{ instance }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 80},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 12, "y": 0}
      },
      {
        "id": 4,
        "title": "Disk Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "(1 - (node_filesystem_avail_bytes{instance=~\"$instance\",fstype!~\"tmpfs|fuse.lxcfs|squashfs\"} / node_filesystem_size_bytes{instance=~\"$instance\",fstype!~\"tmpfs|fuse.lxcfs|squashfs\"})) * 100",
            "legendFormat": "{{ instance }}:{{ mountpoint }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 80},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 18, "y": 0}
      }
    ]
  }
}

{
  "dashboard": {
    "id": null,
    "title": "Infrastructure Overview",
    "tags": ["infrastructure", "overview"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "templating": {
      "list": [
        {
          "name": "instance",
          "type": "query",
          "query": "label_values(up{job=\"node-exporter\"}, instance)",
          "refresh": 1,
          "multi": true,
          "includeAll": true,
          "current": {
            "value": "$__all",
            "text": "All"
          }
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "System Load",
        "type": "stat",
        "targets": [
          {
            "expr": "node_load1{instance=~\"$instance\"}",
            "legendFormat": "{{ instance }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "short",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 2},
                {"color": "red", "value": 4}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "CPU Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\",instance=~\"$instance\"}[5m])) * 100)",
            "legendFormat": "{{ instance }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 6, "y": 0}
      },
      {
        "id": 3,
        "title": "Memory Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\"} / node_memory_MemTotal_bytes{instance=~\"$instance\"})) * 100",
            "legendFormat": "{{ instance }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 80},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 12, "y": 0}
      },
      {
        "id": 4,
        "title": "Disk Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "(1 - (node_filesystem_avail_bytes{instance=~\"$instance\",fstype!~\"tmpfs|fuse.lxcfs|squashfs\"} / node_filesystem_size_bytes{instance=~\"$instance\",fstype!~\"tmpfs|fuse.lxcfs|squashfs\"})) * 100",
            "legendFormat": "{{ instance }}:{{ mountpoint }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 80},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 18, "y": 0}
      }
    ]
  }
}

JSON

Application Performance Dashboard

{
  "dashboard": {
    "id": null,
    "title": "Application Performance",
    "tags": ["application", "performance"],
    "timezone": "browser",
    "refresh": "30s",
    "templating": {
      "list": [
        {
          "name": "service",
          "type": "query",
          "query": "label_values(http_requests_total, service)",
          "refresh": 1,
          "multi": true,
          "includeAll": true
        },
        {
          "name": "environment",
          "type": "query",
          "query": "label_values(http_requests_total, environment)",
          "refresh": 1
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service)",
            "legendFormat": "{{ service }}"
          }
        ],
        "yAxes": [
          {
            "label": "requests/sec",
            "min": 0
          }
        ],
        "gridPos": {"h": 9, "w": 12, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"[45]..\",service=~\"$service\",environment=\"$environment\"}[5m])) by (service) / sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service) * 100",
            "legendFormat": "{{ service }}"
          }
        ],
        "yAxes": [
          {
            "label": "error %",
            "min": 0,
            "max": 100
          }
        ],
        "gridPos": {"h": 9, "w": 12, "x": 12, "y": 0}
      }
    ]
  }
}

{
  "dashboard": {
    "id": null,
    "title": "Application Performance",
    "tags": ["application", "performance"],
    "timezone": "browser",
    "refresh": "30s",
    "templating": {
      "list": [
        {
          "name": "service",
          "type": "query",
          "query": "label_values(http_requests_total, service)",
          "refresh": 1,
          "multi": true,
          "includeAll": true
        },
        {
          "name": "environment",
          "type": "query",
          "query": "label_values(http_requests_total, environment)",
          "refresh": 1
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service)",
            "legendFormat": "{{ service }}"
          }
        ],
        "yAxes": [
          {
            "label": "requests/sec",
            "min": 0
          }
        ],
        "gridPos": {"h": 9, "w": 12, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"[45]..\",service=~\"$service\",environment=\"$environment\"}[5m])) by (service) / sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service) * 100",
            "legendFormat": "{{ service }}"
          }
        ],
        "yAxes": [
          {
            "label": "error %",
            "min": 0,
            "max": 100
          }
        ],
        "gridPos": {"h": 9, "w": 12, "x": 12, "y": 0}
      }
    ]
  }
}

JSON

Appendix E: Configuration Management

Environment-specific Configurations

Development Environment

# prometheus-dev.yml
global:
  scrape_interval: 30s
  evaluation_interval: 30s
  external_labels:
    environment: 'development'
    cluster: 'dev'

rule_files:
  - "dev_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']
    scrape_interval: 60s  # Less frequent in dev

  - job_name: 'application'
    static_configs:
      - targets: ['localhost:8080']
    scrape_interval: 30s

# prometheus-dev.yml
global:
  scrape_interval: 30s
  evaluation_interval: 30s
  external_labels:
    environment: 'development'
    cluster: 'dev'

rule_files:
  - "dev_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']
    scrape_interval: 60s  # Less frequent in dev

  - job_name: 'application'
    static_configs:
      - targets: ['localhost:8080']
    scrape_interval: 30s

YAML

Production Environment

# prometheus-prod.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    environment: 'production'
    cluster: 'prod'
    datacenter: 'us-east-1'

rule_files:
  - "prod_rules.yml"
  - "slo_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: 
        - 'node1:9100'
        - 'node2:9100'
        - 'node3:9100'
    scrape_interval: 15s

  - job_name: 'application'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - production
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

remote_write:
  - url: "https://remote-storage.company.com/api/v1/write"
    headers:
      Authorization: "Bearer ${REMOTE_WRITE_TOKEN}"

# prometheus-prod.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    environment: 'production'
    cluster: 'prod'
    datacenter: 'us-east-1'

rule_files:
  - "prod_rules.yml"
  - "slo_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: 
        - 'node1:9100'
        - 'node2:9100'
        - 'node3:9100'
    scrape_interval: 15s

  - job_name: 'application'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - production
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

remote_write:
  - url: "https://remote-storage.company.com/api/v1/write"
    headers:
      Authorization: "Bearer ${REMOTE_WRITE_TOKEN}"

YAML

Configuration Validation

#!/bin/bash
# scripts/validate-config.sh

set -e

echo "Validating Prometheus configuration..."

# Check Prometheus config syntax
promtool check config prometheus/prometheus.yml

# Check recording rules
if [ -f "prometheus/recording_rules.yml" ]; then
    promtool check rules prometheus/recording_rules.yml
fi

# Check alerting rules
if [ -f "prometheus/alert_rules.yml" ]; then
    promtool check rules prometheus/alert_rules.yml
fi

# Check Alertmanager config
if [ -f "alertmanager/alertmanager.yml" ]; then
    amtool check-config alertmanager/alertmanager.yml
fi

echo "Configuration validation completed successfully!"

#!/bin/bash
# scripts/validate-config.sh

set -e

echo "Validating Prometheus configuration..."

# Check Prometheus config syntax
promtool check config prometheus/prometheus.yml

# Check recording rules
if [ -f "prometheus/recording_rules.yml" ]; then
    promtool check rules prometheus/recording_rules.yml
fi

# Check alerting rules
if [ -f "prometheus/alert_rules.yml" ]; then
    promtool check rules prometheus/alert_rules.yml
fi

# Check Alertmanager config
if [ -f "alertmanager/alertmanager.yml" ]; then
    amtool check-config alertmanager/alertmanager.yml
fi

echo "Configuration validation completed successfully!"

Bash

Appendix F: Troubleshooting Guide

Common Issues and Solutions

Prometheus Issues

Issue: Targets showing as “DOWN”

# Check target accessibility
curl -v http://target-host:9100/metrics

# Check network connectivity
telnet target-host 9100

# Check Prometheus logs
docker logs prometheus

# Check scrape configuration
curl http://localhost:9090/api/v1/targets

# Check target accessibility
curl -v http://target-host:9100/metrics

# Check network connectivity
telnet target-host 9100

# Check Prometheus logs
docker logs prometheus

# Check scrape configuration
curl http://localhost:9090/api/v1/targets

Bash

Issue: High memory usage

# Check active series count
prometheus_tsdb_head_series

# Check samples ingested per second
rate(prometheus_tsdb_samples_total[5m])

# Find high cardinality metrics
topk(10, count by (__name__)({__name__!=""}))

# Check active series count
prometheus_tsdb_head_series

# Check samples ingested per second
rate(prometheus_tsdb_samples_total[5m])

# Find high cardinality metrics
topk(10, count by (__name__)({__name__!=""}))

INI

Issue: Slow queries

# Check query duration
histogram_quantile(0.99, rate(prometheus_engine_query_duration_seconds_bucket[5m]))

# Check concurrent queries
prometheus_engine_queries_concurrent_max

# Check query duration
histogram_quantile(0.99, rate(prometheus_engine_query_duration_seconds_bucket[5m]))

# Check concurrent queries
prometheus_engine_queries_concurrent_max

INI

Alertmanager Issues

Issue: Alerts not firing

# Check Prometheus rules evaluation
curl http://localhost:9090/api/v1/rules

# Check alert status
curl http://localhost:9090/api/v1/alerts

# Check Alertmanager configuration
amtool config show --alertmanager.url=http://localhost:9093

# Check Prometheus rules evaluation
curl http://localhost:9090/api/v1/rules

# Check alert status
curl http://localhost:9090/api/v1/alerts

# Check Alertmanager configuration
amtool config show --alertmanager.url=http://localhost:9093

Bash

Issue: Notifications not being sent

# Check Alertmanager logs
docker logs alertmanager

# Test notification channels
amtool alert add --alertmanager.url=http://localhost:9093 \
  alertname="test" service="test" severity="warning"

# Check silences
amtool silence query --alertmanager.url=http://localhost:9093

# Check Alertmanager logs
docker logs alertmanager

# Test notification channels
amtool alert add --alertmanager.url=http://localhost:9093 \
  alertname="test" service="test" severity="warning"

# Check silences
amtool silence query --alertmanager.url=http://localhost:9093

Bash

Grafana Issues

Issue: Dashboard not loading data

# Check data source connectivity
curl -X GET "http://admin:admin123@localhost:3000/api/datasources/1/health"

# Check Prometheus connectivity from Grafana
curl -X GET "http://admin:admin123@localhost:3000/api/datasources/proxy/1/api/v1/query?query=up"

# Check data source connectivity
curl -X GET "http://admin:admin123@localhost:3000/api/datasources/1/health"

# Check Prometheus connectivity from Grafana
curl -X GET "http://admin:admin123@localhost:3000/api/datasources/proxy/1/api/v1/query?query=up"

Bash

Issue: Variables not working

Check variable query syntax
Verify data source selection
Check refresh settings

Performance Optimization

Reduce Cardinality

# Metric relabeling to drop high cardinality labels
metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'high_cardinality_metric.*'
    action: drop

  - source_labels: [user_id]
    target_label: user_type
    regex: 'premium_.*'
    replacement: 'premium'

  - regex: 'user_id'
    action: labeldrop

# Metric relabeling to drop high cardinality labels
metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'high_cardinality_metric.*'
    action: drop

  - source_labels: [user_id]
    target_label: user_type
    regex: 'premium_.*'
    replacement: 'premium'

  - regex: 'user_id'
    action: labeldrop

YAML

Optimize Recording Rules

# Pre-compute expensive queries
groups:
  - name: optimization_rules
    interval: 30s
    rules:
      - record: expensive_calculation:rate5m
        expr: |
          sum(rate(complex_metric[5m])) by (service) /
          sum(rate(other_complex_metric[5m])) by (service)

# Pre-compute expensive queries
groups:
  - name: optimization_rules
    interval: 30s
    rules:
      - record: expensive_calculation:rate5m
        expr: |
          sum(rate(complex_metric[5m])) by (service) /
          sum(rate(other_complex_metric[5m])) by (service)

YAML

Appendix G: Further Reading and References

Official Documentation

Books and Guides

“Prometheus: Up & Running” by Brian Brazil
“Monitoring with Prometheus” by James Turnbull
“Site Reliability Engineering” by Google (SRE practices)
“The Art of Monitoring” by James Turnbull

Online Resources

Training and Certification

Community Resources

Best Practices Repositories

Conclusion

This comprehensive guide has covered all aspects of Prometheus observability, from basic concepts to advanced enterprise deployments. By following the patterns, best practices, and examples provided, you should be well-equipped to implement robust monitoring solutions that provide actionable insights into your systems and applications.

Remember that observability is not just about collecting metrics—it’s about building systems that help you understand and improve your applications and infrastructure. Start with the basics, iterate based on your needs, and continuously refine your monitoring strategy as your systems evolve.

The capstone project provides a practical foundation that you can adapt and extend for your specific use cases. Use the appendices as reference materials for ongoing implementation and troubleshooting.

Happy monitoring! 🚀📊

Discover more from Altgr Blog

Subscribe to get the latest posts sent to your email.

Table of Contents

1. Introduction to Observability

What is Observability?

Monitoring vs. Observability

The Three Pillars of Observability

1. Metrics

2. Logs

3. Traces

Where Prometheus Fits

Chapter 1 Summary

Hands-on Exercise

2. Getting Started with Prometheus

History and Background

Prometheus Architecture

Core Components

Installation Methods

Method 1: Binary Installation (Windows)

Method 2: Docker Installation

Method 3: Docker Compose

Configuration Basics

Understanding prometheus.yml

Key Configuration Parameters

Verifying Installation

Chapter 2 Summary

Hands-on Exercise

3. Metrics and Data Collection

Types of Metrics

1. Counter

2. Gauge

3. Histogram

4. Summary

Exposing Metrics with Client Libraries

Go Application Example

Python Application Example

Node.js Application Example

Exporters

Node Exporter (System Metrics)

Blackbox Exporter (External Monitoring)

Custom Exporter Example

Service Discovery and Relabeling

File-based Service Discovery

Relabeling Configuration

Chapter 3 Summary

Hands-on Exercise

4. PromQL: Querying and Analyzing Data

Introduction to PromQL

Basic PromQL Concepts

Instant Vectors vs Range Vectors

Selectors and Matchers

Common Queries for System Metrics

CPU Metrics

Memory Metrics

Disk Metrics

Network Metrics

Advanced PromQL Functions

Rate and Increase

Histogram Functions

Aggregation Functions

Mathematical Functions

Time Functions

Recording Rules

Complex Query Examples

SLI/SLO Calculations

Resource Utilization Patterns

Application Performance Analysis

Alerting Rules

Chapter 4 Summary

Hands-on Exercise

5. Alerting and Notifications

Alertmanager Architecture

Installing and Configuring Alertmanager

Docker Installation

Basic Alertmanager Configuration

Writing Effective Alerts

Alert Quality Guidelines

Infrastructure Alerting Rules

Application Alerting Rules

Grouping, Inhibition, and Silences

Grouping Configuration

Inhibition Rules