Prometheus Observability

    Table of Contents

    1. Introduction to Observability
    2. Getting Started with Prometheus
    3. Metrics and Data Collection
    4. PromQL: Querying and Analyzing Data
    5. Alerting and Notifications
    6. Visualization
    7. Prometheus in Kubernetes
    8. Scaling and Performance
    9. Best Practices and Pitfalls
    10. Advanced Topics
    11. Capstone Project
    12. Appendices

    1. Introduction to Observability

    What is Observability?

    Observability is the ability to understand the internal state of a system by examining its external outputs. Unlike traditional monitoring, which tells you when something breaks, observability helps you understand why it broke and how to fix it.

    Monitoring vs. Observability

    MonitoringObservability
    Known unknownsUnknown unknowns
    Predefined dashboardsAd-hoc queries
    Health checksDeep insights
    ReactiveProactive

    Monitoring answers: “Is the system up?” Observability answers: “Why is the system behaving this way?”

    The Three Pillars of Observability

    graph TB
        A[Observability] --> B[Metrics]
        A --> C[Logs]
        A --> D[Traces]
    
        B --> B1[Numerical data over time]
        B --> B2[System performance indicators]
    
        C --> C1[Discrete events with context]
        C --> C2[Application and system logs]
    
        D --> D1[Request flows across services]
        D --> D2[Performance bottleneck identification]

    1. Metrics

    • Definition: Numerical measurements captured over time
    • Examples: CPU usage, memory consumption, request rate, error rate
    • Best for: Dashboards, alerting, trend analysis

    2. Logs

    • Definition: Discrete events with timestamps and context
    • Examples: Application errors, access logs, audit trails
    • Best for: Debugging, forensic analysis, compliance

    3. Traces

    • Definition: Records of requests as they flow through distributed systems
    • Examples: Microservice call chains, database queries, external API calls
    • Best for: Performance optimization, dependency mapping

    Where Prometheus Fits

    Prometheus is primarily a metrics-based monitoring system that excels at:

    • Time-series data collection and storage
    • Powerful querying language (PromQL)
    • Built-in alerting capabilities
    • Service discovery integration
    • Scalable architecture

    Chapter 1 Summary

    graph LR
        A[Applications] --> B[Prometheus]
        C[Infrastructure] --> B
        D[Exporters] --> B
        B --> E[Alertmanager]
        B --> F[Grafana]
        B --> G[Remote Storage]

    Observability goes beyond traditional monitoring by providing deep insights into system behavior. The three pillars—metrics, logs, and traces—work together to provide comprehensive visibility. Prometheus serves as the foundation for metrics collection and analysis in modern observability stacks.

    Hands-on Exercise

    1. Reflection Exercise: Think about a recent production issue in your environment
      • What metrics could have helped detect it earlier?
      • What logs would have aided in debugging?
      • How would distributed tracing have helped?
    2. Research Task: Investigate the observability stack used in your organization
      • Identify which tools handle metrics, logs, and traces
      • Note any gaps in observability coverage

    2. Getting Started with Prometheus

    History and Background

    Prometheus was created at SoundCloud in 2012 by Matt T. Proud and Julius Volz. Inspired by Google’s Borgmon, it became a Cloud Native Computing Foundation (CNCF) project in 2016 and graduated in 2018.

    Key Timeline:

    • 2012: Created at SoundCloud
    • 2015: Open-sourced
    • 2016: Joined CNCF
    • 2018: CNCF Graduated Project

    Prometheus Architecture

    graph TB
        subgraph "Prometheus Server"
            A[Retrieval] --> B[TSDB]
            C[PromQL Engine] --> B
            D[Web UI] --> C
            E[HTTP API] --> C
        end
    
        F[Targets] --> A
        G[Exporters] --> A
        H[Pushgateway] --> A
    
        B --> I[Alertmanager]
        D --> J[Grafana]
        C --> J
    
        K[Service Discovery] --> A

    Core Components

    1. Prometheus Server
      • Scrapes and stores time-series data
      • Executes PromQL queries
      • Evaluates alerting rules
    2. Client Libraries
      • Instrument applications
      • Expose metrics endpoints
    3. Exporters
      • Bridge between Prometheus and third-party systems
      • Translate metrics to Prometheus format
    4. Alertmanager
      • Handles alerts from Prometheus
      • Manages routing, grouping, and silencing
    5. Pushgateway
      • Allows ephemeral jobs to push metrics
      • Used for batch jobs and short-lived processes

    Installation Methods

    Method 1: Binary Installation (Windows)

    1. Download Prometheus: # Create directory New-Item -ItemType Directory -Path C:\prometheus # Download latest release $url = "https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.windows-amd64.zip" Invoke-WebRequest -Uri $url -OutFile C:\prometheus\prometheus.zip # Extract Expand-Archive -Path C:\prometheus\prometheus.zip -DestinationPath C:\prometheus
    2. Create basic configuration: # prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s rule_files: # - "first_rules.yml" # - "second_rules.yml" scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090']
    3. Run Prometheus: cd C:\prometheus\prometheus-2.47.0.windows-amd64 .\prometheus.exe --config.file=prometheus.yml --storage.tsdb.path=data\

    Method 2: Docker Installation

    1. Create configuration directory: mkdir prometheus-data
    2. Create prometheus.yml: global: scrape_interval: 15s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node-exporter' static_configs: - targets: ['host.docker.internal:9100']
    3. Run with Docker: docker run -d \ --name prometheus \ -p 9090:9090 \ -v ${PWD}/prometheus.yml:/etc/prometheus/prometheus.yml \ -v prometheus-data:/prometheus \ prom/prometheus:latest \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/prometheus \ --web.console.libraries=/etc/prometheus/console_libraries \ --web.console.templates=/etc/prometheus/consoles \ --web.enable-lifecycle

    Method 3: Docker Compose

    # docker-compose.yml
    version: '3.8'
    
    services:
      prometheus:
        image: prom/prometheus:latest
        container_name: prometheus
        ports:
          - "9090:9090"
        volumes:
          - ./prometheus.yml:/etc/prometheus/prometheus.yml
          - prometheus_data:/prometheus
        command:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--web.console.libraries=/etc/prometheus/console_libraries'
          - '--web.console.templates=/etc/prometheus/consoles'
          - '--web.enable-lifecycle'
          - '--web.enable-admin-api'
    
      node-exporter:
        image: prom/node-exporter:latest
        container_name: node-exporter
        ports:
          - "9100:9100"
        volumes:
          - /proc:/host/proc:ro
          - /sys:/host/sys:ro
          - /:/rootfs:ro
        command:
          - '--path.procfs=/host/proc'
          - '--path.rootfs=/rootfs'
          - '--path.sysfs=/host/sys'
          - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    
    volumes:
      prometheus_data:
    YAML

    Configuration Basics

    Understanding prometheus.yml

    # Global configuration
    global:
      scrape_interval: 15s        # How often to scrape targets
      evaluation_interval: 15s    # How often to evaluate rules
      external_labels:            # Labels added to metrics when federating
        cluster: 'production'
        region: 'us-west-2'
    
    # Rule files for recording and alerting rules
    rule_files:
      - "alert_rules.yml"
      - "recording_rules.yml"
    
    # Scrape configuration
    scrape_configs:
      # Self-monitoring
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']
        scrape_interval: 5s         # Override global interval
        metrics_path: /metrics      # Default metrics endpoint
    
      # Application monitoring
      - job_name: 'my-app'
        static_configs:
          - targets: ['app1:8080', 'app2:8080']
        scrape_timeout: 10s
        honor_labels: true
    
    # Alertmanager configuration
    alerting:
      alertmanagers:
        - static_configs:
            - targets:
              - alertmanager:9093
    
    # Remote write configuration (optional)
    remote_write:
      - url: "https://remote-storage-endpoint/write"
        headers:
          Authorization: "Bearer token"
    YAML

    Key Configuration Parameters

    ParameterDescriptionDefault
    scrape_intervalHow often to collect metrics1m
    scrape_timeoutMaximum time for scrape request10s
    evaluation_intervalRule evaluation frequency1m
    metrics_pathHTTP path for metrics/metrics
    schemeProtocol (http/https)http

    Verifying Installation

    1. Access Prometheus Web UI:
      • Open browser to http://localhost:9090
      • Check Status → Targets to see configured endpoints
    2. Test basic query: up This should return 1 for all healthy targets.
    3. Check metrics endpoint: curl http://localhost:9090/metrics

    Chapter 2 Summary

    Prometheus follows a pull-based architecture where the server scrapes metrics from configured targets. The system consists of the main server, client libraries, exporters, and supporting components like Alertmanager. Installation can be done via binaries, Docker, or Kubernetes, with configuration managed through the prometheus.yml file.

    Hands-on Exercise

    1. Basic Setup:
      • Install Prometheus using your preferred method
      • Configure it to monitor itself
      • Access the web UI and explore the interface
    2. Configuration Practice:
      • Modify the scrape interval to 30 seconds
      • Add a new job that targets a non-existent endpoint
      • Observe the target status and understand failure states
    3. Metrics Exploration:
      • Use the web UI to explore available metrics
      • Try simple queries like prometheus_tsdb_samples_total
      • Understand the difference between different metric types you see

    3. Metrics and Data Collection

    Types of Metrics

    Prometheus supports four fundamental metric types, each serving different purposes:

    1. Counter

    A cumulative metric that only increases (or resets to zero on restart).

    Use cases: Request counts, error counts, tasks completed Examples: http_requests_total, errors_total

    // Go example
    var requestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )
    Go

    2. Gauge

    A metric that can go up and down.

    Use cases: Memory usage, CPU usage, queue size, temperature Examples: memory_usage_bytes, cpu_usage_percent

    // Go example
    var memoryUsage = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "memory_usage_bytes",
            Help: "Current memory usage in bytes",
        },
    )
    Go

    3. Histogram

    Samples observations and counts them in configurable buckets.

    Use cases: Request durations, response sizes, latency distribution Features: Provides _count, _sum, and _bucket metrics

    // Go example
    var requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request duration in seconds",
            Buckets: prometheus.DefBuckets, // or custom: []float64{.1, .25, .5, 1, 2.5, 5, 10}
        },
        []string{"method", "endpoint"},
    )
    Go

    4. Summary

    Similar to histogram but calculates configurable quantiles.

    Use cases: Request durations when you need specific percentiles Features: Provides _count, _sum, and quantile metrics

    // Go example
    var requestDuration = prometheus.NewSummaryVec(
        prometheus.SummaryOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request duration in seconds",
            Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
        },
        []string{"method", "endpoint"},
    )
    Go

    Exposing Metrics with Client Libraries

    Go Application Example

    // main.go
    package main
    
    import (
        "fmt"
        "log"
        "math/rand"
        "net/http"
        "time"
    
        "github.com/prometheus/client_golang/prometheus"
        "github.com/prometheus/client_golang/prometheus/promhttp"
    )
    
    var (
        requestsTotal = prometheus.NewCounterVec(
            prometheus.CounterOpts{
                Name: "http_requests_total",
                Help: "Total number of HTTP requests",
            },
            []string{"method", "endpoint", "status"},
        )
    
        requestDuration = prometheus.NewHistogramVec(
            prometheus.HistogramOpts{
                Name: "http_request_duration_seconds",
                Help: "HTTP request duration in seconds",
                Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
            },
            []string{"method", "endpoint"},
        )
    
        activeConnections = prometheus.NewGauge(
            prometheus.GaugeOpts{
                Name: "active_connections",
                Help: "Number of active connections",
            },
        )
    )
    
    func init() {
        // Register metrics with Prometheus
        prometheus.MustRegister(requestsTotal)
        prometheus.MustRegister(requestDuration)
        prometheus.MustRegister(activeConnections)
    }
    
    func metricsMiddleware(next http.HandlerFunc) http.HandlerFunc {
        return func(w http.ResponseWriter, r *http.Request) {
            start := time.Now()
    
            // Increment active connections
            activeConnections.Inc()
            defer activeConnections.Dec()
    
            // Call the next handler
            next(w, r)
    
            // Record metrics
            duration := time.Since(start).Seconds()
            requestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
            requestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
        }
    }
    
    func helloHandler(w http.ResponseWriter, r *http.Request) {
        // Simulate some work
        time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond)
        fmt.Fprintf(w, "Hello, World!")
    }
    
    func main() {
        // Application routes
        http.HandleFunc("/hello", metricsMiddleware(helloHandler))
    
        // Metrics endpoint
        http.Handle("/metrics", promhttp.Handler())
    
        log.Println("Server starting on :8080")
        log.Fatal(http.ListenAndServe(":8080", nil))
    }
    Go

    Python Application Example

    # app.py
    from flask import Flask, request
    from prometheus_client import Counter, Histogram, Gauge, generate_latest
    import time
    import random
    
    app = Flask(__name__)
    
    # Define metrics
    REQUEST_COUNT = Counter(
        'http_requests_total',
        'Total HTTP requests',
        ['method', 'endpoint', 'status']
    )
    
    REQUEST_DURATION = Histogram(
        'http_request_duration_seconds',
        'HTTP request duration in seconds',
        ['method', 'endpoint'],
        buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
    )
    
    ACTIVE_CONNECTIONS = Gauge(
        'active_connections',
        'Number of active connections'
    )
    
    def track_metrics(f):
        def wrapper(*args, **kwargs):
            start_time = time.time()
            ACTIVE_CONNECTIONS.inc()
    
            try:
                result = f(*args, **kwargs)
                status = '200'
                return result
            except Exception as e:
                status = '500'
                raise
            finally:
                REQUEST_COUNT.labels(
                    method=request.method,
                    endpoint=request.endpoint or 'unknown',
                    status=status
                ).inc()
    
                REQUEST_DURATION.labels(
                    method=request.method,
                    endpoint=request.endpoint or 'unknown'
                ).observe(time.time() - start_time)
    
                ACTIVE_CONNECTIONS.dec()
    
        wrapper.__name__ = f.__name__
        return wrapper
    
    @app.route('/hello')
    @track_metrics
    def hello():
        # Simulate work
        time.sleep(random.uniform(0.01, 0.1))
        return "Hello, World!"
    
    @app.route('/metrics')
    def metrics():
        return generate_latest()
    
    if __name__ == '__main__':
        app.run(host='0.0.0.0', port=8080)
    Python

    Node.js Application Example

    // app.js
    const express = require('express');
    const promClient = require('prom-client');
    
    const app = express();
    const port = 8080;
    
    // Create metrics
    const requestCounter = new promClient.Counter({
        name: 'http_requests_total',
        help: 'Total number of HTTP requests',
        labelNames: ['method', 'endpoint', 'status']
    });
    
    const requestDuration = new promClient.Histogram({
        name: 'http_request_duration_seconds',
        help: 'HTTP request duration in seconds',
        labelNames: ['method', 'endpoint'],
        buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
    });
    
    const activeConnections = new promClient.Gauge({
        name: 'active_connections',
        help: 'Number of active connections'
    });
    
    // Middleware to track metrics
    function metricsMiddleware(req, res, next) {
        const start = Date.now();
        activeConnections.inc();
    
        res.on('finish', () => {
            const duration = (Date.now() - start) / 1000;
    
            requestCounter.labels(req.method, req.path, res.statusCode).inc();
            requestDuration.labels(req.method, req.path).observe(duration);
            activeConnections.dec();
        });
    
        next();
    }
    
    app.use(metricsMiddleware);
    
    app.get('/hello', (req, res) => {
        // Simulate work
        setTimeout(() => {
            res.send('Hello, World!');
        }, Math.random() * 100);
    });
    
    app.get('/metrics', (req, res) => {
        res.set('Content-Type', promClient.register.contentType);
        res.end(promClient.register.metrics());
    });
    
    app.listen(port, () => {
        console.log(`Server running on port ${port}`);
    });
    JavaScript

    Exporters

    Exporters are components that fetch statistics from third-party systems and export them as Prometheus metrics.

    Node Exporter (System Metrics)

    # docker-compose.yml addition
      node-exporter:
        image: prom/node-exporter:latest
        container_name: node-exporter
        ports:
          - "9100:9100"
        volumes:
          - /proc:/host/proc:ro
          - /sys:/host/sys:ro
          - /:/rootfs:ro
        command:
          - '--path.procfs=/host/proc'
          - '--path.rootfs=/rootfs'
          - '--path.sysfs=/host/sys'
          - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
        restart: unless-stopped
    YAML

    Key metrics from node-exporter:

    • node_cpu_seconds_total: CPU usage
    • node_memory_MemTotal_bytes: Total memory
    • node_filesystem_size_bytes: Filesystem size
    • node_network_receive_bytes_total: Network received bytes

    Blackbox Exporter (External Monitoring)

    # blackbox.yml
    modules:
      http_2xx:
        prober: http
        timeout: 5s
        http:
          valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
          valid_status_codes: []
          method: GET
          follow_redirects: true
          preferred_ip_protocol: "ip4"
    
      http_post_2xx:
        prober: http
        timeout: 5s
        http:
          method: POST
          headers:
            Content-Type: application/json
          body: '{"test": "data"}'
    
      tcp_connect:
        prober: tcp
        timeout: 5s
    
      dns:
        prober: dns
        timeout: 5s
        dns:
          query_name: "example.com"
          query_type: "A"
    YAML
    # prometheus.yml addition
    scrape_configs:
      - job_name: 'blackbox'
        metrics_path: /probe
        params:
          module: [http_2xx]
        static_configs:
          - targets:
            - https://google.com
            - https://github.com
            - https://stackoverflow.com
        relabel_configs:
          - source_labels: [__address__]
            target_label: __param_target
          - source_labels: [__param_target]
            target_label: instance
          - target_label: __address__
            replacement: blackbox-exporter:9115
    YAML

    Custom Exporter Example

    # custom_exporter.py
    from prometheus_client import start_http_server, Gauge, Counter
    import time
    import psutil
    import requests
    
    # Define custom metrics
    CUSTOM_CPU_USAGE = Gauge('custom_cpu_usage_percent', 'Custom CPU usage percentage')
    CUSTOM_DISK_USAGE = Gauge('custom_disk_usage_percent', 'Custom disk usage percentage', ['device'])
    API_CALLS_TOTAL = Counter('api_calls_total', 'Total API calls made', ['endpoint'])
    
    def collect_system_metrics():
        """Collect custom system metrics"""
        # CPU usage
        cpu_percent = psutil.cpu_percent(interval=1)
        CUSTOM_CPU_USAGE.set(cpu_percent)
    
        # Disk usage
        for partition in psutil.disk_partitions():
            try:
                partition_usage = psutil.disk_usage(partition.mountpoint)
                usage_percent = (partition_usage.used / partition_usage.total) * 100
                CUSTOM_DISK_USAGE.labels(device=partition.device).set(usage_percent)
            except PermissionError:
                continue
    
    def call_external_api():
        """Simulate calling external APIs and track calls"""
        endpoints = ['/users', '/orders', '/products']
    
        for endpoint in endpoints:
            try:
                # Simulate API call
                response = requests.get(f'https://jsonplaceholder.typicode.com{endpoint}', timeout=5)
                API_CALLS_TOTAL.labels(endpoint=endpoint).inc()
            except requests.RequestException:
                pass
    
    if __name__ == '__main__':
        # Start metrics server
        start_http_server(8000)
        print("Custom exporter started on port 8000")
    
        while True:
            collect_system_metrics()
            call_external_api()
            time.sleep(30)
    Python

    Service Discovery and Relabeling

    File-based Service Discovery

    # prometheus.yml
    scrape_configs:
      - job_name: 'file-discovery'
        file_sd_configs:
          - files:
            - 'targets/*.json'
            refresh_interval: 30s
    YAML
    # targets/web-servers.json
    [
      {
        "targets": ["web1:8080", "web2:8080", "web3:8080"],
        "labels": {
          "job": "web-servers",
          "environment": "production",
          "region": "us-west-2"
        }
      },
      {
        "targets": ["api1:8080", "api2:8080"],
        "labels": {
          "job": "api-servers",
          "environment": "production",
          "region": "us-east-1"
        }
      }
    ]
    JSON

    Relabeling Configuration

    scrape_configs:
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          # Only scrape pods with prometheus.io/scrape annotation
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
    
          # Use custom metrics path if specified
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
    
          # Use custom port if specified
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
    
          # Add pod metadata as labels
          - source_labels: [__meta_kubernetes_pod_name]
            target_label: kubernetes_pod_name
          - source_labels: [__meta_kubernetes_namespace]
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_pod_node_name]
            target_label: kubernetes_node
    YAML

    Chapter 3 Summary

    Prometheus supports four metric types: counters for cumulative values, gauges for current values, histograms for distribution analysis, and summaries for quantile calculations. Client libraries in various languages make it easy to instrument applications, while exporters bridge third-party systems. Service discovery and relabeling provide flexible configuration for dynamic environments.

    Hands-on Exercise

    1. Instrument an Application:
      • Choose a simple web application in your preferred language
      • Add Prometheus metrics for request count, duration, and active connections
      • Test the metrics endpoint
    2. Deploy Exporters:
      • Set up node-exporter to monitor system metrics
      • Configure blackbox-exporter to monitor external websites
      • Add both to your Prometheus configuration
    3. Service Discovery:
      • Create a file-based service discovery configuration
      • Add and remove targets dynamically
      • Observe how Prometheus handles target changes

    4. PromQL: Querying and Analyzing Data

    Introduction to PromQL

    Prometheus Query Language (PromQL) is a functional query language that allows you to select and aggregate time-series data. It’s designed to be both powerful and intuitive for operational use cases.

    Basic PromQL Concepts

    Instant Vectors vs Range Vectors

    # Instant vector - single value per time series at query time
    up
    
    # Range vector - range of values over time
    up[5m]
    INI

    Selectors and Matchers

    # Exact match
    http_requests_total{job="prometheus"}
    
    # Regex match
    http_requests_total{job=~".*server.*"}
    
    # Negative match
    http_requests_total{job!="prometheus"}
    
    # Negative regex match
    http_requests_total{job!~".*test.*"}
    
    # Multiple labels
    http_requests_total{job="api-server",method="GET",status="200"}
    INI

    Common Queries for System Metrics

    CPU Metrics

    # Current CPU usage per core
    100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
    
    # CPU usage by mode
    rate(node_cpu_seconds_total[5m]) * 100
    
    # Top 5 instances by CPU usage
    topk(5, 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))
    
    # CPU usage over 80%
    (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
    INI

    Memory Metrics

    # Memory usage percentage
    (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
    
    # Available memory in GB
    node_memory_MemAvailable_bytes / 1024 / 1024 / 1024
    
    # Memory usage by instance
    (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
    
    # Instances with memory usage > 90%
    ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100) > 90
    INI

    Disk Metrics

    # Disk usage percentage
    (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100
    
    # Disk usage excluding system filesystems
    (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs"} / node_filesystem_size_bytes)) * 100
    
    # Free disk space in GB
    node_filesystem_avail_bytes / 1024 / 1024 / 1024
    
    # Disk I/O rate
    rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m])
    INI

    Network Metrics

    # Network receive rate in MB/s
    rate(node_network_receive_bytes_total[5m]) / 1024 / 1024
    
    # Network transmit rate in MB/s
    rate(node_network_transmit_bytes_total[5m]) / 1024 / 1024
    
    # Total network traffic
    rate(node_network_receive_bytes_total[5m]) + rate(node_network_transmit_bytes_total[5m])
    
    # Network errors
    rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m])
    INI

    Advanced PromQL Functions

    Rate and Increase

    # Rate: per-second average rate over time window
    rate(http_requests_total[5m])
    
    # Increase: total increase over time window
    increase(http_requests_total[5m])
    
    # irate: instantaneous rate (using last two data points)
    irate(http_requests_total[5m])
    INI

    Histogram Functions

    # 95th percentile response time
    histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
    
    # 50th percentile (median)
    histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
    
    # Average response time
    rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
    
    # Request rate
    rate(http_request_duration_seconds_count[5m])
    INI

    Aggregation Functions

    # Sum across all instances
    sum(rate(http_requests_total[5m]))
    
    # Average across instances
    avg(rate(http_requests_total[5m]))
    
    # Maximum value
    max(node_memory_MemTotal_bytes)
    
    # Count number of instances
    count(up == 1)
    
    # Sum by job
    sum by (job) (rate(http_requests_total[5m]))
    
    # Average without specific labels
    avg without (instance) (rate(http_requests_total[5m]))
    INI

    Mathematical Functions

    # Absolute value
    abs(delta(cpu_temp_celsius[5m]))
    
    # Round to nearest integer
    round(rate(http_requests_total[5m]))
    
    # Ceiling and floor
    ceil(rate(http_requests_total[5m]))
    floor(rate(http_requests_total[5m]))
    
    # Square root
    sqrt(rate(http_requests_total[5m]))
    
    # Logarithm
    ln(rate(http_requests_total[5m]))
    log10(rate(http_requests_total[5m]))
    INI

    Time Functions

    # Current timestamp
    time()
    
    # Time since epoch for each sample
    timestamp(up)
    
    # Day of week (0=Sunday, 6=Saturday)
    day_of_week()
    
    # Hour of day (0-23)
    hour()
    
    # Predict linear trend
    predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)
    INI

    Recording Rules

    Recording rules allow you to precompute frequently used expressions and save them as new time series.

    # recording_rules.yml
    groups:
      - name: instance_rules
        interval: 30s
        rules:
          - record: instance:cpu_usage:rate5m
            expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
            labels:
              job: node-exporter
    
          - record: instance:memory_usage:percentage
            expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
            labels:
              job: node-exporter
    
          - record: instance:disk_usage:percentage
            expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs"} / node_filesystem_size_bytes)) * 100
            labels:
              job: node-exporter
    
      - name: application_rules
        interval: 15s
        rules:
          - record: job:http_requests:rate5m
            expr: sum by (job) (rate(http_requests_total[5m]))
    
          - record: job:http_request_duration:p95
            expr: histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))
    
          - record: job:http_errors:rate5m
            expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
    YAML

    Complex Query Examples

    SLI/SLO Calculations

    # Error rate (percentage of 5xx responses)
    (
      sum(rate(http_requests_total{status=~"5.."}[5m])) /
      sum(rate(http_requests_total[5m]))
    ) * 100
    
    # Availability (percentage of successful requests)
    (
      sum(rate(http_requests_total{status!~"5.."}[5m])) /
      sum(rate(http_requests_total[5m]))
    ) * 100
    
    # Latency SLI (percentage of requests under threshold)
    (
      sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) /
      sum(rate(http_request_duration_seconds_count[5m]))
    ) * 100
    INI

    Resource Utilization Patterns

    # Predict when disk will be full (4 hours from now)
    predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0
    
    # Instance running out of memory (< 10% available)
    (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
    
    # High load average (> number of CPUs)
    node_load1 > count by (instance) (node_cpu_seconds_total{mode="idle"})
    
    # Network saturation (approaching interface limit)
    rate(node_network_transmit_bytes_total[5m]) > 
      node_network_speed_bytes * 0.8
    INI

    Application Performance Analysis

    # Request rate by endpoint
    sum by (endpoint) (rate(http_requests_total[5m]))
    
    # Error rate by endpoint
    sum by (endpoint) (rate(http_requests_total{status=~"5.."}[5m])) /
    sum by (endpoint) (rate(http_requests_total[5m]))
    
    # 95th percentile latency by endpoint
    histogram_quantile(0.95, 
      sum by (endpoint, le) (rate(http_request_duration_seconds_bucket[5m]))
    )
    
    # Slow endpoints (95th percentile > 1 second)
    histogram_quantile(0.95, 
      sum by (endpoint, le) (rate(http_request_duration_seconds_bucket[5m]))
    ) > 1
    INI

    Alerting Rules

    # alert_rules.yml
    groups:
      - name: infrastructure_alerts
        rules:
          - alert: HighCPUUsage
            expr: instance:cpu_usage:rate5m > 80
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High CPU usage on {{ $labels.instance }}"
              description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"
    
          - alert: HighMemoryUsage
            expr: instance:memory_usage:percentage > 90
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "High memory usage on {{ $labels.instance }}"
              description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"
    
          - alert: DiskSpaceLow
            expr: instance:disk_usage:percentage > 85
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: "Low disk space on {{ $labels.instance }}"
              description: "Disk usage is {{ $value }}% on {{ $labels.instance }}"
    
      - name: application_alerts
        rules:
          - alert: HighErrorRate
            expr: job:http_errors:rate5m / job:http_requests:rate5m > 0.05
            for: 2m
            labels:
              severity: critical
            annotations:
              summary: "High error rate for {{ $labels.job }}"
              description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
    
          - alert: HighLatency
            expr: job:http_request_duration:p95 > 1
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High latency for {{ $labels.job }}"
              description: "95th percentile latency is {{ $value }}s for {{ $labels.job }}"
    
          - alert: ServiceDown
            expr: up == 0
            for: 1m
            labels:
              severity: critical
            annotations:
              summary: "Service {{ $labels.instance }} is down"
              description: "{{ $labels.instance }} has been down for more than 1 minute"
    YAML

    Chapter 4 Summary

    PromQL is a powerful query language that enables complex analysis of time-series data. Key concepts include instant vs range vectors, label selectors, aggregation functions, and mathematical operations. Recording rules help optimize performance by precomputing common queries, while alerting rules define when notifications should be sent.

    Hands-on Exercise

    1. Basic Queries:
      • Write queries to find CPU usage for all instances
      • Calculate memory usage percentage
      • Find instances with high disk usage
    2. Advanced Analysis:
      • Create queries for error rates and latency percentiles
      • Write a query to predict disk space exhaustion
      • Build SLI queries for your application
    3. Rules Configuration:
      • Create recording rules for common calculations
      • Write alerting rules for infrastructure monitoring
      • Test rules using the Prometheus web UI

    5. Alerting and Notifications

    Alertmanager Architecture

    Alertmanager handles alerts sent by Prometheus and other client applications. It provides grouping, inhibition, silencing, and routing to various notification channels.

    graph TB
        A[Prometheus] --> B[Alertmanager]
        C[Other Sources] --> B
    
        subgraph "Alertmanager"
            D[Receiver] --> E[Grouping]
            E --> F[Throttling]
            F --> G[Inhibition]
            G --> H[Silencing]
            H --> I[Routing]
        end
    
        I --> J[Email]
        I --> K[Slack]
        I --> L[PagerDuty]
        I --> M[Webhook]

    Installing and Configuring Alertmanager

    Docker Installation

    # docker-compose.yml addition
      alertmanager:
        image: prom/alertmanager:latest
        container_name: alertmanager
        ports:
          - "9093:9093"
        volumes:
          - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
          - alertmanager_data:/alertmanager
        command:
          - '--config.file=/etc/alertmanager/alertmanager.yml'
          - '--storage.path=/alertmanager'
          - '--web.external-url=http://localhost:9093'
        restart: unless-stopped
    
    volumes:
      alertmanager_data:
    YAML

    Basic Alertmanager Configuration

    # alertmanager.yml
    global:
      smtp_smarthost: 'smtp.gmail.com:587'
      smtp_from: 'alerts@yourcompany.com'
      smtp_auth_username: 'alerts@yourcompany.com'
      smtp_auth_password: 'your-app-password'
    
    route:
      group_by: ['alertname', 'job']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h
      receiver: 'web.hook'
      routes:
        - matchers:
            - severity=critical
          receiver: 'critical-alerts'
          continue: true
        - matchers:
            - severity=warning
          receiver: 'warning-alerts'
    
    receivers:
      - name: 'web.hook'
        webhook_configs:
          - url: 'http://webhook-server:8080/webhook'
    
      - name: 'critical-alerts'
        email_configs:
          - to: 'oncall@yourcompany.com'
            subject: 'CRITICAL: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
            body: |
              {{ range .Alerts }}
              Alert: {{ .Annotations.summary }}
              Description: {{ .Annotations.description }}
              Severity: {{ .Labels.severity }}
              Instance: {{ .Labels.instance }}
              {{ end }}
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
            channel: '#critical-alerts'
            title: 'Critical Alert'
            text: |
              {{ range .Alerts }}
              *Alert:* {{ .Annotations.summary }}
              *Description:* {{ .Annotations.description }}
              *Severity:* {{ .Labels.severity }}
              *Instance:* {{ .Labels.instance }}
              {{ end }}
    
      - name: 'warning-alerts'
        email_configs:
          - to: 'team@yourcompany.com'
            subject: 'WARNING: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
            channel: '#warnings'
            title: 'Warning Alert'
            text: |
              {{ range .Alerts }}
              *Alert:* {{ .Annotations.summary }}
              *Description:* {{ .Annotations.description }}
              {{ end }}
    
    inhibit_rules:
      - source_matchers:
          - severity=critical
        target_matchers:
          - severity=warning
        equal: ['instance']
    YAML

    Writing Effective Alerts

    Alert Quality Guidelines

    1. Actionable: Every alert should require human action
    2. Relevant: Alerts should indicate real problems
    3. Clear: Alert messages should be immediately understandable
    4. Timely: Alerts should fire before customers notice

    Infrastructure Alerting Rules

    # infrastructure_alerts.yml
    groups:
      - name: node_alerts
        rules:
          - alert: NodeDown
            expr: up{job="node-exporter"} == 0
            for: 1m
            labels:
              severity: critical
            annotations:
              summary: "Node {{ $labels.instance }} is down"
              description: "Node {{ $labels.instance }} has been down for more than 1 minute"
              runbook_url: "https://runbooks.company.com/node-down"
    
          - alert: HighCPUUsage
            expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High CPU usage on {{ $labels.instance }}"
              description: "CPU usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"
              runbook_url: "https://runbooks.company.com/high-cpu"
    
          - alert: CriticalCPUUsage
            expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 95
            for: 2m
            labels:
              severity: critical
            annotations:
              summary: "Critical CPU usage on {{ $labels.instance }}"
              description: "CPU usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"
    
          - alert: HighMemoryUsage
            expr: ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes) * 100 > 90
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High memory usage on {{ $labels.instance }}"
              description: "Memory usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"
    
          - alert: DiskSpaceCritical
            expr: ((node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes) * 100 > 95
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "Critical disk space on {{ $labels.instance }}"
              description: "Disk usage is {{ $value | humanizePercentage }} on {{ $labels.instance }} {{ $labels.mountpoint }}"
    
          - alert: DiskWillFillIn4Hours
            expr: predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Disk will fill in 4 hours on {{ $labels.instance }}"
              description: "Disk {{ $labels.mountpoint }} on {{ $labels.instance }} will fill in approximately 4 hours"
    YAML

    Application Alerting Rules

    # application_alerts.yml
    groups:
      - name: application_alerts
        rules:
          - alert: HighErrorRate
            expr: |
              (
                sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) /
                sum(rate(http_requests_total[5m])) by (job)
              ) * 100 > 5
            for: 2m
            labels:
              severity: critical
            annotations:
              summary: "High error rate for {{ $labels.job }}"
              description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
    
          - alert: HighLatency
            expr: |
              histogram_quantile(0.95, 
                sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
              ) > 1
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High latency for {{ $labels.job }}"
              description: "95th percentile latency is {{ $value }}s for {{ $labels.job }}"
    
          - alert: LowThroughput
            expr: sum(rate(http_requests_total[5m])) by (job) < 10
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: "Low throughput for {{ $labels.job }}"
              description: "Request rate is {{ $value }} req/s for {{ $labels.job }}"
    
          - alert: DatabaseConnectionFailure
            expr: db_connections_failed_total > 0
            for: 1m
            labels:
              severity: critical
            annotations:
              summary: "Database connection failures for {{ $labels.job }}"
              description: "{{ $value }} database connection failures in the last minute"
    YAML

    Grouping, Inhibition, and Silences

    Grouping Configuration

    # Group alerts by cluster and alertname
    route:
      group_by: ['cluster', 'alertname']
      group_wait: 30s      # Wait for more alerts before sending
      group_interval: 5m   # How often to send updates for a group
      repeat_interval: 12h # How often to resend the same alert
    YAML

    Inhibition Rules

    inhibit_rules:
      # Don't send warning alerts if critical alerts are firing for the same instance
      - source_matchers:
          - severity=critical
        target_matchers:
          - severity=warning
        equal: ['instance']
    
      # Don't send individual service alerts if the whole node is down
      - source_matchers:
          - alertname=NodeDown
        target_matchers:
          - alertname=~"High.*|.*ServiceDown"
        equal: ['instance']
    
      # Don't send disk space warnings if disk is critically full
      - source_matchers:
          - alertname=DiskSpaceCritical
        target_matchers:
          - alertname=DiskWillFillIn4Hours
        equal: ['instance', 'device']
    YAML

    Managing Silences

    # Create a silence via API
    curl -X POST http://localhost:9093/api/v1/silences \
      -H "Content-Type: application/json" \
      -d '{
        "matchers": [
          {
            "name": "alertname",
            "value": "HighCPUUsage"
          },
          {
            "name": "instance",
            "value": "server-01:9100"
          }
        ],
        "startsAt": "2023-08-21T12:00:00.000Z",
        "endsAt": "2023-08-21T14:00:00.000Z",
        "createdBy": "maintenance-team",
        "comment": "Planned maintenance window"
      }'
    Bash

    Integration Examples

    Slack Integration

    # Slack configuration with rich formatting
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        title: '{{ if eq .Status "firing" }}🔥{{ else }}✅{{ end }} {{ .GroupLabels.alertname }}'
        text: |
          {{ if eq .Status "firing" }}
          *Status:* Firing
          *Alerts:* {{ len .Alerts }}
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Severity:* {{ .Labels.severity }}
          *Instance:* {{ .Labels.instance }}
          *Runbook:* {{ .Annotations.runbook_url }}
          {{ end }}
          {{ else }}
          *Status:* Resolved
          All alerts have been resolved.
          {{ end }}
        actions:
          - type: button
            text: 'View in Alertmanager'
            url: '{{ template "__alertmanagerURL" . }}'
          - type: button
            text: 'Silence'
            url: '{{ template "__alertmanagerURL" . }}/#/silences/new'
    YAML

    PagerDuty Integration

    pagerduty_configs:
      - routing_key: 'YOUR_INTEGRATION_KEY'
        description: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        details:
          severity: '{{ range .Alerts }}{{ .Labels.severity }}{{ end }}'
          instance: '{{ range .Alerts }}{{ .Labels.instance }}{{ end }}'
          alertname: '{{ range .Alerts }}{{ .Labels.alertname }}{{ end }}'
        links:
          - href: '{{ range .Alerts }}{{ .Annotations.runbook_url }}{{ end }}'
            text: 'Runbook'
    YAML

    Email Integration

    email_configs:
      - to: 'team@company.com'
        from: 'alertmanager@company.com'
        subject: '{{ .Status | toUpper }}: {{ .GroupLabels.alertname }} ({{ len .Alerts }} alerts)'
        html: |
          <!DOCTYPE html>
          <html>
          <head>
              <style>
                  table { border-collapse: collapse; width: 100%; }
                  th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
                  th { background-color: #f2f2f2; }
                  .critical { background-color: #ffebee; }
                  .warning { background-color: #fff3e0; }
              </style>
          </head>
          <body>
              <h2>Alert {{ .Status | toUpper }}</h2>
              <table>
                  <tr>
                      <th>Alert</th>
                      <th>Severity</th>
                      <th>Instance</th>
                      <th>Description</th>
                  </tr>
                  {{ range .Alerts }}
                  <tr class="{{ .Labels.severity }}">
                      <td>{{ .Labels.alertname }}</td>
                      <td>{{ .Labels.severity }}</td>
                      <td>{{ .Labels.instance }}</td>
                      <td>{{ .Annotations.description }}</td>
                  </tr>
                  {{ end }}
              </table>
          </body>
          </html>
    YAML

    Custom Webhook Integration

    # webhook_server.py
    from flask import Flask, request, jsonify
    import json
    import requests
    
    app = Flask(__name__)
    
    @app.route('/webhook', methods=['POST'])
    def webhook():
        data = request.get_json()
    
        # Process the alert
        status = data.get('status')
        alerts = data.get('alerts', [])
    
        for alert in alerts:
            labels = alert.get('labels', {})
            annotations = alert.get('annotations', {})
    
            # Custom logic based on alert
            if labels.get('severity') == 'critical':
                send_to_ops_team(alert)
            elif 'database' in labels.get('alertname', '').lower():
                send_to_dba_team(alert)
    
            # Log to external system
            log_alert_to_system(alert)
    
        return jsonify({'status': 'received'})
    
    def send_to_ops_team(alert):
        # Send to ticketing system, chat platform, etc.
        pass
    
    def send_to_dba_team(alert):
        # Send to database team's channel
        pass
    
    def log_alert_to_system(alert):
        # Log to centralized logging system
        pass
    
    if __name__ == '__main__':
        app.run(host='0.0.0.0', port=8080)
    Python

    Testing Alerts

    Manual Alert Testing

    # Send test alert to Alertmanager
    curl -X POST http://localhost:9093/api/v1/alerts \
      -H "Content-Type: application/json" \
      -d '[
        {
          "labels": {
            "alertname": "TestAlert",
            "instance": "test-instance",
            "severity": "warning"
          },
          "annotations": {
            "summary": "This is a test alert",
            "description": "Testing alert routing and notifications"
          },
          "startsAt": "2023-08-21T12:00:00.000Z"
        }
      ]'
    Bash

    Alert Testing Framework

    # alert_tester.py
    import requests
    import time
    from datetime import datetime, timezone
    
    class AlertTester:
        def __init__(self, alertmanager_url, prometheus_url):
            self.alertmanager_url = alertmanager_url
            self.prometheus_url = prometheus_url
    
        def send_test_alert(self, alertname, labels, annotations):
            """Send a test alert to Alertmanager"""
            alert = {
                "labels": {
                    "alertname": alertname,
                    **labels
                },
                "annotations": annotations,
                "startsAt": datetime.now(timezone.utc).isoformat()
            }
    
            response = requests.post(
                f"{self.alertmanager_url}/api/v1/alerts",
                json=[alert]
            )
            return response.status_code == 200
    
        def check_alert_rule(self, rule_name):
            """Check if an alert rule is defined in Prometheus"""
            response = requests.get(f"{self.prometheus_url}/api/v1/rules")
            rules = response.json()
    
            for group in rules['data']['groups']:
                for rule in group['rules']:
                    if rule.get('name') == rule_name:
                        return True
            return False
    
        def test_critical_alert_routing(self):
            """Test that critical alerts go to the right channels"""
            return self.send_test_alert(
                "TestCriticalAlert",
                {"severity": "critical", "instance": "test-server"},
                {
                    "summary": "Test critical alert",
                    "description": "This should route to critical alerts channel"
                }
            )
    
    # Usage
    tester = AlertTester("http://localhost:9093", "http://localhost:9090")
    tester.test_critical_alert_routing()
    Python

    Chapter 5 Summary

    Alertmanager provides sophisticated alert routing, grouping, and notification capabilities. Effective alerting requires clear rules, proper grouping, inhibition to reduce noise, and integration with appropriate notification channels. Testing alerts ensures they work as expected and reach the right people.

    Hands-on Exercise

    1. Alertmanager Setup:
      • Install and configure Alertmanager
      • Set up basic routing to email or Slack
      • Test with manual alerts
    2. Alert Rules:
      • Create alerting rules for your infrastructure
      • Set appropriate thresholds and timing
      • Add helpful annotations and runbook links
    3. Advanced Features:
      • Configure inhibition rules to reduce noise
      • Set up silences for maintenance windows
      • Test different notification channels

    6. Visualization

    Introduction to Grafana

    Grafana is the de facto standard for visualizing Prometheus metrics. It provides powerful dashboarding capabilities, alerting integration, and supports multiple data sources beyond Prometheus.

    graph TB
        A[Prometheus] --> B[Grafana]
        C[Users] --> B
        B --> D[Dashboards]
        B --> E[Alerts]
        B --> F[Data Sources]
    
        D --> G[Panels]
        D --> H[Variables]
        D --> I[Annotations]
    
        G --> J[Time Series]
        G --> K[Stats]
        G --> L[Tables]
        G --> M[Heatmaps]

    Installing and Configuring Grafana

    Docker Installation

    # docker-compose.yml
    version: '3.8'
    
    services:
      grafana:
        image: grafana/grafana:latest
        container_name: grafana
        ports:
          - "3000:3000"
        environment:
          - GF_SECURITY_ADMIN_PASSWORD=admin123
          - GF_USERS_ALLOW_SIGN_UP=false
          - GF_USERS_DEFAULT_THEME=dark
          - GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=/etc/grafana/provisioning/dashboards/overview.json
        volumes:
          - grafana_data:/var/lib/grafana
          - ./grafana/provisioning:/etc/grafana/provisioning
          - ./grafana/dashboards:/var/lib/grafana/dashboards
        restart: unless-stopped
    
    volumes:
      grafana_data:
    YAML

    Configuration as Code

    # grafana/provisioning/datasources/prometheus.yml
    apiVersion: 1
    
    datasources:
      - name: Prometheus
        type: prometheus
        access: proxy
        url: http://prometheus:9090
        isDefault: true
        editable: true
        jsonData:
          httpMethod: POST
          prometheusType: Prometheus
          prometheusVersion: 2.40.0
          cacheLevel: 'High'
          disableMetricsLookup: false
          customQueryParameters: ''
          incrementalQuerying: false
          disableRecordingRules: false
    YAML
    # grafana/provisioning/dashboards/dashboard.yml
    apiVersion: 1
    
    providers:
      - name: 'default'
        orgId: 1
        folder: ''
        type: file
        disableDeletion: false
        editable: true
        updateIntervalSeconds: 10
        allowUiUpdates: true
        options:
          path: /var/lib/grafana/dashboards
    YAML

    Dashboard Design Principles

    Information Hierarchy

    1. Overview Level: High-level health and performance indicators
    2. Service Level: Detailed metrics for specific services
    3. Component Level: Deep-dive into individual components
    4. Debug Level: Raw metrics for troubleshooting

    Dashboard Layout Best Practices

    {
      "dashboard": {
        "title": "Service Overview",
        "panels": [
          {
            "id": 1,
            "title": "Key Metrics (Top Row)",
            "type": "stat",
            "gridPos": {"h": 6, "w": 24, "x": 0, "y": 0}
          },
          {
            "id": 2,
            "title": "Trends (Middle Section)",
            "type": "timeseries", 
            "gridPos": {"h": 8, "w": 12, "x": 0, "y": 6}
          },
          {
            "id": 3,
            "title": "Distribution (Right Side)",
            "type": "heatmap",
            "gridPos": {"h": 8, "w": 12, "x": 12, "y": 6}
          },
          {
            "id": 4,
            "title": "Details (Bottom)",
            "type": "table",
            "gridPos": {"h": 8, "w": 24, "x": 0, "y": 14}
          }
        ]
      }
    }
    JSON

    Essential Panel Types

    Time Series Panels

    {
      "id": 1,
      "title": "Request Rate",
      "type": "timeseries",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[5m])) by (service)",
          "legendFormat": "{{service}}",
          "refId": "A"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "reqps",
          "custom": {
            "drawStyle": "line",
            "lineInterpolation": "linear",
            "barAlignment": 0,
            "lineWidth": 2,
            "fillOpacity": 10,
            "gradientMode": "none",
            "spanNulls": false,
            "insertNulls": false,
            "showPoints": "never",
            "pointSize": 5,
            "stacking": {
              "mode": "none",
              "group": "A"
            },
            "axisPlacement": "auto",
            "axisLabel": "",
            "scaleDistribution": {
              "type": "linear"
            },
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "vis": false
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          }
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "frontend"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "mode": "fixed",
                  "fixedColor": "green"
                }
              }
            ]
          }
        ]
      },
      "options": {
        "tooltip": {
          "mode": "multi",
          "sort": "desc"
        },
        "legend": {
          "displayMode": "table",
          "placement": "bottom",
          "calcs": ["lastNotNull", "max", "mean"],
          "values": true
        }
      }
    }
    JSON

    Stat Panels for Key Metrics

    {
      "id": 2,
      "title": "Service Availability",
      "type": "stat",
      "targets": [
        {
          "expr": "avg(up{job=~\".*-service\"})",
          "refId": "A",
          "format": "time_series",
          "instant": true
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percentunit",
          "min": 0,
          "max": 1,
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "red",
                "value": 0
              },
              {
                "color": "yellow", 
                "value": 0.95
              },
              {
                "color": "green",
                "value": 0.99
              }
            ]
          },
          "mappings": [],
          "custom": {
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "vis": false
            }
          }
        }
      },
      "options": {
        "reduceOptions": {
          "values": false,
          "calcs": ["lastNotNull"],
          "fields": ""
        },
        "orientation": "auto",
        "textMode": "auto",
        "colorMode": "background",
        "graphMode": "area",
        "justifyMode": "auto"
      },
      "gridPos": {"h": 6, "w": 6, "x": 0, "y": 0}
    }
    JSON

    Heatmap for Latency Distribution

    {
      "id": 3,
      "title": "Response Time Distribution",
      "type": "heatmap",
      "targets": [
        {
          "expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
          "format": "heatmap",
          "legendFormat": "{{le}}",
          "refId": "A"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "custom": {
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "vis": false
            },
            "scaleDistribution": {
              "type": "linear"
            }
          }
        }
      },
      "options": {
        "calculate": false,
        "cellGap": 2,
        "cellValues": {
          "unit": "short"
        },
        "color": {
          "exponent": 0.5,
          "fill": "dark-orange",
          "mode": "spectrum",
          "reverse": false,
          "scale": "exponential",
          "scheme": "Oranges",
          "steps": 64
        },
        "exemplars": {
          "color": "rgba(255,0,255,0.7)"
        },
        "filterValues": {
          "le": 1e-9
        },
        "legend": {
          "show": true
        },
        "rowsFrame": {
          "layout": "auto"
        },
        "tooltip": {
          "show": true,
          "yHistogram": false
        },
        "yAxis": {
          "axisPlacement": "left",
          "reverse": false,
          "unit": "s"
        }
      }
    }
    JSON

    Table for Detailed Breakdown

    {
      "id": 4,
      "title": "Service Status Details",
      "type": "table",
      "targets": [
        {
          "expr": "up{job=~\".*-service\"}",
          "format": "table",
          "instant": true,
          "refId": "A"
        },
        {
          "expr": "rate(http_requests_total[5m])",
          "format": "table", 
          "instant": true,
          "refId": "B"
        },
        {
          "expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m])",
          "format": "table",
          "instant": true, 
          "refId": "C"
        }
      ],
      "transformations": [
        {
          "id": "merge",
          "options": {}
        },
        {
          "id": "organize",
          "options": {
            "excludeByName": {
              "Time": true,
              "__name__": true
            },
            "indexByName": {
              "instance": 0,
              "job": 1,
              "Value #A": 2,
              "Value #B": 3,
              "Value #C": 4
            },
            "renameByName": {
              "Value #A": "Status",
              "Value #B": "Request Rate",
              "Value #C": "Error Rate",
              "instance": "Instance",
              "job": "Service"
            }
          }
        }
      ],
      "fieldConfig": {
        "defaults": {
          "custom": {
            "align": "auto",
            "displayMode": "auto",
            "inspect": false
          },
          "mappings": [
            {
              "options": {
                "0": {
                  "color": "red",
                  "index": 0,
                  "text": "DOWN"
                },
                "1": {
                  "color": "green", 
                  "index": 1,
                  "text": "UP"
                }
              },
              "type": "value"
            }
          ],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "Error Rate"
            },
            "properties": [
              {
                "id": "unit",
                "value": "percentunit"
              },
              {
                "id": "custom.displayMode",
                "value": "color-background"
              },
              {
                "id": "thresholds",
                "value": {
                  "mode": "absolute",
                  "steps": [
                    {
                      "color": "green",
                      "value": null
                    },
                    {
                      "color": "yellow",
                      "value": 0.01
                    },
                    {
                      "color": "red",
                      "value": 0.05
                    }
                  ]
                }
              }
            ]
          }
        ]
      }
    }
    JSON

    Dashboard Templates and Variables

    Template Variables

    {
      "templating": {
        "list": [
          {
            "name": "environment",
            "type": "query",
            "query": "label_values(up, environment)",
            "current": {
              "selected": true,
              "text": "production",
              "value": "production"
            },
            "options": [],
            "refresh": 1,
            "regex": "",
            "sort": 1,
            "multi": false,
            "includeAll": false,
            "allValue": null
          },
          {
            "name": "service",
            "type": "query", 
            "query": "label_values(http_requests_total{environment=\"$environment\"}, service)",
            "current": {
              "selected": false,
              "text": "All",
              "value": "$__all"
            },
            "options": [],
            "refresh": 1,
            "regex": "",
            "sort": 1,
            "multi": true,
            "includeAll": true,
            "allValue": ".*"
          },
          {
            "name": "instance",
            "type": "query",
            "query": "label_values(up{job=\"$service\"}, instance)",
            "current": {
              "selected": false,
              "text": "All", 
              "value": "$__all"
            },
            "options": [],
            "refresh": 2,
            "regex": "",
            "sort": 1,
            "multi": true,
            "includeAll": true,
            "allValue": ".*"
          },
          {
            "name": "interval",
            "type": "interval",
            "current": {
              "selected": false,
              "text": "5m",
              "value": "5m"
            },
            "options": [
              {
                "selected": true,
                "text": "1m",
                "value": "1m"
              },
              {
                "selected": false,
                "text": "5m", 
                "value": "5m"
              },
              {
                "selected": false,
                "text": "15m",
                "value": "15m"
              },
              {
                "selected": false,
                "text": "1h",
                "value": "1h"
              }
            ],
            "query": "1m,5m,15m,1h,6h,12h,1d,7d,14d,30d",
            "refresh": 2,
            "auto": true,
            "auto_count": 30,
            "auto_min": "10s"
          }
        ]
      }
    }
    JSON

    Using Variables in Queries

    # Using service variable
    sum(rate(http_requests_total{service=~"$service"}[$interval])) by (service)
    
    # Using environment and instance variables  
    up{environment="$environment",instance=~"$instance"}
    
    # Advanced variable usage with regex
    rate(http_requests_total{service=~"$service",status!~"$__interval"}[5m])
    INI

    Complete Dashboard Examples

    Infrastructure Overview Dashboard

    {
      "dashboard": {
        "id": null,
        "title": "Infrastructure Overview",
        "description": "High-level infrastructure health and performance metrics",
        "tags": ["infrastructure", "overview"],
        "timezone": "browser",
        "refresh": "30s",
        "time": {
          "from": "now-1h",
          "to": "now"
        },
        "templating": {
          "list": [
            {
              "name": "instance",
              "type": "query",
              "query": "label_values(up{job=\"node-exporter\"}, instance)",
              "refresh": 1,
              "multi": true,
              "includeAll": true,
              "current": {
                "value": "$__all",
                "text": "All"
              }
            }
          ]
        },
        "panels": [
          {
            "id": 1,
            "title": "Node Status",
            "type": "stat",
            "targets": [
              {
                "expr": "up{job=\"node-exporter\",instance=~\"$instance\"}",
                "legendFormat": "{{instance}}"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "mappings": [
                  {
                    "options": {
                      "0": {"color": "red", "text": "DOWN"},
                      "1": {"color": "green", "text": "UP"}
                    },
                    "type": "value"
                  }
                ],
                "thresholds": {
                  "steps": [
                    {"color": "red", "value": 0},
                    {"color": "green", "value": 1}
                  ]
                }
              }
            },
            "gridPos": {"h": 4, "w": 24, "x": 0, "y": 0}
          },
          {
            "id": 2,
            "title": "CPU Usage",
            "type": "timeseries",
            "targets": [
              {
                "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\",instance=~\"$instance\"}[5m])) * 100)",
                "legendFormat": "{{instance}}"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "percent",
                "max": 100,
                "min": 0,
                "thresholds": {
                  "steps": [
                    {"color": "green", "value": 0},
                    {"color": "yellow", "value": 70},
                    {"color": "red", "value": 90}
                  ]
                }
              }
            },
            "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4}
          },
          {
            "id": 3,
            "title": "Memory Usage",
            "type": "timeseries",
            "targets": [
              {
                "expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\"} / node_memory_MemTotal_bytes{instance=~\"$instance\"})) * 100",
                "legendFormat": "{{instance}}"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "percent",
                "max": 100,
                "min": 0,
                "thresholds": {
                  "steps": [
                    {"color": "green", "value": 0},
                    {"color": "yellow", "value": 80},
                    {"color": "red", "value": 90}
                  ]
                }
              }
            },
            "gridPos": {"h": 8, "w": 12, "x": 12, "y": 4}
          },
          {
            "id": 4,
            "title": "Disk Usage",
            "type": "timeseries",
            "targets": [
              {
                "expr": "(1 - (node_filesystem_avail_bytes{instance=~\"$instance\",fstype!~\"tmpfs|fuse.lxcfs|squashfs\"} / node_filesystem_size_bytes{instance=~\"$instance\",fstype!~\"tmpfs|fuse.lxcfs|squashfs\"})) * 100",
                "legendFormat": "{{instance}}:{{mountpoint}}"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "percent",
                "max": 100,
                "min": 0,
                "thresholds": {
                  "steps": [
                    {"color": "green", "value": 0},
                    {"color": "yellow", "value": 80},
                    {"color": "red", "value": 90}
                  ]
                }
              }
            },
            "gridPos": {"h": 8, "w": 12, "x": 0, "y": 12}
          },
          {
            "id": 5,
            "title": "Network I/O",
            "type": "timeseries",
            "targets": [
              {
                "expr": "rate(node_network_receive_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*\"}[5m])",
                "legendFormat": "{{instance}}:{{device}} - Receive"
              },
              {
                "expr": "rate(node_network_transmit_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*\"}[5m])",
                "legendFormat": "{{instance}}:{{device}} - Transmit"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "Bps"
              }
            },
            "gridPos": {"h": 8, "w": 12, "x": 12, "y": 12}
          }
        ]
      }
    }
    JSON

    Application Performance Dashboard

    {
      "dashboard": {
        "id": null,
        "title": "Application Performance",
        "description": "Application performance metrics and SLIs",
        "tags": ["application", "performance", "sli"],
        "timezone": "browser", 
        "refresh": "30s",
        "templating": {
          "list": [
            {
              "name": "service",
              "type": "query",
              "query": "label_values(http_requests_total, service)",
              "refresh": 1,
              "multi": true,
              "includeAll": true
            },
            {
              "name": "environment",
              "type": "query", 
              "query": "label_values(http_requests_total, environment)",
              "refresh": 1
            }
          ]
        },
        "panels": [
          {
            "id": 1,
            "title": "Request Rate",
            "type": "timeseries",
            "targets": [
              {
                "expr": "sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service)",
                "legendFormat": "{{service}}"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "reqps"
              }
            },
            "gridPos": {"h": 8, "w": 8, "x": 0, "y": 0}
          },
          {
            "id": 2,
            "title": "Error Rate",
            "type": "timeseries",
            "targets": [
              {
                "expr": "sum(rate(http_requests_total{status=~\"[45]..\",service=~\"$service\",environment=\"$environment\"}[5m])) by (service) / sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service) * 100",
                "legendFormat": "{{service}}"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "percent",
                "max": 100,
                "min": 0,
                "thresholds": {
                  "steps": [
                    {"color": "green", "value": 0},
                    {"color": "yellow", "value": 1},
                    {"color": "red", "value": 5}
                  ]
                }
              }
            },
            "gridPos": {"h": 8, "w": 8, "x": 8, "y": 0}
          },
          {
            "id": 3,
            "title": "Response Time (95th percentile)",
            "type": "timeseries",
            "targets": [
              {
                "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (service, le))",
                "legendFormat": "{{service}}"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "s",
                "thresholds": {
                  "steps": [
                    {"color": "green", "value": 0},
                    {"color": "yellow", "value": 0.5},
                    {"color": "red", "value": 1}
                  ]
                }
              }
            },
            "gridPos": {"h": 8, "w": 8, "x": 16, "y": 0}
          },
          {
            "id": 4,
            "title": "Response Time Heatmap",
            "type": "heatmap",
            "targets": [
              {
                "expr": "sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (le)",
                "format": "heatmap",
                "legendFormat": "{{le}}"
              }
            ],
            "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}
          },
          {
            "id": 5,
            "title": "Top Endpoints by Request Count",
            "type": "table",
            "targets": [
              {
                "expr": "topk(10, sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (endpoint))",
                "format": "table",
                "instant": true
              }
            ],
            "transformations": [
              {
                "id": "organize",
                "options": {
                  "excludeByName": {
                    "Time": true,
                    "__name__": true
                  },
                  "renameByName": {
                    "Value": "Requests/sec",
                    "endpoint": "Endpoint"
                  }
                }
              }
            ],
            "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8}
          }
        ]
      }
    }
    JSON

    Advanced Visualization Techniques

    Custom Annotations

    {
      "annotations": {
        "list": [
          {
            "name": "Deployments",
            "datasource": "Prometheus",
            "enable": true,
            "expr": "increase(prometheus_config_last_reload_success_timestamp_seconds[1m]) > 0",
            "iconColor": "green",
            "titleFormat": "Config Reload",
            "textFormat": "Prometheus configuration reloaded"
          },
          {
            "name": "Alerts",
            "datasource": "Prometheus", 
            "enable": true,
            "expr": "ALERTS{alertstate=\"firing\"}",
            "iconColor": "red",
            "titleFormat": "{{alertname}}",
            "textFormat": "{{summary}}"
          }
        ]
      }
    }
    JSON

    Value Mappings and Overrides

    {
      "fieldConfig": {
        "defaults": {
          "mappings": [
            {
              "options": {
                "0": {"text": "Healthy", "color": "green"},
                "1": {"text": "Warning", "color": "yellow"},
                "2": {"text": "Critical", "color": "red"}
              },
              "type": "value"
            },
            {
              "options": {
                "from": 0,
                "to": 50,
                "result": {"text": "Low", "color": "green"}
              },
              "type": "range"
            }
          ]
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "Critical Services"
            },
            "properties": [
              {
                "id": "color",
                "value": {"mode": "fixed", "fixedColor": "red"}
              },
              {
                "id": "custom.displayMode",
                "value": "color-background"
              }
            ]
          }
        ]
      }
    }
    JSON

    Dynamic Thresholds

    {
      "targets": [
        {
          "expr": "avg(response_time_seconds)",
          "refId": "A"
        },
        {
          "expr": "avg(response_time_seconds) + 2 * stddev(response_time_seconds)",
          "refId": "B",
          "hide": true
        }
      ],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {"color": "green", "value": null},
              {"color": "red", "value": "${B}"}
            ]
          }
        }
      }
    }
    JSON

    Dashboard Organization and Management

    Folder Structure

    Dashboards/
    ├── Overview/
    │   ├── System Overview
    │   ├── Application Overview
    │   └── Business Metrics
    ├── Infrastructure/
    │   ├── Node Metrics
    │   ├── Network Performance
    │   └── Storage Performance
    ├── Applications/
    │   ├── Frontend Service
    │   ├── Backend Services
    │   └── Database Performance
    ├── Troubleshooting/
    │   ├── Error Analysis
    │   ├── Performance Deep Dive
    │   └── Debug Dashboard
    └── Business/
        ├── User Metrics
        ├── Revenue Tracking
        └── KPI Dashboard
    JSON
    {
      "dashboard": {
        "tags": [
          "infrastructure", 
          "monitoring", 
          "production",
          "team:platform",
          "level:l1"
        ],
        "title": "Production Infrastructure Overview",
        "description": "L1 monitoring dashboard for production infrastructure"
      }
    }
    JSON
    {
      "links": [
        {
          "title": "System Overview",
          "url": "/d/system-overview/system-overview",
          "type": "dashboards",
          "icon": "dashboard"
        },
        {
          "title": "Runbook",
          "url": "https://runbooks.company.com/infrastructure",
          "type": "link",
          "targetBlank": true,
          "icon": "doc"
        },
        {
          "title": "Alert Manager",
          "url": "http://alertmanager:9093",
          "type": "link",
          "targetBlank": true,
          "icon": "bell"
        }
      ]
    }
    JSON

    Performance Optimization for Dashboards

    Query Optimization

    # Inefficient - multiple queries
    sum(rate(http_requests_total[5m])) by (service)
    sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
    sum(rate(http_requests_total{status=~"4.."}[5m])) by (service)
    
    # Better - single query with grouping
    sum(rate(http_requests_total[5m])) by (service, status)
    INI

    Using Recording Rules for Heavy Queries

    # recording_rules.yml
    groups:
      - name: dashboard_optimization
        interval: 30s
        rules:
          - record: dashboard:request_rate:5m
            expr: sum(rate(http_requests_total[5m])) by (service)
    
          - record: dashboard:error_rate:5m
            expr: |
              sum(rate(http_requests_total{status=~"[45].."}[5m])) by (service) /
              sum(rate(http_requests_total[5m])) by (service)
    YAML

    Dashboard Caching Configuration

    # grafana.ini
    [caching]
    enabled = true
    
    [database]
    query_cache_enabled = true
    query_cache_size = 100MB
    query_cache_ttl = 300s
    INI

    Alerting Integration

    Alert Panel Configuration

    {
      "id": 6,
      "title": "Active Alerts",
      "type": "alertlist",
      "options": {
        "showOptions": "current",
        "maxItems": 20,
        "sortOrder": 1,
        "dashboardAlerts": false,
        "alertInstanceLabelFilter": "",
        "dashboardTitle": "",
        "folderId": null,
        "tags": []
      },
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16}
    }
    JSON

    Conditional Formatting Based on Alerts

    {
      "fieldConfig": {
        "overrides": [
          {
            "matcher": {
              "id": "byFrameRefID",
              "options": "Alerts"
            },
            "properties": [
              {
                "id": "custom.displayMode",
                "value": "color-background"
              },
              {
                "id": "mappings",
                "value": [
                  {
                    "options": {
                      "0": {"text": "OK", "color": "green"},
                      "1": {"text": "ALERT", "color": "red"}
                    },
                    "type": "value"
                  }
                ]
              }
            ]
          }
        ]
      }
    }
    JSON

    Export and Import Strategies

    Dashboard Export Script

    #!/bin/bash
    # scripts/export-dashboards.sh
    
    GRAFANA_URL="http://localhost:3000"
    GRAFANA_USER="admin"
    GRAFANA_PASS="admin123"
    
    # Get all dashboards
    curl -u $GRAFANA_USER:$GRAFANA_PASS \
      "$GRAFANA_URL/api/search?type=dash-db" | \
      jq -r '.[] | .uid' | \
      while read uid; do
        echo "Exporting dashboard: $uid"
        curl -u $GRAFANA_USER:$GRAFANA_PASS \
          "$GRAFANA_URL/api/dashboards/uid/$uid" | \
          jq '.dashboard' > "dashboards/${uid}.json"
      done
    Bash

    Dashboard Import with Provisioning

    # grafana/provisioning/dashboards/dashboards.yml
    apiVersion: 1
    
    providers:
      - name: 'infrastructure'
        orgId: 1
        folder: 'Infrastructure'
        type: file
        disableDeletion: false
        editable: true
        updateIntervalSeconds: 10
        options:
          path: /etc/grafana/provisioning/dashboards/infrastructure
    
      - name: 'applications'
        orgId: 1
        folder: 'Applications'
        type: file
        disableDeletion: false
        editable: true
        updateIntervalSeconds: 10
        options:
          path: /etc/grafana/provisioning/dashboards/applications
    YAML

    Chapter 6 Summary

    Grafana provides powerful visualization capabilities for Prometheus metrics through various panel types, template variables, and advanced features. Effective dashboard design follows information hierarchy principles, uses appropriate panel types for different data, and optimizes queries for performance. Dashboard organization, alerting integration, and automation through provisioning enable scalable monitoring visualization.

    Hands-on Exercise

    1. Dashboard Creation:
      • Create an infrastructure overview dashboard
      • Add template variables for dynamic filtering
      • Implement different panel types (stat, timeseries, table, heatmap)
    2. Advanced Features:
      • Set up annotations for deployments and alerts
      • Configure custom thresholds and value mappings
      • Create dashboard links and navigation
    3. Optimization and Management:
      • Optimize queries using recording rules
      • Organize dashboards with folders and tags
      • Set up

    7. Prometheus in Kubernetes

    Service Discovery in Kubernetes

    Kubernetes provides rich metadata that Prometheus can use for automatic service discovery, eliminating the need for manual target configuration.

    graph TB
        A[Kubernetes API] --> B[Prometheus]
        B --> C[Pods]
        B --> D[Services]
        B --> E[Endpoints]
        B --> F[Nodes]
    
        C --> G[App Metrics]
        D --> H[Service Metrics]
        E --> I[Endpoint Metrics]
        F --> J[Node Metrics]

    Kubernetes SD Configuration

    # prometheus.yml for Kubernetes
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    scrape_configs:
      # Scrape Kubernetes API server
      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
          - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
            action: keep
            regex: default;kubernetes;https
    
      # Scrape Kubernetes nodes
      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
          - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}/proxy/metrics
    
      # Scrape pods with prometheus.io annotations
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          # Only scrape pods with prometheus.io/scrape annotation
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
    
          # Use custom path if specified
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
    
          # Use custom port if specified
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
    
          # Add Kubernetes metadata as labels
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: kubernetes_pod_name
    
      # Scrape services with prometheus.io annotations
      - job_name: 'kubernetes-services'
        kubernetes_sd_configs:
          - role: service
        metrics_path: /probe
        params:
          module: [http_2xx]
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
            action: keep
            regex: true
          - source_labels: [__address__]
            target_label: __param_target
          - target_label: __address__
            replacement: blackbox-exporter:9115
          - source_labels: [__param_target]
            target_label: instance
          - action: labelmap
            regex: __meta_kubernetes_service_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_service_name]
            target_label: kubernetes_name
    YAML

    Using kube-state-metrics

    kube-state-metrics generates metrics about Kubernetes object states, providing cluster-level visibility.

    Installing kube-state-metrics

    # kube-state-metrics.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: kube-state-metrics
      namespace: kube-system
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: kube-state-metrics
      template:
        metadata:
          labels:
            app: kube-state-metrics
        spec:
          serviceAccountName: kube-state-metrics
          containers:
          - name: kube-state-metrics
            image: k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.6.0
            ports:
            - containerPort: 8080
              name: http-metrics
            - containerPort: 8081
              name: telemetry
            readinessProbe:
              httpGet:
                path: /
                port: 8081
              initialDelaySeconds: 5
              timeoutSeconds: 5
    
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: kube-state-metrics
      namespace: kube-system
    
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: kube-state-metrics
    rules:
    - apiGroups: [""]
      resources:
      - configmaps
      - secrets
      - nodes
      - pods
      - services
      - resourcequotas
      - replicationcontrollers
      - limitranges
      - persistentvolumeclaims
      - persistentvolumes
      - namespaces
      - endpoints
      verbs: ["list", "watch"]
    - apiGroups: ["apps"]
      resources:
      - statefulsets
      - daemonsets
      - deployments
      - replicasets
      verbs: ["list", "watch"]
    
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: kube-state-metrics
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: kube-state-metrics
    subjects:
    - kind: ServiceAccount
      name: kube-state-metrics
      namespace: kube-system
    YAML

    Key kube-state-metrics Metrics

    # Pod status metrics
    kube_pod_status_phase{phase="Running"}
    kube_pod_status_ready{condition="true"}
    kube_pod_container_status_restarts_total
    
    # Deployment metrics
    kube_deployment_status_replicas_available
    kube_deployment_status_replicas_unavailable
    
    # Node metrics
    kube_node_status_condition{condition="Ready", status="true"}
    kube_node_spec_unschedulable
    
    # Resource requests and limits
    kube_pod_container_resource_requests
    kube_pod_container_resource_limits
    
    # Namespace resource quotas
    kube_resourcequota
    Bash

    Prometheus Operator and CRDs

    The Prometheus Operator simplifies Prometheus deployment and management in Kubernetes through Custom Resource Definitions (CRDs).

    Installing Prometheus Operator

    # Using Helm
    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm repo update
    helm install prometheus-operator prometheus-community/kube-prometheus-stack
    Bash

    Custom Resource Examples

    Prometheus CR
    apiVersion: monitoring.coreos.com/v1
    kind: Prometheus
    metadata:
      name: prometheus
      namespace: monitoring
    spec:
      serviceAccountName: prometheus
      serviceMonitorSelector:
        matchLabels:
          team: frontend
      ruleSelector:
        matchLabels:
          prometheus: kube-prometheus
          role: alert-rules
      resources:
        requests:
          memory: 400Mi
      storage:
        volumeClaimTemplate:
          spec:
            storageClassName: fast-ssd
            resources:
              requests:
                storage: 50Gi
      retention: 30d
      retentionSize: 45GB
    YAML
    ServiceMonitor CR
    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      name: my-app-monitor
      namespace: monitoring
      labels:
        team: frontend
    spec:
      selector:
        matchLabels:
          app: my-app
      endpoints:
      - port: metrics
        interval: 30s
        path: /metrics
        honorLabels: true
      namespaceSelector:
        matchNames:
        - production
        - staging
    YAML
    PrometheusRule CR
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      name: my-app-rules
      namespace: monitoring
      labels:
        prometheus: kube-prometheus
        role: alert-rules
    spec:
      groups:
      - name: my-app.rules
        rules:
        - alert: MyAppHighErrorRate
          expr: |
            (
              sum(rate(http_requests_total{job="my-app", status=~"5.."}[5m])) /
              sum(rate(http_requests_total{job="my-app"}[5m]))
            ) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High error rate in my-app"
            description: "Error rate is {{ $value | humanizePercentage }}"
    YAML

    Best Practices for Monitoring Kubernetes Workloads

    Pod Annotations for Scraping

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: my-app
    spec:
      template:
        metadata:
          annotations:
            prometheus.io/scrape: "true"
            prometheus.io/port: "8080"
            prometheus.io/path: "/metrics"
        spec:
          containers:
          - name: my-app
            image: my-app:latest
            ports:
            - containerPort: 8080
              name: metrics
    YAML

    Resource Monitoring Queries

    # CPU usage by pod
    sum by (pod) (rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m]))
    
    # Memory usage by pod
    sum by (pod) (container_memory_working_set_bytes{container!="POD",container!=""})
    
    # Pod restart rate
    increase(kube_pod_container_status_restarts_total[1h])
    
    # Pods not ready
    kube_pod_status_ready{condition="false"}
    
    # Node CPU usage
    (1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100
    
    # Node memory usage
    (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
    
    # Persistent Volume usage
    (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100
    INI

    Kubernetes Alerting Rules

    # k8s-alerts.yml
    groups:
      - name: kubernetes-alerts
        rules:
          - alert: KubePodCrashLooping
            expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
            for: 15m
            labels:
              severity: warning
            annotations:
              summary: "Pod is crash looping"
              description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting {{ $value | humanize }} times per 15 minutes"
    
          - alert: KubePodNotReady
            expr: kube_pod_status_ready{condition="false"} == 1
            for: 15m
            labels:
              severity: warning
            annotations:
              summary: "Pod has been in not ready state for more than 15 minutes"
              description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than 15 minutes"
    
          - alert: KubeDeploymentGenerationMismatch
            expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
            for: 15m
            labels:
              severity: warning
            annotations:
              summary: "Deployment generation mismatch"
              description: "Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match"
    
          - alert: KubeNodeNotReady
            expr: kube_node_status_condition{condition="Ready",status="true"} == 0
            for: 15m
            labels:
              severity: critical
            annotations:
              summary: "Node is not ready"
              description: "Node {{ $labels.node }} has been unready for more than 15 minutes"
    
          - alert: KubeDaemonSetRolloutStuck
            expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100
            for: 15m
            labels:
              severity: warning
            annotations:
              summary: "DaemonSet rollout is stuck"
              description: "Only {{ $value | humanizePercentage }} of the desired Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are scheduled and ready"
    YAML

    Network Policy Monitoring

    # Example application with network policies
    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: my-app-netpol
      namespace: production
    spec:
      podSelector:
        matchLabels:
          app: my-app
      policyTypes:
      - Ingress
      - Egress
      ingress:
      - from:
        - namespaceSelector:
            matchLabels:
              name: production
        ports:
        - protocol: TCP
          port: 8080
      egress:
      - to:
        - namespaceSelector:
            matchLabels:
              name: production
        ports:
        - protocol: TCP
          port: 5432
    YAML

    Chapter 7 Summary

    Prometheus integrates seamlessly with Kubernetes through service discovery, automatically finding and monitoring pods, services, and nodes. kube-state-metrics provides cluster-level visibility, while the Prometheus Operator simplifies deployment through CRDs. Proper annotation strategies and resource monitoring ensure comprehensive Kubernetes observability.

    Hands-on Exercise

    1. Service Discovery Setup:
      • Configure Prometheus for Kubernetes service discovery
      • Deploy applications with proper annotations
      • Verify automatic target discovery
    2. kube-state-metrics:
      • Install and configure kube-state-metrics
      • Create queries for cluster health monitoring
      • Build dashboards for Kubernetes resources
    3. Prometheus Operator:
      • Deploy Prometheus using the operator
      • Create ServiceMonitor and PrometheusRule resources
      • Test the operator’s automated configuration management

    8. Scaling and Performance

    Federation and Hierarchical Prometheus Setups

    Federation allows Prometheus servers to scrape selected time series from other Prometheus servers, enabling hierarchical monitoring architectures.

    graph TB
        A[Global Prometheus] --> B[Regional Prometheus US]
        A --> C[Regional Prometheus EU]
        A --> D[Regional Prometheus APAC]
    
        B --> E[Cluster Prometheus US-1]
        B --> F[Cluster Prometheus US-2]
    
        C --> G[Cluster Prometheus EU-1]
        C --> H[Cluster Prometheus EU-2]
    
        D --> I[Cluster Prometheus APAC-1]

    Federation Configuration

    # Global Prometheus configuration
    scrape_configs:
      - job_name: 'federate'
        scrape_interval: 15s
        honor_labels: true
        metrics_path: '/federate'
        params:
          'match[]':
            - '{job=~"prometheus|node-exporter"}'
            - '{__name__=~"job:.*"}'
            - '{__name__=~"instance:.*"}'
        static_configs:
          - targets:
            - 'us-prometheus:9090'
            - 'eu-prometheus:9090'
            - 'apac-prometheus:9090'
    
      # Aggregate high-level metrics
      - job_name: 'federate-aggregates'
        scrape_interval: 30s
        honor_labels: true
        metrics_path: '/federate'
        params:
          'match[]':
            - '{__name__=~"cluster:.*"}'
            - '{__name__=~"region:.*"}'
        static_configs:
          - targets:
            - 'us-prometheus:9090'
            - 'eu-prometheus:9090'
            - 'apac-prometheus:9090'
    YAML

    Recording Rules for Federation

    # Regional Prometheus recording rules
    groups:
      - name: cluster_aggregates
        interval: 30s
        rules:
          - record: cluster:cpu_usage:avg
            expr: avg by (cluster) (instance:cpu_usage:rate5m)
    
          - record: cluster:memory_usage:avg
            expr: avg by (cluster) (instance:memory_usage:percentage)
    
          - record: cluster:disk_usage:avg
            expr: avg by (cluster) (instance:disk_usage:percentage)
    
      - name: region_aggregates
        interval: 60s
        rules:
          - record: region:request_rate:sum
            expr: sum by (region) (cluster:request_rate:sum)
    
          - record: region:error_rate:avg
            expr: avg by (region) (cluster:error_rate:avg)
    YAML

    Remote Storage Integrations

    Remote storage solutions provide long-term storage and horizontal scalability for Prometheus metrics.

    Thanos Integration

    Thanos provides unlimited retention and horizontal scaling for Prometheus.

    # Prometheus with Thanos sidecar
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: prometheus
    spec:
      serviceName: prometheus
      replicas: 1
      template:
        spec:
          containers:
          - name: prometheus
            image: prom/prometheus:latest
            args:
              - '--config.file=/etc/prometheus/prometheus.yml'
              - '--storage.tsdb.path=/prometheus'
              - '--storage.tsdb.retention.time=2h'
              - '--storage.tsdb.min-block-duration=2h'
              - '--storage.tsdb.max-block-duration=2h'
              - '--web.enable-lifecycle'
            ports:
            - containerPort: 9090
            volumeMounts:
            - name: prometheus-storage
              mountPath: /prometheus
    
          - name: thanos-sidecar
            image: thanosio/thanos:latest
            args:
              - sidecar
              - --tsdb.path=/prometheus
              - --prometheus.url=http://localhost:9090
              - --objstore.config-file=/etc/thanos/objstore.yml
            ports:
            - containerPort: 10901
            - containerPort: 10902
            volumeMounts:
            - name: prometheus-storage
              mountPath: /prometheus
            - name: thanos-objstore-config
              mountPath: /etc/thanos
    
          volumes:
          - name: thanos-objstore-config
            secret:
              secretName: thanos-objstore-config
    YAML
    # Thanos objstore configuration
    # objstore.yml
    type: S3
    config:
      bucket: "thanos-metrics"
      endpoint: "s3.amazonaws.com"
      access_key: "ACCESS_KEY"
      secret_key: "SECRET_KEY"
      insecure: false
    YAML
    # Thanos query deployment
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: thanos-query
    spec:
      replicas: 2
      template:
        spec:
          containers:
          - name: thanos-query
            image: thanosio/thanos:latest
            args:
              - query
              - --store=prometheus-0.prometheus:10901
              - --store=prometheus-1.prometheus:10901
              - --store=thanos-store:10901
            ports:
            - containerPort: 10902
    YAML

    VictoriaMetrics Integration

    VictoriaMetrics provides high-performance storage and querying.

    # VictoriaMetrics deployment
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: victoriametrics
    spec:
      replicas: 1
      template:
        spec:
          containers:
          - name: victoriametrics
            image: victoriametrics/victoria-metrics:latest
            args:
              - '--storageDataPath=/victoria-metrics-data'
              - '--retentionPeriod=12'
              - '--httpListenAddr=:8428'
            ports:
            - containerPort: 8428
            volumeMounts:
            - name: storage
              mountPath: /victoria-metrics-data
    YAML
    # Prometheus remote write configuration
    remote_write:
      - url: "http://victoriametrics:8428/api/v1/write"
        queue_config:
          max_samples_per_send: 10000
          batch_send_deadline: 5s
          max_shards: 20
    YAML

    Cortex Configuration

    # Cortex configuration
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: cortex-config
    data:
      cortex.yml: |
        server:
          http_listen_port: 9009
          grpc_listen_port: 9095
    
        distributor:
          ring:
            kvstore:
              store: consul
              consul:
                host: consul:8500
    
        ingester:
          lifecycler:
            ring:
              kvstore:
                store: consul
                consul:
                  host: consul:8500
              replication_factor: 3
    
        storage:
          engine: blocks
    
        blocks_storage:
          backend: s3
          s3:
            endpoint: s3.amazonaws.com
            bucket_name: cortex-blocks
            access_key_id: ACCESS_KEY
            secret_access_key: SECRET_KEY
    YAML

    Retention Policies and Storage Tuning

    Prometheus Storage Configuration

    # Prometheus with optimized storage settings
    args:
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=15d'
      - '--storage.tsdb.retention.size=50GB'
      - '--storage.tsdb.wal-compression'
      - '--storage.tsdb.min-block-duration=2h'
      - '--storage.tsdb.max-block-duration=2h'
      - '--web.enable-admin-api'
    YAML

    Storage Optimization Strategies

    # Monitor Prometheus storage metrics
    prometheus_tsdb_symbol_table_size_bytes
    prometheus_tsdb_head_series
    prometheus_tsdb_compaction_duration_seconds
    prometheus_config_last_reload_successful
    
    # Storage utilization
    prometheus_tsdb_size_bytes{type="wal"}
    prometheus_tsdb_size_bytes{type="head"}
    prometheus_tsdb_size_bytes{type="blocks"}
    
    # Query performance
    prometheus_engine_query_duration_seconds
    prometheus_engine_queries_concurrent_max
    INI

    Handling High Cardinality Metrics

    Cardinality Analysis

    # Find high cardinality metrics
    topk(10, count by (__name__)({__name__!=""}))
    
    # Series count by job
    count by (job) ({__name__!=""})
    
    # Label cardinality analysis
    count by (__name__) (group by (__name__, instance) ({__name__!=""}))
    INI

    Cardinality Management Strategies

    # Metric relabeling to reduce cardinality
    metric_relabel_configs:
      # Drop unnecessary labels
      - source_labels: [__name__]
        regex: 'http_request_duration_seconds_bucket'
        target_label: __tmp_bucket_drop
        replacement: 'true'
      - source_labels: [__tmp_bucket_drop, le]
        regex: 'true;(0.005|0.01|0.025|0.05|0.1|0.25|0.5|1|2.5|5|10|+Inf)'
        action: keep
      - regex: '__tmp_bucket_drop'
        action: labeldrop
    
      # Limit user agent variations
      - source_labels: [user_agent]
        regex: '(.*Chrome.*|.*Firefox.*|.*Safari.*)'
        target_label: user_agent_family
        replacement: '${1}'
      - source_labels: [user_agent]
        regex: '.*'
        target_label: user_agent_family
        replacement: 'other'
      - regex: 'user_agent'
        action: labeldrop
    YAML

    Recording Rules for High Cardinality

    # Aggregate high cardinality metrics
    groups:
      - name: cardinality_reduction
        interval: 30s
        rules:
          # Aggregate by service instead of instance
          - record: service:request_rate:sum
            expr: sum by (service) (rate(http_requests_total[5m]))
    
          # Aggregate errors by service and status class
          - record: service:error_rate:sum
            expr: |
              sum by (service, status_class) (
                rate(http_requests_total{status=~"[45].."}[5m])
              )
            labels:
              status_class: "4xx_5xx"
    
          # Remove detailed path information
          - record: service:request_duration:p95
            expr: |
              histogram_quantile(0.95,
                sum by (service, le) (
                  rate(http_request_duration_seconds_bucket[5m])
                )
              )
    YAML

    Performance Optimization

    Query Optimization

    # Inefficient query - scans all time series
    {__name__=~"http_.*"}
    
    # Better - specific metric with labels
    http_requests_total{job="my-service"}
    
    # Inefficient - regex on high cardinality label
    http_requests_total{instance=~".*prod.*"}
    
    # Better - exact match or limited regex
    http_requests_total{environment="production"}
    
    # Use recording rules for complex calculations
    histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
    # Replace with:
    http_request_duration:p95
    INI

    Memory and CPU Tuning

    # Prometheus resource optimization
    resources:
      requests:
        memory: "4Gi"
        cpu: "1000m"
      limits:
        memory: "8Gi"
        cpu: "2000m"
    
    # JVM tuning for Java exporters
    env:
      - name: JAVA_OPTS
        value: "-Xmx1g -Xms1g -XX:+UseG1GC"
    YAML

    Monitoring Prometheus Performance

    # Prometheus performance dashboard queries
    panels:
      - title: "Ingestion Rate"
        expr: "rate(prometheus_tsdb_samples_total[5m])"
    
      - title: "Active Series"
        expr: "prometheus_tsdb_head_series"
    
      - title: "Query Duration"
        expr: "histogram_quantile(0.99, rate(prometheus_engine_query_duration_seconds_bucket[5m]))"
    
      - title: "Memory Usage"
        expr: "process_resident_memory_bytes"
    
      - title: "WAL Truncations"
        expr: "rate(prometheus_tsdb_wal_truncations_total[5m])"
    
      - title: "Compaction Duration"
        expr: "rate(prometheus_tsdb_compaction_duration_seconds_sum[5m])"
    YAML

    Chapter 8 Summary

    Scaling Prometheus involves federation for hierarchical setups, remote storage for long-term retention, and careful cardinality management. Performance optimization requires query tuning, resource allocation, and monitoring of Prometheus itself. Remote storage solutions like Thanos, VictoriaMetrics, and Cortex provide different approaches to horizontal scaling.

    Hands-on Exercise

    1. Federation Setup:
      • Create a hierarchical Prometheus setup with federation
      • Configure recording rules for aggregation
      • Test cross-instance querying
    2. Remote Storage:
      • Implement remote write to VictoriaMetrics or Thanos
      • Configure retention policies
      • Compare query performance
    3. Performance Optimization:
      • Analyze cardinality in your metrics
      • Implement relabeling to reduce cardinality
      • Create recording rules for expensive queries

    9. Best Practices and Pitfalls

    Designing Effective Metrics

    The Four Golden Signals

    Focus on these key metrics for any system:

    1. Latency: Time to process requests
    2. Traffic: Amount of demand on the system
    3. Errors: Rate of failed requests
    4. Saturation: Resource utilization
    # Latency - 95th percentile response time
    histogram_quantile(0.95, sum by (service) (rate(http_request_duration_seconds_bucket[5m])))
    
    # Traffic - Request rate
    sum by (service) (rate(http_requests_total[5m]))
    
    # Errors - Error rate
    sum by (service) (rate(http_requests_total{status=~"5.."}[5m])) / 
    sum by (service) (rate(http_requests_total[5m]))
    
    # Saturation - CPU utilization
    avg by (instance) (1 - rate(node_cpu_seconds_total{mode="idle"}[5m]))

    USE Method for Resources

    For every resource, monitor:

    • Utilization: How busy the resource is
    • Saturation: Extra work queued
    • Errors: Error events
    # CPU Utilization
    100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
    
    # CPU Saturation
    node_load1 / count by (instance) (node_cpu_seconds_total{mode="idle"})
    
    # Memory Utilization
    (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
    
    # Memory Saturation
    rate(node_vmstat_pswpin[5m]) + rate(node_vmstat_pswpout[5m])
    
    # Disk Utilization
    rate(node_disk_io_time_seconds_total[5m]) * 100
    
    # Disk Saturation
    rate(node_disk_io_time_weighted_seconds_total[5m])
    
    # Network Utilization
    rate(node_network_transmit_bytes_total[5m]) + rate(node_network_receive_bytes_total[5m])
    
    # Network Errors
    rate(node_network_transmit_errs_total[5m]) + rate(node_network_receive_errs_total[5m])
    INI

    RED Method for Services

    For every service, monitor:

    • Rate: Requests per second
    • Errors: Failed requests per second
    • Duration: Response time distribution
    # Rate
    sum by (service) (rate(http_requests_total[5m]))
    
    # Errors
    sum by (service) (rate(http_requests_total{status=~"[45].."}[5m]))
    
    # Duration
    histogram_quantile(0.50, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])))
    histogram_quantile(0.95, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])))
    histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])))
    INI

    Avoiding Cardinality Explosions

    Common Cardinality Pitfalls

    // BAD: User ID as label (unbounded cardinality)
    requestsTotal := prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
        },
        []string{"method", "endpoint", "user_id"}, // user_id is unbounded!
    )
    
    // GOOD: Remove user_id or aggregate it
    requestsTotal := prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
        },
        []string{"method", "endpoint", "user_type"}, // bounded categories
    )
    
    // BAD: Full URL path as label
    errorCounter := prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "errors_total",
        },
        []string{"full_path"}, // /user/123/profile, /user/456/profile, etc.
    )
    
    // GOOD: Parameterized path
    errorCounter := prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "errors_total",
        },
        []string{"path_template"}, // /user/:id/profile
    )
    Go

    Label Guidelines

    # Good label practices
    labels:
      # Use bounded categorical values
      environment: ["production", "staging", "development"]
      region: ["us-east-1", "us-west-2", "eu-west-1"]
      service: ["frontend", "backend", "database"]
    
      # Avoid unbounded values
      # ❌ user_id: "12345"
      # ❌ session_id: "abc-def-123"
      # ❌ full_url: "/api/users/12345/posts/67890"
    
      # Use bounded alternatives
      # ✅ user_type: "premium"
      # ✅ endpoint: "/api/users/:id/posts/:id"
      # ✅ status_class: "2xx"
    YAML

    Cardinality Monitoring

    # Monitor series count by job
    count by (job) ({__name__!=""})
    
    # Find metrics with highest cardinality
    topk(10, count by (__name__) ({__name__!=""}))
    
    # Monitor label value counts
    count by (__name__, status) (http_requests_total)
    
    # Alert on high cardinality
    count by (__name__) ({__name__!=""}) > 10000
    INI

    Setting SLOs and SLIs with Prometheus

    Defining SLIs (Service Level Indicators)

    # Example SLI definitions
    slis:
      availability:
        description: "Percentage of successful requests"
        query: |
          sum(rate(http_requests_total{status!~"5.."}[5m])) /
          sum(rate(http_requests_total[5m])) * 100
        target: "> 99.9%"
    
      latency:
        description: "95th percentile response time"
        query: |
          histogram_quantile(0.95, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          )
        target: "< 200ms"
    
      error_rate:
        description: "Rate of 5xx errors"
        query: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) /
          sum(rate(http_requests_total[5m])) * 100
        target: "< 0.1%"
    YAML

    SLO Implementation

    # SLO recording rules
    groups:
      - name: slo_rules
        interval: 30s
        rules:
          # Error rate SLI
          - record: sli:error_rate
            expr: |
              sum(rate(http_requests_total{status=~"5.."}[5m])) /
              sum(rate(http_requests_total[5m]))
    
          # Availability SLI
          - record: sli:availability
            expr: |
              sum(rate(http_requests_total{status!~"5.."}[5m])) /
              sum(rate(http_requests_total[5m]))
    
          # Latency SLI
          - record: sli:latency:p95
            expr: |
              histogram_quantile(0.95,
                sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
              )
    
          # Error budget calculation (30-day window)
          - record: slo:error_budget:30d
            expr: 1 - (1 - 0.999) * (30 * 24 * 60 * 60) / (30 * 24 * 60 * 60)
    YAML

    SLO Alerting

    # SLO alerting rules
    groups:
      - name: slo_alerts
        rules:
          # Fast burn rate (1 hour)
          - alert: SLOErrorBudgetBurnRateFast
            expr: |
              sli:error_rate > (14.4 * (1 - 0.999)) and
              sli:error_rate[1h] > (14.4 * (1 - 0.999))
            for: 2m
            labels:
              severity: critical
            annotations:
              summary: "Fast SLO burn rate detected"
              description: "Error rate is consuming error budget 14.4x faster than sustainable"
    
          # Slow burn rate (6 hours)
          - alert: SLOErrorBudgetBurnRateSlow
            expr: |
              sli:error_rate > (6 * (1 - 0.999)) and
              sli:error_rate[6h] > (6 * (1 - 0.999))
            for: 15m
            labels:
              severity: warning
            annotations:
              summary: "Slow SLO burn rate detected"
              description: "Error rate is consuming error budget 6x faster than sustainable"
    YAML

    Case Studies from Real-World Systems

    Case Study 1: E-commerce Platform

    Challenge: Monitor checkout flow reliability Solution: Multi-step funnel monitoring

    # Checkout funnel metrics
    checkout_funnel_step_total{step="cart_view"}
    checkout_funnel_step_total{step="checkout_start"}
    checkout_funnel_step_total{step="payment_submit"}
    checkout_funnel_step_total{step="order_complete"}
    
    # Conversion rates
    rate(checkout_funnel_step_total{step="checkout_start"}[5m]) /
    rate(checkout_funnel_step_total{step="cart_view"}[5m])
    
    # Payment failure rate
    rate(checkout_funnel_step_total{step="payment_failed"}[5m]) /
    rate(checkout_funnel_step_total{step="payment_submit"}[5m])
    
    # Revenue impact
    sum(rate(order_value_total[5m])) * 3600
    INI

    Case Study 2: Microservices Architecture

    Challenge: Distributed tracing with metrics correlation Solution: Service dependency monitoring

    # Service dependency health
    up{job=~".*service.*"}
    
    # Cross-service error propagation
    sum by (source_service, target_service) (
      rate(http_requests_total{status=~"5.."}[5m])
    )
    
    # Service response time correlation
    histogram_quantile(0.95,
      sum by (service, le) (
        rate(http_request_duration_seconds_bucket[5m])
      )
    )
    INI

    Case Study 3: Infrastructure Cost Optimization

    Challenge: Monitor resource efficiency Solution: Cost-aware metrics

    # CPU cost efficiency
    sum by (instance_type) (node_cpu_seconds_total) /
    sum by (instance_type) (node_cpu_cost_per_hour)
    
    # Memory utilization by cost
    avg by (instance_type) (
      (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) /
      node_memory_MemTotal_bytes
    ) * sum by (instance_type) (node_memory_cost_per_gb)
    
    # Idle resource identification
    avg_over_time(
      (1 - rate(node_cpu_seconds_total{mode="idle"}[5m]))[7d:]
    ) < 0.1
    INI

    Metrics Naming Conventions

    Prometheus Naming Best Practices

    # Good metric names
    http_requests_total          # Counter with _total suffix
    http_request_duration_seconds # Histogram with base unit
    memory_usage_bytes          # Gauge with base unit
    process_cpu_usage_ratio     # Ratio as _ratio suffix
    
    # Bad metric names
    HttpRequestsCount           # Should be snake_case
    request_time_ms             # Should use base unit (seconds)
    cpu_percentage              # Should be cpu_usage_ratio
    errors                      # Not descriptive enough
    INI

    Label Naming Conventions

    # Good labels
    method: ["GET", "POST", "PUT", "DELETE"]
    status: ["200", "404", "500"]
    environment: ["production", "staging"]
    region: ["us-east-1", "eu-west-1"]
    
    # Bad labels
    Method: "GET"               # Should be lowercase
    http_status_code: "200"     # Redundant prefix
    env: "prod"                 # Use full names
    datacenter: "dc1"           # Be specific about location
    INI

    Testing and Validation

    Metrics Testing Framework

    # metrics_test.py
    import requests
    import time
    import pytest
    
    class MetricsTestFramework:
        def __init__(self, prometheus_url, app_url):
            self.prometheus_url = prometheus_url
            self.app_url = app_url
    
        def query_metric(self, query):
            """Query Prometheus and return result"""
            response = requests.get(
                f"{self.prometheus_url}/api/v1/query",
                params={"query": query}
            )
            return response.json()
    
        def generate_load(self, endpoint, count=10):
            """Generate load on application endpoint"""
            for _ in range(count):
                requests.get(f"{self.app_url}{endpoint}")
                time.sleep(0.1)
    
        def test_counter_increment(self):
            """Test that counters increment properly"""
            # Get initial value
            initial = self.query_metric("http_requests_total")
    
            # Generate load
            self.generate_load("/test", 5)
    
            # Wait for scrape
            time.sleep(20)
    
            # Check increment
            final = self.query_metric("http_requests_total")
            assert final > initial
    
        def test_histogram_buckets(self):
            """Test histogram bucket distribution"""
            self.generate_load("/slow", 10)
            time.sleep(20)
    
            # Check bucket distribution
            result = self.query_metric(
                'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[1m]))'
            )
            assert float(result['data']['result'][0]['value'][1]) > 0
    
    # Usage
    framework = MetricsTestFramework(
        "http://localhost:9090",
        "http://localhost:8080"
    )
    framework.test_counter_increment()
    Python

    Documentation and Runbooks

    Metrics Documentation Template

    # Metric: http_requests_total
    
    ## Description
    Counter of HTTP requests processed by the application.
    
    ## Type
    Counter
    
    ## Labels
    - `method`: HTTP method (GET, POST, PUT, DELETE)
    - `endpoint`: API endpoint template (e.g., /api/users/:id)
    - `status`: HTTP status code
    - `service`: Service name
    
    ## Usage Examples
    ```promql
    # Request rate
    rate(http_requests_total[5m])
    
    # Error rate
    sum(rate(http_requests_total{status=~"5.."}[5m])) /
    sum(rate(http_requests_total[5m]))
    INI

    Alerts

    • HighErrorRate: Fires when error rate > 5%
    • LowRequestRate: Fires when request rate < 1 req/s

    Dashboard Panels

    • Request Rate Over Time
    • Error Rate by Endpoint
    • Request Distribution by Method

    Chapter 9 Summary

    Effective Prometheus monitoring requires following established patterns like the Four Golden Signals, USE, and RED methods. Avoid cardinality explosions through careful label design, implement meaningful SLOs with proper error budget tracking, and establish clear naming conventions. Testing, documentation, and real-world case studies help ensure monitoring provides actionable insights.

    Hands-on Exercise

    1. Metrics Review:
      • Audit your existing metrics for cardinality issues
      • Apply the Four Golden Signals to your services
      • Implement USE method for infrastructure resources
    2. SLO Implementation:
      • Define SLIs for a critical service
      • Create SLO recording and alerting rules
      • Set up error budget tracking
    3. Best Practices Assessment:
      • Review metric naming conventions
      • Create documentation for key metrics
      • Implement automated metrics testing

    10. Advanced Topics

    Exemplars and Tracing Correlation

    Exemplars link metrics to traces, providing context for high-level aggregations by pointing to specific trace samples.

    graph LR
        A[HTTP Request] --> B[Metrics]
        A --> C[Traces]
        B --> D[Exemplar]
        D --> C
        C --> E[Span Details]

    Enabling Exemplars in Prometheus

    # prometheus.yml
    global:
      scrape_interval: 15s
      exemplar_storage:
        max_exemplars: 100000
    
    scrape_configs:
      - job_name: 'my-app'
        scrape_interval: 10s
        static_configs:
          - targets: ['app:8080']
    YAML

    Instrumenting Applications with Exemplars

    // Go application with exemplars
    package main
    
    import (
        "context"
        "fmt"
        "math/rand"
        "net/http"
        "strconv"
        "time"
    
        "github.com/prometheus/client_golang/prometheus"
        "github.com/prometheus/client_golang/prometheus/promhttp"
        "go.opentelemetry.io/otel"
        "go.opentelemetry.io/otel/attribute"
        "go.opentelemetry.io/otel/trace"
    )
    
    var (
        requestDuration = prometheus.NewHistogramVec(
            prometheus.HistogramOpts{
                Name: "http_request_duration_seconds",
                Help: "HTTP request duration in seconds",
                Buckets: prometheus.DefBuckets,
            },
            []string{"method", "endpoint"},
        )
    )
    
    func init() {
        prometheus.MustRegister(requestDuration)
    }
    
    func instrumentedHandler(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
    
        // Start OpenTelemetry span
        ctx, span := otel.Tracer("my-app").Start(r.Context(), "http_request")
        defer span.End()
    
        // Simulate work
        time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond)
    
        // Record metric with exemplar
        duration := time.Since(start).Seconds()
        exemplar := prometheus.Labels{
            "trace_id": span.SpanContext().TraceID().String(),
            "span_id":  span.SpanContext().SpanID().String(),
        }
    
        requestDuration.WithLabelValues(r.Method, r.URL.Path).
            ObserveWithExemplar(duration, exemplar)
    
        span.SetAttributes(
            attribute.String("http.method", r.Method),
            attribute.String("http.url", r.URL.Path),
            attribute.Float64("http.duration", duration),
        )
    
        w.WriteHeader(http.StatusOK)
        fmt.Fprintf(w, "Request processed in %.2f seconds", duration)
    }
    
    func main() {
        http.HandleFunc("/api", instrumentedHandler)
        http.Handle("/metrics", promhttp.Handler())
        http.ListenAndServe(":8080", nil)
    }
    Go

    Querying Exemplars

    # Query histogram with exemplars
    histogram_quantile(0.95, 
      sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
    )
    
    # API endpoint for exemplars
    GET /api/v1/query_exemplars?query=http_request_duration_seconds_bucket&start=<timestamp>&end=<timestamp>
    Go

    Multi-cluster Monitoring

    Centralized Multi-cluster Architecture

    graph TB
        A[Global Prometheus] --> B[Cluster A Prometheus]
        A --> C[Cluster B Prometheus]
        A --> D[Cluster C Prometheus]
    
        B --> E[Workloads A]
        C --> F[Workloads B]
        D --> G[Workloads C]
    
        A --> H[Global Grafana]
        A --> I[Global Alertmanager]

    Cross-cluster Service Discovery

    # Global Prometheus configuration
    global:
      external_labels:
        cluster: 'management'
        region: 'global'
    
    scrape_configs:
      # Federate from regional clusters
      - job_name: 'federate-clusters'
        scrape_interval: 30s
        honor_labels: true
        metrics_path: '/federate'
        params:
          'match[]':
            - '{__name__=~"cluster:.*"}'
            - '{__name__=~"node_.*"}'
            - '{__name__=~"container_.*"}'
        static_configs:
          - targets:
            - 'cluster-a-prometheus:9090'
            labels:
              cluster: 'cluster-a'
              region: 'us-east-1'
          - targets:
            - 'cluster-b-prometheus:9090'
            labels:
              cluster: 'cluster-b'
              region: 'us-west-2'
          - targets:
            - 'cluster-c-prometheus:9090'
            labels:
              cluster: 'cluster-c'
              region: 'eu-west-1'
    
      # Cross-cluster service monitoring
      - job_name: 'cross-cluster-services'
        kubernetes_sd_configs:
          - role: endpoints
            api_server: 'https://cluster-a.k8s.local'
            tls_config:
              ca_file: /etc/ssl/cluster-a-ca.crt
              cert_file: /etc/ssl/cluster-a.crt
              key_file: /etc/ssl/cluster-a.key
          - role: endpoints
            api_server: 'https://cluster-b.k8s.local'
            tls_config:
              ca_file: /etc/ssl/cluster-b-ca.crt
              cert_file: /etc/ssl/cluster-b.crt
              key_file: /etc/ssl/cluster-b.key
    YAML

    Multi-cluster Recording Rules

    # Global recording rules
    groups:
      - name: cross_cluster_aggregates
        interval: 60s
        rules:
          - record: global:request_rate:sum
            expr: sum by (service) (cluster:request_rate:sum)
    
          - record: global:error_rate:avg
            expr: avg by (service) (cluster:error_rate:avg)
    
          - record: global:latency:p95
            expr: |
              histogram_quantile(0.95,
                sum by (service, le) (cluster:latency:histogram)
              )
    
          - record: region:capacity:available
            expr: |
              sum by (region) (
                cluster:node_capacity:cpu - cluster:node_usage:cpu
              )
    YAML

    Integrating with Logging and Tracing

    Correlation with ELK Stack

    # Logstash configuration for metrics correlation
    input {
      beats {
        port => 5044
      }
    }
    
    filter {
      if [fields][service] {
        # Add Prometheus job label
        mutate {
          add_field => { "prometheus_job" => "%{[fields][service]}" }
        }
    
        # Extract trace ID if present
        if [message] =~ /trace_id=/ {
          grok {
            match => { "message" => "trace_id=(?<trace_id>[a-f0-9]+)" }
          }
        }
    
        # Add links to metrics
        mutate {
          add_field => { 
            "metrics_link" => "http://grafana.local/d/app-dashboard?var-service=%{[fields][service]}&from=now-5m&to=now"
          }
        }
      }
    }
    
    output {
      elasticsearch {
        hosts => ["elasticsearch:9200"]
        index => "logs-%{+YYYY.MM.dd}"
      }
    }
    JSON

    Jaeger Integration

    # Jaeger query service with Prometheus metrics
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: jaeger-query
    spec:
      template:
        spec:
          containers:
          - name: jaeger-query
            image: jaegertracing/jaeger-query:latest
            env:
            - name: SPAN_STORAGE_TYPE
              value: elasticsearch
            - name: ES_SERVER_URLS
              value: http://elasticsearch:9200
            - name: METRICS_BACKEND
              value: prometheus
            - name: PROMETHEUS_SERVER_URL
              value: http://prometheus:9090
            ports:
            - containerPort: 16686
            - containerPort: 16687
    YAML

    OpenTelemetry Collector Configuration

    # otelcol-config.yml
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    
      prometheus:
        config:
          scrape_configs:
            - job_name: 'otel-collector'
              static_configs:
                - targets: ['localhost:8888']
    
    processors:
      batch:
        timeout: 1s
        send_batch_size: 1024
    
      attributes:
        actions:
          - key: cluster
            value: production
            action: insert
    
    exporters:
      jaeger:
        endpoint: jaeger-collector:14250
        tls:
          insecure: true
    
      prometheus:
        endpoint: "0.0.0.0:8889"
        namespace: "otel"
    
      prometheusremotewrite:
        endpoint: "http://prometheus:9090/api/v1/write"
    
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [attributes, batch]
          exporters: [jaeger]
    
        metrics:
          receivers: [otlp, prometheus]
          processors: [attributes, batch]
          exporters: [prometheus, prometheusremotewrite]
    YAML

    Security and RBAC in Prometheus Setups

    Prometheus Security Configuration

    # Prometheus with TLS and authentication
    apiVersion: v1
    kind: Secret
    metadata:
      name: prometheus-certs
    type: Opaque
    data:
      tls.crt: <base64-encoded-cert>
      tls.key: <base64-encoded-key>
    
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: prometheus
    spec:
      template:
        spec:
          containers:
          - name: prometheus
            image: prom/prometheus:latest
            args:
              - '--config.file=/etc/prometheus/prometheus.yml'
              - '--web.config.file=/etc/prometheus/web.yml'
              - '--storage.tsdb.path=/prometheus'
              - '--web.listen-address=0.0.0.0:9090'
            volumeMounts:
            - name: config
              mountPath: /etc/prometheus
            - name: certs
              mountPath: /etc/ssl/prometheus
              readOnly: true
    YAML
    # web.yml - Prometheus web configuration
    tls_server_config:
      cert_file: /etc/ssl/prometheus/tls.crt
      key_file: /etc/ssl/prometheus/tls.key
    
    basic_auth_users:
      admin: $2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay
      readonly: $2b$12$6tgWf5DZ9z7LZtD.ZrAb/.VjBfI3WnJg3ULf.TgLBtO4vKAzp7KuG
    YAML

    RBAC Configuration for Kubernetes

    # ServiceAccount for Prometheus
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: prometheus
      namespace: monitoring
    
    ---
    # ClusterRole with minimal permissions
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: prometheus
    rules:
    - apiGroups: [""]
      resources:
      - nodes
      - nodes/proxy
      - services
      - endpoints
      - pods
      verbs: ["get", "list", "watch"]
    - apiGroups: ["extensions", "apps"]
      resources:
      - ingresses
      - deployments
      - daemonsets
      - statefulsets
      verbs: ["get", "list", "watch"]
    - nonResourceURLs: ["/metrics"]
      verbs: ["get"]
    
    ---
    # ClusterRoleBinding
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: prometheus
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: prometheus
    subjects:
    - kind: ServiceAccount
      name: prometheus
      namespace: monitoring
    YAML

    OAuth2 Proxy Integration

    # OAuth2 Proxy for Prometheus
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: oauth2-proxy
    spec:
      template:
        spec:
          containers:
          - name: oauth2-proxy
            image: quay.io/oauth2-proxy/oauth2-proxy:latest
            args:
              - --provider=github
              - --email-domain=yourcompany.com
              - --upstream=http://prometheus:9090
              - --http-address=0.0.0.0:4180
              - --client-id=$(OAUTH2_PROXY_CLIENT_ID)
              - --client-secret=$(OAUTH2_PROXY_CLIENT_SECRET)
              - --cookie-secret=$(OAUTH2_PROXY_COOKIE_SECRET)
            env:
            - name: OAUTH2_PROXY_CLIENT_ID
              valueFrom:
                secretKeyRef:
                  name: oauth2-proxy-secrets
                  key: client-id
            - name: OAUTH2_PROXY_CLIENT_SECRET
              valueFrom:
                secretKeyRef:
                  name: oauth2-proxy-secrets
                  key: client-secret
            - name: OAUTH2_PROXY_COOKIE_SECRET
              valueFrom:
                secretKeyRef:
                  name: oauth2-proxy-secrets
                  key: cookie-secret
    YAML

    Chapter 10 Summary

    Advanced Prometheus topics include exemplars for linking metrics to traces, multi-cluster monitoring architectures, integration with logging and tracing systems, and comprehensive security configurations. These features enable enterprise-scale observability with proper access controls and correlation across different observability signals.

    Hands-on Exercise

    1. Exemplars Implementation:
      • Enable exemplars in Prometheus
      • Instrument an application with trace correlation
      • View exemplars in Grafana dashboards
    2. Multi-cluster Setup:
      • Configure federation between Prometheus instances
      • Implement cross-cluster monitoring
      • Test global query capabilities
    3. Security Hardening:
      • Implement TLS and authentication
      • Configure RBAC for Kubernetes
      • Set up OAuth2 proxy for access control

    11. Capstone Project

    Project Overview

    Build a complete observability stack for a sample e-commerce application with microservices architecture, including metrics collection, alerting, visualization, and incident response workflows.

    Architecture Overview

    graph TB
        subgraph "Application Layer"
            A[Frontend Service] --> B[User Service]
            A --> C[Product Service]
            A --> D[Order Service]
            D --> E[Payment Service]
            D --> F[Inventory Service]
            B --> G[User Database]
            C --> H[Product Database]
            D --> I[Order Database]
        end
    
        subgraph "Observability Layer"
            J[Prometheus] --> K[Alertmanager]
            J --> L[Grafana]
            M[Node Exporter] --> J
            N[Application Metrics] --> J
            O[Blackbox Exporter] --> J
            K --> P[Slack/Email]
            L --> Q[Dashboards]
        end
    
        A --> N
        B --> N
        C --> N
        D --> N
        E --> N
        F --> N

    Step 1: Infrastructure Setup

    Docker Compose Environment

    # docker-compose.yml
    version: '3.8'
    
    networks:
      monitoring:
        driver: bridge
      app:
        driver: bridge
    
    volumes:
      prometheus_data:
      grafana_data:
      alertmanager_data:
    
    services:
      # Prometheus
      prometheus:
        image: prom/prometheus:latest
        container_name: prometheus
        ports:
          - "9090:9090"
        volumes:
          - ./prometheus:/etc/prometheus
          - prometheus_data:/prometheus
        command:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--web.console.libraries=/etc/prometheus/console_libraries'
          - '--web.console.templates=/etc/prometheus/consoles'
          - '--web.enable-lifecycle'
          - '--web.enable-admin-api'
        networks:
          - monitoring
          - app
        restart: unless-stopped
    
      # Alertmanager
      alertmanager:
        image: prom/alertmanager:latest
        container_name: alertmanager
        ports:
          - "9093:9093"
        volumes:
          - ./alertmanager:/etc/alertmanager
          - alertmanager_data:/alertmanager
        command:
          - '--config.file=/etc/alertmanager/alertmanager.yml'
          - '--storage.path=/alertmanager'
        networks:
          - monitoring
        restart: unless-stopped
    
      # Grafana
      grafana:
        image: grafana/grafana:latest
        container_name: grafana
        ports:
          - "3000:3000"
        environment:
          - GF_SECURITY_ADMIN_PASSWORD=admin123
          - GF_USERS_ALLOW_SIGN_UP=false
        volumes:
          - grafana_data:/var/lib/grafana
          - ./grafana/provisioning:/etc/grafana/provisioning
        networks:
          - monitoring
        restart: unless-stopped
    
      # Node Exporter
      node-exporter:
        image: prom/node-exporter:latest
        container_name: node-exporter
        ports:
          - "9100:9100"
        volumes:
          - /proc:/host/proc:ro
          - /sys:/host/sys:ro
          - /:/rootfs:ro
        command:
          - '--path.procfs=/host/proc'
          - '--path.rootfs=/rootfs'
          - '--path.sysfs=/host/sys'
          - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
        networks:
          - monitoring
        restart: unless-stopped
    
      # Blackbox Exporter
      blackbox-exporter:
        image: prom/blackbox-exporter:latest
        container_name: blackbox-exporter
        ports:
          - "9115:9115"
        volumes:
          - ./blackbox:/etc/blackbox_exporter
        networks:
          - monitoring
        restart: unless-stopped
    
      # Application Services
      frontend:
        build: ./apps/frontend
        container_name: frontend
        ports:
          - "8080:8080"
        environment:
          - USER_SERVICE_URL=http://user-service:8081
          - PRODUCT_SERVICE_URL=http://product-service:8082
          - ORDER_SERVICE_URL=http://order-service:8083
        networks:
          - app
        restart: unless-stopped
    
      user-service:
        build: ./apps/user-service
        container_name: user-service
        ports:
          - "8081:8081"
        environment:
          - DATABASE_URL=postgresql://user:password@user-db:5432/users
        networks:
          - app
        restart: unless-stopped
    
      product-service:
        build: ./apps/product-service
        container_name: product-service
        ports:
          - "8082:8082"
        environment:
          - DATABASE_URL=postgresql://product:password@product-db:5432/products
        networks:
          - app
        restart: unless-stopped
    
      order-service:
        build: ./apps/order-service
        container_name: order-service
        ports:
          - "8083:8083"
        environment:
          - DATABASE_URL=postgresql://order:password@order-db:5432/orders
          - PAYMENT_SERVICE_URL=http://payment-service:8084
          - INVENTORY_SERVICE_URL=http://inventory-service:8085
        networks:
          - app
        restart: unless-stopped
    
      payment-service:
        build: ./apps/payment-service
        container_name: payment-service
        ports:
          - "8084:8084"
        networks:
          - app
        restart: unless-stopped
    
      inventory-service:
        build: ./apps/inventory-service
        container_name: inventory-service
        ports:
          - "8085:8085"
        networks:
          - app
        restart: unless-stopped
    
      # Databases
      user-db:
        image: postgres:13
        container_name: user-db
        environment:
          - POSTGRES_DB=users
          - POSTGRES_USER=user
          - POSTGRES_PASSWORD=password
        volumes:
          - ./data/user-db:/var/lib/postgresql/data
        networks:
          - app
    
      product-db:
        image: postgres:13
        container_name: product-db
        environment:
          - POSTGRES_DB=products
          - POSTGRES_USER=product
          - POSTGRES_PASSWORD=password
        volumes:
          - ./data/product-db:/var/lib/postgresql/data
        networks:
          - app
    
      order-db:
        image: postgres:13
        container_name: order-db
        environment:
          - POSTGRES_DB=orders
          - POSTGRES_USER=order
          - POSTGRES_PASSWORD=password
        volumes:
          - ./data/order-db:/var/lib/postgresql/data
        networks:
          - app
    YAML

    Step 2: Application Instrumentation

    Frontend Service (Go)

    // apps/frontend/main.go
    package main
    
    import (
        "encoding/json"
        "fmt"
        "log"
        "net/http"
        "os"
        "time"
    
        "github.com/prometheus/client_golang/prometheus"
        "github.com/prometheus/client_golang/prometheus/promhttp"
    )
    
    var (
        httpRequestsTotal = prometheus.NewCounterVec(
            prometheus.CounterOpts{
                Name: "http_requests_total",
                Help: "Total number of HTTP requests",
            },
            []string{"service", "method", "endpoint", "status"},
        )
    
        httpRequestDuration = prometheus.NewHistogramVec(
            prometheus.HistogramOpts{
                Name: "http_request_duration_seconds",
                Help: "HTTP request duration in seconds",
                Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
            },
            []string{"service", "method", "endpoint"},
        )
    
        upstreamRequestsTotal = prometheus.NewCounterVec(
            prometheus.CounterOpts{
                Name: "upstream_requests_total",
                Help: "Total upstream requests",
            },
            []string{"service", "target_service", "status"},
        )
    
        businessMetrics = prometheus.NewCounterVec(
            prometheus.CounterOpts{
                Name: "business_events_total",
                Help: "Business events counter",
            },
            []string{"service", "event_type"},
        )
    )
    
    func init() {
        prometheus.MustRegister(httpRequestsTotal)
        prometheus.MustRegister(httpRequestDuration)
        prometheus.MustRegister(upstreamRequestsTotal)
        prometheus.MustRegister(businessMetrics)
    }
    
    func instrumentHandler(service, endpoint string, handler http.HandlerFunc) http.HandlerFunc {
        return func(w http.ResponseWriter, r *http.Request) {
            start := time.Now()
    
            // Wrap ResponseWriter to capture status code
            ww := &responseWriter{ResponseWriter: w, statusCode: 200}
    
            handler(ww, r)
    
            duration := time.Since(start).Seconds()
            status := fmt.Sprintf("%d", ww.statusCode)
    
            httpRequestsTotal.WithLabelValues(service, r.Method, endpoint, status).Inc()
            httpRequestDuration.WithLabelValues(service, r.Method, endpoint).Observe(duration)
        }
    }
    
    type responseWriter struct {
        http.ResponseWriter
        statusCode int
    }
    
    func (rw *responseWriter) WriteHeader(code int) {
        rw.statusCode = code
        rw.ResponseWriter.WriteHeader(code)
    }
    
    func homeHandler(w http.ResponseWriter, r *http.Request) {
        businessMetrics.WithLabelValues("frontend", "page_view").Inc()
    
        response := map[string]string{
            "service": "frontend",
            "status":  "healthy",
            "version": "1.0.0",
        }
    
        w.Header().Set("Content-Type", "application/json")
        json.NewEncoder(w).Encode(response)
    }
    
    func usersHandler(w http.ResponseWriter, r *http.Request) {
        userServiceURL := os.Getenv("USER_SERVICE_URL")
        if userServiceURL == "" {
            userServiceURL = "http://localhost:8081"
        }
    
        start := time.Now()
        resp, err := http.Get(userServiceURL + "/users")
        duration := time.Since(start).Seconds()
    
        status := "500"
        if err == nil {
            status = fmt.Sprintf("%d", resp.StatusCode)
            defer resp.Body.Close()
        }
    
        upstreamRequestsTotal.WithLabelValues("frontend", "user-service", status).Inc()
    
        if err != nil {
            http.Error(w, "User service unavailable", http.StatusServiceUnavailable)
            return
        }
    
        businessMetrics.WithLabelValues("frontend", "user_list_view").Inc()
        w.Header().Set("Content-Type", "application/json")
        w.Write([]byte(`{"users": []}`))
    }
    
    func main() {
        http.Handle("/metrics", promhttp.Handler())
        http.HandleFunc("/", instrumentHandler("frontend", "/", homeHandler))
        http.HandleFunc("/users", instrumentHandler("frontend", "/users", usersHandler))
        http.HandleFunc("/health", instrumentHandler("frontend", "/health", func(w http.ResponseWriter, r *http.Request) {
            w.WriteHeader(http.StatusOK)
            w.Write([]byte("OK"))
        }))
    
        log.Println("Frontend service starting on :8080")
        log.Fatal(http.ListenAndServe(":8080", nil))
    }
    Go

    User Service (Python)

    # apps/user-service/app.py
    from flask import Flask, jsonify, request
    from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
    import time
    import psycopg2
    import os
    
    app = Flask(__name__)
    
    # Prometheus metrics
    REQUEST_COUNT = Counter(
        'http_requests_total',
        'Total HTTP requests',
        ['service', 'method', 'endpoint', 'status']
    )
    
    REQUEST_DURATION = Histogram(
        'http_request_duration_seconds',
        'HTTP request duration',
        ['service', 'method', 'endpoint'],
        buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
    )
    
    DATABASE_CONNECTIONS = Gauge(
        'database_connections_active',
        'Active database connections',
        ['service', 'database']
    )
    
    BUSINESS_EVENTS = Counter(
        'business_events_total',
        'Business events',
        ['service', 'event_type']
    )
    
    def instrument_request(f):
        def wrapper(*args, **kwargs):
            start_time = time.time()
            status = '200'
    
            try:
                result = f(*args, **kwargs)
                return result
            except Exception as e:
                status = '500'
                raise
            finally:
                REQUEST_COUNT.labels(
                    service='user-service',
                    method=request.method,
                    endpoint=request.endpoint or 'unknown',
                    status=status
                ).inc()
    
                REQUEST_DURATION.labels(
                    service='user-service',
                    method=request.method,
                    endpoint=request.endpoint or 'unknown'
                ).observe(time.time() - start_time)
    
        wrapper.__name__ = f.__name__
        return wrapper
    
    @app.route('/')
    @instrument_request
    def home():
        return jsonify({
            'service': 'user-service',
            'status': 'healthy',
            'version': '1.0.0'
        })
    
    @app.route('/users')
    @instrument_request
    def get_users():
        BUSINESS_EVENTS.labels(service='user-service', event_type='user_list_request').inc()
    
        # Simulate database query
        DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').inc()
        time.sleep(0.01)  # Simulate query time
        DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').dec()
    
        return jsonify({
            'users': [
                {'id': 1, 'name': 'John Doe', 'email': 'john@example.com'},
                {'id': 2, 'name': 'Jane Smith', 'email': 'jane@example.com'}
            ]
        })
    
    @app.route('/users/<int:user_id>')
    @instrument_request
    def get_user(user_id):
        BUSINESS_EVENTS.labels(service='user-service', event_type='user_detail_request').inc()
    
        DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').inc()
        time.sleep(0.005)
        DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').dec()
    
        return jsonify({
            'id': user_id,
            'name': f'User {user_id}',
            'email': f'user{user_id}@example.com'
        })
    
    @app.route('/health')
    @instrument_request
    def health():
        return jsonify({'status': 'healthy'})
    
    @app.route('/metrics')
    def metrics():
        return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}
    
    if __name__ == '__main__':
        app.run(host='0.0.0.0', port=8081)
    Python

    Step 3: Prometheus Configuration

    # prometheus/prometheus.yml
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        cluster: 'ecommerce'
        environment: 'production'
    
    rule_files:
      - "alert_rules.yml"
      - "recording_rules.yml"
    
    alerting:
      alertmanagers:
        - static_configs:
            - targets:
              - alertmanager:9093
    
    scrape_configs:
      # Prometheus itself
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']
    
      # Node Exporter
      - job_name: 'node-exporter'
        static_configs:
          - targets: ['node-exporter:9100']
        scrape_interval: 30s
    
      # Application services
      - job_name: 'frontend'
        static_configs:
          - targets: ['frontend:8080']
        metrics_path: '/metrics'
        scrape_interval: 15s
    
      - job_name: 'user-service'
        static_configs:
          - targets: ['user-service:8081']
        metrics_path: '/metrics'
        scrape_interval: 15s
    
      - job_name: 'product-service'
        static_configs:
          - targets: ['product-service:8082']
        metrics_path: '/metrics'
        scrape_interval: 15s
    
      - job_name: 'order-service'
        static_configs:
          - targets: ['order-service:8083']
        metrics_path: '/metrics'
        scrape_interval: 15s
    
      - job_name: 'payment-service'
        static_configs:
          - targets: ['payment-service:8084']
        metrics_path: '/metrics'
        scrape_interval: 15s
    
      - job_name: 'inventory-service'
        static_configs:
          - targets: ['inventory-service:8085']
        metrics_path: '/metrics'
        scrape_interval: 15s
    
      # Blackbox monitoring
      - job_name: 'blackbox'
        metrics_path: /probe
        params:
          module: [http_2xx]
        static_configs:
          - targets:
            - http://frontend:8080/health
            - http://user-service:8081/health
            - http://product-service:8082/health
            - http://order-service:8083/health
            - http://payment-service:8084/health
            - http://inventory-service:8085/health
        relabel_configs:
          - source_labels: [__address__]
            target_label: __param_target
          - source_labels: [__param_target]
            target_label: instance
          - target_label: __address__
            replacement: blackbox-exporter:9115
    YAML

    Step 4: Recording Rules

    # prometheus/recording_rules.yml
    groups:
      - name: application_rules
        interval: 30s
        rules:
          # Request rates
          - record: service:request_rate:rate5m
            expr: sum by (service) (rate(http_requests_total[5m]))
    
          - record: service:request_rate:rate1h
            expr: sum by (service) (rate(http_requests_total[1h]))
    
          # Error rates
          - record: service:error_rate:rate5m
            expr: |
              sum by (service) (rate(http_requests_total{status=~"[45].."}[5m])) /
              sum by (service) (rate(http_requests_total[5m]))
    
          # Latency percentiles
          - record: service:request_duration:p50
            expr: |
              histogram_quantile(0.50,
                sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
              )
    
          - record: service:request_duration:p95
            expr: |
              histogram_quantile(0.95,
                sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
              )
    
          - record: service:request_duration:p99
            expr: |
              histogram_quantile(0.99,
                sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
              )
    
      - name: infrastructure_rules
        interval: 30s
        rules:
          # Node metrics
          - record: node:cpu_usage:rate5m
            expr: |
              100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
    
          - record: node:memory_usage:percentage
            expr: |
              (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
    
          - record: node:disk_usage:percentage
            expr: |
              (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100
    
      - name: business_rules
        interval: 60s
        rules:
          # Business metrics
          - record: business:page_views:rate1h
            expr: rate(business_events_total{event_type="page_view"}[1h]) * 3600
    
          - record: business:user_requests:rate1h
            expr: rate(business_events_total{event_type=~"user_.*"}[1h]) * 3600
    
          # Service dependency health
          - record: service:dependency_success_rate:rate5m
            expr: |
              sum by (service, target_service) (rate(upstream_requests_total{status=~"2.."}[5m])) /
              sum by (service, target_service) (rate(upstream_requests_total[5m]))
    YAML

    Step 5: Alerting Rules

    # prometheus/alert_rules.yml
    groups:
      - name: infrastructure_alerts
        rules:
          - alert: NodeDown
            expr: up{job="node-exporter"} == 0
            for: 1m
            labels:
              severity: critical
              team: infrastructure
            annotations:
              summary: "Node is down"
              description: "Node {{ $labels.instance }} has been down for more than 1 minute"
              runbook_url: "https://runbooks.company.com/node-down"
    
          - alert: HighCPUUsage
            expr: node:cpu_usage:rate5m > 80
            for: 5m
            labels:
              severity: warning
              team: infrastructure
            annotations:
              summary: "High CPU usage"
              description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"
    
          - alert: HighMemoryUsage
            expr: node:memory_usage:percentage > 85
            for: 5m
            labels:
              severity: warning
              team: infrastructure
            annotations:
              summary: "High memory usage"
              description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"
    
      - name: application_alerts
        rules:
          - alert: ServiceDown
            expr: up{job=~"frontend|.*-service"} == 0
            for: 1m
            labels:
              severity: critical
              team: platform
            annotations:
              summary: "Service is down"
              description: "Service {{ $labels.job }} is down"
    
          - alert: HighErrorRate
            expr: service:error_rate:rate5m > 0.05
            for: 2m
            labels:
              severity: critical
              team: platform
            annotations:
              summary: "High error rate for {{ $labels.service }}"
              description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"
    
          - alert: HighLatency
            expr: service:request_duration:p95 > 1
            for: 5m
            labels:
              severity: warning
              team: platform
            annotations:
              summary: "High latency for {{ $labels.service }}"
              description: "95th percentile latency is {{ $value }}s for {{ $labels.service }}"
    
          - alert: LowRequestRate
            expr: service:request_rate:rate5m < 0.1
            for: 10m
            labels:
              severity: warning
              team: platform
            annotations:
              summary: "Low request rate for {{ $labels.service }}"
              description: "Request rate is {{ $value }} req/s for {{ $labels.service }}"
    
      - name: business_alerts
        rules:
          - alert: LowPageViews
            expr: business:page_views:rate1h < 10
            for: 15m
            labels:
              severity: warning
              team: product
            annotations:
              summary: "Low page view rate"
              description: "Page view rate is {{ $value }} views/hour"
    
          - alert: ServiceDependencyFailure
            expr: service:dependency_success_rate:rate5m < 0.95
            for: 5m
            labels:
              severity: critical
              team: platform
            annotations:
              summary: "Service dependency failure"
              description: "{{ $labels.service }} -> {{ $labels.target_service }} success rate is {{ $value | humanizePercentage }}"
    YAML

    Step 6: Alertmanager Configuration

    # alertmanager/alertmanager.yml
    global:
      smtp_smarthost: 'smtp.gmail.com:587'
      smtp_from: 'alerts@ecommerce.local'
      smtp_auth_username: 'alerts@ecommerce.local'
      smtp_auth_password: 'your-app-password'
    
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h
      receiver: 'default'
      routes:
        # Critical alerts to on-call
        - matchers:
            - severity=critical
          receiver: 'critical-alerts'
          continue: true
    
        # Infrastructure team alerts
        - matchers:
            - team=infrastructure
          receiver: 'infrastructure-team'
    
        # Platform team alerts
        - matchers:
            - team=platform
          receiver: 'platform-team'
    
        # Product team alerts
        - matchers:
            - team=product
          receiver: 'product-team'
    
    receivers:
      - name: 'default'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
            channel: '#alerts'
            title: 'Alert: {{ .GroupLabels.alertname }}'
            text: |
              {{ range .Alerts }}
              *Alert:* {{ .Annotations.summary }}
              *Description:* {{ .Annotations.description }}
              *Severity:* {{ .Labels.severity }}
              *Service:* {{ .Labels.service }}
              {{ end }}
    
      - name: 'critical-alerts'
        email_configs:
          - to: 'oncall@ecommerce.local'
            subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
            body: |
              {{ range .Alerts }}
              Alert: {{ .Annotations.summary }}
              Description: {{ .Annotations.description }}
              Severity: {{ .Labels.severity }}
              Service: {{ .Labels.service }}
              Runbook: {{ .Annotations.runbook_url }}
              {{ end }}
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
            channel: '#critical-alerts'
            title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
            text: |
              {{ range .Alerts }}
              *Alert:* {{ .Annotations.summary }}
              *Description:* {{ .Annotations.description }}
              *Runbook:* {{ .Annotations.runbook_url }}
              {{ end }}
    
      - name: 'infrastructure-team'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
            channel: '#infrastructure'
            title: '⚠️ Infrastructure Alert: {{ .GroupLabels.alertname }}'
    
      - name: 'platform-team'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
            channel: '#platform'
            title: '🔧 Platform Alert: {{ .GroupLabels.alertname }}'
    
      - name: 'product-team'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
            channel: '#product'
            title: '📊 Business Alert: {{ .GroupLabels.alertname }}'
    
    inhibit_rules:
      # Don't send warning alerts if critical alerts are firing
      - source_matchers:
          - severity=critical
        target_matchers:
          - severity=warning
        equal: ['service']
    
      # Don't send service alerts if node is down
      - source_matchers:
          - alertname=NodeDown
        target_matchers:
          - alertname=ServiceDown
        equal: ['instance']
    YAML

    Step 7: Grafana Dashboards

    Infrastructure Dashboard

    # grafana/provisioning/dashboards/infrastructure.json
    {
      "dashboard": {
        "id": null,
        "title": "Infrastructure Overview",
        "tags": ["infrastructure", "monitoring"],
        "timezone": "browser",
        "refresh": "30s",
        "time": {
          "from": "now-1h",
          "to": "now"
        },
        "panels": [
          {
            "id": 1,
            "title": "CPU Usage",
            "type": "stat",
            "targets": [
              {
                "expr": "node:cpu_usage:rate5m",
                "legendFormat": "{{ instance }}"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "percent",
                "thresholds": {
                  "steps": [
                    {"color": "green", "value": 0},
                    {"color": "yellow", "value": 70},
                    {"color": "red", "value": 90}
                  ]
                }
              }
            },
            "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
          },
          {
            "id": 2,
            "title": "Memory Usage",
            "type": "stat",
            "targets": [
              {
                "expr": "node:memory_usage:percentage",
                "legendFormat": "{{ instance }}"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "percent",
                "thresholds": {
                  "steps": [
                    {"color": "green", "value": 0},
                    {"color": "yellow", "value": 80},
                    {"color": "red", "value": 90}
                  ]
                }
              }
            },
            "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
          },
          {
            "id": 3,
            "title": "CPU Usage Over Time",
            "type": "graph",
            "targets": [
              {
                "expr": "node:cpu_usage:rate5m",
                "legendFormat": "{{ instance }}"
              }
            ],
            "yAxes": [
              {
                "unit": "percent",
                "max": 100,
                "min": 0
              }
            ],
            "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
          }
        ]
      }
    }
    JSON

    Application Dashboard

    # grafana/provisioning/dashboards/application.json
    {
      "dashboard": {
        "id": null,
        "title": "Application Performance",
        "tags": ["application", "performance"],
        "timezone": "browser",
        "refresh": "30s",
        "templating": {
          "list": [
            {
              "name": "service",
              "type": "query",
              "query": "label_values(http_requests_total, service)",
              "refresh": 1,
              "multi": true,
              "includeAll": true
            }
          ]
        },
        "panels": [
          {
            "id": 1,
            "title": "Request Rate",
            "type": "graph",
            "targets": [
              {
                "expr": "service:request_rate:rate5m{service=~\"$service\"}",
                "legendFormat": "{{ service }}"
              }
            ],
            "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
          },
          {
            "id": 2,
            "title": "Error Rate",
            "type": "graph",
            "targets": [
              {
                "expr": "service:error_rate:rate5m{service=~\"$service\"} * 100",
                "legendFormat": "{{ service }}"
              }
            ],
            "yAxes": [
              {
                "unit": "percent",
                "max": 100,
                "min": 0
              }
            ],
            "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
          },
          {
            "id": 3,
            "title": "Response Time Percentiles",
            "type": "graph",
            "targets": [
              {
                "expr": "service:request_duration:p50{service=~\"$service\"}",
                "legendFormat": "{{ service }} - 50th"
              },
              {
                "expr": "service:request_duration:p95{service=~\"$service\"}",
                "legendFormat": "{{ service }} - 95th"
              },
              {
                "expr": "service:request_duration:p99{service=~\"$service\"}",
                "legendFormat": "{{ service }} - 99th"
              }
            ],
            "yAxes": [
              {
                "unit": "s"
              }
            ],
            "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
          }
        ]
      }
    }
    JSON

    Step 8: Testing and Validation

    Load Testing Script

    # scripts/load_test.py
    import requests
    import time
    import random
    import threading
    from concurrent.futures import ThreadPoolExecutor
    
    BASE_URL = "http://localhost:8080"
    
    def make_request(endpoint):
        """Make a request to the specified endpoint"""
        try:
            response = requests.get(f"{BASE_URL}{endpoint}", timeout=5)
            return response.status_code
        except Exception as e:
            print(f"Error calling {endpoint}: {e}")
            return 500
    
    def generate_load():
        """Generate load on the application"""
        endpoints = ["/", "/users", "/health"]
    
        while True:
            endpoint = random.choice(endpoints)
            status = make_request(endpoint)
    
            # Add some randomness to the load
            time.sleep(random.uniform(0.1, 1.0))
    
    def run_load_test(duration_minutes=10, concurrent_users=5):
        """Run load test for specified duration"""
        print(f"Starting load test with {concurrent_users} concurrent users for {duration_minutes} minutes")
    
        with ThreadPoolExecutor(max_workers=concurrent_users) as executor:
            # Submit load generation tasks
            futures = []
            for _ in range(concurrent_users):
                future = executor.submit(generate_load)
                futures.append(future)
    
            # Let it run for the specified duration
            time.sleep(duration_minutes * 60)
    
            # Cancel all tasks
            for future in futures:
                future.cancel()
    
    if __name__ == "__main__":
        run_load_test(duration_minutes=5, concurrent_users=10)
    Python

    Chaos Testing

    # scripts/chaos_test.py
    import docker
    import time
    import random
    
    client = docker.from_env()
    
    def stop_random_service():
        """Stop a random service for chaos testing"""
        services = ['user-service', 'product-service', 'order-service']
        service_name = random.choice(services)
    
        try:
            container = client.containers.get(service_name)
            print(f"Stopping {service_name}")
            container.stop()
    
            # Wait for some time
            time.sleep(30)
    
            print(f"Starting {service_name}")
            container.start()
    
        except Exception as e:
            print(f"Error with {service_name}: {e}")
    
    def simulate_high_load():
        """Simulate high CPU load on a container"""
        try:
            container = client.containers.get('frontend')
            print("Simulating high CPU load")
    
            # Run stress test inside container
            container.exec_run("stress --cpu 2 --timeout 60s", detach=True)
    
        except Exception as e:
            print(f"Error simulating load: {e}")
    
    if __name__ == "__main__":
        print("Starting chaos testing...")
    
        # Run different chaos scenarios
        stop_random_service()
        time.sleep(120)
    
        simulate_high_load()
        time.sleep(120)
    Python

    Step 9: Deployment Script

    #!/bin/bash
    # scripts/deploy.sh
    
    set -e
    
    echo "Starting E-commerce Observability Stack deployment..."
    
    # Create necessary directories
    mkdir -p data/{user-db,product-db,order-db}
    mkdir -p prometheus grafana/provisioning/{datasources,dashboards}
    mkdir -p alertmanager blackbox
    
    # Set permissions
    chmod 777 data/{user-db,product-db,order-db}
    
    # Build application images
    echo "Building application images..."
    for service in frontend user-service product-service order-service payment-service inventory-service; do
        echo "Building $service..."
        docker build -t ecommerce/$service:latest apps/$service/
    done
    
    # Start the stack
    echo "Starting services..."
    docker-compose up -d
    
    # Wait for services to be ready
    echo "Waiting for services to start..."
    sleep 30
    
    # Check service health
    echo "Checking service health..."
    services=("prometheus:9090" "grafana:3000" "alertmanager:9093" "frontend:8080")
    
    for service in "${services[@]}"; do
        IFS=':' read -r name port <<< "$service"
        echo "Checking $name on port $port..."
    
        for i in {1..30}; do
            if curl -f "http://localhost:$port/health" 2>/dev/null || curl -f "http://localhost:$port" 2>/dev/null; then
                echo "$name is healthy"
                break
            fi
    
            if [ $i -eq 30 ]; then
                echo "Warning: $name may not be ready"
            fi
    
            sleep 2
        done
    done
    
    echo "Deployment complete!"
    echo "Access URLs:"
    echo "  Prometheus: http://localhost:9090"
    echo "  Grafana: http://localhost:3000 (admin/admin123)"
    echo "  Alertmanager: http://localhost:9093"
    echo "  Application: http://localhost:8080"
    
    echo "Run load tests with: python scripts/load_test.py"
    echo "Run chaos tests with: python scripts/chaos_test.py"
    Bash

    Step 10: Documentation and Runbooks

    README.md

    # E-commerce Observability Stack
    
    This project demonstrates a complete observability setup for a microservices-based e-commerce application using Prometheus, Grafana, and Alertmanager.
    
    ## Architecture
    
    - **Frontend Service** (Go): Main web interface
    - **User Service** (Python): User management
    - **Product Service** (Python): Product catalog
    - **Order Service** (Python): Order processing
    - **Payment Service** (Python): Payment processing
    - **Inventory Service** (Python): Inventory management
    
    ## Deployment
    
    ```bash
    # Clone the repository
    git clone <repository-url>
    cd ecommerce-observability
    
    # Deploy the stack
    ./scripts/deploy.sh
    Markdown

    Access Points

    Testing

    Load Testing

    python scripts/load_test.py
    Bash

    Chaos Testing

    python scripts/chaos_test.py
    Bash

    Monitoring

    Key Metrics

    • Request rate per service
    • Error rate per service
    • Response time percentiles
    • Infrastructure utilization

    Alerts

    • Service down
    • High error rate (>5%)
    • High latency (>1s p95)
    • Infrastructure issues

    Troubleshooting

    Service Discovery Issues

    Check Prometheus targets: http://localhost:9090/targets

    Missing Metrics

    Verify service /metrics endpoints are accessible

    Alert Not Firing

    Check Prometheus rules: http://localhost:9090/rules

    ### Project Validation
    
    #### Verification Checklist
    
    1. **✅ Infrastructure Monitoring**
       - [ ] Node exporter collecting system metrics
       - [ ] CPU, memory, disk usage visible in Grafana
       - [ ] Infrastructure alerts firing correctly
    
    2. **✅ Application Monitoring**
       - [ ] All services exposing metrics
       - [ ] Request rate, error rate, latency tracked
       - [ ] Business metrics instrumented
    
    3. **✅ Alerting**
       - [ ] Critical alerts configured
       - [ ] Alert routing working
       - [ ] Notification channels tested
    
    4. **✅ Visualization**
       - [ ] Infrastructure dashboard functional
       - [ ] Application dashboard with filters
       - [ ] Business metrics dashboard
    
    5. **✅ Testing**
       - [ ] Load testing generating metrics
       - [ ] Chaos testing triggering alerts
       - [ ] Recovery scenarios validated
    
    ### Chapter 11 Summary
    
    The capstone project demonstrates a production-ready observability stack with comprehensive monitoring, alerting, and visualization. It covers infrastructure monitoring, application performance tracking, business metrics, and incident response workflows. The project serves as a practical template for implementing Prometheus-based observability in real-world microservices environments.
    
    ### Final Exercise
    
    1. **Deploy the Complete Stack**:
       - Follow the deployment guide
       - Verify all components are working
       - Access all web interfaces
    
    2. **Run Tests and Observe**:
       - Execute load tests and watch metrics
       - Trigger chaos tests and verify alerts
       - Practice incident response workflows
    
    3. **Customize and Extend**:
       - Add new metrics to services
       - Create custom dashboards
       - Implement additional alert rules
    
    ---
    
    ## 12. Appendices
    
    ### Appendix A: PromQL Cheat Sheet
    
    #### Basic Selectors
    ```promql
    # Simple metric selection
    http_requests_total
    
    # Label matching
    http_requests_total{method="GET"}
    http_requests_total{method!="GET"}
    http_requests_total{method=~"GET|POST"}
    http_requests_total{method!~"GET|POST"}
    
    # Multiple labels
    http_requests_total{method="GET", status="200"}
    Markdown

    Time Series Types

    # Instant vector (single value per series)
    up
    
    # Range vector (range of values over time)
    up[5m]
    
    # Scalar (single numeric value)
    42
    INI

    Rate and Counter Functions

    # Rate: per-second average rate
    rate(http_requests_total[5m])
    
    # Increase: total increase over time window
    increase(http_requests_total[5m])
    
    # irate: instantaneous rate
    irate(http_requests_total[5m])
    
    # Delta: difference between first and last value
    delta(cpu_temp_celsius[2h])
    INI

    Aggregation Operators

    # Sum
    sum(http_requests_total)
    sum by (job) (http_requests_total)
    sum without (instance) (http_requests_total)
    
    # Average
    avg(node_cpu_seconds_total)
    avg by (mode) (node_cpu_seconds_total)
    
    # Count
    count(up)
    count by (job) (up)
    
    # Min/Max
    min(node_filesystem_free_bytes)
    max(node_filesystem_free_bytes)
    
    # Quantile
    quantile(0.95, http_request_duration_seconds)
    
    # Top/Bottom K
    topk(5, http_requests_total)
    bottomk(3, node_filesystem_free_bytes)
    INI

    Mathematical Functions

    # Arithmetic operators
    node_memory_MemTotal_bytes - node_memory_MemFree_bytes
    rate(http_requests_total[5m]) * 60
    
    # Mathematical functions
    abs(delta(cpu_temp_celsius[5m]))
    ceil(rate(http_requests_total[5m]))
    floor(rate(http_requests_total[5m]))
    round(rate(http_requests_total[5m]), 0.1)
    sqrt(rate(http_requests_total[5m]))
    ln(rate(http_requests_total[5m]))
    log10(rate(http_requests_total[5m]))
    INI

    Histogram Functions

    # Quantiles
    histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
    histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
    
    # Average from histogram
    rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
    
    # Request rate from histogram
    rate(http_request_duration_seconds_count[5m])

    Time Functions

    # Current time
    time()
    
    # Timestamp of samples
    timestamp(up)
    
    # Time-based filtering
    hour() > 9 and hour() < 17  # Business hours
    day_of_week() > 0 and day_of_week() < 6  # Weekdays
    
    # Prediction
    predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)
    INI

    String Functions

    # Label manipulation
    label_replace(up, "instance_short", "$1", "instance", "([^:]+):.*")
    label_join(up, "instance_job", ":", "instance", "job")
    INI

    Comparison Operators

    # Comparison
    node_filesystem_free_bytes < 1000000000  # Less than 1GB
    rate(http_requests_total[5m]) > 10       # More than 10 req/s
    
    # Boolean operators
    up == 1 and on(instance) node_load1 > 2
    up == 0 or on(instance) node_filesystem_free_bytes < 1000000000
    INI

    Advanced Patterns

    # SLI/SLO calculations
    sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
    
    # Error budget burn rate
    (1 - sli_availability) / (1 - slo_target) > burn_rate_threshold
    
    # Multi-service aggregation
    sum by (environment) (rate(http_requests_total[5m]))
    
    # Cross-metric calculations
    rate(http_requests_total[5m]) / on(instance) group_left rate(node_cpu_seconds_total{mode="idle"}[5m])
    INI

    Appendix B: Exporter Catalog

    Official Exporters

    ExporterPurposePortKey Metrics
    Node ExporterSystem metrics9100CPU, memory, disk, network
    Blackbox ExporterExternal monitoring9115HTTP, DNS, TCP, ICMP
    MySQL ExporterMySQL database9104Connections, queries, performance
    Redis ExporterRedis database9121Memory, commands, keys
    HAProxy ExporterHAProxy load balancer8404Requests, responses, health
    NGINX ExporterNGINX web server9113Requests, connections, status

    Third-party Exporters

    ExporterPurposePortKey Metrics
    Postgres ExporterPostgreSQL database9187Connections, queries, locks
    MongoDB ExporterMongoDB database9216Operations, connections, memory
    Elasticsearch ExporterElasticsearch9114Cluster health, indices, queries
    RabbitMQ ExporterRabbitMQ message broker9419Queues, messages, connections
    Kafka ExporterApache Kafka9308Topics, partitions, lag
    JMX ExporterJava applications8080JVM metrics, garbage collection

    Cloud Provider Exporters

    ExporterPurposeKey Metrics
    AWS CloudWatch ExporterAWS servicesEC2, RDS, ELB metrics
    Azure Monitor ExporterAzure servicesVM, storage, network metrics
    GCP Monitoring ExporterGoogle CloudCompute, storage, network metrics

    Configuration Examples

    Node Exporter
    # docker-compose.yml
    node-exporter:
      image: prom/node-exporter:latest
      command:
        - '--path.procfs=/host/proc'
        - '--path.rootfs=/rootfs'
        - '--path.sysfs=/host/sys'
        - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
      volumes:
        - /proc:/host/proc:ro
        - /sys:/host/sys:ro
        - /:/rootfs:ro
      ports:
        - "9100:9100"
    YAML
    Blackbox Exporter
    # blackbox.yml
    modules:
      http_2xx:
        prober: http
        timeout: 5s
        http:
          valid_status_codes: []
          method: GET
          follow_redirects: true
    YAML
    MySQL Exporter
    # Environment variables
    DATA_SOURCE_NAME: "user:password@(mysql:3306)/"
    
    # Or configuration file
    [client]
    user = exporter
    password = password
    host = mysql
    port = 3306
    YAML

    Appendix C: Alert Rule Templates

    Infrastructure Alerts

    groups:
      - name: node_alerts
        rules:
          - alert: NodeDown
            expr: up{job="node-exporter"} == 0
            for: 1m
            labels:
              severity: critical
            annotations:
              summary: "Node {{ $labels.instance }} is down"
    
          - alert: HighCPU
            expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High CPU usage on {{ $labels.instance }}"
    
          - alert: HighMemory
            expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "High memory usage on {{ $labels.instance }}"
    
          - alert: DiskSpaceLow
            expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 85
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: "Low disk space on {{ $labels.instance }}"
    
          - alert: DiskSpaceCritical
            expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 95
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "Critical disk space on {{ $labels.instance }}"
    YAML

    Application Alerts

    groups:
      - name: application_alerts
        rules:
          - alert: ServiceDown
            expr: up{job=~".*-service"} == 0
            for: 1m
            labels:
              severity: critical
            annotations:
              summary: "Service {{ $labels.job }} is down"
    
          - alert: HighErrorRate
            expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
            for: 2m
            labels:
              severity: critical
            annotations:
              summary: "High error rate for {{ $labels.job }}"
    
          - alert: HighLatency
            expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High latency for {{ $labels.job }}"
    
          - alert: LowThroughput
            expr: rate(http_requests_total[5m]) < 1
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: "Low throughput for {{ $labels.job }}"
    YAML

    Database Alerts

    groups:
      - name: database_alerts
        rules:
          - alert: DatabaseDown
            expr: mysql_up == 0
            for: 1m
            labels:
              severity: critical
            annotations:
              summary: "Database {{ $labels.instance }} is down"
    
          - alert: HighConnections
            expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High database connections on {{ $labels.instance }}"
    
          - alert: SlowQueries
            expr: rate(mysql_global_status_slow_queries[5m]) > 0
            for: 5m
            labels:
              severity: warning# filepath: c:\Users\ankus\Videos\PROMETHEUS\prometheus.md
    #### Multi-cluster Recording Rules
    
    ```yaml
    # Global recording rules
    groups:
      - name: cross_cluster_aggregates
        interval: 60s
        rules:
          - record: global:request_rate:sum
            expr: sum by (service) (cluster:request_rate:sum)
    
          - record: global:error_rate:avg
            expr: avg by (service) (cluster:error_rate:avg)
    
          - record: global:latency:p95
            expr: |
              histogram_quantile(0.95,
                sum by (service, le) (cluster:latency:histogram)
              )
    
          - record: region:capacity:available
            expr: |
              sum by (region) (
                cluster:node_capacity:cpu - cluster:node_usage:cpu
              )
    YAML

    Integrating with Logging and Tracing

    Correlation with ELK Stack

    # Logstash configuration for metrics correlation
    input {
      beats {
        port => 5044
      }
    }
    
    filter {
      if [fields][service] {
        # Add Prometheus job label
        mutate {
          add_field => { "prometheus_job" => "%{[fields][service]}" }
        }
    
        # Extract trace ID if present
        if [message] =~ /trace_id=/ {
          grok {
            match => { "message" => "trace_id=(?<trace_id>[a-f0-9]+)" }
          }
        }
    
        # Add links to metrics
        mutate {
          add_field => { 
            "metrics_link" => "http://grafana.local/d/app-dashboard?var-service=%{[fields][service]}&from=now-5m&to=now"
          }
        }
      }
    }
    
    output {
      elasticsearch {
        hosts => ["elasticsearch:9200"]
        index => "logs-%{+YYYY.MM.dd}"
      }
    }
    Groovy

    Jaeger Integration

    # Jaeger query service with Prometheus metrics
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: jaeger-query
    spec:
      template:
        spec:
          containers:
          - name: jaeger-query
            image: jaegertracing/jaeger-query:latest
            env:
            - name: SPAN_STORAGE_TYPE
              value: elasticsearch
            - name: ES_SERVER_URLS
              value: http://elasticsearch:9200
            - name: METRICS_BACKEND
              value: prometheus
            - name: PROMETHEUS_SERVER_URL
              value: http://prometheus:9090
            ports:
            - containerPort: 16686
            - containerPort: 16687
    YAML

    OpenTelemetry Collector Configuration

    # otelcol-config.yml
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    
      prometheus:
        config:
          scrape_configs:
            - job_name: 'otel-collector'
              static_configs:
                - targets: ['localhost:8888']
    
    processors:
      batch:
        timeout: 1s
        send_batch_size: 1024
    
      attributes:
        actions:
          - key: cluster
            value: production
            action: insert
    
    exporters:
      jaeger:
        endpoint: jaeger-collector:14250
        tls:
          insecure: true
    
      prometheus:
        endpoint: "0.0.0.0:8889"
        namespace: "otel"
    
      prometheusremotewrite:
        endpoint: "http://prometheus:9090/api/v1/write"
    
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [attributes, batch]
          exporters: [jaeger]
    
        metrics:
          receivers: [otlp, prometheus]
          processors: [attributes, batch]
          exporters: [prometheus, prometheusremotewrite]
    YAML

    Security and RBAC in Prometheus Setups

    Prometheus Security Configuration

    # Prometheus with TLS and authentication
    apiVersion: v1
    kind: Secret
    metadata:
      name: prometheus-certs
    type: Opaque
    data:
      tls.crt: <base64-encoded-cert>
      tls.key: <base64-encoded-key>
    
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: prometheus
    spec:
      template:
        spec:
          containers:
          - name: prometheus
            image: prom/prometheus:latest
            args:
              - '--config.file=/etc/prometheus/prometheus.yml'
              - '--web.config.file=/etc/prometheus/web.yml'
              - '--storage.tsdb.path=/prometheus'
              - '--web.listen-address=0.0.0.0:9090'
            volumeMounts:
            - name: config
              mountPath: /etc/prometheus
            - name: certs
              mountPath: /etc/ssl/prometheus
              readOnly: true
    YAML
    # web.yml - Prometheus web configuration
    tls_server_config:
      cert_file: /etc/ssl/prometheus/tls.crt
      key_file: /etc/ssl/prometheus/tls.key
    
    basic_auth_users:
      admin: $2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay
      readonly: $2b$12$6tgWf5DZ9z7LZtD.ZrAb/.VjBfI3WnJg3ULf.TgLBtO4vKAzp7KuG
    YAML

    RBAC Configuration for Kubernetes

    # ServiceAccount for Prometheus
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: prometheus
      namespace: monitoring
    
    ---
    # ClusterRole with minimal permissions
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: prometheus
    rules:
    - apiGroups: [""]
      resources:
      - nodes
      - nodes/proxy
      - services
      - endpoints
      - pods
      verbs: ["get", "list", "watch"]
    - apiGroups: ["extensions", "apps"]
      resources:
      - ingresses
      - deployments
      - daemonsets
      - statefulsets
      verbs: ["get", "list", "watch"]
    - nonResourceURLs: ["/metrics"]
      verbs: ["get"]
    
    ---
    # ClusterRoleBinding
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: prometheus
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: prometheus
    subjects:
    - kind: ServiceAccount
      name: prometheus
      namespace: monitoring
    YAML

    OAuth2 Proxy Integration

    # OAuth2 Proxy for Prometheus
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: oauth2-proxy
    spec:
      template:
        spec:
          containers:
          - name: oauth2-proxy
            image: quay.io/oauth2-proxy/oauth2-proxy:latest
            args:
              - --provider=github
              - --email-domain=yourcompany.com
              - --upstream=http://prometheus:9090
              - --http-address=0.0.0.0:4180
              - --client-id=$(OAUTH2_PROXY_CLIENT_ID)
              - --client-secret=$(OAUTH2_PROXY_CLIENT_SECRET)
              - --cookie-secret=$(OAUTH2_PROXY_COOKIE_SECRET)
            env:
            - name: OAUTH2_PROXY_CLIENT_ID
              valueFrom:
                secretKeyRef:
                  name: oauth2-proxy-secrets
                  key: client-id
            - name: OAUTH2_PROXY_CLIENT_SECRET
              valueFrom:
                secretKeyRef:
                  name: oauth2-proxy-secrets
                  key: client-secret
            - name: OAUTH2_PROXY_COOKIE_SECRET
              valueFrom:
                secretKeyRef:
                  name: oauth2-proxy-secrets
                  key: cookie-secret
    YAML

    Chapter 10 Summary

    Advanced Prometheus topics include exemplars for linking metrics to traces, multi-cluster monitoring architectures, integration with logging and tracing systems, and comprehensive security configurations. These features enable enterprise-scale observability with proper access controls and correlation across different observability signals.

    Hands-on Exercise

    1. Exemplars Implementation:
      • Enable exemplars in Prometheus
      • Instrument an application with trace correlation
      • View exemplars in Grafana dashboards
    2. Multi-cluster Setup:
      • Configure federation between Prometheus instances
      • Implement cross-cluster monitoring
      • Test global query capabilities
    3. Security Hardening:
      • Implement TLS and authentication
      • Configure RBAC for Kubernetes
      • Set up OAuth2 proxy for access control

    11. Capstone Project

    Project Overview

    Build a complete observability stack for a sample e-commerce application with microservices architecture, including metrics collection, alerting, visualization, and incident response workflows.

    Architecture Overview

    graph TB
        subgraph "Application Layer"
            A[Frontend Service] --> B[User Service]
            A --> C[Product Service]
            A --> D[Order Service]
            D --> E[Payment Service]
            D --> F[Inventory Service]
            B --> G[User Database]
            C --> H[Product Database]
            D --> I[Order Database]
        end
    
        subgraph "Observability Layer"
            J[Prometheus] --> K[Alertmanager]
            J --> L[Grafana]
            M[Node Exporter] --> J
            N[Application Metrics] --> J
            O[Blackbox Exporter] --> J
            K --> P[Slack/Email]
            L --> Q[Dashboards]
        end
    
        A --> N
        B --> N
        C --> N
        D --> N
        E --> N
        F --> N

    Step 1: Infrastructure Setup

    Docker Compose Environment

    # docker-compose.yml
    version: '3.8'
    
    networks:
      monitoring:
        driver: bridge
      app:
        driver: bridge
    
    volumes:
      prometheus_data:
      grafana_data:
      alertmanager_data:
    
    services:
      # Prometheus
      prometheus:
        image: prom/prometheus:latest
        container_name: prometheus
        ports:
          - "9090:9090"
        volumes:
          - ./prometheus:/etc/prometheus
          - prometheus_data:/prometheus
        command:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--web.console.libraries=/etc/prometheus/console_libraries'
          - '--web.console.templates=/etc/prometheus/consoles'
          - '--web.enable-lifecycle'
          - '--web.enable-admin-api'
        networks:
          - monitoring
          - app
        restart: unless-stopped
    
      # Alertmanager
      alertmanager:
        image: prom/alertmanager:latest
        container_name: alertmanager
        ports:
          - "9093:9093"
        volumes:
          - ./alertmanager:/etc/alertmanager
          - alertmanager_data:/alertmanager
        command:
          - '--config.file=/etc/alertmanager/alertmanager.yml'
          - '--storage.path=/alertmanager'
        networks:
          - monitoring
        restart: unless-stopped
    
      # Grafana
      grafana:
        image: grafana/grafana:latest
        container_name: grafana
        ports:
          - "3000:3000"
        environment:
          - GF_SECURITY_ADMIN_PASSWORD=admin123
          - GF_USERS_ALLOW_SIGN_UP=false
        volumes:
          - grafana_data:/var/lib/grafana
          - ./grafana/provisioning:/etc/grafana/provisioning
        networks:
          - monitoring
        restart: unless-stopped
    
      # Node Exporter
      node-exporter:
        image: prom/node-exporter:latest
        container_name: node-exporter
        ports:
          - "9100:9100"
        volumes:
          - /proc:/host/proc:ro
          - /sys:/host/sys:ro
          - /:/rootfs:ro
        command:
          - '--path.procfs=/host/proc'
          - '--path.rootfs=/rootfs'
          - '--path.sysfs=/host/sys'
          - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
        networks:
          - monitoring
        restart: unless-stopped
    
      # Blackbox Exporter
      blackbox-exporter:
        image: prom/blackbox-exporter:latest
        container_name: blackbox-exporter
        ports:
          - "9115:9115"
        volumes:
          - ./blackbox:/etc/blackbox_exporter
        networks:
          - monitoring
        restart: unless-stopped
    
      # Application Services
      frontend:
        build: ./apps/frontend
        container_name: frontend
        ports:
          - "8080:8080"
        environment:
          - USER_SERVICE_URL=http://user-service:8081
          - PRODUCT_SERVICE_URL=http://product-service:8082
          - ORDER_SERVICE_URL=http://order-service:8083
        networks:
          - app
        restart: unless-stopped
    
      user-service:
        build: ./apps/user-service
        container_name: user-service
        ports:
          - "8081:8081"
        environment:
          - DATABASE_URL=postgresql://user:password@user-db:5432/users
        networks:
          - app
        restart: unless-stopped
    
      product-service:
        build: ./apps/product-service
        container_name: product-service
        ports:
          - "8082:8082"
        environment:
          - DATABASE_URL=postgresql://product:password@product-db:5432/products
        networks:
          - app
        restart: unless-stopped
    
      order-service:
        build: ./apps/order-service
        container_name: order-service
        ports:
          - "8083:8083"
        environment:
          - DATABASE_URL=postgresql://order:password@order-db:5432/orders
          - PAYMENT_SERVICE_URL=http://payment-service:8084
          - INVENTORY_SERVICE_URL=http://inventory-service:8085
        networks:
          - app
        restart: unless-stopped
    
      payment-service:
        build: ./apps/payment-service
        container_name: payment-service
        ports:
          - "8084:8084"
        networks:
          - app
        restart: unless-stopped
    
      inventory-service:
        build: ./apps/inventory-service
        container_name: inventory-service
        ports:
          - "8085:8085"
        networks:
          - app
        restart: unless-stopped
    
      # Databases
      user-db:
        image: postgres:13
        container_name: user-db
        environment:
          - POSTGRES_DB=users
          - POSTGRES_USER=user
          - POSTGRES_PASSWORD=password
        volumes:
          - ./data/user-db:/var/lib/postgresql/data
        networks:
          - app
    
      product-db:
        image: postgres:13
        container_name: product-db
        environment:
          - POSTGRES_DB=products
          - POSTGRES_USER=product
          - POSTGRES_PASSWORD=password
        volumes:
          - ./data/product-db:/var/lib/postgresql/data
        networks:
          - app
    
      order-db:
        image: postgres:13
        container_name: order-db
        environment:
          - POSTGRES_DB=orders
          - POSTGRES_USER=order
          - POSTGRES_PASSWORD=password
        volumes:
          - ./data/order-db:/var/lib/postgresql/data
        networks:
          - app
    YAML

    Step 2: Application Instrumentation

    Frontend Service (Go)

    // apps/frontend/main.go
    package main
    
    import (
        "encoding/json"
        "fmt"
        "log"
        "net/http"
        "os"
        "time"
    
        "github.com/prometheus/client_golang/prometheus"
        "github.com/prometheus/client_golang/prometheus/promhttp"
    )
    
    var (
        httpRequestsTotal = prometheus.NewCounterVec(
            prometheus.CounterOpts{
                Name: "http_requests_total",
                Help: "Total number of HTTP requests",
            },
            []string{"service", "method", "endpoint", "status"},
        )
    
        httpRequestDuration = prometheus.NewHistogramVec(
            prometheus.HistogramOpts{
                Name: "http_request_duration_seconds",
                Help: "HTTP request duration in seconds",
                Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
            },
            []string{"service", "method", "endpoint"},
        )
    
        upstreamRequestsTotal = prometheus.NewCounterVec(
            prometheus.CounterOpts{
                Name: "upstream_requests_total",
                Help: "Total upstream requests",
            },
            []string{"service", "target_service", "status"},
        )
    
        businessMetrics = prometheus.NewCounterVec(
            prometheus.CounterOpts{
                Name: "business_events_total",
                Help: "Business events counter",
            },
            []string{"service", "event_type"},
        )
    )
    
    func init() {
        prometheus.MustRegister(httpRequestsTotal)
        prometheus.MustRegister(httpRequestDuration)
        prometheus.MustRegister(upstreamRequestsTotal)
        prometheus.MustRegister(businessMetrics)
    }
    
    func instrumentHandler(service, endpoint string, handler http.HandlerFunc) http.HandlerFunc {
        return func(w http.ResponseWriter, r *http.Request) {
            start := time.Now()
    
            // Wrap ResponseWriter to capture status code
            ww := &responseWriter{ResponseWriter: w, statusCode: 200}
    
            handler(ww, r)
    
            duration := time.Since(start).Seconds()
            status := fmt.Sprintf("%d", ww.statusCode)
    
            httpRequestsTotal.WithLabelValues(service, r.Method, endpoint, status).Inc()
            httpRequestDuration.WithLabelValues(service, r.Method, endpoint).Observe(duration)
        }
    }
    
    type responseWriter struct {
        http.ResponseWriter
        statusCode int
    }
    
    func (rw *responseWriter) WriteHeader(code int) {
        rw.statusCode = code
        rw.ResponseWriter.WriteHeader(code)
    }
    
    func homeHandler(w http.ResponseWriter, r *http.Request) {
        businessMetrics.WithLabelValues("frontend", "page_view").Inc()
    
        response := map[string]string{
            "service": "frontend",
            "status":  "healthy",
            "version": "1.0.0",
        }
    
        w.Header().Set("Content-Type", "application/json")
        json.NewEncoder(w).Encode(response)
    }
    
    func usersHandler(w http.ResponseWriter, r *http.Request) {
        userServiceURL := os.Getenv("USER_SERVICE_URL")
        if userServiceURL == "" {
            userServiceURL = "http://localhost:8081"
        }
    
        start := time.Now()
        resp, err := http.Get(userServiceURL + "/users")
        duration := time.Since(start).Seconds()
    
        status := "500"
        if err == nil {
            status = fmt.Sprintf("%d", resp.StatusCode)
            defer resp.Body.Close()
        }
    
        upstreamRequestsTotal.WithLabelValues("frontend", "user-service", status).Inc()
    
        if err != nil {
            http.Error(w, "User service unavailable", http.StatusServiceUnavailable)
            return
        }
    
        businessMetrics.WithLabelValues("frontend", "user_list_view").Inc()
        w.Header().Set("Content-Type", "application/json")
        w.Write([]byte(`{"users": []}`))
    }
    
    func main() {
        http.Handle("/metrics", promhttp.Handler())
        http.HandleFunc("/", instrumentHandler("frontend", "/", homeHandler))
        http.HandleFunc("/users", instrumentHandler("frontend", "/users", usersHandler))
        http.HandleFunc("/health", instrumentHandler("frontend", "/health", func(w http.ResponseWriter, r *http.Request) {
            w.WriteHeader(http.StatusOK)
            w.Write([]byte("OK"))
        }))
    
        log.Println("Frontend service starting on :8080")
        log.Fatal(http.ListenAndServe(":8080", nil))
    }
    Go

    User Service (Python)

    # apps/user-service/app.py
    from flask import Flask, jsonify, request
    from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
    import time
    import psycopg2
    import os
    
    app = Flask(__name__)
    
    # Prometheus metrics
    REQUEST_COUNT = Counter(
        'http_requests_total',
        'Total HTTP requests',
        ['service', 'method', 'endpoint', 'status']
    )
    
    REQUEST_DURATION = Histogram(
        'http_request_duration_seconds',
        'HTTP request duration',
        ['service', 'method', 'endpoint'],
        buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
    )
    
    DATABASE_CONNECTIONS = Gauge(
        'database_connections_active',
        'Active database connections',
        ['service', 'database']
    )
    
    BUSINESS_EVENTS = Counter(
        'business_events_total',
        'Business events',
        ['service', 'event_type']
    )
    
    def instrument_request(f):
        def wrapper(*args, **kwargs):
            start_time = time.time()
            status = '200'
    
            try:
                result = f(*args, **kwargs)
                return result
            except Exception as e:
                status = '500'
                raise
            finally:
                REQUEST_COUNT.labels(
                    service='user-service',
                    method=request.method,
                    endpoint=request.endpoint or 'unknown',
                    status=status
                ).inc()
    
                REQUEST_DURATION.labels(
                    service='user-service',
                    method=request.method,
                    endpoint=request.endpoint or 'unknown'
                ).observe(time.time() - start_time)
    
        wrapper.__name__ = f.__name__
        return wrapper
    
    @app.route('/')
    @instrument_request
    def home():
        return jsonify({
            'service': 'user-service',
            'status': 'healthy',
            'version': '1.0.0'
        })
    
    @app.route('/users')
    @instrument_request
    def get_users():
        BUSINESS_EVENTS.labels(service='user-service', event_type='user_list_request').inc()
    
        # Simulate database query
        DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').inc()
        time.sleep(0.01)  # Simulate query time
        DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').dec()
    
        return jsonify({
            'users': [
                {'id': 1, 'name': 'John Doe', 'email': 'john@example.com'},
                {'id': 2, 'name': 'Jane Smith', 'email': 'jane@example.com'}
            ]
        })
    
    @app.route('/users/<int:user_id>')
    @instrument_request
    def get_user(user_id):
        BUSINESS_EVENTS.labels(service='user-service', event_type='user_detail_request').inc()
    
        DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').inc()
        time.sleep(0.005)
        DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').dec()
    
        return jsonify({
            'id': user_id,
            'name': f'User {user_id}',
            'email': f'user{user_id}@example.com'
        })
    
    @app.route('/health')
    @instrument_request
    def health():
        return jsonify({'status': 'healthy'})
    
    @app.route('/metrics')
    def metrics():
        return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}
    
    if __name__ == '__main__':
        app.run(host='0.0.0.0', port=8081)
    Python

    Step 3: Prometheus Configuration

    # prometheus/prometheus.yml
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        cluster: 'ecommerce'
        environment: 'production'
    
    rule_files:
      - "alert_rules.yml"
      - "recording_rules.yml"
    
    alerting:
      alertmanagers:
        - static_configs:
            - targets:
              - alertmanager:9093
    
    scrape_configs:
      # Prometheus itself
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']
    
      # Node Exporter
      - job_name: 'node-exporter'
        static_configs:
          - targets: ['node-exporter:9100']
        scrape_interval: 30s
    
      # Application services
      - job_name: 'frontend'
        static_configs:
          - targets: ['frontend:8080']
        metrics_path: '/metrics'
        scrape_interval: 15s
    
      - job_name: 'user-service'
        static_configs:
          - targets: ['user-service:8081']
        metrics_path: '/metrics'
        scrape_interval: 15s
    
      - job_name: 'product-service'
        static_configs:
          - targets: ['product-service:8082']
        metrics_path: '/metrics'
        scrape_interval: 15s
    
      - job_name: 'order-service'
        static_configs:
          - targets: ['order-service:8083']
        metrics_path: '/metrics'
        scrape_interval: 15s
    
      - job_name: 'payment-service'
        static_configs:
          - targets: ['payment-service:8084']
        metrics_path: '/metrics'
        scrape_interval: 15s
    
      - job_name: 'inventory-service'
        static_configs:
          - targets: ['inventory-service:8085']
        metrics_path: '/metrics'
        scrape_interval: 15s
    
      # Blackbox monitoring
      - job_name: 'blackbox'
        metrics_path: /probe
        params:
          module: [http_2xx]
        static_configs:
          - targets:
            - http://frontend:8080/health
            - http://user-service:8081/health
            - http://product-service:8082/health
            - http://order-service:8083/health
            - http://payment-service:8084/health
            - http://inventory-service:8085/health
        relabel_configs:
          - source_labels: [__address__]
            target_label: __param_target
          - source_labels: [__param_target]
            target_label: instance
          - target_label: __address__
            replacement: blackbox-exporter:9115
    YAML

    Step 4: Recording Rules

    # prometheus/recording_rules.yml
    groups:
      - name: application_rules
        interval: 30s
        rules:
          # Request rates
          - record: service:request_rate:rate5m
            expr: sum by (service) (rate(http_requests_total[5m]))
    
          - record: service:request_rate:rate1h
            expr: sum by (service) (rate(http_requests_total[1h]))
    
          # Error rates
          - record: service:error_rate:rate5m
            expr: |
              sum by (service) (rate(http_requests_total{status=~"[45].."}[5m])) /
              sum by (service) (rate(http_requests_total[5m]))
    
          # Latency percentiles
          - record: service:request_duration:p50
            expr: |
              histogram_quantile(0.50,
                sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
              )
    
          - record: service:request_duration:p95
            expr: |
              histogram_quantile(0.95,
                sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
              )
    
          - record: service:request_duration:p99
            expr: |
              histogram_quantile(0.99,
                sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
              )
    
      - name: infrastructure_rules
        interval: 30s
        rules:
          # Node metrics
          - record: node:cpu_usage:rate5m
            expr: |
              100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
    
          - record: node:memory_usage:percentage
            expr: |
              (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
    
          - record: node:disk_usage:percentage
            expr: |
              (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100
    
      - name: business_rules
        interval: 60s
        rules:
          # Business metrics
          - record: business:page_views:rate1h
            expr: rate(business_events_total{event_type="page_view"}[1h]) * 3600
    
          - record: business:user_requests:rate1h
            expr: rate(business_events_total{event_type=~"user_.*"}[1h]) * 3600
    
          # Service dependency health
          - record: service:dependency_success_rate:rate5m
            expr: |
              sum by (service, target_service) (rate(upstream_requests_total{status=~"2.."}[5m])) /
              sum by (service, target_service) (rate(upstream_requests_total[5m]))
    YAML

    Step 5: Alerting Rules

    # prometheus/alert_rules.yml
    groups:
      - name: infrastructure_alerts
        rules:
          - alert: NodeDown
            expr: up{job="node-exporter"} == 0
            for: 1m
            labels:
              severity: critical
              team: infrastructure
            annotations:
              summary: "Node is down"
              description: "Node {{ $labels.instance }} has been down for more than 1 minute"
              runbook_url: "https://runbooks.company.com/node-down"
    
          - alert: HighCPUUsage
            expr: node:cpu_usage:rate5m > 80
            for: 5m
            labels:
              severity: warning
              team: infrastructure
            annotations:
              summary: "High CPU usage"
              description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"
    
          - alert: HighMemoryUsage
            expr: node:memory_usage:percentage > 85
            for: 5m
            labels:
              severity: warning
              team: infrastructure
            annotations:
              summary: "High memory usage"
              description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"
    
      - name: application_alerts
        rules:
          - alert: ServiceDown
            expr: up{job=~"frontend|.*-service"} == 0
            for: 1m
            labels:
              severity: critical
              team: platform
            annotations:
              summary: "Service is down"
              description: "Service {{ $labels.job }} is down"
    
          - alert: HighErrorRate
            expr: service:error_rate:rate5m > 0.05
            for: 2m
            labels:
              severity: critical
              team: platform
            annotations:
              summary: "High error rate for {{ $labels.service }}"
              description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"
    
          - alert: HighLatency
            expr: service:request_duration:p95 > 1
            for: 5m
            labels:
              severity: warning
              team: platform
            annotations:
              summary: "High latency for {{ $labels.service }}"
              description: "95th percentile latency is {{ $value }}s for {{ $labels.service }}"
    
          - alert: LowRequestRate
            expr: service:request_rate:rate5m < 0.1
            for: 10m
            labels:
              severity: warning
              team: platform
            annotations:
              summary: "Low request rate for {{ $labels.service }}"
              description: "Request rate is {{ $value }} req/s for {{ $labels.service }}"
    
      - name: business_alerts
        rules:
          - alert: LowPageViews
            expr: business:page_views:rate1h < 10
            for: 15m
            labels:
              severity: warning
              team: product
            annotations:
              summary: "Low page view rate"
              description: "Page view rate is {{ $value }} views/hour"
    
          - alert: ServiceDependencyFailure
            expr: service:dependency_success_rate:rate5m < 0.95
            for: 5m
            labels:
              severity: critical
              team: platform
            annotations:
              summary: "Service dependency failure"
              description: "{{ $labels.service }} -> {{ $labels.target_service }} success rate is {{ $value | humanizePercentage }}"
    YAML

    Step 6: Alertmanager Configuration

    # alertmanager/alertmanager.yml
    global:
      smtp_smarthost: 'smtp.gmail.com:587'
      smtp_from: 'alerts@ecommerce.local'
      smtp_auth_username: 'alerts@ecommerce.local'
      smtp_auth_password: 'your-app-password'
    
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h
      receiver: 'default'
      routes:
        # Critical alerts to on-call
        - matchers:
            - severity=critical
          receiver: 'critical-alerts'
          continue: true
    
        # Infrastructure team alerts
        - matchers:
            - team=infrastructure
          receiver: 'infrastructure-team'
    
        # Platform team alerts
        - matchers:
            - team=platform
          receiver: 'platform-team'
    
        # Product team alerts
        - matchers:
            - team=product
          receiver: 'product-team'
    
    receivers:
      - name: 'default'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
            channel: '#alerts'
            title: 'Alert: {{ .GroupLabels.alertname }}'
            text: |
              {{ range .Alerts }}
              *Alert:* {{ .Annotations.summary }}
              *Description:* {{ .Annotations.description }}
              *Severity:* {{ .Labels.severity }}
              *Service:* {{ .Labels.service }}
              {{ end }}
    
      - name: 'critical-alerts'
        email_configs:
          - to: 'oncall@ecommerce.local'
            subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
            body: |
              {{ range .Alerts }}
              Alert: {{ .Annotations.summary }}
              Description: {{ .Annotations.description }}
              Severity: {{ .Labels.severity }}
              Service: {{ .Labels.service }}
              Runbook: {{ .Annotations.runbook_url }}
              {{ end }}
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
            channel: '#critical-alerts'
            title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
            text: |
              {{ range .Alerts }}
              *Alert:* {{ .Annotations.summary }}
              *Description:* {{ .Annotations.description }}
              *Runbook:* {{ .Annotations.runbook_url }}
              {{ end }}
    
      - name: 'infrastructure-team'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
            channel: '#infrastructure'
            title: '⚠️ Infrastructure Alert: {{ .GroupLabels.alertname }}'
    
      - name: 'platform-team'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
            channel: '#platform'
            title: '🔧 Platform Alert: {{ .GroupLabels.alertname }}'
    
      - name: 'product-team'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
            channel: '#product'
            title: '📊 Business Alert: {{ .GroupLabels.alertname }}'
    
    inhibit_rules:
      # Don't send warning alerts if critical alerts are firing
      - source_matchers:
          - severity=critical
        target_matchers:
          - severity=warning
        equal: ['service']
    
      # Don't send service alerts if node is down
      - source_matchers:
          - alertname=NodeDown
        target_matchers:
          - alertname=ServiceDown
        equal: ['instance']
    YAML

    Step 7: Grafana Dashboards

    Infrastructure Dashboard

    # grafana/provisioning/dashboards/infrastructure.json
    {
      "dashboard": {
        "id": null,
        "title": "Infrastructure Overview",
        "tags": ["infrastructure", "monitoring"],
        "timezone": "browser",
        "refresh": "30s",
        "time": {
          "from": "now-1h",
          "to": "now"
        },
        "panels": [
          {
            "id": 1,
            "title": "CPU Usage",
            "type": "stat",
            "targets": [
              {
                "expr": "node:cpu_usage:rate5m",
                "legendFormat": "{{ instance }}"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "percent",
                "thresholds": {
                  "steps": [
                    {"color": "green", "value": 0},
                    {"color": "yellow", "value": 70},
                    {"color": "red", "value": 90}
                  ]
                }
              }
            },
            "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
          },
          {
            "id": 2,
            "title": "Memory Usage",
            "type": "stat",
            "targets": [
              {
                "expr": "node:memory_usage:percentage",
                "legendFormat": "{{ instance }}"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "percent",
                "thresholds": {
                  "steps": [
                    {"color": "green", "value": 0},
                    {"color": "yellow", "value": 80},
                    {"color": "red", "value": 90}
                  ]
                }
              }
            },
            "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
          },
          {
            "id": 3,
            "title": "CPU Usage Over Time",
            "type": "graph",
            "targets": [
              {
                "expr": "node:cpu_usage:rate5m",
                "legendFormat": "{{ instance }}"
              }
            ],
            "yAxes": [
              {
                "unit": "percent",
                "max": 100,
                "min": 0
              }
            ],
            "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
          }
        ]
      }
    }
    JSON

    Application Dashboard

    # grafana/provisioning/dashboards/application.json
    {
      "dashboard": {
        "id": null,
        "title": "Application Performance",
        "tags": ["application", "performance"],
        "timezone": "browser",
        "refresh": "30s",
        "templating": {
          "list": [
            {
              "name": "service",
              "type": "query",
              "query": "label_values(http_requests_total, service)",
              "refresh": 1,
              "multi": true,
              "includeAll": true
            }
          ]
        },
        "panels": [
          {
            "id": 1,
            "title": "Request Rate",
            "type": "graph",
            "targets": [
              {
                "expr": "service:request_rate:rate5m{service=~\"$service\"}",
                "legendFormat": "{{ service }}"
              }
            ],
            "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
          },
          {
            "id": 2,
            "title": "Error Rate",
            "type": "graph",
            "targets": [
              {
                "expr": "service:error_rate:rate5m{service=~\"$service\"} * 100",
                "legendFormat": "{{ service }}"
              }
            ],
            "yAxes": [
              {
                "unit": "percent",
                "max": 100,
                "min": 0
              }
            ],
            "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
          },
          {
            "id": 3,
            "title": "Response Time Percentiles",
            "type": "graph",
            "targets": [
              {
                "expr": "service:request_duration:p50{service=~\"$service\"}",
                "legendFormat": "{{ service }} - 50th"
              },
              {
                "expr": "service:request_duration:p95{service=~\"$service\"}",
                "legendFormat": "{{ service }} - 95th"
              },
              {
                "expr": "service:request_duration:p99{service=~\"$service\"}",
                "legendFormat": "{{ service }} - 99th"
              }
            ],
            "yAxes": [
              {
                "unit": "s"
              }
            ],
            "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
          }
        ]
      }
    }
    JSON

    Step 8: Testing and Validation

    Load Testing Script

    # scripts/load_test.py
    import requests
    import time
    import random
    import threading
    from concurrent.futures import ThreadPoolExecutor
    
    BASE_URL = "http://localhost:8080"
    
    def make_request(endpoint):
        """Make a request to the specified endpoint"""
        try:
            response = requests.get(f"{BASE_URL}{endpoint}", timeout=5)
            return response.status_code
        except Exception as e:
            print(f"Error calling {endpoint}: {e}")
            return 500
    
    def generate_load():
        """Generate load on the application"""
        endpoints = ["/", "/users", "/health"]
    
        while True:
            endpoint = random.choice(endpoints)
            status = make_request(endpoint)
    
            # Add some randomness to the load
            time.sleep(random.uniform(0.1, 1.0))
    
    def run_load_test(duration_minutes=10, concurrent_users=5):
        """Run load test for specified duration"""
        print(f"Starting load test with {concurrent_users} concurrent users for {duration_minutes} minutes")
    
        with ThreadPoolExecutor(max_workers=concurrent_users) as executor:
            # Submit load generation tasks
            futures = []
            for _ in range(concurrent_users):
                future = executor.submit(generate_load)
                futures.append(future)
    
            # Let it run for the specified duration
            time.sleep(duration_minutes * 60)
    
            # Cancel all tasks
            for future in futures:
                future.cancel()
    
    if __name__ == "__main__":
        run_load_test(duration_minutes=5, concurrent_users=10)
    Python

    Chaos Testing

    # scripts/chaos_test.py
    import docker
    import time
    import random
    
    client = docker.from_env()
    
    def stop_random_service():
        """Stop a random service for chaos testing"""
        services = ['user-service', 'product-service', 'order-service']
        service_name = random.choice(services)
    
        try:
            container = client.containers.get(service_name)
            print(f"Stopping {service_name}")
            container.stop()
    
            # Wait for some time
            time.sleep(30)
    
            print(f"Starting {service_name}")
            container.start()
    
        except Exception as e:
            print(f"Error with {service_name}: {e}")
    
    def simulate_high_load():
        """Simulate high CPU load on a container"""
        try:
            container = client.containers.get('frontend')
            print("Simulating high CPU load")
    
            # Run stress test inside container
            container.exec_run("stress --cpu 2 --timeout 60s", detach=True)
    
        except Exception as e:
            print(f"Error simulating load: {e}")
    
    if __name__ == "__main__":
        print("Starting chaos testing...")
    
        # Run different chaos scenarios
        stop_random_service()
        time.sleep(120)
    
        simulate_high_load()
        time.sleep(120)
    Python

    Step 9: Deployment Script

    #!/bin/bash
    # scripts/deploy.sh
    
    set -e
    
    echo "Starting E-commerce Observability Stack deployment..."
    
    # Create necessary directories
    mkdir -p data/{user-db,product-db,order-db}
    mkdir -p prometheus grafana/provisioning/{datasources,dashboards}
    mkdir -p alertmanager blackbox
    
    # Set permissions
    chmod 777 data/{user-db,product-db,order-db}
    
    # Build application images
    echo "Building application images..."
    for service in frontend user-service product-service order-service payment-service inventory-service; do
        echo "Building $service..."
        docker build -t ecommerce/$service:latest apps/$service/
    done
    
    # Start the stack
    echo "Starting services..."
    docker-compose up -d
    
    # Wait for services to be ready
    echo "Waiting for services to start..."
    sleep 30
    
    # Check service health
    echo "Checking service health..."
    services=("prometheus:9090" "grafana:3000" "alertmanager:9093" "frontend:8080")
    
    for service in "${services[@]}"; do
        IFS=':' read -r name port <<< "$service"
        echo "Checking $name on port $port..."
    
        for i in {1..30}; do
            if curl -f "http://localhost:$port/health" 2>/dev/null || curl -f "http://localhost:$port" 2>/dev/null; then
                echo "$name is healthy"
                break
            fi
    
            if [ $i -eq 30 ]; then
                echo "Warning: $name may not be ready"
            fi
    
            sleep 2
        done
    done
    
    echo "Deployment complete!"
    echo "Access URLs:"
    echo "  Prometheus: http://localhost:9090"
    echo "  Grafana: http://localhost:3000 (admin/admin123)"
    echo "  Alertmanager: http://localhost:9093"
    echo "  Application: http://localhost:8080"
    
    echo "Run load tests with: python scripts/load_test.py"
    echo "Run chaos tests with: python scripts/chaos_test.py"
    Bash

    Step 10: Documentation and Runbooks

    README.md

    # E-commerce Observability Stack
    
    This project demonstrates a complete observability setup for a microservices-based e-commerce application using Prometheus, Grafana, and Alertmanager.
    
    ## Architecture
    
    - **Frontend Service** (Go): Main web interface
    - **User Service** (Python): User management
    - **Product Service** (Python): Product catalog
    - **Order Service** (Python): Order processing
    - **Payment Service** (Python): Payment processing
    - **Inventory Service** (Python): Inventory management
    
    ## Deployment
    
    ```bash
    # Clone the repository
    git clone <repository-url>
    cd ecommerce-observability
    
    # Deploy the stack
    ./scripts/deploy.sh
    Markdown

    Access Points

    Testing

    Load Testing

    python scripts/load_test.py
    Bash

    Chaos Testing

    python scripts/chaos_test.py
    Bash

    Monitoring

    Key Metrics

    • Request rate per service
    • Error rate per service
    • Response time percentiles
    • Infrastructure utilization

    Alerts

    • Service down
    • High error rate (>5%)
    • High latency (>1s p95)
    • Infrastructure issues

    Troubleshooting

    Service Discovery Issues

    Check Prometheus targets: http://localhost:9090/targets

    Missing Metrics

    Verify service /metrics endpoints are accessible

    Alert Not Firing

    Check Prometheus rules: http://localhost:9090/rules

    ### Project Validation
    
    #### Verification Checklist
    
    1. **✅ Infrastructure Monitoring**
       - [ ] Node exporter collecting system metrics
       - [ ] CPU, memory, disk usage visible in Grafana
       - [ ] Infrastructure alerts firing correctly
    
    2. **✅ Application Monitoring**
       - [ ] All services exposing metrics
       - [ ] Request rate, error rate, latency tracked
       - [ ] Business metrics instrumented
    
    3. **✅ Alerting**
       - [ ] Critical alerts configured
       - [ ] Alert routing working
       - [ ] Notification channels tested
    
    4. **✅ Visualization**
       - [ ] Infrastructure dashboard functional
       - [ ] Application dashboard with filters
       - [ ] Business metrics dashboard
    
    5. **✅ Testing**
       - [ ] Load testing generating metrics
       - [ ] Chaos testing triggering alerts
       - [ ] Recovery scenarios validated
    
    ### Chapter 11 Summary
    
    The capstone project demonstrates a production-ready observability stack with comprehensive monitoring, alerting, and visualization. It covers infrastructure monitoring, application performance tracking, business metrics, and incident response workflows. The project serves as a practical template for implementing Prometheus-based observability in real-world microservices environments.
    
    ### Final Exercise
    
    1. **Deploy the Complete Stack**:
       - Follow the deployment guide
       - Verify all components are working
       - Access all web interfaces
    
    2. **Run Tests and Observe**:
       - Execute load tests and watch metrics
       - Trigger chaos tests and verify alerts
       - Practice incident response workflows
    
    3. **Customize and Extend**:
       - Add new metrics to services
       - Create custom dashboards
       - Implement additional alert rules
    
    ---
    
    ## 12. Appendices
    
    ### Appendix A: PromQL Cheat Sheet
    
    #### Basic Selectors
    ```promql
    # Simple metric selection
    http_requests_total
    
    # Label matching
    http_requests_total{method="GET"}
    http_requests_total{method!="GET"}
    http_requests_total{method=~"GET|POST"}
    http_requests_total{method!~"GET|POST"}
    
    # Multiple labels
    http_requests_total{method="GET", status="200"}
    Markdown

    Time Series Types

    # Instant vector (single value per series)
    up
    
    # Range vector (range of values over time)
    up[5m]
    
    # Scalar (single numeric value)
    42
    INI

    Rate and Counter Functions

    # Rate: per-second average rate
    rate(http_requests_total[5m])
    
    # Increase: total increase over time window
    increase(http_requests_total[5m])
    
    # irate: instantaneous rate
    irate(http_requests_total[5m])
    
    # Delta: difference between first and last value
    delta(cpu_temp_celsius[2h])
    INI

    Aggregation Operators

    # Sum
    sum(http_requests_total)
    sum by (job) (http_requests_total)
    sum without (instance) (http_requests_total)
    
    # Average
    avg(node_cpu_seconds_total)
    avg by (mode) (node_cpu_seconds_total)
    
    # Count
    count(up)
    count by (job) (up)
    
    # Min/Max
    min(node_filesystem_free_bytes)
    max(node_filesystem_free_bytes)
    
    # Quantile
    quantile(0.95, http_request_duration_seconds)
    
    # Top/Bottom K
    topk(5, http_requests_total)
    bottomk(3, node_filesystem_free_bytes)
    INI

    Mathematical Functions

    # Arithmetic operators
    node_memory_MemTotal_bytes - node_memory_MemFree_bytes
    rate(http_requests_total[5m]) * 60
    
    # Mathematical functions
    abs(delta(cpu_temp_celsius[5m]))
    ceil(rate(http_requests_total[5m]))
    floor(rate(http_requests_total[5m]))
    round(rate(http_requests_total[5m]), 0.1)
    sqrt(rate(http_requests_total[5m]))
    ln(rate(http_requests_total[5m]))
    log10(rate(http_requests_total[5m]))
    INI

    Histogram Functions

    # Quantiles
    histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
    histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
    
    # Average from histogram
    rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
    
    # Request rate from histogram
    rate(http_request_duration_seconds_count[5m])
    INI

    Time Functions

    # Current time
    time()
    
    # Timestamp of samples
    timestamp(up)
    
    # Time-based filtering
    hour() > 9 and hour() < 17  # Business hours
    day_of_week() > 0 and day_of_week() < 6  # Weekdays
    
    # Prediction
    predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)
    INI

    String Functions

    # Label manipulation
    label_replace(up, "instance_short", "$1", "instance", "([^:]+):.*")
    label_join(up, "instance_job", ":", "instance", "job")
    INI

    Comparison Operators

    # Comparison
    node_filesystem_free_bytes < 1000000000  # Less than 1GB
    rate(http_requests_total[5m]) > 10       # More than 10 req/s
    
    # Boolean operators
    up == 1 and on(instance) node_load1 > 2
    up == 0 or on(instance) node_filesystem_free_bytes < 1000000000
    INI

    Advanced Patterns

    # SLI/SLO calculations
    sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
    
    # Error budget burn rate
    (1 - sli_availability) / (1 - slo_target) > burn_rate_threshold
    
    # Multi-service aggregation
    sum by (environment) (rate(http_requests_total[5m]))
    
    # Cross-metric calculations
    rate(http_requests_total[5m]) / on(instance) group_left rate(node_cpu_seconds_total{mode="idle"}[5m])
    INI

    Appendix B: Exporter Catalog

    Official Exporters

    ExporterPurposePortKey Metrics
    Node ExporterSystem metrics9100CPU, memory, disk, network
    Blackbox ExporterExternal monitoring9115HTTP, DNS, TCP, ICMP
    MySQL ExporterMySQL database9104Connections, queries, performance
    Redis ExporterRedis database9121Memory, commands, keys
    HAProxy ExporterHAProxy load balancer8404Requests, responses, health
    NGINX ExporterNGINX web server9113Requests, connections, status
    | **RabbitMQ Exporter** | RabbitMQ message broker | 9419 | Queues, messages, connections |
    | **Kafka Exporter** | Apache Kafka | 9308 | Topics, partitions, lag |
    | **JMX Exporter** | Java applications | 8080 | JVM metrics, garbage collection |
    | **Consul Exporter** | HashiCorp Consul | 9107 | Service health, cluster status |
    | **Memcached Exporter** | Memcached | 9150 | Cache hits/misses, memory usage |
    | **StatsD Exporter** | StatsD metrics | 9102 | Custom application metrics |
    
    #### Cloud Provider Exporters
    
    | Exporter | Purpose | Key Metrics |
    |----------|---------|-------------|
    | **AWS CloudWatch Exporter** | AWS services | EC2, RDS, ELB metrics |
    | **Azure Monitor Exporter** | Azure services | VM, storage, network metrics |
    | **GCP Monitoring Exporter** | Google Cloud | Compute, storage, network metrics |
    | **DigitalOcean Exporter** | DigitalOcean | Droplet metrics, load balancers |
    
    #### Configuration Examples
    
    ##### Node Exporter
    ```yaml
    # docker-compose.yml
    node-exporter:
      image: prom/node-exporter:latest
      command:
        - '--path.procfs=/host/proc'
        - '--path.rootfs=/rootfs'
        - '--path.sysfs=/host/sys'
        - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
        - '--collector.textfile.directory=/host/textfile_collector'
      volumes:
        - /proc:/host/proc:ro
        - /sys:/host/sys:ro
        - /:/rootfs:ro
        - /var/log:/host/var/log:ro
      ports:
        - "9100:9100"
      network_mode: host
    Markdown
    Blackbox Exporter
    # blackbox.yml
    modules:
      http_2xx:
        prober: http
        timeout: 5s
        http:
          valid_status_codes: []
          method: GET
          follow_redirects: true
          preferred_ip_protocol: "ip4"
          headers:
            User-Agent: "Prometheus Blackbox Exporter"
    
      http_post_2xx:
        prober: http
        timeout: 5s
        http:
          method: POST
          headers:
            Content-Type: application/json
          body: '{"health": "check"}'
    
      tcp_connect:
        prober: tcp
        timeout: 5s
    
      ping:
        prober: icmp
        timeout: 5s
        icmp:
          preferred_ip_protocol: "ip4"
    
      dns:
        prober: dns
        timeout: 5s
        dns:
          query_name: "example.com"
          query_type: "A"
          valid_rcodes:
            - NOERROR
    YAML
    MySQL Exporter
    # Environment variables
    DATA_SOURCE_NAME: "user:password@(mysql:3306)/"
    
    # Or configuration file
    [client]
    user = exporter
    password = password
    host = mysql
    port = 3306
    
    # Prometheus scrape config
    scrape_configs:
      - job_name: 'mysql'
        static_configs:
          - targets: ['mysql-exporter:9104']
    YAML
    PostgreSQL Exporter
    # docker-compose.yml
    postgres-exporter:
      image: prometheuscommunity/postgres-exporter
      environment:
        DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/database?sslmode=disable"
      ports:
        - "9187:9187"
    YAML
    Redis Exporter
    redis-exporter:
      image: oliver006/redis_exporter
      environment:
        REDIS_ADDR: "redis://redis:6379"
        REDIS_PASSWORD: "your-redis-password"
      ports:
        - "9121:9121"
    YAML

    Appendix C: Alert Rule Templates

    Infrastructure Alerts

    groups:
      - name: node_alerts
        rules:
          - alert: NodeDown
            expr: up{job="node-exporter"} == 0
            for: 1m
            labels:
              severity: critical
              team: infrastructure
            annotations:
              summary: "Node {{ $labels.instance }} is down"
              description: "Node {{ $labels.instance }} has been down for more than 1 minute"
              runbook_url: "https://runbooks.company.com/alerts/node-down"
    
          - alert: HighCPU
            expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
            for: 5m
            labels:
              severity: warning
              team: infrastructure
            annotations:
              summary: "High CPU usage on {{ $labels.instance }}"
              description: "CPU usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }}"
    
          - alert: CriticalCPU
            expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
            for: 2m
            labels:
              severity: critical
              team: infrastructure
            annotations:
              summary: "Critical CPU usage on {{ $labels.instance }}"
              description: "CPU usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }}"
    
          - alert: HighMemory
            expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
            for: 5m
            labels:
              severity: critical
              team: infrastructure
            annotations:
              summary: "High memory usage on {{ $labels.instance }}"
              description: "Memory usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }}"
    
          - alert: DiskSpaceLow
            expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 85
            for: 10m
            labels:
              severity: warning
              team: infrastructure
            annotations:
              summary: "Low disk space on {{ $labels.instance }}"
              description: "Disk usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }} {{ $labels.mountpoint }}"
    
          - alert: DiskSpaceCritical
            expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 95
            for: 5m
            labels:
              severity: critical
              team: infrastructure
            annotations:
              summary: "Critical disk space on {{ $labels.instance }}"
              description: "Disk usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }} {{ $labels.mountpoint }}"
    
          - alert: HighLoadAverage
            expr: node_load1 / count by (instance) (count by (instance, cpu) (node_cpu_seconds_total{mode="idle"})) > 1.5
            for: 10m
            labels:
              severity: warning
              team: infrastructure
            annotations:
              summary: "High load average on {{ $labels.instance }}"
              description: "Load average is {{ $value | printf \"%.2f\" }} on {{ $labels.instance }}"
    YAML

    Application Alerts

    groups:
      - name: application_alerts
        rules:
          - alert: ServiceDown
            expr: up{job=~".*-service"} == 0
            for: 1m
            labels:
              severity: critical
              team: platform
            annotations:
              summary: "Service {{ $labels.job }} is down"
              description: "Service {{ $labels.job }} on {{ $labels.instance }} is down"
              runbook_url: "https://runbooks.company.com/alerts/service-down"
    
          - alert: HighErrorRate
            expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
            for: 2m
            labels:
              severity: critical
              team: platform
            annotations:
              summary: "High error rate for {{ $labels.job }}"
              description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
    
          - alert: HighLatency
            expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
            for: 5m
            labels:
              severity: warning
              team: platform
            annotations:
              summary: "High latency for {{ $labels.job }}"
              description: "95th percentile latency is {{ $value | printf \"%.3f\" }}s for {{ $labels.job }}"
    
          - alert: LowThroughput
            expr: rate(http_requests_total[5m]) < 1
            for: 10m
            labels:
              severity: warning
              team: platform
            annotations:
              summary: "Low throughput for {{ $labels.job }}"
              description: "Request rate is {{ $value | printf \"%.2f\" }} req/s for {{ $labels.job }}"
    
          - alert: HighMemoryUsage
            expr: (container_memory_usage_bytes{container!="POD",container!=""} / container_spec_memory_limit_bytes) * 100 > 90
            for: 5m
            labels:
              severity: warning
              team: platform
            annotations:
              summary: "High memory usage for container {{ $labels.container }}"
              description: "Memory usage is {{ $value | printf \"%.2f\" }}% for container {{ $labels.container }} in pod {{ $labels.pod }}"
    YAML

    Database Alerts

    groups:
      - name: database_alerts
        rules:
          - alert: DatabaseDown
            expr: mysql_up == 0
            for: 1m
            labels:
              severity: critical
              team: database
            annotations:
              summary: "Database {{ $labels.instance }} is down"
              description: "MySQL database on {{ $labels.instance }} is not responding"
              runbook_url: "https://runbooks.company.com/alerts/database-down"
    
          - alert: HighConnections
            expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
            for: 5m
            labels:
              severity: warning
              team: database
            annotations:
              summary: "High database connections on {{ $labels.instance }}"
              description: "Database connection usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"
    
          - alert: SlowQueries
            expr: rate(mysql_global_status_slow_queries[5m]) > 0.1
            for: 5m
            labels:
              severity: warning
              team: database
            annotations:
              summary: "High slow query rate on {{ $labels.instance }}"
              description: "Slow query rate is {{ $value | printf \"%.2f\" }} queries/s on {{ $labels.instance }}"
    
          - alert: DatabaseReplicationLag
            expr: mysql_slave_lag_seconds > 30
            for: 2m
            labels:
              severity: warning
              team: database
            annotations:
              summary: "Database replication lag on {{ $labels.instance }}"
              description: "Replication lag is {{ $value }}s on {{ $labels.instance }}"
    
          - alert: PostgreSQLDown
            expr: pg_up == 0
            for: 1m
            labels:
              severity: critical
              team: database
            annotations:
              summary: "PostgreSQL {{ $labels.instance }} is down"
              description: "PostgreSQL database on {{ $labels.instance }} is not responding"
    
          - alert: PostgreSQLHighConnections
            expr: sum by (instance) (pg_stat_activity_count) / pg_settings_max_connections > 0.8
            for: 5m
            labels:
              severity: warning
              team: database
            annotations:
              summary: "High PostgreSQL connections on {{ $labels.instance }}"
              description: "Connection usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"
    YAML

    Network and External Service Alerts

    groups:
      - name: network_alerts
        rules:
          - alert: HighNetworkReceive
            expr: rate(node_network_receive_bytes_total[5m]) > 100 * 1024 * 1024  # 100MB/s
            for: 5m
            labels:
              severity: warning
              team: infrastructure
            annotations:
              summary: "High network receive on {{ $labels.instance }}"
              description: "Network receive is {{ $value | humanize1024 }}B/s on {{ $labels.instance }} interface {{ $labels.device }}"
    
          - alert: HighNetworkTransmit
            expr: rate(node_network_transmit_bytes_total[5m]) > 100 * 1024 * 1024  # 100MB/s
            for: 5m
            labels:
              severity: warning
              team: infrastructure
            annotations:
              summary: "High network transmit on {{ $labels.instance }}"
              description: "Network transmit is {{ $value | humanize1024 }}B/s on {{ $labels.instance }} interface {{ $labels.device }}"
    
          - alert: ExternalServiceDown
            expr: probe_success{job="blackbox"} == 0
            for: 2m
            labels:
              severity: critical
              team: platform
            annotations:
              summary: "External service {{ $labels.instance }} is down"
              description: "External service check for {{ $labels.instance }} is failing"
    
          - alert: ExternalServiceSlowResponse
            expr: probe_duration_seconds{job="blackbox"} > 5
            for: 3m
            labels:
              severity: warning
              team: platform
            annotations:
              summary: "External service {{ $labels.instance }} is slow"
              description: "External service {{ $labels.instance }} is responding in {{ $value | printf \"%.2f\" }}s"
    YAML

    Business Logic Alerts

    groups:
      - name: business_alerts
        rules:
          - alert: LowOrderRate
            expr: rate(orders_total[1h]) * 3600 < 10
            for: 15m
            labels:
              severity: warning
              team: business
            annotations:
              summary: "Low order rate"
              description: "Order rate is {{ $value | printf \"%.2f\" }} orders/hour"
    
          - alert: HighCartAbandonmentRate
            expr: |
              (
                rate(cart_abandoned_total[1h]) /
                (rate(cart_created_total[1h]) + rate(cart_abandoned_total[1h]))
              ) > 0.7
            for: 30m
            labels:
              severity: warning
              team: business
            annotations:
              summary: "High cart abandonment rate"
              description: "Cart abandonment rate is {{ $value | humanizePercentage }}"
    
          - alert: PaymentProcessingFailures
            expr: rate(payment_failed_total[5m]) / rate(payment_attempted_total[5m]) > 0.05
            for: 10m
            labels:
              severity: critical
              team: payments
            annotations:
              summary: "High payment failure rate"
              description: "Payment failure rate is {{ $value | humanizePercentage }}"
    YAML

    Appendix D: Grafana Dashboard Templates

    Infrastructure Overview Dashboard

    {
      "dashboard": {
        "id": null,
        "title": "Infrastructure Overview",
        "tags": ["infrastructure", "overview"],
        "timezone": "browser",
        "refresh": "30s",
        "time": {
          "from": "now-1h",
          "to": "now"
        },
        "templating": {
          "list": [
            {
              "name": "instance",
              "type": "query",
              "query": "label_values(up{job=\"node-exporter\"}, instance)",
              "refresh": 1,
              "multi": true,
              "includeAll": true,
              "current": {
                "value": "$__all",
                "text": "All"
              }
            }
          ]
        },
        "panels": [
          {
            "id": 1,
            "title": "System Load",
            "type": "stat",
            "targets": [
              {
                "expr": "node_load1{instance=~\"$instance\"}",
                "legendFormat": "{{ instance }}"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "short",
                "thresholds": {
                  "steps": [
                    {"color": "green", "value": 0},
                    {"color": "yellow", "value": 2},
                    {"color": "red", "value": 4}
                  ]
                }
              }
            },
            "gridPos": {"h": 8, "w": 6, "x": 0, "y": 0}
          },
          {
            "id": 2,
            "title": "CPU Usage",
            "type": "stat",
            "targets": [
              {
                "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\",instance=~\"$instance\"}[5m])) * 100)",
                "legendFormat": "{{ instance }}"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "percent",
                "thresholds": {
                  "steps": [
                    {"color": "green", "value": 0},
                    {"color": "yellow", "value": 70},
                    {"color": "red", "value": 90}
                  ]
                }
              }
            },
            "gridPos": {"h": 8, "w": 6, "x": 6, "y": 0}
          },
          {
            "id": 3,
            "title": "Memory Usage",
            "type": "stat",
            "targets": [
              {
                "expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\"} / node_memory_MemTotal_bytes{instance=~\"$instance\"})) * 100",
                "legendFormat": "{{ instance }}"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "percent",
                "thresholds": {
                  "steps": [
                    {"color": "green", "value": 0},
                    {"color": "yellow", "value": 80},
                    {"color": "red", "value": 90}
                  ]
                }
              }
            },
            "gridPos": {"h": 8, "w": 6, "x": 12, "y": 0}
          },
          {
            "id": 4,
            "title": "Disk Usage",
            "type": "stat",
            "targets": [
              {
                "expr": "(1 - (node_filesystem_avail_bytes{instance=~\"$instance\",fstype!~\"tmpfs|fuse.lxcfs|squashfs\"} / node_filesystem_size_bytes{instance=~\"$instance\",fstype!~\"tmpfs|fuse.lxcfs|squashfs\"})) * 100",
                "legendFormat": "{{ instance }}:{{ mountpoint }}"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "percent",
                "thresholds": {
                  "steps": [
                    {"color": "green", "value": 0},
                    {"color": "yellow", "value": 80},
                    {"color": "red", "value": 90}
                  ]
                }
              }
            },
            "gridPos": {"h": 8, "w": 6, "x": 18, "y": 0}
          }
        ]
      }
    }
    JSON

    Application Performance Dashboard

    {
      "dashboard": {
        "id": null,
        "title": "Application Performance",
        "tags": ["application", "performance"],
        "timezone": "browser",
        "refresh": "30s",
        "templating": {
          "list": [
            {
              "name": "service",
              "type": "query",
              "query": "label_values(http_requests_total, service)",
              "refresh": 1,
              "multi": true,
              "includeAll": true
            },
            {
              "name": "environment",
              "type": "query",
              "query": "label_values(http_requests_total, environment)",
              "refresh": 1
            }
          ]
        },
        "panels": [
          {
            "id": 1,
            "title": "Request Rate",
            "type": "graph",
            "targets": [
              {
                "expr": "sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service)",
                "legendFormat": "{{ service }}"
              }
            ],
            "yAxes": [
              {
                "label": "requests/sec",
                "min": 0
              }
            ],
            "gridPos": {"h": 9, "w": 12, "x": 0, "y": 0}
          },
          {
            "id": 2,
            "title": "Error Rate",
            "type": "graph",
            "targets": [
              {
                "expr": "sum(rate(http_requests_total{status=~\"[45]..\",service=~\"$service\",environment=\"$environment\"}[5m])) by (service) / sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service) * 100",
                "legendFormat": "{{ service }}"
              }
            ],
            "yAxes": [
              {
                "label": "error %",
                "min": 0,
                "max": 100
              }
            ],
            "gridPos": {"h": 9, "w": 12, "x": 12, "y": 0}
          }
        ]
      }
    }
    JSON

    Appendix E: Configuration Management

    Environment-specific Configurations

    Development Environment
    # prometheus-dev.yml
    global:
      scrape_interval: 30s
      evaluation_interval: 30s
      external_labels:
        environment: 'development'
        cluster: 'dev'
    
    rule_files:
      - "dev_rules.yml"
    
    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']
    
      - job_name: 'node-exporter'
        static_configs:
          - targets: ['localhost:9100']
        scrape_interval: 60s  # Less frequent in dev
    
      - job_name: 'application'
        static_configs:
          - targets: ['localhost:8080']
        scrape_interval: 30s
    YAML
    Production Environment
    # prometheus-prod.yml
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        environment: 'production'
        cluster: 'prod'
        datacenter: 'us-east-1'
    
    rule_files:
      - "prod_rules.yml"
      - "slo_rules.yml"
    
    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']
    
      - job_name: 'node-exporter'
        static_configs:
          - targets: 
            - 'node1:9100'
            - 'node2:9100'
            - 'node3:9100'
        scrape_interval: 15s
    
      - job_name: 'application'
        kubernetes_sd_configs:
          - role: pod
            namespaces:
              names:
                - production
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
    
    remote_write:
      - url: "https://remote-storage.company.com/api/v1/write"
        headers:
          Authorization: "Bearer ${REMOTE_WRITE_TOKEN}"
    YAML

    Configuration Validation

    #!/bin/bash
    # scripts/validate-config.sh
    
    set -e
    
    echo "Validating Prometheus configuration..."
    
    # Check Prometheus config syntax
    promtool check config prometheus/prometheus.yml
    
    # Check recording rules
    if [ -f "prometheus/recording_rules.yml" ]; then
        promtool check rules prometheus/recording_rules.yml
    fi
    
    # Check alerting rules
    if [ -f "prometheus/alert_rules.yml" ]; then
        promtool check rules prometheus/alert_rules.yml
    fi
    
    # Check Alertmanager config
    if [ -f "alertmanager/alertmanager.yml" ]; then
        amtool check-config alertmanager/alertmanager.yml
    fi
    
    echo "Configuration validation completed successfully!"
    Bash

    Appendix F: Troubleshooting Guide

    Common Issues and Solutions

    Prometheus Issues

    Issue: Targets showing as “DOWN”

    # Check target accessibility
    curl -v http://target-host:9100/metrics
    
    # Check network connectivity
    telnet target-host 9100
    
    # Check Prometheus logs
    docker logs prometheus
    
    # Check scrape configuration
    curl http://localhost:9090/api/v1/targets
    Bash

    Issue: High memory usage

    # Check active series count
    prometheus_tsdb_head_series
    
    # Check samples ingested per second
    rate(prometheus_tsdb_samples_total[5m])
    
    # Find high cardinality metrics
    topk(10, count by (__name__)({__name__!=""}))
    INI

    Issue: Slow queries

    # Check query duration
    histogram_quantile(0.99, rate(prometheus_engine_query_duration_seconds_bucket[5m]))
    
    # Check concurrent queries
    prometheus_engine_queries_concurrent_max
    INI
    Alertmanager Issues

    Issue: Alerts not firing

    # Check Prometheus rules evaluation
    curl http://localhost:9090/api/v1/rules
    
    # Check alert status
    curl http://localhost:9090/api/v1/alerts
    
    # Check Alertmanager configuration
    amtool config show --alertmanager.url=http://localhost:9093
    Bash

    Issue: Notifications not being sent

    # Check Alertmanager logs
    docker logs alertmanager
    
    # Test notification channels
    amtool alert add --alertmanager.url=http://localhost:9093 \
      alertname="test" service="test" severity="warning"
    
    # Check silences
    amtool silence query --alertmanager.url=http://localhost:9093
    Bash
    Grafana Issues

    Issue: Dashboard not loading data

    # Check data source connectivity
    curl -X GET "http://admin:admin123@localhost:3000/api/datasources/1/health"
    
    # Check Prometheus connectivity from Grafana
    curl -X GET "http://admin:admin123@localhost:3000/api/datasources/proxy/1/api/v1/query?query=up"
    Bash

    Issue: Variables not working

    • Check variable query syntax
    • Verify data source selection
    • Check refresh settings

    Performance Optimization

    Reduce Cardinality
    # Metric relabeling to drop high cardinality labels
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'high_cardinality_metric.*'
        action: drop
    
      - source_labels: [user_id]
        target_label: user_type
        regex: 'premium_.*'
        replacement: 'premium'
    
      - regex: 'user_id'
        action: labeldrop
    YAML
    Optimize Recording Rules
    # Pre-compute expensive queries
    groups:
      - name: optimization_rules
        interval: 30s
        rules:
          - record: expensive_calculation:rate5m
            expr: |
              sum(rate(complex_metric[5m])) by (service) /
              sum(rate(other_complex_metric[5m])) by (service)
    YAML

    Appendix G: Further Reading and References

    Official Documentation

    Books and Guides

    • “Prometheus: Up & Running” by Brian Brazil
    • “Monitoring with Prometheus” by James Turnbull
    • “Site Reliability Engineering” by Google (SRE practices)
    • “The Art of Monitoring” by James Turnbull

    Online Resources

    Training and Certification

    Community Resources

    Best Practices Repositories


    Conclusion

    This comprehensive guide has covered all aspects of Prometheus observability, from basic concepts to advanced enterprise deployments. By following the patterns, best practices, and examples provided, you should be well-equipped to implement robust monitoring solutions that provide actionable insights into your systems and applications.

    Remember that observability is not just about collecting metrics—it’s about building systems that help you understand and improve your applications and infrastructure. Start with the basics, iterate based on your needs, and continuously refine your monitoring strategy as your systems evolve.

    The capstone project provides a practical foundation that you can adapt and extend for your specific use cases. Use the appendices as reference materials for ongoing implementation and troubleshooting.

    Happy monitoring! 🚀📊


    Discover more from Altgr Blog

    Subscribe to get the latest posts sent to your email.

    Leave a Reply

    Your email address will not be published. Required fields are marked *