Table of Contents
- Introduction to Observability
- Getting Started with Prometheus
- Metrics and Data Collection
- PromQL: Querying and Analyzing Data
- Alerting and Notifications
- Visualization
- Prometheus in Kubernetes
- Scaling and Performance
- Best Practices and Pitfalls
- Advanced Topics
- Capstone Project
- Appendices
1. Introduction to Observability
What is Observability?
Observability is the ability to understand the internal state of a system by examining its external outputs. Unlike traditional monitoring, which tells you when something breaks, observability helps you understand why it broke and how to fix it.
Monitoring vs. Observability
| Monitoring | Observability |
|---|---|
| Known unknowns | Unknown unknowns |
| Predefined dashboards | Ad-hoc queries |
| Health checks | Deep insights |
| Reactive | Proactive |
Monitoring answers: “Is the system up?” Observability answers: “Why is the system behaving this way?”
The Three Pillars of Observability
graph TB
A[Observability] --> B[Metrics]
A --> C[Logs]
A --> D[Traces]
B --> B1[Numerical data over time]
B --> B2[System performance indicators]
C --> C1[Discrete events with context]
C --> C2[Application and system logs]
D --> D1[Request flows across services]
D --> D2[Performance bottleneck identification]1. Metrics
- Definition: Numerical measurements captured over time
- Examples: CPU usage, memory consumption, request rate, error rate
- Best for: Dashboards, alerting, trend analysis
2. Logs
- Definition: Discrete events with timestamps and context
- Examples: Application errors, access logs, audit trails
- Best for: Debugging, forensic analysis, compliance
3. Traces
- Definition: Records of requests as they flow through distributed systems
- Examples: Microservice call chains, database queries, external API calls
- Best for: Performance optimization, dependency mapping
Where Prometheus Fits
Prometheus is primarily a metrics-based monitoring system that excels at:
- Time-series data collection and storage
- Powerful querying language (PromQL)
- Built-in alerting capabilities
- Service discovery integration
- Scalable architecture
Chapter 1 Summary
graph LR
A[Applications] --> B[Prometheus]
C[Infrastructure] --> B
D[Exporters] --> B
B --> E[Alertmanager]
B --> F[Grafana]
B --> G[Remote Storage]Observability goes beyond traditional monitoring by providing deep insights into system behavior. The three pillars—metrics, logs, and traces—work together to provide comprehensive visibility. Prometheus serves as the foundation for metrics collection and analysis in modern observability stacks.
Hands-on Exercise
- Reflection Exercise: Think about a recent production issue in your environment
- What metrics could have helped detect it earlier?
- What logs would have aided in debugging?
- How would distributed tracing have helped?
- Research Task: Investigate the observability stack used in your organization
- Identify which tools handle metrics, logs, and traces
- Note any gaps in observability coverage
2. Getting Started with Prometheus
History and Background
Prometheus was created at SoundCloud in 2012 by Matt T. Proud and Julius Volz. Inspired by Google’s Borgmon, it became a Cloud Native Computing Foundation (CNCF) project in 2016 and graduated in 2018.
Key Timeline:
- 2012: Created at SoundCloud
- 2015: Open-sourced
- 2016: Joined CNCF
- 2018: CNCF Graduated Project
Prometheus Architecture
graph TB
subgraph "Prometheus Server"
A[Retrieval] --> B[TSDB]
C[PromQL Engine] --> B
D[Web UI] --> C
E[HTTP API] --> C
end
F[Targets] --> A
G[Exporters] --> A
H[Pushgateway] --> A
B --> I[Alertmanager]
D --> J[Grafana]
C --> J
K[Service Discovery] --> ACore Components
- Prometheus Server
- Scrapes and stores time-series data
- Executes PromQL queries
- Evaluates alerting rules
- Client Libraries
- Instrument applications
- Expose metrics endpoints
- Exporters
- Bridge between Prometheus and third-party systems
- Translate metrics to Prometheus format
- Alertmanager
- Handles alerts from Prometheus
- Manages routing, grouping, and silencing
- Pushgateway
- Allows ephemeral jobs to push metrics
- Used for batch jobs and short-lived processes
Installation Methods
Method 1: Binary Installation (Windows)
- Download Prometheus:
# Create directory New-Item -ItemType Directory -Path C:\prometheus # Download latest release $url = "https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.windows-amd64.zip" Invoke-WebRequest -Uri $url -OutFile C:\prometheus\prometheus.zip # Extract Expand-Archive -Path C:\prometheus\prometheus.zip -DestinationPath C:\prometheus - Create basic configuration:
# prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s rule_files: # - "first_rules.yml" # - "second_rules.yml" scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - Run Prometheus:
cd C:\prometheus\prometheus-2.47.0.windows-amd64 .\prometheus.exe --config.file=prometheus.yml --storage.tsdb.path=data\
Method 2: Docker Installation
- Create configuration directory:
mkdir prometheus-data - Create prometheus.yml:
global: scrape_interval: 15s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node-exporter' static_configs: - targets: ['host.docker.internal:9100'] - Run with Docker:
docker run -d \ --name prometheus \ -p 9090:9090 \ -v ${PWD}/prometheus.yml:/etc/prometheus/prometheus.yml \ -v prometheus-data:/prometheus \ prom/prometheus:latest \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/prometheus \ --web.console.libraries=/etc/prometheus/console_libraries \ --web.console.templates=/etc/prometheus/consoles \ --web.enable-lifecycle
Method 3: Docker Compose
# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
prometheus_data:YAMLConfiguration Basics
Understanding prometheus.yml
# Global configuration
global:
scrape_interval: 15s # How often to scrape targets
evaluation_interval: 15s # How often to evaluate rules
external_labels: # Labels added to metrics when federating
cluster: 'production'
region: 'us-west-2'
# Rule files for recording and alerting rules
rule_files:
- "alert_rules.yml"
- "recording_rules.yml"
# Scrape configuration
scrape_configs:
# Self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 5s # Override global interval
metrics_path: /metrics # Default metrics endpoint
# Application monitoring
- job_name: 'my-app'
static_configs:
- targets: ['app1:8080', 'app2:8080']
scrape_timeout: 10s
honor_labels: true
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Remote write configuration (optional)
remote_write:
- url: "https://remote-storage-endpoint/write"
headers:
Authorization: "Bearer token"YAMLKey Configuration Parameters
| Parameter | Description | Default |
|---|---|---|
scrape_interval | How often to collect metrics | 1m |
scrape_timeout | Maximum time for scrape request | 10s |
evaluation_interval | Rule evaluation frequency | 1m |
metrics_path | HTTP path for metrics | /metrics |
scheme | Protocol (http/https) | http |
Verifying Installation
- Access Prometheus Web UI:
- Open browser to
http://localhost:9090 - Check Status → Targets to see configured endpoints
- Open browser to
- Test basic query:
upThis should return 1 for all healthy targets. - Check metrics endpoint:
curl http://localhost:9090/metrics
Chapter 2 Summary
Prometheus follows a pull-based architecture where the server scrapes metrics from configured targets. The system consists of the main server, client libraries, exporters, and supporting components like Alertmanager. Installation can be done via binaries, Docker, or Kubernetes, with configuration managed through the prometheus.yml file.
Hands-on Exercise
- Basic Setup:
- Install Prometheus using your preferred method
- Configure it to monitor itself
- Access the web UI and explore the interface
- Configuration Practice:
- Modify the scrape interval to 30 seconds
- Add a new job that targets a non-existent endpoint
- Observe the target status and understand failure states
- Metrics Exploration:
- Use the web UI to explore available metrics
- Try simple queries like
prometheus_tsdb_samples_total - Understand the difference between different metric types you see
3. Metrics and Data Collection
Types of Metrics
Prometheus supports four fundamental metric types, each serving different purposes:
1. Counter
A cumulative metric that only increases (or resets to zero on restart).
Use cases: Request counts, error counts, tasks completed Examples: http_requests_total, errors_total
// Go example
var requestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)Go2. Gauge
A metric that can go up and down.
Use cases: Memory usage, CPU usage, queue size, temperature Examples: memory_usage_bytes, cpu_usage_percent
// Go example
var memoryUsage = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "memory_usage_bytes",
Help: "Current memory usage in bytes",
},
)Go3. Histogram
Samples observations and counts them in configurable buckets.
Use cases: Request durations, response sizes, latency distribution Features: Provides _count, _sum, and _bucket metrics
// Go example
var requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: prometheus.DefBuckets, // or custom: []float64{.1, .25, .5, 1, 2.5, 5, 10}
},
[]string{"method", "endpoint"},
)Go4. Summary
Similar to histogram but calculates configurable quantiles.
Use cases: Request durations when you need specific percentiles Features: Provides _count, _sum, and quantile metrics
// Go example
var requestDuration = prometheus.NewSummaryVec(
prometheus.SummaryOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
},
[]string{"method", "endpoint"},
)GoExposing Metrics with Client Libraries
Go Application Example
// main.go
package main
import (
"fmt"
"log"
"math/rand"
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
requestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
},
[]string{"method", "endpoint"},
)
activeConnections = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
)
func init() {
// Register metrics with Prometheus
prometheus.MustRegister(requestsTotal)
prometheus.MustRegister(requestDuration)
prometheus.MustRegister(activeConnections)
}
func metricsMiddleware(next http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// Increment active connections
activeConnections.Inc()
defer activeConnections.Dec()
// Call the next handler
next(w, r)
// Record metrics
duration := time.Since(start).Seconds()
requestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
requestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
}
}
func helloHandler(w http.ResponseWriter, r *http.Request) {
// Simulate some work
time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond)
fmt.Fprintf(w, "Hello, World!")
}
func main() {
// Application routes
http.HandleFunc("/hello", metricsMiddleware(helloHandler))
// Metrics endpoint
http.Handle("/metrics", promhttp.Handler())
log.Println("Server starting on :8080")
log.Fatal(http.ListenAndServe(":8080", nil))
}GoPython Application Example
# app.py
from flask import Flask, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest
import time
import random
app = Flask(__name__)
# Define metrics
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_DURATION = Histogram(
'http_request_duration_seconds',
'HTTP request duration in seconds',
['method', 'endpoint'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
ACTIVE_CONNECTIONS = Gauge(
'active_connections',
'Number of active connections'
)
def track_metrics(f):
def wrapper(*args, **kwargs):
start_time = time.time()
ACTIVE_CONNECTIONS.inc()
try:
result = f(*args, **kwargs)
status = '200'
return result
except Exception as e:
status = '500'
raise
finally:
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.endpoint or 'unknown',
status=status
).inc()
REQUEST_DURATION.labels(
method=request.method,
endpoint=request.endpoint or 'unknown'
).observe(time.time() - start_time)
ACTIVE_CONNECTIONS.dec()
wrapper.__name__ = f.__name__
return wrapper
@app.route('/hello')
@track_metrics
def hello():
# Simulate work
time.sleep(random.uniform(0.01, 0.1))
return "Hello, World!"
@app.route('/metrics')
def metrics():
return generate_latest()
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)PythonNode.js Application Example
// app.js
const express = require('express');
const promClient = require('prom-client');
const app = express();
const port = 8080;
// Create metrics
const requestCounter = new promClient.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'endpoint', 'status']
});
const requestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'endpoint'],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
});
const activeConnections = new promClient.Gauge({
name: 'active_connections',
help: 'Number of active connections'
});
// Middleware to track metrics
function metricsMiddleware(req, res, next) {
const start = Date.now();
activeConnections.inc();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
requestCounter.labels(req.method, req.path, res.statusCode).inc();
requestDuration.labels(req.method, req.path).observe(duration);
activeConnections.dec();
});
next();
}
app.use(metricsMiddleware);
app.get('/hello', (req, res) => {
// Simulate work
setTimeout(() => {
res.send('Hello, World!');
}, Math.random() * 100);
});
app.get('/metrics', (req, res) => {
res.set('Content-Type', promClient.register.contentType);
res.end(promClient.register.metrics());
});
app.listen(port, () => {
console.log(`Server running on port ${port}`);
});JavaScriptExporters
Exporters are components that fetch statistics from third-party systems and export them as Prometheus metrics.
Node Exporter (System Metrics)
# docker-compose.yml addition
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
restart: unless-stoppedYAMLKey metrics from node-exporter:
node_cpu_seconds_total: CPU usagenode_memory_MemTotal_bytes: Total memorynode_filesystem_size_bytes: Filesystem sizenode_network_receive_bytes_total: Network received bytes
Blackbox Exporter (External Monitoring)
# blackbox.yml
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: []
method: GET
follow_redirects: true
preferred_ip_protocol: "ip4"
http_post_2xx:
prober: http
timeout: 5s
http:
method: POST
headers:
Content-Type: application/json
body: '{"test": "data"}'
tcp_connect:
prober: tcp
timeout: 5s
dns:
prober: dns
timeout: 5s
dns:
query_name: "example.com"
query_type: "A"YAML# prometheus.yml addition
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://google.com
- https://github.com
- https://stackoverflow.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115YAMLCustom Exporter Example
# custom_exporter.py
from prometheus_client import start_http_server, Gauge, Counter
import time
import psutil
import requests
# Define custom metrics
CUSTOM_CPU_USAGE = Gauge('custom_cpu_usage_percent', 'Custom CPU usage percentage')
CUSTOM_DISK_USAGE = Gauge('custom_disk_usage_percent', 'Custom disk usage percentage', ['device'])
API_CALLS_TOTAL = Counter('api_calls_total', 'Total API calls made', ['endpoint'])
def collect_system_metrics():
"""Collect custom system metrics"""
# CPU usage
cpu_percent = psutil.cpu_percent(interval=1)
CUSTOM_CPU_USAGE.set(cpu_percent)
# Disk usage
for partition in psutil.disk_partitions():
try:
partition_usage = psutil.disk_usage(partition.mountpoint)
usage_percent = (partition_usage.used / partition_usage.total) * 100
CUSTOM_DISK_USAGE.labels(device=partition.device).set(usage_percent)
except PermissionError:
continue
def call_external_api():
"""Simulate calling external APIs and track calls"""
endpoints = ['/users', '/orders', '/products']
for endpoint in endpoints:
try:
# Simulate API call
response = requests.get(f'https://jsonplaceholder.typicode.com{endpoint}', timeout=5)
API_CALLS_TOTAL.labels(endpoint=endpoint).inc()
except requests.RequestException:
pass
if __name__ == '__main__':
# Start metrics server
start_http_server(8000)
print("Custom exporter started on port 8000")
while True:
collect_system_metrics()
call_external_api()
time.sleep(30)PythonService Discovery and Relabeling
File-based Service Discovery
# prometheus.yml
scrape_configs:
- job_name: 'file-discovery'
file_sd_configs:
- files:
- 'targets/*.json'
refresh_interval: 30sYAML# targets/web-servers.json
[
{
"targets": ["web1:8080", "web2:8080", "web3:8080"],
"labels": {
"job": "web-servers",
"environment": "production",
"region": "us-west-2"
}
},
{
"targets": ["api1:8080", "api2:8080"],
"labels": {
"job": "api-servers",
"environment": "production",
"region": "us-east-1"
}
}
]JSONRelabeling Configuration
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with prometheus.io/scrape annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use custom metrics path if specified
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# Use custom port if specified
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Add pod metadata as labels
- source_labels: [__meta_kubernetes_pod_name]
target_label: kubernetes_pod_name
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_node_name]
target_label: kubernetes_nodeYAMLChapter 3 Summary
Prometheus supports four metric types: counters for cumulative values, gauges for current values, histograms for distribution analysis, and summaries for quantile calculations. Client libraries in various languages make it easy to instrument applications, while exporters bridge third-party systems. Service discovery and relabeling provide flexible configuration for dynamic environments.
Hands-on Exercise
- Instrument an Application:
- Choose a simple web application in your preferred language
- Add Prometheus metrics for request count, duration, and active connections
- Test the metrics endpoint
- Deploy Exporters:
- Set up node-exporter to monitor system metrics
- Configure blackbox-exporter to monitor external websites
- Add both to your Prometheus configuration
- Service Discovery:
- Create a file-based service discovery configuration
- Add and remove targets dynamically
- Observe how Prometheus handles target changes
4. PromQL: Querying and Analyzing Data
Introduction to PromQL
Prometheus Query Language (PromQL) is a functional query language that allows you to select and aggregate time-series data. It’s designed to be both powerful and intuitive for operational use cases.
Basic PromQL Concepts
Instant Vectors vs Range Vectors
# Instant vector - single value per time series at query time
up
# Range vector - range of values over time
up[5m]INISelectors and Matchers
# Exact match
http_requests_total{job="prometheus"}
# Regex match
http_requests_total{job=~".*server.*"}
# Negative match
http_requests_total{job!="prometheus"}
# Negative regex match
http_requests_total{job!~".*test.*"}
# Multiple labels
http_requests_total{job="api-server",method="GET",status="200"}INICommon Queries for System Metrics
CPU Metrics
# Current CPU usage per core
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# CPU usage by mode
rate(node_cpu_seconds_total[5m]) * 100
# Top 5 instances by CPU usage
topk(5, 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))
# CPU usage over 80%
(100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80INIMemory Metrics
# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Available memory in GB
node_memory_MemAvailable_bytes / 1024 / 1024 / 1024
# Memory usage by instance
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Instances with memory usage > 90%
((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100) > 90INIDisk Metrics
# Disk usage percentage
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100
# Disk usage excluding system filesystems
(1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs"} / node_filesystem_size_bytes)) * 100
# Free disk space in GB
node_filesystem_avail_bytes / 1024 / 1024 / 1024
# Disk I/O rate
rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m])ININetwork Metrics
# Network receive rate in MB/s
rate(node_network_receive_bytes_total[5m]) / 1024 / 1024
# Network transmit rate in MB/s
rate(node_network_transmit_bytes_total[5m]) / 1024 / 1024
# Total network traffic
rate(node_network_receive_bytes_total[5m]) + rate(node_network_transmit_bytes_total[5m])
# Network errors
rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m])INIAdvanced PromQL Functions
Rate and Increase
# Rate: per-second average rate over time window
rate(http_requests_total[5m])
# Increase: total increase over time window
increase(http_requests_total[5m])
# irate: instantaneous rate (using last two data points)
irate(http_requests_total[5m])INIHistogram Functions
# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# 50th percentile (median)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
# Average response time
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
# Request rate
rate(http_request_duration_seconds_count[5m])INIAggregation Functions
# Sum across all instances
sum(rate(http_requests_total[5m]))
# Average across instances
avg(rate(http_requests_total[5m]))
# Maximum value
max(node_memory_MemTotal_bytes)
# Count number of instances
count(up == 1)
# Sum by job
sum by (job) (rate(http_requests_total[5m]))
# Average without specific labels
avg without (instance) (rate(http_requests_total[5m]))INIMathematical Functions
# Absolute value
abs(delta(cpu_temp_celsius[5m]))
# Round to nearest integer
round(rate(http_requests_total[5m]))
# Ceiling and floor
ceil(rate(http_requests_total[5m]))
floor(rate(http_requests_total[5m]))
# Square root
sqrt(rate(http_requests_total[5m]))
# Logarithm
ln(rate(http_requests_total[5m]))
log10(rate(http_requests_total[5m]))INITime Functions
# Current timestamp
time()
# Time since epoch for each sample
timestamp(up)
# Day of week (0=Sunday, 6=Saturday)
day_of_week()
# Hour of day (0-23)
hour()
# Predict linear trend
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)INIRecording Rules
Recording rules allow you to precompute frequently used expressions and save them as new time series.
# recording_rules.yml
groups:
- name: instance_rules
interval: 30s
rules:
- record: instance:cpu_usage:rate5m
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
labels:
job: node-exporter
- record: instance:memory_usage:percentage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
labels:
job: node-exporter
- record: instance:disk_usage:percentage
expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs"} / node_filesystem_size_bytes)) * 100
labels:
job: node-exporter
- name: application_rules
interval: 15s
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_request_duration:p95
expr: histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))
- record: job:http_errors:rate5m
expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))YAMLComplex Query Examples
SLI/SLO Calculations
# Error rate (percentage of 5xx responses)
(
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
) * 100
# Availability (percentage of successful requests)
(
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
) * 100
# Latency SLI (percentage of requests under threshold)
(
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) /
sum(rate(http_request_duration_seconds_count[5m]))
) * 100INIResource Utilization Patterns
# Predict when disk will be full (4 hours from now)
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0
# Instance running out of memory (< 10% available)
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
# High load average (> number of CPUs)
node_load1 > count by (instance) (node_cpu_seconds_total{mode="idle"})
# Network saturation (approaching interface limit)
rate(node_network_transmit_bytes_total[5m]) >
node_network_speed_bytes * 0.8INIApplication Performance Analysis
# Request rate by endpoint
sum by (endpoint) (rate(http_requests_total[5m]))
# Error rate by endpoint
sum by (endpoint) (rate(http_requests_total{status=~"5.."}[5m])) /
sum by (endpoint) (rate(http_requests_total[5m]))
# 95th percentile latency by endpoint
histogram_quantile(0.95,
sum by (endpoint, le) (rate(http_request_duration_seconds_bucket[5m]))
)
# Slow endpoints (95th percentile > 1 second)
histogram_quantile(0.95,
sum by (endpoint, le) (rate(http_request_duration_seconds_bucket[5m]))
) > 1INIAlerting Rules
# alert_rules.yml
groups:
- name: infrastructure_alerts
rules:
- alert: HighCPUUsage
expr: instance:cpu_usage:rate5m > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: instance:memory_usage:percentage > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: instance:disk_usage:percentage > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk usage is {{ $value }}% on {{ $labels.instance }}"
- name: application_alerts
rules:
- alert: HighErrorRate
expr: job:http_errors:rate5m / job:http_requests:rate5m > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate for {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
- alert: HighLatency
expr: job:http_request_duration:p95 > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency for {{ $labels.job }}"
description: "95th percentile latency is {{ $value }}s for {{ $labels.job }}"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "{{ $labels.instance }} has been down for more than 1 minute"YAMLChapter 4 Summary
PromQL is a powerful query language that enables complex analysis of time-series data. Key concepts include instant vs range vectors, label selectors, aggregation functions, and mathematical operations. Recording rules help optimize performance by precomputing common queries, while alerting rules define when notifications should be sent.
Hands-on Exercise
- Basic Queries:
- Write queries to find CPU usage for all instances
- Calculate memory usage percentage
- Find instances with high disk usage
- Advanced Analysis:
- Create queries for error rates and latency percentiles
- Write a query to predict disk space exhaustion
- Build SLI queries for your application
- Rules Configuration:
- Create recording rules for common calculations
- Write alerting rules for infrastructure monitoring
- Test rules using the Prometheus web UI
5. Alerting and Notifications
Alertmanager Architecture
Alertmanager handles alerts sent by Prometheus and other client applications. It provides grouping, inhibition, silencing, and routing to various notification channels.
graph TB
A[Prometheus] --> B[Alertmanager]
C[Other Sources] --> B
subgraph "Alertmanager"
D[Receiver] --> E[Grouping]
E --> F[Throttling]
F --> G[Inhibition]
G --> H[Silencing]
H --> I[Routing]
end
I --> J[Email]
I --> K[Slack]
I --> L[PagerDuty]
I --> M[Webhook]Installing and Configuring Alertmanager
Docker Installation
# docker-compose.yml addition
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--web.external-url=http://localhost:9093'
restart: unless-stopped
volumes:
alertmanager_data:YAMLBasic Alertmanager Configuration
# alertmanager.yml
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@yourcompany.com'
smtp_auth_username: 'alerts@yourcompany.com'
smtp_auth_password: 'your-app-password'
route:
group_by: ['alertname', 'job']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
routes:
- matchers:
- severity=critical
receiver: 'critical-alerts'
continue: true
- matchers:
- severity=warning
receiver: 'warning-alerts'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://webhook-server:8080/webhook'
- name: 'critical-alerts'
email_configs:
- to: 'oncall@yourcompany.com'
subject: 'CRITICAL: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Severity: {{ .Labels.severity }}
Instance: {{ .Labels.instance }}
{{ end }}
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#critical-alerts'
title: 'Critical Alert'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Instance:* {{ .Labels.instance }}
{{ end }}
- name: 'warning-alerts'
email_configs:
- to: 'team@yourcompany.com'
subject: 'WARNING: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#warnings'
title: 'Warning Alert'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
{{ end }}
inhibit_rules:
- source_matchers:
- severity=critical
target_matchers:
- severity=warning
equal: ['instance']YAMLWriting Effective Alerts
Alert Quality Guidelines
- Actionable: Every alert should require human action
- Relevant: Alerts should indicate real problems
- Clear: Alert messages should be immediately understandable
- Timely: Alerts should fire before customers notice
Infrastructure Alerting Rules
# infrastructure_alerts.yml
groups:
- name: node_alerts
rules:
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is down"
description: "Node {{ $labels.instance }} has been down for more than 1 minute"
runbook_url: "https://runbooks.company.com/node-down"
- alert: HighCPUUsage
expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"
runbook_url: "https://runbooks.company.com/high-cpu"
- alert: CriticalCPUUsage
expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 95
for: 2m
labels:
severity: critical
annotations:
summary: "Critical CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"
- alert: DiskSpaceCritical
expr: ((node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes) * 100 > 95
for: 5m
labels:
severity: critical
annotations:
summary: "Critical disk space on {{ $labels.instance }}"
description: "Disk usage is {{ $value | humanizePercentage }} on {{ $labels.instance }} {{ $labels.mountpoint }}"
- alert: DiskWillFillIn4Hours
expr: predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0
for: 5m
labels:
severity: warning
annotations:
summary: "Disk will fill in 4 hours on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} on {{ $labels.instance }} will fill in approximately 4 hours"YAMLApplication Alerting Rules
# application_alerts.yml
groups:
- name: application_alerts
rules:
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) /
sum(rate(http_requests_total[5m])) by (job)
) * 100 > 5
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate for {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency for {{ $labels.job }}"
description: "95th percentile latency is {{ $value }}s for {{ $labels.job }}"
- alert: LowThroughput
expr: sum(rate(http_requests_total[5m])) by (job) < 10
for: 10m
labels:
severity: warning
annotations:
summary: "Low throughput for {{ $labels.job }}"
description: "Request rate is {{ $value }} req/s for {{ $labels.job }}"
- alert: DatabaseConnectionFailure
expr: db_connections_failed_total > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database connection failures for {{ $labels.job }}"
description: "{{ $value }} database connection failures in the last minute"YAMLGrouping, Inhibition, and Silences
Grouping Configuration
# Group alerts by cluster and alertname
route:
group_by: ['cluster', 'alertname']
group_wait: 30s # Wait for more alerts before sending
group_interval: 5m # How often to send updates for a group
repeat_interval: 12h # How often to resend the same alertYAMLInhibition Rules
inhibit_rules:
# Don't send warning alerts if critical alerts are firing for the same instance
- source_matchers:
- severity=critical
target_matchers:
- severity=warning
equal: ['instance']
# Don't send individual service alerts if the whole node is down
- source_matchers:
- alertname=NodeDown
target_matchers:
- alertname=~"High.*|.*ServiceDown"
equal: ['instance']
# Don't send disk space warnings if disk is critically full
- source_matchers:
- alertname=DiskSpaceCritical
target_matchers:
- alertname=DiskWillFillIn4Hours
equal: ['instance', 'device']YAMLManaging Silences
# Create a silence via API
curl -X POST http://localhost:9093/api/v1/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{
"name": "alertname",
"value": "HighCPUUsage"
},
{
"name": "instance",
"value": "server-01:9100"
}
],
"startsAt": "2023-08-21T12:00:00.000Z",
"endsAt": "2023-08-21T14:00:00.000Z",
"createdBy": "maintenance-team",
"comment": "Planned maintenance window"
}'BashIntegration Examples
Slack Integration
# Slack configuration with rich formatting
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts'
title: '{{ if eq .Status "firing" }}🔥{{ else }}✅{{ end }} {{ .GroupLabels.alertname }}'
text: |
{{ if eq .Status "firing" }}
*Status:* Firing
*Alerts:* {{ len .Alerts }}
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Instance:* {{ .Labels.instance }}
*Runbook:* {{ .Annotations.runbook_url }}
{{ end }}
{{ else }}
*Status:* Resolved
All alerts have been resolved.
{{ end }}
actions:
- type: button
text: 'View in Alertmanager'
url: '{{ template "__alertmanagerURL" . }}'
- type: button
text: 'Silence'
url: '{{ template "__alertmanagerURL" . }}/#/silences/new'YAMLPagerDuty Integration
pagerduty_configs:
- routing_key: 'YOUR_INTEGRATION_KEY'
description: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
details:
severity: '{{ range .Alerts }}{{ .Labels.severity }}{{ end }}'
instance: '{{ range .Alerts }}{{ .Labels.instance }}{{ end }}'
alertname: '{{ range .Alerts }}{{ .Labels.alertname }}{{ end }}'
links:
- href: '{{ range .Alerts }}{{ .Annotations.runbook_url }}{{ end }}'
text: 'Runbook'YAMLEmail Integration
email_configs:
- to: 'team@company.com'
from: 'alertmanager@company.com'
subject: '{{ .Status | toUpper }}: {{ .GroupLabels.alertname }} ({{ len .Alerts }} alerts)'
html: |
<!DOCTYPE html>
<html>
<head>
<style>
table { border-collapse: collapse; width: 100%; }
th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
th { background-color: #f2f2f2; }
.critical { background-color: #ffebee; }
.warning { background-color: #fff3e0; }
</style>
</head>
<body>
<h2>Alert {{ .Status | toUpper }}</h2>
<table>
<tr>
<th>Alert</th>
<th>Severity</th>
<th>Instance</th>
<th>Description</th>
</tr>
{{ range .Alerts }}
<tr class="{{ .Labels.severity }}">
<td>{{ .Labels.alertname }}</td>
<td>{{ .Labels.severity }}</td>
<td>{{ .Labels.instance }}</td>
<td>{{ .Annotations.description }}</td>
</tr>
{{ end }}
</table>
</body>
</html>YAMLCustom Webhook Integration
# webhook_server.py
from flask import Flask, request, jsonify
import json
import requests
app = Flask(__name__)
@app.route('/webhook', methods=['POST'])
def webhook():
data = request.get_json()
# Process the alert
status = data.get('status')
alerts = data.get('alerts', [])
for alert in alerts:
labels = alert.get('labels', {})
annotations = alert.get('annotations', {})
# Custom logic based on alert
if labels.get('severity') == 'critical':
send_to_ops_team(alert)
elif 'database' in labels.get('alertname', '').lower():
send_to_dba_team(alert)
# Log to external system
log_alert_to_system(alert)
return jsonify({'status': 'received'})
def send_to_ops_team(alert):
# Send to ticketing system, chat platform, etc.
pass
def send_to_dba_team(alert):
# Send to database team's channel
pass
def log_alert_to_system(alert):
# Log to centralized logging system
pass
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)PythonTesting Alerts
Manual Alert Testing
# Send test alert to Alertmanager
curl -X POST http://localhost:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '[
{
"labels": {
"alertname": "TestAlert",
"instance": "test-instance",
"severity": "warning"
},
"annotations": {
"summary": "This is a test alert",
"description": "Testing alert routing and notifications"
},
"startsAt": "2023-08-21T12:00:00.000Z"
}
]'BashAlert Testing Framework
# alert_tester.py
import requests
import time
from datetime import datetime, timezone
class AlertTester:
def __init__(self, alertmanager_url, prometheus_url):
self.alertmanager_url = alertmanager_url
self.prometheus_url = prometheus_url
def send_test_alert(self, alertname, labels, annotations):
"""Send a test alert to Alertmanager"""
alert = {
"labels": {
"alertname": alertname,
**labels
},
"annotations": annotations,
"startsAt": datetime.now(timezone.utc).isoformat()
}
response = requests.post(
f"{self.alertmanager_url}/api/v1/alerts",
json=[alert]
)
return response.status_code == 200
def check_alert_rule(self, rule_name):
"""Check if an alert rule is defined in Prometheus"""
response = requests.get(f"{self.prometheus_url}/api/v1/rules")
rules = response.json()
for group in rules['data']['groups']:
for rule in group['rules']:
if rule.get('name') == rule_name:
return True
return False
def test_critical_alert_routing(self):
"""Test that critical alerts go to the right channels"""
return self.send_test_alert(
"TestCriticalAlert",
{"severity": "critical", "instance": "test-server"},
{
"summary": "Test critical alert",
"description": "This should route to critical alerts channel"
}
)
# Usage
tester = AlertTester("http://localhost:9093", "http://localhost:9090")
tester.test_critical_alert_routing()PythonChapter 5 Summary
Alertmanager provides sophisticated alert routing, grouping, and notification capabilities. Effective alerting requires clear rules, proper grouping, inhibition to reduce noise, and integration with appropriate notification channels. Testing alerts ensures they work as expected and reach the right people.
Hands-on Exercise
- Alertmanager Setup:
- Install and configure Alertmanager
- Set up basic routing to email or Slack
- Test with manual alerts
- Alert Rules:
- Create alerting rules for your infrastructure
- Set appropriate thresholds and timing
- Add helpful annotations and runbook links
- Advanced Features:
- Configure inhibition rules to reduce noise
- Set up silences for maintenance windows
- Test different notification channels
6. Visualization
Introduction to Grafana
Grafana is the de facto standard for visualizing Prometheus metrics. It provides powerful dashboarding capabilities, alerting integration, and supports multiple data sources beyond Prometheus.
graph TB
A[Prometheus] --> B[Grafana]
C[Users] --> B
B --> D[Dashboards]
B --> E[Alerts]
B --> F[Data Sources]
D --> G[Panels]
D --> H[Variables]
D --> I[Annotations]
G --> J[Time Series]
G --> K[Stats]
G --> L[Tables]
G --> M[Heatmaps]Installing and Configuring Grafana
Docker Installation
# docker-compose.yml
version: '3.8'
services:
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
- GF_USERS_DEFAULT_THEME=dark
- GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=/etc/grafana/provisioning/dashboards/overview.json
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
restart: unless-stopped
volumes:
grafana_data:YAMLConfiguration as Code
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
jsonData:
httpMethod: POST
prometheusType: Prometheus
prometheusVersion: 2.40.0
cacheLevel: 'High'
disableMetricsLookup: false
customQueryParameters: ''
incrementalQuerying: false
disableRecordingRules: falseYAML# grafana/provisioning/dashboards/dashboard.yml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboardsYAMLDashboard Design Principles
Information Hierarchy
- Overview Level: High-level health and performance indicators
- Service Level: Detailed metrics for specific services
- Component Level: Deep-dive into individual components
- Debug Level: Raw metrics for troubleshooting
Dashboard Layout Best Practices
{
"dashboard": {
"title": "Service Overview",
"panels": [
{
"id": 1,
"title": "Key Metrics (Top Row)",
"type": "stat",
"gridPos": {"h": 6, "w": 24, "x": 0, "y": 0}
},
{
"id": 2,
"title": "Trends (Middle Section)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 6}
},
{
"id": 3,
"title": "Distribution (Right Side)",
"type": "heatmap",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 6}
},
{
"id": 4,
"title": "Details (Bottom)",
"type": "table",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 14}
}
]
}
}JSONEssential Panel Types
Time Series Panels
{
"id": 1,
"title": "Request Rate",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{service}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"custom": {
"drawStyle": "line",
"lineInterpolation": "linear",
"barAlignment": 0,
"lineWidth": 2,
"fillOpacity": 10,
"gradientMode": "none",
"spanNulls": false,
"insertNulls": false,
"showPoints": "never",
"pointSize": 5,
"stacking": {
"mode": "none",
"group": "A"
},
"axisPlacement": "auto",
"axisLabel": "",
"scaleDistribution": {
"type": "linear"
},
"hideFrom": {
"legend": false,
"tooltip": false,
"vis": false
},
"thresholdsStyle": {
"mode": "off"
}
}
},
"overrides": [
{
"matcher": {
"id": "byName",
"options": "frontend"
},
"properties": [
{
"id": "color",
"value": {
"mode": "fixed",
"fixedColor": "green"
}
}
]
}
]
},
"options": {
"tooltip": {
"mode": "multi",
"sort": "desc"
},
"legend": {
"displayMode": "table",
"placement": "bottom",
"calcs": ["lastNotNull", "max", "mean"],
"values": true
}
}
}JSONStat Panels for Key Metrics
{
"id": 2,
"title": "Service Availability",
"type": "stat",
"targets": [
{
"expr": "avg(up{job=~\".*-service\"})",
"refId": "A",
"format": "time_series",
"instant": true
}
],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"min": 0,
"max": 1,
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "red",
"value": 0
},
{
"color": "yellow",
"value": 0.95
},
{
"color": "green",
"value": 0.99
}
]
},
"mappings": [],
"custom": {
"hideFrom": {
"legend": false,
"tooltip": false,
"vis": false
}
}
}
},
"options": {
"reduceOptions": {
"values": false,
"calcs": ["lastNotNull"],
"fields": ""
},
"orientation": "auto",
"textMode": "auto",
"colorMode": "background",
"graphMode": "area",
"justifyMode": "auto"
},
"gridPos": {"h": 6, "w": 6, "x": 0, "y": 0}
}JSONHeatmap for Latency Distribution
{
"id": 3,
"title": "Response Time Distribution",
"type": "heatmap",
"targets": [
{
"expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
"format": "heatmap",
"legendFormat": "{{le}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"custom": {
"hideFrom": {
"legend": false,
"tooltip": false,
"vis": false
},
"scaleDistribution": {
"type": "linear"
}
}
}
},
"options": {
"calculate": false,
"cellGap": 2,
"cellValues": {
"unit": "short"
},
"color": {
"exponent": 0.5,
"fill": "dark-orange",
"mode": "spectrum",
"reverse": false,
"scale": "exponential",
"scheme": "Oranges",
"steps": 64
},
"exemplars": {
"color": "rgba(255,0,255,0.7)"
},
"filterValues": {
"le": 1e-9
},
"legend": {
"show": true
},
"rowsFrame": {
"layout": "auto"
},
"tooltip": {
"show": true,
"yHistogram": false
},
"yAxis": {
"axisPlacement": "left",
"reverse": false,
"unit": "s"
}
}
}JSONTable for Detailed Breakdown
{
"id": 4,
"title": "Service Status Details",
"type": "table",
"targets": [
{
"expr": "up{job=~\".*-service\"}",
"format": "table",
"instant": true,
"refId": "A"
},
{
"expr": "rate(http_requests_total[5m])",
"format": "table",
"instant": true,
"refId": "B"
},
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m])",
"format": "table",
"instant": true,
"refId": "C"
}
],
"transformations": [
{
"id": "merge",
"options": {}
},
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true,
"__name__": true
},
"indexByName": {
"instance": 0,
"job": 1,
"Value #A": 2,
"Value #B": 3,
"Value #C": 4
},
"renameByName": {
"Value #A": "Status",
"Value #B": "Request Rate",
"Value #C": "Error Rate",
"instance": "Instance",
"job": "Service"
}
}
}
],
"fieldConfig": {
"defaults": {
"custom": {
"align": "auto",
"displayMode": "auto",
"inspect": false
},
"mappings": [
{
"options": {
"0": {
"color": "red",
"index": 0,
"text": "DOWN"
},
"1": {
"color": "green",
"index": 1,
"text": "UP"
}
},
"type": "value"
}
],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": [
{
"matcher": {
"id": "byName",
"options": "Error Rate"
},
"properties": [
{
"id": "unit",
"value": "percentunit"
},
{
"id": "custom.displayMode",
"value": "color-background"
},
{
"id": "thresholds",
"value": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 0.01
},
{
"color": "red",
"value": 0.05
}
]
}
}
]
}
]
}
}JSONDashboard Templates and Variables
Template Variables
{
"templating": {
"list": [
{
"name": "environment",
"type": "query",
"query": "label_values(up, environment)",
"current": {
"selected": true,
"text": "production",
"value": "production"
},
"options": [],
"refresh": 1,
"regex": "",
"sort": 1,
"multi": false,
"includeAll": false,
"allValue": null
},
{
"name": "service",
"type": "query",
"query": "label_values(http_requests_total{environment=\"$environment\"}, service)",
"current": {
"selected": false,
"text": "All",
"value": "$__all"
},
"options": [],
"refresh": 1,
"regex": "",
"sort": 1,
"multi": true,
"includeAll": true,
"allValue": ".*"
},
{
"name": "instance",
"type": "query",
"query": "label_values(up{job=\"$service\"}, instance)",
"current": {
"selected": false,
"text": "All",
"value": "$__all"
},
"options": [],
"refresh": 2,
"regex": "",
"sort": 1,
"multi": true,
"includeAll": true,
"allValue": ".*"
},
{
"name": "interval",
"type": "interval",
"current": {
"selected": false,
"text": "5m",
"value": "5m"
},
"options": [
{
"selected": true,
"text": "1m",
"value": "1m"
},
{
"selected": false,
"text": "5m",
"value": "5m"
},
{
"selected": false,
"text": "15m",
"value": "15m"
},
{
"selected": false,
"text": "1h",
"value": "1h"
}
],
"query": "1m,5m,15m,1h,6h,12h,1d,7d,14d,30d",
"refresh": 2,
"auto": true,
"auto_count": 30,
"auto_min": "10s"
}
]
}
}JSONUsing Variables in Queries
# Using service variable
sum(rate(http_requests_total{service=~"$service"}[$interval])) by (service)
# Using environment and instance variables
up{environment="$environment",instance=~"$instance"}
# Advanced variable usage with regex
rate(http_requests_total{service=~"$service",status!~"$__interval"}[5m])INIComplete Dashboard Examples
Infrastructure Overview Dashboard
{
"dashboard": {
"id": null,
"title": "Infrastructure Overview",
"description": "High-level infrastructure health and performance metrics",
"tags": ["infrastructure", "overview"],
"timezone": "browser",
"refresh": "30s",
"time": {
"from": "now-1h",
"to": "now"
},
"templating": {
"list": [
{
"name": "instance",
"type": "query",
"query": "label_values(up{job=\"node-exporter\"}, instance)",
"refresh": 1,
"multi": true,
"includeAll": true,
"current": {
"value": "$__all",
"text": "All"
}
}
]
},
"panels": [
{
"id": 1,
"title": "Node Status",
"type": "stat",
"targets": [
{
"expr": "up{job=\"node-exporter\",instance=~\"$instance\"}",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"mappings": [
{
"options": {
"0": {"color": "red", "text": "DOWN"},
"1": {"color": "green", "text": "UP"}
},
"type": "value"
}
],
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "green", "value": 1}
]
}
}
},
"gridPos": {"h": 4, "w": 24, "x": 0, "y": 0}
},
{
"id": 2,
"title": "CPU Usage",
"type": "timeseries",
"targets": [
{
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\",instance=~\"$instance\"}[5m])) * 100)",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"max": 100,
"min": 0,
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
}
}
},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4}
},
{
"id": 3,
"title": "Memory Usage",
"type": "timeseries",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\"} / node_memory_MemTotal_bytes{instance=~\"$instance\"})) * 100",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"max": 100,
"min": 0,
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 80},
{"color": "red", "value": 90}
]
}
}
},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4}
},
{
"id": 4,
"title": "Disk Usage",
"type": "timeseries",
"targets": [
{
"expr": "(1 - (node_filesystem_avail_bytes{instance=~\"$instance\",fstype!~\"tmpfs|fuse.lxcfs|squashfs\"} / node_filesystem_size_bytes{instance=~\"$instance\",fstype!~\"tmpfs|fuse.lxcfs|squashfs\"})) * 100",
"legendFormat": "{{instance}}:{{mountpoint}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"max": 100,
"min": 0,
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 80},
{"color": "red", "value": 90}
]
}
}
},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 12}
},
{
"id": 5,
"title": "Network I/O",
"type": "timeseries",
"targets": [
{
"expr": "rate(node_network_receive_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*\"}[5m])",
"legendFormat": "{{instance}}:{{device}} - Receive"
},
{
"expr": "rate(node_network_transmit_bytes_total{instance=~\"$instance\",device!~\"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*\"}[5m])",
"legendFormat": "{{instance}}:{{device}} - Transmit"
}
],
"fieldConfig": {
"defaults": {
"unit": "Bps"
}
},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 12}
}
]
}
}JSONApplication Performance Dashboard
{
"dashboard": {
"id": null,
"title": "Application Performance",
"description": "Application performance metrics and SLIs",
"tags": ["application", "performance", "sli"],
"timezone": "browser",
"refresh": "30s",
"templating": {
"list": [
{
"name": "service",
"type": "query",
"query": "label_values(http_requests_total, service)",
"refresh": 1,
"multi": true,
"includeAll": true
},
{
"name": "environment",
"type": "query",
"query": "label_values(http_requests_total, environment)",
"refresh": 1
}
]
},
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service)",
"legendFormat": "{{service}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps"
}
},
"gridPos": {"h": 8, "w": 8, "x": 0, "y": 0}
},
{
"id": 2,
"title": "Error Rate",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"[45]..\",service=~\"$service\",environment=\"$environment\"}[5m])) by (service) / sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service) * 100",
"legendFormat": "{{service}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"max": 100,
"min": 0,
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 1},
{"color": "red", "value": 5}
]
}
}
},
"gridPos": {"h": 8, "w": 8, "x": 8, "y": 0}
},
{
"id": 3,
"title": "Response Time (95th percentile)",
"type": "timeseries",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (service, le))",
"legendFormat": "{{service}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 0.5},
{"color": "red", "value": 1}
]
}
}
},
"gridPos": {"h": 8, "w": 8, "x": 16, "y": 0}
},
{
"id": 4,
"title": "Response Time Heatmap",
"type": "heatmap",
"targets": [
{
"expr": "sum(rate(http_request_duration_seconds_bucket{service=~\"$service\",environment=\"$environment\"}[5m])) by (le)",
"format": "heatmap",
"legendFormat": "{{le}}"
}
],
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}
},
{
"id": 5,
"title": "Top Endpoints by Request Count",
"type": "table",
"targets": [
{
"expr": "topk(10, sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (endpoint))",
"format": "table",
"instant": true
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true,
"__name__": true
},
"renameByName": {
"Value": "Requests/sec",
"endpoint": "Endpoint"
}
}
}
],
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8}
}
]
}
}JSONAdvanced Visualization Techniques
Custom Annotations
{
"annotations": {
"list": [
{
"name": "Deployments",
"datasource": "Prometheus",
"enable": true,
"expr": "increase(prometheus_config_last_reload_success_timestamp_seconds[1m]) > 0",
"iconColor": "green",
"titleFormat": "Config Reload",
"textFormat": "Prometheus configuration reloaded"
},
{
"name": "Alerts",
"datasource": "Prometheus",
"enable": true,
"expr": "ALERTS{alertstate=\"firing\"}",
"iconColor": "red",
"titleFormat": "{{alertname}}",
"textFormat": "{{summary}}"
}
]
}
}JSONValue Mappings and Overrides
{
"fieldConfig": {
"defaults": {
"mappings": [
{
"options": {
"0": {"text": "Healthy", "color": "green"},
"1": {"text": "Warning", "color": "yellow"},
"2": {"text": "Critical", "color": "red"}
},
"type": "value"
},
{
"options": {
"from": 0,
"to": 50,
"result": {"text": "Low", "color": "green"}
},
"type": "range"
}
]
},
"overrides": [
{
"matcher": {
"id": "byName",
"options": "Critical Services"
},
"properties": [
{
"id": "color",
"value": {"mode": "fixed", "fixedColor": "red"}
},
{
"id": "custom.displayMode",
"value": "color-background"
}
]
}
]
}
}JSONDynamic Thresholds
{
"targets": [
{
"expr": "avg(response_time_seconds)",
"refId": "A"
},
{
"expr": "avg(response_time_seconds) + 2 * stddev(response_time_seconds)",
"refId": "B",
"hide": true
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "red", "value": "${B}"}
]
}
}
}
}JSONDashboard Organization and Management
Folder Structure
Dashboards/
├── Overview/
│ ├── System Overview
│ ├── Application Overview
│ └── Business Metrics
├── Infrastructure/
│ ├── Node Metrics
│ ├── Network Performance
│ └── Storage Performance
├── Applications/
│ ├── Frontend Service
│ ├── Backend Services
│ └── Database Performance
├── Troubleshooting/
│ ├── Error Analysis
│ ├── Performance Deep Dive
│ └── Debug Dashboard
└── Business/
├── User Metrics
├── Revenue Tracking
└── KPI DashboardJSONDashboard Tags and Search
{
"dashboard": {
"tags": [
"infrastructure",
"monitoring",
"production",
"team:platform",
"level:l1"
],
"title": "Production Infrastructure Overview",
"description": "L1 monitoring dashboard for production infrastructure"
}
}JSONDashboard Links and Navigation
{
"links": [
{
"title": "System Overview",
"url": "/d/system-overview/system-overview",
"type": "dashboards",
"icon": "dashboard"
},
{
"title": "Runbook",
"url": "https://runbooks.company.com/infrastructure",
"type": "link",
"targetBlank": true,
"icon": "doc"
},
{
"title": "Alert Manager",
"url": "http://alertmanager:9093",
"type": "link",
"targetBlank": true,
"icon": "bell"
}
]
}JSONPerformance Optimization for Dashboards
Query Optimization
# Inefficient - multiple queries
sum(rate(http_requests_total[5m])) by (service)
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
sum(rate(http_requests_total{status=~"4.."}[5m])) by (service)
# Better - single query with grouping
sum(rate(http_requests_total[5m])) by (service, status)INIUsing Recording Rules for Heavy Queries
# recording_rules.yml
groups:
- name: dashboard_optimization
interval: 30s
rules:
- record: dashboard:request_rate:5m
expr: sum(rate(http_requests_total[5m])) by (service)
- record: dashboard:error_rate:5m
expr: |
sum(rate(http_requests_total{status=~"[45].."}[5m])) by (service) /
sum(rate(http_requests_total[5m])) by (service)YAMLDashboard Caching Configuration
# grafana.ini
[caching]
enabled = true
[database]
query_cache_enabled = true
query_cache_size = 100MB
query_cache_ttl = 300sINIAlerting Integration
Alert Panel Configuration
{
"id": 6,
"title": "Active Alerts",
"type": "alertlist",
"options": {
"showOptions": "current",
"maxItems": 20,
"sortOrder": 1,
"dashboardAlerts": false,
"alertInstanceLabelFilter": "",
"dashboardTitle": "",
"folderId": null,
"tags": []
},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 16}
}JSONConditional Formatting Based on Alerts
{
"fieldConfig": {
"overrides": [
{
"matcher": {
"id": "byFrameRefID",
"options": "Alerts"
},
"properties": [
{
"id": "custom.displayMode",
"value": "color-background"
},
{
"id": "mappings",
"value": [
{
"options": {
"0": {"text": "OK", "color": "green"},
"1": {"text": "ALERT", "color": "red"}
},
"type": "value"
}
]
}
]
}
]
}
}JSONExport and Import Strategies
Dashboard Export Script
#!/bin/bash
# scripts/export-dashboards.sh
GRAFANA_URL="http://localhost:3000"
GRAFANA_USER="admin"
GRAFANA_PASS="admin123"
# Get all dashboards
curl -u $GRAFANA_USER:$GRAFANA_PASS \
"$GRAFANA_URL/api/search?type=dash-db" | \
jq -r '.[] | .uid' | \
while read uid; do
echo "Exporting dashboard: $uid"
curl -u $GRAFANA_USER:$GRAFANA_PASS \
"$GRAFANA_URL/api/dashboards/uid/$uid" | \
jq '.dashboard' > "dashboards/${uid}.json"
doneBashDashboard Import with Provisioning
# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: 'infrastructure'
orgId: 1
folder: 'Infrastructure'
type: file
disableDeletion: false
editable: true
updateIntervalSeconds: 10
options:
path: /etc/grafana/provisioning/dashboards/infrastructure
- name: 'applications'
orgId: 1
folder: 'Applications'
type: file
disableDeletion: false
editable: true
updateIntervalSeconds: 10
options:
path: /etc/grafana/provisioning/dashboards/applicationsYAMLChapter 6 Summary
Grafana provides powerful visualization capabilities for Prometheus metrics through various panel types, template variables, and advanced features. Effective dashboard design follows information hierarchy principles, uses appropriate panel types for different data, and optimizes queries for performance. Dashboard organization, alerting integration, and automation through provisioning enable scalable monitoring visualization.
Hands-on Exercise
- Dashboard Creation:
- Create an infrastructure overview dashboard
- Add template variables for dynamic filtering
- Implement different panel types (stat, timeseries, table, heatmap)
- Advanced Features:
- Set up annotations for deployments and alerts
- Configure custom thresholds and value mappings
- Create dashboard links and navigation
- Optimization and Management:
- Optimize queries using recording rules
- Organize dashboards with folders and tags
- Set up
7. Prometheus in Kubernetes
Service Discovery in Kubernetes
Kubernetes provides rich metadata that Prometheus can use for automatic service discovery, eliminating the need for manual target configuration.
graph TB
A[Kubernetes API] --> B[Prometheus]
B --> C[Pods]
B --> D[Services]
B --> E[Endpoints]
B --> F[Nodes]
C --> G[App Metrics]
D --> H[Service Metrics]
E --> I[Endpoint Metrics]
F --> J[Node Metrics]Kubernetes SD Configuration
# prometheus.yml for Kubernetes
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# Scrape Kubernetes API server
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Scrape Kubernetes nodes
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
# Scrape pods with prometheus.io annotations
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with prometheus.io/scrape annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use custom path if specified
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# Use custom port if specified
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Add Kubernetes metadata as labels
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# Scrape services with prometheus.io annotations
- job_name: 'kubernetes-services'
kubernetes_sd_configs:
- role: service
metrics_path: /probe
params:
module: [http_2xx]
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: kubernetes_nameYAMLUsing kube-state-metrics
kube-state-metrics generates metrics about Kubernetes object states, providing cluster-level visibility.
Installing kube-state-metrics
# kube-state-metrics.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: kube-state-metrics
template:
metadata:
labels:
app: kube-state-metrics
spec:
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.6.0
ports:
- containerPort: 8080
name: http-metrics
- containerPort: 8081
name: telemetry
readinessProbe:
httpGet:
path: /
port: 8081
initialDelaySeconds: 5
timeoutSeconds: 5
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-state-metrics
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-state-metrics
rules:
- apiGroups: [""]
resources:
- configmaps
- secrets
- nodes
- pods
- services
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources:
- statefulsets
- daemonsets
- deployments
- replicasets
verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: kube-systemYAMLKey kube-state-metrics Metrics
# Pod status metrics
kube_pod_status_phase{phase="Running"}
kube_pod_status_ready{condition="true"}
kube_pod_container_status_restarts_total
# Deployment metrics
kube_deployment_status_replicas_available
kube_deployment_status_replicas_unavailable
# Node metrics
kube_node_status_condition{condition="Ready", status="true"}
kube_node_spec_unschedulable
# Resource requests and limits
kube_pod_container_resource_requests
kube_pod_container_resource_limits
# Namespace resource quotas
kube_resourcequotaBashPrometheus Operator and CRDs
The Prometheus Operator simplifies Prometheus deployment and management in Kubernetes through Custom Resource Definitions (CRDs).
Installing Prometheus Operator
# Using Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus-operator prometheus-community/kube-prometheus-stackBashCustom Resource Examples
Prometheus CR
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
namespace: monitoring
spec:
serviceAccountName: prometheus
serviceMonitorSelector:
matchLabels:
team: frontend
ruleSelector:
matchLabels:
prometheus: kube-prometheus
role: alert-rules
resources:
requests:
memory: 400Mi
storage:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
resources:
requests:
storage: 50Gi
retention: 30d
retentionSize: 45GBYAMLServiceMonitor CR
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-monitor
namespace: monitoring
labels:
team: frontend
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 30s
path: /metrics
honorLabels: true
namespaceSelector:
matchNames:
- production
- stagingYAMLPrometheusRule CR
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-rules
namespace: monitoring
labels:
prometheus: kube-prometheus
role: alert-rules
spec:
groups:
- name: my-app.rules
rules:
- alert: MyAppHighErrorRate
expr: |
(
sum(rate(http_requests_total{job="my-app", status=~"5.."}[5m])) /
sum(rate(http_requests_total{job="my-app"}[5m]))
) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate in my-app"
description: "Error rate is {{ $value | humanizePercentage }}"YAMLBest Practices for Monitoring Kubernetes Workloads
Pod Annotations for Scraping
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: my-app
image: my-app:latest
ports:
- containerPort: 8080
name: metricsYAMLResource Monitoring Queries
# CPU usage by pod
sum by (pod) (rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m]))
# Memory usage by pod
sum by (pod) (container_memory_working_set_bytes{container!="POD",container!=""})
# Pod restart rate
increase(kube_pod_container_status_restarts_total[1h])
# Pods not ready
kube_pod_status_ready{condition="false"}
# Node CPU usage
(1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100
# Node memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Persistent Volume usage
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100INIKubernetes Alerting Rules
# k8s-alerts.yml
groups:
- name: kubernetes-alerts
rules:
- alert: KubePodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Pod is crash looping"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting {{ $value | humanize }} times per 15 minutes"
- alert: KubePodNotReady
expr: kube_pod_status_ready{condition="false"} == 1
for: 15m
labels:
severity: warning
annotations:
summary: "Pod has been in not ready state for more than 15 minutes"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than 15 minutes"
- alert: KubeDeploymentGenerationMismatch
expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
for: 15m
labels:
severity: warning
annotations:
summary: "Deployment generation mismatch"
description: "Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match"
- alert: KubeNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 15m
labels:
severity: critical
annotations:
summary: "Node is not ready"
description: "Node {{ $labels.node }} has been unready for more than 15 minutes"
- alert: KubeDaemonSetRolloutStuck
expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100
for: 15m
labels:
severity: warning
annotations:
summary: "DaemonSet rollout is stuck"
description: "Only {{ $value | humanizePercentage }} of the desired Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are scheduled and ready"YAMLNetwork Policy Monitoring
# Example application with network policies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: my-app-netpol
namespace: production
spec:
podSelector:
matchLabels:
app: my-app
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: production
ports:
- protocol: TCP
port: 8080
egress:
- to:
- namespaceSelector:
matchLabels:
name: production
ports:
- protocol: TCP
port: 5432YAMLChapter 7 Summary
Prometheus integrates seamlessly with Kubernetes through service discovery, automatically finding and monitoring pods, services, and nodes. kube-state-metrics provides cluster-level visibility, while the Prometheus Operator simplifies deployment through CRDs. Proper annotation strategies and resource monitoring ensure comprehensive Kubernetes observability.
Hands-on Exercise
- Service Discovery Setup:
- Configure Prometheus for Kubernetes service discovery
- Deploy applications with proper annotations
- Verify automatic target discovery
- kube-state-metrics:
- Install and configure kube-state-metrics
- Create queries for cluster health monitoring
- Build dashboards for Kubernetes resources
- Prometheus Operator:
- Deploy Prometheus using the operator
- Create ServiceMonitor and PrometheusRule resources
- Test the operator’s automated configuration management
8. Scaling and Performance
Federation and Hierarchical Prometheus Setups
Federation allows Prometheus servers to scrape selected time series from other Prometheus servers, enabling hierarchical monitoring architectures.
graph TB
A[Global Prometheus] --> B[Regional Prometheus US]
A --> C[Regional Prometheus EU]
A --> D[Regional Prometheus APAC]
B --> E[Cluster Prometheus US-1]
B --> F[Cluster Prometheus US-2]
C --> G[Cluster Prometheus EU-1]
C --> H[Cluster Prometheus EU-2]
D --> I[Cluster Prometheus APAC-1]Federation Configuration
# Global Prometheus configuration
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"prometheus|node-exporter"}'
- '{__name__=~"job:.*"}'
- '{__name__=~"instance:.*"}'
static_configs:
- targets:
- 'us-prometheus:9090'
- 'eu-prometheus:9090'
- 'apac-prometheus:9090'
# Aggregate high-level metrics
- job_name: 'federate-aggregates'
scrape_interval: 30s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{__name__=~"cluster:.*"}'
- '{__name__=~"region:.*"}'
static_configs:
- targets:
- 'us-prometheus:9090'
- 'eu-prometheus:9090'
- 'apac-prometheus:9090'YAMLRecording Rules for Federation
# Regional Prometheus recording rules
groups:
- name: cluster_aggregates
interval: 30s
rules:
- record: cluster:cpu_usage:avg
expr: avg by (cluster) (instance:cpu_usage:rate5m)
- record: cluster:memory_usage:avg
expr: avg by (cluster) (instance:memory_usage:percentage)
- record: cluster:disk_usage:avg
expr: avg by (cluster) (instance:disk_usage:percentage)
- name: region_aggregates
interval: 60s
rules:
- record: region:request_rate:sum
expr: sum by (region) (cluster:request_rate:sum)
- record: region:error_rate:avg
expr: avg by (region) (cluster:error_rate:avg)YAMLRemote Storage Integrations
Remote storage solutions provide long-term storage and horizontal scalability for Prometheus metrics.
Thanos Integration
Thanos provides unlimited retention and horizontal scaling for Prometheus.
# Prometheus with Thanos sidecar
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus
spec:
serviceName: prometheus
replicas: 1
template:
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=2h'
- '--storage.tsdb.min-block-duration=2h'
- '--storage.tsdb.max-block-duration=2h'
- '--web.enable-lifecycle'
ports:
- containerPort: 9090
volumeMounts:
- name: prometheus-storage
mountPath: /prometheus
- name: thanos-sidecar
image: thanosio/thanos:latest
args:
- sidecar
- --tsdb.path=/prometheus
- --prometheus.url=http://localhost:9090
- --objstore.config-file=/etc/thanos/objstore.yml
ports:
- containerPort: 10901
- containerPort: 10902
volumeMounts:
- name: prometheus-storage
mountPath: /prometheus
- name: thanos-objstore-config
mountPath: /etc/thanos
volumes:
- name: thanos-objstore-config
secret:
secretName: thanos-objstore-configYAML# Thanos objstore configuration
# objstore.yml
type: S3
config:
bucket: "thanos-metrics"
endpoint: "s3.amazonaws.com"
access_key: "ACCESS_KEY"
secret_key: "SECRET_KEY"
insecure: falseYAML# Thanos query deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: thanos-query
spec:
replicas: 2
template:
spec:
containers:
- name: thanos-query
image: thanosio/thanos:latest
args:
- query
- --store=prometheus-0.prometheus:10901
- --store=prometheus-1.prometheus:10901
- --store=thanos-store:10901
ports:
- containerPort: 10902YAMLVictoriaMetrics Integration
VictoriaMetrics provides high-performance storage and querying.
# VictoriaMetrics deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: victoriametrics
spec:
replicas: 1
template:
spec:
containers:
- name: victoriametrics
image: victoriametrics/victoria-metrics:latest
args:
- '--storageDataPath=/victoria-metrics-data'
- '--retentionPeriod=12'
- '--httpListenAddr=:8428'
ports:
- containerPort: 8428
volumeMounts:
- name: storage
mountPath: /victoria-metrics-dataYAML# Prometheus remote write configuration
remote_write:
- url: "http://victoriametrics:8428/api/v1/write"
queue_config:
max_samples_per_send: 10000
batch_send_deadline: 5s
max_shards: 20YAMLCortex Configuration
# Cortex configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: cortex-config
data:
cortex.yml: |
server:
http_listen_port: 9009
grpc_listen_port: 9095
distributor:
ring:
kvstore:
store: consul
consul:
host: consul:8500
ingester:
lifecycler:
ring:
kvstore:
store: consul
consul:
host: consul:8500
replication_factor: 3
storage:
engine: blocks
blocks_storage:
backend: s3
s3:
endpoint: s3.amazonaws.com
bucket_name: cortex-blocks
access_key_id: ACCESS_KEY
secret_access_key: SECRET_KEYYAMLRetention Policies and Storage Tuning
Prometheus Storage Configuration
# Prometheus with optimized storage settings
args:
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
- '--storage.tsdb.retention.size=50GB'
- '--storage.tsdb.wal-compression'
- '--storage.tsdb.min-block-duration=2h'
- '--storage.tsdb.max-block-duration=2h'
- '--web.enable-admin-api'YAMLStorage Optimization Strategies
# Monitor Prometheus storage metrics
prometheus_tsdb_symbol_table_size_bytes
prometheus_tsdb_head_series
prometheus_tsdb_compaction_duration_seconds
prometheus_config_last_reload_successful
# Storage utilization
prometheus_tsdb_size_bytes{type="wal"}
prometheus_tsdb_size_bytes{type="head"}
prometheus_tsdb_size_bytes{type="blocks"}
# Query performance
prometheus_engine_query_duration_seconds
prometheus_engine_queries_concurrent_maxINIHandling High Cardinality Metrics
Cardinality Analysis
# Find high cardinality metrics
topk(10, count by (__name__)({__name__!=""}))
# Series count by job
count by (job) ({__name__!=""})
# Label cardinality analysis
count by (__name__) (group by (__name__, instance) ({__name__!=""}))INICardinality Management Strategies
# Metric relabeling to reduce cardinality
metric_relabel_configs:
# Drop unnecessary labels
- source_labels: [__name__]
regex: 'http_request_duration_seconds_bucket'
target_label: __tmp_bucket_drop
replacement: 'true'
- source_labels: [__tmp_bucket_drop, le]
regex: 'true;(0.005|0.01|0.025|0.05|0.1|0.25|0.5|1|2.5|5|10|+Inf)'
action: keep
- regex: '__tmp_bucket_drop'
action: labeldrop
# Limit user agent variations
- source_labels: [user_agent]
regex: '(.*Chrome.*|.*Firefox.*|.*Safari.*)'
target_label: user_agent_family
replacement: '${1}'
- source_labels: [user_agent]
regex: '.*'
target_label: user_agent_family
replacement: 'other'
- regex: 'user_agent'
action: labeldropYAMLRecording Rules for High Cardinality
# Aggregate high cardinality metrics
groups:
- name: cardinality_reduction
interval: 30s
rules:
# Aggregate by service instead of instance
- record: service:request_rate:sum
expr: sum by (service) (rate(http_requests_total[5m]))
# Aggregate errors by service and status class
- record: service:error_rate:sum
expr: |
sum by (service, status_class) (
rate(http_requests_total{status=~"[45].."}[5m])
)
labels:
status_class: "4xx_5xx"
# Remove detailed path information
- record: service:request_duration:p95
expr: |
histogram_quantile(0.95,
sum by (service, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)YAMLPerformance Optimization
Query Optimization
# Inefficient query - scans all time series
{__name__=~"http_.*"}
# Better - specific metric with labels
http_requests_total{job="my-service"}
# Inefficient - regex on high cardinality label
http_requests_total{instance=~".*prod.*"}
# Better - exact match or limited regex
http_requests_total{environment="production"}
# Use recording rules for complex calculations
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
# Replace with:
http_request_duration:p95INIMemory and CPU Tuning
# Prometheus resource optimization
resources:
requests:
memory: "4Gi"
cpu: "1000m"
limits:
memory: "8Gi"
cpu: "2000m"
# JVM tuning for Java exporters
env:
- name: JAVA_OPTS
value: "-Xmx1g -Xms1g -XX:+UseG1GC"YAMLMonitoring Prometheus Performance
# Prometheus performance dashboard queries
panels:
- title: "Ingestion Rate"
expr: "rate(prometheus_tsdb_samples_total[5m])"
- title: "Active Series"
expr: "prometheus_tsdb_head_series"
- title: "Query Duration"
expr: "histogram_quantile(0.99, rate(prometheus_engine_query_duration_seconds_bucket[5m]))"
- title: "Memory Usage"
expr: "process_resident_memory_bytes"
- title: "WAL Truncations"
expr: "rate(prometheus_tsdb_wal_truncations_total[5m])"
- title: "Compaction Duration"
expr: "rate(prometheus_tsdb_compaction_duration_seconds_sum[5m])"YAMLChapter 8 Summary
Scaling Prometheus involves federation for hierarchical setups, remote storage for long-term retention, and careful cardinality management. Performance optimization requires query tuning, resource allocation, and monitoring of Prometheus itself. Remote storage solutions like Thanos, VictoriaMetrics, and Cortex provide different approaches to horizontal scaling.
Hands-on Exercise
- Federation Setup:
- Create a hierarchical Prometheus setup with federation
- Configure recording rules for aggregation
- Test cross-instance querying
- Remote Storage:
- Implement remote write to VictoriaMetrics or Thanos
- Configure retention policies
- Compare query performance
- Performance Optimization:
- Analyze cardinality in your metrics
- Implement relabeling to reduce cardinality
- Create recording rules for expensive queries
9. Best Practices and Pitfalls
Designing Effective Metrics
The Four Golden Signals
Focus on these key metrics for any system:
- Latency: Time to process requests
- Traffic: Amount of demand on the system
- Errors: Rate of failed requests
- Saturation: Resource utilization
# Latency - 95th percentile response time
histogram_quantile(0.95, sum by (service) (rate(http_request_duration_seconds_bucket[5m])))
# Traffic - Request rate
sum by (service) (rate(http_requests_total[5m]))
# Errors - Error rate
sum by (service) (rate(http_requests_total{status=~"5.."}[5m])) /
sum by (service) (rate(http_requests_total[5m]))
# Saturation - CPU utilization
avg by (instance) (1 - rate(node_cpu_seconds_total{mode="idle"}[5m]))
USE Method for Resources
For every resource, monitor:
- Utilization: How busy the resource is
- Saturation: Extra work queued
- Errors: Error events
# CPU Utilization
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# CPU Saturation
node_load1 / count by (instance) (node_cpu_seconds_total{mode="idle"})
# Memory Utilization
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Memory Saturation
rate(node_vmstat_pswpin[5m]) + rate(node_vmstat_pswpout[5m])
# Disk Utilization
rate(node_disk_io_time_seconds_total[5m]) * 100
# Disk Saturation
rate(node_disk_io_time_weighted_seconds_total[5m])
# Network Utilization
rate(node_network_transmit_bytes_total[5m]) + rate(node_network_receive_bytes_total[5m])
# Network Errors
rate(node_network_transmit_errs_total[5m]) + rate(node_network_receive_errs_total[5m])INIRED Method for Services
For every service, monitor:
- Rate: Requests per second
- Errors: Failed requests per second
- Duration: Response time distribution
# Rate
sum by (service) (rate(http_requests_total[5m]))
# Errors
sum by (service) (rate(http_requests_total{status=~"[45].."}[5m]))
# Duration
histogram_quantile(0.50, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.95, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])))INIAvoiding Cardinality Explosions
Common Cardinality Pitfalls
// BAD: User ID as label (unbounded cardinality)
requestsTotal := prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
},
[]string{"method", "endpoint", "user_id"}, // user_id is unbounded!
)
// GOOD: Remove user_id or aggregate it
requestsTotal := prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
},
[]string{"method", "endpoint", "user_type"}, // bounded categories
)
// BAD: Full URL path as label
errorCounter := prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "errors_total",
},
[]string{"full_path"}, // /user/123/profile, /user/456/profile, etc.
)
// GOOD: Parameterized path
errorCounter := prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "errors_total",
},
[]string{"path_template"}, // /user/:id/profile
)GoLabel Guidelines
# Good label practices
labels:
# Use bounded categorical values
environment: ["production", "staging", "development"]
region: ["us-east-1", "us-west-2", "eu-west-1"]
service: ["frontend", "backend", "database"]
# Avoid unbounded values
# ❌ user_id: "12345"
# ❌ session_id: "abc-def-123"
# ❌ full_url: "/api/users/12345/posts/67890"
# Use bounded alternatives
# ✅ user_type: "premium"
# ✅ endpoint: "/api/users/:id/posts/:id"
# ✅ status_class: "2xx"YAMLCardinality Monitoring
# Monitor series count by job
count by (job) ({__name__!=""})
# Find metrics with highest cardinality
topk(10, count by (__name__) ({__name__!=""}))
# Monitor label value counts
count by (__name__, status) (http_requests_total)
# Alert on high cardinality
count by (__name__) ({__name__!=""}) > 10000INISetting SLOs and SLIs with Prometheus
Defining SLIs (Service Level Indicators)
# Example SLI definitions
slis:
availability:
description: "Percentage of successful requests"
query: |
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
target: "> 99.9%"
latency:
description: "95th percentile response time"
query: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
target: "< 200ms"
error_rate:
description: "Rate of 5xx errors"
query: |
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
target: "< 0.1%"YAMLSLO Implementation
# SLO recording rules
groups:
- name: slo_rules
interval: 30s
rules:
# Error rate SLI
- record: sli:error_rate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
# Availability SLI
- record: sli:availability
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
# Latency SLI
- record: sli:latency:p95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# Error budget calculation (30-day window)
- record: slo:error_budget:30d
expr: 1 - (1 - 0.999) * (30 * 24 * 60 * 60) / (30 * 24 * 60 * 60)YAMLSLO Alerting
# SLO alerting rules
groups:
- name: slo_alerts
rules:
# Fast burn rate (1 hour)
- alert: SLOErrorBudgetBurnRateFast
expr: |
sli:error_rate > (14.4 * (1 - 0.999)) and
sli:error_rate[1h] > (14.4 * (1 - 0.999))
for: 2m
labels:
severity: critical
annotations:
summary: "Fast SLO burn rate detected"
description: "Error rate is consuming error budget 14.4x faster than sustainable"
# Slow burn rate (6 hours)
- alert: SLOErrorBudgetBurnRateSlow
expr: |
sli:error_rate > (6 * (1 - 0.999)) and
sli:error_rate[6h] > (6 * (1 - 0.999))
for: 15m
labels:
severity: warning
annotations:
summary: "Slow SLO burn rate detected"
description: "Error rate is consuming error budget 6x faster than sustainable"YAMLCase Studies from Real-World Systems
Case Study 1: E-commerce Platform
Challenge: Monitor checkout flow reliability Solution: Multi-step funnel monitoring
# Checkout funnel metrics
checkout_funnel_step_total{step="cart_view"}
checkout_funnel_step_total{step="checkout_start"}
checkout_funnel_step_total{step="payment_submit"}
checkout_funnel_step_total{step="order_complete"}
# Conversion rates
rate(checkout_funnel_step_total{step="checkout_start"}[5m]) /
rate(checkout_funnel_step_total{step="cart_view"}[5m])
# Payment failure rate
rate(checkout_funnel_step_total{step="payment_failed"}[5m]) /
rate(checkout_funnel_step_total{step="payment_submit"}[5m])
# Revenue impact
sum(rate(order_value_total[5m])) * 3600INICase Study 2: Microservices Architecture
Challenge: Distributed tracing with metrics correlation Solution: Service dependency monitoring
# Service dependency health
up{job=~".*service.*"}
# Cross-service error propagation
sum by (source_service, target_service) (
rate(http_requests_total{status=~"5.."}[5m])
)
# Service response time correlation
histogram_quantile(0.95,
sum by (service, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)INICase Study 3: Infrastructure Cost Optimization
Challenge: Monitor resource efficiency Solution: Cost-aware metrics
# CPU cost efficiency
sum by (instance_type) (node_cpu_seconds_total) /
sum by (instance_type) (node_cpu_cost_per_hour)
# Memory utilization by cost
avg by (instance_type) (
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) /
node_memory_MemTotal_bytes
) * sum by (instance_type) (node_memory_cost_per_gb)
# Idle resource identification
avg_over_time(
(1 - rate(node_cpu_seconds_total{mode="idle"}[5m]))[7d:]
) < 0.1INIMetrics Naming Conventions
Prometheus Naming Best Practices
# Good metric names
http_requests_total # Counter with _total suffix
http_request_duration_seconds # Histogram with base unit
memory_usage_bytes # Gauge with base unit
process_cpu_usage_ratio # Ratio as _ratio suffix
# Bad metric names
HttpRequestsCount # Should be snake_case
request_time_ms # Should use base unit (seconds)
cpu_percentage # Should be cpu_usage_ratio
errors # Not descriptive enoughINILabel Naming Conventions
# Good labels
method: ["GET", "POST", "PUT", "DELETE"]
status: ["200", "404", "500"]
environment: ["production", "staging"]
region: ["us-east-1", "eu-west-1"]
# Bad labels
Method: "GET" # Should be lowercase
http_status_code: "200" # Redundant prefix
env: "prod" # Use full names
datacenter: "dc1" # Be specific about locationINITesting and Validation
Metrics Testing Framework
# metrics_test.py
import requests
import time
import pytest
class MetricsTestFramework:
def __init__(self, prometheus_url, app_url):
self.prometheus_url = prometheus_url
self.app_url = app_url
def query_metric(self, query):
"""Query Prometheus and return result"""
response = requests.get(
f"{self.prometheus_url}/api/v1/query",
params={"query": query}
)
return response.json()
def generate_load(self, endpoint, count=10):
"""Generate load on application endpoint"""
for _ in range(count):
requests.get(f"{self.app_url}{endpoint}")
time.sleep(0.1)
def test_counter_increment(self):
"""Test that counters increment properly"""
# Get initial value
initial = self.query_metric("http_requests_total")
# Generate load
self.generate_load("/test", 5)
# Wait for scrape
time.sleep(20)
# Check increment
final = self.query_metric("http_requests_total")
assert final > initial
def test_histogram_buckets(self):
"""Test histogram bucket distribution"""
self.generate_load("/slow", 10)
time.sleep(20)
# Check bucket distribution
result = self.query_metric(
'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[1m]))'
)
assert float(result['data']['result'][0]['value'][1]) > 0
# Usage
framework = MetricsTestFramework(
"http://localhost:9090",
"http://localhost:8080"
)
framework.test_counter_increment()PythonDocumentation and Runbooks
Metrics Documentation Template
# Metric: http_requests_total
## Description
Counter of HTTP requests processed by the application.
## Type
Counter
## Labels
- `method`: HTTP method (GET, POST, PUT, DELETE)
- `endpoint`: API endpoint template (e.g., /api/users/:id)
- `status`: HTTP status code
- `service`: Service name
## Usage Examples
```promql
# Request rate
rate(http_requests_total[5m])
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))INIAlerts
- HighErrorRate: Fires when error rate > 5%
- LowRequestRate: Fires when request rate < 1 req/s
Dashboard Panels
- Request Rate Over Time
- Error Rate by Endpoint
- Request Distribution by Method
Chapter 9 Summary
Effective Prometheus monitoring requires following established patterns like the Four Golden Signals, USE, and RED methods. Avoid cardinality explosions through careful label design, implement meaningful SLOs with proper error budget tracking, and establish clear naming conventions. Testing, documentation, and real-world case studies help ensure monitoring provides actionable insights.
Hands-on Exercise
- Metrics Review:
- Audit your existing metrics for cardinality issues
- Apply the Four Golden Signals to your services
- Implement USE method for infrastructure resources
- SLO Implementation:
- Define SLIs for a critical service
- Create SLO recording and alerting rules
- Set up error budget tracking
- Best Practices Assessment:
- Review metric naming conventions
- Create documentation for key metrics
- Implement automated metrics testing
10. Advanced Topics
Exemplars and Tracing Correlation
Exemplars link metrics to traces, providing context for high-level aggregations by pointing to specific trace samples.
graph LR
A[HTTP Request] --> B[Metrics]
A --> C[Traces]
B --> D[Exemplar]
D --> C
C --> E[Span Details]Enabling Exemplars in Prometheus
# prometheus.yml
global:
scrape_interval: 15s
exemplar_storage:
max_exemplars: 100000
scrape_configs:
- job_name: 'my-app'
scrape_interval: 10s
static_configs:
- targets: ['app:8080']YAMLInstrumenting Applications with Exemplars
// Go application with exemplars
package main
import (
"context"
"fmt"
"math/rand"
"net/http"
"strconv"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/trace"
)
var (
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
)
func init() {
prometheus.MustRegister(requestDuration)
}
func instrumentedHandler(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// Start OpenTelemetry span
ctx, span := otel.Tracer("my-app").Start(r.Context(), "http_request")
defer span.End()
// Simulate work
time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond)
// Record metric with exemplar
duration := time.Since(start).Seconds()
exemplar := prometheus.Labels{
"trace_id": span.SpanContext().TraceID().String(),
"span_id": span.SpanContext().SpanID().String(),
}
requestDuration.WithLabelValues(r.Method, r.URL.Path).
ObserveWithExemplar(duration, exemplar)
span.SetAttributes(
attribute.String("http.method", r.Method),
attribute.String("http.url", r.URL.Path),
attribute.Float64("http.duration", duration),
)
w.WriteHeader(http.StatusOK)
fmt.Fprintf(w, "Request processed in %.2f seconds", duration)
}
func main() {
http.HandleFunc("/api", instrumentedHandler)
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}GoQuerying Exemplars
# Query histogram with exemplars
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# API endpoint for exemplars
GET /api/v1/query_exemplars?query=http_request_duration_seconds_bucket&start=<timestamp>&end=<timestamp>GoMulti-cluster Monitoring
Centralized Multi-cluster Architecture
graph TB
A[Global Prometheus] --> B[Cluster A Prometheus]
A --> C[Cluster B Prometheus]
A --> D[Cluster C Prometheus]
B --> E[Workloads A]
C --> F[Workloads B]
D --> G[Workloads C]
A --> H[Global Grafana]
A --> I[Global Alertmanager]Cross-cluster Service Discovery
# Global Prometheus configuration
global:
external_labels:
cluster: 'management'
region: 'global'
scrape_configs:
# Federate from regional clusters
- job_name: 'federate-clusters'
scrape_interval: 30s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{__name__=~"cluster:.*"}'
- '{__name__=~"node_.*"}'
- '{__name__=~"container_.*"}'
static_configs:
- targets:
- 'cluster-a-prometheus:9090'
labels:
cluster: 'cluster-a'
region: 'us-east-1'
- targets:
- 'cluster-b-prometheus:9090'
labels:
cluster: 'cluster-b'
region: 'us-west-2'
- targets:
- 'cluster-c-prometheus:9090'
labels:
cluster: 'cluster-c'
region: 'eu-west-1'
# Cross-cluster service monitoring
- job_name: 'cross-cluster-services'
kubernetes_sd_configs:
- role: endpoints
api_server: 'https://cluster-a.k8s.local'
tls_config:
ca_file: /etc/ssl/cluster-a-ca.crt
cert_file: /etc/ssl/cluster-a.crt
key_file: /etc/ssl/cluster-a.key
- role: endpoints
api_server: 'https://cluster-b.k8s.local'
tls_config:
ca_file: /etc/ssl/cluster-b-ca.crt
cert_file: /etc/ssl/cluster-b.crt
key_file: /etc/ssl/cluster-b.keyYAMLMulti-cluster Recording Rules
# Global recording rules
groups:
- name: cross_cluster_aggregates
interval: 60s
rules:
- record: global:request_rate:sum
expr: sum by (service) (cluster:request_rate:sum)
- record: global:error_rate:avg
expr: avg by (service) (cluster:error_rate:avg)
- record: global:latency:p95
expr: |
histogram_quantile(0.95,
sum by (service, le) (cluster:latency:histogram)
)
- record: region:capacity:available
expr: |
sum by (region) (
cluster:node_capacity:cpu - cluster:node_usage:cpu
)YAMLIntegrating with Logging and Tracing
Correlation with ELK Stack
# Logstash configuration for metrics correlation
input {
beats {
port => 5044
}
}
filter {
if [fields][service] {
# Add Prometheus job label
mutate {
add_field => { "prometheus_job" => "%{[fields][service]}" }
}
# Extract trace ID if present
if [message] =~ /trace_id=/ {
grok {
match => { "message" => "trace_id=(?<trace_id>[a-f0-9]+)" }
}
}
# Add links to metrics
mutate {
add_field => {
"metrics_link" => "http://grafana.local/d/app-dashboard?var-service=%{[fields][service]}&from=now-5m&to=now"
}
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
}JSONJaeger Integration
# Jaeger query service with Prometheus metrics
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger-query
spec:
template:
spec:
containers:
- name: jaeger-query
image: jaegertracing/jaeger-query:latest
env:
- name: SPAN_STORAGE_TYPE
value: elasticsearch
- name: ES_SERVER_URLS
value: http://elasticsearch:9200
- name: METRICS_BACKEND
value: prometheus
- name: PROMETHEUS_SERVER_URL
value: http://prometheus:9090
ports:
- containerPort: 16686
- containerPort: 16687YAMLOpenTelemetry Collector Configuration
# otelcol-config.yml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
static_configs:
- targets: ['localhost:8888']
processors:
batch:
timeout: 1s
send_batch_size: 1024
attributes:
actions:
- key: cluster
value: production
action: insert
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
namespace: "otel"
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"
service:
pipelines:
traces:
receivers: [otlp]
processors: [attributes, batch]
exporters: [jaeger]
metrics:
receivers: [otlp, prometheus]
processors: [attributes, batch]
exporters: [prometheus, prometheusremotewrite]YAMLSecurity and RBAC in Prometheus Setups
Prometheus Security Configuration
# Prometheus with TLS and authentication
apiVersion: v1
kind: Secret
metadata:
name: prometheus-certs
type: Opaque
data:
tls.crt: <base64-encoded-cert>
tls.key: <base64-encoded-key>
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
spec:
template:
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.config.file=/etc/prometheus/web.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.listen-address=0.0.0.0:9090'
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: certs
mountPath: /etc/ssl/prometheus
readOnly: trueYAML# web.yml - Prometheus web configuration
tls_server_config:
cert_file: /etc/ssl/prometheus/tls.crt
key_file: /etc/ssl/prometheus/tls.key
basic_auth_users:
admin: $2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay
readonly: $2b$12$6tgWf5DZ9z7LZtD.ZrAb/.VjBfI3WnJg3ULf.TgLBtO4vKAzp7KuGYAMLRBAC Configuration for Kubernetes
# ServiceAccount for Prometheus
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
# ClusterRole with minimal permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups: ["extensions", "apps"]
resources:
- ingresses
- deployments
- daemonsets
- statefulsets
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
# ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoringYAMLOAuth2 Proxy Integration
# OAuth2 Proxy for Prometheus
apiVersion: apps/v1
kind: Deployment
metadata:
name: oauth2-proxy
spec:
template:
spec:
containers:
- name: oauth2-proxy
image: quay.io/oauth2-proxy/oauth2-proxy:latest
args:
- --provider=github
- --email-domain=yourcompany.com
- --upstream=http://prometheus:9090
- --http-address=0.0.0.0:4180
- --client-id=$(OAUTH2_PROXY_CLIENT_ID)
- --client-secret=$(OAUTH2_PROXY_CLIENT_SECRET)
- --cookie-secret=$(OAUTH2_PROXY_COOKIE_SECRET)
env:
- name: OAUTH2_PROXY_CLIENT_ID
valueFrom:
secretKeyRef:
name: oauth2-proxy-secrets
key: client-id
- name: OAUTH2_PROXY_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: oauth2-proxy-secrets
key: client-secret
- name: OAUTH2_PROXY_COOKIE_SECRET
valueFrom:
secretKeyRef:
name: oauth2-proxy-secrets
key: cookie-secretYAMLChapter 10 Summary
Advanced Prometheus topics include exemplars for linking metrics to traces, multi-cluster monitoring architectures, integration with logging and tracing systems, and comprehensive security configurations. These features enable enterprise-scale observability with proper access controls and correlation across different observability signals.
Hands-on Exercise
- Exemplars Implementation:
- Enable exemplars in Prometheus
- Instrument an application with trace correlation
- View exemplars in Grafana dashboards
- Multi-cluster Setup:
- Configure federation between Prometheus instances
- Implement cross-cluster monitoring
- Test global query capabilities
- Security Hardening:
- Implement TLS and authentication
- Configure RBAC for Kubernetes
- Set up OAuth2 proxy for access control
11. Capstone Project
Project Overview
Build a complete observability stack for a sample e-commerce application with microservices architecture, including metrics collection, alerting, visualization, and incident response workflows.
Architecture Overview
graph TB
subgraph "Application Layer"
A[Frontend Service] --> B[User Service]
A --> C[Product Service]
A --> D[Order Service]
D --> E[Payment Service]
D --> F[Inventory Service]
B --> G[User Database]
C --> H[Product Database]
D --> I[Order Database]
end
subgraph "Observability Layer"
J[Prometheus] --> K[Alertmanager]
J --> L[Grafana]
M[Node Exporter] --> J
N[Application Metrics] --> J
O[Blackbox Exporter] --> J
K --> P[Slack/Email]
L --> Q[Dashboards]
end
A --> N
B --> N
C --> N
D --> N
E --> N
F --> NStep 1: Infrastructure Setup
Docker Compose Environment
# docker-compose.yml
version: '3.8'
networks:
monitoring:
driver: bridge
app:
driver: bridge
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
services:
# Prometheus
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus:/etc/prometheus
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
networks:
- monitoring
- app
restart: unless-stopped
# Alertmanager
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager:/etc/alertmanager
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
networks:
- monitoring
restart: unless-stopped
# Grafana
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
networks:
- monitoring
restart: unless-stopped
# Node Exporter
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
networks:
- monitoring
restart: unless-stopped
# Blackbox Exporter
blackbox-exporter:
image: prom/blackbox-exporter:latest
container_name: blackbox-exporter
ports:
- "9115:9115"
volumes:
- ./blackbox:/etc/blackbox_exporter
networks:
- monitoring
restart: unless-stopped
# Application Services
frontend:
build: ./apps/frontend
container_name: frontend
ports:
- "8080:8080"
environment:
- USER_SERVICE_URL=http://user-service:8081
- PRODUCT_SERVICE_URL=http://product-service:8082
- ORDER_SERVICE_URL=http://order-service:8083
networks:
- app
restart: unless-stopped
user-service:
build: ./apps/user-service
container_name: user-service
ports:
- "8081:8081"
environment:
- DATABASE_URL=postgresql://user:password@user-db:5432/users
networks:
- app
restart: unless-stopped
product-service:
build: ./apps/product-service
container_name: product-service
ports:
- "8082:8082"
environment:
- DATABASE_URL=postgresql://product:password@product-db:5432/products
networks:
- app
restart: unless-stopped
order-service:
build: ./apps/order-service
container_name: order-service
ports:
- "8083:8083"
environment:
- DATABASE_URL=postgresql://order:password@order-db:5432/orders
- PAYMENT_SERVICE_URL=http://payment-service:8084
- INVENTORY_SERVICE_URL=http://inventory-service:8085
networks:
- app
restart: unless-stopped
payment-service:
build: ./apps/payment-service
container_name: payment-service
ports:
- "8084:8084"
networks:
- app
restart: unless-stopped
inventory-service:
build: ./apps/inventory-service
container_name: inventory-service
ports:
- "8085:8085"
networks:
- app
restart: unless-stopped
# Databases
user-db:
image: postgres:13
container_name: user-db
environment:
- POSTGRES_DB=users
- POSTGRES_USER=user
- POSTGRES_PASSWORD=password
volumes:
- ./data/user-db:/var/lib/postgresql/data
networks:
- app
product-db:
image: postgres:13
container_name: product-db
environment:
- POSTGRES_DB=products
- POSTGRES_USER=product
- POSTGRES_PASSWORD=password
volumes:
- ./data/product-db:/var/lib/postgresql/data
networks:
- app
order-db:
image: postgres:13
container_name: order-db
environment:
- POSTGRES_DB=orders
- POSTGRES_USER=order
- POSTGRES_PASSWORD=password
volumes:
- ./data/order-db:/var/lib/postgresql/data
networks:
- appYAMLStep 2: Application Instrumentation
Frontend Service (Go)
// apps/frontend/main.go
package main
import (
"encoding/json"
"fmt"
"log"
"net/http"
"os"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"service", "method", "endpoint", "status"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
},
[]string{"service", "method", "endpoint"},
)
upstreamRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "upstream_requests_total",
Help: "Total upstream requests",
},
[]string{"service", "target_service", "status"},
)
businessMetrics = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "business_events_total",
Help: "Business events counter",
},
[]string{"service", "event_type"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
prometheus.MustRegister(upstreamRequestsTotal)
prometheus.MustRegister(businessMetrics)
}
func instrumentHandler(service, endpoint string, handler http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// Wrap ResponseWriter to capture status code
ww := &responseWriter{ResponseWriter: w, statusCode: 200}
handler(ww, r)
duration := time.Since(start).Seconds()
status := fmt.Sprintf("%d", ww.statusCode)
httpRequestsTotal.WithLabelValues(service, r.Method, endpoint, status).Inc()
httpRequestDuration.WithLabelValues(service, r.Method, endpoint).Observe(duration)
}
}
type responseWriter struct {
http.ResponseWriter
statusCode int
}
func (rw *responseWriter) WriteHeader(code int) {
rw.statusCode = code
rw.ResponseWriter.WriteHeader(code)
}
func homeHandler(w http.ResponseWriter, r *http.Request) {
businessMetrics.WithLabelValues("frontend", "page_view").Inc()
response := map[string]string{
"service": "frontend",
"status": "healthy",
"version": "1.0.0",
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(response)
}
func usersHandler(w http.ResponseWriter, r *http.Request) {
userServiceURL := os.Getenv("USER_SERVICE_URL")
if userServiceURL == "" {
userServiceURL = "http://localhost:8081"
}
start := time.Now()
resp, err := http.Get(userServiceURL + "/users")
duration := time.Since(start).Seconds()
status := "500"
if err == nil {
status = fmt.Sprintf("%d", resp.StatusCode)
defer resp.Body.Close()
}
upstreamRequestsTotal.WithLabelValues("frontend", "user-service", status).Inc()
if err != nil {
http.Error(w, "User service unavailable", http.StatusServiceUnavailable)
return
}
businessMetrics.WithLabelValues("frontend", "user_list_view").Inc()
w.Header().Set("Content-Type", "application/json")
w.Write([]byte(`{"users": []}`))
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.HandleFunc("/", instrumentHandler("frontend", "/", homeHandler))
http.HandleFunc("/users", instrumentHandler("frontend", "/users", usersHandler))
http.HandleFunc("/health", instrumentHandler("frontend", "/health", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
w.Write([]byte("OK"))
}))
log.Println("Frontend service starting on :8080")
log.Fatal(http.ListenAndServe(":8080", nil))
}GoUser Service (Python)
# apps/user-service/app.py
from flask import Flask, jsonify, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time
import psycopg2
import os
app = Flask(__name__)
# Prometheus metrics
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['service', 'method', 'endpoint', 'status']
)
REQUEST_DURATION = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['service', 'method', 'endpoint'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
DATABASE_CONNECTIONS = Gauge(
'database_connections_active',
'Active database connections',
['service', 'database']
)
BUSINESS_EVENTS = Counter(
'business_events_total',
'Business events',
['service', 'event_type']
)
def instrument_request(f):
def wrapper(*args, **kwargs):
start_time = time.time()
status = '200'
try:
result = f(*args, **kwargs)
return result
except Exception as e:
status = '500'
raise
finally:
REQUEST_COUNT.labels(
service='user-service',
method=request.method,
endpoint=request.endpoint or 'unknown',
status=status
).inc()
REQUEST_DURATION.labels(
service='user-service',
method=request.method,
endpoint=request.endpoint or 'unknown'
).observe(time.time() - start_time)
wrapper.__name__ = f.__name__
return wrapper
@app.route('/')
@instrument_request
def home():
return jsonify({
'service': 'user-service',
'status': 'healthy',
'version': '1.0.0'
})
@app.route('/users')
@instrument_request
def get_users():
BUSINESS_EVENTS.labels(service='user-service', event_type='user_list_request').inc()
# Simulate database query
DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').inc()
time.sleep(0.01) # Simulate query time
DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').dec()
return jsonify({
'users': [
{'id': 1, 'name': 'John Doe', 'email': 'john@example.com'},
{'id': 2, 'name': 'Jane Smith', 'email': 'jane@example.com'}
]
})
@app.route('/users/<int:user_id>')
@instrument_request
def get_user(user_id):
BUSINESS_EVENTS.labels(service='user-service', event_type='user_detail_request').inc()
DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').inc()
time.sleep(0.005)
DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').dec()
return jsonify({
'id': user_id,
'name': f'User {user_id}',
'email': f'user{user_id}@example.com'
})
@app.route('/health')
@instrument_request
def health():
return jsonify({'status': 'healthy'})
@app.route('/metrics')
def metrics():
return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8081)PythonStep 3: Prometheus Configuration
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'ecommerce'
environment: 'production'
rule_files:
- "alert_rules.yml"
- "recording_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
scrape_interval: 30s
# Application services
- job_name: 'frontend'
static_configs:
- targets: ['frontend:8080']
metrics_path: '/metrics'
scrape_interval: 15s
- job_name: 'user-service'
static_configs:
- targets: ['user-service:8081']
metrics_path: '/metrics'
scrape_interval: 15s
- job_name: 'product-service'
static_configs:
- targets: ['product-service:8082']
metrics_path: '/metrics'
scrape_interval: 15s
- job_name: 'order-service'
static_configs:
- targets: ['order-service:8083']
metrics_path: '/metrics'
scrape_interval: 15s
- job_name: 'payment-service'
static_configs:
- targets: ['payment-service:8084']
metrics_path: '/metrics'
scrape_interval: 15s
- job_name: 'inventory-service'
static_configs:
- targets: ['inventory-service:8085']
metrics_path: '/metrics'
scrape_interval: 15s
# Blackbox monitoring
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://frontend:8080/health
- http://user-service:8081/health
- http://product-service:8082/health
- http://order-service:8083/health
- http://payment-service:8084/health
- http://inventory-service:8085/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115YAMLStep 4: Recording Rules
# prometheus/recording_rules.yml
groups:
- name: application_rules
interval: 30s
rules:
# Request rates
- record: service:request_rate:rate5m
expr: sum by (service) (rate(http_requests_total[5m]))
- record: service:request_rate:rate1h
expr: sum by (service) (rate(http_requests_total[1h]))
# Error rates
- record: service:error_rate:rate5m
expr: |
sum by (service) (rate(http_requests_total{status=~"[45].."}[5m])) /
sum by (service) (rate(http_requests_total[5m]))
# Latency percentiles
- record: service:request_duration:p50
expr: |
histogram_quantile(0.50,
sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- record: service:request_duration:p95
expr: |
histogram_quantile(0.95,
sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- record: service:request_duration:p99
expr: |
histogram_quantile(0.99,
sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- name: infrastructure_rules
interval: 30s
rules:
# Node metrics
- record: node:cpu_usage:rate5m
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- record: node:memory_usage:percentage
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
- record: node:disk_usage:percentage
expr: |
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100
- name: business_rules
interval: 60s
rules:
# Business metrics
- record: business:page_views:rate1h
expr: rate(business_events_total{event_type="page_view"}[1h]) * 3600
- record: business:user_requests:rate1h
expr: rate(business_events_total{event_type=~"user_.*"}[1h]) * 3600
# Service dependency health
- record: service:dependency_success_rate:rate5m
expr: |
sum by (service, target_service) (rate(upstream_requests_total{status=~"2.."}[5m])) /
sum by (service, target_service) (rate(upstream_requests_total[5m]))YAMLStep 5: Alerting Rules
# prometheus/alert_rules.yml
groups:
- name: infrastructure_alerts
rules:
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 1m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Node is down"
description: "Node {{ $labels.instance }} has been down for more than 1 minute"
runbook_url: "https://runbooks.company.com/node-down"
- alert: HighCPUUsage
expr: node:cpu_usage:rate5m > 80
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High CPU usage"
description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: node:memory_usage:percentage > 85
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High memory usage"
description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"
- name: application_alerts
rules:
- alert: ServiceDown
expr: up{job=~"frontend|.*-service"} == 0
for: 1m
labels:
severity: critical
team: platform
annotations:
summary: "Service is down"
description: "Service {{ $labels.job }} is down"
- alert: HighErrorRate
expr: service:error_rate:rate5m > 0.05
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "High error rate for {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"
- alert: HighLatency
expr: service:request_duration:p95 > 1
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "High latency for {{ $labels.service }}"
description: "95th percentile latency is {{ $value }}s for {{ $labels.service }}"
- alert: LowRequestRate
expr: service:request_rate:rate5m < 0.1
for: 10m
labels:
severity: warning
team: platform
annotations:
summary: "Low request rate for {{ $labels.service }}"
description: "Request rate is {{ $value }} req/s for {{ $labels.service }}"
- name: business_alerts
rules:
- alert: LowPageViews
expr: business:page_views:rate1h < 10
for: 15m
labels:
severity: warning
team: product
annotations:
summary: "Low page view rate"
description: "Page view rate is {{ $value }} views/hour"
- alert: ServiceDependencyFailure
expr: service:dependency_success_rate:rate5m < 0.95
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Service dependency failure"
description: "{{ $labels.service }} -> {{ $labels.target_service }} success rate is {{ $value | humanizePercentage }}"YAMLStep 6: Alertmanager Configuration
# alertmanager/alertmanager.yml
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@ecommerce.local'
smtp_auth_username: 'alerts@ecommerce.local'
smtp_auth_password: 'your-app-password'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
# Critical alerts to on-call
- matchers:
- severity=critical
receiver: 'critical-alerts'
continue: true
# Infrastructure team alerts
- matchers:
- team=infrastructure
receiver: 'infrastructure-team'
# Platform team alerts
- matchers:
- team=platform
receiver: 'platform-team'
# Product team alerts
- matchers:
- team=product
receiver: 'product-team'
receivers:
- name: 'default'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Service:* {{ .Labels.service }}
{{ end }}
- name: 'critical-alerts'
email_configs:
- to: 'oncall@ecommerce.local'
subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Severity: {{ .Labels.severity }}
Service: {{ .Labels.service }}
Runbook: {{ .Annotations.runbook_url }}
{{ end }}
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#critical-alerts'
title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Runbook:* {{ .Annotations.runbook_url }}
{{ end }}
- name: 'infrastructure-team'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#infrastructure'
title: '⚠️ Infrastructure Alert: {{ .GroupLabels.alertname }}'
- name: 'platform-team'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#platform'
title: '🔧 Platform Alert: {{ .GroupLabels.alertname }}'
- name: 'product-team'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#product'
title: '📊 Business Alert: {{ .GroupLabels.alertname }}'
inhibit_rules:
# Don't send warning alerts if critical alerts are firing
- source_matchers:
- severity=critical
target_matchers:
- severity=warning
equal: ['service']
# Don't send service alerts if node is down
- source_matchers:
- alertname=NodeDown
target_matchers:
- alertname=ServiceDown
equal: ['instance']YAMLStep 7: Grafana Dashboards
Infrastructure Dashboard
# grafana/provisioning/dashboards/infrastructure.json
{
"dashboard": {
"id": null,
"title": "Infrastructure Overview",
"tags": ["infrastructure", "monitoring"],
"timezone": "browser",
"refresh": "30s",
"time": {
"from": "now-1h",
"to": "now"
},
"panels": [
{
"id": 1,
"title": "CPU Usage",
"type": "stat",
"targets": [
{
"expr": "node:cpu_usage:rate5m",
"legendFormat": "{{ instance }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
}
}
},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
},
{
"id": 2,
"title": "Memory Usage",
"type": "stat",
"targets": [
{
"expr": "node:memory_usage:percentage",
"legendFormat": "{{ instance }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 80},
{"color": "red", "value": 90}
]
}
}
},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
},
{
"id": 3,
"title": "CPU Usage Over Time",
"type": "graph",
"targets": [
{
"expr": "node:cpu_usage:rate5m",
"legendFormat": "{{ instance }}"
}
],
"yAxes": [
{
"unit": "percent",
"max": 100,
"min": 0
}
],
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
}
]
}
}JSONApplication Dashboard
# grafana/provisioning/dashboards/application.json
{
"dashboard": {
"id": null,
"title": "Application Performance",
"tags": ["application", "performance"],
"timezone": "browser",
"refresh": "30s",
"templating": {
"list": [
{
"name": "service",
"type": "query",
"query": "label_values(http_requests_total, service)",
"refresh": 1,
"multi": true,
"includeAll": true
}
]
},
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "service:request_rate:rate5m{service=~\"$service\"}",
"legendFormat": "{{ service }}"
}
],
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
},
{
"id": 2,
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "service:error_rate:rate5m{service=~\"$service\"} * 100",
"legendFormat": "{{ service }}"
}
],
"yAxes": [
{
"unit": "percent",
"max": 100,
"min": 0
}
],
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
},
{
"id": 3,
"title": "Response Time Percentiles",
"type": "graph",
"targets": [
{
"expr": "service:request_duration:p50{service=~\"$service\"}",
"legendFormat": "{{ service }} - 50th"
},
{
"expr": "service:request_duration:p95{service=~\"$service\"}",
"legendFormat": "{{ service }} - 95th"
},
{
"expr": "service:request_duration:p99{service=~\"$service\"}",
"legendFormat": "{{ service }} - 99th"
}
],
"yAxes": [
{
"unit": "s"
}
],
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
}
]
}
}JSONStep 8: Testing and Validation
Load Testing Script
# scripts/load_test.py
import requests
import time
import random
import threading
from concurrent.futures import ThreadPoolExecutor
BASE_URL = "http://localhost:8080"
def make_request(endpoint):
"""Make a request to the specified endpoint"""
try:
response = requests.get(f"{BASE_URL}{endpoint}", timeout=5)
return response.status_code
except Exception as e:
print(f"Error calling {endpoint}: {e}")
return 500
def generate_load():
"""Generate load on the application"""
endpoints = ["/", "/users", "/health"]
while True:
endpoint = random.choice(endpoints)
status = make_request(endpoint)
# Add some randomness to the load
time.sleep(random.uniform(0.1, 1.0))
def run_load_test(duration_minutes=10, concurrent_users=5):
"""Run load test for specified duration"""
print(f"Starting load test with {concurrent_users} concurrent users for {duration_minutes} minutes")
with ThreadPoolExecutor(max_workers=concurrent_users) as executor:
# Submit load generation tasks
futures = []
for _ in range(concurrent_users):
future = executor.submit(generate_load)
futures.append(future)
# Let it run for the specified duration
time.sleep(duration_minutes * 60)
# Cancel all tasks
for future in futures:
future.cancel()
if __name__ == "__main__":
run_load_test(duration_minutes=5, concurrent_users=10)PythonChaos Testing
# scripts/chaos_test.py
import docker
import time
import random
client = docker.from_env()
def stop_random_service():
"""Stop a random service for chaos testing"""
services = ['user-service', 'product-service', 'order-service']
service_name = random.choice(services)
try:
container = client.containers.get(service_name)
print(f"Stopping {service_name}")
container.stop()
# Wait for some time
time.sleep(30)
print(f"Starting {service_name}")
container.start()
except Exception as e:
print(f"Error with {service_name}: {e}")
def simulate_high_load():
"""Simulate high CPU load on a container"""
try:
container = client.containers.get('frontend')
print("Simulating high CPU load")
# Run stress test inside container
container.exec_run("stress --cpu 2 --timeout 60s", detach=True)
except Exception as e:
print(f"Error simulating load: {e}")
if __name__ == "__main__":
print("Starting chaos testing...")
# Run different chaos scenarios
stop_random_service()
time.sleep(120)
simulate_high_load()
time.sleep(120)PythonStep 9: Deployment Script
#!/bin/bash
# scripts/deploy.sh
set -e
echo "Starting E-commerce Observability Stack deployment..."
# Create necessary directories
mkdir -p data/{user-db,product-db,order-db}
mkdir -p prometheus grafana/provisioning/{datasources,dashboards}
mkdir -p alertmanager blackbox
# Set permissions
chmod 777 data/{user-db,product-db,order-db}
# Build application images
echo "Building application images..."
for service in frontend user-service product-service order-service payment-service inventory-service; do
echo "Building $service..."
docker build -t ecommerce/$service:latest apps/$service/
done
# Start the stack
echo "Starting services..."
docker-compose up -d
# Wait for services to be ready
echo "Waiting for services to start..."
sleep 30
# Check service health
echo "Checking service health..."
services=("prometheus:9090" "grafana:3000" "alertmanager:9093" "frontend:8080")
for service in "${services[@]}"; do
IFS=':' read -r name port <<< "$service"
echo "Checking $name on port $port..."
for i in {1..30}; do
if curl -f "http://localhost:$port/health" 2>/dev/null || curl -f "http://localhost:$port" 2>/dev/null; then
echo "$name is healthy"
break
fi
if [ $i -eq 30 ]; then
echo "Warning: $name may not be ready"
fi
sleep 2
done
done
echo "Deployment complete!"
echo "Access URLs:"
echo " Prometheus: http://localhost:9090"
echo " Grafana: http://localhost:3000 (admin/admin123)"
echo " Alertmanager: http://localhost:9093"
echo " Application: http://localhost:8080"
echo "Run load tests with: python scripts/load_test.py"
echo "Run chaos tests with: python scripts/chaos_test.py"BashStep 10: Documentation and Runbooks
README.md
# E-commerce Observability Stack
This project demonstrates a complete observability setup for a microservices-based e-commerce application using Prometheus, Grafana, and Alertmanager.
## Architecture
- **Frontend Service** (Go): Main web interface
- **User Service** (Python): User management
- **Product Service** (Python): Product catalog
- **Order Service** (Python): Order processing
- **Payment Service** (Python): Payment processing
- **Inventory Service** (Python): Inventory management
## Deployment
```bash
# Clone the repository
git clone <repository-url>
cd ecommerce-observability
# Deploy the stack
./scripts/deploy.shMarkdownAccess Points
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin123)
- Alertmanager: http://localhost:9093
- Application: http://localhost:8080
Testing
Load Testing
python scripts/load_test.pyBashChaos Testing
python scripts/chaos_test.pyBashMonitoring
Key Metrics
- Request rate per service
- Error rate per service
- Response time percentiles
- Infrastructure utilization
Alerts
- Service down
- High error rate (>5%)
- High latency (>1s p95)
- Infrastructure issues
Troubleshooting
Service Discovery Issues
Check Prometheus targets: http://localhost:9090/targets
Missing Metrics
Verify service /metrics endpoints are accessible
Alert Not Firing
Check Prometheus rules: http://localhost:9090/rules
### Project Validation
#### Verification Checklist
1. **✅ Infrastructure Monitoring**
- [ ] Node exporter collecting system metrics
- [ ] CPU, memory, disk usage visible in Grafana
- [ ] Infrastructure alerts firing correctly
2. **✅ Application Monitoring**
- [ ] All services exposing metrics
- [ ] Request rate, error rate, latency tracked
- [ ] Business metrics instrumented
3. **✅ Alerting**
- [ ] Critical alerts configured
- [ ] Alert routing working
- [ ] Notification channels tested
4. **✅ Visualization**
- [ ] Infrastructure dashboard functional
- [ ] Application dashboard with filters
- [ ] Business metrics dashboard
5. **✅ Testing**
- [ ] Load testing generating metrics
- [ ] Chaos testing triggering alerts
- [ ] Recovery scenarios validated
### Chapter 11 Summary
The capstone project demonstrates a production-ready observability stack with comprehensive monitoring, alerting, and visualization. It covers infrastructure monitoring, application performance tracking, business metrics, and incident response workflows. The project serves as a practical template for implementing Prometheus-based observability in real-world microservices environments.
### Final Exercise
1. **Deploy the Complete Stack**:
- Follow the deployment guide
- Verify all components are working
- Access all web interfaces
2. **Run Tests and Observe**:
- Execute load tests and watch metrics
- Trigger chaos tests and verify alerts
- Practice incident response workflows
3. **Customize and Extend**:
- Add new metrics to services
- Create custom dashboards
- Implement additional alert rules
---
## 12. Appendices
### Appendix A: PromQL Cheat Sheet
#### Basic Selectors
```promql
# Simple metric selection
http_requests_total
# Label matching
http_requests_total{method="GET"}
http_requests_total{method!="GET"}
http_requests_total{method=~"GET|POST"}
http_requests_total{method!~"GET|POST"}
# Multiple labels
http_requests_total{method="GET", status="200"}MarkdownTime Series Types
# Instant vector (single value per series)
up
# Range vector (range of values over time)
up[5m]
# Scalar (single numeric value)
42INIRate and Counter Functions
# Rate: per-second average rate
rate(http_requests_total[5m])
# Increase: total increase over time window
increase(http_requests_total[5m])
# irate: instantaneous rate
irate(http_requests_total[5m])
# Delta: difference between first and last value
delta(cpu_temp_celsius[2h])INIAggregation Operators
# Sum
sum(http_requests_total)
sum by (job) (http_requests_total)
sum without (instance) (http_requests_total)
# Average
avg(node_cpu_seconds_total)
avg by (mode) (node_cpu_seconds_total)
# Count
count(up)
count by (job) (up)
# Min/Max
min(node_filesystem_free_bytes)
max(node_filesystem_free_bytes)
# Quantile
quantile(0.95, http_request_duration_seconds)
# Top/Bottom K
topk(5, http_requests_total)
bottomk(3, node_filesystem_free_bytes)INIMathematical Functions
# Arithmetic operators
node_memory_MemTotal_bytes - node_memory_MemFree_bytes
rate(http_requests_total[5m]) * 60
# Mathematical functions
abs(delta(cpu_temp_celsius[5m]))
ceil(rate(http_requests_total[5m]))
floor(rate(http_requests_total[5m]))
round(rate(http_requests_total[5m]), 0.1)
sqrt(rate(http_requests_total[5m]))
ln(rate(http_requests_total[5m]))
log10(rate(http_requests_total[5m]))INIHistogram Functions
# Quantiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
# Average from histogram
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
# Request rate from histogram
rate(http_request_duration_seconds_count[5m])
Time Functions
# Current time
time()
# Timestamp of samples
timestamp(up)
# Time-based filtering
hour() > 9 and hour() < 17 # Business hours
day_of_week() > 0 and day_of_week() < 6 # Weekdays
# Prediction
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)INIString Functions
# Label manipulation
label_replace(up, "instance_short", "$1", "instance", "([^:]+):.*")
label_join(up, "instance_job", ":", "instance", "job")INIComparison Operators
# Comparison
node_filesystem_free_bytes < 1000000000 # Less than 1GB
rate(http_requests_total[5m]) > 10 # More than 10 req/s
# Boolean operators
up == 1 and on(instance) node_load1 > 2
up == 0 or on(instance) node_filesystem_free_bytes < 1000000000INIAdvanced Patterns
# SLI/SLO calculations
sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# Error budget burn rate
(1 - sli_availability) / (1 - slo_target) > burn_rate_threshold
# Multi-service aggregation
sum by (environment) (rate(http_requests_total[5m]))
# Cross-metric calculations
rate(http_requests_total[5m]) / on(instance) group_left rate(node_cpu_seconds_total{mode="idle"}[5m])INIAppendix B: Exporter Catalog
Official Exporters
| Exporter | Purpose | Port | Key Metrics |
|---|---|---|---|
| Node Exporter | System metrics | 9100 | CPU, memory, disk, network |
| Blackbox Exporter | External monitoring | 9115 | HTTP, DNS, TCP, ICMP |
| MySQL Exporter | MySQL database | 9104 | Connections, queries, performance |
| Redis Exporter | Redis database | 9121 | Memory, commands, keys |
| HAProxy Exporter | HAProxy load balancer | 8404 | Requests, responses, health |
| NGINX Exporter | NGINX web server | 9113 | Requests, connections, status |
Third-party Exporters
| Exporter | Purpose | Port | Key Metrics |
|---|---|---|---|
| Postgres Exporter | PostgreSQL database | 9187 | Connections, queries, locks |
| MongoDB Exporter | MongoDB database | 9216 | Operations, connections, memory |
| Elasticsearch Exporter | Elasticsearch | 9114 | Cluster health, indices, queries |
| RabbitMQ Exporter | RabbitMQ message broker | 9419 | Queues, messages, connections |
| Kafka Exporter | Apache Kafka | 9308 | Topics, partitions, lag |
| JMX Exporter | Java applications | 8080 | JVM metrics, garbage collection |
Cloud Provider Exporters
| Exporter | Purpose | Key Metrics |
|---|---|---|
| AWS CloudWatch Exporter | AWS services | EC2, RDS, ELB metrics |
| Azure Monitor Exporter | Azure services | VM, storage, network metrics |
| GCP Monitoring Exporter | Google Cloud | Compute, storage, network metrics |
Configuration Examples
Node Exporter
# docker-compose.yml
node-exporter:
image: prom/node-exporter:latest
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
ports:
- "9100:9100"YAMLBlackbox Exporter
# blackbox.yml
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_status_codes: []
method: GET
follow_redirects: trueYAMLMySQL Exporter
# Environment variables
DATA_SOURCE_NAME: "user:password@(mysql:3306)/"
# Or configuration file
[client]
user = exporter
password = password
host = mysql
port = 3306YAMLAppendix C: Alert Rule Templates
Infrastructure Alerts
groups:
- name: node_alerts
rules:
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is down"
- alert: HighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
- alert: HighMemory
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
- alert: DiskSpaceCritical
expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 95
for: 5m
labels:
severity: critical
annotations:
summary: "Critical disk space on {{ $labels.instance }}"YAMLApplication Alerts
groups:
- name: application_alerts
rules:
- alert: ServiceDown
expr: up{job=~".*-service"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate for {{ $labels.job }}"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency for {{ $labels.job }}"
- alert: LowThroughput
expr: rate(http_requests_total[5m]) < 1
for: 10m
labels:
severity: warning
annotations:
summary: "Low throughput for {{ $labels.job }}"YAMLDatabase Alerts
groups:
- name: database_alerts
rules:
- alert: DatabaseDown
expr: mysql_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database {{ $labels.instance }} is down"
- alert: HighConnections
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High database connections on {{ $labels.instance }}"
- alert: SlowQueries
expr: rate(mysql_global_status_slow_queries[5m]) > 0
for: 5m
labels:
severity: warning# filepath: c:\Users\ankus\Videos\PROMETHEUS\prometheus.md
#### Multi-cluster Recording Rules
```yaml
# Global recording rules
groups:
- name: cross_cluster_aggregates
interval: 60s
rules:
- record: global:request_rate:sum
expr: sum by (service) (cluster:request_rate:sum)
- record: global:error_rate:avg
expr: avg by (service) (cluster:error_rate:avg)
- record: global:latency:p95
expr: |
histogram_quantile(0.95,
sum by (service, le) (cluster:latency:histogram)
)
- record: region:capacity:available
expr: |
sum by (region) (
cluster:node_capacity:cpu - cluster:node_usage:cpu
)YAMLIntegrating with Logging and Tracing
Correlation with ELK Stack
# Logstash configuration for metrics correlation
input {
beats {
port => 5044
}
}
filter {
if [fields][service] {
# Add Prometheus job label
mutate {
add_field => { "prometheus_job" => "%{[fields][service]}" }
}
# Extract trace ID if present
if [message] =~ /trace_id=/ {
grok {
match => { "message" => "trace_id=(?<trace_id>[a-f0-9]+)" }
}
}
# Add links to metrics
mutate {
add_field => {
"metrics_link" => "http://grafana.local/d/app-dashboard?var-service=%{[fields][service]}&from=now-5m&to=now"
}
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
}GroovyJaeger Integration
# Jaeger query service with Prometheus metrics
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger-query
spec:
template:
spec:
containers:
- name: jaeger-query
image: jaegertracing/jaeger-query:latest
env:
- name: SPAN_STORAGE_TYPE
value: elasticsearch
- name: ES_SERVER_URLS
value: http://elasticsearch:9200
- name: METRICS_BACKEND
value: prometheus
- name: PROMETHEUS_SERVER_URL
value: http://prometheus:9090
ports:
- containerPort: 16686
- containerPort: 16687YAMLOpenTelemetry Collector Configuration
# otelcol-config.yml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
static_configs:
- targets: ['localhost:8888']
processors:
batch:
timeout: 1s
send_batch_size: 1024
attributes:
actions:
- key: cluster
value: production
action: insert
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
namespace: "otel"
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"
service:
pipelines:
traces:
receivers: [otlp]
processors: [attributes, batch]
exporters: [jaeger]
metrics:
receivers: [otlp, prometheus]
processors: [attributes, batch]
exporters: [prometheus, prometheusremotewrite]YAMLSecurity and RBAC in Prometheus Setups
Prometheus Security Configuration
# Prometheus with TLS and authentication
apiVersion: v1
kind: Secret
metadata:
name: prometheus-certs
type: Opaque
data:
tls.crt: <base64-encoded-cert>
tls.key: <base64-encoded-key>
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
spec:
template:
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.config.file=/etc/prometheus/web.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.listen-address=0.0.0.0:9090'
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: certs
mountPath: /etc/ssl/prometheus
readOnly: trueYAML# web.yml - Prometheus web configuration
tls_server_config:
cert_file: /etc/ssl/prometheus/tls.crt
key_file: /etc/ssl/prometheus/tls.key
basic_auth_users:
admin: $2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay
readonly: $2b$12$6tgWf5DZ9z7LZtD.ZrAb/.VjBfI3WnJg3ULf.TgLBtO4vKAzp7KuGYAMLRBAC Configuration for Kubernetes
# ServiceAccount for Prometheus
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
# ClusterRole with minimal permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups: ["extensions", "apps"]
resources:
- ingresses
- deployments
- daemonsets
- statefulsets
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
# ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoringYAMLOAuth2 Proxy Integration
# OAuth2 Proxy for Prometheus
apiVersion: apps/v1
kind: Deployment
metadata:
name: oauth2-proxy
spec:
template:
spec:
containers:
- name: oauth2-proxy
image: quay.io/oauth2-proxy/oauth2-proxy:latest
args:
- --provider=github
- --email-domain=yourcompany.com
- --upstream=http://prometheus:9090
- --http-address=0.0.0.0:4180
- --client-id=$(OAUTH2_PROXY_CLIENT_ID)
- --client-secret=$(OAUTH2_PROXY_CLIENT_SECRET)
- --cookie-secret=$(OAUTH2_PROXY_COOKIE_SECRET)
env:
- name: OAUTH2_PROXY_CLIENT_ID
valueFrom:
secretKeyRef:
name: oauth2-proxy-secrets
key: client-id
- name: OAUTH2_PROXY_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: oauth2-proxy-secrets
key: client-secret
- name: OAUTH2_PROXY_COOKIE_SECRET
valueFrom:
secretKeyRef:
name: oauth2-proxy-secrets
key: cookie-secretYAMLChapter 10 Summary
Advanced Prometheus topics include exemplars for linking metrics to traces, multi-cluster monitoring architectures, integration with logging and tracing systems, and comprehensive security configurations. These features enable enterprise-scale observability with proper access controls and correlation across different observability signals.
Hands-on Exercise
- Exemplars Implementation:
- Enable exemplars in Prometheus
- Instrument an application with trace correlation
- View exemplars in Grafana dashboards
- Multi-cluster Setup:
- Configure federation between Prometheus instances
- Implement cross-cluster monitoring
- Test global query capabilities
- Security Hardening:
- Implement TLS and authentication
- Configure RBAC for Kubernetes
- Set up OAuth2 proxy for access control
11. Capstone Project
Project Overview
Build a complete observability stack for a sample e-commerce application with microservices architecture, including metrics collection, alerting, visualization, and incident response workflows.
Architecture Overview
graph TB
subgraph "Application Layer"
A[Frontend Service] --> B[User Service]
A --> C[Product Service]
A --> D[Order Service]
D --> E[Payment Service]
D --> F[Inventory Service]
B --> G[User Database]
C --> H[Product Database]
D --> I[Order Database]
end
subgraph "Observability Layer"
J[Prometheus] --> K[Alertmanager]
J --> L[Grafana]
M[Node Exporter] --> J
N[Application Metrics] --> J
O[Blackbox Exporter] --> J
K --> P[Slack/Email]
L --> Q[Dashboards]
end
A --> N
B --> N
C --> N
D --> N
E --> N
F --> NStep 1: Infrastructure Setup
Docker Compose Environment
# docker-compose.yml
version: '3.8'
networks:
monitoring:
driver: bridge
app:
driver: bridge
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
services:
# Prometheus
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus:/etc/prometheus
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
networks:
- monitoring
- app
restart: unless-stopped
# Alertmanager
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager:/etc/alertmanager
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
networks:
- monitoring
restart: unless-stopped
# Grafana
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
networks:
- monitoring
restart: unless-stopped
# Node Exporter
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
networks:
- monitoring
restart: unless-stopped
# Blackbox Exporter
blackbox-exporter:
image: prom/blackbox-exporter:latest
container_name: blackbox-exporter
ports:
- "9115:9115"
volumes:
- ./blackbox:/etc/blackbox_exporter
networks:
- monitoring
restart: unless-stopped
# Application Services
frontend:
build: ./apps/frontend
container_name: frontend
ports:
- "8080:8080"
environment:
- USER_SERVICE_URL=http://user-service:8081
- PRODUCT_SERVICE_URL=http://product-service:8082
- ORDER_SERVICE_URL=http://order-service:8083
networks:
- app
restart: unless-stopped
user-service:
build: ./apps/user-service
container_name: user-service
ports:
- "8081:8081"
environment:
- DATABASE_URL=postgresql://user:password@user-db:5432/users
networks:
- app
restart: unless-stopped
product-service:
build: ./apps/product-service
container_name: product-service
ports:
- "8082:8082"
environment:
- DATABASE_URL=postgresql://product:password@product-db:5432/products
networks:
- app
restart: unless-stopped
order-service:
build: ./apps/order-service
container_name: order-service
ports:
- "8083:8083"
environment:
- DATABASE_URL=postgresql://order:password@order-db:5432/orders
- PAYMENT_SERVICE_URL=http://payment-service:8084
- INVENTORY_SERVICE_URL=http://inventory-service:8085
networks:
- app
restart: unless-stopped
payment-service:
build: ./apps/payment-service
container_name: payment-service
ports:
- "8084:8084"
networks:
- app
restart: unless-stopped
inventory-service:
build: ./apps/inventory-service
container_name: inventory-service
ports:
- "8085:8085"
networks:
- app
restart: unless-stopped
# Databases
user-db:
image: postgres:13
container_name: user-db
environment:
- POSTGRES_DB=users
- POSTGRES_USER=user
- POSTGRES_PASSWORD=password
volumes:
- ./data/user-db:/var/lib/postgresql/data
networks:
- app
product-db:
image: postgres:13
container_name: product-db
environment:
- POSTGRES_DB=products
- POSTGRES_USER=product
- POSTGRES_PASSWORD=password
volumes:
- ./data/product-db:/var/lib/postgresql/data
networks:
- app
order-db:
image: postgres:13
container_name: order-db
environment:
- POSTGRES_DB=orders
- POSTGRES_USER=order
- POSTGRES_PASSWORD=password
volumes:
- ./data/order-db:/var/lib/postgresql/data
networks:
- appYAMLStep 2: Application Instrumentation
Frontend Service (Go)
// apps/frontend/main.go
package main
import (
"encoding/json"
"fmt"
"log"
"net/http"
"os"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"service", "method", "endpoint", "status"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
},
[]string{"service", "method", "endpoint"},
)
upstreamRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "upstream_requests_total",
Help: "Total upstream requests",
},
[]string{"service", "target_service", "status"},
)
businessMetrics = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "business_events_total",
Help: "Business events counter",
},
[]string{"service", "event_type"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
prometheus.MustRegister(upstreamRequestsTotal)
prometheus.MustRegister(businessMetrics)
}
func instrumentHandler(service, endpoint string, handler http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// Wrap ResponseWriter to capture status code
ww := &responseWriter{ResponseWriter: w, statusCode: 200}
handler(ww, r)
duration := time.Since(start).Seconds()
status := fmt.Sprintf("%d", ww.statusCode)
httpRequestsTotal.WithLabelValues(service, r.Method, endpoint, status).Inc()
httpRequestDuration.WithLabelValues(service, r.Method, endpoint).Observe(duration)
}
}
type responseWriter struct {
http.ResponseWriter
statusCode int
}
func (rw *responseWriter) WriteHeader(code int) {
rw.statusCode = code
rw.ResponseWriter.WriteHeader(code)
}
func homeHandler(w http.ResponseWriter, r *http.Request) {
businessMetrics.WithLabelValues("frontend", "page_view").Inc()
response := map[string]string{
"service": "frontend",
"status": "healthy",
"version": "1.0.0",
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(response)
}
func usersHandler(w http.ResponseWriter, r *http.Request) {
userServiceURL := os.Getenv("USER_SERVICE_URL")
if userServiceURL == "" {
userServiceURL = "http://localhost:8081"
}
start := time.Now()
resp, err := http.Get(userServiceURL + "/users")
duration := time.Since(start).Seconds()
status := "500"
if err == nil {
status = fmt.Sprintf("%d", resp.StatusCode)
defer resp.Body.Close()
}
upstreamRequestsTotal.WithLabelValues("frontend", "user-service", status).Inc()
if err != nil {
http.Error(w, "User service unavailable", http.StatusServiceUnavailable)
return
}
businessMetrics.WithLabelValues("frontend", "user_list_view").Inc()
w.Header().Set("Content-Type", "application/json")
w.Write([]byte(`{"users": []}`))
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.HandleFunc("/", instrumentHandler("frontend", "/", homeHandler))
http.HandleFunc("/users", instrumentHandler("frontend", "/users", usersHandler))
http.HandleFunc("/health", instrumentHandler("frontend", "/health", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
w.Write([]byte("OK"))
}))
log.Println("Frontend service starting on :8080")
log.Fatal(http.ListenAndServe(":8080", nil))
}GoUser Service (Python)
# apps/user-service/app.py
from flask import Flask, jsonify, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time
import psycopg2
import os
app = Flask(__name__)
# Prometheus metrics
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['service', 'method', 'endpoint', 'status']
)
REQUEST_DURATION = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['service', 'method', 'endpoint'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
DATABASE_CONNECTIONS = Gauge(
'database_connections_active',
'Active database connections',
['service', 'database']
)
BUSINESS_EVENTS = Counter(
'business_events_total',
'Business events',
['service', 'event_type']
)
def instrument_request(f):
def wrapper(*args, **kwargs):
start_time = time.time()
status = '200'
try:
result = f(*args, **kwargs)
return result
except Exception as e:
status = '500'
raise
finally:
REQUEST_COUNT.labels(
service='user-service',
method=request.method,
endpoint=request.endpoint or 'unknown',
status=status
).inc()
REQUEST_DURATION.labels(
service='user-service',
method=request.method,
endpoint=request.endpoint or 'unknown'
).observe(time.time() - start_time)
wrapper.__name__ = f.__name__
return wrapper
@app.route('/')
@instrument_request
def home():
return jsonify({
'service': 'user-service',
'status': 'healthy',
'version': '1.0.0'
})
@app.route('/users')
@instrument_request
def get_users():
BUSINESS_EVENTS.labels(service='user-service', event_type='user_list_request').inc()
# Simulate database query
DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').inc()
time.sleep(0.01) # Simulate query time
DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').dec()
return jsonify({
'users': [
{'id': 1, 'name': 'John Doe', 'email': 'john@example.com'},
{'id': 2, 'name': 'Jane Smith', 'email': 'jane@example.com'}
]
})
@app.route('/users/<int:user_id>')
@instrument_request
def get_user(user_id):
BUSINESS_EVENTS.labels(service='user-service', event_type='user_detail_request').inc()
DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').inc()
time.sleep(0.005)
DATABASE_CONNECTIONS.labels(service='user-service', database='postgres').dec()
return jsonify({
'id': user_id,
'name': f'User {user_id}',
'email': f'user{user_id}@example.com'
})
@app.route('/health')
@instrument_request
def health():
return jsonify({'status': 'healthy'})
@app.route('/metrics')
def metrics():
return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8081)PythonStep 3: Prometheus Configuration
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'ecommerce'
environment: 'production'
rule_files:
- "alert_rules.yml"
- "recording_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
scrape_interval: 30s
# Application services
- job_name: 'frontend'
static_configs:
- targets: ['frontend:8080']
metrics_path: '/metrics'
scrape_interval: 15s
- job_name: 'user-service'
static_configs:
- targets: ['user-service:8081']
metrics_path: '/metrics'
scrape_interval: 15s
- job_name: 'product-service'
static_configs:
- targets: ['product-service:8082']
metrics_path: '/metrics'
scrape_interval: 15s
- job_name: 'order-service'
static_configs:
- targets: ['order-service:8083']
metrics_path: '/metrics'
scrape_interval: 15s
- job_name: 'payment-service'
static_configs:
- targets: ['payment-service:8084']
metrics_path: '/metrics'
scrape_interval: 15s
- job_name: 'inventory-service'
static_configs:
- targets: ['inventory-service:8085']
metrics_path: '/metrics'
scrape_interval: 15s
# Blackbox monitoring
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://frontend:8080/health
- http://user-service:8081/health
- http://product-service:8082/health
- http://order-service:8083/health
- http://payment-service:8084/health
- http://inventory-service:8085/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115YAMLStep 4: Recording Rules
# prometheus/recording_rules.yml
groups:
- name: application_rules
interval: 30s
rules:
# Request rates
- record: service:request_rate:rate5m
expr: sum by (service) (rate(http_requests_total[5m]))
- record: service:request_rate:rate1h
expr: sum by (service) (rate(http_requests_total[1h]))
# Error rates
- record: service:error_rate:rate5m
expr: |
sum by (service) (rate(http_requests_total{status=~"[45].."}[5m])) /
sum by (service) (rate(http_requests_total[5m]))
# Latency percentiles
- record: service:request_duration:p50
expr: |
histogram_quantile(0.50,
sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- record: service:request_duration:p95
expr: |
histogram_quantile(0.95,
sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- record: service:request_duration:p99
expr: |
histogram_quantile(0.99,
sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- name: infrastructure_rules
interval: 30s
rules:
# Node metrics
- record: node:cpu_usage:rate5m
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- record: node:memory_usage:percentage
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
- record: node:disk_usage:percentage
expr: |
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100
- name: business_rules
interval: 60s
rules:
# Business metrics
- record: business:page_views:rate1h
expr: rate(business_events_total{event_type="page_view"}[1h]) * 3600
- record: business:user_requests:rate1h
expr: rate(business_events_total{event_type=~"user_.*"}[1h]) * 3600
# Service dependency health
- record: service:dependency_success_rate:rate5m
expr: |
sum by (service, target_service) (rate(upstream_requests_total{status=~"2.."}[5m])) /
sum by (service, target_service) (rate(upstream_requests_total[5m]))YAMLStep 5: Alerting Rules
# prometheus/alert_rules.yml
groups:
- name: infrastructure_alerts
rules:
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 1m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Node is down"
description: "Node {{ $labels.instance }} has been down for more than 1 minute"
runbook_url: "https://runbooks.company.com/node-down"
- alert: HighCPUUsage
expr: node:cpu_usage:rate5m > 80
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High CPU usage"
description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: node:memory_usage:percentage > 85
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High memory usage"
description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"
- name: application_alerts
rules:
- alert: ServiceDown
expr: up{job=~"frontend|.*-service"} == 0
for: 1m
labels:
severity: critical
team: platform
annotations:
summary: "Service is down"
description: "Service {{ $labels.job }} is down"
- alert: HighErrorRate
expr: service:error_rate:rate5m > 0.05
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "High error rate for {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"
- alert: HighLatency
expr: service:request_duration:p95 > 1
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "High latency for {{ $labels.service }}"
description: "95th percentile latency is {{ $value }}s for {{ $labels.service }}"
- alert: LowRequestRate
expr: service:request_rate:rate5m < 0.1
for: 10m
labels:
severity: warning
team: platform
annotations:
summary: "Low request rate for {{ $labels.service }}"
description: "Request rate is {{ $value }} req/s for {{ $labels.service }}"
- name: business_alerts
rules:
- alert: LowPageViews
expr: business:page_views:rate1h < 10
for: 15m
labels:
severity: warning
team: product
annotations:
summary: "Low page view rate"
description: "Page view rate is {{ $value }} views/hour"
- alert: ServiceDependencyFailure
expr: service:dependency_success_rate:rate5m < 0.95
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Service dependency failure"
description: "{{ $labels.service }} -> {{ $labels.target_service }} success rate is {{ $value | humanizePercentage }}"YAMLStep 6: Alertmanager Configuration
# alertmanager/alertmanager.yml
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@ecommerce.local'
smtp_auth_username: 'alerts@ecommerce.local'
smtp_auth_password: 'your-app-password'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
# Critical alerts to on-call
- matchers:
- severity=critical
receiver: 'critical-alerts'
continue: true
# Infrastructure team alerts
- matchers:
- team=infrastructure
receiver: 'infrastructure-team'
# Platform team alerts
- matchers:
- team=platform
receiver: 'platform-team'
# Product team alerts
- matchers:
- team=product
receiver: 'product-team'
receivers:
- name: 'default'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Service:* {{ .Labels.service }}
{{ end }}
- name: 'critical-alerts'
email_configs:
- to: 'oncall@ecommerce.local'
subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Severity: {{ .Labels.severity }}
Service: {{ .Labels.service }}
Runbook: {{ .Annotations.runbook_url }}
{{ end }}
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#critical-alerts'
title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Runbook:* {{ .Annotations.runbook_url }}
{{ end }}
- name: 'infrastructure-team'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#infrastructure'
title: '⚠️ Infrastructure Alert: {{ .GroupLabels.alertname }}'
- name: 'platform-team'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#platform'
title: '🔧 Platform Alert: {{ .GroupLabels.alertname }}'
- name: 'product-team'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#product'
title: '📊 Business Alert: {{ .GroupLabels.alertname }}'
inhibit_rules:
# Don't send warning alerts if critical alerts are firing
- source_matchers:
- severity=critical
target_matchers:
- severity=warning
equal: ['service']
# Don't send service alerts if node is down
- source_matchers:
- alertname=NodeDown
target_matchers:
- alertname=ServiceDown
equal: ['instance']YAMLStep 7: Grafana Dashboards
Infrastructure Dashboard
# grafana/provisioning/dashboards/infrastructure.json
{
"dashboard": {
"id": null,
"title": "Infrastructure Overview",
"tags": ["infrastructure", "monitoring"],
"timezone": "browser",
"refresh": "30s",
"time": {
"from": "now-1h",
"to": "now"
},
"panels": [
{
"id": 1,
"title": "CPU Usage",
"type": "stat",
"targets": [
{
"expr": "node:cpu_usage:rate5m",
"legendFormat": "{{ instance }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
}
}
},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
},
{
"id": 2,
"title": "Memory Usage",
"type": "stat",
"targets": [
{
"expr": "node:memory_usage:percentage",
"legendFormat": "{{ instance }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 80},
{"color": "red", "value": 90}
]
}
}
},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
},
{
"id": 3,
"title": "CPU Usage Over Time",
"type": "graph",
"targets": [
{
"expr": "node:cpu_usage:rate5m",
"legendFormat": "{{ instance }}"
}
],
"yAxes": [
{
"unit": "percent",
"max": 100,
"min": 0
}
],
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
}
]
}
}JSONApplication Dashboard
# grafana/provisioning/dashboards/application.json
{
"dashboard": {
"id": null,
"title": "Application Performance",
"tags": ["application", "performance"],
"timezone": "browser",
"refresh": "30s",
"templating": {
"list": [
{
"name": "service",
"type": "query",
"query": "label_values(http_requests_total, service)",
"refresh": 1,
"multi": true,
"includeAll": true
}
]
},
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "service:request_rate:rate5m{service=~\"$service\"}",
"legendFormat": "{{ service }}"
}
],
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
},
{
"id": 2,
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "service:error_rate:rate5m{service=~\"$service\"} * 100",
"legendFormat": "{{ service }}"
}
],
"yAxes": [
{
"unit": "percent",
"max": 100,
"min": 0
}
],
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
},
{
"id": 3,
"title": "Response Time Percentiles",
"type": "graph",
"targets": [
{
"expr": "service:request_duration:p50{service=~\"$service\"}",
"legendFormat": "{{ service }} - 50th"
},
{
"expr": "service:request_duration:p95{service=~\"$service\"}",
"legendFormat": "{{ service }} - 95th"
},
{
"expr": "service:request_duration:p99{service=~\"$service\"}",
"legendFormat": "{{ service }} - 99th"
}
],
"yAxes": [
{
"unit": "s"
}
],
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
}
]
}
}JSONStep 8: Testing and Validation
Load Testing Script
# scripts/load_test.py
import requests
import time
import random
import threading
from concurrent.futures import ThreadPoolExecutor
BASE_URL = "http://localhost:8080"
def make_request(endpoint):
"""Make a request to the specified endpoint"""
try:
response = requests.get(f"{BASE_URL}{endpoint}", timeout=5)
return response.status_code
except Exception as e:
print(f"Error calling {endpoint}: {e}")
return 500
def generate_load():
"""Generate load on the application"""
endpoints = ["/", "/users", "/health"]
while True:
endpoint = random.choice(endpoints)
status = make_request(endpoint)
# Add some randomness to the load
time.sleep(random.uniform(0.1, 1.0))
def run_load_test(duration_minutes=10, concurrent_users=5):
"""Run load test for specified duration"""
print(f"Starting load test with {concurrent_users} concurrent users for {duration_minutes} minutes")
with ThreadPoolExecutor(max_workers=concurrent_users) as executor:
# Submit load generation tasks
futures = []
for _ in range(concurrent_users):
future = executor.submit(generate_load)
futures.append(future)
# Let it run for the specified duration
time.sleep(duration_minutes * 60)
# Cancel all tasks
for future in futures:
future.cancel()
if __name__ == "__main__":
run_load_test(duration_minutes=5, concurrent_users=10)PythonChaos Testing
# scripts/chaos_test.py
import docker
import time
import random
client = docker.from_env()
def stop_random_service():
"""Stop a random service for chaos testing"""
services = ['user-service', 'product-service', 'order-service']
service_name = random.choice(services)
try:
container = client.containers.get(service_name)
print(f"Stopping {service_name}")
container.stop()
# Wait for some time
time.sleep(30)
print(f"Starting {service_name}")
container.start()
except Exception as e:
print(f"Error with {service_name}: {e}")
def simulate_high_load():
"""Simulate high CPU load on a container"""
try:
container = client.containers.get('frontend')
print("Simulating high CPU load")
# Run stress test inside container
container.exec_run("stress --cpu 2 --timeout 60s", detach=True)
except Exception as e:
print(f"Error simulating load: {e}")
if __name__ == "__main__":
print("Starting chaos testing...")
# Run different chaos scenarios
stop_random_service()
time.sleep(120)
simulate_high_load()
time.sleep(120)PythonStep 9: Deployment Script
#!/bin/bash
# scripts/deploy.sh
set -e
echo "Starting E-commerce Observability Stack deployment..."
# Create necessary directories
mkdir -p data/{user-db,product-db,order-db}
mkdir -p prometheus grafana/provisioning/{datasources,dashboards}
mkdir -p alertmanager blackbox
# Set permissions
chmod 777 data/{user-db,product-db,order-db}
# Build application images
echo "Building application images..."
for service in frontend user-service product-service order-service payment-service inventory-service; do
echo "Building $service..."
docker build -t ecommerce/$service:latest apps/$service/
done
# Start the stack
echo "Starting services..."
docker-compose up -d
# Wait for services to be ready
echo "Waiting for services to start..."
sleep 30
# Check service health
echo "Checking service health..."
services=("prometheus:9090" "grafana:3000" "alertmanager:9093" "frontend:8080")
for service in "${services[@]}"; do
IFS=':' read -r name port <<< "$service"
echo "Checking $name on port $port..."
for i in {1..30}; do
if curl -f "http://localhost:$port/health" 2>/dev/null || curl -f "http://localhost:$port" 2>/dev/null; then
echo "$name is healthy"
break
fi
if [ $i -eq 30 ]; then
echo "Warning: $name may not be ready"
fi
sleep 2
done
done
echo "Deployment complete!"
echo "Access URLs:"
echo " Prometheus: http://localhost:9090"
echo " Grafana: http://localhost:3000 (admin/admin123)"
echo " Alertmanager: http://localhost:9093"
echo " Application: http://localhost:8080"
echo "Run load tests with: python scripts/load_test.py"
echo "Run chaos tests with: python scripts/chaos_test.py"BashStep 10: Documentation and Runbooks
README.md
# E-commerce Observability Stack
This project demonstrates a complete observability setup for a microservices-based e-commerce application using Prometheus, Grafana, and Alertmanager.
## Architecture
- **Frontend Service** (Go): Main web interface
- **User Service** (Python): User management
- **Product Service** (Python): Product catalog
- **Order Service** (Python): Order processing
- **Payment Service** (Python): Payment processing
- **Inventory Service** (Python): Inventory management
## Deployment
```bash
# Clone the repository
git clone <repository-url>
cd ecommerce-observability
# Deploy the stack
./scripts/deploy.shMarkdownAccess Points
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin123)
- Alertmanager: http://localhost:9093
- Application: http://localhost:8080
Testing
Load Testing
python scripts/load_test.pyBashChaos Testing
python scripts/chaos_test.pyBashMonitoring
Key Metrics
- Request rate per service
- Error rate per service
- Response time percentiles
- Infrastructure utilization
Alerts
- Service down
- High error rate (>5%)
- High latency (>1s p95)
- Infrastructure issues
Troubleshooting
Service Discovery Issues
Check Prometheus targets: http://localhost:9090/targets
Missing Metrics
Verify service /metrics endpoints are accessible
Alert Not Firing
Check Prometheus rules: http://localhost:9090/rules
### Project Validation
#### Verification Checklist
1. **✅ Infrastructure Monitoring**
- [ ] Node exporter collecting system metrics
- [ ] CPU, memory, disk usage visible in Grafana
- [ ] Infrastructure alerts firing correctly
2. **✅ Application Monitoring**
- [ ] All services exposing metrics
- [ ] Request rate, error rate, latency tracked
- [ ] Business metrics instrumented
3. **✅ Alerting**
- [ ] Critical alerts configured
- [ ] Alert routing working
- [ ] Notification channels tested
4. **✅ Visualization**
- [ ] Infrastructure dashboard functional
- [ ] Application dashboard with filters
- [ ] Business metrics dashboard
5. **✅ Testing**
- [ ] Load testing generating metrics
- [ ] Chaos testing triggering alerts
- [ ] Recovery scenarios validated
### Chapter 11 Summary
The capstone project demonstrates a production-ready observability stack with comprehensive monitoring, alerting, and visualization. It covers infrastructure monitoring, application performance tracking, business metrics, and incident response workflows. The project serves as a practical template for implementing Prometheus-based observability in real-world microservices environments.
### Final Exercise
1. **Deploy the Complete Stack**:
- Follow the deployment guide
- Verify all components are working
- Access all web interfaces
2. **Run Tests and Observe**:
- Execute load tests and watch metrics
- Trigger chaos tests and verify alerts
- Practice incident response workflows
3. **Customize and Extend**:
- Add new metrics to services
- Create custom dashboards
- Implement additional alert rules
---
## 12. Appendices
### Appendix A: PromQL Cheat Sheet
#### Basic Selectors
```promql
# Simple metric selection
http_requests_total
# Label matching
http_requests_total{method="GET"}
http_requests_total{method!="GET"}
http_requests_total{method=~"GET|POST"}
http_requests_total{method!~"GET|POST"}
# Multiple labels
http_requests_total{method="GET", status="200"}MarkdownTime Series Types
# Instant vector (single value per series)
up
# Range vector (range of values over time)
up[5m]
# Scalar (single numeric value)
42INIRate and Counter Functions
# Rate: per-second average rate
rate(http_requests_total[5m])
# Increase: total increase over time window
increase(http_requests_total[5m])
# irate: instantaneous rate
irate(http_requests_total[5m])
# Delta: difference between first and last value
delta(cpu_temp_celsius[2h])INIAggregation Operators
# Sum
sum(http_requests_total)
sum by (job) (http_requests_total)
sum without (instance) (http_requests_total)
# Average
avg(node_cpu_seconds_total)
avg by (mode) (node_cpu_seconds_total)
# Count
count(up)
count by (job) (up)
# Min/Max
min(node_filesystem_free_bytes)
max(node_filesystem_free_bytes)
# Quantile
quantile(0.95, http_request_duration_seconds)
# Top/Bottom K
topk(5, http_requests_total)
bottomk(3, node_filesystem_free_bytes)INIMathematical Functions
# Arithmetic operators
node_memory_MemTotal_bytes - node_memory_MemFree_bytes
rate(http_requests_total[5m]) * 60
# Mathematical functions
abs(delta(cpu_temp_celsius[5m]))
ceil(rate(http_requests_total[5m]))
floor(rate(http_requests_total[5m]))
round(rate(http_requests_total[5m]), 0.1)
sqrt(rate(http_requests_total[5m]))
ln(rate(http_requests_total[5m]))
log10(rate(http_requests_total[5m]))INIHistogram Functions
# Quantiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
# Average from histogram
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
# Request rate from histogram
rate(http_request_duration_seconds_count[5m])INITime Functions
# Current time
time()
# Timestamp of samples
timestamp(up)
# Time-based filtering
hour() > 9 and hour() < 17 # Business hours
day_of_week() > 0 and day_of_week() < 6 # Weekdays
# Prediction
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)INIString Functions
# Label manipulation
label_replace(up, "instance_short", "$1", "instance", "([^:]+):.*")
label_join(up, "instance_job", ":", "instance", "job")INIComparison Operators
# Comparison
node_filesystem_free_bytes < 1000000000 # Less than 1GB
rate(http_requests_total[5m]) > 10 # More than 10 req/s
# Boolean operators
up == 1 and on(instance) node_load1 > 2
up == 0 or on(instance) node_filesystem_free_bytes < 1000000000INIAdvanced Patterns
# SLI/SLO calculations
sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# Error budget burn rate
(1 - sli_availability) / (1 - slo_target) > burn_rate_threshold
# Multi-service aggregation
sum by (environment) (rate(http_requests_total[5m]))
# Cross-metric calculations
rate(http_requests_total[5m]) / on(instance) group_left rate(node_cpu_seconds_total{mode="idle"}[5m])INIAppendix B: Exporter Catalog
Official Exporters
| Exporter | Purpose | Port | Key Metrics |
|---|---|---|---|
| Node Exporter | System metrics | 9100 | CPU, memory, disk, network |
| Blackbox Exporter | External monitoring | 9115 | HTTP, DNS, TCP, ICMP |
| MySQL Exporter | MySQL database | 9104 | Connections, queries, performance |
| Redis Exporter | Redis database | 9121 | Memory, commands, keys |
| HAProxy Exporter | HAProxy load balancer | 8404 | Requests, responses, health |
| NGINX Exporter | NGINX web server | 9113 | Requests, connections, status |
| **RabbitMQ Exporter** | RabbitMQ message broker | 9419 | Queues, messages, connections |
| **Kafka Exporter** | Apache Kafka | 9308 | Topics, partitions, lag |
| **JMX Exporter** | Java applications | 8080 | JVM metrics, garbage collection |
| **Consul Exporter** | HashiCorp Consul | 9107 | Service health, cluster status |
| **Memcached Exporter** | Memcached | 9150 | Cache hits/misses, memory usage |
| **StatsD Exporter** | StatsD metrics | 9102 | Custom application metrics |
#### Cloud Provider Exporters
| Exporter | Purpose | Key Metrics |
|----------|---------|-------------|
| **AWS CloudWatch Exporter** | AWS services | EC2, RDS, ELB metrics |
| **Azure Monitor Exporter** | Azure services | VM, storage, network metrics |
| **GCP Monitoring Exporter** | Google Cloud | Compute, storage, network metrics |
| **DigitalOcean Exporter** | DigitalOcean | Droplet metrics, load balancers |
#### Configuration Examples
##### Node Exporter
```yaml
# docker-compose.yml
node-exporter:
image: prom/node-exporter:latest
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
- '--collector.textfile.directory=/host/textfile_collector'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
- /var/log:/host/var/log:ro
ports:
- "9100:9100"
network_mode: hostMarkdownBlackbox Exporter
# blackbox.yml
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_status_codes: []
method: GET
follow_redirects: true
preferred_ip_protocol: "ip4"
headers:
User-Agent: "Prometheus Blackbox Exporter"
http_post_2xx:
prober: http
timeout: 5s
http:
method: POST
headers:
Content-Type: application/json
body: '{"health": "check"}'
tcp_connect:
prober: tcp
timeout: 5s
ping:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
dns:
prober: dns
timeout: 5s
dns:
query_name: "example.com"
query_type: "A"
valid_rcodes:
- NOERRORYAMLMySQL Exporter
# Environment variables
DATA_SOURCE_NAME: "user:password@(mysql:3306)/"
# Or configuration file
[client]
user = exporter
password = password
host = mysql
port = 3306
# Prometheus scrape config
scrape_configs:
- job_name: 'mysql'
static_configs:
- targets: ['mysql-exporter:9104']YAMLPostgreSQL Exporter
# docker-compose.yml
postgres-exporter:
image: prometheuscommunity/postgres-exporter
environment:
DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/database?sslmode=disable"
ports:
- "9187:9187"YAMLRedis Exporter
redis-exporter:
image: oliver006/redis_exporter
environment:
REDIS_ADDR: "redis://redis:6379"
REDIS_PASSWORD: "your-redis-password"
ports:
- "9121:9121"YAMLAppendix C: Alert Rule Templates
Infrastructure Alerts
groups:
- name: node_alerts
rules:
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 1m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Node {{ $labels.instance }} is down"
description: "Node {{ $labels.instance }} has been down for more than 1 minute"
runbook_url: "https://runbooks.company.com/alerts/node-down"
- alert: HighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }}"
- alert: CriticalCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
for: 2m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Critical CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }}"
- alert: HighMemory
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 85
for: 10m
labels:
severity: warning
team: infrastructure
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }} {{ $labels.mountpoint }}"
- alert: DiskSpaceCritical
expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 95
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Critical disk space on {{ $labels.instance }}"
description: "Disk usage is {{ $value | printf \"%.2f\" }}% on {{ $labels.instance }} {{ $labels.mountpoint }}"
- alert: HighLoadAverage
expr: node_load1 / count by (instance) (count by (instance, cpu) (node_cpu_seconds_total{mode="idle"})) > 1.5
for: 10m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High load average on {{ $labels.instance }}"
description: "Load average is {{ $value | printf \"%.2f\" }} on {{ $labels.instance }}"YAMLApplication Alerts
groups:
- name: application_alerts
rules:
- alert: ServiceDown
expr: up{job=~".*-service"} == 0
for: 1m
labels:
severity: critical
team: platform
annotations:
summary: "Service {{ $labels.job }} is down"
description: "Service {{ $labels.job }} on {{ $labels.instance }} is down"
runbook_url: "https://runbooks.company.com/alerts/service-down"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "High error rate for {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "High latency for {{ $labels.job }}"
description: "95th percentile latency is {{ $value | printf \"%.3f\" }}s for {{ $labels.job }}"
- alert: LowThroughput
expr: rate(http_requests_total[5m]) < 1
for: 10m
labels:
severity: warning
team: platform
annotations:
summary: "Low throughput for {{ $labels.job }}"
description: "Request rate is {{ $value | printf \"%.2f\" }} req/s for {{ $labels.job }}"
- alert: HighMemoryUsage
expr: (container_memory_usage_bytes{container!="POD",container!=""} / container_spec_memory_limit_bytes) * 100 > 90
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "High memory usage for container {{ $labels.container }}"
description: "Memory usage is {{ $value | printf \"%.2f\" }}% for container {{ $labels.container }} in pod {{ $labels.pod }}"YAMLDatabase Alerts
groups:
- name: database_alerts
rules:
- alert: DatabaseDown
expr: mysql_up == 0
for: 1m
labels:
severity: critical
team: database
annotations:
summary: "Database {{ $labels.instance }} is down"
description: "MySQL database on {{ $labels.instance }} is not responding"
runbook_url: "https://runbooks.company.com/alerts/database-down"
- alert: HighConnections
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
for: 5m
labels:
severity: warning
team: database
annotations:
summary: "High database connections on {{ $labels.instance }}"
description: "Database connection usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"
- alert: SlowQueries
expr: rate(mysql_global_status_slow_queries[5m]) > 0.1
for: 5m
labels:
severity: warning
team: database
annotations:
summary: "High slow query rate on {{ $labels.instance }}"
description: "Slow query rate is {{ $value | printf \"%.2f\" }} queries/s on {{ $labels.instance }}"
- alert: DatabaseReplicationLag
expr: mysql_slave_lag_seconds > 30
for: 2m
labels:
severity: warning
team: database
annotations:
summary: "Database replication lag on {{ $labels.instance }}"
description: "Replication lag is {{ $value }}s on {{ $labels.instance }}"
- alert: PostgreSQLDown
expr: pg_up == 0
for: 1m
labels:
severity: critical
team: database
annotations:
summary: "PostgreSQL {{ $labels.instance }} is down"
description: "PostgreSQL database on {{ $labels.instance }} is not responding"
- alert: PostgreSQLHighConnections
expr: sum by (instance) (pg_stat_activity_count) / pg_settings_max_connections > 0.8
for: 5m
labels:
severity: warning
team: database
annotations:
summary: "High PostgreSQL connections on {{ $labels.instance }}"
description: "Connection usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"YAMLNetwork and External Service Alerts
groups:
- name: network_alerts
rules:
- alert: HighNetworkReceive
expr: rate(node_network_receive_bytes_total[5m]) > 100 * 1024 * 1024 # 100MB/s
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High network receive on {{ $labels.instance }}"
description: "Network receive is {{ $value | humanize1024 }}B/s on {{ $labels.instance }} interface {{ $labels.device }}"
- alert: HighNetworkTransmit
expr: rate(node_network_transmit_bytes_total[5m]) > 100 * 1024 * 1024 # 100MB/s
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High network transmit on {{ $labels.instance }}"
description: "Network transmit is {{ $value | humanize1024 }}B/s on {{ $labels.instance }} interface {{ $labels.device }}"
- alert: ExternalServiceDown
expr: probe_success{job="blackbox"} == 0
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "External service {{ $labels.instance }} is down"
description: "External service check for {{ $labels.instance }} is failing"
- alert: ExternalServiceSlowResponse
expr: probe_duration_seconds{job="blackbox"} > 5
for: 3m
labels:
severity: warning
team: platform
annotations:
summary: "External service {{ $labels.instance }} is slow"
description: "External service {{ $labels.instance }} is responding in {{ $value | printf \"%.2f\" }}s"YAMLBusiness Logic Alerts
groups:
- name: business_alerts
rules:
- alert: LowOrderRate
expr: rate(orders_total[1h]) * 3600 < 10
for: 15m
labels:
severity: warning
team: business
annotations:
summary: "Low order rate"
description: "Order rate is {{ $value | printf \"%.2f\" }} orders/hour"
- alert: HighCartAbandonmentRate
expr: |
(
rate(cart_abandoned_total[1h]) /
(rate(cart_created_total[1h]) + rate(cart_abandoned_total[1h]))
) > 0.7
for: 30m
labels:
severity: warning
team: business
annotations:
summary: "High cart abandonment rate"
description: "Cart abandonment rate is {{ $value | humanizePercentage }}"
- alert: PaymentProcessingFailures
expr: rate(payment_failed_total[5m]) / rate(payment_attempted_total[5m]) > 0.05
for: 10m
labels:
severity: critical
team: payments
annotations:
summary: "High payment failure rate"
description: "Payment failure rate is {{ $value | humanizePercentage }}"YAMLAppendix D: Grafana Dashboard Templates
Infrastructure Overview Dashboard
{
"dashboard": {
"id": null,
"title": "Infrastructure Overview",
"tags": ["infrastructure", "overview"],
"timezone": "browser",
"refresh": "30s",
"time": {
"from": "now-1h",
"to": "now"
},
"templating": {
"list": [
{
"name": "instance",
"type": "query",
"query": "label_values(up{job=\"node-exporter\"}, instance)",
"refresh": 1,
"multi": true,
"includeAll": true,
"current": {
"value": "$__all",
"text": "All"
}
}
]
},
"panels": [
{
"id": 1,
"title": "System Load",
"type": "stat",
"targets": [
{
"expr": "node_load1{instance=~\"$instance\"}",
"legendFormat": "{{ instance }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 2},
{"color": "red", "value": 4}
]
}
}
},
"gridPos": {"h": 8, "w": 6, "x": 0, "y": 0}
},
{
"id": 2,
"title": "CPU Usage",
"type": "stat",
"targets": [
{
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\",instance=~\"$instance\"}[5m])) * 100)",
"legendFormat": "{{ instance }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
}
}
},
"gridPos": {"h": 8, "w": 6, "x": 6, "y": 0}
},
{
"id": 3,
"title": "Memory Usage",
"type": "stat",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\"} / node_memory_MemTotal_bytes{instance=~\"$instance\"})) * 100",
"legendFormat": "{{ instance }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 80},
{"color": "red", "value": 90}
]
}
}
},
"gridPos": {"h": 8, "w": 6, "x": 12, "y": 0}
},
{
"id": 4,
"title": "Disk Usage",
"type": "stat",
"targets": [
{
"expr": "(1 - (node_filesystem_avail_bytes{instance=~\"$instance\",fstype!~\"tmpfs|fuse.lxcfs|squashfs\"} / node_filesystem_size_bytes{instance=~\"$instance\",fstype!~\"tmpfs|fuse.lxcfs|squashfs\"})) * 100",
"legendFormat": "{{ instance }}:{{ mountpoint }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 80},
{"color": "red", "value": 90}
]
}
}
},
"gridPos": {"h": 8, "w": 6, "x": 18, "y": 0}
}
]
}
}JSONApplication Performance Dashboard
{
"dashboard": {
"id": null,
"title": "Application Performance",
"tags": ["application", "performance"],
"timezone": "browser",
"refresh": "30s",
"templating": {
"list": [
{
"name": "service",
"type": "query",
"query": "label_values(http_requests_total, service)",
"refresh": 1,
"multi": true,
"includeAll": true
},
{
"name": "environment",
"type": "query",
"query": "label_values(http_requests_total, environment)",
"refresh": 1
}
]
},
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service)",
"legendFormat": "{{ service }}"
}
],
"yAxes": [
{
"label": "requests/sec",
"min": 0
}
],
"gridPos": {"h": 9, "w": 12, "x": 0, "y": 0}
},
{
"id": 2,
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"[45]..\",service=~\"$service\",environment=\"$environment\"}[5m])) by (service) / sum(rate(http_requests_total{service=~\"$service\",environment=\"$environment\"}[5m])) by (service) * 100",
"legendFormat": "{{ service }}"
}
],
"yAxes": [
{
"label": "error %",
"min": 0,
"max": 100
}
],
"gridPos": {"h": 9, "w": 12, "x": 12, "y": 0}
}
]
}
}JSONAppendix E: Configuration Management
Environment-specific Configurations
Development Environment
# prometheus-dev.yml
global:
scrape_interval: 30s
evaluation_interval: 30s
external_labels:
environment: 'development'
cluster: 'dev'
rule_files:
- "dev_rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
scrape_interval: 60s # Less frequent in dev
- job_name: 'application'
static_configs:
- targets: ['localhost:8080']
scrape_interval: 30sYAMLProduction Environment
# prometheus-prod.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
environment: 'production'
cluster: 'prod'
datacenter: 'us-east-1'
rule_files:
- "prod_rules.yml"
- "slo_rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets:
- 'node1:9100'
- 'node2:9100'
- 'node3:9100'
scrape_interval: 15s
- job_name: 'application'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- production
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
remote_write:
- url: "https://remote-storage.company.com/api/v1/write"
headers:
Authorization: "Bearer ${REMOTE_WRITE_TOKEN}"YAMLConfiguration Validation
#!/bin/bash
# scripts/validate-config.sh
set -e
echo "Validating Prometheus configuration..."
# Check Prometheus config syntax
promtool check config prometheus/prometheus.yml
# Check recording rules
if [ -f "prometheus/recording_rules.yml" ]; then
promtool check rules prometheus/recording_rules.yml
fi
# Check alerting rules
if [ -f "prometheus/alert_rules.yml" ]; then
promtool check rules prometheus/alert_rules.yml
fi
# Check Alertmanager config
if [ -f "alertmanager/alertmanager.yml" ]; then
amtool check-config alertmanager/alertmanager.yml
fi
echo "Configuration validation completed successfully!"BashAppendix F: Troubleshooting Guide
Common Issues and Solutions
Prometheus Issues
Issue: Targets showing as “DOWN”
# Check target accessibility
curl -v http://target-host:9100/metrics
# Check network connectivity
telnet target-host 9100
# Check Prometheus logs
docker logs prometheus
# Check scrape configuration
curl http://localhost:9090/api/v1/targetsBashIssue: High memory usage
# Check active series count
prometheus_tsdb_head_series
# Check samples ingested per second
rate(prometheus_tsdb_samples_total[5m])
# Find high cardinality metrics
topk(10, count by (__name__)({__name__!=""}))INIIssue: Slow queries
# Check query duration
histogram_quantile(0.99, rate(prometheus_engine_query_duration_seconds_bucket[5m]))
# Check concurrent queries
prometheus_engine_queries_concurrent_maxINIAlertmanager Issues
Issue: Alerts not firing
# Check Prometheus rules evaluation
curl http://localhost:9090/api/v1/rules
# Check alert status
curl http://localhost:9090/api/v1/alerts
# Check Alertmanager configuration
amtool config show --alertmanager.url=http://localhost:9093BashIssue: Notifications not being sent
# Check Alertmanager logs
docker logs alertmanager
# Test notification channels
amtool alert add --alertmanager.url=http://localhost:9093 \
alertname="test" service="test" severity="warning"
# Check silences
amtool silence query --alertmanager.url=http://localhost:9093BashGrafana Issues
Issue: Dashboard not loading data
# Check data source connectivity
curl -X GET "http://admin:admin123@localhost:3000/api/datasources/1/health"
# Check Prometheus connectivity from Grafana
curl -X GET "http://admin:admin123@localhost:3000/api/datasources/proxy/1/api/v1/query?query=up"BashIssue: Variables not working
- Check variable query syntax
- Verify data source selection
- Check refresh settings
Performance Optimization
Reduce Cardinality
# Metric relabeling to drop high cardinality labels
metric_relabel_configs:
- source_labels: [__name__]
regex: 'high_cardinality_metric.*'
action: drop
- source_labels: [user_id]
target_label: user_type
regex: 'premium_.*'
replacement: 'premium'
- regex: 'user_id'
action: labeldropYAMLOptimize Recording Rules
# Pre-compute expensive queries
groups:
- name: optimization_rules
interval: 30s
rules:
- record: expensive_calculation:rate5m
expr: |
sum(rate(complex_metric[5m])) by (service) /
sum(rate(other_complex_metric[5m])) by (service)YAMLAppendix G: Further Reading and References
Official Documentation
Books and Guides
- “Prometheus: Up & Running” by Brian Brazil
- “Monitoring with Prometheus” by James Turnbull
- “Site Reliability Engineering” by Google (SRE practices)
- “The Art of Monitoring” by James Turnbull
Online Resources
- PromLabs – Prometheus conference talks
- Robust Perception Blog
- Grafana Labs Blog
- CNCF Prometheus Slack
Training and Certification
Community Resources
Best Practices Repositories
Conclusion
This comprehensive guide has covered all aspects of Prometheus observability, from basic concepts to advanced enterprise deployments. By following the patterns, best practices, and examples provided, you should be well-equipped to implement robust monitoring solutions that provide actionable insights into your systems and applications.
Remember that observability is not just about collecting metrics—it’s about building systems that help you understand and improve your applications and infrastructure. Start with the basics, iterate based on your needs, and continuously refine your monitoring strategy as your systems evolve.
The capstone project provides a practical foundation that you can adapt and extend for your specific use cases. Use the appendices as reference materials for ongoing implementation and troubleshooting.
Happy monitoring! 🚀📊
Discover more from Altgr Blog
Subscribe to get the latest posts sent to your email.
