Observability vs. Monitoring

A Senior DevOps Engineer’s Complete Guide

🚀 Introduction: The Bug Discovery Race

As DevOps engineers, we face a critical question every day: Who discovers bugs first – us or our customers?

If customers find issues first, they might:

Rate our application as unreliable
Start exploring alternative solutions
Lost trust in our platform

However, if we can identify and fix bugs before they impact users, we can:

Provide exceptional service
Build customer trust
Drive business growth

This is why proactive issue detection is crucial for any successful application.

graph TD
    A[Bug Occurs] --> B{Who Finds It First?}
    B -->|Customer| C[Bad ReviewsLost TrustBusiness Impact]
    B -->|DevOps Team| D[Quick FixHappy CustomersBusiness Growth]

    style C fill:#ffcccc
    style D fill:#ccffcc

🔍 Understanding Monitoring

What is Monitoring?

Monitoring is the practice of collecting, processing, and alerting on predefined metrics to watch for known failure modes. It answers the question: “What is happening right now?”

Real-World Example: E-Commerce Application

Let’s consider an e-commerce microservices application:

graph TB
    subgraph "E-Commerce Application"
        User[👤 User] --> Search[🔍 Search Service]
        User --> Cart[🛒 Cart Service]
        User --> Checkout[💳 Checkout Service]

        Search --> SearchDB[(Search DB)]
        Cart --> CartDB[(Cart DB)]
        Checkout --> Payment[💰 Payment Service]
        Payment --> Gateway[🏦 Payment Gateway]

        Search -.->|"Slow Query"| Alert1[🚨 Alert: High Latency]
        Cart -.->|"High CPU"| Alert2[🚨 Alert: Resource Usage]
        Payment -.->|"Errors"| Alert3[🚨 Alert: Error Rate]
    end

Scenario: Users experience latency when searching for products. The root cause might be a poorly performing database query in the Search Service.

With monitoring, we can:

Track the average response time of the Search Service
Set up alerts when response time exceeds 500ms
Get notified before users complain

The Four Golden Signals

Every application should monitor these fundamental metrics:

mindmap
  root((Four Golden Signals))
    Latency
      Response Time
      Time to First Byte
      Database Query Time
    Traffic
      Requests per Second
      Concurrent Users
      Bandwidth Usage
    Errors
      HTTP 4xx/5xx
      Failed Transactions
      Exception Rate
    Saturation
      CPU Usage
      Memory Usage
      Disk I/O
      Network Capacity

1. Latency 📊

Definition: Time taken to service a request
Example: Search results loading time
Impact: Slow responses = frustrated users switching to competitors

2. Traffic 🚦

Definition: Number of requests your system receives
Example: During flash sales, traffic spikes dramatically
Impact: Helps ensure system can scale to meet demand

3. Errors ❌

Definition: Rate of failed requests
Examples: 404 (Page Not Found), 500 (Internal Server Error)
Impact: High error rates indicate system instability

4. Saturation 🔋

Definition: How “full” your service is
Example: Server consistently at 90% CPU usage
Impact: Saturated resources lead to performance degradation

Practical Monitoring Setup

Here’s a Prometheus alerting rule example:

# Prometheus Alert Rules
groups:
  - name: ecommerce-alerts
    rules:
      - alert: HighSearchLatency
        expr: histogram_quantile(0.95, rate(search_request_duration_seconds_bucket[5m])) > 0.5
        for: 2m
        labels:
          severity: warning
          service: search
        annotations:
          summary: "Search service latency is high"
          description: "95th percentile latency is {{ $value }}s for 2+ minutes"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} requests/sec"

# Prometheus Alert Rules
groups:
  - name: ecommerce-alerts
    rules:
      - alert: HighSearchLatency
        expr: histogram_quantile(0.95, rate(search_request_duration_seconds_bucket[5m])) > 0.5
        for: 2m
        labels:
          severity: warning
          service: search
        annotations:
          summary: "Search service latency is high"
          description: "95th percentile latency is {{ $value }}s for 2+ minutes"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} requests/sec"

YAML

Monitoring Best Practices

graph LR
    A[Start Early] --> B[Define Key Metrics]
    B --> C[Set Meaningful Thresholds]
    C --> D[Create Clear Dashboards]
    D --> E[Limit Alert Fatigue]
    E --> F[Regular Review & Update]

Start monitoring as early as possible – Don’t wait for production issues
Focus on the Four Golden Signals first, then expand
Make dashboards easy to understand – clear visualizations
Avoid alert fatigue – only alert on actionable issues
Set priority-based alerts – critical vs warning vs info

🔬 Understanding Observability

What is Observability?

Observability is a measure of how well you can understand a system’s internal state from its external outputs. It answers: “Why is this happening?”

Note: Observability is often abbreviated as O11y (11 characters between ‘O’ and ‘y’), similar to how Kubernetes is abbreviated as K8s.

The Three Pillars of Observability

graph TD
    subgraph "The Three Pillars"
        A[📋 LogsWhat happened?]
        B[📊 MetricsHow much/many?]
        C[🔍 TracesWhere did it go?]
    end

    A <--> B
    B <--> C
    C <--> A

    A --> D[Chronological EventsError MessagesDebug Information]
    B --> E[Response TimesError RatesResource Usage]
    C --> F[Request JourneyService DependenciesPerformance Bottlenecks]

1. Logs 📋

What they are: Chronological, timestamped records of discrete events

Example: When search is slow, examine search service logs:

{
  "timestamp": "2024-08-22T10:30:15Z",
  "level": "WARN",
  "service": "search-service",
  "message": "Database query took 2.3s",
  "query": "SELECT * FROM products WHERE category='electronics' ORDER BY popularity",
  "execution_time_ms": 2300,
  "user_id": "user123"
}

{
  "timestamp": "2024-08-22T10:30:15Z",
  "level": "WARN",
  "service": "search-service",
  "message": "Database query took 2.3s",
  "query": "SELECT * FROM products WHERE category='electronics' ORDER BY popularity",
  "execution_time_ms": 2300,
  "user_id": "user123"
}

JSON

Tools: Elasticsearch, Logstash, Loki, Fluentd

2. Metrics 📊

What they are: Quantitative measurements over time

Example: During a flash sale, response time metrics show:

search_request_duration_seconds{quantile="0.95"} 0.85  # 95th percentile: 850ms
search_request_duration_seconds{quantile="0.50"} 0.12  # 50th percentile: 120ms
http_requests_total{service="payment",status="200"} 1500  # Successful payments

search_request_duration_seconds{quantile="0.95"} 0.85  # 95th percentile: 850ms
search_request_duration_seconds{quantile="0.50"} 0.12  # 50th percentile: 120ms
http_requests_total{service="payment",status="200"} 1500  # Successful payments

INI

Tools: Prometheus, InfluxDB, DataDog

3. Traces 🔍

What they are: Journey of a single request through all services

Example: Trace showing slow “Add to Cart” operation:

sequenceDiagram
    participant U as User
    participant API as API Gateway
    participant Cart as Cart Service
    participant Inv as Inventory Service
    participant DB as Database

    U->>API: Add item to cart
    API->>Cart: POST /cart/add
    Cart->>Inv: Check inventory
    Note over Inv,DB: Slow query: 2.5s
    Inv-->>DB: SELECT stock FROM inventory
    DB-->>Inv: stock: 10
    Inv->>Cart: Available: 10
    Cart->>DB: INSERT into cart
    Cart->>API: Success
    API->>U: Item added

    Note over U,DB: Total time: 3.2s (2.5s in inventory check)

Tools: Jaeger, Zipkin, Tempo, AWS X-Ray

Observability Best Practices

Instrument everything – Add observability from the start
Correlate data – Link logs, metrics, and traces with correlation IDs
Control log volume – Implement log sampling and retention policies
Use structured logging – JSON format for better parsing
Add context – Include user IDs, request IDs, and business context

🤝 Monitoring vs. Observability: Working Together

The Workflow

graph LR
    A[🚨 Monitoring Alert'Search is slow'] --> B[👨‍💻 DevOps EngineerInvestigates]
    B --> C[📊 Observability DataLogs + Metrics + Traces]
    C --> D[🔍 Root Cause Found'Slow DB query']
    D --> E[🔧 Permanent FixQuery optimization]

    style A fill:#ffcccc
    style E fill:#ccffcc

Key Differences

Aspect	Monitoring	Observability
Purpose	Alerts when something is wrong	Helps understand why it’s wrong
Questions	“What is happening?”	“Why is it happening?”
Data	Predefined metrics	Rich, contextual data
Scope	Known failure modes	Unknown unknowns
Action	Reactive alerts	Proactive investigation

Real-World Example

Scenario: E-commerce checkout process is failing

Monitoring says:

❌ Payment service error rate: 15%
❌ Checkout completion rate: 85% (down from 98%)

Observability reveals:

Logs show payment gateway timeouts
Metrics indicate increased latency to payment provider
Traces reveal the payment service is retrying failed requests, causing cascading delays

Root Cause: Payment provider is experiencing issues.

Solution: Implement circuit breaker pattern and failover to backup payment provider.

🏥 The Hospital Analogy

Imagine a patient in a hospital after surgery:

graph TD
    subgraph "Hospital Room"
        Patient[🏥 Patient]
        Monitor[📱 Heart Rate Monitor]
        Chart[📋 Patient Chart]
    end

    Monitor --> Alert[🚨 Heart Rate Spike Alert]
    Alert --> Doctor[👨‍⚕️ Doctor Arrives]
    Doctor --> Chart
    Chart --> Data["📊 Patient Data:• Recent medications• Activity logs• Allergy information• Lab results"]
    Data --> Diagnosis[💡 Diagnosis:Allergic reaction tonew medication]
    Diagnosis --> Treatment[💊 Treatment:Stop medicationAdminister antihistamine]

    style Alert fill:#ffcccc
    style Treatment fill:#ccffcc

Monitoring (Heart Rate Monitor): Alerts when heart rate spikes – “Something is wrong!”
Observability (Patient Chart): Provides context to understand why – “Patient is having an allergic reaction to the new pain medication”

🛠️ Practical Implementation

Setting Up Monitoring Stack

# docker-compose.yml for monitoring stack
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alerts.yml:/etc/prometheus/alerts.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-storage:/var/lib/grafana

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

volumes:
  grafana-storage:

# docker-compose.yml for monitoring stack
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alerts.yml:/etc/prometheus/alerts.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-storage:/var/lib/grafana

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

volumes:
  grafana-storage:

YAML

Setting Up Observability Stack

# docker-compose.yml for observability stack
version: '3.8'
services:
  # Logs
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.9.0
    environment:
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ports:
      - "9200:9200"

  kibana:
    image: docker.elastic.co/kibana/kibana:8.9.0
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch

  # Traces
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "14268:14268"  # HTTP collector
    environment:
      - COLLECTOR_OTLP_ENABLED=true

  # Metrics (reuse Prometheus from monitoring stack)
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"

# docker-compose.yml for observability stack
version: '3.8'
services:
  # Logs
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.9.0
    environment:
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ports:
      - "9200:9200"

  kibana:
    image: docker.elastic.co/kibana/kibana:8.9.0
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch

  # Traces
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "14268:14268"  # HTTP collector
    environment:
      - COLLECTOR_OTLP_ENABLED=true

  # Metrics (reuse Prometheus from monitoring stack)
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"

YAML

Application Instrumentation Example

# Python Flask application with observability
from flask import Flask, request
import time
import logging
import json
from prometheus_client import Counter, Histogram, generate_latest
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

app = Flask(__name__)

# Configure structured logging
logging.basicConfig(
    level=logging.INFO,
    format='%(message)s'
)
logger = logging.getLogger(__name__)

# Prometheus metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency')

# Configure tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

jaeger_exporter = JaegerExporter(
    agent_host_name="localhost",
    agent_port=14268,
)

span_processor = BatchSpanProcessor(jaeger_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

@app.before_request
def before_request():
    request.start_time = time.time()

@app.after_request
def after_request(response):
    # Metrics
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.endpoint or 'unknown',
        status=response.status_code
    ).inc()

    REQUEST_LATENCY.observe(time.time() - request.start_time)

    # Structured logging
    log_data = {
        "timestamp": time.time(),
        "method": request.method,
        "path": request.path,
        "status": response.status_code,
        "latency": time.time() - request.start_time,
        "user_agent": request.headers.get('User-Agent'),
        "ip": request.remote_addr
    }
    logger.info(json.dumps(log_data))

    return response

@app.route('/search')
def search():
    with tracer.start_as_current_span("search_products") as span:
        query = request.args.get('q', '')
        span.set_attribute("search.query", query)

        # Simulate database query
        time.sleep(0.1)  # Simulate work

        # Log business context
        logger.info(json.dumps({
            "event": "product_search",
            "query": query,
            "results_count": 42,
            "user_id": request.headers.get('X-User-ID')
        }))

        return {"results": ["product1", "product2"], "query": query}

@app.route('/metrics')
def metrics():
    return generate_latest()

if __name__ == '__main__':
    app.run(debug=True)

# Python Flask application with observability
from flask import Flask, request
import time
import logging
import json
from prometheus_client import Counter, Histogram, generate_latest
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

app = Flask(__name__)

# Configure structured logging
logging.basicConfig(
    level=logging.INFO,
    format='%(message)s'
)
logger = logging.getLogger(__name__)

# Prometheus metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency')

# Configure tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

jaeger_exporter = JaegerExporter(
    agent_host_name="localhost",
    agent_port=14268,
)

span_processor = BatchSpanProcessor(jaeger_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

@app.before_request
def before_request():
    request.start_time = time.time()

@app.after_request
def after_request(response):
    # Metrics
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.endpoint or 'unknown',
        status=response.status_code
    ).inc()

    REQUEST_LATENCY.observe(time.time() - request.start_time)

    # Structured logging
    log_data = {
        "timestamp": time.time(),
        "method": request.method,
        "path": request.path,
        "status": response.status_code,
        "latency": time.time() - request.start_time,
        "user_agent": request.headers.get('User-Agent'),
        "ip": request.remote_addr
    }
    logger.info(json.dumps(log_data))

    return response

@app.route('/search')
def search():
    with tracer.start_as_current_span("search_products") as span:
        query = request.args.get('q', '')
        span.set_attribute("search.query", query)

        # Simulate database query
        time.sleep(0.1)  # Simulate work

        # Log business context
        logger.info(json.dumps({
            "event": "product_search",
            "query": query,
            "results_count": 42,
            "user_id": request.headers.get('X-User-ID')
        }))

        return {"results": ["product1", "product2"], "query": query}

@app.route('/metrics')
def metrics():
    return generate_latest()

if __name__ == '__main__':
    app.run(debug=True)

Python

📊 Complete System Architecture

Here’s how monitoring and observability work together in a complete system:

graph TB
    subgraph "Application Layer"
        App1[🛍️ E-commerce App]
        App2[📱 Mobile API]
        App3[🔍 Search Service]
    end

    subgraph "Monitoring Stack"
        Prometheus[📊 Prometheus]
        Grafana[📈 Grafana]
        AlertManager[🚨 AlertManager]
    end

    subgraph "Observability Stack"
        Logs[📋 ELK Stack]
        Traces[🔍 Jaeger]
        Metrics[📊 Prometheus]
    end

    subgraph "Notification"
        Slack[💬 Slack]
        PagerDuty[📞 PagerDuty]
        Email[📧 Email]
    end

    App1 --> Prometheus
    App2 --> Prometheus
    App3 --> Prometheus

    App1 --> Logs
    App2 --> Logs
    App3 --> Logs

    App1 --> Traces
    App2 --> Traces
    App3 --> Traces

    Prometheus --> Grafana
    Prometheus --> AlertManager
    AlertManager --> Slack
    AlertManager --> PagerDuty
    AlertManager --> Email

    Logs --> Investigation[🔍 Investigation]
    Traces --> Investigation
    Metrics --> Investigation

🎯 Key Takeaways

The Golden Rules

Monitoring tells you WHAT is wrong 📢
- “API response time is 2 seconds”
- “Error rate is 5%”
- “CPU usage is 90%”
Observability tells you WHY it’s wrong 🕵️
- “Database query is slow because of missing index”
- “Errors are happening due to payment gateway timeout”
- “CPU is high because of memory leak in user service”
Together, they enable fast resolution ⚡
- Get alerted quickly (monitoring)
- Debug efficiently (observability)
- Fix permanently (root cause analysis)

Success Metrics

Your monitoring and observability setup is successful when:

MTTR (Mean Time To Resolution) decreases
MTBF (Mean Time Between Failures) increases
Customer-reported issues decrease
Team confidence in system health increases

🚀 Next Steps

Start with the Four Golden Signals for monitoring
Implement structured logging across all services
Add distributed tracing to understand request flows
Create meaningful dashboards that tell a story
Practice incident response using your observability data
Continuously improve your monitoring and observability based on what you learn from incidents

Remember: Monitoring gets you to the problem, observability gets you through the problem! 🎯

This guide provides a foundation for understanding and implementing both monitoring and observability. Start small, iterate often, and always focus on reducing the time between when something breaks and when it’s fixed.

Discover more from Altgr Blog

Subscribe to get the latest posts sent to your email.

Observability vs. Monitoring

🚀 Introduction: The Bug Discovery Race

🔍 Understanding Monitoring

What is Monitoring?

Real-World Example: E-Commerce Application

The Four Golden Signals

1. Latency 📊

2. Traffic 🚦

3. Errors ❌

4. Saturation 🔋

Practical Monitoring Setup

Monitoring Best Practices

🔬 Understanding Observability

What is Observability?

The Three Pillars of Observability

1. Logs 📋

2. Metrics 📊

3. Traces 🔍

Observability Best Practices

🤝 Monitoring vs. Observability: Working Together

The Workflow

Key Differences

Real-World Example

🏥 The Hospital Analogy

🛠️ Practical Implementation

Setting Up Monitoring Stack

Setting Up Observability Stack

Application Instrumentation Example

📊 Complete System Architecture

🎯 Key Takeaways

The Golden Rules

Success Metrics

🚀 Next Steps

Related

Discover more from Altgr Blog

Leave a Reply Cancel reply

🚀 Introduction: The Bug Discovery Race

🔍 Understanding Monitoring

What is Monitoring?

Real-World Example: E-Commerce Application

The Four Golden Signals

1. Latency 📊

2. Traffic 🚦

3. Errors ❌

4. Saturation 🔋

Practical Monitoring Setup

Monitoring Best Practices

🔬 Understanding Observability

What is Observability?

The Three Pillars of Observability

1. Logs 📋

2. Metrics 📊

3. Traces 🔍

Observability Best Practices

🤝 Monitoring vs. Observability: Working Together

The Workflow

Key Differences

Real-World Example

🏥 The Hospital Analogy

🛠️ Practical Implementation

Setting Up Monitoring Stack

Setting Up Observability Stack

Application Instrumentation Example

📊 Complete System Architecture

🎯 Key Takeaways

The Golden Rules

Success Metrics

🚀 Next Steps

Related

Discover more from Altgr Blog

Related Posts

Kubernetes Cheatsheet

Docker Comprehensive Cheatsheet

AWX Ansible Tower

Leave a Reply Cancel reply