Observability vs. Monitoring

    A Senior DevOps Engineer’s Complete Guide

    ๐Ÿš€ Introduction: The Bug Discovery Race

    As DevOps engineers, we face a critical question every day: Who discovers bugs first – us or our customers?

    If customers find issues first, they might:

    • Rate our application as unreliable
    • Start exploring alternative solutions
    • Lost trust in our platform

    However, if we can identify and fix bugs before they impact users, we can:

    • Provide exceptional service
    • Build customer trust
    • Drive business growth

    This is why proactive issue detection is crucial for any successful application.

    graph TD
        A[Bug Occurs] --> B{Who Finds It First?}
        B -->|Customer| C[Bad ReviewsLost TrustBusiness Impact]
        B -->|DevOps Team| D[Quick FixHappy CustomersBusiness Growth]
    
        style C fill:#ffcccc
        style D fill:#ccffcc

    ๐Ÿ” Understanding Monitoring

    What is Monitoring?

    Monitoring is the practice of collecting, processing, and alerting on predefined metrics to watch for known failure modes. It answers the question: “What is happening right now?”

    Real-World Example: E-Commerce Application

    Let’s consider an e-commerce microservices application:

    graph TB
        subgraph "E-Commerce Application"
            User[๐Ÿ‘ค User] --> Search[๐Ÿ” Search Service]
            User --> Cart[๐Ÿ›’ Cart Service]
            User --> Checkout[๐Ÿ’ณ Checkout Service]
    
            Search --> SearchDB[(Search DB)]
            Cart --> CartDB[(Cart DB)]
            Checkout --> Payment[๐Ÿ’ฐ Payment Service]
            Payment --> Gateway[๐Ÿฆ Payment Gateway]
    
            Search -.->|"Slow Query"| Alert1[๐Ÿšจ Alert: High Latency]
            Cart -.->|"High CPU"| Alert2[๐Ÿšจ Alert: Resource Usage]
            Payment -.->|"Errors"| Alert3[๐Ÿšจ Alert: Error Rate]
        end

    Scenario: Users experience latency when searching for products. The root cause might be a poorly performing database query in the Search Service.

    With monitoring, we can:

    • Track the average response time of the Search Service
    • Set up alerts when response time exceeds 500ms
    • Get notified before users complain

    The Four Golden Signals

    Every application should monitor these fundamental metrics:

    mindmap
      root((Four Golden Signals))
        Latency
          Response Time
          Time to First Byte
          Database Query Time
        Traffic
          Requests per Second
          Concurrent Users
          Bandwidth Usage
        Errors
          HTTP 4xx/5xx
          Failed Transactions
          Exception Rate
        Saturation
          CPU Usage
          Memory Usage
          Disk I/O
          Network Capacity

    1. Latency ๐Ÿ“Š

    • Definition: Time taken to service a request
    • Example: Search results loading time
    • Impact: Slow responses = frustrated users switching to competitors

    2. Traffic ๐Ÿšฆ

    • Definition: Number of requests your system receives
    • Example: During flash sales, traffic spikes dramatically
    • Impact: Helps ensure system can scale to meet demand

    3. Errors โŒ

    • Definition: Rate of failed requests
    • Examples: 404 (Page Not Found), 500 (Internal Server Error)
    • Impact: High error rates indicate system instability

    4. Saturation ๐Ÿ”‹

    • Definition: How “full” your service is
    • Example: Server consistently at 90% CPU usage
    • Impact: Saturated resources lead to performance degradation

    Practical Monitoring Setup

    Here’s a Prometheus alerting rule example:

    # Prometheus Alert Rules
    groups:
      - name: ecommerce-alerts
        rules:
          - alert: HighSearchLatency
            expr: histogram_quantile(0.95, rate(search_request_duration_seconds_bucket[5m])) > 0.5
            for: 2m
            labels:
              severity: warning
              service: search
            annotations:
              summary: "Search service latency is high"
              description: "95th percentile latency is {{ $value }}s for 2+ minutes"
    
          - alert: HighErrorRate
            expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
            for: 1m
            labels:
              severity: critical
            annotations:
              summary: "High error rate detected"
              description: "Error rate is {{ $value }} requests/sec"
    YAML

    Monitoring Best Practices

    graph LR
        A[Start Early] --> B[Define Key Metrics]
        B --> C[Set Meaningful Thresholds]
        C --> D[Create Clear Dashboards]
        D --> E[Limit Alert Fatigue]
        E --> F[Regular Review & Update]
    1. Start monitoring as early as possible – Don’t wait for production issues
    2. Focus on the Four Golden Signals first, then expand
    3. Make dashboards easy to understand – clear visualizations
    4. Avoid alert fatigue – only alert on actionable issues
    5. Set priority-based alerts – critical vs warning vs info

    ๐Ÿ”ฌ Understanding Observability

    What is Observability?

    Observability is a measure of how well you can understand a system’s internal state from its external outputs. It answers: “Why is this happening?”

    Note: Observability is often abbreviated as O11y (11 characters between ‘O’ and ‘y’), similar to how Kubernetes is abbreviated as K8s.

    The Three Pillars of Observability

    graph TD
        subgraph "The Three Pillars"
            A[๐Ÿ“‹ LogsWhat happened?]
            B[๐Ÿ“Š MetricsHow much/many?]
            C[๐Ÿ” TracesWhere did it go?]
        end
    
        A <--> B
        B <--> C
        C <--> A
    
        A --> D[Chronological EventsError MessagesDebug Information]
        B --> E[Response TimesError RatesResource Usage]
        C --> F[Request JourneyService DependenciesPerformance Bottlenecks]

    1. Logs ๐Ÿ“‹

    What they are: Chronological, timestamped records of discrete events

    Example: When search is slow, examine search service logs:

    {
      "timestamp": "2024-08-22T10:30:15Z",
      "level": "WARN",
      "service": "search-service",
      "message": "Database query took 2.3s",
      "query": "SELECT * FROM products WHERE category='electronics' ORDER BY popularity",
      "execution_time_ms": 2300,
      "user_id": "user123"
    }
    JSON

    Tools: Elasticsearch, Logstash, Loki, Fluentd

    2. Metrics ๐Ÿ“Š

    What they are: Quantitative measurements over time

    Example: During a flash sale, response time metrics show:

    search_request_duration_seconds{quantile="0.95"} 0.85  # 95th percentile: 850ms
    search_request_duration_seconds{quantile="0.50"} 0.12  # 50th percentile: 120ms
    http_requests_total{service="payment",status="200"} 1500  # Successful payments
    INI

    Tools: Prometheus, InfluxDB, DataDog

    3. Traces ๐Ÿ”

    What they are: Journey of a single request through all services

    Example: Trace showing slow “Add to Cart” operation:

    sequenceDiagram
        participant U as User
        participant API as API Gateway
        participant Cart as Cart Service
        participant Inv as Inventory Service
        participant DB as Database
    
        U->>API: Add item to cart
        API->>Cart: POST /cart/add
        Cart->>Inv: Check inventory
        Note over Inv,DB: Slow query: 2.5s
        Inv-->>DB: SELECT stock FROM inventory
        DB-->>Inv: stock: 10
        Inv->>Cart: Available: 10
        Cart->>DB: INSERT into cart
        Cart->>API: Success
        API->>U: Item added
    
        Note over U,DB: Total time: 3.2s (2.5s in inventory check)

    Tools: Jaeger, Zipkin, Tempo, AWS X-Ray

    Observability Best Practices

    1. Instrument everything – Add observability from the start
    2. Correlate data – Link logs, metrics, and traces with correlation IDs
    3. Control log volume – Implement log sampling and retention policies
    4. Use structured logging – JSON format for better parsing
    5. Add context – Include user IDs, request IDs, and business context

    ๐Ÿค Monitoring vs. Observability: Working Together

    The Workflow

    graph LR
        A[๐Ÿšจ Monitoring Alert'Search is slow'] --> B[๐Ÿ‘จโ€๐Ÿ’ป DevOps EngineerInvestigates]
        B --> C[๐Ÿ“Š Observability DataLogs + Metrics + Traces]
        C --> D[๐Ÿ” Root Cause Found'Slow DB query']
        D --> E[๐Ÿ”ง Permanent FixQuery optimization]
    
        style A fill:#ffcccc
        style E fill:#ccffcc

    Key Differences

    AspectMonitoringObservability
    PurposeAlerts when something is wrongHelps understand why it’s wrong
    Questions“What is happening?”“Why is it happening?”
    DataPredefined metricsRich, contextual data
    ScopeKnown failure modesUnknown unknowns
    ActionReactive alertsProactive investigation

    Real-World Example

    Scenario: E-commerce checkout process is failing

    Monitoring says:

    • โŒ Payment service error rate: 15%
    • โŒ Checkout completion rate: 85% (down from 98%)

    Observability reveals:

    1. Logs show payment gateway timeouts
    2. Metrics indicate increased latency to payment provider
    3. Traces reveal the payment service is retrying failed requests, causing cascading delays

    Root Cause: Payment provider is experiencing issues.

    Solution: Implement circuit breaker pattern and failover to backup payment provider.

    ๐Ÿฅ The Hospital Analogy

    Imagine a patient in a hospital after surgery:

    graph TD
        subgraph "Hospital Room"
            Patient[๐Ÿฅ Patient]
            Monitor[๐Ÿ“ฑ Heart Rate Monitor]
            Chart[๐Ÿ“‹ Patient Chart]
        end
    
        Monitor --> Alert[๐Ÿšจ Heart Rate Spike Alert]
        Alert --> Doctor[๐Ÿ‘จโ€โš•๏ธ Doctor Arrives]
        Doctor --> Chart
        Chart --> Data["๐Ÿ“Š Patient Data:โ€ข Recent medicationsโ€ข Activity logsโ€ข Allergy informationโ€ข Lab results"]
        Data --> Diagnosis[๐Ÿ’ก Diagnosis:Allergic reaction tonew medication]
        Diagnosis --> Treatment[๐Ÿ’Š Treatment:Stop medicationAdminister antihistamine]
    
        style Alert fill:#ffcccc
        style Treatment fill:#ccffcc
    • Monitoring (Heart Rate Monitor): Alerts when heart rate spikes – “Something is wrong!”
    • Observability (Patient Chart): Provides context to understand why – “Patient is having an allergic reaction to the new pain medication”

    ๐Ÿ› ๏ธ Practical Implementation

    Setting Up Monitoring Stack

    # docker-compose.yml for monitoring stack
    version: '3.8'
    services:
      prometheus:
        image: prom/prometheus:latest
        ports:
          - "9090:9090"
        volumes:
          - ./prometheus.yml:/etc/prometheus/prometheus.yml
          - ./alerts.yml:/etc/prometheus/alerts.yml
        command:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--web.console.libraries=/etc/prometheus/console_libraries'
          - '--web.console.templates=/etc/prometheus/consoles'
          - '--web.enable-lifecycle'
    
      grafana:
        image: grafana/grafana:latest
        ports:
          - "3000:3000"
        environment:
          - GF_SECURITY_ADMIN_PASSWORD=admin
        volumes:
          - grafana-storage:/var/lib/grafana
    
      alertmanager:
        image: prom/alertmanager:latest
        ports:
          - "9093:9093"
        volumes:
          - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    
    volumes:
      grafana-storage:
    YAML

    Setting Up Observability Stack

    # docker-compose.yml for observability stack
    version: '3.8'
    services:
      # Logs
      elasticsearch:
        image: docker.elastic.co/elasticsearch/elasticsearch:8.9.0
        environment:
          - discovery.type=single-node
          - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
        ports:
          - "9200:9200"
    
      kibana:
        image: docker.elastic.co/kibana/kibana:8.9.0
        ports:
          - "5601:5601"
        depends_on:
          - elasticsearch
    
      # Traces
      jaeger:
        image: jaegertracing/all-in-one:latest
        ports:
          - "16686:16686"  # UI
          - "14268:14268"  # HTTP collector
        environment:
          - COLLECTOR_OTLP_ENABLED=true
    
      # Metrics (reuse Prometheus from monitoring stack)
      prometheus:
        image: prom/prometheus:latest
        ports:
          - "9090:9090"
    YAML

    Application Instrumentation Example

    # Python Flask application with observability
    from flask import Flask, request
    import time
    import logging
    import json
    from prometheus_client import Counter, Histogram, generate_latest
    from opentelemetry import trace
    from opentelemetry.exporter.jaeger.thrift import JaegerExporter
    from opentelemetry.sdk.trace import TracerProvider
    from opentelemetry.sdk.trace.export import BatchSpanProcessor
    
    app = Flask(__name__)
    
    # Configure structured logging
    logging.basicConfig(
        level=logging.INFO,
        format='%(message)s'
    )
    logger = logging.getLogger(__name__)
    
    # Prometheus metrics
    REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
    REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency')
    
    # Configure tracing
    trace.set_tracer_provider(TracerProvider())
    tracer = trace.get_tracer(__name__)
    
    jaeger_exporter = JaegerExporter(
        agent_host_name="localhost",
        agent_port=14268,
    )
    
    span_processor = BatchSpanProcessor(jaeger_exporter)
    trace.get_tracer_provider().add_span_processor(span_processor)
    
    @app.before_request
    def before_request():
        request.start_time = time.time()
    
    @app.after_request
    def after_request(response):
        # Metrics
        REQUEST_COUNT.labels(
            method=request.method,
            endpoint=request.endpoint or 'unknown',
            status=response.status_code
        ).inc()
    
        REQUEST_LATENCY.observe(time.time() - request.start_time)
    
        # Structured logging
        log_data = {
            "timestamp": time.time(),
            "method": request.method,
            "path": request.path,
            "status": response.status_code,
            "latency": time.time() - request.start_time,
            "user_agent": request.headers.get('User-Agent'),
            "ip": request.remote_addr
        }
        logger.info(json.dumps(log_data))
    
        return response
    
    @app.route('/search')
    def search():
        with tracer.start_as_current_span("search_products") as span:
            query = request.args.get('q', '')
            span.set_attribute("search.query", query)
    
            # Simulate database query
            time.sleep(0.1)  # Simulate work
    
            # Log business context
            logger.info(json.dumps({
                "event": "product_search",
                "query": query,
                "results_count": 42,
                "user_id": request.headers.get('X-User-ID')
            }))
    
            return {"results": ["product1", "product2"], "query": query}
    
    @app.route('/metrics')
    def metrics():
        return generate_latest()
    
    if __name__ == '__main__':
        app.run(debug=True)
    Python

    ๐Ÿ“Š Complete System Architecture

    Here’s how monitoring and observability work together in a complete system:

    graph TB
        subgraph "Application Layer"
            App1[๐Ÿ›๏ธ E-commerce App]
            App2[๐Ÿ“ฑ Mobile API]
            App3[๐Ÿ” Search Service]
        end
    
        subgraph "Monitoring Stack"
            Prometheus[๐Ÿ“Š Prometheus]
            Grafana[๐Ÿ“ˆ Grafana]
            AlertManager[๐Ÿšจ AlertManager]
        end
    
        subgraph "Observability Stack"
            Logs[๐Ÿ“‹ ELK Stack]
            Traces[๐Ÿ” Jaeger]
            Metrics[๐Ÿ“Š Prometheus]
        end
    
        subgraph "Notification"
            Slack[๐Ÿ’ฌ Slack]
            PagerDuty[๐Ÿ“ž PagerDuty]
            Email[๐Ÿ“ง Email]
        end
    
        App1 --> Prometheus
        App2 --> Prometheus
        App3 --> Prometheus
    
        App1 --> Logs
        App2 --> Logs
        App3 --> Logs
    
        App1 --> Traces
        App2 --> Traces
        App3 --> Traces
    
        Prometheus --> Grafana
        Prometheus --> AlertManager
        AlertManager --> Slack
        AlertManager --> PagerDuty
        AlertManager --> Email
    
        Logs --> Investigation[๐Ÿ” Investigation]
        Traces --> Investigation
        Metrics --> Investigation

    ๐ŸŽฏ Key Takeaways

    The Golden Rules

    1. Monitoring tells you WHAT is wrong ๐Ÿ“ข
      • “API response time is 2 seconds”
      • “Error rate is 5%”
      • “CPU usage is 90%”
    2. Observability tells you WHY it’s wrong ๐Ÿ•ต๏ธ
      • “Database query is slow because of missing index”
      • “Errors are happening due to payment gateway timeout”
      • “CPU is high because of memory leak in user service”
    3. Together, they enable fast resolution โšก
      • Get alerted quickly (monitoring)
      • Debug efficiently (observability)
      • Fix permanently (root cause analysis)

    Success Metrics

    Your monitoring and observability setup is successful when:

    • MTTR (Mean Time To Resolution) decreases
    • MTBF (Mean Time Between Failures) increases
    • Customer-reported issues decrease
    • Team confidence in system health increases

    ๐Ÿš€ Next Steps

    1. Start with the Four Golden Signals for monitoring
    2. Implement structured logging across all services
    3. Add distributed tracing to understand request flows
    4. Create meaningful dashboards that tell a story
    5. Practice incident response using your observability data
    6. Continuously improve your monitoring and observability based on what you learn from incidents

    Remember: Monitoring gets you to the problem, observability gets you through the problem! ๐ŸŽฏ


    This guide provides a foundation for understanding and implementing both monitoring and observability. Start small, iterate often, and always focus on reducing the time between when something breaks and when it’s fixed.


    Discover more from Altgr Blog

    Subscribe to get the latest posts sent to your email.

    Leave a Reply

    Your email address will not be published. Required fields are marked *