A Senior DevOps Engineer’s Complete Guide
๐ Introduction: The Bug Discovery Race
As DevOps engineers, we face a critical question every day: Who discovers bugs first – us or our customers?
If customers find issues first, they might:
- Rate our application as unreliable
- Start exploring alternative solutions
- Lost trust in our platform
However, if we can identify and fix bugs before they impact users, we can:
- Provide exceptional service
- Build customer trust
- Drive business growth
This is why proactive issue detection is crucial for any successful application.
graph TD
A[Bug Occurs] --> B{Who Finds It First?}
B -->|Customer| C[Bad ReviewsLost TrustBusiness Impact]
B -->|DevOps Team| D[Quick FixHappy CustomersBusiness Growth]
style C fill:#ffcccc
style D fill:#ccffcc๐ Understanding Monitoring
What is Monitoring?
Monitoring is the practice of collecting, processing, and alerting on predefined metrics to watch for known failure modes. It answers the question: “What is happening right now?”
Real-World Example: E-Commerce Application
Let’s consider an e-commerce microservices application:
graph TB
subgraph "E-Commerce Application"
User[๐ค User] --> Search[๐ Search Service]
User --> Cart[๐ Cart Service]
User --> Checkout[๐ณ Checkout Service]
Search --> SearchDB[(Search DB)]
Cart --> CartDB[(Cart DB)]
Checkout --> Payment[๐ฐ Payment Service]
Payment --> Gateway[๐ฆ Payment Gateway]
Search -.->|"Slow Query"| Alert1[๐จ Alert: High Latency]
Cart -.->|"High CPU"| Alert2[๐จ Alert: Resource Usage]
Payment -.->|"Errors"| Alert3[๐จ Alert: Error Rate]
endScenario: Users experience latency when searching for products. The root cause might be a poorly performing database query in the Search Service.
With monitoring, we can:
- Track the average response time of the Search Service
- Set up alerts when response time exceeds 500ms
- Get notified before users complain
The Four Golden Signals
Every application should monitor these fundamental metrics:
mindmap
root((Four Golden Signals))
Latency
Response Time
Time to First Byte
Database Query Time
Traffic
Requests per Second
Concurrent Users
Bandwidth Usage
Errors
HTTP 4xx/5xx
Failed Transactions
Exception Rate
Saturation
CPU Usage
Memory Usage
Disk I/O
Network Capacity1. Latency ๐
- Definition: Time taken to service a request
- Example: Search results loading time
- Impact: Slow responses = frustrated users switching to competitors
2. Traffic ๐ฆ
- Definition: Number of requests your system receives
- Example: During flash sales, traffic spikes dramatically
- Impact: Helps ensure system can scale to meet demand
3. Errors โ
- Definition: Rate of failed requests
- Examples: 404 (Page Not Found), 500 (Internal Server Error)
- Impact: High error rates indicate system instability
4. Saturation ๐
- Definition: How “full” your service is
- Example: Server consistently at 90% CPU usage
- Impact: Saturated resources lead to performance degradation
Practical Monitoring Setup
Here’s a Prometheus alerting rule example:
# Prometheus Alert Rules
groups:
- name: ecommerce-alerts
rules:
- alert: HighSearchLatency
expr: histogram_quantile(0.95, rate(search_request_duration_seconds_bucket[5m])) > 0.5
for: 2m
labels:
severity: warning
service: search
annotations:
summary: "Search service latency is high"
description: "95th percentile latency is {{ $value }}s for 2+ minutes"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 1m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} requests/sec"YAMLMonitoring Best Practices
graph LR
A[Start Early] --> B[Define Key Metrics]
B --> C[Set Meaningful Thresholds]
C --> D[Create Clear Dashboards]
D --> E[Limit Alert Fatigue]
E --> F[Regular Review & Update]- Start monitoring as early as possible – Don’t wait for production issues
- Focus on the Four Golden Signals first, then expand
- Make dashboards easy to understand – clear visualizations
- Avoid alert fatigue – only alert on actionable issues
- Set priority-based alerts – critical vs warning vs info
๐ฌ Understanding Observability
What is Observability?
Observability is a measure of how well you can understand a system’s internal state from its external outputs. It answers: “Why is this happening?”
Note: Observability is often abbreviated as O11y (11 characters between ‘O’ and ‘y’), similar to how Kubernetes is abbreviated as K8s.
The Three Pillars of Observability
graph TD
subgraph "The Three Pillars"
A[๐ LogsWhat happened?]
B[๐ MetricsHow much/many?]
C[๐ TracesWhere did it go?]
end
A <--> B
B <--> C
C <--> A
A --> D[Chronological EventsError MessagesDebug Information]
B --> E[Response TimesError RatesResource Usage]
C --> F[Request JourneyService DependenciesPerformance Bottlenecks]1. Logs ๐
What they are: Chronological, timestamped records of discrete events
Example: When search is slow, examine search service logs:
{
"timestamp": "2024-08-22T10:30:15Z",
"level": "WARN",
"service": "search-service",
"message": "Database query took 2.3s",
"query": "SELECT * FROM products WHERE category='electronics' ORDER BY popularity",
"execution_time_ms": 2300,
"user_id": "user123"
}JSONTools: Elasticsearch, Logstash, Loki, Fluentd
2. Metrics ๐
What they are: Quantitative measurements over time
Example: During a flash sale, response time metrics show:
search_request_duration_seconds{quantile="0.95"} 0.85 # 95th percentile: 850ms
search_request_duration_seconds{quantile="0.50"} 0.12 # 50th percentile: 120ms
http_requests_total{service="payment",status="200"} 1500 # Successful paymentsINITools: Prometheus, InfluxDB, DataDog
3. Traces ๐
What they are: Journey of a single request through all services
Example: Trace showing slow “Add to Cart” operation:
sequenceDiagram
participant U as User
participant API as API Gateway
participant Cart as Cart Service
participant Inv as Inventory Service
participant DB as Database
U->>API: Add item to cart
API->>Cart: POST /cart/add
Cart->>Inv: Check inventory
Note over Inv,DB: Slow query: 2.5s
Inv-->>DB: SELECT stock FROM inventory
DB-->>Inv: stock: 10
Inv->>Cart: Available: 10
Cart->>DB: INSERT into cart
Cart->>API: Success
API->>U: Item added
Note over U,DB: Total time: 3.2s (2.5s in inventory check)Tools: Jaeger, Zipkin, Tempo, AWS X-Ray
Observability Best Practices
- Instrument everything – Add observability from the start
- Correlate data – Link logs, metrics, and traces with correlation IDs
- Control log volume – Implement log sampling and retention policies
- Use structured logging – JSON format for better parsing
- Add context – Include user IDs, request IDs, and business context
๐ค Monitoring vs. Observability: Working Together
The Workflow
graph LR
A[๐จ Monitoring Alert'Search is slow'] --> B[๐จโ๐ป DevOps EngineerInvestigates]
B --> C[๐ Observability DataLogs + Metrics + Traces]
C --> D[๐ Root Cause Found'Slow DB query']
D --> E[๐ง Permanent FixQuery optimization]
style A fill:#ffcccc
style E fill:#ccffccKey Differences
| Aspect | Monitoring | Observability |
|---|---|---|
| Purpose | Alerts when something is wrong | Helps understand why it’s wrong |
| Questions | “What is happening?” | “Why is it happening?” |
| Data | Predefined metrics | Rich, contextual data |
| Scope | Known failure modes | Unknown unknowns |
| Action | Reactive alerts | Proactive investigation |
Real-World Example
Scenario: E-commerce checkout process is failing
Monitoring says:
- โ Payment service error rate: 15%
- โ Checkout completion rate: 85% (down from 98%)
Observability reveals:
- Logs show payment gateway timeouts
- Metrics indicate increased latency to payment provider
- Traces reveal the payment service is retrying failed requests, causing cascading delays
Root Cause: Payment provider is experiencing issues.
Solution: Implement circuit breaker pattern and failover to backup payment provider.
๐ฅ The Hospital Analogy
Imagine a patient in a hospital after surgery:
graph TD
subgraph "Hospital Room"
Patient[๐ฅ Patient]
Monitor[๐ฑ Heart Rate Monitor]
Chart[๐ Patient Chart]
end
Monitor --> Alert[๐จ Heart Rate Spike Alert]
Alert --> Doctor[๐จโโ๏ธ Doctor Arrives]
Doctor --> Chart
Chart --> Data["๐ Patient Data:โข Recent medicationsโข Activity logsโข Allergy informationโข Lab results"]
Data --> Diagnosis[๐ก Diagnosis:Allergic reaction tonew medication]
Diagnosis --> Treatment[๐ Treatment:Stop medicationAdminister antihistamine]
style Alert fill:#ffcccc
style Treatment fill:#ccffcc- Monitoring (Heart Rate Monitor): Alerts when heart rate spikes – “Something is wrong!”
- Observability (Patient Chart): Provides context to understand why – “Patient is having an allergic reaction to the new pain medication”
๐ ๏ธ Practical Implementation
Setting Up Monitoring Stack
# docker-compose.yml for monitoring stack
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerts.yml:/etc/prometheus/alerts.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-storage:/var/lib/grafana
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
volumes:
grafana-storage:YAMLSetting Up Observability Stack
# docker-compose.yml for observability stack
version: '3.8'
services:
# Logs
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.9.0
environment:
- discovery.type=single-node
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ports:
- "9200:9200"
kibana:
image: docker.elastic.co/kibana/kibana:8.9.0
ports:
- "5601:5601"
depends_on:
- elasticsearch
# Traces
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "14268:14268" # HTTP collector
environment:
- COLLECTOR_OTLP_ENABLED=true
# Metrics (reuse Prometheus from monitoring stack)
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"YAMLApplication Instrumentation Example
# Python Flask application with observability
from flask import Flask, request
import time
import logging
import json
from prometheus_client import Counter, Histogram, generate_latest
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
app = Flask(__name__)
# Configure structured logging
logging.basicConfig(
level=logging.INFO,
format='%(message)s'
)
logger = logging.getLogger(__name__)
# Prometheus metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency')
# Configure tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=14268,
)
span_processor = BatchSpanProcessor(jaeger_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
@app.before_request
def before_request():
request.start_time = time.time()
@app.after_request
def after_request(response):
# Metrics
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.endpoint or 'unknown',
status=response.status_code
).inc()
REQUEST_LATENCY.observe(time.time() - request.start_time)
# Structured logging
log_data = {
"timestamp": time.time(),
"method": request.method,
"path": request.path,
"status": response.status_code,
"latency": time.time() - request.start_time,
"user_agent": request.headers.get('User-Agent'),
"ip": request.remote_addr
}
logger.info(json.dumps(log_data))
return response
@app.route('/search')
def search():
with tracer.start_as_current_span("search_products") as span:
query = request.args.get('q', '')
span.set_attribute("search.query", query)
# Simulate database query
time.sleep(0.1) # Simulate work
# Log business context
logger.info(json.dumps({
"event": "product_search",
"query": query,
"results_count": 42,
"user_id": request.headers.get('X-User-ID')
}))
return {"results": ["product1", "product2"], "query": query}
@app.route('/metrics')
def metrics():
return generate_latest()
if __name__ == '__main__':
app.run(debug=True)Python๐ Complete System Architecture
Here’s how monitoring and observability work together in a complete system:
graph TB
subgraph "Application Layer"
App1[๐๏ธ E-commerce App]
App2[๐ฑ Mobile API]
App3[๐ Search Service]
end
subgraph "Monitoring Stack"
Prometheus[๐ Prometheus]
Grafana[๐ Grafana]
AlertManager[๐จ AlertManager]
end
subgraph "Observability Stack"
Logs[๐ ELK Stack]
Traces[๐ Jaeger]
Metrics[๐ Prometheus]
end
subgraph "Notification"
Slack[๐ฌ Slack]
PagerDuty[๐ PagerDuty]
Email[๐ง Email]
end
App1 --> Prometheus
App2 --> Prometheus
App3 --> Prometheus
App1 --> Logs
App2 --> Logs
App3 --> Logs
App1 --> Traces
App2 --> Traces
App3 --> Traces
Prometheus --> Grafana
Prometheus --> AlertManager
AlertManager --> Slack
AlertManager --> PagerDuty
AlertManager --> Email
Logs --> Investigation[๐ Investigation]
Traces --> Investigation
Metrics --> Investigation๐ฏ Key Takeaways
The Golden Rules
- Monitoring tells you WHAT is wrong ๐ข
- “API response time is 2 seconds”
- “Error rate is 5%”
- “CPU usage is 90%”
- Observability tells you WHY it’s wrong ๐ต๏ธ
- “Database query is slow because of missing index”
- “Errors are happening due to payment gateway timeout”
- “CPU is high because of memory leak in user service”
- Together, they enable fast resolution โก
- Get alerted quickly (monitoring)
- Debug efficiently (observability)
- Fix permanently (root cause analysis)
Success Metrics
Your monitoring and observability setup is successful when:
- MTTR (Mean Time To Resolution) decreases
- MTBF (Mean Time Between Failures) increases
- Customer-reported issues decrease
- Team confidence in system health increases
๐ Next Steps
- Start with the Four Golden Signals for monitoring
- Implement structured logging across all services
- Add distributed tracing to understand request flows
- Create meaningful dashboards that tell a story
- Practice incident response using your observability data
- Continuously improve your monitoring and observability based on what you learn from incidents
Remember: Monitoring gets you to the problem, observability gets you through the problem! ๐ฏ
This guide provides a foundation for understanding and implementing both monitoring and observability. Start small, iterate often, and always focus on reducing the time between when something breaks and when it’s fixed.
Discover more from Altgr Blog
Subscribe to get the latest posts sent to your email.
