From Beginner to Expert
Table of Contents
- Introduction to Grafana
- Getting Started
- Understanding Data Sources
- Creating Your First Dashboard
- Visualization Types and Best Practices
- Advanced Querying
- Alerting and Notifications
- User Management and Security
- Plugins and Extensions
- Performance Optimization
- Advanced Administration
- Enterprise Features
- Grafana in Production
- Troubleshooting and Best Practices
1. Introduction to Grafana
What is Grafana?
Grafana is an open-source analytics and interactive visualization web application. It provides charts, graphs, and alerts for the web when connected to supported data sources. Grafana is commonly used for monitoring and observability of infrastructure, applications, and business metrics.
Key Features
- Multi-platform dashboards: Create rich, interactive dashboards
- Multiple data sources: Connect to various databases and services
- Alerting: Set up intelligent alerts with multiple notification channels
- Annotations: Add context to your graphs with rich events
- Ad hoc filters: Create dynamic dashboards with template variables
- Mixed data sources: Combine data from multiple sources in a single graph
Grafana Architecture
graph TB
A[Users/Browsers] --> B[Grafana Frontend]
B --> C[Grafana Backend/API]
C --> D[Authentication Provider]
C --> E[Database - SQLite/MySQL/PostgreSQL]
C --> F[Data Source Plugins]
F --> G[Prometheus]
F --> H[InfluxDB]
F --> I[Elasticsearch]
F --> J[CloudWatch]
F --> K[MySQL/PostgreSQL]
F --> L[Other Data Sources]
style A fill:#e1f5fe
style B fill:#f3e5f5
style C fill:#fff3e0
style E fill:#e8f5e8
style F fill:#fff8e1Use Cases
- Infrastructure Monitoring: Server metrics, network performance, system health
- Application Performance Monitoring (APM): Response times, error rates, throughput
- Business Intelligence: KPIs, sales metrics, user analytics
- IoT Monitoring: Sensor data, environmental monitoring
- Log Analysis: Error tracking, security monitoring
2. Getting Started
Installation Methods
Docker Installation (Recommended for Beginners)
# Run Grafana in Docker
docker run -d -p 3000:3000 --name grafana grafana/grafana-enterprise
# With persistent storage
docker run -d -p 3000:3000 --name grafana \
-v grafana-storage:/var/lib/grafana \
grafana/grafana-enterpriseBashWindows Installation
# Using Chocolatey
choco install grafana
# Or download MSI from grafana.com
# Run the installer and follow the wizardBashConfiguration Files
graph LR
A[grafana.ini] --> B[Server Configuration]
A --> C[Database Settings]
A --> D[Security Settings]
A --> E[Auth Configuration]
A --> F[SMTP Settings]
style A fill:#ffeb3b
style B fill:#4caf50
style C fill:#2196f3
style D fill:#f44336
style E fill:#9c27b0
style F fill:#ff9800First Time Setup
- Access Grafana: Navigate to
http://localhost:3000 - Default Login: Username:
admin, Password:admin - Change Password: You’ll be prompted to change the default password
- Initial Configuration: Set up your first data source
Basic Configuration
# grafana.ini basic configuration
[server]
http_port = 3000
domain = localhost
[database]
type = sqlite3
path = grafana.db
[security]
admin_user = admin
admin_password = admin
secret_key = your_secret_key
[smtp]
enabled = false
host = localhost:587
user =
password = INI3. Understanding Data Sources
What are Data Sources?
Data sources are the backbone of Grafana. They define where your data comes from and how Grafana should query it.
Data Source Hierarchy
graph TD
A[Grafana Instance] --> B[Data Source 1]
A --> C[Data Source 2]
A --> D[Data Source N]
B --> E[Prometheus]
C --> F[InfluxDB]
D --> G[MySQL]
E --> H[Metrics Data]
F --> I[Time Series Data]
G --> J[Relational Data]
H --> K[Dashboard]
I --> K
J --> K
style A fill:#e3f2fd
style K fill:#f1f8e9Popular Data Sources
Time Series Databases
- Prometheus: Metrics collection and alerting
- InfluxDB: High-performance time series database
- Graphite: Scalable real-time graphing
Relational Databases
- MySQL: Popular open-source database
- PostgreSQL: Advanced open-source database
- Microsoft SQL Server: Enterprise database
Cloud Services
- Amazon CloudWatch: AWS monitoring service
- Azure Monitor: Microsoft Azure monitoring
- Google Cloud Monitoring: GCP monitoring service
Log Management
- Elasticsearch: Search and analytics engine
- Loki: Log aggregation system by Grafana
Adding Your First Data Source
Example: Adding Prometheus
# prometheus.yml configuration
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'grafana'
static_configs:
- targets: ['localhost:3000']YAMLSteps to add Prometheus data source:
- Go to Configuration → Data Sources
- Click “Add data source”
- Select Prometheus
- Configure URL:
http://localhost:9090 - Click “Save & Test”
Data Source Configuration Flow
sequenceDiagram
participant U as User
participant G as Grafana
participant DS as Data Source
U->>G: Add Data Source
G->>U: Show Configuration Form
U->>G: Enter Connection Details
G->>DS: Test Connection
DS->>G: Connection Response
G->>U: Show Test Results
U->>G: Save Configuration
G->>G: Store Data Source Config4. Creating Your First Dashboard
Dashboard Concepts
A dashboard is a collection of panels arranged in a grid. Each panel contains a visualization of data from one or more data sources.
Dashboard Structure
graph TB
A[Dashboard] --> B[Row 1]
A --> C[Row 2]
A --> D[Row N]
B --> E[Panel 1]
B --> F[Panel 2]
C --> G[Panel 3]
C --> H[Panel 4]
D --> I[Panel N]
style A fill:#e8f5e8
style E fill:#fff3e0
style F fill:#fff3e0
style G fill:#fff3e0
style H fill:#fff3e0
style I fill:#fff3e0Creating a Dashboard
- Navigate to Dashboards: Click the “+” icon and select “Dashboard”
- Add First Panel: Click “Add new panel”
- Select Data Source: Choose your configured data source
- Write Query: Enter your query (syntax depends on data source)
- Choose Visualization: Select appropriate chart type
- Configure Panel: Set title, description, and options
- Save Dashboard: Give it a meaningful name
Basic Panel Types
Graph Panel
graph LR
A[Time Series Data] --> B[Line Chart]
A --> C[Bar Chart]
A --> D[Area Chart]
style A fill:#e1f5fe
style B fill:#f3e5f5
style C fill:#fff3e0
style D fill:#e8f5e8Stat Panel
graph TB
A[Single Value] --> B[Current Value]
A --> C[Min/Max/Avg]
A --> D[Threshold Colors]
style A fill:#fff8e1
style B fill:#f1f8e9
style C fill:#fce4ec
style D fill:#e0f2f1Example: System Monitoring Dashboard
Let’s create a basic system monitoring dashboard:
Panel 1: CPU Usage
# Prometheus query for CPU usage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)BashPanel 2: Memory Usage
# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100BashPanel 3: Disk Usage
# Disk usage percentage
100 - ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes)BashDashboard JSON Model
{
"dashboard": {
"id": null,
"title": "System Overview",
"tags": ["monitoring", "system"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"refId": "A"
}
]
}
],
"time": {
"from": "now-1h",
"to": "now"
},
"refresh": "10s"
}
}JSON5. Visualization Types and Best Practices
Panel Types Overview
mindmap
root((Panel Types))
Time Series
Graph
State Timeline
Status History
Stats
Stat
Gauge
Bar Gauge
Tables
Table
Logs
Text
Text
News
Misc
Heatmap
Pie Chart
Node GraphTime Series Visualizations
Graph Panel
- Use Case: Showing trends over time
- Best For: Metrics that change continuously
- Examples: CPU usage, memory consumption, network traffic
graph LR
A[Raw Time Series] --> B[Line Graph]
A --> C[Area Graph]
A --> D[Bar Graph]
A --> E[Points Graph]
style A fill:#e3f2fd
style B fill:#f3e5f5
style C fill:#fff3e0
style D fill:#e8f5e8
style E fill:#fce4ecState Timeline
- Use Case: Showing state changes over time
- Best For: Boolean or categorical data
- Examples: Service status, deployment events
Single Value Visualizations
Stat Panel
graph TB
A[Stat Panel] --> B[Value Display]
A --> C[Sparkline]
A --> D[Thresholds]
B --> E[Current Value]
B --> F[Change from Previous]
B --> G[Percentage Change]
style A fill:#fff8e1
style B fill:#f1f8e9
style C fill:#e8f5e8
style D fill:#fce4ecGauge Panel
- Use Case: Showing values within a range
- Best For: Percentages, utilization metrics
- Examples: CPU usage, disk space, memory utilization
Table Visualizations
Table Panel
graph TB
A[Table Panel] --> B[Columns]
A --> C[Rows]
A --> D[Sorting]
A --> E[Filtering]
B --> F[Value Columns]
B --> G[Time Columns]
B --> H[String Columns]
style A fill:#e1f5fe
style B fill:#f3e5f5
style C fill:#fff3e0
style D fill:#e8f5e8
style E fill:#fce4ecChoosing the Right Visualization
Decision Tree
graph TD
A[What type of data?] --> B[Time Series]
A --> C[Single Value]
A --> D[Multiple Values]
A --> E[Text/Logs]
B --> F[Trending?]
F --> G[Yes - Graph Panel]
F --> H[No - State Timeline]
C --> I[Range/Percentage?]
I --> J[Yes - Gauge]
I --> K[No - Stat Panel]
D --> L[Tabular Data?]
L --> M[Yes - Table Panel]
L --> N[No - Multiple Stats]
E --> O[Logs Panel]
style A fill:#ffeb3b
style G fill:#4caf50
style H fill:#4caf50
style J fill:#4caf50
style K fill:#4caf50
style M fill:#4caf50
style N fill:#4caf50
style O fill:#4caf50Best Practices for Visualizations
Color Usage
- Consistent Color Scheme: Use a consistent palette across dashboards
- Meaningful Colors: Red for errors, green for success, yellow for warnings
- Accessibility: Consider colorblind-friendly palettes
Chart Design
- Clear Titles: Use descriptive panel titles
- Appropriate Y-Axis: Set meaningful min/max values
- Legend: Include legends when multiple series are shown
- Units: Always specify units for metrics
Performance Considerations
- Query Optimization: Use efficient queries
- Time Ranges: Don’t query more data than necessary
- Refresh Rates: Balance between freshness and performance
Advanced Visualization Features
Value Mappings
{
"valueMaps": [
{
"value": "0",
"text": "Down"
},
{
"value": "1",
"text": "Up"
}
]
}JSONThresholds
{
"thresholds": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 80
},
{
"color": "red",
"value": 90
}
]
}JSON6. Advanced Querying
Query Language Fundamentals
Different data sources use different query languages. Understanding these is crucial for creating effective dashboards.
Prometheus Queries (PromQL)
Basic PromQL Concepts
graph TB
A[PromQL Query] --> B[Metric Name]
A --> C[Label Selectors]
A --> D[Functions]
A --> E[Operators]
B --> F[up]
B --> G[cpu_usage_percent]
B --> H[http_requests_total]
C --> I[job='prometheus']
C --> J[instance='localhost:9090']
D --> K["rate()"]
D --> L["avg()"]
D --> M["sum()"]
E --> N[+, -, *, /]
E --> O[and, or, unless]
style A fill:#e3f2fd
style B fill:#f3e5f5
style C fill:#fff3e0
style D fill:#e8f5e8
style E fill:#fce4ec
Common PromQL Patterns
# Basic metric selection
up
# With label filtering
up{job="prometheus"}
# Rate calculation for counters
rate(http_requests_total[5m])
# Aggregation
sum(rate(http_requests_total[5m])) by (status)
# Mathematical operations
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Prediction
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0JSONInfluxDB Queries (InfluxQL)
Basic InfluxQL Structure
graph LR
A[SELECT] --> B[field/function]
C[FROM] --> D[measurement]
E[WHERE] --> F[conditions]
G[GROUP BY] --> H[time/tags]
style A fill:#4caf50
style C fill:#2196f3
style E fill:#ff9800
style G fill:#9c27b0InfluxQL Examples
-- Basic query
SELECT value FROM cpu_usage WHERE time > now() - 1h
-- With aggregation
SELECT mean(value) FROM cpu_usage WHERE time > now() - 1h GROUP BY time(5m)
-- Multiple fields
SELECT mean(cpu), mean(memory) FROM system_stats
WHERE time > now() - 1h GROUP BY time(1m)
-- With conditions
SELECT * FROM http_requests
WHERE status_code = 200 AND time > now() - 1hSQLSQL Queries for Relational Databases
Query Structure for Time Series Data
-- Basic time series query
SELECT
timestamp,
value,
metric_name
FROM metrics
WHERE timestamp > NOW() - INTERVAL 1 HOUR
ORDER BY timestamp;
-- Aggregated data
SELECT
DATE_TRUNC('minute', timestamp) as time,
AVG(value) as avg_value,
MAX(value) as max_value
FROM metrics
WHERE timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY DATE_TRUNC('minute', timestamp)
ORDER BY time;SQLQuery Optimization Techniques
Performance Best Practices
graph TD
A[Query Optimization] --> B[Time Range Limitation]
A --> C[Index Usage]
A --> D[Aggregation Strategy]
A --> E[Query Caching]
B --> F[Use appropriate time windows]
C --> G[Ensure proper indexing]
D --> H[Pre-aggregate when possible]
E --> I[Enable query caching]
style A fill:#ffeb3b
style F fill:#4caf50
style G fill:#4caf50
style H fill:#4caf50
style I fill:#4caf50Template Variables
Template variables make dashboards dynamic and reusable.
Variable Types
graph TB
A[Template Variables] --> B[Query Variables]
A --> C[Custom Variables]
A --> D[Constant Variables]
A --> E[Data Source Variables]
A --> F[Interval Variables]
B --> G[Based on data source queries]
C --> H[Manually defined values]
D --> I[Fixed values]
E --> J[Available data sources]
F --> K[Time intervals]
style A fill:#e3f2fd
style B fill:#f3e5f5
style C fill:#fff3e0
style D fill:#e8f5e8
style E fill:#fce4ec
style F fill:#f1f8e9Example: Server Selection Variable
# Query variable for server selection
label_values(up, instance)SQLUsage in panel query:
up{instance="$server"}SQLMulti-Value Variables
{
"name": "servers",
"type": "query",
"query": "label_values(up, instance)",
"multi": true,
"includeAll": true,
"allValue": ".*"
}JSONAdvanced Query Techniques
Subqueries and Complex Aggregations
# Average of maximums
avg(
max_over_time(
cpu_usage_percent[1h:5m]
)
) by (instance)
# Rate of rate (acceleration)
rate(rate(http_requests_total[5m])[5m:])BashQuery Functions Reference
| Function | Purpose | Example |
|---|---|---|
rate() | Calculate per-second rate | rate(counter[5m]) |
increase() | Calculate increase over time | increase(counter[1h]) |
avg_over_time() | Average over time range | avg_over_time(gauge[1h]) |
predict_linear() | Linear prediction | predict_linear(metric[1h], 3600) |
histogram_quantile() | Calculate quantiles | histogram_quantile(0.95, rate(bucket[5m])) |
7. Alerting and Notifications
Alerting Architecture
graph TB
A[Alert Rules] --> B[Alert Manager]
B --> C[Notification Channels]
A --> D[Query Evaluation]
D --> E[Condition Check]
E --> F[State Transition]
F --> G[Alert Firing]
G --> H[Slack]
G --> I[Email]
G --> J[PagerDuty]
G --> K[Webhook]
G --> L[Teams]
style A fill:#ffeb3b
style B fill:#ff9800
style C fill:#4caf50
style G fill:#f44336Alert States
stateDiagram-v2
[*] --> No_Data : Initial state
No_Data --> Alerting : Condition met
No_Data --> OK : Data received, condition not met
OK --> Alerting : Condition met
Alerting --> OK : Condition not met
Alerting --> No_Data : No data received
OK --> No_Data : No data received
No_Data : No Data
OK : OK
Alerting : AlertingCreating Alert Rules
Basic Alert Configuration
- Query: Define the metric query
- Condition: Set the alert condition
- Evaluation: Configure how often to check
- Notifications: Choose notification channels
Example: High CPU Alert
{
"alert": {
"name": "High CPU Usage",
"message": "CPU usage is above 80%",
"frequency": "10s",
"conditions": [
{
"query": {
"queryType": "",
"refId": "A",
"model": {
"expr": "avg(cpu_usage_percent) by (instance)",
"intervalMs": 1000,
"maxDataPoints": 43200
}
},
"reducer": {
"type": "last",
"params": []
},
"evaluator": {
"params": [80],
"type": "gt"
}
}
]
}
}JSONNotification Channels
Email Configuration
{
"name": "email-alerts",
"type": "email",
"settings": {
"addresses": "admin@company.com;ops@company.com",
"subject": "Grafana Alert: {{ .GroupLabels.alertname }}",
"body": "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
}
}JSONSlack Configuration
{
"name": "slack-alerts",
"type": "slack",
"settings": {
"url": "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK",
"channel": "#alerts",
"username": "Grafana",
"title": "{{ .GroupLabels.alertname }}",
"text": "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
}
}JSONAlert Templates
Custom Message Templates
{{ define "alert.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}
({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}
{{ end }}
{{ define "alert.message" }}
{{ range .Alerts }}
**Alert:** {{ .Annotations.summary }}
**Description:** {{ .Annotations.description }}
**Graph:** {{ .GeneratorURL }}
**Details:**
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
{{ end }}Jinja HTMLAlert Management
Alert Rule Groups
graph TB
A[Alert Rule Groups] --> B[Infrastructure]
A --> C[Application]
A --> D[Business]
B --> E[CPU Alerts]
B --> F[Memory Alerts]
B --> G[Disk Alerts]
C --> H[Response Time]
C --> I[Error Rate]
C --> J[Throughput]
D --> K[Revenue]
D --> L[User Count]
D --> M[Conversion Rate]
style A fill:#e3f2fd
style B fill:#fff3e0
style C fill:#f3e5f5
style D fill:#e8f5e8Silencing and Inhibition
# Example silence configuration
silences:
- matchers:
- name: "alertname"
value: "HighCPUUsage"
- name: "instance"
value: "server-01"
startsAt: "2023-01-01T00:00:00Z"
endsAt: "2023-01-01T06:00:00Z"
comment: "Planned maintenance"YAMLAdvanced Alerting Features
Multi-Dimensional Alerts
# Alert on multiple metrics
(
(cpu_usage_percent > 80) and
(memory_usage_percent > 90)
) or (
disk_usage_percent > 95
)BashTime-Based Conditions
# Alert if condition persists for 5 minutes
avg_over_time(cpu_usage_percent[5m]) > 80BashAlerting Best Practices
Alert Fatigue Prevention
graph TD
A[Alert Fatigue Prevention] --> B[Meaningful Thresholds]
A --> C[Alert Grouping]
A --> D[Escalation Policies]
A --> E[Maintenance Windows]
B --> F[Based on historical data]
C --> G[Group related alerts]
D --> H[Progressive notification]
E --> I[Planned downtime handling]
style A fill:#ffeb3b
style F fill:#4caf50
style G fill:#4caf50
style H fill:#4caf50
style I fill:#4caf50Alert Hierarchy
- Critical: Service down, data loss
- High: Performance degradation
- Medium: Warning conditions
- Low: Information only
8. User Management and Security
Authentication Methods
graph TB
A[Authentication] --> B[Built-in]
A --> C[LDAP/Active Directory]
A --> D[OAuth]
A --> E[SAML]
A --> F[Proxy Authentication]
B --> G[Local Users]
C --> H[Enterprise Directory]
D --> I[Google/GitHub/Azure]
E --> J[Enterprise SSO]
F --> K[Reverse Proxy]
style A fill:#e3f2fd
style B fill:#f3e5f5
style C fill:#fff3e0
style D fill:#e8f5e8
style E fill:#fce4ec
style F fill:#f1f8e9User Roles and Permissions
Built-in Roles
graph TB
A[Grafana Roles] --> B[Super Admin]
A --> C[Admin]
A --> D[Editor]
A --> E[Viewer]
B --> F[All permissions + server admin]
C --> G[Org admin permissions]
D --> H[Create/edit dashboards]
E --> I[View dashboards only]
style A fill:#ffeb3b
style B fill:#f44336
style C fill:#ff9800
style D fill:#4caf50
style E fill:#2196f3Permission Matrix
| Action | Viewer | Editor | Admin | Super Admin |
|---|---|---|---|---|
| View dashboards | ✓ | ✓ | ✓ | ✓ |
| Create dashboards | ✗ | ✓ | ✓ | ✓ |
| Manage data sources | ✗ | ✗ | ✓ | ✓ |
| Manage users | ✗ | ✗ | ✓ | ✓ |
| Server administration | ✗ | ✗ | ✗ | ✓ |
Organizations and Teams
Multi-Tenancy Structure
graph TB
A[Grafana Instance] --> B[Organization 1]
A --> C[Organization 2]
A --> D[Organization N]
B --> E[Team A]
B --> F[Team B]
C --> G[Team C]
C --> H[Team D]
E --> I[Users 1-5]
F --> J[Users 6-10]
G --> K[Users 11-15]
H --> L[Users 16-20]
style A fill:#e3f2fd
style B fill:#f3e5f5
style C fill:#f3e5f5
style D fill:#f3e5f5Team-Based Permissions
{
"teams": [
{
"name": "Infrastructure Team",
"email": "infra@company.com",
"members": ["alice", "bob", "charlie"],
"permissions": {
"dashboards": ["infrastructure-*"],
"folders": ["Infrastructure"],
"role": "Editor"
}
}
]
}JSONSecurity Configuration
HTTPS Configuration
# grafana.ini - HTTPS settings
[server]
protocol = https
cert_file = /path/to/cert.pem
cert_key = /path/to/cert.keyINISecurity Headers
[security]
# Security headers
cookie_secure = true
cookie_samesite = strict
content_type_nosniff = true
x_content_type_options = nosniff
x_xss_protection = trueINILDAP Integration
LDAP Configuration
# LDAP configuration
[auth.ldap]
enabled = true
config_file = /etc/grafana/ldap.toml
allow_sign_up = trueINILDAP Configuration File
# ldap.toml
[[servers]]
host = "ldap.company.com"
port = 389
use_ssl = false
start_tls = false
ssl_skip_verify = false
bind_dn = "cn=admin,dc=company,dc=com"
bind_password = "password"
search_filter = "(uid=%s)"
search_base_dns = ["ou=users,dc=company,dc=com"]
[servers.attributes]
name = "givenName"
surname = "sn"
username = "uid"
member_of = "memberOf"
email = "mail"
[[servers.group_mappings]]
group_dn = "cn=grafana-admins,ou=groups,dc=company,dc=com"
org_role = "Admin"
[[servers.group_mappings]]
group_dn = "cn=grafana-users,ou=groups,dc=company,dc=com"
org_role = "Viewer"TOMLOAuth Configuration
Google OAuth Setup
[auth.google]
enabled = true
client_id = YOUR_GOOGLE_CLIENT_ID
client_secret = YOUR_GOOGLE_CLIENT_SECRET
scopes = https://www.googleapis.com/auth/userinfo.profile https://www.googleapis.com/auth/userinfo.email
auth_url = https://accounts.google.com/o/oauth2/auth
token_url = https://accounts.google.com/o/oauth2/token
allowed_domains = company.com
allow_sign_up = trueINIAPI Security
API Key Management
graph TB
A[API Keys] --> B[Admin Keys]
A --> C[Editor Keys]
A --> D[Viewer Keys]
B --> E[Full API Access]
C --> F[Limited Write Access]
D --> G[Read-Only Access]
E --> H[Create/Delete Resources]
F --> I[Modify Dashboards]
G --> J[Query Data Only]
style A fill:#ffeb3b
style B fill:#f44336
style C fill:#ff9800
style D fill:#4caf50Creating API Keys
# Create API key via CLI
curl -X POST \
http://admin:admin@localhost:3000/api/auth/keys \
-H 'Content-Type: application/json' \
-d '{
"name": "test-key",
"role": "Editor",
"secondsToLive": 86400
}'BashSecurity Best Practices
Access Control
graph TD
A[Security Best Practices] --> B[Principle of Least Privilege]
A --> C[Regular Access Reviews]
A --> D[Strong Authentication]
A --> E[Network Security]
B --> F[Minimal required permissions]
C --> G[Quarterly user audits]
D --> H[MFA when possible]
E --> I[VPN/firewall restrictions]
style A fill:#ffeb3b
style F fill:#4caf50
style G fill:#4caf50
style H fill:#4caf50
style I fill:#4caf50Audit Logging
# Enable audit logging
[log]
level = info
mode = file
file = /var/log/grafana/grafana.log
[auditing]
enabled = true
log_dashboard_content = trueINI9. Plugins and Extensions
Plugin Architecture
graph TB
A[Grafana Core] --> B[Plugin System]
B --> C[Data Source Plugins]
B --> D[Panel Plugins]
B --> E[App Plugins]
C --> F[Custom Databases]
C --> G[External APIs]
C --> H[File Systems]
D --> I[Custom Visualizations]
D --> J[Interactive Panels]
D --> K[Third-party Charts]
E --> L[Complete Applications]
E --> M[Configuration Pages]
E --> N[Custom Workflows]
style A fill:#e3f2fd
style B fill:#ffeb3b
style C fill:#f3e5f5
style D fill:#fff3e0
style E fill:#e8f5e8Plugin Types
Data Source Plugins
- Connect to custom databases
- Integrate with external APIs
- Support new query languages
Panel Plugins
- Custom visualization types
- Interactive components
- Specialized chart types
App Plugins
- Complete applications within Grafana
- Custom configuration interfaces
- Workflow automation tools
Popular Plugins
Community Plugins
| Plugin | Type | Purpose |
|---|---|---|
| Pie Chart | Panel | Pie and donut charts |
| Worldmap | Panel | Geographic visualizations |
| Diagram | Panel | Network diagrams |
| Polystat | Panel | Multi-stat panels |
| Discrete | Panel | Discrete value display |
Data Source Plugins
graph LR
A[Data Sources] --> B[Databases]
A --> C[APIs]
A --> D[Files]
A --> E[Cloud Services]
B --> F[MongoDB]
B --> G[Redis]
B --> H[Cassandra]
C --> I[REST APIs]
C --> J[GraphQL]
C --> K[SOAP]
D --> L[CSV]
D --> M[JSON]
D --> N[XML]
E --> O[Datadog]
E --> P[New Relic]
E --> Q[Splunk]
style A fill:#e3f2fd
style B fill:#f3e5f5
style C fill:#fff3e0
style D fill:#e8f5e8
style E fill:#fce4ecInstalling Plugins
Using Grafana CLI
# Install a plugin
grafana-cli plugins install grafana-piechart-panel
# List installed plugins
grafana-cli plugins ls
# Update a plugin
grafana-cli plugins update grafana-piechart-panel
# Remove a plugin
grafana-cli plugins remove grafana-piechart-panelBashManual Installation
# Download and extract plugin
cd /var/lib/grafana/plugins
wget https://github.com/grafana/piechart-panel/archive/master.zip
unzip master.zip
mv piechart-panel-master piechart-panel
# Restart Grafana
systemctl restart grafana-serverBashDeveloping Custom Plugins
Plugin Development Workflow
sequenceDiagram
participant D as Developer
participant CLI as Grafana CLI
participant G as Grafana
participant B as Browser
D->>CLI: Create plugin scaffold
CLI->>D: Generate boilerplate
D->>D: Implement functionality
D->>CLI: Build plugin
CLI->>D: Compiled plugin
D->>G: Install plugin
G->>B: Load plugin
B->>D: Test & iterateCreating a Data Source Plugin
# Create new data source plugin
npx @grafana/create-plugin@latest
# Select options:
# - Plugin type: datasource
# - Plugin name: my-datasource
# - Organization: myorgBashBasic Plugin Structure
my-datasource/
├── src/
│ ├── datasource.ts
│ ├── query_ctrl.ts
│ ├── config_ctrl.ts
│ └── module.ts
├── package.json
├── plugin.json
└── README.mdBashPlugin Configuration (plugin.json)
{
"type": "datasource",
"name": "My Custom DataSource",
"id": "myorg-mydatasource-datasource",
"metrics": true,
"annotations": true,
"alerting": true,
"info": {
"description": "Custom data source plugin",
"author": {
"name": "Your Name",
"url": "https://github.com/yourname"
},
"version": "1.0.0"
}
}JSONPanel Plugin Development
React Panel Plugin Example
// SimplePanel.tsx
import React from 'react';
import { PanelProps } from '@grafana/data';
import { SimpleOptions } from 'types';
interface Props extends PanelProps<SimpleOptions> {}
export const SimplePanel: React.FC<Props> = ({
options,
data,
width,
height
}) => {
return (
<div
style={{
width,
height,
display: 'flex',
alignItems: 'center',
justifyContent: 'center',
}}
>
<span style={{ fontSize: options.fontSize }}>
{options.text}
</span>
</div>
);
};JavaScriptPlugin Configuration
Plugin Settings UI
// PanelEditor.tsx
import React from 'react';
import { PanelOptionsEditorProps } from '@grafana/data';
import { Input, Field } from '@grafana/ui';
import { SimpleOptions } from '../types';
export const PanelEditor: React.FC<
PanelOptionsEditorProps<SimpleOptions>
> = ({ options, onOptionsChange }) => {
return (
<div>
<Field label="Text">
<Input
value={options.text}
onChange={(e) =>
onOptionsChange({
...options,
text: e.currentTarget.value
})
}
/>
</Field>
<Field label="Font Size">
<Input
type="number"
value={options.fontSize}
onChange={(e) =>
onOptionsChange({
...options,
fontSize: parseInt(e.currentTarget.value, 10)
})
}
/>
</Field>
</div>
);
};JavaScriptPlugin Distribution
Publishing to Grafana Plugin Registry
graph TB
A[Plugin Development] --> B[Testing]
B --> C[Documentation]
C --> D[Signing]
D --> E[Submission]
E --> F[Review Process]
F --> G[Publication]
style A fill:#e3f2fd
style G fill:#4caf50Plugin Signing
# Sign plugin for distribution
npx @grafana/sign-plugin@latest --rootUrls http://localhost:3000BashPlugin Management in Production
Plugin Security Considerations
- Source Verification: Only install plugins from trusted sources
- Code Review: Review plugin code before installation
- Update Management: Keep plugins updated
- Access Control: Limit plugin installation permissions
Plugin Monitoring
graph TB
A[Plugin Monitoring] --> B[Performance Impact]
A --> C[Error Tracking]
A --> D[Usage Analytics]
A --> E[Security Audits]
B --> F[Memory usage]
B --> G[CPU utilization]
B --> H[Query performance]
C --> I[Plugin errors]
C --> J[Compatibility issues]
C --> K[Failed installations]
D --> L[Usage frequency]
D --> M[User adoption]
D --> N[Feature utilization]
E --> O[Vulnerability scans]
E --> P[Permission audits]
E --> Q[Code reviews]
style A fill:#ffeb3b
style F fill:#4caf50
style G fill:#4caf50
style H fill:#4caf50
style I fill:#ff9800
style J fill:#ff9800
style K fill:#ff980010. Performance Optimization
Performance Monitoring
Key Metrics to Track
graph TB
A[Performance Metrics] --> B[Query Performance]
A --> C[Dashboard Load Times]
A --> D[Memory Usage]
A --> E[CPU Utilization]
A --> F[Network I/O]
B --> G[Query execution time]
B --> H[Data source response time]
B --> I[Query complexity]
C --> J[Initial load time]
C --> K[Panel render time]
C --> L[Refresh performance]
style A fill:#e3f2fd
style B fill:#fff3e0
style C fill:#f3e5f5
style D fill:#e8f5e8
style E fill:#fce4ec
style F fill:#f1f8e9Query Optimization
Best Practices for Query Performance
graph TD
A[Query Optimization] --> B[Time Range Management]
A --> C[Aggregation Strategy]
A --> D[Index Utilization]
A --> E[Caching Implementation]
B --> F[Limit query time ranges]
B --> G[Use appropriate intervals]
B --> H[Avoid unnecessary historical data]
C --> I[Pre-aggregate data when possible]
C --> J[Use appropriate grouping]
C --> K[Reduce cardinality]
D --> L[Ensure proper indexing]
D --> M[Use selective filters]
D --> N[Optimize label queries]
E --> O[Enable query caching]
E --> P[Use data source caching]
E --> Q[Implement result caching]
style A fill:#ffeb3b
style F fill:#4caf50
style G fill:#4caf50
style H fill:#4caf50
style I fill:#4caf50
style J fill:#4caf50
style K fill:#4caf50
style L fill:#4caf50
style M fill:#4caf50
style N fill:#4caf50
style O fill:#4caf50
style P fill:#4caf50
style Q fill:#4caf50Prometheus Query Optimization
# Inefficient - high cardinality
sum(rate(http_requests_total[5m])) by (instance, job, method, status)
# More efficient - reduced cardinality
sum(rate(http_requests_total[5m])) by (job, status)
# Use recording rules for complex queries
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)BashDashboard Optimization
Panel Optimization Strategies
graph TB
A[Dashboard Optimization] --> B[Panel Count Management]
A --> C[Query Efficiency]
A --> D[Refresh Rate Optimization]
A --> E[Data Visualization Choice]
B --> F[Limit panels per dashboard]
B --> G[Use folders for organization]
B --> H[Split complex dashboards]
C --> I[Optimize queries per panel]
C --> J[Use template variables]
C --> K[Avoid redundant queries]
D --> L[Set appropriate refresh rates]
D --> M[Use auto-refresh wisely]
D --> N[Consider user workflow]
E --> O[Choose efficient visualizations]
E --> P[Limit data points displayed]
E --> Q[Use appropriate aggregation]
style A fill:#e3f2fd
style F fill:#4caf50
style G fill:#4caf50
style H fill:#4caf50
style I fill:#4caf50
style J fill:#4caf50
style K fill:#4caf50
style L fill:#4caf50
style M fill:#4caf50
style N fill:#4caf50
style O fill:#4caf50
style P fill:#4caf50
style Q fill:#4caf50Efficient Dashboard Design
{
"dashboard": {
"title": "Optimized System Dashboard",
"refresh": "30s",
"time": {
"from": "now-1h",
"to": "now"
},
"panels": [
{
"title": "CPU Usage (Efficient)",
"type": "stat",
"maxDataPoints": 100,
"interval": "30s"
}
]
}
}JSONInfrastructure Optimization
Grafana Server Configuration
# grafana.ini - Performance settings
[server]
http_port = 3000
enable_gzip = true
[database]
# Use PostgreSQL for better performance
type = postgres
host = localhost:5432
name = grafana
user = grafana
password = password
max_open_conn = 300
max_idle_conn = 300
[session]
provider = redis
provider_config = addr=127.0.0.1:6379
[caching]
enabled = true
ttl = 3600
[query_history]
enabled = true
max_queries_per_user = 1000INIDatabase Optimization
graph TB
A[Database Optimization] --> B[Connection Pooling]
A --> C[Indexing Strategy]
A --> D[Query Optimization]
A --> E[Data Retention]
B --> F[Configure max connections]
B --> G[Set connection timeouts]
B --> H[Use connection pooling]
C --> I[Index frequently queried columns]
C --> J[Composite indexes for complex queries]
C --> K[Regular index maintenance]
D --> L[Analyze slow queries]
D --> M[Optimize data source queries]
D --> N[Use query result caching]
E --> O[Set appropriate retention periods]
E --> P[Archive old data]
E --> Q[Implement data compression]
style A fill:#e3f2fd
style F fill:#4caf50
style G fill:#4caf50
style H fill:#4caf50
style I fill:#4caf50
style J fill:#4caf50
style K fill:#4caf50
style L fill:#4caf50
style M fill:#4caf50
style N fill:#4caf50
style O fill:#4caf50
style P fill:#4caf50
style Q fill:#4caf50Scaling Strategies
Horizontal Scaling
graph TB
A[Load Balancer] --> B[Grafana Instance 1]
A --> C[Grafana Instance 2]
A --> D[Grafana Instance N]
B --> E[Shared Database]
C --> E
D --> E
E --> F[PostgreSQL/MySQL]
B --> G[Shared Storage]
C --> G
D --> G
G --> H[File System/NFS]
style A fill:#ffeb3b
style E fill:#4caf50
style G fill:#4caf50Load Balancing Configuration
# nginx.conf for Grafana load balancing
upstream grafana {
server grafana1:3000;
server grafana2:3000;
server grafana3:3000;
}
server {
listen 80;
server_name grafana.company.com;
location / {
proxy_pass http://grafana;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}NginxMonitoring Grafana Performance
Key Performance Indicators
graph LR
A[Grafana KPIs] --> B[Response Time]
A --> C[Throughput]
A --> D[Error Rate]
A --> E[Resource Usage]
B --> F[Dashboard load time < 3s]
C --> G[Concurrent users]
D --> H[Error percentage < 1%]
E --> I[CPU < 80%, Memory < 85%]
style A fill:#e3f2fd
style F fill:#4caf50
style G fill:#4caf50
style H fill:#4caf50
style I fill:#4caf50Performance Monitoring Dashboard
# Dashboard load time
histogram_quantile(0.95,
sum(rate(grafana_http_request_duration_seconds_bucket[5m]))
by (le, handler)
)
# Memory usage
process_resident_memory_bytes{job="grafana"}
# Active users
grafana_stat_active_users
# Query performance
grafana_datasource_request_duration_secondsBash11. Advanced Administration
Backup and Recovery
Backup Strategy
graph TB
A[Backup Strategy] --> B[Database Backup]
A --> C[Configuration Backup]
A --> D[Plugin Backup]
A --> E[Dashboard Export]
B --> F[Automated DB dumps]
B --> G[Point-in-time recovery]
B --> H[Cross-region replication]
C --> I[grafana.ini backup]
C --> J[Custom config files]
C --> K[Environment variables]
D --> L[Plugin binaries]
D --> M[Plugin configurations]
D --> N[Custom plugin data]
E --> O[JSON exports]
E --> P[API-based backups]
E --> Q[Version control integration]
style A fill:#ffeb3b
style F fill:#4caf50
style G fill:#4caf50
style H fill:#4caf50
style I fill:#4caf50
style J fill:#4caf50
style K fill:#4caf50
style L fill:#4caf50
style M fill:#4caf50
style N fill:#4caf50
style O fill:#4caf50
style P fill:#4caf50
style Q fill:#4caf50Automated Backup Script
#!/bin/bash
# Grafana backup script
BACKUP_DIR="/backup/grafana/$(date +%Y%m%d)"
GRAFANA_URL="http://localhost:3000"
API_KEY="your-api-key"
# Create backup directory
mkdir -p $BACKUP_DIR
# Backup database
pg_dump grafana > $BACKUP_DIR/grafana_db.sql
# Backup configuration
cp /etc/grafana/grafana.ini $BACKUP_DIR/
# Export all dashboards
curl -H "Authorization: Bearer $API_KEY" \
$GRAFANA_URL/api/search?type=dash-db | \
jq -r '.[] | .uid' | \
while read uid; do
curl -H "Authorization: Bearer $API_KEY" \
$GRAFANA_URL/api/dashboards/uid/$uid > \
$BACKUP_DIR/dashboard_$uid.json
done
# Backup plugins
cp -r /var/lib/grafana/plugins $BACKUP_DIR/
# Compress backup
tar -czf $BACKUP_DIR.tar.gz $BACKUP_DIR
rm -rf $BACKUP_DIR
echo "Backup completed: $BACKUP_DIR.tar.gz"BashConfiguration Management
Infrastructure as Code
# docker-compose.yml for Grafana deployment
version: '3.8'
services:
grafana:
image: grafana/grafana-enterprise:latest
container_name: grafana
restart: unless-stopped
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=secure-password
- GF_DATABASE_TYPE=postgres
- GF_DATABASE_HOST=postgres:5432
- GF_DATABASE_NAME=grafana
- GF_DATABASE_USER=grafana
- GF_DATABASE_PASSWORD=grafana-password
volumes:
- grafana-data:/var/lib/grafana
- ./grafana.ini:/etc/grafana/grafana.ini
- ./provisioning:/etc/grafana/provisioning
depends_on:
- postgres
postgres:
image: postgres:13
container_name: grafana-postgres
restart: unless-stopped
environment:
- POSTGRES_DB=grafana
- POSTGRES_USER=grafana
- POSTGRES_PASSWORD=grafana-password
volumes:
- postgres-data:/var/lib/postgresql/data
volumes:
grafana-data:
postgres-data:YAMLProvisioning Configuration
# provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
jsonData:
timeInterval: "5s"
queryTimeout: "60s"
httpMethod: "POST"YAML# provisioning/dashboards/default.yml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /etc/grafana/provisioning/dashboardsYAMLMigration Strategies
Version Upgrade Process
graph TB
A[Pre-Migration] --> B[Backup Current State]
B --> C[Test Environment Setup]
C --> D[Migration Execution]
D --> E[Validation Testing]
E --> F[Production Deployment]
F --> G[Post-Migration Monitoring]
A --> H[Review Release Notes]
A --> I[Identify Breaking Changes]
A --> J[Plan Rollback Strategy]
style A fill:#fff3e0
style D fill:#ffeb3b
style F fill:#4caf50
style G fill:#4caf50Migration Checklist
- Pre-Migration
- Review Grafana release notes
- Backup database and configuration
- Test migration in staging environment
- Identify plugin compatibility issues
- Plan maintenance window
- Migration Execution
- Stop Grafana service
- Update Grafana binaries
- Run database migrations
- Update configuration if needed
- Restart services
- Post-Migration
- Verify all dashboards load correctly
- Test alerting functionality
- Validate data source connections
- Monitor performance metrics
- Update documentation
High Availability Setup
Active-Passive Configuration
graph TB
A[Load Balancer] --> B[Active Grafana]
A --> C[Passive Grafana]
B --> D[Shared Database]
C --> D
B --> E[Shared Storage]
C --> E
D --> F[Primary DB]
D --> G[Replica DB]
style A fill:#ffeb3b
style B fill:#4caf50
style C fill:#ff9800
style D fill:#2196f3
style E fill:#9c27b0Health Check Configuration
# docker-compose.yml health checks
services:
grafana:
image: grafana/grafana-enterprise:latest
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40sYAMLLog Management
Centralized Logging
graph TB
A[Grafana Instances] --> B[Log Aggregator]
B --> C[Log Storage]
C --> D[Log Analysis]
A --> E[Application Logs]
A --> F[Access Logs]
A --> G[Error Logs]
A --> H[Audit Logs]
B --> I[Fluentd/Logstash]
C --> J[Elasticsearch]
D --> K[Kibana/Grafana]
style A fill:#e3f2fd
style B fill:#fff3e0
style C fill:#f3e5f5
style D fill:#e8f5e8Log Configuration
# grafana.ini - Logging configuration
[log]
mode = console file
level = info
filters = rendering:debug
[log.console]
level = info
format = console
[log.file]
level = info
format = text
log_rotate = true
max_lines = 1000000
max_size_shift = 28
daily_rotate = true
max_days = 7INI12. Enterprise Features
Grafana Enterprise Overview
graph TB
A[Grafana Enterprise] --> B[Advanced Security]
A --> C[Enhanced RBAC]
A --> D[Reporting]
A --> E[White Labeling]
A --> F[Enterprise Plugins]
A --> G[Priority Support]
B --> H[SAML Authentication]
B --> I[Enhanced LDAP]
B --> J[Audit Logging]
C --> K[Fine-grained Permissions]
C --> L[Team Sync]
C --> M[Data Source Permissions]
D --> N[PDF Reports]
D --> O[Scheduled Reports]
D --> P[Report Sharing]
style A fill:#ffeb3b
style B fill:#f44336
style C fill:#ff9800
style D fill:#4caf50
style E fill:#2196f3
style F fill:#9c27b0
style G fill:#607d8bAdvanced Role-Based Access Control (RBAC)
Custom Roles and Permissions
graph TB
A[Enterprise RBAC] --> B[Custom Roles]
A --> C[Fine-grained Permissions]
A --> D[Resource-level Access]
A --> E[Team Synchronization]
B --> F[Read-only Analyst]
B --> G[Dashboard Creator]
B --> H[Data Source Manager]
B --> I[Alert Manager]
C --> J[Dashboard Permissions]
C --> K[Folder Permissions]
C --> L[Data Source Permissions]
C --> M[API Permissions]
style A fill:#e3f2fd
style B fill:#f3e5f5
style C fill:#fff3e0
style D fill:#e8f5e8
style E fill:#fce4ecPermission Configuration
{
"roles": [
{
"name": "Custom Dashboard Editor",
"description": "Can edit specific dashboards",
"permissions": [
{
"action": "dashboards:read",
"scope": "dashboards:uid:dashboard-123"
},
{
"action": "dashboards:write",
"scope": "dashboards:uid:dashboard-123"
}
]
}
]
}JSONEnterprise Data Sources
Advanced Data Source Features
graph LR
A[Enterprise Data Sources] --> B[Oracle]
A --> C[SAP HANA]
A --> D[Snowflake]
A --> E[Databricks]
A --> F[Splunk]
A --> G[Dynatrace]
A --> H[AppDynamics]
A --> I[Honeycomb]
style A fill:#e3f2fd
style B fill:#ff9800
style C fill:#4caf50
style D fill:#2196f3
style E fill:#9c27b0
style F fill:#607d8b
style G fill:#795548
style H fill:#f44336
style I fill:#ffeb3bReporting and Sharing
PDF Reports
graph TB
A[Report Generation] --> B[Dashboard Rendering]
B --> C[PDF Creation]
C --> D[Report Distribution]
A --> E[Scheduled Reports]
A --> F[On-demand Reports]
A --> G[Email Reports]
D --> H[Email]
D --> I[Slack]
D --> J[File Storage]
D --> K[API Endpoints]
style A fill:#e3f2fd
style C fill:#ffeb3b
style D fill:#4caf50Report Configuration
{
"report": {
"name": "Weekly System Report",
"dashboardId": 123,
"schedule": "0 9 * * MON",
"format": "pdf",
"orientation": "landscape",
"layout": "simple",
"recipients": [
"manager@company.com",
"team-lead@company.com"
],
"message": "Weekly system performance report"
}
}JSONWhite Labeling
Custom Branding Configuration
# grafana.ini - White labeling
[white_labeling]
app_title = "Company Monitoring"
login_title = "Company Analytics Platform"
footer_links = "Support|https://support.company.com"
login_logo = "/public/img/custom_logo.png"
menu_logo = "/public/img/custom_menu_logo.png"INIEnterprise Security Features
SAML Configuration
# grafana.ini - SAML settings
[auth.saml]
enabled = true
certificate_path = /etc/grafana/saml.crt
private_key_path = /etc/grafana/saml.key
idp_metadata_url = https://company.okta.com/app/metadata
assertion_attribute_name = displayName
assertion_attribute_login = email
assertion_attribute_email = emailINIEnhanced Audit Logging
{
"timestamp": "2023-09-03T10:30:00Z",
"userId": 123,
"orgId": 1,
"action": "dashboard.create",
"resource": "dashboard",
"resourceId": "new-dashboard-uid",
"requestUri": "/api/dashboards/db",
"ipAddress": "192.168.1.100",
"userAgent": "Mozilla/5.0...",
"success": true,
"details": {
"dashboardTitle": "New Monitoring Dashboard"
}
}JSON13. Grafana in Production
Production Architecture
Multi-Tier Architecture
graph TB
A[Load Balancer] --> B[Web Tier]
B --> C[Application Tier]
C --> D[Data Tier]
B --> E[Reverse Proxy]
B --> F[SSL Termination]
B --> G[Rate Limiting]
C --> H[Grafana Instances]
C --> I[Session Storage]
C --> J[Cache Layer]
D --> K[Primary Database]
D --> L[Read Replicas]
D --> M[Backup Storage]
style A fill:#ffeb3b
style B fill:#4caf50
style C fill:#2196f3
style D fill:#ff9800Production Deployment Checklist
graph TD
A[Production Deployment] --> B[Infrastructure Setup]
A --> C[Security Configuration]
A --> D[Monitoring Setup]
A --> E[Backup Strategy]
A --> F[Documentation]
B --> G[Load balancers configured]
B --> H[Database cluster ready]
B --> I[Storage provisioned]
C --> J[HTTPS enabled]
C --> K[Authentication configured]
C --> L[Firewall rules applied]
D --> M[Health checks implemented]
D --> N[Alerting configured]
D --> O[Log aggregation setup]
style A fill:#e3f2fd
style G fill:#4caf50
style H fill:#4caf50
style I fill:#4caf50
style J fill:#4caf50
style K fill:#4caf50
style L fill:#4caf50
style M fill:#4caf50
style N fill:#4caf50
style O fill:#4caf50Container Orchestration
Kubernetes Deployment
# grafana-deployment.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
labels:
app: grafana
spec:
replicas: 3
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana-enterprise:latest
ports:
- containerPort: 3000
env:
- name: GF_DATABASE_TYPE
value: "postgres"
- name: GF_DATABASE_HOST
value: "postgres-service:5432"
- name: GF_DATABASE_NAME
valueFrom:
secretKeyRef:
name: grafana-secrets
key: database-name
volumeMounts:
- name: grafana-storage
mountPath: /var/lib/grafana
- name: grafana-config
mountPath: /etc/grafana
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: grafana-storage
persistentVolumeClaim:
claimName: grafana-pvc
- name: grafana-config
configMap:
name: grafana-configYAMLService Configuration
# grafana-service.yml
apiVersion: v1
kind: Service
metadata:
name: grafana-service
spec:
selector:
app: grafana
ports:
- port: 80
targetPort: 3000
type: LoadBalancerYAMLMonitoring Grafana Itself
Self-Monitoring Dashboard
# Grafana performance queries
# Request rate
rate(grafana_http_request_total[5m])
# Request duration
grafana_http_request_duration_seconds
# Active sessions
grafana_stat_active_users
# Database query duration
grafana_database_query_duration_seconds
# Memory usage
process_resident_memory_bytes{job="grafana"}
# CPU usage
rate(process_cpu_seconds_total{job="grafana"}[5m])
# Go garbage collection
go_gc_duration_seconds{job="grafana"}BashDisaster Recovery
Recovery Planning
graph TB
A[Disaster Recovery Plan] --> B[Recovery Time Objective]
A --> C[Recovery Point Objective]
A --> D[Backup Strategy]
A --> E[Failover Procedures]
B --> F[RTO: 4 hours]
C --> G[RPO: 1 hour]
D --> H[Automated backups]
D --> I[Cross-region replication]
D --> J[Configuration versioning]
E --> K[Automated failover]
E --> L[Manual procedures]
E --> M[Communication plan]
style A fill:#ffeb3b
style F fill:#f44336
style G fill:#f44336
style H fill:#4caf50
style I fill:#4caf50
style J fill:#4caf50
style K fill:#4caf50
style L fill:#ff9800
style M fill:#2196f3Disaster Recovery Script
#!/bin/bash
# Grafana disaster recovery script
# Configuration
BACKUP_LOCATION="s3://company-backups/grafana"
TARGET_ENVIRONMENT="production"
GRAFANA_URL="https://grafana.company.com"
# Recovery steps
echo "Starting Grafana disaster recovery..."
# 1. Restore database
echo "Restoring database from backup..."
aws s3 cp $BACKUP_LOCATION/latest/database.sql /tmp/
psql -h $DB_HOST -U $DB_USER -d grafana < /tmp/database.sql
# 2. Restore configuration
echo "Restoring configuration..."
aws s3 cp $BACKUP_LOCATION/latest/grafana.ini /etc/grafana/
# 3. Restore plugins
echo "Restoring plugins..."
aws s3 sync $BACKUP_LOCATION/latest/plugins/ /var/lib/grafana/plugins/
# 4. Start services
echo "Starting Grafana services..."
systemctl start grafana-server
# 5. Verify recovery
echo "Verifying recovery..."
curl -f $GRAFANA_URL/api/health || {
echo "Health check failed!"
exit 1
}
echo "Disaster recovery completed successfully!"Bash14. Troubleshooting and Best Practices
Common Issues and Solutions
Performance Issues
graph TB
A[Performance Issues] --> B[Slow Dashboard Loading]
A --> C[High Memory Usage]
A --> D[Query Timeouts]
A --> E[Database Bottlenecks]
B --> F[Optimize queries]
B --> G[Reduce panel count]
B --> H[Increase refresh intervals]
C --> I[Optimize data retention]
C --> J[Increase memory allocation]
C --> K[Enable garbage collection tuning]
D --> L[Optimize data source queries]
D --> M[Increase timeout settings]
D --> N[Use query caching]
E --> O[Add database indexes]
E --> P[Optimize connection pooling]
E --> Q[Scale database resources]
style A fill:#ffeb3b
style F fill:#4caf50
style G fill:#4caf50
style H fill:#4caf50
style I fill:#4caf50
style J fill:#4caf50
style K fill:#4caf50
style L fill:#4caf50
style M fill:#4caf50
style N fill:#4caf50
style O fill:#4caf50
style P fill:#4caf50
style Q fill:#4caf50Debugging Workflow/
sequenceDiagram
participant U as User
participant G as Grafana
participant D as Data Source
participant L as Logs
U->>G: Report Issue
G->>L: Check Grafana Logs
L-->>G: Log Information
G->>D: Test Data Source
D-->>G: Connection Status
G->>G: Check Configuration
G->>U: Provide SolutionBest Practices Summary
Dashboard Design
- Clarity and Purpose
- Define clear objectives for each dashboard
- Use consistent naming conventions
- Group related metrics logically
- Performance Optimization
- Limit the number of panels per dashboard
- Use appropriate time ranges
- Optimize queries for efficiency
- User Experience
- Design for your audience
- Use meaningful colors and labels
- Provide context through annotations
Operational Excellence
graph TB
A[Operational Excellence] --> B[Monitoring]
A --> C[Automation]
A --> D[Documentation]
A --> E[Training]
B --> F[System Health Monitoring]
B --> G[Performance Tracking]
B --> H[Error Monitoring]
C --> I[Automated Backups]
C --> J[Deployment Automation]
C --> K[Alert Management]
D --> L[Architecture Documentation]
D --> M[Runbooks]
D --> N[User Guides]
E --> O[User Training Programs]
E --> P[Administrator Training]
E --> Q[Best Practices Sharing]
style A fill:#e3f2fd
style F fill:#4caf50
style G fill:#4caf50
style H fill:#4caf50
style I fill:#4caf50
style J fill:#4caf50
style K fill:#4caf50
style L fill:#4caf50
style M fill:#4caf50
style N fill:#4caf50
style O fill:#4caf50
style P fill:#4caf50
style Q fill:#4caf50Troubleshooting Tools
Command Line Tools
# Check Grafana status
systemctl status grafana-server
# View Grafana logs
journalctl -u grafana-server -f
# Test data source connectivity
curl -H "Authorization: Bearer $API_KEY" \
http://localhost:3000/api/datasources/proxy/1/api/v1/query?query=up
# Backup dashboard
grafana-cli admin export-dashboard-json dashboard-uid
# Reset admin password
grafana-cli admin reset-admin-password newpasswordBashAPI Debugging
# Health check
curl http://localhost:3000/api/health
# Data source test
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{"query":"up"}' \
http://localhost:3000/api/datasources/proxy/1/api/v1/query
# Dashboard export
curl -H "Authorization: Bearer $API_KEY" \
http://localhost:3000/api/dashboards/uid/dashboard-uidBashFinal Recommendations
Security Checklist
- Enable HTTPS with valid certificates
- Configure strong authentication
- Implement proper access controls
- Regular security updates
- Audit logging enabled
- Network security configured
Performance Checklist
- Database optimized and indexed
- Caching configured appropriately
- Resource limits set correctly
- Monitoring in place
- Backup strategy implemented
- Load testing completed
Operational Checklist
- Documentation up to date
- Runbooks created
- Team training completed
- Incident response plan ready
- Regular maintenance scheduled
- Success metrics defined
Conclusion
This comprehensive guide has covered Grafana from basic concepts to enterprise-level implementations. Key takeaways include:
- Foundation: Understanding Grafana’s architecture and core concepts
- Implementation: Proper setup, configuration, and data source integration
- Optimization: Performance tuning and best practices
- Security: Robust authentication and authorization
- Operations: Production deployment and maintenance
Continue exploring Grafana’s capabilities and stay updated with the latest features and best practices. The monitoring and observability landscape is constantly evolving, and Grafana remains at the forefront of these innovations.
For the latest information and community support, visit:
Discover more from Altgr Blog
Subscribe to get the latest posts sent to your email.
