What is grafana-dashboards?

A skill to create and manage production Grafana dashboards for real-time observability.

What design principles does it use?

Hierarchy of information, RED for services, and USE for resources to structure dashboards.

How do I get started?

Identify key metrics, choose panel types, connect to Prometheus, and save the dashboard for production use.

grafana-dashboards

npx machina-cli add skill wshobson/agents/grafana-dashboards --openclaw

Files (1)

SKILL.md

8.2 KB

Grafana Dashboards

Create and manage production-ready Grafana dashboards for comprehensive system observability.

Purpose

Design effective Grafana dashboards for monitoring applications, infrastructure, and business metrics.

When to Use

Visualize Prometheus metrics
Create custom dashboards
Implement SLO dashboards
Monitor infrastructure
Track business KPIs

Dashboard Design Principles

1. Hierarchy of Information

┌─────────────────────────────────────┐
│  Critical Metrics (Big Numbers)     │
├─────────────────────────────────────┤
│  Key Trends (Time Series)           │
├─────────────────────────────────────┤
│  Detailed Metrics (Tables/Heatmaps) │
└─────────────────────────────────────┘

2. RED Method (Services)

Rate - Requests per second
Errors - Error rate
Duration - Latency/response time

3. USE Method (Resources)

Utilization - % time resource is busy
Saturation - Queue length/wait time
Errors - Error count

Dashboard Structure

API Monitoring Dashboard

{
  "dashboard": {
    "title": "API Monitoring",
    "tags": ["api", "production"],
    "timezone": "browser",
    "refresh": "30s",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{service}}"
          }
        ],
        "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
      },
      {
        "title": "Error Rate %",
        "type": "graph",
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
            "legendFormat": "Error Rate"
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": { "params": [5], "type": "gt" },
              "operator": { "type": "and" },
              "query": { "params": ["A", "5m", "now"] },
              "type": "query"
            }
          ]
        },
        "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 }
      },
      {
        "title": "P95 Latency",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
            "legendFormat": "{{service}}"
          }
        ],
        "gridPos": { "x": 0, "y": 8, "w": 24, "h": 8 }
      }
    ]
  }
}

Reference: See assets/api-dashboard.json

Panel Types

1. Stat Panel (Single Value)

{
  "type": "stat",
  "title": "Total Requests",
  "targets": [
    {
      "expr": "sum(http_requests_total)"
    }
  ],
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": ["lastNotNull"]
    },
    "orientation": "auto",
    "textMode": "auto",
    "colorMode": "value"
  },
  "fieldConfig": {
    "defaults": {
      "thresholds": {
        "mode": "absolute",
        "steps": [
          { "value": 0, "color": "green" },
          { "value": 80, "color": "yellow" },
          { "value": 90, "color": "red" }
        ]
      }
    }
  }
}

2. Time Series Graph

{
  "type": "graph",
  "title": "CPU Usage",
  "targets": [
    {
      "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
    }
  ],
  "yaxes": [
    { "format": "percent", "max": 100, "min": 0 },
    { "format": "short" }
  ]
}

3. Table Panel

{
  "type": "table",
  "title": "Service Status",
  "targets": [
    {
      "expr": "up",
      "format": "table",
      "instant": true
    }
  ],
  "transformations": [
    {
      "id": "organize",
      "options": {
        "excludeByName": { "Time": true },
        "indexByName": {},
        "renameByName": {
          "instance": "Instance",
          "job": "Service",
          "Value": "Status"
        }
      }
    }
  ]
}

4. Heatmap

{
  "type": "heatmap",
  "title": "Latency Heatmap",
  "targets": [
    {
      "expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
      "format": "heatmap"
    }
  ],
  "dataFormat": "tsbuckets",
  "yAxis": {
    "format": "s"
  }
}

Variables

Query Variables

{
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_pod_info, namespace)",
        "refresh": 1,
        "multi": false
      },
      {
        "name": "service",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)",
        "refresh": 1,
        "multi": true
      }
    ]
  }
}

Use Variables in Queries

sum(rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m]))

Alerts in Dashboards

{
  "alert": {
    "name": "High Error Rate",
    "conditions": [
      {
        "evaluator": {
          "params": [5],
          "type": "gt"
        },
        "operator": { "type": "and" },
        "query": {
          "params": ["A", "5m", "now"]
        },
        "reducer": { "type": "avg" },
        "type": "query"
      }
    ],
    "executionErrorState": "alerting",
    "for": "5m",
    "frequency": "1m",
    "message": "Error rate is above 5%",
    "noDataState": "no_data",
    "notifications": [{ "uid": "slack-channel" }]
  }
}

Dashboard Provisioning

dashboards.yml:

apiVersion: 1

providers:
  - name: "default"
    orgId: 1
    folder: "General"
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/dashboards

Common Dashboard Patterns

Infrastructure Dashboard

Key Panels:

CPU utilization per node
Memory usage per node
Disk I/O
Network traffic
Pod count by namespace
Node status

Reference: See assets/infrastructure-dashboard.json

Database Dashboard

Key Panels:

Queries per second
Connection pool usage
Query latency (P50, P95, P99)
Active connections
Database size
Replication lag
Slow queries

Reference: See assets/database-dashboard.json

Application Dashboard

Key Panels:

Request rate
Error rate
Response time (percentiles)
Active users/sessions
Cache hit rate
Queue length

Best Practices

Start with templates (Grafana community dashboards)
Use consistent naming for panels and variables
Group related metrics in rows
Set appropriate time ranges (default: Last 6 hours)
Use variables for flexibility
Add panel descriptions for context
Configure units correctly
Set meaningful thresholds for colors
Use consistent colors across dashboards
Test with different time ranges

Dashboard as Code

Terraform Provisioning

resource "grafana_dashboard" "api_monitoring" {
  config_json = file("${path.module}/dashboards/api-monitoring.json")
  folder      = grafana_folder.monitoring.id
}

resource "grafana_folder" "monitoring" {
  title = "Production Monitoring"
}

Ansible Provisioning

- name: Deploy Grafana dashboards
  copy:
    src: "{{ item }}"
    dest: /etc/grafana/dashboards/
  with_fileglob:
    - "dashboards/*.json"
  notify: restart grafana

Reference Files

assets/api-dashboard.json - API monitoring dashboard
assets/infrastructure-dashboard.json - Infrastructure dashboard
assets/database-dashboard.json - Database monitoring dashboard
references/dashboard-design.md - Dashboard design guide

Related Skills

prometheus-configuration - For metric collection
slo-implementation - For SLO dashboards

Source

git clone https://github.com/wshobson/agents/blob/main/plugins/observability-monitoring/skills/grafana-dashboards/SKILL.md

View on GitHub

Overview

Grafana dashboards enable production-ready, real-time visualization of system, application, and business metrics. This skill guides building observability interfaces from Prometheus data to monitor infrastructure and services. It covers design principles, panel patterns, and a production-ready API monitoring example.

How This Skill Works

Design dashboards using a hierarchy of information and the RED and USE methods, then implement panels (Stat, Time Series, Table) wired to data sources like Prometheus. The included API Monitoring Dashboard example demonstrates a typical production setup with Request Rate, Error Rate, and P95 Latency panels.

When to Use It

Visualize Prometheus metrics
Create custom dashboards
Implement SLO dashboards
Monitor infrastructure
Track business KPIs

Quick Start

Step 1: Define metrics and layout using the hierarchy and RED/USE principles
Step 2: Build panels (Stat, Time Series, Table) and connect to Prometheus or another data source
Step 3: Save, tag, and set a production-friendly 30s refresh with alerts

Best Practices

Start with a clear information hierarchy: critical metrics first, then key trends, then detailed metrics
Apply RED for services: Rate, Errors, Duration
Apply USE for resources: Utilization, Saturation, Errors
Use production-ready panel types (Stat, Time Series, Table) with consistent naming and legends
Configure sensible refresh intervals and alerts aligned with SLOs

Example Use Cases

API Monitoring Dashboard with Request Rate, Error Rate, and P95 Latency
Prometheus metrics visualization dashboards for services
SLO dashboards to visualize uptime and error budgets
Infrastructure health dashboard tracking CPU, memory, and disk metrics
Business KPI dashboard highlighting key revenue and usage metrics

Frequently Asked Questions

Add this skill to your agents