Get the FREE Ultimate OpenClaw Setup Guide →
npx machina-cli add skill chaterm/terminal-skills/monitoring --openclaw
Files (1)
SKILL.md
7.8 KB

监控与告警

概述

Prometheus、Grafana、告警规则配置等技能。

Prometheus

基础查询(PromQL)

# 即时向量
http_requests_total
http_requests_total{job="api", status="200"}

# 范围向量
http_requests_total[5m]

# 偏移
http_requests_total offset 1h

# 聚合
sum(http_requests_total)
sum by (job) (http_requests_total)
sum without (instance) (http_requests_total)

# 速率
rate(http_requests_total[5m])
irate(http_requests_total[5m])

# 增量
increase(http_requests_total[1h])

# 直方图分位数
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

常用查询

# CPU 使用率
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# 磁盘使用率
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100

# 网络流量
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

# HTTP 请求速率
sum(rate(http_requests_total[5m])) by (status)

# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# 延迟 P99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

配置文件

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - "rules/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node1:9100', 'node2:9100']

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

告警规则

# rules/alerts.yml
groups:
  - name: node
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}%"

      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 85
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"

  - name: application
    rules:
      - alert: HighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency"

Alertmanager

配置

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager@example.com'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    email_configs:
      - to: 'team@example.com'

  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'xxx'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

Grafana

数据源配置

# provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

Dashboard JSON 示例

{
  "dashboard": {
    "title": "Node Metrics",
    "panels": [
      {
        "title": "CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{ instance }}"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "gauge",
        "targets": [
          {
            "expr": "(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100"
          }
        ]
      }
    ]
  }
}

常用面板查询

# CPU 使用率(时间序列)
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用(仪表盘)
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# 请求速率(柱状图)
sum(rate(http_requests_total[5m])) by (status)

# 延迟热力图
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)

常见场景

场景 1:Kubernetes 监控

# ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-monitor
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

场景 2:自定义指标

# Python 应用
from prometheus_client import Counter, Histogram, start_http_server

REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency', ['method', 'endpoint'])

@REQUEST_LATENCY.labels(method='GET', endpoint='/api').time()
def handle_request():
    REQUEST_COUNT.labels(method='GET', endpoint='/api', status='200').inc()
    # ...

start_http_server(8000)

场景 3:SLO 监控

# 可用性 SLO (99.9%)
1 - (sum(rate(http_requests_total{status=~"5.."}[30d])) / sum(rate(http_requests_total[30d])))

# 错误预算消耗
(1 - (sum(rate(http_requests_total{status=~"5.."}[7d])) / sum(rate(http_requests_total[7d])))) / 0.999

# 延迟 SLO (P99 < 500ms)
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[30d])) by (le)) < 0.5

场景 4:告警静默

# 创建静默
amtool silence add alertname=HighCPUUsage instance=node1 --duration=2h --comment="Maintenance"

# 查看静默
amtool silence query

# 删除静默
amtool silence expire <silence-id>

故障排查

问题排查方法
指标缺失检查 scrape 配置、target 状态
告警不触发检查规则语法、Alertmanager 配置
查询慢优化 PromQL、增加采样间隔
存储满调整 retention、清理旧数据
# 检查 Prometheus targets
curl http://prometheus:9090/api/v1/targets

# 检查告警规则
curl http://prometheus:9090/api/v1/rules

# 检查 Alertmanager 状态
curl http://alertmanager:9093/api/v1/status

# 测试 PromQL
curl 'http://prometheus:9090/api/v1/query?query=up'

Source

git clone https://github.com/chaterm/terminal-skills/blob/main/devops/monitoring/SKILL.mdView on GitHub

Overview

monitoring 技能聚焦 Prometheus 的数据采集、PromQL 查询、告警规则、Alertmanager 路由以及 Grafana 的可视化与仪表盘建设。你可以利用它快速生成指标查询、编写告警规则并配置告警通知渠道,从而实现系统健康状态的持续可观测性。

How This Skill Works

该技能围绕 Prometheus 数据源的配置、PromQL 查询的编写与优化、告警规则的定义以及 Alertmanager 的路由与通知。你将定义 scrape_configs 收集目标、创建规则文件触发告警、配置 Alertmanager 将告警分发到 Slack、PagerDuty 等通道,并使用 Grafana 通过仪表盘对关键指标进行可视化分析。

When to Use It

  • Kubernetes 集群与应用健康监控
  • 自定义业务与 SRE 指标的观测与告警
  • 告警分级、路由与静默管理以降低告警噪音
  • SLA/SLO 监控、错误预算与延迟分析
  • 故障排查与容量规划的持续监控与趋势分析

Quick Start

  1. 安装 Prometheus、Alertmanager 与 Grafana,确保它们在同一网络可达。
  2. 配置 Prometheus 的 scrape_configs 收集节点/应用及 Kubernetes Pod 指标。
  3. 编写基础 PromQL 查询并创建简单告警规则,如 HighCPUUsage、HighMemoryUsage,并放入 rules/alerts.yml。
  4. 配置 Alertmanager 的路由与接收器(如 Slack、PagerDuty),在 Grafana 中添加 Prometheus 数据源并创建第一张仪表板。

Best Practices

  • 优先使用 rate()/irate() 对可增量计数器进行速率计算,避免误报。
  • 对延迟指标使用 histogram_quantile(0.95/0.99) 等分位值以反映分布特性。
  • 为告警打标签并设置合理的 for 期限,减少告警抖动与重复通知。
  • 对不同严重性分支使用清晰的路由规则,将 critical 与 warning 分别发送到不同通知渠道。
  • 定期验证告警规则与接收渠道,结合 amtool 测试静默以确保告警行为符合预期。

Example Use Cases

  • 场景 1:节点资源监控,监控 CPU、内存、磁盘与网络使用情况并在超阈值时触发告警。
  • 场景 2:应用 5xx 错误率与高延迟告警,结合 rate(http_requests_total[5m]) 与 histogram_quantile(0.99, rate(...[5m]))。
  • 场景 3:Kubernetes ServiceMonitor 配置示例,用于自动发现并采集 Pod 指标。
  • 场景 4:自定义指标暴露,如通过 Prometheus client 在应用中暴露请求计数与延迟指标,并在 Prometheus 中可观测。
  • 场景 5:SLO 监控与错误预算,使用可用性和延迟分布指标计算 SLO 达成情况并触发告警。

Frequently Asked Questions

Add this skill to your agents

Related Skills

Sponsor this space

Reach thousands of developers