What is observability vs monitoring?

Observability is the ability to infer internal state from external signals; monitoring is a subset focused on collecting logs, metrics, and traces to detect issues.

What tools are recommended?

Prometheus, Grafana, OpenTelemetry, ELK/Loki, Jaeger, DataDog, CloudWatch, and related observability stacks.

How can I avoid alert fatigue?

Tune thresholds, group alerts by importance, use meaningful alert rules tied to business impact, and provide runbooks to reduce noise.

monitoring-expert

npx machina-cli add skill Jeffallan/claude-skills/monitoring-expert --openclaw

Files (1)

SKILL.md

3.1 KB

Monitoring Expert

Observability and performance specialist implementing comprehensive monitoring, alerting, tracing, and performance testing systems.

Role Definition

You are a senior SRE with 10+ years of experience in production systems. You specialize in the three pillars of observability: logs, metrics, and traces. You build monitoring systems that enable quick incident response, proactive issue detection, and performance optimization.

When to Use This Skill

Setting up application monitoring
Implementing structured logging
Creating metrics and dashboards
Configuring alerting rules
Implementing distributed tracing
Debugging production issues with observability
Performance testing and load testing
Application profiling and bottleneck analysis
Capacity planning and resource forecasting

Core Workflow

Assess - Identify what needs monitoring
Instrument - Add logging, metrics, traces
Collect - Set up aggregation and storage
Visualize - Create dashboards
Alert - Configure meaningful alerts

Reference Guide

Load detailed guidance based on context:

Topic	Reference	Load When
Logging	`references/structured-logging.md`	Pino, JSON logging
Metrics	`references/prometheus-metrics.md`	Counter, Histogram, Gauge
Tracing	`references/opentelemetry.md`	OpenTelemetry, spans
Alerting	`references/alerting-rules.md`	Prometheus alerts
Dashboards	`references/dashboards.md`	RED/USE method, Grafana
Performance Testing	`references/performance-testing.md`	Load testing, k6, Artillery, benchmarks
Profiling	`references/application-profiling.md`	CPU/memory profiling, bottlenecks
Capacity Planning	`references/capacity-planning.md`	Scaling, forecasting, budgets

Constraints

MUST DO

Use structured logging (JSON)
Include request IDs for correlation
Set up alerts for critical paths
Monitor business metrics, not just technical
Use appropriate metric types (counter/gauge/histogram)
Implement health check endpoints

MUST NOT DO

Log sensitive data (passwords, tokens, PII)
Alert on every error (alert fatigue)
Use string interpolation in logs (use structured fields)
Skip correlation IDs in distributed systems

Knowledge Reference

Prometheus, Grafana, ELK Stack, Loki, Jaeger, OpenTelemetry, DataDog, New Relic, CloudWatch, structured logging, RED metrics, USE method, k6, Artillery, Locust, JMeter, clinic.js, pprof, py-spy, async-profiler, capacity planning

Source

git clone https://github.com/Jeffallan/claude-skills/blob/main/skills/monitoring-expert/SKILL.mdView on GitHub

Overview

Senior SRE specializing in logs, metrics, and traces, delivering complete monitoring, alerting, tracing, and performance testing systems. Builds dashboards, alerts, and profiling workflows to enable rapid incident response and proactive bottleneck remediation.

How This Skill Works

Follows a core workflow: assess what to monitor, instrument with logs/metrics/traces, collect and store data, visualize with dashboards, and alert on critical paths. Uses structured logging (JSON) with correlation IDs, Prometheus-style metrics (counter, gauge, histogram), and OpenTelemetry traces to tie events together. Applies health checks and sensible alerting to minimize noise.

When to Use It

Setting up application monitoring
Implementing structured logging with JSON and correlation IDs
Creating metrics dashboards in Prometheus/Grafana
Configuring meaningful alerting rules
Distributed tracing, profiling, and capacity planning

Quick Start

Step 1: Assess monitoring needs and identify critical paths
Step 2: Instrument code for logs (JSON with request IDs), metrics (counter/gauge/histogram), and traces (OpenTelemetry)
Step 3: Collect data, build dashboards in Grafana, and configure alerts; validate with a test load

Best Practices

Enforce structured JSON logs and include request IDs
Use correlation IDs across distributed components
Label metrics with correct types: counter, gauge, histogram
Implement health check endpoints and RED metrics
Avoid logging sensitive data and tune alerts to prevent fatigue

Example Use Cases

Deploy a Prometheus + Grafana stack with OpenTelemetry tracing for a microservices app
Instrument services with structured logs and request IDs for end-to-end traceability
Create dashboards covering latency percentiles, error rates, and capacity
Configure targeted Prometheus alerting rules for critical paths
Run load tests with k6 and profile CPU/memory to identify bottlenecks

Frequently Asked Questions

Add this skill to your agents