Get the FREE Ultimate OpenClaw Setup Guide →

observability-specialist

npx machina-cli add skill k1lgor/virtual-company/19-observability-specialist --openclaw
Files (1)
SKILL.md
3.1 KB

Observability Specialist

You ensure systems are observable, debuggable, and reliable through metrics, logs, and traces.

When to use

  • "Set up monitoring for this app."
  • "Create an alert for high latency."
  • "Debug this production issue using logs."
  • "Implement distributed tracing."

Instructions

  1. Structured Logging:
    • Use JSON format for logs.
    • Include essential fields: timestamp, level, service, trace_id, message.
    • Log at appropriate levels (ERROR for faults, INFO for state changes, DEBUG for details).
  2. Metrics:
    • Track the "Golden Signals": Latency, Traffic, Errors, and Saturation.
    • Use Prometheus-style metrics (Counters, Gauges, Histograms).
  3. Tracing:
    • Implement OpenTelemetry or similar for distributed tracing.
    • Ensure trace context propagates across service boundaries.
  4. Dashboards & Alerts:
    • Create dashboards to visualize system health.
    • Define alerts on meaningful symptoms (user error rate) rather than just internal causes (CPU high).

Examples

1. Structured Logging with JSON

import logging
import json
from datetime import datetime

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_data = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": record.levelname,
            "service": "my-service",
            "message": record.getMessage(),
            "trace_id": getattr(record, 'trace_id', None)
        }
        return json.dumps(log_data)

logger = logging.getLogger(__name__)
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# Usage
logger.info("User logged in", extra={"trace_id": "abc123"})

2. Prometheus Metrics

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics
request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
request_latency = Histogram('http_request_duration_seconds', 'HTTP request latency')
active_users = Gauge('active_users', 'Number of active users')

# Track metrics
@request_latency.time()
def handle_request(method, endpoint):
    # Your logic here
    time.sleep(0.1)
    request_count.labels(method=method, endpoint=endpoint, status='200').inc()

# Start metrics server
start_http_server(8000)

3. OpenTelemetry Distributed Tracing

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter

# Setup
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(ConsoleSpanExporter())
)

# Usage
with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", "12345")
    # Your business logic
    with tracer.start_as_current_span("validate_payment"):
        # Payment validation logic
        pass

Source

git clone https://github.com/k1lgor/virtual-company/blob/main/skills/19-observability-specialist/SKILL.mdView on GitHub

Overview

An observability specialist ensures systems are observable, debuggable, and reliable through metrics, logs, and traces. They implement structured logging, Prometheus-style metrics, and distributed tracing to improve visibility and uptime.

How This Skill Works

Implementation centers on three pillars: structured JSON logging with fields like timestamp, level, service, and trace_id; Prometheus-style metrics (counters, gauges, histograms) to track latency, traffic, errors, and saturation; and OpenTelemetry-based tracing with trace context propagated across services. Dashboards distill health data and alerts trigger on meaningful symptoms.

When to Use It

  • Set up monitoring for this app to gain end-to-end visibility.
  • Create an alert for high latency affecting users.
  • Debug a production issue by correlating logs, metrics, and traces.
  • Implement distributed tracing across service boundaries to diagnose bottlenecks.
  • Design dashboards that visualize health and alerting signals for on-call readiness.

Quick Start

  1. Step 1: Enable structured JSON logging with fields: timestamp, level, service, trace_id, message.
  2. Step 2: Define Prometheus metrics for latency, traffic, errors, and saturation; expose /metrics endpoint.
  3. Step 3: Instrument services with OpenTelemetry and configure dashboards and alerts focused on user-visible symptoms.

Best Practices

  • Use JSON logs with essential fields: timestamp, level, service, trace_id, and message; log at appropriate levels (ERROR, INFO, DEBUG).
  • Track the Golden Signals—Latency, Traffic, Errors, and Saturation—using Prometheus-style metrics (counters, gauges, histograms).
  • Implement OpenTelemetry or an equivalent tracing system and ensure trace context propagates across service boundaries.
  • Create dashboards that visualize system health and enable alerts on meaningful symptoms rather than solely internal metrics.
  • Prefer user-focused alerts (e.g., rising error rate or latency) to reduce alert fatigue and improve on-call response.

Example Use Cases

  • Structured logging in JSON from a Python service including timestamp, level, service, and trace_id for traceability.
  • Prometheus metrics for HTTP requests: http_requests_total (counter) and http_request_duration_seconds (histogram).
  • OpenTelemetry tracing across microservices to model a user request from gateway to downstream services.
  • A dashboards showing latency, request rate, error rate, and saturation to monitor health.
  • Alerts that trigger on elevated user latency or error rate instead of CPU spikes.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers