What is an observability specialist?

A role focused on making systems observable through logs, metrics, and traces to improve debugging, reliability, and performance.

What are the Golden Signals?

The four core metrics to monitor: Latency, Traffic, Errors, and Saturation.

Why use structured logging and tracing?

Structured logging enables correlation with traces via trace_id, while tracing reveals end-to-end call paths and latency across services.

observability-specialist

npx machina-cli add skill k1lgor/virtual-company/19-observability-specialist --openclaw

Files (1)

SKILL.md

3.1 KB

Observability Specialist

You ensure systems are observable, debuggable, and reliable through metrics, logs, and traces.

When to use

"Set up monitoring for this app."
"Create an alert for high latency."
"Debug this production issue using logs."
"Implement distributed tracing."

Instructions

Structured Logging:
- Use JSON format for logs.
- Include essential fields: timestamp, level, service, trace_id, message.
- Log at appropriate levels (ERROR for faults, INFO for state changes, DEBUG for details).
Metrics:
- Track the "Golden Signals": Latency, Traffic, Errors, and Saturation.
- Use Prometheus-style metrics (Counters, Gauges, Histograms).
Tracing:
- Implement OpenTelemetry or similar for distributed tracing.
- Ensure trace context propagates across service boundaries.
Dashboards & Alerts:
- Create dashboards to visualize system health.
- Define alerts on meaningful symptoms (user error rate) rather than just internal causes (CPU high).

Examples

1. Structured Logging with JSON

import logging
import json
from datetime import datetime

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_data = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": record.levelname,
            "service": "my-service",
            "message": record.getMessage(),
            "trace_id": getattr(record, 'trace_id', None)
        }
        return json.dumps(log_data)

logger = logging.getLogger(__name__)
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# Usage
logger.info("User logged in", extra={"trace_id": "abc123"})

2. Prometheus Metrics

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics
request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
request_latency = Histogram('http_request_duration_seconds', 'HTTP request latency')
active_users = Gauge('active_users', 'Number of active users')

# Track metrics
@request_latency.time()
def handle_request(method, endpoint):
    # Your logic here
    time.sleep(0.1)
    request_count.labels(method=method, endpoint=endpoint, status='200').inc()

# Start metrics server
start_http_server(8000)

3. OpenTelemetry Distributed Tracing

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter

# Setup
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(ConsoleSpanExporter())
)

# Usage
with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", "12345")
    # Your business logic
    with tracer.start_as_current_span("validate_payment"):
        # Payment validation logic
        pass

Source

git clone https://github.com/k1lgor/virtual-company/blob/main/skills/19-observability-specialist/SKILL.mdView on GitHub

Overview

An observability specialist ensures systems are observable, debuggable, and reliable through metrics, logs, and traces. They implement structured logging, Prometheus-style metrics, and distributed tracing to improve visibility and uptime.

How This Skill Works

Implementation centers on three pillars: structured JSON logging with fields like timestamp, level, service, and trace_id; Prometheus-style metrics (counters, gauges, histograms) to track latency, traffic, errors, and saturation; and OpenTelemetry-based tracing with trace context propagated across services. Dashboards distill health data and alerts trigger on meaningful symptoms.

When to Use It

Set up monitoring for this app to gain end-to-end visibility.
Create an alert for high latency affecting users.
Debug a production issue by correlating logs, metrics, and traces.
Implement distributed tracing across service boundaries to diagnose bottlenecks.
Design dashboards that visualize health and alerting signals for on-call readiness.

Quick Start

Step 1: Enable structured JSON logging with fields: timestamp, level, service, trace_id, message.
Step 2: Define Prometheus metrics for latency, traffic, errors, and saturation; expose /metrics endpoint.
Step 3: Instrument services with OpenTelemetry and configure dashboards and alerts focused on user-visible symptoms.

Best Practices

Use JSON logs with essential fields: timestamp, level, service, trace_id, and message; log at appropriate levels (ERROR, INFO, DEBUG).
Track the Golden Signals—Latency, Traffic, Errors, and Saturation—using Prometheus-style metrics (counters, gauges, histograms).
Implement OpenTelemetry or an equivalent tracing system and ensure trace context propagates across service boundaries.
Create dashboards that visualize system health and enable alerts on meaningful symptoms rather than solely internal metrics.
Prefer user-focused alerts (e.g., rising error rate or latency) to reduce alert fatigue and improve on-call response.

Example Use Cases

Structured logging in JSON from a Python service including timestamp, level, service, and trace_id for traceability.
Prometheus metrics for HTTP requests: http_requests_total (counter) and http_request_duration_seconds (histogram).
OpenTelemetry tracing across microservices to model a user request from gateway to downstream services.
A dashboards showing latency, request rate, error rate, and saturation to monitor health.
Alerts that trigger on elevated user latency or error rate instead of CPU spikes.

Frequently Asked Questions

Add this skill to your agents