service-mesh-observability
npx machina-cli add skill wshobson/agents/service-mesh-observability --openclawService Mesh Observability
Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
When to Use This Skill
- Setting up distributed tracing across services
- Implementing service mesh metrics and dashboards
- Debugging latency and error issues
- Defining SLOs for service communication
- Visualizing service dependencies
- Troubleshooting mesh connectivity
Core Concepts
1. Three Pillars of Observability
┌─────────────────────────────────────────────────────┐
│ Observability │
├─────────────────┬─────────────────┬─────────────────┤
│ Metrics │ Traces │ Logs │
│ │ │ │
│ • Request rate │ • Span context │ • Access logs │
│ • Error rate │ • Latency │ • Error details │
│ • Latency P50 │ • Dependencies │ • Debug info │
│ • Saturation │ • Bottlenecks │ • Audit trail │
└─────────────────┴─────────────────┴─────────────────┘
2. Golden Signals for Mesh
| Signal | Description | Alert Threshold |
|---|---|---|
| Latency | Request duration P50, P99 | P99 > 500ms |
| Traffic | Requests per second | Anomaly detection |
| Errors | 5xx error rate | > 1% |
| Saturation | Resource utilization | > 80% |
Templates
Template 1: Istio with Prometheus & Grafana
# Install Prometheus
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus
namespace: istio-system
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'istio-mesh'
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- istio-system
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: istio-telemetry
---
# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: istio-mesh
namespace: istio-system
spec:
selector:
matchLabels:
app: istiod
endpoints:
- port: http-monitoring
interval: 15s
Template 2: Key Istio Metrics Queries
# Request rate by service
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)
# Error rate (5xx)
sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))
/ sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100
# P99 latency
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
by (le, destination_service_name))
# TCP connections
sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)
# Request size
histogram_quantile(0.99,
sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))
by (le, destination_service_name))
Template 3: Jaeger Distributed Tracing
# Jaeger installation for Istio
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
meshConfig:
enableTracing: true
defaultConfig:
tracing:
sampling: 100.0 # 100% in dev, lower in prod
zipkin:
address: jaeger-collector.istio-system:9411
---
# Jaeger deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: istio-system
spec:
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.50
ports:
- containerPort: 5775 # UDP
- containerPort: 6831 # Thrift
- containerPort: 6832 # Thrift
- containerPort: 5778 # Config
- containerPort: 16686 # UI
- containerPort: 14268 # HTTP
- containerPort: 14250 # gRPC
- containerPort: 9411 # Zipkin
env:
- name: COLLECTOR_ZIPKIN_HOST_PORT
value: ":9411"
Template 4: Linkerd Viz Dashboard
# Install Linkerd viz extension
linkerd viz install | kubectl apply -f -
# Access dashboard
linkerd viz dashboard
# CLI commands for observability
# Top requests
linkerd viz top deploy/my-app
# Per-route metrics
linkerd viz routes deploy/my-app --to deploy/backend
# Live traffic inspection
linkerd viz tap deploy/my-app --to deploy/backend
# Service edges (dependencies)
linkerd viz edges deployment -n my-namespace
Template 5: Grafana Dashboard JSON
{
"dashboard": {
"title": "Service Mesh Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",
"legendFormat": "{{destination_service_name}}"
}
]
},
{
"title": "Error Rate",
"type": "gauge",
"targets": [
{
"expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 1, "color": "yellow" },
{ "value": 5, "color": "red" }
]
}
}
}
},
{
"title": "P99 Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",
"legendFormat": "{{destination_service_name}}"
}
]
},
{
"title": "Service Topology",
"type": "nodeGraph",
"targets": [
{
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"
}
]
}
]
}
}
Template 6: Kiali Service Mesh Visualization
# Kiali installation
apiVersion: kiali.io/v1alpha1
kind: Kiali
metadata:
name: kiali
namespace: istio-system
spec:
auth:
strategy: anonymous # or openid, token
deployment:
accessible_namespaces:
- "**"
external_services:
prometheus:
url: http://prometheus.istio-system:9090
tracing:
url: http://jaeger-query.istio-system:16686
grafana:
url: http://grafana.istio-system:3000
Template 7: OpenTelemetry Integration
# OpenTelemetry Collector for mesh
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
zipkin:
endpoint: 0.0.0.0:9411
processors:
batch:
timeout: 10s
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp, zipkin]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
---
# Istio Telemetry v2 with OTel
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: mesh-default
namespace: istio-system
spec:
tracing:
- providers:
- name: otel
randomSamplingPercentage: 10
Alerting Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: mesh-alerts
namespace: istio-system
spec:
groups:
- name: mesh.rules
rules:
- alert: HighErrorRate
expr: |
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
/ sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate for {{ $labels.destination_service_name }}"
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
by (le, destination_service_name)) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency for {{ $labels.destination_service_name }}"
- alert: MeshCertExpiring
expr: |
(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
labels:
severity: warning
annotations:
summary: "Mesh certificate expiring in less than 7 days"
Best Practices
Do's
- Sample appropriately - 100% in dev, 1-10% in prod
- Use trace context - Propagate headers consistently
- Set up alerts - For golden signals
- Correlate metrics/traces - Use exemplars
- Retain strategically - Hot/cold storage tiers
Don'ts
- Don't over-sample - Storage costs add up
- Don't ignore cardinality - Limit label values
- Don't skip dashboards - Visualize dependencies
- Don't forget costs - Monitor observability costs
Resources
Source
git clone https://github.com/wshobson/agents/blob/main/plugins/cloud-infrastructure/skills/service-mesh-observability/SKILL.mdView on GitHub Overview
Provides end-to-end observability for Istio, Linkerd, and other mesh deployments by combining metrics, distributed tracing, and visual dashboards. It helps you set up mesh monitoring, diagnose latency issues, and define SLOs for inter-service communication.
How This Skill Works
Observability rests on the three pillars: metrics, traces, and logs. It uses golden signals (latency, traffic, errors, saturation) and common queries like istio_requests_total and istio_request_duration_milliseconds_bucket. Prometheus collects metrics via ServiceMonitor, Grafana renders dashboards, and Jaeger (or compatible tracers) captures distributed traces; Istio telemetry is configured to emit the data.
When to Use It
- Setting up distributed tracing across services
- Implementing service mesh metrics and dashboards
- Debugging latency and error issues
- Defining SLOs for service communication
- Visualizing service dependencies
Quick Start
- Step 1: Install Prometheus and configure a ServiceMonitor for istio-system to scrape mesh metrics.
- Step 2: Install Grafana and import Istio dashboards or create custom panels for requests, latency, and errors.
- Step 3: Enable tracing by installing Jaeger (or chosen tracer) and configure Istio tracing sampling (e.g., 100% in dev, adjust in prod).
Best Practices
- Standardize trace context and correlation across services for end-to-end visibility.
- Use the Golden Signals to drive alerts: latency (P99), error rate, and saturation thresholds.
- Deploy Prometheus + Grafana and configure a ServiceMonitor for istio-system to collect metrics.
- Enable Jaeger tracing with sensible sampling (e.g., 100% in dev, tuned in prod) and connect it to Istio telemetry.
- Start with templated dashboards and PromQL queries (e.g., istio_requests_total, istio_request_duration_milliseconds_bucket) and iterate.
Example Use Cases
- Istio with Prometheus & Grafana: deploy example ServiceMonitor and Prometheus config to scrape mesh metrics and build dashboards.
- Jaeger tracing for Istio: IstioOperator config enabling tracing and a Jaeger deployment for end-to-end traces.
- PromQL for traffic and errors: queries showing request rate by destination and 5xx error percentage.
- Grafana dashboards: visualizing service dependencies and latencies across the mesh.
- SLO-based alerting: define P99 latency targets and alert when thresholds are breached.