What is the difference between metrics, logs, and traces?

Metrics provide aggregated time-based numbers; logs capture detailed events; traces show request flow and latency breakdown across services.

How can I reduce log costs in GCP?

Use exclusion filters to drop unwanted logs, apply sampling to limit volume, shorten retention, and downgrade logs to cheaper storage.

What are common alert patterns?

Common patterns include error rate > 1% for 5 minutes, P99 latency > 2s for 10 minutes, and memory > 90% for 5 minutes.

gcloud-usage

npx machina-cli add skill fcakyon/claude-codex-settings/gcloud-usage --openclaw

Files (1)

SKILL.md

3.4 KB

GCP Observability Best Practices

Structured Logging

JSON Log Format

Use structured JSON logging for better queryability:

{
  "severity": "ERROR",
  "message": "Payment failed",
  "httpRequest": { "requestMethod": "POST", "requestUrl": "/api/payment" },
  "labels": { "user_id": "123", "transaction_id": "abc" },
  "timestamp": "2025-01-15T10:30:00Z"
}

Severity Levels

Use appropriate severity for filtering:

DEBUG: Detailed diagnostic info
INFO: Normal operations, milestones
NOTICE: Normal but significant events
WARNING: Potential issues, degraded performance
ERROR: Failures that don't stop the service
CRITICAL: Failures requiring immediate action
ALERT: Person must take action immediately
EMERGENCY: System is unusable

Log Filtering Queries

Common Filters

# By severity
severity >= WARNING

# By resource
resource.type="cloud_run_revision"
resource.labels.service_name="my-service"

# By time
timestamp >= "2025-01-15T00:00:00Z"

# By text content
textPayload =~ "error.*timeout"

# By JSON field
jsonPayload.user_id = "123"

# Combined
severity >= ERROR AND resource.labels.service_name="api"

Advanced Queries

# Regex matching
textPayload =~ "status=[45][0-9]{2}"

# Substring search
textPayload : "connection refused"

# Multiple values
severity = (ERROR OR CRITICAL)

Metrics vs Logs vs Traces

When to Use Each

Metrics: Aggregated numeric data over time

Request counts, latency percentiles
Resource utilization (CPU, memory)
Business KPIs (orders/minute)

Logs: Detailed event records

Error details and stack traces
Audit trails
Debugging specific requests

Traces: Request flow across services

Latency breakdown by service
Identifying bottlenecks
Distributed system debugging

Alert Policy Design

Alert Best Practices

Avoid alert fatigue: Only alert on actionable issues
Use multi-condition alerts: Reduce noise from transient spikes
Set appropriate windows: 5-15 min for most metrics
Include runbook links: Help responders act quickly

Common Alert Patterns

Error rate:

Condition: Error rate > 1% for 5 minutes
Good for: Service health monitoring

Latency:

Condition: P99 latency > 2s for 10 minutes
Good for: Performance degradation detection

Resource exhaustion:

Condition: Memory > 90% for 5 minutes
Good for: Capacity planning triggers

Cost Optimization

Reducing Log Costs

Exclusion filters: Drop verbose logs at ingestion
Sampling: Log only percentage of high-volume events
Shorter retention: Reduce default 30-day retention
Downgrade logs: Route to cheaper storage buckets

Exclusion Filter Examples

# Exclude health checks
resource.type="cloud_run_revision" AND httpRequest.requestUrl="/health"

# Exclude debug logs in production
severity = DEBUG

Debugging Workflow

Start with metrics: Identify when issues started
Correlate with logs: Filter logs around problem time
Use traces: Follow specific requests across services
Check resource logs: Look for infrastructure issues
Compare baselines: Check against known-good periods

Source

git clone https://github.com/fcakyon/claude-codex-settings/blob/main/plugins/gcloud-tools/skills/gcloud-usage/SKILL.mdView on GitHub

Overview

This skill provides practical guidance for GCP observability, including how to structure JSON logs, craft filtering queries, and distinguish metrics, logs, and traces. It also covers alert design, cost optimization, and a debugging workflow to diagnose production issues on Google Cloud.

How This Skill Works

By enforcing structured JSON logs with severity and request metadata, using Cloud Logging filters and queries, and aligning data across metrics, logs, and traces, you can diagnose issues efficiently. The skill also prescribes alert patterns and cost-saving practices to maintain scalable observability.

When to Use It

When querying Cloud Logging for severity, time, resources, and text payloads
When structuring logs in JSON with fields like severity, httpRequest, labels
When choosing between metrics, logs, and traces for observability
When designing alert policies that minimize noise and include runbooks
When reducing log costs with exclusion filters, sampling, and shorter retention

Quick Start

Step 1: Start with metrics to identify when issues began
Step 2: Correlate with logs around the problem time
Step 3: Use traces to follow specific requests across services

Best Practices

Adopt structured JSON logs with severity, httpRequest, labels, and timestamp
Use the full severity scale (DEBUG to EMERGENCY) to support filtering
Craft common and advanced log filters (severity, resource, time, text, JSON fields)
Balance metrics, logs, and traces for end-to-end visibility
Include runbook links in alerts and apply cost-savings like sampling and retention control

Example Use Cases

Filter by severity: severity >= WARNING
Filter by resource: resource.type='cloud_run_revision' and resource.labels.service_name='my-service'
Filter by time: timestamp >= '2025-01-15T00:00:00Z'
Filter by JSON field: jsonPayload.user_id = '123'
Combined filter: severity >= ERROR AND resource.labels.service_name='api'

Frequently Asked Questions

Add this skill to your agents