What is AWS CloudWatch?

A monitoring and observability service for AWS resources that collects metrics, logs, and events to help you monitor and troubleshoot your environment.

How do I create an alarm?

Use put-metric-alarm (CLI) or put_metric_alarm (boto3) to specify the metric, namespace, threshold, evaluation periods, and actions like SNS notifications.

How can I query logs with Insights?

Use Logs Insights to run queries against log groups, enabling fast search for patterns, errors, and events to aid troubleshooting.

cloudwatch

Scanned

npx machina-cli add skill itsmostafa/aws-agent-skills/cloudwatch --openclaw

Files (1)

SKILL.md

10.3 KB

AWS CloudWatch

Amazon CloudWatch provides monitoring and observability for AWS resources and applications. It collects metrics, logs, and events, enabling you to monitor, troubleshoot, and optimize your AWS environment.

Core Concepts
Common Patterns
CLI Reference
Best Practices
Troubleshooting
References

Core Concepts

Metrics

Time-ordered data points published to CloudWatch. Key components:

Namespace: Container for metrics (e.g., AWS/Lambda)
Metric name: Name of the measurement (e.g., Invocations)
Dimensions: Name-value pairs for filtering (e.g., FunctionName=MyFunc)
Statistics: Aggregations (Sum, Average, Min, Max, SampleCount, pN)

Logs

Log data from AWS services and applications:

Log groups: Collections of log streams
Log streams: Sequences of log events from same source
Log events: Individual log entries with timestamp and message

Alarms

Automated actions based on metric thresholds:

States: OK, ALARM, INSUFFICIENT_DATA
Actions: SNS notifications, Auto Scaling, EC2 actions

Common Patterns

Create a Metric Alarm

AWS CLI:

# CPU utilization alarm for EC2
aws cloudwatch put-metric-alarm \
  --alarm-name "HighCPU-i-1234567890abcdef0" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts \
  --ok-actions arn:aws:sns:us-east-1:123456789012:alerts

boto3:

import boto3

cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_alarm(
    AlarmName='HighCPU-i-1234567890abcdef0',
    MetricName='CPUUtilization',
    Namespace='AWS/EC2',
    Statistic='Average',
    Period=300,
    Threshold=80.0,
    ComparisonOperator='GreaterThanThreshold',
    EvaluationPeriods=2,
    Dimensions=[
        {'Name': 'InstanceId', 'Value': 'i-1234567890abcdef0'}
    ],
    AlarmActions=['arn:aws:sns:us-east-1:123456789012:alerts'],
    OKActions=['arn:aws:sns:us-east-1:123456789012:alerts']
)

Lambda Error Rate Alarm

aws cloudwatch put-metric-alarm \
  --alarm-name "LambdaErrorRate-MyFunction" \
  --metrics '[
    {
      "Id": "errors",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/Lambda",
          "MetricName": "Errors",
          "Dimensions": [{"Name": "FunctionName", "Value": "MyFunction"}]
        },
        "Period": 60,
        "Stat": "Sum"
      },
      "ReturnData": false
    },
    {
      "Id": "invocations",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/Lambda",
          "MetricName": "Invocations",
          "Dimensions": [{"Name": "FunctionName", "Value": "MyFunction"}]
        },
        "Period": 60,
        "Stat": "Sum"
      },
      "ReturnData": false
    },
    {
      "Id": "errorRate",
      "Expression": "errors/invocations*100",
      "Label": "Error Rate",
      "ReturnData": true
    }
  ]' \
  --threshold 5 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts

Query Logs with Insights

# Find errors in Lambda logs
aws logs start-query \
  --log-group-name /aws/lambda/MyFunction \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string '
    fields @timestamp, @message
    | filter @message like /ERROR/
    | sort @timestamp desc
    | limit 50
  '

# Get query results
aws logs get-query-results --query-id <query-id>

boto3:

import boto3
import time

logs = boto3.client('logs')

# Start query
response = logs.start_query(
    logGroupName='/aws/lambda/MyFunction',
    startTime=int(time.time()) - 3600,
    endTime=int(time.time()),
    queryString='''
        fields @timestamp, @message
        | filter @message like /ERROR/
        | sort @timestamp desc
        | limit 50
    '''
)

query_id = response['queryId']

# Wait for results
while True:
    result = logs.get_query_results(queryId=query_id)
    if result['status'] == 'Complete':
        break
    time.sleep(1)

for row in result['results']:
    print(row)

Create Metric Filter

Extract metrics from log patterns:

# Create metric filter for error count
aws logs put-metric-filter \
  --log-group-name /aws/lambda/MyFunction \
  --filter-name ErrorCount \
  --filter-pattern "ERROR" \
  --metric-transformations \
    metricName=ErrorCount,metricNamespace=MyApp,metricValue=1,defaultValue=0

Publish Custom Metrics

import boto3

cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_data(
    Namespace='MyApp',
    MetricData=[
        {
            'MetricName': 'OrdersProcessed',
            'Value': 1,
            'Unit': 'Count',
            'Dimensions': [
                {'Name': 'Environment', 'Value': 'Production'},
                {'Name': 'OrderType', 'Value': 'Standard'}
            ]
        }
    ]
)

Create Dashboard

cat > dashboard.json << 'EOF'
{
  "widgets": [
    {
      "type": "metric",
      "x": 0, "y": 0, "width": 12, "height": 6,
      "properties": {
        "title": "Lambda Invocations",
        "metrics": [
          ["AWS/Lambda", "Invocations", "FunctionName", "MyFunction"]
        ],
        "period": 60,
        "stat": "Sum",
        "region": "us-east-1"
      }
    },
    {
      "type": "log",
      "x": 12, "y": 0, "width": 12, "height": 6,
      "properties": {
        "title": "Recent Errors",
        "query": "SOURCE '/aws/lambda/MyFunction' | filter @message like /ERROR/ | limit 20",
        "region": "us-east-1"
      }
    }
  ]
}
EOF

aws cloudwatch put-dashboard \
  --dashboard-name MyAppDashboard \
  --dashboard-body file://dashboard.json

CLI Reference

Metrics Commands

Command	Description
`aws cloudwatch put-metric-data`	Publish custom metrics
`aws cloudwatch get-metric-data`	Retrieve metric values
`aws cloudwatch get-metric-statistics`	Get aggregated statistics
`aws cloudwatch list-metrics`	List available metrics

Alarms Commands

Command	Description
`aws cloudwatch put-metric-alarm`	Create or update alarm
`aws cloudwatch describe-alarms`	List alarms
`aws cloudwatch set-alarm-state`	Manually set alarm state
`aws cloudwatch delete-alarms`	Delete alarms

Logs Commands

Command	Description
`aws logs create-log-group`	Create log group
`aws logs put-log-events`	Write log events
`aws logs filter-log-events`	Search log events
`aws logs start-query`	Start Insights query
`aws logs put-metric-filter`	Create metric filter
`aws logs put-retention-policy`	Set log retention

Best Practices

Metrics

Use dimensions wisely — too many creates metric explosion
Aggregate before publishing — batch custom metrics
Use high-resolution metrics (1-second) only when needed
Set meaningful units for custom metrics

Alarms

Use composite alarms for complex conditions
Set appropriate evaluation periods to avoid flapping
Include OK actions to track recovery
Use anomaly detection for dynamic thresholds

Logs

Set retention policies — don't keep logs forever
Use structured logging (JSON) for better querying
Create metric filters for key events
Use Contributor Insights for top-N analysis

Cost Optimization

Delete unused dashboards
Reduce log retention for non-critical logs
Avoid high-resolution metrics unless necessary
Use log subscription filters instead of polling

Troubleshooting

Missing Metrics

Causes:

Service not publishing yet (wait 1-5 minutes)
Wrong namespace/dimensions
Detailed monitoring not enabled (EC2)

Debug:

# List metrics for a namespace
aws cloudwatch list-metrics \
  --namespace AWS/Lambda \
  --dimensions Name=FunctionName,Value=MyFunction

Alarm Stuck in INSUFFICIENT_DATA

Causes:

Metric not being published
Dimensions mismatch
Evaluation period too short

Debug:

# Check if metric has data
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Invocations \
  --dimensions Name=FunctionName,Value=MyFunction \
  --start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Sum

Log Events Not Appearing

Causes:

IAM permissions missing
CloudWatch Logs agent not running
Log group doesn't exist

Debug:

# Check log streams
aws logs describe-log-streams \
  --log-group-name /aws/lambda/MyFunction \
  --order-by LastEventTime \
  --descending \
  --limit 5

High CloudWatch Costs

Check usage:

# Get PutLogEvents usage
aws cloudwatch get-metric-statistics \
  --namespace AWS/Logs \
  --metric-name IncomingBytes \
  --dimensions Name=LogGroupName,Value=/aws/lambda/MyFunction \
  --start-time $(date -d '7 days ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 86400 \
  --statistics Sum

References

Source

git clone https://github.com/itsmostafa/aws-agent-skills/blob/main/skills/cloudwatch/SKILL.mdView on GitHub

Overview

CloudWatch provides monitoring and observability for AWS resources and applications. It collects metrics, logs, and events to help you monitor, troubleshoot, and optimize your AWS environment.

How This Skill Works

CloudWatch organizes data into metrics (namespaces, metric names, dimensions, statistics) and logs (log groups, log streams, log events). You create alarms that trigger actions, and you can query logs with Insights and build dashboards to visualize the data for operators and developers.

When to Use It

Setting up monitoring for a new AWS service or application
Creating and tuning alarms to detect issues early
Querying logs with Insights to troubleshoot failures
Configuring metric filters to extract signals from logs
Building dashboards to visualize key metrics and logs for operations

Quick Start

Step 1: Define the metrics and logs you want to monitor (e.g., AWS/EC2 CPUUtilization, Lambda Errors, log groups)
Step 2: Create alarms and/or metric filters via the AWS CLI or Console, specifying thresholds and actions
Step 3: Build dashboards and use Logs Insights to visualize and troubleshoot in real time

Best Practices

Define consistent namespaces and dimensions to enable cross-service correlation
Use specific thresholds and evaluation periods for reliable alarms
Test alarms by simulating conditions and verify the triggered actions
Create dashboards that mix metrics, logs, and alarms for quick situational awareness
Enable log Insights queries and metric filters with proper retention and tagging

Example Use Cases

Create an EC2 CPUUtilization alarm using put-metric-alarm to alert on high usage
Set up a Lambda Error Rate alarm with a metric math expression combining Errors and Invocations
Query Lambda logs with Insights to locate error messages and stack traces
Build a dashboard showing EC2 CPU, NetworkIn, and DiskReadBytes across instances
Configure metric filters to count error messages from application logs and feed into alarms

Frequently Asked Questions

Add this skill to your agents

cloudwatch

AWS CloudWatch

Table of Contents

Core Concepts

Metrics

Logs

Alarms

Common Patterns

Create a Metric Alarm

Lambda Error Rate Alarm

Query Logs with Insights

Create Metric Filter

Publish Custom Metrics

Create Dashboard

CLI Reference

Metrics Commands

Alarms Commands

Logs Commands

Best Practices

Metrics

Alarms

Logs

Cost Optimization

Troubleshooting

Missing Metrics

Alarm Stuck in INSUFFICIENT_DATA

Log Events Not Appearing

High CloudWatch Costs

References

Source

Overview

How This Skill Works

When to Use It

Quick Start

Best Practices

Example Use Cases

Frequently Asked Questions

What is AWS CloudWatch?

How do I create an alarm?

How can I query logs with Insights?