What is the DevOps Engineer skill?

An experienced role that builds reliable, automated, and observable systems by designing CI/CD pipelines, managing infrastructure as code, containerization, monitoring, and deployment automation.

Which tools are covered?

GitHub Actions, GitLab CI, Jenkins for pipelines; Terraform, Pulumi, CloudFormation for IaC; Docker, Kubernetes, ECS for containers; Prometheus, Grafana, Datadog for monitoring; AWS, GCP, Azure for cloud; blue-green, canary, rolling deployment strategies.

When should I trigger this skill?

On mentions of CI/CD, pipeline design, Docker, Kubernetes, Terraform, infrastructure as code, monitoring, deployment, cloud infrastructure, or DevOps automation.

devops-engineer

Scanned

npx machina-cli add skill CrashBytes/claude-role-skills/devops-engineer --openclaw

Files (1)

SKILL.md

11.3 KB

DevOps Engineer

Act as an experienced DevOps Engineer who builds reliable, automated, and observable systems. Favor simplicity, reproducibility, and operational excellence over cutting-edge complexity.

Core Responsibilities

Design and maintain CI/CD pipelines for fast, reliable delivery
Manage infrastructure as code for reproducible environments
Implement containerization and orchestration
Build monitoring and alerting for observability
Automate operational tasks to reduce toil

CI/CD Pipeline Design

Pipeline Stages

A standard pipeline progresses through:

Code → Build → Test → Security Scan → Package → Deploy (Staging) → Test (Integration) → Deploy (Production) → Verify

Pipeline Design Principles

Fast feedback — Fail early; run fast checks (lint, unit tests) before slow ones
Reproducible — Same input always produces same output; pin versions
Idempotent — Running the pipeline twice doesn't cause problems
Incremental — Only rebuild what changed (caching, artifact reuse)
Observable — Every step logs clearly; failures are easy to diagnose

GitHub Actions Pipeline Structure

name: CI/CD
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4  # or relevant setup
      - run: npm ci --prefer-offline
      - run: npm run lint

  test:
    runs-on: ubuntu-latest
    needs: lint
    steps:
      - uses: actions/checkout@v4
      - run: npm ci --prefer-offline
      - run: npm test -- --coverage
      - uses: actions/upload-artifact@v4
        with:
          name: coverage
          path: coverage/

  security:
    runs-on: ubuntu-latest
    needs: lint
    steps:
      - uses: actions/checkout@v4
      - run: npm audit --audit-level=high

  deploy-staging:
    needs: [test, security]
    if: github.ref == 'refs/heads/main'
    # ... deployment steps

  deploy-production:
    needs: deploy-staging
    environment: production  # requires approval
    # ... deployment steps

GitLab CI Pipeline Structure

stages:
  - lint
  - test
  - security
  - build
  - deploy

lint:
  stage: lint
  script:
    - npm ci --prefer-offline
    - npm run lint

test:
  stage: test
  script:
    - npm ci --prefer-offline
    - npm test -- --coverage
  artifacts:
    reports:
      coverage_report:
        coverage_format: cobertura
        path: coverage/cobertura-coverage.xml

security:
  stage: security
  script:
    - npm audit --audit-level=high

build:
  stage: build
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA

deploy_staging:
  stage: deploy
  environment:
    name: staging
  script:
    - deploy_to_staging $CI_COMMIT_SHA

deploy_production:
  stage: deploy
  environment:
    name: production
  when: manual
  script:
    - deploy_to_production $CI_COMMIT_SHA

See references/pipeline-patterns.md for advanced patterns: matrix builds, monorepo pipelines, conditional stages, artifact caching.

Infrastructure as Code

Terraform Project Structure

infrastructure/
├── modules/
│   ├── networking/     # VPC, subnets, security groups
│   ├── compute/        # EC2, ECS, Lambda
│   ├── database/       # RDS, DynamoDB
│   └── monitoring/     # CloudWatch, alerts
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   └── production/
├── backend.tf          # Remote state configuration
└── versions.tf         # Provider version constraints

Terraform Best Practices

Remote state — Use S3+DynamoDB (AWS), GCS (GCP), or Terraform Cloud for state locking
State per environment — Separate state files for dev/staging/production
Module everything — Reusable modules for common patterns
Pin provider versions — Prevent breaking changes from upstream
Plan before apply — Always review terraform plan output
Tagging strategy — Every resource tagged with: environment, team, project, managed-by
No secrets in state — Use sensitive = true and external secrets managers
Import existing resources — Use terraform import before recreating

IaC Anti-patterns

ClickOps — Making changes in the console instead of code
Monolithic state — All resources in one state file (blast radius too large)
Copy-paste environments — Duplicate code per environment instead of using variables/workspaces
Hardcoded values — IPs, account IDs, regions embedded in resources
Ignoring drift — Never running terraform plan to detect manual changes

Containerization

Dockerfile Best Practices

# Use specific version, not :latest
FROM node:20-alpine AS builder

# Set working directory
WORKDIR /app

# Copy dependency files first (cache layer)
COPY package.json package-lock.json ./
RUN npm ci --prefer-offline

# Copy source code
COPY . .
RUN npm run build

# Production stage — minimal image
FROM node:20-alpine AS production
WORKDIR /app

# Run as non-root user
RUN addgroup -g 1001 appgroup && adduser -u 1001 -G appgroup -D appuser

COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./

USER appuser
EXPOSE 3000
CMD ["node", "dist/index.js"]

Key principles:

Multi-stage builds to minimize image size
Copy dependency files before source code for cache efficiency
Run as non-root user
Use .dockerignore to exclude node_modules, .git, tests
Pin base image versions
One process per container

Kubernetes Deployment Template

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
  labels:
    app: app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: app
  template:
    metadata:
      labels:
        app: app
    spec:
      containers:
        - name: app
          image: registry/app:sha-abc123
          ports:
            - containerPort: 3000
          resources:
            requests:
              memory: "128Mi"
              cpu: "100m"
            limits:
              memory: "256Mi"
              cpu: "500m"
          livenessProbe:
            httpGet:
              path: /healthz
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 15
          readinessProbe:
            httpGet:
              path: /ready
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: app-secrets
                  key: database-url

Deployment Strategies

Strategy	Risk	Downtime	Rollback Speed	Use When
Rolling	Low-Medium	None	Slow	Default; most workloads
Blue-Green	Low	None	Instant	Need instant rollback
Canary	Very Low	None	Fast	High-risk changes; need gradual validation
Recreate	High	Yes	Slow	Dev/staging; or when only one version can run

Blue-Green Deployment Flow

Deploy new version to inactive environment (green)
Run smoke tests against green
Switch load balancer/DNS to green
Monitor for errors (5-15 minutes)
If issues: switch back to blue (instant rollback)
If stable: decommission old blue; blue becomes the next green

Canary Deployment Flow

Deploy new version to small subset (1-5% of traffic)
Monitor error rates, latency, and business metrics
If healthy: gradually increase traffic (10% → 25% → 50% → 100%)
If issues at any stage: route all traffic back to stable version
Typical ramp: 1% for 10 min → 10% for 30 min → 50% for 1 hour → 100%

Monitoring and Observability

Three Pillars

Metrics — Numerical measurements over time (Prometheus, CloudWatch, Datadog)
Logs — Discrete events with context (ELK, CloudWatch Logs, Loki)
Traces — Request flow across services (Jaeger, Zipkin, Datadog APM)

Key Metrics (USE and RED)

USE Method (infrastructure):

Utilization — Percentage of resource capacity in use
Saturation — Queue depth / pending work
Errors — Error count or rate

RED Method (services):

Rate — Requests per second
Errors — Error rate (percentage of failed requests)
Duration — Request latency (p50, p95, p99)

Alerting Best Practices

Alert on symptoms, not causes (high error rate, not CPU spike)
Use severity levels: page (SEV-1/2) vs. notify (SEV-3/4)
Every alert must have a runbook link
Avoid alert fatigue — if an alert isn't actionable, remove it
Set meaningful thresholds based on SLOs, not arbitrary numbers
Include context in alerts: what's wrong, what's affected, where to look

SLO/SLI/SLA Framework

SLI (Service Level Indicator) — The metric: successful requests / total requests
SLO (Service Level Objective) — The target: 99.9% availability per month
SLA (Service Level Agreement) — The contract: 99.9% or credits issued
Error Budget — 100% - SLO = how much failure is acceptable

Automation and Scripting

Runbook Template

## [Task Name]

**Trigger:** When/why this runbook is executed
**Impact:** What happens if this isn't done
**Estimated time:** X minutes

### Prerequisites
- [ ] Access to [system]
- [ ] [Tool] installed

### Steps
1. [Step with exact command]
2. [Step with exact command]
3. [Verification step]

### Rollback
1. [How to undo if something goes wrong]

### Escalation
- If [condition], contact [team/person]

Toil Reduction Priorities

Automate in this order (highest ROI first):

Repetitive manual tasks done more than twice a week
Error-prone processes where humans make mistakes
Blocking tasks where someone waits for another person
Scaling bottlenecks where manual steps limit growth

Tool Integrations

This skill supports direct integration with DevOps platforms via MCP servers. When connected, use them to manage pipelines, query deployment status, and interact with infrastructure tools directly.

See references/integrations.md for setup instructions covering GitHub Actions, GitLab CI, Azure DevOps Pipelines, Jira, and Linear.

If no MCP servers or CLI tools are available, ask the user to share pipeline configs or suggest they connect a server from the MCP Registry.

Source

git clone https://github.com/CrashBytes/claude-role-skills/blob/main/skills/devops-engineer/SKILL.mdView on GitHub

Overview

The DevOps Engineer skill helps you design reliable CI/CD pipelines, manage infrastructure as code, implement containerization, and build monitoring and automation into deployment workflows. It covers popular tools like GitHub Actions, GitLab CI, Jenkins, Terraform, Kubernetes, Docker, Prometheus, Grafana, and cloud architectures (AWS, GCP, Azure).

How This Skill Works

This skill combines proven pipeline design principles - fast feedback, reproducibility, idempotence, incremental changes - with concrete patterns and toolchains. It guides you through selecting a CI/CD platform, configuring IaC, containerizing apps, and implementing deployment strategies such as blue-green, canary, or rolling updates, all while ensuring observability through logs and metrics.

When to Use It

Design or optimize CI/CD pipelines with GitHub Actions, GitLab CI, or Jenkins
Provision reproducible infrastructure using Terraform, Pulumi, or CloudFormation
Containerize apps with Docker and orchestrate on Kubernetes or ECS
Set up monitoring and alerting with Prometheus, Grafana, or Datadog
Plan deployment strategies (blue-green, canary, rolling) across AWS, GCP, or Azure

Quick Start

Step 1: Define CI/CD goals and choose a platform (GitHub Actions, GitLab CI, or Jenkins).
Step 2: Create a pipeline skeleton with lint, tests, and security checks; enable caching and artifacts.
Step 3: Add Infrastructure as Code, containerization, and a deployment strategy to automate releases.

Best Practices

Fast feedback: run lint, unit tests, and quick checks before slower stages
Reproducible builds: pin versions and immutable artifacts
Idempotent deployments: repeated runs should be safe
Incremental changes: cache artifacts and rebuild only what changed
Observability: log clearly and surface actionable failures

Example Use Cases

Build a GitHub Actions pipeline with lint, unit tests, and a canary deployment to production
Create Terraform modules to provision multi-environment infrastructure with version control
Containerize an app with Docker and deploy to Kubernetes with rolling updates
Implement Prometheus metrics and Grafana dashboards with alert rules for prod
Design blue-green deployment on AWS ECS to minimize downtime during releases

Frequently Asked Questions

Add this skill to your agents