When should I load this skill?

Use it when working with Docker, CI/CD pipelines, deployment configuration, infrastructure as code, or observability setup. Skip if you only have application logic with no infra concerns.

Does this cover Terraform or IaC?

Yes. It emphasizes infrastructure as code, deployment planning, and examples that relate to production infrastructure, including Terraform-themed patterns.

What Docker practices are included?

Key practices include multi-stage builds, layer caching, a .dockerignore file, pinning base images, running as a non-root user, and security scanning.

devops-infrastructure

Scanned

npx machina-cli add skill CloudAI-X/claude-workflow-v2/devops-infrastructure --openclaw

Files (1)

SKILL.md

10.2 KB

DevOps & Infrastructure

When to Load

Trigger: Docker, CI/CD pipelines, deployment configuration, monitoring, infrastructure as code
Skip: Application logic only with no infrastructure or deployment concerns

DevOps Workflow

Copy this checklist and track progress:

DevOps Setup Progress:
- [ ] Step 1: Containerize application (Dockerfile)
- [ ] Step 2: Set up CI/CD pipeline
- [ ] Step 3: Define deployment strategy
- [ ] Step 4: Configure monitoring & alerting
- [ ] Step 5: Set up environment management
- [ ] Step 6: Document runbooks
- [ ] Step 7: Validate against anti-patterns checklist

Docker Best Practices

Multi-Stage Build

# WRONG: Single stage, bloated image
FROM node:20
WORKDIR /app
COPY . .
RUN npm install
RUN npm run build
CMD ["node", "dist/index.js"]
# Result: 1.2GB image with devDependencies and source code

# CORRECT: Multi-stage build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM node:20-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production
RUN addgroup -g 1001 appgroup && adduser -u 1001 -G appgroup -s /bin/sh -D appuser
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER appuser
EXPOSE 3000
CMD ["node", "dist/index.js"]
# Result: ~150MB image, no devDependencies, non-root user

Python Multi-Stage

FROM python:3.12-slim AS builder
WORKDIR /app
RUN pip install uv
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
COPY . .

FROM python:3.12-slim AS runner
WORKDIR /app
RUN useradd -r -s /bin/false appuser
COPY --from=builder /app/.venv /app/.venv
COPY --from=builder /app/src ./src
ENV PATH="/app/.venv/bin:$PATH"
USER appuser
CMD ["python", "-m", "src.main"]

Layer Caching

# WRONG: Cache busted on every code change
COPY . .
RUN npm ci

# CORRECT: Dependencies cached separately
COPY package*.json ./
RUN npm ci                  # cached unless package.json changes
COPY . .                    # only source code changes bust this layer

.dockerignore

node_modules
.git
.env
*.md
.vscode
coverage
dist
__pycache__
.pytest_cache
*.pyc

Security

# Always pin versions
FROM node:20.11.0-alpine   # NOT node:latest

# Don't run as root
USER appuser

# Read-only filesystem where possible
# docker run --read-only --tmpfs /tmp myapp

# Scan images
# docker scout cves myimage:latest
# trivy image myimage:latest

CI/CD Pipeline Design

GitHub Actions Structure

name: CI/CD
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: "npm"
      - run: npm ci
      - run: npm run lint

  test:
    runs-on: ubuntu-latest
    needs: lint
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: testdb
        ports: ["5432:5432"]
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: "npm"
      - run: npm ci
      - run: npm test

  build:
    runs-on: ubuntu-latest
    needs: test
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/build-push-action@v5
        with:
          push: ${{ github.event_name == 'push' }}
          tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    runs-on: ubuntu-latest
    needs: build
    if: github.ref == 'refs/heads/main'
    environment: production
    steps:
      - run: echo "Deploy to production"

Caching Strategies

# Node modules
- uses: actions/setup-node@v4
  with:
    cache: "npm"

# Python with uv
- name: Cache uv
  uses: actions/cache@v4
  with:
    path: ~/.cache/uv
    key: uv-${{ runner.os }}-${{ hashFiles('uv.lock') }}

# Docker layer caching
- uses: docker/build-push-action@v5
  with:
    cache-from: type=gha
    cache-to: type=gha,mode=max

Deployment Strategies

Blue-Green Deployment

1. Run two identical environments: Blue (live) and Green (idle)
2. Deploy new version to Green
3. Run smoke tests on Green
4. Switch load balancer to Green
5. Green is now live, Blue is idle
6. Rollback: switch back to Blue

Pros: Instant rollback, zero downtime
Cons: 2x infrastructure cost during deploy

Canary Deployment

1. Deploy new version to small subset (5% of traffic)
2. Monitor error rates and latency
3. Gradually increase: 5% -> 25% -> 50% -> 100%
4. Rollback: route all traffic back to old version

Pros: Limited blast radius, real-world testing
Cons: More complex routing, longer rollout

Rolling Deployment

1. Replace instances one at a time
2. Each new instance passes health checks before next starts
3. Continue until all instances updated

Pros: No extra infrastructure, gradual rollout
Cons: Mixed versions during deploy, slower rollback

Feature Flags

// Simple feature flag implementation
const features = {
  NEW_CHECKOUT: process.env.FF_NEW_CHECKOUT === "true",
  DARK_MODE: process.env.FF_DARK_MODE === "true",
};

function getCheckoutFlow(user: User) {
  if (features.NEW_CHECKOUT && user.betaGroup) {
    return newCheckoutFlow(user);
  }
  return legacyCheckoutFlow(user);
}

// Use a proper service for production: LaunchDarkly, Unleash, Flagsmith

Infrastructure as Code

Terraform Basics

# main.tf
terraform {
  required_version = ">= 1.5"
  backend "s3" {
    bucket = "myapp-terraform-state"
    key    = "prod/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_instance" "web" {
  ami           = var.ami_id
  instance_type = var.instance_type
  tags = {
    Name        = "web-${var.environment}"
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

# variables.tf
variable "environment" {
  type    = string
  default = "dev"
}

variable "instance_type" {
  type    = string
  default = "t3.micro"
}

Terraform Rules

1. Always use remote state (S3, GCS, Terraform Cloud)
2. Lock state files to prevent concurrent modifications
3. Use variables and modules for reusability
4. Tag all resources with environment and ManagedBy
5. Run `terraform plan` before `terraform apply`
6. Never edit infrastructure manually (all changes via code)
7. Use workspaces or separate state files per environment

Monitoring & Observability

The Three Pillars

METRICS: Numeric measurements over time
  - Request rate, error rate, latency (RED method)
  - CPU, memory, disk, network (USE method)
  - Business metrics (signups, purchases)
  Tools: Prometheus, Datadog, CloudWatch

LOGS: Discrete events with context
  - Structured JSON format
  - Correlation IDs across services
  - Log levels: DEBUG, INFO, WARN, ERROR
  Tools: ELK Stack, Loki, CloudWatch Logs

TRACES: Request flow across services
  - Distributed tracing with span context
  - Latency breakdown per service
  - Dependency mapping
  Tools: Jaeger, Zipkin, Datadog APM

Health Check Endpoint

// Express health check
app.get("/health", async (req, res) => {
  const checks = {
    uptime: process.uptime(),
    timestamp: Date.now(),
    database: "unknown",
    redis: "unknown",
  };

  try {
    await db.query("SELECT 1");
    checks.database = "healthy";
  } catch (e) {
    checks.database = "unhealthy";
  }

  try {
    await redis.ping();
    checks.redis = "healthy";
  } catch (e) {
    checks.redis = "unhealthy";
  }

  const isHealthy = checks.database === "healthy";
  res.status(isHealthy ? 200 : 503).json(checks);
});

Alerting Rules

Good alerts:
- Error rate > 1% for 5 minutes (actionable)
- P99 latency > 2s for 10 minutes (meaningful)
- Disk usage > 80% (preventive)

Bad alerts:
- CPU spike for 30 seconds (too noisy)
- Any single 500 error (too sensitive)
- "Something might be wrong" (not actionable)

Alert fatigue is real. Every alert should require human action.

Environment Management

Dev/Staging/Prod Parity

# docker-compose.yml for local development
services:
  app:
    build: .
    env_file: .env
    ports: ["3000:3000"]
    depends_on:
      postgres:
        condition: service_healthy

  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: myapp
    healthcheck:
      test: ["CMD-SHELL", "pg_isready"]
      interval: 5s
    volumes:
      - pgdata:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]

volumes:
  pgdata:

Environment Variables

# .env.example (committed to git, no real values)
DATABASE_URL=postgresql://user:placeholder@localhost:5432/myapp
REDIS_URL=redis://localhost:6379
LOG_LEVEL=debug
API_KEY=your-key-here

# .env (never committed, listed in .gitignore)
# Contains real values for local development

Common Anti-Patterns Summary

AVOID                              DO INSTEAD
-------------------------------------------------------------------
FROM node:latest                   Pin exact versions (node:20.11.0-alpine)
Running as root in container       Create and use non-root user
No .dockerignore                   Exclude .git, node_modules, .env
Single CI job does everything      Separate lint, test, build, deploy stages
Manual deployment                  Automated pipeline with approvals
No health checks                   Liveness + readiness probes
Alerts on every error              Alert on error RATE thresholds
Same config in all environments    Per-environment configuration
No rollback plan                   Test rollback before every deploy
Logs as unstructured strings       Structured JSON logs with correlation IDs

Source

git clone https://github.com/CloudAI-X/claude-workflow-v2/blob/main/skills/devops-infrastructure/SKILL.mdView on GitHub

Overview

This skill guides containerization, CI/CD pipelines, deployment strategies, infrastructure as code, and observability setup. It emphasizes practical patterns like multi-stage Docker builds, layer caching, and secure image practices, plus GitHub Actions workflows to automate deployments.

How This Skill Works

Technically, you build optimized Docker images using multi-stage builds, configure GitHub Actions workflows for linting, testing, building, and pushing artifacts, and codify deployment strategies and observability in infrastructure-as-code templates. The workflow covers security, environment management, and runbooks to support reproducible production deployments.

When to Use It

When containerizing an application with a Dockerfile to prepare it for production
When configuring a GitHub Actions workflow to automate linting, testing, and deployment
When planning deployment strategies (rolling, blue/green, canary) and environment management
When setting up monitoring, logs, metrics, and observability across services
When asked about containers, pipelines, Terraform, or production infrastructure

Quick Start

Step 1: Create a multi-stage Dockerfile that builds and runs the app with a non-root user
Step 2: Add a GitHub Actions workflow with lint, test, and build jobs
Step 3: Define a deployment strategy and basic observability to monitor production

Best Practices

Use multi-stage builds to minimize image size and remove build-time dependencies
Cache dependencies with separate layers and copy sequences to leverage Docker layer caching
Pin base images and avoid running as root; create and run as a non-root user
Keep security tight with .dockerignore, image scanning, and pinned versions
Design CI/CD with reusable workflows and IaC templates; include runbooks and anti-pattern checks

Example Use Cases

A multi-stage Dockerfile for a Node.js app producing a ~150MB image with dist artifacts only
A Python 3.12-slim multi-stage build that installs dependencies and runs a slim runtime
A GitHub Actions YAML with lint, test (including a Postgres service), and build/push steps
A .dockerignore example excluding node_modules, Git history, env files, and caches
A security-focused Dockerfile pattern: pin versions, non-root user, read-only FS, and scanners

Frequently Asked Questions

Add this skill to your agents