Get the FREE Ultimate OpenClaw Setup Guide →

devops

npx machina-cli add skill Joncik91/ucai/devops --openclaw
Files (1)
SKILL.md
17.5 KB

DevOps Engineer

CI/CD, containers, Kubernetes, IaC, and observability patterns for production systems.

Table of Contents


GitHub Actions

Secretless Auth via OIDC (no long-lived credentials)

# .github/workflows/deploy.yml
permissions:
  id-token: write   # required for OIDC
  contents: read

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production   # requires manual approval in repo settings
    steps:
      - uses: actions/checkout@v4

      # AWS
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy
          aws-region: us-east-1

      # GCP
      - uses: google-github-actions/auth@v2
        with:
          workload_identity_provider: projects/123/locations/global/workloadIdentityPools/github/providers/github
          service_account: deploy@my-project.iam.gserviceaccount.com

Pin actions to full commit SHA (not tags)

# WRONG — tag can be moved
- uses: actions/checkout@v4

# RIGHT — immutable, auditable
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683  # v4.2.2

Reusable workflows

# .github/workflows/_build.yml (reusable)
on:
  workflow_call:
    inputs:
      image-tag:
        required: true
        type: string
    secrets:
      REGISTRY_TOKEN:
        required: true

# Caller
jobs:
  build:
    uses: ./.github/workflows/_build.yml
    with:
      image-tag: ${{ github.sha }}
    secrets:
      REGISTRY_TOKEN: ${{ secrets.REGISTRY_TOKEN }}

Cache dependencies

- uses: actions/setup-node@v4
  with:
    node-version: '22'
    cache: 'npm'

- uses: actions/cache@v4
  with:
    path: ~/.cache/pip
    key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}

Docker

Multi-stage build with distroless

# Build stage
FROM node:22-slim AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --ignore-scripts
COPY . .
RUN npm run build

# Production stage — distroless has no shell, minimal attack surface
FROM gcr.io/distroless/nodejs22-debian12
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
EXPOSE 3000
CMD ["dist/server.js"]

BuildKit caching and secrets (never bake secrets into layers)

# syntax=docker/dockerfile:1

# Mount pip cache — not stored in image
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements.txt

# Mount secret at build time — never appears in layer history
RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \
    npm ci
# Build with BuildKit
DOCKER_BUILDKIT=1 docker build \
  --secret id=npmrc,src=.npmrc \
  --cache-from type=registry,ref=myrepo/app:cache \
  --cache-to   type=registry,ref=myrepo/app:cache,mode=max \
  -t myrepo/app:$GIT_SHA .

Layer ordering (slow → fast)

# 1. OS deps (rarely change)
RUN apt-get update && apt-get install -y --no-install-recommends curl && rm -rf /var/lib/apt/lists/*

# 2. Package manager files (change on dep updates)
COPY package*.json ./
RUN npm ci --ignore-scripts

# 3. Source code (changes every commit)
COPY . .
RUN npm run build

Kubernetes

Resource requests/limits (always set both — determines QoS class)

resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    # Never set CPU limits — causes throttling under burst load
    # Set only memory limits to get Burstable QoS
    memory: "512Mi"

QoS classes:

  • Guaranteed: requests == limits for all containers → last evicted
  • Burstable: requests < limits → evicted after BestEffort
  • BestEffort: no requests/limits → first evicted

All three probe types (on separate endpoints)

livenessProbe:
  httpGet:
    path: /healthz/live      # returns 200 if process is alive; restart if fails
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /healthz/ready     # returns 200 if ready to receive traffic; remove from LB if fails
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3

startupProbe:
  httpGet:
    path: /healthz/startup   # gives slow-start apps time before liveness kicks in
    port: 8080
  failureThreshold: 30       # 30 × 10s = 5 minutes max startup
  periodSeconds: 10

Rule: Never reuse the same endpoint for all three probes. Liveness must not depend on downstream services — it will cause cascading restarts.

HPA + KEDA for event-driven scaling

# Standard HPA (CPU/memory)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300   # wait 5 min before scaling down
# KEDA — scale on queue depth, Kafka lag, cron, etc.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
spec:
  scaleTargetRef:
    name: worker
  minReplicaCount: 0    # can scale to zero
  maxReplicaCount: 50
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123/my-queue
        queueLength: "10"   # 1 pod per 10 messages

PodDisruptionBudget (required for zero-downtime deploys)

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2        # or use maxUnavailable: 1
  selector:
    matchLabels:
      app: api

Network policies (default deny, explicit allow)

# Default deny all ingress/egress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]

# Allow only: api → postgres on 5432
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-api-to-postgres
spec:
  podSelector:
    matchLabels:
      app: postgres
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: api
      ports:
        - port: 5432

Spread across zones

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: api

Terraform / OpenTofu

Layered module structure

infra/
├── modules/              # reusable building blocks (no state)
│   ├── vpc/
│   ├── eks/
│   └── rds/
├── environments/
│   ├── staging/          # thin wrappers that call modules
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── backend.tf    # remote state config
│   └── production/
└── shared/               # cross-env resources (DNS, ECR)

Remote state + locking

# backend.tf
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "production/eks/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"   # prevents concurrent applies
    encrypt        = true
  }
}

Pin providers and modules

terraform {
  required_version = ">= 1.9, < 2.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"   # allows 5.x but not 6.x
    }
  }
}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.13.0"   # exact pin for production
}

Drift detection

# Check for infrastructure drift without changing anything
terraform plan -refresh-only

# In CI — fail pipeline if drift detected
terraform plan -refresh-only -detailed-exitcode
# exit 0 = no changes, exit 2 = drift detected

Variables and secrets

# Use sensitive = true to prevent values appearing in plan output
variable "db_password" {
  type      = string
  sensitive = true
}

# Pass secrets via environment — never in tfvars committed to git
# TF_VAR_db_password=... terraform apply

Observability (OpenTelemetry)

2025 rule: OTel Collector is the non-negotiable central routing hub. Never send telemetry directly to vendors from application code.

App → OTel Collector → [Tempo, Prometheus, Loki, Datadog, etc.]

OTel Collector config

# otel-collector.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  tail_sampling:              # sample after seeing full trace
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: { status_codes: [ERROR] }   # always keep errors
      - name: slow-policy
        type: latency
        latency: { threshold_ms: 500 }            # keep slow traces
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 10 } # sample 10% of rest

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, tail_sampling]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite]

Instrument Node.js with OTel SDK

// instrumentation.ts — load before all other imports
import { NodeSDK } from '@opentelemetry/sdk-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc'
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc'
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'

const sdk = new NodeSDK({
  serviceName: process.env.OTEL_SERVICE_NAME ?? 'my-service',
  traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4317' }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({ url: 'http://otel-collector:4317' }),
    exportIntervalMillis: 15_000,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
})

sdk.start()

Trace-log correlation

import { trace, context } from '@opentelemetry/api'

function getTraceContext() {
  const span = trace.getActiveSpan()
  if (!span) return {}
  const { traceId, spanId } = span.spanContext()
  return { traceId, spanId }
}

// Add to every log line
logger.info('Payment processed', {
  ...getTraceContext(),
  amount: 99.99,
  userId: 'u_123',
})

Key metrics to instrument

import { metrics } from '@opentelemetry/api'

const meter = metrics.getMeter('my-service')
const httpDuration = meter.createHistogram('http.server.duration', {
  description: 'HTTP request duration',
  unit: 'ms',
  advice: { explicitBucketBoundaries: [5, 10, 25, 50, 100, 250, 500, 1000] },
})
const activeRequests = meter.createUpDownCounter('http.server.active_requests')

// Use OTel semantic conventions for attribute names
// http.request.method, http.response.status_code, url.path, etc.

GitOps

ArgoCD app-of-apps pattern

# apps/root-app.yaml — bootstrap app that manages all other apps
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root
  namespace: argocd
spec:
  source:
    repoURL: https://github.com/myorg/infra
    targetRevision: main
    path: apps/
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true        # remove resources deleted from git
      selfHeal: true     # revert manual kubectl changes

Flux (alternative to ArgoCD)

# flux-system/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps
spec:
  interval: 5m
  path: ./apps/production
  prune: true
  sourceRef:
    kind: GitRepository
    name: infra
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: api
      namespace: default

Security Scanning

Trivy — image and IaC scanning

# Scan container image (fail on HIGH/CRITICAL)
trivy image --severity HIGH,CRITICAL --exit-code 1 myrepo/app:$GIT_SHA

# Scan Terraform IaC
trivy config --severity HIGH,CRITICAL ./infra/

# Generate SBOM (software bill of materials)
trivy image --format cyclonedx --output sbom.json myrepo/app:$GIT_SHA

GitHub Actions security pipeline

jobs:
  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683

      - name: Build image
        run: docker build -t app:${{ github.sha }} .

      - name: Scan image
        uses: aquasecurity/trivy-action@915b19bbe73b92a6cf82a1bc12b087c9a19a5fe  # v0.28.0
        with:
          image-ref: app:${{ github.sha }}
          format: sarif
          output: trivy-results.sarif
          severity: HIGH,CRITICAL
          exit-code: '1'

      - name: Upload SARIF to GitHub Security
        uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: trivy-results.sarif

      - name: Scan IaC
        run: trivy config --severity HIGH,CRITICAL --exit-code 1 ./infra/

Secret scanning

# Gitleaks — prevent secrets from reaching git
# Install pre-commit hook
gitleaks protect --staged

# Scan full repo history
gitleaks detect --source . --report-path gitleaks-report.json

Cost Optimization

Karpenter (EKS) — replaces Cluster Autoscaler

Karpenter + spot diversification + consolidation routinely delivers 50-70% compute cost reduction.

# NodePool — provision right-sized nodes just in time
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]   # spot first
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m5.large", "m5.xlarge", "m6i.large", "m6i.xlarge"]  # diversify
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s   # bin-pack idle nodes aggressively
  limits:
    cpu: "1000"
# EC2NodeClass — use latest EKS-optimized AMI automatically
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiSelectorTerms:
    - alias: al2023@latest   # auto-updates to latest patched AMI
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

Spot instance best practices

# Use multiple instance families and sizes — maximizes spot pool availability
requirements:
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
      - m5.xlarge
      - m5a.xlarge
      - m6i.xlarge
      - m6a.xlarge
      - m7i.xlarge

# Handle spot interruption gracefully
# Karpenter automatically drains nodes 2 minutes before interruption
# Ensure pods have preStop hooks + terminationGracePeriodSeconds
spec:
  terminationGracePeriodSeconds: 90
  containers:
    - lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 5"]  # allow LB to drain

Resource optimization

# Find over-provisioned pods with VPA recommendation
kubectl get vpa -A

# Goldilocks — namespace-level VPA dashboard
kubectl label ns default goldilocks.fairwinds.com/enabled=true
kubectl port-forward svc/goldilocks-dashboard 8080:80 -n goldilocks

Quick Reference

CI/CD Checklist

  • OIDC auth — no long-lived secrets in GitHub Actions secrets
  • Actions pinned to full commit SHA
  • Environment protection rules on production jobs
  • Docker layer cache configured
  • Trivy scan in pipeline (fail on HIGH/CRITICAL)
  • SBOM generated and attached to release

Kubernetes Checklist

  • resources.requests and resources.limits.memory set on all containers
  • All three health probes defined (startup, readiness, liveness)
  • PodDisruptionBudget defined for all critical workloads
  • Network policies: default-deny + explicit allow rules
  • topologySpreadConstraints across zones
  • terminationGracePeriodSeconds ≥ 30 + preStop sleep

Terraform Checklist

  • Remote state with locking (S3 + DynamoDB or Terraform Cloud)
  • Provider versions pinned (~> minor floor, exact for prod)
  • sensitive = true on secret variables
  • Drift detection in scheduled CI job (plan -refresh-only)
  • Modules in /modules, environments in /environments

Observability Checklist

  • OTel Collector as routing hub (never direct-to-vendor)
  • Tail sampling configured (always keep errors + slow traces)
  • Trace IDs injected into log lines
  • OTel semantic conventions used for attribute names
  • RED metrics: Rate, Errors, Duration per service

Source

git clone https://github.com/Joncik91/ucai/blob/main/skills/devops/SKILL.mdView on GitHub

Overview

DevOps Engineer streamlines modern software delivery by building CI/CD pipelines, containerizing apps, and managing cloud infrastructure (AWS, GCP, Azure). It also covers observability, GitOps, and cost optimization to keep production reliable and scalable.

How This Skill Works

The skill orchestrates tooling and patterns across GitHub Actions, Docker, Kubernetes, and Terraform/OpenTofu. Teams implement OIDC-based secret management, multi-stage Docker builds with distroless images, and IaC to provision cloud resources, while OpenTelemetry-based observability and GitOps automate deployment and rollback.

When to Use It

  • Setting up CI/CD pipelines for applications and services.
  • Deploying applications to cloud providers (AWS, GCP, Azure) with secure, short lived credentials.
  • Containerizing services and implementing multi stage builds with distroless images.
  • Provisioning and managing cloud infrastructure using Terraform or OpenTofu (IaC).
  • Implementing observability, metrics, tracing, and GitOps for reliable deployments.

Quick Start

  1. Step 1: Create a GitHub Actions workflow and enable OIDC-based credentials for your cloud provider.
  2. Step 2: Build a multi-stage Dockerfile using distroless production images and enable BuildKit secrets.
  3. Step 3: Define Terraform/OpenTofu infrastructure and Kubernetes manifests, and wire them into a GitOps workflow for automated deployments.

Best Practices

  • Pin actions to full commit SHAs to ensure immutability and auditability.
  • Use OIDC-based secret management to avoid long lived credentials.
  • Adopt multi-stage Docker builds with distroless production images.
  • Enable BuildKit caching and secrets to avoid secret leakage in layers.
  • Set Kubernetes resource requests and limits to ensure QoS and stability.

Example Use Cases

  • Deploy a Node.js app to AWS with GitHub Actions using OIDC.
  • Use reusable workflows to standardize build and deployment steps.
  • Publish container images with multi-stage Dockerfiles and distroless images.
  • Manage infrastructure with Terraform/OpenTofu and apply changes via GitOps.
  • Instrument services with OpenTelemetry for end-to-end observability.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers