Get the FREE Ultimate OpenClaw Setup Guide →

k8s-debug

Scanned
npx machina-cli add skill agenticdevops/devops-execution-engine/k8s-debug --openclaw
Files (1)
SKILL.md
5.8 KB

Kubernetes Debugging

Expert guidance for diagnosing and resolving Kubernetes issues.

When to Use This Skill

Use this skill when:

  • Pods are not running or restarting
  • Services are not responding
  • Deployments are stuck
  • Resource issues are suspected
  • Network connectivity problems occur

Quick Diagnosis Commands

Check Pod Status

# All pods not in Running state
kubectl get pods -A | grep -v Running | grep -v Completed

# Pods in specific namespace
kubectl get pods -n <namespace> -o wide

# Detailed pod info
kubectl describe pod <pod-name> -n <namespace>

Check Events

# Recent cluster events (sorted by time)
kubectl get events -A --sort-by='.lastTimestamp' | tail -20

# Events for specific pod
kubectl get events --field-selector involvedObject.name=<pod-name>

Check Logs

# Current logs
kubectl logs <pod-name> -n <namespace>

# Previous container logs (after crash)
kubectl logs <pod-name> -n <namespace> --previous

# Logs with timestamps
kubectl logs <pod-name> --timestamps

# Follow logs
kubectl logs <pod-name> -f

# Logs from specific container in multi-container pod
kubectl logs <pod-name> -c <container-name>

Common Pod States and Solutions

CrashLoopBackOff

Symptoms: Pod repeatedly crashes and restarts.

Diagnosis:

# Check previous logs
kubectl logs <pod-name> --previous

# Check events
kubectl describe pod <pod-name> | grep -A 10 Events

# Check exit code
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

Common Causes:

  1. Application error - Check logs for stack traces
  2. Missing config/secrets - Verify ConfigMaps and Secrets exist
  3. Resource limits too low - Check if OOMKilled
  4. Liveness probe failing - Review probe configuration
  5. Missing dependencies - Database, external service not reachable

ImagePullBackOff

Symptoms: Container image cannot be pulled.

Diagnosis:

# Check image name and events
kubectl describe pod <pod-name> | grep -E "(Image|Events)" -A 5

Common Causes:

  1. Wrong image name/tag - Verify image exists in registry
  2. Private registry auth - Check imagePullSecrets
  3. Network issues - Registry not reachable
  4. Rate limiting - Docker Hub rate limits

Fix:

# Create/update pull secret
kubectl create secret docker-registry regcred \
  --docker-server=<registry> \
  --docker-username=<user> \
  --docker-password=<password>

Pending

Symptoms: Pod stays in Pending state.

Diagnosis:

# Check why pod is pending
kubectl describe pod <pod-name> | grep -A 5 "Events"

# Check node resources
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check for taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

Common Causes:

  1. Insufficient resources - Not enough CPU/memory on nodes
  2. Node selector mismatch - No nodes match selector
  3. Taints/tolerations - Pod doesn't tolerate node taints
  4. PVC not bound - Persistent volume not available

OOMKilled

Symptoms: Container killed due to memory limit.

Diagnosis:

# Check termination reason
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'

# Check memory limits vs requests
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].resources}'

Fix:

  • Increase memory limits in deployment
  • Optimize application memory usage
  • Add memory monitoring

Resource Debugging

Check Resource Usage

# Node resource usage
kubectl top nodes

# Pod resource usage
kubectl top pods -A --sort-by=memory

# Pod resource usage in namespace
kubectl top pods -n <namespace>

Check Resource Requests/Limits

# All pods with resources
kubectl get pods -A -o custom-columns=\
'NAMESPACE:.metadata.namespace,NAME:.metadata.name,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory,CPU_LIM:.spec.containers[*].resources.limits.cpu,MEM_LIM:.spec.containers[*].resources.limits.memory'

Network Debugging

Check Service Connectivity

# Get service endpoints
kubectl get endpoints <service-name>

# Check if service has pods
kubectl get pods -l <service-selector>

# Test from inside cluster
kubectl run debug --rm -it --image=busybox -- wget -qO- http://<service>:<port>

DNS Issues

# Test DNS resolution
kubectl run debug --rm -it --image=busybox -- nslookup <service-name>

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

Quick Fixes

Restart Deployment

# Rolling restart (zero downtime)
kubectl rollout restart deployment/<name> -n <namespace>

# Check rollout status
kubectl rollout status deployment/<name> -n <namespace>

Scale Deployment

# Scale up/down
kubectl scale deployment/<name> --replicas=3 -n <namespace>

Delete Stuck Pod

# Force delete (use with caution)
kubectl delete pod <pod-name> --grace-period=0 --force

Debugging Checklist

  1. Check pod status: kubectl get pods
  2. Check events: kubectl get events --sort-by='.lastTimestamp'
  3. Check logs: kubectl logs <pod>
  4. Check describe: kubectl describe pod <pod>
  5. Check resources: kubectl top pods
  6. Check network: Test service connectivity
  7. Check config: Verify ConfigMaps/Secrets

Related Skills

  • k8s-deploy: For deployment issues
  • log-analysis: For log pattern analysis
  • incident-response: For structured incident handling

Source

git clone https://github.com/agenticdevops/devops-execution-engine/blob/main/skills/k8s-debug/SKILL.mdView on GitHub

Overview

Kubernetes debugging and troubleshooting workflows provide expert guidance for diagnosing issues across pods, services, and deployments. It emphasizes practical, kubectl-based checks, logs, events, and resource considerations to quickly identify root causes and verify fixes.

How This Skill Works

The skill offers structured diagnostic recipes: start with pod status checks, review events, logs, and descriptions; correlate findings to common states like CrashLoopBackOff, ImagePullBackOff, Pending, and OOMKilled; then apply targeted remedies such as updating configurations, credentials, image names, or resource requests before re-testing the state.

When to Use It

  • Pods are not running or restarting.
  • Services are not responding.
  • Deployments are stuck.
  • Resource issues are suspected.
  • Network connectivity problems occur.

Quick Start

  1. Step 1: Identify the issue with kubectl get pods -A and kubectl describe pod pod-name -n namespace.
  2. Step 2: Inspect events and logs with kubectl get events -A, kubectl logs pod-name -n namespace [--previous] [--timestamps].
  3. Step 3: Apply fixes (update image, adjust ConfigMaps/Secrets, modify resource requests/limits) and re-check with kubectl get pods and kubectl top nodes.

Best Practices

  • Start with pod status and events using kubectl get pods -A and kubectl describe pod to surface root causes.
  • Check logs with kubectl logs, including --previous and --timestamps when relevant to capture failures and transitions.
  • Review related resources (ConfigMaps, Secrets, deployments, services) and surface events for clues.
  • Verify image names, registries, and imagePullSecrets to fix ImagePullBackOff scenarios.
  • Monitor resources and limits and check node capacity when pods stay Pending or experience OOMKilled.

Example Use Cases

  • Diagnose CrashLoopBackOff in a web API pod and identify faulty startup logic.
  • Resolve ImagePullBackOff when a private registry blocks image pulls.
  • Unstick a Deployment that is stuck during rollout or scaling.
  • Troubleshoot Pending due to insufficient node resources or taints.
  • Investigate OOMKilled by memory spikes and adjust deployment memory requests/limits.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers