What binaries are required to use this skill?

kubectl is required, as indicated by the skill metadata requiring the kubectl binary.

Which pod states and issues does this cover?

It covers CrashLoopBackOff, ImagePullBackOff, Pending, and OOMKilled with diagnostic steps and fixes.

How do I validate that a fix worked?

Re-run kubectl get pods, kubectl describe pod, and kubectl logs to confirm the issue is resolved and the pod stabilizes

k8s-debug

Scanned

npx machina-cli add skill agenticdevops/devops-execution-engine/k8s-debug --openclaw

Files (1)

SKILL.md

5.8 KB

Kubernetes Debugging

Expert guidance for diagnosing and resolving Kubernetes issues.

When to Use This Skill

Use this skill when:

Pods are not running or restarting
Services are not responding
Deployments are stuck
Resource issues are suspected
Network connectivity problems occur

Quick Diagnosis Commands

Check Pod Status

# All pods not in Running state
kubectl get pods -A | grep -v Running | grep -v Completed

# Pods in specific namespace
kubectl get pods -n <namespace> -o wide

# Detailed pod info
kubectl describe pod <pod-name> -n <namespace>

Check Events

# Recent cluster events (sorted by time)
kubectl get events -A --sort-by='.lastTimestamp' | tail -20

# Events for specific pod
kubectl get events --field-selector involvedObject.name=<pod-name>

Check Logs

# Current logs
kubectl logs <pod-name> -n <namespace>

# Previous container logs (after crash)
kubectl logs <pod-name> -n <namespace> --previous

# Logs with timestamps
kubectl logs <pod-name> --timestamps

# Follow logs
kubectl logs <pod-name> -f

# Logs from specific container in multi-container pod
kubectl logs <pod-name> -c <container-name>

Common Pod States and Solutions

CrashLoopBackOff

Symptoms: Pod repeatedly crashes and restarts.

Diagnosis:

# Check previous logs
kubectl logs <pod-name> --previous

# Check events
kubectl describe pod <pod-name> | grep -A 10 Events

# Check exit code
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

Common Causes:

Application error - Check logs for stack traces
Missing config/secrets - Verify ConfigMaps and Secrets exist
Resource limits too low - Check if OOMKilled
Liveness probe failing - Review probe configuration
Missing dependencies - Database, external service not reachable

ImagePullBackOff

Symptoms: Container image cannot be pulled.

Diagnosis:

# Check image name and events
kubectl describe pod <pod-name> | grep -E "(Image|Events)" -A 5

Common Causes:

Wrong image name/tag - Verify image exists in registry
Private registry auth - Check imagePullSecrets
Network issues - Registry not reachable
Rate limiting - Docker Hub rate limits

Fix:

# Create/update pull secret
kubectl create secret docker-registry regcred \
  --docker-server=<registry> \
  --docker-username=<user> \
  --docker-password=<password>

Pending

Symptoms: Pod stays in Pending state.

Diagnosis:

# Check why pod is pending
kubectl describe pod <pod-name> | grep -A 5 "Events"

# Check node resources
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check for taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

Common Causes:

Insufficient resources - Not enough CPU/memory on nodes
Node selector mismatch - No nodes match selector
Taints/tolerations - Pod doesn't tolerate node taints
PVC not bound - Persistent volume not available

OOMKilled

Symptoms: Container killed due to memory limit.

Diagnosis:

# Check termination reason
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'

# Check memory limits vs requests
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].resources}'

Fix:

Increase memory limits in deployment
Optimize application memory usage
Add memory monitoring

Resource Debugging

Check Resource Usage

# Node resource usage
kubectl top nodes

# Pod resource usage
kubectl top pods -A --sort-by=memory

# Pod resource usage in namespace
kubectl top pods -n <namespace>

Check Resource Requests/Limits

# All pods with resources
kubectl get pods -A -o custom-columns=\
'NAMESPACE:.metadata.namespace,NAME:.metadata.name,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory,CPU_LIM:.spec.containers[*].resources.limits.cpu,MEM_LIM:.spec.containers[*].resources.limits.memory'

Network Debugging

Check Service Connectivity

# Get service endpoints
kubectl get endpoints <service-name>

# Check if service has pods
kubectl get pods -l <service-selector>

# Test from inside cluster
kubectl run debug --rm -it --image=busybox -- wget -qO- http://<service>:<port>

DNS Issues

# Test DNS resolution
kubectl run debug --rm -it --image=busybox -- nslookup <service-name>

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

Quick Fixes

Restart Deployment

# Rolling restart (zero downtime)
kubectl rollout restart deployment/<name> -n <namespace>

# Check rollout status
kubectl rollout status deployment/<name> -n <namespace>

Scale Deployment

# Scale up/down
kubectl scale deployment/<name> --replicas=3 -n <namespace>

Delete Stuck Pod

# Force delete (use with caution)
kubectl delete pod <pod-name> --grace-period=0 --force

Debugging Checklist

Check pod status: kubectl get pods
Check events: kubectl get events --sort-by='.lastTimestamp'
Check logs: kubectl logs <pod>
Check describe: kubectl describe pod <pod>
Check resources: kubectl top pods
Check network: Test service connectivity
Check config: Verify ConfigMaps/Secrets

Related Skills

k8s-deploy: For deployment issues
log-analysis: For log pattern analysis
incident-response: For structured incident handling

Source

git clone https://github.com/agenticdevops/devops-execution-engine/blob/main/skills/k8s-debug/SKILL.mdView on GitHub

Overview

Kubernetes debugging and troubleshooting workflows provide expert guidance for diagnosing issues across pods, services, and deployments. It emphasizes practical, kubectl-based checks, logs, events, and resource considerations to quickly identify root causes and verify fixes.

How This Skill Works

The skill offers structured diagnostic recipes: start with pod status checks, review events, logs, and descriptions; correlate findings to common states like CrashLoopBackOff, ImagePullBackOff, Pending, and OOMKilled; then apply targeted remedies such as updating configurations, credentials, image names, or resource requests before re-testing the state.

When to Use It

Pods are not running or restarting.
Services are not responding.
Deployments are stuck.
Resource issues are suspected.
Network connectivity problems occur.

Quick Start

Step 1: Identify the issue with kubectl get pods -A and kubectl describe pod pod-name -n namespace.
Step 2: Inspect events and logs with kubectl get events -A, kubectl logs pod-name -n namespace [--previous] [--timestamps].
Step 3: Apply fixes (update image, adjust ConfigMaps/Secrets, modify resource requests/limits) and re-check with kubectl get pods and kubectl top nodes.

Best Practices

Start with pod status and events using kubectl get pods -A and kubectl describe pod to surface root causes.
Check logs with kubectl logs, including --previous and --timestamps when relevant to capture failures and transitions.
Review related resources (ConfigMaps, Secrets, deployments, services) and surface events for clues.
Verify image names, registries, and imagePullSecrets to fix ImagePullBackOff scenarios.
Monitor resources and limits and check node capacity when pods stay Pending or experience OOMKilled.

Example Use Cases

Diagnose CrashLoopBackOff in a web API pod and identify faulty startup logic.
Resolve ImagePullBackOff when a private registry blocks image pulls.
Unstick a Deployment that is stuck during rollout or scaling.
Troubleshoot Pending due to insufficient node resources or taints.
Investigate OOMKilled by memory spikes and adjust deployment memory requests/limits.

Frequently Asked Questions

Add this skill to your agents