k8s-debug
Scannednpx machina-cli add skill agenticdevops/devops-execution-engine/k8s-debug --openclawKubernetes Debugging
Expert guidance for diagnosing and resolving Kubernetes issues.
When to Use This Skill
Use this skill when:
- Pods are not running or restarting
- Services are not responding
- Deployments are stuck
- Resource issues are suspected
- Network connectivity problems occur
Quick Diagnosis Commands
Check Pod Status
# All pods not in Running state
kubectl get pods -A | grep -v Running | grep -v Completed
# Pods in specific namespace
kubectl get pods -n <namespace> -o wide
# Detailed pod info
kubectl describe pod <pod-name> -n <namespace>
Check Events
# Recent cluster events (sorted by time)
kubectl get events -A --sort-by='.lastTimestamp' | tail -20
# Events for specific pod
kubectl get events --field-selector involvedObject.name=<pod-name>
Check Logs
# Current logs
kubectl logs <pod-name> -n <namespace>
# Previous container logs (after crash)
kubectl logs <pod-name> -n <namespace> --previous
# Logs with timestamps
kubectl logs <pod-name> --timestamps
# Follow logs
kubectl logs <pod-name> -f
# Logs from specific container in multi-container pod
kubectl logs <pod-name> -c <container-name>
Common Pod States and Solutions
CrashLoopBackOff
Symptoms: Pod repeatedly crashes and restarts.
Diagnosis:
# Check previous logs
kubectl logs <pod-name> --previous
# Check events
kubectl describe pod <pod-name> | grep -A 10 Events
# Check exit code
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
Common Causes:
- Application error - Check logs for stack traces
- Missing config/secrets - Verify ConfigMaps and Secrets exist
- Resource limits too low - Check if OOMKilled
- Liveness probe failing - Review probe configuration
- Missing dependencies - Database, external service not reachable
ImagePullBackOff
Symptoms: Container image cannot be pulled.
Diagnosis:
# Check image name and events
kubectl describe pod <pod-name> | grep -E "(Image|Events)" -A 5
Common Causes:
- Wrong image name/tag - Verify image exists in registry
- Private registry auth - Check imagePullSecrets
- Network issues - Registry not reachable
- Rate limiting - Docker Hub rate limits
Fix:
# Create/update pull secret
kubectl create secret docker-registry regcred \
--docker-server=<registry> \
--docker-username=<user> \
--docker-password=<password>
Pending
Symptoms: Pod stays in Pending state.
Diagnosis:
# Check why pod is pending
kubectl describe pod <pod-name> | grep -A 5 "Events"
# Check node resources
kubectl describe nodes | grep -A 5 "Allocated resources"
# Check for taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
Common Causes:
- Insufficient resources - Not enough CPU/memory on nodes
- Node selector mismatch - No nodes match selector
- Taints/tolerations - Pod doesn't tolerate node taints
- PVC not bound - Persistent volume not available
OOMKilled
Symptoms: Container killed due to memory limit.
Diagnosis:
# Check termination reason
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
# Check memory limits vs requests
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].resources}'
Fix:
- Increase memory limits in deployment
- Optimize application memory usage
- Add memory monitoring
Resource Debugging
Check Resource Usage
# Node resource usage
kubectl top nodes
# Pod resource usage
kubectl top pods -A --sort-by=memory
# Pod resource usage in namespace
kubectl top pods -n <namespace>
Check Resource Requests/Limits
# All pods with resources
kubectl get pods -A -o custom-columns=\
'NAMESPACE:.metadata.namespace,NAME:.metadata.name,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory,CPU_LIM:.spec.containers[*].resources.limits.cpu,MEM_LIM:.spec.containers[*].resources.limits.memory'
Network Debugging
Check Service Connectivity
# Get service endpoints
kubectl get endpoints <service-name>
# Check if service has pods
kubectl get pods -l <service-selector>
# Test from inside cluster
kubectl run debug --rm -it --image=busybox -- wget -qO- http://<service>:<port>
DNS Issues
# Test DNS resolution
kubectl run debug --rm -it --image=busybox -- nslookup <service-name>
# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
Quick Fixes
Restart Deployment
# Rolling restart (zero downtime)
kubectl rollout restart deployment/<name> -n <namespace>
# Check rollout status
kubectl rollout status deployment/<name> -n <namespace>
Scale Deployment
# Scale up/down
kubectl scale deployment/<name> --replicas=3 -n <namespace>
Delete Stuck Pod
# Force delete (use with caution)
kubectl delete pod <pod-name> --grace-period=0 --force
Debugging Checklist
- Check pod status:
kubectl get pods - Check events:
kubectl get events --sort-by='.lastTimestamp' - Check logs:
kubectl logs <pod> - Check describe:
kubectl describe pod <pod> - Check resources:
kubectl top pods - Check network: Test service connectivity
- Check config: Verify ConfigMaps/Secrets
Related Skills
- k8s-deploy: For deployment issues
- log-analysis: For log pattern analysis
- incident-response: For structured incident handling
Source
git clone https://github.com/agenticdevops/devops-execution-engine/blob/main/skills/k8s-debug/SKILL.mdView on GitHub Overview
Kubernetes debugging and troubleshooting workflows provide expert guidance for diagnosing issues across pods, services, and deployments. It emphasizes practical, kubectl-based checks, logs, events, and resource considerations to quickly identify root causes and verify fixes.
How This Skill Works
The skill offers structured diagnostic recipes: start with pod status checks, review events, logs, and descriptions; correlate findings to common states like CrashLoopBackOff, ImagePullBackOff, Pending, and OOMKilled; then apply targeted remedies such as updating configurations, credentials, image names, or resource requests before re-testing the state.
When to Use It
- Pods are not running or restarting.
- Services are not responding.
- Deployments are stuck.
- Resource issues are suspected.
- Network connectivity problems occur.
Quick Start
- Step 1: Identify the issue with kubectl get pods -A and kubectl describe pod pod-name -n namespace.
- Step 2: Inspect events and logs with kubectl get events -A, kubectl logs pod-name -n namespace [--previous] [--timestamps].
- Step 3: Apply fixes (update image, adjust ConfigMaps/Secrets, modify resource requests/limits) and re-check with kubectl get pods and kubectl top nodes.
Best Practices
- Start with pod status and events using kubectl get pods -A and kubectl describe pod to surface root causes.
- Check logs with kubectl logs, including --previous and --timestamps when relevant to capture failures and transitions.
- Review related resources (ConfigMaps, Secrets, deployments, services) and surface events for clues.
- Verify image names, registries, and imagePullSecrets to fix ImagePullBackOff scenarios.
- Monitor resources and limits and check node capacity when pods stay Pending or experience OOMKilled.
Example Use Cases
- Diagnose CrashLoopBackOff in a web API pod and identify faulty startup logic.
- Resolve ImagePullBackOff when a private registry blocks image pulls.
- Unstick a Deployment that is stuck during rollout or scaling.
- Troubleshoot Pending due to insufficient node resources or taints.
- Investigate OOMKilled by memory spikes and adjust deployment memory requests/limits.