server-management
Scannednpx machina-cli add skill vudovn/antigravity-kit/server-management --openclawServer Management
Server management principles for production operations. Learn to THINK, not memorize commands.
1. Process Management Principles
Tool Selection
| Scenario | Tool |
|---|---|
| Node.js app | PM2 (clustering, reload) |
| Any app | systemd (Linux native) |
| Containers | Docker/Podman |
| Orchestration | Kubernetes, Docker Swarm |
Process Management Goals
| Goal | What It Means |
|---|---|
| Restart on crash | Auto-recovery |
| Zero-downtime reload | No service interruption |
| Clustering | Use all CPU cores |
| Persistence | Survive server reboot |
2. Monitoring Principles
What to Monitor
| Category | Key Metrics |
|---|---|
| Availability | Uptime, health checks |
| Performance | Response time, throughput |
| Errors | Error rate, types |
| Resources | CPU, memory, disk |
Alert Severity Strategy
| Level | Response |
|---|---|
| Critical | Immediate action |
| Warning | Investigate soon |
| Info | Review daily |
Monitoring Tool Selection
| Need | Options |
|---|---|
| Simple/Free | PM2 metrics, htop |
| Full observability | Grafana, Datadog |
| Error tracking | Sentry |
| Uptime | UptimeRobot, Pingdom |
3. Log Management Principles
Log Strategy
| Log Type | Purpose |
|---|---|
| Application logs | Debug, audit |
| Access logs | Traffic analysis |
| Error logs | Issue detection |
Log Principles
- Rotate logs to prevent disk fill
- Structured logging (JSON) for parsing
- Appropriate levels (error/warn/info/debug)
- No sensitive data in logs
4. Scaling Decisions
When to Scale
| Symptom | Solution |
|---|---|
| High CPU | Add instances (horizontal) |
| High memory | Increase RAM or fix leak |
| Slow response | Profile first, then scale |
| Traffic spikes | Auto-scaling |
Scaling Strategy
| Type | When to Use |
|---|---|
| Vertical | Quick fix, single instance |
| Horizontal | Sustainable, distributed |
| Auto | Variable traffic |
5. Health Check Principles
What Constitutes Healthy
| Check | Meaning |
|---|---|
| HTTP 200 | Service responding |
| Database connected | Data accessible |
| Dependencies OK | External services reachable |
| Resources OK | CPU/memory not exhausted |
Health Check Implementation
- Simple: Just return 200
- Deep: Check all dependencies
- Choose based on load balancer needs
6. Security Principles
| Area | Principle |
|---|---|
| Access | SSH keys only, no passwords |
| Firewall | Only needed ports open |
| Updates | Regular security patches |
| Secrets | Environment vars, not files |
| Audit | Log access and changes |
7. Troubleshooting Priority
When something's wrong:
- Check if running (process status)
- Check logs (error messages)
- Check resources (disk, memory, CPU)
- Check network (ports, DNS)
- Check dependencies (database, APIs)
8. Anti-Patterns
| ❌ Don't | ✅ Do |
|---|---|
| Run as root | Use non-root user |
| Ignore logs | Set up log rotation |
| Skip monitoring | Monitor from day one |
| Manual restarts | Auto-restart config |
| No backups | Regular backup schedule |
Remember: A well-managed server is boring. That's the goal.
Source
git clone https://github.com/vudovn/antigravity-kit/blob/main/.agent/skills/server-management/SKILL.mdView on GitHub Overview
This skill teaches production ready server management principles focusing on decision making for process control, monitoring, and scaling. It emphasizes thinking over memorizing commands and provides practical guidelines for tool selection, health checks, logging, and security.
How This Skill Works
The approach guides tool selection by workload: PM2 for Node.js, systemd for Linux native services, Docker/Podman for containers, and Kubernetes or Docker Swarm for orchestration. It defines clear goals such as auto recovery, zero downtime reload, clustering, and persistence, and then ties them to a structured monitoring, logging, health check, and scaling framework.
When to Use It
- Deploying a Node.js app that benefits from clustering and reload capabilities via PM2
- Managing any app with Linux native service requirements using systemd
- Running containerized workloads with Docker or Podman and needs for log rotation
- Orchestrating multi-service deployments with Kubernetes or Docker Swarm
- Implementing auto scaling and health based scaling to handle variable traffic
Quick Start
- Step 1: Assess workload and select the appropriate tool set (PM2 for Node.js, systemd for Linux, Docker/Podman for containers, Kubernetes for orchestration)
- Step 2: Define health checks, logging strategy with rotation, and alerting baseline
- Step 3: Implement auto recovery and scaling rules and validate with light load tests
Best Practices
- Align tool choice with workload and goals (PM2 for Node.js, systemd for Linux, Docker/Podman for containers, Kubernetes for orchestration)
- Design for auto recovery, zero downtime reloads, and efficient clustering
- Implement structured logging with rotation and avoid exposing sensitive data
- Define layered health checks and a coherent monitoring and alerting strategy
- Automate scaling decisions with guardrails and observable metrics
Example Use Cases
- Node.js app using PM2 clustering and auto restart
- Linux service managed by systemd with health checks and resource limits
- Containerized service using Docker/Podman with log rotation and metrics
- Kubernetes deployment with readiness and liveness probes and horizontal pod autoscaler
- Basic monitoring setup using PM2 metrics, uptime checks, and alerting