What should a health check cover?

HTTP 200 status, database connectivity, dependency availability, and resource sufficiency; choose simple vs deep checks based on load balancer needs.

When should I scale the system?

Scale when symptoms appear such as high CPU or memory, slow responses, or traffic spikes; apply vertical, horizontal, or auto scaling as appropriate.

server-management

Scanned

npx machina-cli add skill vudovn/antigravity-kit/server-management --openclaw

Files (1)

SKILL.md

3.6 KB

Server Management

Server management principles for production operations. Learn to THINK, not memorize commands.

1. Process Management Principles

Tool Selection

Scenario	Tool
Node.js app	PM2 (clustering, reload)
Any app	systemd (Linux native)
Containers	Docker/Podman
Orchestration	Kubernetes, Docker Swarm

Process Management Goals

Goal	What It Means
Restart on crash	Auto-recovery
Zero-downtime reload	No service interruption
Clustering	Use all CPU cores
Persistence	Survive server reboot

2. Monitoring Principles

What to Monitor

Category	Key Metrics
Availability	Uptime, health checks
Performance	Response time, throughput
Errors	Error rate, types
Resources	CPU, memory, disk

Alert Severity Strategy

Level	Response
Critical	Immediate action
Warning	Investigate soon
Info	Review daily

Monitoring Tool Selection

Need	Options
Simple/Free	PM2 metrics, htop
Full observability	Grafana, Datadog
Error tracking	Sentry
Uptime	UptimeRobot, Pingdom

3. Log Management Principles

Log Strategy

Log Type	Purpose
Application logs	Debug, audit
Access logs	Traffic analysis
Error logs	Issue detection

Log Principles

Rotate logs to prevent disk fill
Structured logging (JSON) for parsing
Appropriate levels (error/warn/info/debug)
No sensitive data in logs

4. Scaling Decisions

When to Scale

Symptom	Solution
High CPU	Add instances (horizontal)
High memory	Increase RAM or fix leak
Slow response	Profile first, then scale
Traffic spikes	Auto-scaling

Scaling Strategy

Type	When to Use
Vertical	Quick fix, single instance
Horizontal	Sustainable, distributed
Auto	Variable traffic

5. Health Check Principles

What Constitutes Healthy

Check	Meaning
HTTP 200	Service responding
Database connected	Data accessible
Dependencies OK	External services reachable
Resources OK	CPU/memory not exhausted

Health Check Implementation

Simple: Just return 200
Deep: Check all dependencies
Choose based on load balancer needs

6. Security Principles

Area	Principle
Access	SSH keys only, no passwords
Firewall	Only needed ports open
Updates	Regular security patches
Secrets	Environment vars, not files
Audit	Log access and changes

7. Troubleshooting Priority

When something's wrong:

Check if running (process status)
Check logs (error messages)
Check resources (disk, memory, CPU)
Check network (ports, DNS)
Check dependencies (database, APIs)

8. Anti-Patterns

❌ Don't	✅ Do
Run as root	Use non-root user
Ignore logs	Set up log rotation
Skip monitoring	Monitor from day one
Manual restarts	Auto-restart config
No backups	Regular backup schedule

Remember: A well-managed server is boring. That's the goal.

Source

git clone https://github.com/vudovn/antigravity-kit/blob/main/.agent/skills/server-management/SKILL.mdView on GitHub

Overview

This skill teaches production ready server management principles focusing on decision making for process control, monitoring, and scaling. It emphasizes thinking over memorizing commands and provides practical guidelines for tool selection, health checks, logging, and security.

How This Skill Works

The approach guides tool selection by workload: PM2 for Node.js, systemd for Linux native services, Docker/Podman for containers, and Kubernetes or Docker Swarm for orchestration. It defines clear goals such as auto recovery, zero downtime reload, clustering, and persistence, and then ties them to a structured monitoring, logging, health check, and scaling framework.

When to Use It

Deploying a Node.js app that benefits from clustering and reload capabilities via PM2
Managing any app with Linux native service requirements using systemd
Running containerized workloads with Docker or Podman and needs for log rotation
Orchestrating multi-service deployments with Kubernetes or Docker Swarm
Implementing auto scaling and health based scaling to handle variable traffic

Quick Start

Step 1: Assess workload and select the appropriate tool set (PM2 for Node.js, systemd for Linux, Docker/Podman for containers, Kubernetes for orchestration)
Step 2: Define health checks, logging strategy with rotation, and alerting baseline
Step 3: Implement auto recovery and scaling rules and validate with light load tests

Best Practices

Align tool choice with workload and goals (PM2 for Node.js, systemd for Linux, Docker/Podman for containers, Kubernetes for orchestration)
Design for auto recovery, zero downtime reloads, and efficient clustering
Implement structured logging with rotation and avoid exposing sensitive data
Define layered health checks and a coherent monitoring and alerting strategy
Automate scaling decisions with guardrails and observable metrics

Example Use Cases

Node.js app using PM2 clustering and auto restart
Linux service managed by systemd with health checks and resource limits
Containerized service using Docker/Podman with log rotation and metrics
Kubernetes deployment with readiness and liveness probes and horizontal pod autoscaler
Basic monitoring setup using PM2 metrics, uptime checks, and alerting

Frequently Asked Questions

Add this skill to your agents