Get the FREE Ultimate OpenClaw Setup Guide →

server-management

Scanned
npx machina-cli add skill vudovn/antigravity-kit/server-management --openclaw
Files (1)
SKILL.md
3.6 KB

Server Management

Server management principles for production operations. Learn to THINK, not memorize commands.


1. Process Management Principles

Tool Selection

ScenarioTool
Node.js appPM2 (clustering, reload)
Any appsystemd (Linux native)
ContainersDocker/Podman
OrchestrationKubernetes, Docker Swarm

Process Management Goals

GoalWhat It Means
Restart on crashAuto-recovery
Zero-downtime reloadNo service interruption
ClusteringUse all CPU cores
PersistenceSurvive server reboot

2. Monitoring Principles

What to Monitor

CategoryKey Metrics
AvailabilityUptime, health checks
PerformanceResponse time, throughput
ErrorsError rate, types
ResourcesCPU, memory, disk

Alert Severity Strategy

LevelResponse
CriticalImmediate action
WarningInvestigate soon
InfoReview daily

Monitoring Tool Selection

NeedOptions
Simple/FreePM2 metrics, htop
Full observabilityGrafana, Datadog
Error trackingSentry
UptimeUptimeRobot, Pingdom

3. Log Management Principles

Log Strategy

Log TypePurpose
Application logsDebug, audit
Access logsTraffic analysis
Error logsIssue detection

Log Principles

  1. Rotate logs to prevent disk fill
  2. Structured logging (JSON) for parsing
  3. Appropriate levels (error/warn/info/debug)
  4. No sensitive data in logs

4. Scaling Decisions

When to Scale

SymptomSolution
High CPUAdd instances (horizontal)
High memoryIncrease RAM or fix leak
Slow responseProfile first, then scale
Traffic spikesAuto-scaling

Scaling Strategy

TypeWhen to Use
VerticalQuick fix, single instance
HorizontalSustainable, distributed
AutoVariable traffic

5. Health Check Principles

What Constitutes Healthy

CheckMeaning
HTTP 200Service responding
Database connectedData accessible
Dependencies OKExternal services reachable
Resources OKCPU/memory not exhausted

Health Check Implementation

  • Simple: Just return 200
  • Deep: Check all dependencies
  • Choose based on load balancer needs

6. Security Principles

AreaPrinciple
AccessSSH keys only, no passwords
FirewallOnly needed ports open
UpdatesRegular security patches
SecretsEnvironment vars, not files
AuditLog access and changes

7. Troubleshooting Priority

When something's wrong:

  1. Check if running (process status)
  2. Check logs (error messages)
  3. Check resources (disk, memory, CPU)
  4. Check network (ports, DNS)
  5. Check dependencies (database, APIs)

8. Anti-Patterns

❌ Don't✅ Do
Run as rootUse non-root user
Ignore logsSet up log rotation
Skip monitoringMonitor from day one
Manual restartsAuto-restart config
No backupsRegular backup schedule

Remember: A well-managed server is boring. That's the goal.

Source

git clone https://github.com/vudovn/antigravity-kit/blob/main/.agent/skills/server-management/SKILL.mdView on GitHub

Overview

This skill teaches production ready server management principles focusing on decision making for process control, monitoring, and scaling. It emphasizes thinking over memorizing commands and provides practical guidelines for tool selection, health checks, logging, and security.

How This Skill Works

The approach guides tool selection by workload: PM2 for Node.js, systemd for Linux native services, Docker/Podman for containers, and Kubernetes or Docker Swarm for orchestration. It defines clear goals such as auto recovery, zero downtime reload, clustering, and persistence, and then ties them to a structured monitoring, logging, health check, and scaling framework.

When to Use It

  • Deploying a Node.js app that benefits from clustering and reload capabilities via PM2
  • Managing any app with Linux native service requirements using systemd
  • Running containerized workloads with Docker or Podman and needs for log rotation
  • Orchestrating multi-service deployments with Kubernetes or Docker Swarm
  • Implementing auto scaling and health based scaling to handle variable traffic

Quick Start

  1. Step 1: Assess workload and select the appropriate tool set (PM2 for Node.js, systemd for Linux, Docker/Podman for containers, Kubernetes for orchestration)
  2. Step 2: Define health checks, logging strategy with rotation, and alerting baseline
  3. Step 3: Implement auto recovery and scaling rules and validate with light load tests

Best Practices

  • Align tool choice with workload and goals (PM2 for Node.js, systemd for Linux, Docker/Podman for containers, Kubernetes for orchestration)
  • Design for auto recovery, zero downtime reloads, and efficient clustering
  • Implement structured logging with rotation and avoid exposing sensitive data
  • Define layered health checks and a coherent monitoring and alerting strategy
  • Automate scaling decisions with guardrails and observable metrics

Example Use Cases

  • Node.js app using PM2 clustering and auto restart
  • Linux service managed by systemd with health checks and resource limits
  • Containerized service using Docker/Podman with log rotation and metrics
  • Kubernetes deployment with readiness and liveness probes and horizontal pod autoscaler
  • Basic monitoring setup using PM2 metrics, uptime checks, and alerting

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers