What is HighTriad used for?

Design and review systems for high concurrency, high performance, and high availability.

What should you define first?

Workload shape, failure tolerance, and explicit SLO targets.

What tests are recommended?

Load, stress, soak, and chaos tests aligned to SLOs; plan rollback and mitigation.

hightriad

Scanned

npx machina-cli add skill JWCodeWrote/Agent_Skills_Plugin/HighTriad --openclaw

Files (1)

SKILL.md

5.8 KB

HighTriad

Build professional, production-grade system designs that balance high concurrency, high performance, and high availability.

Core Workflow

Clarify requirements Collect workload shape, critical paths, and failure tolerance. Ask for absolute targets: RPS/QPS, p95/p99 latency, peak traffic, growth rate, error budget, RTO/RPO, data consistency needs.
Define SLIs/SLOs Choose 3 to 5 primary SLIs and map them to explicit SLOs. Prefer latency percentiles, availability, throughput, and freshness over averages.
Model the system Sketch request flow and identify bottlenecks across compute, network, storage, and dependency call chains. Enumerate concurrency boundaries: queues, pools, locks, partitions, and external rate limits.
Design for scale Select scaling axis: horizontal, vertical, data partitioning, or event-driven async. Define partitioning keys, load balancing strategy, and caching boundaries.
Design for performance Minimize critical path length, reduce tail latency, and cut remote calls. Choose data access patterns, indexing, caching tiers, and compression tradeoffs.
Design for availability Add redundancy, fault isolation, and graceful degradation. Define failover paths, health checks, circuit breakers, and data durability strategy.
Validate with tests Create load, stress, soak, and chaos test plans aligned to SLOs. Plan rollback and mitigation steps for regression risk.
Operationalize Define observability, alerting, runbooks, and capacity review cadence. Prepare incident response playbooks and on-call readiness.

Reference Map

Read references/tech-kubernetes.md when the system runs on Kubernetes or needs autoscaling, multi-zone placement, or service mesh guidance.
Read references/tech-redis.md when using Redis for caching, rate limiting, queues, or session storage.
Read references/tech-postgresql.md when PostgreSQL is the primary datastore or when designing replicas, partitioning, and indexing.
Read references/tech-kafka.md when using Kafka for event streaming, async pipelines, or decoupling services.
Read references/tech-nginx.md when edge routing, TLS termination, or L7 load balancing is required.
Read references/industry-finance.md for trading, payments, or regulated workloads.
Read references/industry-ecommerce.md for flash sales, promotions, and cart/checkout workloads.
Read references/industry-iot.md for device fleets, bursty telemetry, or edge connectivity constraints.
Read references/industry-realtime.md for chat, gaming, or real-time collaboration systems.
Read references/templates.md when the user needs architecture, SLO, or capacity plan templates.
Read references/testing-drills.md when load testing, chaos testing, or DR drills are requested.

Concurrency Design Checklist

Define concurrency target by peak RPS and concurrent users.
Bound resource usage with worker pools, queues, and backpressure.
Partition workload by tenant, shard key, or request type.
Use async I/O for network and storage operations.
Limit shared-state contention with sharding or lock-free structures.
Apply rate limiting at edge and internal dependencies.
Protect downstream services with bulkheads and timeouts.

Performance Design Checklist

Reduce critical path by collapsing or parallelizing remote calls.
Minimize p99 latency contributors: cold starts, GC pauses, locks, slow queries.
Add caching with explicit invalidation rules.
Use read replicas or materialized views for read-heavy workloads.
Choose data formats and compression based on CPU vs bandwidth tradeoff.
Optimize queries with indexes and selective projections.
Warm pools and caches for predictable latency.

Availability Design Checklist

Eliminate single points of failure with redundancy across zones.
Use health checks and automated failover.
Separate control plane and data plane failure domains.
Support graceful degradation for non-critical features.
Define RTO/RPO per subsystem and validate with DR drills.
Ensure idempotency for retries and at-least-once delivery.
Protect data with backups, versioning, and restore verification.

Validation Plan

Run load tests to p95/p99 targets at expected peak.
Run stress tests beyond peak to validate backpressure behavior.
Run soak tests to surface memory leaks and queue buildup.
Run chaos tests on dependencies and network partitions.
Validate auto-scaling and failover timing against RTO.

Deliverables

Architecture diagram with data flow and failure domains.
SLI/SLO document with error budgets and alert thresholds.
Capacity plan with scaling triggers and cost projections.
Risk register with mitigations and rollback plans.
Test plan covering load, stress, soak, and chaos.
Operational runbook with on-call actions and dashboards.

Red Flags

SLOs not defined or only averages tracked.
Unbounded queues or unlimited fiber/thread spawning.
Single shared database without partitioning plan at scale.
No clear rollback or mitigation plan for deploys.
No chaos testing or failover verification.

Output Template

Provide a concise plan with headings in this order:

Targets (SLIs/SLOs, RTO/RPO)
Workload model (traffic shape, hotspots, dependencies)
Architecture (flow, scaling axis, partitions)
Performance (critical path, caching, data access)
Availability (redundancy, failover, degradation)
Validation (tests and success criteria)
Ops (observability, runbooks, incident response)

Source

git clone https://github.com/JWCodeWrote/Agent_Skills_Plugin/blob/main/HighTriad/SKILL.mdView on GitHub

Overview

HighTriad guides building production-grade system designs that balance three pillars: concurrency, performance, and availability. It emphasizes defining SLIs/SLOs, modeling bottlenecks, and validating with tests to ensure resilience at scale. The framework covers workload clarification, architecture decisions, and incident-readiness for robust production systems.

How This Skill Works

The workflow starts by clarifying requirements and targets (RPS, latency, error budgets), then defines SLIs/SLOs and maps them to concrete targets. It models the system to identify bottlenecks across compute, network, storage, and dependencies, and designs for scale, performance, and availability with redundancy and failover strategies. Finally, it validates plans with tests and operationalizes observability, runbooks, and incident-readiness.

When to Use It

Planning architecture for high-concurrency workloads and low tail latency
Choosing scaling strategies, partitioning keys, and caching boundaries
Defining explicit SLOs/SLIs and target error budgets
Conducting load/stress/soak tests aligned to SLOs
Preparing incident-readiness, runbooks, and on-call readiness

Quick Start

Step 1: Clarify requirements and collect workload shape, failure tolerance, and explicit SLO targets.
Step 2: Define 3–5 SLIs/SLOs and model system bottlenecks across compute, network, and storage.
Step 3: Design for scale, performance, and availability; validate with tests and operationalize observability and runbooks.

Best Practices

Clarify workload shape, critical paths, and failure tolerance with absolute targets (RPS, latency, growth rate, error budget, RTO/RPO).
Define 3–5 primary SLIs and map them to explicit SLOs, prioritizing latency percentiles, availability, throughput, and freshness.
Model bottlenecks across compute, network, storage, and dependency call chains; enumerate concurrency boundaries (queues, pools, partitions).
Design for scale and performance by choosing partitioning keys, load balancing, caching boundaries, and reducing critical path length.
Validate plans with load, stress, soak, and chaos tests aligned to SLOs; include rollback and mitigation steps.

Example Use Cases

Architect a high-traffic API with horizontal scaling, partitioning, and caching to meet peak QPS and p95 latency targets.
Define SLIs/SLOs for a microservices platform and validate resilience with chaos testing and runbooks.
Plan capacity and caching strategies for an e-commerce site during flash sales with data access optimization.
Utilize read replicas and partitioning to improve data access performance for a social-graph service.
Develop incident-response playbooks and on-call readiness for production systems to ensure quick recovery.

Frequently Asked Questions

Add this skill to your agents