hightriad
Scannednpx machina-cli add skill JWCodeWrote/Agent_Skills_Plugin/HighTriad --openclawHighTriad
Build professional, production-grade system designs that balance high concurrency, high performance, and high availability.
Core Workflow
-
Clarify requirements Collect workload shape, critical paths, and failure tolerance. Ask for absolute targets: RPS/QPS, p95/p99 latency, peak traffic, growth rate, error budget, RTO/RPO, data consistency needs.
-
Define SLIs/SLOs Choose 3 to 5 primary SLIs and map them to explicit SLOs. Prefer latency percentiles, availability, throughput, and freshness over averages.
-
Model the system Sketch request flow and identify bottlenecks across compute, network, storage, and dependency call chains. Enumerate concurrency boundaries: queues, pools, locks, partitions, and external rate limits.
-
Design for scale Select scaling axis: horizontal, vertical, data partitioning, or event-driven async. Define partitioning keys, load balancing strategy, and caching boundaries.
-
Design for performance Minimize critical path length, reduce tail latency, and cut remote calls. Choose data access patterns, indexing, caching tiers, and compression tradeoffs.
-
Design for availability Add redundancy, fault isolation, and graceful degradation. Define failover paths, health checks, circuit breakers, and data durability strategy.
-
Validate with tests Create load, stress, soak, and chaos test plans aligned to SLOs. Plan rollback and mitigation steps for regression risk.
-
Operationalize Define observability, alerting, runbooks, and capacity review cadence. Prepare incident response playbooks and on-call readiness.
Reference Map
- Read
references/tech-kubernetes.mdwhen the system runs on Kubernetes or needs autoscaling, multi-zone placement, or service mesh guidance. - Read
references/tech-redis.mdwhen using Redis for caching, rate limiting, queues, or session storage. - Read
references/tech-postgresql.mdwhen PostgreSQL is the primary datastore or when designing replicas, partitioning, and indexing. - Read
references/tech-kafka.mdwhen using Kafka for event streaming, async pipelines, or decoupling services. - Read
references/tech-nginx.mdwhen edge routing, TLS termination, or L7 load balancing is required. - Read
references/industry-finance.mdfor trading, payments, or regulated workloads. - Read
references/industry-ecommerce.mdfor flash sales, promotions, and cart/checkout workloads. - Read
references/industry-iot.mdfor device fleets, bursty telemetry, or edge connectivity constraints. - Read
references/industry-realtime.mdfor chat, gaming, or real-time collaboration systems. - Read
references/templates.mdwhen the user needs architecture, SLO, or capacity plan templates. - Read
references/testing-drills.mdwhen load testing, chaos testing, or DR drills are requested.
Concurrency Design Checklist
- Define concurrency target by peak RPS and concurrent users.
- Bound resource usage with worker pools, queues, and backpressure.
- Partition workload by tenant, shard key, or request type.
- Use async I/O for network and storage operations.
- Limit shared-state contention with sharding or lock-free structures.
- Apply rate limiting at edge and internal dependencies.
- Protect downstream services with bulkheads and timeouts.
Performance Design Checklist
- Reduce critical path by collapsing or parallelizing remote calls.
- Minimize p99 latency contributors: cold starts, GC pauses, locks, slow queries.
- Add caching with explicit invalidation rules.
- Use read replicas or materialized views for read-heavy workloads.
- Choose data formats and compression based on CPU vs bandwidth tradeoff.
- Optimize queries with indexes and selective projections.
- Warm pools and caches for predictable latency.
Availability Design Checklist
- Eliminate single points of failure with redundancy across zones.
- Use health checks and automated failover.
- Separate control plane and data plane failure domains.
- Support graceful degradation for non-critical features.
- Define RTO/RPO per subsystem and validate with DR drills.
- Ensure idempotency for retries and at-least-once delivery.
- Protect data with backups, versioning, and restore verification.
Validation Plan
- Run load tests to p95/p99 targets at expected peak.
- Run stress tests beyond peak to validate backpressure behavior.
- Run soak tests to surface memory leaks and queue buildup.
- Run chaos tests on dependencies and network partitions.
- Validate auto-scaling and failover timing against RTO.
Deliverables
- Architecture diagram with data flow and failure domains.
- SLI/SLO document with error budgets and alert thresholds.
- Capacity plan with scaling triggers and cost projections.
- Risk register with mitigations and rollback plans.
- Test plan covering load, stress, soak, and chaos.
- Operational runbook with on-call actions and dashboards.
Red Flags
- SLOs not defined or only averages tracked.
- Unbounded queues or unlimited fiber/thread spawning.
- Single shared database without partitioning plan at scale.
- No clear rollback or mitigation plan for deploys.
- No chaos testing or failover verification.
Output Template
Provide a concise plan with headings in this order:
- Targets (SLIs/SLOs, RTO/RPO)
- Workload model (traffic shape, hotspots, dependencies)
- Architecture (flow, scaling axis, partitions)
- Performance (critical path, caching, data access)
- Availability (redundancy, failover, degradation)
- Validation (tests and success criteria)
- Ops (observability, runbooks, incident response)
Source
git clone https://github.com/JWCodeWrote/Agent_Skills_Plugin/blob/main/HighTriad/SKILL.mdView on GitHub Overview
HighTriad guides building production-grade system designs that balance three pillars: concurrency, performance, and availability. It emphasizes defining SLIs/SLOs, modeling bottlenecks, and validating with tests to ensure resilience at scale. The framework covers workload clarification, architecture decisions, and incident-readiness for robust production systems.
How This Skill Works
The workflow starts by clarifying requirements and targets (RPS, latency, error budgets), then defines SLIs/SLOs and maps them to concrete targets. It models the system to identify bottlenecks across compute, network, storage, and dependencies, and designs for scale, performance, and availability with redundancy and failover strategies. Finally, it validates plans with tests and operationalizes observability, runbooks, and incident-readiness.
When to Use It
- Planning architecture for high-concurrency workloads and low tail latency
- Choosing scaling strategies, partitioning keys, and caching boundaries
- Defining explicit SLOs/SLIs and target error budgets
- Conducting load/stress/soak tests aligned to SLOs
- Preparing incident-readiness, runbooks, and on-call readiness
Quick Start
- Step 1: Clarify requirements and collect workload shape, failure tolerance, and explicit SLO targets.
- Step 2: Define 3–5 SLIs/SLOs and model system bottlenecks across compute, network, and storage.
- Step 3: Design for scale, performance, and availability; validate with tests and operationalize observability and runbooks.
Best Practices
- Clarify workload shape, critical paths, and failure tolerance with absolute targets (RPS, latency, growth rate, error budget, RTO/RPO).
- Define 3–5 primary SLIs and map them to explicit SLOs, prioritizing latency percentiles, availability, throughput, and freshness.
- Model bottlenecks across compute, network, storage, and dependency call chains; enumerate concurrency boundaries (queues, pools, partitions).
- Design for scale and performance by choosing partitioning keys, load balancing, caching boundaries, and reducing critical path length.
- Validate plans with load, stress, soak, and chaos tests aligned to SLOs; include rollback and mitigation steps.
Example Use Cases
- Architect a high-traffic API with horizontal scaling, partitioning, and caching to meet peak QPS and p95 latency targets.
- Define SLIs/SLOs for a microservices platform and validate resilience with chaos testing and runbooks.
- Plan capacity and caching strategies for an e-commerce site during flash sales with data access optimization.
- Utilize read replicas and partitioning to improve data access performance for a social-graph service.
- Develop incident-response playbooks and on-call readiness for production systems to ensure quick recovery.