What is the core idea of this skill?

Never optimize guesses. define a measurable goal, measure, profile to locate bottlenecks, and validate changes with repeatable tests.

Which tools should I use?

Use at least two tool types (e.g., wall-clock timer, CPU profiler, memory profiler, flame graph, tracing, or load testing) to form a complete view. A CPU profiler alone may miss bottlenecks like databases or I/O.

How do I know if the optimization helped?

Compare pre- and post-change measurements for the target metric, assess any side effects, and ensure only one change was made at a time to avoid confounding results.

performance

npx machina-cli add skill tslateman/duet/performance --openclaw

Files (1)

SKILL.md

5.7 KB

Performance as Measurement

Overview

Optimize what you've measured, not what you suspect. Performance work without profiling is superstition. Measure first, hypothesize second, optimize third, measure again.

The Performance Loop

Define the goal — What metric matters? Latency, throughput, memory, startup time?
Measure the baseline — Quantify current performance with reproducible benchmarks
Profile — Identify where time and resources actually go
Hypothesize — What change would improve the bottleneck?
Optimize — Make one change
Measure again — Did it help? By how much? Any regressions elsewhere?

Never skip from step 1 to step 5.

Trade-off Framework

Every optimization trades one resource for another. Make the trade explicit.

Trade-off	Example
Latency vs. throughput	Batching increases throughput, raises individual latency
Memory vs. CPU	Caching trades memory for fewer computations
Simplicity vs. speed	Hand-rolled loops beat abstractions but obscure intent
Startup vs. runtime	Lazy loading delays startup cost to first use
Bandwidth vs. latency	Compression saves bandwidth, costs CPU time
Consistency vs. speed	Eventual consistency is faster than strong consistency

Ask: "Which resource is scarce in this context?" Optimize for the scarce one.

Profiling Strategy

Where to Look

Start with the outermost measurement, narrow inward:

End-to-end timing — Total wall-clock time for the operation
Component breakdown — Which phase takes the most time?
Hot path analysis — Which functions dominate the profile?
Allocation analysis — Where is memory allocated and freed?

What Tools Reveal

Tool type	Reveals	Misses
Wall-clock timer	Total duration	Where time is spent
CPU profiler	Hot functions	I/O waits, lock contention
Memory profiler	Allocations, leaks	Cache effects
Flame graph	Call hierarchy costs	Inlined functions
Tracing	Request flow, latency	Aggregate behavior
Load testing	Throughput limits	Root cause of limits

Use at least two tool types. A CPU profiler won't find a database bottleneck.

Common Bottleneck Patterns

I/O Bound

Symptoms: Low CPU usage, high wait times, slow under load.

Causes: Synchronous I/O, N+1 queries, unbatched requests, no connection pooling.

Remedies: Batch operations, add caching, use async I/O, pool connections.

CPU Bound

Symptoms: High CPU usage, scales with input size, unaffected by I/O improvements.

Causes: Inefficient algorithms, unnecessary computation, poor data structures.

Remedies: Better algorithms first (O(n) beats optimized O(n^2)), then micro-optimize the hot path.

Memory Bound

Symptoms: Growing memory usage, GC pauses, OOM errors, cache thrashing.

Causes: Unbounded caches, leaked references, large intermediate allocations, fragmentation.

Remedies: Bound caches (LRU), stream instead of buffer, pool allocations, reduce object size.

Contention Bound

Symptoms: Low individual resource usage but poor throughput under concurrency.

Causes: Lock contention, shared mutable state, thread pool exhaustion, connection limits.

Remedies: Reduce critical section scope, use lock-free structures, partition state, increase pool size.

Anti-Patterns

Premature optimization — Optimizing before measuring. The bottleneck is never where you think.

Micro-benchmarking in isolation — Benchmarking a function outside its real context misses cache effects, GC pressure, and contention.

Optimizing the wrong metric — Reducing P50 latency when users complain about P99. Improving throughput when the problem is startup time.

Death by a thousand cuts — No single bottleneck, just accumulated inefficiency. Profile holistically, not function-by-function.

Caching without invalidation strategy — Cache speeds reads but stale data causes correctness bugs. Define TTL and invalidation before adding a cache.

Output Format

When analyzing performance:

## Performance Analysis

### Goal

[Specific metric and target: "Reduce API P99 latency from 800ms to 200ms"]

### Baseline

[Current measurements with methodology]

### Profile Summary

[Where time/memory/resources go, ranked by impact]

### Recommendations

1. [Change] — [Expected improvement] — [Trade-off]
2. [Change] — [Expected improvement] — [Trade-off]

### Not Optimizing

[What was considered but rejected, and why]

Knuth's Reminder

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered."

Optimize the critical 3%, not the other 97%.

Overview

Performance as Measurement advocates measuring first, then hypothesizing and optimizing. It emphasizes a repeatable performance loop to avoid superstition and focus on measurable bottlenecks. The framework helps teams trade resources intentionally and verify gains with repeatable tests.

How This Skill Works

Follow the Performance Loop: define the goal, measure a baseline, profile to locate bottlenecks, hypothesize a change, implement one change, and re-measure. Use multiple profiling tools (wall-clock timing, CPU/memory profilers, flame graphs, tracing) to get a complete view. Always consider trade-offs and validate improvements with repeatable measurements.

When to Use It

make this faster
optimize
profile
reduce latency
investigate performance regressions

Quick Start

Step 1: Define the goal and establish a reproducible baseline.
Step 2: Profile to locate hot paths and allocations using at least two tool types.
Step 3: Implement one measured change and re-measure to verify impact.

Best Practices

Define the goal first (which metric matters: latency, throughput, memory, startup).
Measure baseline with reproducible benchmarks before changing anything.
Profile using at least two tool types to locate hot paths and allocations.
Optimize one change at a time and weigh explicit trade-offs (latency vs throughput, memory vs CPU).
Re-measure after each change to confirm gains and catch regressions elsewhere.

Example Use Cases

Reduce API latency by profiling end-to-end and adding targeted caching.
Improve startup time by lazy-loading heavy initialization and deferring non-critical work.
Replace an O(n^2) processing loop with an O(n) approach after hotspot analysis.
Batch I/O operations and adopt async I/O to boost throughput.
Investigate a regression via a performance audit and fix the bottleneck with a focused change.

Frequently Asked Questions

Add this skill to your agents

performance

Performance as Measurement

Overview

The Performance Loop

Trade-off Framework

Profiling Strategy

Where to Look

What Tools Reveal

Common Bottleneck Patterns

I/O Bound

CPU Bound

Memory Bound

Contention Bound

Anti-Patterns

Output Format

Knuth's Reminder

See Also

Source

Overview

How This Skill Works

When to Use It

Quick Start

Best Practices

Example Use Cases

Frequently Asked Questions