What is the perf-optimizer?

An agent that analyzes code and prioritizes bottlenecks using a defined optimization hierarchy (algorithm → data structure → I/O → memory → concurrency → vectorization → compute → caching) to identify the single biggest bottleneck.

What can ARGUMENTS point to?

A file, module, or directory to optimize; you pass this as the ARGUMENTS input to the workflow.

optimize

Scanned

npx machina-cli add skill Borda/.home/optimize --openclaw

Files (1)

SKILL.md

4.8 KB

Orchestrate a performance investigation using the perf-optimizer agent. This skill handles the measurement bookends (baseline → change → verify) while the agent handles the actual analysis and implementation.

</objective> <inputs>

$ARGUMENTS: file, module, or directory to optimize.

</inputs> <workflow>

Step 1: Establish baseline

Before touching any code, measure current performance:

# Python script / module
python -m cProfile -s cumtime "$ARGUMENTS" 2>&1 | head -30

# Quick wall-clock timing
time python "$ARGUMENTS"

# Memory snapshot — use memray (safer and more accurate than exec-based approaches):
# python -m memray run --output /tmp/memray.bin "$ARGUMENTS" && python -m memray stats /tmp/memray.bin

Record the baseline numbers — they are the benchmark for all improvements.

Step 2: Spawn perf-optimizer agent

Task the perf-optimizer agent with:

Read all relevant code files in and around $ARGUMENTS
Apply the optimization hierarchy (algorithm → data structure → I/O → memory → concurrency → vectorization → compute → caching)
Identify the single biggest bottleneck — not a laundry list
Implement a targeted fix for that bottleneck
Identify 2 additional bottlenecks to address next
End your response with a ## Confidence block per CLAUDE.md output standards.

Step 3: Verify improvement

After each change from the perf-optimizer:

# Re-run the same baseline measurement
python -m cProfile -s cumtime "$ARGUMENTS" 2>&1 | head -30
time python "$ARGUMENTS"

Accept if improvement > 10% (adjust threshold for your workload — GPU benchmarks may need 20%+ to clear noise; hot-path latency may justify 2%). Revert if not measurable or < noise floor.

Step 4: Report

## Performance Optimization: [target]

### Baseline
- [metric]: [value]

### Changes Applied
1. **[bottleneck]**: [what changed] → [measured improvement]
2. **[bottleneck]**: [what changed] → [measured improvement]

### After
- [metric]: [new value] ([X]x improvement)

### Remaining Opportunities
- [next bottleneck to address]

## Confidence
**Score**: [0.N]
**Gaps**: [e.g., benchmark environment noisy, no profiler available, GPU not accessible]

Step 5: Delegate documentation follow-up (optional)

After confirming improvements, inspect the applied changes (git diff HEAD --stat) and identify documentation tasks where Codex can add meaningful content.

Delegate to Codex when:

Optimized code uses non-obvious techniques (pre-allocation, vectorized ops, batched I/O) that need inline explanation — read the code first, then describe the technique and why it is faster
A function signature changed due to optimization (e.g., added batch_size or device parameter) and the docstring no longer matches the actual contract

Do not delegate:

Generic "add comments" requests — only delegate when you can describe the specific technique and its rationale
Any task where you cannot write a precise description without guessing

For each task, read the optimized code, form an accurate brief, then spawn:

Task(
  subagent_type="general-purpose",
  prompt="Read .claude/skills/codex/SKILL.md and follow its workflow exactly.
Task: use the <agent> to <documentation task with accurate description of what the optimization does>.
Target: <file>."
)

Example prompt: "use the doc-scribe to add an inline comment to the inner loop in src/batch_processor.py:87 explaining that the result tensor is pre-allocated before the loop to avoid repeated GPU memory allocation — the old version called torch.zeros() inside the loop"

The subagent handles pre-flight, dispatch, validation, and patch capture. If Codex is unavailable it reports gracefully.

Print a ### Codex Delegation section after the Step 4 terminal output only if this step ran.

</workflow> <notes>

The perf-optimizer agent has the full optimization knowledge base — this skill only orchestrates the measure-change-measure loop
Never skip the baseline measurement — unmeasured optimization is guessing
For ML-specific optimization (DataLoader, mixed precision, torch.compile), the perf-optimizer agent has dedicated sections
Follow-up chains:
- Bottleneck is architectural (not just a hot loop) → /refactor for structural changes with test safety net
- Optimization changes non-trivial code paths → /review for quality validation
- Optimized code needs documentation updates → Step 5 auto-delegates to Codex

</notes>

Source

git clone https://github.com/Borda/.home/blob/main/.claude/skills/optimize/SKILL.mdView on GitHub

Overview

This skill orchestrates a performance investigation by establishing a baseline, then using a perf-optimizer agent to identify the single most impactful bottleneck and apply a targeted fix. It spans CPU, memory, I/O, concurrency, and ML/GPU workloads and outputs a before/after report.

How This Skill Works

The process starts by measuring a baseline with Python profiling, wall-clock timing, and optional memray memory snapshots. It then delegates bottleneck analysis to the perf-optimizer agent, following a defined optimization hierarchy, to pinpoint the single biggest bottleneck and implement a targeted change. After changes, it re-measures to verify improvements against the baseline.

When to Use It

Profiling a Python project to reduce CPU-bound latency or memory pressure.
When an application with concurrency or multiprocessing shows underutilization or contention.
Optimizing I/O-heavy data pipelines or ETL steps.
Tuning ML/GPU workloads where compute or memory usage is suboptimal.
Establishing a repeatable baseline → bottleneck → verify workflow for performance work.

Quick Start

Step 1: Measure the baseline with python -m cProfile -s cumtime <ARGUMENTS>, optional wall-clock time, and memray as needed.
Step 2: Spawn the perf-optimizer agent to identify the single biggest bottleneck and apply a targeted fix in that area.
Step 3: Re-run the same baseline measurements and verify improvement; accept if >10% (GPU may require higher) and revert if not.

Best Practices

Define a realistic, representative workload for baseline measurements.
Use the specified tools (cProfile, time, memray) to collect consistent metrics.
Target the single biggest bottleneck identified by perf-optimizer, not a long list.
Run measurements under controlled conditions and account for noise; document thresholds.
Capture a clear before/after report and keep changes reversible if needed.

Example Use Cases

Data ingestion script: profile CPU and I/O, identify and optimize the dominant bottleneck, achieving measurable throughput gains.
Web API latency: profile request path, fix the top bottleneck, and verify latency reduction under load.
ETL batch job: reduce memory churn with targeted fixes and validate improved memory usage.
ML inference service: optimize the compute path on GPU/CPU and improve throughput per second.
CSV transformation pipeline: address a concurrency bottleneck to speed up processing.

Frequently Asked Questions

Add this skill to your agents