optimize
Scannednpx machina-cli add skill Borda/.home/optimize --openclawOrchestrate a performance investigation using the perf-optimizer agent. This skill handles the measurement bookends (baseline → change → verify) while the agent handles the actual analysis and implementation.
</objective> <inputs>- $ARGUMENTS: file, module, or directory to optimize.
Step 1: Establish baseline
Before touching any code, measure current performance:
# Python script / module
python -m cProfile -s cumtime "$ARGUMENTS" 2>&1 | head -30
# Quick wall-clock timing
time python "$ARGUMENTS"
# Memory snapshot — use memray (safer and more accurate than exec-based approaches):
# python -m memray run --output /tmp/memray.bin "$ARGUMENTS" && python -m memray stats /tmp/memray.bin
Record the baseline numbers — they are the benchmark for all improvements.
Step 2: Spawn perf-optimizer agent
Task the perf-optimizer agent with:
- Read all relevant code files in and around
$ARGUMENTS - Apply the optimization hierarchy (algorithm → data structure → I/O → memory → concurrency → vectorization → compute → caching)
- Identify the single biggest bottleneck — not a laundry list
- Implement a targeted fix for that bottleneck
- Identify 2 additional bottlenecks to address next
- End your response with a
## Confidenceblock per CLAUDE.md output standards.
Step 3: Verify improvement
After each change from the perf-optimizer:
# Re-run the same baseline measurement
python -m cProfile -s cumtime "$ARGUMENTS" 2>&1 | head -30
time python "$ARGUMENTS"
Accept if improvement > 10% (adjust threshold for your workload — GPU benchmarks may need 20%+ to clear noise; hot-path latency may justify 2%). Revert if not measurable or < noise floor.
Step 4: Report
## Performance Optimization: [target]
### Baseline
- [metric]: [value]
### Changes Applied
1. **[bottleneck]**: [what changed] → [measured improvement]
2. **[bottleneck]**: [what changed] → [measured improvement]
### After
- [metric]: [new value] ([X]x improvement)
### Remaining Opportunities
- [next bottleneck to address]
## Confidence
**Score**: [0.N]
**Gaps**: [e.g., benchmark environment noisy, no profiler available, GPU not accessible]
Step 5: Delegate documentation follow-up (optional)
After confirming improvements, inspect the applied changes (git diff HEAD --stat) and identify documentation tasks where Codex can add meaningful content.
Delegate to Codex when:
- Optimized code uses non-obvious techniques (pre-allocation, vectorized ops, batched I/O) that need inline explanation — read the code first, then describe the technique and why it is faster
- A function signature changed due to optimization (e.g., added
batch_sizeordeviceparameter) and the docstring no longer matches the actual contract
Do not delegate:
- Generic "add comments" requests — only delegate when you can describe the specific technique and its rationale
- Any task where you cannot write a precise description without guessing
For each task, read the optimized code, form an accurate brief, then spawn:
Task(
subagent_type="general-purpose",
prompt="Read .claude/skills/codex/SKILL.md and follow its workflow exactly.
Task: use the <agent> to <documentation task with accurate description of what the optimization does>.
Target: <file>."
)
Example prompt: "use the doc-scribe to add an inline comment to the inner loop in src/batch_processor.py:87 explaining that the result tensor is pre-allocated before the loop to avoid repeated GPU memory allocation — the old version called torch.zeros() inside the loop"
The subagent handles pre-flight, dispatch, validation, and patch capture. If Codex is unavailable it reports gracefully.
Print a ### Codex Delegation section after the Step 4 terminal output only if this step ran.
- The perf-optimizer agent has the full optimization knowledge base — this skill only orchestrates the measure-change-measure loop
- Never skip the baseline measurement — unmeasured optimization is guessing
- For ML-specific optimization (DataLoader, mixed precision, torch.compile), the perf-optimizer agent has dedicated sections
- Follow-up chains:
- Bottleneck is architectural (not just a hot loop) →
/refactorfor structural changes with test safety net - Optimization changes non-trivial code paths →
/reviewfor quality validation - Optimized code needs documentation updates → Step 5 auto-delegates to Codex
- Bottleneck is architectural (not just a hot loop) →
Source
git clone https://github.com/Borda/.home/blob/main/.claude/skills/optimize/SKILL.mdView on GitHub Overview
This skill orchestrates a performance investigation by establishing a baseline, then using a perf-optimizer agent to identify the single most impactful bottleneck and apply a targeted fix. It spans CPU, memory, I/O, concurrency, and ML/GPU workloads and outputs a before/after report.
How This Skill Works
The process starts by measuring a baseline with Python profiling, wall-clock timing, and optional memray memory snapshots. It then delegates bottleneck analysis to the perf-optimizer agent, following a defined optimization hierarchy, to pinpoint the single biggest bottleneck and implement a targeted change. After changes, it re-measures to verify improvements against the baseline.
When to Use It
- Profiling a Python project to reduce CPU-bound latency or memory pressure.
- When an application with concurrency or multiprocessing shows underutilization or contention.
- Optimizing I/O-heavy data pipelines or ETL steps.
- Tuning ML/GPU workloads where compute or memory usage is suboptimal.
- Establishing a repeatable baseline → bottleneck → verify workflow for performance work.
Quick Start
- Step 1: Measure the baseline with python -m cProfile -s cumtime <ARGUMENTS>, optional wall-clock time, and memray as needed.
- Step 2: Spawn the perf-optimizer agent to identify the single biggest bottleneck and apply a targeted fix in that area.
- Step 3: Re-run the same baseline measurements and verify improvement; accept if >10% (GPU may require higher) and revert if not.
Best Practices
- Define a realistic, representative workload for baseline measurements.
- Use the specified tools (cProfile, time, memray) to collect consistent metrics.
- Target the single biggest bottleneck identified by perf-optimizer, not a long list.
- Run measurements under controlled conditions and account for noise; document thresholds.
- Capture a clear before/after report and keep changes reversible if needed.
Example Use Cases
- Data ingestion script: profile CPU and I/O, identify and optimize the dominant bottleneck, achieving measurable throughput gains.
- Web API latency: profile request path, fix the top bottleneck, and verify latency reduction under load.
- ETL batch job: reduce memory churn with targeted fixes and validate improved memory usage.
- ML inference service: optimize the compute path on GPU/CPU and improve throughput per second.
- CSV transformation pipeline: address a concurrency bottleneck to speed up processing.