When should I use building-with-llms?

Use when you need an end-to-end LLM Build Pack: prompts, tool contracts, data/eval plan, architecture, and a launch checklist for LLM-powered features and apps.

What if I also need product strategy?

This skill focuses on the build-pack deliverables. For product strategy, consider using ai-product-strategy or related strategy skills.

Is coding experience required?

Not strictly. You should define integration surfaces and constraints; you may leverage coding agents if needed and request confirmation before writing code, applying least privilege and avoiding secrets.

building-with-llms

npx machina-cli add skill liqiongyu/lenny_skills_plus/building-with-llms --openclaw

Files (1)

SKILL.md

8.1 KB

Building with LLMs

Scope

Covers

Building and shipping LLM-powered features/apps (assistant, copilot, light agent workflows)
Prompt + tool contract design (instructions, schemas, examples, guardrails)
Data quality + evaluation (test sets, rubrics, red teaming, iteration loop)
Production readiness (latency/cost budgets, logging, fallbacks, safety/security checks)
Using coding agents (Codex/Claude Code) to accelerate engineering safely

When to use

“Turn this LLM feature idea into a build plan with prompts, evals, and launch checks.”
“We need a system prompt + tool definitions + output schema for our LLM workflow.”
“Our LLM is flaky—design an eval plan and iteration loop to stabilize quality.”
“Design a RAG/tool-using agent approach with safety and monitoring.”
“We want to use an AI coding agent to implement this—set constraints and review gates.”

When NOT to use

You need product/portfolio strategy and positioning (use ai-product-strategy).
You need a full PRD/spec set for cross-functional alignment (use writing-prds / writing-specs-designs).
You need primary user research (use conducting-user-interviews / usability-testing).
You are doing model training/research, infra architecture, or bespoke model tuning (delegate to ML/eng; this skill assumes API models).
You only want “which model/provider should we pick?” (treat as an input; if it dominates, do a separate evaluation doc).

Inputs

Minimum required

Use case + target user + what “good” looks like (success metrics + failure modes)
The LLM’s job: generate text, transform data, classify, extract, plan, or take actions via tools
Constraints: privacy/compliance, data sensitivity, latency, cost, reliability, supported regions
Integration surface: UI/workflow, downstream systems/APIs/tools, and any required output schema

Missing-info strategy

Ask up to 5 questions from references/INTAKE.md (3–5 at a time).
If details remain missing, proceed with explicit assumptions and provide 2–3 options (prompting vs RAG vs tool use; autonomy level).
If asked to write code or run commands, request confirmation and use least privilege (no secrets; avoid destructive changes).

Outputs (deliverables)

Produce an LLM Build Pack (in chat; or as files if requested), in this order:

Feature brief (goal, users, non-goals, constraints, success + guardrails)
System design sketch (pattern + architecture, context strategy, budgets, failure handling)
Prompt + tool contract (system prompt, tool schemas, output schema, examples, refusal/guardrails)
Data + evaluation plan (test set, rubrics, automated checks, red-team suite, acceptance thresholds)
Build + iteration plan (prototype slice, instrumentation, debugging loop, how to use coding agents safely)
Launch + monitoring plan (logging, dashboards/alerts, fallback/rollback, incident playbook hooks)
Risks / Open questions / Next steps (always included)

Templates: references/TEMPLATES.md

Workflow (8 steps)

1) Frame the job, boundary, and “good”

Inputs: Use case, target user, constraints.
Actions: Write a crisp job statement (“The LLM must…”) + 3–5 non-goals. Define success metrics and guardrails (quality, safety, cost, latency).
Outputs: Draft Feature brief.
Checks: A stakeholder can restate what the LLM does and does not do, and how success is measured.

2) Choose the minimum viable autonomy pattern

Inputs: Workflow + risk tolerance.
Actions: Decide assistant vs copilot vs agent-like tool use. Identify “human control points” (review/approve moments) and what the model is never allowed to do.
Outputs: Autonomy decisions captured in Feature brief.
Checks: Any action-taking behavior has explicit permissions, confirmations, and an undo/rollback story.

3) Design the context strategy (prompting → RAG → tools)

Inputs: Data sources, integration points, constraints.
Actions: Decide how the model gets reliable context: instruction hierarchy, retrieval strategy, tool calls, structured inputs. Define the “source of truth” and how conflicts are handled.
Outputs: Draft System design sketch.
Checks: You can explain (a) what data is used, (b) where it comes from, (c) how freshness/authority is enforced.

4) Draft the prompt + tool contract (make the system legible)

Inputs: Job statement + context strategy + output schema needs.
Actions: Write the system prompt, tool descriptions, and output schema. Add examples and explicit DO/DO NOT rules. Include safe failure behavior (ask clarifying questions, abstain, cite sources).
Outputs: Prompt + tool contract.
Checks: A reviewer can predict behavior for 5–10 representative inputs; contract includes at least 3 hard constraints and examples.

5) Build the eval set + rubric (debug like software)

Inputs: Expected behaviors + failure modes + edge cases.
Actions: Create a test set covering normal cases, tricky cases, and red-team cases. Define a scoring rubric and acceptance thresholds. Add automated checks where possible (schema validity, citation presence, forbidden content).
Outputs: Data + evaluation plan.
Checks: You can run the same prompts repeatedly and measure improvement/regression; evals cover the top failure modes.

6) Prototype a thin slice, using coding agents safely

Inputs: System sketch + prompt contract + eval plan.
Actions: Implement the smallest end-to-end slice. Use coding agents for “lower hanging fruit” tasks, but keep tight constraints: small diffs, tests, code review, no secret handling.
Outputs: Build + iteration plan (and optionally a prototype plan/checklist).
Checks: You can explain what the agent changed, why, and how it was validated (tests, evals, manual review).

7) Production readiness: budgets, monitoring, and failure handling

Inputs: Prototype learnings + constraints.
Actions: Define cost/latency budgets, fallbacks, rate limits, logging fields, and alert thresholds. Address prompt injection/tool misuse risks; add safeguards and review processes.
Outputs: Launch + monitoring plan.
Checks: There is a clear path to detect regressions, cap cost, and safely degrade when the model misbehaves.

8) Quality gate + finalize

Inputs: Full draft pack.
Actions: Run references/CHECKLISTS.md and score with references/RUBRIC.md. Tighten unclear contracts, add missing tests, and always include Risks / Open questions / Next steps.
Outputs: Final LLM Build Pack.
Checks: A team can execute the plan without a meeting; unknowns are explicit and owned.

Quality gate (required)

Use references/CHECKLISTS.md and references/RUBRIC.md.
Always include: Risks, Open questions, Next steps.

Examples

Example 1 (RAG copilot): “Use building-with-llms to plan a support-response copilot that drafts replies using our internal KB. Constraints: no PII leakage; must cite sources; p95 latency < 3s; cost < $0.10/ticket.”
Expected: LLM Build Pack with prompt/tool contract, eval set (including privacy red-team cases), and monitoring/rollback plan.

Example 2 (tool-using workflow): “Use building-with-llms to design an LLM workflow that turns meeting notes into action items and Jira tickets (human review required). Output must be valid JSON.”
Expected: output schema + tool contract + eval plan for structured extraction + guardrails against over-creation.

Boundary example: “Fine-tune/train a new LLM from scratch.”
Response: out of scope; propose an API-model approach and highlight what ML/infra work is required if training is truly needed.

Source

git clone https://github.com/liqiongyu/lenny_skills_plus/blob/main/skills/building-with-llms/SKILL.mdView on GitHub

Overview

Produces an end-to-end LLM Build Pack (prompt + tool contract, data/eval plan, architecture + safety, launch checklist) to design, build, and ship LLM-powered features and apps. It covers GPT/Claude apps, prompt engineering, RAG workflows, and tool-using agents, guiding teams from framing to launch with concrete deliverables.

How This Skill Works

Follow a structured flow from framing the job to post-launch monitoring. The output is a seven-deliverable Build Pack: feature brief, system design sketch, prompt + tool contract, data/evaluation plan, build/iteration plan, launch plan, and risks/next steps. The process also prescribes intake questions, autonomy patterns, and guardrails to ensure safe, production-ready deployments.

When to Use It

Turn this LLM feature idea into a build plan with prompts, evals, and launch checks.
We need a system prompt + tool definitions + output schema for our LLM workflow.
Our LLM is flaky—design an eval plan and iteration loop to stabilize quality.
Design a RAG/tool-using agent approach with safety and monitoring.
We want to use an AI coding agent to implement this—set constraints and review gates.

Quick Start

Step 1: Frame the job, define constraints, success metrics, and guardrails.
Step 2: Decide the autonomy pattern (assistant, copilot, or agent) and required human review points.
Step 3: Produce the seven deliverables (Feature brief, System design, Prompt + tool contract, Data/eval plan, Build/iteration plan, Launch plan, Risks/Next steps) and prepare for launch.

Best Practices

Frame the job with a crisp feature brief, clear success metrics, and guardrails for quality, safety, cost, and latency.
Choose the minimum viable autonomy pattern early (assistant vs copilot vs agent) and identify human review points.
Define system prompts, tool contracts, output schemas, and refusal guardrails up front.
Build a rigorous data and evaluation plan with test sets, rubrics, red-teaming, and an iteration loop.
Plan for production readiness with latency/cost budgets, logging, fallbacks, and safety checks plus monitoring.

Example Use Cases

An LLM-powered customer support assistant using RAG and tool integration to fetch order data and escalate cases.
An internal coding assistant leveraging Codex/Claude Code with constraints and review gates to generate safe code.
A compliance-aware data extractor that uses red-teaming and guardrails to prevent leakage of sensitive information.
A light agent workflow that orchestrates multiple tools (scheduling, alerts, data retrieval) with human review points.
An AI coding agent that implements a feature end-to-end while adhering to safety constraints and review gates.

Frequently Asked Questions

Add this skill to your agents