Get the FREE Ultimate OpenClaw Setup Guide →

building-with-llms

npx machina-cli add skill liqiongyu/lenny_skills_plus/building-with-llms --openclaw
Files (1)
SKILL.md
8.1 KB

Building with LLMs

Scope

Covers

  • Building and shipping LLM-powered features/apps (assistant, copilot, light agent workflows)
  • Prompt + tool contract design (instructions, schemas, examples, guardrails)
  • Data quality + evaluation (test sets, rubrics, red teaming, iteration loop)
  • Production readiness (latency/cost budgets, logging, fallbacks, safety/security checks)
  • Using coding agents (Codex/Claude Code) to accelerate engineering safely

When to use

  • “Turn this LLM feature idea into a build plan with prompts, evals, and launch checks.”
  • “We need a system prompt + tool definitions + output schema for our LLM workflow.”
  • “Our LLM is flaky—design an eval plan and iteration loop to stabilize quality.”
  • “Design a RAG/tool-using agent approach with safety and monitoring.”
  • “We want to use an AI coding agent to implement this—set constraints and review gates.”

When NOT to use

  • You need product/portfolio strategy and positioning (use ai-product-strategy).
  • You need a full PRD/spec set for cross-functional alignment (use writing-prds / writing-specs-designs).
  • You need primary user research (use conducting-user-interviews / usability-testing).
  • You are doing model training/research, infra architecture, or bespoke model tuning (delegate to ML/eng; this skill assumes API models).
  • You only want “which model/provider should we pick?” (treat as an input; if it dominates, do a separate evaluation doc).

Inputs

Minimum required

  • Use case + target user + what “good” looks like (success metrics + failure modes)
  • The LLM’s job: generate text, transform data, classify, extract, plan, or take actions via tools
  • Constraints: privacy/compliance, data sensitivity, latency, cost, reliability, supported regions
  • Integration surface: UI/workflow, downstream systems/APIs/tools, and any required output schema

Missing-info strategy

  • Ask up to 5 questions from references/INTAKE.md (3–5 at a time).
  • If details remain missing, proceed with explicit assumptions and provide 2–3 options (prompting vs RAG vs tool use; autonomy level).
  • If asked to write code or run commands, request confirmation and use least privilege (no secrets; avoid destructive changes).

Outputs (deliverables)

Produce an LLM Build Pack (in chat; or as files if requested), in this order:

  1. Feature brief (goal, users, non-goals, constraints, success + guardrails)
  2. System design sketch (pattern + architecture, context strategy, budgets, failure handling)
  3. Prompt + tool contract (system prompt, tool schemas, output schema, examples, refusal/guardrails)
  4. Data + evaluation plan (test set, rubrics, automated checks, red-team suite, acceptance thresholds)
  5. Build + iteration plan (prototype slice, instrumentation, debugging loop, how to use coding agents safely)
  6. Launch + monitoring plan (logging, dashboards/alerts, fallback/rollback, incident playbook hooks)
  7. Risks / Open questions / Next steps (always included)

Templates: references/TEMPLATES.md

Workflow (8 steps)

1) Frame the job, boundary, and “good”

  • Inputs: Use case, target user, constraints.
  • Actions: Write a crisp job statement (“The LLM must…”) + 3–5 non-goals. Define success metrics and guardrails (quality, safety, cost, latency).
  • Outputs: Draft Feature brief.
  • Checks: A stakeholder can restate what the LLM does and does not do, and how success is measured.

2) Choose the minimum viable autonomy pattern

  • Inputs: Workflow + risk tolerance.
  • Actions: Decide assistant vs copilot vs agent-like tool use. Identify “human control points” (review/approve moments) and what the model is never allowed to do.
  • Outputs: Autonomy decisions captured in Feature brief.
  • Checks: Any action-taking behavior has explicit permissions, confirmations, and an undo/rollback story.

3) Design the context strategy (prompting → RAG → tools)

  • Inputs: Data sources, integration points, constraints.
  • Actions: Decide how the model gets reliable context: instruction hierarchy, retrieval strategy, tool calls, structured inputs. Define the “source of truth” and how conflicts are handled.
  • Outputs: Draft System design sketch.
  • Checks: You can explain (a) what data is used, (b) where it comes from, (c) how freshness/authority is enforced.

4) Draft the prompt + tool contract (make the system legible)

  • Inputs: Job statement + context strategy + output schema needs.
  • Actions: Write the system prompt, tool descriptions, and output schema. Add examples and explicit DO/DO NOT rules. Include safe failure behavior (ask clarifying questions, abstain, cite sources).
  • Outputs: Prompt + tool contract.
  • Checks: A reviewer can predict behavior for 5–10 representative inputs; contract includes at least 3 hard constraints and examples.

5) Build the eval set + rubric (debug like software)

  • Inputs: Expected behaviors + failure modes + edge cases.
  • Actions: Create a test set covering normal cases, tricky cases, and red-team cases. Define a scoring rubric and acceptance thresholds. Add automated checks where possible (schema validity, citation presence, forbidden content).
  • Outputs: Data + evaluation plan.
  • Checks: You can run the same prompts repeatedly and measure improvement/regression; evals cover the top failure modes.

6) Prototype a thin slice, using coding agents safely

  • Inputs: System sketch + prompt contract + eval plan.
  • Actions: Implement the smallest end-to-end slice. Use coding agents for “lower hanging fruit” tasks, but keep tight constraints: small diffs, tests, code review, no secret handling.
  • Outputs: Build + iteration plan (and optionally a prototype plan/checklist).
  • Checks: You can explain what the agent changed, why, and how it was validated (tests, evals, manual review).

7) Production readiness: budgets, monitoring, and failure handling

  • Inputs: Prototype learnings + constraints.
  • Actions: Define cost/latency budgets, fallbacks, rate limits, logging fields, and alert thresholds. Address prompt injection/tool misuse risks; add safeguards and review processes.
  • Outputs: Launch + monitoring plan.
  • Checks: There is a clear path to detect regressions, cap cost, and safely degrade when the model misbehaves.

8) Quality gate + finalize

  • Inputs: Full draft pack.
  • Actions: Run references/CHECKLISTS.md and score with references/RUBRIC.md. Tighten unclear contracts, add missing tests, and always include Risks / Open questions / Next steps.
  • Outputs: Final LLM Build Pack.
  • Checks: A team can execute the plan without a meeting; unknowns are explicit and owned.

Quality gate (required)

Examples

Example 1 (RAG copilot): “Use building-with-llms to plan a support-response copilot that drafts replies using our internal KB. Constraints: no PII leakage; must cite sources; p95 latency < 3s; cost < $0.10/ticket.”
Expected: LLM Build Pack with prompt/tool contract, eval set (including privacy red-team cases), and monitoring/rollback plan.

Example 2 (tool-using workflow): “Use building-with-llms to design an LLM workflow that turns meeting notes into action items and Jira tickets (human review required). Output must be valid JSON.”
Expected: output schema + tool contract + eval plan for structured extraction + guardrails against over-creation.

Boundary example: “Fine-tune/train a new LLM from scratch.”
Response: out of scope; propose an API-model approach and highlight what ML/infra work is required if training is truly needed.

Source

git clone https://github.com/liqiongyu/lenny_skills_plus/blob/main/skills/building-with-llms/SKILL.mdView on GitHub

Overview

Produces an end-to-end LLM Build Pack (prompt + tool contract, data/eval plan, architecture + safety, launch checklist) to design, build, and ship LLM-powered features and apps. It covers GPT/Claude apps, prompt engineering, RAG workflows, and tool-using agents, guiding teams from framing to launch with concrete deliverables.

How This Skill Works

Follow a structured flow from framing the job to post-launch monitoring. The output is a seven-deliverable Build Pack: feature brief, system design sketch, prompt + tool contract, data/evaluation plan, build/iteration plan, launch plan, and risks/next steps. The process also prescribes intake questions, autonomy patterns, and guardrails to ensure safe, production-ready deployments.

When to Use It

  • Turn this LLM feature idea into a build plan with prompts, evals, and launch checks.
  • We need a system prompt + tool definitions + output schema for our LLM workflow.
  • Our LLM is flaky—design an eval plan and iteration loop to stabilize quality.
  • Design a RAG/tool-using agent approach with safety and monitoring.
  • We want to use an AI coding agent to implement this—set constraints and review gates.

Quick Start

  1. Step 1: Frame the job, define constraints, success metrics, and guardrails.
  2. Step 2: Decide the autonomy pattern (assistant, copilot, or agent) and required human review points.
  3. Step 3: Produce the seven deliverables (Feature brief, System design, Prompt + tool contract, Data/eval plan, Build/iteration plan, Launch plan, Risks/Next steps) and prepare for launch.

Best Practices

  • Frame the job with a crisp feature brief, clear success metrics, and guardrails for quality, safety, cost, and latency.
  • Choose the minimum viable autonomy pattern early (assistant vs copilot vs agent) and identify human review points.
  • Define system prompts, tool contracts, output schemas, and refusal guardrails up front.
  • Build a rigorous data and evaluation plan with test sets, rubrics, red-teaming, and an iteration loop.
  • Plan for production readiness with latency/cost budgets, logging, fallbacks, and safety checks plus monitoring.

Example Use Cases

  • An LLM-powered customer support assistant using RAG and tool integration to fetch order data and escalate cases.
  • An internal coding assistant leveraging Codex/Claude Code with constraints and review gates to generate safe code.
  • A compliance-aware data extractor that uses red-teaming and guardrails to prevent leakage of sensitive information.
  • A light agent workflow that orchestrates multiple tools (scheduling, alerts, data retrieval) with human review points.
  • An AI coding agent that implements a feature end-to-end while adhering to safety constraints and review gates.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers