What formats are supported?

CSV, JSON, SQL INSERT statements, or a Python script generator.

Can I enforce business rules in the generated data?

Yes; you can encode constraints like rating distributions, category constraints, and domain validity for emails.

dummy-dataset

Scanned

npx machina-cli add skill phuryn/pm-skills/dummy-dataset --openclaw

Files (1)

SKILL.md

3.8 KB

Dummy Dataset Generation

Generate realistic dummy datasets for testing with customizable columns, constraints, and output formats (CSV, JSON, SQL, Python script). Creates executable scripts or direct data files for immediate use.

Use when: Creating test data, generating sample datasets, building realistic mock data for development, or populating test environments.

Arguments:

$PRODUCT: The product or system name
$DATASET_TYPE: Type of data (e.g., customer feedback, transactions, user profiles)
$ROWS: Number of rows to generate (default: 100)
$COLUMNS: Specific columns or fields to include
$FORMAT: Output format (CSV, JSON, SQL, Python script)
$CONSTRAINTS: Additional constraints or business rules

Step-by-Step Process

Identify dataset type - Understand the data domain
Define column specifications - Names, data types, and value ranges
Determine row count - How many sample records needed
Select output format - CSV, JSON, SQL INSERT, or Python script
Apply realistic patterns - Ensure data looks authentic and valid
Add business constraints - Respect business logic and relationships
Generate or script data - Create executable output
Validate output - Ensure data quality and completeness

Template: Python Script Output

import csv
import json
from datetime import datetime, timedelta
import random

# Configuration
ROWS = $ROWS
FILENAME = "$DATASET_TYPE.csv"

# Column definitions with realistic value generators
columns = {
    "id": "auto-increment",
    "name": "first_last_name",
    "email": "email",
    "created_at": "timestamp",
    # Add more columns...
}

def generate_dataset():
    """Generate realistic dummy dataset"""
    data = []
    for i in range(1, ROWS + 1):
        record = {
            "id": f"U{i:06d}",
            # Generate values based on column definitions
        }
        data.append(record)
    return data

def save_as_csv(data, filename):
    """Save dataset as CSV"""
    with open(filename, 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)

if __name__ == "__main__":
    dataset = generate_dataset()
    save_as_csv(dataset, FILENAME)
    print(f"Generated {len(dataset)} records in {FILENAME}")

Example Dataset Specification

Dataset Type: Customer Feedback

Columns:

feedback_id (auto-increment, U001, U002...)
customer_name (realistic names)
email (valid email format)
feedback_date (dates last 90 days)
rating (1-5 stars)
category (Bug, Feature Request, Complaint, Praise)
text (realistic feedback)
product (electronics, clothing, home)

Constraints:

Ratings skewed: 40% 5-star, 30% 4-star, 20% 3-star, 10% 1-2 star
Bug category only with ratings 1-3
Feature requests only with ratings 3-5
Email domains realistic (gmail, yahoo, company.com)

Output Deliverables

Ready-to-execute Python script OR direct data file
CSV file with proper headers and formatting
JSON file with valid structure and types
SQL INSERT statements for database population
Data validation and constraint compliance
Realistic, business-appropriate values
Documentation of data generation logic
Quick-start instructions for using the dataset

Output Formats

CSV: Flat tabular format, easy to import into spreadsheets and databases

JSON: Nested structure, ideal for APIs and NoSQL databases

SQL: INSERT statements, directly executable on relational databases

Python Script: Executable generator for custom or large datasets

Source

git clone https://github.com/phuryn/pm-skills/blob/main/pm-execution/skills/dummy-dataset/SKILL.mdView on GitHub

Overview

Generate realistic dummy datasets for testing with customizable columns, constraints, and output formats: CSV, JSON, SQL, and Python script. Creates executable scripts or ready-to-use data files for development, demos, and test environments.

How This Skill Works

Specify inputs like PRODUCT, DATASET_TYPE, ROWS, COLUMNS, FORMAT, and CONSTRAINTS. The process follows the defined steps: identify dataset type, define column specifications, determine row count, select output format, apply realistic patterns, add business constraints, generate or script data, and validate output.

When to Use It

When you need realistic test data for a new product or module
When building mock datasets to test ETL pipelines and analytics
When creating sample data for development demos and QA environments
When seeding environments with CSV, JSON, SQL INSERT statements, or Python scripts
When validating business rules and data relationships (e.g., rating vs. category constraints)

Quick Start

Step 1: Identify dataset type (e.g., Customer Feedback) and target output format
Step 2: Define column specifications including data types, value ranges, and constraints
Step 3: Determine ROWS and generate or export as CSV, JSON, SQL INSERTs, or Python script

Best Practices

Define realistic distributions and relationships (e.g., 40% 5-star ratings, 30% 4-star, 20% 3-star, 10% 1-2 star)
Clearly specify column data types and value ranges to guide generation
Incorporate realistic domains for fields like emails (gmail, yahoo, company.com)
Validate generated data against the specified schema and constraints
Document the generation logic for reproducibility and future adjustments

Example Use Cases

Customer Feedback dataset with fields: feedback_id, customer_name, email, feedback_date, rating, category, text, product; with constraints on rating distribution and category rules
E-commerce Transactions dataset including order_id, user_id, amount, date, status, and product_category
User Profiles dataset containing user_id, name, email, signup_date, country, and plan
Support Tickets dataset capturing ticket_id, user_id, issue_type, priority, created_at, and status
Product Reviews dataset across product lines (electronics, clothing, home) with rating and review_text

Frequently Asked Questions

Add this skill to your agents