Get the FREE Ultimate OpenClaw Setup Guide →

bauplan-data-pipelines

Scanned
npx machina-cli add skill BauplanLabs/bauplan-mcp-server/data-pipeline --openclaw
Files (1)
SKILL.md
10.3 KB

Creating a New Bauplan Data Pipeline

This skill guides you through creating a new bauplan data pipeline project from scratch, including the project configuration and transformation models.

CRITICAL: Branch Safety

NEVER run pipelines on main branch. ALWAYS use a separate data branch for development.

Branch naming convention: <username>.<branch_name> (e.g., john.feature-pipeline). Get your username by running bauplan info.

Table References

  • Source tables must already exist in the bauplan lakehouse before building a pipeline.
  • Verify source tables exist and understand their schema before writing any code.
  • Always use fully-qualified names: <namespace>.<table_name> (e.g., bauplan.taxi_fhvhv).
  • Default namespace: bauplan.

Required User Input

Before writing a pipeline, you MUST gather the following from the user:

  1. Pipeline purpose (required): What transformations should the DAG perform? What is the business logic or goal?
  2. Source tables (required): Which tables from the lakehouse should be used as inputs? Verify they exist with bauplan table get <namespace>.<table_name>.
  3. Output tables (required): Which tables should be materialized as final outputs?
  4. Materialization strategy (optional, default: REPLACE): Should output tables use REPLACE or APPEND?
  5. Strict mode (optional): Should the pipeline run in strict mode?

If any required item is missing, ask the user before writing any code.

Pipelines

A bauplan pipeline is a DAG of functions (models). Key concepts:

  • Models: Python functions that transform data.
  • Source Tables: Existing lakehouse tables: the entry points to your DAG.
  • Inputs: Each Model can take multiple tables via bauplan.Model() references, either from the outputs of previous Models or as Source Tables.
  • Outputs: Each Model produces exactly one table. The output name is the function name (def clean_trips()clean_trips).
  • Topology: Implicitly defined by input references — Bauplan determines the execution order.
  • Expectations: Data quality functions that take tables as input and return a boolean. Generated by the data-quality-checks skill — do not write expectations directly in this skill.

Example DAG

[lakehouse: taxi_fhvhv] ──→ [trips] ──→ [clean_trips] ──→ [daily_summary]
                                              ↑
[lakehouse: taxi_zones] ──────────────────────┘
  • taxi_fhvhv and taxi_zones are Source Tables (already in lakehouse)
  • trips is a Python Model reading from taxi_fhvhv (single input)
  • clean_trips is a Python Model taking trips and taxi_zones as inputs (multiple inputs)
  • daily_summary is a Python Model taking clean_trips as input (single input)

Python Models

Python models are the preferred way to write all transformations. They are Python functions registered with decorators.

Base Python Model

# import bauplan globally, but DO NOT import other packages at the top level
import bauplan

@bauplan.model(
    # declare expected output columns for validation
    columns=['pickup_datetime', 'PULocationID', 'trip_miles'],
    # persist output as an Iceberg table; omit for intermediate models
    materialization_strategy='REPLACE'
)
# specify Python version and dependencies; prefer Polars or DuckDB over Pandas
@bauplan.python('3.11', pip={'polars': '1.15.0'})
def clean_trips(
    # use columns and filter for efficient I/O pushdown
    data=bauplan.Model(
        'trips',
        columns=['pickup_datetime', 'PULocationID', 'trip_miles'],
        filter="trip_miles > 0"
    )
):
    """
    Filters trips to include only those with positive mileage.

    | pickup_datetime     | PULocationID | trip_miles |
    |---------------------|--------------|------------|
    | 2022-12-01 08:00:00 | 123          | 5.2        |
    """
    # import dependencies inside the function — each model runs in its own environment
    import polars as pl

    df = pl.from_arrow(data)
    df = df.filter(pl.col('trip_miles') > 0.0)
    return df.to_arrow()

Python Model with Multiple Inputs

Models can take multiple tables as input — add more bauplan.Model() parameters. See examples.md for complete examples.

Best Practices

Output Columns Validation

Whenever possible, specify columns in @bauplan.model() to define the expected output schema. This enables automatic validation of your model's output. Check the schema of your source tables first, then declare output columns based on what your transformation actually produces.

Docstrings with Output Schema

Every Python model should have a docstring describing the transformation and showing the output table structure as an ASCII table. If the table is too wide, show only key columns; if values are too large, truncate them.

I/O Pushdown with columns and filter

Use columns and filter in bauplan.Model() to restrict the data read at the storage level. Do not read columns you don't need. This enables I/O pushdown, meaning Bauplan filters data at the Iceberg layer before it reaches your function. On large tables, this can reduce data transfer by orders of magnitude.

  • columns: list only the columns your model actually needs.
  • filter: SQL-like expression to restrict rows (e.g., filter="price > 0").

See examples.md for a complete guide.

Use Polars or DuckDB, Not Pandas

Use Polars or DuckDB for data processing inside Python models. Avoid Pandas. Bauplan models receive and return Apache Arrow tables. Polars and DuckDB operate natively on Arrow with zero-copy reads and multi-threaded execution. Pandas requires a full data copy into its own format — slower, single-threaded, and uses significantly more memory.

  • Polars: best for DataFrame-style transformations (filter, join, group_by, with_columns).
  • DuckDB: best when the logic is naturally expressed as SQL (complex joins, aggregations).

All four patterns are demonstrated in the Base Python Model above.

Project Structure

A bauplan project is a folder containing:

my-project/
  bauplan_project.yml    # Required: project configuration
  models.py              # Python models (one file can contain the entire pipeline)
  expectations.py        # Optional: generated by the data-quality-checks skill

bauplan_project.yml

Every project requires this configuration file:

project:
  id: <unique-uuid>       # Generate with: python3 -c "import uuid; print(uuid.uuid4())"
  name: <project_name>    # Descriptive name for the project

Running Pipelines

Always run from inside the project directory (the folder containing bauplan_project.yml).

Dry Run

A dry run validates the pipeline without materializing any tables. It checks that the DAG is valid, source tables exist, SQL parses correctly, and declared output columns match. Always dry-run before a full run.

bauplan run --dry-run

If the dry run fails, read the error output, fix the issue, and dry-run again. Don't proceed to a full run until the dry run passes.

Full Run

bauplan run

This executes the pipeline and materializes all models that have a materialization_strategy set. After a successful run, verify the output:

bauplan table get <namespace>.<output_table>
bauplan query "SELECT * FROM <namespace>.<output_table> LIMIT 5"

Strict Mode

Append --strict to catch declaration errors early. In strict mode, the run fails immediately on output column mismatches or expectation failures instead of logging a warning and continuing. Recommended during development.

bauplan run --dry-run --strict
bauplan run --strict

Materialization Checklist

After writing models, verify each model has the correct materialization_strategy:

  • Intermediate models (not final outputs): no materialization_strategy@bauplan.model()
  • Final output tables: @bauplan.model(materialization_strategy='REPLACE') or 'APPEND'

Workflow Checklist

  • Step 1: Get username → bauplan info
  • Step 2: Checkout main → bauplan branch checkout main
  • Step 3: Create dev branch → bauplan branch create <username>.<branch_name>
  • Step 4: Checkout dev branch → bauplan branch checkout <username>.<branch_name>
  • Step 5: Verify source tables → bauplan table get <namespace>.<table_name>
  • Step 6: Create project folder with bauplan_project.yml
  • Step 7: Write Python models respecting the best practices above
  • Step 8: Verify materialization (see Materialization Checklist)
  • Step 9: Dry run → bauplan run --dry-run
  • Step 10: Full run → bauplan run
  • Step 11: Verify output → bauplan table get + bauplan query
  • Step 12: Offer to add data quality checks → invoke the data-quality-checks skill with the project path. The skill reads models.py, derives checks from how the pipeline uses each table, and generates expectations.py.

SQL Models (Optional)

Bauplan also supports SQL models for simple reads from source tables. SQL models are .sql files where the filename becomes the output table name and the FROM clause defines the inputs.

-- trips.sql
SELECT
    pickup_datetime,
    PULocationID,
    trip_miles
FROM taxi_fhvhv
WHERE pickup_datetime >= '2022-12-01'

Output table: trips (from filename). Add -- bauplan: materialization_strategy=REPLACE as a comment to materialize.

Limitations:

  • SQL models should only be used as the first node in a pipeline, reading directly from source tables.
  • They lack output column validation, docstring documentation, and the I/O pushdown controls available in Python models.
  • For all downstream transformations, use Python models.

Advanced Examples

See examples.md for:

  • APPEND materialization strategy
  • DuckDB queries in Python models
  • Data quality expectations
  • Multi-stage pipelines

Source

git clone https://github.com/BauplanLabs/bauplan-mcp-server/blob/main/skills/data-pipeline/SKILL.mdView on GitHub

Overview

This skill guides you through creating a new bauplan data pipeline project from scratch, including project configuration and transformation models. It emphasizes branch safety, lakehouse readiness, and collecting required inputs before coding.

How This Skill Works

A Bauplan data pipeline is a DAG of Python models where source tables feed into models and each model emits a single output table. Models are written as Python functions registered with decorators, and inputs are wired via bauplan.Model references, with the DAG order inferred from dependencies. Projects are created on feature branches (not main), and sources must exist in the lakehouse with fully-qualified names.

When to Use It

  • Starting a brand-new bauplan data pipeline project from scratch
  • Defining the DAG of transformations for new data workflows
  • Writing and wiring transformation models with Python using bauplan.Model inputs
  • Setting up the project structure and configuration before implementing models
  • Ensuring branch safety and source table validation before runtime

Quick Start

  1. Step 1: On a feature branch (not main) and use bauplan info to get your username; create <username>.<branch_name>
  2. Step 2: Gather required inputs: pipeline purpose, source tables, output tables, materialization (default REPLACE), strict mode if needed
  3. Step 3: Implement at least one Python model with bauplan.model decorators and wire inputs/outputs, then push to the feature branch

Best Practices

  • Never run pipelines on the main branch; use a separate data/development branch
  • Name branches as <username>.<branch_name> (e.g., john.feature-pipeline) and obtain username with bauplan info
  • Verify source tables exist in the lakehouse before coding (use bauplan table get <namespace>.<table_name>)
  • Always use fully-qualified names for tables: <namespace>.<table_name>, with default namespace bauplan
  • Gather all required inputs before writing code: pipeline purpose, source tables, outputs, materialization, and strict mode

Example Use Cases

  • A taxi data pipeline where [taxi_fhvhv] and [taxi_zones] are sources, producing [trips], [clean_trips], and [daily_summary] in sequence
  • A Python model that reads from a source table, applies business logic, and materializes the result with REPLACE by default
  • Developing on a feature branch like john.feature-pipeline and wiring models without touching main
  • Setting up a new bauplan project skeleton including configuration and transformation models from scratch
  • Using bauplan.Model to pass multiple inputs into a model and define a single output table per model

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers