What is this skill for?

A toolkit to diagnose and resolve Fabric Lakehouse performance issues, covering Spark tuning, Delta health, optimization commands, and automated maintenance for notebooks and pipelines.

Which commands are central to this skill?

OPTIMIZE, VORDER, ZORDER, VACUUM, and related Spark SQL maintenance commands applied to Lakehouse tables.

When should I use Fabric Lakehouse Performance Remediate?

Use it when Lakehouse queries are slow, Delta tables have many small files, Spark notebooks underperform, Direct Lake queries are slow, or you need to automate OPTIMIZE/VACUUM and optimize capacity usage.

fabric-lakehouse-perf-remediate

npx machina-cli add skill PatrickGallucci/fabric-skills/fabric-lakehouse-perf-remediate --openclaw

Files (1)

SKILL.md

10.1 KB

Fabric Lakehouse Performance remediate

Systematic toolkit for diagnosing and resolving performance issues in Microsoft Fabric Lakehouse environments. Covers Delta table health, Spark compute tuning, query optimization, and automated maintenance workflows.

When to Use This Skill

Lakehouse queries are running slowly or timing out
Delta tables have accumulated many small files (small file problem)
Spark notebooks or jobs are underperforming
Direct Lake semantic models have cold-start or transcoding delays
SQL analytics endpoint queries are slow
Table maintenance (OPTIMIZE, VACUUM) needs to be scheduled or automated
V-Order, Z-Order, or resource profile configuration is needed
Capacity throttling or concurrency issues are suspected
Streaming ingestion is creating fragmented Delta tables

Prerequisites

Microsoft Fabric workspace with Lakehouse items
Contributor or higher workspace role
Fabric capacity (F2 or above) or Trial capacity
For REST API automation: Microsoft Entra token for Fabric service
For Spark commands: Access to Fabric notebooks or Spark Job Definitions

Quick Diagnosis Checklist

When a user reports Lakehouse performance issues, work through these areas in order:

Identify the symptom — Slow reads, slow writes, capacity throttling, or query timeouts
Check Delta table health — File count, file sizes, V-Order status, partition layout
Review Spark configuration — Resource profile, autotune, shuffle partitions
Inspect capacity utilization — Concurrency limits, burst capacity, throttling
Evaluate maintenance history — When was OPTIMIZE/VACUUM last run?
Assess data patterns — Streaming vs batch, read-heavy vs write-heavy

Symptom-to-Action Map

Symptom	Root Cause	Action
Slow reads across all engines	Small files, no V-Order	Run OPTIMIZE VORDER, switch to readHeavy profile
Slow Spark queries only	Wrong shuffle partitions	Enable autotune or tune manually
Slow Power BI Direct Lake	Too many Parquet files/row groups	Run OPTIMIZE, check guardrail limits
Slow SQL analytics endpoint	Files under 400 MB, too many small files	OPTIMIZE with maxRecordsPerFile=2M
Write performance degraded	V-Order enabled on write-heavy workload	Switch to writeHeavy resource profile
Capacity throttled	Too many concurrent Spark jobs	Review concurrency limits, enable optimistic admission
Storage growing unexpectedly	VACUUM not running	Schedule VACUUM with 7-day retention
Streaming creates tiny files	No batching or trigger interval	Add processingTime trigger, run periodic OPTIMIZE

Core Optimization Operations

1. Table Maintenance Commands

Run in a Fabric notebook (Spark SQL):

-- Basic OPTIMIZE (bin-compaction)
OPTIMIZE lakehouse_name.schema_name.table_name;

-- OPTIMIZE with V-Order
OPTIMIZE lakehouse_name.schema_name.table_name VORDER;

-- OPTIMIZE with Z-Order on frequently filtered columns
OPTIMIZE lakehouse_name.schema_name.table_name ZORDER BY (column_name);

-- OPTIMIZE with both Z-Order and V-Order
OPTIMIZE lakehouse_name.schema_name.table_name ZORDER BY (column_name) VORDER;

-- OPTIMIZE specific partitions only
OPTIMIZE lakehouse_name.schema_name.table_name WHERE date_key >= '2025-01-01' VORDER;

-- VACUUM with default 7-day retention
VACUUM lakehouse_name.schema_name.table_name;

-- VACUUM with custom retention (requires safety check disabled)
VACUUM lakehouse_name.schema_name.table_name RETAIN 168 HOURS;

2. Resource Profile Configuration

Set at environment level or runtime. See resource-profiles.md for details.

# Check current profile
spark.conf.get('spark.fabric.resourceProfile')

# Switch to read-heavy for Spark queries
spark.conf.set("spark.fabric.resourceProfile", "readHeavyForSpark")

# Switch to read-heavy for Power BI Direct Lake
spark.conf.set("spark.fabric.resourceProfile", "readHeavyForPBI")

# Switch to write-heavy for ETL/ingestion
spark.conf.set("spark.fabric.resourceProfile", "writeHeavy")

3. V-Order Control

# Check current V-Order setting
spark.conf.get('spark.sql.parquet.vorder.default')

# Enable V-Order for read-heavy workloads
spark.conf.set('spark.sql.parquet.vorder.default', 'true')

# Disable V-Order for write-heavy ingestion
spark.conf.set('spark.sql.parquet.vorder.default', 'false')

4. Autotune Configuration

-- Enable autotune for automatic Spark SQL tuning
SET spark.ms.autotune.enabled=TRUE;

-- Disable autotune
SET spark.ms.autotune.enabled=FALSE;

Autotune adjusts three key settings per query: spark.sql.shuffle.partitions, spark.sql.autoBroadcastJoinThreshold, and spark.sql.files.maxPartitionBytes. Requires 20-25 iterations to learn optimal settings. Only works on Runtime 1.1 and 1.2. Not compatible with high concurrency mode or private endpoints.

5. SQL Analytics Endpoint Optimization

For best SQL analytics endpoint performance, target ~2 million rows and ~400 MB per Parquet file:

# Before data changes
spark.conf.set("spark.sql.files.maxRecordsPerFile", 2000000)

# Perform data operations (inserts, updates, deletes)
# ...

# After data changes, set max file size and optimize
spark.conf.set("spark.databricks.delta.optimize.maxFileSize", 4294967296)

Then run OPTIMIZE on the affected tables.

REST API Automation

Automate table maintenance via the Fabric REST API. See rest-api-maintenance.md for full details.

Endpoint:

POST https://api.fabric.microsoft.com/v1/workspaces/{workspaceId}/items/{lakehouseId}/jobs/instances?jobType=TableMaintenance

Request body:

{
  "executionData": {
    "tableName": "my_table",
    "schemaName": "dbo",
    "optimizeSettings": {
      "vOrder": "true",
      "zOrderBy": ["frequently_filtered_column"]
    },
    "vacuumSettings": {
      "retentionPeriod": "7.01:00:00"
    }
  }
}

Monitor status:

GET https://api.fabric.microsoft.com/v1/workspaces/{workspaceId}/items/{lakehouseId}/jobs/instances/{operationId}

Streaming Ingestion Best Practices

When streaming data into a Lakehouse, prevent small file proliferation:

Set processing time triggers to batch events into larger writes
Use Optimized Write (spark.microsoft.delta.optimizeWrite.enabled = true)
Partition wisely — low-cardinality columns only (< 100-200 distinct values)
Schedule periodic OPTIMIZE — daily or more often for high-frequency streams
Combine repartition() with partitionBy() for optimal in-memory and on-disk layout

# Example: Streaming with batching and partitioning
rawData = df \
  .writeStream \
  .format("delta") \
  .option("checkpointLocation", "Files/checkpoint") \
  .outputMode("append") \
  .partitionBy("date_key") \
  .trigger(processingTime="1 minute") \
  .toTable("my_streaming_table")

Available Scripts

Invoke-FabricTableMaintenance.ps1 — PowerShell script to automate table maintenance via REST API across multiple tables
Get-DeltaTableHealth.ps1 — PowerShell script to assess Delta table health metrics via Fabric REST API

Available Templates

maintenance-notebook.py — PySpark notebook template for comprehensive table maintenance
perf-diagnostic-notebook.py — PySpark notebook template for diagnosing performance issues

Detailed References

Resource Profiles — Complete guide to Fabric Spark resource profile selection and configuration
REST API Maintenance — Automating table maintenance with Fabric REST API and PowerShell
Delta Table Health — Assessing and monitoring Delta table file layout, V-Order status, and partition health
Concurrency and Capacity — Understanding Spark job admission, throttling, burst capacity, and autoscale billing

remediate

Issue	Cause	Fix
OPTIMIZE command not recognized	Running in SQL analytics endpoint	Use Fabric notebook with Spark runtime
VACUUM fails with retention error	Retention < 7 days without safety override	Set `spark.databricks.delta.retentionDurationCheck.enabled=false`
Autotune not activating	Query too short (< 15 seconds) or wrong runtime	Use Runtime 1.1/1.2, ensure queries exceed 15s
Table maintenance API returns 409	Another maintenance job running on same table	Wait for completion, jobs on different tables run in parallel
V-Order not applied after OPTIMIZE	Session/table property mismatch	Use `OPTIMIZE table VORDER` explicitly
Capacity throttled during maintenance	Maintenance consuming too many cores	Schedule during off-peak, reduce concurrent jobs

Key Decision Framework

Is the workload primarily reads or writes?
├── Read-heavy (dashboards, queries, analytics)
│   ├── Power BI Direct Lake → readHeavyForPBI profile + OPTIMIZE VORDER
│   └── Spark analytics → readHeavyForSpark profile + enable autotune
├── Write-heavy (ETL, ingestion, streaming)
│   └── writeHeavy profile + periodic OPTIMIZE + VACUUM on schedule
└── Mixed workload
    ├── Separate environments for read vs write paths
    └── Use runtime spark.conf.set() to switch profiles per notebook

Source

git clone https://github.com/PatrickGallucci/fabric-skills/blob/main/skills/fabric-lakehouse-perf-remediate/SKILL.mdView on GitHub

Overview

Systematic toolkit for diagnosing and resolving Microsoft Fabric Lakehouse performance issues, including slow Spark queries, small-file problems, Delta table fragmentation, and Direct Lake tuning. It covers maintenance commands (OPTIMIZE, VACUUM, Z-Order, V-Order), partitioning strategies, resource profiles, and autotune to speed notebooks and pipelines.

How This Skill Works

Begin with a Quick Diagnosis Checklist to identify symptoms, Delta health, Spark configuration, capacity, and maintenance history. Then apply core optimization operations such as OPTIMIZE (with VORDER or ZORDER), adjust resource profiles, enable autotune, and schedule VACUUM and OPTIMIZE as needed. The approach follows a symptom-to-action map to target root causes like small files, poor partitioning, or excessive concurrency.

When to Use It

Lakehouse queries are slow or timing out
Delta tables have accumulated many small files
Spark notebooks or jobs are underperforming
Direct Lake semantic models have cold-start or transcoding delays
Table maintenance (OPTIMIZE, VACUUM) needs automation or capacity throttling is suspected

Quick Start

Step 1: Open a Fabric notebook and run the Quick Diagnosis Checklist (symptoms, Delta health, Spark config, capacity, maintenance history, data patterns)
Step 2: Apply core optimizations (OPTIMIZE with VORDER/ZORDER, adjust resource profile, enable autotune) and address small-file issues
Step 3: Schedule maintenance (OPTIMIZE, VACUUM) and validate improvements with representative queries

Best Practices

Run OPTIMIZE with V-Order and Z-Order to improve data layout and query performance
For small files, OPTIMIZE with maxRecordsPerFile=2M to reduce metadata overhead
Schedule VACUUM with a 7-day retention to control storage growth
Enable Spark autotune and tune shuffle partitions for balanced resource usage
Monitor capacity, concurrency, and apply optimistic admission to prevent throttling

Example Use Cases

Slow reads across all engines caused by many small files; remediate with OPTIMIZE VORDER and switch to a readHeavy profile
Slow Spark queries in notebooks; enable autotune and adjust shuffle partitions for better parallelism
Direct Lake performance degraded by too many Parquet files; run OPTIMIZE and verify guardrail limits
SQL analytics endpoint slow due to fragmentation; OPTIMIZE with maxRecordsPerFile=2M
Streaming ingestion creates tiny files; add a processingTime trigger and perform periodic OPTIMIZE

Frequently Asked Questions

Add this skill to your agents