What is fabric-delta-spark-perf?

A guided workflow to diagnose and optimize Delta Lake and Spark performance in Microsoft Fabric Lakehouse environments, including AQE, tuning, and maintenance steps.

When should I run the quick diagnostic workflow?

Use it when Spark jobs are slow, you suspect small files or data skew, encounter OOM errors, or see shuffle bottlenecks in Fabric Lakehouse workloads.

Which optimizations are typically recommended?

OPTIMIZE and VACUUM on Delta tables, rebalance partitions, enable Adaptive Query Execution, adjust resource profiles, and consider the Native Execution Engine and broadcast joins where appropriate.

fabric-delta-spark-perf

Scanned

npx machina-cli add skill PatrickGallucci/fabric-skills/fabric-delta-spark-perf --openclaw

Files (1)

SKILL.md

8.2 KB

Microsoft Fabric Delta Lake Spark Performance remediate

Systematic workflows for diagnosing and resolving Apache Spark and Delta Lake performance issues in Microsoft Fabric Lakehouse environments.

When to Use This Skill

Activate when the user mentions any of the following:

Spark job is slow, taking too long, or timing out
Small file problem, too many small files, file fragmentation
Data skew, straggler tasks, unbalanced partitions
Out of memory (OOM) errors on driver or executor
Shuffle spill, excessive shuffle read/write
OPTIMIZE, VACUUM, bin-compaction, or table maintenance
V-Order, Z-Order, or Parquet optimization
Resource profiles: writeHeavy, readHeavyForSpark, readHeavyForPBI
Autotune, Adaptive Query Execution (AQE), broadcast join thresholds
Native Execution Engine configuration
Streaming performance, microbatch tuning, checkpoint issues
Spark pool sizing, autoscale, dynamic executor allocation
Direct Lake performance tied to Delta table structure
Capacity throttling, TooManyRequestsForCapacity errors

Prerequisites

Microsoft Fabric workspace with Data Engineering or Data Science experience
Apache Spark notebooks or Spark Job Definitions
Lakehouse with Delta tables
Appropriate Fabric capacity SKU (F2 through F2048)

Quick Diagnostic Workflow

When a user reports slow Spark performance, follow this triage sequence:

Step 1: Identify the Symptom Category

Symptom	Likely Root Cause	Jump To
Job runs much longer than expected	Data skew or small files	Step 2
OOM error on driver	`collect()`, `toPandas()`, or large broadcast	diagnostic-checklist.md
OOM error on executor	Wide joins, large shuffles, insufficient memory	diagnostic-checklist.md
Many tasks, most finish fast, few stragglers	Data skew	diagnostic-checklist.md
High shuffle read/write in Spark UI	Missing broadcast join or too many partitions	diagnostic-checklist.md
Query reads thousands of small files	Small file problem, needs OPTIMIZE	Step 3
Capacity throttled (HTTP 430)	Too many concurrent jobs for SKU	spark-configurations.md
Streaming lag increasing	Microbatch interval or partition mismatch	diagnostic-checklist.md

Step 2: Check for Data Skew

Run the diagnostic script to detect skew in a target table:

# Quick skew detection
df = spark.read.format("delta").table("your_table")
df.groupBy("partition_column") \
  .count() \
  .orderBy(F.desc("count")) \
  .show(20)

If the top partition has 10x+ more rows than the median, apply skew mitigation. See diagnostic-checklist.md for remediation steps.

Step 3: Check File Health

# Check file count and sizes for a Delta table
from delta.tables import DeltaTable
dt = DeltaTable.forName(spark, "your_table")
detail = spark.sql("DESCRIBE DETAIL your_table")
detail.select("numFiles", "sizeInBytes").show()

# Check for small files (< 32MB)
files_df = spark.sql("DESCRIBE DETAIL your_table")

If file count is high relative to data volume (e.g., >1000 files for <10GB), run OPTIMIZE. See table-maintenance-guide.md.

Step 4: Verify Spark Configuration

Check that the environment has an appropriate resource profile:

# Check current resource profile
print(spark.conf.get("spark.fabric.resourceProfile", "not set"))

# Check key write/read settings
configs = [
    "spark.sql.parquet.vorder.default",
    "spark.databricks.delta.optimizeWrite.enabled",
    "spark.databricks.delta.optimizeWrite.binSize",
    "spark.sql.shuffle.partitions",
    "spark.sql.autoBroadcastJoinThreshold",
    "spark.sql.files.maxPartitionBytes",
    "spark.databricks.optimizer.adaptive.enabled"
]
for c in configs:
    try:
        print(f"{c} = {spark.conf.get(c)}")
    except:
        print(f"{c} = (default)")

See spark-configurations.md for recommended values per workload type.

Step 5: Apply Fix and Validate

After identifying the root cause, apply the appropriate fix from the references:

Root Cause	Fix	Reference
Small files	Run OPTIMIZE with V-Order	table-maintenance-guide.md
Stale files bloating storage	Run VACUUM with 7+ day retention	table-maintenance-guide.md
Data skew on joins	Enable AQE skew join + key salting	diagnostic-checklist.md
Wrong resource profile	Switch to appropriate profile	spark-configurations.md
Driver OOM	Avoid `collect()`, increase driver memory	diagnostic-checklist.md
Executor OOM	Increase executor memory, reduce partition size	diagnostic-checklist.md
Excessive shuffling	Use broadcast joins for small tables	diagnostic-checklist.md
V-Order not applied	Enable at table or session level	table-maintenance-guide.md

Available Scripts

Script	Purpose
diagnose-delta-performance.py	Automated diagnostic scan: file health, skew detection, config audit
table-maintenance.py	Run OPTIMIZE + VACUUM across all tables in a Lakehouse schema

Available Templates

Template	Purpose
notebook-performance-template.py	Starter notebook with performance best practices baked in

Key Principles

Profile before optimizing — Use Spark UI metrics and the diagnostic script before changing configurations. Most performance issues trace to data layout, not Spark settings.
Match resource profile to workload — Write-heavy ETL should use writeHeavy (V-Order off, stats collection off). Read-heavy analytics should use readHeavyForSpark or readHeavyForPBI (V-Order on, Optimized Write on).
Right-size files first — Target ~128MB–1GB per Parquet file. Use OPTIMIZE regularly for tables with frequent writes. Streaming tables need more frequent compaction.
Partition wisely — Only partition tables over 1TB. Use low-cardinality columns (< 200 distinct values). Over-partitioning creates the small file problem.
Enable AQE — Adaptive Query Execution handles skew joins, partition coalescing, and dynamic broadcast automatically. Keep it enabled.
VACUUM safely — Always retain at least 7 days. Shorter retention can break time travel and concurrent readers.

References

Diagnostic Checklist — Step-by-step remediate for OOM, skew, shuffle, and streaming issues
Spark Configurations Guide — Resource profiles, autotune, AQE, pool sizing, and capacity limits
Table Maintenance Guide — OPTIMIZE, VACUUM, V-Order, Z-Order, and Optimized Write deep dive

Source

git clone https://github.com/PatrickGallucci/fabric-skills/blob/main/skills/fabric-delta-spark-perf/SKILL.mdView on GitHub

Overview

This skill provides systematic workflows for diagnosing and resolving Apache Spark and Delta Lake performance issues in Microsoft Fabric Lakehouse environments. It helps teams quickly identify root causes such as data skew, small files, or memory pressure, and guides targeted optimizations to improve query throughput and reliability.

How This Skill Works

Use a triage-driven diagnostic workflow starting with symptom identification, then skew checks, then file health checks, followed by targeted fixes like OPTIMIZE, VACUUM, partition tuning, and enabling AQE or the Native Execution Engine. Validate changes with end-to-end performance checks to ensure improvements hold under real workloads.

When to Use It

Spark job is slow or times out
Small file problem and file fragmentation
Data skew causing stragglers or unbalanced partitions
OOM errors on driver or executor
Shuffle bottlenecks and high shuffle read/write

Quick Start

Step 1: Kick off the Quick Diagnostic Workflow and identify the symptom category
Step 2: Run the skew detection script to quantify skew
Step 3: Check Delta table health and small-file counts, then apply targeted fixes such as OPTIMIZE, VACUUM, and partition tuning

Best Practices

Follow the Quick Diagnostic Workflow Step 1-3
Run skew detection and remediate with partitioning or repartitioning
Assess Delta table health and small-file counts
Apply OPTIMIZE, VACUUM, and bin-packing as appropriate
Tune AQE, broadcast join thresholds, and Native Execution Engine settings

Example Use Cases

Delta table with many small files caused slow scans; after enabling OPTIMIZE and managing file sizes, performance improved
Data skew across partitions led to stragglers; skew detected and repartitioning by key reduced runtime
High shuffle spill and read/write bottlenecks; added broadcast joins where appropriate and increased AQE thresholds
OOM errors on executor during wide joins; adjusted memory limits and optimized join strategies
Streaming microbatches lag due to partition mismatch; tuned partitioning and checkpointing in Fabric notebooks

Frequently Asked Questions

Add this skill to your agents