Get the FREE Ultimate OpenClaw Setup Guide →

fabric-delta-spark-perf

npx machina-cli add skill PatrickGallucci/fabric-skills/fabric-delta-spark-perf --openclaw
Files (1)
SKILL.md
8.2 KB

Microsoft Fabric Delta Lake Spark Performance remediate

Systematic workflows for diagnosing and resolving Apache Spark and Delta Lake performance issues in Microsoft Fabric Lakehouse environments.

When to Use This Skill

Activate when the user mentions any of the following:

  • Spark job is slow, taking too long, or timing out
  • Small file problem, too many small files, file fragmentation
  • Data skew, straggler tasks, unbalanced partitions
  • Out of memory (OOM) errors on driver or executor
  • Shuffle spill, excessive shuffle read/write
  • OPTIMIZE, VACUUM, bin-compaction, or table maintenance
  • V-Order, Z-Order, or Parquet optimization
  • Resource profiles: writeHeavy, readHeavyForSpark, readHeavyForPBI
  • Autotune, Adaptive Query Execution (AQE), broadcast join thresholds
  • Native Execution Engine configuration
  • Streaming performance, microbatch tuning, checkpoint issues
  • Spark pool sizing, autoscale, dynamic executor allocation
  • Direct Lake performance tied to Delta table structure
  • Capacity throttling, TooManyRequestsForCapacity errors

Prerequisites

  • Microsoft Fabric workspace with Data Engineering or Data Science experience
  • Apache Spark notebooks or Spark Job Definitions
  • Lakehouse with Delta tables
  • Appropriate Fabric capacity SKU (F2 through F2048)

Quick Diagnostic Workflow

When a user reports slow Spark performance, follow this triage sequence:

Step 1: Identify the Symptom Category

SymptomLikely Root CauseJump To
Job runs much longer than expectedData skew or small filesStep 2
OOM error on drivercollect(), toPandas(), or large broadcastdiagnostic-checklist.md
OOM error on executorWide joins, large shuffles, insufficient memorydiagnostic-checklist.md
Many tasks, most finish fast, few stragglersData skewdiagnostic-checklist.md
High shuffle read/write in Spark UIMissing broadcast join or too many partitionsdiagnostic-checklist.md
Query reads thousands of small filesSmall file problem, needs OPTIMIZEStep 3
Capacity throttled (HTTP 430)Too many concurrent jobs for SKUspark-configurations.md
Streaming lag increasingMicrobatch interval or partition mismatchdiagnostic-checklist.md

Step 2: Check for Data Skew

Run the diagnostic script to detect skew in a target table:

# Quick skew detection
df = spark.read.format("delta").table("your_table")
df.groupBy("partition_column") \
  .count() \
  .orderBy(F.desc("count")) \
  .show(20)

If the top partition has 10x+ more rows than the median, apply skew mitigation. See diagnostic-checklist.md for remediation steps.

Step 3: Check File Health

# Check file count and sizes for a Delta table
from delta.tables import DeltaTable
dt = DeltaTable.forName(spark, "your_table")
detail = spark.sql("DESCRIBE DETAIL your_table")
detail.select("numFiles", "sizeInBytes").show()

# Check for small files (< 32MB)
files_df = spark.sql("DESCRIBE DETAIL your_table")

If file count is high relative to data volume (e.g., >1000 files for <10GB), run OPTIMIZE. See table-maintenance-guide.md.

Step 4: Verify Spark Configuration

Check that the environment has an appropriate resource profile:

# Check current resource profile
print(spark.conf.get("spark.fabric.resourceProfile", "not set"))

# Check key write/read settings
configs = [
    "spark.sql.parquet.vorder.default",
    "spark.databricks.delta.optimizeWrite.enabled",
    "spark.databricks.delta.optimizeWrite.binSize",
    "spark.sql.shuffle.partitions",
    "spark.sql.autoBroadcastJoinThreshold",
    "spark.sql.files.maxPartitionBytes",
    "spark.databricks.optimizer.adaptive.enabled"
]
for c in configs:
    try:
        print(f"{c} = {spark.conf.get(c)}")
    except:
        print(f"{c} = (default)")

See spark-configurations.md for recommended values per workload type.

Step 5: Apply Fix and Validate

After identifying the root cause, apply the appropriate fix from the references:

Root CauseFixReference
Small filesRun OPTIMIZE with V-Ordertable-maintenance-guide.md
Stale files bloating storageRun VACUUM with 7+ day retentiontable-maintenance-guide.md
Data skew on joinsEnable AQE skew join + key saltingdiagnostic-checklist.md
Wrong resource profileSwitch to appropriate profilespark-configurations.md
Driver OOMAvoid collect(), increase driver memorydiagnostic-checklist.md
Executor OOMIncrease executor memory, reduce partition sizediagnostic-checklist.md
Excessive shufflingUse broadcast joins for small tablesdiagnostic-checklist.md
V-Order not appliedEnable at table or session leveltable-maintenance-guide.md

Available Scripts

ScriptPurpose
diagnose-delta-performance.pyAutomated diagnostic scan: file health, skew detection, config audit
table-maintenance.pyRun OPTIMIZE + VACUUM across all tables in a Lakehouse schema

Available Templates

TemplatePurpose
notebook-performance-template.pyStarter notebook with performance best practices baked in

Key Principles

  1. Profile before optimizing — Use Spark UI metrics and the diagnostic script before changing configurations. Most performance issues trace to data layout, not Spark settings.

  2. Match resource profile to workload — Write-heavy ETL should use writeHeavy (V-Order off, stats collection off). Read-heavy analytics should use readHeavyForSpark or readHeavyForPBI (V-Order on, Optimized Write on).

  3. Right-size files first — Target ~128MB–1GB per Parquet file. Use OPTIMIZE regularly for tables with frequent writes. Streaming tables need more frequent compaction.

  4. Partition wisely — Only partition tables over 1TB. Use low-cardinality columns (< 200 distinct values). Over-partitioning creates the small file problem.

  5. Enable AQE — Adaptive Query Execution handles skew joins, partition coalescing, and dynamic broadcast automatically. Keep it enabled.

  6. VACUUM safely — Always retain at least 7 days. Shorter retention can break time travel and concurrent readers.

References

Source

git clone https://github.com/PatrickGallucci/fabric-skills/blob/main/skills/fabric-delta-spark-perf/SKILL.mdView on GitHub

Overview

This skill provides systematic workflows for diagnosing and resolving Apache Spark and Delta Lake performance issues in Microsoft Fabric Lakehouse environments. It helps teams quickly identify root causes such as data skew, small files, or memory pressure, and guides targeted optimizations to improve query throughput and reliability.

How This Skill Works

Use a triage-driven diagnostic workflow starting with symptom identification, then skew checks, then file health checks, followed by targeted fixes like OPTIMIZE, VACUUM, partition tuning, and enabling AQE or the Native Execution Engine. Validate changes with end-to-end performance checks to ensure improvements hold under real workloads.

When to Use It

  • Spark job is slow or times out
  • Small file problem and file fragmentation
  • Data skew causing stragglers or unbalanced partitions
  • OOM errors on driver or executor
  • Shuffle bottlenecks and high shuffle read/write

Quick Start

  1. Step 1: Kick off the Quick Diagnostic Workflow and identify the symptom category
  2. Step 2: Run the skew detection script to quantify skew
  3. Step 3: Check Delta table health and small-file counts, then apply targeted fixes such as OPTIMIZE, VACUUM, and partition tuning

Best Practices

  • Follow the Quick Diagnostic Workflow Step 1-3
  • Run skew detection and remediate with partitioning or repartitioning
  • Assess Delta table health and small-file counts
  • Apply OPTIMIZE, VACUUM, and bin-packing as appropriate
  • Tune AQE, broadcast join thresholds, and Native Execution Engine settings

Example Use Cases

  • Delta table with many small files caused slow scans; after enabling OPTIMIZE and managing file sizes, performance improved
  • Data skew across partitions led to stragglers; skew detected and repartitioning by key reduced runtime
  • High shuffle spill and read/write bottlenecks; added broadcast joins where appropriate and increased AQE thresholds
  • OOM errors on executor during wide joins; adjusted memory limits and optimized join strategies
  • Streaming microbatches lag due to partition mismatch; tuned partitioning and checkpointing in Fabric notebooks

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers