What problems is fabric-pyspark-perf-remediate designed to solve?

It provides a structured approach to diagnosing and resolving Spark performance problems in Fabric notebooks and Spark Job Definitions, including data skew, shuffle, memory pressure, small file issues, and capacity throttling.

Which techniques are commonly recommended by this skill?

Techniques include broadcast joins, partition tuning, salting for skew, adaptive query execution, OPTIMIZE/Optimized Write for Delta Lake, and selecting appropriate resource profiles with autotune where applicable.

What prerequisites are needed to use this skill effectively?

Access to a Microsoft Fabric workspace with Spark compute enabled, ability to view Spark UI via Monitoring Hub or notebook session, and familiarity with PySpark DataFrames and Spark SQL.

fabric-pyspark-perf-remediate

npx machina-cli add skill PatrickGallucci/fabric-skills/fabric-pyspark-perf-remediate --openclaw

Files (1)

SKILL.md

9.6 KB

Microsoft Fabric PySpark Performance remediate

Systematic guide for diagnosing and resolving Apache Spark performance problems in Microsoft Fabric Data Engineering workloads, including notebooks, Spark Job Definitions, and pipeline activities.

When to Use This Skill

Activate when encountering any of these scenarios:

PySpark notebook cells take unexpectedly long to execute
Spark Job Definitions exceed expected duration or fail with timeouts
Out-of-memory (OOM) errors on driver or executors
Excessive shuffle read/write in Spark UI stage details
Data skew causing individual tasks to run much longer than peers
Delta Lake table writes are slow or produce many small files
Fabric capacity utilization is high or jobs are queued/throttled
Need to choose between resource profiles (readHeavy vs writeHeavy)
Deciding whether to enable autotune, native execution engine, or Optimized Write
Interpreting Spark UI metrics (stages, tasks, storage, SQL plan)

Prerequisites

Access to a Microsoft Fabric workspace with Data Engineering/Science experience
Fabric capacity (F2 or higher) with Spark compute enabled
Familiarity with PySpark DataFrames and Spark SQL
Access to Spark UI via the Monitoring Hub or notebook session details

Quick Diagnostic Workflow

Follow this triage sequence to identify the root cause:

Check capacity status - Is the Fabric capacity throttled or overloaded? See Monitoring Hub for queued jobs and CU utilization.
Identify the slow stage - Open Spark UI, find the stage with the longest duration, and check task-level metrics.
Classify the bottleneck - Use the decision matrix below to categorize the issue.
Apply targeted fix - Follow the relevant reference guide for your bottleneck type.
Validate improvement - Re-run the job and compare Spark UI metrics before and after.

Bottleneck Decision Matrix

Use these indicators to classify your performance issue:

Shuffle Bottleneck: Shuffle read/write bytes are large (>1 GB per stage), many tasks in the stage, high GC time. Fix with broadcast joins, reduced shuffle partitions, or pre-partitioned data. See shuffle-and-join-optimization.md.

Data Skew: One or few tasks take 10x longer than median, input size per task is highly uneven, some tasks process GBs while others process MBs. Fix with salting, repartitioning, or adaptive query execution. See data-skew-resolution.md.

Memory Pressure / OOM: Executor lost errors, "Container killed by YARN for exceeding memory limits", excessive GC time (>10% of task time), disk spill. Fix with memory configuration, caching strategy, or partition sizing. See memory-and-spill-tuning.md.

Small Files / Delta Compaction: Write operations produce thousands of small Parquet files, reads are slow due to file listing overhead, table maintenance never run. Fix with OPTIMIZE, Optimized Write, or streaming trigger intervals. See delta-table-optimization.md.

Capacity / Compute Sizing: Jobs queued for long periods, capacity utilization consistently above 80%, throttling messages in monitoring. Fix with pool sizing, resource profiles, or autoscale billing. See fabric-compute-tuning.md.

Key Spark Configuration Quick Reference

Session-Level Overrides (Notebook Cell 1)

# Shuffle partition count - default is 200, tune to data volume
spark.conf.set("spark.sql.shuffle.partitions", "auto")

# Broadcast join threshold - default 10MB, increase for medium tables
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "50m")

# Max partition bytes - default 128MB, increase for fewer larger partitions
spark.conf.set("spark.sql.files.maxPartitionBytes", "256m")

# Enable Optimized Write for Delta
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")

# Enable autotune (preview, Runtime 1.1/1.2 only)
spark.conf.set("spark.ms.autotune.enabled", "true")

# Enable native execution engine
spark.conf.set("spark.native.enabled", "true")

# VOrder for read-optimized scenarios (Power BI, SQL endpoint)
spark.conf.set("spark.sql.parquet.vorder.default", "true")

Fabric Resource Profiles

Profile	Best For	VOrder	Key Behavior
`writeHeavy`	ETL, ingestion, streaming	Off	Optimized throughput for writes (new workspace default)
`readHeavyForSpark`	Interactive queries, EDA	On	Optimized for Spark read operations
`readHeavyForPBI`	Power BI, SQL endpoint	On	Optimized for downstream BI consumption

Resource profiles are set at the environment level in Workspace Settings > Data Engineering/Science > Spark Settings.

Autotune Details

Autotune automatically adjusts three key configurations per query using ML:

Setting	Default	What Autotune Optimizes
`spark.sql.shuffle.partitions`	200	Partition count for shuffles
`spark.sql.autoBroadcastJoinThreshold`	10 MB	Broadcast join size threshold
`spark.sql.files.maxPartitionBytes`	128 MB	Max bytes per read partition

Requirements: Runtime 1.1 or 1.2, not compatible with high concurrency mode or private endpoints. Needs 20-25 iterations to learn optimal config. Targets repetitive queries running >15 seconds.

Diagnostic Scripts

Run these in a Fabric notebook to gather performance data:

spark_health_check.py - Comprehensive session diagnostics including configuration, pool info, and Delta table stats
identify_skew.py - Detect data skew in a DataFrame by analyzing partition sizes
delta_table_health.py - Analyze Delta table file sizes, history, and maintenance status

Notebook Templates

perf_baseline_template.py - Template for establishing performance baselines with timing instrumentation

Common Anti-Patterns

Collecting large datasets to the driver: df.collect() or df.toPandas() on large DataFrames causes OOM. Use .limit(), .take(), or write to Delta instead.

Using repartition() before write when Optimized Write is available: Optimized Write handles partition optimization automatically, making explicit repartition() calls redundant and expensive.

Never running OPTIMIZE on Delta tables: Over time, append-heavy tables accumulate small files. Schedule OPTIMIZE (and optionally VACUUM) as maintenance.

Caching DataFrames that are used only once: .cache() or .persist() consume memory/storage. Only cache DataFrames referenced multiple times in the same session.

Using Python UDFs instead of native Spark functions: Python UDFs serialize data between JVM and Python, causing major overhead. Prefer pyspark.sql.functions or Pandas UDFs for vectorized operations.

Ignoring partition pruning: Reading Delta tables without filter predicates on partition columns forces full table scans. Always filter on partition columns when possible.

remediate Quick Fixes

Problem: Notebook hangs on first cell. Fix: Check if the Spark session is waiting for cluster allocation. Verify capacity is not fully utilized in Monitoring Hub. Consider starter pools for faster spin-up.

Problem: java.lang.OutOfMemoryError: Java heap space on driver. Fix: Reduce collect() size, increase driver memory in environment settings, or switch to writing results to a table instead of returning to the driver.

Problem: Stage with 200 tasks where 1 task takes 30 minutes and rest take 30 seconds. Fix: Classic data skew. See data-skew-resolution.md for salting and repartitioning techniques.

Problem: Write operation produces 10,000+ small files. Fix: Enable Optimized Write: spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true"). Run OPTIMIZE on the table afterward. See delta-table-optimization.md.

Problem: Job ran fine last week but is slow now. Fix: Check Delta table history for data volume growth. Run DESCRIBE HISTORY table_name and check OPTIMIZE history. Also check if workspace resource profile changed.

References

Source

git clone https://github.com/PatrickGallucci/fabric-skills/blob/main/skills/fabric-pyspark-perf-remediate/SKILL.mdView on GitHub

Overview

A structured guide to diagnose and resolve Spark performance problems in Microsoft Fabric notebooks and Spark Job Definitions. It covers common bottlenecks like data skew, shuffle storms, OOM errors, Delta Lake write slowness, and capacity throttling, plus practical tuning paths.

How This Skill Works

Follow a concise diagnostic workflow: identify bottlenecks with Spark UI and Monitoring Hub, classify the issue using a practical bottleneck matrix (shuffle, skew, memory, small files), apply targeted fixes such as broadcast joins, repartitioning, OPTIMIZE/Optimized Write, and appropriate resource profiles, then validate by re-running the job and comparing metrics.

When to Use It

PySpark notebook cells take unexpectedly long to execute
Spark Job Definitions exceed expected duration or fail with timeouts
Out-of-memory (OOM) errors on driver or executors
Excessive shuffle read/write in Spark UI stage details
Data skew causing some tasks to run much longer than peers

Quick Start

Step 1: Check capacity status in Monitoring Hub to see if Fabric capacity is throttled or overloaded
Step 2: Open Spark UI, identify the slowest stage, and review task-level metrics
Step 3: Classify the bottleneck and apply the relevant fix (e.g., broadcast join, repartitioning, OPTIMIZE), then re-run to validate

Best Practices

Prefer broadcast joins when one side is small to reduce shuffle and improve join performance
Tune spark.sql.shuffle.partitions and target partition sizing to balance task load
Address data skew with salting or adaptive query execution techniques
Optimize Delta Lake writes with OPTIMIZE, Optimized Write, and small-file compaction
Choose appropriate resource profiles (readHeavy vs writeHeavy) and enable autotune/native execution where suitable

Example Use Cases

A PySpark notebook run time improved after reducing shuffle partitions and enabling autoBroadcastJoinThreshold
Delta Lake writes sped up once OPTIMIZE and Optimized Write were applied to consolidate small files
Stage skew addressed by repartitioning/salting, balancing task durations across executors
OOM errors mitigated by increasing executor memory and adjusting cache/persist strategy
Fabric capacity throttling reduced by switching to a writeHeavy resource profile, cutting queue wait times

Frequently Asked Questions

Add this skill to your agents