fabric-delta-spark-perf
Scannednpx machina-cli add skill PatrickGallucci/fabric-skills/fabric-delta-spark-perf --openclawMicrosoft Fabric Delta Lake Spark Performance remediate
Systematic workflows for diagnosing and resolving Apache Spark and Delta Lake performance issues in Microsoft Fabric Lakehouse environments.
When to Use This Skill
Activate when the user mentions any of the following:
- Spark job is slow, taking too long, or timing out
- Small file problem, too many small files, file fragmentation
- Data skew, straggler tasks, unbalanced partitions
- Out of memory (OOM) errors on driver or executor
- Shuffle spill, excessive shuffle read/write
- OPTIMIZE, VACUUM, bin-compaction, or table maintenance
- V-Order, Z-Order, or Parquet optimization
- Resource profiles: writeHeavy, readHeavyForSpark, readHeavyForPBI
- Autotune, Adaptive Query Execution (AQE), broadcast join thresholds
- Native Execution Engine configuration
- Streaming performance, microbatch tuning, checkpoint issues
- Spark pool sizing, autoscale, dynamic executor allocation
- Direct Lake performance tied to Delta table structure
- Capacity throttling, TooManyRequestsForCapacity errors
Prerequisites
- Microsoft Fabric workspace with Data Engineering or Data Science experience
- Apache Spark notebooks or Spark Job Definitions
- Lakehouse with Delta tables
- Appropriate Fabric capacity SKU (F2 through F2048)
Quick Diagnostic Workflow
When a user reports slow Spark performance, follow this triage sequence:
Step 1: Identify the Symptom Category
| Symptom | Likely Root Cause | Jump To |
|---|---|---|
| Job runs much longer than expected | Data skew or small files | Step 2 |
| OOM error on driver | collect(), toPandas(), or large broadcast | diagnostic-checklist.md |
| OOM error on executor | Wide joins, large shuffles, insufficient memory | diagnostic-checklist.md |
| Many tasks, most finish fast, few stragglers | Data skew | diagnostic-checklist.md |
| High shuffle read/write in Spark UI | Missing broadcast join or too many partitions | diagnostic-checklist.md |
| Query reads thousands of small files | Small file problem, needs OPTIMIZE | Step 3 |
| Capacity throttled (HTTP 430) | Too many concurrent jobs for SKU | spark-configurations.md |
| Streaming lag increasing | Microbatch interval or partition mismatch | diagnostic-checklist.md |
Step 2: Check for Data Skew
Run the diagnostic script to detect skew in a target table:
# Quick skew detection
df = spark.read.format("delta").table("your_table")
df.groupBy("partition_column") \
.count() \
.orderBy(F.desc("count")) \
.show(20)
If the top partition has 10x+ more rows than the median, apply skew mitigation. See diagnostic-checklist.md for remediation steps.
Step 3: Check File Health
# Check file count and sizes for a Delta table
from delta.tables import DeltaTable
dt = DeltaTable.forName(spark, "your_table")
detail = spark.sql("DESCRIBE DETAIL your_table")
detail.select("numFiles", "sizeInBytes").show()
# Check for small files (< 32MB)
files_df = spark.sql("DESCRIBE DETAIL your_table")
If file count is high relative to data volume (e.g., >1000 files for <10GB), run OPTIMIZE. See table-maintenance-guide.md.
Step 4: Verify Spark Configuration
Check that the environment has an appropriate resource profile:
# Check current resource profile
print(spark.conf.get("spark.fabric.resourceProfile", "not set"))
# Check key write/read settings
configs = [
"spark.sql.parquet.vorder.default",
"spark.databricks.delta.optimizeWrite.enabled",
"spark.databricks.delta.optimizeWrite.binSize",
"spark.sql.shuffle.partitions",
"spark.sql.autoBroadcastJoinThreshold",
"spark.sql.files.maxPartitionBytes",
"spark.databricks.optimizer.adaptive.enabled"
]
for c in configs:
try:
print(f"{c} = {spark.conf.get(c)}")
except:
print(f"{c} = (default)")
See spark-configurations.md for recommended values per workload type.
Step 5: Apply Fix and Validate
After identifying the root cause, apply the appropriate fix from the references:
| Root Cause | Fix | Reference |
|---|---|---|
| Small files | Run OPTIMIZE with V-Order | table-maintenance-guide.md |
| Stale files bloating storage | Run VACUUM with 7+ day retention | table-maintenance-guide.md |
| Data skew on joins | Enable AQE skew join + key salting | diagnostic-checklist.md |
| Wrong resource profile | Switch to appropriate profile | spark-configurations.md |
| Driver OOM | Avoid collect(), increase driver memory | diagnostic-checklist.md |
| Executor OOM | Increase executor memory, reduce partition size | diagnostic-checklist.md |
| Excessive shuffling | Use broadcast joins for small tables | diagnostic-checklist.md |
| V-Order not applied | Enable at table or session level | table-maintenance-guide.md |
Available Scripts
| Script | Purpose |
|---|---|
| diagnose-delta-performance.py | Automated diagnostic scan: file health, skew detection, config audit |
| table-maintenance.py | Run OPTIMIZE + VACUUM across all tables in a Lakehouse schema |
Available Templates
| Template | Purpose |
|---|---|
| notebook-performance-template.py | Starter notebook with performance best practices baked in |
Key Principles
-
Profile before optimizing — Use Spark UI metrics and the diagnostic script before changing configurations. Most performance issues trace to data layout, not Spark settings.
-
Match resource profile to workload — Write-heavy ETL should use
writeHeavy(V-Order off, stats collection off). Read-heavy analytics should usereadHeavyForSparkorreadHeavyForPBI(V-Order on, Optimized Write on). -
Right-size files first — Target ~128MB–1GB per Parquet file. Use OPTIMIZE regularly for tables with frequent writes. Streaming tables need more frequent compaction.
-
Partition wisely — Only partition tables over 1TB. Use low-cardinality columns (< 200 distinct values). Over-partitioning creates the small file problem.
-
Enable AQE — Adaptive Query Execution handles skew joins, partition coalescing, and dynamic broadcast automatically. Keep it enabled.
-
VACUUM safely — Always retain at least 7 days. Shorter retention can break time travel and concurrent readers.
References
- Diagnostic Checklist — Step-by-step remediate for OOM, skew, shuffle, and streaming issues
- Spark Configurations Guide — Resource profiles, autotune, AQE, pool sizing, and capacity limits
- Table Maintenance Guide — OPTIMIZE, VACUUM, V-Order, Z-Order, and Optimized Write deep dive
Source
git clone https://github.com/PatrickGallucci/fabric-skills/blob/main/skills/fabric-delta-spark-perf/SKILL.mdView on GitHub Overview
This skill provides systematic workflows for diagnosing and resolving Apache Spark and Delta Lake performance issues in Microsoft Fabric Lakehouse environments. It helps teams quickly identify root causes such as data skew, small files, or memory pressure, and guides targeted optimizations to improve query throughput and reliability.
How This Skill Works
Use a triage-driven diagnostic workflow starting with symptom identification, then skew checks, then file health checks, followed by targeted fixes like OPTIMIZE, VACUUM, partition tuning, and enabling AQE or the Native Execution Engine. Validate changes with end-to-end performance checks to ensure improvements hold under real workloads.
When to Use It
- Spark job is slow or times out
- Small file problem and file fragmentation
- Data skew causing stragglers or unbalanced partitions
- OOM errors on driver or executor
- Shuffle bottlenecks and high shuffle read/write
Quick Start
- Step 1: Kick off the Quick Diagnostic Workflow and identify the symptom category
- Step 2: Run the skew detection script to quantify skew
- Step 3: Check Delta table health and small-file counts, then apply targeted fixes such as OPTIMIZE, VACUUM, and partition tuning
Best Practices
- Follow the Quick Diagnostic Workflow Step 1-3
- Run skew detection and remediate with partitioning or repartitioning
- Assess Delta table health and small-file counts
- Apply OPTIMIZE, VACUUM, and bin-packing as appropriate
- Tune AQE, broadcast join thresholds, and Native Execution Engine settings
Example Use Cases
- Delta table with many small files caused slow scans; after enabling OPTIMIZE and managing file sizes, performance improved
- Data skew across partitions led to stragglers; skew detected and repartitioning by key reduced runtime
- High shuffle spill and read/write bottlenecks; added broadcast joins where appropriate and increased AQE thresholds
- OOM errors on executor during wide joins; adjusted memory limits and optimized join strategies
- Streaming microbatches lag due to partition mismatch; tuned partitioning and checkpointing in Fabric notebooks