What is this skill for?

Diagnose and resolve performance bottlenecks in Fabric notebooks running Apache Spark, covering session tuning, Native Execution Engine, autotune, custom Spark pools, partition optimization, and Monitoring Hub diagnostics.

What prerequisites are required?

Workspace Admin or Contributor role, access to the Fabric Monitoring Hub, Fabric Capacity Metrics app installed, and familiarity with PySpark or Spark SQL.

How do I start remediation?

Run the notebook health check script, review diagnostics, then apply Quick Wins (enable NEE, autotune, tune shuffle partitions, enable AQE) using Spark configuration settings.

fabric-notebook-perf-remediate

npx machina-cli add skill PatrickGallucci/fabric-skills/fabric-notebook-perf-remediate --openclaw

Files (1)

SKILL.md

9.0 KB

Microsoft Fabric Notebook Performance remediate

Systematic toolkit for diagnosing, analyzing, and resolving performance bottlenecks in Microsoft Fabric notebooks powered by Apache Spark.

When to Use This Skill

Fabric notebook cells are running slowly or timing out
Spark jobs are being throttled with HTTP 430 errors
Capacity Metrics app shows high CU consumption
Data skew is causing unbalanced task execution
Shuffle operations are consuming excessive resources
Delta Lake tables have degraded read/write performance
OOM (Out of Memory) errors during notebook execution
Spark Advisor shows warnings or errors in cell output
Session startup is slow or sessions expire unexpectedly
Pipeline-triggered notebooks are queued for extended periods

Prerequisites

Workspace Admin or Contributor role in the target Fabric workspace
Access to the Fabric Monitoring Hub for your capacity
Fabric Capacity Metrics app installed (for capacity-level analysis)
Familiarity with PySpark or Spark SQL syntax

remediate Decision Tree

Identify your symptom and follow the corresponding workflow.

Symptom	Root Cause Category	Action
Notebook cell runs for minutes on small data	Spark session config or query plan	See Spark Session Tuning
HTTP 430 error on job submission	Capacity exhausted, concurrency limit	See Capacity and Throttling
One task takes 10x longer than others	Data skew	See Data Skew Diagnosis
Write operations are slow	V-Order overhead or small file problem	See Delta Table Optimization
OOM / executor lost errors	Memory pressure, partition sizing	See Memory and Partition Tuning
Session expired / timed out	Idle timeout settings	See Common Errors
Notebook queued, never starts	Queue limits for SKU	See Capacity and Throttling

Quick Wins Checklist

Apply these optimizations first — they resolve the majority of performance issues.

Enable Native Execution Engine (NEE) — delivers 2x–5x improvement:

spark.conf.set("spark.native.enabled", "true")

Enable Autotune for adaptive configuration:

spark.conf.set("spark.ms.autotune.enabled", "true")

Right-size shuffle partitions — default 200 is often wrong:

# For datasets under 1 GB
spark.conf.set("spark.sql.shuffle.partitions", "20")
# For datasets 1-10 GB
spark.conf.set("spark.sql.shuffle.partitions", "100")
# For datasets over 10 GB, leave default or increase

Enable Adaptive Query Execution (AQE):

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

Use DataFrame APIs instead of RDDs — enables Catalyst optimizer and Tungsten engine.
Break complex query chains into staged intermediate writes to reduce Catalyst plan complexity. See Common Errors.

Run the notebook health check script to audit your current session configuration.

Capacity and Throttling

Each Fabric SKU maps to a fixed number of Spark VCores (1 CU = 2 Spark VCores). When all VCores are consumed, new jobs receive HTTP 430 errors.

SKU	Spark VCores	Queue Limit
F2	4	4
F8	16	8
F64 / P1	128	64
F128 / P2	256	128
F256 / P3	512	256

Resolution steps:

Open the Monitoring Hub and cancel idle or unnecessary Spark sessions.
Stop sessions you are not actively using — default idle timeout is 20 minutes.
Reduce executor count in custom Spark pools to free VCores for parallel jobs.
Enable Autoscale Billing for Spark for bursty workloads — jobs use dedicated serverless resources instead of consuming capacity CUs.
For pipeline-triggered notebooks, leverage job queueing (FIFO). Queue expiry is 24 hours.

Queueing is not supported for interactive notebook jobs or Fabric trial capacities.

Memory and Partition Tuning

OOM errors typically stem from oversized partitions or insufficient executor memory.

Diagnose with Spark UI:

Open the cell's Spark job progress indicator.
Click Resources tab to view executor usage graph.
Check the Spark Advisor light-bulb icon for memory warnings.

Tune partitions:

# Check current partition count
df.rdd.getNumPartitions()

# Repartition for parallelism (increases partitions)
df = df.repartition(200)

# Coalesce to reduce partitions (avoids full shuffle)
df = df.coalesce(50)

# Adjust max partition bytes for reads
spark.conf.set("spark.sql.files.maxPartitionBytes", "128m")

Tune task memory:

# For memory-intensive tasks causing OOM
spark.conf.set("spark.task.cpus", "2")  # More memory per task

# For CPU-bound tasks needing more parallelism
spark.conf.set("spark.task.cpus", "0.5")  # More concurrent tasks

Monitoring and Diagnostics

In-Notebook Monitoring

Spark job progress bar — real-time per-cell execution status
Resources tab — executor allocation and resource usage line chart (Spark 3.4+)
Spark Advisor — Info/Warning/Error recommendations per cell (expand via light-bulb icon)

Monitoring Hub

Navigate to Monitoring Hub to view all active Spark applications across your workspace. Key actions: cancel sessions, view executor count, check job duration, identify queued jobs.

Capacity Metrics App

Filter by item type (Notebook, Lakehouse, Spark Job Definition) to see CU consumption per job. Use the Multi metric ribbon chart to identify capacity spikes over time.

Formula: CU consumption = Total Spark VCores / 2 × duration

Spark Pool Configuration

Starter Pools (Default)

Session initialization in 5–10 seconds, pre-configured, no manual setup. Good for development and small workloads.

Custom Spark Pools

Configure via Workspace Settings → Data Engineering/Science → Spark Settings:

Scenario	Node Size	Guidance
Transform-heavy with shuffles and joins	Large (16–64 cores)	Maximize per-node memory
Bursty or unpredictable jobs	Medium + Autoscale	Let cluster grow/shrink dynamically
Many small parallel jobs	Small/Medium	Use `mssparkutils.notebook.runMultiple()`
Development / exploration	Small, single node	Driver and executor share 1 VM
ML / distributed training	Many medium/large nodes	Maximize parallelism

Enable Customize compute configurations for items in Workspace Settings → Pool tab to allow per-notebook pool overrides.

Resource Profiles

Use predefined Spark resource profiles to auto-configure for your workload type:

Profile	Best For
`Default`	General-purpose workloads
`readHeavyforSpark` / `ReadHeavy`	Interactive queries, dashboards (enables V-Order)
Write-heavy	Data ingestion pipelines (V-Order disabled by default)

Library Management Impact

Library installation in Fabric environments takes 5–15 minutes during publishing. For interactive development, use inline installation (%pip install) to avoid environment republish delays. However, inline commands are turned off by default in pipeline runs due to dependency tree instability.

Method	Session Impact	Pipeline Safe
Environment libraries	None (pre-installed)	Yes
Inline `%pip install`	Current session only	No
Inline `install.packages()` (R)	Current session only	No

References

Spark Session Tuning — NEE, autotune, session configs, resource profiles
Data Skew Diagnosis — Identifying skew, AQE, salting, repartitioning
Delta Table Optimization — V-Order, OPTIMIZE, VACUUM, ZORDER
Common Errors — Error messages, timeouts, connectivity, excessive query complexity
Notebook Health Check Script — PySpark diagnostic script
Spark Config Template — Starter configuration cell

Source

git clone https://github.com/PatrickGallucci/fabric-skills/blob/main/skills/fabric-notebook-perf-remediate/SKILL.mdView on GitHub

Overview

This skill provides a systematic toolkit to diagnose and resolve Spark-based performance bottlenecks in Microsoft Fabric notebooks. It covers common symptoms like slow cells, HTTP 430 throttling, data skew, inefficient shuffles, Delta Lake issues, and OOM errors, and guides practical tuning from Spark session tweaks to the Native Execution Engine, autotune, custom Spark pools, partition optimization, and Monitoring Hub diagnostics.

How This Skill Works

Use a symptom-driven remediation workflow (Remediate Decision Tree) to identify root causes and apply targeted actions. Start with quick wins such as enabling the Native Execution Engine, enabling autotune, and enabling AQE, then adjust shuffle partitions and partition sizing, and consult the referenced tuning guides for deeper issues like data skew, Delta Lake optimization, and memory/partition tuning.

When to Use It

Notebook cells are running slowly or timing out.
Spark jobs are throttled with HTTP 430 errors.
Data skew causes unbalanced task execution.
OOM errors occur during notebook execution.
Session startup is slow or sessions expire unexpectedly.

Quick Start

Step 1: Run the notebook health check script and review results in the Monitoring Hub and Fabric Capacity Metrics app.
Step 2: Enable Native Execution Engine and Autotune: spark.conf.set("spark.native.enabled", "true"); spark.conf.set("spark.ms.autotune.enabled", "true").
Step 3: Right-size shuffle partitions and enable AQE: spark.conf.set("spark.sql.shuffle.partitions", "100"); spark.conf.set("spark.sql.adaptive.enabled", "true"); spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true"); spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true").

Best Practices

Enable Native Execution Engine (NEE) to gain 2x–5x speedups (spark.native.enabled = true).
Enable Autotune (spark.ms.autotune.enabled = true) for adaptive configuration.
Right-size shuffle partitions based on dataset size to avoid over/under-partitioning.
Enable Adaptive Query Execution (AQE) with coalescing and skew join handling.
Prefer DataFrame APIs and break complex queries into staged intermediate writes to reduce Catalyst plan complexity.

Example Use Cases

Notebook with slow cells: enable NEE, Autotune, and AQE to accelerate execution.
Spark job hitting HTTP 430: diagnose capacity throttling via Monitoring Hub and adjust concurrency.
Data skew causing long-tail tasks: diagnose with data-skew tools and repartition accordingly.
Delta Lake reads/writes degraded: perform Delta Table Optimization.
OOM errors on large notebooks: resize partitions and adjust memory usage.

Frequently Asked Questions

Add this skill to your agents