fabric-notebook-perf-remediate
npx machina-cli add skill PatrickGallucci/fabric-skills/fabric-notebook-perf-remediate --openclawMicrosoft Fabric Notebook Performance remediate
Systematic toolkit for diagnosing, analyzing, and resolving performance bottlenecks in Microsoft Fabric notebooks powered by Apache Spark.
When to Use This Skill
- Fabric notebook cells are running slowly or timing out
- Spark jobs are being throttled with HTTP 430 errors
- Capacity Metrics app shows high CU consumption
- Data skew is causing unbalanced task execution
- Shuffle operations are consuming excessive resources
- Delta Lake tables have degraded read/write performance
- OOM (Out of Memory) errors during notebook execution
- Spark Advisor shows warnings or errors in cell output
- Session startup is slow or sessions expire unexpectedly
- Pipeline-triggered notebooks are queued for extended periods
Prerequisites
- Workspace Admin or Contributor role in the target Fabric workspace
- Access to the Fabric Monitoring Hub for your capacity
- Fabric Capacity Metrics app installed (for capacity-level analysis)
- Familiarity with PySpark or Spark SQL syntax
remediate Decision Tree
Identify your symptom and follow the corresponding workflow.
| Symptom | Root Cause Category | Action |
|---|---|---|
| Notebook cell runs for minutes on small data | Spark session config or query plan | See Spark Session Tuning |
| HTTP 430 error on job submission | Capacity exhausted, concurrency limit | See Capacity and Throttling |
| One task takes 10x longer than others | Data skew | See Data Skew Diagnosis |
| Write operations are slow | V-Order overhead or small file problem | See Delta Table Optimization |
| OOM / executor lost errors | Memory pressure, partition sizing | See Memory and Partition Tuning |
| Session expired / timed out | Idle timeout settings | See Common Errors |
| Notebook queued, never starts | Queue limits for SKU | See Capacity and Throttling |
Quick Wins Checklist
Apply these optimizations first — they resolve the majority of performance issues.
- Enable Native Execution Engine (NEE) — delivers 2x–5x improvement:
spark.conf.set("spark.native.enabled", "true")
- Enable Autotune for adaptive configuration:
spark.conf.set("spark.ms.autotune.enabled", "true")
- Right-size shuffle partitions — default 200 is often wrong:
# For datasets under 1 GB
spark.conf.set("spark.sql.shuffle.partitions", "20")
# For datasets 1-10 GB
spark.conf.set("spark.sql.shuffle.partitions", "100")
# For datasets over 10 GB, leave default or increase
- Enable Adaptive Query Execution (AQE):
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
-
Use DataFrame APIs instead of RDDs — enables Catalyst optimizer and Tungsten engine.
-
Break complex query chains into staged intermediate writes to reduce Catalyst plan complexity. See Common Errors.
Run the notebook health check script to audit your current session configuration.
Capacity and Throttling
Each Fabric SKU maps to a fixed number of Spark VCores (1 CU = 2 Spark VCores). When all VCores are consumed, new jobs receive HTTP 430 errors.
| SKU | Spark VCores | Queue Limit |
|---|---|---|
| F2 | 4 | 4 |
| F8 | 16 | 8 |
| F64 / P1 | 128 | 64 |
| F128 / P2 | 256 | 128 |
| F256 / P3 | 512 | 256 |
Resolution steps:
- Open the Monitoring Hub and cancel idle or unnecessary Spark sessions.
- Stop sessions you are not actively using — default idle timeout is 20 minutes.
- Reduce executor count in custom Spark pools to free VCores for parallel jobs.
- Enable Autoscale Billing for Spark for bursty workloads — jobs use dedicated serverless resources instead of consuming capacity CUs.
- For pipeline-triggered notebooks, leverage job queueing (FIFO). Queue expiry is 24 hours.
Queueing is not supported for interactive notebook jobs or Fabric trial capacities.
Memory and Partition Tuning
OOM errors typically stem from oversized partitions or insufficient executor memory.
Diagnose with Spark UI:
- Open the cell's Spark job progress indicator.
- Click Resources tab to view executor usage graph.
- Check the Spark Advisor light-bulb icon for memory warnings.
Tune partitions:
# Check current partition count
df.rdd.getNumPartitions()
# Repartition for parallelism (increases partitions)
df = df.repartition(200)
# Coalesce to reduce partitions (avoids full shuffle)
df = df.coalesce(50)
# Adjust max partition bytes for reads
spark.conf.set("spark.sql.files.maxPartitionBytes", "128m")
Tune task memory:
# For memory-intensive tasks causing OOM
spark.conf.set("spark.task.cpus", "2") # More memory per task
# For CPU-bound tasks needing more parallelism
spark.conf.set("spark.task.cpus", "0.5") # More concurrent tasks
Monitoring and Diagnostics
In-Notebook Monitoring
- Spark job progress bar — real-time per-cell execution status
- Resources tab — executor allocation and resource usage line chart (Spark 3.4+)
- Spark Advisor — Info/Warning/Error recommendations per cell (expand via light-bulb icon)
Monitoring Hub
Navigate to Monitoring Hub to view all active Spark applications across your workspace. Key actions: cancel sessions, view executor count, check job duration, identify queued jobs.
Capacity Metrics App
Filter by item type (Notebook, Lakehouse, Spark Job Definition) to see CU consumption per job. Use the Multi metric ribbon chart to identify capacity spikes over time.
Formula: CU consumption = Total Spark VCores / 2 × duration
Spark Pool Configuration
Starter Pools (Default)
Session initialization in 5–10 seconds, pre-configured, no manual setup. Good for development and small workloads.
Custom Spark Pools
Configure via Workspace Settings → Data Engineering/Science → Spark Settings:
| Scenario | Node Size | Guidance |
|---|---|---|
| Transform-heavy with shuffles and joins | Large (16–64 cores) | Maximize per-node memory |
| Bursty or unpredictable jobs | Medium + Autoscale | Let cluster grow/shrink dynamically |
| Many small parallel jobs | Small/Medium | Use mssparkutils.notebook.runMultiple() |
| Development / exploration | Small, single node | Driver and executor share 1 VM |
| ML / distributed training | Many medium/large nodes | Maximize parallelism |
Enable Customize compute configurations for items in Workspace Settings → Pool tab to allow per-notebook pool overrides.
Resource Profiles
Use predefined Spark resource profiles to auto-configure for your workload type:
| Profile | Best For |
|---|---|
Default | General-purpose workloads |
readHeavyforSpark / ReadHeavy | Interactive queries, dashboards (enables V-Order) |
| Write-heavy | Data ingestion pipelines (V-Order disabled by default) |
Library Management Impact
Library installation in Fabric environments takes 5–15 minutes during publishing. For interactive development, use inline installation (%pip install) to avoid environment republish delays. However, inline commands are turned off by default in pipeline runs due to dependency tree instability.
| Method | Session Impact | Pipeline Safe |
|---|---|---|
| Environment libraries | None (pre-installed) | Yes |
Inline %pip install | Current session only | No |
Inline install.packages() (R) | Current session only | No |
References
- Spark Session Tuning — NEE, autotune, session configs, resource profiles
- Data Skew Diagnosis — Identifying skew, AQE, salting, repartitioning
- Delta Table Optimization — V-Order, OPTIMIZE, VACUUM, ZORDER
- Common Errors — Error messages, timeouts, connectivity, excessive query complexity
- Notebook Health Check Script — PySpark diagnostic script
- Spark Config Template — Starter configuration cell
Source
git clone https://github.com/PatrickGallucci/fabric-skills/blob/main/skills/fabric-notebook-perf-remediate/SKILL.mdView on GitHub Overview
This skill provides a systematic toolkit to diagnose and resolve Spark-based performance bottlenecks in Microsoft Fabric notebooks. It covers common symptoms like slow cells, HTTP 430 throttling, data skew, inefficient shuffles, Delta Lake issues, and OOM errors, and guides practical tuning from Spark session tweaks to the Native Execution Engine, autotune, custom Spark pools, partition optimization, and Monitoring Hub diagnostics.
How This Skill Works
Use a symptom-driven remediation workflow (Remediate Decision Tree) to identify root causes and apply targeted actions. Start with quick wins such as enabling the Native Execution Engine, enabling autotune, and enabling AQE, then adjust shuffle partitions and partition sizing, and consult the referenced tuning guides for deeper issues like data skew, Delta Lake optimization, and memory/partition tuning.
When to Use It
- Notebook cells are running slowly or timing out.
- Spark jobs are throttled with HTTP 430 errors.
- Data skew causes unbalanced task execution.
- OOM errors occur during notebook execution.
- Session startup is slow or sessions expire unexpectedly.
Quick Start
- Step 1: Run the notebook health check script and review results in the Monitoring Hub and Fabric Capacity Metrics app.
- Step 2: Enable Native Execution Engine and Autotune: spark.conf.set("spark.native.enabled", "true"); spark.conf.set("spark.ms.autotune.enabled", "true").
- Step 3: Right-size shuffle partitions and enable AQE: spark.conf.set("spark.sql.shuffle.partitions", "100"); spark.conf.set("spark.sql.adaptive.enabled", "true"); spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true"); spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true").
Best Practices
- Enable Native Execution Engine (NEE) to gain 2x–5x speedups (spark.native.enabled = true).
- Enable Autotune (spark.ms.autotune.enabled = true) for adaptive configuration.
- Right-size shuffle partitions based on dataset size to avoid over/under-partitioning.
- Enable Adaptive Query Execution (AQE) with coalescing and skew join handling.
- Prefer DataFrame APIs and break complex queries into staged intermediate writes to reduce Catalyst plan complexity.
Example Use Cases
- Notebook with slow cells: enable NEE, Autotune, and AQE to accelerate execution.
- Spark job hitting HTTP 430: diagnose capacity throttling via Monitoring Hub and adjust concurrency.
- Data skew causing long-tail tasks: diagnose with data-skew tools and repartition accordingly.
- Delta Lake reads/writes degraded: perform Delta Table Optimization.
- OOM errors on large notebooks: resize partitions and adjust memory usage.