Get the FREE Ultimate OpenClaw Setup Guide →

fabric-notebook-perf-remediate

npx machina-cli add skill PatrickGallucci/fabric-skills/fabric-notebook-perf-remediate --openclaw
Files (1)
SKILL.md
9.0 KB

Microsoft Fabric Notebook Performance remediate

Systematic toolkit for diagnosing, analyzing, and resolving performance bottlenecks in Microsoft Fabric notebooks powered by Apache Spark.

When to Use This Skill

  • Fabric notebook cells are running slowly or timing out
  • Spark jobs are being throttled with HTTP 430 errors
  • Capacity Metrics app shows high CU consumption
  • Data skew is causing unbalanced task execution
  • Shuffle operations are consuming excessive resources
  • Delta Lake tables have degraded read/write performance
  • OOM (Out of Memory) errors during notebook execution
  • Spark Advisor shows warnings or errors in cell output
  • Session startup is slow or sessions expire unexpectedly
  • Pipeline-triggered notebooks are queued for extended periods

Prerequisites

  • Workspace Admin or Contributor role in the target Fabric workspace
  • Access to the Fabric Monitoring Hub for your capacity
  • Fabric Capacity Metrics app installed (for capacity-level analysis)
  • Familiarity with PySpark or Spark SQL syntax

remediate Decision Tree

Identify your symptom and follow the corresponding workflow.

SymptomRoot Cause CategoryAction
Notebook cell runs for minutes on small dataSpark session config or query planSee Spark Session Tuning
HTTP 430 error on job submissionCapacity exhausted, concurrency limitSee Capacity and Throttling
One task takes 10x longer than othersData skewSee Data Skew Diagnosis
Write operations are slowV-Order overhead or small file problemSee Delta Table Optimization
OOM / executor lost errorsMemory pressure, partition sizingSee Memory and Partition Tuning
Session expired / timed outIdle timeout settingsSee Common Errors
Notebook queued, never startsQueue limits for SKUSee Capacity and Throttling

Quick Wins Checklist

Apply these optimizations first — they resolve the majority of performance issues.

  1. Enable Native Execution Engine (NEE) — delivers 2x–5x improvement:
spark.conf.set("spark.native.enabled", "true")
  1. Enable Autotune for adaptive configuration:
spark.conf.set("spark.ms.autotune.enabled", "true")
  1. Right-size shuffle partitions — default 200 is often wrong:
# For datasets under 1 GB
spark.conf.set("spark.sql.shuffle.partitions", "20")
# For datasets 1-10 GB
spark.conf.set("spark.sql.shuffle.partitions", "100")
# For datasets over 10 GB, leave default or increase
  1. Enable Adaptive Query Execution (AQE):
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
  1. Use DataFrame APIs instead of RDDs — enables Catalyst optimizer and Tungsten engine.

  2. Break complex query chains into staged intermediate writes to reduce Catalyst plan complexity. See Common Errors.

Run the notebook health check script to audit your current session configuration.

Capacity and Throttling

Each Fabric SKU maps to a fixed number of Spark VCores (1 CU = 2 Spark VCores). When all VCores are consumed, new jobs receive HTTP 430 errors.

SKUSpark VCoresQueue Limit
F244
F8168
F64 / P112864
F128 / P2256128
F256 / P3512256

Resolution steps:

  1. Open the Monitoring Hub and cancel idle or unnecessary Spark sessions.
  2. Stop sessions you are not actively using — default idle timeout is 20 minutes.
  3. Reduce executor count in custom Spark pools to free VCores for parallel jobs.
  4. Enable Autoscale Billing for Spark for bursty workloads — jobs use dedicated serverless resources instead of consuming capacity CUs.
  5. For pipeline-triggered notebooks, leverage job queueing (FIFO). Queue expiry is 24 hours.

Queueing is not supported for interactive notebook jobs or Fabric trial capacities.

Memory and Partition Tuning

OOM errors typically stem from oversized partitions or insufficient executor memory.

Diagnose with Spark UI:

  1. Open the cell's Spark job progress indicator.
  2. Click Resources tab to view executor usage graph.
  3. Check the Spark Advisor light-bulb icon for memory warnings.

Tune partitions:

# Check current partition count
df.rdd.getNumPartitions()

# Repartition for parallelism (increases partitions)
df = df.repartition(200)

# Coalesce to reduce partitions (avoids full shuffle)
df = df.coalesce(50)

# Adjust max partition bytes for reads
spark.conf.set("spark.sql.files.maxPartitionBytes", "128m")

Tune task memory:

# For memory-intensive tasks causing OOM
spark.conf.set("spark.task.cpus", "2")  # More memory per task

# For CPU-bound tasks needing more parallelism
spark.conf.set("spark.task.cpus", "0.5")  # More concurrent tasks

Monitoring and Diagnostics

In-Notebook Monitoring

  • Spark job progress bar — real-time per-cell execution status
  • Resources tab — executor allocation and resource usage line chart (Spark 3.4+)
  • Spark Advisor — Info/Warning/Error recommendations per cell (expand via light-bulb icon)

Monitoring Hub

Navigate to Monitoring Hub to view all active Spark applications across your workspace. Key actions: cancel sessions, view executor count, check job duration, identify queued jobs.

Capacity Metrics App

Filter by item type (Notebook, Lakehouse, Spark Job Definition) to see CU consumption per job. Use the Multi metric ribbon chart to identify capacity spikes over time.

Formula: CU consumption = Total Spark VCores / 2 × duration

Spark Pool Configuration

Starter Pools (Default)

Session initialization in 5–10 seconds, pre-configured, no manual setup. Good for development and small workloads.

Custom Spark Pools

Configure via Workspace Settings → Data Engineering/Science → Spark Settings:

ScenarioNode SizeGuidance
Transform-heavy with shuffles and joinsLarge (16–64 cores)Maximize per-node memory
Bursty or unpredictable jobsMedium + AutoscaleLet cluster grow/shrink dynamically
Many small parallel jobsSmall/MediumUse mssparkutils.notebook.runMultiple()
Development / explorationSmall, single nodeDriver and executor share 1 VM
ML / distributed trainingMany medium/large nodesMaximize parallelism

Enable Customize compute configurations for items in Workspace Settings → Pool tab to allow per-notebook pool overrides.

Resource Profiles

Use predefined Spark resource profiles to auto-configure for your workload type:

ProfileBest For
DefaultGeneral-purpose workloads
readHeavyforSpark / ReadHeavyInteractive queries, dashboards (enables V-Order)
Write-heavyData ingestion pipelines (V-Order disabled by default)

Library Management Impact

Library installation in Fabric environments takes 5–15 minutes during publishing. For interactive development, use inline installation (%pip install) to avoid environment republish delays. However, inline commands are turned off by default in pipeline runs due to dependency tree instability.

MethodSession ImpactPipeline Safe
Environment librariesNone (pre-installed)Yes
Inline %pip installCurrent session onlyNo
Inline install.packages() (R)Current session onlyNo

References

Source

git clone https://github.com/PatrickGallucci/fabric-skills/blob/main/skills/fabric-notebook-perf-remediate/SKILL.mdView on GitHub

Overview

This skill provides a systematic toolkit to diagnose and resolve Spark-based performance bottlenecks in Microsoft Fabric notebooks. It covers common symptoms like slow cells, HTTP 430 throttling, data skew, inefficient shuffles, Delta Lake issues, and OOM errors, and guides practical tuning from Spark session tweaks to the Native Execution Engine, autotune, custom Spark pools, partition optimization, and Monitoring Hub diagnostics.

How This Skill Works

Use a symptom-driven remediation workflow (Remediate Decision Tree) to identify root causes and apply targeted actions. Start with quick wins such as enabling the Native Execution Engine, enabling autotune, and enabling AQE, then adjust shuffle partitions and partition sizing, and consult the referenced tuning guides for deeper issues like data skew, Delta Lake optimization, and memory/partition tuning.

When to Use It

  • Notebook cells are running slowly or timing out.
  • Spark jobs are throttled with HTTP 430 errors.
  • Data skew causes unbalanced task execution.
  • OOM errors occur during notebook execution.
  • Session startup is slow or sessions expire unexpectedly.

Quick Start

  1. Step 1: Run the notebook health check script and review results in the Monitoring Hub and Fabric Capacity Metrics app.
  2. Step 2: Enable Native Execution Engine and Autotune: spark.conf.set("spark.native.enabled", "true"); spark.conf.set("spark.ms.autotune.enabled", "true").
  3. Step 3: Right-size shuffle partitions and enable AQE: spark.conf.set("spark.sql.shuffle.partitions", "100"); spark.conf.set("spark.sql.adaptive.enabled", "true"); spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true"); spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true").

Best Practices

  • Enable Native Execution Engine (NEE) to gain 2x–5x speedups (spark.native.enabled = true).
  • Enable Autotune (spark.ms.autotune.enabled = true) for adaptive configuration.
  • Right-size shuffle partitions based on dataset size to avoid over/under-partitioning.
  • Enable Adaptive Query Execution (AQE) with coalescing and skew join handling.
  • Prefer DataFrame APIs and break complex queries into staged intermediate writes to reduce Catalyst plan complexity.

Example Use Cases

  • Notebook with slow cells: enable NEE, Autotune, and AQE to accelerate execution.
  • Spark job hitting HTTP 430: diagnose capacity throttling via Monitoring Hub and adjust concurrency.
  • Data skew causing long-tail tasks: diagnose with data-skew tools and repartition accordingly.
  • Delta Lake reads/writes degraded: perform Delta Table Optimization.
  • OOM errors on large notebooks: resize partitions and adjust memory usage.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers