Get the FREE Ultimate OpenClaw Setup Guide →

fabric-onelake-perf-remediate

npx machina-cli add skill PatrickGallucci/fabric-skills/fabric-onelake-perf-remediate --openclaw
Files (1)
SKILL.md
7.4 KB

OneLake Performance remediate

Systematic diagnostic and remediation toolkit for Microsoft Fabric OneLake performance issues. Covers the full stack from capacity-level throttling down to individual Delta table file layout problems.

When to Use This Skill

  • OneLake read or write operations are slow or timing out
  • Lakehouse or warehouse queries have unexpectedly high latency
  • Spark jobs are being throttled with HTTP 430 errors
  • Delta tables have accumulated many small files (small file problem)
  • Direct Lake semantic models are falling back to DirectQuery
  • Cold cache performance is significantly slower than warm cache
  • Cross-region data access is adding network latency
  • V-Order is not applied or needs to be enabled/disabled
  • Table maintenance (OPTIMIZE, VACUUM) is failing or not improving performance
  • Capacity utilization is high and jobs are queuing

Prerequisites

  • Microsoft Fabric workspace with Contributor or higher role
  • Access to the Monitoring Hub in the Fabric portal
  • PowerShell 7+ with Az.Fabric module (for automation scripts)
  • Familiarity with Spark SQL or T-SQL for diagnostic queries

Diagnostic Decision Tree

Follow this sequence to isolate the root cause:

1. Is the issue capacity-level? → Check Spark VCore utilization and queue depth
2. Is the issue cold cache? → Check data_scanned_remote_storage_mb
3. Is the issue file layout? → Check small file count and V-Order status
4. Is the issue cross-region? → Verify data and capacity are co-located
5. Is the issue query design? → Check string column widths, partition pruning

Step-by-Step Workflows

Workflow 1: Diagnose Capacity Throttling

When Spark jobs fail with HTTP 430 (TooManyRequestsForCapacity):

  1. Open the Monitoring Hub in the Fabric portal
  2. Check active Spark sessions against your SKU's VCore limit (1 CU = 2 Spark VCores)
  3. Review the queue depth against your SKU's queue limit (see capacity-sku-reference.md)
  4. Cancel unnecessary jobs or scale up the capacity SKU
  5. For burst workloads, use the spark-capacity-check.ps1 script to monitor utilization

Workflow 2: Resolve Cold Cache Latency

When first query execution is significantly slower than subsequent runs:

  1. Query the queryinsights.exec_requests_history view
  2. Check the data_scanned_remote_storage_mb column — non-zero indicates cold start
  3. Do NOT judge performance on first execution; measure subsequent runs
  4. For pre-warming strategies and diagnostic queries, see cold-cache-diagnostics.md

Workflow 3: Fix Small File Problem

When Delta tables have hundreds or thousands of small Parquet files:

  1. Run the table-health-check.ps1 script to assess file counts and sizes
  2. Apply OPTIMIZE to consolidate files (target: 128 MB–1 GB per file)
  3. Apply V-Order for read-optimized workloads
  4. Schedule recurring maintenance — see table-maintenance-workflow.md

Workflow 4: Optimize V-Order Configuration

When choosing between read-heavy and write-heavy resource profiles:

  1. Identify your dominant workload pattern (ingestion vs. analytics)
  2. New Fabric workspaces default to writeHeavy profile (V-Order disabled)
  3. For Power BI / interactive queries, switch to readHeavyForSpark or readHeavyForPBI
  4. Apply V-Order at session, table, or OPTIMIZE command level
  5. See v-order-decision-guide.md for detailed configuration

Workflow 5: Diagnose Cross-Region Latency

When data in OneLake or external storage is in a different region than Fabric capacity:

  1. Verify the Fabric capacity region in the Admin portal
  2. Check shortcut destinations — are they in the same region?
  3. For ADLS Gen2 or S3 shortcuts, confirm storage account region
  4. Keep large fact tables co-located; small dimension tables tolerate cross-region
  5. Use the region-latency-test.ps1 script to measure impact

Workflow 6: Direct Lake Fallback Investigation

When Direct Lake models fall back to DirectQuery instead of reading from OneLake:

  1. Check if the semantic model has been framed (refreshed) recently
  2. Verify Delta tables are V-Ordered for optimal transcoding
  3. Check table row counts against the SKU guardrails
  4. Review column data types — large string columns degrade performance
  5. See direct-lake-remediate.md

remediate Quick Reference

SymptomLikely CauseFirst Action
HTTP 430 errorsCapacity VCores exhaustedCheck Monitoring Hub, cancel idle sessions
First query very slowCold cache / node resumeCheck data_scanned_remote_storage_mb
All queries slowSmall files / no V-OrderRun table health check script
Queries slow after migrationWrong resource profileSwitch to appropriate read/write profile
Shortcuts slowCross-region data accessVerify region co-location
Direct Lake fallbackTable not framed / too largeCheck framing status and SKU guardrails
VACUUM failsRetention period too shortSet retention >= 7 days
Streaming ingestion slowSchema enforcement overheadConsider Eventhouse with OneLake availability

References

Available Scripts

Templates

Source

git clone https://github.com/PatrickGallucci/fabric-skills/blob/main/skills/fabric-onelake-perf-remediate/SKILL.mdView on GitHub

Overview

OneLake Performance remediate is a systematic diagnostic and remediation toolkit for Microsoft Fabric OneLake performance issues. It guides analysts through capacity throttling, cold cache latency, small file problems, Delta table fragmentation, V-Order configuration, and cross-region data access to restore fast lakehouse queries and reliable Direct Lake workflows.

How This Skill Works

The skill uses a diagnostic decision tree and scripted workflows to pinpoint root causes—from capacity constraints to data layout—and then applies proven remediations (OPTIMIZE, VACUUM, V-Order tuning, and capacity scaling). It relies on PowerShell, T-SQL, and Spark SQL workflows to automate collection, analysis, and remediation.

When to Use It

  • OneLake read or write operations are slow or timing out
  • Lakehouse or warehouse queries have unexpectedly high latency
  • Spark jobs are being throttled with HTTP 430 errors
  • Delta tables have accumulated many small files (small file problem)
  • Cold cache performance is significantly slower than warm cache

Quick Start

  1. Step 1: Open the Monitoring Hub in the Fabric portal and collect baseline metrics (VCore utilization, queue depth, data_scanned_remote_storage_mb)
  2. Step 2: Run a remediation workflow script (e.g., spark-capacity-check.ps1 or table-health-check.ps1) to identify root causes
  3. Step 3: Apply remediation (scale SKU, OPTIMIZE/VACUUM, enable V-Order) and validate improvements with subsequent runs

Best Practices

  • Start by checking capacity metrics: VCore utilization and queue depth
  • Review cold-cache indicators via data_scanned_remote_storage_mb
  • Run small-file remediation with OPTIMIZE and apply V-Order for read-optimized workloads
  • Enable or tune V-Order based on workload pattern and disable when not needed
  • Schedule recurring maintenance and use diagnostic references (maintenance workflow, cold-cache diagnostics)

Example Use Cases

  • Diagnose capacity throttling for burst Spark workloads and scale SKU to reduce HTTP 430 errors
  • Resolve cold-cache latency for a critical lakehouse query by pre-warming and reviewing data_scanned_remote_storage_mb
  • Fix small-file problem on a Delta table by running table-health-check.ps1 and applying OPTIMIZE and VACUUM
  • Enable V-Order for a read-heavy analytics workload to improve scan efficiency
  • Address cross-region data latency by co-locating data and capacity and validating with diagnostics

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers