What does this skill remediate?

Diagnoses and fixes OneLake performance issues across capacity, caching, file layout, V-Order, cross-region access, and maintenance.

What tooling and access do I need?

Access to Fabric Monitoring Hub, PowerShell 7+ with Az.Fabric, and familiarity with Spark SQL or T-SQL for diagnostic queries.

How do I know remediation worked?

Monitor latency improvements, reduced HTTP 430 errors, smaller delta file counts after OPTIMIZE, and stabilised data_scanned_remote_storage_mb metrics; re-run representative queries to confirm gains.

fabric-onelake-perf-remediate

npx machina-cli add skill PatrickGallucci/fabric-skills/fabric-onelake-perf-remediate --openclaw

Files (1)

SKILL.md

7.4 KB

OneLake Performance remediate

Systematic diagnostic and remediation toolkit for Microsoft Fabric OneLake performance issues. Covers the full stack from capacity-level throttling down to individual Delta table file layout problems.

When to Use This Skill

OneLake read or write operations are slow or timing out
Lakehouse or warehouse queries have unexpectedly high latency
Spark jobs are being throttled with HTTP 430 errors
Delta tables have accumulated many small files (small file problem)
Direct Lake semantic models are falling back to DirectQuery
Cold cache performance is significantly slower than warm cache
Cross-region data access is adding network latency
V-Order is not applied or needs to be enabled/disabled
Table maintenance (OPTIMIZE, VACUUM) is failing or not improving performance
Capacity utilization is high and jobs are queuing

Prerequisites

Microsoft Fabric workspace with Contributor or higher role
Access to the Monitoring Hub in the Fabric portal
PowerShell 7+ with Az.Fabric module (for automation scripts)
Familiarity with Spark SQL or T-SQL for diagnostic queries

Diagnostic Decision Tree

Follow this sequence to isolate the root cause:

1. Is the issue capacity-level? → Check Spark VCore utilization and queue depth
2. Is the issue cold cache? → Check data_scanned_remote_storage_mb
3. Is the issue file layout? → Check small file count and V-Order status
4. Is the issue cross-region? → Verify data and capacity are co-located
5. Is the issue query design? → Check string column widths, partition pruning

Step-by-Step Workflows

Workflow 1: Diagnose Capacity Throttling

When Spark jobs fail with HTTP 430 (TooManyRequestsForCapacity):

Open the Monitoring Hub in the Fabric portal
Check active Spark sessions against your SKU's VCore limit (1 CU = 2 Spark VCores)
Review the queue depth against your SKU's queue limit (see capacity-sku-reference.md)
Cancel unnecessary jobs or scale up the capacity SKU
For burst workloads, use the spark-capacity-check.ps1 script to monitor utilization

Workflow 2: Resolve Cold Cache Latency

When first query execution is significantly slower than subsequent runs:

Query the queryinsights.exec_requests_history view
Check the data_scanned_remote_storage_mb column — non-zero indicates cold start
Do NOT judge performance on first execution; measure subsequent runs
For pre-warming strategies and diagnostic queries, see cold-cache-diagnostics.md

Workflow 3: Fix Small File Problem

When Delta tables have hundreds or thousands of small Parquet files:

Run the table-health-check.ps1 script to assess file counts and sizes
Apply OPTIMIZE to consolidate files (target: 128 MB–1 GB per file)
Apply V-Order for read-optimized workloads
Schedule recurring maintenance — see table-maintenance-workflow.md

Workflow 4: Optimize V-Order Configuration

When choosing between read-heavy and write-heavy resource profiles:

Identify your dominant workload pattern (ingestion vs. analytics)
New Fabric workspaces default to writeHeavy profile (V-Order disabled)
For Power BI / interactive queries, switch to readHeavyForSpark or readHeavyForPBI
Apply V-Order at session, table, or OPTIMIZE command level
See v-order-decision-guide.md for detailed configuration

Workflow 5: Diagnose Cross-Region Latency

When data in OneLake or external storage is in a different region than Fabric capacity:

Verify the Fabric capacity region in the Admin portal
Check shortcut destinations — are they in the same region?
For ADLS Gen2 or S3 shortcuts, confirm storage account region
Keep large fact tables co-located; small dimension tables tolerate cross-region
Use the region-latency-test.ps1 script to measure impact

Workflow 6: Direct Lake Fallback Investigation

When Direct Lake models fall back to DirectQuery instead of reading from OneLake:

Check if the semantic model has been framed (refreshed) recently
Verify Delta tables are V-Ordered for optimal transcoding
Check table row counts against the SKU guardrails
Review column data types — large string columns degrade performance
See direct-lake-remediate.md

remediate Quick Reference

Symptom	Likely Cause	First Action
HTTP 430 errors	Capacity VCores exhausted	Check Monitoring Hub, cancel idle sessions
First query very slow	Cold cache / node resume	Check `data_scanned_remote_storage_mb`
All queries slow	Small files / no V-Order	Run table health check script
Queries slow after migration	Wrong resource profile	Switch to appropriate read/write profile
Shortcuts slow	Cross-region data access	Verify region co-location
Direct Lake fallback	Table not framed / too large	Check framing status and SKU guardrails
VACUUM fails	Retention period too short	Set retention >= 7 days
Streaming ingestion slow	Schema enforcement overhead	Consider Eventhouse with OneLake availability

References

Capacity SKU Reference — VCore limits, queue limits, node configurations
Cold Cache Diagnostics — T-SQL diagnostic queries and pre-warming
Table Maintenance Workflow — OPTIMIZE, VACUUM, and scheduling
V-Order Decision Guide — When to enable/disable, resource profiles
Direct Lake remediate — Fallback investigation, framing, transcoding

Available Scripts

spark-capacity-check.ps1 — Monitor Spark VCore utilization and queue depth
table-health-check.ps1 — Assess Delta table file counts, sizes, and V-Order status
region-latency-test.ps1 — Measure cross-region OneLake access latency
run-table-maintenance.ps1 — Execute table maintenance via Fabric REST API

Templates

diagnostic-report.md — Template for documenting performance investigation findings

Source

git clone https://github.com/PatrickGallucci/fabric-skills/blob/main/skills/fabric-onelake-perf-remediate/SKILL.mdView on GitHub

Overview

OneLake Performance remediate is a systematic diagnostic and remediation toolkit for Microsoft Fabric OneLake performance issues. It guides analysts through capacity throttling, cold cache latency, small file problems, Delta table fragmentation, V-Order configuration, and cross-region data access to restore fast lakehouse queries and reliable Direct Lake workflows.

How This Skill Works

The skill uses a diagnostic decision tree and scripted workflows to pinpoint root causes—from capacity constraints to data layout—and then applies proven remediations (OPTIMIZE, VACUUM, V-Order tuning, and capacity scaling). It relies on PowerShell, T-SQL, and Spark SQL workflows to automate collection, analysis, and remediation.

When to Use It

OneLake read or write operations are slow or timing out
Lakehouse or warehouse queries have unexpectedly high latency
Spark jobs are being throttled with HTTP 430 errors
Delta tables have accumulated many small files (small file problem)
Cold cache performance is significantly slower than warm cache

Quick Start

Step 1: Open the Monitoring Hub in the Fabric portal and collect baseline metrics (VCore utilization, queue depth, data_scanned_remote_storage_mb)
Step 2: Run a remediation workflow script (e.g., spark-capacity-check.ps1 or table-health-check.ps1) to identify root causes
Step 3: Apply remediation (scale SKU, OPTIMIZE/VACUUM, enable V-Order) and validate improvements with subsequent runs

Best Practices

Start by checking capacity metrics: VCore utilization and queue depth
Review cold-cache indicators via data_scanned_remote_storage_mb
Run small-file remediation with OPTIMIZE and apply V-Order for read-optimized workloads
Enable or tune V-Order based on workload pattern and disable when not needed
Schedule recurring maintenance and use diagnostic references (maintenance workflow, cold-cache diagnostics)

Example Use Cases

Diagnose capacity throttling for burst Spark workloads and scale SKU to reduce HTTP 430 errors
Resolve cold-cache latency for a critical lakehouse query by pre-warming and reviewing data_scanned_remote_storage_mb
Fix small-file problem on a Delta table by running table-health-check.ps1 and applying OPTIMIZE and VACUUM
Enable V-Order for a read-heavy analytics workload to improve scan efficiency
Address cross-region data latency by co-locating data and capacity and validating with diagnostics

Frequently Asked Questions

Add this skill to your agents