fabric-onelake-perf-remediate
npx machina-cli add skill PatrickGallucci/fabric-skills/fabric-onelake-perf-remediate --openclawOneLake Performance remediate
Systematic diagnostic and remediation toolkit for Microsoft Fabric OneLake performance issues. Covers the full stack from capacity-level throttling down to individual Delta table file layout problems.
When to Use This Skill
- OneLake read or write operations are slow or timing out
- Lakehouse or warehouse queries have unexpectedly high latency
- Spark jobs are being throttled with HTTP 430 errors
- Delta tables have accumulated many small files (small file problem)
- Direct Lake semantic models are falling back to DirectQuery
- Cold cache performance is significantly slower than warm cache
- Cross-region data access is adding network latency
- V-Order is not applied or needs to be enabled/disabled
- Table maintenance (OPTIMIZE, VACUUM) is failing or not improving performance
- Capacity utilization is high and jobs are queuing
Prerequisites
- Microsoft Fabric workspace with Contributor or higher role
- Access to the Monitoring Hub in the Fabric portal
- PowerShell 7+ with Az.Fabric module (for automation scripts)
- Familiarity with Spark SQL or T-SQL for diagnostic queries
Diagnostic Decision Tree
Follow this sequence to isolate the root cause:
1. Is the issue capacity-level? → Check Spark VCore utilization and queue depth
2. Is the issue cold cache? → Check data_scanned_remote_storage_mb
3. Is the issue file layout? → Check small file count and V-Order status
4. Is the issue cross-region? → Verify data and capacity are co-located
5. Is the issue query design? → Check string column widths, partition pruning
Step-by-Step Workflows
Workflow 1: Diagnose Capacity Throttling
When Spark jobs fail with HTTP 430 (TooManyRequestsForCapacity):
- Open the Monitoring Hub in the Fabric portal
- Check active Spark sessions against your SKU's VCore limit (1 CU = 2 Spark VCores)
- Review the queue depth against your SKU's queue limit (see capacity-sku-reference.md)
- Cancel unnecessary jobs or scale up the capacity SKU
- For burst workloads, use the spark-capacity-check.ps1 script to monitor utilization
Workflow 2: Resolve Cold Cache Latency
When first query execution is significantly slower than subsequent runs:
- Query the
queryinsights.exec_requests_historyview - Check the
data_scanned_remote_storage_mbcolumn — non-zero indicates cold start - Do NOT judge performance on first execution; measure subsequent runs
- For pre-warming strategies and diagnostic queries, see cold-cache-diagnostics.md
Workflow 3: Fix Small File Problem
When Delta tables have hundreds or thousands of small Parquet files:
- Run the table-health-check.ps1 script to assess file counts and sizes
- Apply OPTIMIZE to consolidate files (target: 128 MB–1 GB per file)
- Apply V-Order for read-optimized workloads
- Schedule recurring maintenance — see table-maintenance-workflow.md
Workflow 4: Optimize V-Order Configuration
When choosing between read-heavy and write-heavy resource profiles:
- Identify your dominant workload pattern (ingestion vs. analytics)
- New Fabric workspaces default to
writeHeavyprofile (V-Order disabled) - For Power BI / interactive queries, switch to
readHeavyForSparkorreadHeavyForPBI - Apply V-Order at session, table, or OPTIMIZE command level
- See v-order-decision-guide.md for detailed configuration
Workflow 5: Diagnose Cross-Region Latency
When data in OneLake or external storage is in a different region than Fabric capacity:
- Verify the Fabric capacity region in the Admin portal
- Check shortcut destinations — are they in the same region?
- For ADLS Gen2 or S3 shortcuts, confirm storage account region
- Keep large fact tables co-located; small dimension tables tolerate cross-region
- Use the region-latency-test.ps1 script to measure impact
Workflow 6: Direct Lake Fallback Investigation
When Direct Lake models fall back to DirectQuery instead of reading from OneLake:
- Check if the semantic model has been framed (refreshed) recently
- Verify Delta tables are V-Ordered for optimal transcoding
- Check table row counts against the SKU guardrails
- Review column data types — large string columns degrade performance
- See direct-lake-remediate.md
remediate Quick Reference
| Symptom | Likely Cause | First Action |
|---|---|---|
| HTTP 430 errors | Capacity VCores exhausted | Check Monitoring Hub, cancel idle sessions |
| First query very slow | Cold cache / node resume | Check data_scanned_remote_storage_mb |
| All queries slow | Small files / no V-Order | Run table health check script |
| Queries slow after migration | Wrong resource profile | Switch to appropriate read/write profile |
| Shortcuts slow | Cross-region data access | Verify region co-location |
| Direct Lake fallback | Table not framed / too large | Check framing status and SKU guardrails |
| VACUUM fails | Retention period too short | Set retention >= 7 days |
| Streaming ingestion slow | Schema enforcement overhead | Consider Eventhouse with OneLake availability |
References
- Capacity SKU Reference — VCore limits, queue limits, node configurations
- Cold Cache Diagnostics — T-SQL diagnostic queries and pre-warming
- Table Maintenance Workflow — OPTIMIZE, VACUUM, and scheduling
- V-Order Decision Guide — When to enable/disable, resource profiles
- Direct Lake remediate — Fallback investigation, framing, transcoding
Available Scripts
- spark-capacity-check.ps1 — Monitor Spark VCore utilization and queue depth
- table-health-check.ps1 — Assess Delta table file counts, sizes, and V-Order status
- region-latency-test.ps1 — Measure cross-region OneLake access latency
- run-table-maintenance.ps1 — Execute table maintenance via Fabric REST API
Templates
- diagnostic-report.md — Template for documenting performance investigation findings
Source
git clone https://github.com/PatrickGallucci/fabric-skills/blob/main/skills/fabric-onelake-perf-remediate/SKILL.mdView on GitHub Overview
OneLake Performance remediate is a systematic diagnostic and remediation toolkit for Microsoft Fabric OneLake performance issues. It guides analysts through capacity throttling, cold cache latency, small file problems, Delta table fragmentation, V-Order configuration, and cross-region data access to restore fast lakehouse queries and reliable Direct Lake workflows.
How This Skill Works
The skill uses a diagnostic decision tree and scripted workflows to pinpoint root causes—from capacity constraints to data layout—and then applies proven remediations (OPTIMIZE, VACUUM, V-Order tuning, and capacity scaling). It relies on PowerShell, T-SQL, and Spark SQL workflows to automate collection, analysis, and remediation.
When to Use It
- OneLake read or write operations are slow or timing out
- Lakehouse or warehouse queries have unexpectedly high latency
- Spark jobs are being throttled with HTTP 430 errors
- Delta tables have accumulated many small files (small file problem)
- Cold cache performance is significantly slower than warm cache
Quick Start
- Step 1: Open the Monitoring Hub in the Fabric portal and collect baseline metrics (VCore utilization, queue depth, data_scanned_remote_storage_mb)
- Step 2: Run a remediation workflow script (e.g., spark-capacity-check.ps1 or table-health-check.ps1) to identify root causes
- Step 3: Apply remediation (scale SKU, OPTIMIZE/VACUUM, enable V-Order) and validate improvements with subsequent runs
Best Practices
- Start by checking capacity metrics: VCore utilization and queue depth
- Review cold-cache indicators via data_scanned_remote_storage_mb
- Run small-file remediation with OPTIMIZE and apply V-Order for read-optimized workloads
- Enable or tune V-Order based on workload pattern and disable when not needed
- Schedule recurring maintenance and use diagnostic references (maintenance workflow, cold-cache diagnostics)
Example Use Cases
- Diagnose capacity throttling for burst Spark workloads and scale SKU to reduce HTTP 430 errors
- Resolve cold-cache latency for a critical lakehouse query by pre-warming and reviewing data_scanned_remote_storage_mb
- Fix small-file problem on a Delta table by running table-health-check.ps1 and applying OPTIMIZE and VACUUM
- Enable V-Order for a read-heavy analytics workload to improve scan efficiency
- Address cross-region data latency by co-locating data and capacity and validating with diagnostics