fabric-performance-monitoring
npx machina-cli add skill PatrickGallucci/fabric-skills/fabric-performance-monitoring --openclawMicrosoft Fabric Performance Monitoring
Toolkit for monitoring, diagnosing, and optimizing Microsoft Fabric capacity and workload performance across Spark, Data Warehouse, Lakehouse, and Pipeline workloads.
When to Use This Skill
- Checking Fabric capacity utilization or CU consumption
- Diagnosing throttling errors (HTTP 430 / TooManyRequestsForCapacity)
- Monitoring Spark VCore usage and concurrency limits
- Querying Fabric REST APIs for capacity and workspace health
- Generating capacity performance reports
- Tuning Spark resource profiles (readHeavy, writeHeavy, balanced)
- Investigating job failures in the Monitoring Hub
- Analyzing autoscale billing vs capacity-based billing
- Reviewing background vs interactive operation patterns
- Planning capacity SKU sizing or rightsizing
Prerequisites
- PowerShell 7+ with Az.Fabric module installed
- Microsoft Entra ID app registration with Fabric API permissions
- Fabric Capacity Admin or Workspace Admin role
- Fabric Capacity Metrics app installed (for visual monitoring)
Core Concepts
Capacity Units and Spark VCores
One Capacity Unit (CU) equals two Apache Spark VCores. Fabric capacity is shared across all workspaces assigned to it, and Spark VCores are shared among notebooks, Spark job definitions, and lakehouses within those workspaces.
Operation Types
Fabric classifies operations as interactive (on-demand, like DAX queries) or background (scheduled, like refreshes and Spark jobs). Background operations are smoothed over a 24-hour period. All Spark operations are background operations.
Throttling Behavior
When capacity is fully utilized, new Spark jobs receive HTTP 430 with TooManyRequestsForCapacity. With queueing enabled, pipeline-triggered and scheduled jobs enter a FIFO queue and retry automatically when capacity becomes available.
Capacity SKU Limits
| SKU | Spark VCores | Queue Limit |
|---|---|---|
| F2 | 4 | 4 |
| F4 | 8 | 4 |
| F8 | 16 | 8 |
| F16 | 32 | 16 |
| F32 | 64 | 32 |
| F64 | 128 | 64 |
| F128 | 256 | 128 |
| F256 | 512 | 256 |
| F512 | 1024 | 512 |
Spark Resource Profiles
Fabric supports predefined Spark resource profiles for workload optimization. New workspaces default to writeHeavy. Available profiles: readHeavy, writeHeavy, balanced. When writeHeavy is used, VOrder is disabled by default and must be manually enabled.
Step-by-Step Workflows
Workflow 1: Capacity Health Check
Run the capacity health check script to retrieve current capacity status, SKU details, and state.
./scripts/Get-FabricCapacityHealth.ps1 -SubscriptionId "<sub-id>" -ResourceGroupName "<rg>" -CapacityName "<name>"
See capacity-health-reference.md for detailed API response schemas and interpretation guidance.
Workflow 2: Spark Concurrency Analysis
Run the Spark concurrency analyzer to check active sessions, queued jobs, and throttling status.
./scripts/Get-FabricSparkConcurrency.ps1 -WorkspaceId "<workspace-id>"
Workflow 3: Monitoring Hub Job Audit
Run the job audit script to retrieve recent job executions, durations, and failure details.
./scripts/Get-FabricJobHistory.ps1 -WorkspaceId "<workspace-id>" -HoursBack 24
Workflow 4: Generate Performance Report
Use the performance report template to query the SQL analytics endpoint for Lakehouse operation metrics, then generate a summary with the report generator.
Workflow 5: Autoscale vs Capacity Cost Analysis
See cost-analysis-reference.md for guidance on comparing autoscale billing vs capacity-based models using Azure Cost Management.
remediate
| Symptom | Likely Cause | Resolution |
|---|---|---|
| HTTP 430 errors | Capacity fully utilized | Scale SKU, cancel idle sessions, enable queueing |
| Jobs stuck in queue | All VCores consumed | Check Monitoring Hub, stop idle notebooks |
| Slow Spark startup | Using custom pool with cold start | Switch to starter pool for quick sessions |
| High CU consumption | Inefficient queries or unoptimized code | Review Capacity Metrics app, optimize DAX/Spark |
| Autoscale charges unexpected | Spark jobs billed independently | Check Azure Cost Analysis with Autoscale meter |
| VOrder disabled | writeHeavy profile active | Manually enable VOrder if read performance needed |
References
- Capacity Health Reference - REST API schemas and interpretation
- Cost Analysis Reference - Autoscale vs capacity billing comparison
- Fabric Capacity Metrics App
- Monitor Spark Capacity Consumption
- Fabric REST API Documentation
- Concurrency Limits and Queueing
Source
git clone https://github.com/PatrickGallucci/fabric-skills/blob/main/skills/fabric-performance-monitoring/SKILL.mdView on GitHub Overview
Microsoft Fabric Performance Monitoring provides tooling to monitor, diagnose, and optimize capacity and workload performance across Spark, Data Warehouse, Lakehouse, and Pipeline workloads. It helps you detect throttling, track VCore usage, review Monitoring Hub jobs, and plan capacity SKU sizing.
How This Skill Works
The skill uses PowerShell (Az.Fabric), REST APIs, and T-SQL workflows to collect capacity metrics, retrieve Spark VCore consumption, and analyze CU usage. It provides ready-to-run scripts for capacity health checks, concurrency analysis, and job audits, plus guidance to tune Spark resource profiles (readHeavy, writeHeavy, balanced).
When to Use It
- Check Fabric capacity utilization or CU consumption
- Diagnose throttling errors (HTTP 430 / TooManyRequestsForCapacity)
- Monitor Spark VCore usage and concurrency limits
- Query Fabric REST APIs for capacity and workspace health
- Generate capacity performance reports and SKU sizing plans
Quick Start
- Step 1: Install prerequisites (PowerShell 7+, Az.Fabric module, Entra ID app with Fabric API permissions, and appropriate Fabric roles)
- Step 2: Run a capacity health check script to fetch current capacity status
- Step 3: Analyze results, adjust Spark resource profiles or SKU sizing, then re-run health checks
Best Practices
- Run the capacity health check script Get-FabricCapacityHealth.ps1 to verify current state
- Tune Spark resource profiles (readHeavy, writeHeavy, balanced) based on workload; start with writeHeavy as the default
- Enable queueing for throttling scenarios; understand HTTP 430 and retry behavior
- Regularly compare autoscale billing vs capacity-based billing to optimize cost
- Review Monitoring Hub job results to identify failures and optimization opportunities
Example Use Cases
- Diagnose a HTTP 430 throttling event during a data refresh and reallocate capacity
- Size a new Fabric SKU (e.g., from F8 to F16) after observing peak Spark VCore usage
- Tune a workspace from readHeavy to balanced to reduce VCore contention
- Audit Monitoring Hub jobs to identify recurring failures and adjust scheduling
- Generate a quarterly capacity performance report for governance reviews