fabric-data-factory-perf-remediate
npx machina-cli add skill PatrickGallucci/fabric-skills/fabric-data-factory-perf-remediate --openclawMicrosoft Fabric Data Factory Performance remediate
Systematic approach to diagnosing and resolving performance issues in Microsoft Fabric Data Factory pipelines, copy activities, and dataflows.
When to Use This Skill
- Pipeline execution takes longer than expected
- Copy activities are slow or appear stuck
- Activities show "Not Started" status for extended periods
- Capacity throttling errors (HTTP 430, TooManyRequestsForCapacity)
- Throughput is lower than expected for copy operations
- Dataflow Gen2 refresh is slow or timing out
- Pipeline monitoring shows performance degradation over time
- Need to optimize parallelism, DIU, or partitioning settings
Prerequisites
- Access to Microsoft Fabric workspace with Contributor or higher role
- Familiarity with the Fabric Monitoring Hub
- Understanding of Fabric capacity SKUs and their limits
- PowerShell 7+ for running diagnostic scripts
Diagnostic Workflow
Step 1: Identify the Bottleneck Category
Determine which category your issue falls into:
| Category | Symptoms | Start Here |
|---|---|---|
| Copy Activity Slow | Low throughput, long transfer duration | copy-activity-tuning.md |
| Pipeline Stuck | Activity shows In Progress with no movement | pipeline-stuck-resolution.md |
| Capacity Throttling | HTTP 430 errors, jobs queued | capacity-throttling-guide.md |
| Dataflow Slow | Dataflow Gen2 refresh takes too long | dataflow-optimization.md |
| Spark Job Queue | Jobs stuck in "Not Started" status | capacity-throttling-guide.md |
Step 2: Collect Diagnostics
Run the diagnostic script to gather baseline metrics:
./scripts/Get-FabricPipelineDiagnostics.ps1 -WorkspaceId "<guid>" -PipelineName "MyPipeline"
Or manually collect from the Monitoring Hub:
- Open Fabric portal and navigate to Monitoring Hub
- Filter by pipeline name and time range
- Select the run details (glasses icon) for the slow run
- Capture the Duration Breakdown for copy activities
- Note the queue time, transfer time, and pre/post-copy script duration
Step 3: Apply Targeted Fixes
Based on the bottleneck category, apply the appropriate optimization from the reference guides.
Quick Fixes for Common Issues
Copy Activity Running Slowly
- Set Intelligent Throughput Optimization to
Maximum(or custom 4-256) - Configure Degree of Copy Parallelism based on source type
- Enable Partition Option for SQL sources (Dynamic Range or Physical)
- Pre-calculate partition upper/lower bounds to avoid overhead
- Enable Staging when sink is Fabric Warehouse
Pipeline Activity Stuck
- Cancel the stuck activity and retry
- Check source/sink connectivity and credentials
- Verify Fabric capacity is not in throttled state
- Review if payload exceeds 896 KB limit
- Check for connection timeout or network interruption
Capacity Throttling (HTTP 430)
- Check current Spark concurrency against SKU limits
- Cancel unnecessary active Spark jobs via Monitoring Hub
- Consider upgrading to a larger capacity SKU
- Distribute pipeline trigger times to avoid burst load
- Use job queueing for non-interactive Spark workloads
Dataflow Gen2 Performance
- Reduce data volume with query folding and filters
- Avoid unnecessary data type conversions
- Minimize the number of transformation steps
- Use staging for large datasets
- Check for connector-specific throttling
Capacity SKU Quick Reference
| SKU | Max Spark Cores | Queue Limit | Equivalent Power BI |
|---|---|---|---|
| F2 | Limited | 4 | - |
| F4 | Limited | 4 | - |
| F8 | Limited | 8 | - |
| F16 | Limited | 16 | - |
| F32 | Limited | 32 | - |
| F64 | Standard | 64 | P1 |
| F128 | Standard | 128 | P2 |
| F256 | Standard | 256 | P3 |
| F512 | Standard | 512 | P4 |
| F1024 | Large | 1024 | - |
| F2048 | Large | 2048 | - |
| Trial | P1 equiv | N/A (no queue) | P1 |
Copy Activity Performance Settings Reference
| Setting | Property | Range | Recommendation |
|---|---|---|---|
| Intelligent Throughput Optimization | dataIntegrationUnits | Auto, Standard (64), Balanced (128), Maximum (256), Custom (4-256) | Start with Auto, increase for large datasets |
| Degree of Copy Parallelism | parallelCopies | 1-256 | Auto for most; limit to 32 for Fabric Warehouse sink |
| Partition Option | Source settings | None, Physical, Dynamic Range | Use Dynamic Range for large SQL tables |
| Enable Staging | enableStaging | true/false | Required for Fabric Warehouse sink |
| Source Retry Count | sourceRetryCount | Integer | Set 2-3 for transient failures |
| Fault Tolerance | enableSkipIncompatibleRow | true/false | Enable for non-critical loads |
Error Code Quick Reference
| Error | Meaning | Action |
|---|---|---|
| HTTP 430 | Capacity compute limit reached | Reduce concurrent jobs or upgrade SKU |
| Payload too large | Activity config exceeds 896 KB | Reduce parameter sizes |
| TooManyRequestsForCapacity | Spark compute or API rate limit | Cancel active jobs or wait |
| Connection timeout | Source/sink unreachable | Check network, credentials, firewall |
| Deflate64 unsupported | Compression format not supported | Re-compress with deflate algorithm |
Monitoring Setup
Enable workspace monitoring for ongoing performance analysis:
- Go to Workspace Settings > Monitoring
- Add a Monitoring Eventhouse and enable Log workspace activity
- Query the
ItemJobEventLogstable with KQL for pipeline-level insights
Example KQL query for failure trends:
ItemJobEventLogs
| where ItemKind == "Pipeline"
| summarize count() by JobStatus
See workspace-monitoring-setup.md for detailed configuration.
References
- Copy Activity Tuning Guide
- Pipeline Stuck Resolution
- Capacity Throttling Guide
- Dataflow Optimization
- Workspace Monitoring Setup
- remediate Runbook Template
External Resources
Source
git clone https://github.com/PatrickGallucci/fabric-skills/blob/main/skills/fabric-data-factory-perf-remediate/SKILL.mdView on GitHub Overview
Systematic approach to diagnosing and resolving performance issues in Microsoft Fabric Data Factory pipelines, copy activities, and dataflows. It covers bottleneck classification, tuning knobs such as parallelCopies, DIU, ITO, and partitioning, plus monitoring and dataflow optimization to prevent timeouts, stalls, and throttling.
How This Skill Works
Identify the bottleneck category (Copy Activity Slow, Pipeline Stuck, Capacity Throttling, Dataflow Slow, Spark Job Queue). Collect diagnostics with the Get-FabricPipelineDiagnostics.ps1 script or by inspecting the Monitoring Hub. Apply targeted fixes from the reference guides (copy activity tuning, capacity management, dataflow optimization) and validate improvements with fresh runs.
When to Use It
- Pipeline execution is slower than expected
- Copy activities are slow or appear stuck
- Activities show In Progress or Not Started for extended periods
- HTTP 430 / TooManyRequestsForCapacity throttling occurs
- Dataflow Gen2 refresh is slow or timing out
Quick Start
- Step 1: Identify the bottleneck category from symptoms using the diagnostic workflow (Copy Activity Slow, Pipeline Stuck, Capacity Throttling, Dataflow Slow, Spark Job Queue)
- Step 2: Collect diagnostics with Get-FabricPipelineDiagnostics.ps1 or via Monitoring Hub (note queue time, transfer time, and breakdowns)
- Step 3: Apply targeted fixes from the reference guides and validate by re-running the pipeline and monitoring performance
Best Practices
- Use Monitoring Hub to establish a performance baseline and capture duration breakdowns
- Run the diagnostic script Get-FabricPipelineDiagnostics.ps1 to collect baseline metrics
- Set Intelligent Throughput Optimization to Maximum (or a custom 4-256) and tune parallelism
- Configure Degree of Copy Parallelism and enable Partition Option for SQL sources
- Review capacity SKUs and quotas; adjust workspace capacity to match workload and reduce throttling
Example Use Cases
- A slow pipeline with Not Started tasks is resolved by adjusting capacity throttling settings and increasing the capacity SKU
- Copy throughput improves after enabling Intelligent Throughput Optimization and calibrating parallelism to source type
- Spark job queueing is alleviated by aligning SKUs and monitoring queue times via the Monitoring Hub
- Dataflow Gen2 refresh speeds up after enabling Partition Option for SQL sources
- Long queue times disappear after applying targeted fixes from the capacity-throttling guide and re-running the diagnostic