nextflow-development
Flagged{"isSafe":false,"isSuspicious":true,"riskLevel":"high","findings":[{"category":"shell_command","severity":"high","description":"Installation command for Nextflow uses curl to fetch a script and pipe it directly to bash, which executes remote code. This is a known risky pattern and can lead to supply-chain or remote code execution if the remote script is compromised.","evidence":"curl -s https://get.nextflow.io | bash && mv nextflow ~/bin/"}],"summary":"The skill content is largely safe but includes a high-risk installation pattern (curl | bash) for Nextflow. Recommend replacing with a safer installation process (download, verify, and run locally) and adding checksums or artifact verification. No other dangerous patterns detected."}
npx machina-cli add skill anthropics/knowledge-work-plugins/nextflow-development --openclawnf-core Pipeline Deployment
Run nf-core bioinformatics pipelines on local or public sequencing data.
Target users: Bench scientists and researchers without specialized bioinformatics training who need to run large-scale omics analyses—differential expression, variant calling, or chromatin accessibility analysis.
Workflow Checklist
- [ ] Step 0: Acquire data (if from GEO/SRA)
- [ ] Step 1: Environment check (MUST pass)
- [ ] Step 2: Select pipeline (confirm with user)
- [ ] Step 3: Run test profile (MUST pass)
- [ ] Step 4: Create samplesheet
- [ ] Step 5: Configure & run (confirm genome with user)
- [ ] Step 6: Verify outputs
Step 0: Acquire Data (GEO/SRA Only)
Skip this step if user has local FASTQ files.
For public datasets, fetch from GEO/SRA first. See references/geo-sra-acquisition.md for the full workflow.
Quick start:
# 1. Get study info
python scripts/sra_geo_fetch.py info GSE110004
# 2. Download (interactive mode)
python scripts/sra_geo_fetch.py download GSE110004 -o ./fastq -i
# 3. Generate samplesheet
python scripts/sra_geo_fetch.py samplesheet GSE110004 --fastq-dir ./fastq -o samplesheet.csv
DECISION POINT: After fetching study info, confirm with user:
- Which sample subset to download (if multiple data types)
- Suggested genome and pipeline
Then continue to Step 1.
Step 1: Environment Check
Run first. Pipeline will fail without passing environment.
python scripts/check_environment.py
All critical checks must pass. If any fail, provide fix instructions:
Docker issues
| Problem | Fix |
|---|---|
| Not installed | Install from https://docs.docker.com/get-docker/ |
| Permission denied | sudo usermod -aG docker $USER then re-login |
| Daemon not running | sudo systemctl start docker |
Nextflow issues
| Problem | Fix |
|---|---|
| Not installed | curl -s https://get.nextflow.io | bash && mv nextflow ~/bin/ |
| Version < 23.04 | nextflow self-update |
Java issues
| Problem | Fix |
|---|---|
| Not installed / < 11 | sudo apt install openjdk-11-jdk |
Do not proceed until all checks pass. For HPC/Singularity, see references/troubleshooting.md.
Step 2: Select Pipeline
DECISION POINT: Confirm with user before proceeding.
| Data Type | Pipeline | Version | Goal |
|---|---|---|---|
| RNA-seq | rnaseq | 3.22.2 | Gene expression |
| WGS/WES | sarek | 3.7.1 | Variant calling |
| ATAC-seq | atacseq | 2.1.2 | Chromatin accessibility |
Auto-detect from data:
python scripts/detect_data_type.py /path/to/data
For pipeline-specific details:
Step 3: Run Test Profile
Validates environment with small data. MUST pass before real data.
nextflow run nf-core/<pipeline> -r <version> -profile test,docker --outdir test_output
| Pipeline | Command |
|---|---|
| rnaseq | nextflow run nf-core/rnaseq -r 3.22.2 -profile test,docker --outdir test_rnaseq |
| sarek | nextflow run nf-core/sarek -r 3.7.1 -profile test,docker --outdir test_sarek |
| atacseq | nextflow run nf-core/atacseq -r 2.1.2 -profile test,docker --outdir test_atacseq |
Verify:
ls test_output/multiqc/multiqc_report.html
grep "Pipeline completed successfully" .nextflow.log
If test fails, see references/troubleshooting.md.
Step 4: Create Samplesheet
Generate automatically
python scripts/generate_samplesheet.py /path/to/data <pipeline> -o samplesheet.csv
The script:
- Discovers FASTQ/BAM/CRAM files
- Pairs R1/R2 reads
- Infers sample metadata
- Validates before writing
For sarek: Script prompts for tumor/normal status if not auto-detected.
Validate existing samplesheet
python scripts/generate_samplesheet.py --validate samplesheet.csv <pipeline>
Samplesheet formats
rnaseq:
sample,fastq_1,fastq_2,strandedness
SAMPLE1,/abs/path/R1.fq.gz,/abs/path/R2.fq.gz,auto
sarek:
patient,sample,lane,fastq_1,fastq_2,status
patient1,tumor,L001,/abs/path/tumor_R1.fq.gz,/abs/path/tumor_R2.fq.gz,1
patient1,normal,L001,/abs/path/normal_R1.fq.gz,/abs/path/normal_R2.fq.gz,0
atacseq:
sample,fastq_1,fastq_2,replicate
CONTROL,/abs/path/ctrl_R1.fq.gz,/abs/path/ctrl_R2.fq.gz,1
Step 5: Configure & Run
5a. Check genome availability
python scripts/manage_genomes.py check <genome>
# If not installed:
python scripts/manage_genomes.py download <genome>
Common genomes: GRCh38 (human), GRCh37 (legacy), GRCm39 (mouse), R64-1-1 (yeast), BDGP6 (fly)
5b. Decision points
DECISION POINT: Confirm with user:
- Genome: Which reference to use
- Pipeline-specific options:
- rnaseq: aligner (star_salmon recommended, hisat2 for low memory)
- sarek: tools (haplotypecaller for germline, mutect2 for somatic)
- atacseq: read_length (50, 75, 100, or 150)
5c. Run pipeline
nextflow run nf-core/<pipeline> \
-r <version> \
-profile docker \
--input samplesheet.csv \
--outdir results \
--genome <genome> \
-resume
Key flags:
-r: Pin version-profile docker: Use Docker (orsingularityfor HPC)--genome: iGenomes key-resume: Continue from checkpoint
Resource limits (if needed):
--max_cpus 8 --max_memory '32.GB' --max_time '24.h'
Step 6: Verify Outputs
Check completion
ls results/multiqc/multiqc_report.html
grep "Pipeline completed successfully" .nextflow.log
Key outputs by pipeline
rnaseq:
results/star_salmon/salmon.merged.gene_counts.tsv- Gene countsresults/star_salmon/salmon.merged.gene_tpm.tsv- TPM values
sarek:
results/variant_calling/*/- VCF filesresults/preprocessing/recalibrated/- BAM files
atacseq:
results/macs2/narrowPeak/- Peak callsresults/bwa/mergedLibrary/bigwig/- Coverage tracks
Quick Reference
For common exit codes and fixes, see references/troubleshooting.md.
Resume failed run
nextflow run nf-core/<pipeline> -resume
References
- references/geo-sra-acquisition.md - Downloading public GEO/SRA data
- references/troubleshooting.md - Common issues and fixes
- references/installation.md - Environment setup
- references/pipelines/rnaseq.md - RNA-seq pipeline details
- references/pipelines/sarek.md - Variant calling details
- references/pipelines/atacseq.md - ATAC-seq details
Disclaimer
This skill is provided as a prototype example demonstrating how to integrate nf-core bioinformatics pipelines into Claude Code for automated analysis workflows. The current implementation supports three pipelines (rnaseq, sarek, and atacseq), serving as a foundation that enables the community to expand support to the full set of nf-core pipelines.
It is intended for educational and research purposes and should not be considered production-ready without appropriate validation for your specific use case. Users are responsible for ensuring their computing environment meets pipeline requirements and for verifying analysis results.
Anthropic does not guarantee the accuracy of bioinformatics outputs, and users should follow standard practices for validating computational analyses. This integration is not officially endorsed by or affiliated with the nf-core community.
Attribution
When publishing results, cite the appropriate pipeline. Citations are available in each nf-core repository's CITATIONS.md file (e.g., https://github.com/nf-core/rnaseq/blob/3.22.2/CITATIONS.md).
Licenses
- nf-core pipelines: MIT License (https://nf-co.re/about)
- Nextflow: Apache License, Version 2.0 (https://www.nextflow.io/about-us.html)
- NCBI SRA Toolkit: Public Domain (https://github.com/ncbi/sra-tools/blob/master/LICENSE)
Source
git clone https://github.com/anthropics/knowledge-work-plugins/blob/main/bio-research/skills/nextflow-development/SKILL.mdView on GitHub Overview
nf-core Pipeline Deployment guides bench scientists to run nf-core RNA-seq (rnaseq), WGS/WES (sarek), and ATAC-seq (atacseq) workflows on local FASTQs or GEO/SRA datasets. The process covers environment checks, pipeline selection, samplesheet creation, and configuring runs to deliver reproducible omics analyses.
How This Skill Works
Start with an environment check (Python script), then select the nf-core pipeline and version, run a test profile, create a samplesheet, configure the genome, and execute the Nextflow run. The workflow supports containerized execution (docker) and can auto-detect data type to streamline setup.
When to Use It
- Analyzing RNA-seq data for gene expression using the rnaseq pipeline.
- Performing variant calling on WGS/WES data with the sarek pipeline.
- Assessing chromatin accessibility from ATAC-seq using atacseq.
- Reanalyzing public GEO/SRA datasets (GSE/GSM/SRR accessions) to reuse existing data.
- Creating and configuring a samplesheet for multi-sample projects.
Quick Start
- Step 1: python scripts/check_environment.py
- Step 2: Select the pipeline (rnaseq, sarek, or atacseq) and confirm the genome with the user
- Step 3: nextflow run nf-core/<pipeline> -r <version> -profile test,docker --outdir test_<pipeline>
Best Practices
- Run Step 1: environment check and Step 3: test profile before loading real data.
- Confirm the genome and pipeline choice with the user during Step 2.
- Create a clean, well-structured samplesheet with sample IDs and FASTQ paths.
- Use containerized execution (docker or singularity) to improve reproducibility.
- Verify outputs with the MultiQC report and look for 'Pipeline completed successfully'.
Example Use Cases
- Run rnaseq on local RNA-seq data to generate gene expression counts and downstream differential expression.
- Reanalyze GEO dataset GSE110004 to compare with published results using rnaseq.
- Run sarek on whole-genome sequencing data to call variants.
- Analyze ATAC-seq data to identify differential chromatin accessibility with atacseq.
- Create and run a samplesheet for a multi-sample study sourced from GEO/SRA.