bio-protein-clustering-pangenome
Scannednpx machina-cli add skill fmschulz/omics-skills/bio-protein-clustering-pangenome --openclawBio Protein Clustering Pangenome
Cluster proteins into orthogroups and derive pangenome matrices.
Instructions
- Cluster proteins with MMseqs2 or ProteinOrtho.
- Build presence/absence matrix.
- Compute core/accessory/cloud/singleton partitions.
- Identify single-copy orthologs for phylogenetic analysis.
- Discriminate paralogs from orthologs in multi-copy gene families.
- Calculate pangenome statistics (completeness, orthogroup occupancy).
Quick Reference
| Task | Action |
|---|---|
| Run workflow | Follow the steps in this skill and capture outputs. |
| Validate inputs | Confirm required inputs and reference data exist. |
| Review outputs | Inspect reports and QC gates before proceeding. |
| Tool docs | See docs/README.md. |
| References | See references.md and ../bio-skills-references.md. |
Input Requirements
Prerequisites:
- Tools available in the active environment (Pixi/conda/system). See
docs/README.mdfor expected tools. - Protein FASTA inputs are available. Inputs:
- proteins.faa (FASTA protein sequences)
Output
- results/bio-protein-clustering-pangenome/orthogroups.tsv
- results/bio-protein-clustering-pangenome/presence_absence.parquet
- results/bio-protein-clustering-pangenome/pangenome_report.md
- results/bio-protein-clustering-pangenome/logs/
Quality Gates
- Cluster size distributions meet project thresholds.
- Matrix completeness meets project thresholds.
- On failure: retry with alternative parameters; if still failing, record in report and exit non-zero.
- Verify proteins.faa is non-empty and amino acid encoded.
Examples
Example 1: Expected input layout
proteins.faa (FASTA protein sequences)
Troubleshooting
Issue: Missing inputs or reference databases Solution: Verify paths and permissions before running the workflow.
Issue: Low-quality results or failed QC gates Solution: Review reports, adjust parameters, and re-run the affected step.
Source
git clone https://github.com/fmschulz/omics-skills/blob/main/skills/bio-protein-clustering-pangenome/SKILL.mdView on GitHub Overview
Clusters proteins into orthogroups using MMseqs2 or ProteinOrtho and derives a presence/absence matrix across genomes. It then computes core, accessory, cloud, and singleton partitions, identifies single-copy orthologs for phylogenetic analyses, and discriminates paralogs within multi-copy gene families to support robust pangenome statistics.
How This Skill Works
The workflow clusters proteins into orthogroups with MMseqs2 or ProteinOrtho, then builds a presence/absence matrix. It computes core/accessory/cloud/singleton partitions, identifies single-copy orthologs for phylogeny, and flags paralogs in multi-copy families before calculating pangenome statistics and producing reports.
When to Use It
- Compare gene content across multiple strains or species to map the core and accessory genome.
- Prepare data for phylogenetic analysis by extracting single-copy orthologs.
- Assess pan-genome completeness and orthogroup occupancy to quality-control assemblies.
- Discriminate paralogs from orthologs in multi-copy gene families before functional annotation.
- Generate a presence/absence matrix to support downstream functional and evolutionary analyses.
Quick Start
- Step 1: Ensure input proteins.faa is present in your workspace.
- Step 2: Run clustering with MMseqs2 or ProteinOrtho to generate orthogroups and the presence/absence matrix.
- Step 3: Review outputs in results/bio-protein-clustering-pangenome (orthogroups.tsv, presence_absence.parquet, pangenome_report.md) and inspect logs.
Best Practices
- Verify proteins.faa is non-empty and amino acid encoded.
- Ensure MMseqs2 or ProteinOrtho is installed and accessible in PATH.
- Confirm input data layout (proteins.faa) and references exist before running.
- Review outputs (orthogroups.tsv, presence_absence.parquet, pangenome_report.md) and QC logs prior to proceeding.
- If clustering or QC gates fail, retry with alternative parameters and document changes in the report.
Example Use Cases
- Example 1: Build an orthogroup set and presence/absence matrix for a microbial pan-genome across multiple isolates.
- Example 2: Identify core single-copy orthologs for downstream phylogenetic analysis.
- Example 3: Distinguish paralogs from orthologs in multi-copy families to improve functional annotation.
- Example 4: Assess pan-genome completeness and orthogroup occupancy to gauge assembly and annotation quality.
- Example 5: Generate a pangenome report (pangenome_report.md) and QC outputs to guide project decisions.