Get the FREE Ultimate OpenClaw Setup Guide →

bio-protein-clustering-pangenome

Scanned
npx machina-cli add skill fmschulz/omics-skills/bio-protein-clustering-pangenome --openclaw
Files (1)
SKILL.md
2.1 KB

Bio Protein Clustering Pangenome

Cluster proteins into orthogroups and derive pangenome matrices.

Instructions

  1. Cluster proteins with MMseqs2 or ProteinOrtho.
  2. Build presence/absence matrix.
  3. Compute core/accessory/cloud/singleton partitions.
  4. Identify single-copy orthologs for phylogenetic analysis.
  5. Discriminate paralogs from orthologs in multi-copy gene families.
  6. Calculate pangenome statistics (completeness, orthogroup occupancy).

Quick Reference

TaskAction
Run workflowFollow the steps in this skill and capture outputs.
Validate inputsConfirm required inputs and reference data exist.
Review outputsInspect reports and QC gates before proceeding.
Tool docsSee docs/README.md.
ReferencesSee references.md and ../bio-skills-references.md.

Input Requirements

Prerequisites:

  • Tools available in the active environment (Pixi/conda/system). See docs/README.md for expected tools.
  • Protein FASTA inputs are available. Inputs:
  • proteins.faa (FASTA protein sequences)

Output

  • results/bio-protein-clustering-pangenome/orthogroups.tsv
  • results/bio-protein-clustering-pangenome/presence_absence.parquet
  • results/bio-protein-clustering-pangenome/pangenome_report.md
  • results/bio-protein-clustering-pangenome/logs/

Quality Gates

  • Cluster size distributions meet project thresholds.
  • Matrix completeness meets project thresholds.
  • On failure: retry with alternative parameters; if still failing, record in report and exit non-zero.
  • Verify proteins.faa is non-empty and amino acid encoded.

Examples

Example 1: Expected input layout

proteins.faa (FASTA protein sequences)

Troubleshooting

Issue: Missing inputs or reference databases Solution: Verify paths and permissions before running the workflow.

Issue: Low-quality results or failed QC gates Solution: Review reports, adjust parameters, and re-run the affected step.

Source

git clone https://github.com/fmschulz/omics-skills/blob/main/skills/bio-protein-clustering-pangenome/SKILL.mdView on GitHub

Overview

Clusters proteins into orthogroups using MMseqs2 or ProteinOrtho and derives a presence/absence matrix across genomes. It then computes core, accessory, cloud, and singleton partitions, identifies single-copy orthologs for phylogenetic analyses, and discriminates paralogs within multi-copy gene families to support robust pangenome statistics.

How This Skill Works

The workflow clusters proteins into orthogroups with MMseqs2 or ProteinOrtho, then builds a presence/absence matrix. It computes core/accessory/cloud/singleton partitions, identifies single-copy orthologs for phylogeny, and flags paralogs in multi-copy families before calculating pangenome statistics and producing reports.

When to Use It

  • Compare gene content across multiple strains or species to map the core and accessory genome.
  • Prepare data for phylogenetic analysis by extracting single-copy orthologs.
  • Assess pan-genome completeness and orthogroup occupancy to quality-control assemblies.
  • Discriminate paralogs from orthologs in multi-copy gene families before functional annotation.
  • Generate a presence/absence matrix to support downstream functional and evolutionary analyses.

Quick Start

  1. Step 1: Ensure input proteins.faa is present in your workspace.
  2. Step 2: Run clustering with MMseqs2 or ProteinOrtho to generate orthogroups and the presence/absence matrix.
  3. Step 3: Review outputs in results/bio-protein-clustering-pangenome (orthogroups.tsv, presence_absence.parquet, pangenome_report.md) and inspect logs.

Best Practices

  • Verify proteins.faa is non-empty and amino acid encoded.
  • Ensure MMseqs2 or ProteinOrtho is installed and accessible in PATH.
  • Confirm input data layout (proteins.faa) and references exist before running.
  • Review outputs (orthogroups.tsv, presence_absence.parquet, pangenome_report.md) and QC logs prior to proceeding.
  • If clustering or QC gates fail, retry with alternative parameters and document changes in the report.

Example Use Cases

  • Example 1: Build an orthogroup set and presence/absence matrix for a microbial pan-genome across multiple isolates.
  • Example 2: Identify core single-copy orthologs for downstream phylogenetic analysis.
  • Example 3: Distinguish paralogs from orthologs in multi-copy families to improve functional annotation.
  • Example 4: Assess pan-genome completeness and orthogroup occupancy to gauge assembly and annotation quality.
  • Example 5: Generate a pangenome report (pangenome_report.md) and QC outputs to guide project decisions.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers