What inputs are required?

A FASTA file named proteins.faa with protein sequences and access to MMseqs2 or ProteinOrtho in the environment.

What outputs are produced?

Orthogroups.tsv, presence_absence.parquet, pangenome_report.md, and logs under results/bio-protein-clustering-pangenome.

How are paralogs handled?

The workflow includes steps to discriminate paralogs from orthologs in multi-copy gene families and identifies single-copy orthologs for phylogenetic analysis.

bio-protein-clustering-pangenome

Scanned

npx machina-cli add skill fmschulz/omics-skills/bio-protein-clustering-pangenome --openclaw

Files (1)

SKILL.md

2.1 KB

Bio Protein Clustering Pangenome

Cluster proteins into orthogroups and derive pangenome matrices.

Instructions

Cluster proteins with MMseqs2 or ProteinOrtho.
Build presence/absence matrix.
Compute core/accessory/cloud/singleton partitions.
Identify single-copy orthologs for phylogenetic analysis.
Discriminate paralogs from orthologs in multi-copy gene families.
Calculate pangenome statistics (completeness, orthogroup occupancy).

Quick Reference

Task	Action
Run workflow	Follow the steps in this skill and capture outputs.
Validate inputs	Confirm required inputs and reference data exist.
Review outputs	Inspect reports and QC gates before proceeding.
Tool docs	See `docs/README.md`.
References	See `references.md` and `../bio-skills-references.md`.

Input Requirements

Prerequisites:

Tools available in the active environment (Pixi/conda/system). See docs/README.md for expected tools.
Protein FASTA inputs are available. Inputs:
proteins.faa (FASTA protein sequences)

Output

results/bio-protein-clustering-pangenome/orthogroups.tsv
results/bio-protein-clustering-pangenome/presence_absence.parquet
results/bio-protein-clustering-pangenome/pangenome_report.md
results/bio-protein-clustering-pangenome/logs/

Quality Gates

Cluster size distributions meet project thresholds.
Matrix completeness meets project thresholds.
On failure: retry with alternative parameters; if still failing, record in report and exit non-zero.
Verify proteins.faa is non-empty and amino acid encoded.

Examples

Example 1: Expected input layout

proteins.faa (FASTA protein sequences)

Troubleshooting

Issue: Missing inputs or reference databases Solution: Verify paths and permissions before running the workflow.

Issue: Low-quality results or failed QC gates Solution: Review reports, adjust parameters, and re-run the affected step.

Source

git clone https://github.com/fmschulz/omics-skills/blob/main/skills/bio-protein-clustering-pangenome/SKILL.mdView on GitHub

Overview

Clusters proteins into orthogroups using MMseqs2 or ProteinOrtho and derives a presence/absence matrix across genomes. It then computes core, accessory, cloud, and singleton partitions, identifies single-copy orthologs for phylogenetic analyses, and discriminates paralogs within multi-copy gene families to support robust pangenome statistics.

How This Skill Works

The workflow clusters proteins into orthogroups with MMseqs2 or ProteinOrtho, then builds a presence/absence matrix. It computes core/accessory/cloud/singleton partitions, identifies single-copy orthologs for phylogeny, and flags paralogs in multi-copy families before calculating pangenome statistics and producing reports.

When to Use It

Compare gene content across multiple strains or species to map the core and accessory genome.
Prepare data for phylogenetic analysis by extracting single-copy orthologs.
Assess pan-genome completeness and orthogroup occupancy to quality-control assemblies.
Discriminate paralogs from orthologs in multi-copy gene families before functional annotation.
Generate a presence/absence matrix to support downstream functional and evolutionary analyses.

Quick Start

Step 1: Ensure input proteins.faa is present in your workspace.
Step 2: Run clustering with MMseqs2 or ProteinOrtho to generate orthogroups and the presence/absence matrix.
Step 3: Review outputs in results/bio-protein-clustering-pangenome (orthogroups.tsv, presence_absence.parquet, pangenome_report.md) and inspect logs.

Best Practices

Verify proteins.faa is non-empty and amino acid encoded.
Ensure MMseqs2 or ProteinOrtho is installed and accessible in PATH.
Confirm input data layout (proteins.faa) and references exist before running.
Review outputs (orthogroups.tsv, presence_absence.parquet, pangenome_report.md) and QC logs prior to proceeding.
If clustering or QC gates fail, retry with alternative parameters and document changes in the report.

Example Use Cases

Example 1: Build an orthogroup set and presence/absence matrix for a microbial pan-genome across multiple isolates.
Example 2: Identify core single-copy orthologs for downstream phylogenetic analysis.
Example 3: Distinguish paralogs from orthologs in multi-copy families to improve functional annotation.
Example 4: Assess pan-genome completeness and orthogroup occupancy to gauge assembly and annotation quality.
Example 5: Generate a pangenome report (pangenome_report.md) and QC outputs to guide project decisions.

Frequently Asked Questions

Add this skill to your agents