What inputs are required?

Inputs include proteins.faa (FASTA protein sequences) and reference_db containing eggNog, InterPro and DIAMOND databases plus taxdump. Ensure BIO_DB_ROOT is set and inputs are readable.

Where are outputs stored?

Outputs are written to results/bio-annotation, including annotations.parquet, taxonomy.parquet, annotation_report.md and logs/.

What if QC gates fail?

Review the annotation_report.md, retry steps with adjusted parameters as suggested, and if still failing, document in the report and exit non-zero to stop the pipeline.

bio-annotation

Scanned

npx machina-cli add skill fmschulz/omics-skills/bio-annotation --openclaw

Files (1)

SKILL.md

2.0 KB

Bio Annotation

Functional annotation and taxonomy inference from sequence homology.

Instructions

Run InterProScan for domain/family annotation.
Run eggnog-mapper for orthology-based annotation.
Run DIAMOND and resolve taxonomy with TaxonKit.

Quick Reference

Task	Action
Run workflow	Follow the steps in this skill and capture outputs.
Validate inputs	Confirm required inputs and reference data exist.
Review outputs	Inspect reports and QC gates before proceeding.
Tool docs	See `docs/README.md`.
References	- See ../bio-skills-references.md

Input Requirements

Prerequisites:

Tools available in the active environment (Pixi/conda/system). See docs/README.md for expected tools.
Reference DB root: set BIO_DB_ROOT (default /media/shared-expansion/db/ on WSU).
Input FASTA and reference DBs are readable. Inputs:
proteins.faa (FASTA protein sequences).
reference_db/ (eggNOG, InterPro, DIAMOND databases + taxdump).

Output

results/bio-annotation/annotations.parquet
results/bio-annotation/taxonomy.parquet
results/bio-annotation/annotation_report.md
results/bio-annotation/logs/

Quality Gates

Annotation hit rate and taxonomy rank coverage meet project thresholds.
On failure: retry with alternative parameters; if still failing, record in report and exit non-zero.
Verify proteins.faa is non-empty and amino acid encoded.
Verify required reference DBs exist under the reference root.

Examples

Example 1: Expected input layout

proteins.faa (FASTA protein sequences).
reference_db/ (eggNOG, InterPro, DIAMOND databases + taxdump).

Troubleshooting

Issue: Missing inputs or reference databases Solution: Verify paths and permissions before running the workflow.

Issue: Low-quality results or failed QC gates Solution: Review reports, adjust parameters, and re-run the affected step.

Source

git clone https://github.com/fmschulz/omics-skills/blob/main/skills/bio-annotation/SKILL.mdView on GitHub

Overview

Bio-annotation delivers functional labels and taxonomy for proteins by integrating domain/family annotations from InterProScan, orthology-based annotations from eggnog-mapper, and taxonomic context from DIAMOND hits resolved with TaxonKit. Outputs include parquet summaries and a readable annotation report, with QC gates to ensure data quality.

How This Skill Works

Input proteins.faa and reference_db are processed through a three-step pipeline: InterProScan annotates domains and families, eggnog-mapper provides orthology-based functional annotations, and DIAMOND identifies close homologs while TaxonKit resolves taxonomy. The results are stored as parquet files and a comprehensive annotation_report.md with logs for QC review.

When to Use It

You need functional domain and family annotations for a protein set to understand potential roles.
You require orthology-based annotations and inferred functional terms from evolutionary relationships.
You must assign taxonomy to proteins based on sequence similarity and taxonomic databases.
You want structured outputs (parquet) suitable for downstream analysis and a summary report for stakeholders.
You have validated inputs (proteins.faa and reference_db) and want a reproducible QC-driven workflow.

Quick Start

Step 1: Prepare inputs proteins.faa and reference_db, and set BIO_DB_ROOT to the DB directory.
Step 2: Run the bio-annotation workflow to execute InterProScan, eggnog-mapper, DIAMOND and TaxonKit.
Step 3: Inspect outputs in results/bio-annotation (annotations.parquet, taxonomy.parquet, annotation_report.md) and review QC gates.

Best Practices

Set BIO_DB_ROOT correctly and verify reference databases exist before running.
Confirm input FASTA is non-empty and amino acid encoded to avoid misreads.
Use InterProScan and eggnog-mapper with recommended parameters for your organism group.
Review annotation_report.md and taxonomy.parquet for consistency before proceeding.
Retain logs and run QC gates; retry with adjusted parameters if hits or coverage fall below thresholds.

Example Use Cases

Annotating a bacterial proteome to link domains with potential functions and taxonomic placement.
Functional annotation of novel plant proteins with orthology-based GO/EC term predictions.
Cross-species comparison of protein families using orthology annotations to infer conserved functions.
Assigning taxonomy to a metagenomic-like protein set via DIAMOND hits and TaxonKit resolution.
Preparing parquet-based summaries for integration into a larger omics analytics pipeline.

Frequently Asked Questions

Add this skill to your agents