bio-annotation
Scannednpx machina-cli add skill fmschulz/omics-skills/bio-annotation --openclawBio Annotation
Functional annotation and taxonomy inference from sequence homology.
Instructions
- Run InterProScan for domain/family annotation.
- Run eggnog-mapper for orthology-based annotation.
- Run DIAMOND and resolve taxonomy with TaxonKit.
Quick Reference
| Task | Action |
|---|---|
| Run workflow | Follow the steps in this skill and capture outputs. |
| Validate inputs | Confirm required inputs and reference data exist. |
| Review outputs | Inspect reports and QC gates before proceeding. |
| Tool docs | See docs/README.md. |
| References | - See ../bio-skills-references.md |
Input Requirements
Prerequisites:
- Tools available in the active environment (Pixi/conda/system). See
docs/README.mdfor expected tools. - Reference DB root: set
BIO_DB_ROOT(default/media/shared-expansion/db/on WSU). - Input FASTA and reference DBs are readable. Inputs:
- proteins.faa (FASTA protein sequences).
- reference_db/ (eggNOG, InterPro, DIAMOND databases + taxdump).
Output
- results/bio-annotation/annotations.parquet
- results/bio-annotation/taxonomy.parquet
- results/bio-annotation/annotation_report.md
- results/bio-annotation/logs/
Quality Gates
- Annotation hit rate and taxonomy rank coverage meet project thresholds.
- On failure: retry with alternative parameters; if still failing, record in report and exit non-zero.
- Verify proteins.faa is non-empty and amino acid encoded.
- Verify required reference DBs exist under the reference root.
Examples
Example 1: Expected input layout
proteins.faa (FASTA protein sequences).
reference_db/ (eggNOG, InterPro, DIAMOND databases + taxdump).
Troubleshooting
Issue: Missing inputs or reference databases Solution: Verify paths and permissions before running the workflow.
Issue: Low-quality results or failed QC gates Solution: Review reports, adjust parameters, and re-run the affected step.
Source
git clone https://github.com/fmschulz/omics-skills/blob/main/skills/bio-annotation/SKILL.mdView on GitHub Overview
Bio-annotation delivers functional labels and taxonomy for proteins by integrating domain/family annotations from InterProScan, orthology-based annotations from eggnog-mapper, and taxonomic context from DIAMOND hits resolved with TaxonKit. Outputs include parquet summaries and a readable annotation report, with QC gates to ensure data quality.
How This Skill Works
Input proteins.faa and reference_db are processed through a three-step pipeline: InterProScan annotates domains and families, eggnog-mapper provides orthology-based functional annotations, and DIAMOND identifies close homologs while TaxonKit resolves taxonomy. The results are stored as parquet files and a comprehensive annotation_report.md with logs for QC review.
When to Use It
- You need functional domain and family annotations for a protein set to understand potential roles.
- You require orthology-based annotations and inferred functional terms from evolutionary relationships.
- You must assign taxonomy to proteins based on sequence similarity and taxonomic databases.
- You want structured outputs (parquet) suitable for downstream analysis and a summary report for stakeholders.
- You have validated inputs (proteins.faa and reference_db) and want a reproducible QC-driven workflow.
Quick Start
- Step 1: Prepare inputs proteins.faa and reference_db, and set BIO_DB_ROOT to the DB directory.
- Step 2: Run the bio-annotation workflow to execute InterProScan, eggnog-mapper, DIAMOND and TaxonKit.
- Step 3: Inspect outputs in results/bio-annotation (annotations.parquet, taxonomy.parquet, annotation_report.md) and review QC gates.
Best Practices
- Set BIO_DB_ROOT correctly and verify reference databases exist before running.
- Confirm input FASTA is non-empty and amino acid encoded to avoid misreads.
- Use InterProScan and eggnog-mapper with recommended parameters for your organism group.
- Review annotation_report.md and taxonomy.parquet for consistency before proceeding.
- Retain logs and run QC gates; retry with adjusted parameters if hits or coverage fall below thresholds.
Example Use Cases
- Annotating a bacterial proteome to link domains with potential functions and taxonomic placement.
- Functional annotation of novel plant proteins with orthology-based GO/EC term predictions.
- Cross-species comparison of protein families using orthology annotations to infer conserved functions.
- Assigning taxonomy to a metagenomic-like protein set via DIAMOND hits and TaxonKit resolution.
- Preparing parquet-based summaries for integration into a larger omics analytics pipeline.