Get the FREE Ultimate OpenClaw Setup Guide →

jgi-lakehouse

Scanned
npx machina-cli add skill fmschulz/omics-skills/jgi-lakehouse --openclaw
Files (1)
SKILL.md
1.6 KB

JGI Lakehouse

Use JGI Lakehouse (Dremio) for metadata queries and the JGI filesystem for sequence downloads.

Instructions

  1. Authenticate to Dremio using a PAT.
  2. Explore schemas and tables to find the required metadata.
  3. Run SQL queries for project/sample/taxon discovery.
  4. Use IMG taxon OIDs to fetch genome packages from the filesystem.
  5. Validate outputs and record provenance.

Quick Reference

TaskAction
Auth setupSee docs/authentication.md
SQL cheatsheetSee docs/sql-quick-reference.md
Table catalogSee docs/data-catalog.md
GOLD explorationSee docs/explore_gold.md

Input Requirements

  • DREMIO_PAT token (for Lakehouse access)
  • Query intent (taxonomy, ecosystem, project IDs, etc.)
  • JGI filesystem access for downloads

Output

  • Query results (tables or CSVs)
  • Lists of taxon OIDs or accessions
  • Downloaded genome packages (FNA/FAA/GFF)

Quality Gates

  • SQL queries return expected row counts
  • Taxon OIDs map to existing filesystem packages
  • Downloaded files pass basic integrity checks

Examples

Example 1: Basic GOLD query

SELECT gold_id, project_name
FROM "gold-db-2 postgresql".gold.project
WHERE is_public = 'Yes'
LIMIT 5;

Troubleshooting

Issue: Authentication failures Solution: Re-create the PAT and confirm it is exported before querying.

Issue: Missing genome files Solution: Verify IMG taxon OIDs and filesystem path permissions.

Source

git clone https://github.com/fmschulz/omics-skills/blob/main/skills/jgi-lakehouse/SKILL.mdView on GitHub

Overview

JGI Lakehouse (Dremio) enables metadata queries across GOLD, IMG, Mycocosm, and Phytozome, with direct access to the JGI filesystem for genome packages. This skill helps researchers quickly discover projects, taxa, and samples, then download genome files by IMG taxon OIDs while recording provenance.

How This Skill Works

Authenticate to Dremio using a PAT, then explore schemas and tables to locate the required metadata. Run SQL queries for project, sample, and taxon discovery, and use IMG taxon OIDs to fetch genome packages from the filesystem. Validate outputs and capture provenance for reproducibility.

When to Use It

  • You need metadata for GOLD/IMG/Mycocosm/Phytozome projects and taxa.
  • You want to discover samples or taxon metadata for a study across multiple repositories.
  • You need to download genome packages by IMG taxon OIDs from the JGI filesystem.
  • You require validation and provenance for Lakehouse-derived outputs (auditable results).
  • You are preparing outputs (CSV/tables) for downstream genomics analyses.

Quick Start

  1. Step 1: Authenticate to Dremio with your DREMIO_PAT token.
  2. Step 2: Explore schemas and tables to locate GOLD/IMG/Phytozome metadata.
  3. Step 3: Run discovery SQL for project/sample/taxon and download genome packages via IMG taxon OIDs; validate and record provenance.

Best Practices

  • Ensure a valid DREMIO_PAT and filesystem access before queries.
  • Start by listing schemas/tables to identify reliable metadata sources.
  • Write precise, scoped SQL for discovery and test with small result sets.
  • Map IMG taxon OIDs to existing filesystem packages and run integrity checks.
  • Record provenance (query, timestamp, user) for reproducibility.

Example Use Cases

  • Example 1: Basic GOLD query – retrieve public project IDs and names.
  • Example 2: Discover IMG taxa across GOLD/IMG for a given search term.
  • Example 3: Resolve IMG taxon OIDs to fetch genome packages (FNA/FAA/GFF).
  • Example 4: Validate output row counts against expectations as a quality gate.
  • Example 5: Troubleshoot authentication by re-creating and exporting a PAT.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers