jgi-lakehouse
Scannednpx machina-cli add skill fmschulz/omics-skills/jgi-lakehouse --openclawJGI Lakehouse
Use JGI Lakehouse (Dremio) for metadata queries and the JGI filesystem for sequence downloads.
Instructions
- Authenticate to Dremio using a PAT.
- Explore schemas and tables to find the required metadata.
- Run SQL queries for project/sample/taxon discovery.
- Use IMG taxon OIDs to fetch genome packages from the filesystem.
- Validate outputs and record provenance.
Quick Reference
| Task | Action |
|---|---|
| Auth setup | See docs/authentication.md |
| SQL cheatsheet | See docs/sql-quick-reference.md |
| Table catalog | See docs/data-catalog.md |
| GOLD exploration | See docs/explore_gold.md |
Input Requirements
- DREMIO_PAT token (for Lakehouse access)
- Query intent (taxonomy, ecosystem, project IDs, etc.)
- JGI filesystem access for downloads
Output
- Query results (tables or CSVs)
- Lists of taxon OIDs or accessions
- Downloaded genome packages (FNA/FAA/GFF)
Quality Gates
- SQL queries return expected row counts
- Taxon OIDs map to existing filesystem packages
- Downloaded files pass basic integrity checks
Examples
Example 1: Basic GOLD query
SELECT gold_id, project_name
FROM "gold-db-2 postgresql".gold.project
WHERE is_public = 'Yes'
LIMIT 5;
Troubleshooting
Issue: Authentication failures Solution: Re-create the PAT and confirm it is exported before querying.
Issue: Missing genome files Solution: Verify IMG taxon OIDs and filesystem path permissions.
Source
git clone https://github.com/fmschulz/omics-skills/blob/main/skills/jgi-lakehouse/SKILL.mdView on GitHub Overview
JGI Lakehouse (Dremio) enables metadata queries across GOLD, IMG, Mycocosm, and Phytozome, with direct access to the JGI filesystem for genome packages. This skill helps researchers quickly discover projects, taxa, and samples, then download genome files by IMG taxon OIDs while recording provenance.
How This Skill Works
Authenticate to Dremio using a PAT, then explore schemas and tables to locate the required metadata. Run SQL queries for project, sample, and taxon discovery, and use IMG taxon OIDs to fetch genome packages from the filesystem. Validate outputs and capture provenance for reproducibility.
When to Use It
- You need metadata for GOLD/IMG/Mycocosm/Phytozome projects and taxa.
- You want to discover samples or taxon metadata for a study across multiple repositories.
- You need to download genome packages by IMG taxon OIDs from the JGI filesystem.
- You require validation and provenance for Lakehouse-derived outputs (auditable results).
- You are preparing outputs (CSV/tables) for downstream genomics analyses.
Quick Start
- Step 1: Authenticate to Dremio with your DREMIO_PAT token.
- Step 2: Explore schemas and tables to locate GOLD/IMG/Phytozome metadata.
- Step 3: Run discovery SQL for project/sample/taxon and download genome packages via IMG taxon OIDs; validate and record provenance.
Best Practices
- Ensure a valid DREMIO_PAT and filesystem access before queries.
- Start by listing schemas/tables to identify reliable metadata sources.
- Write precise, scoped SQL for discovery and test with small result sets.
- Map IMG taxon OIDs to existing filesystem packages and run integrity checks.
- Record provenance (query, timestamp, user) for reproducibility.
Example Use Cases
- Example 1: Basic GOLD query – retrieve public project IDs and names.
- Example 2: Discover IMG taxa across GOLD/IMG for a given search term.
- Example 3: Resolve IMG taxon OIDs to fetch genome packages (FNA/FAA/GFF).
- Example 4: Validate output row counts against expectations as a quality gate.
- Example 5: Troubleshoot authentication by re-creating and exporting a PAT.