kbase-query
Scannednpx machina-cli add skill cmungall/lakehouse-skills/kbase-query --openclawKBase Query
Query the KBase/BERDL Datalake MCP Server via REST API.
Setup
export KBASE_TOKEN="your_token_here"
export KBASE_MCP_URL="https://hub.berdl.kbase.us/apis/mcp" # optional default
Getting your token
- Login to the KBase JupyterHub
- In any notebook, run:
BERDLSettings().KBASE_AUTH_TOKEN - Copy the token value
Note: Tokens expire after ~1 week. If you get auth errors, refresh your token.
Scripts
All scripts require KBASE_TOKEN env var and jq installed.
| Script | Usage |
|---|---|
kbase_health.sh | Check API health |
kbase_list_databases.sh | List all databases |
kbase_list_tables.sh <db> | List tables in database |
kbase_table_schema.sh <db> <table> | Get table columns |
kbase_db_structure.sh [with_schema] | Full DB structure |
kbase_table_count.sh <db> <table> | Row count |
kbase_table_sample.sh <db> <table> [limit] | Sample rows (max 100) |
kbase_query.sh <sql> [limit] | Execute SQL (max 1000) |
kbase_select.sh <db> <table> [limit] | Structured select |
Example Workflow
# List databases
kbase_list_databases.sh
# → {"databases": ["kbase_ke_pangenome", "nmdc_core", ...]}
# List tables in pangenome database
kbase_list_tables.sh kbase_ke_pangenome
# → {"tables": ["genome", "gene", "gene_cluster", ...]}
# Get columns for a table (returns names only, not types)
kbase_table_schema.sh kbase_ke_pangenome genome
# → {"columns": ["genome_id", "gtdb_species_clade_id", ...]}
# Sample rows
kbase_table_sample.sh kbase_ke_pangenome genome 5
# SQL query
kbase_query.sh "SELECT * FROM kbase_ke_pangenome.genome LIMIT 10"
Useful jq Patterns
# Extract just database names
kbase_list_databases.sh | jq -r '.databases[]'
# Get columns as comma-separated list
kbase_table_schema.sh kbase_ke_pangenome genome | jq -r '.columns | join(", ")'
# Loop through all tables to get schemas
for t in $(kbase_list_tables.sh kbase_ke_pangenome | jq -r '.tables[]'); do
echo "=== $t ==="
kbase_table_schema.sh kbase_ke_pangenome "$t" | jq -r '.columns | join(", ")'
done
Available Databases
Key databases include:
kbase_ke_pangenome- Pangenomic data with GTDB taxonomynmdc_core- NMDC microbiome datakbase_genomes- KBase genome collectionkbase_uniprot_*- UniProt reference data
API Reference
See references/api_reference.md for complete endpoint documentation.
Source
git clone https://github.com/cmungall/lakehouse-skills/blob/main/kbase-query/SKILL.mdView on GitHub Overview
KBase Query lets you explore the KBase/BERDL Datalake through the MCP REST API. It supports listing databases and tables, retrieving schemas, sampling data, and running SQL queries against the data lake. It authenticates with a KBASE_TOKEN and uses handy kbase_*.sh scripts to perform common tasks.
How This Skill Works
Set the KBASE_TOKEN and optional KBASE_MCP_URL environment variables, then use the provided kbase_*.sh scripts to interact with the MCP REST endpoints. The scripts perform health checks, list databases and tables, fetch schemas, sample data, and execute SQL queries, returning results in JSON which can be parsed with jq.
When to Use It
- Exploring what databases and tables exist in the KBase BERDL datalake.
- Listing all tables within a specific database.
- Retrieving the column names (schema) for a table.
- Sampling a few rows to quickly inspect data quality and shape.
- Running SQL queries against a database/table to extract insights.
Quick Start
- Step 1: export KBASE_TOKEN="your_token_here"; export KBASE_MCP_URL="https://hub.berdl.kbase.us/apis/mcp" # optional default
- Step 2: Run a quick health check: kbase_health.sh
- Step 3: Try a basic operation, e.g., list databases or run a small SQL: kbase_list_databases.sh or kbase_query.sh "SELECT * FROM kbase_ke_pangenome.genome LIMIT 5"
Best Practices
- Keep KBASE_TOKEN secure and refresh it before expiry; tokens expire after ~1 week.
- Override the MCP URL with KBASE_MCP_URL when connecting to non-default endpoints.
- Check API health with kbase_health.sh before running heavy queries.
- Inspect table schemas with kbase_table_schema.sh before joins or heavy analytics.
- Respect result limits: sample data is capped at 100 rows; SQL queries have a max of 1000 rows.
Example Use Cases
- List databases: kbase_list_databases.sh Output: {"databases": ["kbase_ke_pangenome", "nmdc_core", ...]}
- List tables in a database: kbase_list_tables.sh kbase_ke_pangenome Output: {"tables": ["genome", "gene", "gene_cluster", ...]}
- Get columns for a table: kbase_table_schema.sh kbase_ke_pangenome genome Output: {"columns": ["genome_id", "gtdb_species_clade_id", ...]}
- Sample rows: kbase_table_sample.sh kbase_ke_pangenome genome 5 Output: {"rows": [...]}
- SQL query: kbase_query.sh "SELECT * FROM kbase_ke_pangenome.genome LIMIT 10" Output: {"results": [...]}