What is AnnData used for?

AnnData is a Python data structure for annotated matrices designed for single-cell genomics workflows and general annotated data analysis.

How do I read and write AnnData objects?

Use ad.read_h5ad to read and ad.write_h5ad to write; other formats like CSV, Loom, and MTX are also supported.

Backed mode allows reading data from disk without loading the full matrix into memory, enabling scalable analysis of large datasets.

anndata

Scanned

npx machina-cli add skill Microck/ordinary-claude-skills/anndata --openclaw

Files (1)

SKILL.md

9.9 KB

AnnData

Overview

AnnData is a Python package for handling annotated data matrices, storing experimental measurements (X) alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Originally designed for single-cell genomics through Scanpy, it now serves as a general-purpose framework for any annotated data requiring efficient storage, manipulation, and analysis.

When to Use This Skill

Use this skill when:

Creating, reading, or writing AnnData objects
Working with h5ad, zarr, or other genomics data formats
Performing single-cell RNA-seq analysis
Managing large datasets with sparse matrices or backed mode
Concatenating multiple datasets or experimental batches
Subsetting, filtering, or transforming annotated data
Integrating with scanpy, scvi-tools, or other scverse ecosystem tools

Installation

uv pip install anndata

# With optional dependencies
uv pip install anndata[dev,test,doc]

Quick Start

Creating an AnnData object

import anndata as ad
import numpy as np
import pandas as pd

# Minimal creation
X = np.random.rand(100, 2000)  # 100 cells × 2000 genes
adata = ad.AnnData(X)

# With metadata
obs = pd.DataFrame({
    'cell_type': ['T cell', 'B cell'] * 50,
    'sample': ['A', 'B'] * 50
}, index=[f'cell_{i}' for i in range(100)])

var = pd.DataFrame({
    'gene_name': [f'Gene_{i}' for i in range(2000)]
}, index=[f'ENSG{i:05d}' for i in range(2000)])

adata = ad.AnnData(X=X, obs=obs, var=var)

Reading data

# Read h5ad file
adata = ad.read_h5ad('data.h5ad')

# Read with backed mode (for large files)
adata = ad.read_h5ad('large_data.h5ad', backed='r')

# Read other formats
adata = ad.read_csv('data.csv')
adata = ad.read_loom('data.loom')
adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')

Writing data

# Write h5ad file
adata.write_h5ad('output.h5ad')

# Write with compression
adata.write_h5ad('output.h5ad', compression='gzip')

# Write other formats
adata.write_zarr('output.zarr')
adata.write_csvs('output_dir/')

Basic operations

# Subset by conditions
t_cells = adata[adata.obs['cell_type'] == 'T cell']

# Subset by indices
subset = adata[0:50, 0:100]

# Add metadata
adata.obs['quality_score'] = np.random.rand(adata.n_obs)
adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8

# Access dimensions
print(f"{adata.n_obs} observations × {adata.n_vars} variables")

Core Capabilities

1. Data Structure

Understand the AnnData object structure including X, obs, var, layers, obsm, varm, obsp, varp, uns, and raw components.

See: references/data_structure.md for comprehensive information on:

Core components (X, obs, var, layers, obsm, varm, obsp, varp, uns, raw)
Creating AnnData objects from various sources
Accessing and manipulating data components
Memory-efficient practices

2. Input/Output Operations

Read and write data in various formats with support for compression, backed mode, and cloud storage.

See: references/io_operations.md for details on:

Native formats (h5ad, zarr)
Alternative formats (CSV, MTX, Loom, 10X, Excel)
Backed mode for large datasets
Remote data access
Format conversion
Performance optimization

Common commands:

# Read/write h5ad
adata = ad.read_h5ad('data.h5ad', backed='r')
adata.write_h5ad('output.h5ad', compression='gzip')

# Read 10X data
adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')

# Read MTX format
adata = ad.read_mtx('matrix.mtx').T

3. Concatenation

Combine multiple AnnData objects along observations or variables with flexible join strategies.

See: references/concatenation.md for comprehensive coverage of:

Basic concatenation (axis=0 for observations, axis=1 for variables)
Join types (inner, outer)
Merge strategies (same, unique, first, only)
Tracking data sources with labels
Lazy concatenation (AnnCollection)
On-disk concatenation for large datasets

Common commands:

# Concatenate observations (combine samples)
adata = ad.concat(
    [adata1, adata2, adata3],
    axis=0,
    join='inner',
    label='batch',
    keys=['batch1', 'batch2', 'batch3']
)

# Concatenate variables (combine modalities)
adata = ad.concat([adata_rna, adata_protein], axis=1)

# Lazy concatenation
from anndata.experimental import AnnCollection
collection = AnnCollection(
    ['data1.h5ad', 'data2.h5ad'],
    join_obs='outer',
    label='dataset'
)

4. Data Manipulation

Transform, subset, filter, and reorganize data efficiently.

See: references/manipulation.md for detailed guidance on:

Subsetting (by indices, names, boolean masks, metadata conditions)
Transposition
Copying (full copies vs views)
Renaming (observations, variables, categories)
Type conversions (strings to categoricals, sparse/dense)
Adding/removing data components
Reordering
Quality control filtering

Common commands:

# Subset by metadata
filtered = adata[adata.obs['quality_score'] > 0.8]
hv_genes = adata[:, adata.var['highly_variable']]

# Transpose
adata_T = adata.T

# Copy vs view
view = adata[0:100, :]  # View (lightweight reference)
copy = adata[0:100, :].copy()  # Independent copy

# Convert strings to categoricals
adata.strings_to_categoricals()

5. Best Practices

Follow recommended patterns for memory efficiency, performance, and reproducibility.

See: references/best_practices.md for guidelines on:

Memory management (sparse matrices, categoricals, backed mode)
Views vs copies
Data storage optimization
Performance optimization
Working with raw data
Metadata management
Reproducibility
Error handling
Integration with other tools
Common pitfalls and solutions

Key recommendations:

# Use sparse matrices for sparse data
from scipy.sparse import csr_matrix
adata.X = csr_matrix(adata.X)

# Convert strings to categoricals
adata.strings_to_categoricals()

# Use backed mode for large files
adata = ad.read_h5ad('large.h5ad', backed='r')

# Store raw before filtering
adata.raw = adata.copy()
adata = adata[:, adata.var['highly_variable']]

Integration with Scverse Ecosystem

AnnData serves as the foundational data structure for the scverse ecosystem:

Scanpy (Single-cell analysis)

import scanpy as sc

# Preprocessing
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)

# Dimensionality reduction
sc.pp.pca(adata, n_comps=50)
sc.pp.neighbors(adata, n_neighbors=15)
sc.tl.umap(adata)
sc.tl.leiden(adata)

# Visualization
sc.pl.umap(adata, color=['cell_type', 'leiden'])

Muon (Multimodal data)

import muon as mu

# Combine RNA and protein data
mdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein})

PyTorch integration

from anndata.experimental import AnnLoader

# Create DataLoader for deep learning
dataloader = AnnLoader(adata, batch_size=128, shuffle=True)

for batch in dataloader:
    X = batch.X
    # Train model

Common Workflows

Single-cell RNA-seq analysis

import anndata as ad
import scanpy as sc

# 1. Load data
adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')

# 2. Quality control
adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
adata.obs['n_counts'] = adata.X.sum(axis=1)
adata = adata[adata.obs['n_genes'] > 200]
adata = adata[adata.obs['n_counts'] < 50000]

# 3. Store raw
adata.raw = adata.copy()

# 4. Normalize and filter
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
adata = adata[:, adata.var['highly_variable']]

# 5. Save processed data
adata.write_h5ad('processed.h5ad')

Batch integration

# Load multiple batches
adata1 = ad.read_h5ad('batch1.h5ad')
adata2 = ad.read_h5ad('batch2.h5ad')
adata3 = ad.read_h5ad('batch3.h5ad')

# Concatenate with batch labels
adata = ad.concat(
    [adata1, adata2, adata3],
    label='batch',
    keys=['batch1', 'batch2', 'batch3'],
    join='inner'
)

# Apply batch correction
import scanpy as sc
sc.pp.combat(adata, key='batch')

# Continue analysis
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

Working with large datasets

# Open in backed mode
adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r')

# Filter based on metadata (no data loading)
high_quality = adata[adata.obs['quality_score'] > 0.8]

# Load filtered subset
adata_subset = high_quality.to_memory()

# Process subset
process(adata_subset)

# Or process in chunks
chunk_size = 1000
for i in range(0, adata.n_obs, chunk_size):
    chunk = adata[i:i+chunk_size, :].to_memory()
    process(chunk)

Troubleshooting

Out of memory errors

Use backed mode or convert to sparse matrices:

# Backed mode
adata = ad.read_h5ad('file.h5ad', backed='r')

# Sparse matrices
from scipy.sparse import csr_matrix
adata.X = csr_matrix(adata.X)

Slow file reading

Use compression and appropriate formats:

# Optimize for storage
adata.strings_to_categoricals()
adata.write_h5ad('file.h5ad', compression='gzip')

# Use Zarr for cloud storage
adata.write_zarr('file.zarr', chunks=(1000, 1000))

Index alignment issues

Always align external data on index:

# Wrong
adata.obs['new_col'] = external_data['values']

# Correct
adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']

Additional Resources

Official documentation: https://anndata.readthedocs.io/
Scanpy tutorials: https://scanpy.readthedocs.io/
Scverse ecosystem: https://scverse.org/
GitHub repository: https://github.com/scverse/anndata

Source

git clone https://github.com/Microck/ordinary-claude-skills/blob/main/skills_all/claude-scientific-skills/scientific-skills/anndata/SKILL.md

View on GitHub

Overview

AnnData is a Python package for handling annotated data matrices, storing measurements in X alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Born from Scanpy for single-cell genomics, it now serves as a general, memory-efficient framework for large, complex datasets and seamless integration with the scverse ecosystem, including scanpy and scvi-tools.

How This Skill Works

An AnnData object bundles a data matrix X with metadata (obs, var) and optional layers, embeddings (obsm, varm, obsp, varp), and unstructured data (uns). You create, read, and write AnnData objects in formats like h5ad or zarr, enable backed mode for large datasets, and perform subsetting, filtering, and concatenation, often in conjunction with Scanpy/scvi-tools.

When to Use It

Creating, reading, or writing AnnData objects
Working with h5ad, zarr, or other genomics data formats
Performing single-cell RNA-seq data analysis
Managing large datasets with sparse matrices or backed mode
Integrating with Scanpy, scvi-tools, or the broader scverse ecosystem

Quick Start

Step 1: Install and import the package; prepare X and optional obs/var
Step 2: Create an AnnData object with ad.AnnData(X, obs=obs, var=var)
Step 3: Read/write data (ad.read_h5ad, ad.write_h5ad) and use backed mode for large files

Best Practices

Prefer backing mode for large datasets to avoid loading into memory
Use h5ad or zarr for efficient storage and IO
Keep obs and var metadata aligned with X to ensure consistent analysis
Use concat to merge datasets carefully, handling batch annotations
Leverage integration with scanpy/scvi-tools and other scverse tools for downstream analysis

Example Use Cases

Create an AnnData from a dense matrix and attach cell metadata via obs and gene annotations via var
Read a large single-cell dataset from an h5ad file using backed mode to save memory
Write a compressed h5ad file after filtering cells and genes
Load 10X Genomics data with read_10x_h5 and perform QC
Subset, filter, and concatenate multiple AnnData objects representing different batches

Frequently Asked Questions

Add this skill to your agents

anndata

AnnData

Overview

When to Use This Skill

Installation

Quick Start

Creating an AnnData object

Reading data

Writing data

Basic operations

Core Capabilities

1. Data Structure

2. Input/Output Operations

3. Concatenation

4. Data Manipulation

5. Best Practices

Integration with Scverse Ecosystem

Scanpy (Single-cell analysis)

Muon (Multimodal data)

PyTorch integration

Common Workflows

Single-cell RNA-seq analysis

Batch integration

Working with large datasets

Troubleshooting

Out of memory errors

Slow file reading

Index alignment issues

Additional Resources

Source

Overview

How This Skill Works

When to Use It

Quick Start

Best Practices

Example Use Cases

Frequently Asked Questions

What is AnnData used for?

How do I read and write AnnData objects?

What is backed mode?