What is stable-diffusion-image-generation?

A guide to generating images from text prompts, performing image-to-image translation, inpainting, and outpainting using HuggingFace Diffusers with Stable Diffusion models.

Which models and pipelines are supported?

Supports SD 1.5, SDXL, SD 3.0, Flux; pipelines include StableDiffusionPipeline, StableDiffusionImg2ImgPipeline, and Inpaint/Outpaint variants.

How can I optimize speed and memory usage?

Use FP16 precision, move the pipeline to CUDA, enable memory offloading (CPU offload), enable memory-efficient attention (e.g., xformers), and tune steps and guidance_scale for your hardware.

stable-diffusion-image-generation

Scanned

Image Generation Stable Diffusion Diffusers Text-to-Image Multimodal Computer Vision

npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/stable-diffusion --openclaw

Files (1)

SKILL.md

12.7 KB

Stable Diffusion Image Generation

Comprehensive guide to generating images with Stable Diffusion using the HuggingFace Diffusers library.

When to use Stable Diffusion

Use Stable Diffusion when:

Generating images from text descriptions
Performing image-to-image translation (style transfer, enhancement)
Inpainting (filling in masked regions)
Outpainting (extending images beyond boundaries)
Creating variations of existing images
Building custom image generation workflows

Key features:

Text-to-Image: Generate images from natural language prompts
Image-to-Image: Transform existing images with text guidance
Inpainting: Fill masked regions with context-aware content
ControlNet: Add spatial conditioning (edges, poses, depth)
LoRA Support: Efficient fine-tuning and style adaptation
Multiple Models: SD 1.5, SDXL, SD 3.0, Flux support

Use alternatives instead:

DALL-E 3: For API-based generation without GPU
Midjourney: For artistic, stylized outputs
Imagen: For Google Cloud integration
Leonardo.ai: For web-based creative workflows

Quick start

Installation

pip install diffusers transformers accelerate torch
pip install xformers  # Optional: memory-efficient attention

Basic text-to-image

from diffusers import DiffusionPipeline
import torch

# Load pipeline (auto-detects model type)
pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16
)
pipe.to("cuda")

# Generate image
image = pipe(
    "A serene mountain landscape at sunset, highly detailed",
    num_inference_steps=50,
    guidance_scale=7.5
).images[0]

image.save("output.png")

Using SDXL (higher quality)

from diffusers import AutoPipelineForText2Image
import torch

pipe = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")

# Enable memory optimization
pipe.enable_model_cpu_offload()

image = pipe(
    prompt="A futuristic city with flying cars, cinematic lighting",
    height=1024,
    width=1024,
    num_inference_steps=30
).images[0]

Architecture overview

Three-pillar design

Diffusers is built around three core components:

Pipeline (orchestration)
├── Model (neural networks)
│   ├── UNet / Transformer (noise prediction)
│   ├── VAE (latent encoding/decoding)
│   └── Text Encoder (CLIP/T5)
└── Scheduler (denoising algorithm)

Pipeline inference flow

Text Prompt → Text Encoder → Text Embeddings
                                    ↓
Random Noise → [Denoising Loop] ← Scheduler
                      ↓
               Predicted Noise
                      ↓
              VAE Decoder → Final Image

Core concepts

Pipelines

Pipelines orchestrate complete workflows:

Pipeline	Purpose
`StableDiffusionPipeline`	Text-to-image (SD 1.x/2.x)
`StableDiffusionXLPipeline`	Text-to-image (SDXL)
`StableDiffusion3Pipeline`	Text-to-image (SD 3.0)
`FluxPipeline`	Text-to-image (Flux models)
`StableDiffusionImg2ImgPipeline`	Image-to-image
`StableDiffusionInpaintPipeline`	Inpainting

Schedulers

Schedulers control the denoising process:

Scheduler	Steps	Quality	Use Case
`EulerDiscreteScheduler`	20-50	Good	Default choice
`EulerAncestralDiscreteScheduler`	20-50	Good	More variation
`DPMSolverMultistepScheduler`	15-25	Excellent	Fast, high quality
`DDIMScheduler`	50-100	Good	Deterministic
`LCMScheduler`	4-8	Good	Very fast
`UniPCMultistepScheduler`	15-25	Excellent	Fast convergence

Swapping schedulers

from diffusers import DPMSolverMultistepScheduler

# Swap for faster generation
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    pipe.scheduler.config
)

# Now generate with fewer steps
image = pipe(prompt, num_inference_steps=20).images[0]

Generation parameters

Key parameters

Parameter	Default	Description
`prompt`	Required	Text description of desired image
`negative_prompt`	None	What to avoid in the image
`num_inference_steps`	50	Denoising steps (more = better quality)
`guidance_scale`	7.5	Prompt adherence (7-12 typical)
`height`, `width`	512/1024	Output dimensions (multiples of 8)
`generator`	None	Torch generator for reproducibility
`num_images_per_prompt`	1	Batch size

Reproducible generation

import torch

generator = torch.Generator(device="cuda").manual_seed(42)

image = pipe(
    prompt="A cat wearing a top hat",
    generator=generator,
    num_inference_steps=50
).images[0]

Negative prompts

image = pipe(
    prompt="Professional photo of a dog in a garden",
    negative_prompt="blurry, low quality, distorted, ugly, bad anatomy",
    guidance_scale=7.5
).images[0]

Image-to-image

Transform existing images with text guidance:

from diffusers import AutoPipelineForImage2Image
from PIL import Image

pipe = AutoPipelineForImage2Image.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

init_image = Image.open("input.jpg").resize((512, 512))

image = pipe(
    prompt="A watercolor painting of the scene",
    image=init_image,
    strength=0.75,  # How much to transform (0-1)
    num_inference_steps=50
).images[0]

Inpainting

Fill masked regions:

from diffusers import AutoPipelineForInpainting
from PIL import Image

pipe = AutoPipelineForInpainting.from_pretrained(
    "runwayml/stable-diffusion-inpainting",
    torch_dtype=torch.float16
).to("cuda")

image = Image.open("photo.jpg")
mask = Image.open("mask.png")  # White = inpaint region

result = pipe(
    prompt="A red car parked on the street",
    image=image,
    mask_image=mask,
    num_inference_steps=50
).images[0]

ControlNet

Add spatial conditioning for precise control:

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch

# Load ControlNet for edge conditioning
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/control_v11p_sd15_canny",
    torch_dtype=torch.float16
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

# Use Canny edge image as control
control_image = get_canny_image(input_image)

image = pipe(
    prompt="A beautiful house in the style of Van Gogh",
    image=control_image,
    num_inference_steps=30
).images[0]

Available ControlNets

ControlNet	Input Type	Use Case
`canny`	Edge maps	Preserve structure
`openpose`	Pose skeletons	Human poses
`depth`	Depth maps	3D-aware generation
`normal`	Normal maps	Surface details
`mlsd`	Line segments	Architectural lines
`scribble`	Rough sketches	Sketch-to-image

LoRA adapters

Load fine-tuned style adapters:

from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

# Load LoRA weights
pipe.load_lora_weights("path/to/lora", weight_name="style.safetensors")

# Generate with LoRA style
image = pipe("A portrait in the trained style").images[0]

# Adjust LoRA strength
pipe.fuse_lora(lora_scale=0.8)

# Unload LoRA
pipe.unload_lora_weights()

Multiple LoRAs

# Load multiple LoRAs
pipe.load_lora_weights("lora1", adapter_name="style")
pipe.load_lora_weights("lora2", adapter_name="character")

# Set weights for each
pipe.set_adapters(["style", "character"], adapter_weights=[0.7, 0.5])

image = pipe("A portrait").images[0]

Memory optimization

Enable CPU offloading

# Model CPU offload - moves models to CPU when not in use
pipe.enable_model_cpu_offload()

# Sequential CPU offload - more aggressive, slower
pipe.enable_sequential_cpu_offload()

Attention slicing

# Reduce memory by computing attention in chunks
pipe.enable_attention_slicing()

# Or specific chunk size
pipe.enable_attention_slicing("max")

xFormers memory-efficient attention

# Requires xformers package
pipe.enable_xformers_memory_efficient_attention()

VAE slicing for large images

# Decode latents in tiles for large images
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()

Model variants

Loading different precisions

# FP16 (recommended for GPU)
pipe = DiffusionPipeline.from_pretrained(
    "model-id",
    torch_dtype=torch.float16,
    variant="fp16"
)

# BF16 (better precision, requires Ampere+ GPU)
pipe = DiffusionPipeline.from_pretrained(
    "model-id",
    torch_dtype=torch.bfloat16
)

Loading specific components

from diffusers import UNet2DConditionModel, AutoencoderKL

# Load custom VAE
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")

# Use with pipeline
pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    vae=vae,
    torch_dtype=torch.float16
)

Batch generation

Generate multiple images efficiently:

# Multiple prompts
prompts = [
    "A cat playing piano",
    "A dog reading a book",
    "A bird painting a picture"
]

images = pipe(prompts, num_inference_steps=30).images

# Multiple images per prompt
images = pipe(
    "A beautiful sunset",
    num_images_per_prompt=4,
    num_inference_steps=30
).images

Common workflows

Workflow 1: High-quality generation

from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler
import torch

# 1. Load SDXL with optimizations
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

# 2. Generate with quality settings
image = pipe(
    prompt="A majestic lion in the savanna, golden hour lighting, 8k, detailed fur",
    negative_prompt="blurry, low quality, cartoon, anime, sketch",
    num_inference_steps=30,
    guidance_scale=7.5,
    height=1024,
    width=1024
).images[0]

Workflow 2: Fast prototyping

from diffusers import AutoPipelineForText2Image, LCMScheduler
import torch

# Use LCM for 4-8 step generation
pipe = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16
).to("cuda")

# Load LCM LoRA for fast generation
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.fuse_lora()

# Generate in ~1 second
image = pipe(
    "A beautiful landscape",
    num_inference_steps=4,
    guidance_scale=1.0
).images[0]

Common issues

CUDA out of memory:

# Enable memory optimizations
pipe.enable_model_cpu_offload()
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()

# Or use lower precision
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)

Black/noise images:

# Check VAE configuration
# Use safety checker bypass if needed
pipe.safety_checker = None

# Ensure proper dtype consistency
pipe = pipe.to(dtype=torch.float16)

Slow generation:

# Use faster scheduler
from diffusers import DPMSolverMultistepScheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

# Reduce steps
image = pipe(prompt, num_inference_steps=20).images[0]

References

Advanced Usage - Custom pipelines, fine-tuning, deployment
Troubleshooting - Common issues and solutions

Resources

Documentation: https://huggingface.co/docs/diffusers
Repository: https://github.com/huggingface/diffusers
Model Hub: https://huggingface.co/models?library=diffusers
Discord: https://discord.gg/diffusers

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/18-multimodal/stable-diffusion/SKILL.mdView on GitHub

Overview

This skill enables generating images from text prompts, performing image-to-image translation, inpainting, outpainting, and creating variations using HuggingFace Diffusers. It supports multiple Stable Diffusion models (SD 1.5, SDXL, SD 3.0) and features like ControlNet and LoRA for flexible workflows.

How This Skill Works

Diffusers pipelines orchestrate text-to-image workflows by combining a model, a denoising scheduler, and a VAE. A text prompt is encoded into embeddings, then a denoising loop predicts noise to produce a latent image which the VAE decodes into the final image.

When to Use It

Generating images from text descriptions
Image-to-image translation or style transfer
Inpainting to fill masked regions
Outpainting to extend images beyond boundaries
Building custom diffusion pipelines and workflows

Quick Start

Step 1: Install dependencies: pip install diffusers transformers accelerate torch; optional: pip install xformers
Step 2: Load a diffusion pipeline and generate an image from a prompt, then move the model to CUDA
Step 3: Save or display the resulting image (e.g., image.save('output.png'))

Best Practices

Start with a descriptive prompt and iteratively adjust guidance_scale and inference steps for quality and speed balance
Choose the appropriate pipeline (e.g., StableDiffusionPipeline for text-to-image, StableDiffusionImg2ImgPipeline for image-to-image, InpaintPipeline for inpainting)
Leverage ControlNet for spatial conditioning (edges, poses, depth) when needed
Use LoRA for efficient fine-tuning and style adaptation, and test across models (SD 1.5, SDXL, SD 3.0)
For large outputs or memory constraints, enable memory optimization (e.g., pipe.enable_model_cpu_offload) and consider FP16 / xformers

Example Use Cases

Generate a cinematic 4K landscape from a textual prompt like 'a serene mountain landscape at sunset, highly detailed'
Transform an existing portrait’s style with image-to-image translation and text guidance
Restore or fill missing regions in a photo via inpainting
Outpaint a city skyline beyond the original image boundaries
Create multiple concept variations of a character for a game or storyboard