How does CLIP measure similarity?

CLIP encodes images and texts into embeddings, normalizes them, and computes cosine similarity (or logits) to determine how closely a given image matches a text prompt or vice versa.

What are the limitations of CLIP?

CLIP can reflect biases present in training data, may struggle with highly specialized domains, and results should be validated for critical decisions with human review.

clip

Scanned

Multimodal CLIP Vision-Language Zero-Shot Image Classification OpenAI Image Search Cross-Modal Retrieval Content Moderation

npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/clip --openclaw

Files (1)

SKILL.md

6.7 KB

CLIP - Contrastive Language-Image Pre-Training

OpenAI's model that understands images from natural language.

When to use CLIP

Use when:

Zero-shot image classification (no training data needed)
Image-text similarity/matching
Semantic image search
Content moderation (detect NSFW, violence)
Visual question answering
Cross-modal retrieval (image→text, text→image)

Metrics:

25,300+ GitHub stars
Trained on 400M image-text pairs
Matches ResNet-50 on ImageNet (zero-shot)
MIT License

Use alternatives instead:

BLIP-2: Better captioning
LLaVA: Vision-language chat
Segment Anything: Image segmentation

Quick start

Installation

pip install git+https://github.com/openai/CLIP.git
pip install torch torchvision ftfy regex tqdm

Zero-shot classification

import torch
import clip
from PIL import Image

# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Load image
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)

# Define possible labels
text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)

# Compute similarity
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

    # Cosine similarity
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

# Print results
labels = ["a dog", "a cat", "a bird", "a car"]
for label, prob in zip(labels, probs[0]):
    print(f"{label}: {prob:.2%}")

Available models

# Models (sorted by size)
models = [
    "RN50",           # ResNet-50
    "RN101",          # ResNet-101
    "ViT-B/32",       # Vision Transformer (recommended)
    "ViT-B/16",       # Better quality, slower
    "ViT-L/14",       # Best quality, slowest
]

model, preprocess = clip.load("ViT-B/32")

Model	Parameters	Speed	Quality
RN50	102M	Fast	Good
ViT-B/32	151M	Medium	Better
ViT-L/14	428M	Slow	Best

Image-text similarity

# Compute embeddings
image_features = model.encode_image(image)
text_features = model.encode_text(text)

# Normalize
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

# Cosine similarity
similarity = (image_features @ text_features.T).item()
print(f"Similarity: {similarity:.4f}")

Semantic image search

# Index images
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
image_embeddings = []

for img_path in image_paths:
    image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
    with torch.no_grad():
        embedding = model.encode_image(image)
        embedding /= embedding.norm(dim=-1, keepdim=True)
    image_embeddings.append(embedding)

image_embeddings = torch.cat(image_embeddings)

# Search with text query
query = "a sunset over the ocean"
text_input = clip.tokenize([query]).to(device)
with torch.no_grad():
    text_embedding = model.encode_text(text_input)
    text_embedding /= text_embedding.norm(dim=-1, keepdim=True)

# Find most similar images
similarities = (text_embedding @ image_embeddings.T).squeeze(0)
top_k = similarities.topk(3)

for idx, score in zip(top_k.indices, top_k.values):
    print(f"{image_paths[idx]}: {score:.3f}")

Content moderation

# Define categories
categories = [
    "safe for work",
    "not safe for work",
    "violent content",
    "graphic content"
]

text = clip.tokenize(categories).to(device)

# Check image
with torch.no_grad():
    logits_per_image, _ = model(image, text)
    probs = logits_per_image.softmax(dim=-1)

# Get classification
max_idx = probs.argmax().item()
max_prob = probs[0, max_idx].item()

print(f"Category: {categories[max_idx]} ({max_prob:.2%})")

Batch processing

# Process multiple images
images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)]
images = torch.stack(images).to(device)

with torch.no_grad():
    image_features = model.encode_image(images)
    image_features /= image_features.norm(dim=-1, keepdim=True)

# Batch text
texts = ["a dog", "a cat", "a bird"]
text_tokens = clip.tokenize(texts).to(device)

with torch.no_grad():
    text_features = model.encode_text(text_tokens)
    text_features /= text_features.norm(dim=-1, keepdim=True)

# Similarity matrix (10 images × 3 texts)
similarities = image_features @ text_features.T
print(similarities.shape)  # (10, 3)

Integration with vector databases

# Store CLIP embeddings in Chroma/FAISS
import chromadb

client = chromadb.Client()
collection = client.create_collection("image_embeddings")

# Add image embeddings
for img_path, embedding in zip(image_paths, image_embeddings):
    collection.add(
        embeddings=[embedding.cpu().numpy().tolist()],
        metadatas=[{"path": img_path}],
        ids=[img_path]
    )

# Query with text
query = "a sunset"
text_embedding = model.encode_text(clip.tokenize([query]))
results = collection.query(
    query_embeddings=[text_embedding.cpu().numpy().tolist()],
    n_results=5
)

Best practices

Use ViT-B/32 for most cases - Good balance
Normalize embeddings - Required for cosine similarity
Batch processing - More efficient
Cache embeddings - Expensive to recompute
Use descriptive labels - Better zero-shot performance
GPU recommended - 10-50× faster
Preprocess images - Use provided preprocess function

Performance

Operation	CPU	GPU (V100)
Image encoding	~200ms	~20ms
Text encoding	~50ms	~5ms
Similarity compute	<1ms	<1ms

Limitations

Not for fine-grained tasks - Best for broad categories
Requires descriptive text - Vague labels perform poorly
Biased on web data - May have dataset biases
No bounding boxes - Whole image only
Limited spatial understanding - Position/counting weak

Resources

GitHub: https://github.com/openai/CLIP ⭐ 25,300+
Paper: https://arxiv.org/abs/2103.00020
Colab: https://colab.research.google.com/github/openai/clip/
License: MIT

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/18-multimodal/clip/SKILL.mdView on GitHub

Overview

CLIP links vision and language to perform zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs, it enables image search, content moderation, and vision-language tasks without fine-tuning.

How This Skill Works

CLIP encodes images and text into a shared embedding space using contrastive pretraining. At inference, you compute image_features and text_features, normalize them, and measure similarity (e.g., cosine similarity or logits) to perform classification or retrieval.

When to Use It

Zero-shot image classification (no labeled data needed)
Image-text similarity or matching
Semantic image search
Content moderation (detect NSFW or violence)
Cross-modal retrieval (image↔text)

Quick Start

Step 1: Install CLIP and dependencies
Step 2: Load the model and an image processor (clip.load and preprocess)
Step 3: Compute image and text embeddings, then measure similarity to retrieve or classify

Best Practices

Use clear, descriptive text prompts for candidate labels to improve alignment
Pick model variant (e.g., ViT-B/32 vs ViT-L/14) based on speed and quality needs
Normalize embeddings and use cosine similarity for robust comparisons
Index and cache image embeddings to enable fast large-scale retrieval
Be mindful of biases and limitations; incorporate human review for critical decisions

Example Use Cases

E-commerce: search product images using natural language queries like 'red leather wallet'
Content moderation: flag images that may violate policies by matching against risk phrases
Media library: semantically browse a large collection by text descriptions
Zero-shot labeling: assign custom categories to images without labeled datasets
Vision-language tasks: perform semantic image search and support basic VQA-style queries