llava
Scannednpx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/llava --openclawLLaVA - Large Language and Vision Assistant
Open-source vision-language model for conversational image understanding.
When to use LLaVA
Use when:
- Building vision-language chatbots
- Visual question answering (VQA)
- Image description and captioning
- Multi-turn image conversations
- Visual instruction following
- Document understanding with images
Metrics:
- 23,000+ GitHub stars
- GPT-4V level capabilities (targeted)
- Apache 2.0 License
- Multiple model sizes (7B-34B params)
Use alternatives instead:
- GPT-4V: Highest quality, API-based
- CLIP: Simple zero-shot classification
- BLIP-2: Better for captioning only
- Flamingo: Research, not open-source
Quick start
Installation
# Clone repository
git clone https://github.com/haotian-liu/LLaVA
cd LLaVA
# Install
pip install -e .
Basic usage
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from PIL import Image
import torch
# Load model
model_path = "liuhaotian/llava-v1.5-7b"
tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path=model_path,
model_base=None,
model_name=get_model_name_from_path(model_path)
)
# Load image
image = Image.open("image.jpg")
image_tensor = process_images([image], image_processor, model.config)
image_tensor = image_tensor.to(model.device, dtype=torch.float16)
# Create conversation
conv = conv_templates["llava_v1"].copy()
conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?")
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
# Generate response
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(model.device)
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=image_tensor,
do_sample=True,
temperature=0.2,
max_new_tokens=512
)
response = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
print(response)
Available models
| Model | Parameters | VRAM | Quality |
|---|---|---|---|
| LLaVA-v1.5-7B | 7B | ~14 GB | Good |
| LLaVA-v1.5-13B | 13B | ~28 GB | Better |
| LLaVA-v1.6-34B | 34B | ~70 GB | Best |
# Load different models
model_7b = "liuhaotian/llava-v1.5-7b"
model_13b = "liuhaotian/llava-v1.5-13b"
model_34b = "liuhaotian/llava-v1.6-34b"
# 4-bit quantization for lower VRAM
load_4bit = True # Reduces VRAM by ~4×
CLI usage
# Single image query
python -m llava.serve.cli \
--model-path liuhaotian/llava-v1.5-7b \
--image-file image.jpg \
--query "What is in this image?"
# Multi-turn conversation
python -m llava.serve.cli \
--model-path liuhaotian/llava-v1.5-7b \
--image-file image.jpg
# Then type questions interactively
Web UI (Gradio)
# Launch Gradio interface
python -m llava.serve.gradio_web_server \
--model-path liuhaotian/llava-v1.5-7b \
--load-4bit # Optional: reduce VRAM
# Access at http://localhost:7860
Multi-turn conversations
# Initialize conversation
conv = conv_templates["llava_v1"].copy()
# Turn 1
conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?")
conv.append_message(conv.roles[1], None)
response1 = generate(conv, model, image) # "A dog playing in a park"
# Turn 2
conv.messages[-1][1] = response1 # Add previous response
conv.append_message(conv.roles[0], "What breed is the dog?")
conv.append_message(conv.roles[1], None)
response2 = generate(conv, model, image) # "Golden Retriever"
# Turn 3
conv.messages[-1][1] = response2
conv.append_message(conv.roles[0], "What time of day is it?")
conv.append_message(conv.roles[1], None)
response3 = generate(conv, model, image)
Common tasks
Image captioning
question = "Describe this image in detail."
response = ask(model, image, question)
Visual question answering
question = "How many people are in the image?"
response = ask(model, image, question)
Object detection (textual)
question = "List all the objects you can see in this image."
response = ask(model, image, question)
Scene understanding
question = "What is happening in this scene?"
response = ask(model, image, question)
Document understanding
question = "What is the main topic of this document?"
response = ask(model, document_image, question)
Training custom model
# Stage 1: Feature alignment (558K image-caption pairs)
bash scripts/v1_5/pretrain.sh
# Stage 2: Visual instruction tuning (150K instruction data)
bash scripts/v1_5/finetune.sh
Quantization (reduce VRAM)
# 4-bit quantization
tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path="liuhaotian/llava-v1.5-13b",
model_base=None,
model_name=get_model_name_from_path("liuhaotian/llava-v1.5-13b"),
load_4bit=True # Reduces VRAM ~4×
)
# 8-bit quantization
load_8bit=True # Reduces VRAM ~2×
Best practices
- Start with 7B model - Good quality, manageable VRAM
- Use 4-bit quantization - Reduces VRAM significantly
- GPU required - CPU inference extremely slow
- Clear prompts - Specific questions get better answers
- Multi-turn conversations - Maintain conversation context
- Temperature 0.2-0.7 - Balance creativity/consistency
- max_new_tokens 512-1024 - For detailed responses
- Batch processing - Process multiple images sequentially
Performance
| Model | VRAM (FP16) | VRAM (4-bit) | Speed (tokens/s) |
|---|---|---|---|
| 7B | ~14 GB | ~4 GB | ~20 |
| 13B | ~28 GB | ~8 GB | ~12 |
| 34B | ~70 GB | ~18 GB | ~5 |
On A100 GPU
Benchmarks
LLaVA achieves competitive scores on:
- VQAv2: 78.5%
- GQA: 62.0%
- MM-Vet: 35.4%
- MMBench: 64.3%
Limitations
- Hallucinations - May describe things not in image
- Spatial reasoning - Struggles with precise locations
- Small text - Difficulty reading fine print
- Object counting - Imprecise for many objects
- VRAM requirements - Need powerful GPU
- Inference speed - Slower than CLIP
Integration with frameworks
LangChain
from langchain.llms.base import LLM
class LLaVALLM(LLM):
def _call(self, prompt, stop=None):
# Custom LLaVA inference
return response
llm = LLaVALLM()
Gradio App
import gradio as gr
def chat(image, text, history):
response = ask_llava(model, image, text)
return response
demo = gr.ChatInterface(
chat,
additional_inputs=[gr.Image(type="pil")],
title="LLaVA Chat"
)
demo.launch()
Resources
- GitHub: https://github.com/haotian-liu/LLaVA ⭐ 23,000+
- Paper: https://arxiv.org/abs/2304.08485
- Demo: https://llava.hliu.cc
- Models: https://huggingface.co/liuhaotian
- License: Apache 2.0
Source
git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/18-multimodal/llava/SKILL.mdView on GitHub Overview
LLaVA is an open-source vision-language model for conversational image understanding. It fuses a CLIP-based vision encoder with Vicuna/LLaMA language models to support multi-turn image chats, visual question answering, and instruction following. Use it to build vision-language chatbots or perform image understanding tasks.
How This Skill Works
It combines a CLIP vision encoder with LLaMA/Vicuna language models to process images and generate natural language responses. During inference, images are preprocessed and converted into image tokens that accompany text prompts, enabling the model to reason across dialogue turns.
When to Use It
- Building vision-language chatbots that handle real-time image inputs.
- Visual question answering on product photos, scenes, or documents.
- Multi-turn image conversations that require context tracking.
- Image description and captioning for accessibility and summaries.
- Visual instruction following and image-based task execution.
Quick Start
- Step 1: Install and load a pre-trained LLaVA model from the repository (clone, pip install -e .).
- Step 2: Prepare an image and start a multi-turn conversation using a conv_template and image tokens.
- Step 3: Run inference to generate a response and read the model's answer.
Best Practices
- Choose a model size that fits your hardware (7B-34B; ~14–70 GB VRAM depending on model).
- Use the provided image processor and image-token workflow to feed images into prompts.
- Leverage conv_templates for consistent multi-turn dialogue flows.
- Consider visual instruction tuning to improve instruction-following capabilities.
- Benchmark against API-based options (e.g., GPT-4V) to calibrate expectations.
Example Use Cases
- A customer-support chatbot that answers questions about product images.
- An e-commerce image QA assistant that clarifies features from photos.
- An accessibility tool that generates descriptive captions for images.
- A document-image understanding workflow for extracting info from forms and screenshots.
- A research prototype comparing multimodal models for vision-language tasks.
Frequently Asked Questions
Related Skills
llamaindex
Orchestra-Research/AI-Research-SKILLs
Data framework for building LLM applications with RAG. Specializes in document ingestion (300+ connectors), indexing, and querying. Features vector indices, query engines, agents, and multi-modal support. Use for document Q&A, chatbots, knowledge retrieval, or building RAG pipelines. Best for data-centric LLM applications.
clip
Orchestra-Research/AI-Research-SKILLs
OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.
audiocraft-audio-generation
Orchestra-Research/AI-Research-SKILLs
PyTorch library for audio generation including text-to-music (MusicGen) and text-to-sound (AudioGen). Use when you need to generate music from text descriptions, create sound effects, or perform melody-conditioned music generation.
blip-2-vision-language
Orchestra-Research/AI-Research-SKILLs
Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.
llama-factory
Orchestra-Research/AI-Research-SKILLs
Expert guidance for fine-tuning LLMs with LLaMA-Factory - WebUI no-code, 100+ models, 2/3/4/5/6/8-bit QLoRA, multimodal support
nemo-curator
Orchestra-Research/AI-Research-SKILLs
GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora.