vllm

MCP server from micytao/vllm-mcp-server

Installation

Run this command in your terminal to add the MCP server to Claude Code.

Run in terminal:

Command

claude mcp add --transport stdio micytao-vllm-mcp-server uvx vllm-mcp-server \
  --env VLLM_MODEL="TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
  --env VLLM_BASE_URL="http://localhost:8000" \
  --env VLLM_HF_TOKEN="hf_your_token_here"

How to use

This MCP server exposes the vLLM capabilities to MCP-compatible assistants. It enables chat and text completions against a vLLM backend, provides model discovery and info endpoints, and includes server health monitoring and platform-aware container management. With the included tools, you can start and stop the vLLM container automatically, query the server status, list available models, and perform chat or completion requests. The server is designed to work with common MCP clients (such as Claude or Cursor) by presenting a consistent set of commands for inference, model management, and server control. To begin, configure your MCP client to point at the vLLM MCP server (using uvx in the recommended setup) and provide the necessary environment variables like BASE_URL and model selection. After configuration, you can launch the server via start_vllm and interact with it through vllm_chat and vllm_complete for inference, or use list_models and get_model_info to inspect available models. For deployment, the MCP server can manage a vLLM Docker container with automatic platform detection, simplifying setup on Linux, macOS, or Windows.

How to install

Prerequisites:

Python 3.10+ installed on your system
Git available to clone repositories
Basic container runtime (Podman or Docker) if you plan to run via container

Install from PyPI (recommended):

pip install vllm-mcp-server

Install from source:

git clone https://github.com/micytao/vllm-mcp-server.git
cd vllm-mcp-server
pip install -e .

Optional: If you want to use the uvx-backed setup with MCP clients, ensure uvx is installed and available in your PATH:

pip install uvx

Configure your MCP client (example) and start using the MCP server as described in the README. You can also run the server via container tooling (see Quick Start in the README) if you prefer Docker/Podman-based deployment.

Additional notes

Notes and tips:

The MCP server uses environment variables to configure the vLLM connection and model selection. Typical vars include VLLM_BASE_URL, VLLM_MODEL, and VLLM_HF_TOKEN for gated models.
If you see authentication or token errors for gated models, obtain a HuggingFace token and set VLLM_HF_TOKEN accordingly.
The server supports both Podman and Docker for container-based operation; the provided tooling will attempt platform detection to pick GPU-enabled or CPU images as appropriate.
max_model_len defaults differ by CPU vs GPU mode; you can override it by passing max_model_len to the start_vllm tool as shown in the documentation.
When using start_vllm, make sure the vLLM backend container is accessible at the VLLM_BASE_URL you configure (default http://localhost:8000).
If you experience port conflicts, adjust the port in both the MCP client config and the vLLM container run command accordingly.

Related MCP Servers

mcp-vegalite

MCP server from isaacwasserman/mcp-vegalite-server

github-chat

A Model Context Protocol (MCP) for analyzing and querying GitHub repositories using the GitHub Chat API.

nautex

MCP server for guiding Coding Agents via end-to-end requirements to implementation plan pipeline

pagerduty

PagerDuty's official local MCP (Model Context Protocol) server which provides tools to interact with your PagerDuty account directly from your MCP-enabled client.

futu-stock

mcp server for futuniuniu stock

mcp -boilerplate

Boilerplate using one of the 'better' ways to build MCP Servers. Written using FastMCP