vllm
MCP server from micytao/vllm-mcp-server
claude mcp add --transport stdio micytao-vllm-mcp-server uvx vllm-mcp-server \ --env VLLM_MODEL="TinyLlama/TinyLlama-1.1B-Chat-v1.0" \ --env VLLM_BASE_URL="http://localhost:8000" \ --env VLLM_HF_TOKEN="hf_your_token_here"
How to use
This MCP server exposes the vLLM capabilities to MCP-compatible assistants. It enables chat and text completions against a vLLM backend, provides model discovery and info endpoints, and includes server health monitoring and platform-aware container management. With the included tools, you can start and stop the vLLM container automatically, query the server status, list available models, and perform chat or completion requests. The server is designed to work with common MCP clients (such as Claude or Cursor) by presenting a consistent set of commands for inference, model management, and server control. To begin, configure your MCP client to point at the vLLM MCP server (using uvx in the recommended setup) and provide the necessary environment variables like BASE_URL and model selection. After configuration, you can launch the server via start_vllm and interact with it through vllm_chat and vllm_complete for inference, or use list_models and get_model_info to inspect available models. For deployment, the MCP server can manage a vLLM Docker container with automatic platform detection, simplifying setup on Linux, macOS, or Windows.
How to install
Prerequisites:
- Python 3.10+ installed on your system
- Git available to clone repositories
- Basic container runtime (Podman or Docker) if you plan to run via container
Install from PyPI (recommended):
pip install vllm-mcp-server
Install from source:
git clone https://github.com/micytao/vllm-mcp-server.git
cd vllm-mcp-server
pip install -e .
Optional: If you want to use the uvx-backed setup with MCP clients, ensure uvx is installed and available in your PATH:
pip install uvx
Configure your MCP client (example) and start using the MCP server as described in the README. You can also run the server via container tooling (see Quick Start in the README) if you prefer Docker/Podman-based deployment.
Additional notes
Notes and tips:
- The MCP server uses environment variables to configure the vLLM connection and model selection. Typical vars include VLLM_BASE_URL, VLLM_MODEL, and VLLM_HF_TOKEN for gated models.
- If you see authentication or token errors for gated models, obtain a HuggingFace token and set VLLM_HF_TOKEN accordingly.
- The server supports both Podman and Docker for container-based operation; the provided tooling will attempt platform detection to pick GPU-enabled or CPU images as appropriate.
- max_model_len defaults differ by CPU vs GPU mode; you can override it by passing max_model_len to the start_vllm tool as shown in the documentation.
- When using start_vllm, make sure the vLLM backend container is accessible at the VLLM_BASE_URL you configure (default http://localhost:8000).
- If you experience port conflicts, adjust the port in both the MCP client config and the vLLM container run command accordingly.
Related MCP Servers
mcp-vegalite
MCP server from isaacwasserman/mcp-vegalite-server
github-chat
A Model Context Protocol (MCP) for analyzing and querying GitHub repositories using the GitHub Chat API.
nautex
MCP server for guiding Coding Agents via end-to-end requirements to implementation plan pipeline
pagerduty
PagerDuty's official local MCP (Model Context Protocol) server which provides tools to interact with your PagerDuty account directly from your MCP-enabled client.
futu-stock
mcp server for futuniuniu stock
mcp -boilerplate
Boilerplate using one of the 'better' ways to build MCP Servers. Written using FastMCP