semantic-router

System Level Intelligent Router for Mixture-of-Models at Cloud, Data Center and Edge

Installation

Run this command in your terminal to add the MCP server to Claude Code.

Run in terminal:

Command

claude mcp add --transport stdio vllm-project-semantic-router python -m vllm_sr serve

How to use

vLLM Semantic Router sits between models and clients to intelligently route requests to the appropriate LLM endpoints, manage caching, and optimize routing decisions in MoM (Mixture-of-Models) systems. The server exposes a CLI with commands such as config, init, dashboard, logs, serve, status, and stop. To start using it, install the Python package and run the serve command, which boots the router and begins serving routing decisions for your vLLM endpoints. If you need credentials for Hugging Face models, you can configure HF_ENDPOINT, HF_TOKEN, and HF_HOME as environment variables, which will be picked up by the router and forwarded to models as needed. The router also provides a configuration generator, dashboards for monitoring, and logs to help you observe routing behavior in real time.

How to install

Prerequisites:

Python 3.8 or newer and a virtual environment tool (optional but recommended)
network access to download packages

Create and activate a Python virtual environment (optional but recommended):

python -m venv vllm-sr-venv
source vllm-sr-venv/bin/activate

Install the vLLM Semantic Router package from PyPI:

pip install vllm-sr

Verify installation by running the CLI help to confirm available commands:

vllm-sr --help

Start the router (as a server) using the Python module entry point:

python -m vllm_sr serve

Note: You can customize runtime behavior by exporting environment variables such as HF_ENDPOINT, HF_TOKEN, and HF_HOME prior to starting the service.

Additional notes

Tips and common issues:

If you run behind a proxy or firewall, ensure ports used by the dashboard and API are open and reachable.
For model access, set Hugging Face credentials via HF_ENDPOINT, HF_TOKEN, and HF_HOME. These are automatically propagated to the router and model download/cache logic.
If you need to adjust resource limits (file descriptors) for Envoy or other proxies, set VLLM_SR_NOFILE_LIMIT as described in the docs.
Use the CLI commands to inspect status, view logs, and manage the running service (config, init, dashboard, logs, status, stop).
Ensure your Python environment uses compatible versions of dependencies listed by vllm-sr and the models you intend to route.

Related MCP Servers

agentgateway

1.8k

Next Generation Agentic Proxy for AI Agents and MCP servers

station

385

Station is our open-source runtime that lets teams deploy agents on their own infrastructure with full control.

goclaw

292

Multi-agent AI gateway with teams, delegation & orchestration. Single Go binary, 11+ LLM providers, 5 channels.

jetski

208

Authentication, analytics, and prompt visibility for MCP servers with zero code changes. Supports OAuth2.1, DCR, real-time logs, and client onboarding out of the box

go-utcp

101

Official Go implementation of the UTCP

workflowy

Powerful CLI and MCP server for WorkFlowy: reports, search/replace, backup support, and AI integration (Claude, LLMs)