Get the FREE Ultimate OpenClaw Setup Guide →

semantic-router

System Level Intelligent Router for Mixture-of-Models at Cloud, Data Center and Edge

Installation
Run this command in your terminal to add the MCP server to Claude Code.
Run in terminal:
Command
claude mcp add --transport stdio vllm-project-semantic-router python -m vllm_sr serve

How to use

vLLM Semantic Router sits between models and clients to intelligently route requests to the appropriate LLM endpoints, manage caching, and optimize routing decisions in MoM (Mixture-of-Models) systems. The server exposes a CLI with commands such as config, init, dashboard, logs, serve, status, and stop. To start using it, install the Python package and run the serve command, which boots the router and begins serving routing decisions for your vLLM endpoints. If you need credentials for Hugging Face models, you can configure HF_ENDPOINT, HF_TOKEN, and HF_HOME as environment variables, which will be picked up by the router and forwarded to models as needed. The router also provides a configuration generator, dashboards for monitoring, and logs to help you observe routing behavior in real time.

How to install

Prerequisites:

  • Python 3.8 or newer and a virtual environment tool (optional but recommended)
  • network access to download packages
  1. Create and activate a Python virtual environment (optional but recommended):
python -m venv vllm-sr-venv
source vllm-sr-venv/bin/activate
  1. Install the vLLM Semantic Router package from PyPI:
pip install vllm-sr
  1. Verify installation by running the CLI help to confirm available commands:
vllm-sr --help
  1. Start the router (as a server) using the Python module entry point:
python -m vllm_sr serve

Note: You can customize runtime behavior by exporting environment variables such as HF_ENDPOINT, HF_TOKEN, and HF_HOME prior to starting the service.

Additional notes

Tips and common issues:

  • If you run behind a proxy or firewall, ensure ports used by the dashboard and API are open and reachable.
  • For model access, set Hugging Face credentials via HF_ENDPOINT, HF_TOKEN, and HF_HOME. These are automatically propagated to the router and model download/cache logic.
  • If you need to adjust resource limits (file descriptors) for Envoy or other proxies, set VLLM_SR_NOFILE_LIMIT as described in the docs.
  • Use the CLI commands to inspect status, view logs, and manage the running service (config, init, dashboard, logs, status, stop).
  • Ensure your Python environment uses compatible versions of dependencies listed by vllm-sr and the models you intend to route.

Related MCP Servers

Sponsor this space

Reach thousands of developers