MCPToolBenchPP
MCPToolBench++ MCP Model Context Protocol Tool Use Benchmark on AI Agent and Model Tool Use Ability
claude mcp add --transport stdio mcp-tool-bench-mcptoolbenchpp python -m mcp_tool_bench_pp
How to use
MCPToolBenchPP is a large-scale AI Agent Tool Use Benchmark that aggregates thousands of MCP servers across multiple domains to evaluate tool-use capabilities of AI agents. It provides a structured collection of MCP servers and evaluation workflows, enabling researchers to test how well agents select, call, and chain tools to complete tasks. To use it, set up the MCPToolBenchPP server bundle in your environment, then leverage the included evaluation scripts to run benchmark tasks across Browser, File System, Search, Map, Pay, and other domains. The benchmark focuses on real-world tool use, including single-step and multi-step tool invocation scenarios, and can be extended with additional domain datasets. Start the server(s) and run the provided dataset evaluation pipeline to measure metrics like success rate and tool-use efficiency across categories.
How to install
Prerequisites:
- Python 3.8+ (recommended 3.9+)
- Git
- A suitable environment (virtualenv or conda)
-
Clone the repository or obtain the MCPToolBenchPP package: git clone https://github.com/mcp-tool-bench/MCPToolBenchPP.git cd MCPToolBenchPP
-
Create and activate a virtual environment: python -m venv venv source venv/bin/activate # Linux/macOS venv\Scripts\activate.bat # Windows
-
Install dependencies (adjust if a requirements file exists): pip install -r requirements.txt
-
Verify installation by listing available modules or running a quick help: python -m mcp_tool_bench_pp --help
-
Run the benchmark server (example): python -m mcp_tool_bench_pp
Note: If the project provides its own setup or packaging scripts, prefer those commands as documented in the repository. The MCPToolBenchPP dataset is a benchmark collection rather than a single application, so you may also need to configure data paths and environment variables as described in the repository documentation.
Additional notes
Tips and caveats:
- This project is described as still WIP with ongoing updates and new domain datasets, so expect changes and additions over time.
- If you encounter network or resource limitations, ensure your environment has sufficient CPU/GPU, memory, and disk space for large-scale tool-use benchmarks.
- Environment variables or config files may be used to tailor the evaluation (e.g., specifying dataset paths, model names, or evaluation settings). Check the docs for any required vars like DATA_PATH, MODEL_NAME, or EVAL_CONFIG.
- If you run into compatibility issues, consider using a clean virtual environment and pinning dependency versions.
- For reproducibility, document the exact data version and server configuration used for each run, as benchmark results can vary with dataset version.
Related MCP Servers
python-utcp
Official python implementation of UTCP. UTCP is an open standard that lets AI agents call any API directly, without extra middleware.
skillport
Bring Agent Skills to Any AI Agent and Coding Agent — via CLI or MCP. Manage once, serve anywhere.
MCPSecBench
MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols
AI-SOC-Agent
Blackhat 2025 presentation and codebase: AI SOC agent & MCP server for automated security investigation, alert triage, and incident response. Integrates with ELK, IRIS, and other platforms.
jmcomic-ai
AI 原生 JMComic 助手:通过 MCP 与 Skills 将 JMComic 注入你的 AI Agent. / AI-powered JMComic assistant for seamless integration with AI Agents via MCP & Skills.
alris
Alris is an AI automation tool that transforms natural language commands into task execution.