mcpmark
MCPMark is a comprehensive, stress-testing MCP benchmark designed to evaluate model and agent capabilities in real-world MCP use.
claude mcp add --transport stdio eval-sys-mcpmark python -m pipeline --mcp filesystem \ --env OPENAI_API_KEY="your-openai-api-key" \ --env OPENAI_BASE_URL="https://api.openai.com/v1" \ --env PLAYWRIGHT_BROWSER="chromium (optional for Playwright tasks)" \ --env EVAL_NOTION_API_KEY="your-eval-notion-api-key (optional for Notion tasks)" \ --env PLAYWRIGHT_HEADLESS="True (optional)" \ --env SOURCE_NOTION_API_KEY="your-source-notion-api-key (optional for Notion tasks)" \ --env EVAL_PARENT_PAGE_TITLE="MCPMark Eval Hub (optional)"
How to use
MCPMark is an evaluation suite designed to benchmark how agentic models interact with real MCP tool environments such as Notion, GitHub, Filesystem, Postgres, and Playwright. It provides ready-to-run task suites with strict automated verification, isolated sandboxes, auto-resume on failures, unified metrics, and aggregated reports. You can run tasks locally or in Docker, and you can select different task suites (standard or easy) to balance coverage and speed. The included orchestration uses a built-in MCPMarkAgent, with an option to try a ReAct-style agent by passing --agent react to the pipeline. To get started, configure your environment, clone the repository, install dependencies, and invoke the pipeline module to run a chosen MCP (e.g., filesystem).
How to install
Prerequisites:
- Python 3.8+ and pip
- Git
- (Optional) Docker if you plan to run Docker-based experiments
Steps:
-
Clone the repository git clone https://github.com/eval-sys/mcpmark.git cd mcpmark
-
Create a virtual environment and activate it (optional but recommended) python -m venv venv source venv/bin/activate # on Unix/macOS .\venv\Scripts\activate # on Windows
-
Install the package in editable mode (local development) pip install -e .
-
(Optional) Install Playwright browsers if you plan to use browser-based tasks playwright install
-
Run the MCP server (example using filesystem tasks) python -m pipeline --mcp filesystem --k 1 --models MODEL --tasks file_property/size_classification
-
If using Docker, build the Docker image and start containers as described in the docs ./build-docker.sh
Notes:
- You can switch to a different MCP by changing the --mcp flag and corresponding --tasks path.
- Ensure required service credentials are available in environment variables or a .mcp_env file at the repo root.
Additional notes
Tips and notes:
- Auto-resume: MCPMark will retry and resume unfinished tasks after failures that match retriable patterns (see RETRYABLE_PATTERNS in the source).
- Auto-compaction: Use --compaction-token N to enable summary of context when prompt tokens grow; set N to a high value to disable (e.g., 999999999).
- Task suites: Standard suite runs the full benchmark; easy suite provides a smaller subset for smoke tests or CI.
- Environment variables: Set credentials for services you intend to use (Notion, GitHub, Postgres, etc.). You can place them in a .mcp_env file or export them in your shell before running the pipeline.
- Docker and local execution: The project supports both local Python execution and Docker-based orchestration. Use Docker for isolated, reproducible runs across different hosts.
- If you encounter a new resumable error, consider opening a PR or issue with the error string and a short repro scenario.
Related MCP Servers
AnyTool
"AnyTool: Universal Tool-Use Layer for AI Agents"
MCPToolBenchPP
MCPToolBench++ MCP Model Context Protocol Tool Use Benchmark on AI Agent and Model Tool Use Ability
pfsense
pfSense MCP Server enables security administrators to manage their pfSense firewalls using natural language through AI assistants like Claude Desktop. Simply ask "Show me blocked IPs" or "Run a PCI compliance check" instead of navigating complex interfaces. Supports REST/XML-RPC/SSH connections, and includes built-in complian
MCPSecBench
MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols
arch
Arch Linux MCP (Model Context Protocol)
fegis
Define AI tools in YAML with natural language schemas. All tool usage is automatically stored in Qdrant vector database, enabling semantic search, filtering, and memory retrieval across sessions.