mcpmark

MCPMark is a comprehensive, stress-testing MCP benchmark designed to evaluate model and agent capabilities in real-world MCP use.

Installation

Run this command in your terminal to add the MCP server to Claude Code.

Run in terminal:

View docs

Command

claude mcp add --transport stdio eval-sys-mcpmark python -m pipeline --mcp filesystem \
  --env OPENAI_API_KEY="your-openai-api-key" \
  --env OPENAI_BASE_URL="https://api.openai.com/v1" \
  --env PLAYWRIGHT_BROWSER="chromium (optional for Playwright tasks)" \
  --env EVAL_NOTION_API_KEY="your-eval-notion-api-key (optional for Notion tasks)" \
  --env PLAYWRIGHT_HEADLESS="True (optional)" \
  --env SOURCE_NOTION_API_KEY="your-source-notion-api-key (optional for Notion tasks)" \
  --env EVAL_PARENT_PAGE_TITLE="MCPMark Eval Hub (optional)"

How to use

MCPMark is an evaluation suite designed to benchmark how agentic models interact with real MCP tool environments such as Notion, GitHub, Filesystem, Postgres, and Playwright. It provides ready-to-run task suites with strict automated verification, isolated sandboxes, auto-resume on failures, unified metrics, and aggregated reports. You can run tasks locally or in Docker, and you can select different task suites (standard or easy) to balance coverage and speed. The included orchestration uses a built-in MCPMarkAgent, with an option to try a ReAct-style agent by passing --agent react to the pipeline. To get started, configure your environment, clone the repository, install dependencies, and invoke the pipeline module to run a chosen MCP (e.g., filesystem).

How to install

Prerequisites:

Python 3.8+ and pip
Git
(Optional) Docker if you plan to run Docker-based experiments

Steps:

Clone the repository git clone https://github.com/eval-sys/mcpmark.git cd mcpmark
Create a virtual environment and activate it (optional but recommended) python -m venv venv source venv/bin/activate # on Unix/macOS .\venv\Scripts\activate # on Windows
Install the package in editable mode (local development) pip install -e .
(Optional) Install Playwright browsers if you plan to use browser-based tasks playwright install
Run the MCP server (example using filesystem tasks) python -m pipeline --mcp filesystem --k 1 --models MODEL --tasks file_property/size_classification
If using Docker, build the Docker image and start containers as described in the docs ./build-docker.sh

Notes:

You can switch to a different MCP by changing the --mcp flag and corresponding --tasks path.
Ensure required service credentials are available in environment variables or a .mcp_env file at the repo root.

Additional notes

Tips and notes:

Auto-resume: MCPMark will retry and resume unfinished tasks after failures that match retriable patterns (see RETRYABLE_PATTERNS in the source).
Auto-compaction: Use --compaction-token N to enable summary of context when prompt tokens grow; set N to a high value to disable (e.g., 999999999).
Task suites: Standard suite runs the full benchmark; easy suite provides a smaller subset for smoke tests or CI.
Environment variables: Set credentials for services you intend to use (Notion, GitHub, Postgres, etc.). You can place them in a .mcp_env file or export them in your shell before running the pipeline.
Docker and local execution: The project supports both local Python execution and Docker-based orchestration. Use Docker for isolated, reproducible runs across different hosts.
If you encounter a new resumable error, consider opening a PR or issue with the error string and a short repro scenario.

Related MCP Servers

AnyTool

548

"AnyTool: Universal Tool-Use Layer for AI Agents"

MCPToolBenchPP

MCPToolBench++ MCP Model Context Protocol Tool Use Benchmark on AI Agent and Model Tool Use Ability

pfsense

pfSense MCP Server enables security administrators to manage their pfSense firewalls using natural language through AI assistants like Claude Desktop. Simply ask "Show me blocked IPs" or "Run a PCI compliance check" instead of navigating complex interfaces. Supports REST/XML-RPC/SSH connections, and includes built-in complian

MCPSecBench

MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols

arch

Arch Linux MCP (Model Context Protocol)

fegis

Define AI tools in YAML with natural language schemas. All tool usage is automatically stored in Qdrant vector database, enabling semantic search, filtering, and memory retrieval across sessions.