Get the FREE Ultimate OpenClaw Setup Guide →

mcpmark

MCPMark is a comprehensive, stress-testing MCP benchmark designed to evaluate model and agent capabilities in real-world MCP use.

Installation
Run this command in your terminal to add the MCP server to Claude Code.
Run in terminal:
Command
claude mcp add --transport stdio eval-sys-mcpmark python -m pipeline --mcp filesystem \
  --env OPENAI_API_KEY="your-openai-api-key" \
  --env OPENAI_BASE_URL="https://api.openai.com/v1" \
  --env PLAYWRIGHT_BROWSER="chromium (optional for Playwright tasks)" \
  --env EVAL_NOTION_API_KEY="your-eval-notion-api-key (optional for Notion tasks)" \
  --env PLAYWRIGHT_HEADLESS="True (optional)" \
  --env SOURCE_NOTION_API_KEY="your-source-notion-api-key (optional for Notion tasks)" \
  --env EVAL_PARENT_PAGE_TITLE="MCPMark Eval Hub (optional)"

How to use

MCPMark is an evaluation suite designed to benchmark how agentic models interact with real MCP tool environments such as Notion, GitHub, Filesystem, Postgres, and Playwright. It provides ready-to-run task suites with strict automated verification, isolated sandboxes, auto-resume on failures, unified metrics, and aggregated reports. You can run tasks locally or in Docker, and you can select different task suites (standard or easy) to balance coverage and speed. The included orchestration uses a built-in MCPMarkAgent, with an option to try a ReAct-style agent by passing --agent react to the pipeline. To get started, configure your environment, clone the repository, install dependencies, and invoke the pipeline module to run a chosen MCP (e.g., filesystem).

How to install

Prerequisites:

  • Python 3.8+ and pip
  • Git
  • (Optional) Docker if you plan to run Docker-based experiments

Steps:

  1. Clone the repository git clone https://github.com/eval-sys/mcpmark.git cd mcpmark

  2. Create a virtual environment and activate it (optional but recommended) python -m venv venv source venv/bin/activate # on Unix/macOS .\venv\Scripts\activate # on Windows

  3. Install the package in editable mode (local development) pip install -e .

  4. (Optional) Install Playwright browsers if you plan to use browser-based tasks playwright install

  5. Run the MCP server (example using filesystem tasks) python -m pipeline --mcp filesystem --k 1 --models MODEL --tasks file_property/size_classification

  6. If using Docker, build the Docker image and start containers as described in the docs ./build-docker.sh

Notes:

  • You can switch to a different MCP by changing the --mcp flag and corresponding --tasks path.
  • Ensure required service credentials are available in environment variables or a .mcp_env file at the repo root.

Additional notes

Tips and notes:

  • Auto-resume: MCPMark will retry and resume unfinished tasks after failures that match retriable patterns (see RETRYABLE_PATTERNS in the source).
  • Auto-compaction: Use --compaction-token N to enable summary of context when prompt tokens grow; set N to a high value to disable (e.g., 999999999).
  • Task suites: Standard suite runs the full benchmark; easy suite provides a smaller subset for smoke tests or CI.
  • Environment variables: Set credentials for services you intend to use (Notion, GitHub, Postgres, etc.). You can place them in a .mcp_env file or export them in your shell before running the pipeline.
  • Docker and local execution: The project supports both local Python execution and Docker-based orchestration. Use Docker for isolated, reproducible runs across different hosts.
  • If you encounter a new resumable error, consider opening a PR or issue with the error string and a short repro scenario.

Related MCP Servers

Sponsor this space

Reach thousands of developers