k8s-gpu

NVIDIA GPU hardware introspection for Kubernetes clusters via MCP

Installation

Run this command in your terminal to add the MCP server to Claude Code.

Run in terminal:

Command

claude mcp add --transport stdio arangogutierrez-k8s-gpu-mcp-server npx -y k8s-gpu-mcp-server@latest

How to use

The k8s-gpu-mcp-server is an ephemeral diagnostic agent that provides real-time NVIDIA GPU hardware introspection for Kubernetes clusters under the Model Context Protocol (MCP). It exposes a low-footprint HTTP server that serves MCP tools and workflows, enabling AI-assisted troubleshooting for complex GPU issues. The server ships with a suite of NVML-based tools for inventory, health, error analysis, topology, timing data, and Kubernetes-aware diagnostics, all accessible through a consistent MCP interface. Prompts are available to guide operators through common GPU triage workflows, making it easier to gather context, identify root causes, and surface remediation steps.

To use it, install the MCP server via npx or npm and point your MCP client (Clorde/Cursor/Claude Desktop, or other MCP hosts) at the emitted endpoint. The server is designed to operate securely in read-only modes by default, with operator-mode options for deeper access where supported. Once running, you can invoke the available tools such as get_gpu_inventory, get_gpu_health, analyze_xid_errors, get_nvlink_topology, get_gpu_timeline, describe_gpu_node, get_pod_gpu_allocation, explain_failure, get_incident_report, and other MCP prompts to orchestrate comprehensive GPU diagnostics.

How to install

Prerequisites:

Node.js (version compatible with npx/npm) and npm installed on your host or CI environment
Internet access to fetch the MCP server package
Optional: Docker if you prefer containerized runs

Install methods:

One-line (recommended):

Ensure Node.js and npm are installed
Run: npx k8s-gpu-mcp-server@latest

Global installation (alternative):

Install the package globally and run from anywhere: npm install -g k8s-gpu-mcp-server
Then start the server (the exact start command may be exposed by the package, e.g., k8s-gpu-mcp-server or a node script).

Install from source (advanced):

Clone the repository: git clone https://github.com/ArangoGutierrez/k8s-gpu-mcp-server.git
Build/install as needed per repository instructions (example shown in README): cd k8s-gpu-mcp-server make agent
Run the locally built agent binary (as shown in the quickstart): cat examples/gpu_inventory.json | ./bin/agent --nvml-mode=mock
or
cat examples/gpu_inventory.json | ./bin/agent --nvml-mode=real

Notes:

If you are deploying to Kubernetes, you can use the provided Helm charts or OCI deployments as shown in the README to run the MCP server in a cluster.”

Additional notes

Tips and caveats:

The MCP server is designed to be low-footprint; it keeps a persistent HTTP front-end and performs GPU work on-demand when tools are invoked.
GPU access in Kubernetes typically requires a RuntimeClass (e.g., nvidia) or equivalent GPU operator setup. If a RuntimeClass is not available, you may need to enable fallbacks as described in the README (disable gpu.runtimeClass and enable gpu.resourceRequest).
There are multiple deployment paths: one-line npx install, global npm install, or deploying via Helm charts to Kubernetes.
The MCP prompts (gpu-health-check, diagnose-xid-errors, gpu-triage) guide orchestration of multiple tools to produce actionable insights.
For Claude Desktop integration, you can configure a mcpServers entry that executes kubectl to run the agent inside a pod, enabling remote querying via Claude.
If you encounter issues with tool availability, ensure NVML access is functioning and that you are running a compatible GPU driver stack.

Related MCP Servers

mcp-language

1.5k

mcp-language-server gives MCP enabled clients access semantic tools like get definition, references, rename, and diagnostics.

agentcontrolplane

341

ACP is the Agent Control Plane - a distributed agent scheduler optimized for simplicity, clarity, and control. It is designed for outer-loop agents that run without supervision, and make asynchronous tool calls like requesting human feedback on key operations. Full MCP support.

kom

146

kom 是一个用于 Kubernetes 操作的工具，SDK级的kubectl、client-go的使用封装。并且支持作为管理k8s 的 MCP server。它提供了一系列功能来管理 Kubernetes 资源，包括创建、更新、删除和获取资源，甚至使用SQL查询k8s资源。这个项目支持多种 Kubernetes 资源类型的操作，并能够处理自定义资源定义（CRD）。通过使用 kom，你可以轻松地进行资源的增删改查和日志获取以及操作POD内文件等动作。

kodit

116

👩‍💻 MCP server to index external repositories

mcp-web-ui

MCP Web UI is a web-based user interface that serves as a Host within the Model Context Protocol (MCP) architecture. It provides a powerful and user-friendly interface for interacting with Large Language Models (LLMs) while managing context aggregation and coordination between clients and servers.

github-brain

An experimental GitHub MCP server with local database.