guidance-for-scalable-model-inference-and-agentic-ai-on-amazon-eks

Comprehensive, scalable ML inference architecture using Amazon EKS, leveraging Graviton processors for cost-effective CPU-based inference and GPU instances for accelerated inference. Guidance provides a complete end-to-end platform for deploying LLMs with agentic AI capabilities, including RAG and MCP

Installation

Run this command in your terminal to add the MCP server to Claude Code.

Run in terminal:

View docs

Command

claude mcp add --transport stdio aws-solutions-library-samples-guidance-for-scalable-model-inference-and-agentic-ai-on-amazon-eks docker run -i guidance-for-scalable-model-inference-and-agentic-ai-on-amazon-eks \
  --env AWS_REGION="us-east-1 (or your region)" \
  --env AWS_ROLE_ARN="arn:aws:iam::123456789012:role/YourEKSRole" \
  --env LOGGING_SERVICE="Langfuse or your observability endpoint" \
  --env EKS_CLUSTER_NAME="your-eks-cluster" \
  --env LANGUAGE_MODEL_PROVIDER="Bedrock/OpenSearch/LLM providers as per deployment"

How to use

This MCP server provides a guidance-driven, scalable model inference and agentic AI platform designed to run on Amazon EKS. It orchestrates CPU-based inference on Graviton instances and GPU-accelerated inference for high-throughput workloads, using a combination of Ray Serve, LiteLLM, vLLM, and Karpenter for elastic resource provisioning. The architecture enables Retrieval-Augmented Generation (RAG), Intelligent Document Processing (IDP), and multi-agent workflows, with observability through Langfuse and Prometheus/Grafana. You can deploy end-to-end inference pipelines, expose a unified API gateway for multiple models, and route requests to the most suitable compute tier based on workload characteristics. The platform also supports embedded reasoning, document retrieval, and web search fallbacks to maintain up-to-date knowledge while delivering coherent responses.

How to install

Prerequisites:

An AWS account with an EKS-ready environment
Docker installed on the deployment host
Access to the repository containing the MCP server configuration
Optional: Langfuse, Prometheus, and Grafana for observability

Installation steps:

Install Docker on your machine (if not already installed):

# macOS / Windows
- Download Docker Desktop from https://www.docker.com/products/docker-desktop
# Linux (example for Debian-based systems)
sudo apt-get update
sudo apt-get install -y docker.io
sudo systemctl enable --now docker

Pull and run the MCP server container (as defined in mcp_config).

docker run -i guidance-for-scalable-model-inference-and-agentic-ai-on-amazon-eks

Set required environment variables to match your AWS/EKS setup. Example:

export AWS_REGION=us-east-1
export AWS_ROLE_ARN=arn:aws:iam::123456789012:role/YourEKSRole
export EKS_CLUSTER_NAME=your-eks-cluster
export LANGUAGE_MODEL_PROVIDER=Bedrock

Verify the server is running and reachable at the configured endpoint. Use your API gateway (LiteLLM proxy) URL or the container's exposed port to send test requests.
Optional: Configure observability.

# Start Prometheus/Grafana and Langfuse as per your cluster setup

Deploy any additional components (RAG, OpenSearch, vLLM workers) according to your deployment manifest or Helm charts if provided in the repository.

Additional notes

Tips and notes:

Ensure your AWS IAM roles and permissions allow EKS, OpenSearch, and Ray/RLLM components to access necessary resources.
If using GPU nodes, confirm NVIDIA drivers and CUDA toolkit compatibility for your container images.
Monitor costs across EKS, EC2 (Graviton/ GPU), and OpenSearch; consider adjusting Karpenter settings for tighter autoscaling.
The guidance relies on a combination of multiple services (RAG, IDP, multi-agent orchestration); verify network policies and security groups permit proper communication among components.
If you encounter deployment issues, check the container logs for model loading errors, OpenSearch connectivity, and gateway routing configuration.
For environment variables, avoid committing secrets; use secure secret management in your deployment pipeline.

Related MCP Servers

cursor10x

The Cursor10x MCP is a persistent multi-dimensional memory system for Cursor that enhances AI assistants with conversation context, project history, and code relationships across sessions.

mcp-playground

A Streamlit-based chat app for LLMs with plug-and-play tool support via Model Context Protocol (MCP), powered by LangChain, LangGraph, and Docker.

zerodha

Zerodha MCP Server & Client - AI Agent (w/Agno & w/Google ADK)

fast -telegram

Telegram MCP Server and HTTP-MTProto bridge | Multi-user auth, intelligent search, file sending, web setup | Docker & PyPI ready

mcp-browser-automation

Model Context Protocol based AI Agent that runs a browser from Claude desktop

mcp-document-converter

MCP Document Converter - A powerful MCP tool for converting documents between multiple formats, enabling AI agents to easily transform documents.