What is an InferenceService and why use it?

An InferenceService is a KServe resource that hosts ML models on Kubernetes, enabling scalable serving and API access.

How do I test the OpenAI-compatible API after deployment?

Obtain the service URL with kubectl get inferenceservice -n -o jsonpath='{.status.url}' and curl to /v1/chat/completions with the appropriate model name.

What are the common issues to check during troubleshooting?

Check GPU availability (torch.cuda.device_count), ensure the model path exists at /mnt/models/${MODEL_NAME} or /mnt/models, and verify single-file GGUF model constraints if using GGUF models.

manage-inference-service-cli

npx machina-cli add skill typhoonzero/awesome-acp-skills/manage-inference-service-cli --openclaw

Files (1)

SKILL.md

3.7 KB

Manage Inference Service CLI

Overview

This skill enables you to create, manage, test, and troubleshoot KServe InferenceService resources using the command line. It is based on the standard procedures for deploying models (such as Qwen2.5) using vLLM on a Kubernetes cluster with KServe installed.

Workflow: Creating an Inference Service

To create an InferenceService, follow these steps:

Prepare the YAML: Use the provided template in assets/qwen-2-vllm.yaml as a base. You can modify the name, namespace, storageUri, and resources as needed based on the user's request.
Apply the YAML: Run the following command to apply the configuration to the cluster:
```
kubectl apply -f <path-to-yaml> -n <namespace>
```
Verify Creation: Check if the resource was created successfully:
```
kubectl get inferenceservice <name> -n <namespace>
```

Workflow: Checking Status and Troubleshooting

To check the status of an InferenceService or troubleshoot issues:

Check the READY status:
```
kubectl get inferenceservice <name> -n <namespace>
```
Wait until the READY column shows True.

Troubleshoot Pending/Failing Services: If the service is not ready, check the pod logs and events:

# Get the pods for the InferenceService
kubectl get pods -n <namespace> -l serving.kserve.io/inferenceservice=<name>

# Check the logs of the predictor container
kubectl logs -n <namespace> -l serving.kserve.io/inferenceservice=<name> -c kserve-container

# Describe the InferenceService for events
kubectl describe inferenceservice <name> -n <namespace>

Common Issues to Look For in Logs:
- GPU Count: The startup script checks torch.cuda.device_count(). If it outputs "No GPUs found", ensure the container has acquired GPU devices (check resource limits/requests).
- Model Path: The script looks for the model in /mnt/models/${MODEL_NAME} or /mnt/models. If neither exists or the model failed to download, the storage initializer might have failed.
- GGUF Models: If using GGUF models, vLLM only supports single-file GGUF models. The script will exit with an error if multiple .gguf files are found.

Workflow: Testing the Inference Service

Once the InferenceService is READY (True), you can test it using the OpenAI-compatible API.

Get the Service URL:

SERVICE_URL=$(kubectl get inferenceservice <name> -n <namespace> -o jsonpath='{.status.url}')
echo $SERVICE_URL

Send a Test Request: Use curl to send a request to the /v1/chat/completions endpoint. Ensure the model parameter matches the --served-model-name configured in the InferenceService (usually <name> or <namespace>/<name>).

curl -X POST ${SERVICE_URL}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<name>",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is Kubernetes?"}
    ],
    "max_tokens": 50,
    "temperature": 0.7
  }'

Resources

assets/

qwen-2-vllm.yaml: A complete, working example of an InferenceService YAML for deploying Qwen2.5-0.5B-Instruct using vLLM. Use this as a template when generating configurations for users.

Source

git clone https://github.com/typhoonzero/awesome-acp-skills/blob/master/manage-inference-service-cli/SKILL.mdView on GitHub

Overview

This skill enables you to create, manage, test, and troubleshoot KServe InferenceService resources using the command line. It centers on deploying models (such as Qwen2.5) via vLLM on a Kubernetes cluster with KServe installed, including status checks and OpenAI-compatible API testing.

How This Skill Works

You prepare a deployment YAML (based on assets/qwen-2-vllm.yaml), apply it with kubectl, and monitor the InferenceService status until READY is True. Once ready, you retrieve the service URL and test the OpenAI-compatible API with curl against /v1/chat/completions, while using logs and describe commands to troubleshoot issues like GPU availability or model-path problems.

When to Use It

Create a new InferenceService from a template (e.g., Qwen2.5 via vLLM).
Update an existing InferenceService's name, namespace, storageUri, or resources.
Check READY status or troubleshoot non-ready services using pod logs and events.
Test the OpenAI-compatible API once the service is READY.
Validate deployments across namespaces and diagnose common deployment issues.

Quick Start

Step 1: Prepare the YAML from assets/qwen-2-vllm.yaml, updating name, namespace, storageUri, and resources as needed.
Step 2: kubectl apply -f <path-to-yaml> -n <namespace> to create the InferenceService.
Step 3: kubectl get inferenceservice <name> -n <namespace> to verify READY is True.

Best Practices

Start with the provided assets/qwen-2-vllm.yaml and customize name, namespace, storageUri, and resources as needed.
Apply configuration with kubectl apply -f <path-to-yaml> -n <namespace> and verify with kubectl get inferenceservice.
Wait for READY to be True before testing; use kubectl describe and kubectl logs to diagnose issues.
Review logs for GPU availability, correct model path (/mnt/models/${MODEL_NAME} or /mnt/models), and GGUF model constraints.
Test the deployed service using the OpenAI-compatible endpoint to confirm end-to-end functionality.

Example Use Cases

Deploy Qwen2.5-0.5B-Instruct via vLLM in a dedicated namespace and monitor readiness.
Update the InferenceService to point to a new storageUri or model version without recreating the service.
Troubleshoot a non-READY service by inspecting pod logs, listing pods with the InferenceService label, and describing the service for events.
Test the endpoint with a curl POST to /v1/chat/completions and verify the returned response uses the correct served model name.
Diagnose GPU or model-path issues by confirming container has access to GPUs and that the model exists at /mnt/models.

Frequently Asked Questions

Add this skill to your agents