manage-inference-service-cli
npx machina-cli add skill typhoonzero/awesome-acp-skills/manage-inference-service-cli --openclawManage Inference Service CLI
Overview
This skill enables you to create, manage, test, and troubleshoot KServe InferenceService resources using the command line. It is based on the standard procedures for deploying models (such as Qwen2.5) using vLLM on a Kubernetes cluster with KServe installed.
Workflow: Creating an Inference Service
To create an InferenceService, follow these steps:
- Prepare the YAML: Use the provided template in
assets/qwen-2-vllm.yamlas a base. You can modify thename,namespace,storageUri, andresourcesas needed based on the user's request. - Apply the YAML: Run the following command to apply the configuration to the cluster:
kubectl apply -f <path-to-yaml> -n <namespace> - Verify Creation: Check if the resource was created successfully:
kubectl get inferenceservice <name> -n <namespace>
Workflow: Checking Status and Troubleshooting
To check the status of an InferenceService or troubleshoot issues:
-
Check the READY status:
kubectl get inferenceservice <name> -n <namespace>Wait until the
READYcolumn showsTrue. -
Troubleshoot Pending/Failing Services: If the service is not ready, check the pod logs and events:
# Get the pods for the InferenceService kubectl get pods -n <namespace> -l serving.kserve.io/inferenceservice=<name> # Check the logs of the predictor container kubectl logs -n <namespace> -l serving.kserve.io/inferenceservice=<name> -c kserve-container # Describe the InferenceService for events kubectl describe inferenceservice <name> -n <namespace> -
Common Issues to Look For in Logs:
- GPU Count: The startup script checks
torch.cuda.device_count(). If it outputs "No GPUs found", ensure the container has acquired GPU devices (check resource limits/requests). - Model Path: The script looks for the model in
/mnt/models/${MODEL_NAME}or/mnt/models. If neither exists or the model failed to download, the storage initializer might have failed. - GGUF Models: If using GGUF models, vLLM only supports single-file GGUF models. The script will exit with an error if multiple
.gguffiles are found.
- GPU Count: The startup script checks
Workflow: Testing the Inference Service
Once the InferenceService is READY (True), you can test it using the OpenAI-compatible API.
-
Get the Service URL:
SERVICE_URL=$(kubectl get inferenceservice <name> -n <namespace> -o jsonpath='{.status.url}') echo $SERVICE_URL -
Send a Test Request: Use
curlto send a request to the/v1/chat/completionsendpoint. Ensure themodelparameter matches the--served-model-nameconfigured in the InferenceService (usually<name>or<namespace>/<name>).curl -X POST ${SERVICE_URL}/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "<name>", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Kubernetes?"} ], "max_tokens": 50, "temperature": 0.7 }'
Resources
assets/
qwen-2-vllm.yaml: A complete, working example of an InferenceService YAML for deploying Qwen2.5-0.5B-Instruct using vLLM. Use this as a template when generating configurations for users.
Source
git clone https://github.com/typhoonzero/awesome-acp-skills/blob/master/manage-inference-service-cli/SKILL.mdView on GitHub Overview
This skill enables you to create, manage, test, and troubleshoot KServe InferenceService resources using the command line. It centers on deploying models (such as Qwen2.5) via vLLM on a Kubernetes cluster with KServe installed, including status checks and OpenAI-compatible API testing.
How This Skill Works
You prepare a deployment YAML (based on assets/qwen-2-vllm.yaml), apply it with kubectl, and monitor the InferenceService status until READY is True. Once ready, you retrieve the service URL and test the OpenAI-compatible API with curl against /v1/chat/completions, while using logs and describe commands to troubleshoot issues like GPU availability or model-path problems.
When to Use It
- Create a new InferenceService from a template (e.g., Qwen2.5 via vLLM).
- Update an existing InferenceService's name, namespace, storageUri, or resources.
- Check READY status or troubleshoot non-ready services using pod logs and events.
- Test the OpenAI-compatible API once the service is READY.
- Validate deployments across namespaces and diagnose common deployment issues.
Quick Start
- Step 1: Prepare the YAML from assets/qwen-2-vllm.yaml, updating name, namespace, storageUri, and resources as needed.
- Step 2: kubectl apply -f <path-to-yaml> -n <namespace> to create the InferenceService.
- Step 3: kubectl get inferenceservice <name> -n <namespace> to verify READY is True.
Best Practices
- Start with the provided assets/qwen-2-vllm.yaml and customize name, namespace, storageUri, and resources as needed.
- Apply configuration with kubectl apply -f <path-to-yaml> -n <namespace> and verify with kubectl get inferenceservice.
- Wait for READY to be True before testing; use kubectl describe and kubectl logs to diagnose issues.
- Review logs for GPU availability, correct model path (/mnt/models/${MODEL_NAME} or /mnt/models), and GGUF model constraints.
- Test the deployed service using the OpenAI-compatible endpoint to confirm end-to-end functionality.
Example Use Cases
- Deploy Qwen2.5-0.5B-Instruct via vLLM in a dedicated namespace and monitor readiness.
- Update the InferenceService to point to a new storageUri or model version without recreating the service.
- Troubleshoot a non-READY service by inspecting pod logs, listing pods with the InferenceService label, and describing the service for events.
- Test the endpoint with a curl POST to /v1/chat/completions and verify the returned response uses the correct served model name.
- Diagnose GPU or model-path issues by confirming container has access to GPUs and that the model exists at /mnt/models.