Get the FREE Ultimate OpenClaw Setup Guide →

manage-inference-service-cli

npx machina-cli add skill typhoonzero/awesome-acp-skills/manage-inference-service-cli --openclaw
Files (1)
SKILL.md
3.7 KB

Manage Inference Service CLI

Overview

This skill enables you to create, manage, test, and troubleshoot KServe InferenceService resources using the command line. It is based on the standard procedures for deploying models (such as Qwen2.5) using vLLM on a Kubernetes cluster with KServe installed.

Workflow: Creating an Inference Service

To create an InferenceService, follow these steps:

  1. Prepare the YAML: Use the provided template in assets/qwen-2-vllm.yaml as a base. You can modify the name, namespace, storageUri, and resources as needed based on the user's request.
  2. Apply the YAML: Run the following command to apply the configuration to the cluster:
    kubectl apply -f <path-to-yaml> -n <namespace>
    
  3. Verify Creation: Check if the resource was created successfully:
    kubectl get inferenceservice <name> -n <namespace>
    

Workflow: Checking Status and Troubleshooting

To check the status of an InferenceService or troubleshoot issues:

  1. Check the READY status:

    kubectl get inferenceservice <name> -n <namespace>
    

    Wait until the READY column shows True.

  2. Troubleshoot Pending/Failing Services: If the service is not ready, check the pod logs and events:

    # Get the pods for the InferenceService
    kubectl get pods -n <namespace> -l serving.kserve.io/inferenceservice=<name>
    
    # Check the logs of the predictor container
    kubectl logs -n <namespace> -l serving.kserve.io/inferenceservice=<name> -c kserve-container
    
    # Describe the InferenceService for events
    kubectl describe inferenceservice <name> -n <namespace>
    
  3. Common Issues to Look For in Logs:

    • GPU Count: The startup script checks torch.cuda.device_count(). If it outputs "No GPUs found", ensure the container has acquired GPU devices (check resource limits/requests).
    • Model Path: The script looks for the model in /mnt/models/${MODEL_NAME} or /mnt/models. If neither exists or the model failed to download, the storage initializer might have failed.
    • GGUF Models: If using GGUF models, vLLM only supports single-file GGUF models. The script will exit with an error if multiple .gguf files are found.

Workflow: Testing the Inference Service

Once the InferenceService is READY (True), you can test it using the OpenAI-compatible API.

  1. Get the Service URL:

    SERVICE_URL=$(kubectl get inferenceservice <name> -n <namespace> -o jsonpath='{.status.url}')
    echo $SERVICE_URL
    
  2. Send a Test Request: Use curl to send a request to the /v1/chat/completions endpoint. Ensure the model parameter matches the --served-model-name configured in the InferenceService (usually <name> or <namespace>/<name>).

    curl -X POST ${SERVICE_URL}/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "<name>",
        "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "What is Kubernetes?"}
        ],
        "max_tokens": 50,
        "temperature": 0.7
      }'
    

Resources

assets/

  • qwen-2-vllm.yaml: A complete, working example of an InferenceService YAML for deploying Qwen2.5-0.5B-Instruct using vLLM. Use this as a template when generating configurations for users.

Source

git clone https://github.com/typhoonzero/awesome-acp-skills/blob/master/manage-inference-service-cli/SKILL.mdView on GitHub

Overview

This skill enables you to create, manage, test, and troubleshoot KServe InferenceService resources using the command line. It centers on deploying models (such as Qwen2.5) via vLLM on a Kubernetes cluster with KServe installed, including status checks and OpenAI-compatible API testing.

How This Skill Works

You prepare a deployment YAML (based on assets/qwen-2-vllm.yaml), apply it with kubectl, and monitor the InferenceService status until READY is True. Once ready, you retrieve the service URL and test the OpenAI-compatible API with curl against /v1/chat/completions, while using logs and describe commands to troubleshoot issues like GPU availability or model-path problems.

When to Use It

  • Create a new InferenceService from a template (e.g., Qwen2.5 via vLLM).
  • Update an existing InferenceService's name, namespace, storageUri, or resources.
  • Check READY status or troubleshoot non-ready services using pod logs and events.
  • Test the OpenAI-compatible API once the service is READY.
  • Validate deployments across namespaces and diagnose common deployment issues.

Quick Start

  1. Step 1: Prepare the YAML from assets/qwen-2-vllm.yaml, updating name, namespace, storageUri, and resources as needed.
  2. Step 2: kubectl apply -f <path-to-yaml> -n <namespace> to create the InferenceService.
  3. Step 3: kubectl get inferenceservice <name> -n <namespace> to verify READY is True.

Best Practices

  • Start with the provided assets/qwen-2-vllm.yaml and customize name, namespace, storageUri, and resources as needed.
  • Apply configuration with kubectl apply -f <path-to-yaml> -n <namespace> and verify with kubectl get inferenceservice.
  • Wait for READY to be True before testing; use kubectl describe and kubectl logs to diagnose issues.
  • Review logs for GPU availability, correct model path (/mnt/models/${MODEL_NAME} or /mnt/models), and GGUF model constraints.
  • Test the deployed service using the OpenAI-compatible endpoint to confirm end-to-end functionality.

Example Use Cases

  • Deploy Qwen2.5-0.5B-Instruct via vLLM in a dedicated namespace and monitor readiness.
  • Update the InferenceService to point to a new storageUri or model version without recreating the service.
  • Troubleshoot a non-READY service by inspecting pod logs, listing pods with the InferenceService label, and describing the service for events.
  • Test the endpoint with a curl POST to /v1/chat/completions and verify the returned response uses the correct served model name.
  • Diagnose GPU or model-path issues by confirming container has access to GPUs and that the model exists at /mnt/models.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers