How do I start fine-tuning LLMs with this skill?

Follow the workflow: prepare the environment, model, dataset, runtime image, submit the VolcanoJob, monitor the task, and track experiments with MLFlow; finally publish the fine-tuned model as an inference service.

What is VolcanoJob and how do I configure it?

VolcanoJob is a Kubernetes resource used to submit and manage training tasks. Use the provided assets/vcjob-sft.yaml template and fill BASE_MODEL_URL, DATASET_URL, OUTPUT_MODEL_URL, MLFLOW settings, resources, and storage configurations before submitting with kubectl.

How can I troubleshoot common issues?

Use kubectl to list and describe tasks, view status, and fetch logs: kubectl get vcjob, kubectl get vcjob , kubectl logs , kubectl describe vcjob . For pod creation issues, check Volcano scheduler logs. For NFS PVC mounting issues, ensure nfs-utils is installed on all K8s nodes.

llm-fine-tuning-workbench

npx machina-cli add skill typhoonzero/awesome-acp-skills/llm-fine-tuning-workbench --openclaw

Files (1)

SKILL.md

5.0 KB

LLM Fine-Tuning Workbench

Overview

This skill enables the fine-tuning of Large Language Models (LLMs) using Alauda AI Workbench. It provides workflows for preparing models and datasets, building runtime images, submitting and managing VolcanoJob tasks, and troubleshooting common issues during the fine-tuning process.

Workflow Decision Tree

To fine-tune an LLM using Alauda AI Workbench, follow these steps:

Prepare the Environment: Create a Notebook/VSCode instance.
Prepare the Model: Download the base model and upload it to the model repository. Create an empty model for the output.
Prepare the Dataset: Download the dataset and push it to the dataset repository.
Prepare the Runtime Image: Build the training runtime image using the provided Containerfile.
Create and Submit the Task: Configure and submit a VolcanoJob YAML file.
Manage the Task: Monitor the task status, view logs, and troubleshoot issues.
Experiment Tracking: Use MLFlow to track and compare experiments.
Launch Inference Service: Publish the fine-tuned model as an inference service.

Step 1: Prepare the Environment

Create a Notebook/VSCode instance in Alauda AI Workbench. It is recommended to request only CPU resources for the Notebook, as the actual fine-tuning task will be submitted to the cluster and request GPU resources.

Step 2: Prepare the Model

Download the base model (e.g., Qwen/Qwen3-0.6B) from Huggingface or another source.
Upload the model to the Alauda AI model repository.
Create an empty model in the model repository to store the fine-tuned output model. Note its Git repository URL.

Step 3: Prepare the Dataset

Create an empty dataset repository in Alauda AI.
Upload your dataset (e.g., identity-alauda) to the Notebook, unzip it, and push it to the dataset repository using git lfs.
Ensure the dataset format is compatible with the fine-tuning framework (e.g., Huggingface datasets or LLaMA-Factory format).

Step 4: Prepare the Runtime Image

Use the provided assets/Containerfile to build the training runtime image. This image includes necessary dependencies like git-lfs, LLaMA-Factory, transformers, and mlflow.

To build and push the image:

docker build -t <your-registry>/fine_tune_with_llamafactory:v0.1.0 -f assets/Containerfile .
docker push <your-registry>/fine_tune_with_llamafactory:v0.1.0

Step 5: Create and Submit the Task

Use the provided assets/vcjob-sft.yaml template to create a VolcanoJob task.

Before submitting, modify the following settings in the YAML file:

BASE_MODEL_URL: Git URL of the base model.
DATASET_URL: Git URL of the dataset.
OUTPUT_MODEL_URL: Git URL of the empty output model.
MLFLOW_TRACKING_URI: URL of your MLFlow tracking server.
MLFLOW_EXPERIMENT_NAME: Name of your MLFlow experiment.
Resource requests/limits (CPU, Memory, GPU).
Storage configurations (e.g., models-cache PVC).

Submit the task:

kubectl create -f assets/vcjob-sft.yaml

Step 6: Manage the Task and Troubleshoot

Use the following commands to manage the task:

View task list: kubectl get vcjob
View task status: kubectl get vcjob <task-name>
View pod status: kubectl get pod
View task logs: kubectl logs <pod-name>
Delete task: kubectl delete vcjob <task-name>

Troubleshooting Common Issues

Pod is not created:
- Run kubectl describe vcjob <task-name> or kubectl get podgroups.
- Check the Volcano scheduler logs for resource insufficiency or PVC mounting issues.
NFS PVC Mounting Issues:
- Ensure nfs-utils is installed on all K8s nodes.
- Ensure the NFS StorageClass has mountPermissions: "0757".
Non-Nvidia GPU Scheduling:
- Ensure the vendor GPU driver and Kubernetes device plugin are deployed.
- Modify the resource requests in the YAML file to request the specific vendor GPU (e.g., huawei.com/Ascend910: 1).

Step 7: Experiment Tracking

If report_to: mlflow is set in the LLaMA-Factory configuration, training metrics will be automatically sent to the MLFlow server. View and compare experiments in the Alauda AI MLFlow interface.

Step 8: Launch Inference Service

Once the fine-tuning task completes and the model is pushed to the repository:

Go to the model repository and edit the metadata (Task Type: Text Classification, Framework: Transformers).
Click Publish Inference API -> Custom Publishing.
Select the vLLM inference runtime and configure resources.
Click Publish and wait for the service to start.

Resources

assets/

Containerfile: Dockerfile for building the training runtime image.
vcjob-sft.yaml: Template for submitting the fine-tuning VolcanoJob task.

Source

git clone https://github.com/typhoonzero/awesome-acp-skills/blob/master/llm-fine-tuning-workbench/SKILL.mdView on GitHub

Overview

This skill enables fine-tuning Large Language Models (LLMs) using Alauda AI Workbench. It provides workflows for preparing models and datasets, building runtime images, submitting and managing VolcanoJob tasks, and troubleshooting issues during the fine-tuning process.

How This Skill Works

Follow a step-by-step workflow: prepare the environment, model, and dataset; build the training runtime image with the provided Containerfile; create and submit a VolcanoJob YAML with BASE_MODEL_URL, DATASET_URL, OUTPUT_MODEL_URL and MLFlow settings; then monitor the task, troubleshoot as needed, track experiments with MLFlow, and finally publish the fine-tuned model as an inference service.

When to Use It

When you want to fine-tune a base LLM (e.g., Qwen) using Alauda AI Workbench
When you need to prepare and push a dataset into Alauda AI's dataset repository
When you must build a training runtime image using the provided Containerfile
When you want to submit, monitor, and troubleshoot a VolcanoJob for fine-tuning
When you plan to track experiments with MLFlow and publish an inference service for the fine-tuned model

Quick Start

Step 1: Prepare the Environment — Create a Notebook/VSCode instance in Alauda AI Workbench; request CPU resources for the Notebook
Step 2: Prepare the Model — Download the base model (e.g., Qwen/Qwen3-0.6B), upload to the model repository, and create an empty output model; note its Git URL
Step 3: Prepare the Dataset — Create an empty dataset repository, upload your data, and push to the dataset repo using git lfs; ensure compatibility with the fine-tuning framework

Best Practices

Start with CPU resources for the Notebook; the training task runs on the cluster and will use GPU resources as configured in the VolcanoJob
Use exact, accessible URLs for BASE_MODEL_URL, DATASET_URL, and OUTPUT_MODEL_URL in the vcjob-sft.yaml
Leverage the provided assets/vcjob-sft.yaml template and customize MLFlow settings and resource requests/limits
Ensure dataset format is compatible (Huggingface datasets or LLaMA-Factory) before training
Configure storage via a PVC (e.g., models-cache) to persist models and artifacts across tasks

Example Use Cases

Fine-tuning Qwen3-0.6B with a HuggingFace dataset in Alauda AI Workbench
Building and pushing a training runtime image using assets/Containerfile with git-lfs, LLaMA-Factory, transformers, and mlflow
Submitting a VolcanoJob with assets/vcjob-sft.yaml and monitoring progress via kubectl
Tracking experiments and comparing results in MLFlow during iterative fine-tuning
Publishing the fine-tuned model as an inference service for online predictions

Frequently Asked Questions

Add this skill to your agents