llm-fine-tuning-workbench
npx machina-cli add skill typhoonzero/awesome-acp-skills/llm-fine-tuning-workbench --openclawLLM Fine-Tuning Workbench
Overview
This skill enables the fine-tuning of Large Language Models (LLMs) using Alauda AI Workbench. It provides workflows for preparing models and datasets, building runtime images, submitting and managing VolcanoJob tasks, and troubleshooting common issues during the fine-tuning process.
Workflow Decision Tree
To fine-tune an LLM using Alauda AI Workbench, follow these steps:
- Prepare the Environment: Create a Notebook/VSCode instance.
- Prepare the Model: Download the base model and upload it to the model repository. Create an empty model for the output.
- Prepare the Dataset: Download the dataset and push it to the dataset repository.
- Prepare the Runtime Image: Build the training runtime image using the provided
Containerfile. - Create and Submit the Task: Configure and submit a
VolcanoJobYAML file. - Manage the Task: Monitor the task status, view logs, and troubleshoot issues.
- Experiment Tracking: Use MLFlow to track and compare experiments.
- Launch Inference Service: Publish the fine-tuned model as an inference service.
Step 1: Prepare the Environment
Create a Notebook/VSCode instance in Alauda AI Workbench. It is recommended to request only CPU resources for the Notebook, as the actual fine-tuning task will be submitted to the cluster and request GPU resources.
Step 2: Prepare the Model
- Download the base model (e.g.,
Qwen/Qwen3-0.6B) from Huggingface or another source. - Upload the model to the Alauda AI model repository.
- Create an empty model in the model repository to store the fine-tuned output model. Note its Git repository URL.
Step 3: Prepare the Dataset
- Create an empty dataset repository in Alauda AI.
- Upload your dataset (e.g.,
identity-alauda) to the Notebook, unzip it, and push it to the dataset repository usinggit lfs. - Ensure the dataset format is compatible with the fine-tuning framework (e.g., Huggingface
datasetsor LLaMA-Factory format).
Step 4: Prepare the Runtime Image
Use the provided assets/Containerfile to build the training runtime image. This image includes necessary dependencies like git-lfs, LLaMA-Factory, transformers, and mlflow.
To build and push the image:
docker build -t <your-registry>/fine_tune_with_llamafactory:v0.1.0 -f assets/Containerfile .
docker push <your-registry>/fine_tune_with_llamafactory:v0.1.0
Step 5: Create and Submit the Task
Use the provided assets/vcjob-sft.yaml template to create a VolcanoJob task.
Before submitting, modify the following settings in the YAML file:
BASE_MODEL_URL: Git URL of the base model.DATASET_URL: Git URL of the dataset.OUTPUT_MODEL_URL: Git URL of the empty output model.MLFLOW_TRACKING_URI: URL of your MLFlow tracking server.MLFLOW_EXPERIMENT_NAME: Name of your MLFlow experiment.- Resource requests/limits (CPU, Memory, GPU).
- Storage configurations (e.g.,
models-cachePVC).
Submit the task:
kubectl create -f assets/vcjob-sft.yaml
Step 6: Manage the Task and Troubleshoot
Use the following commands to manage the task:
- View task list:
kubectl get vcjob - View task status:
kubectl get vcjob <task-name> - View pod status:
kubectl get pod - View task logs:
kubectl logs <pod-name> - Delete task:
kubectl delete vcjob <task-name>
Troubleshooting Common Issues
- Pod is not created:
- Run
kubectl describe vcjob <task-name>orkubectl get podgroups. - Check the Volcano scheduler logs for resource insufficiency or PVC mounting issues.
- Run
- NFS PVC Mounting Issues:
- Ensure
nfs-utilsis installed on all K8s nodes. - Ensure the NFS StorageClass has
mountPermissions: "0757".
- Ensure
- Non-Nvidia GPU Scheduling:
- Ensure the vendor GPU driver and Kubernetes device plugin are deployed.
- Modify the resource requests in the YAML file to request the specific vendor GPU (e.g.,
huawei.com/Ascend910: 1).
Step 7: Experiment Tracking
If report_to: mlflow is set in the LLaMA-Factory configuration, training metrics will be automatically sent to the MLFlow server. View and compare experiments in the Alauda AI MLFlow interface.
Step 8: Launch Inference Service
Once the fine-tuning task completes and the model is pushed to the repository:
- Go to the model repository and edit the metadata (Task Type: Text Classification, Framework: Transformers).
- Click Publish Inference API -> Custom Publishing.
- Select the vLLM inference runtime and configure resources.
- Click Publish and wait for the service to start.
Resources
assets/
Containerfile: Dockerfile for building the training runtime image.vcjob-sft.yaml: Template for submitting the fine-tuning VolcanoJob task.
Source
git clone https://github.com/typhoonzero/awesome-acp-skills/blob/master/llm-fine-tuning-workbench/SKILL.mdView on GitHub Overview
This skill enables fine-tuning Large Language Models (LLMs) using Alauda AI Workbench. It provides workflows for preparing models and datasets, building runtime images, submitting and managing VolcanoJob tasks, and troubleshooting issues during the fine-tuning process.
How This Skill Works
Follow a step-by-step workflow: prepare the environment, model, and dataset; build the training runtime image with the provided Containerfile; create and submit a VolcanoJob YAML with BASE_MODEL_URL, DATASET_URL, OUTPUT_MODEL_URL and MLFlow settings; then monitor the task, troubleshoot as needed, track experiments with MLFlow, and finally publish the fine-tuned model as an inference service.
When to Use It
- When you want to fine-tune a base LLM (e.g., Qwen) using Alauda AI Workbench
- When you need to prepare and push a dataset into Alauda AI's dataset repository
- When you must build a training runtime image using the provided Containerfile
- When you want to submit, monitor, and troubleshoot a VolcanoJob for fine-tuning
- When you plan to track experiments with MLFlow and publish an inference service for the fine-tuned model
Quick Start
- Step 1: Prepare the Environment — Create a Notebook/VSCode instance in Alauda AI Workbench; request CPU resources for the Notebook
- Step 2: Prepare the Model — Download the base model (e.g., Qwen/Qwen3-0.6B), upload to the model repository, and create an empty output model; note its Git URL
- Step 3: Prepare the Dataset — Create an empty dataset repository, upload your data, and push to the dataset repo using git lfs; ensure compatibility with the fine-tuning framework
Best Practices
- Start with CPU resources for the Notebook; the training task runs on the cluster and will use GPU resources as configured in the VolcanoJob
- Use exact, accessible URLs for BASE_MODEL_URL, DATASET_URL, and OUTPUT_MODEL_URL in the vcjob-sft.yaml
- Leverage the provided assets/vcjob-sft.yaml template and customize MLFlow settings and resource requests/limits
- Ensure dataset format is compatible (Huggingface datasets or LLaMA-Factory) before training
- Configure storage via a PVC (e.g., models-cache) to persist models and artifacts across tasks
Example Use Cases
- Fine-tuning Qwen3-0.6B with a HuggingFace dataset in Alauda AI Workbench
- Building and pushing a training runtime image using assets/Containerfile with git-lfs, LLaMA-Factory, transformers, and mlflow
- Submitting a VolcanoJob with assets/vcjob-sft.yaml and monitoring progress via kubectl
- Tracking experiments and comparing results in MLFlow during iterative fine-tuning
- Publishing the fine-tuned model as an inference service for online predictions