LLM Fine-tuning Documentation

Table of Contents

Environment Preparation

GPU devices: A10, 3090, V100, A100 are all suitable.

# Install ms-swift
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e '.[llm]'

# If you want to use deepspeed.
pip install deepspeed -U

# If you want to use qlora training based on auto_gptq. (Recommended, better than bnb)
# Models supporting auto_gptq: `https://github.com/modelscope/swift/blob/main/docs/source/Instruction/supported-models-and-datasets.md#models`
# auto_gptq and cuda versions are related, please choose the version according to `https://github.com/PanQiWei/AutoGPTQ#quick-installation`
pip install auto_gptq -U

# If you want to use bnb-based qlora training.
pip install bitsandbytes -U

# Align environment (usually not necessary to run. If you encounter errors, you can run the following code, the repository is tested with the latest environment)
pip install -r requirements/framework.txt -U
pip install -r requirements/llm.txt -U

Fine-Tuning

If you want to fine-tune and infer using the interface, you can check Web-ui Documentation.

Using Python

# Experimental environment: A10, 3090, V100, ...
# 20GB GPU memory
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

import torch

from swift.llm import (
    DatasetName, InferArguments, ModelType, SftArguments,
    infer_main, sft_main, app_ui_main
)

model_type = ModelType.qwen_7b_chat
sft_args = SftArguments(
    model_type=model_type,
    dataset=[f'{DatasetName.blossom_math_zh}#2000'],
    output_dir='output')
result = sft_main(sft_args)
last_model_checkpoint = result['last_model_checkpoint']
print(f'last_model_checkpoint: {last_model_checkpoint}')
torch.cuda.empty_cache()

infer_args = InferArguments(
    ckpt_dir=last_model_checkpoint,
    load_dataset_config=True)
# merge_lora(infer_args, device_map='cpu')
result = infer_main(infer_args)
torch.cuda.empty_cache()

app_ui_main(infer_args)

Using CLI

# Experimental environment: A10, 3090, V100, ...
# 20GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
    --model_id_or_path qwen/Qwen-7B-Chat \
    --dataset AI-ModelScope/blossom-math-v2 \
    --output_dir output \

# Using your own dataset
# custom dataset format: https://github.com/modelscope/swift/blob/main/docs/source_en/Instruction/Customization.md#custom-datasets
CUDA_VISIBLE_DEVICES=0 swift sft \
    --model_id_or_path qwen/Qwen-7B-Chat \
    --dataset chatml.jsonl \
    --output_dir output \

# Using DDP
# Experimental environment: 2 * 3090
# 2 * 23GB GPU memory
CUDA_VISIBLE_DEVICES=0,1 \
NPROC_PER_NODE=2 \
swift sft \
    --model_id_or_path qwen/Qwen-7B-Chat \
    --dataset AI-ModelScope/blossom-math-v2 \
    --output_dir output \

# Multi-machine multi-card
# If the disk is not shared, please additionally specify `--save_on_each_node true` in the shell scripts on each machine.
# node0
CUDA_VISIBLE_DEVICES=0,1,2,3 \
NNODES=2 \
NODE_RANK=0 \
MASTER_ADDR=127.0.0.1 \
NPROC_PER_NODE=4 \
swift sft \
    --model_id_or_path qwen/Qwen-7B-Chat \
    --dataset AI-ModelScope/blossom-math-v2 \
    --output_dir output \
# node1
CUDA_VISIBLE_DEVICES=0,1,2,3 \
NNODES=2 \
NODE_RANK=1 \
MASTER_ADDR=xxx.xxx.xxx.xxx \
NPROC_PER_NODE=4 \
swift sft \
    --model_id_or_path qwen/Qwen-7B-Chat \
    --dataset AI-ModelScope/blossom-math-v2 \
    --output_dir output \

More sh Scripts

More sh scripts can be viewed here

# Scripts need to be executed in this directory
cd examples/pytorch/llm

Tips:

  • We default to setting --gradient_checkpointing true during training to save memory, which may slightly reduce training speed.

  • If you want to use quantization parameters --quantization_bit 4, you need to first install bnb: pip install bitsandbytes -U. This will reduce memory usage but usually slows down the training speed.

  • If you want to use quantization based on auto_gptq, you need to install the corresponding cuda version of auto_gptq: pip install auto_gptq -U.

    Models that can use auto_gptq can be viewed in LLM Supported Models. It is recommended to use auto_gptq instead of bnb.

  • If you want to use deepspeed, you need pip install deepspeed -U. Using deepspeed can save memory, but may slightly reduce training speed.

  • If your training involves knowledge editing, such as: Self-aware Fine-tuning, you need to add LoRA to MLP as well, otherwise, the results might be poor. You can simply pass the argument --lora_target_modules ALL to add lora to all linear(qkvo, mlp), this is usually the best result.

  • If you are using older GPUs like V100, you need to set --dtype AUTO or --dtype fp16, as they do not support bf16.

  • If your machine has high-performance graphics cards like A100 and the model supports flash-attn, it is recommended to install flash-attn, which will speed up training and inference as well as reduce memory usage (A10, 3090, V100, etc. graphics cards do not support training with flash-attn). Models that support flash-attn can be viewed in LLM Supported Models

  • If you are doing second pre-training or multi-turn dialogue, you can refer to Customization and Extension

  • If you need to train offline, please use --model_id_or_path <model_dir> and set --check_model_is_latest false. For specific parameter meanings, please check Command-line Parameters.

  • If you want to push weights to the ModelScope Hub during training, you need to set --push_to_hub true.

  • If you want to merge LoRA weights and save them during inference, you need to set --merge_lora true. It is not recommended to merge for models trained with qlora, as this will result in precision loss. Therefore it is not recommended to fine-tune with qlora, as the deployment ecology is not good.

Note:

  • Due to the legacy name issue, scripts ending with xxx_ds mean: training using deepspeed zero2. (e.g. full_ddp_ds).

  • In addition to the scripts listed below, other scripts may not be maintained.

If you want to customize scripts, you can refer to the following scripts for modification: (The following scripts will be regularly maintained)

DPO

If you want to use DPO for human-aligned fine-tuning, you can check the DPO Fine-Tuning Documentation.

ORPO

If you want to use ORPO for human-aligned fine-tuning, you can check the ORPO Fine-Tuning Documentation.

Merge LoRA

Tip: Currently, merging LoRA is not supported for bnb and auto_gptq quantized models, as this would result in significant accuracy loss.

# If you need quantization, you can specify `--quant_bits 4`.
CUDA_VISIBLE_DEVICES=0 swift export \
    --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true

Quantization

For quantization of the fine-tuned model, you can check LLM Quantization Documentation

Inference

If you want to use VLLM for accelerated inference, you can check VLLM Inference Acceleration and Deployment

Original Model

Single sample inference can be checked in LLM Inference Documentation

Using Dataset for evaluation:

CUDA_VISIBLE_DEVICES=0 swift infer --model_id_or_path qwen/Qwen-7B-Chat --dataset AI-ModelScope/blossom-math-v2

Fine-tuned Model

Single sample inference:

Inference using LoRA incremental weights:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
    get_model_tokenizer, get_template, inference, ModelType, get_default_template_type
)
from swift.tuners import Swift

ckpt_dir = 'vx-xxx/checkpoint-100'
model_type = ModelType.qwen_7b_chat
template_type = get_default_template_type(model_type)
model_id_or_path = None
model, tokenizer = get_model_tokenizer(model_type, model_id_or_path=model_id_or_path, model_kwargs={'device_map': 'auto'})

model = Swift.from_pretrained(model, ckpt_dir, inference_mode=True)
template = get_template(template_type, tokenizer)
query = 'xxxxxx'
response, history = inference(model, template, query)
print(f'response: {response}')
print(f'history: {history}')

Inference using LoRA merged weights:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
    get_model_tokenizer, get_template, inference, ModelType, get_default_template_type
)

ckpt_dir = 'vx-xxx/checkpoint-100-merged'
model_type = ModelType.qwen_7b_chat
template_type = get_default_template_type(model_type)

model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'},
                                       model_id_or_path=ckpt_dir)

template = get_template(template_type, tokenizer)
query = 'xxxxxx'
response, history = inference(model, template, query)
print(f'response: {response}')
print(f'history: {history}')

Using Dataset for evaluation:

# Direct inference
CUDA_VISIBLE_DEVICES=0 swift infer \
    --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' \
    --load_dataset_config true \

# If you need to replace the val_dataset
CUDA_VISIBLE_DEVICES=0 swift infer \
    --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --val_dataset <your-val-dataset>

# Merge LoRA incremental weights and infer
# If you need quantization, you can specify `--quant_bits 4`.
CUDA_VISIBLE_DEVICES=0 swift export \
    --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true

CUDA_VISIBLE_DEVICES=0 swift infer \
    --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged' --load_dataset_config true

Manual evaluation:

# Direct inference
CUDA_VISIBLE_DEVICES=0 swift infer --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx'

# Merge LoRA incremental weights and infer
# If you need quantization, you can specify `--quant_bits 4`.
CUDA_VISIBLE_DEVICES=0 swift export \
    --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true

CUDA_VISIBLE_DEVICES=0 swift infer --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged'

Web-UI

If you want to deploy VLLM and provide API interface, you can check VLLM Inference Acceleration and Deployment

Original Model

Using the original model’s web-ui can be viewed in LLM Inference Documentation

Fine-tuned Model

# Directly use app-ui
CUDA_VISIBLE_DEVICES=0 swift app-ui --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx'

# Merge LoRA incremental weights and use app-ui
# If you need quantization, you can specify `--quant_bits 4`.
CUDA_VISIBLE_DEVICES=0 swift export \
    --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true

CUDA_VISIBLE_DEVICES=0 swift app-ui --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged'