# Qwen2-VL Best Practice The best practices for qwen2-vl-72b-instruct can be found [here](https://github.com/modelscope/ms-swift/issues/2064). ## Table of Contents - [Environment Setup](#environment-setup) - [Inference](#inference) - [Fine-tuning](#fine-tuning) ## Environment Setup ```shell git clone https://github.com/modelscope/swift.git cd swift pip install -e .[llm] pip install pyav qwen_vl_utils ``` Model: (Supports base/instruct/gptq-int4/gptq-int8/awq fine-tuning) - qwen2-vl-2b-instruct: [https://modelscope.cn/models/qwen/Qwen2-VL-2B-Instruct](https://modelscope.cn/models/qwen/Qwen2-VL-2B-Instruct) - qwen2-vl-7b-instruct: [https://modelscope.cn/models/qwen/Qwen2-VL-7B-Instruct](https://modelscope.cn/models/qwen/Qwen2-VL-7B-Instruct) - qwen2-vl-72b-instruct: [https://modelscope.cn/models/qwen/Qwen2-VL-72B-Instruct](https://modelscope.cn/models/qwen/Qwen2-VL-72B-Instruct) ## Inference Inference qwen2-vl-7b-instruct. ```shell # Experimental environment: A100 # 30GB GPU memory CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-vl-7b-instruct ``` Output: (Supports passing in local paths or URLs) ```python """ <<< who are you? I am a large language model created by Alibaba Cloud. I am called Qwen. -------------------------------------------------- <<< There are several sheep in the picture. Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png [INFO:swift] Setting size_factor: 28. You can adjust this hyperparameter through the environment variable: `SIZE_FACTOR`. [INFO:swift] Setting resized_height: None. You can adjust this hyperparameter through the environment variable: `RESIZED_HEIGHT`. [INFO:swift] Setting resized_width: None. You can adjust this hyperparameter through the environment variable: `RESIZED_WIDTH`. [INFO:swift] Setting min_pixels: 3136. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`. [INFO:swift] Setting max_pixels: 12845056. You can adjust this hyperparameter through the environment variable: `MAX_PIXELS`. There are four sheep in the picture. -------------------------------------------------- <<< clear <<< What is the result of the calculation? Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png The result of the calculation 1452 + 45304 is 46756. -------------------------------------------------- <<< Perform OCR on the image. Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr_en.png Introduction SWIFT supports training, inference, evaluation and deployment of 250+ LLMs and 35+ MLLMs (multimodal large models). Developers can directly apply our framework to their own research and production environments to realize the complete workflow from model training and evaluation to application. In addition to supporting the lightweight training solutions provided by PEFT, we also provide a complete Adapters library to support the latest training techniques such as NEFTune, LoRA+, LLaMA-PRO, etc. This adapter library can be used directly in your own custom workflow without our training scripts. To facilitate use by users unfamiliar with deep learning, we provide a Gradio web-ui for controlling training and inference, as well as accompanying deep learning courses and best practices for beginners. Additionally, we are expanding capabilities for other modalities. Currently, we support full-parameter training and LoRA training for AnimateDiff. SWIFT has rich documentations for users, please check here. SWIFT web-ui is available both on Huggingface space and ModelScope studio, please feel free to try! -------------------------------------------------- <<< clear <<<