# InternVL Best Practice The document corresponds to the following models: - [internvl-chat-v1_5](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5/summary) - [internvl-chat-v1_5-int8](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-int8/summary) - [mini-internvl-chat-2b-v1_5](https://www.modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-2B-V1-5) - [mini-internvl-chat-4b-v1_5](https://www.modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-4B-V1-5) - [internvl2-1b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-1B) - [internvl2-2b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-2B) - [internvl2-4b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-4B) - [internvl2-8b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-8B) - [internvl2-26b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-26B) - [internvl2-40b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-40B) - [internvl2-llama3-76b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-Llama3-76B) The following practice takes `internvl-chat-v1_5` as an example, and you can also switch to other models by specifying `--model_type`. **FAQ** 1. **Model shows `The request model does not exist!`** This issue often arises when attempting to use the mini-internvl or InternVL2 models, as the corresponding models on modelscope are subject to an application process. To resolve this, you need to log in to modelscope and go to the respective model page to apply for download. After approval, you can obtain the model through either of the following methods: - Use `snap_download` to download the model locally (the relevant code is available in the model download section of the model file), and then specify the local model file path using `--model_id_or_path`. - Obtain the SDK token for your account from the [modelscope account homepage](https://www.modelscope.cn/my/myaccesstoken), and specify it using the `--hub_token` parameter or the `MODELSCOPE_API_TOKEN` environment variable. 2. **Why is the distribution uneven across multiple GPU cards when running models, leading to OOM?** The auto device map algorithm in transformers is not friendly to multi-modal models, which may result in uneven memory allocation across different GPU cards. - You can set the memory usage for each card using the `--device_max_memory parameter`, for example, in a four-card environment, you can set `--device_max_memory 15GB 15GB 15GB 15GB`. - Alternatively, you can explicitly specify the device map using `--device_map_config`. 3. **Differences between the InternVL2 model and its predecessors (InternVL-V1.5 and Mini-InternVL)** - The InternVL2 model supports multi-turn multi-image inference and training, meaning multi-turn conversations with images, and supports text and images interleaved within a single turn. For details, refer to [Custom Dataset](#custom-dataset) and InternVL2 part in Inference section. The predecessors models supported multi-turn conversations but could only have images in a single turn. - The InternVL2 model supports video input. For specific formats, refer to [Custom Dataset](#custom-dataset). ## Table of Contents - [Environment Setup](#environment-setup) - [Inference](#inference) - [Fine-tuning](#fine-tuning) - [Custom Dataset](#custom-dataset) - [Inference after Fine-tuning](#inference-after-fine-tuning) ## Environment Setup ```shell git clone https://github.com/modelscope/swift.git cd swift pip install -e '.[llm]' pip install Pillow ``` ## Inference **Note** - If you want to use a local model file, add the argument --model_id_or_path /path/to/model. - If your GPU does not support flash attention, use the argument --use_flash_attn false. And for int8 models, it is necessary to specify `dtype --bf16` during inference, otherwise the output may be garbled. - The model's configuration specifies a relatively small max_length of 2048, which can be modified by setting `--max_length`. - Memory consumption can be reduced by using the parameter `--gradient_checkpointing true`. ```shell # Experimental environment: A100 # 55GB GPU memory CUDA_VISIBLE_DEVICES=0 swift infer --model_type internvl-chat-v1_5 --dtype bf16 --max_length 4096 # 2*30GB GPU memory CUDA_VISIBLE_DEVICES=0,1 swift infer --model_type internvl-chat-v1_5 --dtype bf16 --max_length 4096 ``` Output: (supports passing in local path or URL) ```python """ <<< Describe this image. Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png This is a high-resolution image of a kitten. The kitten has striking blue eyes and a fluffy white and grey coat. The fur pattern suggests that it may be a Maine Coon or a similar breed. The kitten's ears are perked up, and it has a curious and innocent expression. The background is blurred, which brings the focus to the kitten's face. -------------------------------------------------- <<< clear <<< How many sheep are in the picture? Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png There are four sheep in the picture. -------------------------------------------------- <<< clear <<< What is the calculation result? Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png The calculation result is 59,856. -------------------------------------------------- <<< clear <<< Write a poem based on the content of the picture. Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png Token indices sequence length is longer than the specified maximum sequence length for this model (5142 > 4096). Running this sequence through the model will result in indexing errors In the still of the night, A lone boat sails on the light. The stars above, a twinkling sight, Reflecting in the water's might. The trees stand tall, a silent guard, Their leaves rustling in the yard. The boatman's lantern, a beacon bright, Guiding him through the night. The river flows, a gentle stream, Carrying the boatman's dream. His journey long, his heart serene, In the beauty of the scene. The stars above, a guiding light, Leading him through the night. The boatman's journey, a tale to tell, Of courage, hope, and love as well. """ ``` For the **InternVL2** series models, multi-turn multi-image inference is supported, and within a single turn, images and text can be interleaved. The position of the images needs to be determined by in the input. Example of multi-turn multi-image input: ```shell CUDA_VISIBLE_DEVICES=0 swift infer --model_type internvl2-2b ``` ``` <<< describe the image Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png The image depicts a kitten with striking blue eyes and a mix of white and black fur. The kitten's fur appears soft and fluffy, and its ears are large and pointed. The eyes of the kitten are particularly expressive, giving it a curious and endearing appearance. The background is blurred, drawing attention to the kitten's detailed features. The overall composition of the image highlights the kitten's innocence and charm. -------------------------------------------------- <<< What is the difference from the last picture Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png The difference between the last picture and the one in the image is primarily in the facial expressions and the overall demeanor of the sheep. In the first picture, the sheep have a calm and curious expression, with large, expressive eyes and a gentle demeanor. The background is blurred, emphasizing the sheep's features. In the second picture, the sheep have a more serious and focused expression. Their eyes are slightly narrowed, and their ears are perked up, giving them a more alert and attentive appearance. The background remains the same, maintaining the focus on the sheep. In the third picture, the sheep have a more neutral expression. Their eyes are slightly narrowed, and their ears are perked up, but their overall demeanor is still calm and attentive. The background remains the same, ensuring the focus remains on the sheep. In the fourth picture, the sheep have a more neutral and calm expression. Their eyes are slightly narrowed, and their ears are perked up, but their overall demeanor is still calm and attentive. The background remains the same, maintaining the focus on the sheep. Overall, the main difference is the facial expressions and the overall demeanor of the sheep, with the second picture showing a more alert and focused look compared to the other three pictures. -------------------------------------------------- <<< What the two pictures have in common The two pictures have several similarities: 1. **Foreground Sheep**: All four sheep in the images are in the foreground, standing on a grassy field. This creates a sense of unity and focus on the sheep. 2. **Expression and Demeanor**: Each sheep has a distinct expression and demeanor. The first picture shows a curious and calm sheep, the second shows a more alert and focused sheep, the third shows a neutral and calm sheep, and the fourth shows a neutral and calm sheep. Despite the differences in expressions, all four sheep maintain a calm demeanor. 3. **Background**: The background remains consistent in all four pictures. The background features a blurred green field with mountains in the distance, which helps to keep the focus on the sheep. 4. **Art Style**: The art style is consistent across all four pictures. The sheep are depicted in a cartoonish and friendly manner, with large eyes and expressive faces. 5. **Overall Composition**: The composition of the images is similar, with the sheep standing in the foreground and the background featuring a blurred natural landscape. These similarities create a cohesive and engaging visual experience, despite the differences in expressions and demeanor. -------------------------------------------------- <<< clear <<<