🔥SCEdit
SCEdit, proposed by Alibaba TongYi Vision Intelligence Lab, is an efficient generative fine-tuning framework. The framework not only supports fine-tuning capabilities for text-to-image downstream tasks, saving 30%-50% of training memory overhead compared to LoRA, achieving rapid transfer to specific generation scenarios; but it can also directly extend to controllable image generation tasks, requiring only 7.9% of the parameter amount of ControlNet conditional generation and saving 30% of memory overhead, supporting conditional generation tasks such as edge images, depth images, segmentation images, poses, color images, image inpainting, etc.
We used the 3D style data from the Style Transfer Dataset for training, and tested using the same Prompt: A boy in a camouflage jacket with a scarf. The specific qualitative and quantitative results are as follows:
| Method | bs | ep | Target Module | Param. (M) | Mem. (MiB) | 3D style |
|---|---|---|---|---|---|---|
| LoRA/r=64 | 1 | 50 | q/k/v/out/mlp | 23.94 (2.20%) | 8440MiB | ![]() |
| SCEdit | 1 | 50 | up_blocks | 19.68 (1.81%) | 7556MiB | ![]() |
| LoRA/r=64 | 10 | 100 | q/k/v/out/mlp | 23.94 (2.20%) | 26300MiB | ![]() |
| SCEdit | 10 | 100 | up_blocks | 19.68 (1.81%) | 18634MiB | ![]() |
| LoRA/r=64 | 30 | 200 | q/k/v/out/mlp | 23.94 (2.20%) | 69554MiB | ![]() |
| SCEdit | 30 | 200 | up_blocks | 19.68 (1.81%) | 43350MiB | ![]() |
To perform the training task using SCEdit and reproduce the above results:
# First, follow the installation steps in the section below
cd examples/pytorch/multi_modal/notebook
python text_to_image_synthesis.py





