🔥SCEdit

SCEdit, proposed by Alibaba TongYi Vision Intelligence Lab, is an efficient generative fine-tuning framework. The framework not only supports fine-tuning capabilities for text-to-image downstream tasks, saving 30%-50% of training memory overhead compared to LoRA, achieving rapid transfer to specific generation scenarios; but it can also directly extend to controllable image generation tasks, requiring only 7.9% of the parameter amount of ControlNet conditional generation and saving 30% of memory overhead, supporting conditional generation tasks such as edge images, depth images, segmentation images, poses, color images, image inpainting, etc.

We used the 3D style data from the Style Transfer Dataset for training, and tested using the same Prompt: A boy in a camouflage jacket with a scarf. The specific qualitative and quantitative results are as follows:

Method	bs	ep	Target Module	Param. (M)	Mem. (MiB)
LoRA/r=64	1	50	q/k/v/out/mlp	23.94 (2.20%)	8440MiB
SCEdit	1	50	up_blocks	19.68 (1.81%)	7556MiB
LoRA/r=64	10	100	q/k/v/out/mlp	23.94 (2.20%)	26300MiB
SCEdit	10	100	up_blocks	19.68 (1.81%)	18634MiB
LoRA/r=64	30	200	q/k/v/out/mlp	23.94 (2.20%)	69554MiB
SCEdit	30	200	up_blocks	19.68 (1.81%)	43350MiB

To perform the training task using SCEdit and reproduce the above results:

# First, follow the installation steps in the section below
cd examples/pytorch/multi_modal/notebook
python text_to_image_synthesis.py