This paper presents a comprehensive study on the role of Classifier-Free Guidance (CFG) in text-conditioned diffusion models from the perspective of inference efficiency. In particular, we relax the default choice of applying CFG in all diffusion steps and instead search for efficient guidance policies. We formulate the discovery of such policies in the differentiable Neural Architecture Search framework.
Our findings suggest that the denoising steps proposed by CFG become increasingly aligned with simple conditional steps, which renders the extra neural network evaluation of CFG redundant, especially in the second half of the denoising process. Building upon this insight, we propose Adaptive Guidance (AG), an efficient variant of CFG, that adaptively omits network evaluations when the denoising process displays convergence. Our experiments demonstrate that AG preserves CFG's image quality while reducing computation by 25%. Thus, AG constitutes a plug-and-play alternative to Guidance Distillation, achieving 50% of the speed-ups of the latter while being training-free and retaining the capacity to handle negative prompts. Finally, we uncover further redundancies of CFG in the first half of the diffusion process, showing that entire neural function evaluations can be replaced by simple affine transformations of past score estimates. This method, termed LinearAG, offers even cheaper inference at the cost of deviating from the baseline model. Our findings provide insights into the efficiency of the conditional denoising process that contribute to more practical and swift deployment of text-conditioned diffusion models.
Adaptive Guidance is an efficient variant of Classifier-Free Guidance that saves 25% of total NFEs without compromising image quality.
For Adaptive Guidance we keep the number of denoising iterations constant but reduce the number of steps using CFG by increasing the threshold (top). CFG simply reduces the total number of diffusion steps (bottom).
We employ techniques from differentiable Neural Architecture Search (NAS) to search for policies offering more desirable trade-offs between quality and Number of Function Evaluations (NFEs).
Search uncovers interesting pattern: CFG steps are important in the first half of the diffusion process but later become redundant.
We find that conditional and unconditional updates become increasingly correlated over time.
We also find that AG produces similar results to CFG when using non-empty negative prompts, again highlighting the importance of only the first T2 diffusion steps for semantic structure.
We use an instruction-based editing with EMU Edit, which builds upon InstructPix2Pix. Classic CFG editing and our method editing give equal quality results while reducing NFEs by 33.3%. Importantly, Guidance Distillation is not directly applicable for this task as the update steps are conditioned on the input image.
@article{castillo2023adaptive,
author = {Castillo, Angela and Kohler, Jonas and Perez, Juan C. and Pérez, Juan Pablo and Pumarola, Albert and Ghanem, Bernard and Arbeláez, Pablo and Thabet, Ali},
title = {Adaptive Guidance: Training-free Acceleration of Conditional Diffusion Models},
journal = {ArXiv},
year = {2023},
}