Architecture

Stable Diffusion uses a kind of diffusion model (DM), called a latent diffusion model (LDM).[1] Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images which can be thought of as a sequence of denoising autoencoders.

Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder.[11] The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, capturing a more fundamental semantic meaning of the image.[12] Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion.[11] The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation.

Finally, the VAE decoder generates the final image by converting the representation back into pixel space.[11] The denoising step can be flexibly conditioned on a string of text, an image, or another modality. The encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism.[11] For conditioning on text, the fixed, pretrained CLIP ViT-L/14 text encoder is used to transform text prompts to an embedding space.[1] Researchers point to increased computational efficiency for training and generation as an advantage of LDMs.[13][14]

Last updated