# Training procedures

The model was initially trained on the laion2B-en and laion-high-resolution subsets, with the last few rounds of training done on LAION-Aesthetics v2 5+, a subset of 600 million captioned images which the LAION-Aesthetics Predictor V2 predicted that humans would, on average, give a score of at least 5 out of 10 when asked to rate how much they liked them.\[18]\[15]\[19] The LAION-Aesthetics v2 5+ subset also excluded low-resolution images and images which LAION-5B-WatermarkDetection identified as carrying a watermark with greater than 80% probability.\[15] Final rounds of training additionally dropped 10% of text conditioning to improve Classifier-Free Diffusion Guidance.\[20]

<figure><img src="https://66508924-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FJaRV9HsOrKmAQxWrp4OQ%2Fuploads%2FDib6LifseqzaeJGZKO5b%2FX-Y_plot_of_algorithmically-generated_AI_art_of_European-style_castle_in_Japan_demonstrating_DDIM_diffusion_steps.png?alt=media&#x26;token=3b0dd185-7fa2-4487-a97c-349a58475213" alt=""><figcaption><p>The denoising process used by Stable Diffusion. The model generates images by iteratively denoising random noise until a configured number of steps have been reached, guided by the CLIP text encoder pretrained on concepts along with the attention mechanism, resulting in the desired image depicting a representation of the trained concept.</p></figcaption></figure>

The model was trained using 256 Nvidia A100 GPUs on Amazon Web Services for a total of 150,000 GPU-hours, at a cost of $600,000.\[21]\[22]\[23]
