Stable Diffusion
  • 👋Welcome to Stable Diffusion
  • Stable Diffusion Overview
    • 💡Technology
      • Architecture
      • Training data
      • Training procedures
      • Limitations
      • End-user fine tuning
    • âš™ī¸Capabilities
      • Text to image generation
      • Image modification
    • đŸ•šī¸Usage and controversy
    • ✨License
    • 🔗External links
  • Stable Diffusion Chain
    • đŸ“ĒToken
    • 👑Tokenomics
    • đŸ’ģIntegrated Systems
    • ⌚Stable Diffusion Chain Roadmap
    • 🐧Build 3D NFT
Powered by GitBook
On this page
  1. Stable Diffusion Overview
  2. Technology

Training procedures

PreviousTraining dataNextLimitations

Last updated 2 years ago

The model was initially trained on the laion2B-en and laion-high-resolution subsets, with the last few rounds of training done on LAION-Aesthetics v2 5+, a subset of 600 million captioned images which the LAION-Aesthetics Predictor V2 predicted that humans would, on average, give a score of at least 5 out of 10 when asked to rate how much they liked them.[18][15][19] The LAION-Aesthetics v2 5+ subset also excluded low-resolution images and images which LAION-5B-WatermarkDetection identified as carrying a watermark with greater than 80% probability.[15] Final rounds of training additionally dropped 10% of text conditioning to improve Classifier-Free Diffusion Guidance.[20]

The model was trained using 256 Nvidia A100 GPUs on Amazon Web Services for a total of 150,000 GPU-hours, at a cost of $600,000.[21][22][23]

💡
The denoising process used by Stable Diffusion. The model generates images by iteratively denoising random noise until a configured number of steps have been reached, guided by the CLIP text encoder pretrained on concepts along with the attention mechanism, resulting in the desired image depicting a representation of the trained concept.