Limitations

Stable Diffusion has issues with degradation and inaccuracies in certain scenarios. Initial releases of the model were trained on a dataset that consists of 512Γ—512 resolution images, meaning that the quality of generated images noticeably degrades when user specifications deviate from its "expected" 512Γ—512 resolution;[24] the version 2.0 update of the Stable Diffusion model later introduced the ability to natively generate images at 768Γ—768 resolution.[25] Another challenge is in generating human limbs due to poor data quality of limbs in the LAION database.[26] The model is insufficiently trained to understand human limbs and faces due to the lack of representative features in the database, and prompting the model to generate images of such type can confound the model.[27]

Accessibility for individual developers can also be a problem. In order to customize the model for new use cases that are not included in the dataset such as generating anime characters ("waifu diffusion"),[28] new data and further training are required. Fine-tuned adaptations of Stable Diffusion created through additional retraining have been used for a variety of different use-cases, from medical imaging[29] to algorithmically-generated music.[30] However, this fine-tuning process is sensitive to the quality of new data; low resolution images or different resolutions from the original data can not only fail to learn the new task but degrade the overall performance of the model. Even when the model is additionally trained on high quality images, it is difficult for individuals to run models in consumer electronics. For example, the training process for waifu-diffusion requires a minimum 30 GB of VRAM,[31] which exceeds the usual resource provided in consumer GPUs, such as Nvidia’s GeForce 30 series having around 12 GB.[32]

The creators of Stable Diffusion acknowledge the potential for algorithmic bias, as the model was primarily trained on images with English descriptions.[22] As a result, generated images reinforce social biases and are from a western perspective as the creators note that the model lacks data from other communities and cultures. The model gives more accurate results for prompts that are written in English in comparison to those written in other languages with western or white cultures often being the default representation.[22]

Last updated