Text to image generation

The text to image sampling script within Stable Diffusion, known as "txt2img", consumes a text prompt in addition to assorted option parameters covering sampling types, output image dimensions, and seed values. The script outputs an image file based on the model's interpretation of the prompt.[1] Generated images are tagged with an invisible digital watermark to allow users to identify an image as generated by Stable Diffusion,[1] although this watermark loses its efficacy if the image is resized or rotated.[39]

Each txt2img generation will involve a specific seed value which affects the output image. Users may opt to randomize the seed in order to explore different generated outputs, or use the same seed to obtain the same image output as a previously generated image.[24] Users are also able to adjust the number of inference steps for the sampler; a higher value takes a longer duration of time, however a smaller value may result in visual defects.[24] Another configurable option, the classifier-free guidance scale value, allows the user to adjust how closely the output image adheres to the prompt.[20] More experimentative use cases may opt for a lower scale value, while use cases aiming for more specific outputs may use a higher value.[24]

Additional text2img features are provided by front-end implementations of Stable Diffusion, which allow users to modify the weight given to specific parts of the text prompt. Emphasis markers allow users to add or reduce emphasis to keywords by enclosing them with brackets.[40] An alternative method of adjusting weight to parts of the prompt are "negative prompts". Negative prompts are a feature included in some front-end implementations, including Stability AI's own DreamStudio cloud service, and allow the user to specify prompts which the model should avoid during image generation. The specified prompts may be undesirable image features that would otherwise be present within image outputs due to the positive prompts provided by the user, or due to how the model was originally trained, with mangled human hands being a common example.[38][2]

Last updated