Stable Diffusion v1.5

Property	Value
License	CreativeML OpenRAIL-M
Authors	Robin Rombach, Patrick Esser
Training Data	LAION-2B Dataset
Paper	High-Resolution Image Synthesis With Latent Diffusion Models (CVPR 2022)

What is stable-diffusion-v1-5?

Stable Diffusion v1.5 is a state-of-the-art latent diffusion model designed for text-to-image generation. It builds upon v1.2 and underwent extensive training with 595,000 steps at 512x512 resolution on the "laion-aesthetics v2 5+" dataset. The model incorporates a 10% text-conditioning dropout to enhance classifier-free guidance sampling.

Implementation Details

The model combines an autoencoder with a diffusion model trained in latent space. It utilizes a ViT-L/14 text encoder and features a UNet backbone for the latent diffusion process. Training was conducted on 32 x 8 x A100 GPUs with a batch size of 2048 and AdamW optimizer.

Relative downsampling factor of 8 for latent representations
Non-pooled text encoder output fed via cross-attention
Warmup learning rate to 0.0001 for 10,000 steps

Core Capabilities

High-quality text-to-image generation at 512x512 resolution
Advanced aesthetics handling through specialized dataset filtering
Improved classifier-free guidance sampling
Support for various frameworks including Diffusers, ComfyUI, and Automatic1111

Frequently Asked Questions

Q: What makes this model unique?

This model represents a significant improvement over previous versions with its extensive training (595k steps) and specialized dataset focusing on aesthetic quality. It includes improved classifier-free guidance sampling and maintains broad compatibility with various frameworks.

Q: What are the recommended use cases?

The model is primarily intended for research purposes, including safe deployment studies, artistic applications, educational tools, and research on generative models. It explicitly excludes harmful content generation and maintains strong ethical guidelines.