The thumbnail of this article was generated using Stable Diffusion, with the prompt “A dream of a distant galaxy, by Caspar David Friedrich, matte painting trending on artstation HQ”. This article aims to simply explain how Stable Diffusion does this.
Stable Diffusion is a text-to-image model. This means you input text, and it outputs an image. And as the name suggests, it is also diffusion-based model.
What is diffusion?
Diffusion is the process of repeatedly adding small amounts of noise to an image. The noise is added in small steps, and the final output has no visual resemblance to the original image.
Diffusion models are trained to reverse this process. This means they turn noise into images!
Diffusion models work by learning a function which reverses the noise-adding process. The more steps the model takes (the more times the image is put through the function), the clearer the image will become.
What are embeddings?
Stable Diffusion was trained on large 512×512 images. However, this is a very slow and expensive process. A 512×512 image contains 262,144 pixels!
To make this whole process more efficient, Stable Diffusion is trained on the embeddings of the images rather than on the pixels themselves.
Embeddings are smart compressions of data (images, text, audio, etc). They are numerical representations. They try their best to include all of the detail of the original data whilst getting rid of everything else. These embeddings get created by sending data through encoders.
Take for example the prompt used to create the thumbnail. To create an embedding of this, first the prompt is split up like this
Then, each of these sections is given a token
This string of numbers you see above is the embedding of our text prompt. Also note, that these tokens are universal for the whole neural network. For example, the word “A” will always be token 32, and the word “dream” will always be token 4320.
You can tokenise your own sentences on the OpenAI website.
These embeddings are then stored in a place called the latent space. This can be thought of as a huge multidimensional room, which information can be stored in. Just like the brain, bits of information are not stored in single locations but are stored throughout the latent space.
We can do exciting things in the latent space, such as calculating the distance between two embeddings. This distance tells us how related the two embeddings are. If we calculated the distance between a text embedding and an image embedding, we would be able to tell if the text describes the image.
This is an important idea. E.g., Google Search uses embeddings to match text to text and text to images; Snapchat uses them to “serve the right ad to the right user at the right time”; and Meta (Facebook) uses them for their social search.
Back to Stable Diffusion
But these diffusion models have been around for ages. What makes ones like DALLE 2 and Stable Diffusion so special?
The first major difference, as we have already learnt, is that Stable Diffusion is trained on these image embeddings rather than on the original images.
Another major difference though is that whilst the original Diffusion models could create coherent-looking images, there existed no way to add context/guide/describe them. If the diffusion model was trained on pictures of dogs, or faces, then it would output random pictures of dogs and faces. But there would be no way to add context to them to get the sort of output that you want.
Check out This Person Does Not Exist for an example of an unguided diffusion model.
Stable Diffusion is not only trained on removing noise from images, but also on context. The first version of Stable Diffusion was trained on text context, but further released have added in image context too. This allows for images to be described through words or other images, allowing the output to be guided.
This context is injected into the latent space as well. The context helps the diffusion model know where in the latent space it needs to retrieve information from.
With all this, we can provide Stable Diffusion with context (eg the words “a cat wearing a yellow raincoat”) which will guide the diffusion process. This process starts with an image of random noise, and with each step, denoises it into a coherent image.
Why is Stable Diffusion so important?
As YouTuber Eden Myer pointed out, there are 4 main reasons for Stable Diffusions great success.
- It gives high quality generations
- It’s free to use
- It’s open source
- It has computationally low requirements
These things are not only great for users, hence its widespread popularity, but also are important for the AI field as a whole. It took only a few months between the release of DALLE 2 to the release of Stable Diffusion, and whilst DALLE also provides impressive results, it lacks the other 3 qualities mentioned above. Being open source allows anyone to make modifications to the model and to improve upon it. It also allows for the knowledge gained during the development of the model to be shared better with the wider community. And the importance of having lower computational requirements cannot be stressed enough. In order for AI to be more consumer friendly, they must be able to run on consumer hardware. Stable Diffusion can run on computers with as little as 6GB of vRAM, meaning it can be run on most consumer GPUs. Previously, multiple high end GPU’s would have been required.
You yourself can try Stable Diffusion using the Dream Studio website. New accounts currently get $2 in credit, which is equal to around 200 image generations. You can also set the steps to less (less denoising steps) to get slightly worse quality outputs but for the trade-off of being computationally less expensive, whilst having the benefit of your free balance lasting you longer.
I have also put together an Android app which introduces you to Stable Diffusion and allows you to get experimenting with it for free. It’s called “Make AI Art”, and can be found here on the Play Store.