How to train an AI on your face to create silly portraits – Ars Technica
By now, you’ve read a lot about generative AI technologies such as Midjourney and Stable Diffusion, which translate text input into images in seconds. If you’re anything like me, you immediately wondered how you could use that technology to slap your face onto the Mona Lisa or Captain America. After all, who doesn’t want to be America’s ass?
I have a long history of putting my face on things. Previously, doing so was a painstaking process of finding or taking a picture with the right angle and expression and then using Photoshop to graft my face onto the original. While I considered the results demented yet worthwhile, the process required a lot of time. But with Stable Diffusion and Dreambooth, I’m now able to train a model on my face and then paste it onto anything my strange heart desires.
In this walkthrough, I’ll show you how to install Stable Diffusion locally on your computer, train Dreambooth on your face, and generate so many pictures of yourself that your friends and family will eventually block you to stop the deluge of silly photos. The entire process will take about two hours from start to finish, with the bulk of the time spent babysitting a Google Colab notebook while it trains on your images.
Before we begin, a couple of notes:
For this walkthrough, I’m working on a Windows computer with an Nvidia 3080Ti that has 12GB VRAM. To run Stable Diffusion, you should have an Nvidia graphics card with a minimum of 4GB of video RAM. Stable Diffusion can run on Linux systems, Macs that have an M1 or M2 chip, and AMD GPUs, and you can generate images using only the CPU. Those methods require some tinkering, though, so for the purposes of this walkthrough, a Windows machine with an Nvidia GPU is preferred.
When it comes to generative image programs like Stable Diffusion, there are ethical concerns I feel I should acknowledge. There are valid questions surrounding how the data used to train Stable Diffusion was gathered and whether it’s ethical to have trained the program on an artist’s work without their consent. It’s a big topic that’s outside the scope of this walkthrough. Personally, I use Stable Diffusion as an author to help me create quick character sketches, and it’s become an invaluable part of my process. I don’t, however, think work created by Stable Diffusion should be commercialized, at least until we settle the ethical dilemmas and determine how to compensate artists who might have been exploited. And for the time being, I feel that Stable Diffusion should remain for personal use only.
Lastly, tech like Stable Diffusion is simultaneously exciting and terrifying. It’s exciting because it gives people like me, who peaked artistically with fingerpaints in kindergarten, the ability to create the images I imagine. But it’s terrifying because it can be used to create frighteningly realistic propaganda and deepfakes with the potential to ruin people’s lives. So you should only train Stable Diffusion on photos of yourself or someone who has given you consent. Period.
Now, who’s ready to do this?
Installing and using Stable Diffusion
There are a number of programs you can use to run Stable Diffusion locally. For this walkthrough, I’ve chosen Easy Diffusion (formerly known as Stable Diffusion UI) for its ease of installation and because it has an auto-update function to ensure you always have the latest version. There are other installations that offer different and more customizable experiences, such as InvokeAI or Automatic1111, but Easy Diffusion is user-friendly and a breeze to install, which makes it a perfect place to begin.
To grab the installer, go here and click on “Download,” which will take you to the download page. From there, click the “Download for Windows” link. Doing so will download a zip file called “stable-diffusion-ui-windows.”
Right-click that file and extract the files. You should now have a folder called “stable-diffusion-ui-windows.” Navigate to that folder and find the sub-folder called “stable-diffusion-ui.” Move that folder to the root level of your hard drive.
Ideally, you should place this at the root level of your C: drive. Be aware, however, that generating images can take up a lot of space, so if space is limited, you can install this on a secondary drive as long as it’s located at the root level.
Once you’ve moved the folder, go into it and find the Command Script file called “Start Stable Diffusion UI.” Double-click it. You’ll see a security warning pop-up notifying you that the program is from an unknown publisher. Obviously, you should be careful what you download and install on your computer. I’ve been running Easy Diffusion for a while now without issue. When you’re ready, uncheck the box and click “Run.”
At that point, a Windows command line window will open and the install process will begin. It’s a good time to make a sandwich or run to the bathroom, as the installation can take anywhere from 10 to 30 minutes, depending on the speed of your Internet connection. The beauty of Easy Diffusion—and why the installation takes a little while—is that it’s downloading everything you need, including the model you’ll use to generate images. Easy Diffusion comes loaded with the model version 1.4 from Stability AI. There are numerous models—a couple of official models from Stability AI and a whole slew of custom models made by the Stable Diffusion community—and I’d encourage you to check them out when you’ve gotten more comfortable with the software. For the purposes of this walkthrough, the 1.4 model is sufficient.
If you decide to explore and download different models, be aware that there are two types of files: checkpoints (.ckpt, also referred to as pickle format) and safetensors. Most of the Stable Diffusion community has moved to using safetensors because it’s a more secure file format. When given the choice, always prefer safetensors over pickles, and be sure to scan anything you download.
I hope your sandwich was good because it’s time to move on.
When the installation is complete, the last line you should see in the command line window will say “loaded stable-diffusion model from “C:\stable-diffusion-ui\models\stable-diffusion\sd-v1-4.ckpt to device: cuda:0” and a browser window will open to the Easy Diffusion start page.
Don’t close the command line window. Just minimize it and forget it. It will open every time you use Stable Diffusion. You won’t need to interact with it except to close it when you’re finished. This is also a good time to mention that you should go back into your C:\stable-diffusion-ui folder and make a shortcut of the “Start Stable Diffusion UI.cmd” file and drop it on your desktop so you don’t have to navigate to the folder each time you want to start the program.
Using Easy Diffusion
You’ve installed Stable Diffusion! Good job. So what does all this stuff do? Most people will want to get to the part where you put your face on stuff, so I won’t spend much time going over the UI, but here are the basics.
Prompt is the text that will turn into an image. For example, an oil painting of a cat holding a balloon.
Negative Prompt is where you include the things you’d like to exclude.
Below, in the Image Settings box, we have Seed. Leave this at random for now.
Number of Images (total) is exactly what it sounds like. Number of Images in Parallel is the number of images in each batch. So if you want to make five images, you could make five total with one in each batch or five total with five in each batch, which is significantly faster. How many you’ll be able to create will be affected by how much VRAM your GPU has, as well as the image size, which I’ll go over later. I’d suggest experimenting to find the optimum combination that works for your system.
Model is the checkpoint or safetensors file I discussed earlier. If you have multiple models (which we will once we complete the Dreambooth training), this dropdown is where you’ll choose which model to use.
Custom VAE. VAE stands for variational autoencoder. In the context of Stable Diffusion, it can help improve some of a model’s deficiencies. The custom VAE that Easy Diffusion comes with, vae-ft-mse-840000-ema-pruned, smooths out some of the model’s problems with human eyes and hands. Different models may have custom VAEs, but I rarely use anything other than the one included.
Sampler. These are basically the mathematical formulas that turn noise into pictures, and since I barely passed high school algebra, I can’t get much deeper than that. What I do know is that different samplers will give you different images, even using the same seed. The three I use most often are Euler Ancestral, DDIM, and DPM++ SDE. They each have their pros and cons, and I’d recommend experimenting to decide which you like. I prefer DDIM because it provides consistent results quickly in a small number of steps.
The Image Size dropdowns control the size of the image produced. The larger the image, the more VRAM you’ll need. For now, you’ll probably want to stick to 512×512. Not only because it won’t tax your GPU as much but also because the Stable Diffusion model you’ll most likely be using was trained on a data set consisting of images sized 512 px by 512 px, so you’ll get the best results by sticking with that standard.
Inference Steps are the number of passes the program makes in attempting to convert the noise into an image. Conventional wisdom generally tells us that more is better, but that’s not always the case here. At a certain point, higher steps become a waste of time. This is another area where experimenting is important. I mentioned earlier that I prefer DDIM as my sampler because I can usually produce good images in 15 to 20 steps.
Below that is the Guidance Scale, which tells Stable Diffusion how much to pay attention to your prompt. Lower values give it more leeway for interpretation; higher values make it more strict. Again, higher values aren’t always better. I find that with a good prompt, values in the 6 to 9 range work best.
Hypernetworks are a type of specific training that is beyond the scope of this walkthrough.
Output Format is self-explanatory. It defaults to JPG because while PNGs look better, they’re lossless and can take up a lot of space.
The render settings are generally self-explanatory. I tend to leave “Fix incorrect faces and eyes” off because the face correction often tends to make people look airbrushed in an unpleasant way.
I also leave “Scale up” off because it adds time to your render to scale up every single image, and Easy Diffusion’s UI conveniently places an Upscale button on each image so that if you get one you like, you can upscale it then.
The image modifiers are buttons that can help you craft a prompt. I won’t go over those, but feel free to experiment. I will say, however, that I don’t agree with copying the style of living artists. There are plenty of dead artists to copy and a ton of ways to achieve a unique style without including the names of living artists.
Now we’ve come to the fun part. It’s time to train the model that will let us stick our faces on things. There are a couple of methods to go about this. I’ll walk you through the Dreambooth method because, while it takes a little more work, I think it creates the best results. For this, we’ll be using a Google Colab notebook.
But first, we need to choose the photos to use. You can train the model on your face with as little as six images, and I’ve done a training that used more than 100, but I’ve found the best results with between 20 to 30 images. You’ll want to choose close-up headshots that don’t have anyone else in the frame and don’t have a busy background. You should ideally have a couple straight on, a couple from the side, and a couple of three-quarter shots. Try to avoid selfies because the perspective throws off the proportions of your face.
You should also avoid using pictures taken in the same place or those in which you’re wearing the same clothing. If there’s a clock behind you in every picture, Dreambooth will associate the clock with your face and Stable Diffusion will attempt to produce it in any pictures of you that it generates. The same goes with glasses or facial hair. For my training, I chose photos with a variety of hairstyles, facial expressions, and ages to give the model more flexibility. It’s not totally necessary to include photos from the shoulders down, but you can if you want.
Once you’ve selected your photos, you’ll need to crop and resize them so that they’re all 512 px by 512 px. Try to use the PNG file format to avoid JPG artifacts. If you don’t crop them, the program will do it for you, and it’s more than likely that your face won’t be properly centered. Before you do this, you’ll need to think of a word—nonsense or gibberish—that you’ll use when you want the model to produce your face. For example, I use shauniedarko because it’s easy to remember, it’s my Twitter handle, and it won’t be a word that’s in the data’s training set. If you use a word in the training set, you won’t get good results.
So you have your trigger word and your photos. Rename each file so that it follows the pattern “triggerword (x).png,” where x is a number.
Hugging Face token
You’ll also need a token from Hugging Face, the repository for Stable Diffusion models. To do that, go to huggingface.co and create an account. Next, go into Settings and then into Access Tokens and create a new token. Name it whatever you want and change the Role to “Write.” Copy the token and paste it into a plain text file called “token.txt.” Then open this link, accept the terms and conditions, and you’re all set.
Now we’re ready to begin training. You’ll need a Google account for this. Every Google account comes with 15GB of free storage. You’ll want to have 6 to 8GB free before starting.
The Google Colab we’ll be using is here. Click that link, and let’s get started. Make sure you’re signed into Google.
Run the first section by clicking the play button. You’ll get a popup that says the notebook wasn’t authored by Google but by Github. Click “Run Anyway.” It will ask you to allow it to connect your Google Drive. This is a requirement; it allows the notebook to save and access the necessary files.
Next, click the play button beside “Dependencies.” This will install all the necessary files for running the notebook. You’ll see a green checkmark beside it when it’s finished.
Before completing the next section, open your Google Drive by going to drive.google.com. Under My Drive, you should see a folder called “Fast-Dreambooth.” Open that folder and drop the token.txt file into it. Then return to the Colab notebook and go to the section called “Model Download.” The only thing you need to enter here is “runwayml/stable-diffusion-v1-5” in “Path_to_Huggingface” Once done, click the play button and wait. It will say “Done!” when complete.
Under the “Create/Load Session,” give the session a name and press the play button.
Moving onto the Instance Images, click the play button. At the bottom, a “Choose Files” button will appear. Click that and select the 30 images we created earlier. It will take a few moments to download them all. This is a good time to let you know that Google Colab has an annoying time limit for inactivity, so it’s best to do this all in one go. If you get kicked off, you’ll need to start over. The best way to prevent that is to make sure you scroll around. You can collapse and expand the previous sections to let Google know you’re still breathing.
Once your images are done uploading, we’re ready to move on. You can skip the Captions and Concept Images sections. Those are helpful if you’re training on a specific concept or style. Since you’re training Dreambooth on your face, they aren’t necessary.
Under the Training tab, in the Start Dreambooth section, you’ll want to go to the “UNet_Training Steps” section. The standard number of steps is 100 times the number of images you’re using. Since we’re using 30 images, we want to enter 3,000 steps. Too few steps and the training won’t reproduce your face. Too many steps and it won’t be flexible enough to reproduce your face in a variety of styles.
After entering the number of steps, go to “Text_Encoder_Training_Steps” and change it to 1,050. This number should be about 35 percent of the number of steps you chose. So if you choose 2,000 steps previously, this number would be 700. You can leave most of the rest of the settings alone. However, check the “Save Checkpoint Every n Steps” box and choose how often you’d like it to save the model. We’re training for 1,000 steps, so I’d suggest saving every 500 steps starting at 500 steps. That way, if something goes wrong, you can load up an earlier version and continue training from there rather than having to start from scratch.
Be warned, however, that each save requires approximately 2GB, so you’ll want to make sure you have enough space. If you run out of space while the program is running, it will go off the rails and cause you a headache. For those who don’t have much space, saving every 1,500 steps starting at 1,500 steps gives you a save file at the halfway point.
Once you’ve done that, click the play button and wait.
Training on 30 images should take about 30 to 50 minutes depending on how Google is feeling that day. Again, make sure you don’t walk away and that you scroll around to prevent Google from kicking you out due to inactivity.
When the training is complete, open your Google Drive and go to Fast-DreamBooth > Sessions > YourSessionName and locate the ckpt file with the name you gave the session earlier. Move that file into the C:\Stable-Diffusion\Models\StableDiffusion folder on your computer. Then open Easy Diffusion. Let’s start making some faces.
We’ve already been over how to create images in Easy Diffusion. So we’ll start by loading your new model so that you can create images of yourself. In the Model dropdown in the Image Settings, choose the file you just created. Let’s start with something easy. In the Prompt box, write something like “a superhero, oil on canvas, realistic, dark colors.” After you generate the images, you should see a bunch of random superhero portraits.
But wait. None of these look like me. Right. Because creating pictures of you requires a trigger word, which will be the name you used for your image files. In my case, it was “shauniedarko.”
Crafting a prompt requires some trial and error. The words closer to the beginning of the prompt are often weighted more than later words. You’ll have to experiment to see what order of words works best. In this case, I’ll write, “shauniedarko as a superhero, oil on canvas, realistic, dark colors.”
And there’s my face. Stable Diffusion can be tricky. Sometimes you might have to generate 20 or 30 images to find one you like.
Let’s try some other prompts.
And that’s all there is to it. If you’re not getting the results you want, take a look at the pictures you used. Go back and pull from your Google Drive one of the earlier checkpoints that it saved and drop it in the model folder where you put the first one. Additionally, learning how to craft a prompt that will get you the results you want can take some time, so I suggest experimenting to see what works and what doesn’t.