Fun With Diffusion Models!

Catherine Chu

Part A: The Power of Diffusion Models

Setup

The diffusion model used in part A is DeepFloyd IF, a two stage model trained by Stability AI. To start, the following outputs are generated for the displayed captions. The outputs reflect the text prompts pretty accurately. Moreover, as num_inference_steps increases, the outputs become less noisy and more detailed.

snowy_mountain_village
An Oil Painting of a Snowy Mountain Village
man_wearing_hat
A Man Wearing a Hat
rocket_ship
A Rocket Ship
snowy_mountain_village_5
num_inference_steps=5
snowy_mountain_village
num_inference_steps=20
snowy_mountain_village_30
num_inference_steps=30

1.1 Implementing the Forward Process

The forward process involves taking a clean image and adding noise. The alpha_cumprod list is used to compute a noisy image x_t at timestep t, where larger t corresponds to more noise. The forward(im, t) function is implemented, and the following images of the Campanile (64 x 64) are outputted at different noise levels.

campanile
Berkeley Campanile
(Original Image)
campanile_250
Noisy Campanile at t=250
campanile_500
Noisy Campanile at t=500
campanile_750
Noisy Campanile at t=750

1.2 Classical Denoising

To try denoising the images using classical methods, Gaussian blur filtering produces the results below (noisy images and Gaussian-denoised versions shown side by side). It is difficult, if not impossible, to adequately remove noise.

campanile_250
Noisy Campanile at t=250
campanile_500
Noisy Campanile at t=500
campanile_750
Noisy Campanile at t=750
campanile_250
Gaussian Blur Denoising at t=250
campanile_500
Gaussian Blur Denoising at t=500
campanile_750
Gaussian Blur Denoising at t=750

1.3 One-Step Denoising

Next, the pretrained denoiser stage_1.unet is used to recover Gaussian noise from the images. Then, this noise can be removed to recover something close to the original image.

campanile_250
Noisy Campanile at t=250
campanile_500
Noisy Campanile at t=500
campanile_750
Noisy Campanile at t=750
campanile_250
One-Step Denoised Campanile at t=250
campanile_500
One-Step Denoised Campanile at t=500
campanile_750
One-Step Denoised Campanile at t=750

1.4 Iterative Denoising

To denoise iteratively, a new list of timesteps strided_timesteps is created, iterating from the noisiest image (largest t) to a clean image with a regular stride step (e.g. 30). Then, the denoising step is implemented at every timestep in the iterative_denoise function.

The following example starts at timestep i_start=10 and shows the noisy image at every fifth loop of denoising. Finally, the predicted clean image is displayed and compared against the previous denoising methods.

campanile_90
Noisy Campanile at t=90
campanile_240
Noisy Campanile at t=240
campanile_390
Noisy Campanile at t=390
campanile_540
Noisy Campanile at t=540
campanile_690
Noisy Campanile at t=690
original
Original
iterative_denoised
Iteratively Denoised Campanile
one_step_denoised
One-Step Denoised Campanile
one_step_denoised
Gaussian Blurred Campanile

1.5 Diffusion Model Sampling

Theiterative_denoise function can also be used to generate images from scratch, setting i_start=0 and passing in random noise. 5 sampled images:

sample1
Sample 1
sample2
Sample 2
sample3
Sample 3
sample4
Sample 4
sample5
Sample 5

1.6 Classifier-Free Guidance (CFG)

To improve image quality, CFG computes a conditional and uncondiitonal noise estimate in iterative_denoise_cfg. WHen the strength of the CFG is > 1, much higher quality images are generated. 5 sampled images:

sample1
Sample 1 with CFG
sample2
Sample 2 with CFG
sample3
Sample 3 with CFG
sample4
Sample 4 with CFG
sample5
Sample 5 with CFG

1.7 Image-to-Image Translation

Following the SDEdit algorithm, original images can also be noised and forced back onto the image manifold without conditioning. The outputs of different starting indices are shown below, gradually matching the original images closer as i_start increases.

sdedit1
SDEdit with i_start=1
sdedit3
SDEdit with i_start=3
sdedit5
SDEdit with i_start=5
sdedit7
SDEdit with i_start=7
sdedit10
SDEdit with i_start=10
sdedit20
SDEdit with i_start=20
campanile
Original Campanile
sdedit1
SDEdit with i_start=1
sdedit3
SDEdit with i_start=3
sdedit5
SDEdit with i_start=5
sdedit7
SDEdit with i_start=7
sdedit10
SDEdit with i_start=10
sdedit20
SDEdit with i_start=20
waterfall
Original Waterfall
sdedit1
SDEdit with i_start=1
sdedit3
SDEdit with i_start=3
sdedit5
SDEdit with i_start=5
sdedit7
SDEdit with i_start=7
sdedit10
SDEdit with i_start=10
sdedit20
SDEdit with i_start=20
canyon
Original Canyon

1.7.1 Editing Hand-Drawn and Web Images

This procedure works particularly well when starting with nonrealistic images. It is applied to one image from the web and two hand-drawn images.

sdedit1
Espresso at i_start=1
sdedit3
Espresso at i_start=3
sdedit5
Espresso at i_start=5
sdedit7
Espresso at i_start=7
sdedit10
Espresso at i_start=10
sdedit20
Espresso at i_start=20
espresso
Original Espresso
sdedit1
Cherries at i_start=1
sdedit3
Cherries at i_start=3
sdedit5
Cherries at i_start=5
sdedit7
Cherries at i_start=7
sdedit10
Cherries at i_start=10
sdedit20
Cherries at i_start=20
cherries
Original Cherries
sdedit1
Pikachu at i_start=1
sdedit3
Pikachu at i_start=3
sdedit5
Pikachu at i_start=5
sdedit7
Pikachu at i_start=7
sdedit10
Pikachu at i_start=10
sdedit20
Pikachu at i_start=20
pikachu
Original Pikachu

1.7.2 Inpainting

Following the RePaint paper, inpainting involves a binary mask m, where the new image has the same content as the original image where m is 0, and new content where m is 1. At every denoising step, x_t can be forced to have the same pixels as the original image (plus the amount of noise at timestep t) where m is 0.

Note: Mostly likely due to the diffusion model not being trained for the inpainting task, and the training data potentially containing more humans than animals, the Dogs Inpainted output replaces the original dog with a human.

og
Campanile
mask
Mask
hole
Hole to Fill
inpainted
Campanile Inpainted
og
Dogs
mask
Mask
hole
Hole to Fill
inpainted
Dogs Inpainted
og
Wedding
mask
Mask
hole
Hole to Fill
inpainted
Wedding Inpainted

1.7.3 Text-Conditional Image-to-Image Translation

SDEdit can also be guided with a text prompt. The examples below use the prompt: "a rocket ship." As before, the outputs gradually look more like the originals as the starting amount of noise decreases from level 1 to 20.

sdedit1
Rocket Ship at noise level 1
sdedit3
Rocket Ship at noise level 3
sdedit5
Rocket Ship at noise level 5
sdedit7
Rocket Ship at noise level 7
sdedit10
Rocket Ship at noise level 10
sdedit20
Rocket Ship at noise level 20
og
Campanile
sdedit1
Rocket Ship at noise level 1
sdedit3
Rocket Ship at noise level 3
sdedit5
Rocket Ship at noise level 5
sdedit7
Rocket Ship at noise level 7
sdedit10
Rocket Ship at noise level 10
sdedit20
Rocket Ship at noise level 20
og
Soda
sdedit1
Rocket Ship at noise level 1
sdedit3
Rocket Ship at noise level 3
sdedit5
Rocket Ship at noise level 5
sdedit7
Rocket Ship at noise level 7
sdedit10
Rocket Ship at noise level 10
sdedit20
Rocket Ship at noise level 20
og
Giraffe

1.8 Visual Anagrams

Visual anagrams appear as one image right side up, but another when flipped upside down. They are implemented by denoising an image to obtain its noise estimate, repeating this process for the image flipped upside down and averaging the two noise estimates. The denoising step is then performed with this averaged noise estimate.

vis_ag
An Oil Painting of an Old Man
vis_ag_flip
An Oil Painting of People around a Campfire
vis_ag
A Photo of a Dog
vis_ag_flip
A Photo of a Hipster Barista
vis_ag
A Rocket Ship
vis_ag_flip
A Man Wearing a Hat

1.9 Hybrid Images

Hybrid images look like one image up close and another from a distance. To create hybrid images, a composite noise estimate is created by combining low frequencies from one noise estimate (for the image to be perceived from far away) with high frequencies of the other (for the image to be perceived from up close).

hybrid
Hybrid Image of a Skull and a Waterfall
hybrid
Hybrid Image of a Skull and a Snowy Mountain Village
hybrid
Hybrid Image of an Old Man and a Snowy Mountain Village

Part B: Diffusion Models from Scratch

1.1 Implementing the UNet

To begin, the following one-step denoiser is implemented as a UNet:

b_unconditional_arch
Unconditional UNet

1.2 Using the UNet to Train a Denoiser

The UNet will be trained on MNIST digits. Data pairs (z,x) are generated, where z is controlled by a varying noise level sigma.

b_diff_noises
Varying levels of noise on MNIST digits

1.2.1 Training

The denoiser is trained to denoise noisy image z with sigma=0.5 to a clean image x. Model parameters include a batch size of 256 and hidden dimension D=128, using the Adam optimizer with a learning rate of 1e-4. The model is trained over 5 epochs.

b_train_loss
Training Loss Curve

Below are sample results after the first and fifth epochs.

b_1epoch
Results on digits from the test set after 1 epoch of training
b_5epochs
Results on digits from the test set after 5 epochs of training

1.2.2 Out-of-Distribution Testing

The denoiser results are also visualized on test set digits with varying levels of noise beyond sigma=0.5. The model performs well on smaller noise levels, but has trouble with bigger noise levels.

b_ood
Results on digits from the test set with varying noise levels

2.1 Adding Time Conditioning to UNet

Now, a diffusion model is trained to implement the DDPM paper. Specifically for the UNet, the major architectural change is injecting a scalar time t to condition on it via an FCBlock (fully-connected block).

b_conditional_arch
Conditioned UNet

2.2 Training the UNet

The time-conditioned UNet is trained to predict the noise in a noisy image x_t generated at timestep t. The batch size is 128, hidden dimension D=64 and the Adam optimizer is used with learning rate 1e-3 and an exponential learning rate decay scheduler. The model is trained for 20 epochs.

b_time_loss
Time-Conditioned UNet Training Loss Curve

2.3 Sampling from the UNet

Below are 40 sample results generated from scratch in T=300 timesteps after the fifth and 20th epochs.

b_time5
After 5 Epochs
b_time20
After 20 Epochs

2.4 Adding Class Conditioning to UNet

Next, the UNet is also conditioned on the class of the digit, 0-9, by added two more FCBlocks. The class c is one-hot-encoded and set to 0 with probability p_uncond=0.1 so that the UNet still works without class-conditioning.

b_class_loss
Class-Conditioned UNet Training Loss Curve

2.5 Sampling from the Class-Conditioned UNet

Similarly, below are 40 sample results generated after the fifth and 20th epochs. Classifier-free guidance with gamma=5.0 is used to improve conditional results.

b_class5
After 5 Epochs
b_class20
After 20 Epochs