The diffusion model used in part A is DeepFloyd IF, a two stage model trained by Stability AI. To start, the
following outputs are generated for the displayed captions. The outputs reflect the text prompts pretty
accurately. Moreover, as num_inference_steps
increases, the outputs become less noisy and more
detailed.
The forward process involves taking a clean image and adding noise. The alpha_cumprod
list is
used to compute a noisy image x_t at timestep t, where larger t corresponds to more noise. The
forward(im, t)
function is implemented, and the following images of the Campanile (64
x 64) are outputted at different noise levels.
To try denoising the images using classical methods, Gaussian blur filtering produces the results below (noisy images and Gaussian-denoised versions shown side by side). It is difficult, if not impossible, to adequately remove noise.
Next, the pretrained denoiser stage_1.unet
is used to recover Gaussian noise from the images.
Then, this noise can be removed to recover something close to the original image.
To denoise iteratively, a new list of timesteps strided_timesteps
is created, iterating from
the noisiest image (largest t) to a clean image with a regular stride step (e.g. 30). Then, the denoising
step is implemented at every timestep in the iterative_denoise
function.
The following example starts at timestep i_start=10
and shows the noisy image at every fifth
loop of denoising. Finally, the predicted clean image is displayed and compared against the previous
denoising
methods.
Theiterative_denoise
function can also be used to generate images from scratch, setting
i_start=0
and passing in random noise. 5 sampled images:
To improve image quality, CFG computes a conditional and uncondiitonal noise estimate in
iterative_denoise_cfg
. WHen the strength of the CFG is > 1, much higher quality images are
generated. 5 sampled images:
Following the SDEdit algorithm, original images can also be noised and forced back onto the image
manifold without conditioning. The outputs of different starting indices are shown below, gradually matching
the original images closer as i_start
increases.
This procedure works particularly well when starting with nonrealistic images. It is applied to one image from the web and two hand-drawn images.
Following the RePaint paper, inpainting involves a binary mask m, where the new image has the same content
as the original image where m is 0, and new content where m is 1. At every denoising step, x_t can be forced
to have the same pixels as the original image (plus the amount of noise at timestep t) where m is 0.
Note: Mostly likely due to the diffusion model not being trained for the inpainting task, and the training
data potentially containing more humans than animals, the Dogs Inpainted output replaces the original dog
with a human.
SDEdit can also be guided with a text prompt. The examples below use the prompt: "a rocket ship." As before, the outputs gradually look more like the originals as the starting amount of noise decreases from level 1 to 20.
Visual anagrams appear as one image right side up, but another when flipped upside down. They are implemented by denoising an image to obtain its noise estimate, repeating this process for the image flipped upside down and averaging the two noise estimates. The denoising step is then performed with this averaged noise estimate.
Hybrid images look like one image up close and another from a distance. To create hybrid images, a composite noise estimate is created by combining low frequencies from one noise estimate (for the image to be perceived from far away) with high frequencies of the other (for the image to be perceived from up close).
To begin, the following one-step denoiser is implemented as a UNet:
The UNet will be trained on MNIST digits. Data pairs (z,x) are generated, where z is controlled by a varying noise level sigma.
The denoiser is trained to denoise noisy image z with sigma=0.5 to a clean image x. Model parameters include a batch size of 256 and hidden dimension D=128, using the Adam optimizer with a learning rate of 1e-4. The model is trained over 5 epochs.
Below are sample results after the first and fifth epochs.
The denoiser results are also visualized on test set digits with varying levels of noise beyond sigma=0.5. The model performs well on smaller noise levels, but has trouble with bigger noise levels.
Now, a diffusion model is trained to implement the DDPM paper. Specifically for the UNet, the major architectural change is injecting a scalar time t to condition on it via an FCBlock (fully-connected block).
The time-conditioned UNet is trained to predict the noise in a noisy image x_t generated at timestep t. The batch size is 128, hidden dimension D=64 and the Adam optimizer is used with learning rate 1e-3 and an exponential learning rate decay scheduler. The model is trained for 20 epochs.
Below are 40 sample results generated from scratch in T=300 timesteps after the fifth and 20th epochs.
Next, the UNet is also conditioned on the class of the digit, 0-9, by added two more FCBlocks. The class c is one-hot-encoded and set to 0 with probability p_uncond=0.1 so that the UNet still works without class-conditioning.
Similarly, below are 40 sample results generated after the fifth and 20th epochs. Classifier-free guidance with gamma=5.0 is used to improve conditional results.