**Applied Machine Learning with Generative Adversarial Network** This thesis attempts to give machine learning practitioners experienced with deep convolutional neural networks an overview of generative models, with a focus on Generative Adversarial Networks (GANs) in the context of image generation. This includes the standard GAN architecture, improvements to that formulation, and more specialized variations for tasks such as image-to-image translation or attribute-conditioned sampling and editing of generated images.

Additionally, an overview of current-day, real-world uses of GANs is provided. Note that this thesis does not attempt to introduce the basic concepts of machine learning or deep learning to unfamiliar readers. For that, the section *Further Reading* should be consulted.

For millennia, humans dreamed of creating artificial life, a desire that has long been firmly lodged in the mystical, with stories such as the ancient greek Galatea, or the golems of Jewish folklore. But with the advent of the first programmable computers, that changed. Alan Turing already discussed many aspects of what is today called machine learning [40], including the ability of these machines to learn from experience. While artificial intelligence has progressed since Turing’s days, for most of the concept’s existence, progress has been rather sluggish. Relatively recently, increased computing power has made deep learning techniques possible. We are still far from achieving the general artificial intelligence dreamed of in dystopian and utopian fantasies alike, but specialized solutions for many specific tasks have achieved enormously impressive results. Generative Adversarial Networks have, for example, recently made incredible progress in the field of image generation, and can now generate images with a fidelity that can fool human onlookers.

[10][40][16][29]

*Figure 1: Images of entirely fictional faces. Generated by StyleGAN2, an architecture you will learn about in the section StyleGAN. [41]*

#### Generative Models

Machine learning models can generally be modeled as mathematical functions that map some input to some output. A standard regression or classification problem usually entails mapping a high-dimensional set of features, like a vector of features, an image, or time-series data, to a lower-dimensional output, often as simple as a single scalar value. As such, a simple convolutional classifier might have an architecture such as **Figure 2.**

*Figure 2: A simplified classifier architecture.*

In contrast, a generative model takes a low-dimensional input, often just random noise, and maps that to a high-dimensional output, like an RGB image. Thus, a generator has an architecture that can be simplified to** Figure 3.**

*Figure 3: A simplified generator architecture.*

There are several additional issues faced when designing a generative model as opposed to a classifier of regression model, the greatest of which is the lack of a clear objective, as there is a practically infinite amount of possible outputs.

A generative model must represent the distribution of the data it wants to generate. As such, the process of generation is equivalent to drawing samples from that distribution. That goal is significantly more complex than that of a classifier.

Multiple early architectures have been proposed to model that distribution, with mostly limited success. Today, the two most prominent solutions are variational autoencoders and generative adversarial networks.

[10][45][11][1]

#### Variational Autoencoders (VAEs)

One relatively successful approach to image generation is the variational autoencoder. VAEs consist of an encoder and a decoder. The encoder maps the input into the much lower-dimensionality latent space, and the decoder maps values in the latent space back to data space. Both elements are trained together in order to minimize reconstruction error, or the difference between the input and the output. For image generation, typically the pixel distance is used to model reconstruction error. In inference, a random value in the latent space is selected and mapped to data space using just the decoder.

[45][34][3]

*Figure 4: The architecture of a VAE. Adapted from [45]*

#### Advantages over GANs

Compared to GANs, which will be explored in the following section, VAEs are typically easier and computationally cheaper to train. In addition, they are fully invertible, meaning that data can be transformed into its latent space representation, which is not straightforward for a GAN. They also suffer less from mode collapse, an issue that is further explored in the following section.

[3][45]

#### Disadvantages over GANs

The main disadvantage of VAEs in comparison to GANs is the fidelity of results. This is due to the reconstruction loss of VAEs, where the pixels of the original image and the output are directly compared, a formulation that produces inherently blurry results. [24][3]

##### Generative Adversarial Networks (GANs)

The issue with pixel-distance loss functions described in* Disadvantages over GANs* is avoided by generative adversarial networks, which are now the most popular generative model architecture, both in industry and research.

[3][45]

#### Intuition

A GAN consists of two neural networks: the *generator *and the *discriminator*. Intuitively, the generator and the discriminator can be seen as an art forger and an art inspector. In this analogy, the art forger (generator) starts randomly drawing art, without any knowledge of drawing. In the beginning, the generated images will be nothing but random noise. Similarly, the inspector (discriminator), which starts out without any relevant knowledge, gets shown both real and fake images, and has to guess which are real and which are fake.

The generator never actually sees the real images it is trying to recreate: the only indicator of how well it is doing is the prediction of the discriminator. The discriminator, in comparison, does see both real and generated images, and has a standard loss function indicating how well it is doing. The goal of the discriminator is to maximize the accuracy of its own predictions, or to categorize the highest possible number of images correctly. The goal of the generator is to “fool” the discriminator as often as possible, or to minimize the fraction of correct guesses by the discriminator.

The generator and the discriminator are trained simultaneously. Intuitively, this is done so that both models have a similar ability at the same time. If the generator tries to fool a discriminator that guesses every fake, it cannot learn, since no matter what it generates, the discriminator will predict that it is fake.

[45]

#### Formal Definition

More formally, a GAN is composed of a generator G(z; θ_{g}), where G is a function mapping from a noise vector z with the parameters θ_{g} to the space of the training data. Typically, G is a neural network. The discriminator D(x) is a function mapping from the space of the training data to a scalar representing the probability that x is a real image (from the training set), rather than being generated by G. In this framework, the objective of D is that of a standard classifier, to maximize the probability of assigning the correct label D(x) to x, while the objective of G is to minimize that probability. Summing up, a GAN has the objective

where:

V(D,G) is the value function of the GAN [11]

##### Wasserstein GANs

This basic formulation, as per Goodfellow et al. [11], has already led to promising results when compared to contemporary alternatives. However, it still faces a number of issues, many of which have been overcome by newer architectures.

[45]

#### Mode Collapse

Mode collapse occurs when the distribution of data learned by a model collapses around a single mode. This causes every generated sample to be very similar, or even indistinguishable. It occurs when during training, the discriminator is better at classifying at some modes than at others, leading the generator to only output values around these modes

*Figure 5: MNIST, an example of a multimodal dataset. Here, each digit represents one mode. Around each mode, there are many training examples, while there are close to none between the modes. Adapted from [45], [23]*

##### Vanishing Gradients

Traditionally, GANs use the BCE loss function, as seen in equation 1. The problem with this loss is that as the discriminator improves, predictions will move closer to the values of 0 and 1 (real and fake), where the gradient approaches zero, which drastically reduces the amount of useful feedback passed to the generator

*Figure 6: An illustration of the BCE loss function. As the function approaches 0 or 1, the gradient approaches 0.*

##### Earth Mover’s Distance

Both mode collapse and the vanishing gradient problem can be remedied by replacing the traditional BCE loss with earth mover’s distance. It describes the minimal cost required to transform one distribution to another and is named such because intuitively, it represents the amount of effort required to move one pile of earth to match the distribution of another. Unlike BCE loss, EMD is not bound between 0 and 1, and can instead take on any real value.

[33][45]

##### Wasserstein Loss

Wasserstein loss is a loss function that approximates earth mover’s distance. It is defined as

Where C is the critic, the equivalent of a discriminator in the Wasserstein GAN

Note that this loss is very similar to the standard BCE loss seen in equation 1, but that it doesn’t include a logarithm. This loss function approximates the EMD under a certain condition. [45][2]

##### 1-Lipschitz Continuity

Wasserstein loss approximates EMD under the condition that the loss is 1- Lipschitz continuous. A function f that is 1-Lipschitz continuous has a gradient whose norm is ≤ 1 at every point of f, so that

For Wasserstein GANs, this is typically not explicitly enforced, but heavily encouraged by using a gradient penalty, a regularization term. For every optimization step, an intermediary value x is computed by interpolating between x and G(z) with

Using this value, the gradient penalty is calculated:

This term penalizes gradients with a L2 norm ≥ 1.

[13][45]

##### Complete Architecture

A Wasserstein GAN is, then, a GAN which uses the Wasserstein loss function, and enforces 1-Lipschitz continuity somehow, usually through a gradient penalty. In this architecture, the discriminator is called the critic, as it does not discriminate between real and fake, but instead outputs an arbitrary real number.

With just these simple changes, a GAN is significantly less susceptible to both mode collapse and vanishing gradients, leading to significantly more stable and predictable training.

[45][2][13]

#### GAN Evaluation

Evaluating GANs is not straightforward, as there is no ground truth to compare generated values to. Instead, a good GAN is expected to have high fidelity (quality of generated images) and diversity, both attributes that are nontrivial to quantify. To solve that issue, most GAN evaluation metrics rely on analyzing embeddings from large-scale, pretrained CNNs, an approach that has been shown to correspond to human perception of results reasonably well.

[44]

#### Fréchet Inception Distance

Fréchet Inception Distance (FID) is by far the most widely used GAN evaluation metric. It works by calculating the activations of the final layer of an Inceptionv3 model trained on ImageNet (these activations are called embeddings) for a large number of both real and generated images, and fitting a multivariate normal distribution to each set of embeddings. The two distributions are then compared using Fréchet Distance, a metric for comparing curves. If the two distributions are very similar, meaning that the generated images closely match the characteristics of real examples, the FID is low, and vice versa. FID is defined as:

It is important to note that FID is not a suitable metric for anything that isn’t represented in ImageNet, which consists of natural photographs.

[14][45][19]

#### Conditional and Controllable GANs

##### Conditional GANs

Conditional GANs, or cGANs, expand the GAN architecture by appending a condition vector to the noise vector z. In a conditional GAN model, the GAN is trained with a dataset labeled for the features that you want to control, and during inference, a label is chosen. This turns the previously unsupervised GAN objective into a supervised objective.

In a cGAN, the generator is expressed as G(z|y), since it generates based on both z and the condition y. The discriminator is D(x|y), as it predicts realness on the example x conditioned on y. Thus, the discriminator not only judges whether an example is real, but also whether it belongs to class y.

[25][45]

#### Controllable GANs

Controllable GANs, like cGANs, allow for controllable output, but unlike them, they do not require the GAN to be trained on a labeled dataset. Instead, the input noise z is manipulated to change certain features of the output. [45]

#### Disentanglement

Perhaps the largest issue faced by controllable GANs is entanglement. In the context of GANs, a highly entangled z-space is one in which no feature can be changed in isolation, but can only be modified by also changing a large number of unrelated features.

In contrast, a disentangled z space is one where every value you want to control corresponds exactly to one dimension of z. In such a latent, individual features can be modified without affecting any other dimensions. Both are illustrated in **Figure 7.** Note that a model can be disentangled with regard to just a certain subset of features that should be controllable, and not others.

#### Image-to-Image Translation

One specialization of the cGAN architecture is a so-called image-to-image translation model. Image-to-image translation is the task of generating an image

*Figure 7: An example of an entangled (left) and a disentangled (right) latent space. Adapted from [45] *

with an input image as the condition. Generally, image translation can be separated into the supervised paired image-to-image translation, in which you have a training set of paired input and output images, and the unsupervised unpaired image-to-image translation, in which the training set consists of two sets of unpaired images. [45]

#### Paired Image-to-Image Translation

The most successful paired image-to-image translation models have been based on the Pix2Pix architecture introduced by [17] or variations thereof, which is also the architecture that will be further discussed here.

Like any conditional GAN, a Pix2Pix network has a condition as input for both the generator and discriminator. Unlike most other cGANs, this condition is not appended to a noise vector, but used as the only input. This is done as the noise vector has been found to make no significant difference in the image-to-image translation setting, as the GAN simply learns to ignore the noise. Pix2Pix uses an improved generator, discriminator, and loss function. [17][45]

#### Generator

Pix2Pix’s generator is a U-Net, which is an encoder-decoder architecture with skip connections. Such a network consists of a number of encoding layers that downsample their input into a lower resolution, followed by the same number of decoding layers, that output a larger resolution than their input. There are skip connections between every layer i and n − i, with n being the total number of layers. This allows information that would be lost by the downsampling to be preserved. In addition, the generator has multiple dropout layers.

*Figure 8: A comparison between paired and unpaired image data sets. In paired datasets, every input image x*_{i}* corresponds to exactly one output image y*_{i}* , which is not the case for unpaired datasets. [46] *

*Figure 9: The U-Net generator architecture used by Pix2Pix. [17] *

[32][17][45]

#### PatchGAN

PatchGAN is the improved discriminator used by Pix2Pix. Instead of outputting a single real/fake prediction (or a score, in the case of a Wasserstein GAN), it outputs a matrix of predictions on N × N patches.

*Figure 10: The PatchGAN discriminator. [6] *

[17][45]

#### Additional Loss Term

In addition to the standard GAN loss (Pix2Pix uses the standard BCE loss, not Wasserstein loss), the Pix2Pix generator (but not the discriminator) uses the L^{1} distance between the real and generated image as an additional loss term, which improves quality. [17]

*Figure 11: A comparison of losses for Pix2Pix. The right-most column shows the results when both the classic GAN loss as well as the L*^{1}* loss are used. [17] *

#### Unpaired Image-to-Image Translation

This thesis will focus on CycleGAN as the predominant unpaired image-to-image translation architecture. A CycleGAN works by enforcing cycle consistency between two GANs, one of which maps from domain A to B, and another which maps from B to A. The general architecture of the individual generators and discriminators is very similar to Pix2Pix.

#### Cycle Consistency

Cycle consistency is a property based on the assumption that translating data from one domain to another and back, so from A to B to A, the output should be identical to the input. Enforcing cycle consistency means minimizing the distance between a data point in the domain A with that same “cycled” data. In CycleGAN, this distance is measured by transforming from A to B using one GAN, and from B to A using a second GAN. Both are trained simultaneously.

*Figure 12: The architecture of CycleGAN. An image is first mapped from A to B using the generator G, and then back to A using F. Then, cycle consistency loss is computed between the original image x and F(G(x)). Adapted from [46]*

The cycle consistency loss term, which is used by both generators (but not the discriminators) during training, is defined as

Note that the L^{1} loss is used as opposed to the L^{2} loss to discourage blurring. The loss is a simple pixel distance term as opposed to an adversarial formulation, as the CycleGAN authors [46] found the more complex loss to not improve results. Note that cycle-consistency loss is calculated for both A → B → A and B → A → B.

[46][45]

#### StyleGAN

##### Currently, the state-of-the-art in unsupervised image GANs primarily consists of StyleGAN-based architectures. StyleGAN, introduced by Karras, Laine, and Aila [20], is a model based on existing style transfer methods. It generates not only significantly more sophisticated images, but is significantly more disentangled than preceding unsupervised architectures, and as such allows for much greater control over the output.

##### The discriminator of StyleGAN remains unchanged to the unsupervised formulations discussed earlier, while the generator is improved in several ways. In this section, both StyleGAN and the improved StyleGAN2 [21] are discussed.

#### [20][21][45]

*Figure 13: The StyleGAN generator architecture. Adapted from [20], [21] *

#### Mapping Network

One significant change in the StyleGAN architecture is the use of the mapping network, an 8-layer MLP with the standard (gaussian) noise z ∈ R^{512} as an input, which is mapped nonlinearly to the noise vector w ∈ R^{512} . The use of such a mapping network frees the model from the constraint that z needs to follow the distribution of the training data, which forces the generator to contort in such a way as to map from a gaussian distribution to the image features. This is illustrated in** Figure 14.** [20][45]

*Figure 14: An example of entanglement caused by z. (a) represents two features, where one combination is missing in the training set (e.g. bearded women). (b) shows the mapping that the generator has to learn to avoid the possibility of the forbidden combination to be sampled from z. The learned mapping w (c) can undo some of that warping, resulting in a more disentangled representation in w space. [20] *

#### Noise Input

Many attributes for generated images are stochastic (for faces this includes attributes like the placement of individual hairs or skin pores), and can thus be randomly modeled. This means that an image generator must find a way to create stochasticity, which consumes model capacity. StyleGAN circumvents that issue by passing random noise explicitly into every layer. [20]

##### Adaptive Instance Normalization

StyleGAN uses adaptive instance normalization, or AdaIN, to inject information from the noise vector w into the generator. Unlike a standard GAN, where this vector is passed into the input layer, which is then sequentially altered by each layer of the network, StyleGAN actually inputs a constant, learned input to its first layer. Information from w is passed via AdaIN to every layer.

In instance normalization, the feature map of every channel is normalized separately for each instance (an instance being each generated image), as opposed to normalization over an entire batch in batch normalization. This is the first part of AdaIN, wherein at every layer, each feature map xi is normalized to have a standard deviation µ(x_{i}) of 1 and a mean σ(x_{i}) of 0. The adaptive part of AdaIN is what actually injects information from w into the network. After normalization, the mean and standard deviation of the feature map is changed using a scalar scale factor ys, and a scalar bias factor y_{b}. These are computed from w using learned parameters (this transformation can be expressed as a single MLP layer). The AdaIN operation is, thus

The normalization is done at each layer to effectively cancel out the effects of previous styles, so that every style can be applied independently of styles from previous layers. [20][45]

#### Weight Demodulation

While AdaIN is effective, it produces undesirable “water droplet” artifacts on generated images that are often very obvious. To combat this, StyleGAN2 uses a modified technique called weight demodulation. This is a “softer” normalization technique than AdaIN, since it does not force the normalized distribution to match the expected statistics, but only encourages it by normalizing the weights according to expected statistics of the feature map. In addition, the authors found that with the new technique, applying normalization to the mean is not necessary, so weight demodulation only normalizes the standard deviation. The first step of weight demodulation is modulation, in which the weights of each feature map are scaled using a learned scale value:

This applies the style to the layer. In the next step, demodulation, the standard deviation is normalized to 1. To achieve this, it is assumed that the input weights (before eq. 9) have unit standard deviation. The standard deviation after eq. 9 will then be

To normalize these values to σ = 1, we can simply divide the weights w’ by σ.[21][39][15]

#### Perceptual Path Length Regularization

Perceptual path lenght, or PPL, is a metric introduced by Karras, Laine, and Aila [20]. It is a way to measure the disentanglement of w.

If w is disentangled, and you interpolate between the two randomly sampled values w_{1} and w_{2}, the perceived distance between every possible interpolation point should be identical. PPL uses that fact by measuring the distance between the VGG16 embedding (similarly to* Fréchet Inception Distance*) of the interpolation between w_{1} and w_{2 }at t, and at t + ϵ, where t is a random interpolation point, and is a very small number, usually 10^{−4}. To get a value that does not depend on ϵ, this is then divided by ϵ^{2} (as opposed to , since the distance function used is quadratic [44]). This entire operation is done for many different random w_{1} and w_{2 }values to get a cost for the entire network. The entire operation can be described as

PPL regularization is a technique introduced by StyleGAN2. It not only uses PPL to measure the network’s performance, but also as an additional loss term, as disentanglement has been shown to not only improve controllability, but also to greatly improve the quality of the generated image. [21] Effectively, it further encourages disentanglement, and has the added effect of making the model dramatically easier to invert.

[44][20][21][45]

Figure 15: Various images generated by StyleGAN2. Each image is generated by a model trained on a different dataset. [30]

#### Normalizing Flows and GANs

##### Normalizing Flows

As explored in *Generative Models*, all generative models approximate the distribution of training examples. If that distribution is well represented, a number of tasks can be achieved. In addition to generation (or the sampling of points from the modeled distribution), the probability of arbitrary data points can be measured (density estimation), or incomplete data points can be completed, among others. Most popular generative frameworks, including GANs and VAEs, do not explicitly represent this distribution. Instead, they allow for a subset of the tasks above to be performed: generation for GANs, and density estimation and generation for VAEs. In contrast, normalizing flows directly model the underlying distribution.

Normalizing flows model a distribution via a sequence of invertible, differentiable mappings. Using this framework, any distribution (in practice, normal distributions are usually used) can be transformed to an arbitrarily complex one. [31][22][18][42][3]

*Figure 16: An illustration of how a normalizing flow can transform a simple distribution (left) to an arbitrarily complex one (far right). [42] *

#### StyleFlow

A notable recent controllable formulation based on StyleGAN is StyleFlow. Recall that, as opposed to a conditional GAN, a controllable GAN does not need to be trained with the desired attributes, but instead allows control over desired features in the input latent space. Theoretically, this should be straightforward in a perfectly disentangled latent space, but even in highly disentangled models like StyleGAN2, some entanglement is persistent. It has been postulated [1] that in a purely unsupervised setting, complete disentanglement is impossible. This presents a challenge for tasks such as attribute-controlled generation or image editing, where it is important to both accurately generate desired conditions and to preserve the identity of unrelated features.

StyleFlow approaches that problem via a normalizing flow Φ(z, a) that maps from z space to w space based on desired attributes, such that w = Φ(z, a). The flow consists of multiple blocks, each of which improves the mapping between the distribution of z (which is known) to the unknown distribution of w. This flow model is trained separately from the GAN, and is unrelated to the StyleGAN mapping network. It is trained using a large number of randomly sampled z values and their w values and corresponding attributes, which can be easily produced using a pretrained CNN on the images generated from z.

This architecture allows for attribute-based generation without a tradeoff in image quality, and allows for editing of arbitrary images based on desired attributes. Unlike previous methods, it also allows sequential edits of the same image to be performed sequentially without a penalty of fidelity or identity preservation. Assuming access to a pretrained image classification network, data preparation is extremely straightforward, and does not require the computationally expensive step of retraining a state-of-the-art GAN. [1][31]

*Figure 17: Sequential edits of real faces using StyleFlow. The first column shows a real image, the second shows the projection into the StyleGAN latent space, and the following columns show sequential edits of the attributes specified. *

##### Practical Uses for GANs

Finally, here are a few examples of real-world applications of GANs.

##### Augmented Reality

As a part of their AR functionality, Apple uses GANs to render reflections. [9]

*Figure 18: GAN-generated environment maps are a part of Apple’s AR framework. Reflections need to include information that is not in view of the camera (left), so a GAN is used to generate the missing information. This leads to a more accurate reflection (right). [9] *

##### Data Augmentation

GANs are used at companies including Apple [9] and IBM [45] for data augmentation. They are most useful here to oversample data from classes that are underrepresented in the dataset. [45]

#### GAN Art

A number of artists have used AI to create art. In the last few years, GANs have been used by various artists.

*Figure 19: Art by Derrick Schultz created using StyleGAN2. (left) [38] (right) [37] *

*Figure 20: An artwork by Scott Eaton. It was generated from a drawing using a custom Pix2Pix model. [7] *

#### Deepfakes

A widely discussed use of GANs that has wide-reaching societal implications are deepfakes, a technique in which videos of entirely fictional events can be created. A discussion on the implications of deepfakes is not in the scope of this thesis, and is further discussed in [35].

*Figure 21: A screen capture of a deepfake of former US president Obama. Video via [4]*

##### Further Reading

If any concepts discussed here are unfamiliar, the comprehensive deep learning textbook by Goodfellow, Bengio, and Courville [10] and Andrew Ng’s courses on machine learning [27][26] are great resources. For more on GANs, Sharon Zhou’s course on the topic [45] is recommended. Many exciting GAN architectures that could not be explored here exist. These include

- Several promising semi-supervised GANs (Chakraborty et al. [5], Nie et al. [28]),
- A conditional GAN in which images can be generated using subjective attributes that are hard to quantify (Saquil, Kim, and Hall [36]),
- A GAN for 3-dimensional shapes (Wu et al. [43]),
- Text generation GANs (e.g. Fedus, Goodfellow, and Dai [8]),
- and many more

##### Conclusion

GANs are powerful tools. Different architectures accomplish tasks as varied as image generation with a fidelity that can fool even careful human observers (StyleGAN), image-to-image translation, both with paired (Pix2Pix) and unpaired (CycleGAN) data, or precise editing of specific, learned attributes (StyleFlow). As many of the early challenges with GANs, including mode collapse and training instability, have been drastically reduced, they are now used widely by companies, researchers and even artists alike.

Few concepts in the field of machine learning have seen greater improvements in a shorter amount of time, or captured more minds than GANs have in the past seven years. They are already an essential part of any machine learning practitioner’s toolkit, and only time will tell what they will be used for next.

#### Written by **Bernhard Böck** & Teo Maayan

#### Artificial Intelligence & Machine Learning – Rebellion Research

[1] Rameen Abdal et al. StyleFlow: Attribute-Conditioned Exploration of StyleGAN-Generated Images Using Conditional Continuous Normalizing Flows. Sept. 20, 2020. arXiv: 2008.02401 [cs]. url: http://arxiv.org/abs/ 2008.02401 (visited on 02/17/2021).

[2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. Dec. 6, 2017. arXiv: 1701.07875 [cs, stat]. url: http://arxiv.org/abs/ 1701.07875 (visited on 03/21/2021).

[3] Sam Bond-Taylor et al. Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models. Mar. 8, 2021. arXiv: 2103.04922 [cs, stat]. url: http://arxiv. org/abs/2103.04922 (visited on 04/03/2021).

[4] BuzzFeedVideo, director. You Won’t Believe What Obama Says In This Video! url: https://www.youtube.com/watch?v=cQ54GDm1eL0.

[5] Arunava Chakraborty et al. S2cGAN: Semi-Supervised Training of Conditional GANs with Fewer Labels. Oct. 23, 2020. arXiv: 2010.12622 [cs, stat]. url: http://arxiv.org/abs/2010.12622 (visited on 04/03/2021).

[6] Ugur Demir and Gozde Unal. Patch-Based Image Inpainting with Generative Adversarial Networks. Mar. 20, 2018. arXiv: 1803.07422 [cs]. url: http://arxiv.org/abs/1803.07422 (visited on 04/04/2021).

[7] Scott Eaton. Humanity (Fall of the Damned). url: http://www.scotteaton.com/2019/humanity-fall-of-the-damned (visited on 04/03/2021).

[8] William Fedus, Ian Goodfellow, and Andrew Dai. “MaskGAN: Better Text Generation via Filling in the ____”. In: 2018. url: https://openreview. net/pdf?id=ByOExmWAb.

[9] Ian Goodfellow. “Generative Adversarial Networks” (GANs for Good- A Virtual Expert Panel by DeepLearning.AI). url: https://www.youtube. com/watch?v=9d4jmPmTWmc&t=249s.

[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.

[11] Ian J. Goodfellow et al. Generative Adversarial Networks. Version 1. June 10, 2014. arXiv: 1406.2661 [cs, stat]. url: http:/ / arxiv. org / abs/1406.2661 (visited on 07/07/2020).

[12] Will Grathwohl et al. FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models. Oct. 22, 2018. arXiv: 1810.01367 [cs, stat]. url: http://arxiv.org/abs/1810.01367 (visited on 04/05/2021).

[13] Ishaan Gulrajani et al. Improved Training of Wasserstein GANs. Dec. 25, 2017. arXiv: 1704.00028 [cs, stat]. url: http://arxiv.org/abs/1704. 00028 (visited on 03/21/2021).

[14] Martin Heusel et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. Jan. 12, 2018. arXiv: 1706.08500 [cs, stat]. url: http://arxiv.org/abs/1706.08500 (visited on 04/02/2021).

[15] Jonathan Hui. GAN — StyleGAN & StyleGAN2. Medium. Mar. 10, 2020. url: https : / / jonathan – hui . medium . com / gan – stylegan – stylegan2 – 479bdf256299 (visited on 04/05/2021).

[16] Moshe Idel. Golem: Jewish Magical and Mystical Traditions on the Artificial Anthropoid. SUNY Series in Judaica. Albany, N.Y: State University of New York Press, 1990. 323 pp. isbn: 978-0-7914-0160-6 978-0-7914-0161-3.

[17] Phillip Isola et al. Image-to-Image Translation with Conditional Adversarial Networks. Nov. 26, 2018. arXiv: 1611.07004 [cs]. url: http://arxiv.org/ abs/1611.07004 (visited on 04/04/2021).

[18] Eric Jang. Normalizing Flows Tutorial, Part 1: Distributions and Determinants. url: https://blog.evjang.com/2018/01/nf1.html.

[19] Neal Jean. Fréchet Inception Distance. Neal Jean. July 15, 2018. url: nealjean.com/ml/frechet-inception-distance/ (visited on 04/02/2021).

[20] Tero Karras, Samuli Laine, and Timo Aila. “A Style-Based Generator Architecture for Generative Adversarial Networks”. In: (), p. 10.

[21] Tero Karras et al. Analyzing and Improving the Image Quality of StyleGAN. Mar. 23, 2020. arXiv: 1912.04958 [cs, eess, stat]. url: http://arxiv. org/abs/1912.04958 (visited on 03/29/2021).

[22] Ivan Kobyzev, Simon J. D. Prince, and Marcus A. Brubaker. “Normalizing Flows: An Introduction and Review of Current Methods”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2020), pp. 1–1. issn: 0162-8828, 2160-9292, 1939-3539. doi: 10.1109/TPAMI.2020.2992934. arXiv: 1908.09257. url: http://arxiv.org/abs/1908.09257 (visited on 03/30/2021).

[23] Y. Lecun et al. “Gradient-Based Learning Applied to Document Recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324. doi: 10.1109/5.726791.

[24] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep Multi-Scale Video Prediction beyond Mean Square Error. Feb. 26, 2016. arXiv: 1511. 05440 [cs, stat]. url: http://arxiv.org/abs/1511.05440 (visited on 04/03/2021).

[25] Mehdi Mirza and Simon Osindero. Conditional Generative Adversarial Nets. Nov. 6, 2014. arXiv: 1411.1784 [cs, stat]. url: http://arxiv.org/ abs/1411.1784 (visited on 03/16/2021).

[26] Andrew Ng. Deep Learning. Coursera. url: https://www.coursera.org/ specializations/deep-learning (visited on 04/01/2021).

[27] Andrew Ng. Machine Learning. Coursera. url: https://www.coursera.org/ learn/machine-learning (visited on 04/01/2021).

[28] Weili Nie et al. Semi-Supervised StyleGAN for Disentanglement Learning. Nov. 25, 2020. arXiv: 2003.03461 [cs]. url: http://arxiv.org/abs/2003. 03461 (visited on 04/03/2021).

[29] Ovid, D. A. Raeburn, and D. C. Feeney. Metamorphoses: A New Verse Translation. New edition. Penguin Classics. London: Penguin Classics, an imprint of Penguin Books, 2014. 725 pp. isbn: 978-0-14-139461-9.

[30] Justin Pinkney. Awesome Pretrained StyleGAN2. Apr. 4, 2021. url: https: //github.com/justinpinkney/awesome-pretrained-stylegan2 (visited on 04/05/2021).

[31] Danilo Jimenez Rezende and Shakir Mohamed. Variational Inference with Normalizing Flows. June 14, 2016. arXiv: 1505.05770 [cs, stat]. url: http://arxiv.org/abs/1505.05770 (visited on 03/13/2021).

[32] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. May 18, 2015. arXiv: 1505. 04597 [cs]. url: http: / / arxiv. org / abs / 1505. 04597 (visited on 03/23/2021).

[33] Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. “The Earth Mover’s Distance as a Metric for Image Retrieval”. In: International Journal of Computer Vision 40.2 (2000), pp. 99–121. issn: 09205691. doi: 10.1023/A: 1026543900054. url: http://link.springer.com/10.1023/A:1026543900054 (visited on 03/21/2021).

[34] M. Sami and I. Mobin. “A Comparative Study on Variational Autoencoders and Generative Adversarial Networks”. In: 2019 International Conference of Artificial Intelligence and Information Technology (ICAIIT). 2019, pp. 1– 5. doi: 10.1109/ICAIIT.2019.8834544.

[35] Ian Sample. “What Are Deepfakes – and How Can You Spot Them?” In: The Guardian (). url: https://www.theguardian.com/technology/2020/ jan/13/what-are-deepfakes-and-how-can-you-spot-them.

[36] Yassir Saquil, Kwang In Kim, and Peter Hall. Ranking CGANs: Subjective Control over Semantic Image Attributes. July 24, 2018. arXiv: 1804.04082 [cs]. url: http://arxiv.org/abs/1804.04082 (visited on 02/14/2021).

[37] Derrick Schultz. #bbvday2020. url: https://artificial-images.com/project/ bbvday-stylegan-prints (visited on 04/03/2021).

[38] Derrick Schultz. #bbvday2021. url: https://artificial-images.com/project/ bbvday2021-stylegan-prints (visited on 04/03/2021).

[39] Connor Shorten. StyleGAN2. url: https : / / towardsdatascience . com / stylegan2-ace6d3da405d.

[40] A. M. Turing. “I.—COMPUTING MACHINERY AND INTELLIGENCE”. In: Mind LIX.236 (Oct. 1950), pp. 433–460. issn: 0026-4423. doi: 10.1093/ mind/LIX.236.433. eprint: https://academic.oup.com/mind/article – pdf/LIX/236/433/30123314/lix-236-433.pdf. url: https://doi.org/10. 1093/mind/LIX.236.433.

[41] Phil Wang. This Person Does Not Exist. This Person Does Not Exist. url: https://thispersondoesnotexist.com/ (visited on 12/31/2020).

[42] Lilian Weng. “Flow-Based Deep Generative Models”. In: lilianweng.github.io/lillog (2018). url: http://lilianweng.github.io/lil-log/2018/10/13/flow-baseddeep-generative-models.html.

[43] Jiajun Wu et al. “Learning a Probabilistic Latent Space of Object Shapes via 3d Generative-Adversarial Modeling”. In: Advances in Neural Information Processing Systems. 2016, pp. 82–90.

[44] Richard Zhang et al. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. Apr. 10, 2018. arXiv: 1801 . 03924 [cs]. url: http://arxiv.org/abs/1801.03924 (visited on 03/30/2021).

[45] Sharon Zhou. Generative Adversarial Networks (GANs) | Coursera. Coursera. url: https://www.coursera.org/specializations/generative-adversarialnetworks-gans?= (visited on 12/31/2020).

[46] Jun-Yan Zhu et al. Unpaired Image-to-Image Translation Using CycleConsistent Adversarial Networks. Aug. 24, 2020. arXiv: 1703.10593 [cs]. url: http://arxiv.org/abs/1703.10593 (visited on 09/13/2020).