1 Introduction

Just like for many other medical fields, the problems of data scarcity and class imbalance are also apparent for machine learning driven skin image analysis. In the ISIC2018 challenge, the provided dataset comprises only 10,000 labeled training samples, and the class distribution is heavily skewed among the seven categories of skin lesions, due to the rare nature of some pathologies. In order to tackle the problem of limited training data, state-of-the-art approaches for skin lesion classification and segmentation rely on heavy data augmentation [9, 18] or webly supervised learning [11]. As an alternative, synthetic images could open up new ways to deal with these problems. Generative Adversarial Networks (GANs) [5] have shown outstanding results for this task. In the computer vision community, GANs have been successfully used for the generation of realistically looking images of indoor and outdoor scenery [3, 13], faces [13] or handwritten digits [5]. Some conditional variants [10] have also set the new state-of-the-art in the realms of super-resolution [8] and image-to-image translation [6]. A few of these successes have been translated to the medical domain, with applications for cross-modality image synthesis [16], CT image denoising [17] and for the pure synthesis of biological images [12], PET images [2], and OCT patches [14]. First successful attempts for medical data augmentation using GANs have been made in [1, 4], however at a level of small patches.

Fig. 1.
figure 1

Samples generated with the different models.

In contrast to many other medical classification problems, skin lesion segmentation and classification models often utilize ImageNet-pretrained models, meaning that these rely on input data with resolutions of \(224\times 224\,\)px or higher. For image synthesis, this implies that higher resolution images need to be generated without trading off realism. Thoroughly engineered, unconditional architectures such as DCGAN [13] or LAPGAN [3] have proven to work well for high quality image synthesis from noise, however at fairly low resolution. Conditional approaches [15] have shown that both high quality and high resolution image synthesis up to \(2048\times 1024\,\)px is possible when mapping from semantic labelmaps to synthetic images with a hierarchy of conditional GANs, however this setting requires well structured input into the generator. Recently, progressive growing of GANs (PGAN) [7] has shown outstanding results for realistic image synthesis of faces at resolutions up to \(1024\times 1024\,\)px, without the need for any conditioning.

Contribution. In this work, we synthesize skin lesion images at sufficiently high resolution while ensuring high quality and realism. For our experiments, we utilize dermoscopic images of benign and malignant skin lesions provided by the ISIC2018 challengeFootnote 1. For data synthesis, we employ the PGAN and compare it to the DCGAN and the LAPGAN. As PGANs can natively only synthesize images whose size is a power of 2, we aim for a target resolution of \(256\times 256\,\)px, such that State-of-the-Art classifiers could potentially leverage the samples. A quantitative comparison of the image statistics of the synthetic and real images shows that the PGAN matches the training dataset distribution very well, and visual exploration further corroborates its superiority over the other approaches in terms of sample diversity, sharpness and artifacts. Ultimately, we evaluate the quality of the PGAN samples in a user study involving 3 expert dermatologists as well 5 Deep Learning experts, showing that the experts have a hard time distinguishing between real and fake images.

The remainder of this manuscript is organized as follows: We first briefly recapitulate the GAN framework as well as the different GAN concepts before we describe the experimental setup. Afterwards, we introduce the dataset, evaluation metrics, provide a quantitative comparison of the aforementioned concepts for skin lesion synthesis and the results of our user study. We conclude this paper with a discussion and an outlook on future work.

2 Skin Lesion Synthesis

2.1 Generative Adversarial Networks

The original GAN framework consists of a pair of adversarial networks: A generator network G tries to transform random noise \(z \sim p_z\) from a prior distribution \(p_z\) (usually a standard normal distribution) to realistically looking images \(G(z) \sim p_{fake}\). At the same time, a discriminator network D aims to classify well between samples coming from the real training data distribution \(x \sim p_{real}\) and fake samples G(z) generated by the generator. By utilizing the feedback of the discriminator, the generator G can be adjusted such that its samples are more likely to fool the discriminator in its classification task, ultimately teaching the generator to approximate the training dataset distribution. Mathematically speaking, the networks play a two-player minimax game against each other:

$$\begin{aligned} \min _{G} \max _{D} V(D,G) = \mathbb {E}_{x \sim p_{real}(x)}[log(D(x))] + \mathbb {E}_{z \sim p_z(z)}[1-log(D(G(z)))] \end{aligned}$$
(1)

In consequence, as D and G are updated in an alternating fashion, the discriminator D becomes better in distinguishing between real and fake samples while the generator G learns to produce even more realistic samples.

In this work, we employ three different GAN concepts for the task of high resolution skin lesion synthesis, namely the DCGAN, the LAPGAN and finally the very recent PGAN. An overview of the setup is given in Fig. 2.

Fig. 2.
figure 2

An overview of the PGAN employed for skin lesion synthesis.

The DCGAN architecture is a popular and well engineered convolutional GAN that is fairly stable to train and has proven to yield high quality results at a resolution of 64\(\,\times \,\)64 px. The architecture is carefully designed with concepts such as leaky ReLu activations to avoid sparse gradients and a specific weight initialization to allow for a robust training.

The LAPGAN is a generative image synthesis framework inspired by the concept of Laplacian pyramids. In essence, it consists of a hierarchy of GANs, where the first generator \(G_0\) is trained to synthesize low-resolution images from noise. Successive generators \(G_i\) are targeted to map from lower-resolution images of the previous generator \(G_{i-1}\) to residual images, which have to be added to the upsampled, input in order to obtain compelling higher resolution images.

The PGAN utilizes the idea of progressive growing [7] to facilitate high resolution image synthesis from noise at unprecedented levels of quality and realism. Opposed to the LAPGAN, the PGAN consists only of a single generator and a discriminator, which both start as small networks which grow in depth and model complexity during training (see Fig. 2). Gradually, the output-resolution of the generator and the input-resolution to the discriminator are simultaneously ramped up, leading to a very stable training behavior and very realistic, synthetic images at resolutions up to \(1024\times 1024\) px.

3 Experiments and Results

In the first part of our experiments, we train a PGAN, and to prove its superiority over other concepts, also a DCGAN and a LAPGAN for skin lesion synthesis at a resolution of \(256\times 256\) px. In succession, we investigate the properties of the synthetic samples both quantitatively and qualitatively. In the second part of our experiments, we conduct a user study to verify the realism of the generated images.

3.1 Dataset

For our experiments, we utilize the ISIC2018 dataset consisting of 10,000 dermoscopic images of both benign and malignant skin lesions (see Fig. 1a). The megapixel dermoscopic images are center cropped to square size and downsampled to \(256\times 256\) px. No data augmentation or pre-processing was applied.

3.2 Evaluation Metrics

A variety of methods have been proposed for evaluating the performance of GANs in capturing data distributions and for judging the quality of synthesized images. In order to evaluate visual fidelity, numerous works utilized either crowdsourcing or expert user studies. We also conduct such a user study to rate the realism of our synthetic images. In addition, we discuss visual fidelity of the generated images with a focus on diversity, realism, sharpness and artifacts. For quantitatively judging sample realism, the Sliced Wasserstein Distance (SWD) has recently shown to be a reasonably good metric for approximately comparing image distributions [7], thus we also make use of it.

3.3 Image Synthesis

We trained a PGAN as described in [7] from all 10,000 images, as well as a DCGAN and a LAPGAN. The PGAN has been trained for 3M iterations, until the SWD between the synthetic samples and the training dataset did not decrease noticeably any further. For a valid comparison, the LAPGAN and DCGAN were also trained for the same amount of iterations.

Fig. 3.
figure 3

Artifacts produced by the different models. DCGAN samples show characteristic checkerboard patterns (left), LAPGAN produces high frequency artifacts (middle), whereas PGAN has only problems synthesizing hair (right).

Fig. 4.
figure 4

Walking along the visual manifold of synthetic PGAN samples.

Per model, we then generate 10,000 synthetic images and compare their distribution to the real data by means of the SWD (see Table 1). Since the SWD constitutes an approximation, we also compute the SWD between the real data and itself to obtain a lower bound. In comparison, the lowest SWD is clearly obtained with the PGAN samples, whereas the DCGAN and LAPGAN perform considerably, but equally worse. This is also reflected by a visual exploration of the samples (see Fig. 1 for a comparison of samples generated with the different models). The DCGAN samples are prone to checkerboard artifacts (Fig. 3, left) and can thus easily be identified as fake. The LAPGAN samples (Fig. 3, middle) seem more realistic and diverse, but close inspection shows a vast amount of high frequency artifacts, which again, negatively impact realism of these samples. The PGAN samples (Fig. 3, right) seem highly realistic, alone filamentary structures such as hair raise suspicion.

Table 1. Sliced Wasserstein Distances (SWDs) between the real and generated samples from different models. Closest to the lower bound (i.e. SWD between real images and themselves) is the PGAN, whereas the distribution of DCGAN and LAPGAN samples differs considerably from the real one.
Table 2. Confusion matrix coefficients, Accuracy, TPR & TNR per voter.

Exploring the Visual Manifold. Since the PGAN samples look so compelling, there might be a chance that the model memorized the training dataset. Therefore, we explore the manifold of synthetic samples. The smooth transitions among samples provide clear evidence that memorization did not occur (see Fig. 4).

3.4 Visual Turing Test

In order to juge realism of the generated images, we conduct a so-called Visual Turing Test (VTT) involving 3 expert dermatologists (ED) and 5 deep-learning experts (DLE). Each participant is asked to classify the same random mix of generated and real images as being either real (class 1) or fake (class 0). The DLEs are familiar with common GAN artifacts and are thus expected to be skilled to identify unplausible generated images, even though they do not have experience in judging actual skin lesion images. On the other hand, the EDs are not aware of these deep-learning induced image artifacts, but instead know about the gamut of possible skin lesion phenotypes.

Fig. 5.
figure 5

Visual Turing Test results

Using the PGAN, we first generate 30 synthetic images, which are then mixed with 50 randomly chosen images from the real training dataset. In the VTT, we present each participant with these 80 images in random order and let him/her classify. The performances of all the participants in terms of the TPR (how many real images have been identified as real), the FPR (how many fake images have ben classified as real) and the Accuracy are reported in Fig. 5a. Performance statistics among EDs and DLEs are provided in Fig. 5b), and the complete user study details can be found in Table 2. Interestingly, the classification accuracy is slightly lower for the EDs than for the DLEs. Overall, the accuracy is just slightly above 50%, implying that the experts can distinguish between real and fake just slightly better than chance. Thereby, not all fakes have been mistaken as real (on average 56%), but on average 42% of the real images have also mistakingly be identified as fake. All in all, none of the participants is able to reliably distinguish the fake samples from real ones, leading to the conclusion that these synthetic samples are in fact highly realistic.

4 Discussion and Conclusion

We have shown that with the help of PGANs, we are able to generate extremely realistic dermoscopic images, which carves open new opportunities to tackle the problems of data scarcity and class imbalance. Yet, it is unclear to which extent these synthetic data provide additional information to supervised deep learning models. In fact, a variety of questions need to be answered, such as (i) whether there is an information gain in the synthetic samples over the actual training dataset, (ii) if the gain is higher than using standard data augmentation and (iii) how many training images are in fact required to obtain reliable generative models. Noteworthy, we trained the PGAN ignoring the presence of different classes. For generating images along with class information, one would need to leverage labeled data and effectively train a single model per class. Further, the synthetic images are not always perfect. In particular, the methodology has to be enhanced to account for filamentary structures. In future work, we aim to perform large scale experiments and strive to answer these question.

Overall, we have shown that we can synthesize images of skin lesions at yet unprecedented levels of realism. In fact, the level of realism is so high such that experts from both the medical and the deep-learning fields were not able to reliably distinguish real images from generated ones. This leaves us confident that such synthetic data can be leveraged for new data augmentation approaches.