Project Title: Comics Illustration Synthesizer using Generative Adversarial Networks

Responsible Professor: Lydia Y. Chen


PhD Comics is a newspaper and webcomic strip written and drawn by Jorge Cham that follows the lives of several grad students. First published in 1997 when Cham was a grad student himself at Stanford University, the strip deals with issues of life in graduate school, including the difficulties of scientific research, the perils of procrastination, and the complex student - supervisor relationship.

Figure 1: An example from PhD Comics

Specific problem scenario: We know that drawing illustrations is a time-consuming and ex- pensive process, so we wonder whether we can use machine learning algorithms to learn illustrations through dialogue in comics or descriptions of illustrations, using dialogue and descriptions of illustra- tions to build generative adversarial networks (GAN). Thus, we will first build a text-image pair dataset by extracting dialogue and descriptions of illustrations from existing comics. Based on this dataset, we will adopt transfer learning and text-to-image generation approaches to build a text visualization model. And lastly, we will conduct both an automatic as well as a human evaluation. Objective:In this project, we aim to develop a text-to-image generative adversarial network for generating comic illustrations. To be concrete on the comic’s style, we specifically focus on PhD Comics.

The whole project involves two main aspects: (1) Data preparation and (2) Model construction. In the data preparation step, to automatically collect data, students must align the text with images and classify the characters and comic formats (e.g., 2, 3, or 4 grids). This work is not trivial and requires the help of machine learning. Furthermore, as we cannot simply enter the raw text into the machine learning model, a proper text representation is needed. Hence, the Bidirectional Encoder Representations from Transformers (BERT) is recommended to achieve this goal. As our study case is comics, an interesting branch of humour detection using BERT is also proposed. Then, in model construction step, two types of model are proposed. First, we propose to construct a comics-to-comics conditional GAN. This model will explore the method that uses comic illustrations to generate comic illustrations. Second, we propose to implement the text-to-image GAN, which uses text to generate comic illustrations. Novelty:Previous work such as [6, 5] use text to generate images, the results on standard datasets are already promising (as shown in Fig. 2). Therefore in our proposal, we want to build our own dataset. Since the characters in the comics are human, it will concerns about the emotions of the characters, the actions of bodies, which makes the jobs more difficult than generating still objects, this can also be seen in the 5th column of generated images in Fig. 2.

Figure 2: Examples of images generated by (a) AttnGAN, (b) MirrorGAN Baseline, and (c) MirrorGAN conditioned on text descriptions from CUB and COCO test sets and (d) the corresponding ground truth.

Testbed and baseline

The codes of MirrorGAN and AttnGAN are provided with the datasets that they applied on. Therefore, reproducing their results can be a good start to get familiar with the text-to-image GAN structure. And once we collect our PhD Comics text and images aligned dataset, it should be easy to adapt the data format to apply on these algorithms.

1 Text-image dataset construction - extracting text from image

  • Research Question: In this step, text-image pairs should be aligned. Students need to first determine the information used for the dataset, it can be pure dialogue (in the illustrations) to illustrations, or pure descriptions (of the illustrations) to illustrations. Then the pipeline of auto- downloading all the PhD Comics, auto-segmenting the sub-figure from each comic’s illustration and auto-extract text from each comics should be developed.

  • Method: Web Crawling using python3 can help to download the comics, Optical Character Recognition (OCR) tool (e.g., Google’s tesseract-OCR Engine) can be used to extract text from each comics.
  • Outcome: a data pipeline to auto-build the text-to-image dataset based on PhD Comics illus- trations.
  • Related work: the paper[4] which automatically builds a large celebrity face dataset (i.e., Face- scrub) can be a good reference to learn the process.
  • Timeline: 2 weeks to explore web crawling, 3 weeks to explore segment figures, 3 weeks to explore extracting text from images, 2 weeks to analyze the results and write report.

2 Text-image dataset construction - auto labeling

  • Research Question: In PhD Comics, there are several specific characters (e.g. Cecilia, Mike, etc. Details here) in every painting. Students need to annotate them in each painting. The encoded characters can then be used as a conditional vector for generative adversarial network.
  • Method: It’s recommend to use openCV library in python to extract all faces from illustrations. After that step. First, manually clear up a small part of the data. Then train a convolutional neural network (CNN) to do the classification on the rest of images.
  • Outcome: a data pipeline to auto-label the images in PhD Comics.
  • Related work: the paper[4] that built Facescrub dataset involves part of jobs to extract faces from images.
  • Timeline: 2 weeks to reproduce the results in [4], 4 weeks to build a small clean annotated dataset and use this dataset to build CNN, 2 weeks to use the CNN to auto-label rest data and test, 2 weeks to analyze the results and write report.

3 Context representation and humour detection

  • Research Question: The goal is to learn good representations of context that help image gener- ation. Students will use a pre-train BERT model, and the output of BERT can be used as the encoded vector of text. Furthermore, since from Research Question 1, students will extract the dialogue or descriptions from comics. Therefore, in this step, students can use BERT to encode text and train a humour detector.
  • Method: Use pre-train BERT to encode extracted text from comics. Then use the structure of ColBERT [1]: Using BERT Sentence Embedding for Humor Detection, and doing the classifica- tion on PhD Comics text. ColBERT code provides a dataset with 200k formal short texts (100k positive, 100k negative).
  • Outcome: a dialogue representation encoder using pre-train BERT, and a text humor detector built by ColBERT.

Figure 3: An illustration of a generative adversarial network (GAN) learns to generate handwriting numbers from MNIST dataset. GAN consists of one generator and one discriminator.

  • Related work: There is an implementation of Bert-as-service which has already integrated pre- train english BERT. ColBERT implementation is provided in this git , but apparently, we should add our own context to train a new humor detector.
  • Timeline: 2 weeks to integrate Bert-as-service into python code and test on PhD Comics context, 2 weeks to annotate PhD Comics context, 4 weeks to build our own ColBERT, 2 weeks to analyze the results and write report.

4 Conditional-GAN for comics-to-comics generation

  • Research Question: The goal is to implement a comics-to-comics conditional GAN (CGAN) [3], which uses comic illustrations to generate comic illustrations. The conditional vector can be used to indicate the characters and the comic’s format. Additionally, many GAN models suffer the problems such as: (1) non-convergence, (2) mode collapse and (3) diminished gradient. Therefore, there are several methods worth exploring.
  • Method: The state-of-the-art method that integrates Wasserstein distance and gradient penalty (WGAN + GP) [2] should be implemented.
  • Outcome: a comics-to-comics generative adversarial network
  • Related work: A Conditional Wasserstein GANs with Gradient Penalty demo on MNIST hand- writing images dataset is provided here. Henceforth, students need to adopt these techniques on the PhD Comics dataset and fine-tune the parameters.
  • Timeline: 2 weeks to understand the structure of GAN, 2 weeks to run a GAN demo on standard image dataset such as MNIST or CIFAR10, 4 weeks to adapt PhD Comics illustration format to WGAN + GP structure, 2 weeks to analyze the results and write report.

5 Text-to-image generative adversarial network

  • Research Question: The goal is to firstly reproduce the state-of-the-art models for text-to-image generation tasks, and finally adapt PhD Comics illustration to train the model and generate comics illustration.
  • Method: The state-of-the-art algorithms such as MirrorGAN [5] and AttnGAN [6] are definitely need to be reproduced. The first step is to use the standard datasets provided with the codes to reproduce the results in original papers. The current results based on these standard datasets are as shown in Fig. 2. Then adapt the data format of PhD Comics illustration to train the algorithms and get our own models.
  • Outcome: one/two text-to-image generative adversarial network which can using comics dialogue or descriptions to generate comics illustration.
  • Related work: Codes for MirrorGAN and AttnGAN are provided on github along with their using datasets.
  • Timeline: 2 weeks to understand the structure of GAN, 4 weeks to reproduce the results of MirrorGAN and AttnGAN, 2 weeks to adapt PhD Comics illustration format to these two algo- rithms, 2 weeks to analyze the results and write report.

Relation Between Research Questions

Successfully answering each of these research questions can lead to the ultimate objective: efficient collect and build image dataset with annotation, and have a deep insight into generative adversarial network. Toward the end of the project, each research topic can benefit from the findings of each other and move away from the baseline configuration. A thorough investigation of each topic can be a stand-alone workshop paper, and the combination of them can form a conference paper for the machine learning application track. This project will be in collaboration with a PhD student and master student who will provide the baseline system.


[1] Issa Annamoradnejad. ColBERT: Using BERT Sentence Embedding for Humor Detection.arXiv e-prints, page arXiv:2004.12765, April 2020.

[2] Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of wasserstein gans. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5767–5777, 2017.

[3] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets.CoRR, abs/1411.1784, 2014.

[4] Hongwei Ng and Stefan Winkler. A data-driven approach to cleaning large face datasets. In 2014 IEEE International Conference on Image Processing, ICIP 2014, Paris, France, October 27-30, 2014 , pages 343–347. IEEE, 2014.

[5] Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. Mirrorgan: Learning text-to-image generation by redescription. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 1505–1514. Computer Vision Foun- dation / IEEE, 2019.

[6] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial net- works. In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 1316–1324. IEEE Computer Society, 2018.