Abstract
Deep learning methods for RNA sequencing data have exploded in the recent years due to the advent of singlecell RNA sequencing (scRNA-seq), which enables the study of multiple cells per-patient simultaneously. However, in the case of rare cell types, data scarcity continues to exist, posing several challenges, while preventing the exploitation of deep learning models’ full predictive power. Generating realistic synthetic cells to augment the data could allow for more informative subsequent downstream analyses. Herein, we introduce Mask-cscGAN, a conditional generative adversarial network (GAN) that generates realistic synthetic cells with desired characteristics managing also to model genes’ sparsity through learning a mask of zeros. Employed for the augmentation of a glioblastoma multiforme (GBM) malignant cells dataset, Mask-cscGAN generates realistic synthetic cells of desired cancer subtypes. Generating cells of a rare cancer subtype, Mask-cscGAN improves the classification performance of the rare cancer subtype by 12.29%. MaskcscGAN is the first to generate realistic synthetic cells belonging to specified cancer subtypes, and augmentation with MaskcscGAN outperforms state-of-the-art methods in rare cancer subtype classification.