Abstract
To investigate how plant responds to various types of stress, measuring gene expression profiles in several time points is a common practice. Analyzing such time series transcriptome data can be useful to understand biological mechanisms responding to stress. One important question is which genes are related to which stress types. This question can hardly be answered by analyzing transcriptome from single experiments since too many genes are affected by stress, i.e., too many false positives. Performing integrated analysis of transcriptome from multiple experiments can certainly improve the situation. However, there are only small number of samples available while the dimension, i.e., the number of genes, is very high. Thus, to perform integrated analysis of stress time series data, a new method is needed. In this study, we designed and implemented a novel machine learning method for predicting stress related genes from heterogeneous time series transcriptome data. Our method performs feature embedding of time series data in the form of minimizing data loss, and uses a logical relevance layer that learns a stress-gene correlation weight matrix with cross-entropy and group effect constraint. The weight matrix learned in the training stage is used to predict stress related genes with CMCL (Confident Multiple Choice Learning) loss to prevent parameter overfitting. In experiments with Arabidopsis transcriptome data with four stress types, heat, cold, salt, and drought, our analysis ranked stress related genes higher compared to Fisher’s method with DEG p-values. In addition, our prediction model showed better performance than Random Forest and SVM, in terms of stress type prediction. Our system for identifying stress-related genes and predicting stress types using a logical correlation layer and CMCL loss will be useful in analyzing stress time series transcriptome data and it can be applied to many phenotype-related research problems.