Abstract
Application of machine learning method in new drug development can greatly shorten the process of experimental discovery and reduce the risk of clinical failure. However, the feature extraction of proteins sequence is very difficult due to the large dimension. To this end, we propose a Protein Embedding Model(PEM) for drug molecular screening to predict the interaction between proteins and small molecules. Specifically, PEM first classifies 20 kinds of amino acids into 6 categories to reduce the dimension and learns the representation of protein borrowing the idea of word embedding. Then the model uses multiple imputation to fill the physical and chemical properties of small molecule compounds. Finally, the model uses LightGBM model to predict the affinity value Ki between proteins and small molecules. Experiments show that the model can effectively extract the features of proteins and small molecules and outperforms other traditional methods on the data provided by a drug discovery and development company.