ABLE

Classifying proteins into their respective enzyme class is an interesting question for researchers for a variety of reasons. The open source Protein Data Bank (PDB) contains more than 1,60,000 structures, with more being added everyday. In this project we developed an attention-based bidirectional-LSTM model (ABLE) which was trained on oversampled data generated by SMOTE to analyse and classify a protein into one of the six enzyme classes or a negative class using only the primary structure of the protein described as a string by the FASTA sequence as an input. We achieve the highest F1-score of 0.834 using our proposed model on a dataset of proteins from the Protein Data Bank. We baseline our model against seventeen other machine learning and deep learning models, including a Convolutional Neural Network, Long Short Term Memory Network, Bidirectional Long Short Term Memory Network, and Gated Recurrent Unit Network. We perform extensive experimentation and statistical testing to corroborate our results.
Read more about this project: Paper Pre-Print