Predicting phosphorylation sites using machine learning by integrating the sequence, structure, and functional information of proteins

Jamal, Salma; Ali, Waseem; Nagpal, Priya; Grover, Abhinav; Grover, Sonam

doi:10.1186/s12967-021-02851-0

Research
Open access
Published: 24 May 2021

Predicting phosphorylation sites using machine learning by integrating the sequence, structure, and functional information of proteins

Salma Jamal¹,
Waseem Ali¹,
Priya Nagpal²,
Abhinav Grover² &
…
Sonam Grover¹

Journal of Translational Medicine volume 19, Article number: 218 (2021) Cite this article

6984 Accesses
19 Citations
2 Altmetric
Metrics details

Abstract

Background

Post-translational modification (PTM) is a biological process that alters proteins and is therefore involved in the regulation of various cellular activities and pathogenesis. Protein phosphorylation is an essential process and one of the most-studied PTMs: it occurs when a phosphate group is added to serine (Ser, S), threonine (Thr, T), or tyrosine (Tyr, Y) residue. Dysregulation of protein phosphorylation can lead to various diseases—most commonly neurological disorders, Alzheimer’s disease, and Parkinson’s disease—thus necessitating the prediction of S/T/Y residues that can be phosphorylated in an uncharacterized amino acid sequence. Despite a surplus of sequencing data, current experimental methods of PTM prediction are time-consuming, costly, and error-prone, so a number of computational methods have been proposed to replace them. However, phosphorylation prediction remains limited, owing to substrate specificity, performance, and the diversity of its features.

Methods

In the present study we propose machine-learning-based predictors that use the physicochemical, sequence, structural, and functional information of proteins to classify S/T/Y phosphorylation sites. Rigorous feature selection, the minimum redundancy/maximum relevance approach, and the symmetrical uncertainty method were employed to extract the most informative features to train the models.

Results

The RF and SVM models generated using diverse feature types in the present study were highly accurate as is evident from good values for different statistical measures. Moreover, independent test sets and benchmark validations indicated that the proposed method clearly outperformed the existing methods, demonstrating its ability to accurately predict protein phosphorylation.

Conclusions

The results obtained in the present work indicate that the proposed computational methodology can be effectively used for predicting putative phosphorylation sites further facilitating discovery of various biological processes mechanisms.

Background

The post-translational modification (PTM) of proteins plays an extremely important role in numerous cellular functions and biological processes [1], including altering proteins’ physiochemical properties, conformation, localization, and enzymatic activity; it also plays an important role in several other processes, such as cell signaling, regulation of gene expression, and cellular metabolism, to name a few [2]. Over 200 diverse PTMs have been recognized [3], of which phosphorylation is the most abundant and well-established PTM in eukaryotes and is crucial to almost all aspects of cell life.

Protein phosphorylation is a rapid process involved in signal transduction pathways, cell proliferation and differentiation, metabolic activities, regulating protein functions, DNA replication, apoptosis, etc. [4,5,6]. Although PTMs are essential to homeostasis in biological systems, an individual PTM can also disrupt the regulation of complex protein networks, further affecting protein function and leading to many diseases and disorders (most of which are related to aging and dementia) [7]. The most common example of such disruption is the extensive phosphorylation of tau proteins in neurofibrillary tangles, which leads to neurodegenerative disorders such as Alzheimer’s disease and Parkinson’s disease. Over-phosphorylation of tau proteins promotes their aggregation and reduces the stability of microtubules, kinase, and phosphatase activity, thus exacerbating neurotoxicity [8]. The identification and elucidation of the role of PTMs is therefore required to better understand the molecular mechanisms of modified proteins, which could lead to the development of potential disease interventions and treatments.

During phosphorylation, a phosphate group is added to the side chain of an amino acid (AA)—mainly serine (Ser), threonine (Thr), or tyrosine (Tyr), but to a lesser extent to arginine, lysine and histidine residues [9]. This reaction is catalyzed by kinase enzymes and is reversible, during which phosphate groups are removed by specific protein phosphatases [10]. Phosphorylation of an AA residue by protein kinase is also known to depend on the neighboring AAs [11]. Over the years, PTMs have been identified experimentally using biological methods, including mass spectrometry and site-directed mutagenesis [12]. Although these techniques provide a vast amount of data when operated in a high-throughput manner, they are laborious, costly, time-consuming, and often produce false positives and false negatives. A large number of PTMs thus remain unidentified or misclassified, and the associated mechanisms in context of cellular and biological processes are overlooked [13]. The computational prediction of protein phosphorylation sites appears to be a promising alternative strategy for reducing the associated costs and time. The preliminary prediction of phosphorylation sites together with experimentally identified PTMs would expand our knowledge of the molecular mechanisms behind phosphorylation events and aid in protein functional characterization.

Around 40 computational methods have been developed to predict protein phosphorylation sites [14]. For example, Maiti et al. proposed an approach that uses sequence-environment-specific, geometric, and evolutionary information-based features to identify phosphorylation sites using the LightGBM algorithm [13]. Other machine-learning (ML)-based approaches, such as PhosPred-RF [11] and PhosphoSVM [15], use only sequence-based features for predictions based on random forest (RF) and support vector machine (SVM), respectively. NetPhos [16] uses a combination of sequence and structural features for independent and kinase-specific predictions, whereas PPRED uses only evolutionary information to classify phosphorylation sites [17]. PhosphoPredict [10] also uses a combination of sequence and functional features to decipher kinase-specific substrates and their related phosphorylation sites. Other methods that use general and kinase-specific sequences for prediction are MusiteDeep [18], DeepPhos [19], Scansite [20], KinasePhos2.0 [21], and GPS [22].

In the present study, RF and SVM algorithm-based learning were used to predict pS, pT, and pY residues from the protein sequences. A multitude of features, including sequence-based features, physicochemical-property-based (PP) features, structural features (SF), functional features (FF), and functional annotation (FA) represented sequence fragments around phosphorylation sites. A two-step feature selection approach and a minimum redundancy maximum relevance (mRMR) approach, followed by symmetrical uncertainty (SU), produced the most informative features for training the classifiers.

Methodology

Figure 1 depicts the overall workflow of the proposed approach for prediction of phosphorylation sites.

Data collection and pre-processing

The data comprising of experimentally validated phosphorylation sites, pS, pT and pY was extracted from publicly available dbPTM [23] database. This database provides information integrated from several databases that include Phospho.ELM [12], HPRD [24], PhosphoNET [25] amongst others. A total of 256,481 experimental phosphorylation sites belonging to human category were obtained from dbPTM following which after the removal of homologous and redundant sequences, 230,416 phosphorylation sites remained encoded in 21-residues sequence fragments. Finally, a total of 384,591, 61,004 and 125,142 protein sequence fragments were obtained for pS, pT and pY, respectively. The peptides having the central S, T or Y residue phosphorylated were considered as positive dataset and the other non-phosphorylated sites were labelled as negative dataset. The final datasets were divided into 80% training data for learning and 20% testing set for validation using an in-house Perl script. Feature selection and ML model generation was done using training dataset followed by five-fold internal cross validation for performance’s evaluation of the trained classifiers. Further assessment of ML models was carried out by validation with an independent testing dataset using various statistical measures.

Feature encoding approaches

Each peptide sequence was converted into numeric feature vectors which were then used to generate ML models. For each positive and negative sequence fragment, the features were extracted for central residue as well as ten neighboring residues upstream and downstream of central site to capture the local information of the phosphorylation site. A total of 1404 features categorized into groups, PP features, SF, SSF, FF and FA were used to represent each AA in the present study. Additional file 1 provides a comprehensive list of all the extracted features.

Physicochemical property-based features

In order to capture the environment around the central phosphorylated residue, AAindex database [26] was used to obtain numeric vectors representing physicochemical and biochemical properties of each AA. Seven properties used in the present study covered AA composition (AAC), average flexibility indices, hydrophobicity indices, net charge, partition coefficient, residue volume and molecular weight. Each sequence fragment with a length of 21 residues was represented by 7 properties resulting in a 147 (21 × 7 = 147) dimensional vector.

Sequence-based features

Previous studies have shown that neighboring AAs of phosphorylated residue are the key sequence-based features for the prediction of phosphorylation sites [27]. Binary-encoding (BE) method was used in which each AA corresponded to a 20-dimensional binary vector comprised of elements, ‘0’ and ‘1’. For example, Alanine is represented as a vector ‘10000000000000000000’, Serine as ‘00000000000000010000’ and so on and thus we obtained a 420-dimensional vector (21 × 20 = 420).

Structural level features

Three SSF used in the present study are accessible surface area (ASA), secondary structure (coil, helix and strand) and disordered regions. ASA gives an estimate of accessibility of an AA to solvent in a protein thus giving crucial information on the protein structure [28]. Thus, ASA of individual AA residues for each protein sequence was obtained from AAindex. Another factor that gives insights about protein structure is the secondary structural configuration of AAs. A neural network-based prediction tool, PSSpred [29] was used for secondary structure prediction of all the AAs in each protein sequence fragment. It has been observed that phosphorylation sites are usually located in disordered regions, which makes protein disorder an important feature for predicting phosphorylation sites. The native disorder information was predicted using IUPRED2A which is an energy estimation method taking into account differences between ordered and disordered regions [30]. The scores between, 0 and 1 were obtained amongst which residues above 0.5 score were considered as disordered and the others below 0.5 were labelled to be in ordered region. In all, 5 feature vectors denoted structural features resulting in a 105 (21 × 5 = 105) dimensional vector representing each sequence fragment.

Functional features

The functional information incorporated for prediction of phosphorylation sites include gene ontology terms (1) biological process (BP), (2) molecular function (MF) and (3) cellular component (CC); (4) protein functional domain data from InterPro [31]; and (5) KEGG pathway [32] information through DAVID tool [33] which is a gene functional classification tool. A total of 846 FF including 555 GO terms, 177 functional domain types and 114 terms denoting KEGG pathways were acquired and each AA was encoded into ‘0’ and ‘1’ according to the absence and presence of FFs, respectively.

Functional annotations

Using functional annotation tool available from DAVID, functional properties belonging to two categories, UP_SEQ_FEATURE and UP_KEYWORDS, were retrieved. A total of 526 types of protein functional annotations were obtained where an AA residue was denoted by ‘1’ if it had annotation for a particular function and ‘0’ if it was not linked with a specific function.

Combined features models

In order to enhance the prediction performance, all the groups, PP-based, SF, SSF, FF and FA features were pooled resulting in a total of 1404 features to generate learning models.

Feature selection

Feature selection methods were used to choose the most significant and informative features while minimizing the redundancy in the data thereby reducing its dimensionality and computational time and further improve model performances [34]. In the present study, feature selection was performed at two levels: mRMR approach followed by SU attribute selection method.

Minimum redundancy maximum relevance

mRMR is a widely used feature selection method based on mutual information. This approach ranks the features taking into consideration their importance to the classification variable along with the redundancy amongst the features themselves. A higher ranked attribute indicates its high correlation with the classification variable and least redundancy [35]. Top 50 ranked features of each category were selected as the most contributing features.

Symmetrical uncertainty

SU attribute evaluation method weighs the merit of an attribute by determining its uncertainty with reference to other sets of attributes [36]. SU can be calculated by the following equation:

$$SU \left(X,Y\right)= \frac{2*IG (X|Y)}{H \left(X\right)+H (Y)},$$

where IG stands for information gain, H denotes entropy and X and Y represent attributes [37].

Considering the ability of this method to balance the biasness of information gain towards certain attributes [38], SU is a method of choice for a plethora of feature selection tasks [39, 40]. Weka software [41] was used to implement SU in combination with Ranker search which returned a list of top ranked attributes followed by the less significant ones and lastly the least important.

Training machine learning models

Random forest

RF is broadly used ML algorithm used for solving classification problems and making predictions [42,43,44]. This algorithm is based on the ensemble of decision-making trees which yield individual outputs and the most common output of the model is considered as final RF prediction. The node of the tree and the subset of features used for generating trees is chosen randomly [45]. RF has many advantages which made it suitable for use in the present study. It is considered as a highly accurate learning classifier which can efficiently handle large dimensional datasets, deals with overfitting and does not consume a lot of time for training and prediction amongst many others [42, 46]. In this study, the RF models were trained using RandomForest package available from Weka [41].

Support vector machine

SVM is one of the most extensively applied ML algorithm in various computational studies involving classification and regression tasks [44, 47,48,49,50]. This algorithm finds an optimal hyperplane in a high-dimensional feature space using a kernel and then categorizes the input vectors into two classes [51]. The aim is to maximize the gap between input vectors of both the classes. In the present study, SVM was used with radial basis function (RBF) kernel implemented using Weka [41].

Model performance evaluation

To assess the prediction performance of the ML models generated in this work, several statistical measures, accuracy (ACC) that is proportion of correct positive and negative predictions, sensitivity (SN) or true positive rate (TPR), specificity (SP) which is percentage of correctly predicted non-phosphorylated sites, precision (PRE), F-score and the Matthew’s correlation coefficient (MCC) were used [10]. These are defined in the following lines:

$$ACC= \frac{(TP+TN)}{(TP+TN+FP+FN)},$$

$$SN= \frac{TP}{TP+FN},$$

$$SP= \frac{TN}{TN+FP},$$

$$PRE= \frac{TP}{TP+FP},$$

$$F{\text -}score=2 \times \frac{TP}{2TP+FP+FN},$$

$$MCC= \frac{(TP\,\,X\,\,TN)-(FP\,\,X\,\,FN)}{\sqrt{\left(TP+FN\right)X\,\, \left(TP+FP\right)X\,\, \left(TN+FN\right)X\,\,(TN+FP)}},$$

where TP, TN, FP and FN correspond to the numbers of true positives, true negatives, false positives and false negatives. Furthermore, the area under the curve (AUC) calculated from receiver operating characteristic (ROC), which is a plot of SN vs 1 minus SP, was used for evaluating model performances [10].

Results

Input data transformation

In the case of each protein sequence, centered residue was flanked by 10 AAs in forward and backward directions (± 10); the problem we addressed was whether the central residue acts as a phosphorylation site and belongs to class 1 or 0 (a non-phosphorylation site). In the present study, 61.2% instances were phosphate-binding and 38.7% belonged to class 1 (and were thus non-phosphate-binding). Totals of 134,584, 57,440, and 36,347 pS, pT, and pY, respectively, belonged to the positive (class 1) dataset, while pS 108,975, pT 27,415, and pY 316 sequence fragments belonged to the negative dataset (class 0). Table 1 provides the number of positive and negative phosphorylation sites in training and testing datasets.

Table 1 The number of phosphorylation sites

Full size table

Evaluating contribution of different feature encoding schemes

With regard to post-mRMR and SU attribute selection, of the PP-based features, AAC, molecular weight, residue volume, flexibility, and partition coefficient of predominantly AA11 turned out to be highly significant for classification, followed by the hydrophobicity of the AAs around the central residue. These features have already been shown to be relevant for discriminating between phosphorylated and non-phosphorylated sites [13, 18]. Table 2 lists the initial number of different features types used to encode sequence fragments.

Table 2 Initial number and types of different features used to encode sequence fragments

Full size table

In SS features, the secondary structural conformation of AAs demonstrated maximum involvement in S, T, and Y phosphorylated sites prediction, followed by the disordered region, in accordance with several previous studies [10, 27]. Of all the SS features used to train the ML models, ASA contributed least.

The highly informative features representing functional information, included only GO and KEGG pathway terms, whereas the data for the domains related to phosphorylated sequence fragments did not contribute at all to the prediction of PTM sites. In GO terms, the BP class majorly influenced the prediction of phosphorylation sites followed by CC terms. The commonly influencing ‘KEGG pathway terms’ were B cell receptor signaling pathway, endometrial cancer, prostate cancer, small-cell lung cancer, non-small-cell lung cancer, melanoma, renal cell carcinoma and platelet activation, which has been shown to play a role in cancer and neurodegenerative disorders [52, 53].

In the case of FA, the terms crucial for pS and pT prediction were “compositionally biased region: Ser/Thr-rich,” “endoplasmic reticulum,” “short sequence motif: nuclear export signal,” “domain: TSPtype-13,” and “metal ion-binding site: Divalent metal cation1” and “cation 2”. In order to determine the most significant group, PP, sequence-based features, SSF, FF and FA feature groups were combined and compared. Additional file 2 presents all of the features obtained using the two-step, mRMR, and SU feature selection approach for predicting pS/pT/pY sites.

Performance evaluation of ML models using independent testing data

Using individual and combined-feature encoding schemes, two extensively applied algorithms, RF and SVM, were used to generate learned models. The highest AUC values were obtained for the RF model, based on combined feature groups for pS (Fig. 2), pT (Fig. 3), and pY (Fig. 4) (0.95, 0.97, and 0.99, respectively). SVM models also had comparative AUC values, 0.89 for S, 0.59 for T, and 0.87 for Y phosphorylated sites. Furthermore, the RF and SVM ML models generated using secondary structural information produced the second-highest AUC values, in accordance with our feature selection results, which indicated the maximum contribution of SSF to the prediction of pS/pT/pY sites. In addition to the highest AUC values, the combination and the SSF ML models also produced good values for other statistical measures, including ACC, SN, SP, PRE, F-score, and MCC (Tables 3, 4, 5). Moreover, the confusion matrix for all the RF and SVM models generated in the present study have been provided in Additional file 3. All the RF and SVM models generated for pS, pT, and pY prediction have been provided as Additional files 4–39.

Table 3 Performance comparison with individual and combined feature encoding schemes for pS site prediction on the independent dataset

Full size table

Table 4 Performance comparison with individual and combined feature encoding schemes for pT site prediction on the independent dataset

Full size table

Table 5 Performance comparison with individual and combined feature encoding schemes for pY site prediction on the independent dataset

Full size table

Comparison with existing methods

To evaluate prediction performance, four existing kinase-independent tools, PhosPred-RF, PhosphoSVM, PPRED, and iPhos-PseEn, were compared to the proposed method of predicting pS/pT/pY sites. The proposed RF-based method clearly outperformed other existing methods with regard to SN, SP, MCC, and AUC values, which corresponded to 0.89, 0.88, 0.78, and 0.95 for predicting pS; SN, SP, MCC, and AUC corresponded to 0.97, 0.74, 0.77, and 0.97, respectively, for predicting pT; and SN, SP, MCC, and AUC corresponded to 0.10, 0.63, 0.57, and 0.99, respectively, for predicting pY (Table 6).

Table 6 Performance comparison of different existing tools for pS/pT/pY site prediction

Full size table

Evaluation of the proposed models’ performance on experimental phosphorylation sites

Owing to the large number of computational methods proposed to identify probable PTM sites, the dbPTM database offers an experimental dataset as a standard to explore the PTM prediction ability of proposed tools. In the present study, the best-performing RF model was applied to a total of 5787 experimentally validated protein phosphorylation sites acquired from the dbPTM repository. Of these 5787 phosphorylation sites, 4312 sequence fragments had Ser as central phosphorylated residue, 1442 had Thr phosphorylated residue, and 33 fragments had phosphorylated Tyr residue positioned in the center of the AA sequence. The results were in accordance with the performance measures obtained from an independent test set validation with 2824 sites predicted to be phosphorylated by Ser, 1410 by Thr, and all 33 sites to be phosphorylated by Tyr, as the pY RF model had the highest ACC and AUC, followed by pT and pS.

Discussion

Feature selection results in this study showed secondary structural information to be the top-ranked feature in the case of the pS and pT sites. For pY prediction, most of the PP-based features, AAC, flexibility, hydrophobicity, molecular weight, partition coefficient and residue volume, of the 11th AA were revealed to be of utmost importance, followed by KEGG-pathway-associated terms. The other features responsible for pS and pT site predictions involved GO, KEGG pathway, and FA terms. Amongst the GO terms, for pS and pT prediction, the favored CC terms were integrin complex and ciliary base; BP terms included protein import into nucleus, androgen receptor signaling pathway, intracellular receptor signaling pathway, response to cytokine, cell redox homeostasis, and cellular response to ionizing radiation. For pS prediction, MF terms included fatty-acyl-CoA binding, receptor-signaling protein activity, Rho GTPase binding, SH2 domain binding, and Rac GTPase binding. Histone deacetylase activity was the only MF term contributing to pT prediction. The common KEGG pathway terms contributing towards prediction of pS/pT/pY sites were platelet activation, non-small cell lung cancer, melanoma and prostate cancer. Most of the KEGG pathway terms denoted different types of cancer which makes sense as altered phosphorylation has been strongly linked with cancer [54]. Protein domain information of the FF group appeared to be the least informative and contributed minimum to the prediction of S/T/Y phosphorylated sites. Amongst the FA features, the favorably associated terms for pS/pT/pY sites prediction included “alternative promoter usage,” “DNA recombination,” “repeat: TPR8,” and “transit peptide: mitochondrion.” These events have been associated with phosphorylation which is responsible for various diseases and disorders [55,56,57].

Further during ML model generation, both of the ML algorithms performed well overall in predicting pS/pT/pY sites; however, RF clearly outperformed SVM in most of the feature group models. The best performance for all three of phosphorylated sites for both RF and SVM was achieved using combined feature groups, thereby demonstrating the necessity and significance of exploiting a variety of feature types for prediction. The results of the evaluation of the model performances on the experimental phosphorylation sites confirm that the proposed method can be employed to distinguish unidentified putative phosphorylation and non-phosphorylation sites. On the whole, these results indicate the importance of using different types of feature encoding schemes and feature selection to acquire a diverse set of extremely informative and relevant features for generating high-performance ML models to predict phosphorylation sites.

Conclusion

Protein phosphorylation is essential to the regulation of biological processes and disease pathogenesis. Experimental identification of phosphorylation sites is time-consuming and costly, so in this paper we proposed an ML-based computation method for cheaper, swift, and efficient S/T/Y phosphorylation prediction. The proposed RF- and SVM-algorithms-based method considers diverse features, physiochemical properties, sequence environment, secondary structure, functional features (pathway, GO, and protein domain), and functional annotation of protein sequence fragments to predict phosphorylation sites. Through two-level mRMR and SU feature ranking we observed that secondary structural information followed by pathway, GO, and FA terms were the most informative features, whereas protein domain features were the least useful. The proposed method also demonstrated significant improvement in performance metrics in terms of SN, SP, MCC, and AUC prediction compared to other existing kinase-independent computational tools. Furthermore, the proposed method exhibited outstanding performance on experimental phosphorylation sites, thereby indicating that it is a promising method for identifying potential pS, pT, and pY sites and would thus facilitate the prediction of functional PTMs and further biological analyses.

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Walsh CT, Garneau-Tsodikova S, Gatto GJ Jr. Protein posttranslational modifications: the chemistry of proteome diversifications. Angew Chem Int Ed Engl. 2005;44(45):7342–72.
Article CAS PubMed Google Scholar
Audagnotto M, Dal Peraro M. Protein post-translational modifications: in silico prediction tools and molecular modeling. Comput Struct Biotechnol J. 2017;15:307–19.
Article CAS PubMed PubMed Central Google Scholar
Deribe YL, Pawson T, Dikic I. Post-translational modifications in signal integration. Nat Struct Mol Biol. 2010;17(6):666–72.
Article CAS PubMed Google Scholar
Cohen P. The role of protein phosphorylation in neural and hormonal control of cellular activity. Nature. 1982;296(5858):613–20.
Article CAS PubMed Google Scholar
Johnson LN. The regulation of protein phosphorylation. Biochem Soc Trans. 2009;37(Pt 4):627–41.
Article CAS PubMed Google Scholar
Cohen P. The origins of protein phosphorylation. Nat Cell Biol. 2002;4(5):E127–30.
Article CAS PubMed Google Scholar
Kelley AR, Bach SBH, Perry G. Analysis of post-translational modifications in Alzheimer’s disease by mass spectrometry. Biochim Biophys Acta Mol Basis Dis. 2019;1865(8):2040–7.
Article CAS PubMed Google Scholar
Martin L, Latypova X, Terro F. Post-translational modifications of tau protein: implications for Alzheimer’s disease. Neurochem Int. 2011;58(4):458–71.
Article CAS PubMed Google Scholar
Pearson RB, Kemp BE. Protein kinase phosphorylation site sequences and consensus specificity motifs: tabulations. Methods Enzymol. 1991;200:62–81.
Article CAS PubMed Google Scholar
Song J, Wang H, Wang J, Leier A, Marquez-Lago T, Yang B, et al. PhosphoPredict: a bioinformatics tool for prediction of human kinase-specific phosphorylation substrates and sites by integrating heterogeneous feature selection. Sci Rep. 2017;7(1):6862.
Article PubMed PubMed Central CAS Google Scholar
Wei L, Xing P, Tang J, Zou Q. PhosPred-RF: a novel sequence-based predictor for phosphorylation sites using sequential information only. IEEE Trans Nanobiosci. 2017;16(4):240–7.
Article Google Scholar
Diella F, Cameron S, Gemund C, Linding R, Via A, Kuster B, et al. Phospho.ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins. BMC Bioinform. 2004;5:79.
Article Google Scholar
Maiti S, Hassan A, Mitra P. Boosting phosphorylation site prediction with sequence feature-based machine learning. Proteins. 2020;88(2):284–91.
Article CAS PubMed Google Scholar
Trost B, Kusalik A. Computational prediction of eukaryotic phosphorylation sites. Bioinformatics. 2011;27(21):2927–35.
Article CAS PubMed Google Scholar
Dou Y, Yao B, Zhang C. PhosphoSVM: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine. Amino Acids. 2014;46(6):1459–69.
Article CAS PubMed Google Scholar
Blom N, Gammeltoft S, Brunak S. Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol. 1999;294(5):1351–62.
Article CAS PubMed Google Scholar
Biswas AK, Noman N, Sikder AR. Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information. BMC Bioinformat. 2010;11:273.
Article CAS Google Scholar
Wang D, Zeng S, Xu C, Qiu W, Liang Y, Joshi T, et al. MusiteDeep: a deep-learning framework for general and kinase-specific phosphorylation site prediction. Bioinformatics. 2017;33(24):3909–16.
Article PubMed PubMed Central CAS Google Scholar
Luo F, Wang M, Liu Y, Zhao XM, Li A. DeepPhos: prediction of protein phosphorylation sites with deep learning. Bioinformatics. 2019;35(16):2766–73.
Article CAS PubMed PubMed Central Google Scholar
Obenauer JC, Cantley LC, Yaffe MB. Scansite 2.0: Proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Res. 2003;31(13):3635–41.
Article CAS PubMed PubMed Central Google Scholar
Wong YH, Lee TY, Liang HK, Huang CM, Wang TY, Yang YH, et al. KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns. Nucleic Acids Res. 2007;35(Web Server issue):W588-594.
Article PubMed PubMed Central Google Scholar
Xue Y, Ren J, Gao X, Jin C, Wen L, Yao X. GPS 2.0, a tool to predict kinase-specific phosphorylation sites in hierarchy. Mol Cell Proteomics. 2008;7(9):1598–608.
Article CAS PubMed PubMed Central Google Scholar
Lee TY, Huang HD, Hung JH, Huang HY, Yang YS, Wang TH. dbPTM: an information repository of protein post-translational modification. Nucleic Acids Res. 2006;34(Database issue):622–7.
Article CAS Google Scholar
Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, et al. Human protein reference database—2009 update. Nucleic Acids Res. 2009;37(Database issue):D767–72.
Article CAS PubMed Google Scholar
Safaei J, Manuch J, Gupta A, Stacho L, Pelech S. Prediction of 492 human protein kinase substrate specificities. Proteome Sci. 2011;9(Suppl 1):S6.
Article PubMed PubMed Central Google Scholar
Kawashima S, Ogata H, Kanehisa M. AAindex: amino acid index database. Nucleic Acids Res. 1999;27(1):368–9.
Article CAS PubMed PubMed Central Google Scholar
Li T, Du P, Xu N. Identifying human kinase-specific protein phosphorylation sites by integrating heterogeneous information from various sources. PLoS ONE. 2010;5(11):e15411.
Article PubMed PubMed Central CAS Google Scholar
Lins L, Thomas A, Brasseur R. Analysis of accessible surface of residues in proteins. Protein Sci. 2003;12(7):1406–17.
Article CAS PubMed PubMed Central Google Scholar
Yan R, Xu D, Yang J, Walker S, Zhang Y. A comparative assessment and analysis of 20 representative sequence alignment methods for protein structure prediction. Sci Rep. 2013;3:2619.
Article PubMed PubMed Central Google Scholar
Erdos G, Dosztanyi Z. Analyzing protein disorder with IUPred2A. Curr Protoc Bioinformat. 2020;70(1):e99.
Article CAS Google Scholar
Mitchell A, Chang HY, Daugherty L, Fraser M, Hunter S, Lopez R, et al. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res. 2015;43(Database issue):D213–21.
Article PubMed Google Scholar
Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012;40(Database issue):D109–14.
Article CAS PubMed Google Scholar
Huang DW, Sherman BT, Tan Q, Collins JR, Alvord WG, Roayaei J, et al. The DAVID gene functional classification tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 2007;8(9):R183.
Article PubMed PubMed Central CAS Google Scholar
Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–17.
Article CAS PubMed Google Scholar
Peng H, Long F, Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell. 2005;27(8):1226–38.
Article PubMed Google Scholar
Hall MA. Correlation based feature selection for machine learning: University of Waikato; 1999.
Senthamarai Kannan S, Ramaraj N. A novel hybrid feature selection via symmetrical uncertainty ranking based local memetic search algorithm. Knowl-Based Syst. 2010;23(6):580–5.
Article Google Scholar
Sree CSKRJR. Application of ranking based attribute selection filters to perform automated evaluation of descriptive answers through sequential minimal optimization models. ICTACT J Soft Comput. 2014;5(1):860–8.
Article Google Scholar
Bakhshandeh S, Azmi R, Teshnehlab M. Symmetric uncertainty class-feature association map for feature selection in microarray dataset. Int J Mach Learn Cybern. 2019;11(1):15–32.
Article Google Scholar
Ali SI, Shahzad W, editors. A feature subset selection method based on symmetric uncertainty and Ant Colony Optimization. 2012 International Conference on Emerging Technologies. 2012;8–9.
Frank E, Hall M, Trigg L, Holmes G, Witten IH. Data mining in bioinformatics using Weka. Bioinformatics. 2004;20(15):2479–81.
Article CAS PubMed Google Scholar
Li F, Li C, Wang M, Webb GI, Zhang Y, Whisstock JC, et al. GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome. Bioinformatics. 2015;31(9):1411–9.
Article PubMed CAS Google Scholar
Hasan MM, Khatun MS, Kurata H. Computational modeling of lysine post-translational modification: an overview. Curr Syn Syst Biol. 2018;06(01):137.
Google Scholar
Wang J, Yang B, An Y, Marquez-Lago T, Leier A, Wilksch J, et al. Systematic analysis and prediction of type IV secreted effector proteins by machine learning approaches. Brief Bioinform. 2019;20(3):931–51.
Article CAS PubMed Google Scholar
Leo B. Random forests. Mach Learn. 2001;45:5–32.
Article Google Scholar
Adetiloye T, Awasthi A. Predicting short-term congested traffic flow on urban motorway networks. In: Sekhar S, Balas VE, editors. Samui P. Handbook of neural computation: Academic Press, USA; 2017. p. 145–65.
Google Scholar
Hasan MM, Zhou Y, Lu X, Li J, Song J, Zhang Z. Computational identification of protein pupylation sites by using profile-based composition of k-spaced amino acid pairs. PLoS ONE. 2015;10(6):e0129635.
Article PubMed PubMed Central CAS Google Scholar
Wang LN, Shi SP, Xu HD, Wen PP, Qiu JD. Computational prediction of species-specific malonylation sites via enhanced characteristic strategy. Bioinformatics. 2017;33(10):1457–63.
Article CAS PubMed Google Scholar
Kumar M, Gromiha MM, Raghava GP. Prediction of RNA binding sites in a protein using SVM and PSSM profile. Proteins. 2008;71(1):189–94.
Article CAS PubMed Google Scholar
Kurniawan I, Haryanto T, Hasibuan LS, Agmalaro MA. Combining PSSM and physicochemical feature for protein structure prediction with support vector machine. J Phys Conf Ser. 2017;835:012006.
Article CAS Google Scholar
Ws N. What is a support vectormachine? Nat Biotechnol. 2006;24:1565–7.
Article CAS Google Scholar
Espinosa-Parrilla Y, Gonzalez-Billault C, Fuentes E, Palomo I, Alarcon M. Decoding the role of platelets and related MicroRNAs in aging and neurodegenerative disorders. Front Aging Neurosci. 2019;11:151.
Article CAS PubMed PubMed Central Google Scholar
Idriss HT. Three steps to cancer: how phosphorylation of tubulin, tubulin tyrosine ligase and P-glycoprotein may generate and sustain cancer. Cancer Chemother Pharmacol. 2004;54(2):101–4.
Article CAS PubMed Google Scholar
Singh V, Ram M, Kumar R, Prasad R, Roy BK, Singh KK. Phosphorylation: implications in cancer. Protein J. 2017;36(1):1–6.
Article CAS PubMed Google Scholar
Huin V, Buee L, Behal H, Labreuche J, Sablonniere B, Dhaenens CM. Alternative promoter usage generates novel shorter MAPT mRNA transcripts in Alzheimer’s disease and progressive supranuclear palsy brains. Sci Rep. 2017;7(1):12589.
Article PubMed PubMed Central CAS Google Scholar
Restle A, Farber M, Baumann C, Bohringer M, Scheidtmann KH, Muller-Tidow C, et al. Dissecting the role of p53 phosphorylation in homologous recombination provides new clues for gain-of-function mutants. Nucleic Acids Res. 2008;36(16):5362–75.
Article CAS PubMed PubMed Central Google Scholar
Lim S, Smith KR, Lim ST, Tian R, Lu J, Tan M. Regulation of mitochondrial functions by protein phosphorylation and dephosphorylation. Cell Biosci. 2016;6:25.
Article PubMed PubMed Central CAS Google Scholar

Download references

Acknowledgements

Salma Jamal acknowledges a Young Scientist Fellowship from the Department of Health Research (DHR), India. Abhinav Grover and Sonam Grover are grateful to University Grants Commission, India for the Faculty Recharge Position. Sonam Grover is grateful to Jamia Hamdard for DST Purse Grant.

Funding

Not applicable.

Author information

Authors and Affiliations

JH-Institute of Molecular Medicine, Jamia Hamdard, New Delhi, India
Salma Jamal, Waseem Ali & Sonam Grover
School of Biotechnology, Jawaharlal Nehru University, New Delhi, India
Priya Nagpal & Abhinav Grover

Authors

Salma Jamal
View author publications
You can also search for this author in PubMed Google Scholar
Waseem Ali
View author publications
You can also search for this author in PubMed Google Scholar
Priya Nagpal
View author publications
You can also search for this author in PubMed Google Scholar
Abhinav Grover
View author publications
You can also search for this author in PubMed Google Scholar
Sonam Grover
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SJ conceived and designed the experiments. SJ, WA and PN performed. SJ, SG, and AG analyzed the data. All authors contributed to the writing of the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Abhinav Grover or Sonam Grover.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

: Provides a comprehensive list of all the extracted features.

Additional file 2

: The different types of features obtained after two-step, mRMR and SU, feature selection approach for pS/pT/pY sites prediction.

Additional file 3

: Confusion matrix for all the RF and SVM models generated in present study for prediction of Ser, Thr and Tyr phosphorylation sites.

Additional file 4–39

: All the RF and SVM models generated in present study for prediction of Ser, Thr and Tyr phosphorylation sites.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Jamal, S., Ali, W., Nagpal, P. et al. Predicting phosphorylation sites using machine learning by integrating the sequence, structure, and functional information of proteins. J Transl Med 19, 218 (2021). https://doi.org/10.1186/s12967-021-02851-0

Download citation

Received: 01 December 2020
Accepted: 18 April 2021
Published: 24 May 2021
DOI: https://doi.org/10.1186/s12967-021-02851-0

Predicting phosphorylation sites using machine learning by integrating the sequence, structure, and functional information of proteins

Abstract

Background

Methods

Results

Conclusions

Background

Methodology

Data collection and pre-processing

Feature encoding approaches

Physicochemical property-based features

Sequence-based features

Structural level features

Functional features

Functional annotations

Combined features models

Feature selection

Minimum redundancy maximum relevance

Symmetrical uncertainty

Training machine learning models

Random forest

Support vector machine

Model performance evaluation

Results

Input data transformation

Evaluating contribution of different feature encoding schemes

Performance evaluation of ML models using independent testing data

Comparison with existing methods

Evaluation of the proposed models’ performance on experimental phosphorylation sites

Discussion

Conclusion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1

Additional file 2

Additional file 3

Additional file 4–39

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Translational Medicine

Contact us