Computational Biology

Along with colleagues in the Monash Faculty of Medicine, Nursing and Health Sciences, I am investigating applications of data science in biology. The majority of this work uses machine learning to predict protein structural and functional features.

Publications

Large language models for scientific discovery in molecular property prediction.
Zheng, Y., Koh, H. Y., Ju, J., Nguyen, A. T. N., May, L. T., Webb, G. I., & Pan, S.
Nature Machine Intelligence, in press.
[Bibtex] → Access on publisher site

@Article{Zheng2025,
author = {Zheng, Yizhen and Koh, Huan Yee and Ju, Jiaxin and Nguyen, Anh T. N. and May, Lauren T. and Webb, Geoffrey I. and Pan, Shirui},
journal = {Nature Machine Intelligence},
title = {Large language models for scientific discovery in molecular property prediction},
year = {in press},
issn = {2522-5839},
month = feb,
doi = {10.1038/s42256-025-00994-z},
keywords = {Bioinformatics},
publisher = {Springer Science and Business Media LLC},
related = {computational-biology},
}

ABSTRACT

SCREEN: A Graph-based Contrastive Learning Tool to Infer Catalytic Residues and Assess Enzyme Mutations.
Pan, T., Bi, Y., Wang, X., Zhang, Y., Webb, G. I., Gasser, R. B., Kurgan, L., & Song, J.
Genomics, Proteomics & Bioinformatics, in press.
[Bibtex] → Access on publisher site

@Article{Pan2024,
author = {Pan, Tong and Bi, Yue and Wang, Xiaoyu and Zhang, Ying and Webb, Geoffrey I and Gasser, Robin B and Kurgan, Lukasz and Song, Jiangning},
journal = {Genomics, Proteomics & Bioinformatics},
title = {SCREEN: A Graph-based Contrastive Learning Tool to Infer Catalytic Residues and Assess Enzyme Mutations},
year = {in press},
issn = {2210-3244},
month = dec,
doi = {10.1093/gpbjnl/qzae094},
keywords = {Bioinformatics and DP140100087},
publisher = {Oxford University Press (OUP)},
related = {computational-biology},
}

ABSTRACT

Physicochemical graph neural network for learning protein-ligand interaction fingerprints from sequence data.
Koh, H. Y., Nguyen, A. T. N., Pan, S., May, L. T., & Webb, G. I.
Nature Machine Intelligence, 6, 673–687, 2024.
[Bibtex] [Abstract] → Access on publisher site

@Article{Koh2024,
author = {Koh, Huan Yee and Nguyen, Anh T. N. and Pan, Shirui and May, Lauren T. and Webb, Geoffrey I.},
journal = {Nature Machine Intelligence},
title = {Physicochemical graph neural network for learning protein-ligand interaction fingerprints from sequence data},
year = {2024},
issn = {2522-5839},
pages = {673–687},
volume = {6},
abstract = {In drug discovery, determining the binding affinity and functional effects of small-molecule ligands on proteins is critical. Current computational methods can predict these protein-ligand interaction properties but often lose accuracy without high-resolution protein structures and falter in predicting functional effects. Here we introduce PSICHIC (PhySIcoCHemICal graph neural network), a framework incorporating physicochemical constraints to decode interaction fingerprints directly from sequence data alone. This enables PSICHIC to attain capabilities in decoding mechanisms underlying protein-ligand interactions, achieving state-of-the-art accuracy and interpretability. Trained on identical protein-ligand pairs without structural data, PSICHIC matched and even surpassed leading structure-based methods in binding-affinity prediction. In an experimental library screening for adenosine A1 receptor agonists, PSICHIC discerned functional effects effectively, ranking the sole novel agonist within the top three. PSICHIC’s interpretable fingerprints identified protein residues and ligand atoms involved in interactions, and helped in unveiling selectivity determinants of protein-ligand interaction. We foresee PSICHIC reshaping virtual screening and deepening our understanding of protein-ligand interactions.},
doi = {10.1038/s42256-024-00847-1},
keywords = {Bioinformatics, Pharmacoinformatics},
refid = {Koh2024},
related = {computational-biology},
}

ABSTRACT In drug discovery, determining the binding affinity and functional effects of small-molecule ligands on proteins is critical. Current computational methods can predict these protein-ligand interaction properties but often lose accuracy without high-resolution protein structures and falter in predicting functional effects. Here we introduce PSICHIC (PhySIcoCHemICal graph neural network), a framework incorporating physicochemical constraints to decode interaction fingerprints directly from sequence data alone. This enables PSICHIC to attain capabilities in decoding mechanisms underlying protein-ligand interactions, achieving state-of-the-art accuracy and interpretability. Trained on identical protein-ligand pairs without structural data, PSICHIC matched and even surpassed leading structure-based methods in binding-affinity prediction. In an experimental library screening for adenosine A1 receptor agonists, PSICHIC discerned functional effects effectively, ranking the sole novel agonist within the top three. PSICHIC’s interpretable fingerprints identified protein residues and ligand atoms involved in interactions, and helped in unveiling selectivity determinants of protein-ligand interaction. We foresee PSICHIC reshaping virtual screening and deepening our understanding of protein-ligand interactions.

PFresGO: an attention mechanism-based deep-learning approach for protein annotation by integrating gene ontology inter-relationships.
Pan, T., Li, C., Bi, Y., Wang, Z., Gasser, R. B., Purcell, A. W., Akutsu, T., Webb, G. I., Imoto, S., & Song, J.
Bioinformatics, Art. no. btad094, 2023.
[Bibtex] [Abstract] → Access on publisher site

@Article{10.1093/bioinformatics/btad094,
author = {Pan, Tong and Li, Chen and Bi, Yue and Wang, Zhikang and Gasser, Robin B and Purcell, Anthony W and Akutsu, Tatsuya and Webb, Geoffrey I and Imoto, Seiya and Song, Jiangning},
journal = {Bioinformatics},
title = {{PFresGO}: an attention mechanism-based deep-learning approach for protein annotation by integrating gene ontology inter-relationships},
year = {2023},
issn = {1367-4811},
abstract = {{The rapid accumulation of high-throughput sequence data demands the development of effective and efficient data-driven computational methods to functionally annotate proteins. However, most current approaches used for functional annotation simply focus on the use of protein-level information but ignore inter-relationships among annotations.Here, we established PFresGO, an attention-based deep-learning approach that incorporates hierarchical structures in Gene Ontology (GO) graphs and advances in natural language processing algorithms for the functional annotation of proteins. PFresGO employs a self-attention operation to capture the inter-relationships of GO terms, updates its embedding accordingly, and uses a cross-attention operation to project protein representations and GO embedding into a common latent space to identify global protein sequence patterns and local functional residues. We demonstrate that PFresGO consistently achieves superior performance across GO categories when compared with state-of-the-art methods. Importantly, we show that PFresGO can identify functionally important residues in protein sequences by assessing the distribution of attention weightings. PFresGO should serve as an effective tool for the accurate functional annotation of proteins and functional domains within proteins.PFresGO is available for academic purposes at https://github.com/BioColLab/PFresGO.Supplementary data are available at Bioinformatics online.}},
articlenumber = {btad094},
doi = {10.1093/bioinformatics/btad094},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT {The rapid accumulation of high-throughput sequence data demands the development of effective and efficient data-driven computational methods to functionally annotate proteins. However, most current approaches used for functional annotation simply focus on the use of protein-level information but ignore inter-relationships among annotations.Here, we established PFresGO, an attention-based deep-learning approach that incorporates hierarchical structures in Gene Ontology (GO) graphs and advances in natural language processing algorithms for the functional annotation of proteins. PFresGO employs a self-attention operation to capture the inter-relationships of GO terms, updates its embedding accordingly, and uses a cross-attention operation to project protein representations and GO embedding into a common latent space to identify global protein sequence patterns and local functional residues. We demonstrate that PFresGO consistently achieves superior performance across GO categories when compared with state-of-the-art methods. Importantly, we show that PFresGO can identify functionally important residues in protein sequences by assessing the distribution of attention weightings. PFresGO should serve as an effective tool for the accurate functional annotation of proteins and functional domains within proteins.PFresGO is available for academic purposes at https://github.com/BioColLab/PFresGO.Supplementary data are available at Bioinformatics online.}

Rapid Identification of Protein Formulations with Bayesian Optimisation.
Huynh, V., Say, B., Vogel, P., Cao, L., Webb, G. I., & Aleti, A.
2023 International Conference on Machine Learning and Applications (ICMLA), pp. 776-781, 2023.
[Bibtex] → Access on publisher site

@InProceedings{Huynh2023,
author = {Huynh, Viet and Say, Buser and Vogel, Peter and Cao, Lucy and Webb, Geoffrey I and Aleti, Aldeida},
booktitle = {2023 International Conference on Machine Learning and Applications (ICMLA)},
title = {Rapid Identification of Protein Formulations with Bayesian Optimisation},
year = {2023},
pages = {776-781},
creationdate = {2024-03-21T10:48:29},
doi = {10.1109/ICMLA58977.2023.00113},
keywords = {health,Bioinformatics,Drugs;Proteins;Industries;Metalearning;Transportation;Stability analysis;Bayes methods;Protein buffer optimisation;Bayesian optimisation},
}

ABSTRACT

TIMER is a Siamese neural network-based framework for identifying both general and species-specific bacterial promoters.
Zhu, Y., Li, F., Guo, X., Wang, X., Coin, L. J. M., Webb, G. I., Song, J., & Jia, C.
Briefings in Bioinformatics, 24(4), Art. no. bbad209, 2023.
[Bibtex] [Abstract] → Access on publisher site

@Article{Zhu2023,
author = {Zhu, Yan and Li, Fuyi and Guo, Xudong and Wang, Xiaoyu and Coin, Lachlan J M and Webb, Geoffrey I and Song, Jiangning and Jia, Cangzhi},
journal = {Briefings in Bioinformatics},
title = {{TIMER is a Siamese neural network-based framework for identifying both general and species-specific bacterial promoters}},
year = {2023},
issn = {1477-4054},
number = {4},
volume = {24},
abstract = {{Promoters are DNA regions that initiate the transcription of specific genes near the transcription start sites. In bacteria, promoters are recognized by RNA polymerases and associated sigma factors. Effective promoter recognition is essential for synthesizing the gene-encoded products by bacteria to grow and adapt to different environmental conditions. A variety of machine learning-based predictors for bacterial promoters have been developed; however, most of them were designed specifically for a particular species. To date, only a few predictors are available for identifying general bacterial promoters with limited predictive performance.In this study, we developed TIMER, a Siamese neural network-based approach for identifying both general and species-specific bacterial promoters. Specifically, TIMER uses DNA sequences as the input and employs three Siamese neural networks with the attention layers to train and optimize the models for a total of 13 species-specific and general bacterial promoters. Extensive 10-fold cross-validation and independent tests demonstrated that TIMER achieves a competitive performance and outperforms several existing methods on both general and species-specific promoter prediction. As an implementation of the proposed method, the web server of TIMER is publicly accessible at http://web.unimelb-bioinfortools.cloud.edu.au/TIMER/.}},
articlenumber = {bbad209},
creationdate = {2023-06-21T12:22:54},
doi = {10.1093/bib/bbad209},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT {Promoters are DNA regions that initiate the transcription of specific genes near the transcription start sites. In bacteria, promoters are recognized by RNA polymerases and associated sigma factors. Effective promoter recognition is essential for synthesizing the gene-encoded products by bacteria to grow and adapt to different environmental conditions. A variety of machine learning-based predictors for bacterial promoters have been developed; however, most of them were designed specifically for a particular species. To date, only a few predictors are available for identifying general bacterial promoters with limited predictive performance.In this study, we developed TIMER, a Siamese neural network-based approach for identifying both general and species-specific bacterial promoters. Specifically, TIMER uses DNA sequences as the input and employs three Siamese neural networks with the attention layers to train and optimize the models for a total of 13 species-specific and general bacterial promoters. Extensive 10-fold cross-validation and independent tests demonstrated that TIMER achieves a competitive performance and outperforms several existing methods on both general and species-specific promoter prediction. As an implementation of the proposed method, the web server of TIMER is publicly accessible at http://web.unimelb-bioinfortools.cloud.edu.au/TIMER/.}

Clarion is a multi-label problem transformation method for identifying mRNA subcellular localizations.
Bi, Y., Li, F., Guo, X., Wang, Z., Pan, T., Guo, Y., Webb, G. I., Yao, J., Jia, C., & Song, J.
Briefings in Bioinformatics, 23(6), Art. no. bbac467, 2022.
[Bibtex] [Abstract] → Access on publisher site

@Article{Bi2022,
author = {Bi, Yue and Li, Fuyi and Guo, Xudong and Wang, Zhikang and Pan, Tong and Guo, Yuming and Webb, Geoffrey I and Yao, Jianhua and Jia, Cangzhi and Song, Jiangning},
journal = {Briefings in Bioinformatics},
title = {{Clarion is a multi-label problem transformation method for identifying mRNA subcellular localizations}},
year = {2022},
issn = {1477-4054},
month = {11},
number = {6},
volume = {23},
abstract = {{Subcellular localization of messenger RNAs (mRNAs) plays a key role in the spatial regulation of gene activity. The functions of mRNAs have been shown to be closely linked with their localizations. As such, understanding of the subcellular localizations of mRNAs can help elucidate gene regulatory networks. Despite several computational methods that have been developed to predict mRNA localizations within cells, there is still much room for improvement in predictive performance, especially for the multiple-location prediction. In this study, we proposed a novel multi-label multi-class predictor, termed Clarion, for mRNA subcellular localization prediction. Clarion was developed based on a manually curated benchmark dataset and leveraged the weighted series method for multi-label transformation. Extensive benchmarking tests demonstrated Clarion achieved competitive predictive performance and the weighted series method plays a crucial role in securing superior performance of Clarion. In addition, the independent test results indicate that Clarion outperformed the state-of-the-art methods and can secure accuracy of 81.47, 91.29, 79.77, 92.10, 89.15, 83.74, 80.74, 79.23 and 84.74% for chromatin, cytoplasm, cytosol, exosome, membrane, nucleolus, nucleoplasm, nucleus and ribosome, respectively. The webserver and local stand-alone tool of Clarion is freely available at http://monash.bioweb.cloud.edu.au/Clarion/.}},
articlenumber = {bbac467},
doi = {10.1093/bib/bbac467},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT {Subcellular localization of messenger RNAs (mRNAs) plays a key role in the spatial regulation of gene activity. The functions of mRNAs have been shown to be closely linked with their localizations. As such, understanding of the subcellular localizations of mRNAs can help elucidate gene regulatory networks. Despite several computational methods that have been developed to predict mRNA localizations within cells, there is still much room for improvement in predictive performance, especially for the multiple-location prediction. In this study, we proposed a novel multi-label multi-class predictor, termed Clarion, for mRNA subcellular localization prediction. Clarion was developed based on a manually curated benchmark dataset and leveraged the weighted series method for multi-label transformation. Extensive benchmarking tests demonstrated Clarion achieved competitive predictive performance and the weighted series method plays a crucial role in securing superior performance of Clarion. In addition, the independent test results indicate that Clarion outperformed the state-of-the-art methods and can secure accuracy of 81.47, 91.29, 79.77, 92.10, 89.15, 83.74, 80.74, 79.23 and 84.74% for chromatin, cytoplasm, cytosol, exosome, membrane, nucleolus, nucleoplasm, nucleus and ribosome, respectively. The webserver and local stand-alone tool of Clarion is freely available at http://monash.bioweb.cloud.edu.au/Clarion/.}

DEMoS: A Deep Learning-based Ensemble Approach for Predicting the Molecular Subtypes of Gastric Adenocarcinomas from Histopathological Images.
Wang, Y., Hu, C., Kwok, T., Bain, C. A., Xue, X., Gasser, R. B., Webb, G. I., Boussioutas, A., Shen, X., Daly, R. J., & Song, J.
Bioinformatics, 38(17), 4206-4213, 2022.
[Bibtex] → Access on publisher site

@Article{Wang_2022,
author = {Yanan Wang and Changyuan Hu and Terry Kwok and Christopher A Bain and Xiangyang Xue and Robin B Gasser and Geoffrey I Webb and Alex Boussioutas and Xian Shen and Roger J Daly and Jiangning Song},
journal = {Bioinformatics},
title = {{DEMoS}: A Deep Learning-based Ensemble Approach for Predicting the Molecular Subtypes of Gastric Adenocarcinomas from Histopathological Images},
year = {2022},
number = {17},
pages = {4206-4213},
volume = {38},
doi = {10.1093/bioinformatics/btac456},
editor = {Hanchuan Peng},
keywords = {Bioinformatics},
publisher = {Oxford University Press ({OUP})},
related = {computational-biology},
}

ABSTRACT

PROST: AlphaFold2-aware Sequence-Based Predictor to Estimate Protein Stability Changes upon Missense Mutations.
Iqbal, S., Ge, F., Li, F., Akutsu, T., Zheng, Y., Gasser, R. B., Yu, D., Webb, G. I., & Song, J.
Journal of Chemical Information and Modeling, 62(17), 4270-4282, 2022.
[Bibtex] [Abstract] → Access on publisher site

@Article{Iqbal2022,
author = {Shahid Iqbal and Fang Ge and Fuyi Li and Tatsuya Akutsu and Yuanting Zheng and Robin B. Gasser and Dong-Jun Yu and Geoffrey I. Webb and Jiangning Song},
journal = {Journal of Chemical Information and Modeling},
title = {{PROST}: {AlphaFold}2-aware Sequence-Based Predictor to Estimate Protein Stability Changes upon Missense Mutations},
year = {2022},
number = {17},
pages = {4270-4282},
volume = {62},
abstract = {An essential step in engineering proteins and understanding disease-causing missense mutations is to accurately model protein stability changes when such mutations occur. Here, we developed a new sequence-based predictor for protein stability (PROST) change upon single-point missense mutation. PROST extracts multiple descriptors from the most promising sequence-based predictors, such as BoostDDG, SAAFEC-SEQ, and DDGun. RPOST also extracts descriptors from iFeature and AlphaFold2. The extracted descriptors include sequence-based features, physicochemical properties, evolutionary information, evolutionary-based physicochemical properties, and predicted structural features. The PROST predictor is a weighted average ensemble model based on extreme gradient boosting (XGBoost) decision trees and extra-trees regressor, PROST is trained on both direct and hypothetical reverse mutations using the S5294 (S2647 direct mutations + S2647 inverse mutations). The parameters for the PROST model are optimized using grid searching with 5-fold cross-validation, and feature importance analysis unveils the most relevant features. The performance of PROST is evaluated in a blinded manner, employing nine distinct datasets and existing state-of-the-art sequence-based and structure-based predictors. This method consistently performs well on Frataxin, S217, S349, Ssym, Myoglobin, and CAGI5 datasets in blind tests, and similarly to the state-of-the-art predictors for p53 and S276 datasets. When the performance of PROST is compared with the latest predictors such as BoostDDG, SAAFEC-SEQ, ACDC-NN-seq, and DDGun, PROST dominates these predictors. A case study of mutation scanning of the Frataxin protein for nine wild-type residues demonstrates the utility of PROST. Taken together, these findings indicate that PROST is a well-suited predictor when no protein structural information is available. The source code of PROST, datasets, examples, pre-trained models, along with how to use PROST are available at https://github.com/ShahidIqb/PROST and https://prost.erc.monash.edu/seq.},
doi = {10.1021/acs.jcim.2c00799},
keywords = {Bioinformatics},
publisher = {American Chemical Society ({ACS})},
related = {computational-biology},
}

ABSTRACT An essential step in engineering proteins and understanding disease-causing missense mutations is to accurately model protein stability changes when such mutations occur. Here, we developed a new sequence-based predictor for protein stability (PROST) change upon single-point missense mutation. PROST extracts multiple descriptors from the most promising sequence-based predictors, such as BoostDDG, SAAFEC-SEQ, and DDGun. RPOST also extracts descriptors from iFeature and AlphaFold2. The extracted descriptors include sequence-based features, physicochemical properties, evolutionary information, evolutionary-based physicochemical properties, and predicted structural features. The PROST predictor is a weighted average ensemble model based on extreme gradient boosting (XGBoost) decision trees and extra-trees regressor, PROST is trained on both direct and hypothetical reverse mutations using the S5294 (S2647 direct mutations + S2647 inverse mutations). The parameters for the PROST model are optimized using grid searching with 5-fold cross-validation, and feature importance analysis unveils the most relevant features. The performance of PROST is evaluated in a blinded manner, employing nine distinct datasets and existing state-of-the-art sequence-based and structure-based predictors. This method consistently performs well on Frataxin, S217, S349, Ssym, Myoglobin, and CAGI5 datasets in blind tests, and similarly to the state-of-the-art predictors for p53 and S276 datasets. When the performance of PROST is compared with the latest predictors such as BoostDDG, SAAFEC-SEQ, ACDC-NN-seq, and DDGun, PROST dominates these predictors. A case study of mutation scanning of the Frataxin protein for nine wild-type residues demonstrates the utility of PROST. Taken together, these findings indicate that PROST is a well-suited predictor when no protein structural information is available. The source code of PROST, datasets, examples, pre-trained models, along with how to use PROST are available at https://github.com/ShahidIqb/PROST and https://prost.erc.monash.edu/seq.

Systematic Characterization of Lysine Post-translational Modification Sites Using MUscADEL.
Chen, Z., Liu, X., Li, F., Li, C., Marquez-Lago, T., Leier, A., Webb, G. I., Xu, D., Akutsu, T., & Song, J.
In KC, D. B. (Ed.), In Computational Methods for Predicting Post-Translational Modification Sites (, pp. 205-219). New York, NY: Springer US, 2022.
[Bibtex] [Abstract] → Access on publisher site

@InBook{Chen2022,
author = {Chen, Zhen and Liu, Xuhan and Li, Fuyi and Li, Chen and Marquez-Lago, Tatiana and Leier, Andr{\'e} and Webb, Geoffrey I. and Xu, Dakang and Akutsu, Tatsuya and Song, Jiangning},
editor = {KC, Dukka B.},
pages = {205-219},
publisher = {Springer US},
title = {Systematic Characterization of Lysine Post-translational Modification Sites Using MUscADEL},
year = {2022},
address = {New York, NY},
isbn = {978-1-0716-2317-6},
abstract = {Among various types of protein post-translational modifications (PTMs), lysineLysinesPTMs play an important role in regulating a wide range of functions and biological processes. Due to the generation and accumulation of enormous amount of protein sequence data by ongoing whole-genome sequencing projects, systematic identification of different types of lysineLysinesPTMPost-translational modification (PTM)substrates and their specific PTMPost-translational modification (PTM)sites in the entire proteome is increasingly important and has therefore received much attention. Accordingly, a variety of computational methods for lysineLysinesPTMPost-translational modification (PTM)identification have been developed based on the combination of various handcrafted sequence features and machine-learning techniques. In this chapter, we first briefly review existing computational methods for lysineLysinesPTMPost-translational modification (PTM)identification and then introduce a recently developed deep learning-based method, termed MUscADELMUscADEL (Multiple Scalable Accurate Deep Learner for lysineLysinesPTMs). Specifically, MUscADELMUscADEL employs bidirectional long short-term memoryLong short-term memory (LSTM) (BiLSTM) recurrent neural networks and is capable of predicting eight major types of lysineLysinesPTMs in both the human and mouse proteomes. The web server of MUscADELMUscADEL is publicly available at http://muscadel.erc.monash.edu/for the research community to use.},
booktitle = {Computational Methods for Predicting Post-Translational Modification Sites},
doi = {10.1007/978-1-0716-2317-6_11},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT Among various types of protein post-translational modifications (PTMs), lysineLysinesPTMs play an important role in regulating a wide range of functions and biological processes. Due to the generation and accumulation of enormous amount of protein sequence data by ongoing whole-genome sequencing projects, systematic identification of different types of lysineLysinesPTMPost-translational modification (PTM)substrates and their specific PTMPost-translational modification (PTM)sites in the entire proteome is increasingly important and has therefore received much attention. Accordingly, a variety of computational methods for lysineLysinesPTMPost-translational modification (PTM)identification have been developed based on the combination of various handcrafted sequence features and machine-learning techniques. In this chapter, we first briefly review existing computational methods for lysineLysinesPTMPost-translational modification (PTM)identification and then introduce a recently developed deep learning-based method, termed MUscADELMUscADEL (Multiple Scalable Accurate Deep Learner for lysineLysinesPTMs). Specifically, MUscADELMUscADEL employs bidirectional long short-term memoryLong short-term memory (LSTM) (BiLSTM) recurrent neural networks and is capable of predicting eight major types of lysineLysinesPTMs in both the human and mouse proteomes. The web server of MUscADELMUscADEL is publicly available at http://muscadel.erc.monash.edu/for the research community to use.

ASPIRER: a new computational approach for identifying non-classical secreted proteins based on deep learning.
Wang, X., Li, F., Xu, J., Rong, J., Webb, G. I., Ge, Z., Li, J., & Song, J.
Briefings in Bioinformatics, 23(2), Art. no. bbac031, 2022.
[Bibtex] [Abstract] → Access on publisher site

@Article{10.1093/bib/bbac031,
author = {Wang, Xiaoyu and Li, Fuyi and Xu, Jing and Rong, Jia and Webb, Geoffrey I and Ge, Zongyuan and Li, Jian and Song, Jiangning},
journal = {Briefings in Bioinformatics},
title = {{ASPIRER}: a new computational approach for identifying non-classical secreted proteins based on deep learning},
year = {2022},
issn = {1477-4054},
number = {2},
volume = {23},
abstract = {{Protein secretion has a pivotal role in many biological processes and is particularly important for intercellular communication, from the cytoplasm to the host or external environment. Gram-positive bacteria can secrete proteins through multiple secretion pathways. The non-classical secretion pathway has recently received increasing attention among these secretion pathways, but its exact mechanism remains unclear. Non-classical secreted proteins (NCSPs) are a class of secreted proteins lacking signal peptides and motifs. Several NCSP predictors have been proposed to identify NCSPs and most of them employed the whole amino acid sequence of NCSPs to construct the model. However, the sequence length of different proteins varies greatly. In addition, not all regions of the protein are equally important and some local regions are not relevant to the secretion. The functional regions of the protein, particularly in the N- and C-terminal regions, contain important determinants for secretion. In this study, we propose a new hybrid deep learning-based framework, referred to as ASPIRER, which improves the prediction of NCSPs from amino acid sequences. More specifically, it combines a whole sequence-based XGBoost model and an N-terminal sequence-based convolutional neural network model; 5-fold cross-validation and independent tests demonstrate that ASPIRER achieves superior performance than existing state-of-the-art approaches. The source code and curated datasets of ASPIRER are publicly available at https://github.com/yanwu20/ASPIRER/. ASPIRER is anticipated to be a useful tool for improved prediction of novel putative NCSPs from sequences information and prioritization of candidate proteins for follow-up experimental validation.}},
articlenumber = {bbac031},
doi = {10.1093/bib/bbac031},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT {Protein secretion has a pivotal role in many biological processes and is particularly important for intercellular communication, from the cytoplasm to the host or external environment. Gram-positive bacteria can secrete proteins through multiple secretion pathways. The non-classical secretion pathway has recently received increasing attention among these secretion pathways, but its exact mechanism remains unclear. Non-classical secreted proteins (NCSPs) are a class of secreted proteins lacking signal peptides and motifs. Several NCSP predictors have been proposed to identify NCSPs and most of them employed the whole amino acid sequence of NCSPs to construct the model. However, the sequence length of different proteins varies greatly. In addition, not all regions of the protein are equally important and some local regions are not relevant to the secretion. The functional regions of the protein, particularly in the N- and C-terminal regions, contain important determinants for secretion. In this study, we propose a new hybrid deep learning-based framework, referred to as ASPIRER, which improves the prediction of NCSPs from amino acid sequences. More specifically, it combines a whole sequence-based XGBoost model and an N-terminal sequence-based convolutional neural network model; 5-fold cross-validation and independent tests demonstrate that ASPIRER achieves superior performance than existing state-of-the-art approaches. The source code and curated datasets of ASPIRER are publicly available at https://github.com/yanwu20/ASPIRER/. ASPIRER is anticipated to be a useful tool for improved prediction of novel putative NCSPs from sequences information and prioritization of candidate proteins for follow-up experimental validation.}

Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction.
Zhang, M., Jia, C., Li, F., Li, C., Zhu, Y., Akutsu, T., Webb, G. I., Zou, Q., Coin, L. J. M., & Song, J.
Briefings in Bioinformatics, 23, Art. no. bbab551, 2022.
[Bibtex] [Abstract] → Access on publisher site

@Article{10.1093/bib/bbab551,
author = {Zhang, Meng and Jia, Cangzhi and Li, Fuyi and Li, Chen and Zhu, Yan and Akutsu, Tatsuya and Webb, Geoffrey I and Zou, Quan and Coin, Lachlan J M and Song, Jiangning},
journal = {Briefings in Bioinformatics},
title = {Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction},
year = {2022},
issn = {1477-4054},
volume = {23},
abstract = {{Promoters are crucial regulatory DNA regions for gene transcriptional activation. Rapid advances in next-generation sequencing technologies have accelerated the accumulation of genome sequences, providing increased training data to inform computational approaches for both prokaryotic and eukaryotic promoter prediction. However, it remains a significant challenge to accurately identify species-specific promoter sequences using computational approaches. To advance computational support for promoter prediction, in this study, we curated 58 comprehensive, up-to-date, benchmark datasets for 7 different species (i.e. Escherichia coli, Bacillus subtilis, Homo sapiens, Mus musculus, Arabidopsis thaliana, Zea mays and Drosophila melanogaster) to assist the research community to assess the relative functionality of alternative approaches and support future research on both prokaryotic and eukaryotic promoters. We revisited 106 predictors published since 2000 for promoter identification (40 for prokaryotic promoter, 61 for eukaryotic promoter, and 5 for both). We systematically evaluated their training datasets, computational methodologies, calculated features, performance and software usability. On the basis of these benchmark datasets, we benchmarked 19 predictors with functioning webservers/local tools and assessed their prediction performance. We found that deep learning and traditional machine learning-based approaches generally outperformed scoring function-based approaches. Taken together, the curated benchmark dataset repository and the benchmarking analysis in this study serve to inform the design and implementation of computational approaches for promoter prediction and facilitate more rigorous comparison of new techniques in the future.}},
articlenumber = {bbab551},
doi = {10.1093/bib/bbab551},
issue = {2},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT {Promoters are crucial regulatory DNA regions for gene transcriptional activation. Rapid advances in next-generation sequencing technologies have accelerated the accumulation of genome sequences, providing increased training data to inform computational approaches for both prokaryotic and eukaryotic promoter prediction. However, it remains a significant challenge to accurately identify species-specific promoter sequences using computational approaches. To advance computational support for promoter prediction, in this study, we curated 58 comprehensive, up-to-date, benchmark datasets for 7 different species (i.e. Escherichia coli, Bacillus subtilis, Homo sapiens, Mus musculus, Arabidopsis thaliana, Zea mays and Drosophila melanogaster) to assist the research community to assess the relative functionality of alternative approaches and support future research on both prokaryotic and eukaryotic promoters. We revisited 106 predictors published since 2000 for promoter identification (40 for prokaryotic promoter, 61 for eukaryotic promoter, and 5 for both). We systematically evaluated their training datasets, computational methodologies, calculated features, performance and software usability. On the basis of these benchmark datasets, we benchmarked 19 predictors with functioning webservers/local tools and assessed their prediction performance. We found that deep learning and traditional machine learning-based approaches generally outperformed scoring function-based approaches. Taken together, the curated benchmark dataset repository and the benchmarking analysis in this study serve to inform the design and implementation of computational approaches for promoter prediction and facilitate more rigorous comparison of new techniques in the future.}

Positive-unlabeled learning in bioinformatics and computational biology: a brief review.
Li, F., Dong, S., Leier, A., Han, M., Guo, X., Xu, J., Wang, X., Pan, S., Jia, C., Zhang, Y., Webb, G. I., Coin, L. J. M., Li, C., & Song, J.
Briefings in Bioinformatics, 23(1), Art. no. bbab461, 2022.
[Bibtex] [Abstract] → Access on publisher site

@Article{10.1093/bib/bbab461,
author = {Li, Fuyi and Dong, Shuangyu and Leier, Andre and Han, Meiya and Guo, Xudong and Xu, Jing and Wang, Xiaoyu and Pan, Shirui and Jia, Cangzhi and Zhang, Yang and Webb, Geoffrey I and Coin, Lachlan J M and Li, Chen and Song, Jiangning},
journal = {Briefings in Bioinformatics},
title = {Positive-unlabeled learning in bioinformatics and computational biology: a brief review},
year = {2022},
issn = {1477-4054},
number = {1},
volume = {23},
abstract = {{Conventional supervised binary classification algorithms have been widely applied to address significant research questions using biological and biomedical data. This classification scheme requires two fully labeled classes of data (e.g. positive and negative samples) to train a classification model. However, in many bioinformatics applications, labeling data is laborious, and the negative samples might be potentially mislabeled due to the limited sensitivity of the experimental equipment. The positive unlabeled (PU) learning scheme was therefore proposed to enable the classifier to learn directly from limited positive samples and a large number of unlabeled samples (i.e. a mixture of positive or negative samples). To date, several PU learning algorithms have been developed to address various biological questions, such as sequence identification, functional site characterization and interaction prediction. In this paper, we revisit a collection of 29 state-of-the-art PU learning bioinformatic applications to address various biological questions. Various important aspects are extensively discussed, including PU learning methodology, biological application, classifier design and evaluation strategy. We also comment on the existing issues of PU learning and offer our perspectives for the future development of PU learning applications. We anticipate that our work serves as an instrumental guideline for a better understanding of the PU learning framework in bioinformatics and further developing next-generation PU learning frameworks for critical biological applications.}},
articlenumber = {bbab461},
doi = {10.1093/bib/bbab461},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT {Conventional supervised binary classification algorithms have been widely applied to address significant research questions using biological and biomedical data. This classification scheme requires two fully labeled classes of data (e.g. positive and negative samples) to train a classification model. However, in many bioinformatics applications, labeling data is laborious, and the negative samples might be potentially mislabeled due to the limited sensitivity of the experimental equipment. The positive unlabeled (PU) learning scheme was therefore proposed to enable the classifier to learn directly from limited positive samples and a large number of unlabeled samples (i.e. a mixture of positive or negative samples). To date, several PU learning algorithms have been developed to address various biological questions, such as sequence identification, functional site characterization and interaction prediction. In this paper, we revisit a collection of 29 state-of-the-art PU learning bioinformatic applications to address various biological questions. Various important aspects are extensively discussed, including PU learning methodology, biological application, classifier design and evaluation strategy. We also comment on the existing issues of PU learning and offer our perspectives for the future development of PU learning applications. We anticipate that our work serves as an instrumental guideline for a better understanding of the PU learning framework in bioinformatics and further developing next-generation PU learning frameworks for critical biological applications.}

A Deep Learning-Based Method for Identification of Bacteriophage-Host Interaction.
Li, M., Wang, Y., Li, F., Zhao, Y., Liu, M., Zhang, S., Bin, Y., Smith, A. I., Webb, G., Li, J., Song, J., & Xia, J.
IEEE/ACM Trans Comput Biol Bioinform, 18, 1801-1810, 2021.
[Bibtex] [Abstract] → Access on publisher site

@Article{RN3447,
author = {Li, M. and Wang, Y. and Li, F. and Zhao, Y. and Liu, M. and Zhang, S. and Bin, Y. and Smith, A. I. and Webb, G. and Li, J. and Song, J. and Xia, J.},
journal = {IEEE/ACM Trans Comput Biol Bioinform},
title = {A Deep Learning-Based Method for Identification of Bacteriophage-Host Interaction},
year = {2021},
issn = {1545-5963},
pages = {1801-1810},
volume = {18},
abstract = {Multi-drug resistance (MDR) has become one of the greatest threats to human health worldwide, and novel treatment methods of infections caused by MDR bacteria are urgently needed. Phage therapy is a promising alternative to solve this problem, to which the key is correctly matching target pathogenic bacteria with the corresponding therapeutic phage. Deep learning is powerful for mining complex patterns to generate accurate predictions. In this study, we develop PredPHI (Predicting Phage-Host Interactions), a deep learning-based tool capable of predicting the host of phages from sequence data. We collect >3000 phage-host pairs along with their protein sequences from PhagesDB and GenBank databases and extract a set of features. Then we select high-quality negative samples based on the K-Means clustering method and construct a balanced training set. Finally, we employ a deep convolutional neural network to build the predictive model. The results indicate that PredPHI can achieve a predictive performance of 81% in terms of the area under the receiver operating characteristic curve on the test set, and the clustering-based method is significantly more robust than that based on randomly selecting negative samples. These results highlight that PredPHI is a useful and accurate tool for identifying phage-host interactions from sequence data.},
doi = {10.1109/tcbb.2020.3017386},
issue = {5},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT Multi-drug resistance (MDR) has become one of the greatest threats to human health worldwide, and novel treatment methods of infections caused by MDR bacteria are urgently needed. Phage therapy is a promising alternative to solve this problem, to which the key is correctly matching target pathogenic bacteria with the corresponding therapeutic phage. Deep learning is powerful for mining complex patterns to generate accurate predictions. In this study, we develop PredPHI (Predicting Phage-Host Interactions), a deep learning-based tool capable of predicting the host of phages from sequence data. We collect >3000 phage-host pairs along with their protein sequences from PhagesDB and GenBank databases and extract a set of features. Then we select high-quality negative samples based on the K-Means clustering method and construct a balanced training set. Finally, we employ a deep convolutional neural network to build the predictive model. The results indicate that PredPHI can achieve a predictive performance of 81% in terms of the area under the receiver operating characteristic curve on the test set, and the clustering-based method is significantly more robust than that based on randomly selecting negative samples. These results highlight that PredPHI is a useful and accurate tool for identifying phage-host interactions from sequence data.

DeepBL: a deep learning-based approach for in silico discovery of beta-lactamases.
Wang, Y., Li, F., Bharathwaj, M., Rosas, N. C., Leier, A., Akutsu, T., Webb, G. I., Marquez-Lago, T. T., Li, J., Lithgow, T., & Song, J.
Briefings in Bioinformatics, 22(4), Art. no. bbaa301, 2021.
[Bibtex] [Abstract] → Access on publisher site

@Article{Wang2020,
author = {Yanan Wang and Fuyi Li and Manasa Bharathwaj and Natalia C Rosas and Andr{\'{e}} Leier and Tatsuya Akutsu and Geoffrey I Webb and Tatiana T Marquez-Lago and Jian Li and Trevor Lithgow and Jiangning Song},
journal = {Briefings in Bioinformatics},
title = {{DeepBL}: a deep learning-based approach for in silico discovery of beta-lactamases},
year = {2021},
number = {4},
volume = {22},
abstract = {Beta-lactamases (BLs) are enzymes localized in the periplasmic space of bacterial pathogens, where they confer resistance to beta-lactam antibiotics. Experimental identification of BLs is costly yet crucial to understand beta-lactam resistance mechanisms. To address this issue, we present DeepBL, a deep learning-based approach by incorporating sequence-derived features to enable high-throughput prediction of BLs. Specifically, DeepBL is implemented based on the Small VGGNet architecture and the TensorFlow deep learning library. Furthermore, the performance of DeepBL models is investigated in relation to the sequence redundancy level and negative sample selection in the benchmark dataset. The models are trained on datasets of varying sequence redundancy thresholds, and the model performance is evaluated by extensive benchmarking tests. Using the optimized DeepBL model, we perform proteome-wide screening for all reviewed bacterium protein sequences available from the UniProt database. These results are freely accessible at the DeepBL webserver at http://deepbl.erc.monash.edu.au/.},
articlenumber = {bbaa301},
doi = {10.1093/bib/bbaa301},
keywords = {Bioinformatics},
publisher = {Oxford University Press ({OUP})},
related = {computational-biology},
}

ABSTRACT Beta-lactamases (BLs) are enzymes localized in the periplasmic space of bacterial pathogens, where they confer resistance to beta-lactam antibiotics. Experimental identification of BLs is costly yet crucial to understand beta-lactam resistance mechanisms. To address this issue, we present DeepBL, a deep learning-based approach by incorporating sequence-derived features to enable high-throughput prediction of BLs. Specifically, DeepBL is implemented based on the Small VGGNet architecture and the TensorFlow deep learning library. Furthermore, the performance of DeepBL models is investigated in relation to the sequence redundancy level and negative sample selection in the benchmark dataset. The models are trained on datasets of varying sequence redundancy thresholds, and the model performance is evaluated by extensive benchmarking tests. Using the optimized DeepBL model, we perform proteome-wide screening for all reviewed bacterium protein sequences available from the UniProt database. These results are freely accessible at the DeepBL webserver at http://deepbl.erc.monash.edu.au/.

iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization.
Chen, Z., Zhao, P., Li, C., Li, F., Xiang, D., Chen, Y., Akutsu, T., Daly, R. J., Webb, G. I., Zhao, Q., Kurgan, L., & Song, J.
Nucleic Acids Research, 49(10), Art. no. e60, 2021.
[Bibtex] [Abstract] → Access on publisher site

@Article{ChenEtAl21,
author = {Chen, Zhen and Zhao, Pei and Li, Chen and Li, Fuyi and Xiang, Dongxu and Chen, Yong-Zi and Akutsu, Tatsuya and Daly, Roger J and Webb, Geoffrey I and Zhao, Quanzhi and Kurgan, Lukasz and Song, Jiangning},
journal = {Nucleic Acids Research},
title = {{iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization}},
year = {2021},
issn = {0305-1048},
number = {10},
volume = {49},
abstract = {{Sequence-based analysis and prediction are fundamental bioinformatic tasks that facilitate understanding of the sequence(-structure)-function paradigm for DNAs, RNAs and proteins. Rapid accumulation of sequences requires equally pervasive development of new predictive models, which depends on the availability of effective tools that support these efforts. We introduce iLearnPlus, the first machine-learning platform with graphical- and web-based interfaces for the construction of machine-learning pipelines for analysis and predictions using nucleic acid and protein sequences. iLearnPlus provides a comprehensive set of algorithms and automates sequence-based feature extraction and analysis, construction and deployment of models, assessment of predictive performance, statistical analysis, and data visualization; all without programming. iLearnPlus includes a wide range of feature sets which encode information from the input sequences and over twenty machine-learning algorithms that cover several deep-learning approaches, outnumbering the current solutions by a wide margin. Our solution caters to experienced bioinformaticians, given the broad range of options, and biologists with no programming background, given the point-and-click interface and easy-to-follow design process. We showcase iLearnPlus with two case studies concerning prediction of long noncoding RNAs (lncRNAs) from RNA transcripts and prediction of crotonylation sites in protein chains. iLearnPlus is an open-source platform available at https://github.com/Superzchen/iLearnPlus/ with the webserver at http://ilearnplus.erc.monash.edu/.}},
articlenumber = {e60},
comment = {Clarivate Web of Science Highly Cited Paper 2022 - 2024},
doi = {10.1093/nar/gkab122},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT {Sequence-based analysis and prediction are fundamental bioinformatic tasks that facilitate understanding of the sequence(-structure)-function paradigm for DNAs, RNAs and proteins. Rapid accumulation of sequences requires equally pervasive development of new predictive models, which depends on the availability of effective tools that support these efforts. We introduce iLearnPlus, the first machine-learning platform with graphical- and web-based interfaces for the construction of machine-learning pipelines for analysis and predictions using nucleic acid and protein sequences. iLearnPlus provides a comprehensive set of algorithms and automates sequence-based feature extraction and analysis, construction and deployment of models, assessment of predictive performance, statistical analysis, and data visualization; all without programming. iLearnPlus includes a wide range of feature sets which encode information from the input sequences and over twenty machine-learning algorithms that cover several deep-learning approaches, outnumbering the current solutions by a wide margin. Our solution caters to experienced bioinformaticians, given the broad range of options, and biologists with no programming background, given the point-and-click interface and easy-to-follow design process. We showcase iLearnPlus with two case studies concerning prediction of long noncoding RNAs (lncRNAs) from RNA transcripts and prediction of crotonylation sites in protein chains. iLearnPlus is an open-source platform available at https://github.com/Superzchen/iLearnPlus/ with the webserver at http://ilearnplus.erc.monash.edu/.}

Anthem: a user customised tool for fast and accurate prediction of binding between peptides and HLA class I molecules.
Mei, S., Li, F., Xiang, D., Ayala, R., Faridi, P., Webb, G. I., Illing, P. T., Rossjohn, J., Akutsu, T., Croft, N. P., Purcell, A. W., & Song, J.
Briefings in Bioinformatics, 22, 2021.
[Bibtex] [Abstract] → Access on publisher site

@Article{Mei2021,
author = {Mei, Shutao and Li, Fuyi and Xiang, Dongxu and Ayala, Rochelle and Faridi, Pouya and Webb, Geoffrey I and Illing, Patricia T and Rossjohn, Jamie and Akutsu, Tatsuya and Croft, Nathan P and Purcell, Anthony W and Song, Jiangning},
journal = {Briefings in Bioinformatics},
title = {{Anthem: a user customised tool for fast and accurate prediction of binding between peptides and HLA class I molecules}},
year = {2021},
issn = {1477-4054},
volume = {22},
abstract = {{Neopeptide-based immunotherapy has been recognised as a promising approach for the treatment of cancers. For neopeptides to be recognised by CD8+ T cells and induce an immune response, their binding to human leukocyte antigen class I (HLA-I) molecules is a necessary first step. Most epitope prediction tools thus rely on the prediction of such binding. With the use of mass spectrometry, the scale of naturally presented HLA ligands that could be used to develop such predictors has been expanded. However, there are rarely efforts that focus on the integration of these experimental data with computational algorithms to efficiently develop up-to-date predictors. Here, we present Anthem for accurate HLA-I binding prediction. In particular, we have developed a user-friendly framework to support the development of customisable HLA-I binding prediction models to meet challenges associated with the rapidly increasing availability of large amounts of immunopeptidomic data. Our extensive evaluation, using both independent and experimental datasets shows that Anthem achieves an overall similar or higher area under curve value compared with other contemporary tools. It is anticipated that Anthem will provide a unique opportunity for the non-expert user to analyse and interpret their own in-house or publicly deposited datasets.}},
doi = {10.1093/bib/bbaa415},
eprint = {https://academic.oup.com/bib/advance-article-pdf/doi/10.1093/bib/bbaa415/35904983/bbaa415.pdf},
issue = {5},
keywords = {Bioinformatics and DP140100087},
related = {computational-biology},
}

ABSTRACT {Neopeptide-based immunotherapy has been recognised as a promising approach for the treatment of cancers. For neopeptides to be recognised by CD8+ T cells and induce an immune response, their binding to human leukocyte antigen class I (HLA-I) molecules is a necessary first step. Most epitope prediction tools thus rely on the prediction of such binding. With the use of mass spectrometry, the scale of naturally presented HLA ligands that could be used to develop such predictors has been expanded. However, there are rarely efforts that focus on the integration of these experimental data with computational algorithms to efficiently develop up-to-date predictors. Here, we present Anthem for accurate HLA-I binding prediction. In particular, we have developed a user-friendly framework to support the development of customisable HLA-I binding prediction models to meet challenges associated with the rapidly increasing availability of large amounts of immunopeptidomic data. Our extensive evaluation, using both independent and experimental datasets shows that Anthem achieves an overall similar or higher area under curve value compared with other contemporary tools. It is anticipated that Anthem will provide a unique opportunity for the non-expert user to analyse and interpret their own in-house or publicly deposited datasets.}

Assessing the performance of computational predictors for estimating protein stability changes upon missense mutations.
Iqbal, S., Li, F., Akutsu, T., Ascher, D. B., Webb, G. I., & Song, J.
Briefings in Bioinformatics, 22(6), Art. no. bbab184, 2021.
[Bibtex] [Abstract] → Access on publisher site

@Article{10.1093/bib/bbab184,
author = {Iqbal, Shahid and Li, Fuyi and Akutsu, Tatsuya and Ascher, David B and Webb, Geoffrey I and Song, Jiangning},
journal = {Briefings in Bioinformatics},
title = {Assessing the performance of computational predictors for estimating protein stability changes upon missense mutations},
year = {2021},
issn = {1477-4054},
number = {6},
volume = {22},
abstract = {Understanding how a mutation might affect protein stability is of significant importance to protein engineering and for understanding protein evolution genetic diseases. While a number of computational tools have been developed to predict the effect of missense mutations on protein stability protein stability upon mutations, they are known to exhibit large biases imparted in part by the data used to train and evaluate them. Here, we provide a comprehensive overview of predictive tools, which has provided an evolving insight into the importance and relevance of features that can discern the effects of mutations on protein stability. A diverse selection of these freely available tools was benchmarked using a large mutation-level blind dataset of 1342 experimentally characterised mutations across 130 proteins from ThermoMutDB, a second test dataset encompassing 630 experimentally characterised mutations across 39 proteins from iStable2.0 and a third blind test dataset consisting of 268 mutations in 27 proteins from the newly published ProThermDB. The performance of the methods was further evaluated with respect to the site of mutation, type of mutant residue and by ranging the pH and temperature. Additionally, the classification performance was also evaluated by classifying the mutations as stabilizing (delta delta G>=0) or destabilizing (delta delta G<0). The results reveal that the performance of the predictors is affected by the site of mutation and the type of mutant residue. Further, the results show very low performance for pH values 6-8 and temperature higher than 65 for all predictors except iStable2.0 on the S630 dataset. To illustrate how stability and structure change upon single point mutation, we considered four stabilizing, two destabilizing and two stabilizing mutations from two proteins, namely the toxin protein and bovine liver cytochrome. Overall, the results on S268, S630 and S1342 datasets show that the performance of the integrated predictors is better than the mechanistic or individual machine learning predictors. We expect that this paper will provide useful guidance for the design and development of next-generation bioinformatic tools for predicting protein stability changes upon mutations.},
articlenumber = {bbab184},
doi = {10.1093/bib/bbab184},
keywords = {Bioinformatics and DP140100087},
related = {computational-biology},
}

ABSTRACT Understanding how a mutation might affect protein stability is of significant importance to protein engineering and for understanding protein evolution genetic diseases. While a number of computational tools have been developed to predict the effect of missense mutations on protein stability protein stability upon mutations, they are known to exhibit large biases imparted in part by the data used to train and evaluate them. Here, we provide a comprehensive overview of predictive tools, which has provided an evolving insight into the importance and relevance of features that can discern the effects of mutations on protein stability. A diverse selection of these freely available tools was benchmarked using a large mutation-level blind dataset of 1342 experimentally characterised mutations across 130 proteins from ThermoMutDB, a second test dataset encompassing 630 experimentally characterised mutations across 39 proteins from iStable2.0 and a third blind test dataset consisting of 268 mutations in 27 proteins from the newly published ProThermDB. The performance of the methods was further evaluated with respect to the site of mutation, type of mutant residue and by ranging the pH and temperature. Additionally, the classification performance was also evaluated by classifying the mutations as stabilizing (delta delta G>=0) or destabilizing (delta delta G<0). The results reveal that the performance of the predictors is affected by the site of mutation and the type of mutant residue. Further, the results show very low performance for pH values 6-8 and temperature higher than 65 for all predictors except iStable2.0 on the S630 dataset. To illustrate how stability and structure change upon single point mutation, we considered four stabilizing, two destabilizing and two stabilizing mutations from two proteins, namely the toxin protein and bovine liver cytochrome. Overall, the results on S268, S630 and S1342 datasets show that the performance of the integrated predictors is better than the mechanistic or individual machine learning predictors. We expect that this paper will provide useful guidance for the design and development of next-generation bioinformatic tools for predicting protein stability changes upon mutations.

PROSPECT: A web server for predicting protein histidine phosphorylation sites.
Chen, Z., Zhao, P., Li, F., Leier, A., Marquez-Lago, T. T., Webb, G. I., Baggag, A., Bensmail, H., & Song, J.
Journal of Bioinformatics and Computational Biology, 18(4), Art. no. 2050018, 2020.
[Bibtex] [Abstract] → Access on publisher site

@Article{Chen2020,
author = {Zhen Chen and Pei Zhao and Fuyi Li and Andr{\'{e}} Leier and Tatiana T. Marquez-Lago and Geoffrey I. Webb and Abdelkader Baggag and Halima Bensmail and Jiangning Song},
journal = {Journal of Bioinformatics and Computational Biology},
title = {{PROSPECT}: A web server for predicting protein histidine phosphorylation sites},
year = {2020},
month = {jun},
number = {4},
volume = {18},
abstract = {Background: Phosphorylation of histidine residues plays crucial roles in signaling pathwaysand cell metabolism in prokaryotes such as bacteria. While evidence has emerged that proteinhistidine phosphorylation also occurs in more complex organisms, its role in mammalian cellshas remained largely uncharted. Thus, it is highly desirable to develop computational tools thatare able to identify histidine phosphorylation sites.Result:Here, we introduce PROSPECT thatenables fast and accurate prediction of proteome-wide histidine phosphorylation substrates andsites. Our tool is based on a hybrid method that integrates the outputs of two convolutional neuralnetwork (CNN)-based classifiers and a random forest-based classifier. Three features, includingthe one-of-K coding, enhanced grouped amino acids content (EGAAC) and composition of k-spaced amino acid group pairs (CKSAAGP) encoding, were taken as the input to three classifiers,respectively. Our results show that it is able to accurately predict histidine phosphorylation sitesfrom sequence information. Our PROSPECT web server is user-friendly and publicly available athttp://PROSPECT.erc.monash.edu/. Conclusions: PROSPECT is superior than other pHispredictors in both the running speed and prediction accuracy and we anticipate that thePROSPECT webserver will become a popular tool for identifying the pHis sites in bacteria.},
articlenumber = {2050018},
doi = {10.1142/s0219720020500183},
keywords = {Bioinformatics},
publisher = {World Scientific},
related = {computational-biology},
}

ABSTRACT Background: Phosphorylation of histidine residues plays crucial roles in signaling pathwaysand cell metabolism in prokaryotes such as bacteria. While evidence has emerged that proteinhistidine phosphorylation also occurs in more complex organisms, its role in mammalian cellshas remained largely uncharted. Thus, it is highly desirable to develop computational tools thatare able to identify histidine phosphorylation sites.Result:Here, we introduce PROSPECT thatenables fast and accurate prediction of proteome-wide histidine phosphorylation substrates andsites. Our tool is based on a hybrid method that integrates the outputs of two convolutional neuralnetwork (CNN)-based classifiers and a random forest-based classifier. Three features, includingthe one-of-K coding, enhanced grouped amino acids content (EGAAC) and composition of k-spaced amino acid group pairs (CKSAAGP) encoding, were taken as the input to three classifiers,respectively. Our results show that it is able to accurately predict histidine phosphorylation sitesfrom sequence information. Our PROSPECT web server is user-friendly and publicly available athttp://PROSPECT.erc.monash.edu/. Conclusions: PROSPECT is superior than other pHispredictors in both the running speed and prediction accuracy and we anticipate that thePROSPECT webserver will become a popular tool for identifying the pHis sites in bacteria.

Procleave: Predicting Protease-specific Substrate Cleavage Sites by Combining Sequence and Structural Information.
Li, F., Leier, A., Liu, Q., Wang, Y., Xiang, D., Akutsu, T., Webb, G. I., Smith, I. A., Marquez-Lago, T., Li, J., & Song, J.
Genomics, Proteomics & Bioinformatics, 18(1), 52-64, 2020.
[Bibtex] [Abstract] → Access on publisher site

@Article{LI2020,
author = {Fuyi Li and Andre Leier and Quanzhong Liu and Yanan Wang and Dongxu Xiang and Tatsuya Akutsu and Geoffrey I. Webb and A. Ian Smith and Tatiana Marquez-Lago and Jian Li and Jiangning Song},
journal = {Genomics, Proteomics & Bioinformatics},
title = {Procleave: Predicting Protease-specific Substrate Cleavage Sites by Combining Sequence and Structural Information},
year = {2020},
issn = {1672-0229},
number = {1},
pages = {52-64},
volume = {18},
abstract = {Proteases are enzymes that cleave and hydrolyse the peptide bonds between two specific amino acid residues of target substrate proteins. Protease-controlled proteolysis plays a key role in the degradation and recycling of proteins, which is essential for various physiological processes. Thus, solving the substrate identification problem will have important implications for the precise understanding of functions and physiological roles of proteases, as well as for therapeutic target identification and pharmaceutical applicability. Consequently, there is a great demand for bioinformatics methods that can predict novel substrate cleavage events with high accuracy by utilizing both sequence and structural information. In this study, we present Procleave, a novel bioinformatics approach for predicting protease-specific substrates and specific cleavage sites by taking into account both their sequence and 3D structural information. Structural features of known cleavage sites were represented by discrete values using a LOWESS data-smoothing optimization method, which turned out to be critical for the performance of Procleave. The optimal approximations of all structural parameter values were encoded in a conditional random field (CRF) computational framework, alongside sequence and chemical group-based features. Here, we demonstrate the outstanding performance of Procleave through extensive benchmarking and independent tests. Procleave is capable of correctly identifying most cleavage sites in the case study. Importantly, when applied to the human structural proteome encompassing 17,628 protein structures, Procleave suggests a number of potential novel target substrates and their corresponding cleavage sites of different proteases. Procleave is implemented as a webserver and is freely accessible at http://procleave.erc.monash.edu/.},
doi = {10.1016/j.gpb.2019.08.002},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT Proteases are enzymes that cleave and hydrolyse the peptide bonds between two specific amino acid residues of target substrate proteins. Protease-controlled proteolysis plays a key role in the degradation and recycling of proteins, which is essential for various physiological processes. Thus, solving the substrate identification problem will have important implications for the precise understanding of functions and physiological roles of proteases, as well as for therapeutic target identification and pharmaceutical applicability. Consequently, there is a great demand for bioinformatics methods that can predict novel substrate cleavage events with high accuracy by utilizing both sequence and structural information. In this study, we present Procleave, a novel bioinformatics approach for predicting protease-specific substrates and specific cleavage sites by taking into account both their sequence and 3D structural information. Structural features of known cleavage sites were represented by discrete values using a LOWESS data-smoothing optimization method, which turned out to be critical for the performance of Procleave. The optimal approximations of all structural parameter values were encoded in a conditional random field (CRF) computational framework, alongside sequence and chemical group-based features. Here, we demonstrate the outstanding performance of Procleave through extensive benchmarking and independent tests. Procleave is capable of correctly identifying most cleavage sites in the case study. Importantly, when applied to the human structural proteome encompassing 17,628 protein structures, Procleave suggests a number of potential novel target substrates and their corresponding cleavage sites of different proteases. Procleave is implemented as a webserver and is freely accessible at http://procleave.erc.monash.edu/.

PRISMOID: a comprehensive 3D structure database for post-translational modifications and mutations with functional impact.
Li, F., Fan, C., Marquez-Lago, T. T., Leier, A., Revote, J., Jia, C., Zhu, Y., Smith, I. A., Webb, G. I., Liu, Q., Wei, L., Li, J., & Song, J.
Briefings in Bioinformatics, 21(3), 1069-1079, 2020.
[Bibtex] [Abstract] → Access on publisher site

@Article{10.1093/bib/bbz050,
author = {Li, Fuyi and Fan, Cunshuo and Marquez-Lago, Tatiana T and Leier, Andre and Revote, Jerico and Jia, Cangzhi and Zhu, Yan and Smith, A Ian and Webb, Geoffrey I and Liu, Quanzhong and Wei, Leyi and Li, Jian and Song, Jiangning},
journal = {Briefings in Bioinformatics},
title = {PRISMOID: a comprehensive 3D structure database for post-translational modifications and mutations with functional impact},
year = {2020},
issn = {1477-4054},
number = {3},
pages = {1069-1079},
volume = {21},
abstract = {{Post-translational modifications (PTMs) play very important roles in various cell signaling pathways and biological process. Due to PTMs' extremely important roles, many major PTMs have been studied, while the functional and mechanical characterization of major PTMs is well documented in several databases. However, most currently available databases mainly focus on protein sequences, while the real 3D structures of PTMs have been largely ignored. Therefore, studies of PTMs 3D structural signatures have been severely limited by the deficiency of the data. Here, we develop PRISMOID, a novel publicly available and free 3D structure database for a wide range of PTMs. PRISMOID represents an up-to-date and interactive online knowledge base with specific focus on 3D structural contexts of PTMs sites and mutations that occur on PTMs and in the close proximity of PTM sites with functional impact. The first version of PRISMOID encompasses 17145 non-redundant modification sites on 3919 related protein 3D structure entries pertaining to 37 different types of PTMs. Our entry web page is organized in a comprehensive manner, including detailed PTM annotation on the 3D structure and biological information in terms of mutations affecting PTMs, secondary structure features and per-residue solvent accessibility features of PTM sites, domain context, predicted natively disordered regions and sequence alignments. In addition, high-definition JavaScript packages are employed to enhance information visualization in PRISMOID. PRISMOID equips a variety of interactive and customizable search options and data browsing functions; these capabilities allow users to access data via keyword, ID and advanced options combination search in an efficient and user-friendly way. A download page is also provided to enable users to download the SQL file, computational structural features and PTM sites' data. We anticipate PRISMOID will swiftly become an invaluable online resource, assisting both biologists and bioinformaticians to conduct experiments and develop applications supporting discovery efforts in the sequence-structural-functional relationship of PTMs and providing important insight into mutations and PTM sites interaction mechanisms. The PRISMOID database is freely accessible at http://prismoid.erc.monash.edu/. The database and web interface are implemented in MySQL, JSP, JavaScript and HTML with all major browsers supported.}},
doi = {10.1093/bib/bbz050},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT {Post-translational modifications (PTMs) play very important roles in various cell signaling pathways and biological process. Due to PTMs' extremely important roles, many major PTMs have been studied, while the functional and mechanical characterization of major PTMs is well documented in several databases. However, most currently available databases mainly focus on protein sequences, while the real 3D structures of PTMs have been largely ignored. Therefore, studies of PTMs 3D structural signatures have been severely limited by the deficiency of the data. Here, we develop PRISMOID, a novel publicly available and free 3D structure database for a wide range of PTMs. PRISMOID represents an up-to-date and interactive online knowledge base with specific focus on 3D structural contexts of PTMs sites and mutations that occur on PTMs and in the close proximity of PTM sites with functional impact. The first version of PRISMOID encompasses 17145 non-redundant modification sites on 3919 related protein 3D structure entries pertaining to 37 different types of PTMs. Our entry web page is organized in a comprehensive manner, including detailed PTM annotation on the 3D structure and biological information in terms of mutations affecting PTMs, secondary structure features and per-residue solvent accessibility features of PTM sites, domain context, predicted natively disordered regions and sequence alignments. In addition, high-definition JavaScript packages are employed to enhance information visualization in PRISMOID. PRISMOID equips a variety of interactive and customizable search options and data browsing functions; these capabilities allow users to access data via keyword, ID and advanced options combination search in an efficient and user-friendly way. A download page is also provided to enable users to download the SQL file, computational structural features and PTM sites' data. We anticipate PRISMOID will swiftly become an invaluable online resource, assisting both biologists and bioinformaticians to conduct experiments and develop applications supporting discovery efforts in the sequence-structural-functional relationship of PTMs and providing important insight into mutations and PTM sites interaction mechanisms. The PRISMOID database is freely accessible at http://prismoid.erc.monash.edu/. The database and web interface are implemented in MySQL, JSP, JavaScript and HTML with all major browsers supported.}

DeepCleave: a deep learning predictor for caspase and matrix metalloprotease substrates and cleavage sites.
Li, F., Chen, J., Leier, A., Marquez-Lago, T., Liu, Q., Wang, Y., Revote, J., Smith, I. A., Akutsu, T., Webb, G. I., Kurgan, L., & Song, J.
Bioinformatics, 36(4), 1057-1065, 2020.
[Bibtex] [Abstract] → Access on publisher site

@Article{Li2020a,
author = {Li, Fuyi and Chen, Jinxiang and Leier, Andre and Marquez-Lago, Tatiana and Liu, Quanzhong and Wang, Yanze and Revote, Jerico and Smith, A Ian and Akutsu, Tatsuya and Webb, Geoffrey I and Kurgan, Lukasz and Song, Jiangning},
journal = {Bioinformatics},
title = {DeepCleave: a deep learning predictor for caspase and matrix metalloprotease substrates and cleavage sites},
year = {2020},
issn = {1367-4803},
number = {4},
pages = {1057-1065},
volume = {36},
abstract = {{Proteases are enzymes that cleave target substrate proteins by catalyzing the hydrolysis of peptide bonds between specific amino acids. While the functional proteolysis regulated by proteases plays a central role in the "life and death" process of proteins, many of the corresponding substrates and their cleavage sites were not found yet. Availability of accurate predictors of the substrates and cleavage sites would facilitate understanding of proteases' functions and physiological roles. Deep learning is a promising approach for the development of accurate predictors of substrate cleavage events.We propose DeepCleave, the first deep learning-based predictor of protease-specific substrates and cleavage sites. DeepCleave uses protein substrate sequence data as input and employs convolutional neural networks with transfer learning to train accurate predictive models. High predictive performance of our models stems from the use of high-quality cleavage site features extracted from the substrate sequences through the deep learning process, and the application of transfer learning, multiple kernels and attention layer in the design of the deep network. Empirical tests against several related state-of-the-art methods demonstrate that DeepCleave outperforms these methods in predicting caspase and matrix metalloprotease substrate-cleavage sites.The DeepCleave webserver and source code are freely available at http://deepcleave.erc.monash.edu/.Supplementary data are available at Bioinformatics online.}},
comment = {Clarivate Web of Science Highly Cited Paper 2020},
doi = {10.1093/bioinformatics/btz721},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT {Proteases are enzymes that cleave target substrate proteins by catalyzing the hydrolysis of peptide bonds between specific amino acids. While the functional proteolysis regulated by proteases plays a central role in the "life and death" process of proteins, many of the corresponding substrates and their cleavage sites were not found yet. Availability of accurate predictors of the substrates and cleavage sites would facilitate understanding of proteases' functions and physiological roles. Deep learning is a promising approach for the development of accurate predictors of substrate cleavage events.We propose DeepCleave, the first deep learning-based predictor of protease-specific substrates and cleavage sites. DeepCleave uses protein substrate sequence data as input and employs convolutional neural networks with transfer learning to train accurate predictive models. High predictive performance of our models stems from the use of high-quality cleavage site features extracted from the substrate sequences through the deep learning process, and the application of transfer learning, multiple kernels and attention layer in the design of the deep network. Empirical tests against several related state-of-the-art methods demonstrate that DeepCleave outperforms these methods in predicting caspase and matrix metalloprotease substrate-cleavage sites.The DeepCleave webserver and source code are freely available at http://deepcleave.erc.monash.edu/.Supplementary data are available at Bioinformatics online.}

iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data.
Chen, Z., Zhao, P., Li, F., Marquez-Lago, T. T., Leier, A., Revote, J., Zhu, Y., Powell, D. R., Akutsu, T., Webb, G. I., Chou, K., Smith, I. A., Daly, R. J., Li, J., & Song, J.
Briefings in Bioinformatics, 21(3), 1047-1057, 2020.
[Bibtex] [Abstract] → Access on publisher site

@Article{10.1093/bib/bbz041,
author = {Chen, Zhen and Zhao, Pei and Li, Fuyi and Marquez-Lago, Tatiana T and Leier, Andre and Revote, Jerico and Zhu, Yan and Powell, David R and Akutsu, Tatsuya and Webb, Geoffrey I and Chou, Kuo-Chen and Smith, A Ian and Daly, Roger J and Li, Jian and Song, Jiangning},
journal = {Briefings in Bioinformatics},
title = {iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data},
year = {2020},
issn = {1477-4054},
number = {3},
pages = {1047-1057},
volume = {21},
abstract = {With the explosive growth of biological sequences generated in the post-genomic era, one of the most challenging problems in bioinformatics and computational biology is to computationally characterize sequences, structures and functions in an efficient, accurate and high-throughput manner. A number of online web servers and stand-alone tools have been developed to address this to date; however, all these tools have their limitations and drawbacks in terms of their effectiveness, user-friendliness and capacity. Here, we present iLearn, a comprehensive and versatile Python-based toolkit, integrating the functionality of feature extraction, clustering, normalization, selection, dimensionality reduction, predictor construction, best descriptor/model selection, ensemble learning and results visualization for DNA, RNA and protein sequences. iLearn was designed for users that only want to upload their data set and select the functions they need calculated from it, while all necessary procedures and optimal settings are completed automatically by the software. iLearn includes a variety of descriptors for DNA, RNA and proteins, and four feature output formats are supported so as to facilitate direct output usage or communication with other computational tools. In total, iLearn encompasses 16 different types of feature clustering, selection, normalization and dimensionality reduction algorithms, and five commonly used machine-learning algorithms, thereby greatly facilitating feature analysis and predictor construction. iLearn is made freely available via an online web server and a stand-alone toolkit.},
comment = {Clarivate Web of Science Highly Cited Paper 2020 - 2024},
doi = {10.1093/bib/bbz041},
keywords = {Bioinformatics and DP140100087},
related = {computational-biology},
}

ABSTRACT With the explosive growth of biological sequences generated in the post-genomic era, one of the most challenging problems in bioinformatics and computational biology is to computationally characterize sequences, structures and functions in an efficient, accurate and high-throughput manner. A number of online web servers and stand-alone tools have been developed to address this to date; however, all these tools have their limitations and drawbacks in terms of their effectiveness, user-friendliness and capacity. Here, we present iLearn, a comprehensive and versatile Python-based toolkit, integrating the functionality of feature extraction, clustering, normalization, selection, dimensionality reduction, predictor construction, best descriptor/model selection, ensemble learning and results visualization for DNA, RNA and protein sequences. iLearn was designed for users that only want to upload their data set and select the functions they need calculated from it, while all necessary procedures and optimal settings are completed automatically by the software. iLearn includes a variety of descriptors for DNA, RNA and proteins, and four feature output formats are supported so as to facilitate direct output usage or communication with other computational tools. In total, iLearn encompasses 16 different types of feature clustering, selection, normalization and dimensionality reduction algorithms, and five commonly used machine-learning algorithms, thereby greatly facilitating feature analysis and predictor construction. iLearn is made freely available via an online web server and a stand-alone toolkit.

Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences.
Chen, Z., Zhao, P., Li, F., Wang, Y., Smith, I. A., Webb, G. I., Akutsu, T., Baggag, A., Bensmail, H., & Song, J.
Briefings in Bioinformatics, 21(5), 1676-1696, 2020.
[Bibtex] [Abstract] → Access on publisher site

@Article{10.1093/bib/bbz112,
author = {Chen, Zhen and Zhao, Pei and Li, Fuyi and Wang, Yanan and Smith, A Ian and Webb, Geoffrey I and Akutsu, Tatsuya and Baggag, Abdelkader and Bensmail, Halima and Song, Jiangning},
journal = {Briefings in Bioinformatics},
title = {Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences},
year = {2020},
issn = {1477-4054},
number = {5},
pages = {1676-1696},
volume = {21},
abstract = {RNA post-transcriptional modifications play a crucial role in a myriad of biological processes and cellular functions. To date, more than 160 RNA modifications have been discovered; therefore, accurate identification of RNA-modification sites is fundamental for a better understanding of RNA-mediated biological functions and mechanisms. However, due to limitations in experimental methods, systematic identification of different types of RNA-modification sites remains a major challenge. Recently, more than 20 computational methods have been developed to identify RNA-modification sites in tandem with high-throughput experimental methods, with most of these capable of predicting only single types of RNA-modification sites. These methods show high diversity in their dataset size, data quality, core algorithms, features extracted and feature selection techniques and evaluation strategies. Therefore, there is an urgent need to revisit these methods and summarize their methodologies, in order to improve and further develop computational techniques to identify and characterize RNA-modification sites from the large amounts of sequence data. With this goal in mind, first, we provide a comprehensive survey on a large collection of 27 state-of-the-art approaches for predicting N1-methyladenosine and N6-methyladenosine sites. We cover a variety of important aspects that are crucial for the development of successful predictors, including the dataset quality, operating algorithms, sequence and genomic features, feature selection, model performance evaluation and software utility. In addition, we also provide our thoughts on potential strategies to improve the model performance. Second, we propose a computational approach called DeepPromise based on deep learning techniques for simultaneous prediction of N1-methyladenosine and N6-methyladenosine. To extract the sequence context surrounding the modification sites, three feature encodings, including enhanced nucleic acid composition, one-hot encoding, and RNA embedding, were used as the input to seven consecutive layers of convolutional neural networks (CNNs), respectively. Moreover, DeepPromise further combined the prediction score of the CNN-based models and achieved around 43\\% higher area under receiver-operating curve (AUROC) for m1A site prediction and 6\\% higher AUROC for m6A site prediction, respectively, when compared with several existing state-of-the-art approaches on the independent test. In-depth analyses of characteristic sequence motifs identified from the convolution-layer filters indicated that nucleotide presentation at proximal positions surrounding the modification sites contributed most to the classification, whereas those at distal positions also affected classification but to different extents. To maximize user convenience, a web server was developed as an implementation of DeepPromise and made publicly available at http://DeepPromise.erc.monash.edu/, with the server accepting both RNA sequences and genomic sequences to allow prediction of two types of putative RNA-modification sites.},
doi = {10.1093/bib/bbz112},
eprint = {http://oup.prod.sis.lan/bib/advance-article-pdf/doi/10.1093/bib/bbz112/30663813/bbz112.pdf},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT RNA post-transcriptional modifications play a crucial role in a myriad of biological processes and cellular functions. To date, more than 160 RNA modifications have been discovered; therefore, accurate identification of RNA-modification sites is fundamental for a better understanding of RNA-mediated biological functions and mechanisms. However, due to limitations in experimental methods, systematic identification of different types of RNA-modification sites remains a major challenge. Recently, more than 20 computational methods have been developed to identify RNA-modification sites in tandem with high-throughput experimental methods, with most of these capable of predicting only single types of RNA-modification sites. These methods show high diversity in their dataset size, data quality, core algorithms, features extracted and feature selection techniques and evaluation strategies. Therefore, there is an urgent need to revisit these methods and summarize their methodologies, in order to improve and further develop computational techniques to identify and characterize RNA-modification sites from the large amounts of sequence data. With this goal in mind, first, we provide a comprehensive survey on a large collection of 27 state-of-the-art approaches for predicting N1-methyladenosine and N6-methyladenosine sites. We cover a variety of important aspects that are crucial for the development of successful predictors, including the dataset quality, operating algorithms, sequence and genomic features, feature selection, model performance evaluation and software utility. In addition, we also provide our thoughts on potential strategies to improve the model performance. Second, we propose a computational approach called DeepPromise based on deep learning techniques for simultaneous prediction of N1-methyladenosine and N6-methyladenosine. To extract the sequence context surrounding the modification sites, three feature encodings, including enhanced nucleic acid composition, one-hot encoding, and RNA embedding, were used as the input to seven consecutive layers of convolutional neural networks (CNNs), respectively. Moreover, DeepPromise further combined the prediction score of the CNN-based models and achieved around 43\\% higher area under receiver-operating curve (AUROC) for m1A site prediction and 6\\% higher AUROC for m6A site prediction, respectively, when compared with several existing state-of-the-art approaches on the independent test. In-depth analyses of characteristic sequence motifs identified from the convolution-layer filters indicated that nucleotide presentation at proximal positions surrounding the modification sites contributed most to the classification, whereas those at distal positions also affected classification but to different extents. To maximize user convenience, a web server was developed as an implementation of DeepPromise and made publicly available at http://DeepPromise.erc.monash.edu/, with the server accepting both RNA sequences and genomic sequences to allow prediction of two types of putative RNA-modification sites.

SIMLIN: a bioinformatics tool for prediction of S-sulphenylation in the human proteome based on multi-stage ensemble-learning models.
Wang, X., Li, C., Li, F., Sharma, V. S., Song, J., & Webb, G. I.
BMC Bioinformatics, 20(1), Art. no. 602, 2019.
[Bibtex] [Abstract] → Access on publisher site

@Article{Wang2019,
author = {Wang, Xiaochuan and Li, Chen and Li, Fuyi and Sharma, Varun S. and Song, Jiangning and Webb, Geoffrey I.},
journal = {BMC Bioinformatics},
title = {SIMLIN: a bioinformatics tool for prediction of S-sulphenylation in the human proteome based on multi-stage ensemble-learning models},
year = {2019},
month = {Nov},
number = {1},
volume = {20},
abstract = {S-sulphenylation is a ubiquitous protein post-translational modification (PTM) where an S-hydroxyl bond is formed via the reversible oxidation on the Sulfhydryl group of cysteine (C). Recent experimental studies have revealed that S-sulphenylation plays critical roles in many biological functions, such as protein regulation and cell signaling. State-of-the-art bioinformatic advances have facilitated high-throughput in silico screening of protein S-sulphenylation sites, thereby significantly reducing the time and labour costs traditionally required for the experimental investigation of S-sulphenylation.},
articlenumber = {602},
doi = {10.1186/s12859-019-3178-6},
keywords = {Bioinformatics and DP140100087},
related = {computational-biology},
}

ABSTRACT S-sulphenylation is a ubiquitous protein post-translational modification (PTM) where an S-hydroxyl bond is formed via the reversible oxidation on the Sulfhydryl group of cysteine (C). Recent experimental studies have revealed that S-sulphenylation plays critical roles in many biological functions, such as protein regulation and cell signaling. State-of-the-art bioinformatic advances have facilitated high-throughput in silico screening of protein S-sulphenylation sites, thereby significantly reducing the time and labour costs traditionally required for the experimental investigation of S-sulphenylation.

Positive-unlabelled learning of glycosylation sites in the human proteome.
Li, F., Zhang, Y., Purcell, A. W., Webb, G. I., Chou, K., Lithgow, T., Li, C., & Song, J.
BMC Bioinformatics, 20(1), 112, 2019.
[Bibtex] [Abstract] → Access on publisher site

@Article{Li2019,
author = {Li, Fuyi and Zhang, Yang and Purcell, Anthony W. and Webb, Geoffrey I. and Chou, Kuo-Chen and Lithgow, Trevor and Li, Chen and Song, Jiangning},
journal = {BMC Bioinformatics},
title = {Positive-unlabelled learning of glycosylation sites in the human proteome},
year = {2019},
issn = {1471-2105},
month = {Mar},
number = {1},
pages = {112},
volume = {20},
abstract = {As an important type of post-translational modification (PTM), protein glycosylation plays a crucial role in protein stability and protein function. The abundance and ubiquity of protein glycosylation across three domains of life involving Eukarya, Bacteria and Archaea demonstrate its roles in regulating a variety of signalling and metabolic pathways. Mutations on and in the proximity of glycosylation sites are highly associated with human diseases. Accordingly, accurate prediction of glycosylation can complement laboratory-based methods and greatly benefit experimental efforts for characterization and understanding of functional roles of glycosylation. For this purpose, a number of supervised-learning approaches have been proposed to identify glycosylation sites, demonstrating a promising predictive performance. To train a conventional supervised-learning model, both reliable positive and negative samples are required. However, in practice, a large portion of negative samples (i.e. non-glycosylation sites) are mislabelled due to the limitation of current experimental technologies. Moreover, supervised algorithms often fail to take advantage of large volumes of unlabelled data, which can aid in model learning in conjunction with positive samples (i.e. experimentally verified glycosylation sites).},
doi = {10.1186/s12859-019-2700-1},
keywords = {Bioinformatics},
related = {computational-biology},
url = {https://rdcu.be/bpQBV},
}

ABSTRACT As an important type of post-translational modification (PTM), protein glycosylation plays a crucial role in protein stability and protein function. The abundance and ubiquity of protein glycosylation across three domains of life involving Eukarya, Bacteria and Archaea demonstrate its roles in regulating a variety of signalling and metabolic pathways. Mutations on and in the proximity of glycosylation sites are highly associated with human diseases. Accordingly, accurate prediction of glycosylation can complement laboratory-based methods and greatly benefit experimental efforts for characterization and understanding of functional roles of glycosylation. For this purpose, a number of supervised-learning approaches have been proposed to identify glycosylation sites, demonstrating a promising predictive performance. To train a conventional supervised-learning model, both reliable positive and negative samples are required. However, in practice, a large portion of negative samples (i.e. non-glycosylation sites) are mislabelled due to the limitation of current experimental technologies. Moreover, supervised algorithms often fail to take advantage of large volumes of unlabelled data, which can aid in model learning in conjunction with positive samples (i.e. experimentally verified glycosylation sites).

Systematic analysis and prediction of type IV secreted effector proteins by machine learning approaches.
Wang, J., Yang, B., An, Y., Marquez-Lago, T., Leier, A., Wilksch, J., Hong, Q., Zhang, Y., Hayashida, M., Akutsu, T., Webb, G. I., Strugnell, R. A., Song, J., & Lithgow, T.
Briefings in Bioinformatics, 20(3), 931-951, 2019.
[Bibtex] [Abstract] → Access on publisher site

@Article{doi:10.1093/bib/bbx164,
author = {Wang, Jiawei and Yang, Bingjiao and An, Yi and Marquez-Lago, Tatiana and Leier, Andre and Wilksch, Jonathan and Hong, Qingyang and Zhang, Yang and Hayashida, Morihiro and Akutsu, Tatsuya and Webb, Geoffrey I and Strugnell, Richard A and Song, Jiangning and Lithgow, Trevor},
journal = {Briefings in Bioinformatics},
title = {Systematic analysis and prediction of type IV secreted effector proteins by machine learning approaches},
year = {2019},
number = {3},
pages = {931-951},
volume = {20},
abstract = {In the course of infecting their hosts, pathogenic bacteria secrete numerous effectors, namely, bacterial proteins that pervert host cell biology. Many Gram-negative bacteria, including context-dependent human pathogens, use a type IV secretion system (T4SS) to translocate effectors directly into the cytosol of host cells. Various type IV secreted effectors (T4SEs) have been experimentally validated to play crucial roles in virulence by manipulating host cell gene expression and other processes. Consequently, the identification of novel effector proteins is an important step in increasing our understanding of host-pathogen interactions and bacterial pathogenesis. Here, we train and compare six machine learning models, namely, Naive Bayes (NB), K-nearest neighbor (KNN), logistic regression (LR), random forest (RF), support vector machines (SVMs) and multilayer perceptron (MLP), for the identification of T4SEs using 10 types of selected features and 5-fold cross-validation. Our study shows that: (1) including different but complementary features generally enhance the predictive performance of T4SEs; (2) ensemble models, obtained by integrating individual single-feature models, exhibit a significantly improved predictive performance and (3) the 'majority voting strategy' led to a more stable and accurate classification performance when applied to predicting an ensemble learning model with distinct single features. We further developed a new method to effectively predict T4SEs, Bastion4 (Bacterial secretion effector predictor for T4SS), and we show our ensemble classifier clearly outperforms two recent prediction tools. In summary, we developed a state-of-the-art T4SE predictor by conducting a comprehensive performance evaluation of different machine learning algorithms along with a detailed analysis of single- and multi-feature selections.},
doi = {10.1093/bib/bbx164},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT In the course of infecting their hosts, pathogenic bacteria secrete numerous effectors, namely, bacterial proteins that pervert host cell biology. Many Gram-negative bacteria, including context-dependent human pathogens, use a type IV secretion system (T4SS) to translocate effectors directly into the cytosol of host cells. Various type IV secreted effectors (T4SEs) have been experimentally validated to play crucial roles in virulence by manipulating host cell gene expression and other processes. Consequently, the identification of novel effector proteins is an important step in increasing our understanding of host-pathogen interactions and bacterial pathogenesis. Here, we train and compare six machine learning models, namely, Naive Bayes (NB), K-nearest neighbor (KNN), logistic regression (LR), random forest (RF), support vector machines (SVMs) and multilayer perceptron (MLP), for the identification of T4SEs using 10 types of selected features and 5-fold cross-validation. Our study shows that: (1) including different but complementary features generally enhance the predictive performance of T4SEs; (2) ensemble models, obtained by integrating individual single-feature models, exhibit a significantly improved predictive performance and (3) the 'majority voting strategy' led to a more stable and accurate classification performance when applied to predicting an ensemble learning model with distinct single features. We further developed a new method to effectively predict T4SEs, Bastion4 (Bacterial secretion effector predictor for T4SS), and we show our ensemble classifier clearly outperforms two recent prediction tools. In summary, we developed a state-of-the-art T4SE predictor by conducting a comprehensive performance evaluation of different machine learning algorithms along with a detailed analysis of single- and multi-feature selections.

Large-scale comparative assessment of computational predictors for lysine post-translational modification sites.
Chen, Z., Li, L., Xu, D., Chou, K., Liu, X., Smith, A. I., Li, F., Song, J., Li, C., Leier, A., Marquez-Lago, T., Akutsu, T., & Webb, G. I.
Briefings in Bioinformatics, 20(6), 2267-2290, 2019.
[Bibtex] [Abstract] → Access on publisher site

@Article{ChenEtAl118b,
author = {Chen, Zhen and Li, Lei and Xu, Dakang and Chou, Kuo-Chen and Liu, Xuhan and Smith, Alexander Ian and Li, Fuyi and Song, Jiangning and Li, Chen and Leier, Andre and Marquez-Lago, Tatiana and Akutsu, Tatsuya and Webb, Geoffrey I},
journal = {Briefings in Bioinformatics},
title = {Large-scale comparative assessment of computational predictors for lysine post-translational modification sites},
year = {2019},
number = {6},
pages = {2267-2290},
volume = {20},
abstract = {Lysine post-translational modifications (PTMs) play a crucial role in regulating diverse functions and biological processes of proteins. However, because of the large volumes of sequencing data generated from genome-sequencing projects, systematic identification of different types of lysine PTM substrates and PTM sites in the entire proteome remains a major challenge. In recent years, a number of computational methods for lysine PTM identification have been developed. These methods show high diversity in their core algorithms, features extracted and feature selection techniques and evaluation strategies. There is therefore an urgent need to revisit these methods and summarize their methodologies, to improve and further develop computational techniques to identify and characterize lysine PTMs from the large amounts of sequence data. With this goal in mind, we first provide a comprehensive survey on a large collection of 49 state-of-the-art approaches for lysine PTM prediction. We cover a variety of important aspects that are crucial for the development of successful predictors, including operating algorithms, sequence and structural features, feature selection, model performance evaluation and software utility. We further provide our thoughts on potential strategies to improve the model performance. Second, in order to examine the feasibility of using deep learning for lysine PTM prediction, we propose a novel computational framework, termed MUscADEL (Multiple Scalable Accurate Deep Learner for lysine PTMs), using deep, bidirectional, long short-term memory recurrent neural networks for accurate and systematic mapping of eight major types of lysine PTMs in the human and mouse proteomes. Extensive benchmarking tests show that MUscADEL outperforms current methods for lysine PTM characterization, demonstrating the potential and power of deep learning techniques in protein PTM prediction. The web server of MUscADEL, together with all the data sets assembled in this study, is freely available at http://muscadel.erc.monash.edu/. We anticipate this comprehensive review and the application of deep learning will provide practical guide and useful insights into PTM prediction and inspire future bioinformatics studies in the related fields.},
doi = {10.1093/bib/bby089},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT Lysine post-translational modifications (PTMs) play a crucial role in regulating diverse functions and biological processes of proteins. However, because of the large volumes of sequencing data generated from genome-sequencing projects, systematic identification of different types of lysine PTM substrates and PTM sites in the entire proteome remains a major challenge. In recent years, a number of computational methods for lysine PTM identification have been developed. These methods show high diversity in their core algorithms, features extracted and feature selection techniques and evaluation strategies. There is therefore an urgent need to revisit these methods and summarize their methodologies, to improve and further develop computational techniques to identify and characterize lysine PTMs from the large amounts of sequence data. With this goal in mind, we first provide a comprehensive survey on a large collection of 49 state-of-the-art approaches for lysine PTM prediction. We cover a variety of important aspects that are crucial for the development of successful predictors, including operating algorithms, sequence and structural features, feature selection, model performance evaluation and software utility. We further provide our thoughts on potential strategies to improve the model performance. Second, in order to examine the feasibility of using deep learning for lysine PTM prediction, we propose a novel computational framework, termed MUscADEL (Multiple Scalable Accurate Deep Learner for lysine PTMs), using deep, bidirectional, long short-term memory recurrent neural networks for accurate and systematic mapping of eight major types of lysine PTMs in the human and mouse proteomes. Extensive benchmarking tests show that MUscADEL outperforms current methods for lysine PTM characterization, demonstrating the potential and power of deep learning techniques in protein PTM prediction. The web server of MUscADEL, together with all the data sets assembled in this study, is freely available at http://muscadel.erc.monash.edu/. We anticipate this comprehensive review and the application of deep learning will provide practical guide and useful insights into PTM prediction and inspire future bioinformatics studies in the related fields.

iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites.
Song, J., Wang, Y., Li, F., Akutsu, T., Rawlings, N. D., Webb, G. I., & Chou, K.
Briefings in Bioinformatics, 20(2), 638-658, 2019.
[Bibtex] [Abstract] → Access on publisher site

@Article{doi:10.1093/bib/bby028,
author = {Song, Jiangning and Wang, Yanan and Li, Fuyi and Akutsu, Tatsuya and Rawlings, Neil D and Webb, Geoffrey I and Chou, Kuo-Chen},
journal = {Briefings in Bioinformatics},
title = {iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites},
year = {2019},
number = {2},
pages = {638-658},
volume = {20},
abstract = {Regulation of proteolysis plays a critical role in a myriad of important cellular processes. The key to better understanding the mechanisms that control this process is to identify the specific substrates that each protease targets. To address this, we have developed iProt-Sub, a powerful bioinformatics tool for the accurate prediction of protease-specific substrates and their cleavage sites. Importantly, iProt-Sub represents a significantly advanced version of its successful predecessor, PROSPER. It provides optimized cleavage site prediction models with better prediction performance and coverage for more species-specific proteases (4 major protease families and 38 different proteases). iProt-Sub integrates heterogeneous sequence and structural features and uses a two-step feature selection procedure to further remove redundant and irrelevant features in an effort to improve the cleavage site prediction accuracy. Features used by iProt-Sub are encoded by 11 different sequence encoding schemes, including local amino acid sequence profile, secondary structure, solvent accessibility and native disorder, which will allow a more accurate representation of the protease specificity of approximately 38 proteases and training of the prediction models. Benchmarking experiments using cross-validation and independent tests showed that iProt-Sub is able to achieve a better performance than several existing generic tools. We anticipate that iProt-Sub will be a powerful tool for proteome-wide prediction of protease-specific substrates and their cleavage sites, and will facilitate hypothesis-driven functional interrogation of protease-specific substrate cleavage and proteolytic events.},
comment = {Clarivate Web of Science Hot Paper 2019},
comment2 = {Clarivate Web of Science Highly Cited Paper 2019 - 2023},
doi = {10.1093/bib/bby028},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT Regulation of proteolysis plays a critical role in a myriad of important cellular processes. The key to better understanding the mechanisms that control this process is to identify the specific substrates that each protease targets. To address this, we have developed iProt-Sub, a powerful bioinformatics tool for the accurate prediction of protease-specific substrates and their cleavage sites. Importantly, iProt-Sub represents a significantly advanced version of its successful predecessor, PROSPER. It provides optimized cleavage site prediction models with better prediction performance and coverage for more species-specific proteases (4 major protease families and 38 different proteases). iProt-Sub integrates heterogeneous sequence and structural features and uses a two-step feature selection procedure to further remove redundant and irrelevant features in an effort to improve the cleavage site prediction accuracy. Features used by iProt-Sub are encoded by 11 different sequence encoding schemes, including local amino acid sequence profile, secondary structure, solvent accessibility and native disorder, which will allow a more accurate representation of the protease specificity of approximately 38 proteases and training of the prediction models. Benchmarking experiments using cross-validation and independent tests showed that iProt-Sub is able to achieve a better performance than several existing generic tools. We anticipate that iProt-Sub will be a powerful tool for proteome-wide prediction of protease-specific substrates and their cleavage sites, and will facilitate hypothesis-driven functional interrogation of protease-specific substrate cleavage and proteolytic events.

Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods.
Li, F., Wang, Y., Li, C., Marquez-Lago, T. T., Leier, A., Rawlings, N. D., Haffari, G., Revote, J., Akutsu, T., Chou, K., Purcell, A. W., Pike, R. N., Webb, G. I., Smith, I. A., Lithgow, T., Daly, R. J., Whisstock, J. C., & Song, J.
Briefings in Bioinformatics, 20(6), 2150-2166, 2019.
[Bibtex] [Abstract] → Access on publisher site

@Article{Li18b,
author = {Li, Fuyi and Wang, Yanan and Li, Chen and Marquez-Lago, Tatiana T and Leier, Andre and Rawlings, Neil D and Haffari, Gholamreza and Revote, Jerico and Akutsu, Tatsuya and Chou, Kuo-Chen and Purcell, Anthony W and Pike, Robert N and Webb, Geoffrey I and Smith, Ian A and Lithgow, Trevor and Daly, Roger J and Whisstock, James C and Song, Jiangning},
journal = {Briefings in Bioinformatics},
title = {Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods},
year = {2019},
number = {6},
pages = {2150-2166},
volume = {20},
abstract = {The roles of proteolytic cleavage have been intensively investigated and discussed during the past two decades. This irreversible chemical process has been frequently reported to influence a number of crucial biological processes (BPs), such as cell cycle, protein regulation and inflammation. A number of advanced studies have been published aiming at deciphering the mechanisms of proteolytic cleavage. Given its significance and the large number of functionally enriched substrates targeted by specific proteases, many computational approaches have been established for accurate prediction of protease-specific substrates and their cleavage sites. Consequently, there is an urgent need to systematically assess the state-of-the-art computational approaches for protease-specific cleavage site prediction to further advance the existing methodologies and to improve the prediction performance. With this goal in mind, in this article, we carefully evaluated a total of 19 computational methods (including 8 scoring function-based methods and 11 machine learning-based methods) in terms of their underlying algorithm, calculated features, performance evaluation and software usability. Then, extensive independent tests were performed to assess the robustness and scalability of the reviewed methods using our carefully prepared independent test data sets with 3641 cleavage sites (specific to 10 proteases). The comparative experimental results demonstrate that PROSPERous is the most accurate generic method for predicting eight protease-specific cleavage sites, while GPS-CCD and LabCaS outperformed other predictors for calpain-specific cleavage sites. Based on our review, we then outlined some potential ways to improve the prediction performance and ease the computational burden by applying ensemble learning, deep learning, positive unlabeled learning and parallel and distributed computing techniques. We anticipate that our study will serve as a practical and useful guide for interested readers to further advance next-generation bioinformatics tools for protease-specific cleavage site prediction.},
doi = {10.1093/bib/bby077},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT The roles of proteolytic cleavage have been intensively investigated and discussed during the past two decades. This irreversible chemical process has been frequently reported to influence a number of crucial biological processes (BPs), such as cell cycle, protein regulation and inflammation. A number of advanced studies have been published aiming at deciphering the mechanisms of proteolytic cleavage. Given its significance and the large number of functionally enriched substrates targeted by specific proteases, many computational approaches have been established for accurate prediction of protease-specific substrates and their cleavage sites. Consequently, there is an urgent need to systematically assess the state-of-the-art computational approaches for protease-specific cleavage site prediction to further advance the existing methodologies and to improve the prediction performance. With this goal in mind, in this article, we carefully evaluated a total of 19 computational methods (including 8 scoring function-based methods and 11 machine learning-based methods) in terms of their underlying algorithm, calculated features, performance evaluation and software usability. Then, extensive independent tests were performed to assess the robustness and scalability of the reviewed methods using our carefully prepared independent test data sets with 3641 cleavage sites (specific to 10 proteases). The comparative experimental results demonstrate that PROSPERous is the most accurate generic method for predicting eight protease-specific cleavage sites, while GPS-CCD and LabCaS outperformed other predictors for calpain-specific cleavage sites. Based on our review, we then outlined some potential ways to improve the prediction performance and ease the computational burden by applying ensemble learning, deep learning, positive unlabeled learning and parallel and distributed computing techniques. We anticipate that our study will serve as a practical and useful guide for interested readers to further advance next-generation bioinformatics tools for protease-specific cleavage site prediction.

Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework.
Zhang, Y., Xie, R., Wang, J., Leier, A., Marquez-Lago, T. T., Akutsu, T., Webb, G. I., Chou, K., & Song, J.
Briefings in Bioinformatics, 20(6), 2185-2199, 2019.
[Bibtex] [Abstract] → Access on publisher site

@Article{ZhangEtAl18,
author = {Zhang, Yanju and Xie, Ruopeng and Wang, Jiawei and Leier, Andre and Marquez-Lago, Tatiana T. and Akutsu, Tatsuya and Webb, Geoffrey I. and Chou, Kuo-Chen and Song, Jiangning},
journal = {Briefings in Bioinformatics},
title = {Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework},
year = {2019},
number = {6},
pages = {2185-2199},
volume = {20},
abstract = {As a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.},
doi = {10.1093/bib/bby079},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT As a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.

Critical evaluation of bioinformatics tools for the prediction of protein crystallization propensity.
Wang, H., Feng, L., Webb, G. I., Kurgan, L., Song, J., & Lin, D.
Briefings in Bioinformatics, 19(5), 838-852, 2018.
[Bibtex] [Abstract] → Access on publisher site

@Article{WangEtAl18,
Title = {Critical evaluation of bioinformatics tools for the prediction of protein crystallization propensity},
Author = {Wang, Huilin and Feng, Liubin and Webb, Geoffrey I and Kurgan, Lukasz and Song, Jiangning and Lin, Donghai},
Journal = {Briefings in Bioinformatics},
Year = {2018},
Number = {5},
Pages = {838-852},
Volume = {19},
Abstract = {X-ray crystallography is the main tool for structural determination of proteins. Yet, the underlying crystallization process is costly, has a high attrition rate and involves a series of trial-and-error attempts to obtain diffraction-quality crystals. The Structural Genomics Consortium aims to systematically solve representative structures of major protein-fold classes using primarily high-throughput X-ray crystallography. The attrition rate of these efforts can be improved by selection of proteins that are potentially easier to be crystallized. In this context, bioinformatics approaches have been developed to predict crystallization propensities based on protein sequences. These approaches are used to facilitate prioritization of the most promising target proteins, search for alternative structural orthologues of the target proteins and suggest designs of constructs capable of potentially enhancing the likelihood of successful crystallization. We reviewed and compared nine predictors of protein crystallization propensity. Moreover, we demonstrated that integrating selected outputs from multiple predictors as candidate input features to build the predictive model results in a significantly higher predictive performance when compared to using these predictors individually. Furthermore, we also introduced a new and accurate predictor of protein crystallization propensity, Crysf, which uses functional features extracted from UniProt as inputs. This comprehensive review will assist structural biologists in selecting the most appropriate predictor, and is also beneficial for bioinformaticians to develop a new generation of predictive algorithms.},
Doi = {10.1093/bib/bbx018},
Keywords = {Bioinformatics},
Related = {computational-biology}
}

ABSTRACT X-ray crystallography is the main tool for structural determination of proteins. Yet, the underlying crystallization process is costly, has a high attrition rate and involves a series of trial-and-error attempts to obtain diffraction-quality crystals. The Structural Genomics Consortium aims to systematically solve representative structures of major protein-fold classes using primarily high-throughput X-ray crystallography. The attrition rate of these efforts can be improved by selection of proteins that are potentially easier to be crystallized. In this context, bioinformatics approaches have been developed to predict crystallization propensities based on protein sequences. These approaches are used to facilitate prioritization of the most promising target proteins, search for alternative structural orthologues of the target proteins and suggest designs of constructs capable of potentially enhancing the likelihood of successful crystallization. We reviewed and compared nine predictors of protein crystallization propensity. Moreover, we demonstrated that integrating selected outputs from multiple predictors as candidate input features to build the predictive model results in a significantly higher predictive performance when compared to using these predictors individually. Furthermore, we also introduced a new and accurate predictor of protein crystallization propensity, Crysf, which uses functional features extracted from UniProt as inputs. This comprehensive review will assist structural biologists in selecting the most appropriate predictor, and is also beneficial for bioinformaticians to develop a new generation of predictive algorithms.

Comprehensive assessment and performance improvement of effector protein predictors for bacterial secretion systems III, IV and VI.
An, Y., Wang, J., Li, C., Leier, A., Marquez-Lago, T., Wilksch, J., Zhang, Y., Webb, G. I., Song, J., & Lithgow, T.
Briefings in Bioinformatics, 19(1), 148-161, 2018.
[Bibtex] [Abstract] → Access on publisher site

@Article{AnEtAl2016,
author = {An, Yi and Wang, Jiawei and Li, Chen and Leier, Andre and Marquez-Lago, Tatiana and Wilksch, Jonathan and Zhang, Yang and Webb, Geoffrey I. and Song, Jiangning and Lithgow, Trevor},
journal = {Briefings in Bioinformatics},
title = {Comprehensive assessment and performance improvement of effector protein predictors for bacterial secretion systems III, IV and VI},
year = {2018},
number = {1},
pages = {148-161},
volume = {19},
abstract = {Bacterial effector proteins secreted by various protein secretion systems play crucial roles in host-pathogen interactions. In this context, computational tools capable of accurately predicting effector proteins of the various types of bacterial secretion systems are highly desirable. Existing computational approaches use different machine learning (ML) techniques and heterogeneous features derived from protein sequences and/or structural information. These predictors differ not only in terms of the used ML methods but also with respect to the used curated data sets, the features selection and their prediction performance. Here, we provide a comprehensive survey and benchmarking of currently available tools for the prediction of effector proteins of bacterial types III, IV and VI secretion systems (T3SS, T4SS and T6SS, respectively). We review core algorithms, feature selection techniques, tool availability and applicability and evaluate the prediction performance based on carefully curated independent test data sets. In an effort to improve predictive performance, we constructed three ensemble models based on ML algorithms by integrating the output of all individual predictors reviewed. Our benchmarks demonstrate that these ensemble models outperform all the reviewed tools for the prediction of effector proteins of T3SS and T4SS. The webserver of the proposed ensemble methods for T3SS and T4SS effector protein prediction is freely available at http://tbooster.erc.monash.edu/index.jsp. We anticipate that this survey will serve as a useful guide for interested users and that the new ensemble predictors will stimulate research into host-pathogen relationships and inspiration for the development of new bioinformatics tools for predicting effector proteins of T3SS, T4SS and T6SS.},
doi = {10.1093/bib/bbw100},
keywords = {Bioinformatics and DP140100087},
related = {computational-biology},
}

ABSTRACT Bacterial effector proteins secreted by various protein secretion systems play crucial roles in host-pathogen interactions. In this context, computational tools capable of accurately predicting effector proteins of the various types of bacterial secretion systems are highly desirable. Existing computational approaches use different machine learning (ML) techniques and heterogeneous features derived from protein sequences and/or structural information. These predictors differ not only in terms of the used ML methods but also with respect to the used curated data sets, the features selection and their prediction performance. Here, we provide a comprehensive survey and benchmarking of currently available tools for the prediction of effector proteins of bacterial types III, IV and VI secretion systems (T3SS, T4SS and T6SS, respectively). We review core algorithms, feature selection techniques, tool availability and applicability and evaluate the prediction performance based on carefully curated independent test data sets. In an effort to improve predictive performance, we constructed three ensemble models based on ML algorithms by integrating the output of all individual predictors reviewed. Our benchmarks demonstrate that these ensemble models outperform all the reviewed tools for the prediction of effector proteins of T3SS and T4SS. The webserver of the proposed ensemble methods for T3SS and T4SS effector protein prediction is freely available at http://tbooster.erc.monash.edu/index.jsp. We anticipate that this survey will serve as a useful guide for interested users and that the new ensemble predictors will stimulate research into host-pathogen relationships and inspiration for the development of new bioinformatics tools for predicting effector proteins of T3SS, T4SS and T6SS.

iFeature: A python package and web server for features extraction and selection from protein and peptide sequences.
Chen, Z., Zhao, P., Li, F., Leier, A., Marquez-Lago, T. T., Wang, Y., Webb, G. I., Smith, I. A., Daly, R. J., Chou, K., & Song, J.
Bioinformatics, 2499-2502, 2018.
[Bibtex] [Abstract] → Access on publisher site

@Article{ChenEtAl18,
author = {Chen, Zhen and Zhao, Pei and Li, Fuyi and Leier, Andre and Marquez-Lago, Tatiana T and Wang, Yanan and Webb, Geoffrey I and Smith, A Ian and Daly, Roger J and Chou, Kuo-Chen and Song, Jiangning},
journal = {Bioinformatics},
title = {iFeature: A python package and web server for features extraction and selection from protein and peptide sequences},
year = {2018},
pages = {2499-2502},
abstract = {Structural and physiochemical descriptors extracted from sequence data have been widely used to represent sequences and predict structural, functional, expression and interaction profiles of proteins and peptides as well as DNAs/RNAs. Here, we present iFeature, a versatile Python-based toolkit for generating various numerical feature representation schemes for both protein and peptide sequences. iFeature is capable of calculating and extracting a comprehensive spectrum of 18 major sequence encoding schemes that encompass 53 different types of feature descriptors. It also allows users to extract specific amino acid properties from the AAindex database. Furthermore, iFeature integrates 12 different types of commonly used feature clustering, selection and dimensionality reduction algorithms, greatly facilitating training, analysis and benchmarking of machine-learning models. The functionality of iFeature is made freely available via an online web server and a stand-alone toolkit.},
comment = {Clarivate Web of Science Highly Cited Paper 2019 - 2024},
doi = {10.1093/bioinformatics/bty140},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT Structural and physiochemical descriptors extracted from sequence data have been widely used to represent sequences and predict structural, functional, expression and interaction profiles of proteins and peptides as well as DNAs/RNAs. Here, we present iFeature, a versatile Python-based toolkit for generating various numerical feature representation schemes for both protein and peptide sequences. iFeature is capable of calculating and extracting a comprehensive spectrum of 18 major sequence encoding schemes that encompass 53 different types of feature descriptors. It also allows users to extract specific amino acid properties from the AAindex database. Furthermore, iFeature integrates 12 different types of commonly used feature clustering, selection and dimensionality reduction algorithms, greatly facilitating training, analysis and benchmarking of machine-learning models. The functionality of iFeature is made freely available via an online web server and a stand-alone toolkit.

Structural Capacitance in Protein Evolution and Human Diseases.
Li, C., Clark, L. V. T., Zhang, R., Porebski, B. T., McCoey, J. M., Borg, N. A., Webb, G. I., Kass, I., Buckle, M., Song, J., Woolfson, A., & Buckle, A. M.
Journal of Molecular Biology, 430(18), 3200-3217, 2018.
[Bibtex] → Access on publisher site

@Article{Li2018,
Title = {Structural Capacitance in Protein Evolution and Human Diseases},
Author = {Li, Chen and Clark, Liah V.T. and Zhang, Rory and Porebski, Benjamin T. and McCoey, Julia M. and Borg, Natalie A. and Webb, Geoffrey I. and Kass, Itamar and Buckle, Malcolm and Song, Jiangning and Woolfson, Adrian and Buckle, Ashley M.},
Journal = {Journal of Molecular Biology},
Year = {2018},
Number = {18},
Pages = {3200-3217},
Volume = {430},
Doi = {10.1016/j.jmb.2018.06.051},
ISSN = {0022-2836},
Keywords = {Bioinformatics},
Related = {computational-biology}
}

ABSTRACT

PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework.
Song, J., Li, F., Takemoto, K., Haffari, G., Akutsu, T., Chou, K. C., & Webb, G. I.
Journal of Theoretical Biology, 443, 125-137, 2018.
[Bibtex] → Access on publisher site

@Article{SongEtAl18,
author = {Song, J. and Li, F. and Takemoto, K. and Haffari, G. and Akutsu, T. and Chou, K. C. and Webb, G. I.},
journal = {Journal of Theoretical Biology},
title = {PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework},
year = {2018},
pages = {125-137},
volume = {443},
comment = {Clarivate Web of Science Hot Paper},
comment2 = {Clarivate Web of Science Highly Cited Paper 2019, 2020},
doi = {10.1016/j.jtbi.2018.01.023},
keywords = {Bioinformatics},
related = {computational-biology},
url = {https://authors.elsevier.com/c/1WWQY57ilzyRc},
}

ABSTRACT

MetalExplorer, a Bioinformatics Tool for the Improved Prediction of Eight Types of Metal-binding Sites Using a Random Forest Algorithm with Two-step Feature Selection.
Song, J., Li, C., Zheng, C., Revote, J., Zhang, Z., & Webb, G. I.
Current Bioinformatics, 12(6), 480-489, 2017.
[Bibtex] [Abstract] → Access on publisher site

@Article{SongEtAl16,
author = {Song, Jiangning and Li, Chen and Zheng, Cheng and Revote, Jerico and Zhang, Ziding and Webb, Geoffrey I.},
journal = {Current Bioinformatics},
title = {MetalExplorer, a Bioinformatics Tool for the Improved Prediction of Eight Types of Metal-binding Sites Using a Random Forest Algorithm with Two-step Feature Selection},
year = {2017},
issn = {1574-8936/2212-392X},
number = {6},
pages = {480-489},
volume = {12},
abstract = {Metalloproteins are highly involved in many biological processes,
including catalysis, recognition, transport, transcription, and signal
transduction. The metal ions they bind usually play enzymatic or structural
roles in mediating these diverse functional roles. Thus, the systematic
analysis and prediction of metal-binding sites using sequence and/or
structural information are crucial for understanding their
sequence-structure-function relationships. In this study, we propose
MetalExplorer (http://metalexplorer.erc.monash.edu.au/), a new machine
learning-based method for predicting eight different types of metal-binding
sites (Ca, Co, Cu, Fe, Ni, Mg, Mn, and Zn) in proteins. Our approach
combines heterogeneous sequence-, structure-, and residue contact
network-based features. The predictive performance of MetalExplorer was
tested by cross-validation and independent tests using non-redundant
datasets of known structures. This method applies a two-step feature
selection approach based on the maximum relevance minimum redundancy and
forward feature selection to identify the most informative features that
contribute to the prediction performance. With a precision of 60%,
MetalExplorer achieved high recall values, which ranged from 59% to 88% for
the eight metal ion types in fivefold cross-validation tests. Moreover, the
common and type-specific features in the optimal subsets of all metal ions
were characterized in terms of their contributions to the overall
performance. In terms of both benchmark and independent datasets at the 60%
precision control level, MetalExplorer compared favorably with an existing
metalloprotein prediction tool, SitePredict. Thus, MetalExplorer is expected
to be a powerful tool for the accurate prediction of potential metal-binding
sites and it should facilitate the functional analysis and rational design
of novel metalloproteins.},
doi = {10.2174/2468422806666160618091522},
keywords = {Bioinformatics and DP140100087},
related = {computational-biology},
}

ABSTRACT Metalloproteins are highly involved in many biological processes, including catalysis, recognition, transport, transcription, and signal transduction. The metal ions they bind usually play enzymatic or structural roles in mediating these diverse functional roles. Thus, the systematic analysis and prediction of metal-binding sites using sequence and/or structural information are crucial for understanding their sequence-structure-function relationships. In this study, we propose MetalExplorer (http://metalexplorer.erc.monash.edu.au/), a new machine learning-based method for predicting eight different types of metal-binding sites (Ca, Co, Cu, Fe, Ni, Mg, Mn, and Zn) in proteins. Our approach combines heterogeneous sequence-, structure-, and residue contact network-based features. The predictive performance of MetalExplorer was tested by cross-validation and independent tests using non-redundant datasets of known structures. This method applies a two-step feature selection approach based on the maximum relevance minimum redundancy and forward feature selection to identify the most informative features that contribute to the prediction performance. With a precision of 60%, MetalExplorer achieved high recall values, which ranged from 59% to 88% for the eight metal ion types in fivefold cross-validation tests. Moreover, the common and type-specific features in the optimal subsets of all metal ions were characterized in terms of their contributions to the overall performance. In terms of both benchmark and independent datasets at the 60% precision control level, MetalExplorer compared favorably with an existing metalloprotein prediction tool, SitePredict. Thus, MetalExplorer is expected to be a powerful tool for the accurate prediction of potential metal-binding sites and it should facilitate the functional analysis and rational design of novel metalloproteins.

PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy.
Song, J., Li, F., Leier, A., Marquez-Lago, T. T., Akutsu, T., Haffari, G., Chou, K., Webb, G. I., & Pike, R. N.
Bioinformatics, 34(4), 684-687, 2017.
[Bibtex] → Access on publisher site

@Article{Song2017a,
author = {Song, Jiangning and Li, Fuyi and Leier, Andre and Marquez-Lago, Tatiana T and Akutsu, Tatsuya and Haffari, Gholamreza and Chou, Kuo-Chen and Webb, Geoffrey I and Pike, Robert N},
journal = {Bioinformatics},
title = {PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy},
year = {2017},
number = {4},
pages = {684-687},
volume = {34},
comment = {Clarivate Web of Science Highly Cited Paper 2019, 2020, 2021},
doi = {10.1093/bioinformatics/btx670},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT

SecretEPDB: a comprehensive web-based resource for secreted effector proteins of the bacterial types III, IV and VI secretion systems.
An, Y., Wang, J., Li, C., Revote, J., Zhang, Y., Naderer, T., Hayashida, M., Akutsu, T., Webb, G. I., Lithgow, T., & Song, J.
Scientific Reports, 7, Art. no. 41031, 2017.
[Bibtex] → Access on publisher site

@Article{AnEtAl17,
Title = {SecretEPDB: a comprehensive web-based resource for secreted effector proteins of the bacterial types III, IV and VI secretion systems},
Author = {An, Yi and Wang, Jiawei and Li, Chen and Revote, Jerico and Zhang, Yang and Naderer, Thomas and Hayashida, Mirohiro and Akutsu, Tatsuya and Webb, Geoffrey I. and Lithgow, Trevor and Song, Jiangning},
Journal = {Scientific Reports},
Year = {2017},
Volume = {7},
Articlenumber = {41031},
Doi = {10.1038/srep41031},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology},
Url = {http://rdcu.be/oJ9I}
}

ABSTRACT

PhosphoPredict: A bioinformatics tool for prediction of human kinase-specific phosphorylation substrates and sites by integrating heterogeneous feature selection.
Song, J., Wang, H., Wang, J., Leier, A., Marquez-Lago, T., Yang, B., Zhang, Z., Akutsu, T., Webb, G. I., & Daly, R. J.
Scientific Reports, 7(1), Art. no. 6862, 2017.
[Bibtex] [Abstract] → Access on publisher site

@Article{Song2017,
Title = {PhosphoPredict: A bioinformatics tool for prediction of human kinase-specific phosphorylation substrates and sites by integrating heterogeneous feature selection},
Author = {Song, Jiangning and Wang, Huilin and Wang, Jiawei and Leier, Andre and Marquez-Lago, Tatiana and Yang, Bingjiao and Zhang, Ziding and Akutsu, Tatsuya and Webb, Geoffrey I. and Daly, Roger J.},
Journal = {Scientific Reports},
Year = {2017},
Number = {1},
Volume = {7},
Abstract = {Protein phosphorylation is a major form of post-translational modification (PTM) that regulates diverse cellular processes. In silico methods for phosphorylation site prediction can provide a useful and complementary strategy for complete phosphoproteome annotation. Here, we present a novel bioinformatics tool, PhosphoPredict, that combines protein sequence and functional features to predict kinase-specific substrates and their associated phosphorylation sites for 12 human kinases and kinase families, including ATM, CDKs, GSK-3, MAPKs, PKA, PKB, PKC, and SRC. To elucidate critical determinants, we identified feature subsets that were most informative and relevant for predicting substrate specificity for each individual kinase family. Extensive benchmarking experiments based on both five-fold cross-validation and independent tests indicated that the performance of PhosphoPredict is competitive with that of several other popular prediction tools, including KinasePhos, PPSP, GPS, and Musite. We found that combining protein functional and sequence features significantly improves phosphorylation site prediction performance across all kinases. Application of PhosphoPredict to the entire human proteome identified 150 to 800 potential phosphorylation substrates for each of the 12 kinases or kinase families. PhosphoPredict significantly extends the bioinformatics portfolio for kinase function analysis and will facilitate high-throughput identification of kinase-specific phosphorylation sites, thereby contributing to both basic and translational research programs.},
Articlenumber = {6862},
Doi = {10.1038/s41598-017-07199-4},
Keywords = {Bioinformatics},
Related = {computational-biology}
}

ABSTRACT Protein phosphorylation is a major form of post-translational modification (PTM) that regulates diverse cellular processes. In silico methods for phosphorylation site prediction can provide a useful and complementary strategy for complete phosphoproteome annotation. Here, we present a novel bioinformatics tool, PhosphoPredict, that combines protein sequence and functional features to predict kinase-specific substrates and their associated phosphorylation sites for 12 human kinases and kinase families, including ATM, CDKs, GSK-3, MAPKs, PKA, PKB, PKC, and SRC. To elucidate critical determinants, we identified feature subsets that were most informative and relevant for predicting substrate specificity for each individual kinase family. Extensive benchmarking experiments based on both five-fold cross-validation and independent tests indicated that the performance of PhosphoPredict is competitive with that of several other popular prediction tools, including KinasePhos, PPSP, GPS, and Musite. We found that combining protein functional and sequence features significantly improves phosphorylation site prediction performance across all kinases. Application of PhosphoPredict to the entire human proteome identified 150 to 800 potential phosphorylation substrates for each of the 12 kinases or kinase families. PhosphoPredict significantly extends the bioinformatics portfolio for kinase function analysis and will facilitate high-throughput identification of kinase-specific phosphorylation sites, thereby contributing to both basic and translational research programs.

Knowledge-transfer learning for prediction of matrix metalloprotease substrate-cleavage sites.
Wang, Y., Song, J., Marquez-Lago, T. T., Leier, A., Li, C., Lithgow, T., Webb, G. I., & Shen, H.
Scientific Reports, 7, Art. no. 5755, 2017.
[Bibtex] → Access on publisher site

@Article{WangYEtAl17,
Title = {Knowledge-transfer learning for prediction of matrix metalloprotease substrate-cleavage sites},
Author = {Wang, Yanan and Song, Jiangning and Marquez-Lago, Tatiana T. and Leier, Andre and Li, Chen and Lithgow, Trevor and Webb, Geoffrey I. and Shen, Hong-Bin},
Journal = {Scientific Reports},
Year = {2017},
Volume = {7},
Articlenumber = {5755},
Doi = {10.1038/s41598-017-06219-7},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}

ABSTRACT

POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles.
Wang, J., Yang, B., Revote, J., Leier, A., Marquez-Lago, T. T., Webb, G. I., Song, J., Chou, K., & Lithgow, T.
Bioinformatics, 33(17), 2756-2758, 2017.
[Bibtex] → Access on publisher site

@Article{WangJEtAl17,
Title = {POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles},
Author = {Wang, Jiawei and Yang, Bingjiao and Revote, Jerico and Leier, Andre and Marquez-Lago, Tatiana T. and Webb, Geoffrey I. and Song, Jiangning and Chou, Kuo-Chen and Lithgow, Trevor},
Journal = {Bioinformatics},
Year = {2017},
Number = {17},
Pages = {2756-2758},
Volume = {33},
Doi = {10.1093/bioinformatics/btx302},
Keywords = {Bioinformatics},
Related = {computational-biology}
}

ABSTRACT

GlycoMinestruct: a new bioinformatics tool for highly accurate mapping of the human N-linked and O-linked glycoproteomes by incorporating structural features.
Li, F., Li, C., Revote, J., Zhang, Y., Webb, G. I., Li, J., Song, J., & Lithgow, T.
Scientific Reports, 6, Art. no. 34595, 2016.
[Bibtex] → Access on publisher site

@Article{LiEtAl16,
Title = {GlycoMinestruct: a new bioinformatics tool for highly accurate mapping of the human N-linked and O-linked glycoproteomes by incorporating structural features},
Author = {Li, Fuyi and Li, Chen and Revote, Jerico and Zhang, Yang and Webb, Geoffrey I. and Li, Jian and Song, Jiangning and Lithgow, Trevor},
Journal = {Scientific Reports},
Year = {2016},
Month = oct,
Volume = {6},
Articlenumber = {34595},
Doi = {10.1038/srep34595},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}

ABSTRACT

Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli.
Chang, C. C. H., Li, C., Webb, G. I., Tey, B., & Song, J.
Scientific Reports, 6, Art. no. 21844, 2016.
[Bibtex] [Abstract] → Access on publisher site

@Article{ChangEtAl2016,
Title = {Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli},
Author = {Chang, C.C.H. and Li, C. and Webb, G. I. and Tey, B. and Song, J.},
Journal = {Scientific Reports},
Year = {2016},
Volume = {6},
Abstract = {Periplasmic expression of soluble proteins in Escherichia coli not only offers a much-simplified downstream purification process, but also enhances the probability of obtaining correctly folded and biologically active proteins. Different combinations of signal peptides and target proteins lead to different soluble protein expression levels, ranging from negligible to several grams per litre. Accurate algorithms for rational selection of promising candidates can serve as a powerful tool to complement with current trial-and-error approaches. Accordingly, proteomics studies can be conducted with greater efficiency and cost-effectiveness. Here, we developed a predictor with a two-stage architecture, to predict the real-valued expression level of target protein in the periplasm. The output of the first-stage support vector machine (SVM) classifier determines which second-stage support vector regression (SVR) classifier to be used. When tested on an independent test dataset, the predictor achieved an overall prediction accuracy of 78% and a Pearson's correlation coefficient (PCC) of 0.77. We further illustrate the relative importance of various features with respect to different models. The results indicate that the occurrence of dipeptide glutamine and aspartic acid is the most important feature for the classification model. Finally, we provide access to the implemented predictor through the Periscope webserver, freely accessible at http://lightning.med.monash.edu/periscope/.},
Articlenumber = {21844},
Doi = {10.1038/srep21844},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}

ABSTRACT Periplasmic expression of soluble proteins in Escherichia coli not only offers a much-simplified downstream purification process, but also enhances the probability of obtaining correctly folded and biologically active proteins. Different combinations of signal peptides and target proteins lead to different soluble protein expression levels, ranging from negligible to several grams per litre. Accurate algorithms for rational selection of promising candidates can serve as a powerful tool to complement with current trial-and-error approaches. Accordingly, proteomics studies can be conducted with greater efficiency and cost-effectiveness. Here, we developed a predictor with a two-stage architecture, to predict the real-valued expression level of target protein in the periplasm. The output of the first-stage support vector machine (SVM) classifier determines which second-stage support vector regression (SVR) classifier to be used. When tested on an independent test dataset, the predictor achieved an overall prediction accuracy of 78% and a Pearson's correlation coefficient (PCC) of 0.77. We further illustrate the relative importance of various features with respect to different models. The results indicate that the occurrence of dipeptide glutamine and aspartic acid is the most important feature for the classification model. Finally, we provide access to the implemented predictor through the Periscope webserver, freely accessible at http://lightning.med.monash.edu/periscope/.

Smoothing a rugged protein folding landscape by sequence-based redesign.
Porebski, B. T., Keleher, S., Hollins, J. J., Nickson, A. A., Marijanovic, E. M., Borg, N. A., Costa, M. G. S., Pearce, M. A., Dai, W., Zhu, L., Irving, J. A., Hoke, D. E., Kass, I., Whisstock, J. C., Bottomley, S. P., Webb, G. I., McGowan, S., & Buckle, A. M.
Scientific Reports, 6, Art. no. 33958, 2016.
[Bibtex] [Abstract] → Access on publisher site

@Article{Porebski2016,
Title = {Smoothing a rugged protein folding landscape by sequence-based redesign},
Author = {Porebski, Benjamin T. and Keleher, Shani and Hollins, Jeffrey J. and Nickson, Adrian A. and Marijanovic, Emilia M. and Borg, Natalie A. and Costa, Mauricio G. S. and Pearce, Mary A. and Dai, Weiwen and Zhu, Liguang and Irving, James A. and Hoke, David E. and Kass, Itamar and Whisstock, James C. and Bottomley, Stephen P. and Webb, Geoffrey I. and McGowan, Sheena and Buckle, Ashley M.},
Journal = {Scientific Reports},
Year = {2016},
Volume = {6},
Abstract = {The rugged folding landscapes of functional proteins puts them at risk of misfolding and aggregation. Serine protease inhibitors, or serpins, are paradigms for this delicate balance between function and misfolding. Serpins exist in a metastable state that undergoes a major conformational change in order to inhibit proteases. However, conformational labiality of the native serpin fold renders them susceptible to misfolding, which underlies misfolding diseases such as alpha1-antitrypsin deficiency. To investigate how serpins balance function and folding, we used consensus design to create conserpin, a synthetic serpin that folds reversibly, is functional, thermostable, and polymerization resistant. Characterization of its structure, folding and dynamics suggest that consensus design has remodeled the folding landscape to reconcile competing requirements for stability and function. This approach may offer general benefits for engineering functional proteins that have risky folding landscapes, including the removal of aggregation-prone intermediates, and modifying scaffolds for use as protein therapeutics.},
Articlenumber = {33958},
Doi = {10.1038/srep33958},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology},
Url = {http://dx.doi.org/10.1038/srep33958}
}

ABSTRACT The rugged folding landscapes of functional proteins puts them at risk of misfolding and aggregation. Serine protease inhibitors, or serpins, are paradigms for this delicate balance between function and misfolding. Serpins exist in a metastable state that undergoes a major conformational change in order to inhibit proteases. However, conformational labiality of the native serpin fold renders them susceptible to misfolding, which underlies misfolding diseases such as alpha1-antitrypsin deficiency. To investigate how serpins balance function and folding, we used consensus design to create conserpin, a synthetic serpin that folds reversibly, is functional, thermostable, and polymerization resistant. Characterization of its structure, folding and dynamics suggest that consensus design has remodeled the folding landscape to reconcile competing requirements for stability and function. This approach may offer general benefits for engineering functional proteins that have risky folding landscapes, including the removal of aggregation-prone intermediates, and modifying scaffolds for use as protein therapeutics.

Crysalis: an integrated server for computational analysis and design of protein crystallization.
Wang, H., Feng, L., Zhang, Z., Webb, G. I., Lin, D., & Song, J.
Scientific Reports, 6, Art. no. 21383, 2016.
[Bibtex] [Abstract] → Access on publisher site

@Article{WangEtAl16,
Title = {Crysalis: an integrated server for computational analysis and design of protein crystallization},
Author = {Wang, H. and Feng, L. and Zhang, Z. and Webb, G. I. and Lin, D. and Song, J.},
Journal = {Scientific Reports},
Year = {2016},
Volume = {6},
Abstract = {The failure of multi-step experimental procedures to yield diffraction-quality crystals is a major bottleneck in protein structure determination. Accordingly, several bioinformatics methods have been successfully developed and employed to select crystallizable proteins. Unfortunately, the majority of existing in silico methods only allow the prediction of crystallization propensity, seldom enabling computational design of protein mutants that can be targeted for enhancing protein crystallizability. Here, we present Crysalis, an integrated crystallization analysis tool that builds on support-vector regression (SVR) models to facilitate computational protein crystallization prediction, analysis, and design. More specifically, the functionality of this new tool includes: (1) rapid selection of target crystallizable proteins at the proteome level, (2) identification of site non-optimality for protein crystallization and systematic analysis of all potential single-point mutations that might enhance protein crystallization propensity, and (3) annotation of target protein based on predicted structural properties. We applied the design mode of Crysalis to identify site non-optimality for protein crystallization on a proteome-scale, focusing on proteins currently classified as non-crystallizable. Our results revealed that site non-optimality is based on biases related to residues, predicted structures, physicochemical properties, and sequence loci, which provides in-depth understanding of the features influencing protein crystallization. Crysalis is freely available at http://nmrcen.xmu.edu.cn/crysalis/.},
Articlenumber = {21383},
Doi = {10.1038/srep21383},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}

ABSTRACT The failure of multi-step experimental procedures to yield diffraction-quality crystals is a major bottleneck in protein structure determination. Accordingly, several bioinformatics methods have been successfully developed and employed to select crystallizable proteins. Unfortunately, the majority of existing in silico methods only allow the prediction of crystallization propensity, seldom enabling computational design of protein mutants that can be targeted for enhancing protein crystallizability. Here, we present Crysalis, an integrated crystallization analysis tool that builds on support-vector regression (SVR) models to facilitate computational protein crystallization prediction, analysis, and design. More specifically, the functionality of this new tool includes: (1) rapid selection of target crystallizable proteins at the proteome level, (2) identification of site non-optimality for protein crystallization and systematic analysis of all potential single-point mutations that might enhance protein crystallization propensity, and (3) annotation of target protein based on predicted structural properties. We applied the design mode of Crysalis to identify site non-optimality for protein crystallization on a proteome-scale, focusing on proteins currently classified as non-crystallizable. Our results revealed that site non-optimality is based on biases related to residues, predicted structures, physicochemical properties, and sequence loci, which provides in-depth understanding of the features influencing protein crystallization. Crysalis is freely available at http://nmrcen.xmu.edu.cn/crysalis/.

GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome.
Li, F., Li, C., Wang, M., Webb, G. I., Zhang, Y., Whisstock, J. C., & Song, J.
Bioinformatics, 31(9), 1411-1419, 2015.
[Bibtex] [Abstract] → Access on publisher site

@Article{LiEtAl15,
Title = {GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome},
Author = {Li, F. and Li, C. and Wang, M. and Webb, G. I. and Zhang, Y. and Whisstock, J. C. and Song, J.},
Journal = {Bioinformatics},
Year = {2015},
Number = {9},
Pages = {1411-1419},
Volume = {31},
Abstract = {Motivation: Glycosylation is a ubiquitous type of protein post-translational modification (PTM) in eukaryotic cells, which plays vital roles in various biological processes (BPs) such as cellular communication, ligand recognition and subcellular recognition. It is estimated that >50% of the entire human proteome is glycosylated. However, it is still a significant challenge to identify glycosylation sites, which requires expensive/laborious experimental research. Thus, bioinformatics approaches that can predict the glycan occupancy at specific sequons in protein sequences would be useful for understanding and utilizing this important PTM.
Results: In this study, we present a novel bioinformatics tool called GlycoMine, which is a comprehensive tool for the systematic in silico identification of C-linked, N-linked, and O-linked glycosylation sites in the human proteome. GlycoMine was developed using the random forest algorithm and evaluated based on a well-prepared up-to-date benchmark dataset that encompasses all three types of glycosylation sites, which was curated from multiple public resources. Heterogeneous sequences and functional features were derived from various sources, and subjected to further two-step feature selection to characterize a condensed subset of optimal features that contributed most to the type-specific prediction of glycosylation sites. Five-fold cross-validation and independent tests show that this approach significantly improved the prediction performance compared with four existing prediction tools: NetNGlyc, NetOGlyc, EnsembleGly and GPP. We demonstrated that this tool could identify candidate glycosylation sites in case study proteins and applied it to identify many high-confidence glycosylation target proteins by screening the entire human proteome.},
Doi = {10.1093/bioinformatics/btu852},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}

ABSTRACT Motivation: Glycosylation is a ubiquitous type of protein post-translational modification (PTM) in eukaryotic cells, which plays vital roles in various biological processes (BPs) such as cellular communication, ligand recognition and subcellular recognition. It is estimated that >50% of the entire human proteome is glycosylated. However, it is still a significant challenge to identify glycosylation sites, which requires expensive/laborious experimental research. Thus, bioinformatics approaches that can predict the glycan occupancy at specific sequons in protein sequences would be useful for understanding and utilizing this important PTM. Results: In this study, we present a novel bioinformatics tool called GlycoMine, which is a comprehensive tool for the systematic in silico identification of C-linked, N-linked, and O-linked glycosylation sites in the human proteome. GlycoMine was developed using the random forest algorithm and evaluated based on a well-prepared up-to-date benchmark dataset that encompasses all three types of glycosylation sites, which was curated from multiple public resources. Heterogeneous sequences and functional features were derived from various sources, and subjected to further two-step feature selection to characterize a condensed subset of optimal features that contributed most to the type-specific prediction of glycosylation sites. Five-fold cross-validation and independent tests show that this approach significantly improved the prediction performance compared with four existing prediction tools: NetNGlyc, NetOGlyc, EnsembleGly and GPP. We demonstrated that this tool could identify candidate glycosylation sites in case study proteins and applied it to identify many high-confidence glycosylation target proteins by screening the entire human proteome.

Structural and dynamic properties that govern the stability of an engineered fibronectin type III domain.
Porebski, B. T., Nickson, A. A., Hoke, D. E., Hunter, M. R., Zhu, L., McGowan, S., Webb, G. I., & Buckle, A. M.
Protein Engineering, Design and Selection, 28(3), 67-78, 2015.
[Bibtex] [Abstract] → Access on publisher site

@Article{PorebskiEtAl15,
Title = {Structural and dynamic properties that govern the stability of an engineered fibronectin type III domain},
Author = {Porebski, B. T. and Nickson, A. A. and Hoke, D. E. and Hunter, M. R. and Zhu, L. and McGowan, S. and Webb, G. I. and Buckle, A. M.},
Journal = {Protein Engineering, Design and Selection},
Year = {2015},
Number = {3},
Pages = {67-78},
Volume = {28},
Abstract = {Consensus protein design is a rapid and reliable technique for the improvement of protein stability, which relies on the use of homologous protein sequences. To enhance the stability of a fibronectin type III (FN3) domain, consensus design was employed using an alignment of 2123 sequences. The resulting FN3 domain, FN3con, has unprecedented stability, with a melting temperature >100C, a .GD.N of 15.5 kcal mol.1 and a greatly reduced unfolding rate compared with wild-type. To determine the underlying molecular basis for stability, an X-ray crystal structure of FN3con was determined to 2.0 and compared with other FN3 domains of varying stabilities. The structure of FN3con reveals significantly increased salt bridge interactions that are cooperatively networked, and a highly optimized hydrophobic core. Molecular dynamics simulations of FN3con and comparison structures show the cooperative power of electrostatic and hydrophobic networks in improving FN3con stability. Taken together, our data reveal that FN3con stability does not result from a single mechanism, but rather the combination of several features and the removal of non-conserved, unfavorable interactions. The large number of sequences employed in this study has most likely enhanced the robustness of the consensus design, which is now possible due to the increased sequence availability in the post-genomic era. These studies increase our knowledge of the molecular mechanisms that govern stability and demonstrate the rising potential for enhancing stability via the consensus method.},
Doi = {10.1093/protein/gzv002},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology},
Url = {http://peds.oxfordjournals.org/content/28/3/67.full.pdf+html}
}

ABSTRACT Consensus protein design is a rapid and reliable technique for the improvement of protein stability, which relies on the use of homologous protein sequences. To enhance the stability of a fibronectin type III (FN3) domain, consensus design was employed using an alignment of 2123 sequences. The resulting FN3 domain, FN3con, has unprecedented stability, with a melting temperature >100C, a .GD.N of 15.5 kcal mol.1 and a greatly reduced unfolding rate compared with wild-type. To determine the underlying molecular basis for stability, an X-ray crystal structure of FN3con was determined to 2.0 and compared with other FN3 domains of varying stabilities. The structure of FN3con reveals significantly increased salt bridge interactions that are cooperatively networked, and a highly optimized hydrophobic core. Molecular dynamics simulations of FN3con and comparison structures show the cooperative power of electrostatic and hydrophobic networks in improving FN3con stability. Taken together, our data reveal that FN3con stability does not result from a single mechanism, but rather the combination of several features and the removal of non-conserved, unfavorable interactions. The large number of sequences employed in this study has most likely enhanced the robustness of the consensus design, which is now possible due to the increased sequence availability in the post-genomic era. These studies increase our knowledge of the molecular mechanisms that govern stability and demonstrate the rising potential for enhancing stability via the consensus method.

Accurate in Silico Identification of Species-Specific Acetylation Sites by Integrating Protein Sequence-Derived and Functional Features.
Li, Y., Wang, M., Wang, H., Tan, H., Zhang, Z., Webb, G. I., & Song, J.
Scientific Reports, 4, Art. no. 5765, 2014.
[Bibtex] [Abstract] → Access on publisher site

@Article{LiEtAl2014,
author = {Li, Y. and Wang, M. and Wang, H. and Tan, H. and Zhang, Z. and Webb, G. I. and Song, J.},
journal = {Scientific Reports},
title = {Accurate in Silico Identification of Species-Specific Acetylation Sites by Integrating Protein Sequence-Derived and Functional Features},
year = {2014},
volume = {4},
abstract = {Lysine acetylation is a reversible post-translational modification, playing an important role in cytokine signaling, transcriptional regulation, and apoptosis. To fully understand acetylation mechanisms, identification of substrates and specific acetylation sites is crucial. Experimental identification is often time-consuming and expensive. Alternative bioinformatics methods are cost-effective and can be used in a high-throughput manner to generate relatively precise predictions. Here we develop a method termed as SSPKA for species-specific lysine acetylation prediction, using random forest classifiers that combine sequence-derived and functional features with two-step feature selection. Feature importance analysis indicates functional features, applied for lysine acetylation site prediction for the first time, significantly improve the predictive performance. We apply the SSPKA model to screen the entire human proteome and identify many high-confidence putative substrates that are not previously identified. The results along with the implemented Java tool, serve as useful resources to elucidate the mechanism of lysine acetylation and facilitate hypothesis-driven experimental design and validation.},
articlenumber = {5765},
doi = {10.1038/srep05765},
keywords = {Bioinformatics and DP140100087},
related = {computational-biology},
}

ABSTRACT Lysine acetylation is a reversible post-translational modification, playing an important role in cytokine signaling, transcriptional regulation, and apoptosis. To fully understand acetylation mechanisms, identification of substrates and specific acetylation sites is crucial. Experimental identification is often time-consuming and expensive. Alternative bioinformatics methods are cost-effective and can be used in a high-throughput manner to generate relatively precise predictions. Here we develop a method termed as SSPKA for species-specific lysine acetylation prediction, using random forest classifiers that combine sequence-derived and functional features with two-step feature selection. Feature importance analysis indicates functional features, applied for lysine acetylation site prediction for the first time, significantly improve the predictive performance. We apply the SSPKA model to screen the entire human proteome and identify many high-confidence putative substrates that are not previously identified. The results along with the implemented Java tool, serve as useful resources to elucidate the mechanism of lysine acetylation and facilitate hypothesis-driven experimental design and validation.

TANGLE: Two-Level Support Vector Regression Approach for Protein Backbone Torsion Angle Prediction from Primary Sequences.
Song, J., Tan, H., Wang, M., Webb, G. I., & Akutsu, T.
PLoS ONE, 7(2), Art. no. e30361, 2012.
[Bibtex] [Abstract] → Access on publisher site

@Article{SongEtAl12,
author = {Song, Jiangning and Tan, Hao and Wang, Mingjun and Webb, Geoffrey I. and Akutsu, Tatsuya},
journal = {PLoS ONE},
title = {TANGLE: Two-Level Support Vector Regression Approach for Protein Backbone Torsion Angle Prediction from Primary Sequences},
year = {2012},
month = {02},
number = {2},
volume = {7},
abstract = {Protein backbone torsion angles (Phi) and (Psi) involve two rotation angles rotating around the C-N bond (Phi)
and the C-C bond (Psi). Due to the planarity of the linked rigid peptide bonds, these two angles can essentially determine
the backbone geometry of proteins. Accordingly, the accurate prediction of protein backbone torsion angle from sequence information
can assist the prediction of protein structures. In this study, we develop a new approach called TANGLE (Torsion ANGLE predictor) to
predict the protein backbone torsion angles from amino acid sequences. TANGLE uses a two-level support vector regression approach to
perform real-value torsion angle prediction using a variety of features derived from amino acid sequences, including the evolutionary
profiles in the form of position-specific scoring matrices, predicted secondary structure, solvent accessibility and natively disordered
region as well as other global sequence features. When evaluated based on a large benchmark dataset of 1,526 non-homologous proteins,
the mean absolute errors (MAEs) of the Phi and Psi angle prediction are 27.8 and 44.6, respectively, which are 1% and 3% respectively
lower than that using one of the state-of-the-art prediction tools ANGLOR. Moreover, the prediction of TANGLE is significantly better than a
random predictor that was built on the amino acid-specific basis, with the p-value<1.46e-147 and 7.97e-150, respectively by the
Wilcoxon signed rank test. As a complementary approach to the current torsion angle prediction algorithms, TANGLE should prove useful in predicting
protein structural properties and assisting protein fold recognition by applying the predicted torsion angles as useful restraints. TANGLE is freely
accessible at http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/TANGLE/.},
articlenumber = {e30361},
doi = {10.1371/journal.pone.0030361},
keywords = {Bioinformatics},
publisher = {Public Library of Science},
related = {computational-biology},
url = {http://dx.doi.org/10.1371%2Fjournal.pone.0030361},
}

ABSTRACT Protein backbone torsion angles (Phi) and (Psi) involve two rotation angles rotating around the C-N bond (Phi) and the C-C bond (Psi). Due to the planarity of the linked rigid peptide bonds, these two angles can essentially determine the backbone geometry of proteins. Accordingly, the accurate prediction of protein backbone torsion angle from sequence information can assist the prediction of protein structures. In this study, we develop a new approach called TANGLE (Torsion ANGLE predictor) to predict the protein backbone torsion angles from amino acid sequences. TANGLE uses a two-level support vector regression approach to perform real-value torsion angle prediction using a variety of features derived from amino acid sequences, including the evolutionary profiles in the form of position-specific scoring matrices, predicted secondary structure, solvent accessibility and natively disordered region as well as other global sequence features. When evaluated based on a large benchmark dataset of 1,526 non-homologous proteins, the mean absolute errors (MAEs) of the Phi and Psi angle prediction are 27.8 and 44.6, respectively, which are 1% and 3% respectively lower than that using one of the state-of-the-art prediction tools ANGLOR. Moreover, the prediction of TANGLE is significantly better than a random predictor that was built on the amino acid-specific basis, with the p-value<1.46e-147 and 7.97e-150, respectively by the Wilcoxon signed rank test. As a complementary approach to the current torsion angle prediction algorithms, TANGLE should prove useful in predicting protein structural properties and assisting protein fold recognition by applying the predicted torsion angles as useful restraints. TANGLE is freely accessible at http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/TANGLE/.

PROSPER: An Integrated Feature-Based Tool for Predicting Protease Substrate Cleavage Sites.
Song, J., Tan, H., Perry, A. J., Akutsu, T., Webb, G. I., Whisstock, J. C., & Pike, R. N.
PLoS ONE, 7(11), Art. no. e50300, 2012.
[Bibtex] [Abstract] → Access on publisher site

@Article{SongEtAl12b,
author = {Song, J. and Tan, H. and Perry, A. J. and Akutsu, T. and Webb, G. I. and Whisstock, J. C. and Pike, R. N.},
journal = {PLoS ONE},
title = {PROSPER: An Integrated Feature-Based Tool for Predicting Protease Substrate Cleavage Sites},
year = {2012},
number = {11},
volume = {7},
abstract = {The ability to catalytically cleave protein substrates after synthesis is fundamental for all forms of life. Accordingly, site-specific proteolysis is one of the most important post-translational modifications. The key to understanding the physiological role of a protease is to identify its natural substrate(s). Knowledge of the substrate specificity of a protease can dramatically improve our ability to predict its target protein substrates, but this information must be utilized in an effective manner in order to efficiently identify protein substrates by in silico approaches. To address this problem, we present PROSPER, an integrated feature-based server for in silico identification of protease substrates and their cleavage sites for twenty-four different proteases. PROSPER utilizes established specificity information for these proteases (derived from the MEROPS database) with a machine learning approach to predict protease cleavage sites by using different, but complementary sequence and structure characteristics. Features used by PROSPER include local amino acid sequence profile, predicted secondary structure, solvent accessibility and predicted native disorder. Thus, for proteases with known amino acid specificity, PROSPER provides a convenient, pre-prepared tool for use in identifying protein substrates for the enzymes. Systematic prediction analysis for the twenty-four proteases thus far included in the database revealed that the features we have included in the tool strongly improve performance in terms of cleavage site prediction, as evidenced by their contribution to performance improvement in terms of identifying known cleavage sites in substrates for these enzymes. In comparison with two state-of-the-art prediction tools, PoPS and SitePrediction, PROSPER achieves greater accuracy and coverage. To our knowledge, PROSPER is the first comprehensive server capable of predicting cleavage sites of multiple proteases within a single substrate sequence using machine learning techniques. It is freely available at http://lightning.med.monash.edu.au/PROSPER/.},
articlenumber = {e50300},
doi = {10.1371/journal.pone.0050300},
keywords = {Bioinformatics},
publisher = {Public Library of Science},
related = {computational-biology},
}

ABSTRACT The ability to catalytically cleave protein substrates after synthesis is fundamental for all forms of life. Accordingly, site-specific proteolysis is one of the most important post-translational modifications. The key to understanding the physiological role of a protease is to identify its natural substrate(s). Knowledge of the substrate specificity of a protease can dramatically improve our ability to predict its target protein substrates, but this information must be utilized in an effective manner in order to efficiently identify protein substrates by in silico approaches. To address this problem, we present PROSPER, an integrated feature-based server for in silico identification of protease substrates and their cleavage sites for twenty-four different proteases. PROSPER utilizes established specificity information for these proteases (derived from the MEROPS database) with a machine learning approach to predict protease cleavage sites by using different, but complementary sequence and structure characteristics. Features used by PROSPER include local amino acid sequence profile, predicted secondary structure, solvent accessibility and predicted native disorder. Thus, for proteases with known amino acid specificity, PROSPER provides a convenient, pre-prepared tool for use in identifying protein substrates for the enzymes. Systematic prediction analysis for the twenty-four proteases thus far included in the database revealed that the features we have included in the tool strongly improve performance in terms of cleavage site prediction, as evidenced by their contribution to performance improvement in terms of identifying known cleavage sites in substrates for these enzymes. In comparison with two state-of-the-art prediction tools, PoPS and SitePrediction, PROSPER achieves greater accuracy and coverage. To our knowledge, PROSPER is the first comprehensive server capable of predicting cleavage sites of multiple proteases within a single substrate sequence using machine learning techniques. It is freely available at http://lightning.med.monash.edu.au/PROSPER/.

Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs.
Mahmood, K., Webb, G. I., Song, J., Whisstock, J. C., & Konagurthu, A. S.
Nucleic Acids Research, 40(6), Art. no. e44, 2012.
[Bibtex] [Abstract] → Access on publisher site

@Article{MahmoodEtAl2012,
author = {Mahmood, K. and Webb, G. I. and Song, J. and Whisstock, J. C. and Konagurthu, A. S.},
journal = {Nucleic Acids Research},
title = {Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs},
year = {2012},
number = {6},
volume = {40},
abstract = {Broadly, computational approaches for ortholog assignment is a three steps process: (i) identify all putative homologs between the genomes, (ii) identify gene anchors and (iii) link anchors to identify best gene matches given their order and context. In this article, we engineer two methods to improve two important aspects of this pipeline [specifically steps (ii) and (iii)]. First, computing sequence similarity data [step (i)] is a computationally intensive task for large sequence sets, creating a bottleneck in the ortholog assignment pipeline. We have designed a fast and highly scalable sort-join method (afree) based on k-mer counts to rapidly compare all pairs of sequences in a large protein sequence set to identify putative homologs. Second, availability of complex genomes containing large gene families with prevalence of complex evolutionary events, such as duplications, has made the task of assigning orthologs and co-orthologs difficult. Here, we have developed an iterative graph matching strategy where at each iteration the best gene assignments are identified resulting in a set of orthologs and co-orthologs. We find that the afree algorithm is faster than existing methods and maintains high accuracy in identifying similar genes. The iterative graph matching strategy also showed high accuracy in identifying complex gene relationships. Standalone afree available from http://vbc.med.monash.edu.au/kmahmood/afree. EGM2, complete ortholog assignment pipeline (including afree and the iterative graph matching method) available from http://vbc.med.monash.edu.au/kmahmood/EGM2.},
articlenumber = {e44},
doi = {10.1093/nar/gkr1261},
eprint = {http://nar.oxfordjournals.org/content/early/2011/12/29/nar.gkr1261.full.pdf+html},
keywords = {Bioinformatics},
publisher = {Oxford Journals},
related = {computational-biology},
url = {http://nar.oxfordjournals.org/content/early/2011/12/29/nar.gkr1261.abstract},
}

ABSTRACT Broadly, computational approaches for ortholog assignment is a three steps process: (i) identify all putative homologs between the genomes, (ii) identify gene anchors and (iii) link anchors to identify best gene matches given their order and context. In this article, we engineer two methods to improve two important aspects of this pipeline [specifically steps (ii) and (iii)]. First, computing sequence similarity data [step (i)] is a computationally intensive task for large sequence sets, creating a bottleneck in the ortholog assignment pipeline. We have designed a fast and highly scalable sort-join method (afree) based on k-mer counts to rapidly compare all pairs of sequences in a large protein sequence set to identify putative homologs. Second, availability of complex genomes containing large gene families with prevalence of complex evolutionary events, such as duplications, has made the task of assigning orthologs and co-orthologs difficult. Here, we have developed an iterative graph matching strategy where at each iteration the best gene assignments are identified resulting in a set of orthologs and co-orthologs. We find that the afree algorithm is faster than existing methods and maintains high accuracy in identifying similar genes. The iterative graph matching strategy also showed high accuracy in identifying complex gene relationships. Standalone afree available from http://vbc.med.monash.edu.au/kmahmood/afree. EGM2, complete ortholog assignment pipeline (including afree and the iterative graph matching method) available from http://vbc.med.monash.edu.au/kmahmood/EGM2.

Bioinformatic Approaches for Predicting Substrates of Proteases.
Song, J., Tan, H., Boyd, S. E., Shen, H., Mahmood, K., Webb, G. I., Akutsu, T., Whisstock, J. C., & Pike, R. N.
Journal of Bioinformatics and Computational Biology, 9(1), 149-178, 2011.
[Bibtex] [Abstract] → Access on publisher site

@Article{SongEtAl11,
author = {Song, J. and Tan, H. and Boyd, S. E. and Shen, H. and Mahmood, K. and Webb, G. I. and Akutsu, T. and Whisstock, J. C. and Pike, R. N.},
journal = {Journal of Bioinformatics and Computational Biology},
title = {Bioinformatic Approaches for Predicting Substrates of Proteases},
year = {2011},
number = {1},
pages = {149-178},
volume = {9},
abstract = {Proteases have central roles in "life and death" processes due to their important ability to catalytically hydrolyse protein substrates, usually altering the function and/or activity of the target in the process. Knowledge of the substrate specificity of a protease should, in theory, dramatically improve the ability to predict target protein substrates. However, experimental identification and characterization of protease substrates is often difficult and time-consuming. Thus solving the "substrate identification" problem is fundamental to both understanding protease biology and the development of therapeutics that target specific protease-regulated pathways. In this context, bioinformatic prediction of protease substrates may provide useful and experimentally testable information about novel potential cleavage sites in candidate substrates. In this article, we provide an overview of recent advances in developing bioinformatic approaches for predicting protease substrate cleavage sites and identifying novel putative substrates. We discuss the advantages and drawbacks of the current methods and detail how more accurate models can be built by deriving multiple sequence and structural features of substrates. We also provide some suggestions about how future studies might further improve the accuracy of protease substrate specificity prediction.},
doi = {10.1142/S0219720011005288},
keywords = {Bioinformatics},
publisher = {World Scientific},
related = {computational-biology},
}

ABSTRACT Proteases have central roles in "life and death" processes due to their important ability to catalytically hydrolyse protein substrates, usually altering the function and/or activity of the target in the process. Knowledge of the substrate specificity of a protease should, in theory, dramatically improve the ability to predict target protein substrates. However, experimental identification and characterization of protease substrates is often difficult and time-consuming. Thus solving the "substrate identification" problem is fundamental to both understanding protease biology and the development of therapeutics that target specific protease-regulated pathways. In this context, bioinformatic prediction of protease substrates may provide useful and experimentally testable information about novel potential cleavage sites in candidate substrates. In this article, we provide an overview of recent advances in developing bioinformatic approaches for predicting protease substrate cleavage sites and identifying novel putative substrates. We discuss the advantages and drawbacks of the current methods and detail how more accurate models can be built by deriving multiple sequence and structural features of substrates. We also provide some suggestions about how future studies might further improve the accuracy of protease substrate specificity prediction.

Discovery of Amino Acid Motifs for Thrombin Cleavage and Validation Using a Model Substrate.
Ng, N. M., Pierce, J. D., Webb, G. I., Ratnikov, B. I., Wijeyewickrema, L. C., Duncan, R. C., Robertson, A. L., Bottomley, S. P., Boyd, S. E., & Pike, R. N.
Biochemistry, 50(48), 10499-10507, 2011.
[Bibtex] [Abstract] → Access on publisher site

@Article{NgEtAl11,
author = {N. M. Ng and Pierce, J. D. and Webb, G. I. and Ratnikov, B. I. and Wijeyewickrema, L. C. and Duncan, R. C. and Robertson, A. L. and Bottomley, S. P. and Boyd, S. E. and Pike, R. N.},
journal = {Biochemistry},
title = {Discovery of Amino Acid Motifs for Thrombin Cleavage and Validation Using a Model Substrate},
year = {2011},
number = {48},
pages = {10499-10507},
volume = {50},
abstract = {Understanding the active site preferences of an enzyme is critical to the design of effective inhibitors and to gaining insights into its mechanisms of action on substrates. While the subsite specificity of thrombin is understood, it is not clear whether the enzyme prefers individual amino acids at each subsite in isolation or prefers to cleave combinations of amino acids as a motif. To investigate whether preferred peptide motifs for cleavage could be identified for thrombin, we exposed a phage-displayed peptide library to thrombin. The resulting preferentially cleaved substrates were analyzed using the technique of association rule discovery. The results revealed that thrombin selected for amino acid motifs in cleavage sites. The contribution of these hypothetical motifs to substrate cleavage efficiency was further investigated using the B1 IgG-binding domain of streptococcal protein G as a model substrate. Introduction of a P2.P1. LRS thrombin cleavage sequence within a major loop of the protein led to cleavage of the protein by thrombin, with the cleavage efficiency increasing with the length of the loop. Introduction of further P3.P1 and P1.P1..P3. amino acid motifs into the loop region yielded greater cleavage efficiencies, suggesting that the susceptibility of a protein substrate to cleavage by thrombin is influenced by these motifs, perhaps because of cooperative effects between subsites closest to the scissile peptide bond.},
doi = {10.1021/bi201333g},
eprint = {http://pubs.acs.org/doi/pdf/10.1021/bi201333g},
keywords = {Bioinformatics},
related = {computational-biology},
url = {http://pubs.acs.org/doi/abs/10.1021/bi201333g},
}

ABSTRACT Understanding the active site preferences of an enzyme is critical to the design of effective inhibitors and to gaining insights into its mechanisms of action on substrates. While the subsite specificity of thrombin is understood, it is not clear whether the enzyme prefers individual amino acids at each subsite in isolation or prefers to cleave combinations of amino acids as a motif. To investigate whether preferred peptide motifs for cleavage could be identified for thrombin, we exposed a phage-displayed peptide library to thrombin. The resulting preferentially cleaved substrates were analyzed using the technique of association rule discovery. The results revealed that thrombin selected for amino acid motifs in cleavage sites. The contribution of these hypothetical motifs to substrate cleavage efficiency was further investigated using the B1 IgG-binding domain of streptococcal protein G as a model substrate. Introduction of a P2.P1. LRS thrombin cleavage sequence within a major loop of the protein led to cleavage of the protein by thrombin, with the cleavage efficiency increasing with the length of the loop. Introduction of further P3.P1 and P1.P1..P3. amino acid motifs into the loop region yielded greater cleavage efficiencies, suggesting that the susceptibility of a protein substrate to cleavage by thrombin is influenced by these motifs, perhaps because of cooperative effects between subsites closest to the scissile peptide bond.

Cascleave: Towards More Accurate Prediction of Caspase Substrate Cleavage Sites.
Song, J., Tan, H., Shen, H., Mahmood, K., Boyd, S. E., Webb, G. I., Akutsu, T., & Whisstock, J. C.
Bioinformatics, 26(6), 752-760, 2010.
[Bibtex] [Abstract] → Access on publisher site

@Article{SongEtAl10,
author = {Song, J. and Tan, H. and Shen, H. and Mahmood, K. and Boyd, S. E. and Webb, G. I. and Akutsu, T. and Whisstock, J. C.},
journal = {Bioinformatics},
title = {Cascleave: Towards More Accurate Prediction of Caspase Substrate Cleavage Sites},
year = {2010},
number = {6},
pages = {752-760},
volume = {26},
abstract = {Motivation: The caspase family of cysteine proteases play essential roles in key biological processes such as programmed cell death, differentiation, proliferation, necrosis and inflammation. The complete repertoire of caspase substrates remains to be fully characterized. Accordingly, systematic computational screening studies of caspase substrate cleavage sites may provide insight into the substrate specificity of caspases and further facilitating the discovery of putative novel substrates. Results: In this article we develop an approach (termed Cascleave) to predict both classical (i.e. following a P1 Asp) and non-typical caspase cleavage sites. When using local sequence-derived profiles, Cascleave successfully predicted 82.2% of the known substrate cleavage sites, with a Matthews correla tion coefficient (MCC) of 0.667. We found that prediction performance could be further improved by incorporating information such as predicted solvent accessibility and whether a cleavage sequence lies in a region that is most likely natively unstructured. Novel bi-profile Bayesian signatures were found to significantly improve the prediction performance and yielded the best performance with an overall accuracy of 87.6% and a MCC of 0.747, which is higher accuracy than published methods that essentially rely on amino acid sequence alone. It is anticipated that Cascleave will be a powerful tool for predicting novel substrate cleavage sites of caspases and shedding new insights on the unknown caspase-substrate interactivity relationship.},
doi = {10.1093/bioinformatics/btq043},
keywords = {Bioinformatics},
publisher = {Oxford Univ Press},
related = {computational-biology},
}

ABSTRACT Motivation: The caspase family of cysteine proteases play essential roles in key biological processes such as programmed cell death, differentiation, proliferation, necrosis and inflammation. The complete repertoire of caspase substrates remains to be fully characterized. Accordingly, systematic computational screening studies of caspase substrate cleavage sites may provide insight into the substrate specificity of caspases and further facilitating the discovery of putative novel substrates. Results: In this article we develop an approach (termed Cascleave) to predict both classical (i.e. following a P1 Asp) and non-typical caspase cleavage sites. When using local sequence-derived profiles, Cascleave successfully predicted 82.2% of the known substrate cleavage sites, with a Matthews correla tion coefficient (MCC) of 0.667. We found that prediction performance could be further improved by incorporating information such as predicted solvent accessibility and whether a cleavage sequence lies in a region that is most likely natively unstructured. Novel bi-profile Bayesian signatures were found to significantly improve the prediction performance and yielded the best performance with an overall accuracy of 87.6% and a MCC of 0.747, which is higher accuracy than published methods that essentially rely on amino acid sequence alone. It is anticipated that Cascleave will be a powerful tool for predicting novel substrate cleavage sites of caspases and shedding new insights on the unknown caspase-substrate interactivity relationship.

EGM: Encapsulated Gene-by-Gene Matching to Identify Gene Orthologs and Homologous Segments in Genomes.
Mahmood, K., Konagurthu, A. S., Song, J., Buckle, A. M., Webb, G. I., & Whisstock, J. C.
Bioinformatics, 26(17), 2076-2084, 2010.
[Bibtex] [Abstract] → Access on publisher site

@Article{MahmoodEtAl10,
author = {Mahmood, K. and Konagurthu, A. S. and Song, J. and Buckle, A. M. and Webb, G. I. and Whisstock, J. C.},
journal = {Bioinformatics},
title = {EGM: Encapsulated Gene-by-Gene Matching to Identify Gene Orthologs and Homologous Segments in Genomes},
year = {2010},
number = {17},
pages = {2076-2084},
volume = {26},
abstract = {Motivation: Identification of functionally equivalent genes in different species is essential to understand the evolution of biological pathways and processes. At the same time, identification of strings of conserved orthologous genes helps identify complex genomic rearrangements across different organisms. Such an insight is particularly useful, for example, in the transfer of experimental results between different experimental systems such as Drosophila and mammals.
Results: Here we describe the Encapsulated Gene-by-gene Matching (EGM) approach, a method that employs a graph matching strategy to identify gene orthologs and conserved gene segments. Given a pair of genomes, EGM constructs a global gene match for all genes taking into account gene context and family information. The Hungarian method for identifying the maximum weight matching in bipartite graphs is employed, where the resulting matching reveals one-to-one correspondences between nodes (genes) in a manner that maximizes the gene similarity and context.
Conclusion: We tested our approach by performing several comparisons including a detailed Human v Mouse genome mapping. We find that the algorithm is robust and sensitive in detecting orthologs and conserved gene segments. EGM can sensitively detect rearrangements within large and small chromosomal segments. The EGM tool is fully automated and easy to use compared to other more complex methods that also require extensive manual intervention and input.},
audit-trail = {http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/6/752},
doi = {10.1093/bioinformatics/btq339},
keywords = {Bioinformatics},
publisher = {Oxford Univ Press},
related = {computational-biology},
}

ABSTRACT Motivation: Identification of functionally equivalent genes in different species is essential to understand the evolution of biological pathways and processes. At the same time, identification of strings of conserved orthologous genes helps identify complex genomic rearrangements across different organisms. Such an insight is particularly useful, for example, in the transfer of experimental results between different experimental systems such as Drosophila and mammals. Results: Here we describe the Encapsulated Gene-by-gene Matching (EGM) approach, a method that employs a graph matching strategy to identify gene orthologs and conserved gene segments. Given a pair of genomes, EGM constructs a global gene match for all genes taking into account gene context and family information. The Hungarian method for identifying the maximum weight matching in bipartite graphs is employed, where the resulting matching reveals one-to-one correspondences between nodes (genes) in a manner that maximizes the gene similarity and context. Conclusion: We tested our approach by performing several comparisons including a detailed Human v Mouse genome mapping. We find that the algorithm is robust and sensitive in detecting orthologs and conserved gene segments. EGM can sensitively detect rearrangements within large and small chromosomal segments. The EGM tool is fully automated and easy to use compared to other more complex methods that also require extensive manual intervention and input.

Prodepth: Predict Residue Depth by Support Vector Regression Approach from Protein Sequences Only.
Song, J., Tan, H., Mahmood, K., Law, R. H. P., Buckle, A. M., Webb, G. I., Akutsu, T., & Whisstock, J. C.
PLoS ONE, 4(9), Art. no. e7072, 2009.
[Bibtex] [Abstract] → Access on publisher site

@Article{SongEtAl09,
author = {Song, J. and Tan, H. and Mahmood, K. and Law, R. H. P. and Buckle, A. M. and Webb, G. I. and Akutsu, T. and Whisstock, J. C.},
journal = {PLoS ONE},
title = {Prodepth: Predict Residue Depth by Support Vector Regression Approach from Protein Sequences Only},
year = {2009},
number = {9},
volume = {4},
abstract = {Residue depth (RD) is a solvent exposure measure that complements the information provided by conventional accessible surface area (ASA) and describes to what extent a residue is buried in the protein structure space. Previous studies have established that RD is correlated with several protein properties, such as protein stability, residue conservation and amino acid types. Accurate prediction of RD has many potentially important applications in the field of structural bioinformatics, for example, facilitating the identification of functionally important residues, or residues in the folding nucleus, or enzyme active sites from sequence information. In this work, we introduce an efficient approach that uses support vector regression to quantify the relationship between RD and protein sequence. We systematically investigated eight different sequence encoding schemes including both local and global sequence characteristics and examined their respective prediction performances. For the objective evaluation of our approach, we used 5-fold cross-validation to assess the prediction accuracies and showed that the overall best performance could be achieved with a correlation coefficient (CC) of 0.71 between the observed and predicted RD values and a root mean square error (RMSE) of 1.74, after incorporating the relevant multiple sequence features. The results suggest that residue depth could be reliably predicted solely from protein primary sequences: local sequence environments are the major determinants, while global sequence features could influence the prediction performance marginally. We highlight two examples as a comparison in order to illustrate the applicability of this approach. We also discuss the potential implications of this new structural parameter in the field of protein structure prediction and homology modeling. This method might prove to be a powerful tool for sequence analysis.},
articlenumber = {e7072},
doi = {10.1371/journal.pone.0007072},
keywords = {Bioinformatics},
publisher = {PLOS},
related = {computational-biology},
}

ABSTRACT Residue depth (RD) is a solvent exposure measure that complements the information provided by conventional accessible surface area (ASA) and describes to what extent a residue is buried in the protein structure space. Previous studies have established that RD is correlated with several protein properties, such as protein stability, residue conservation and amino acid types. Accurate prediction of RD has many potentially important applications in the field of structural bioinformatics, for example, facilitating the identification of functionally important residues, or residues in the folding nucleus, or enzyme active sites from sequence information. In this work, we introduce an efficient approach that uses support vector regression to quantify the relationship between RD and protein sequence. We systematically investigated eight different sequence encoding schemes including both local and global sequence characteristics and examined their respective prediction performances. For the objective evaluation of our approach, we used 5-fold cross-validation to assess the prediction accuracies and showed that the overall best performance could be achieved with a correlation coefficient (CC) of 0.71 between the observed and predicted RD values and a root mean square error (RMSE) of 1.74, after incorporating the relevant multiple sequence features. The results suggest that residue depth could be reliably predicted solely from protein primary sequences: local sequence environments are the major determinants, while global sequence features could influence the prediction performance marginally. We highlight two examples as a comparison in order to illustrate the applicability of this approach. We also discuss the potential implications of this new structural parameter in the field of protein structure prediction and homology modeling. This method might prove to be a powerful tool for sequence analysis.

RCPdb: An evolutionary classification and codon usage database for repeat-containing proteins.
Faux, N. G., Huttley, G. A., Mahmood, K., Webb, G. I., Garcia de la Banda, M., & Whisstock, J. C.
Genome Research, 17(1), 1118-1127, 2007.
[Bibtex] [Abstract] → Access on publisher site

@Article{FauxHuttleyMahmoodWebbGarciaWhisstock07,
author = {Faux, N. G. and Huttley, G. A. and Mahmood, K. and Webb, G. I. and Garcia de la Banda, M. and Whisstock, J. C.},
journal = {Genome Research},
title = {RCPdb: An evolutionary classification and codon usage database for repeat-containing proteins},
year = {2007},
number = {1},
pages = {1118-1127},
volume = {17},
abstract = {Over 3% of human proteins contain single amino acid repeats (repeat-containing proteins, RCPs). Many repeats (homopeptides) localize to important proteins involved in transcription, and the expansion of certain repeats, in particular poly-Q and poly-A tracts, can also lead to the development of neurological diseases. Previous studies have suggested that the homopeptide makeup is a result of the presence of G+C-rich tracts in the encoding genes and that expansion occurs via replication slippage. Here, we have performed a large-scale genomic analysis of the variation of the genes encoding RCPs in 13 species and present these data in an online database (http://repeats.med.monash.edu.au/genetic_analysis/). This resource allows rapid comparison and analysis of RCPs, homopeptides, and their underlying genetic tracts across the eukaryotic species considered. We report three major findings. First, there is a bias for a small subset of codons being reiterated within homopeptides, and there is no G+C or A+T bias relative to the organism's transcriptome. Second, single base pair transversions from the homocodon are unusually common and may represent a mechanism of reducing the rate of homopeptide mutations. Third, homopeptides that are conserved across different species lie within regions that are under stronger purifying selection in contrast to nonconserved homopeptides.},
address = {Woodbury, New York},
doi = {10.1101/gr.6255407},
keywords = {Bioinformatics},
publisher = {Cold Spring Harbor Laboratory Press, ISSN 1088-9051/07},
related = {computational-biology},
}

ABSTRACT Over 3% of human proteins contain single amino acid repeats (repeat-containing proteins, RCPs). Many repeats (homopeptides) localize to important proteins involved in transcription, and the expansion of certain repeats, in particular poly-Q and poly-A tracts, can also lead to the development of neurological diseases. Previous studies have suggested that the homopeptide makeup is a result of the presence of G+C-rich tracts in the encoding genes and that expansion occurs via replication slippage. Here, we have performed a large-scale genomic analysis of the variation of the genes encoding RCPs in 13 species and present these data in an online database (http://repeats.med.monash.edu.au/genetic_analysis/). This resource allows rapid comparison and analysis of RCPs, homopeptides, and their underlying genetic tracts across the eukaryotic species considered. We report three major findings. First, there is a bias for a small subset of codons being reiterated within homopeptides, and there is no G+C or A+T bias relative to the organism's transcriptome. Second, single base pair transversions from the homocodon are unusually common and may represent a mechanism of reducing the rate of homopeptide mutations. Third, homopeptides that are conserved across different species lie within regions that are under stronger purifying selection in contrast to nonconserved homopeptides.