Skip to content

Computational Biology

Along with colleagues in the Monash Faculty of Medicine, Nursing and Health Sciences, I am investigating applications of data science in biology.  The majority of this work uses machine learning to predict protein structural and functional features.

Publications

A Deep Learning-Based Method for Identification of Bacteriophage-Host Interaction.
Li, M., Wang, Y., Li, F., Zhao, Y., Liu, M., Zhang, S., Bin, Y., Smith, A. I., Webb, G., Li, J., Song, J., & Xia, J.
IEEE/ACM Trans Comput Biol Bioinform, in press.
[DOI] [Bibtex] [Abstract]

@Article{RN3447,
author = {Li, M. and Wang, Y. and Li, F. and Zhao, Y. and Liu, M. and Zhang, S. and Bin, Y. and Smith, A. I. and Webb, G. and Li, J. and Song, J. and Xia, J.},
journal = {IEEE/ACM Trans Comput Biol Bioinform},
title = {A Deep Learning-Based Method for Identification of Bacteriophage-Host Interaction},
year = {in press},
issn = {1545-5963},
abstract = {Multi-drug resistance (MDR) has become one of the greatest threats to human health worldwide, and novel treatment methods of infections caused by MDR bacteria are urgently needed. Phage therapy is a promising alternative to solve this problem, to which the key is correctly matching target pathogenic bacteria with the corresponding therapeutic phage. Deep learning is powerful for mining complex patterns to generate accurate predictions. In this study, we develop PredPHI (Predicting Phage-Host Interactions), a deep learning-based tool capable of predicting the host of phages from sequence data. We collect >3000 phage-host pairs along with their protein sequences from PhagesDB and GenBank databases and extract a set of features. Then we select high-quality negative samples based on the K-Means clustering method and construct a balanced training set. Finally, we employ a deep convolutional neural network to build the predictive model. The results indicate that PredPHI can achieve a predictive performance of 81% in terms of the area under the receiver operating characteristic curve on the test set, and the clustering-based method is significantly more robust than that based on randomly selecting negative samples. These results highlight that PredPHI is a useful and accurate tool for identifying phage-host interactions from sequence data.},
doi = {10.1109/tcbb.2020.3017386},
keywords = {Bioinformatics},
related = {computational-biology},
}
ABSTRACT Multi-drug resistance (MDR) has become one of the greatest threats to human health worldwide, and novel treatment methods of infections caused by MDR bacteria are urgently needed. Phage therapy is a promising alternative to solve this problem, to which the key is correctly matching target pathogenic bacteria with the corresponding therapeutic phage. Deep learning is powerful for mining complex patterns to generate accurate predictions. In this study, we develop PredPHI (Predicting Phage-Host Interactions), a deep learning-based tool capable of predicting the host of phages from sequence data. We collect >3000 phage-host pairs along with their protein sequences from PhagesDB and GenBank databases and extract a set of features. Then we select high-quality negative samples based on the K-Means clustering method and construct a balanced training set. Finally, we employ a deep convolutional neural network to build the predictive model. The results indicate that PredPHI can achieve a predictive performance of 81% in terms of the area under the receiver operating characteristic curve on the test set, and the clustering-based method is significantly more robust than that based on randomly selecting negative samples. These results highlight that PredPHI is a useful and accurate tool for identifying phage-host interactions from sequence data.

PROSPECT: A web server for predicting protein histidine phosphorylation sites.
Chen, Z., Zhao, P., Li, F., Leier, A., Marquez-Lago, T. T., Webb, G. I., Baggag, A., Bensmail, H., & Song, J.
Journal of Bioinformatics and Computational Biology, Art. no. 2050018, 2020.
[DOI] [Bibtex] [Abstract]

@Article{Chen2020,
author = {Zhen Chen and Pei Zhao and Fuyi Li and Andr{\'{e}} Leier and Tatiana T. Marquez-Lago and Geoffrey I. Webb and Abdelkader Baggag and Halima Bensmail and Jiangning Song},
journal = {Journal of Bioinformatics and Computational Biology},
title = {{PROSPECT}: A web server for predicting protein histidine phosphorylation sites},
year = {2020},
month = {jun},
abstract = {Background: Phosphorylation of histidine residues plays crucial roles in signaling pathwaysand cell metabolism in prokaryotes such as bacteria. While evidence has emerged that proteinhistidine phosphorylation also occurs in more complex organisms, its role in mammalian cellshas remained largely uncharted. Thus, it is highly desirable to develop computational tools thatare able to identify histidine phosphorylation sites.Result:Here, we introduce PROSPECT thatenables fast and accurate prediction of proteome-wide histidine phosphorylation substrates andsites. Our tool is based on a hybrid method that integrates the outputs of two convolutional neuralnetwork (CNN)-based classifiers and a random forest-based classifier. Three features, includingthe one-of-K coding, enhanced grouped amino acids content (EGAAC) and composition of k-spaced amino acid group pairs (CKSAAGP) encoding, were taken as the input to three classifiers,respectively. Our results show that it is able to accurately predict histidine phosphorylation sitesfrom sequence information. Our PROSPECT web server is user-friendly and publicly available athttp://PROSPECT.erc.monash.edu/. Conclusions: PROSPECT is superior than other pHispredictors in both the running speed and prediction accuracy and we anticipate that thePROSPECT webserver will become a popular tool for identifying the pHis sites in bacteria.},
articlenumber = {2050018},
doi = {10.1142/s0219720020500183},
keywords = {Bioinformatics},
publisher = {World Scientific},
related = {computational-biology},
}
ABSTRACT Background: Phosphorylation of histidine residues plays crucial roles in signaling pathwaysand cell metabolism in prokaryotes such as bacteria. While evidence has emerged that proteinhistidine phosphorylation also occurs in more complex organisms, its role in mammalian cellshas remained largely uncharted. Thus, it is highly desirable to develop computational tools thatare able to identify histidine phosphorylation sites.Result:Here, we introduce PROSPECT thatenables fast and accurate prediction of proteome-wide histidine phosphorylation substrates andsites. Our tool is based on a hybrid method that integrates the outputs of two convolutional neuralnetwork (CNN)-based classifiers and a random forest-based classifier. Three features, includingthe one-of-K coding, enhanced grouped amino acids content (EGAAC) and composition of k-spaced amino acid group pairs (CKSAAGP) encoding, were taken as the input to three classifiers,respectively. Our results show that it is able to accurately predict histidine phosphorylation sitesfrom sequence information. Our PROSPECT web server is user-friendly and publicly available athttp://PROSPECT.erc.monash.edu/. Conclusions: PROSPECT is superior than other pHispredictors in both the running speed and prediction accuracy and we anticipate that thePROSPECT webserver will become a popular tool for identifying the pHis sites in bacteria.

DeepCleave: a deep learning predictor for caspase and matrix metalloprotease substrates and cleavage sites.
Li, F., Chen, J., Leier, A., Marquez-Lago, T., Liu, Q., Wang, Y., Revote, J., Smith, I. A., Akutsu, T., Webb, G. I., Kurgan, L., & Song, J.
Bioinformatics, 36(4), 1057-1065, 2020.
[DOI] [Bibtex] [Abstract]

@Article{Li2020a,
Title = {DeepCleave: a deep learning predictor for caspase and matrix metalloprotease substrates and cleavage sites},
Author = {Li, Fuyi and Chen, Jinxiang and Leier, Andre and Marquez-Lago, Tatiana and Liu, Quanzhong and Wang, Yanze and Revote, Jerico and Smith, A Ian and Akutsu, Tatsuya and Webb, Geoffrey I and Kurgan, Lukasz and Song, Jiangning},
Journal = {Bioinformatics},
Year = {2020},
Number = {4},
Pages = {1057-1065},
Volume = {36},
Abstract = {{Proteases are enzymes that cleave target substrate proteins by catalyzing the hydrolysis of peptide bonds between specific amino acids. While the functional proteolysis regulated by proteases plays a central role in the "life and death" process of proteins, many of the corresponding substrates and their cleavage sites were not found yet. Availability of accurate predictors of the substrates and cleavage sites would facilitate understanding of proteases’ functions and physiological roles. Deep learning is a promising approach for the development of accurate predictors of substrate cleavage events.We propose DeepCleave, the first deep learning-based predictor of protease-specific substrates and cleavage sites. DeepCleave uses protein substrate sequence data as input and employs convolutional neural networks with transfer learning to train accurate predictive models. High predictive performance of our models stems from the use of high-quality cleavage site features extracted from the substrate sequences through the deep learning process, and the application of transfer learning, multiple kernels and attention layer in the design of the deep network. Empirical tests against several related state-of-the-art methods demonstrate that DeepCleave outperforms these methods in predicting caspase and matrix metalloprotease substrate-cleavage sites.The DeepCleave webserver and source code are freely available at http://deepcleave.erc.monash.edu/.Supplementary data are available at Bioinformatics online.}},
Doi = {10.1093/bioinformatics/btz721},
ISSN = {1367-4803},
Keywords = {Bioinformatics},
Related = {computational-biology}
}
ABSTRACT {Proteases are enzymes that cleave target substrate proteins by catalyzing the hydrolysis of peptide bonds between specific amino acids. While the functional proteolysis regulated by proteases plays a central role in the "life and death" process of proteins, many of the corresponding substrates and their cleavage sites were not found yet. Availability of accurate predictors of the substrates and cleavage sites would facilitate understanding of proteases’ functions and physiological roles. Deep learning is a promising approach for the development of accurate predictors of substrate cleavage events.We propose DeepCleave, the first deep learning-based predictor of protease-specific substrates and cleavage sites. DeepCleave uses protein substrate sequence data as input and employs convolutional neural networks with transfer learning to train accurate predictive models. High predictive performance of our models stems from the use of high-quality cleavage site features extracted from the substrate sequences through the deep learning process, and the application of transfer learning, multiple kernels and attention layer in the design of the deep network. Empirical tests against several related state-of-the-art methods demonstrate that DeepCleave outperforms these methods in predicting caspase and matrix metalloprotease substrate-cleavage sites.The DeepCleave webserver and source code are freely available at http://deepcleave.erc.monash.edu/.Supplementary data are available at Bioinformatics online.}

Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences.
Chen, Z., Zhao, P., Li, F., Wang, Y., Smith, I. A., Webb, G. I., Akutsu, T., Baggag, A., Bensmail, H., & Song, J.
Briefings in Bioinformatics, 21(5), 2020.
[DOI] [Bibtex] [Abstract]

@Article{10.1093/bib/bbz112,
author = {Chen, Zhen and Zhao, Pei and Li, Fuyi and Wang, Yanan and Smith, A Ian and Webb, Geoffrey I and Akutsu, Tatsuya and Baggag, Abdelkader and Bensmail, Halima and Song, Jiangning},
journal = {Briefings in Bioinformatics},
title = {Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences},
year = {2020},
issn = {1477-4054},
number = {5},
volume = {21},
abstract = {RNA post-transcriptional modifications play a crucial role in a myriad of biological processes and cellular functions. To date, more than 160 RNA modifications have been discovered; therefore, accurate identification of RNA-modification sites is fundamental for a better understanding of RNA-mediated biological functions and mechanisms. However, due to limitations in experimental methods, systematic identification of different types of RNA-modification sites remains a major challenge. Recently, more than 20 computational methods have been developed to identify RNA-modification sites in tandem with high-throughput experimental methods, with most of these capable of predicting only single types of RNA-modification sites. These methods show high diversity in their dataset size, data quality, core algorithms, features extracted and feature selection techniques and evaluation strategies. Therefore, there is an urgent need to revisit these methods and summarize their methodologies, in order to improve and further develop computational techniques to identify and characterize RNA-modification sites from the large amounts of sequence data. With this goal in mind, first, we provide a comprehensive survey on a large collection of 27 state-of-the-art approaches for predicting N1-methyladenosine and N6-methyladenosine sites. We cover a variety of important aspects that are crucial for the development of successful predictors, including the dataset quality, operating algorithms, sequence and genomic features, feature selection, model performance evaluation and software utility. In addition, we also provide our thoughts on potential strategies to improve the model performance. Second, we propose a computational approach called DeepPromise based on deep learning techniques for simultaneous prediction of N1-methyladenosine and N6-methyladenosine. To extract the sequence context surrounding the modification sites, three feature encodings, including enhanced nucleic acid composition, one-hot encoding, and RNA embedding, were used as the input to seven consecutive layers of convolutional neural networks (CNNs), respectively. Moreover, DeepPromise further combined the prediction score of the CNN-based models and achieved around 43\\% higher area under receiver-operating curve (AUROC) for m1A site prediction and 6\\% higher AUROC for m6A site prediction, respectively, when compared with several existing state-of-the-art approaches on the independent test. In-depth analyses of characteristic sequence motifs identified from the convolution-layer filters indicated that nucleotide presentation at proximal positions surrounding the modification sites contributed most to the classification, whereas those at distal positions also affected classification but to different extents. To maximize user convenience, a web server was developed as an implementation of DeepPromise and made publicly available at http://DeepPromise.erc.monash.edu/, with the server accepting both RNA sequences and genomic sequences to allow prediction of two types of putative RNA-modification sites.},
doi = {10.1093/bib/bbz112},
eprint = {http://oup.prod.sis.lan/bib/advance-article-pdf/doi/10.1093/bib/bbz112/30663813/bbz112.pdf},
keywords = {Bioinformatics},
page = {16761696},
related = {computational-biology},
}
ABSTRACT RNA post-transcriptional modifications play a crucial role in a myriad of biological processes and cellular functions. To date, more than 160 RNA modifications have been discovered; therefore, accurate identification of RNA-modification sites is fundamental for a better understanding of RNA-mediated biological functions and mechanisms. However, due to limitations in experimental methods, systematic identification of different types of RNA-modification sites remains a major challenge. Recently, more than 20 computational methods have been developed to identify RNA-modification sites in tandem with high-throughput experimental methods, with most of these capable of predicting only single types of RNA-modification sites. These methods show high diversity in their dataset size, data quality, core algorithms, features extracted and feature selection techniques and evaluation strategies. Therefore, there is an urgent need to revisit these methods and summarize their methodologies, in order to improve and further develop computational techniques to identify and characterize RNA-modification sites from the large amounts of sequence data. With this goal in mind, first, we provide a comprehensive survey on a large collection of 27 state-of-the-art approaches for predicting N1-methyladenosine and N6-methyladenosine sites. We cover a variety of important aspects that are crucial for the development of successful predictors, including the dataset quality, operating algorithms, sequence and genomic features, feature selection, model performance evaluation and software utility. In addition, we also provide our thoughts on potential strategies to improve the model performance. Second, we propose a computational approach called DeepPromise based on deep learning techniques for simultaneous prediction of N1-methyladenosine and N6-methyladenosine. To extract the sequence context surrounding the modification sites, three feature encodings, including enhanced nucleic acid composition, one-hot encoding, and RNA embedding, were used as the input to seven consecutive layers of convolutional neural networks (CNNs), respectively. Moreover, DeepPromise further combined the prediction score of the CNN-based models and achieved around 43\\% higher area under receiver-operating curve (AUROC) for m1A site prediction and 6\\% higher AUROC for m6A site prediction, respectively, when compared with several existing state-of-the-art approaches on the independent test. In-depth analyses of characteristic sequence motifs identified from the convolution-layer filters indicated that nucleotide presentation at proximal positions surrounding the modification sites contributed most to the classification, whereas those at distal positions also affected classification but to different extents. To maximize user convenience, a web server was developed as an implementation of DeepPromise and made publicly available at http://DeepPromise.erc.monash.edu/, with the server accepting both RNA sequences and genomic sequences to allow prediction of two types of putative RNA-modification sites.

PRISMOID: a comprehensive 3D structure database for post-translational modifications and mutations with functional impact.
Li, F., Fan, C., Marquez-Lago, T. T., Leier, A., Revote, J., Jia, C., Zhu, Y., Smith, I. A., Webb, G. I., Liu, Q., Wei, L., Li, J., & Song, J.
Briefings in Bioinformatics, 21(3), 1069-1079, 2020.
[DOI] [Bibtex] [Abstract]

@Article{10.1093/bib/bbz050,
author = {Li, Fuyi and Fan, Cunshuo and Marquez-Lago, Tatiana T and Leier, Andre and Revote, Jerico and Jia, Cangzhi and Zhu, Yan and Smith, A Ian and Webb, Geoffrey I and Liu, Quanzhong and Wei, Leyi and Li, Jian and Song, Jiangning},
journal = {Briefings in Bioinformatics},
title = {PRISMOID: a comprehensive 3D structure database for post-translational modifications and mutations with functional impact},
year = {2020},
issn = {1477-4054},
number = {3},
pages = {1069-1079},
volume = {21},
abstract = {{Post-translational modifications (PTMs) play very important roles in various cell signaling pathways and biological process. Due to PTMs' extremely important roles, many major PTMs have been studied, while the functional and mechanical characterization of major PTMs is well documented in several databases. However, most currently available databases mainly focus on protein sequences, while the real 3D structures of PTMs have been largely ignored. Therefore, studies of PTMs 3D structural signatures have been severely limited by the deficiency of the data. Here, we develop PRISMOID, a novel publicly available and free 3D structure database for a wide range of PTMs. PRISMOID represents an up-to-date and interactive online knowledge base with specific focus on 3D structural contexts of PTMs sites and mutations that occur on PTMs and in the close proximity of PTM sites with functional impact. The first version of PRISMOID encompasses 17 145 non-redundant modification sites on 3919 related protein 3D structure entries pertaining to 37 different types of PTMs. Our entry web page is organized in a comprehensive manner, including detailed PTM annotation on the 3D structure and biological information in terms of mutations affecting PTMs, secondary structure features and per-residue solvent accessibility features of PTM sites, domain context, predicted natively disordered regions and sequence alignments. In addition, high-definition JavaScript packages are employed to enhance information visualization in PRISMOID. PRISMOID equips a variety of interactive and customizable search options and data browsing functions; these capabilities allow users to access data via keyword, ID and advanced options combination search in an efficient and user-friendly way. A download page is also provided to enable users to download the SQL file, computational structural features and PTM sites’ data. We anticipate PRISMOID will swiftly become an invaluable online resource, assisting both biologists and bioinformaticians to conduct experiments and develop applications supporting discovery efforts in the sequence–structural–functional relationship of PTMs and providing important insight into mutations and PTM sites interaction mechanisms. The PRISMOID database is freely accessible at http://prismoid.erc.monash.edu/. The database and web interface are implemented in MySQL, JSP, JavaScript and HTML with all major browsers supported.}},
doi = {10.1093/bib/bbz050},
keywords = {Bioinformatics},
related = {computational-biology},
}
ABSTRACT {Post-translational modifications (PTMs) play very important roles in various cell signaling pathways and biological process. Due to PTMs' extremely important roles, many major PTMs have been studied, while the functional and mechanical characterization of major PTMs is well documented in several databases. However, most currently available databases mainly focus on protein sequences, while the real 3D structures of PTMs have been largely ignored. Therefore, studies of PTMs 3D structural signatures have been severely limited by the deficiency of the data. Here, we develop PRISMOID, a novel publicly available and free 3D structure database for a wide range of PTMs. PRISMOID represents an up-to-date and interactive online knowledge base with specific focus on 3D structural contexts of PTMs sites and mutations that occur on PTMs and in the close proximity of PTM sites with functional impact. The first version of PRISMOID encompasses 17 145 non-redundant modification sites on 3919 related protein 3D structure entries pertaining to 37 different types of PTMs. Our entry web page is organized in a comprehensive manner, including detailed PTM annotation on the 3D structure and biological information in terms of mutations affecting PTMs, secondary structure features and per-residue solvent accessibility features of PTM sites, domain context, predicted natively disordered regions and sequence alignments. In addition, high-definition JavaScript packages are employed to enhance information visualization in PRISMOID. PRISMOID equips a variety of interactive and customizable search options and data browsing functions; these capabilities allow users to access data via keyword, ID and advanced options combination search in an efficient and user-friendly way. A download page is also provided to enable users to download the SQL file, computational structural features and PTM sites’ data. We anticipate PRISMOID will swiftly become an invaluable online resource, assisting both biologists and bioinformaticians to conduct experiments and develop applications supporting discovery efforts in the sequence–structural–functional relationship of PTMs and providing important insight into mutations and PTM sites interaction mechanisms. The PRISMOID database is freely accessible at http://prismoid.erc.monash.edu/. The database and web interface are implemented in MySQL, JSP, JavaScript and HTML with all major browsers supported.}

Procleave: Predicting Protease-specific Substrate Cleavage Sites by Combining Sequence and Structural Information.
Li, F., Leier, A., Liu, Q., Wang, Y., Xiang, D., Akutsu, T., Webb, G. I., Smith, I. A., Marquez-Lago, T., Li, J., & Song, J.
Genomics, Proteomics & Bioinformatics, 18(1), 52-64, 2020.
[DOI] [Bibtex] [Abstract]

@Article{LI2020,
author = {Fuyi Li and Andre Leier and Quanzhong Liu and Yanan Wang and Dongxu Xiang and Tatsuya Akutsu and Geoffrey I. Webb and A. Ian Smith and Tatiana Marquez-Lago and Jian Li and Jiangning Song},
journal = {Genomics, Proteomics & Bioinformatics},
title = {Procleave: Predicting Protease-specific Substrate Cleavage Sites by Combining Sequence and Structural Information},
year = {2020},
issn = {1672-0229},
number = {1},
pages = {52-64},
volume = {18},
abstract = {Proteases are enzymes that cleave and hydrolyse the peptide bonds between two specific amino acid residues of target substrate proteins. Protease-controlled proteolysis plays a key role in the degradation and recycling of proteins, which is essential for various physiological processes. Thus, solving the substrate identification problem will have important implications for the precise understanding of functions and physiological roles of proteases, as well as for therapeutic target identification and pharmaceutical applicability. Consequently, there is a great demand for bioinformatics methods that can predict novel substrate cleavage events with high accuracy by utilizing both sequence and structural information. In this study, we present Procleave, a novel bioinformatics approach for predicting protease-specific substrates and specific cleavage sites by taking into account both their sequence and 3D structural information. Structural features of known cleavage sites were represented by discrete values using a LOWESS data-smoothing optimization method, which turned out to be critical for the performance of Procleave. The optimal approximations of all structural parameter values were encoded in a conditional random field (CRF) computational framework, alongside sequence and chemical group-based features. Here, we demonstrate the outstanding performance of Procleave through extensive benchmarking and independent tests. Procleave is capable of correctly identifying most cleavage sites in the case study. Importantly, when applied to the human structural proteome encompassing 17,628 protein structures, Procleave suggests a number of potential novel target substrates and their corresponding cleavage sites of different proteases. Procleave is implemented as a webserver and is freely accessible at http://procleave.erc.monash.edu/.},
doi = {10.1016/j.gpb.2019.08.002},
keywords = {Bioinformatics},
related = {computational-biology},
}
ABSTRACT Proteases are enzymes that cleave and hydrolyse the peptide bonds between two specific amino acid residues of target substrate proteins. Protease-controlled proteolysis plays a key role in the degradation and recycling of proteins, which is essential for various physiological processes. Thus, solving the substrate identification problem will have important implications for the precise understanding of functions and physiological roles of proteases, as well as for therapeutic target identification and pharmaceutical applicability. Consequently, there is a great demand for bioinformatics methods that can predict novel substrate cleavage events with high accuracy by utilizing both sequence and structural information. In this study, we present Procleave, a novel bioinformatics approach for predicting protease-specific substrates and specific cleavage sites by taking into account both their sequence and 3D structural information. Structural features of known cleavage sites were represented by discrete values using a LOWESS data-smoothing optimization method, which turned out to be critical for the performance of Procleave. The optimal approximations of all structural parameter values were encoded in a conditional random field (CRF) computational framework, alongside sequence and chemical group-based features. Here, we demonstrate the outstanding performance of Procleave through extensive benchmarking and independent tests. Procleave is capable of correctly identifying most cleavage sites in the case study. Importantly, when applied to the human structural proteome encompassing 17,628 protein structures, Procleave suggests a number of potential novel target substrates and their corresponding cleavage sites of different proteases. Procleave is implemented as a webserver and is freely accessible at http://procleave.erc.monash.edu/.

SIMLIN: a bioinformatics tool for prediction of S-sulphenylation in the human proteome based on multi-stage ensemble-learning models.
Wang, X., Li, C., Li, F., Sharma, V. S., Song, J., & Webb, G. I.
BMC Bioinformatics, 20(1), Art. no. 602, 2019.
[DOI] [Bibtex] [Abstract]

@Article{Wang2019,
author = {Wang, Xiaochuan and Li, Chen and Li, Fuyi and Sharma, Varun S. and Song, Jiangning and Webb, Geoffrey I.},
journal = {BMC Bioinformatics},
title = {SIMLIN: a bioinformatics tool for prediction of S-sulphenylation in the human proteome based on multi-stage ensemble-learning models},
year = {2019},
month = {Nov},
number = {1},
volume = {20},
abstract = {S-sulphenylation is a ubiquitous protein post-translational modification (PTM) where an S-hydroxyl (−SOH) bond is formed via the reversible oxidation on the Sulfhydryl group of cysteine (C). Recent experimental studies have revealed that S-sulphenylation plays critical roles in many biological functions, such as protein regulation and cell signaling. State-of-the-art bioinformatic advances have facilitated high-throughput in silico screening of protein S-sulphenylation sites, thereby significantly reducing the time and labour costs traditionally required for the experimental investigation of S-sulphenylation.},
articlenumber = {602},
doi = {10.1186/s12859-019-3178-6},
keywords = {Bioinformatics and DP140100087},
related = {computational-biology},
}
ABSTRACT S-sulphenylation is a ubiquitous protein post-translational modification (PTM) where an S-hydroxyl (−SOH) bond is formed via the reversible oxidation on the Sulfhydryl group of cysteine (C). Recent experimental studies have revealed that S-sulphenylation plays critical roles in many biological functions, such as protein regulation and cell signaling. State-of-the-art bioinformatic advances have facilitated high-throughput in silico screening of protein S-sulphenylation sites, thereby significantly reducing the time and labour costs traditionally required for the experimental investigation of S-sulphenylation.

Positive-unlabelled learning of glycosylation sites in the human proteome.
Li, F., Zhang, Y., Purcell, A. W., Webb, G. I., Chou, K., Lithgow, T., Li, C., & Song, J.
BMC Bioinformatics, 20(1), 112, 2019.
[DOI] [Bibtex] [Abstract]

@Article{Li2019,
Title = {Positive-unlabelled learning of glycosylation sites in the human proteome},
Author = {Li, Fuyi and Zhang, Yang and Purcell, Anthony W. and Webb, Geoffrey I. and Chou, Kuo-Chen and Lithgow, Trevor and Li, Chen and Song, Jiangning},
Journal = {BMC Bioinformatics},
Year = {2019},
Month = {Mar},
Number = {1},
Pages = {112},
Volume = {20},
Abstract = {As an important type of post-translational modification (PTM), protein glycosylation plays a crucial role in protein stability and protein function. The abundance and ubiquity of protein glycosylation across three domains of life involving Eukarya, Bacteria and Archaea demonstrate its roles in regulating a variety of signalling and metabolic pathways. Mutations on and in the proximity of glycosylation sites are highly associated with human diseases. Accordingly, accurate prediction of glycosylation can complement laboratory-based methods and greatly benefit experimental efforts for characterization and understanding of functional roles of glycosylation. For this purpose, a number of supervised-learning approaches have been proposed to identify glycosylation sites, demonstrating a promising predictive performance. To train a conventional supervised-learning model, both reliable positive and negative samples are required. However, in practice, a large portion of negative samples (i.e. non-glycosylation sites) are mislabelled due to the limitation of current experimental technologies. Moreover, supervised algorithms often fail to take advantage of large volumes of unlabelled data, which can aid in model learning in conjunction with positive samples (i.e. experimentally verified glycosylation sites).},
Day = {06},
Doi = {10.1186/s12859-019-2700-1},
ISSN = {1471-2105},
Keywords = {Bioinformatics},
Related = {computational-biology}
}
ABSTRACT As an important type of post-translational modification (PTM), protein glycosylation plays a crucial role in protein stability and protein function. The abundance and ubiquity of protein glycosylation across three domains of life involving Eukarya, Bacteria and Archaea demonstrate its roles in regulating a variety of signalling and metabolic pathways. Mutations on and in the proximity of glycosylation sites are highly associated with human diseases. Accordingly, accurate prediction of glycosylation can complement laboratory-based methods and greatly benefit experimental efforts for characterization and understanding of functional roles of glycosylation. For this purpose, a number of supervised-learning approaches have been proposed to identify glycosylation sites, demonstrating a promising predictive performance. To train a conventional supervised-learning model, both reliable positive and negative samples are required. However, in practice, a large portion of negative samples (i.e. non-glycosylation sites) are mislabelled due to the limitation of current experimental technologies. Moreover, supervised algorithms often fail to take advantage of large volumes of unlabelled data, which can aid in model learning in conjunction with positive samples (i.e. experimentally verified glycosylation sites).

Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods.
Li, F., Wang, Y., Li, C., Marquez-Lago, T. T., Leier, A., Rawlings, N. D., Haffari, G., Revote, J., Akutsu, T., Chou, K., Purcell, A. W., Pike, R. N., Webb, G. I., Smith, I. A., Lithgow, T., Daly, R. J., Whisstock, J. C., & Song, J.
Briefings in Bioinformatics, 20(6), 2150-2166, 2019.
[DOI] [Bibtex] [Abstract]

@Article{Li18b,
author = {Li, Fuyi and Wang, Yanan and Li, Chen and Marquez-Lago, Tatiana T and Leier, Andre and Rawlings, Neil D and Haffari, Gholamreza and Revote, Jerico and Akutsu, Tatsuya and Chou, Kuo-Chen and Purcell, Anthony W and Pike, Robert N and Webb, Geoffrey I and Smith, Ian A and Lithgow, Trevor and Daly, Roger J and Whisstock, James C and Song, Jiangning},
journal = {Briefings in Bioinformatics},
title = {Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods},
year = {2019},
number = {6},
pages = {2150-2166},
volume = {20},
abstract = {The roles of proteolytic cleavage have been intensively investigated and discussed during the past two decades. This irreversible chemical process has been frequently reported to influence a number of crucial biological processes (BPs), such as cell cycle, protein regulation and inflammation. A number of advanced studies have been published aiming at deciphering the mechanisms of proteolytic cleavage. Given its significance and the large number of functionally enriched substrates targeted by specific proteases, many computational approaches have been established for accurate prediction of protease-specific substrates and their cleavage sites. Consequently, there is an urgent need to systematically assess the state-of-the-art computational approaches for protease-specific cleavage site prediction to further advance the existing methodologies and to improve the prediction performance. With this goal in mind, in this article, we carefully evaluated a total of 19 computational methods (including 8 scoring function-based methods and 11 machine learning-based methods) in terms of their underlying algorithm, calculated features, performance evaluation and software usability. Then, extensive independent tests were performed to assess the robustness and scalability of the reviewed methods using our carefully prepared independent test data sets with 3641 cleavage sites (specific to 10 proteases). The comparative experimental results demonstrate that PROSPERous is the most accurate generic method for predicting eight protease-specific cleavage sites, while GPS-CCD and LabCaS outperformed other predictors for calpain-specific cleavage sites. Based on our review, we then outlined some potential ways to improve the prediction performance and ease the computational burden by applying ensemble learning, deep learning, positive unlabeled learning and parallel and distributed computing techniques. We anticipate that our study will serve as a practical and useful guide for interested readers to further advance next-generation bioinformatics tools for protease-specific cleavage site prediction.},
doi = {10.1093/bib/bby077},
keywords = {Bioinformatics},
related = {computational-biology},
}
ABSTRACT The roles of proteolytic cleavage have been intensively investigated and discussed during the past two decades. This irreversible chemical process has been frequently reported to influence a number of crucial biological processes (BPs), such as cell cycle, protein regulation and inflammation. A number of advanced studies have been published aiming at deciphering the mechanisms of proteolytic cleavage. Given its significance and the large number of functionally enriched substrates targeted by specific proteases, many computational approaches have been established for accurate prediction of protease-specific substrates and their cleavage sites. Consequently, there is an urgent need to systematically assess the state-of-the-art computational approaches for protease-specific cleavage site prediction to further advance the existing methodologies and to improve the prediction performance. With this goal in mind, in this article, we carefully evaluated a total of 19 computational methods (including 8 scoring function-based methods and 11 machine learning-based methods) in terms of their underlying algorithm, calculated features, performance evaluation and software usability. Then, extensive independent tests were performed to assess the robustness and scalability of the reviewed methods using our carefully prepared independent test data sets with 3641 cleavage sites (specific to 10 proteases). The comparative experimental results demonstrate that PROSPERous is the most accurate generic method for predicting eight protease-specific cleavage sites, while GPS-CCD and LabCaS outperformed other predictors for calpain-specific cleavage sites. Based on our review, we then outlined some potential ways to improve the prediction performance and ease the computational burden by applying ensemble learning, deep learning, positive unlabeled learning and parallel and distributed computing techniques. We anticipate that our study will serve as a practical and useful guide for interested readers to further advance next-generation bioinformatics tools for protease-specific cleavage site prediction.

Large-scale comparative assessment of computational predictors for lysine post-translational modification sites.
Chen, Z., Li, L., Xu, D., Chou, K., Liu, X., Smith, A. I., Li, F., Song, J., Li, C., Leier, A., Marquez-Lago, T., Akutsu, T., & Webb, G. I.
Briefings in Bioinformatics, 20(6), 2267-2290, 2019.
[DOI] [Bibtex] [Abstract]

@Article{ChenEtAl118b,
author = {Chen, Zhen and Li, Lei and Xu, Dakang and Chou, Kuo-Chen and Liu, Xuhan and Smith, Alexander Ian and Li, Fuyi and Song, Jiangning and Li, Chen and Leier, Andrae and Marquez-Lago, Tatiana and Akutsu, Tatsuya and Webb, Geoffrey I},
journal = {Briefings in Bioinformatics},
title = {Large-scale comparative assessment of computational predictors for lysine post-translational modification sites},
year = {2019},
number = {6},
pages = {2267-2290},
volume = {20},
abstract = {Lysine post-translational modifications (PTMs) play a crucial role in regulating diverse functions and biological processes of proteins. However, because of the large volumes of sequencing data generated from genome-sequencing projects, systematic identification of different types of lysine PTM substrates and PTM sites in the entire proteome remains a major challenge. In recent years, a number of computational methods for lysine PTM identification have been developed. These methods show high diversity in their core algorithms, features extracted and feature selection techniques and evaluation strategies. There is therefore an urgent need to revisit these methods and summarize their methodologies, to improve and further develop computational techniques to identify and characterize lysine PTMs from the large amounts of sequence data. With this goal in mind, we first provide a comprehensive survey on a large collection of 49 state-of-the-art approaches for lysine PTM prediction. We cover a variety of important aspects that are crucial for the development of successful predictors, including operating algorithms, sequence and structural features, feature selection, model performance evaluation and software utility. We further provide our thoughts on potential strategies to improve the model performance. Second, in order to examine the feasibility of using deep learning for lysine PTM prediction, we propose a novel computational framework, termed MUscADEL (Multiple Scalable Accurate Deep Learner for lysine PTMs), using deep, bidirectional, long short-term memory recurrent neural networks for accurate and systematic mapping of eight major types of lysine PTMs in the human and mouse proteomes. Extensive benchmarking tests show that MUscADEL outperforms current methods for lysine PTM characterization, demonstrating the potential and power of deep learning techniques in protein PTM prediction. The web server of MUscADEL, together with all the data sets assembled in this study, is freely available at http://muscadel.erc.monash.edu/. We anticipate this comprehensive review and the application of deep learning will provide practical guide and useful insights into PTM prediction and inspire future bioinformatics studies in the related fields.},
doi = {10.1093/bib/bby089},
keywords = {Bioinformatics},
related = {computational-biology},
}
ABSTRACT Lysine post-translational modifications (PTMs) play a crucial role in regulating diverse functions and biological processes of proteins. However, because of the large volumes of sequencing data generated from genome-sequencing projects, systematic identification of different types of lysine PTM substrates and PTM sites in the entire proteome remains a major challenge. In recent years, a number of computational methods for lysine PTM identification have been developed. These methods show high diversity in their core algorithms, features extracted and feature selection techniques and evaluation strategies. There is therefore an urgent need to revisit these methods and summarize their methodologies, to improve and further develop computational techniques to identify and characterize lysine PTMs from the large amounts of sequence data. With this goal in mind, we first provide a comprehensive survey on a large collection of 49 state-of-the-art approaches for lysine PTM prediction. We cover a variety of important aspects that are crucial for the development of successful predictors, including operating algorithms, sequence and structural features, feature selection, model performance evaluation and software utility. We further provide our thoughts on potential strategies to improve the model performance. Second, in order to examine the feasibility of using deep learning for lysine PTM prediction, we propose a novel computational framework, termed MUscADEL (Multiple Scalable Accurate Deep Learner for lysine PTMs), using deep, bidirectional, long short-term memory recurrent neural networks for accurate and systematic mapping of eight major types of lysine PTMs in the human and mouse proteomes. Extensive benchmarking tests show that MUscADEL outperforms current methods for lysine PTM characterization, demonstrating the potential and power of deep learning techniques in protein PTM prediction. The web server of MUscADEL, together with all the data sets assembled in this study, is freely available at http://muscadel.erc.monash.edu/. We anticipate this comprehensive review and the application of deep learning will provide practical guide and useful insights into PTM prediction and inspire future bioinformatics studies in the related fields.

Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework.
Zhang, Y., Xie, R., Wang, J., Leier, A., Marquez-Lago, T. T., Akutsu, T., Webb, G. I., Chou, K., & Song, J.
Briefings in Bioinformatics, 20(6), 2185-2199, 2019.
[DOI] [Bibtex] [Abstract]

@Article{ZhangEtAl18,
author = {Zhang, Yanju and Xie, Ruopeng and Wang, Jiawei and Leier, Andre and Marquez-Lago, Tatiana T. and Akutsu, Tatsuya and Webb, Geoffrey I. and Chou, Kuo-Chen and Song, Jiangning},
journal = {Briefings in Bioinformatics},
title = {Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework},
year = {2019},
number = {6},
pages = {2185-2199},
volume = {20},
abstract = {As a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.},
doi = {10.1093/bib/bby079},
keywords = {Bioinformatics},
related = {computational-biology},
}
ABSTRACT As a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.

iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites.
Song, J., Wang, Y., Li, F., Akutsu, T., Rawlings, N. D., Webb, G. I., & Chou, K.
Briefings in Bioinformatics, 20(2), 638-658, 2019.
[DOI] [Bibtex] [Abstract]

@Article{doi:10.1093/bib/bby028,
author = {Song, Jiangning and Wang, Yanan and Li, Fuyi and Akutsu, Tatsuya and Rawlings, Neil D and Webb, Geoffrey I and Chou, Kuo-Chen},
journal = {Briefings in Bioinformatics},
title = {iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites},
year = {2019},
number = {2},
pages = {638-658},
volume = {20},
abstract = {Regulation of proteolysis plays a critical role in a myriad of important cellular processes. The key to better understanding the mechanisms that control this process is to identify the specific substrates that each protease targets. To address this, we have developed iProt-Sub, a powerful bioinformatics tool for the accurate prediction of protease-specific substrates and their cleavage sites. Importantly, iProt-Sub represents a significantly advanced version of its successful predecessor, PROSPER. It provides optimized cleavage site prediction models with better prediction performance and coverage for more species-specific proteases (4 major protease families and 38 different proteases). iProt-Sub integrates heterogeneous sequence and structural features and uses a two-step feature selection procedure to further remove redundant and irrelevant features in an effort to improve the cleavage site prediction accuracy. Features used by iProt-Sub are encoded by 11 different sequence encoding schemes, including local amino acid sequence profile, secondary structure, solvent accessibility and native disorder, which will allow a more accurate representation of the protease specificity of approximately 38 proteases and training of the prediction models. Benchmarking experiments using cross-validation and independent tests showed that iProt-Sub is able to achieve a better performance than several existing generic tools. We anticipate that iProt-Sub will be a powerful tool for proteome-wide prediction of protease-specific substrates and their cleavage sites, and will facilitate hypothesis-driven functional interrogation of protease-specific substrate cleavage and proteolytic events.},
comment = {Clarivate Analytics Web of Science Hot Paper and Highly Cited Paper, 2019},
doi = {10.1093/bib/bby028},
keywords = {Bioinformatics},
related = {computational-biology},
}
ABSTRACT Regulation of proteolysis plays a critical role in a myriad of important cellular processes. The key to better understanding the mechanisms that control this process is to identify the specific substrates that each protease targets. To address this, we have developed iProt-Sub, a powerful bioinformatics tool for the accurate prediction of protease-specific substrates and their cleavage sites. Importantly, iProt-Sub represents a significantly advanced version of its successful predecessor, PROSPER. It provides optimized cleavage site prediction models with better prediction performance and coverage for more species-specific proteases (4 major protease families and 38 different proteases). iProt-Sub integrates heterogeneous sequence and structural features and uses a two-step feature selection procedure to further remove redundant and irrelevant features in an effort to improve the cleavage site prediction accuracy. Features used by iProt-Sub are encoded by 11 different sequence encoding schemes, including local amino acid sequence profile, secondary structure, solvent accessibility and native disorder, which will allow a more accurate representation of the protease specificity of approximately 38 proteases and training of the prediction models. Benchmarking experiments using cross-validation and independent tests showed that iProt-Sub is able to achieve a better performance than several existing generic tools. We anticipate that iProt-Sub will be a powerful tool for proteome-wide prediction of protease-specific substrates and their cleavage sites, and will facilitate hypothesis-driven functional interrogation of protease-specific substrate cleavage and proteolytic events.

Systematic analysis and prediction of type IV secreted effector proteins by machine learning approaches.
Wang, J., Yang, B., An, Y., Marquez-Lago, T., Leier, A., Wilksch, J., Hong, Q., Zhang, Y., Hayashida, M., Akutsu, T., Webb, G. I., Strugnell, R. A., Song, J., & Lithgow, T.
Briefings in Bioinformatics, 20(3), 931-951, 2019.
[DOI] [Bibtex] [Abstract]

@Article{doi:10.1093/bib/bbx164,
author = {Wang, Jiawei and Yang, Bingjiao and An, Yi and Marquez-Lago, Tatiana and Leier, Andre and Wilksch, Jonathan and Hong, Qingyang and Zhang, Yang and Hayashida, Morihiro and Akutsu, Tatsuya and Webb, Geoffrey I and Strugnell, Richard A and Song, Jiangning and Lithgow, Trevor},
journal = {Briefings in Bioinformatics},
title = {Systematic analysis and prediction of type IV secreted effector proteins by machine learning approaches},
year = {2019},
number = {3},
pages = {931-951},
volume = {20},
abstract = {In the course of infecting their hosts, pathogenic bacteria secrete numerous effectors, namely, bacterial proteins that pervert host cell biology. Many Gram-negative bacteria, including context-dependent human pathogens, use a type IV secretion system (T4SS) to translocate effectors directly into the cytosol of host cells. Various type IV secreted effectors (T4SEs) have been experimentally validated to play crucial roles in virulence by manipulating host cell gene expression and other processes. Consequently, the identification of novel effector proteins is an important step in increasing our understanding of hostpathogen interactions and bacterial pathogenesis. Here, we train and compare six machine learning models, namely, Nave Bayes (NB), K-nearest neighbor (KNN), logistic regression (LR), random forest (RF), support vector machines (SVMs) and multilayer perceptron (MLP), for the identification of T4SEs using 10 types of selected features and 5-fold cross-validation. Our study shows that: (1) including different but complementary features generally enhance the predictive performance of T4SEs; (2) ensemble models, obtained by integrating individual single-feature models, exhibit a significantly improved predictive performance and (3) the majority voting strategy led to a more stable and accurate classification performance when applied to predicting an ensemble learning model with distinct single features. We further developed a new method to effectively predict T4SEs, Bastion4 (Bacterial secretion effector predictor for T4SS), and we show our ensemble classifier clearly outperforms two recent prediction tools. In summary, we developed a state-of-the-art T4SE predictor by conducting a comprehensive performance evaluation of different machine learning algorithms along with a detailed analysis of single- and multi-feature selections.},
doi = {10.1093/bib/bbx164},
keywords = {Bioinformatics},
related = {computational-biology},
}
ABSTRACT In the course of infecting their hosts, pathogenic bacteria secrete numerous effectors, namely, bacterial proteins that pervert host cell biology. Many Gram-negative bacteria, including context-dependent human pathogens, use a type IV secretion system (T4SS) to translocate effectors directly into the cytosol of host cells. Various type IV secreted effectors (T4SEs) have been experimentally validated to play crucial roles in virulence by manipulating host cell gene expression and other processes. Consequently, the identification of novel effector proteins is an important step in increasing our understanding of hostpathogen interactions and bacterial pathogenesis. Here, we train and compare six machine learning models, namely, Nave Bayes (NB), K-nearest neighbor (KNN), logistic regression (LR), random forest (RF), support vector machines (SVMs) and multilayer perceptron (MLP), for the identification of T4SEs using 10 types of selected features and 5-fold cross-validation. Our study shows that: (1) including different but complementary features generally enhance the predictive performance of T4SEs; (2) ensemble models, obtained by integrating individual single-feature models, exhibit a significantly improved predictive performance and (3) the majority voting strategy led to a more stable and accurate classification performance when applied to predicting an ensemble learning model with distinct single features. We further developed a new method to effectively predict T4SEs, Bastion4 (Bacterial secretion effector predictor for T4SS), and we show our ensemble classifier clearly outperforms two recent prediction tools. In summary, we developed a state-of-the-art T4SE predictor by conducting a comprehensive performance evaluation of different machine learning algorithms along with a detailed analysis of single- and multi-feature selections.

Structural Capacitance in Protein Evolution and Human Diseases.
Li, C., Clark, L. V. T., Zhang, R., Porebski, B. T., McCoey, J. M., Borg, N. A., Webb, G. I., Kass, I., Buckle, M., Song, J., Woolfson, A., & Buckle, A. M.
Journal of Molecular Biology, 430(18), 3200-3217, 2018.
[DOI] [Bibtex]

@Article{Li2018,
Title = {Structural Capacitance in Protein Evolution and Human Diseases},
Author = {Li, Chen and Clark, Liah V.T. and Zhang, Rory and Porebski, Benjamin T. and McCoey, Julia M. and Borg, Natalie A. and Webb, Geoffrey I. and Kass, Itamar and Buckle, Malcolm and Song, Jiangning and Woolfson, Adrian and Buckle, Ashley M.},
Journal = {Journal of Molecular Biology},
Year = {2018},
Number = {18},
Pages = {3200-3217},
Volume = {430},
Doi = {10.1016/j.jmb.2018.06.051},
ISSN = {0022-2836},
Keywords = {Bioinformatics},
Related = {computational-biology}
}
ABSTRACT 

Critical evaluation of bioinformatics tools for the prediction of protein crystallization propensity.
Wang, H., Feng, L., Webb, G. I., Kurgan, L., Song, J., & Lin, D.
Briefings in Bioinformatics, 19(5), 838-852, 2018.
[DOI] [Bibtex] [Abstract]

@Article{WangEtAl18,
Title = {Critical evaluation of bioinformatics tools for the prediction of protein crystallization propensity},
Author = {Wang, Huilin and Feng, Liubin and Webb, Geoffrey I and Kurgan, Lukasz and Song, Jiangning and Lin, Donghai},
Journal = {Briefings in Bioinformatics},
Year = {2018},
Number = {5},
Pages = {838-852},
Volume = {19},
Abstract = {X-ray crystallography is the main tool for structural determination of proteins. Yet, the underlying crystallization process is costly, has a high attrition rate and involves a series of trial-and-error attempts to obtain diffraction-quality crystals. The Structural Genomics Consortium aims to systematically solve representative structures of major protein-fold classes using primarily high-throughput X-ray crystallography. The attrition rate of these efforts can be improved by selection of proteins that are potentially easier to be crystallized. In this context, bioinformatics approaches have been developed to predict crystallization propensities based on protein sequences. These approaches are used to facilitate prioritization of the most promising target proteins, search for alternative structural orthologues of the target proteins and suggest designs of constructs capable of potentially enhancing the likelihood of successful crystallization. We reviewed and compared nine predictors of protein crystallization propensity. Moreover, we demonstrated that integrating selected outputs from multiple predictors as candidate input features to build the predictive model results in a significantly higher predictive performance when compared to using these predictors individually. Furthermore, we also introduced a new and accurate predictor of protein crystallization propensity, Crysf, which uses functional features extracted from UniProt as inputs. This comprehensive review will assist structural biologists in selecting the most appropriate predictor, and is also beneficial for bioinformaticians to develop a new generation of predictive algorithms.},
Doi = {10.1093/bib/bbx018},
Keywords = {Bioinformatics},
Related = {computational-biology}
}
ABSTRACT X-ray crystallography is the main tool for structural determination of proteins. Yet, the underlying crystallization process is costly, has a high attrition rate and involves a series of trial-and-error attempts to obtain diffraction-quality crystals. The Structural Genomics Consortium aims to systematically solve representative structures of major protein-fold classes using primarily high-throughput X-ray crystallography. The attrition rate of these efforts can be improved by selection of proteins that are potentially easier to be crystallized. In this context, bioinformatics approaches have been developed to predict crystallization propensities based on protein sequences. These approaches are used to facilitate prioritization of the most promising target proteins, search for alternative structural orthologues of the target proteins and suggest designs of constructs capable of potentially enhancing the likelihood of successful crystallization. We reviewed and compared nine predictors of protein crystallization propensity. Moreover, we demonstrated that integrating selected outputs from multiple predictors as candidate input features to build the predictive model results in a significantly higher predictive performance when compared to using these predictors individually. Furthermore, we also introduced a new and accurate predictor of protein crystallization propensity, Crysf, which uses functional features extracted from UniProt as inputs. This comprehensive review will assist structural biologists in selecting the most appropriate predictor, and is also beneficial for bioinformaticians to develop a new generation of predictive algorithms.

Comprehensive assessment and performance improvement of effector protein predictors for bacterial secretion systems III, IV and VI.
An, Y., Wang, J., Li, C., Leier, A., Marquez-Lago, T., Wilksch, J., Zhang, Y., Webb, G. I., Song, J., & Lithgow, T.
Briefings in Bioinformatics, 19(1), 148-161, 2018.
[DOI] [Bibtex] [Abstract]

@Article{AnEtAl2016,
author = {An, Yi and Wang, Jiawei and Li, Chen and Leier, Andre and Marquez-Lago, Tatiana and Wilksch, Jonathan and Zhang, Yang and Webb, Geoffrey I. and Song, Jiangning and Lithgow, Trevor},
journal = {Briefings in Bioinformatics},
title = {Comprehensive assessment and performance improvement of effector protein predictors for bacterial secretion systems III, IV and VI},
year = {2018},
number = {1},
pages = {148-161},
volume = {19},
abstract = {Bacterial effector proteins secreted by various protein secretion systems play crucial roles in hostpathogen interactions. In this context, computational tools capable of accurately predicting effector proteins of the various types of bacterial secretion systems are highly desirable. Existing computational approaches use different machine learning (ML) techniques and heterogeneous features derived from protein sequences and/or structural information. These predictors differ not only in terms of the used ML methods but also with respect to the used curated data sets, the features selection and their prediction performance. Here, we provide a comprehensive survey and benchmarking of currently available tools for the prediction of effector proteins of bacterial types III, IV and VI secretion systems (T3SS, T4SS and T6SS, respectively). We review core algorithms, feature selection techniques, tool availability and applicability and evaluate the prediction performance based on carefully curated independent test data sets. In an effort to improve predictive performance, we constructed three ensemble models based on ML algorithms by integrating the output of all individual predictors reviewed. Our benchmarks demonstrate that these ensemble models outperform all the reviewed tools for the prediction of effector proteins of T3SS and T4SS. The webserver of the proposed ensemble methods for T3SS and T4SS effector protein prediction is freely available at http://tbooster.erc.monash.edu/index.jsp. We anticipate that this survey will serve as a useful guide for interested users and that the new ensemble predictors will stimulate research into hostpathogen relationships and inspiration for the development of new bioinformatics tools for predicting effector proteins of T3SS, T4SS and T6SS.},
doi = {10.1093/bib/bbw100},
keywords = {Bioinformatics and DP140100087},
related = {computational-biology},
}
ABSTRACT Bacterial effector proteins secreted by various protein secretion systems play crucial roles in hostpathogen interactions. In this context, computational tools capable of accurately predicting effector proteins of the various types of bacterial secretion systems are highly desirable. Existing computational approaches use different machine learning (ML) techniques and heterogeneous features derived from protein sequences and/or structural information. These predictors differ not only in terms of the used ML methods but also with respect to the used curated data sets, the features selection and their prediction performance. Here, we provide a comprehensive survey and benchmarking of currently available tools for the prediction of effector proteins of bacterial types III, IV and VI secretion systems (T3SS, T4SS and T6SS, respectively). We review core algorithms, feature selection techniques, tool availability and applicability and evaluate the prediction performance based on carefully curated independent test data sets. In an effort to improve predictive performance, we constructed three ensemble models based on ML algorithms by integrating the output of all individual predictors reviewed. Our benchmarks demonstrate that these ensemble models outperform all the reviewed tools for the prediction of effector proteins of T3SS and T4SS. The webserver of the proposed ensemble methods for T3SS and T4SS effector protein prediction is freely available at http://tbooster.erc.monash.edu/index.jsp. We anticipate that this survey will serve as a useful guide for interested users and that the new ensemble predictors will stimulate research into hostpathogen relationships and inspiration for the development of new bioinformatics tools for predicting effector proteins of T3SS, T4SS and T6SS.

iFeature: a python package and web server for features extraction and selection from protein and peptide sequences.
Chen, Z., Zhao, P., Li, F., Leier, A., Marquez-Lago, T. T., Wang, Y., Webb, G. I., Smith, I. A., Daly, R. J., Chou, K., & Song, J.
Bioinformatics, Art. no. bty140, 2018.
[DOI] [Bibtex]

@Article{ChenEtAl18,
Title = {iFeature: a python package and web server for features extraction and selection from protein and peptide sequences},
Author = {Chen, Zhen and Zhao, Pei and Li, Fuyi and Leier, Andre and Marquez-Lago, Tatiana T and Wang, Yanan and Webb, Geoffrey I and Smith, A Ian and Daly, Roger J and Chou, Kuo-Chen and Song, Jiangning},
Journal = {Bioinformatics},
Year = {2018},
Articlenumber = {bty140},
Comment = {Clarivate Analytics Web of Science Highly Cited Paper, 2019},
Doi = {10.1093/bioinformatics/bty140},
Keywords = {Bioinformatics},
Related = {computational-biology}
}
ABSTRACT 

PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework.
Song, J., Li, F., Takemoto, K., Haffari, G., Akutsu, T., Chou, K. C., & Webb, G. I.
Journal of Theoretical Biology, 443, 125-137, 2018.
[DOI] [Bibtex]

@Article{SongEtAl18,
Title = {PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework},
Author = {Song, J. and Li, F. and Takemoto, K. and Haffari, G. and Akutsu, T. and Chou, K. C. and Webb, G. I.},
Journal = {Journal of Theoretical Biology},
Year = {2018},
Pages = {125-137},
Volume = {443},
Comment = {Clarivate Analytics Web of Science Hot Paper and Highly Cited Paper, 2019},
Doi = {10.1016/j.jtbi.2018.01.023},
Keywords = {Bioinformatics},
Related = {computational-biology},
Url = {https://authors.elsevier.com/c/1WWQY57ilzyRc}
}
ABSTRACT 

PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy.
Song, J., Li, F., Leier, A., Marquez-Lago, T. T., Akutsu, T., Haffari, G., Chou, K., Webb, G. I., & Pike, R. N.
Bioinformatics, Art. no. btx670, 2017.
[DOI] [Bibtex]

@Article{Song2017a,
Title = {PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy},
Author = {Song, Jiangning and Li, Fuyi and Leier, Andre and Marquez-Lago, Tatiana T and Akutsu, Tatsuya and Haffari, Gholamreza and Chou, Kuo-Chen and Webb, Geoffrey I and Pike, Robert N},
Journal = {Bioinformatics},
Year = {2017},
Articlenumber = {btx670},
Comment = {Clarivate Analytics Web of Science Highly Cited Paper, 2019},
Doi = {10.1093/bioinformatics/btx670},
Keywords = {Bioinformatics},
Related = {computational-biology},
Url = {https://academic.oup.com/bioinformatics/article/doi/10.1093/bioinformatics/btx670/4562332/PROSPERous-highthroughput-prediction-of-substrate?guestAccessKey=668859da-9d97-47cf-b655-31cc5aa931aa}
}
ABSTRACT 

MetalExplorer, a Bioinformatics Tool for the Improved Prediction of Eight Types of Metal-binding Sites Using a Random Forest Algorithm with Two-step Feature Selection.
Song, J., Li, C., Zheng, C., Revote, J., Zhang, Z., & Webb, G. I.
Current Bioinformatics, 12(6), 480-489, 2017.
[DOI] [Bibtex] [Abstract]

@Article{SongEtAl16,
Title = {MetalExplorer, a Bioinformatics Tool for the Improved Prediction of Eight Types of Metal-binding Sites Using a Random Forest Algorithm with Two-step Feature Selection},
Author = {Song, Jiangning and Li, Chen and Zheng, Cheng and Revote, Jerico and Zhang, Ziding and Webb, Geoffrey I.},
Journal = {Current Bioinformatics},
Year = {2017},
Number = {6},
Pages = {480 - 489},
Volume = {12},
Abstract = {Metalloproteins are highly involved in many biological processes,
including catalysis, recognition, transport, transcription, and signal
transduction. The metal ions they bind usually play enzymatic or structural
roles in mediating these diverse functional roles. Thus, the systematic
analysis and prediction of metal-binding sites using sequence and/or
structural information are crucial for understanding their
sequence-structure-function relationships. In this study, we propose
MetalExplorer (http://metalexplorer.erc.monash.edu.au/), a new machine
learning-based method for predicting eight different types of metal-binding
sites (Ca, Co, Cu, Fe, Ni, Mg, Mn, and Zn) in proteins. Our approach
combines heterogeneous sequence-, structure-, and residue contact
network-based features. The predictive performance of MetalExplorer was
tested by cross-validation and independent tests using non-redundant
datasets of known structures. This method applies a two-step feature
selection approach based on the maximum relevance minimum redundancy and
forward feature selection to identify the most informative features that
contribute to the prediction performance. With a precision of 60%,
MetalExplorer achieved high recall values, which ranged from 59% to 88% for
the eight metal ion types in fivefold cross-validation tests. Moreover, the
common and type-specific features in the optimal subsets of all metal ions
were characterized in terms of their contributions to the overall
performance. In terms of both benchmark and independent datasets at the 60%
precision control level, MetalExplorer compared favorably with an existing
metalloprotein prediction tool, SitePredict. Thus, MetalExplorer is expected
to be a powerful tool for the accurate prediction of potential metal-binding
sites and it should facilitate the functional analysis and rational design
of novel metalloproteins.},
Doi = {10.2174/2468422806666160618091522},
ISSN = {1574-8936/2212-392X},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}
ABSTRACT Metalloproteins are highly involved in many biological processes, including catalysis, recognition, transport, transcription, and signal transduction. The metal ions they bind usually play enzymatic or structural roles in mediating these diverse functional roles. Thus, the systematic analysis and prediction of metal-binding sites using sequence and/or structural information are crucial for understanding their sequence-structure-function relationships. In this study, we propose MetalExplorer (http://metalexplorer.erc.monash.edu.au/), a new machine learning-based method for predicting eight different types of metal-binding sites (Ca, Co, Cu, Fe, Ni, Mg, Mn, and Zn) in proteins. Our approach combines heterogeneous sequence-, structure-, and residue contact network-based features. The predictive performance of MetalExplorer was tested by cross-validation and independent tests using non-redundant datasets of known structures. This method applies a two-step feature selection approach based on the maximum relevance minimum redundancy and forward feature selection to identify the most informative features that contribute to the prediction performance. With a precision of 60%, MetalExplorer achieved high recall values, which ranged from 59% to 88% for the eight metal ion types in fivefold cross-validation tests. Moreover, the common and type-specific features in the optimal subsets of all metal ions were characterized in terms of their contributions to the overall performance. In terms of both benchmark and independent datasets at the 60% precision control level, MetalExplorer compared favorably with an existing metalloprotein prediction tool, SitePredict. Thus, MetalExplorer is expected to be a powerful tool for the accurate prediction of potential metal-binding sites and it should facilitate the functional analysis and rational design of novel metalloproteins.

PhosphoPredict: A bioinformatics tool for prediction of human kinase-specific phosphorylation substrates and sites by integrating heterogeneous feature selection.
Song, J., Wang, H., Wang, J., Leier, A., Marquez-Lago, T., Yang, B., Zhang, Z., Akutsu, T., Webb, G. I., & Daly, R. J.
Scientific Reports, 7(1), Art. no. 6862, 2017.
[DOI] [Bibtex] [Abstract]

@Article{Song2017,
Title = {PhosphoPredict: A bioinformatics tool for prediction of human kinase-specific phosphorylation substrates and sites by integrating heterogeneous feature selection},
Author = {Song, Jiangning and Wang, Huilin and Wang, Jiawei and Leier, Andre and Marquez-Lago, Tatiana and Yang, Bingjiao and Zhang, Ziding and Akutsu, Tatsuya and Webb, Geoffrey I. and Daly, Roger J.},
Journal = {Scientific Reports},
Year = {2017},
Number = {1},
Volume = {7},
Abstract = {Protein phosphorylation is a major form of post-translational modification (PTM) that regulates diverse cellular processes. In silico methods for phosphorylation site prediction can provide a useful and complementary strategy for complete phosphoproteome annotation. Here, we present a novel bioinformatics tool, PhosphoPredict, that combines protein sequence and functional features to predict kinase-specific substrates and their associated phosphorylation sites for 12 human kinases and kinase families, including ATM, CDKs, GSK-3, MAPKs, PKA, PKB, PKC, and SRC. To elucidate critical determinants, we identified feature subsets that were most informative and relevant for predicting substrate specificity for each individual kinase family. Extensive benchmarking experiments based on both five-fold cross-validation and independent tests indicated that the performance of PhosphoPredict is competitive with that of several other popular prediction tools, including KinasePhos, PPSP, GPS, and Musite. We found that combining protein functional and sequence features significantly improves phosphorylation site prediction performance across all kinases. Application of PhosphoPredict to the entire human proteome identified 150 to 800 potential phosphorylation substrates for each of the 12 kinases or kinase families. PhosphoPredict significantly extends the bioinformatics portfolio for kinase function analysis and will facilitate high-throughput identification of kinase-specific phosphorylation sites, thereby contributing to both basic and translational research programs.},
Articlenumber = {6862},
Doi = {10.1038/s41598-017-07199-4},
Keywords = {Bioinformatics},
Related = {computational-biology}
}
ABSTRACT Protein phosphorylation is a major form of post-translational modification (PTM) that regulates diverse cellular processes. In silico methods for phosphorylation site prediction can provide a useful and complementary strategy for complete phosphoproteome annotation. Here, we present a novel bioinformatics tool, PhosphoPredict, that combines protein sequence and functional features to predict kinase-specific substrates and their associated phosphorylation sites for 12 human kinases and kinase families, including ATM, CDKs, GSK-3, MAPKs, PKA, PKB, PKC, and SRC. To elucidate critical determinants, we identified feature subsets that were most informative and relevant for predicting substrate specificity for each individual kinase family. Extensive benchmarking experiments based on both five-fold cross-validation and independent tests indicated that the performance of PhosphoPredict is competitive with that of several other popular prediction tools, including KinasePhos, PPSP, GPS, and Musite. We found that combining protein functional and sequence features significantly improves phosphorylation site prediction performance across all kinases. Application of PhosphoPredict to the entire human proteome identified 150 to 800 potential phosphorylation substrates for each of the 12 kinases or kinase families. PhosphoPredict significantly extends the bioinformatics portfolio for kinase function analysis and will facilitate high-throughput identification of kinase-specific phosphorylation sites, thereby contributing to both basic and translational research programs.

SecretEPDB: a comprehensive web-based resource for secreted effector proteins of the bacterial types III, IV and VI secretion systems.
An, Y., Wang, J., Li, C., Revote, J., Zhang, Y., Naderer, T., Hayashida, M., Akutsu, T., Webb, G. I., Lithgow, T., & Song, J.
Scientific Reports, 7, Art. no. 41031, 2017.
[DOI] [Bibtex]

@Article{AnEtAl17,
Title = {SecretEPDB: a comprehensive web-based resource for secreted effector proteins of the bacterial types III, IV and VI secretion systems},
Author = {An, Yi and Wang, Jiawei and Li, Chen and Revote, Jerico and Zhang, Yang and Naderer, Thomas and Hayashida, Mirohiro and Akutsu, Tatsuya and Webb, Geoffrey I. and Lithgow, Trevor and Song, Jiangning},
Journal = {Scientific Reports},
Year = {2017},
Volume = {7},
Articlenumber = {41031},
Doi = {10.1038/srep41031},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology},
Url = {http://rdcu.be/oJ9I}
}
ABSTRACT 

POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles.
Wang, J., Yang, B., Revote, J., Leier, A., Marquez-Lago, T. T., Webb, G. I., Song, J., Chou, K., & Lithgow, T.
Bioinformatics, 33(17), 2756-2758, 2017.
[DOI] [Bibtex]

@Article{WangJEtAl17,
Title = {POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles},
Author = {Wang, Jiawei and Yang, Bingjiao and Revote, Jerico and Leier, Andre and Marquez-Lago, Tatiana T. and Webb, Geoffrey I. and Song, Jiangning and Chou, Kuo-Chen and Lithgow, Trevor},
Journal = {Bioinformatics},
Year = {2017},
Number = {17},
Pages = {2756-2758},
Volume = {33},
Doi = {10.1093/bioinformatics/btx302},
Keywords = {Bioinformatics},
Related = {computational-biology}
}
ABSTRACT 

Knowledge-transfer learning for prediction of matrix metalloprotease substrate-cleavage sites.
Wang, Y., Song, J., Marquez-Lago, T. T., Leier, A., Li, C., Lithgow, T., Webb, G. I., & Shen, H.
Scientific Reports, 7, Art. no. 5755, 2017.
[DOI] [Bibtex]

@Article{WangYEtAl17,
Title = {Knowledge-transfer learning for prediction of matrix metalloprotease substrate-cleavage sites},
Author = {Wang, Yanan and Song, Jiangning and Marquez-Lago, Tatiana T. and Leier, Andre and Li, Chen and Lithgow, Trevor and Webb, Geoffrey I. and Shen, Hong-Bin},
Journal = {Scientific Reports},
Year = {2017},
Volume = {7},
Articlenumber = {5755},
Doi = {10.1038/s41598-017-06219-7},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}
ABSTRACT 

Smoothing a rugged protein folding landscape by sequence-based redesign.
Porebski, B. T., Keleher, S., Hollins, J. J., Nickson, A. A., Marijanovic, E. M., Borg, N. A., Costa, M. G. S., Pearce, M. A., Dai, W., Zhu, L., Irving, J. A., Hoke, D. E., Kass, I., Whisstock, J. C., Bottomley, S. P., Webb, G. I., McGowan, S., & Buckle, A. M.
Scientific Reports, 6, Art. no. 33958, 2016.
[DOI] [Bibtex] [Abstract]

@Article{Porebski2016,
Title = {Smoothing a rugged protein folding landscape by sequence-based redesign},
Author = {Porebski, Benjamin T. and Keleher, Shani and Hollins, Jeffrey J. and Nickson, Adrian A. and Marijanovic, Emilia M. and Borg, Natalie A. and Costa, Mauricio G. S. and Pearce, Mary A. and Dai, Weiwen and Zhu, Liguang and Irving, James A. and Hoke, David E. and Kass, Itamar and Whisstock, James C. and Bottomley, Stephen P. and Webb, Geoffrey I. and McGowan, Sheena and Buckle, Ashley M.},
Journal = {Scientific Reports},
Year = {2016},
Volume = {6},
Abstract = {The rugged folding landscapes of functional proteins puts them at risk of misfolding and aggregation. Serine protease inhibitors, or serpins, are paradigms for this delicate balance between function and misfolding. Serpins exist in a metastable state that undergoes a major conformational change in order to inhibit proteases. However, conformational labiality of the native serpin fold renders them susceptible to misfolding, which underlies misfolding diseases such as alpha1-antitrypsin deficiency. To investigate how serpins balance function and folding, we used consensus design to create conserpin, a synthetic serpin that folds reversibly, is functional, thermostable, and polymerization resistant. Characterization of its structure, folding and dynamics suggest that consensus design has remodeled the folding landscape to reconcile competing requirements for stability and function. This approach may offer general benefits for engineering functional proteins that have risky folding landscapes, including the removal of aggregation-prone intermediates, and modifying scaffolds for use as protein therapeutics.},
Articlenumber = {33958},
Doi = {10.1038/srep33958},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology},
Url = {http://dx.doi.org/10.1038/srep33958}
}
ABSTRACT The rugged folding landscapes of functional proteins puts them at risk of misfolding and aggregation. Serine protease inhibitors, or serpins, are paradigms for this delicate balance between function and misfolding. Serpins exist in a metastable state that undergoes a major conformational change in order to inhibit proteases. However, conformational labiality of the native serpin fold renders them susceptible to misfolding, which underlies misfolding diseases such as alpha1-antitrypsin deficiency. To investigate how serpins balance function and folding, we used consensus design to create conserpin, a synthetic serpin that folds reversibly, is functional, thermostable, and polymerization resistant. Characterization of its structure, folding and dynamics suggest that consensus design has remodeled the folding landscape to reconcile competing requirements for stability and function. This approach may offer general benefits for engineering functional proteins that have risky folding landscapes, including the removal of aggregation-prone intermediates, and modifying scaffolds for use as protein therapeutics.

Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli.
Chang, C. C. H., Li, C., Webb, G. I., Tey, B., & Song, J.
Scientific Reports, 6, Art. no. 21844, 2016.
[DOI] [Bibtex] [Abstract]

@Article{ChangEtAl2016,
Title = {Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli},
Author = {Chang, C.C.H. and Li, C. and Webb, G. I. and Tey, B. and Song, J.},
Journal = {Scientific Reports},
Year = {2016},
Volume = {6},
Abstract = {Periplasmic expression of soluble proteins in Escherichia coli not only offers a much-simplified downstream purification process, but also enhances the probability of obtaining correctly folded and biologically active proteins. Different combinations of signal peptides and target proteins lead to different soluble protein expression levels, ranging from negligible to several grams per litre. Accurate algorithms for rational selection of promising candidates can serve as a powerful tool to complement with current trial-and-error approaches. Accordingly, proteomics studies can be conducted with greater efficiency and cost-effectiveness. Here, we developed a predictor with a two-stage architecture, to predict the real-valued expression level of target protein in the periplasm. The output of the first-stage support vector machine (SVM) classifier determines which second-stage support vector regression (SVR) classifier to be used. When tested on an independent test dataset, the predictor achieved an overall prediction accuracy of 78% and a Pearson’s correlation coefficient (PCC) of 0.77. We further illustrate the relative importance of various features with respect to different models. The results indicate that the occurrence of dipeptide glutamine and aspartic acid is the most important feature for the classification model. Finally, we provide access to the implemented predictor through the Periscope webserver, freely accessible at http://lightning.med.monash.edu/periscope/.},
Articlenumber = {21844},
Doi = {10.1038/srep21844},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}
ABSTRACT Periplasmic expression of soluble proteins in Escherichia coli not only offers a much-simplified downstream purification process, but also enhances the probability of obtaining correctly folded and biologically active proteins. Different combinations of signal peptides and target proteins lead to different soluble protein expression levels, ranging from negligible to several grams per litre. Accurate algorithms for rational selection of promising candidates can serve as a powerful tool to complement with current trial-and-error approaches. Accordingly, proteomics studies can be conducted with greater efficiency and cost-effectiveness. Here, we developed a predictor with a two-stage architecture, to predict the real-valued expression level of target protein in the periplasm. The output of the first-stage support vector machine (SVM) classifier determines which second-stage support vector regression (SVR) classifier to be used. When tested on an independent test dataset, the predictor achieved an overall prediction accuracy of 78% and a Pearson’s correlation coefficient (PCC) of 0.77. We further illustrate the relative importance of various features with respect to different models. The results indicate that the occurrence of dipeptide glutamine and aspartic acid is the most important feature for the classification model. Finally, we provide access to the implemented predictor through the Periscope webserver, freely accessible at http://lightning.med.monash.edu/periscope/.

GlycoMinestruct: a new bioinformatics tool for highly accurate mapping of the human N-linked and O-linked glycoproteomes by incorporating structural features.
Li, F., Li, C., Revote, J., Zhang, Y., Webb, G. I., Li, J., Song, J., & Lithgow, T.
Scientific Reports, 6, Art. no. 34595, 2016.
[DOI] [Bibtex]

@Article{LiEtAl16,
Title = {GlycoMinestruct: a new bioinformatics tool for highly accurate mapping of the human N-linked and O-linked glycoproteomes by incorporating structural features},
Author = {Li, Fuyi and Li, Chen and Revote, Jerico and Zhang, Yang and Webb, Geoffrey I. and Li, Jian and Song, Jiangning and Lithgow, Trevor},
Journal = {Scientific Reports},
Year = {2016},
Month = oct,
Volume = {6},
Articlenumber = {34595},
Doi = {10.1038/srep34595},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}
ABSTRACT 

Crysalis: an integrated server for computational analysis and design of protein crystallization.
Wang, H., Feng, L., Zhang, Z., Webb, G. I., Lin, D., & Song, J.
Scientific Reports, 6, Art. no. 21383, 2016.
[DOI] [Bibtex] [Abstract]

@Article{WangEtAl16,
Title = {Crysalis: an integrated server for computational analysis and design of protein crystallization},
Author = {Wang, H. and Feng, L. and Zhang, Z. and Webb, G. I. and Lin, D. and Song, J.},
Journal = {Scientific Reports},
Year = {2016},
Volume = {6},
Abstract = {The failure of multi-step experimental procedures to yield diffraction-quality crystals is a major bottleneck in protein structure determination. Accordingly, several bioinformatics methods have been successfully developed and employed to select crystallizable proteins. Unfortunately, the majority of existing in silico methods only allow the prediction of crystallization propensity, seldom enabling computational design of protein mutants that can be targeted for enhancing protein crystallizability. Here, we present Crysalis, an integrated crystallization analysis tool that builds on support-vector regression (SVR) models to facilitate computational protein crystallization prediction, analysis, and design. More specifically, the functionality of this new tool includes: (1) rapid selection of target crystallizable proteins at the proteome level, (2) identification of site non-optimality for protein crystallization and systematic analysis of all potential single-point mutations that might enhance protein crystallization propensity, and (3) annotation of target protein based on predicted structural properties. We applied the design mode of Crysalis to identify site non-optimality for protein crystallization on a proteome-scale, focusing on proteins currently classified as non-crystallizable. Our results revealed that site non-optimality is based on biases related to residues, predicted structures, physicochemical properties, and sequence loci, which provides in-depth understanding of the features influencing protein crystallization. Crysalis is freely available at http://nmrcen.xmu.edu.cn/crysalis/.},
Articlenumber = {21383},
Doi = {10.1038/srep21383},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}
ABSTRACT The failure of multi-step experimental procedures to yield diffraction-quality crystals is a major bottleneck in protein structure determination. Accordingly, several bioinformatics methods have been successfully developed and employed to select crystallizable proteins. Unfortunately, the majority of existing in silico methods only allow the prediction of crystallization propensity, seldom enabling computational design of protein mutants that can be targeted for enhancing protein crystallizability. Here, we present Crysalis, an integrated crystallization analysis tool that builds on support-vector regression (SVR) models to facilitate computational protein crystallization prediction, analysis, and design. More specifically, the functionality of this new tool includes: (1) rapid selection of target crystallizable proteins at the proteome level, (2) identification of site non-optimality for protein crystallization and systematic analysis of all potential single-point mutations that might enhance protein crystallization propensity, and (3) annotation of target protein based on predicted structural properties. We applied the design mode of Crysalis to identify site non-optimality for protein crystallization on a proteome-scale, focusing on proteins currently classified as non-crystallizable. Our results revealed that site non-optimality is based on biases related to residues, predicted structures, physicochemical properties, and sequence loci, which provides in-depth understanding of the features influencing protein crystallization. Crysalis is freely available at http://nmrcen.xmu.edu.cn/crysalis/.

GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome.
Li, F., Li, C., Wang, M., Webb, G. I., Zhang, Y., Whisstock, J. C., & Song, J.
Bioinformatics, 31(9), 1411-1419, 2015.
[DOI] [Bibtex] [Abstract]

@Article{LiEtAl15,
Title = {GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome},
Author = {Li, F. and Li, C. and Wang, M. and Webb, G. I. and Zhang, Y. and Whisstock, J. C. and Song, J.},
Journal = {Bioinformatics},
Year = {2015},
Number = {9},
Pages = {1411-1419},
Volume = {31},
Abstract = {Motivation: Glycosylation is a ubiquitous type of protein post-translational modification (PTM) in eukaryotic cells, which plays vital roles in various biological processes (BPs) such as cellular communication, ligand recognition and subcellular recognition. It is estimated that >50% of the entire human proteome is glycosylated. However, it is still a significant challenge to identify glycosylation sites, which requires expensive/laborious experimental research. Thus, bioinformatics approaches that can predict the glycan occupancy at specific sequons in protein sequences would be useful for understanding and utilizing this important PTM.
Results: In this study, we present a novel bioinformatics tool called GlycoMine, which is a comprehensive tool for the systematic in silico identification of C-linked, N-linked, and O-linked glycosylation sites in the human proteome. GlycoMine was developed using the random forest algorithm and evaluated based on a well-prepared up-to-date benchmark dataset that encompasses all three types of glycosylation sites, which was curated from multiple public resources. Heterogeneous sequences and functional features were derived from various sources, and subjected to further two-step feature selection to characterize a condensed subset of optimal features that contributed most to the type-specific prediction of glycosylation sites. Five-fold cross-validation and independent tests show that this approach significantly improved the prediction performance compared with four existing prediction tools: NetNGlyc, NetOGlyc, EnsembleGly and GPP. We demonstrated that this tool could identify candidate glycosylation sites in case study proteins and applied it to identify many high-confidence glycosylation target proteins by screening the entire human proteome.},
Doi = {10.1093/bioinformatics/btu852},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}
ABSTRACT Motivation: Glycosylation is a ubiquitous type of protein post-translational modification (PTM) in eukaryotic cells, which plays vital roles in various biological processes (BPs) such as cellular communication, ligand recognition and subcellular recognition. It is estimated that >50% of the entire human proteome is glycosylated. However, it is still a significant challenge to identify glycosylation sites, which requires expensive/laborious experimental research. Thus, bioinformatics approaches that can predict the glycan occupancy at specific sequons in protein sequences would be useful for understanding and utilizing this important PTM. Results: In this study, we present a novel bioinformatics tool called GlycoMine, which is a comprehensive tool for the systematic in silico identification of C-linked, N-linked, and O-linked glycosylation sites in the human proteome. GlycoMine was developed using the random forest algorithm and evaluated based on a well-prepared up-to-date benchmark dataset that encompasses all three types of glycosylation sites, which was curated from multiple public resources. Heterogeneous sequences and functional features were derived from various sources, and subjected to further two-step feature selection to characterize a condensed subset of optimal features that contributed most to the type-specific prediction of glycosylation sites. Five-fold cross-validation and independent tests show that this approach significantly improved the prediction performance compared with four existing prediction tools: NetNGlyc, NetOGlyc, EnsembleGly and GPP. We demonstrated that this tool could identify candidate glycosylation sites in case study proteins and applied it to identify many high-confidence glycosylation target proteins by screening the entire human proteome.

Structural and dynamic properties that govern the stability of an engineered fibronectin type III domain.
Porebski, B. T., Nickson, A. A., Hoke, D. E., Hunter, M. R., Zhu, L., McGowan, S., Webb, G. I., & Buckle, A. M.
Protein Engineering, Design and Selection, 28(3), 67-78, 2015.
[DOI] [Bibtex] [Abstract]

@Article{PorebskiEtAl15,
Title = {Structural and dynamic properties that govern the stability of an engineered fibronectin type III domain},
Author = {Porebski, B. T. and Nickson, A. A. and Hoke, D. E. and Hunter, M. R. and Zhu, L. and McGowan, S. and Webb, G. I. and Buckle, A. M.},
Journal = {Protein Engineering, Design and Selection},
Year = {2015},
Number = {3},
Pages = {67-78},
Volume = {28},
Abstract = {Consensus protein design is a rapid and reliable technique for the improvement of protein stability, which relies on the use of homologous protein sequences. To enhance the stability of a fibronectin type III (FN3) domain, consensus design was employed using an alignment of 2123 sequences. The resulting FN3 domain, FN3con, has unprecedented stability, with a melting temperature >100C, a .GD.N of 15.5 kcal mol.1 and a greatly reduced unfolding rate compared with wild-type. To determine the underlying molecular basis for stability, an X-ray crystal structure of FN3con was determined to 2.0  and compared with other FN3 domains of varying stabilities. The structure of FN3con reveals significantly increased salt bridge interactions that are cooperatively networked, and a highly optimized hydrophobic core. Molecular dynamics simulations of FN3con and comparison structures show the cooperative power of electrostatic and hydrophobic networks in improving FN3con stability. Taken together, our data reveal that FN3con stability does not result from a single mechanism, but rather the combination of several features and the removal of non-conserved, unfavorable interactions. The large number of sequences employed in this study has most likely enhanced the robustness of the consensus design, which is now possible due to the increased sequence availability in the post-genomic era. These studies increase our knowledge of the molecular mechanisms that govern stability and demonstrate the rising potential for enhancing stability via the consensus method.},
Doi = {10.1093/protein/gzv002},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology},
Url = {http://peds.oxfordjournals.org/content/28/3/67.full.pdf+html}
}
ABSTRACT Consensus protein design is a rapid and reliable technique for the improvement of protein stability, which relies on the use of homologous protein sequences. To enhance the stability of a fibronectin type III (FN3) domain, consensus design was employed using an alignment of 2123 sequences. The resulting FN3 domain, FN3con, has unprecedented stability, with a melting temperature >100C, a .GD.N of 15.5 kcal mol.1 and a greatly reduced unfolding rate compared with wild-type. To determine the underlying molecular basis for stability, an X-ray crystal structure of FN3con was determined to 2.0  and compared with other FN3 domains of varying stabilities. The structure of FN3con reveals significantly increased salt bridge interactions that are cooperatively networked, and a highly optimized hydrophobic core. Molecular dynamics simulations of FN3con and comparison structures show the cooperative power of electrostatic and hydrophobic networks in improving FN3con stability. Taken together, our data reveal that FN3con stability does not result from a single mechanism, but rather the combination of several features and the removal of non-conserved, unfavorable interactions. The large number of sequences employed in this study has most likely enhanced the robustness of the consensus design, which is now possible due to the increased sequence availability in the post-genomic era. These studies increase our knowledge of the molecular mechanisms that govern stability and demonstrate the rising potential for enhancing stability via the consensus method.

Accurate in Silico Identification of Species-Specific Acetylation Sites by Integrating Protein Sequence-Derived and Functional Features.
Li, Y., Wang, M., Wang, H., Tan, H., Zhang, Z., Webb, G. I., & Song, J.
Scientific Reports, 4, Art. no. 5765, 2014.
[DOI] [Bibtex] [Abstract]

@Article{LiEtAl2014,
author = {Li, Y. and Wang, M. and Wang, H. and Tan, H. and Zhang, Z. and Webb, G. I. and Song, J.},
journal = {Scientific Reports},
title = {Accurate in Silico Identification of Species-Specific Acetylation Sites by Integrating Protein Sequence-Derived and Functional Features},
year = {2014},
volume = {4},
abstract = {Lysine acetylation is a reversible post-translational modification, playing an important role in cytokine signaling, transcriptional regulation, and apoptosis. To fully understand acetylation mechanisms, identification of substrates and specific acetylation sites is crucial. Experimental identification is often time-consuming and expensive. Alternative bioinformatics methods are cost-effective and can be used in a high-throughput manner to generate relatively precise predictions. Here we develop a method termed as SSPKA for species-specific lysine acetylation prediction, using random forest classifiers that combine sequence-derived and functional features with two-step feature selection. Feature importance analysis indicates functional features, applied for lysine acetylation site prediction for the first time, significantly improve the predictive performance. We apply the SSPKA model to screen the entire human proteome and identify many high-confidence putative substrates that are not previously identified. The results along with the implemented Java tool, serve as useful resources to elucidate the mechanism of lysine acetylation and facilitate hypothesis-driven experimental design and validation.},
articlenumber = {5765},
doi = {10.1038/srep05765},
keywords = {Bioinformatics and DP140100087},
related = {computational-biology},
}
ABSTRACT Lysine acetylation is a reversible post-translational modification, playing an important role in cytokine signaling, transcriptional regulation, and apoptosis. To fully understand acetylation mechanisms, identification of substrates and specific acetylation sites is crucial. Experimental identification is often time-consuming and expensive. Alternative bioinformatics methods are cost-effective and can be used in a high-throughput manner to generate relatively precise predictions. Here we develop a method termed as SSPKA for species-specific lysine acetylation prediction, using random forest classifiers that combine sequence-derived and functional features with two-step feature selection. Feature importance analysis indicates functional features, applied for lysine acetylation site prediction for the first time, significantly improve the predictive performance. We apply the SSPKA model to screen the entire human proteome and identify many high-confidence putative substrates that are not previously identified. The results along with the implemented Java tool, serve as useful resources to elucidate the mechanism of lysine acetylation and facilitate hypothesis-driven experimental design and validation.

PROSPER: An Integrated Feature-Based Tool for Predicting Protease Substrate Cleavage Sites.
Song, J., Tan, H., Perry, A. J., Akutsu, T., Webb, G. I., Whisstock, J. C., & Pike, R. N.
PLoS ONE, 7(11), Art. no. e50300, 2012.
[URL] [Bibtex] [Abstract]

@Article{SongEtAl12b,
author = {Song, J. and Tan, H. and Perry, A. J. and Akutsu, T. and Webb, G. I. and Whisstock, J. C. and Pike, R. N.},
journal = {PLoS ONE},
title = {PROSPER: An Integrated Feature-Based Tool for Predicting Protease Substrate Cleavage Sites},
year = {2012},
month = {11},
number = {11},
volume = {7},
abstract = {The ability to catalytically cleave protein substrates after synthesis is fundamental for all forms of life. Accordingly, site-specific proteolysis is one of the most important post-translational modifications. The key to understanding the physiological role of a protease is to identify its natural substrate(s). Knowledge of the substrate specificity of a protease can dramatically improve our ability to predict its target protein substrates, but this information must be utilized in an effective manner in order to efficiently identify protein substrates by in silico approaches. To address this problem, we present PROSPER, an integrated feature-based server for in silico identification of protease substrates and their cleavage sites for twenty-four different proteases. PROSPER utilizes established specificity information for these proteases (derived from the MEROPS database) with a machine learning approach to predict protease cleavage sites by using different, but complementary sequence and structure characteristics. Features used by PROSPER include local amino acid sequence profile, predicted secondary structure, solvent accessibility and predicted native disorder. Thus, for proteases with known amino acid specificity, PROSPER provides a convenient, pre-prepared tool for use in identifying protein substrates for the enzymes. Systematic prediction analysis for the twenty-four proteases thus far included in the database revealed that the features we have included in the tool strongly improve performance in terms of cleavage site prediction, as evidenced by their contribution to performance improvement in terms of identifying known cleavage sites in substrates for these enzymes. In comparison with two state-of-the-art prediction tools, PoPS and SitePrediction, PROSPER achieves greater accuracy and coverage. To our knowledge, PROSPER is the first comprehensive server capable of predicting cleavage sites of multiple proteases within a single substrate sequence using machine learning techniques. It is freely available at http://lightning.med.monash.edu.au/PROSPER/.},
articlenumber = {e50300},
keywords = {Bioinformatics},
publisher = {Public Library of Science},
related = {computational-biology},
url = {http://dx.doi.org/10.1371%2Fjournal.pone.0050300},
}
ABSTRACT The ability to catalytically cleave protein substrates after synthesis is fundamental for all forms of life. Accordingly, site-specific proteolysis is one of the most important post-translational modifications. The key to understanding the physiological role of a protease is to identify its natural substrate(s). Knowledge of the substrate specificity of a protease can dramatically improve our ability to predict its target protein substrates, but this information must be utilized in an effective manner in order to efficiently identify protein substrates by in silico approaches. To address this problem, we present PROSPER, an integrated feature-based server for in silico identification of protease substrates and their cleavage sites for twenty-four different proteases. PROSPER utilizes established specificity information for these proteases (derived from the MEROPS database) with a machine learning approach to predict protease cleavage sites by using different, but complementary sequence and structure characteristics. Features used by PROSPER include local amino acid sequence profile, predicted secondary structure, solvent accessibility and predicted native disorder. Thus, for proteases with known amino acid specificity, PROSPER provides a convenient, pre-prepared tool for use in identifying protein substrates for the enzymes. Systematic prediction analysis for the twenty-four proteases thus far included in the database revealed that the features we have included in the tool strongly improve performance in terms of cleavage site prediction, as evidenced by their contribution to performance improvement in terms of identifying known cleavage sites in substrates for these enzymes. In comparison with two state-of-the-art prediction tools, PoPS and SitePrediction, PROSPER achieves greater accuracy and coverage. To our knowledge, PROSPER is the first comprehensive server capable of predicting cleavage sites of multiple proteases within a single substrate sequence using machine learning techniques. It is freely available at http://lightning.med.monash.edu.au/PROSPER/.

TANGLE: Two-Level Support Vector Regression Approach for Protein Backbone Torsion Angle Prediction from Primary Sequences.
Song, J., Tan, H., Wang, M., Webb, G. I., & Akutsu, T.
PLoS ONE, 7(2), e30361, 2012.
[DOI] [Bibtex] [Abstract]

@Article{SongEtAl12,
Title = {TANGLE: Two-Level Support Vector Regression Approach for Protein Backbone Torsion Angle Prediction from Primary Sequences},
Author = {Song, Jiangning and Tan, Hao and Wang, Mingjun and Webb, Geoffrey I. and Akutsu, Tatsuya},
Journal = {PLoS ONE},
Year = {2012},
Month = {02},
Number = {2},
Pages = {e30361},
Volume = {7},
Abstract = {Protein backbone torsion angles (Phi) and (Psi) involve two rotation angles rotating around the Cα-N bond (Phi)
and the Cα-C bond (Psi). Due to the planarity of the linked rigid peptide bonds, these two angles can essentially determine
the backbone geometry of proteins. Accordingly, the accurate prediction of protein backbone torsion angle from sequence information
can assist the prediction of protein structures. In this study, we develop a new approach called TANGLE (Torsion ANGLE predictor) to
predict the protein backbone torsion angles from amino acid sequences. TANGLE uses a two-level support vector regression approach to
perform real-value torsion angle prediction using a variety of features derived from amino acid sequences, including the evolutionary
profiles in the form of position-specific scoring matrices, predicted secondary structure, solvent accessibility and natively disordered
region as well as other global sequence features. When evaluated based on a large benchmark dataset of 1,526 non-homologous proteins,
the mean absolute errors (MAEs) of the Phi and Psi angle prediction are 27.8° and 44.6°, respectively, which are 1% and 3% respectively
lower than that using one of the state-of-the-art prediction tools ANGLOR. Moreover, the prediction of TANGLE is significantly better than a
random predictor that was built on the amino acid-specific basis, with the p-value<1.46e-147 and 7.97e-150, respectively by the
Wilcoxon signed rank test. As a complementary approach to the current torsion angle prediction algorithms, TANGLE should prove useful in predicting
protein structural properties and assisting protein fold recognition by applying the predicted torsion angles as useful restraints. TANGLE is freely
accessible at http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/TANGLE/.},
Doi = {10.1371/journal.pone.0030361},
Keywords = {Bioinformatics},
Publisher = {Public Library of Science},
Related = {computational-biology},
Url = {http://dx.doi.org/10.1371%2Fjournal.pone.0030361}
}
ABSTRACT Protein backbone torsion angles (Phi) and (Psi) involve two rotation angles rotating around the Cα-N bond (Phi) and the Cα-C bond (Psi). Due to the planarity of the linked rigid peptide bonds, these two angles can essentially determine the backbone geometry of proteins. Accordingly, the accurate prediction of protein backbone torsion angle from sequence information can assist the prediction of protein structures. In this study, we develop a new approach called TANGLE (Torsion ANGLE predictor) to predict the protein backbone torsion angles from amino acid sequences. TANGLE uses a two-level support vector regression approach to perform real-value torsion angle prediction using a variety of features derived from amino acid sequences, including the evolutionary profiles in the form of position-specific scoring matrices, predicted secondary structure, solvent accessibility and natively disordered region as well as other global sequence features. When evaluated based on a large benchmark dataset of 1,526 non-homologous proteins, the mean absolute errors (MAEs) of the Phi and Psi angle prediction are 27.8° and 44.6°, respectively, which are 1% and 3% respectively lower than that using one of the state-of-the-art prediction tools ANGLOR. Moreover, the prediction of TANGLE is significantly better than a random predictor that was built on the amino acid-specific basis, with the p-value<1.46e-147 and 7.97e-150, respectively by the Wilcoxon signed rank test. As a complementary approach to the current torsion angle prediction algorithms, TANGLE should prove useful in predicting protein structural properties and assisting protein fold recognition by applying the predicted torsion angles as useful restraints. TANGLE is freely accessible at http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/TANGLE/.

Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs.
Mahmood, K., Webb, G. I., Song, J., Whisstock, J. C., & Konagurthu, A. S.
Nucleic Acids Research, 40(6), e44, 2012.
[DOI] [Bibtex] [Abstract]

@Article{MahmoodEtAl2012,
author = {Mahmood, K. and Webb, G. I. and Song, J. and Whisstock, J. C. and Konagurthu, A. S.},
journal = {Nucleic Acids Research},
title = {Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs},
year = {2012},
number = {6},
pages = {e44},
volume = {40},
abstract = {Broadly, computational approaches for ortholog assignment is a three steps process: (i) identify all putative homologs between the genomes, (ii) identify gene anchors and (iii) link anchors to identify best gene matches given their order and context. In this article, we engineer two methods to improve two important aspects of this pipeline [specifically steps (ii) and (iii)]. First, computing sequence similarity data [step (i)] is a computationally intensive task for large sequence sets, creating a bottleneck in the ortholog assignment pipeline. We have designed a fast and highly scalable sort-join method (afree) based on k-mer counts to rapidly compare all pairs of sequences in a large protein sequence set to identify putative homologs. Second, availability of complex genomes containing large gene families with prevalence of complex evolutionary events, such as duplications, has made the task of assigning orthologs and co-orthologs difficult. Here, we have developed an iterative graph matching strategy where at each iteration the best gene assignments are identified resulting in a set of orthologs and co-orthologs. We find that the afree algorithm is faster than existing methods and maintains high accuracy in identifying similar genes. The iterative graph matching strategy also showed high accuracy in identifying complex gene relationships. Standalone afree available from http://vbc.med.monash.edu.au/.kmahmood/afree. EGM2, complete ortholog assignment pipeline (including afree and the iterative graph matching method) available from http://vbc.med.monash.edu.au/.kmahmood/EGM2.},
doi = {10.1093/nar/gkr1261},
eprint = {http://nar.oxfordjournals.org/content/early/2011/12/29/nar.gkr1261.full.pdf+html},
keywords = {Bioinformatics},
publisher = {Oxford Journals},
related = {computational-biology},
url = {http://nar.oxfordjournals.org/content/early/2011/12/29/nar.gkr1261.abstract},
}
ABSTRACT Broadly, computational approaches for ortholog assignment is a three steps process: (i) identify all putative homologs between the genomes, (ii) identify gene anchors and (iii) link anchors to identify best gene matches given their order and context. In this article, we engineer two methods to improve two important aspects of this pipeline [specifically steps (ii) and (iii)]. First, computing sequence similarity data [step (i)] is a computationally intensive task for large sequence sets, creating a bottleneck in the ortholog assignment pipeline. We have designed a fast and highly scalable sort-join method (afree) based on k-mer counts to rapidly compare all pairs of sequences in a large protein sequence set to identify putative homologs. Second, availability of complex genomes containing large gene families with prevalence of complex evolutionary events, such as duplications, has made the task of assigning orthologs and co-orthologs difficult. Here, we have developed an iterative graph matching strategy where at each iteration the best gene assignments are identified resulting in a set of orthologs and co-orthologs. We find that the afree algorithm is faster than existing methods and maintains high accuracy in identifying similar genes. The iterative graph matching strategy also showed high accuracy in identifying complex gene relationships. Standalone afree available from http://vbc.med.monash.edu.au/.kmahmood/afree. EGM2, complete ortholog assignment pipeline (including afree and the iterative graph matching method) available from http://vbc.med.monash.edu.au/.kmahmood/EGM2.

Discovery of Amino Acid Motifs for Thrombin Cleavage and Validation Using a Model Substrate.
Ng, N. M., Pierce, J. D., Webb, G. I., Ratnikov, B. I., Wijeyewickrema, L. C., Duncan, R. C., Robertson, A. L., Bottomley, S. P., Boyd, S. E., & Pike, R. N.
Biochemistry, 50(48), 10499-10507, 2011.
[DOI] [Bibtex] [Abstract]

@Article{NgEtAl11,
author = {N. M. Ng and Pierce, J. D. and Webb, G. I. and Ratnikov, B. I. and Wijeyewickrema, L. C. and Duncan, R. C. and Robertson, A. L. and Bottomley, S. P. and Boyd, S. E. and Pike, R. N.},
journal = {Biochemistry},
title = {Discovery of Amino Acid Motifs for Thrombin Cleavage and Validation Using a Model Substrate},
year = {2011},
number = {48},
pages = {10499-10507},
volume = {50},
abstract = {Understanding the active site preferences of an enzyme is critical to the design of effective inhibitors and to gaining insights into its mechanisms of action on substrates. While the subsite specificity of thrombin is understood, it is not clear whether the enzyme prefers individual amino acids at each subsite in isolation or prefers to cleave combinations of amino acids as a motif. To investigate whether preferred peptide motifs for cleavage could be identified for thrombin, we exposed a phage-displayed peptide library to thrombin. The resulting preferentially cleaved substrates were analyzed using the technique of association rule discovery. The results revealed that thrombin selected for amino acid motifs in cleavage sites. The contribution of these hypothetical motifs to substrate cleavage efficiency was further investigated using the B1 IgG-binding domain of streptococcal protein G as a model substrate. Introduction of a P2.P1. LRS thrombin cleavage sequence within a major loop of the protein led to cleavage of the protein by thrombin, with the cleavage efficiency increasing with the length of the loop. Introduction of further P3.P1 and P1.P1..P3. amino acid motifs into the loop region yielded greater cleavage efficiencies, suggesting that the susceptibility of a protein substrate to cleavage by thrombin is influenced by these motifs, perhaps because of cooperative effects between subsites closest to the scissile peptide bond.},
doi = {10.1021/bi201333g},
eprint = {http://pubs.acs.org/doi/pdf/10.1021/bi201333g},
keywords = {Bioinformatics},
related = {computational-biology},
url = {http://pubs.acs.org/doi/abs/10.1021/bi201333g},
}
ABSTRACT Understanding the active site preferences of an enzyme is critical to the design of effective inhibitors and to gaining insights into its mechanisms of action on substrates. While the subsite specificity of thrombin is understood, it is not clear whether the enzyme prefers individual amino acids at each subsite in isolation or prefers to cleave combinations of amino acids as a motif. To investigate whether preferred peptide motifs for cleavage could be identified for thrombin, we exposed a phage-displayed peptide library to thrombin. The resulting preferentially cleaved substrates were analyzed using the technique of association rule discovery. The results revealed that thrombin selected for amino acid motifs in cleavage sites. The contribution of these hypothetical motifs to substrate cleavage efficiency was further investigated using the B1 IgG-binding domain of streptococcal protein G as a model substrate. Introduction of a P2.P1. LRS thrombin cleavage sequence within a major loop of the protein led to cleavage of the protein by thrombin, with the cleavage efficiency increasing with the length of the loop. Introduction of further P3.P1 and P1.P1..P3. amino acid motifs into the loop region yielded greater cleavage efficiencies, suggesting that the susceptibility of a protein substrate to cleavage by thrombin is influenced by these motifs, perhaps because of cooperative effects between subsites closest to the scissile peptide bond.

Bioinformatic Approaches for Predicting Substrates of Proteases.
Song, J., Tan, H., Boyd, S. E., Shen, H., Mahmood, K., Webb, G. I., Akutsu, T., Whisstock, J. C., & Pike, R. N.
Journal of Bioinformatics and Computational Biology, 9(1), 149-178, 2011.
[DOI] [Bibtex] [Abstract]

@Article{SongEtAl11,
author = {Song, J. and Tan, H. and Boyd, S. E. and Shen, H. and Mahmood, K. and Webb, G. I. and Akutsu, T. and Whisstock, J. C. and Pike, R. N.},
journal = {Journal of Bioinformatics and Computational Biology},
title = {Bioinformatic Approaches for Predicting Substrates of Proteases},
year = {2011},
number = {1},
pages = {149-178},
volume = {9},
abstract = {Proteases have central roles in "life and death" processes due to their important ability to catalytically hydrolyse protein substrates, usually altering the function and/or activity of the target in the process. Knowledge of the substrate specificity of a protease should, in theory, dramatically improve the ability to predict target protein substrates. However, experimental identification and characterization of protease substrates is often difficult and time-consuming. Thus solving the "substrate identification" problem is fundamental to both understanding protease biology and the development of therapeutics that target specific protease-regulated pathways. In this context, bioinformatic prediction of protease substrates may provide useful and experimentally testable information about novel potential cleavage sites in candidate substrates. In this article, we provide an overview of recent advances in developing bioinformatic approaches for predicting protease substrate cleavage sites and identifying novel putative substrates. We discuss the advantages and drawbacks of the current methods and detail how more accurate models can be built by deriving multiple sequence and structural features of substrates. We also provide some suggestions about how future studies might further improve the accuracy of protease substrate specificity prediction.},
audit-trail = {http://www.worldscinet.com/jbcb/00/0001/S0219720011005288.html},
doi = {10.1142/S0219720011005288},
keywords = {Bioinformatics},
publisher = {World Scientific},
related = {computational-biology},
}
ABSTRACT Proteases have central roles in "life and death" processes due to their important ability to catalytically hydrolyse protein substrates, usually altering the function and/or activity of the target in the process. Knowledge of the substrate specificity of a protease should, in theory, dramatically improve the ability to predict target protein substrates. However, experimental identification and characterization of protease substrates is often difficult and time-consuming. Thus solving the "substrate identification" problem is fundamental to both understanding protease biology and the development of therapeutics that target specific protease-regulated pathways. In this context, bioinformatic prediction of protease substrates may provide useful and experimentally testable information about novel potential cleavage sites in candidate substrates. In this article, we provide an overview of recent advances in developing bioinformatic approaches for predicting protease substrate cleavage sites and identifying novel putative substrates. We discuss the advantages and drawbacks of the current methods and detail how more accurate models can be built by deriving multiple sequence and structural features of substrates. We also provide some suggestions about how future studies might further improve the accuracy of protease substrate specificity prediction.

EGM: Encapsulated Gene-by-Gene Matching to Identify Gene Orthologs and Homologous Segments in Genomes.
Mahmood, K., Konagurthu, A. S., Song, J., Buckle, A. M., Webb, G. I., & Whisstock, J. C.
Bioinformatics, 26(17), 2076-2084, 2010.
[DOI] [Bibtex] [Abstract]

@Article{MahmoodEtAl10,
author = {Mahmood, K. and Konagurthu, A. S. and Song, J. and Buckle, A. M. and Webb, G. I. and Whisstock, J. C.},
journal = {Bioinformatics},
title = {EGM: Encapsulated Gene-by-Gene Matching to Identify Gene Orthologs and Homologous Segments in Genomes},
year = {2010},
number = {17},
pages = {2076-2084},
volume = {26},
abstract = {Motivation: Identification of functionally equivalent genes in different species is essential to understand the evolution of biological pathways and processes. At the same time, identification of strings of conserved orthologous genes helps identify complex genomic rearrangements across different organisms. Such an insight is particularly useful, for example, in the transfer of experimental results between different experimental systems such as Drosophila and mammals.
Results: Here we describe the Encapsulated Gene-by-gene Matching (EGM) approach, a method that employs a graph matching strategy to identify gene orthologs and conserved gene segments. Given a pair of genomes, EGM constructs a global gene match for all genes taking into account gene context and family information. The Hungarian method for identifying the maximum weight matching in bipartite graphs is employed, where the resulting matching reveals one-to-one correspondences between nodes (genes) in a manner that maximizes the gene similarity and context.
Conclusion: We tested our approach by performing several comparisons including a detailed Human v Mouse genome mapping. We find that the algorithm is robust and sensitive in detecting orthologs and conserved gene segments. EGM can sensitively detect rearrangements within large and small chromosomal segments. The EGM tool is fully automated and easy to use compared to other more complex methods that also require extensive manual intervention and input.},
audit-trail = {http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/6/752},
doi = {10.1093/bioinformatics/btq339},
keywords = {Bioinformatics},
publisher = {Oxford Univ Press},
related = {computational-biology},
}
ABSTRACT Motivation: Identification of functionally equivalent genes in different species is essential to understand the evolution of biological pathways and processes. At the same time, identification of strings of conserved orthologous genes helps identify complex genomic rearrangements across different organisms. Such an insight is particularly useful, for example, in the transfer of experimental results between different experimental systems such as Drosophila and mammals. Results: Here we describe the Encapsulated Gene-by-gene Matching (EGM) approach, a method that employs a graph matching strategy to identify gene orthologs and conserved gene segments. Given a pair of genomes, EGM constructs a global gene match for all genes taking into account gene context and family information. The Hungarian method for identifying the maximum weight matching in bipartite graphs is employed, where the resulting matching reveals one-to-one correspondences between nodes (genes) in a manner that maximizes the gene similarity and context. Conclusion: We tested our approach by performing several comparisons including a detailed Human v Mouse genome mapping. We find that the algorithm is robust and sensitive in detecting orthologs and conserved gene segments. EGM can sensitively detect rearrangements within large and small chromosomal segments. The EGM tool is fully automated and easy to use compared to other more complex methods that also require extensive manual intervention and input.

Cascleave: Towards More Accurate Prediction of Caspase Substrate Cleavage Sites.
Song, J., Tan, H., Shen, H., Mahmood, K., Boyd, S. E., Webb, G. I., Akutsu, T., & Whisstock, J. C.
Bioinformatics, 26(6), 752-760, 2010.
[DOI] [Bibtex] [Abstract]

@Article{SongEtAl10,
author = {Song, J. and Tan, H. and Shen, H. and Mahmood, K. and Boyd, S. E. and Webb, G. I. and Akutsu, T. and Whisstock, J. C.},
journal = {Bioinformatics},
title = {Cascleave: Towards More Accurate Prediction of Caspase Substrate Cleavage Sites},
year = {2010},
number = {6},
pages = {752-760},
volume = {26},
abstract = {Motivation: The caspase family of cysteine proteases play essential roles in key biological processes such as programmed cell death, differentiation, proliferation, necrosis and inflammation. The complete repertoire of caspase substrates remains to be fully characterized. Accordingly, systematic computational screening studies of caspase substrate cleavage sites may provide insight into the substrate specificity of caspases and further facilitating the discovery of putative novel substrates. Results: In this article we develop an approach (termed Cascleave) to predict both classical (i.e. following a P1 Asp) and non-typical caspase cleavage sites. When using local sequence-derived profiles, Cascleave successfully predicted 82.2% of the known substrate cleavage sites, with a Matthews correla tion coefficient (MCC) of 0.667. We found that prediction performance could be further improved by incorporating information such as predicted solvent accessibility and whether a cleavage sequence lies in a region that is most likely natively unstructured. Novel bi-profile Bayesian signatures were found to significantly improve the prediction performance and yielded the best performance with an overall accuracy of 87.6% and a MCC of 0.747, which is higher accuracy than published methods that essentially rely on amino acid sequence alone. It is anticipated that Cascleave will be a powerful tool for predicting novel substrate cleavage sites of caspases and shedding new insights on the unknown caspase-substrate interactivity relationship.},
audit-trail = {http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btq339v1},
doi = {10.1093/bioinformatics/btq043},
keywords = {Bioinformatics},
publisher = {Oxford Univ Press},
related = {computational-biology},
}
ABSTRACT Motivation: The caspase family of cysteine proteases play essential roles in key biological processes such as programmed cell death, differentiation, proliferation, necrosis and inflammation. The complete repertoire of caspase substrates remains to be fully characterized. Accordingly, systematic computational screening studies of caspase substrate cleavage sites may provide insight into the substrate specificity of caspases and further facilitating the discovery of putative novel substrates. Results: In this article we develop an approach (termed Cascleave) to predict both classical (i.e. following a P1 Asp) and non-typical caspase cleavage sites. When using local sequence-derived profiles, Cascleave successfully predicted 82.2% of the known substrate cleavage sites, with a Matthews correla tion coefficient (MCC) of 0.667. We found that prediction performance could be further improved by incorporating information such as predicted solvent accessibility and whether a cleavage sequence lies in a region that is most likely natively unstructured. Novel bi-profile Bayesian signatures were found to significantly improve the prediction performance and yielded the best performance with an overall accuracy of 87.6% and a MCC of 0.747, which is higher accuracy than published methods that essentially rely on amino acid sequence alone. It is anticipated that Cascleave will be a powerful tool for predicting novel substrate cleavage sites of caspases and shedding new insights on the unknown caspase-substrate interactivity relationship.

Prodepth: Predict Residue Depth by Support Vector Regression Approach from Protein Sequences Only.
Song, J., Tan, H., Mahmood, K., Law, R. H. P., Buckle, A. M., Webb, G. I., Akutsu, T., & Whisstock, J. C.
PLoS ONE, 4(9), e7072, 2009.
[DOI] [Bibtex] [Abstract]

@Article{SongEtAl09,
author = {Song, J. and Tan, H. and Mahmood, K. and Law, R. H. P. and Buckle, A. M. and Webb, G. I. and Akutsu, T. and Whisstock, J. C.},
journal = {PLoS ONE},
title = {Prodepth: Predict Residue Depth by Support Vector Regression Approach from Protein Sequences Only},
year = {2009},
number = {9},
pages = {e7072},
volume = {4},
abstract = {Residue depth (RD) is a solvent exposure measure that complements the information provided by conventional accessible surface area (ASA) and describes to what extent a residue is buried in the protein structure space. Previous studies have established that RD is correlated with several protein properties, such as protein stability, residue conservation and amino acid types. Accurate prediction of RD has many potentially important applications in the field of structural bioinformatics, for example, facilitating the identification of functionally important residues, or residues in the folding nucleus, or enzyme active sites from sequence information. In this work, we introduce an efficient approach that uses support vector regression to quantify the relationship between RD and protein sequence. We systematically investigated eight different sequence encoding schemes including both local and global sequence characteristics and examined their respective prediction performances. For the objective evaluation of our approach, we used 5-fold cross-validation to assess the prediction accuracies and showed that the overall best performance could be achieved with a correlation coefficient (CC) of 0.71 between the observed and predicted RD values and a root mean square error (RMSE) of 1.74, after incorporating the relevant multiple sequence features. The results suggest that residue depth could be reliably predicted solely from protein primary sequences: local sequence environments are the major determinants, while global sequence features could influence the prediction performance marginally. We highlight two examples as a comparison in order to illustrate the applicability of this approach. We also discuss the potential implications of this new structural parameter in the field of protein structure prediction and homology modeling. This method might prove to be a powerful tool for sequence analysis.},
audit-trail = {http://www.plosone.org/article/info:doi/10.1371/journal.pone.0007072},
doi = {10.1371/journal.pone.0007072},
keywords = {Bioinformatics},
publisher = {PLOS},
related = {computational-biology},
}
ABSTRACT Residue depth (RD) is a solvent exposure measure that complements the information provided by conventional accessible surface area (ASA) and describes to what extent a residue is buried in the protein structure space. Previous studies have established that RD is correlated with several protein properties, such as protein stability, residue conservation and amino acid types. Accurate prediction of RD has many potentially important applications in the field of structural bioinformatics, for example, facilitating the identification of functionally important residues, or residues in the folding nucleus, or enzyme active sites from sequence information. In this work, we introduce an efficient approach that uses support vector regression to quantify the relationship between RD and protein sequence. We systematically investigated eight different sequence encoding schemes including both local and global sequence characteristics and examined their respective prediction performances. For the objective evaluation of our approach, we used 5-fold cross-validation to assess the prediction accuracies and showed that the overall best performance could be achieved with a correlation coefficient (CC) of 0.71 between the observed and predicted RD values and a root mean square error (RMSE) of 1.74, after incorporating the relevant multiple sequence features. The results suggest that residue depth could be reliably predicted solely from protein primary sequences: local sequence environments are the major determinants, while global sequence features could influence the prediction performance marginally. We highlight two examples as a comparison in order to illustrate the applicability of this approach. We also discuss the potential implications of this new structural parameter in the field of protein structure prediction and homology modeling. This method might prove to be a powerful tool for sequence analysis.

RCPdb: An evolutionary classification and codon usage database for repeat-containing proteins.
Faux, N. G., Huttley, G. A., Mahmood, K., Webb, G. I., Garcia de la Banda, M., & Whisstock, J. C.
Genome Research, 17(1), 1118-1127, 2007.
[DOI] [Bibtex] [Abstract]

@Article{FauxHuttleyMahmoodWebbGarciaWhisstock07,
author = {Faux, N. G. and Huttley, G. A. and Mahmood, K. and Webb, G. I. and Garcia de la Banda, M. and Whisstock, J. C.},
journal = {Genome Research},
title = {RCPdb: An evolutionary classification and codon usage database for repeat-containing proteins},
year = {2007},
number = {1},
pages = {1118-1127},
volume = {17},
abstract = {Over 3% of human proteins contain single amino acid repeats (repeat-containing proteins, RCPs). Many repeats (homopeptides) localize to important proteins involved in transcription, and the expansion of certain repeats, in particular poly-Q and poly-A tracts, can also lead to the development of neurological diseases. Previous studies have suggested that the homopeptide makeup is a result of the presence of G+C-rich tracts in the encoding genes and that expansion occurs via replication slippage. Here, we have performed a large-scale genomic analysis of the variation of the genes encoding RCPs in 13 species and present these data in an online database (http://repeats.med.monash.edu.au/genetic_analysis/). This resource allows rapid comparison and analysis of RCPs, homopeptides, and their underlying genetic tracts across the eukaryotic species considered. We report three major findings. First, there is a bias for a small subset of codons being reiterated within homopeptides, and there is no G+C or A+T bias relative to the organisms transcriptome. Second, single base pair transversions from the homocodon are unusually common and may represent a mechanism of reducing the rate of homopeptide mutations. Third, homopeptides that are conserved across different species lie within regions that are under stronger purifying selection in contrast to nonconserved homopeptides.},
address = {Woodbury, New York},
doi = {10.1101/gr.6255407},
keywords = {Bioinformatics},
publisher = {Cold Spring Harbor Laboratory Press, ISSN 1088-9051/07},
related = {computational-biology},
}
ABSTRACT Over 3% of human proteins contain single amino acid repeats (repeat-containing proteins, RCPs). Many repeats (homopeptides) localize to important proteins involved in transcription, and the expansion of certain repeats, in particular poly-Q and poly-A tracts, can also lead to the development of neurological diseases. Previous studies have suggested that the homopeptide makeup is a result of the presence of G+C-rich tracts in the encoding genes and that expansion occurs via replication slippage. Here, we have performed a large-scale genomic analysis of the variation of the genes encoding RCPs in 13 species and present these data in an online database (http://repeats.med.monash.edu.au/genetic_analysis/). This resource allows rapid comparison and analysis of RCPs, homopeptides, and their underlying genetic tracts across the eukaryotic species considered. We report three major findings. First, there is a bias for a small subset of codons being reiterated within homopeptides, and there is no G+C or A+T bias relative to the organisms transcriptome. Second, single base pair transversions from the homocodon are unusually common and may represent a mechanism of reducing the rate of homopeptide mutations. Third, homopeptides that are conserved across different species lie within regions that are under stronger purifying selection in contrast to nonconserved homopeptides.

Identifying markers of pathology in SAXS data of malignant tissues of the brain.
Siu, K. K. W., Butler, S. M., Beveridge, T., Gillam, J. E., Hall, C. J., Kaye, A. H., Lewis, R. A., Mannan, K., McLoughlin, G., Pearson, S., Round, A. R., E., S., Webb, G. I., & Wilkinson, S. J.
Nuclear Instruments and Methods in Physics Research A, 548, 140-146, 2005.
[PDF] [DOI] [Bibtex] [Abstract]

@Article{SiuEtAl05,
author = {Siu, K. K. W. and Butler, S. M. and Beveridge, T. and Gillam, J. E. and Hall, C. J. and Kaye, A. H. and Lewis, R. A. and Mannan, K. and McLoughlin, G. and Pearson, S. and Round, A. R. and Schultke E. and Webb, G. I. and Wilkinson, S. J.},
journal = {Nuclear Instruments and Methods in Physics Research A},
title = {Identifying markers of pathology in SAXS data of malignant tissues of the brain},
year = {2005},
pages = {140-146},
volume = {548},
abstract = {Conventional neuropathological analysis for brain malignancies is heavily reliant on the observation of morphological abnormalities, observed in thin, stained sections of tissue. Small Angle X-ray Scattering (SAXS) data provide an alternative means of distinguishing pathology by examining the ultra-structural (nanometer length scales) characteristics of tissue. To evaluate the diagnostic potential of SAXS for brain tumors, data was collected from normal, malignant and benign tissues of the human brain at station 2.1 of the Daresbury Laboratory Synchrotron Radiation Source and subjected to data mining and multivariate statistical analysis. The results suggest SAXS data may be an effective classi.er of malignancy.},
doi = {10.1016/j.nima.2005.03.081},
keywords = {Bioinformatics},
publisher = {Elsevier},
related = {computational-biology},
}
ABSTRACT Conventional neuropathological analysis for brain malignancies is heavily reliant on the observation of morphological abnormalities, observed in thin, stained sections of tissue. Small Angle X-ray Scattering (SAXS) data provide an alternative means of distinguishing pathology by examining the ultra-structural (nanometer length scales) characteristics of tissue. To evaluate the diagnostic potential of SAXS for brain tumors, data was collected from normal, malignant and benign tissues of the human brain at station 2.1 of the Daresbury Laboratory Synchrotron Radiation Source and subjected to data mining and multivariate statistical analysis. The results suggest SAXS data may be an effective classi.er of malignancy.

A Case Study in Feature Invention for Breast Cancer Diagnosis Using X-Ray Scatter Images.
Butler, S. M., Webb, G. I., & Lewis, R. A.
Lecture Notes in Artificial Intelligence Vol. 2903: Proceedings of the 16th Australian Conference on Artificial Intelligence (AI 03), Berlin/Heidelberg, pp. 677-685, 2003.
[PDF] [DOI] [Bibtex] [Abstract]

@InProceedings{ButlerWebbLewis03,
author = {Butler, S. M. and Webb, G. I. and Lewis, R. A.},
booktitle = {Lecture Notes in Artificial Intelligence Vol. 2903: Proceedings of the 16th Australian Conference on Artificial Intelligence (AI 03)},
title = {A Case Study in Feature Invention for Breast Cancer Diagnosis Using X-Ray Scatter Images},
year = {2003},
address = {Berlin/Heidelberg},
editor = {Gedeon, T.D. and Fung, L.C.C.},
pages = {677-685},
publisher = {Springer},
abstract = {X-ray mammography is the current method for screening for breast cancer, and like any technique, has its limitations. Several groups have reported differences in the X-ray scattering patterns of normal and tumour tissue from the breast. This gives rise to the hope that X-ray scatter analysis techniques may lead to a more accurate and cost effective method of diagnosing beast cancer which lends itself to automation. This is a particularly challenging exercise due to the inherent complexity of the information content in X-ray scatter patterns from complex heterogenous tissue samples. We use a simple naive Bayes classier, coupled with Equal Frequency Discretization (EFD) as our classification system. High-level features are extracted from the low-level pixel data. This paper reports some preliminary results in the ongoing development of this classification method that can distinguish between the diffraction patterns of normal and cancerous tissue, with particular emphasis on the invention of features for classification.},
doi = {10.1007/978-3-540-24581-0_58},
keywords = {Bioinformatics},
location = {Perth, Australia},
related = {computational-biology},
}
ABSTRACT X-ray mammography is the current method for screening for breast cancer, and like any technique, has its limitations. Several groups have reported differences in the X-ray scattering patterns of normal and tumour tissue from the breast. This gives rise to the hope that X-ray scatter analysis techniques may lead to a more accurate and cost effective method of diagnosing beast cancer which lends itself to automation. This is a particularly challenging exercise due to the inherent complexity of the information content in X-ray scatter patterns from complex heterogenous tissue samples. We use a simple naive Bayes classier, coupled with Equal Frequency Discretization (EFD) as our classification system. High-level features are extracted from the low-level pixel data. This paper reports some preliminary results in the ongoing development of this classification method that can distinguish between the diffraction patterns of normal and cancerous tissue, with particular emphasis on the invention of features for classification.