Data Scientist


Computational Biology

Along with colleagues in the Monash Faculty of Medicine, Nursing and Health Sciences, I am investigating applications of data science in biology.  The majority of this work uses machine learning to predict protein structural and functional features.

Publications

MetalExplorer, a Bioinformatics Tool for the Improved Prediction of Eight Types of Metal-binding Sites Using a Random Forest Algorithm with Two-step Feature Selection.
Song, J., Li, C., Zheng, C., Revote, J., Zhang, Z., & Webb, G. I.
Current Bioinformatics, 11, in press.
[DOI] [Bibtex] [Abstract]

@Article{SongEtAl16,
Title = {MetalExplorer, a Bioinformatics Tool for the Improved Prediction of Eight Types of Metal-binding Sites Using a Random Forest Algorithm with Two-step Feature Selection},
Author = {Jiangning Song and Chen Li and Cheng Zheng and Jerico Revote and Ziding Zhang and Geoffrey I. Webb},
Journal = {Current Bioinformatics},
Year = {in press},
Volume = {11},
Abstract = {Metalloproteins are highly involved in many biological processes,
including catalysis, recognition, transport, transcription, and signal
transduction. The metal ions they bind usually play enzymatic or structural
roles in mediating these diverse functional roles. Thus, the systematic
analysis and prediction of metal-binding sites using sequence and/or
structural information are crucial for understanding their
sequence-structure-function relationships. In this study, we propose
MetalExplorer (http://metalexplorer.erc.monash.edu.au/), a new machine
learning-based method for predicting eight different types of metal-binding
sites (Ca, Co, Cu, Fe, Ni, Mg, Mn, and Zn) in proteins. Our approach
combines heterogeneous sequence-, structure-, and residue contact
network-based features. The predictive performance of MetalExplorer was
tested by cross-validation and independent tests using non-redundant
datasets of known structures. This method applies a two-step feature
selection approach based on the maximum relevance minimum redundancy and
forward feature selection to identify the most informative features that
contribute to the prediction performance. With a precision of 60%,
MetalExplorer achieved high recall values, which ranged from 59% to 88% for
the eight metal ion types in fivefold cross-validation tests. Moreover, the
common and type-specific features in the optimal subsets of all metal ions
were characterized in terms of their contributions to the overall
performance. In terms of both benchmark and independent datasets at the 60%
precision control level, MetalExplorer compared favorably with an existing
metalloprotein prediction tool, SitePredict. Thus, MetalExplorer is expected
to be a powerful tool for the accurate prediction of potential metal-binding
sites and it should facilitate the functional analysis and rational design
of novel metalloproteins.},
Doi = {10.2174/2468422806666160618091522},
ISSN = {1574-8936/2212-392X},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}
ABSTRACT Metalloproteins are highly involved in many biological processes, including catalysis, recognition, transport, transcription, and signal transduction. The metal ions they bind usually play enzymatic or structural roles in mediating these diverse functional roles. Thus, the systematic analysis and prediction of metal-binding sites using sequence and/or structural information are crucial for understanding their sequence-structure-function relationships. In this study, we propose MetalExplorer (http://metalexplorer.erc.monash.edu.au/), a new machine learning-based method for predicting eight different types of metal-binding sites (Ca, Co, Cu, Fe, Ni, Mg, Mn, and Zn) in proteins. Our approach combines heterogeneous sequence-, structure-, and residue contact network-based features. The predictive performance of MetalExplorer was tested by cross-validation and independent tests using non-redundant datasets of known structures. This method applies a two-step feature selection approach based on the maximum relevance minimum redundancy and forward feature selection to identify the most informative features that contribute to the prediction performance. With a precision of 60%, MetalExplorer achieved high recall values, which ranged from 59% to 88% for the eight metal ion types in fivefold cross-validation tests. Moreover, the common and type-specific features in the optimal subsets of all metal ions were characterized in terms of their contributions to the overall performance. In terms of both benchmark and independent datasets at the 60% precision control level, MetalExplorer compared favorably with an existing metalloprotein prediction tool, SitePredict. Thus, MetalExplorer is expected to be a powerful tool for the accurate prediction of potential metal-binding sites and it should facilitate the functional analysis and rational design of novel metalloproteins.

Comprehensive assessment and performance improvement of effector protein predictors for bacterial secretion systems III, IV and VI.
An, Y., Wang, J., Li, C., Leier, A., Marquez-Lago, T., Wilksch, J., Zhang, Y., Webb, G. I., Song, J., & Lithgow, T.
Briefings in Bioinformatics, Art. no. bbw100, 2017.
[DOI] [Bibtex]

ABSTRACT 

SecretEPDB: a comprehensive web-based resource for secreted effector proteins of the bacterial types III, IV and VI secretion systems.
An, Y., Wang, J., Li, C., Revote, J., Zhang, Y., Naderer, T., Hayashida, M., Akutsu, T., Webb, G. I., Lithgow, T., & Song, J.
Scientific Reports, 7, Art. no. 41031, 2017.
[DOI] [Bibtex]

@Article{AnEtAl17,
Title = {SecretEPDB: a comprehensive web-based resource for secreted effector proteins of the bacterial types III, IV and VI secretion systems},
Author = {An, Yi and Wang, Jiawei and Li, Chen and Revote, Jerico and Zhang, Yang and Naderer, Thomas and Hayashida, Mirohiro and Akutsu, Tatsuya and Webb, Geoffrey I. and Lithgow, Trevor and Song, Jiangning},
Journal = {Scientific Reports},
Year = {2017},
Volume = {7},
Articlenumber = {41031},
Doi = {10.1038/srep41031},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology},
Url = {http://rdcu.be/oJ9I}
}
ABSTRACT 

POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles.
Wang, J., Yang, B., Revote, J., Leier, A., Marquez-Lago, T. T., Webb, G. I., Song, J., Chou, K., & Lithgow, T.
Bioinformatics, 33(17), 2017.
[DOI] [Bibtex]

ABSTRACT 

PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy.
Song, J., Li, F., Leier, A., Marquez-Lago, T. T., Akutsu, T., Haffari, G., Chou, K., Webb, G. I., & Pike, R. N.
Bioinformatics, Art. no. btx670, 2017.
[DOI] [Bibtex]

@Article{Song2017a,
Title = {PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy},
Author = {Song, Jiangning and Li, Fuyi and Leier, André and Marquez-Lago, Tatiana T and Akutsu, Tatsuya and Haffari, Gholamreza and Chou, Kuo-Chen and Webb, Geoffrey I and Pike, Robert N},
Journal = {Bioinformatics},
Year = {2017},
__markedentry = {[giwebb:6]},
Articlenumber = {btx670},
Doi = {10.1093/bioinformatics/btx670},
Eprint = {/oup/backfile/content_public/journal/bioinformatics/pap/10.1093_bioinformatics_btx670/1/btx670.pdf},
Keywords = {Bioinformatics},
Related = {computational-biology},
Url = {https://academic.oup.com/bioinformatics/article/doi/10.1093/bioinformatics/btx670/4562332/PROSPERous-highthroughput-prediction-of-substrate?guestAccessKey=668859da-9d97-47cf-b655-31cc5aa931aa}
}
ABSTRACT 

Knowledge-transfer learning for prediction of matrix metalloprotease substrate-cleavage sites.
Wang, Y., Song, J., Marquez-Lago, T. T., Leier, A., Li, C., Lithgow, T., Webb, G. I., & Shen, H.
Scientific Reports, 7, Art. no. 5755, 2017.
[DOI] [Bibtex]

ABSTRACT 

Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli.
Chang, C. C. H., Li, C., Webb, G. I., Tey, B., & Song, J.
Scientific Reports, 6, Art. no. 21844, 2016.
[URL] [Bibtex] [Abstract]

@Article{ChangEtAl2016,
Title = {Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli},
Author = {C.C.H. Chang and C. Li and G. I. Webb and B. Tey and J. Song},
Journal = {Scientific Reports},
Year = {2016},
Volume = {6},
Abstract = {Periplasmic expression of soluble proteins in Escherichia coli not only offers a much-simplified downstream purification process, but also enhances the probability of obtaining correctly folded and biologically active proteins. Different combinations of signal peptides and target proteins lead to different soluble protein expression levels, ranging from negligible to several grams per litre. Accurate algorithms for rational selection of promising candidates can serve as a powerful tool to complement with current trial-and-error approaches. Accordingly, proteomics studies can be conducted with greater efficiency and cost-effectiveness. Here, we developed a predictor with a two-stage architecture, to predict the real-valued expression level of target protein in the periplasm. The output of the first-stage support vector machine (SVM) classifier determines which second-stage support vector regression (SVR) classifier to be used. When tested on an independent test dataset, the predictor achieved an overall prediction accuracy of 78% and a Pearson’s correlation coefficient (PCC) of 0.77. We further illustrate the relative importance of various features with respect to different models. The results indicate that the occurrence of dipeptide glutamine and aspartic acid is the most important feature for the classification model. Finally, we provide access to the implemented predictor through the Periscope webserver, freely accessible at http://lightning.med.monash.edu/periscope/.},
Articlenumber = {21844},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology},
Url = {http://dx.doi.org/10.1038/srep21844}
}
ABSTRACT Periplasmic expression of soluble proteins in Escherichia coli not only offers a much-simplified downstream purification process, but also enhances the probability of obtaining correctly folded and biologically active proteins. Different combinations of signal peptides and target proteins lead to different soluble protein expression levels, ranging from negligible to several grams per litre. Accurate algorithms for rational selection of promising candidates can serve as a powerful tool to complement with current trial-and-error approaches. Accordingly, proteomics studies can be conducted with greater efficiency and cost-effectiveness. Here, we developed a predictor with a two-stage architecture, to predict the real-valued expression level of target protein in the periplasm. The output of the first-stage support vector machine (SVM) classifier determines which second-stage support vector regression (SVR) classifier to be used. When tested on an independent test dataset, the predictor achieved an overall prediction accuracy of 78% and a Pearson’s correlation coefficient (PCC) of 0.77. We further illustrate the relative importance of various features with respect to different models. The results indicate that the occurrence of dipeptide glutamine and aspartic acid is the most important feature for the classification model. Finally, we provide access to the implemented predictor through the Periscope webserver, freely accessible at http://lightning.med.monash.edu/periscope/.

Smoothing a rugged protein folding landscape by sequence-based redesign.
Porebski, B. T., Keleher, S., Hollins, J. J., Nickson, A. A., Marijanovic, E. M., Borg, N. A., Costa, M. G. S., Pearce, M. A., Dai, W., Zhu, L., Irving, J. A., Hoke, D. E., Kass, I., Whisstock, J. C., Bottomley, S. P., Webb, G. I., McGowan, S., & Buckle, A. M.
Scientific Reports, 6, Art. no. 33958, 2016.
[DOI] [Bibtex] [Abstract]

@Article{Porebski2016,
Title = {Smoothing a rugged protein folding landscape by sequence-based redesign},
Author = {Porebski, Benjamin T. and Keleher, Shani and Hollins, Jeffrey J. and Nickson, Adrian A. and Marijanovic, Emilia M. and Borg, Natalie A. and Costa, Mauricio G. S. and Pearce, Mary A. and Dai, Weiwen and Zhu, Liguang and Irving, James A. and Hoke, David E. and Kass, Itamar and Whisstock, James C. and Bottomley, Stephen P. and Webb, Geoffrey I. and McGowan, Sheena and Buckle, Ashley M.},
Journal = {Scientific Reports},
Year = {2016},
Volume = {6},
Abstract = {The rugged folding landscapes of functional proteins puts them at risk of misfolding and aggregation. Serine protease inhibitors, or serpins, are paradigms for this delicate balance between function and misfolding. Serpins exist in a metastable state that undergoes a major conformational change in order to inhibit proteases. However, conformational labiality of the native serpin fold renders them susceptible to misfolding, which underlies misfolding diseases such as alpha1-antitrypsin deficiency. To investigate how serpins balance function and folding, we used consensus design to create conserpin, a synthetic serpin that folds reversibly, is functional, thermostable, and polymerization resistant. Characterization of its structure, folding and dynamics suggest that consensus design has remodeled the folding landscape to reconcile competing requirements for stability and function. This approach may offer general benefits for engineering functional proteins that have risky folding landscapes, including the removal of aggregation-prone intermediates, and modifying scaffolds for use as protein therapeutics.},
Articlenumber = {33958},
Doi = {10.1038/srep33958},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology},
Url = {http://dx.doi.org/10.1038/srep33958}
}
ABSTRACT The rugged folding landscapes of functional proteins puts them at risk of misfolding and aggregation. Serine protease inhibitors, or serpins, are paradigms for this delicate balance between function and misfolding. Serpins exist in a metastable state that undergoes a major conformational change in order to inhibit proteases. However, conformational labiality of the native serpin fold renders them susceptible to misfolding, which underlies misfolding diseases such as alpha1-antitrypsin deficiency. To investigate how serpins balance function and folding, we used consensus design to create conserpin, a synthetic serpin that folds reversibly, is functional, thermostable, and polymerization resistant. Characterization of its structure, folding and dynamics suggest that consensus design has remodeled the folding landscape to reconcile competing requirements for stability and function. This approach may offer general benefits for engineering functional proteins that have risky folding landscapes, including the removal of aggregation-prone intermediates, and modifying scaffolds for use as protein therapeutics.

Crysalis: an integrated server for computational analysis and design of protein crystallization.
Wang, H., Feng, L., Zhang, Z., Webb, G. I., Lin, D., & Song, J.
Scientific Reports, 6, Art. no. 21383, 2016.
[DOI] [Bibtex] [Abstract]

@Article{WangEtAl16,
Title = {Crysalis: an integrated server for computational analysis and design of protein crystallization},
Author = {Wang, H. and Feng, L. and Zhang, Z. and Webb, G. I. and Lin, D. and Song, J.},
Journal = {Scientific Reports},
Year = {2016},
Volume = {6},
Abstract = {The failure of multi-step experimental procedures to yield diffraction-quality crystals is a major bottleneck in protein structure determination. Accordingly, several bioinformatics methods have been successfully developed and employed to select crystallizable proteins. Unfortunately, the majority of existing in silico methods only allow the prediction of crystallization propensity, seldom enabling computational design of protein mutants that can be targeted for enhancing protein crystallizability. Here, we present Crysalis, an integrated crystallization analysis tool that builds on support-vector regression (SVR) models to facilitate computational protein crystallization prediction, analysis, and design. More specifically, the functionality of this new tool includes: (1) rapid selection of target crystallizable proteins at the proteome level, (2) identification of site non-optimality for protein crystallization and systematic analysis of all potential single-point mutations that might enhance protein crystallization propensity, and (3) annotation of target protein based on predicted structural properties. We applied the design mode of Crysalis to identify site non-optimality for protein crystallization on a proteome-scale, focusing on proteins currently classified as non-crystallizable. Our results revealed that site non-optimality is based on biases related to residues, predicted structures, physicochemical properties, and sequence loci, which provides in-depth understanding of the features influencing protein crystallization. Crysalis is freely available at http://nmrcen.xmu.edu.cn/crysalis/.},
Articlenumber = {21383},
Doi = {10.1038/srep21383},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}
ABSTRACT The failure of multi-step experimental procedures to yield diffraction-quality crystals is a major bottleneck in protein structure determination. Accordingly, several bioinformatics methods have been successfully developed and employed to select crystallizable proteins. Unfortunately, the majority of existing in silico methods only allow the prediction of crystallization propensity, seldom enabling computational design of protein mutants that can be targeted for enhancing protein crystallizability. Here, we present Crysalis, an integrated crystallization analysis tool that builds on support-vector regression (SVR) models to facilitate computational protein crystallization prediction, analysis, and design. More specifically, the functionality of this new tool includes: (1) rapid selection of target crystallizable proteins at the proteome level, (2) identification of site non-optimality for protein crystallization and systematic analysis of all potential single-point mutations that might enhance protein crystallization propensity, and (3) annotation of target protein based on predicted structural properties. We applied the design mode of Crysalis to identify site non-optimality for protein crystallization on a proteome-scale, focusing on proteins currently classified as non-crystallizable. Our results revealed that site non-optimality is based on biases related to residues, predicted structures, physicochemical properties, and sequence loci, which provides in-depth understanding of the features influencing protein crystallization. Crysalis is freely available at http://nmrcen.xmu.edu.cn/crysalis/.

GlycoMinestruct: a new bioinformatics tool for highly accurate mapping of the human N-linked and O-linked glycoproteomes by incorporating structural features.
Li, F., Li, C., Revote, J., Zhang, Y., Webb, G. I., Li, J., Song, J., & Lithgow, T.
Scientific Reports, 6, Art. no. 34595, 2016.
[DOI] [Bibtex]

@Article{LiEtAl16,
Title = {GlycoMinestruct: a new bioinformatics tool for highly accurate mapping of the human N-linked and O-linked glycoproteomes by incorporating structural features},
Author = {Li, Fuyi and Li, Chen and Revote, Jerico and Zhang, Yang and Webb, Geoffrey I. and Li, Jian and Song, Jiangning and Lithgow, Trevor},
Journal = {Scientific Reports},
Year = {2016},
Month = oct,
Volume = {6},
Articlenumber = {34595},
Doi = {10.1038/srep34595},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}
ABSTRACT 

GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome.
Li, F., Li, C., Wang, M., Webb, G. I., Zhang, Y., Whisstock, J. C., & Song, J.
Bioinformatics, 31(9), 1411-1419, 2015.
[URL] [Bibtex] [Abstract]

@Article{LiEtAl15,
Title = {GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome},
Author = {F. Li and C. Li and M. Wang and G. I. Webb and Y. Zhang and J. C. Whisstock and J. Song},
Journal = {Bioinformatics},
Year = {2015},
Number = {9},
Pages = {1411-1419},
Volume = {31},
Abstract = {Motivation: Glycosylation is a ubiquitous type of protein post-translational modification (PTM) in eukaryotic cells, which plays vital roles in various biological processes (BPs) such as cellular communication, ligand recognition and subcellular recognition. It is estimated that >50% of the entire human proteome is glycosylated. However, it is still a significant challenge to identify glycosylation sites, which requires expensive/laborious experimental research. Thus, bioinformatics approaches that can predict the glycan occupancy at specific sequons in protein sequences would be useful for understanding and utilizing this important PTM.
Results: In this study, we present a novel bioinformatics tool called GlycoMine, which is a comprehensive tool for the systematic in silico identification of C-linked, N-linked, and O-linked glycosylation sites in the human proteome. GlycoMine was developed using the random forest algorithm and evaluated based on a well-prepared up-to-date benchmark dataset that encompasses all three types of glycosylation sites, which was curated from multiple public resources. Heterogeneous sequences and functional features were derived from various sources, and subjected to further two-step feature selection to characterize a condensed subset of optimal features that contributed most to the type-specific prediction of glycosylation sites. Five-fold cross-validation and independent tests show that this approach significantly improved the prediction performance compared with four existing prediction tools: NetNGlyc, NetOGlyc, EnsembleGly and GPP. We demonstrated that this tool could identify candidate glycosylation sites in case study proteins and applied it to identify many high-confidence glycosylation target proteins by screening the entire human proteome.},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology},
Url = {http://dx.doi.org/10.1093/bioinformatics/btu852}
}
ABSTRACT Motivation: Glycosylation is a ubiquitous type of protein post-translational modification (PTM) in eukaryotic cells, which plays vital roles in various biological processes (BPs) such as cellular communication, ligand recognition and subcellular recognition. It is estimated that >50% of the entire human proteome is glycosylated. However, it is still a significant challenge to identify glycosylation sites, which requires expensive/laborious experimental research. Thus, bioinformatics approaches that can predict the glycan occupancy at specific sequons in protein sequences would be useful for understanding and utilizing this important PTM. Results: In this study, we present a novel bioinformatics tool called GlycoMine, which is a comprehensive tool for the systematic in silico identification of C-linked, N-linked, and O-linked glycosylation sites in the human proteome. GlycoMine was developed using the random forest algorithm and evaluated based on a well-prepared up-to-date benchmark dataset that encompasses all three types of glycosylation sites, which was curated from multiple public resources. Heterogeneous sequences and functional features were derived from various sources, and subjected to further two-step feature selection to characterize a condensed subset of optimal features that contributed most to the type-specific prediction of glycosylation sites. Five-fold cross-validation and independent tests show that this approach significantly improved the prediction performance compared with four existing prediction tools: NetNGlyc, NetOGlyc, EnsembleGly and GPP. We demonstrated that this tool could identify candidate glycosylation sites in case study proteins and applied it to identify many high-confidence glycosylation target proteins by screening the entire human proteome.

Structural and dynamic properties that govern the stability of an engineered fibronectin type III domain.
Porebski, B. T., Nickson, A. A., Hoke, D. E., Hunter, M. R., Zhu, L., McGowan, S., Webb, G. I., & Buckle, A. M.
Protein Engineering, Design and Selection, 28(3), 67-78, 2015.
[URL] [Bibtex]

ABSTRACT 

Accurate in Silico Identification of Species-Specific Acetylation Sites by Integrating Protein Sequence-Derived and Functional Features.
Li, Y., Wang, M., Wang, H., Tan, H., Zhang, Z., Webb, G. I., & Song, J.
Scientific Reports, 4, Art. no. 5765, 2014.
[URL] [Bibtex] [Abstract]

@Article{LiEtAl2014,
Title = {Accurate in Silico Identification of Species-Specific Acetylation Sites by Integrating Protein Sequence-Derived and Functional Features},
Author = {Y. Li and M. Wang and H. Wang and H. Tan and Z. Zhang and G. I. Webb and J. Song},
Journal = {Scientific Reports},
Year = {2014},
Volume = {4},
Abstract = {Lysine acetylation is a reversible post-translational modification, playing an important role in cytokine signaling, transcriptional regulation, and apoptosis. To fully understand acetylation mechanisms, identification of substrates and specific acetylation sites is crucial. Experimental identification is often time-consuming and expensive. Alternative bioinformatics methods are cost-effective and can be used in a high-throughput manner to generate relatively precise predictions. Here we develop a method termed as SSPKA for species-specific lysine acetylation prediction, using random forest classifiers that combine sequence-derived and functional features with two-step feature selection. Feature importance analysis indicates functional features, applied for lysine acetylation site prediction for the first time, significantly improve the predictive performance. We apply the SSPKA model to screen the entire human proteome and identify many high-confidence putative substrates that are not previously identified. The results along with the implemented Java tool, serve as useful resources to elucidate the mechanism of lysine acetylation and facilitate hypothesis-driven experimental design and validation.},
Articlenumber = {5765},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology},
Url = {http://dx.doi.org/10.1038/srep05765}
}
ABSTRACT Lysine acetylation is a reversible post-translational modification, playing an important role in cytokine signaling, transcriptional regulation, and apoptosis. To fully understand acetylation mechanisms, identification of substrates and specific acetylation sites is crucial. Experimental identification is often time-consuming and expensive. Alternative bioinformatics methods are cost-effective and can be used in a high-throughput manner to generate relatively precise predictions. Here we develop a method termed as SSPKA for species-specific lysine acetylation prediction, using random forest classifiers that combine sequence-derived and functional features with two-step feature selection. Feature importance analysis indicates functional features, applied for lysine acetylation site prediction for the first time, significantly improve the predictive performance. We apply the SSPKA model to screen the entire human proteome and identify many high-confidence putative substrates that are not previously identified. The results along with the implemented Java tool, serve as useful resources to elucidate the mechanism of lysine acetylation and facilitate hypothesis-driven experimental design and validation.

PROSPER: An Integrated Feature-Based Tool for Predicting Protease Substrate Cleavage Sites.
Song, J., Tan, H., Perry, A. J., Akutsu, T., I., W. G. I., Whisstock, J. C., & Pike, R. N.
PLoS ONE, 7(11), e50300, 2012.
[URL] [Bibtex] [Abstract]

@Article{SongEtAl12b,
Title = {PROSPER: An Integrated Feature-Based Tool for Predicting Protease Substrate Cleavage Sites},
Author = {J. Song and H. Tan and A.J. Perry and T. Akutsu and G.I. Webb I. and J.C. Whisstock and R.N. Pike},
Journal = {PLoS ONE},
Year = {2012},
Month = {11},
Number = {11},
Pages = {e50300},
Volume = {7},
Abstract = {<p>The ability to catalytically cleave protein substrates after synthesis is fundamental for all forms of life. Accordingly, site-specific proteolysis is one of the most important post-translational modifications. The key to understanding the physiological role of a protease is to identify its natural substrate(s). Knowledge of the substrate specificity of a protease can dramatically improve our ability to predict its target protein substrates, but this information must be utilized in an effective manner in order to efficiently identify protein substrates by <italic>in silico</italic> approaches. To address this problem, we present PROSPER, an integrated feature-based server for <italic>in silico</italic> identification of protease substrates and their cleavage sites for twenty-four different proteases. PROSPER utilizes established specificity information for these proteases (derived from the MEROPS database) with a machine learning approach to predict protease cleavage sites by using different, but complementary sequence and structure characteristics. Features used by PROSPER include local amino acid sequence profile, predicted secondary structure, solvent accessibility and predicted native disorder. Thus, for proteases with known amino acid specificity, PROSPER provides a convenient, pre-prepared tool for use in identifying protein substrates for the enzymes. Systematic prediction analysis for the twenty-four proteases thus far included in the database revealed that the features we have included in the tool strongly improve performance in terms of cleavage site prediction, as evidenced by their contribution to performance improvement in terms of identifying known cleavage sites in substrates for these enzymes. In comparison with two state-of-the-art prediction tools, PoPS and SitePrediction, PROSPER achieves greater accuracy and coverage. To our knowledge, PROSPER is the first comprehensive server capable of predicting cleavage sites of multiple proteases within a single substrate sequence using machine learning techniques. It is freely available at <ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="http://lightning.med.monash.edu.au/PROSPER/" xlink:type="simple">http://lightning.med.monash.edu.au/PROSPER/</ext-link>.</p>},
Keywords = {Bioinformatics},
Publisher = {Public Library of Science},
Related = {computational-biology},
Url = {http://dx.doi.org/10.1371%2Fjournal.pone.0050300}
}
ABSTRACT <p>The ability to catalytically cleave protein substrates after synthesis is fundamental for all forms of life. Accordingly, site-specific proteolysis is one of the most important post-translational modifications. The key to understanding the physiological role of a protease is to identify its natural substrate(s). Knowledge of the substrate specificity of a protease can dramatically improve our ability to predict its target protein substrates, but this information must be utilized in an effective manner in order to efficiently identify protein substrates by <italic>in silico</italic> approaches. To address this problem, we present PROSPER, an integrated feature-based server for <italic>in silico</italic> identification of protease substrates and their cleavage sites for twenty-four different proteases. PROSPER utilizes established specificity information for these proteases (derived from the MEROPS database) with a machine learning approach to predict protease cleavage sites by using different, but complementary sequence and structure characteristics. Features used by PROSPER include local amino acid sequence profile, predicted secondary structure, solvent accessibility and predicted native disorder. Thus, for proteases with known amino acid specificity, PROSPER provides a convenient, pre-prepared tool for use in identifying protein substrates for the enzymes. Systematic prediction analysis for the twenty-four proteases thus far included in the database revealed that the features we have included in the tool strongly improve performance in terms of cleavage site prediction, as evidenced by their contribution to performance improvement in terms of identifying known cleavage sites in substrates for these enzymes. In comparison with two state-of-the-art prediction tools, PoPS and SitePrediction, PROSPER achieves greater accuracy and coverage. To our knowledge, PROSPER is the first comprehensive server capable of predicting cleavage sites of multiple proteases within a single substrate sequence using machine learning techniques. It is freely available at <ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="http://lightning.med.monash.edu.au/PROSPER/" xlink:type="simple">http://lightning.med.monash.edu.au/PROSPER/</ext-link>.</p>

TANGLE: Two-Level Support Vector Regression Approach for Protein Backbone Torsion Angle Prediction from Primary Sequences.
Song, J., Tan, H., Wang, M., Webb, G. I., & Akutsu, T.
PLoS ONE, 7(2), e30361, 2012.
[DOI] [Bibtex] [Abstract]

@Article{SongEtAl12,
Title = {TANGLE: Two-Level Support Vector Regression Approach for Protein Backbone Torsion Angle Prediction from Primary Sequences},
Author = {Song, Jiangning and Tan, Hao and Wang, Mingjun and Webb, Geoffrey I. and Akutsu, Tatsuya},
Journal = {PLoS ONE},
Year = {2012},
Month = {02},
Number = {2},
Pages = {e30361},
Volume = {7},
Abstract = {<p>Protein backbone torsion angles (Phi) and (Psi) involve two rotation angles rotating around the C<sub>α</sub>-N bond (Phi)
and the C<sub>α</sub>-C bond (Psi). Due to the planarity of the linked rigid peptide bonds, these two angles can essentially determine
the backbone geometry of proteins. Accordingly, the accurate prediction of protein backbone torsion angle from sequence information
can assist the prediction of protein structures. In this study, we develop a new approach called TANGLE (Torsion ANGLE predictor) to
predict the protein backbone torsion angles from amino acid sequences. TANGLE uses a two-level support vector regression approach to
perform real-value torsion angle prediction using a variety of features derived from amino acid sequences, including the evolutionary
profiles in the form of position-specific scoring matrices, predicted secondary structure, solvent accessibility and natively disordered
region as well as other global sequence features. When evaluated based on a large benchmark dataset of 1,526 non-homologous proteins,
the mean absolute errors (MAEs) of the Phi and Psi angle prediction are 27.8° and 44.6°, respectively, which are 1% and 3% respectively
lower than that using one of the state-of-the-art prediction tools ANGLOR. Moreover, the prediction of TANGLE is significantly better than a
random predictor that was built on the amino acid-specific basis, with the <italic>p</italic>-value&lt;1.46e-147 and 7.97e-150, respectively by the
Wilcoxon signed rank test. As a complementary approach to the current torsion angle prediction algorithms, TANGLE should prove useful in predicting
protein structural properties and assisting protein fold recognition by applying the predicted torsion angles as useful restraints. TANGLE is freely
accessible at <ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/TANGLE/"
xlink:type="simple">http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/TANGLE/</ext-link>.</p>},
Doi = {10.1371/journal.pone.0030361},
Keywords = {Bioinformatics},
Publisher = {Public Library of Science},
Related = {computational-biology},
Url = {http://dx.doi.org/10.1371%2Fjournal.pone.0030361}
}
ABSTRACT <p>Protein backbone torsion angles (Phi) and (Psi) involve two rotation angles rotating around the C<sub>α</sub>-N bond (Phi) and the C<sub>α</sub>-C bond (Psi). Due to the planarity of the linked rigid peptide bonds, these two angles can essentially determine the backbone geometry of proteins. Accordingly, the accurate prediction of protein backbone torsion angle from sequence information can assist the prediction of protein structures. In this study, we develop a new approach called TANGLE (Torsion ANGLE predictor) to predict the protein backbone torsion angles from amino acid sequences. TANGLE uses a two-level support vector regression approach to perform real-value torsion angle prediction using a variety of features derived from amino acid sequences, including the evolutionary profiles in the form of position-specific scoring matrices, predicted secondary structure, solvent accessibility and natively disordered region as well as other global sequence features. When evaluated based on a large benchmark dataset of 1,526 non-homologous proteins, the mean absolute errors (MAEs) of the Phi and Psi angle prediction are 27.8° and 44.6°, respectively, which are 1% and 3% respectively lower than that using one of the state-of-the-art prediction tools ANGLOR. Moreover, the prediction of TANGLE is significantly better than a random predictor that was built on the amino acid-specific basis, with the <italic>p</italic>-value&lt;1.46e-147 and 7.97e-150, respectively by the Wilcoxon signed rank test. As a complementary approach to the current torsion angle prediction algorithms, TANGLE should prove useful in predicting protein structural properties and assisting protein fold recognition by applying the predicted torsion angles as useful restraints. TANGLE is freely accessible at <ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/TANGLE/" xlink:type="simple">http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/TANGLE/</ext-link>.</p>

Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs.
Mahmood, K., Webb, G. I., Song, J., Whisstock, J. C., & Konagurthu, A. S.
Nucleic Acids Research, 40(6), e44, 2012.
[DOI] [Bibtex]

ABSTRACT 

Discovery of Amino Acid Motifs for Thrombin Cleavage and Validation Using a Model Substrate.
Ng, N. M., Pierce, J. D., Webb, G. I., Ratnikov, B. I., Wijeyewickrema, L. C., Duncan, R. C., Robertson, A. L., Bottomley, S. P., Boyd, S. E., & Pike, R. N.
Biochemistry, 50(48), 10499-10507, 2011.
[DOI] [Bibtex] [Abstract]

@Article{NgEtAl11,
Title = {Discovery of Amino Acid Motifs for Thrombin Cleavage and Validation Using a Model Substrate},
Author = {N.M. Ng and Pierce, J.D. and Webb, G.I. and Ratnikov, B.I. and Wijeyewickrema, L.C. and Duncan, R.C. and Robertson, A.L. and Bottomley, S.P. and Boyd, S.E. and Pike, R.N.},
Journal = {Biochemistry},
Year = {2011},
Number = {48},
Pages = {10499-10507},
Volume = {50},
Abstract = {Understanding the active site preferences of an enzyme is critical to the design of effective inhibitors and to gaining insights into its mechanisms of action on substrates. While the subsite specificity of thrombin is understood, it is not clear whether the enzyme prefers individual amino acids at each subsite in isolation or prefers to cleave combinations of amino acids as a motif. To investigate whether preferred peptide motifs for cleavage could be identified for thrombin, we exposed a phage-displayed peptide library to thrombin. The resulting preferentially cleaved substrates were analyzed using the technique of association rule discovery. The results revealed that thrombin selected for amino acid motifs in cleavage sites. The contribution of these hypothetical motifs to substrate cleavage efficiency was further investigated using the B1 IgG-binding domain of streptococcal protein G as a model substrate. Introduction of a P2.P1. LRS thrombin cleavage sequence within a major loop of the protein led to cleavage of the protein by thrombin, with the cleavage efficiency increasing with the length of the loop. Introduction of further P3.P1 and P1.P1..P3. amino acid motifs into the loop region yielded greater cleavage efficiencies, suggesting that the susceptibility of a protein substrate to cleavage by thrombin is influenced by these motifs, perhaps because of cooperative effects between subsites closest to the scissile peptide bond.},
Doi = {10.1021/bi201333g},
Eprint = {http://pubs.acs.org/doi/pdf/10.1021/bi201333g},
Keywords = {Bioinformatics},
Related = {computational-biology},
Url = {http://pubs.acs.org/doi/abs/10.1021/bi201333g}
}
ABSTRACT Understanding the active site preferences of an enzyme is critical to the design of effective inhibitors and to gaining insights into its mechanisms of action on substrates. While the subsite specificity of thrombin is understood, it is not clear whether the enzyme prefers individual amino acids at each subsite in isolation or prefers to cleave combinations of amino acids as a motif. To investigate whether preferred peptide motifs for cleavage could be identified for thrombin, we exposed a phage-displayed peptide library to thrombin. The resulting preferentially cleaved substrates were analyzed using the technique of association rule discovery. The results revealed that thrombin selected for amino acid motifs in cleavage sites. The contribution of these hypothetical motifs to substrate cleavage efficiency was further investigated using the B1 IgG-binding domain of streptococcal protein G as a model substrate. Introduction of a P2.P1. LRS thrombin cleavage sequence within a major loop of the protein led to cleavage of the protein by thrombin, with the cleavage efficiency increasing with the length of the loop. Introduction of further P3.P1 and P1.P1..P3. amino acid motifs into the loop region yielded greater cleavage efficiencies, suggesting that the susceptibility of a protein substrate to cleavage by thrombin is influenced by these motifs, perhaps because of cooperative effects between subsites closest to the scissile peptide bond.

Bioinformatic Approaches for Predicting Substrates of Proteases.
Song, J., Tan, H., Boyd, S. E., Shen, H., Mahmood, K., Webb, G. I., Akutsu, T., Whisstock, J. C., & Pike, R. N.
Journal of Bioinformatics and Computational Biology, 9(1), 149-178, 2011.
[DOI] [Bibtex] [Abstract]

@Article{SongEtAl11,
Title = {Bioinformatic Approaches for Predicting Substrates of Proteases},
Author = {J. Song and H. Tan and S.E. Boyd and H. Shen and K. Mahmood and G.I. Webb and T. Akutsu and J.C. Whisstock and R.N. Pike},
Journal = {Journal of Bioinformatics and Computational Biology},
Year = {2011},
Number = {1},
Pages = {149-178},
Volume = {9},
Abstract = {Proteases have central roles in "life and death" processes due to their important ability to catalytically hydrolyse protein substrates, usually altering the function and/or activity of the target in the process. Knowledge of the substrate specificity of a protease should, in theory, dramatically improve the ability to predict target protein substrates. However, experimental identification and characterization of protease substrates is often difficult and time-consuming. Thus solving the "substrate identification" problem is fundamental to both understanding protease biology and the development of therapeutics that target specific protease-regulated pathways. In this context, bioinformatic prediction of protease substrates may provide useful and experimentally testable information about novel potential cleavage sites in candidate substrates. In this article, we provide an overview of recent advances in developing bioinformatic approaches for predicting protease substrate cleavage sites and identifying novel putative substrates. We discuss the advantages and drawbacks of the current methods and detail how more accurate models can be built by deriving multiple sequence and structural features of substrates. We also provide some suggestions about how future studies might further improve the accuracy of protease substrate specificity prediction.},
Audit-trail = {http://www.worldscinet.com/jbcb/00/0001/S0219720011005288.html},
Doi = {10.1142/S0219720011005288},
Keywords = {Bioinformatics},
Publisher = {World Scientific},
Related = {computational-biology}
}
ABSTRACT Proteases have central roles in "life and death" processes due to their important ability to catalytically hydrolyse protein substrates, usually altering the function and/or activity of the target in the process. Knowledge of the substrate specificity of a protease should, in theory, dramatically improve the ability to predict target protein substrates. However, experimental identification and characterization of protease substrates is often difficult and time-consuming. Thus solving the "substrate identification" problem is fundamental to both understanding protease biology and the development of therapeutics that target specific protease-regulated pathways. In this context, bioinformatic prediction of protease substrates may provide useful and experimentally testable information about novel potential cleavage sites in candidate substrates. In this article, we provide an overview of recent advances in developing bioinformatic approaches for predicting protease substrate cleavage sites and identifying novel putative substrates. We discuss the advantages and drawbacks of the current methods and detail how more accurate models can be built by deriving multiple sequence and structural features of substrates. We also provide some suggestions about how future studies might further improve the accuracy of protease substrate specificity prediction.

Cascleave: Towards More Accurate Prediction of Caspase Substrate Cleavage Sites.
Song, J., Tan, H., Shen, H., Mahmood, K., Boyd, S. E., Webb, G. I., Akutsu, T., & Whisstock, J. C.
Bioinformatics, 26(6), 752-760, 2010.
[DOI] [Bibtex] [Abstract]

@Article{SongEtAl10,
Title = {Cascleave: Towards More Accurate Prediction of Caspase Substrate Cleavage Sites},
Author = {J. Song and H. Tan and H. Shen and K. Mahmood and S.E. Boyd and G.I. Webb and T. Akutsu and J.C. Whisstock},
Journal = {Bioinformatics},
Year = {2010},
Number = {6},
Pages = {752-760},
Volume = {26},
Abstract = {Motivation: The caspase family of cysteine proteases play essential roles in key biological processes such as programmed cell death, differentiation, proliferation, necrosis and inflammation. The complete repertoire of caspase substrates remains to be fully characterized. Accordingly, systematic computational screening studies of caspase substrate cleavage sites may provide insight into the substrate specificity of caspases and further facilitating the discovery of putative novel substrates. Results: In this article we develop an approach (termed Cascleave) to predict both classical (i.e. following a P1 Asp) and non-typical caspase cleavage sites. When using local sequence-derived profiles, Cascleave successfully predicted 82.2% of the known substrate cleavage sites, with a Matthews correla tion coefficient (MCC) of 0.667. We found that prediction performance could be further improved by incorporating information such as predicted solvent accessibility and whether a cleavage sequence lies in a region that is most likely natively unstructured. Novel bi-profile Bayesian signatures were found to significantly improve the prediction performance and yielded the best performance with an overall accuracy of 87.6% and a MCC of 0.747, which is higher accuracy than published methods that essentially rely on amino acid sequence alone. It is anticipated that Cascleave will be a powerful tool for predicting novel substrate cleavage sites of caspases and shedding new insights on the unknown caspase-substrate interactivity relationship.},
Audit-trail = {http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btq339v1},
Doi = {10.1093/bioinformatics/btq043},
Keywords = {Bioinformatics},
Publisher = {Oxford Univ Press},
Related = {computational-biology}
}
ABSTRACT Motivation: The caspase family of cysteine proteases play essential roles in key biological processes such as programmed cell death, differentiation, proliferation, necrosis and inflammation. The complete repertoire of caspase substrates remains to be fully characterized. Accordingly, systematic computational screening studies of caspase substrate cleavage sites may provide insight into the substrate specificity of caspases and further facilitating the discovery of putative novel substrates. Results: In this article we develop an approach (termed Cascleave) to predict both classical (i.e. following a P1 Asp) and non-typical caspase cleavage sites. When using local sequence-derived profiles, Cascleave successfully predicted 82.2% of the known substrate cleavage sites, with a Matthews correla tion coefficient (MCC) of 0.667. We found that prediction performance could be further improved by incorporating information such as predicted solvent accessibility and whether a cleavage sequence lies in a region that is most likely natively unstructured. Novel bi-profile Bayesian signatures were found to significantly improve the prediction performance and yielded the best performance with an overall accuracy of 87.6% and a MCC of 0.747, which is higher accuracy than published methods that essentially rely on amino acid sequence alone. It is anticipated that Cascleave will be a powerful tool for predicting novel substrate cleavage sites of caspases and shedding new insights on the unknown caspase-substrate interactivity relationship.

EGM: Encapsulated Gene-by-Gene Matching to Identify Gene Orthologs and Homologous Segments in Genomes.
Mahmood, K., Konagurthu, A. S., Song, J., Buckle, A. M., Webb, G. I., & Whisstock, J. C.
Bioinformatics, 26(17), 2076-2084, 2010.
[DOI] [Bibtex] [Abstract]

@Article{MahmoodEtAl10,
Title = {EGM: Encapsulated Gene-by-Gene Matching to Identify Gene Orthologs and Homologous Segments in Genomes},
Author = {K. Mahmood and A.S. Konagurthu and J. Song and A.M. Buckle and G.I. Webb and J.C. Whisstock},
Journal = {Bioinformatics},
Year = {2010},
Number = {17},
Pages = {2076-2084},
Volume = {26},
Abstract = {Motivation: Identification of functionally equivalent genes in different species is essential to understand the evolution of biological pathways and processes. At the same time, identification of strings of conserved orthologous genes helps identify complex genomic rearrangements across different organisms. Such an insight is particularly useful, for example, in the transfer of experimental results between different experimental systems such as Drosophila and mammals.
Results: Here we describe the Encapsulated Gene-by-gene Matching (EGM) approach, a method that employs a graph matching strategy to identify gene orthologs and conserved gene segments. Given a pair of genomes, EGM constructs a global gene match for all genes taking into account gene context and family information. The Hungarian method for identifying the maximum weight matching in bipartite graphs is employed, where the resulting matching reveals one-to-one correspondences between nodes (genes) in a manner that maximizes the gene similarity and context.
Conclusion: We tested our approach by performing several comparisons including a detailed Human v Mouse genome mapping. We find that the algorithm is robust and sensitive in detecting orthologs and conserved gene segments. EGM can sensitively detect rearrangements within large and small chromosomal segments. The EGM tool is fully automated and easy to use compared to other more complex methods that also require extensive manual intervention and input.
},
Audit-trail = {http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/6/752},
Doi = {10.1093/bioinformatics/btq339},
Keywords = {Bioinformatics},
Publisher = {Oxford Univ Press},
Related = {computational-biology}
}
ABSTRACT Motivation: Identification of functionally equivalent genes in different species is essential to understand the evolution of biological pathways and processes. At the same time, identification of strings of conserved orthologous genes helps identify complex genomic rearrangements across different organisms. Such an insight is particularly useful, for example, in the transfer of experimental results between different experimental systems such as Drosophila and mammals. Results: Here we describe the Encapsulated Gene-by-gene Matching (EGM) approach, a method that employs a graph matching strategy to identify gene orthologs and conserved gene segments. Given a pair of genomes, EGM constructs a global gene match for all genes taking into account gene context and family information. The Hungarian method for identifying the maximum weight matching in bipartite graphs is employed, where the resulting matching reveals one-to-one correspondences between nodes (genes) in a manner that maximizes the gene similarity and context. Conclusion: We tested our approach by performing several comparisons including a detailed Human v Mouse genome mapping. We find that the algorithm is robust and sensitive in detecting orthologs and conserved gene segments. EGM can sensitively detect rearrangements within large and small chromosomal segments. The EGM tool is fully automated and easy to use compared to other more complex methods that also require extensive manual intervention and input.

Prodepth: Predict Residue Depth by Support Vector Regression Approach from Protein Sequences Only.
Song, J., Tan, H., Mahmood, K., Law, R. H. P., Buckle, A. M., Webb, G. I., Akutsu, T., & Whisstock, J. C.
PLoS ONE, 4(9), e7072, 2009.
[DOI] [Bibtex] [Abstract]

@Article{SongEtAl09,
Title = {Prodepth: Predict Residue Depth by Support Vector Regression Approach from Protein Sequences Only},
Author = {J. Song and H. Tan and K. Mahmood and R.H.P. Law and A.M. Buckle and G.I. Webb and T. Akutsu and J.C. Whisstock},
Journal = {PLoS ONE},
Year = {2009},
Number = {9},
Pages = {e7072},
Volume = {4},
Abstract = {Residue depth (RD) is a solvent exposure measure that complements the information provided by conventional accessible surface area (ASA) and describes to what extent a residue is buried in the protein structure space. Previous studies have established that RD is correlated with several protein properties, such as protein stability, residue conservation and amino acid types. Accurate prediction of RD has many potentially important applications in the field of structural bioinformatics, for example, facilitating the identification of functionally important residues, or residues in the folding nucleus, or enzyme active sites from sequence information. In this work, we introduce an efficient approach that uses support vector regression to quantify the relationship between RD and protein sequence. We systematically investigated eight different sequence encoding schemes including both local and global sequence characteristics and examined their respective prediction performances. For the objective evaluation of our approach, we used 5-fold cross-validation to assess the prediction accuracies and showed that the overall best performance could be achieved with a correlation coefficient (CC) of 0.71 between the observed and predicted RD values and a root mean square error (RMSE) of 1.74, after incorporating the relevant multiple sequence features. The results suggest that residue depth could be reliably predicted solely from protein primary sequences: local sequence environments are the major determinants, while global sequence features could influence the prediction performance marginally. We highlight two examples as a comparison in order to illustrate the applicability of this approach. We also discuss the potential implications of this new structural parameter in the field of protein structure prediction and homology modeling. This method might prove to be a powerful tool for sequence analysis.},
Audit-trail = {http://www.plosone.org/article/info:doi/10.1371/journal.pone.0007072},
Doi = {10.1371/journal.pone.0007072},
Keywords = {Bioinformatics},
Publisher = {PLOS},
Related = {computational-biology}
}
ABSTRACT Residue depth (RD) is a solvent exposure measure that complements the information provided by conventional accessible surface area (ASA) and describes to what extent a residue is buried in the protein structure space. Previous studies have established that RD is correlated with several protein properties, such as protein stability, residue conservation and amino acid types. Accurate prediction of RD has many potentially important applications in the field of structural bioinformatics, for example, facilitating the identification of functionally important residues, or residues in the folding nucleus, or enzyme active sites from sequence information. In this work, we introduce an efficient approach that uses support vector regression to quantify the relationship between RD and protein sequence. We systematically investigated eight different sequence encoding schemes including both local and global sequence characteristics and examined their respective prediction performances. For the objective evaluation of our approach, we used 5-fold cross-validation to assess the prediction accuracies and showed that the overall best performance could be achieved with a correlation coefficient (CC) of 0.71 between the observed and predicted RD values and a root mean square error (RMSE) of 1.74, after incorporating the relevant multiple sequence features. The results suggest that residue depth could be reliably predicted solely from protein primary sequences: local sequence environments are the major determinants, while global sequence features could influence the prediction performance marginally. We highlight two examples as a comparison in order to illustrate the applicability of this approach. We also discuss the potential implications of this new structural parameter in the field of protein structure prediction and homology modeling. This method might prove to be a powerful tool for sequence analysis.

RCPdb: An evolutionary classification and codon usage database for repeat-containing proteins.
Faux, N. G., Huttley, G. A., Mahmood, K., Webb, G. I., de la Banda, G. M., & Whisstock, J. C.
Genome Research, 17(1), 1118-1127, 2007.
[DOI] [Bibtex]

ABSTRACT 

Identifying markers of pathology in SAXS data of malignant tissues of the brain.
Siu, K. K. W., Butler, S. M., Beveridge, T., Gillam, J. E., Hall, C. J., Kaye, A. H., Lewis, R. A., Mannan, K., McLoughlin, G., Pearson, S., Round, A. R., Schultke, E., Webb, G. I., & Wilkinson, S. J.
Nuclear Instruments and Methods in Physics Research A, 548, 140-146, 2005.
[PDF] [DOI] [Bibtex] [Abstract]

@Article{SiuEtAl05,
Title = {Identifying markers of pathology in SAXS data of malignant tissues of the brain},
Author = {K.K.W. Siu and S.M. Butler and T. Beveridge and J.E. Gillam and C.J. Hall and A.H. Kaye and R.A. Lewis and K. Mannan and G. McLoughlin and S. Pearson and A.R. Round and E. Schultke and G.I. Webb and S.J. Wilkinson},
Journal = {Nuclear Instruments and Methods in Physics Research A},
Year = {2005},
Pages = {140-146},
Volume = {548},
Abstract = {Conventional neuropathological analysis for brain malignancies is heavily reliant on the observation of morphological abnormalities, observed in thin, stained sections of tissue. Small Angle X-ray Scattering (SAXS) data provide an alternative means of distinguishing pathology by examining the ultra-structural (nanometer length scales) characteristics of tissue. To evaluate the diagnostic potential of SAXS for brain tumors, data was collected from normal, malignant and benign tissues of the human brain at station 2.1 of the Daresbury Laboratory Synchrotron Radiation Source and subjected to data mining and multivariate statistical analysis. The results suggest SAXS data may be an effective classi.er of malignancy.},
Doi = {10.1016/j.nima.2005.03.081},
Keywords = {Bioinformatics},
Publisher = {Elsevier},
Related = {computational-biology}
}
ABSTRACT Conventional neuropathological analysis for brain malignancies is heavily reliant on the observation of morphological abnormalities, observed in thin, stained sections of tissue. Small Angle X-ray Scattering (SAXS) data provide an alternative means of distinguishing pathology by examining the ultra-structural (nanometer length scales) characteristics of tissue. To evaluate the diagnostic potential of SAXS for brain tumors, data was collected from normal, malignant and benign tissues of the human brain at station 2.1 of the Daresbury Laboratory Synchrotron Radiation Source and subjected to data mining and multivariate statistical analysis. The results suggest SAXS data may be an effective classi.er of malignancy.

A Case Study in Feature Invention for Breast Cancer Diagnosis Using X-Ray Scatter Images.
Butler, S. M., Webb, G. I., & Lewis, R. A.
Lecture Notes in Artificial Intelligence Vol. 2903: Proceedings of the 16th Australian Conference on Artificial Intelligence (AI 03), Berlin/Heidelberg, pp. 677-685, 2003.
[PDF] [DOI] [Bibtex] [Abstract]

@InProceedings{ButlerWebbLewis03,
Title = {A Case Study in Feature Invention for Breast Cancer Diagnosis Using X-Ray Scatter Images},
Author = {S. M. Butler and G.I. Webb and R.A. Lewis},
Booktitle = {Lecture Notes in Artificial Intelligence Vol. 2903: Proceedings of the 16th Australian Conference on Artificial Intelligence (AI 03)},
Year = {2003},
Address = {Berlin/Heidelberg},
Editor = {T.D. Gedeon and L.C.C. Fung },
Pages = {677-685},
Publisher = {Springer},
Abstract = {X-ray mammography is the current method for screening for breast cancer, and like any technique, has its limitations. Several groups have reported differences in the X-ray scattering patterns of normal and tumour tissue from the breast. This gives rise to the hope that X-ray scatter analysis techniques may lead to a more accurate and cost effective method of diagnosing beast cancer which lends itself to automation. This is a particularly challenging exercise due to the inherent complexity of the information content in X-ray scatter patterns from complex heterogenous tissue samples. We use a simple naive Bayes classier, coupled with Equal Frequency Discretization (EFD) as our classification system. High-level features are extracted from the low-level pixel data. This paper reports some preliminary results in the ongoing development of this classification method that can distinguish between the diffraction patterns of normal and cancerous tissue, with particular emphasis on the invention of features for classification.},
Doi = {10.1007/978-3-540-24581-0_58},
Keywords = {Bioinformatics},
Location = {Perth, Australia},
Related = {computational-biology}
}
ABSTRACT X-ray mammography is the current method for screening for breast cancer, and like any technique, has its limitations. Several groups have reported differences in the X-ray scattering patterns of normal and tumour tissue from the breast. This gives rise to the hope that X-ray scatter analysis techniques may lead to a more accurate and cost effective method of diagnosing beast cancer which lends itself to automation. This is a particularly challenging exercise due to the inherent complexity of the information content in X-ray scatter patterns from complex heterogenous tissue samples. We use a simple naive Bayes classier, coupled with Equal Frequency Discretization (EFD) as our classification system. High-level features are extracted from the low-level pixel data. This paper reports some preliminary results in the ongoing development of this classification method that can distinguish between the diffraction patterns of normal and cancerous tissue, with particular emphasis on the invention of features for classification.