Uncategorized Archives - Page 3 of 4 - Geoff Webb

Finding real associations with R

Geoff WebbJuly 9, 2017October 5, 2021

Our OPUS Miner package is now available in R. It finds statistically significant complex interactions in data. We would value you feedback. It can be downloaded from https://cran.r-project.org/package=opusminer.

Elected to ACM SIGKDD Board of Directors

Geoff WebbJune 17, 2017October 5, 2021

I am honoured to have been elected to the Board of the ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD). I am looking forward to working with the new Board, led by incoming Chair Jian Pei, to further strengthen our community's leading conference and to support the community around it.

Encyclopedia of Machine Learning and Data Mining

Geoff WebbApril 13, 2017October 5, 2021

We are delighted to have the second edition of the highly successful Encyclopedia of Machine Learning go live. The revised and expanded second edition has been re-titled the Encyclopedia of Machine Learning and Data Mining.

Usama Fayyad and my keynote addresses at Practical Big Data 2017

Geoff WebbFebruary 3, 2017October 5, 2021

Encyclopedia of Machine Learning still most downloaded Springer Reference

Geoff WebbJanuary 31, 2017October 5, 2021

It is good to see that the first edition of our Encyclopedia of Machine Learning is still serving the community.

#SpringerRefCountdown! #1 most downloaded entry last month: https://t.co/hZ2j7FJQmN from our Encyclopedia of #MachineLearning! pic.twitter.com/GdGuTDkfnA

— SpringerReference (@SpringerRef) January 31, 2017

Springer is now taking preorders for the next edition, The Encyclopedia of Machine Learning and Data Mining.

Two awards in one week!

Geoff WebbDecember 2, 2016October 5, 2021

I am honoured to have received the Australian Computer Society's ICT Researcher of the Year Award and the Australasian Artificial Intelligence Distinguished Research Contributions Award.

Symposium on Data Science for Good Governing, Dec 5

Geoff WebbNovember 3, 2016October 5, 2021

Fun KDD video

Geoff WebbJune 30, 2016October 5, 2021

Statistical testing of hypothesis streams and cascades

Geoff WebbJune 1, 2016October 5, 2021

Statistical hypothesis testing was developed in an age when calculations were performed by hand and individual hypotheses were considered in isolation.

In the modern information era, it is often desirable to assess large numbers of related hypotheses. Unless explicit steps are taken to control the risk, it becomes a near certainty that if large numbers of the hypotheses that one seeks to reject are in fact true then many will be rejected in error. Multiple testing corrections control this risk, but previous approaches have required that the hypotheses to be assessed (or at least an upper bound on their number) are known in advance.

Sometimes, however, there is a need to test some hypotheses before other related hypotheses are known. We call this a hypothesis stream. In some hypothesis streams, the decision of which hypotheses to reject at one time affect which hypotheses will be considered subsequently. We call this a hypothesis cascade.

One context in which hypothesis cascades arise is model selection, such as forward feature selection. This process starts with an empty model. Then, at each step all models are considered that add a single feature to the current model. A hypothesis test is conducted on every such new model. The model with the best test result (lowest p) is then accepted, and the process is repeated considering all single feature additions to this new model. This is repeated until no improvement is statistically significant. The model selected at each round determines the models to be tested subsequently, and it is not possible to determine in advance how many rounds it will take until the process terminates and hence how many statistical tests will be performed.

We have developed a statistical testing procedure that strictly controls the risk of familywise error - the risk of incorrectly rejecting any null hypothesis out of the entire stream. It does so without requiring any information other than the history of the analysis so far and the current pool of hypotheses under consideration. Studies of its performance in the context of selecting the edges to include in a graphical model indicate that it achieves this strict control while maintaining high statistical power.

Statistical hypothesis testing provides a powerful and mature toolkit for data analytics. These techniques help bring these powerful tools to bear in the age of big data.

Watch the promotional video for our paper:

[embedyt] http://www.youtube.com/watch?v=FBlhhebFhTI&width=640&height=390&rel=0&showinfo=0[/embedyt]

Publication

A multiple test correction for streams and cascades of statistical hypothesis tests.
Webb, G. I., & Petitjean, F.
Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16, pp. 1255-1264, 2016.
Top reviewer score (4.75/5.0), shortlisted for best paper award and invited to ACM TKDE journal KDD-16 special issue
[Bibtex] [Abstract] → Download PDF → Access on publisher site → Related papers and software

@InProceedings{WebbPetitjean16,
Title = {A multiple test correction for streams and cascades of statistical hypothesis tests},
Author = {Webb, Geoffrey I. and Petitjean, Francois},
Booktitle = {Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16},
Year = {2016},
Pages = {1255-1264},
Publisher = {ACM Press},
Abstract = {Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e.~rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be pre-determined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance.
This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed.
To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models.
We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.},
Comment = {Top reviewer score (4.75/5.0), shortlisted for best paper award and invited to ACM TKDE journal KDD-16 special issue},
Doi = {10.1145/2939672.2939775},
Keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and Learning from large datasets and DP140100087},
Related = {statistically-sound-association-discovery},
Url = {http://dl.acm.org/authorize?N19100}
}

ABSTRACT Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e.~rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be pre-determined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance. This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed. To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models. We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.

Did you realise that it is dangerous to assume data are interval scale?

Geoff WebbMay 22, 2016October 5, 2021

Many machine learning algorithms make an implicit assumption that numeric data are interval scale, specifically, that a unit difference between values has the same meaning irrespective of the magnitude of those values. For example, most distance measures would rate values of 19 and 21 to both be the same distance from a value of 20. However, suppose the three values are measures of miles per gallon. It may be arbitrary whether the variable in question was recorded as mile per gallon or gallons per mile. If they had been expressed as the latter, they would be 0.0526, 0.5000 and 0.0476. The value corresponding to 20 (0.5000) would be closer to the value corresponding to 21 (0.0476) than to the value corresponding to 19 (0.0526).

Examples of learning algorithms that make this assumption include any that use conventional distance measures and most linear classifiers.

Our studies [1] have shown that for tasks as diverse as information retrieval and clustering, applying transformations to the data, such as replacing values by their square root or natural logarithm, often improves performance, indicating that the interval scale assumption is not justified.

A weaker assumption than the interval scale assumption is that numeric data are ordinal. Under this assumption, order matters, but the magnitudes of differences in values are not specified. Hence, for ordinal data, we can assert that 21 is more similar to 20 than 19, but not that 21 is more similar to 20 than is 18. Our studies have shown that converting data to ranks often improves performance across a range of machine learning algorithms [1].

However, conversion to ranks entails a significant computational overhead if a learned model is to be applied to unseen data. Mapping a new value onto a rank in a training set is an operation of order log training set size. In consequence, it can be advantageous to use algorithms that assume only that data are ordinal scale, as do decision trees and algorithms built thereon, such as random forests.

Reference

[1]

Fernando, T. L., & Webb, G. I. (2017). SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption. Data Mining and Knowledge Discovery, 31(1), 264-286.
[Bibtex]

@Article{FernandoWebb16,
Title = {SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption},
Author = {Fernando, Thilak L. and Webb, Geoffrey I.},
Journal = {Data Mining and Knowledge Discovery},
Year = {2017},
Number = {1},
Pages = {264-286},
Volume = {31},
Abstract = {Similarity measures are central to many machine learning algorithms. There are many different similarity measures, each catering for different applications and data requirements. Most similarity measures used with numerical data assume that the attributes are interval scale. In the interval scale, it is assumed that a unit difference has the same meaning irrespective of the magnitudes of the values separated. When this assumption is violated, accuracy may be reduced. Our experiments show that removing the interval scale assumption by transforming data to ranks can improve the accuracy of distance-based similarity measures on some tasks. However the rank transform has high time and storage overheads. In this paper, we introduce an efficient similarity measure which does not consider the magnitudes of inter-instance distances. We compare the new similarity measure with popular similarity measures in two applications: DBScan clustering and content based multimedia information retrieval with real world datasets and different transform functions. The results show that the proposed similarity measure provides good performance on a range of tasks and is invariant to violations of the interval scale assumption.},
Doi = {10.1007/s10618-016-0463-0}
}

Variety in recent publications

Geoff WebbMay 21, 2016October 5, 2021

I have had a nice variety of recent papers!

Concept drift:

Characterizing Concept Drift.
Webb, G. I., Hyde, R., Cao, H., Nguyen, H. L., & Petitjean, F.
Data Mining and Knowledge Discovery, 30(4), 964-994, 2016.

[Bibtex] [Abstract] → Related papers and software

@Article{WebbEtAl16,
author = {Webb, G. I. and Hyde, R. and Cao, H. and Nguyen, H. L. and Petitjean, F.},
journal = {Data Mining and Knowledge Discovery},
title = {Characterizing Concept Drift},
year = {2016},
number = {4},
pages = {964-994},
volume = {30},
abstract = {Most machine learning models are static, but the world is dynamic, and increasing online deployment of learned models gives increasing urgency to the development of efficient and effective mechanisms to address learning in the context of non-stationary distributions, or as it is commonly called concept drift. However, the key issue of characterizing the different types of drift that can occur has not previously been subjected to rigorous definition and analysis. In particular, while some qualitative drift categorizations have been proposed, few have been formally defined, and the quantitative descriptions required for precise and objective understanding of learner performance have not existed. We present the first comprehensive framework for quantitative analysis of drift. This supports the development of the first comprehensive set of formal definitions of types of concept drift. The formal definitions clarify ambiguities and identify gaps in previous definitions, giving rise to a new comprehensive taxonomy of concept drift types and a solid foundation for research into mechanisms to detect and address concept drift.},
doi = {10.1007/s10618-015-0448-4},
keywords = {Concept Drift},
related = {learning-from-non-stationary-distributions},
url = {https://rdcu.be/7vMN},
}

ABSTRACT Most machine learning models are static, but the world is dynamic, and increasing online deployment of learned models gives increasing urgency to the development of efficient and effective mechanisms to address learning in the context of non-stationary distributions, or as it is commonly called concept drift. However, the key issue of characterizing the different types of drift that can occur has not previously been subjected to rigorous definition and analysis. In particular, while some qualitative drift categorizations have been proposed, few have been formally defined, and the quantitative descriptions required for precise and objective understanding of learner performance have not existed. We present the first comprehensive framework for quantitative analysis of drift. This supports the development of the first comprehensive set of formal definitions of types of concept drift. The formal definitions clarify ambiguities and identify gaps in previous definitions, giving rise to a new comprehensive taxonomy of concept drift types and a solid foundation for research into mechanisms to detect and address concept drift.

Pitfalls of assuming data are interval scale:

SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption.
Fernando, T. L., & Webb, G. I.
Data Mining and Knowledge Discovery, 31(1), 264-286, 2017.

[Bibtex] [Abstract]

@Article{FernandoWebb16,
Title = {SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption},
Author = {Fernando, Thilak L. and Webb, Geoffrey I.},
Journal = {Data Mining and Knowledge Discovery},
Year = {2017},
Number = {1},
Pages = {264-286},
Volume = {31},
Abstract = {Similarity measures are central to many machine learning algorithms. There are many different similarity measures, each catering for different applications and data requirements. Most similarity measures used with numerical data assume that the attributes are interval scale. In the interval scale, it is assumed that a unit difference has the same meaning irrespective of the magnitudes of the values separated. When this assumption is violated, accuracy may be reduced. Our experiments show that removing the interval scale assumption by transforming data to ranks can improve the accuracy of distance-based similarity measures on some tasks. However the rank transform has high time and storage overheads. In this paper, we introduce an efficient similarity measure which does not consider the magnitudes of inter-instance distances. We compare the new similarity measure with popular similarity measures in two applications: DBScan clustering and content based multimedia information retrieval with real world datasets and different transform functions. The results show that the proposed similarity measure provides good performance on a range of tasks and is invariant to violations of the interval scale assumption.},
Doi = {10.1007/s10618-016-0463-0}
}

ABSTRACT Similarity measures are central to many machine learning algorithms. There are many different similarity measures, each catering for different applications and data requirements. Most similarity measures used with numerical data assume that the attributes are interval scale. In the interval scale, it is assumed that a unit difference has the same meaning irrespective of the magnitudes of the values separated. When this assumption is violated, accuracy may be reduced. Our experiments show that removing the interval scale assumption by transforming data to ranks can improve the accuracy of distance-based similarity measures on some tasks. However the rank transform has high time and storage overheads. In this paper, we introduce an efficient similarity measure which does not consider the magnitudes of inter-instance distances. We compare the new similarity measure with popular similarity measures in two applications: DBScan clustering and content based multimedia information retrieval with real world datasets and different transform functions. The results show that the proposed similarity measure provides good performance on a range of tasks and is invariant to violations of the interval scale assumption.

Statistical testing of streams and cascades of hypotheses:

[Bibtex] [Abstract] → Related papers and software

@InProceedings{WebbPetitjean16,
Title = {A multiple test correction for streams and cascades of statistical hypothesis tests},
Author = {Webb, Geoffrey I. and Petitjean, Francois},
Booktitle = {Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16},
Year = {2016},
Pages = {1255-1264},
Publisher = {ACM Press},
Abstract = {Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e.~rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be pre-determined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance.
This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed.
To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models.
We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.},
Comment = {Top reviewer score (4.75/5.0), shortlisted for best paper award and invited to ACM TKDE journal KDD-16 special issue},
Doi = {10.1145/2939672.2939775},
Keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and Learning from large datasets and DP140100087},
Related = {statistically-sound-association-discovery},
Url = {http://dl.acm.org/authorize?N19100}
}

ABSTRACT Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e.~rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be pre-determined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance. This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed. To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models. We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.

Scalable learning of Bayesian network classifiers:

Sample-based Attribute Selective AnDE for Large Data.
Chen, S., Martinez, A., Webb, G., & Wang, L.
IEEE Transactions on Knowledge and Data Engineering, 29(1), 172-185, 2017.

[Bibtex] [Abstract] → Related papers and software

@Article{ChenEtAl16b,
author = {Chen, S. and Martinez, A. and Webb, G. and Wang, L.},
journal = {{IEEE} Transactions on Knowledge and Data Engineering},
title = {Sample-based Attribute Selective AnDE for Large Data},
year = {2017},
issn = {1041-4347},
number = {1},
pages = {172-185},
volume = {29},
abstract = {More and more applications come with large data sets in the past decade. However, existing algorithms cannot guarantee to scale well on large data. Averaged n-Dependence Estimators (AnDE) allows for flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence AnDE is especially appropriate for large data learning. In this paper, we propose a sample-based attribute selection technique for AnDE. It needs one more pass through the training data, in which a multitude of approximate AnDE models are built and efficiently assessed by leave-one-out cross validation. The use of a sample reduces the training time. Experiments on 15 large data sets demonstrate that the proposed technique significantly reduces AnDE's error at the cost of a modest increase in training time. This efficient and scalable out-of-core approach delivers superior or comparable performance to typical in-core Bayesian network classifiers.},
doi = {10.1109/TKDE.2016.2608881},
keywords = {Conditional Probability Estimation and AODE and Learning from large datasets and DP140100087},
related = {learning-complex-conditional-probabilities-from-data},
}

ABSTRACT More and more applications come with large data sets in the past decade. However, existing algorithms cannot guarantee to scale well on large data. Averaged n-Dependence Estimators (AnDE) allows for flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence AnDE is especially appropriate for large data learning. In this paper, we propose a sample-based attribute selection technique for AnDE. It needs one more pass through the training data, in which a multitude of approximate AnDE models are built and efficiently assessed by leave-one-out cross validation. The use of a sample reduces the training time. Experiments on 15 large data sets demonstrate that the proposed technique significantly reduces AnDE's error at the cost of a modest increase in training time. This efficient and scalable out-of-core approach delivers superior or comparable performance to typical in-core Bayesian network classifiers.

Selective AnDE for large data learning: a low-bias memory constrained approach.
Chen, S., Martinez, A. M., Webb, G. I., & Wang, L.
Knowledge and Information Systems, 50(2), 475-503, 2017.

[Bibtex] [Abstract] → Related papers and software

@Article{ChenEtAl16,
Title = {Selective AnDE for large data learning: a low-bias memory constrained approach},
Author = {Chen, Shenglei and Martinez, Ana M. and Webb, Geoffrey I. and Wang, Limin},
Journal = {Knowledge and Information Systems},
Year = {2017},
Number = {2},
Pages = {475-503},
Volume = {50},
Abstract = {Learning from data that are too big to fit into memory poses great challenges to currently available learning approaches. Averaged n-Dependence Estimators (AnDE) allows for a flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence, AnDE is especially appropriate for learning from large quantities of data. Memory requirement in AnDE, however, increases combinatorially with the number of attributes and the parameter n. In large data learning, number of attributes is often large and we also expect high n to achieve low-bias classification. In order to achieve the lower bias of AnDE with higher n but with less memory requirement, we propose a memory constrained selective AnDE algorithm, in which two passes of learning through training examples are involved. The first pass performs attribute selection on super parents according to available memory, whereas the second one learns an AnDE model with parents only on the selected attributes. Extensive experiments show that the new selective AnDE has considerably lower bias and prediction error relative to A \$\$n'\$\$ n {\textasciiacutex} DE, where \$\$n' = n-1\$\$ n {\textasciiacutex} = n - 1 , while maintaining the same space complexity and similar time complexity. The proposed algorithm works well on categorical data. Numerical data sets need to be discretized first.},
Doi = {10.1007/s10115-016-0937-9},
ISSN = {0219-3116},
Keywords = {Conditional Probability Estimation and AODE and Learning from large datasets and DP140100087},
Related = {learning-complex-conditional-probabilities-from-data}
}

ABSTRACT Learning from data that are too big to fit into memory poses great challenges to currently available learning approaches. Averaged n-Dependence Estimators (AnDE) allows for a flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence, AnDE is especially appropriate for learning from large quantities of data. Memory requirement in AnDE, however, increases combinatorially with the number of attributes and the parameter n. In large data learning, number of attributes is often large and we also expect high n to achieve low-bias classification. In order to achieve the lower bias of AnDE with higher n but with less memory requirement, we propose a memory constrained selective AnDE algorithm, in which two passes of learning through training examples are involved. The first pass performs attribute selection on super parents according to available memory, whereas the second one learns an AnDE model with parents only on the selected attributes. Extensive experiments show that the new selective AnDE has considerably lower bias and prediction error relative to A \$\$n'\$\$ n {\textasciiacutex} DE, where \$\$n' = n-1\$\$ n {\textasciiacutex} = n - 1 , while maintaining the same space complexity and similar time complexity. The proposed algorithm works well on categorical data. Numerical data sets need to be discretized first.

Scalable Learning of Bayesian Network Classifiers.
Martinez, A. M., Webb, G. I., Chen, S., & Zaidi, N. A.
Journal of Machine Learning Research, 17(44), 1-35, 2016.

[Bibtex] [Abstract] → Related papers and software

@Article{MartinezEtAl16,
author = {Martinez, Ana M. and Webb, Geoffrey I. and Chen, Shenglei and Zaidi, Nayyar A.},
journal = {Journal of Machine Learning Research},
title = {Scalable Learning of {Bayesian} Network Classifiers},
year = {2016},
number = {44},
pages = {1-35},
volume = {17},
abstract = {Ever increasing data quantity makes ever more urgent the need for highly scalable learners that have good classification performance. Therefore, an out-of-core learner with excellent time and space complexity, along with high expressivity (that is, capacity to learn very complex multivariate probability distributions) is extremely desirable. This paper presents such a learner. We propose an extension to the k-dependence Bayesian classifier (KDB) that discriminatively selects a sub- model of a full KDB classifier. It requires only one additional pass through the training data, making it a three-pass learner. Our extensive experimental evaluation on 16 large data sets reveals that this out-of-core algorithm achieves competitive classification performance, and substantially better training and classification time than state-of-the-art in-core learners such as random forest and linear and non-linear logistic regression.},
keywords = {Conditional Probability Estimation and AODE and Learning from large datasets and DP140100087, efficient ml},
related = {learning-complex-conditional-probabilities-from-data},
url = {http://jmlr.org/papers/v17/martinez16a.html},
}

ABSTRACT Ever increasing data quantity makes ever more urgent the need for highly scalable learners that have good classification performance. Therefore, an out-of-core learner with excellent time and space complexity, along with high expressivity (that is, capacity to learn very complex multivariate probability distributions) is extremely desirable. This paper presents such a learner. We propose an extension to the k-dependence Bayesian classifier (KDB) that discriminatively selects a sub- model of a full KDB classifier. It requires only one additional pass through the training data, making it a three-pass learner. Our extensive experimental evaluation on 16 large data sets reveals that this out-of-core algorithm achieves competitive classification performance, and substantially better training and classification time than state-of-the-art in-core learners such as random forest and linear and non-linear logistic regression.

Time series classification:

Faster and more accurate classification of time series by exploiting a novel dynamic time warping averaging algorithm.
Petitjean, F., Forestier, G., Webb, G. I., Nicholson, A. E., Chen, Y., & Keogh, E.
Knowledge and Information Systems, 47(1), 1-26, 2016.

[Bibtex] [Abstract] → Related papers and software

@Article{PetitjeanEtAl16a,
author = {Petitjean, F. and Forestier, G. and Webb, G. I. and Nicholson, A. E. and Chen, Y. and Keogh, E.},
journal = {Knowledge and Information Systems},
title = {Faster and more accurate classification of time series by exploiting a novel dynamic time warping averaging algorithm},
year = {2016},
number = {1},
pages = {1-26},
volume = {47},
abstract = {A concerted research effort over the past two decades has heralded significant improvements in both the efficiency and effectiveness of time series classification. The consensus that has emerged in the community is that the best solution is a surprisingly simple one. In virtually all domains, the most accurate classifier is the nearest neighbor algorithm with dynamic time warping as the distance measure. The time complexity of dynamic time warping means that successful deployments on resource-constrained devices remain elusive. Moreover, the recent explosion of interest in wearable computing devices, which typically have limited computational resources, has greatly increased the need for very efficient classification algorithms. A classic technique to obtain the benefits of the nearest neighbor algorithm, without inheriting its undesirable time and space complexity, is to use the nearest centroid algorithm. Unfortunately, the unique properties of (most) time series data mean that the centroid typically does not resemble any of the instances, an unintuitive and underappreciated fact. In this paper we demonstrate that we can exploit a recent result by Petitjean et al. to allow meaningful averaging of 'warped' time series, which then allows us to create super-efficient nearest 'centroid' classifiers that are at least as accurate as their more computationally challenged nearest neighbor relatives. We demonstrate empirically the utility of our approach by comparing it to all the appropriate strawmen algorithms on the ubiquitous UCR Benchmarks and with a case study in supporting insect classification on resource-constrained sensors.},
doi = {10.1007/s10115-015-0878-8},
keywords = {time series, efficient ml},
related = {scalable-time-series-classifiers},
}

ABSTRACT A concerted research effort over the past two decades has heralded significant improvements in both the efficiency and effectiveness of time series classification. The consensus that has emerged in the community is that the best solution is a surprisingly simple one. In virtually all domains, the most accurate classifier is the nearest neighbor algorithm with dynamic time warping as the distance measure. The time complexity of dynamic time warping means that successful deployments on resource-constrained devices remain elusive. Moreover, the recent explosion of interest in wearable computing devices, which typically have limited computational resources, has greatly increased the need for very efficient classification algorithms. A classic technique to obtain the benefits of the nearest neighbor algorithm, without inheriting its undesirable time and space complexity, is to use the nearest centroid algorithm. Unfortunately, the unique properties of (most) time series data mean that the centroid typically does not resemble any of the instances, an unintuitive and underappreciated fact. In this paper we demonstrate that we can exploit a recent result by Petitjean et al. to allow meaningful averaging of 'warped' time series, which then allows us to create super-efficient nearest 'centroid' classifiers that are at least as accurate as their more computationally challenged nearest neighbor relatives. We demonstrate empirically the utility of our approach by comparing it to all the appropriate strawmen algorithms on the ubiquitous UCR Benchmarks and with a case study in supporting insect classification on resource-constrained sensors.

Sequential pattern discovery:

Skopus: Mining top-k sequential patterns under leverage.
Petitjean, F., Li, T., Tatti, N., & Webb, G. I.
Data Mining and Knowledge Discovery, 30(5), 1086-1111, 2016.

[Bibtex] [Abstract] → Related papers and software

@Article{PetitjeanEtAl16b,
Title = {Skopus: Mining top-k sequential patterns under leverage},
Author = {Petitjean, Francois and Li, Tao and Tatti, Nikolaj and Webb, Geoffrey I.},
Journal = {Data Mining and Knowledge Discovery},
Year = {2016},
Number = {5},
Pages = {1086-1111},
Volume = {30},
Abstract = {This paper presents a framework for exact discovery of the top-k sequential patterns under Leverage. It combines (1) a novel definition of the expected support for a sequential pattern---a concept on which most interestingness measures directly rely---with (2) Skopus: a new branch-and-bound algorithm for the exact discovery of top-k sequential patterns under a given measure of interest. Our interestingness measure employs the partition approach. A pattern is interesting to the extent that it is more frequent than can be explained by assuming independence between any of the pairs of patterns from which it can be composed. The larger the support compared to the expectation under independence, the more interesting is the pattern. We build on these two elements to exactly extract the k sequential patterns with highest leverage, consistent with our definition of expected support. We conduct experiments on both synthetic data with known patterns and real-world datasets; both experiments confirm the consistency and relevance of our approach with regard to the state of the art.},
Doi = {10.1007/s10618-016-0467-9},
ISSN = {1573-756X},
Keywords = {OPUS and Association Rule Discovery and statistically sound discovery},
Related = {statistically-sound-association-discovery},
Url = {http://rdcu.be/tsDo}
}

ABSTRACT This paper presents a framework for exact discovery of the top-k sequential patterns under Leverage. It combines (1) a novel definition of the expected support for a sequential pattern–-a concept on which most interestingness measures directly rely–-with (2) Skopus: a new branch-and-bound algorithm for the exact discovery of top-k sequential patterns under a given measure of interest. Our interestingness measure employs the partition approach. A pattern is interesting to the extent that it is more frequent than can be explained by assuming independence between any of the pairs of patterns from which it can be composed. The larger the support compared to the expectation under independence, the more interesting is the pattern. We build on these two elements to exactly extract the k sequential patterns with highest leverage, consistent with our definition of expected support. We conduct experiments on both synthetic data with known patterns and real-world datasets; both experiments confirm the consistency and relevance of our approach with regard to the state of the art.

Using generative parameterisations to scale up discriminative learning:

ALRn: Accelerated higher-order logistic regression.
Zaidi, N. A., Webb, G. I., Carman, M. J., Petitjean, F., & Cerquides, J.
Machine Learning, 104(2), 151-194, 2016.

[Bibtex] [Abstract] → Related papers and software

@Article{ZaidiEtAl16b,
author = {Zaidi, Nayyar A. and Webb, Geoffrey I. and Carman, Mark J. and Petitjean, Francois and Cerquides, Jesus},
journal = {Machine Learning},
title = {{ALRn}: Accelerated higher-order logistic regression},
year = {2016},
issn = {1573-0565},
number = {2},
pages = {151-194},
volume = {104},
abstract = {This paper introduces Accelerated Logistic Regression: a hybrid generative-discriminative approach to training Logistic Regression with high-order features. We present two main results: (1) that our combined generative-discriminative approach significantly improves the efficiency of Logistic Regression and (2) that incorporating higher order features (i.e. features that are the Cartesian products of the original features) reduces the bias of Logistic Regression, which in turn significantly reduces its error on large datasets. We assess the efficacy of Accelerated Logistic Regression by conducting an extensive set of experiments on 75 standard datasets. We demonstrate its competitiveness, particularly on large datasets, by comparing against state-of-the-art classifiers including Random Forest and Averaged n-Dependence Estimators.},
doi = {10.1007/s10994-016-5574-8},
keywords = {Conditional Probability Estimation and WANBIA and DP140100087, efficent ML},
related = {combining-generative-and-discriminative-learning},
url = {http://rdcu.be/unVb},
}

ABSTRACT This paper introduces Accelerated Logistic Regression: a hybrid generative-discriminative approach to training Logistic Regression with high-order features. We present two main results: (1) that our combined generative-discriminative approach significantly improves the efficiency of Logistic Regression and (2) that incorporating higher order features (i.e. features that are the Cartesian products of the original features) reduces the bias of Logistic Regression, which in turn significantly reduces its error on large datasets. We assess the efficacy of Accelerated Logistic Regression by conducting an extensive set of experiments on 75 standard datasets. We demonstrate its competitiveness, particularly on large datasets, by comparing against state-of-the-art classifiers including Random Forest and Averaged n-Dependence Estimators.

My first neural network paper:

Preconditioning an Artificial Neural Network Using Naive Bayes.
Zaidi, N. A., Petitjean, F., & Webb, G. I.
Proceedings of the 20th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2016, pp. 341-353, 2016.

[Bibtex] [Abstract] → Related papers and software

@InProceedings{ZaidiEtAl16,
author = {Zaidi, Nayyar A. and Petitjean, Francois and Webb, Geoffrey I.},
booktitle = {Proceedings of the 20th {Pacific-Asia} Conference on Advances in Knowledge Discovery and Data Mining, {PAKDD} 2016},
title = {Preconditioning an Artificial Neural Network Using Naive {Bayes}},
year = {2016},
editor = {Bailey, James and Khan, Latifur and Washio, Takashi and Dobbie, Gill and Huang, Zhexue Joshua and Wang, Ruili},
pages = {341-353},
publisher = {Springer International Publishing},
abstract = {Logistic Regression (LR) is a workhorse of the statistics community and a state-of-the-art machine learning classifier. It learns a linear model from inputs to outputs trained by optimizing the Conditional Log-Likelihood (CLL) of the data. Recently, it has been shown that preconditioning LR using a Naive Bayes (NB) model speeds up LR learning many-fold. One can, however, train a linear model by optimizing the mean-square-error (MSE) instead of CLL. This leads to an Artificial Neural Network (ANN) with no hidden layer. In this work, we study the effect of NB preconditioning on such an ANN classifier. Optimizing MSE instead of CLL may lead to a lower bias classifier and hence result in better performance on big datasets. We show that this NB preconditioning can speed-up convergence significantly. We also show that optimizing a linear model with MSE leads to a lower bias classifier than optimizing with CLL. We also compare the performance to state-of-the-art classifier Random Forest.},
doi = {10.1007/978-3-319-31753-3_28},
isbn = {978-3-319-31753-3},
keywords = {Conditional Probability Estimation and WANBIA and DP140100087, efficient ml},
related = {combining-generative-and-discriminative-learning},
url = {http://dx.doi.org/10.1007/978-3-319-31753-3_28},
}

ABSTRACT Logistic Regression (LR) is a workhorse of the statistics community and a state-of-the-art machine learning classifier. It learns a linear model from inputs to outputs trained by optimizing the Conditional Log-Likelihood (CLL) of the data. Recently, it has been shown that preconditioning LR using a Naive Bayes (NB) model speeds up LR learning many-fold. One can, however, train a linear model by optimizing the mean-square-error (MSE) instead of CLL. This leads to an Artificial Neural Network (ANN) with no hidden layer. In this work, we study the effect of NB preconditioning on such an ANN classifier. Optimizing MSE instead of CLL may lead to a lower bias classifier and hence result in better performance on big datasets. We show that this NB preconditioning can speed-up convergence significantly. We also show that optimizing a linear model with MSE leads to a lower bias classifier than optimizing with CLL. We also compare the performance to state-of-the-art classifier Random Forest.

My first fuzzy rules paper:

Mining significant association rules from uncertain data.
Zhang, A., Shi, W., & Webb, G. I.
Data Mining and Knowledge Discovery, 30(4), 928-963, 2016.

[Bibtex] [Abstract] → Related papers and software

@Article{ZhangEtAl16,
Title = {Mining significant association rules from uncertain data},
Author = {Zhang, Anshu and Shi, Wenzhong and Webb, Geoffrey I.},
Journal = {Data Mining and Knowledge Discovery},
Year = {2016},
Number = {4},
Pages = {928-963},
Volume = {30},
Abstract = {In association rule mining, the trade-off between avoiding harmful spurious rules and preserving authentic ones is an ever critical barrier to obtaining reliable and useful results. The statistically sound technique for evaluating statistical significance of association rules is superior in preventing spurious rules, yet can also cause severe loss of true rules in presence of data error. This study presents a new and improved method for statistical test on association rules with uncertain erroneous data. An original mathematical model was established to describe data error propagation through computational procedures of the statistical test. Based on the error model, a scheme combining analytic and simulative processes was designed to correct the statistical test for distortions caused by data error. Experiments on both synthetic and real-world data show that the method significantly recovers the loss in true rules (reduces type-2 error) due to data error occurring in original statistically sound method. Meanwhile, the new method maintains effective control over the familywise error rate, which is the distinctive advantage of the original statistically sound technique. Furthermore, the method is robust against inaccurate data error probability information and situations not fulfilling the commonly accepted assumption on independent error probabilities of different data items. The method is particularly effective for rules which were most practically meaningful yet sensitive to data error. The method proves promising in enhancing values of association rule mining results and helping users make correct decisions.},
Doi = {10.1007/s10618-015-0446-6},
Keywords = {Association Rule Discovery and statistically sound discovery},
Publisher = {Springer},
Related = {statistically-sound-association-discovery}
}

ABSTRACT In association rule mining, the trade-off between avoiding harmful spurious rules and preserving authentic ones is an ever critical barrier to obtaining reliable and useful results. The statistically sound technique for evaluating statistical significance of association rules is superior in preventing spurious rules, yet can also cause severe loss of true rules in presence of data error. This study presents a new and improved method for statistical test on association rules with uncertain erroneous data. An original mathematical model was established to describe data error propagation through computational procedures of the statistical test. Based on the error model, a scheme combining analytic and simulative processes was designed to correct the statistical test for distortions caused by data error. Experiments on both synthetic and real-world data show that the method significantly recovers the loss in true rules (reduces type-2 error) due to data error occurring in original statistically sound method. Meanwhile, the new method maintains effective control over the familywise error rate, which is the distinctive advantage of the original statistically sound technique. Furthermore, the method is robust against inaccurate data error probability information and situations not fulfilling the commonly accepted assumption on independent error probabilities of different data items. The method is particularly effective for rules which were most practically meaningful yet sensitive to data error. The method proves promising in enhancing values of association rule mining results and helping users make correct decisions.

Computational biology:

Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli.
Chang, C. C. H., Li, C., Webb, G. I., Tey, B., & Song, J.
Scientific Reports, 6, Art. no. 21844, 2016.

[Bibtex] [Abstract] → Related papers and software

@Article{ChangEtAl2016,
Title = {Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli},
Author = {Chang, C.C.H. and Li, C. and Webb, G. I. and Tey, B. and Song, J.},
Journal = {Scientific Reports},
Year = {2016},
Volume = {6},
Abstract = {Periplasmic expression of soluble proteins in Escherichia coli not only offers a much-simplified downstream purification process, but also enhances the probability of obtaining correctly folded and biologically active proteins. Different combinations of signal peptides and target proteins lead to different soluble protein expression levels, ranging from negligible to several grams per litre. Accurate algorithms for rational selection of promising candidates can serve as a powerful tool to complement with current trial-and-error approaches. Accordingly, proteomics studies can be conducted with greater efficiency and cost-effectiveness. Here, we developed a predictor with a two-stage architecture, to predict the real-valued expression level of target protein in the periplasm. The output of the first-stage support vector machine (SVM) classifier determines which second-stage support vector regression (SVR) classifier to be used. When tested on an independent test dataset, the predictor achieved an overall prediction accuracy of 78% and a Pearson's correlation coefficient (PCC) of 0.77. We further illustrate the relative importance of various features with respect to different models. The results indicate that the occurrence of dipeptide glutamine and aspartic acid is the most important feature for the classification model. Finally, we provide access to the implemented predictor through the Periscope webserver, freely accessible at http://lightning.med.monash.edu/periscope/.},
Articlenumber = {21844},
Doi = {10.1038/srep21844},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}

ABSTRACT Periplasmic expression of soluble proteins in Escherichia coli not only offers a much-simplified downstream purification process, but also enhances the probability of obtaining correctly folded and biologically active proteins. Different combinations of signal peptides and target proteins lead to different soluble protein expression levels, ranging from negligible to several grams per litre. Accurate algorithms for rational selection of promising candidates can serve as a powerful tool to complement with current trial-and-error approaches. Accordingly, proteomics studies can be conducted with greater efficiency and cost-effectiveness. Here, we developed a predictor with a two-stage architecture, to predict the real-valued expression level of target protein in the periplasm. The output of the first-stage support vector machine (SVM) classifier determines which second-stage support vector regression (SVR) classifier to be used. When tested on an independent test dataset, the predictor achieved an overall prediction accuracy of 78% and a Pearson's correlation coefficient (PCC) of 0.77. We further illustrate the relative importance of various features with respect to different models. The results indicate that the occurrence of dipeptide glutamine and aspartic acid is the most important feature for the classification model. Finally, we provide access to the implemented predictor through the Periscope webserver, freely accessible at http://lightning.med.monash.edu/periscope/.

GlycoMinestruct: a new bioinformatics tool for highly accurate mapping of the human N-linked and O-linked glycoproteomes by incorporating structural features.
Li, F., Li, C., Revote, J., Zhang, Y., Webb, G. I., Li, J., Song, J., & Lithgow, T.
Scientific Reports, 6, Art. no. 34595, 2016.

[Bibtex] → Related papers and software

@Article{LiEtAl16,
Title = {GlycoMinestruct: a new bioinformatics tool for highly accurate mapping of the human N-linked and O-linked glycoproteomes by incorporating structural features},
Author = {Li, Fuyi and Li, Chen and Revote, Jerico and Zhang, Yang and Webb, Geoffrey I. and Li, Jian and Song, Jiangning and Lithgow, Trevor},
Journal = {Scientific Reports},
Year = {2016},
Month = oct,
Volume = {6},
Articlenumber = {34595},
Doi = {10.1038/srep34595},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}

ABSTRACT

Smoothing a rugged protein folding landscape by sequence-based redesign.
Porebski, B. T., Keleher, S., Hollins, J. J., Nickson, A. A., Marijanovic, E. M., Borg, N. A., Costa, M. G. S., Pearce, M. A., Dai, W., Zhu, L., Irving, J. A., Hoke, D. E., Kass, I., Whisstock, J. C., Bottomley, S. P., Webb, G. I., McGowan, S., & Buckle, A. M.
Scientific Reports, 6, Art. no. 33958, 2016.

[Bibtex] [Abstract] → Related papers and software

@Article{Porebski2016,
Title = {Smoothing a rugged protein folding landscape by sequence-based redesign},
Author = {Porebski, Benjamin T. and Keleher, Shani and Hollins, Jeffrey J. and Nickson, Adrian A. and Marijanovic, Emilia M. and Borg, Natalie A. and Costa, Mauricio G. S. and Pearce, Mary A. and Dai, Weiwen and Zhu, Liguang and Irving, James A. and Hoke, David E. and Kass, Itamar and Whisstock, James C. and Bottomley, Stephen P. and Webb, Geoffrey I. and McGowan, Sheena and Buckle, Ashley M.},
Journal = {Scientific Reports},
Year = {2016},
Volume = {6},
Abstract = {The rugged folding landscapes of functional proteins puts them at risk of misfolding and aggregation. Serine protease inhibitors, or serpins, are paradigms for this delicate balance between function and misfolding. Serpins exist in a metastable state that undergoes a major conformational change in order to inhibit proteases. However, conformational labiality of the native serpin fold renders them susceptible to misfolding, which underlies misfolding diseases such as alpha1-antitrypsin deficiency. To investigate how serpins balance function and folding, we used consensus design to create conserpin, a synthetic serpin that folds reversibly, is functional, thermostable, and polymerization resistant. Characterization of its structure, folding and dynamics suggest that consensus design has remodeled the folding landscape to reconcile competing requirements for stability and function. This approach may offer general benefits for engineering functional proteins that have risky folding landscapes, including the removal of aggregation-prone intermediates, and modifying scaffolds for use as protein therapeutics.},
Articlenumber = {33958},
Doi = {10.1038/srep33958},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology},
Url = {http://dx.doi.org/10.1038/srep33958}
}

ABSTRACT The rugged folding landscapes of functional proteins puts them at risk of misfolding and aggregation. Serine protease inhibitors, or serpins, are paradigms for this delicate balance between function and misfolding. Serpins exist in a metastable state that undergoes a major conformational change in order to inhibit proteases. However, conformational labiality of the native serpin fold renders them susceptible to misfolding, which underlies misfolding diseases such as alpha1-antitrypsin deficiency. To investigate how serpins balance function and folding, we used consensus design to create conserpin, a synthetic serpin that folds reversibly, is functional, thermostable, and polymerization resistant. Characterization of its structure, folding and dynamics suggest that consensus design has remodeled the folding landscape to reconcile competing requirements for stability and function. This approach may offer general benefits for engineering functional proteins that have risky folding landscapes, including the removal of aggregation-prone intermediates, and modifying scaffolds for use as protein therapeutics.

Crysalis: an integrated server for computational analysis and design of protein crystallization.
Wang, H., Feng, L., Zhang, Z., Webb, G. I., Lin, D., & Song, J.
Scientific Reports, 6, Art. no. 21383, 2016.

[Bibtex] [Abstract] → Related papers and software

@Article{WangEtAl16,
Title = {Crysalis: an integrated server for computational analysis and design of protein crystallization},
Author = {Wang, H. and Feng, L. and Zhang, Z. and Webb, G. I. and Lin, D. and Song, J.},
Journal = {Scientific Reports},
Year = {2016},
Volume = {6},
Abstract = {The failure of multi-step experimental procedures to yield diffraction-quality crystals is a major bottleneck in protein structure determination. Accordingly, several bioinformatics methods have been successfully developed and employed to select crystallizable proteins. Unfortunately, the majority of existing in silico methods only allow the prediction of crystallization propensity, seldom enabling computational design of protein mutants that can be targeted for enhancing protein crystallizability. Here, we present Crysalis, an integrated crystallization analysis tool that builds on support-vector regression (SVR) models to facilitate computational protein crystallization prediction, analysis, and design. More specifically, the functionality of this new tool includes: (1) rapid selection of target crystallizable proteins at the proteome level, (2) identification of site non-optimality for protein crystallization and systematic analysis of all potential single-point mutations that might enhance protein crystallization propensity, and (3) annotation of target protein based on predicted structural properties. We applied the design mode of Crysalis to identify site non-optimality for protein crystallization on a proteome-scale, focusing on proteins currently classified as non-crystallizable. Our results revealed that site non-optimality is based on biases related to residues, predicted structures, physicochemical properties, and sequence loci, which provides in-depth understanding of the features influencing protein crystallization. Crysalis is freely available at http://nmrcen.xmu.edu.cn/crysalis/.},
Articlenumber = {21383},
Doi = {10.1038/srep21383},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}

ABSTRACT The failure of multi-step experimental procedures to yield diffraction-quality crystals is a major bottleneck in protein structure determination. Accordingly, several bioinformatics methods have been successfully developed and employed to select crystallizable proteins. Unfortunately, the majority of existing in silico methods only allow the prediction of crystallization propensity, seldom enabling computational design of protein mutants that can be targeted for enhancing protein crystallizability. Here, we present Crysalis, an integrated crystallization analysis tool that builds on support-vector regression (SVR) models to facilitate computational protein crystallization prediction, analysis, and design. More specifically, the functionality of this new tool includes: (1) rapid selection of target crystallizable proteins at the proteome level, (2) identification of site non-optimality for protein crystallization and systematic analysis of all potential single-point mutations that might enhance protein crystallization propensity, and (3) annotation of target protein based on predicted structural properties. We applied the design mode of Crysalis to identify site non-optimality for protein crystallization on a proteome-scale, focusing on proteins currently classified as non-crystallizable. Our results revealed that site non-optimality is based on biases related to residues, predicted structures, physicochemical properties, and sequence loci, which provides in-depth understanding of the features influencing protein crystallization. Crysalis is freely available at http://nmrcen.xmu.edu.cn/crysalis/.

Francois Petitjean wins Research Impact Award

Geoff WebbMay 11, 2016October 5, 2021

Congratulations to my colleague Francois Petitjean for winning the Victorian Young Achiever Awards Research Impact Award.