Data Scientist


Combining generative and discriminative strategies to facilitate efficient and effective learning from big data

Funder: Australian Research Council

Funding: 2014: $354,000; 2015: $292,000; 2016: $312,000

Summary:

Effective extraction of information from massive data stores is increasingly problematic as data quantities continue to grow rapidly. Quite simply, effective techniques for learning from small data do not scale. However, the problem is even worse than this. Big data contain more information than the small data in which
context most state-of-the-art learning algorithms have been developed. For small data overly detailed classifiers will overfit the data and so should be avoided. In contrast, big data provide fine detail and hence will benefit new types of learner that can capture it. This project will deliver learners that are not only capable of capturing this detail, but do so with the efficiency required to process terabytes of data.

Publications:

MetalExplorer, a Bioinformatics Tool for the Improved Prediction of Eight Types of Metal-binding Sites Using a Random Forest Algorithm with Two-step Feature Selection.
Song, J., Li, C., Zheng, C., Revote, J., Zhang, Z., & Webb, G. I.
Current Bioinformatics, 11, in press.
[DOI] [Bibtex] [Abstract]

@Article{SongEtAl16,
Title = {MetalExplorer, a Bioinformatics Tool for the Improved Prediction of Eight Types of Metal-binding Sites Using a Random Forest Algorithm with Two-step Feature Selection},
Author = {Jiangning Song and Chen Li and Cheng Zheng and Jerico Revote and Ziding Zhang and Geoffrey I. Webb},
Journal = {Current Bioinformatics},
Year = {in press},
Volume = {11},
Abstract = {Metalloproteins are highly involved in many biological processes,
including catalysis, recognition, transport, transcription, and signal
transduction. The metal ions they bind usually play enzymatic or structural
roles in mediating these diverse functional roles. Thus, the systematic
analysis and prediction of metal-binding sites using sequence and/or
structural information are crucial for understanding their
sequence-structure-function relationships. In this study, we propose
MetalExplorer (http://metalexplorer.erc.monash.edu.au/), a new machine
learning-based method for predicting eight different types of metal-binding
sites (Ca, Co, Cu, Fe, Ni, Mg, Mn, and Zn) in proteins. Our approach
combines heterogeneous sequence-, structure-, and residue contact
network-based features. The predictive performance of MetalExplorer was
tested by cross-validation and independent tests using non-redundant
datasets of known structures. This method applies a two-step feature
selection approach based on the maximum relevance minimum redundancy and
forward feature selection to identify the most informative features that
contribute to the prediction performance. With a precision of 60%,
MetalExplorer achieved high recall values, which ranged from 59% to 88% for
the eight metal ion types in fivefold cross-validation tests. Moreover, the
common and type-specific features in the optimal subsets of all metal ions
were characterized in terms of their contributions to the overall
performance. In terms of both benchmark and independent datasets at the 60%
precision control level, MetalExplorer compared favorably with an existing
metalloprotein prediction tool, SitePredict. Thus, MetalExplorer is expected
to be a powerful tool for the accurate prediction of potential metal-binding
sites and it should facilitate the functional analysis and rational design
of novel metalloproteins.},
Doi = {10.2174/2468422806666160618091522},
ISSN = {1574-8936/2212-392X},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}
ABSTRACT Metalloproteins are highly involved in many biological processes, including catalysis, recognition, transport, transcription, and signal transduction. The metal ions they bind usually play enzymatic or structural roles in mediating these diverse functional roles. Thus, the systematic analysis and prediction of metal-binding sites using sequence and/or structural information are crucial for understanding their sequence-structure-function relationships. In this study, we propose MetalExplorer (http://metalexplorer.erc.monash.edu.au/), a new machine learning-based method for predicting eight different types of metal-binding sites (Ca, Co, Cu, Fe, Ni, Mg, Mn, and Zn) in proteins. Our approach combines heterogeneous sequence-, structure-, and residue contact network-based features. The predictive performance of MetalExplorer was tested by cross-validation and independent tests using non-redundant datasets of known structures. This method applies a two-step feature selection approach based on the maximum relevance minimum redundancy and forward feature selection to identify the most informative features that contribute to the prediction performance. With a precision of 60%, MetalExplorer achieved high recall values, which ranged from 59% to 88% for the eight metal ion types in fivefold cross-validation tests. Moreover, the common and type-specific features in the optimal subsets of all metal ions were characterized in terms of their contributions to the overall performance. In terms of both benchmark and independent datasets at the 60% precision control level, MetalExplorer compared favorably with an existing metalloprotein prediction tool, SitePredict. Thus, MetalExplorer is expected to be a powerful tool for the accurate prediction of potential metal-binding sites and it should facilitate the functional analysis and rational design of novel metalloproteins.

Comprehensive assessment and performance improvement of effector protein predictors for bacterial secretion systems III, IV and VI.
An, Y., Wang, J., Li, C., Leier, A., Marquez-Lago, T., Wilksch, J., Zhang, Y., Webb, G. I., Song, J., & Lithgow, T.
Briefings in Bioinformatics, Art. no. bbw100, 2017.
[DOI] [Bibtex]

ABSTRACT 

SecretEPDB: a comprehensive web-based resource for secreted effector proteins of the bacterial types III, IV and VI secretion systems.
An, Y., Wang, J., Li, C., Revote, J., Zhang, Y., Naderer, T., Hayashida, M., Akutsu, T., Webb, G. I., Lithgow, T., & Song, J.
Scientific Reports, 7, Art. no. 41031, 2017.
[DOI] [Bibtex]

@Article{AnEtAl17,
Title = {SecretEPDB: a comprehensive web-based resource for secreted effector proteins of the bacterial types III, IV and VI secretion systems},
Author = {An, Yi and Wang, Jiawei and Li, Chen and Revote, Jerico and Zhang, Yang and Naderer, Thomas and Hayashida, Mirohiro and Akutsu, Tatsuya and Webb, Geoffrey I. and Lithgow, Trevor and Song, Jiangning},
Journal = {Scientific Reports},
Year = {2017},
Volume = {7},
Articlenumber = {41031},
Doi = {10.1038/srep41031},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology},
Url = {http://rdcu.be/oJ9I}
}
ABSTRACT 

Knowledge-transfer learning for prediction of matrix metalloprotease substrate-cleavage sites.
Wang, Y., Song, J., Marquez-Lago, T. T., Leier, A., Li, C., Lithgow, T., Webb, G. I., & Shen, H.
Scientific Reports, 7, Art. no. 5755, 2017.
[DOI] [Bibtex]

ABSTRACT 

Efficient Parameter Learning of Bayesian Network Classifiers.
Zaidi, N., Webb, G. I., Carman, M., Petitjean, F., Buntine, W., Hynes, H., & De Sterck, H.
Machine Learning, 106(9-10), 2017.
[PDF] [DOI] [Bibtex]

ABSTRACT 

Sample-based Attribute Selective AnDE for Large Data.
Chen, S., Martinez, A., Webb, G., & Wang, L.
IEEE Transactions on Knowledge and Data Engineering, 29(1), 172-185, 2017.
[PDF] [DOI] [Bibtex]

ABSTRACT 

Selective AnDE for large data learning: a low-bias memory constrained approach.
Chen, S., Martínez, A. M., Webb, G. I., & Wang, L.
Knowledge and Information Systems, 50(2), 475-503, 2017.
[PDF] [DOI] [Bibtex] [Abstract]

@Article{ChenEtAl16,
Title = {Selective AnDE for large data learning: a low-bias memory constrained approach},
Author = {Chen, Shenglei
and Mart{\'i}nez, Ana M.
and Webb, Geoffrey I.
and Wang, Limin},
Journal = {Knowledge and Information Systems},
Year = {2017},
Number = {2},
Pages = {475-503},
Volume = {50},
Abstract = {Learning from data that are too big to fit into memory poses great challenges to currently available learning approaches. Averaged n-Dependence Estimators (AnDE) allows for a flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence, AnDE is especially appropriate for learning from large quantities of data. Memory requirement in AnDE, however, increases combinatorially with the number of attributes and the parameter n. In large data learning, number of attributes is often large and we also expect high n to achieve low-bias classification. In order to achieve the lower bias of AnDE with higher n but with less memory requirement, we propose a memory constrained selective AnDE algorithm, in which two passes of learning through training examples are involved. The first pass performs attribute selection on super parents according to available memory, whereas the second one learns an AnDE model with parents only on the selected attributes. Extensive experiments show that the new selective AnDE has considerably lower bias and prediction error relative to A \$\$n'\$\$ n {\textasciiacutex} DE, where \$\$n' = n-1\$\$ n {\textasciiacutex} = n - 1 , while maintaining the same space complexity and similar time complexity. The proposed algorithm works well on categorical data. Numerical data sets need to be discretized first.},
Doi = {10.1007/s10115-016-0937-9},
ISSN = {0219-3116},
Keywords = {Conditional Probability Estimation and AODE and Learning from large datasets and DP140100087},
Related = {learning-complex-conditional-probabilities-from-data}
}
ABSTRACT Learning from data that are too big to fit into memory poses great challenges to currently available learning approaches. Averaged n-Dependence Estimators (AnDE) allows for a flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence, AnDE is especially appropriate for learning from large quantities of data. Memory requirement in AnDE, however, increases combinatorially with the number of attributes and the parameter n. In large data learning, number of attributes is often large and we also expect high n to achieve low-bias classification. In order to achieve the lower bias of AnDE with higher n but with less memory requirement, we propose a memory constrained selective AnDE algorithm, in which two passes of learning through training examples are involved. The first pass performs attribute selection on super parents according to available memory, whereas the second one learns an AnDE model with parents only on the selected attributes. Extensive experiments show that the new selective AnDE has considerably lower bias and prediction error relative to A \$\$n'\$\$ n {\textasciiacutex} DE, where \$\$n' = n-1\$\$ n {\textasciiacutex} = n - 1 , while maintaining the same space complexity and similar time complexity. The proposed algorithm works well on categorical data. Numerical data sets need to be discretized first.

GlycoMinestruct: a new bioinformatics tool for highly accurate mapping of the human N-linked and O-linked glycoproteomes by incorporating structural features.
Li, F., Li, C., Revote, J., Zhang, Y., Webb, G. I., Li, J., Song, J., & Lithgow, T.
Scientific Reports, 6, Art. no. 34595, 2016.
[DOI] [Bibtex]

@Article{LiEtAl16,
Title = {GlycoMinestruct: a new bioinformatics tool for highly accurate mapping of the human N-linked and O-linked glycoproteomes by incorporating structural features},
Author = {Li, Fuyi and Li, Chen and Revote, Jerico and Zhang, Yang and Webb, Geoffrey I. and Li, Jian and Song, Jiangning and Lithgow, Trevor},
Journal = {Scientific Reports},
Year = {2016},
Month = oct,
Volume = {6},
Articlenumber = {34595},
Doi = {10.1038/srep34595},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}
ABSTRACT 

ALRn: Accelerated higher-order logistic regression.
Zaidi, N. A., Webb, G. I., Carman, M. J., Petitjean, F., & Cerquides, J.
Machine Learning, 104(2), 151-194, 2016.
[PDF] [DOI] [Bibtex] [Abstract]

@Article{ZaidiEtAl16b,
Title = {{ALRn}: Accelerated higher-order logistic regression},
Author = {Zaidi, Nayyar A.
and Webb, Geoffrey I.
and Carman, Mark J.
and Petitjean, Fran{\c{c}}ois
and Cerquides, Jes{\'u}s},
Journal = {Machine Learning},
Year = {2016},
Number = {2},
Pages = {151-194},
Volume = {104},
Abstract = {This paper introduces Accelerated Logistic Regression: a hybrid generative-discriminative approach to training Logistic Regression with high-order features. We present two main results: (1) that our combined generative-discriminative approach significantly improves the efficiency of Logistic Regression and (2) that incorporating higher order features (i.e. features that are the Cartesian products of the original features) reduces the bias of Logistic Regression, which in turn significantly reduces its error on large datasets. We assess the efficacy of Accelerated Logistic Regression by conducting an extensive set of experiments on 75 standard datasets. We demonstrate its competitiveness, particularly on large datasets, by comparing against state-of-the-art classifiers including Random Forest and Averaged n-Dependence Estimators.},
Doi = {10.1007/s10994-016-5574-8},
ISSN = {1573-0565},
Keywords = {Conditional Probability Estimation and WANBIA and DP140100087},
Related = {combining-generative-and-discriminative-learning},
Url = {http://dx.doi.org/10.1007/s10994-016-5574-8}
}
ABSTRACT This paper introduces Accelerated Logistic Regression: a hybrid generative-discriminative approach to training Logistic Regression with high-order features. We present two main results: (1) that our combined generative-discriminative approach significantly improves the efficiency of Logistic Regression and (2) that incorporating higher order features (i.e. features that are the Cartesian products of the original features) reduces the bias of Logistic Regression, which in turn significantly reduces its error on large datasets. We assess the efficacy of Accelerated Logistic Regression by conducting an extensive set of experiments on 75 standard datasets. We demonstrate its competitiveness, particularly on large datasets, by comparing against state-of-the-art classifiers including Random Forest and Averaged n-Dependence Estimators.

Preconditioning an Artificial Neural Network Using Naive Bayes.
Zaidi, N. A., Petitjean, F., & Webb, G. I.
Proceedings of the 20th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2016, pp. 341-353, 2016.
[PDF] [DOI] [Bibtex] [Abstract]

@InProceedings{ZaidiEtAl16,
Title = {Preconditioning an Artificial Neural Network Using Naive {Bayes}},
Author = {Zaidi, Nayyar A.
and Petitjean, Fran{\c{c}}ois
and Webb, Geoffrey I.},
Booktitle = {Proceedings of the 20th {Pacific-Asia} Conference on Advances in Knowledge Discovery and Data Mining, {PAKDD} 2016},
Year = {2016},
Editor = {Bailey, James
and Khan, Latifur
and Washio, Takashi
and Dobbie, Gill
and Huang, Zhexue Joshua
and Wang, Ruili},
Pages = {341-353},
Publisher = {Springer International Publishing},
Abstract = {Logistic Regression (LR) is a workhorse of the statistics community and a state-of-the-art machine learning classifier. It learns a linear model from inputs to outputs trained by optimizing the Conditional Log-Likelihood (CLL) of the data. Recently, it has been shown that preconditioning LR using a Naive Bayes (NB) model speeds up LR learning many-fold. One can, however, train a linear model by optimizing the mean-square-error (MSE) instead of CLL. This leads to an Artificial Neural Network (ANN) with no hidden layer. In this work, we study the effect of NB preconditioning on such an ANN classifier. Optimizing MSE instead of CLL may lead to a lower bias classifier and hence result in better performance on big datasets. We show that this NB preconditioning can speed-up convergence significantly. We also show that optimizing a linear model with MSE leads to a lower bias classifier than optimizing with CLL. We also compare the performance to state-of-the-art classifier Random Forest.},
Doi = {10.1007/978-3-319-31753-3_28},
ISBN = {978-3-319-31753-3},
Keywords = {Conditional Probability Estimation and WANBIA and DP140100087},
Related = {combining-generative-and-discriminative-learning},
Url = {http://dx.doi.org/10.1007/978-3-319-31753-3_28}
}
ABSTRACT Logistic Regression (LR) is a workhorse of the statistics community and a state-of-the-art machine learning classifier. It learns a linear model from inputs to outputs trained by optimizing the Conditional Log-Likelihood (CLL) of the data. Recently, it has been shown that preconditioning LR using a Naive Bayes (NB) model speeds up LR learning many-fold. One can, however, train a linear model by optimizing the mean-square-error (MSE) instead of CLL. This leads to an Artificial Neural Network (ANN) with no hidden layer. In this work, we study the effect of NB preconditioning on such an ANN classifier. Optimizing MSE instead of CLL may lead to a lower bias classifier and hence result in better performance on big datasets. We show that this NB preconditioning can speed-up convergence significantly. We also show that optimizing a linear model with MSE leads to a lower bias classifier than optimizing with CLL. We also compare the performance to state-of-the-art classifier Random Forest.

A multiple test correction for streams and cascades of statistical hypothesis tests.
Webb, G. I., & Petitjean, F.
Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16, pp. 1255-1264, 2016.
[PDF] [DOI] [Bibtex] [Abstract]

@InProceedings{WebbPetitjean16,
Title = {A multiple test correction for streams and cascades of statistical hypothesis tests},
Author = {Webb, Geoffrey I and Petitjean, Francois},
Booktitle = {Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16},
Year = {2016},
Pages = {1255-1264},
Publisher = {ACM Press},
Abstract = {Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e.~rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be pre-determined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance.
This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed.
To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models.
We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.},
Comment = {Top reviewer score (4.75/5.0), shortlisted for best paper award and invited to ACM TKDE journal KDD-16 special issue},
Doi = {10.1145/2939672.2939775},
Keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and Learning from large datasets and DP140100087},
Related = {statistically-sound-association-discovery},
Url = {http://dl.acm.org/authorize?N19100}
}
ABSTRACT Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e.~rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be pre-determined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance. This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed. To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models. We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.

Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli.
Chang, C. C. H., Li, C., Webb, G. I., Tey, B., & Song, J.
Scientific Reports, 6, Art. no. 21844, 2016.
[URL] [Bibtex] [Abstract]

@Article{ChangEtAl2016,
Title = {Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli},
Author = {C.C.H. Chang and C. Li and G. I. Webb and B. Tey and J. Song},
Journal = {Scientific Reports},
Year = {2016},
Volume = {6},
Abstract = {Periplasmic expression of soluble proteins in Escherichia coli not only offers a much-simplified downstream purification process, but also enhances the probability of obtaining correctly folded and biologically active proteins. Different combinations of signal peptides and target proteins lead to different soluble protein expression levels, ranging from negligible to several grams per litre. Accurate algorithms for rational selection of promising candidates can serve as a powerful tool to complement with current trial-and-error approaches. Accordingly, proteomics studies can be conducted with greater efficiency and cost-effectiveness. Here, we developed a predictor with a two-stage architecture, to predict the real-valued expression level of target protein in the periplasm. The output of the first-stage support vector machine (SVM) classifier determines which second-stage support vector regression (SVR) classifier to be used. When tested on an independent test dataset, the predictor achieved an overall prediction accuracy of 78% and a Pearson’s correlation coefficient (PCC) of 0.77. We further illustrate the relative importance of various features with respect to different models. The results indicate that the occurrence of dipeptide glutamine and aspartic acid is the most important feature for the classification model. Finally, we provide access to the implemented predictor through the Periscope webserver, freely accessible at http://lightning.med.monash.edu/periscope/.},
Articlenumber = {21844},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology},
Url = {http://dx.doi.org/10.1038/srep21844}
}
ABSTRACT Periplasmic expression of soluble proteins in Escherichia coli not only offers a much-simplified downstream purification process, but also enhances the probability of obtaining correctly folded and biologically active proteins. Different combinations of signal peptides and target proteins lead to different soluble protein expression levels, ranging from negligible to several grams per litre. Accurate algorithms for rational selection of promising candidates can serve as a powerful tool to complement with current trial-and-error approaches. Accordingly, proteomics studies can be conducted with greater efficiency and cost-effectiveness. Here, we developed a predictor with a two-stage architecture, to predict the real-valued expression level of target protein in the periplasm. The output of the first-stage support vector machine (SVM) classifier determines which second-stage support vector regression (SVR) classifier to be used. When tested on an independent test dataset, the predictor achieved an overall prediction accuracy of 78% and a Pearson’s correlation coefficient (PCC) of 0.77. We further illustrate the relative importance of various features with respect to different models. The results indicate that the occurrence of dipeptide glutamine and aspartic acid is the most important feature for the classification model. Finally, we provide access to the implemented predictor through the Periscope webserver, freely accessible at http://lightning.med.monash.edu/periscope/.

Crysalis: an integrated server for computational analysis and design of protein crystallization.
Wang, H., Feng, L., Zhang, Z., Webb, G. I., Lin, D., & Song, J.
Scientific Reports, 6, Art. no. 21383, 2016.
[DOI] [Bibtex] [Abstract]

@Article{WangEtAl16,
Title = {Crysalis: an integrated server for computational analysis and design of protein crystallization},
Author = {Wang, H. and Feng, L. and Zhang, Z. and Webb, G. I. and Lin, D. and Song, J.},
Journal = {Scientific Reports},
Year = {2016},
Volume = {6},
Abstract = {The failure of multi-step experimental procedures to yield diffraction-quality crystals is a major bottleneck in protein structure determination. Accordingly, several bioinformatics methods have been successfully developed and employed to select crystallizable proteins. Unfortunately, the majority of existing in silico methods only allow the prediction of crystallization propensity, seldom enabling computational design of protein mutants that can be targeted for enhancing protein crystallizability. Here, we present Crysalis, an integrated crystallization analysis tool that builds on support-vector regression (SVR) models to facilitate computational protein crystallization prediction, analysis, and design. More specifically, the functionality of this new tool includes: (1) rapid selection of target crystallizable proteins at the proteome level, (2) identification of site non-optimality for protein crystallization and systematic analysis of all potential single-point mutations that might enhance protein crystallization propensity, and (3) annotation of target protein based on predicted structural properties. We applied the design mode of Crysalis to identify site non-optimality for protein crystallization on a proteome-scale, focusing on proteins currently classified as non-crystallizable. Our results revealed that site non-optimality is based on biases related to residues, predicted structures, physicochemical properties, and sequence loci, which provides in-depth understanding of the features influencing protein crystallization. Crysalis is freely available at http://nmrcen.xmu.edu.cn/crysalis/.},
Articlenumber = {21383},
Doi = {10.1038/srep21383},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology}
}
ABSTRACT The failure of multi-step experimental procedures to yield diffraction-quality crystals is a major bottleneck in protein structure determination. Accordingly, several bioinformatics methods have been successfully developed and employed to select crystallizable proteins. Unfortunately, the majority of existing in silico methods only allow the prediction of crystallization propensity, seldom enabling computational design of protein mutants that can be targeted for enhancing protein crystallizability. Here, we present Crysalis, an integrated crystallization analysis tool that builds on support-vector regression (SVR) models to facilitate computational protein crystallization prediction, analysis, and design. More specifically, the functionality of this new tool includes: (1) rapid selection of target crystallizable proteins at the proteome level, (2) identification of site non-optimality for protein crystallization and systematic analysis of all potential single-point mutations that might enhance protein crystallization propensity, and (3) annotation of target protein based on predicted structural properties. We applied the design mode of Crysalis to identify site non-optimality for protein crystallization on a proteome-scale, focusing on proteins currently classified as non-crystallizable. Our results revealed that site non-optimality is based on biases related to residues, predicted structures, physicochemical properties, and sequence loci, which provides in-depth understanding of the features influencing protein crystallization. Crysalis is freely available at http://nmrcen.xmu.edu.cn/crysalis/.

Smoothing a rugged protein folding landscape by sequence-based redesign.
Porebski, B. T., Keleher, S., Hollins, J. J., Nickson, A. A., Marijanovic, E. M., Borg, N. A., Costa, M. G. S., Pearce, M. A., Dai, W., Zhu, L., Irving, J. A., Hoke, D. E., Kass, I., Whisstock, J. C., Bottomley, S. P., Webb, G. I., McGowan, S., & Buckle, A. M.
Scientific Reports, 6, Art. no. 33958, 2016.
[DOI] [Bibtex] [Abstract]

@Article{Porebski2016,
Title = {Smoothing a rugged protein folding landscape by sequence-based redesign},
Author = {Porebski, Benjamin T. and Keleher, Shani and Hollins, Jeffrey J. and Nickson, Adrian A. and Marijanovic, Emilia M. and Borg, Natalie A. and Costa, Mauricio G. S. and Pearce, Mary A. and Dai, Weiwen and Zhu, Liguang and Irving, James A. and Hoke, David E. and Kass, Itamar and Whisstock, James C. and Bottomley, Stephen P. and Webb, Geoffrey I. and McGowan, Sheena and Buckle, Ashley M.},
Journal = {Scientific Reports},
Year = {2016},
Volume = {6},
Abstract = {The rugged folding landscapes of functional proteins puts them at risk of misfolding and aggregation. Serine protease inhibitors, or serpins, are paradigms for this delicate balance between function and misfolding. Serpins exist in a metastable state that undergoes a major conformational change in order to inhibit proteases. However, conformational labiality of the native serpin fold renders them susceptible to misfolding, which underlies misfolding diseases such as alpha1-antitrypsin deficiency. To investigate how serpins balance function and folding, we used consensus design to create conserpin, a synthetic serpin that folds reversibly, is functional, thermostable, and polymerization resistant. Characterization of its structure, folding and dynamics suggest that consensus design has remodeled the folding landscape to reconcile competing requirements for stability and function. This approach may offer general benefits for engineering functional proteins that have risky folding landscapes, including the removal of aggregation-prone intermediates, and modifying scaffolds for use as protein therapeutics.},
Articlenumber = {33958},
Doi = {10.1038/srep33958},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology},
Url = {http://dx.doi.org/10.1038/srep33958}
}
ABSTRACT The rugged folding landscapes of functional proteins puts them at risk of misfolding and aggregation. Serine protease inhibitors, or serpins, are paradigms for this delicate balance between function and misfolding. Serpins exist in a metastable state that undergoes a major conformational change in order to inhibit proteases. However, conformational labiality of the native serpin fold renders them susceptible to misfolding, which underlies misfolding diseases such as alpha1-antitrypsin deficiency. To investigate how serpins balance function and folding, we used consensus design to create conserpin, a synthetic serpin that folds reversibly, is functional, thermostable, and polymerization resistant. Characterization of its structure, folding and dynamics suggest that consensus design has remodeled the folding landscape to reconcile competing requirements for stability and function. This approach may offer general benefits for engineering functional proteins that have risky folding landscapes, including the removal of aggregation-prone intermediates, and modifying scaffolds for use as protein therapeutics.

Scalable Learning of Graphical Models.
Petitjean, F., & Webb, G. I.
Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16, pp. 2131-2132, 2016.
[URL] [Bibtex]

@InProceedings{PetitjeanWebbTut16,
Title = {Scalable Learning of Graphical Models},
Author = {F. Petitjean and G.I. Webb},
Booktitle = {Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16},
Year = {2016},
Pages = {2131-2132},
Publisher = {ACM Press},
Keywords = {scalable graphical models and Learning from large datasets and DP140100087},
Related = {scalable-graphical-modeling},
Url = {http://dl.acm.org/authorize?N19101}
}
ABSTRACT 

Scalable Learning of Bayesian Network Classifiers.
Martinez, A. M., Webb, G. I., Chen, S., & Zaidi, N. A.
Journal of Machine Learning Research, 17(44), 1-35, 2016.
[URL] [Bibtex] [Abstract]

@Article{MartinezEtAl16,
Title = {Scalable Learning of {Bayesian} Network Classifiers},
Author = {Ana M. Martinez and Geoffrey I. Webb and Shenglei Chen and Nayyar A. Zaidi},
Journal = {Journal of Machine Learning Research},
Year = {2016},
Number = {44},
Pages = {1-35},
Volume = {17},
Abstract = {Ever increasing data quantity makes ever more urgent the need for highly scalable learners that have good classification performance. Therefore, an out-of-core learner with excellent time and space complexity, along with high expressivity (that is, capacity to learn very complex multivariate probability distributions) is extremely desirable. This paper presents such a learner. We propose an extension to the k-dependence Bayesian classifier (KDB) that discriminatively selects a sub- model of a full KDB classifier. It requires only one additional pass through the training data, making it a three-pass learner. Our extensive experimental evaluation on 16 large data sets reveals that this out-of-core algorithm achieves competitive classification performance, and substantially better training and classification time than state-of-the-art in-core learners such as random forest and linear and non-linear logistic regression.},
Keywords = {Conditional Probability Estimation and AODE and Learning from large datasets and DP140100087},
Related = {learning-complex-conditional-probabilities-from-data},
Url = {http://jmlr.org/papers/v17/martinez16a.html}
}
ABSTRACT Ever increasing data quantity makes ever more urgent the need for highly scalable learners that have good classification performance. Therefore, an out-of-core learner with excellent time and space complexity, along with high expressivity (that is, capacity to learn very complex multivariate probability distributions) is extremely desirable. This paper presents such a learner. We propose an extension to the k-dependence Bayesian classifier (KDB) that discriminatively selects a sub- model of a full KDB classifier. It requires only one additional pass through the training data, making it a three-pass learner. Our extensive experimental evaluation on 16 large data sets reveals that this out-of-core algorithm achieves competitive classification performance, and substantially better training and classification time than state-of-the-art in-core learners such as random forest and linear and non-linear logistic regression.

Structural and dynamic properties that govern the stability of an engineered fibronectin type III domain.
Porebski, B. T., Nickson, A. A., Hoke, D. E., Hunter, M. R., Zhu, L., McGowan, S., Webb, G. I., & Buckle, A. M.
Protein Engineering, Design and Selection, 28(3), 67-78, 2015.
[URL] [Bibtex]

ABSTRACT 

GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome.
Li, F., Li, C., Wang, M., Webb, G. I., Zhang, Y., Whisstock, J. C., & Song, J.
Bioinformatics, 31(9), 1411-1419, 2015.
[URL] [Bibtex] [Abstract]

@Article{LiEtAl15,
Title = {GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome},
Author = {F. Li and C. Li and M. Wang and G. I. Webb and Y. Zhang and J. C. Whisstock and J. Song},
Journal = {Bioinformatics},
Year = {2015},
Number = {9},
Pages = {1411-1419},
Volume = {31},
Abstract = {Motivation: Glycosylation is a ubiquitous type of protein post-translational modification (PTM) in eukaryotic cells, which plays vital roles in various biological processes (BPs) such as cellular communication, ligand recognition and subcellular recognition. It is estimated that >50% of the entire human proteome is glycosylated. However, it is still a significant challenge to identify glycosylation sites, which requires expensive/laborious experimental research. Thus, bioinformatics approaches that can predict the glycan occupancy at specific sequons in protein sequences would be useful for understanding and utilizing this important PTM.
Results: In this study, we present a novel bioinformatics tool called GlycoMine, which is a comprehensive tool for the systematic in silico identification of C-linked, N-linked, and O-linked glycosylation sites in the human proteome. GlycoMine was developed using the random forest algorithm and evaluated based on a well-prepared up-to-date benchmark dataset that encompasses all three types of glycosylation sites, which was curated from multiple public resources. Heterogeneous sequences and functional features were derived from various sources, and subjected to further two-step feature selection to characterize a condensed subset of optimal features that contributed most to the type-specific prediction of glycosylation sites. Five-fold cross-validation and independent tests show that this approach significantly improved the prediction performance compared with four existing prediction tools: NetNGlyc, NetOGlyc, EnsembleGly and GPP. We demonstrated that this tool could identify candidate glycosylation sites in case study proteins and applied it to identify many high-confidence glycosylation target proteins by screening the entire human proteome.},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology},
Url = {http://dx.doi.org/10.1093/bioinformatics/btu852}
}
ABSTRACT Motivation: Glycosylation is a ubiquitous type of protein post-translational modification (PTM) in eukaryotic cells, which plays vital roles in various biological processes (BPs) such as cellular communication, ligand recognition and subcellular recognition. It is estimated that >50% of the entire human proteome is glycosylated. However, it is still a significant challenge to identify glycosylation sites, which requires expensive/laborious experimental research. Thus, bioinformatics approaches that can predict the glycan occupancy at specific sequons in protein sequences would be useful for understanding and utilizing this important PTM. Results: In this study, we present a novel bioinformatics tool called GlycoMine, which is a comprehensive tool for the systematic in silico identification of C-linked, N-linked, and O-linked glycosylation sites in the human proteome. GlycoMine was developed using the random forest algorithm and evaluated based on a well-prepared up-to-date benchmark dataset that encompasses all three types of glycosylation sites, which was curated from multiple public resources. Heterogeneous sequences and functional features were derived from various sources, and subjected to further two-step feature selection to characterize a condensed subset of optimal features that contributed most to the type-specific prediction of glycosylation sites. Five-fold cross-validation and independent tests show that this approach significantly improved the prediction performance compared with four existing prediction tools: NetNGlyc, NetOGlyc, EnsembleGly and GPP. We demonstrated that this tool could identify candidate glycosylation sites in case study proteins and applied it to identify many high-confidence glycosylation target proteins by screening the entire human proteome.

Scaling log-linear analysis to datasets with thousands of variables.
Petitjean, F., & Webb, G. I.
Proceedings of the 2015 SIAM International Conference on Data Mining, pp. 469-477, 2015.
[URL] [Bibtex] [Abstract]

@InProceedings{PetitjeanWebb15,
Title = {Scaling log-linear analysis to datasets with thousands of variables},
Author = {F. Petitjean and G.I. Webb},
Booktitle = {Proceedings of the 2015 {SIAM} International Conference on Data Mining},
Year = {2015},
Pages = {469-477},
Abstract = {Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We have recently shown that, if we ensure that the graph supporting the log-linear model is chordal, log-linear analysis can be applied to datasets with hundreds of variables without sacrificing the statistical soundness [21]. However, further scalability remained limited, because state-of-the-art techniques have to examine every edge at every step of the search. This paper makes the following contributions: 1) we prove that only a very small subset of edges has to be considered at each step of the search; 2) we demonstrate how to efficiently find this subset of edges and 3) we show how to efficiently keep track of the best edges to be subsequently added to the initial model. Our experiments, carried out on real datasets with up to 2000 variables, show that our contributions make it possible to gain about 4 orders of magnitude, making log-linear analysis of datasets with thousands of variables possible in seconds instead of days.},
Comment = {Best Research Paper Honorable Mention Award},
Keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and Learning from large datasets and DP140100087},
Related = {scalable-graphical-modeling},
Url = {http://epubs.siam.org/doi/pdf/10.1137/1.9781611974010.53}
}
ABSTRACT Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We have recently shown that, if we ensure that the graph supporting the log-linear model is chordal, log-linear analysis can be applied to datasets with hundreds of variables without sacrificing the statistical soundness [21]. However, further scalability remained limited, because state-of-the-art techniques have to examine every edge at every step of the search. This paper makes the following contributions: 1) we prove that only a very small subset of edges has to be considered at each step of the search; 2) we demonstrate how to efficiently find this subset of edges and 3) we show how to efficiently keep track of the best edges to be subsequently added to the initial model. Our experiments, carried out on real datasets with up to 2000 variables, show that our contributions make it possible to gain about 4 orders of magnitude, making log-linear analysis of datasets with thousands of variables possible in seconds instead of days.

Highly Scalable Attribute Selection for AODE.
Chen, S., Martinez, A., & Webb, G. I.
Proceedings of the 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 86-97, 2014.
[PDF] [URL] [Bibtex] [Abstract]

@InProceedings{ChenEtAl14,
Title = {Highly Scalable Attribute Selection for AODE},
Author = {S. Chen and A. Martinez and G.I. Webb},
Booktitle = {Proceedings of the 18th {Pacific}-{Asia} Conference on Knowledge Discovery and Data Mining},
Year = {2014},
Pages = {86-97},
Abstract = {Averaged One-Dependence Estimators (AODE) is a popular
and effective approach to Bayesian learning. In this paper, a new
attribute selection approach is proposed for AODE. It can search in a
large model space, while it requires only a single extra pass through the
training data, resulting in a computationally efficient two-pass learning
algorithm. The experimental results indicate that the new technique significantly
reduces AODE.s bias at the cost of a modest increase in training
time. Its low bias and computational efficiency make it an attractive
algorithm for learning from big data.},
Keywords = {Conditional Probability Estimation and AODE and DP140100087},
Related = {learning-complex-conditional-probabilities-from-data},
Url = {http://dx.doi.org/10.1007/978-3-319-06605-9_8}
}
ABSTRACT Averaged One-Dependence Estimators (AODE) is a popular and effective approach to Bayesian learning. In this paper, a new attribute selection approach is proposed for AODE. It can search in a large model space, while it requires only a single extra pass through the training data, resulting in a computationally efficient two-pass learning algorithm. The experimental results indicate that the new technique significantly reduces AODE.s bias at the cost of a modest increase in training time. Its low bias and computational efficiency make it an attractive algorithm for learning from big data.

A Statistically Efficient and Scalable Method for Log-Linear Analysis of High-Dimensional Data.
Petitjean, F., Allison, L., & Webb, G. I.
Proceedings of the 14th IEEE International Conference on Data Mining, pp. 480-489, 2014.
[PDF] [URL] [Bibtex] [Abstract]

@InProceedings{PetitjeanEtAl14a,
Title = {A Statistically Efficient and Scalable Method for Log-Linear Analysis of High-Dimensional Data},
Author = {F. Petitjean and L. Allison and G.I. Webb},
Booktitle = {Proceedings of the 14th {IEEE} International Conference on Data Mining},
Year = {2014},
Pages = {480-489},
Abstract = {Log-linear analysis is the primary statistical approach to discovering conditional dependencies between the variables of a dataset. A good log-linear analysis method requires both high precision and statistical efficiency. High precision means that the risk of false discoveries should be kept very low. Statistical efficiency means that the method should discover actual associations with as few samples as possible. Classical approaches to log-linear analysis make use of χ2 tests to control this balance between quality and complexity. We present an information-theoretic approach to log-linear analysis. We show that our approach 1) requires significantly fewer samples to discover the true associations than statistical approaches -- statistical efficiency -- 2) controls for the risk of false discoveries as well as statistical approaches -- high precision - and 3) can perform the discovery on datasets with hundreds of variables on a standard desktop computer -- computational efficiency.},
Comment = {One of nine papers invited to Knowledge and Information Systems journal ICDM-14 special issue},
Keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and DP140100087},
Related = {scalable-graphical-modeling},
Url = {http://dx.doi.org/10.1109/ICDM.2014.23}
}
ABSTRACT Log-linear analysis is the primary statistical approach to discovering conditional dependencies between the variables of a dataset. A good log-linear analysis method requires both high precision and statistical efficiency. High precision means that the risk of false discoveries should be kept very low. Statistical efficiency means that the method should discover actual associations with as few samples as possible. Classical approaches to log-linear analysis make use of χ2 tests to control this balance between quality and complexity. We present an information-theoretic approach to log-linear analysis. We show that our approach 1) requires significantly fewer samples to discover the true associations than statistical approaches -- statistical efficiency -- 2) controls for the risk of false discoveries as well as statistical approaches -- high precision - and 3) can perform the discovery on datasets with hundreds of variables on a standard desktop computer -- computational efficiency.

Accurate in Silico Identification of Species-Specific Acetylation Sites by Integrating Protein Sequence-Derived and Functional Features.
Li, Y., Wang, M., Wang, H., Tan, H., Zhang, Z., Webb, G. I., & Song, J.
Scientific Reports, 4, Art. no. 5765, 2014.
[URL] [Bibtex] [Abstract]

@Article{LiEtAl2014,
Title = {Accurate in Silico Identification of Species-Specific Acetylation Sites by Integrating Protein Sequence-Derived and Functional Features},
Author = {Y. Li and M. Wang and H. Wang and H. Tan and Z. Zhang and G. I. Webb and J. Song},
Journal = {Scientific Reports},
Year = {2014},
Volume = {4},
Abstract = {Lysine acetylation is a reversible post-translational modification, playing an important role in cytokine signaling, transcriptional regulation, and apoptosis. To fully understand acetylation mechanisms, identification of substrates and specific acetylation sites is crucial. Experimental identification is often time-consuming and expensive. Alternative bioinformatics methods are cost-effective and can be used in a high-throughput manner to generate relatively precise predictions. Here we develop a method termed as SSPKA for species-specific lysine acetylation prediction, using random forest classifiers that combine sequence-derived and functional features with two-step feature selection. Feature importance analysis indicates functional features, applied for lysine acetylation site prediction for the first time, significantly improve the predictive performance. We apply the SSPKA model to screen the entire human proteome and identify many high-confidence putative substrates that are not previously identified. The results along with the implemented Java tool, serve as useful resources to elucidate the mechanism of lysine acetylation and facilitate hypothesis-driven experimental design and validation.},
Articlenumber = {5765},
Keywords = {Bioinformatics and DP140100087},
Related = {computational-biology},
Url = {http://dx.doi.org/10.1038/srep05765}
}
ABSTRACT Lysine acetylation is a reversible post-translational modification, playing an important role in cytokine signaling, transcriptional regulation, and apoptosis. To fully understand acetylation mechanisms, identification of substrates and specific acetylation sites is crucial. Experimental identification is often time-consuming and expensive. Alternative bioinformatics methods are cost-effective and can be used in a high-throughput manner to generate relatively precise predictions. Here we develop a method termed as SSPKA for species-specific lysine acetylation prediction, using random forest classifiers that combine sequence-derived and functional features with two-step feature selection. Feature importance analysis indicates functional features, applied for lysine acetylation site prediction for the first time, significantly improve the predictive performance. We apply the SSPKA model to screen the entire human proteome and identify many high-confidence putative substrates that are not previously identified. The results along with the implemented Java tool, serve as useful resources to elucidate the mechanism of lysine acetylation and facilitate hypothesis-driven experimental design and validation.

Naive-Bayes Inspired Effective Pre-Conditioner for Speeding-up Logistic Regression.
Zaidi, N., Carman, M., Cerquides, J., & Webb, G. I.
Proceedings of the 14th IEEE International Conference on Data Mining, pp. 1097-1102, 2014.
[PDF] [URL] [Bibtex] [Abstract]

@InProceedings{ZaidiEtAl14,
Title = {Naive-{Bayes} Inspired Effective Pre-Conditioner for Speeding-up Logistic Regression},
Author = {N. Zaidi and M. Carman and J. Cerquides and G.I. Webb},
Booktitle = {Proceedings of the 14th {IEEE} International Conference on Data Mining},
Year = {2014},
Pages = {1097-1102},
Abstract = {We propose an alternative parameterization of
Logistic Regression (LR) for the categorical data, multi-class
setting. LR optimizes the conditional log-likelihood over the
training data and is based on an iterative optimization procedure
to tune this objective function. The optimization procedure
employed may be sensitive to scale and hence an effective
pre-conditioning method is recommended. Many problems in
machine learning involve arbitrary scales or categorical data
(where simple standardization of features is not applicable).
The problem can be alleviated by using optimization routines
that are invariant to scale such as (second-order) Newton
methods. However, computing and inverting the Hessian is a
costly procedure and not feasible for big data. Thus one must
often rely on first-order methods such as gradient descent (GD),
stochastic gradient descent (SGD) or approximate secondorder
such as quasi-Newton (QN) routines, which are not
invariant to scale. This paper proposes a simple yet effective
pre-conditioner for speeding-up LR based on naive Bayes
conditional probability estimates. The idea is to scale each
attribute by the log of the conditional probability of that
attribute given the class. This formulation substantially speeds up
LR's convergence. It also provides a weighted naive Bayes
formulation which yields an effective framework for hybrid
generative-discriminative classification.},
Keywords = {Conditional Probability Estimation and WANBIA and DP140100087},
Related = {combining-generative-and-discriminative-learning},
Url = {http://dx.doi.org/10.1109/ICDM.2014.53}
}
ABSTRACT We propose an alternative parameterization of Logistic Regression (LR) for the categorical data, multi-class setting. LR optimizes the conditional log-likelihood over the training data and is based on an iterative optimization procedure to tune this objective function. The optimization procedure employed may be sensitive to scale and hence an effective pre-conditioning method is recommended. Many problems in machine learning involve arbitrary scales or categorical data (where simple standardization of features is not applicable). The problem can be alleviated by using optimization routines that are invariant to scale such as (second-order) Newton methods. However, computing and inverting the Hessian is a costly procedure and not feasible for big data. Thus one must often rely on first-order methods such as gradient descent (GD), stochastic gradient descent (SGD) or approximate secondorder such as quasi-Newton (QN) routines, which are not invariant to scale. This paper proposes a simple yet effective pre-conditioner for speeding-up LR based on naive Bayes conditional probability estimates. The idea is to scale each attribute by the log of the conditional probability of that attribute given the class. This formulation substantially speeds up LR's convergence. It also provides a weighted naive Bayes formulation which yields an effective framework for hybrid generative-discriminative classification.

Scaling log-linear analysis to high-dimensional data.
Petitjean, F., Webb, G. I., & Nicholson, A. E.
Proceedings of the 13th IEEE International Conference on Data Mining, pp. 597-606, 2013.
[PDF] [DOI] [Bibtex] [Abstract]

@InProceedings{PetitjeanEtAl13,
Title = {Scaling log-linear analysis to high-dimensional data},
Author = {F. Petitjean and G. I. Webb and A. E. Nicholson},
Booktitle = {Proceedings of the 13th {IEEE} International Conference on Data Mining},
Year = {2013},
Pages = {597-606},
Abstract = {Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We develop an efficient approach to log-linear analysis that scales to hundreds of variables by melding the classical statistical machinery of log-linear analysis with advanced data mining techniques from association discovery and graphical modeling.},
Doi = {10.1109/ICDM.2013.17},
Keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and Learning from large datasets and DP140100087},
Related = {scalable-graphical-modeling}
}
ABSTRACT Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We develop an efficient approach to log-linear analysis that scales to hundreds of variables by melding the classical statistical machinery of log-linear analysis with advanced data mining techniques from association discovery and graphical modeling.