Data Scientist


Scalable Graphical Modeling

SDM-awardGraphical models are powerful descriptions of joint probability distributions. We have developed techniques for efficiently scaling exact methods to thousands of variables.

Tutorial at ICDM-15, with Francois Petitjean.

Presentation of our award winning paper at SDM15.

Download the software.

Check out the promotional video for our KDD-2106 paper:

Check out our interactive visualisation of the network Chordalysis learns about associated stock movements within the S&P500:
S&P500NetflixVis

Publications

Scalable Learning of Graphical Models.
Petitjean, F., & Webb, G. I.
Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16, pp. 2131-2132, 2016.
[URL] [Bibtex]

@InProceedings{PetitjeanWebbTut16,
Title = {Scalable Learning of Graphical Models},
Author = {F. Petitjean and G.I. Webb},
Booktitle = {Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16},
Year = {2016},
Pages = {2131-2132},
Publisher = {ACM Press},
Keywords = {scalable graphical models and Learning from large datasets and DP140100087},
Related = {scalable-graphical-modeling},
Url = {http://dl.acm.org/authorize?N19101}
}
ABSTRACT 

Scaling log-linear analysis to datasets with thousands of variables.
Petitjean, F., & Webb, G. I.
Proceedings of the 2015 SIAM International Conference on Data Mining, pp. 469-477, 2015.
[URL] [Bibtex] [Abstract]

@InProceedings{PetitjeanWebb15,
Title = {Scaling log-linear analysis to datasets with thousands of variables},
Author = {F. Petitjean and G.I. Webb},
Booktitle = {Proceedings of the 2015 {SIAM} International Conference on Data Mining},
Year = {2015},
Pages = {469-477},
Abstract = {Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We have recently shown that, if we ensure that the graph supporting the log-linear model is chordal, log-linear analysis can be applied to datasets with hundreds of variables without sacrificing the statistical soundness [21]. However, further scalability remained limited, because state-of-the-art techniques have to examine every edge at every step of the search. This paper makes the following contributions: 1) we prove that only a very small subset of edges has to be considered at each step of the search; 2) we demonstrate how to efficiently find this subset of edges and 3) we show how to efficiently keep track of the best edges to be subsequently added to the initial model. Our experiments, carried out on real datasets with up to 2000 variables, show that our contributions make it possible to gain about 4 orders of magnitude, making log-linear analysis of datasets with thousands of variables possible in seconds instead of days.},
Comment = {Best Research Paper Honorable Mention Award},
Keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and Learning from large datasets and DP140100087},
Related = {scalable-graphical-modeling},
Url = {http://epubs.siam.org/doi/pdf/10.1137/1.9781611974010.53}
}
ABSTRACT Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We have recently shown that, if we ensure that the graph supporting the log-linear model is chordal, log-linear analysis can be applied to datasets with hundreds of variables without sacrificing the statistical soundness [21]. However, further scalability remained limited, because state-of-the-art techniques have to examine every edge at every step of the search. This paper makes the following contributions: 1) we prove that only a very small subset of edges has to be considered at each step of the search; 2) we demonstrate how to efficiently find this subset of edges and 3) we show how to efficiently keep track of the best edges to be subsequently added to the initial model. Our experiments, carried out on real datasets with up to 2000 variables, show that our contributions make it possible to gain about 4 orders of magnitude, making log-linear analysis of datasets with thousands of variables possible in seconds instead of days.

A Statistically Efficient and Scalable Method for Log-Linear Analysis of High-Dimensional Data.
Petitjean, F., Allison, L., & Webb, G. I.
Proceedings of the 14th IEEE International Conference on Data Mining, pp. 480-489, 2014.
[PDF] [URL] [Bibtex] [Abstract]

@InProceedings{PetitjeanEtAl14a,
Title = {A Statistically Efficient and Scalable Method for Log-Linear Analysis of High-Dimensional Data},
Author = {F. Petitjean and L. Allison and G.I. Webb},
Booktitle = {Proceedings of the 14th {IEEE} International Conference on Data Mining},
Year = {2014},
Pages = {480-489},
Abstract = {Log-linear analysis is the primary statistical approach to discovering conditional dependencies between the variables of a dataset. A good log-linear analysis method requires both high precision and statistical efficiency. High precision means that the risk of false discoveries should be kept very low. Statistical efficiency means that the method should discover actual associations with as few samples as possible. Classical approaches to log-linear analysis make use of χ2 tests to control this balance between quality and complexity. We present an information-theoretic approach to log-linear analysis. We show that our approach 1) requires significantly fewer samples to discover the true associations than statistical approaches -- statistical efficiency -- 2) controls for the risk of false discoveries as well as statistical approaches -- high precision - and 3) can perform the discovery on datasets with hundreds of variables on a standard desktop computer -- computational efficiency.},
Comment = {One of nine papers invited to Knowledge and Information Systems journal ICDM-14 special issue},
Keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and DP140100087},
Related = {scalable-graphical-modeling},
Url = {http://dx.doi.org/10.1109/ICDM.2014.23}
}
ABSTRACT Log-linear analysis is the primary statistical approach to discovering conditional dependencies between the variables of a dataset. A good log-linear analysis method requires both high precision and statistical efficiency. High precision means that the risk of false discoveries should be kept very low. Statistical efficiency means that the method should discover actual associations with as few samples as possible. Classical approaches to log-linear analysis make use of χ2 tests to control this balance between quality and complexity. We present an information-theoretic approach to log-linear analysis. We show that our approach 1) requires significantly fewer samples to discover the true associations than statistical approaches – statistical efficiency – 2) controls for the risk of false discoveries as well as statistical approaches – high precision - and 3) can perform the discovery on datasets with hundreds of variables on a standard desktop computer – computational efficiency.

Scaling log-linear analysis to high-dimensional data.
Petitjean, F., Webb, G. I., & Nicholson, A. E.
Proceedings of the 13th IEEE International Conference on Data Mining, pp. 597-606, 2013.
[PDF] [DOI] [Bibtex] [Abstract]

@InProceedings{PetitjeanEtAl13,
Title = {Scaling log-linear analysis to high-dimensional data},
Author = {F. Petitjean and G. I. Webb and A. E. Nicholson},
Booktitle = {Proceedings of the 13th {IEEE} International Conference on Data Mining},
Year = {2013},
Pages = {597-606},
Abstract = {Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We develop an efficient approach to log-linear analysis that scales to hundreds of variables by melding the classical statistical machinery of log-linear analysis with advanced data mining techniques from association discovery and graphical modeling.},
Doi = {10.1109/ICDM.2013.17},
Keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and Learning from large datasets and DP140100087},
Related = {scalable-graphical-modeling}
}
ABSTRACT Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We develop an efficient approach to log-linear analysis that scales to hundreds of variables by melding the classical statistical machinery of log-linear analysis with advanced data mining techniques from association discovery and graphical modeling.