Skip to content

About Geoff Webb

Data Scientist

Statistical hypothesis testing was developed in an age when calculations were performed by hand and individual hypotheses were considered in isolation.

In the modern information era, it is often desirable to assess large numbers of related hypotheses. Unless explicit steps are taken to control the risk, it becomes a near certainty that if large numbers of the hypotheses that one seeks to reject are in fact true then many will be rejected in error. Multiple testing corrections control this risk, but previous approaches have required that the hypotheses to be assessed (or at least an upper bound on their number) are known in advance.

Sometimes, however, there is a need to test some hypotheses before other related hypotheses are known.  We call this a hypothesis stream. In some hypothesis streams, the decision of which hypotheses to reject at one time affect which hypotheses will be considered subsequently. We call this a hypothesis cascade.

One context in which hypothesis cascades arise is model selection, such as forward feature selection.  This process starts with an empty model.  Then, at each step all models are considered that add a single feature to the current model. A hypothesis test is conducted on every such new model.  The model with the best test result (lowest p) is then accepted, and the process is repeated considering all single feature additions to this new model. This is repeated until no improvement is statistically significant.  The model selected at each round determines the models to be tested subsequently, and it is not possible to determine in advance how many rounds it will take until the process terminates and hence how many statistical tests will be performed.

We have developed a statistical testing procedure that strictly controls the risk of familywise error - the risk of incorrectly rejecting any null hypothesis out of the entire stream. It does so without requiring any information other than the history of the analysis so far and the current pool of hypotheses under consideration. Studies of its performance in the context of selecting the edges to include in a graphical model indicate that it achieves this strict control while maintaining high statistical power.

Statistical hypothesis testing provides a powerful and mature toolkit for data analytics. These techniques help bring these powerful tools to bear in the age of big data.

Watch the promotional video for our paper:

[embedyt] http://www.youtube.com/watch?v=FBlhhebFhTI&width=640&height=390&rel=0&showinfo=0[/embedyt]

Publication

A multiple test correction for streams and cascades of statistical hypothesis tests.
Webb, G. I., & Petitjean, F.
Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16, pp. 1255-1264, 2016.
exclamation Top reviewer score (4.75/5.0), shortlisted for best paper award and invited to ACM TKDE journal KDD-16 special issue
[PDF] [URL] [Bibtex] [Abstract]  → Related papers and software

@InProceedings{WebbPetitjean16,
Title = {A multiple test correction for streams and cascades of statistical hypothesis tests},
Author = {Webb, Geoffrey I and Petitjean, Francois},
Booktitle = {Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16},
Year = {2016},
Pages = {1255-1264},
Publisher = {ACM Press},
Abstract = {Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e.~rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be pre-determined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance.
This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed.
To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models.
We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.},
Comment = {Top reviewer score (4.75/5.0), shortlisted for best paper award and invited to ACM TKDE journal KDD-16 special issue},
Doi = {10.1145/2939672.2939775},
Keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and Learning from large datasets and DP140100087},
Related = {statistically-sound-association-discovery},
Url = {http://dl.acm.org/authorize?N19100}
}
ABSTRACT Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e.~rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be pre-determined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance. This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed. To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models. We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.

Many machine learning algorithms make an implicit assumption that numeric data are interval scale, specifically, that a unit difference between values has the same meaning irrespective of the magnitude of those values. For example, most distance measures would rate values of 19 and 21 to both be the same distance from a value of 20.  However, suppose the three values are measures of miles per gallon.  It may be arbitrary whether the variable in question was recorded as mile per gallon or gallons per mile.  If they had been expressed as the latter, they would be 0.0526, 0.5000 and 0.0476.  The value corresponding to 20 (0.5000) would be closer to the value corresponding to 21 (0.0476) than to the value corresponding to 19 (0.0526).

Examples of learning algorithms that make this assumption include any that use conventional distance measures and most linear classifiers.

Our studies [1] have shown that for tasks as diverse as information retrieval and clustering, applying transformations to the data, such as replacing values by their square root or natural logarithm, often improves performance, indicating that the interval scale assumption is not justified.

A weaker assumption than the interval scale assumption is that numeric data are ordinal.  Under this assumption, order matters, but the magnitudes of differences in values are not specified.  Hence, for ordinal data, we can assert that 21 is more similar to 20 than 19, but not that 21 is more similar to 20 than is 18.  Our studies have shown that converting data to ranks often improves performance across a range of machine learning algorithms [1].

However, conversion to ranks entails a significant computational overhead if a learned model is to be applied to unseen data.  Mapping a new value onto a rank in a training set is an operation of order log training set size.  In consequence, it can be advantageous to use algorithms that assume only that data are ordinal scale, as do decision trees and algorithms built thereon, such as random forests.

Reference

[1] [pdf] [doi] Fernando, T. L., & Webb, G. I. (2017). SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption. Data Mining and Knowledge Discovery, 31(1), 264-286.
[Bibtex]
@Article{FernandoWebb16,
Title = {SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption},
Author = {Fernando, Thilak L. and Webb, Geoffrey I.},
Journal = {Data Mining and Knowledge Discovery},
Year = {2017},
Number = {1},
Pages = {264-286},
Volume = {31},
Abstract = {Similarity measures are central to many machine learning algorithms. There are many different similarity measures, each catering for different applications and data requirements. Most similarity measures used with numerical data assume that the attributes are interval scale. In the interval scale, it is assumed that a unit difference has the same meaning irrespective of the magnitudes of the values separated. When this assumption is violated, accuracy may be reduced. Our experiments show that removing the interval scale assumption by transforming data to ranks can improve the accuracy of distance-based similarity measures on some tasks. However the rank transform has high time and storage overheads. In this paper, we introduce an efficient similarity measure which does not consider the magnitudes of inter-instance distances. We compare the new similarity measure with popular similarity measures in two applications: DBScan clustering and content based multimedia information retrieval with real world datasets and different transform functions. The results show that the proposed similarity measure provides good performance on a range of tasks and is invariant to violations of the interval scale assumption.},
Doi = {10.1007/s10618-016-0463-0}
}

I have had a nice variety of recent papers!

Concept drift:

  • Characterizing Concept Drift.
    Webb, G. I., Hyde, R., Cao, H., Nguyen, H. L., & Petitjean, F.
    Data Mining and Knowledge Discovery, 30(4), 964-994, 2016.
    [PDF] [DOI] [Bibtex] [Abstract]  → Related papers and software
    @Article{WebbEtAl16,
    Title = {Characterizing Concept Drift},
    Author = {G.I. Webb and R. Hyde and H. Cao and H.L. Nguyen and F. Petitjean},
    Journal = {Data Mining and Knowledge Discovery},
    Year = {2016},
    Number = {4},
    Pages = {964-994},
    Volume = {30},
    Abstract = {Most machine learning models are static, but the world is dynamic, and increasing online deployment of learned models gives increasing urgency to the development of efficient and effective mechanisms to address learning in the context of non-stationary distributions, or as it is commonly called concept drift. However, the key issue of characterizing the different types of drift that can occur has not previously been subjected to rigorous definition and analysis. In particular, while some qualitative drift categorizations have been proposed, few have been formally defined, and the quantitative descriptions required for precise and objective understanding of learner performance have not existed. We present the first comprehensive framework for quantitative analysis of drift. This supports the development of the first comprehensive set of formal definitions of types of concept drift. The formal definitions clarify ambiguities and identify gaps in previous definitions, giving rise to a new comprehensive taxonomy of concept drift types and a solid foundation for research into mechanisms to detect and address concept drift.},
    Doi = {10.1007/s10618-015-0448-4},
    Keywords = {Concept Drift},
    Related = {learning-from-non-stationary-distributions},
    Url = {http://arxiv.org/abs/1511.03816},
    Urltext = {Link to prepublication draft}
    }
    ABSTRACT Most machine learning models are static, but the world is dynamic, and increasing online deployment of learned models gives increasing urgency to the development of efficient and effective mechanisms to address learning in the context of non-stationary distributions, or as it is commonly called concept drift. However, the key issue of characterizing the different types of drift that can occur has not previously been subjected to rigorous definition and analysis. In particular, while some qualitative drift categorizations have been proposed, few have been formally defined, and the quantitative descriptions required for precise and objective understanding of learner performance have not existed. We present the first comprehensive framework for quantitative analysis of drift. This supports the development of the first comprehensive set of formal definitions of types of concept drift. The formal definitions clarify ambiguities and identify gaps in previous definitions, giving rise to a new comprehensive taxonomy of concept drift types and a solid foundation for research into mechanisms to detect and address concept drift.

Pitfalls of assuming data are interval scale:

  • SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption.
    Fernando, T. L., & Webb, G. I.
    Data Mining and Knowledge Discovery, 31(1), 264-286, 2017.
    [PDF] [DOI] [Bibtex] [Abstract]
    @Article{FernandoWebb16,
    Title = {SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption},
    Author = {Fernando, Thilak L. and Webb, Geoffrey I.},
    Journal = {Data Mining and Knowledge Discovery},
    Year = {2017},
    Number = {1},
    Pages = {264-286},
    Volume = {31},
    Abstract = {Similarity measures are central to many machine learning algorithms. There are many different similarity measures, each catering for different applications and data requirements. Most similarity measures used with numerical data assume that the attributes are interval scale. In the interval scale, it is assumed that a unit difference has the same meaning irrespective of the magnitudes of the values separated. When this assumption is violated, accuracy may be reduced. Our experiments show that removing the interval scale assumption by transforming data to ranks can improve the accuracy of distance-based similarity measures on some tasks. However the rank transform has high time and storage overheads. In this paper, we introduce an efficient similarity measure which does not consider the magnitudes of inter-instance distances. We compare the new similarity measure with popular similarity measures in two applications: DBScan clustering and content based multimedia information retrieval with real world datasets and different transform functions. The results show that the proposed similarity measure provides good performance on a range of tasks and is invariant to violations of the interval scale assumption.},
    Doi = {10.1007/s10618-016-0463-0}
    }
    ABSTRACT Similarity measures are central to many machine learning algorithms. There are many different similarity measures, each catering for different applications and data requirements. Most similarity measures used with numerical data assume that the attributes are interval scale. In the interval scale, it is assumed that a unit difference has the same meaning irrespective of the magnitudes of the values separated. When this assumption is violated, accuracy may be reduced. Our experiments show that removing the interval scale assumption by transforming data to ranks can improve the accuracy of distance-based similarity measures on some tasks. However the rank transform has high time and storage overheads. In this paper, we introduce an efficient similarity measure which does not consider the magnitudes of inter-instance distances. We compare the new similarity measure with popular similarity measures in two applications: DBScan clustering and content based multimedia information retrieval with real world datasets and different transform functions. The results show that the proposed similarity measure provides good performance on a range of tasks and is invariant to violations of the interval scale assumption.

Statistical testing of streams and cascades of hypotheses:

  • A multiple test correction for streams and cascades of statistical hypothesis tests.
    Webb, G. I., & Petitjean, F.
    Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16, pp. 1255-1264, 2016.
    [PDF] [DOI] [Bibtex] [Abstract]  → Related papers and software
    @InProceedings{WebbPetitjean16,
    Title = {A multiple test correction for streams and cascades of statistical hypothesis tests},
    Author = {Webb, Geoffrey I and Petitjean, Francois},
    Booktitle = {Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16},
    Year = {2016},
    Pages = {1255-1264},
    Publisher = {ACM Press},
    Abstract = {Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e.~rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be pre-determined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance.
    This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed.
    To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models.
    We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.},
    Comment = {Top reviewer score (4.75/5.0), shortlisted for best paper award and invited to ACM TKDE journal KDD-16 special issue},
    Doi = {10.1145/2939672.2939775},
    Keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and Learning from large datasets and DP140100087},
    Related = {statistically-sound-association-discovery},
    Url = {http://dl.acm.org/authorize?N19100}
    }
    ABSTRACT Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e.~rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be pre-determined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance. This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed. To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models. We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.

Scalable learning of Bayesian network classifiers:

  • Selective AnDE for large data learning: a low-bias memory constrained approach.
    Chen, S., Martínez, A. M., Webb, G. I., & Wang, L.
    Knowledge and Information Systems, 50(2), 475-503, 2017.
    [PDF] [DOI] [Bibtex] [Abstract]  → Related papers and software
    @Article{ChenEtAl16,
    Title = {Selective AnDE for large data learning: a low-bias memory constrained approach},
    Author = {Chen, Shenglei
    and Mart{\'i}nez, Ana M.
    and Webb, Geoffrey I.
    and Wang, Limin},
    Journal = {Knowledge and Information Systems},
    Year = {2017},
    Number = {2},
    Pages = {475-503},
    Volume = {50},
    Abstract = {Learning from data that are too big to fit into memory poses great challenges to currently available learning approaches. Averaged n-Dependence Estimators (AnDE) allows for a flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence, AnDE is especially appropriate for learning from large quantities of data. Memory requirement in AnDE, however, increases combinatorially with the number of attributes and the parameter n. In large data learning, number of attributes is often large and we also expect high n to achieve low-bias classification. In order to achieve the lower bias of AnDE with higher n but with less memory requirement, we propose a memory constrained selective AnDE algorithm, in which two passes of learning through training examples are involved. The first pass performs attribute selection on super parents according to available memory, whereas the second one learns an AnDE model with parents only on the selected attributes. Extensive experiments show that the new selective AnDE has considerably lower bias and prediction error relative to A \$\$n'\$\$ n {\textasciiacutex} DE, where \$\$n' = n-1\$\$ n {\textasciiacutex} = n - 1 , while maintaining the same space complexity and similar time complexity. The proposed algorithm works well on categorical data. Numerical data sets need to be discretized first.},
    Doi = {10.1007/s10115-016-0937-9},
    ISSN = {0219-3116},
    Keywords = {Conditional Probability Estimation and AODE and Learning from large datasets and DP140100087},
    Related = {learning-complex-conditional-probabilities-from-data}
    }
    ABSTRACT Learning from data that are too big to fit into memory poses great challenges to currently available learning approaches. Averaged n-Dependence Estimators (AnDE) allows for a flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence, AnDE is especially appropriate for learning from large quantities of data. Memory requirement in AnDE, however, increases combinatorially with the number of attributes and the parameter n. In large data learning, number of attributes is often large and we also expect high n to achieve low-bias classification. In order to achieve the lower bias of AnDE with higher n but with less memory requirement, we propose a memory constrained selective AnDE algorithm, in which two passes of learning through training examples are involved. The first pass performs attribute selection on super parents according to available memory, whereas the second one learns an AnDE model with parents only on the selected attributes. Extensive experiments show that the new selective AnDE has considerably lower bias and prediction error relative to A \$\$n'\$\$ n {\textasciiacutex} DE, where \$\$n' = n-1\$\$ n {\textasciiacutex} = n - 1 , while maintaining the same space complexity and similar time complexity. The proposed algorithm works well on categorical data. Numerical data sets need to be discretized first.
  • Sample-based Attribute Selective AnDE for Large Data.
    Chen, S., Martinez, A., Webb, G., & Wang, L.
    IEEE Transactions on Knowledge and Data Engineering, 29(1), 172-185, 2017.
    [PDF] [DOI] [Bibtex]  → Related papers and software
    ABSTRACT 
  • Scalable Learning of Bayesian Network Classifiers.
    Martinez, A. M., Webb, G. I., Chen, S., & Zaidi, N. A.
    Journal of Machine Learning Research, 17(44), 1-35, 2016.
    [URL] [Bibtex] [Abstract]  → Related papers and software
    @Article{MartinezEtAl16,
    Title = {Scalable Learning of {Bayesian} Network Classifiers},
    Author = {Ana M. Martinez and Geoffrey I. Webb and Shenglei Chen and Nayyar A. Zaidi},
    Journal = {Journal of Machine Learning Research},
    Year = {2016},
    Number = {44},
    Pages = {1-35},
    Volume = {17},
    Abstract = {Ever increasing data quantity makes ever more urgent the need for highly scalable learners that have good classification performance. Therefore, an out-of-core learner with excellent time and space complexity, along with high expressivity (that is, capacity to learn very complex multivariate probability distributions) is extremely desirable. This paper presents such a learner. We propose an extension to the k-dependence Bayesian classifier (KDB) that discriminatively selects a sub- model of a full KDB classifier. It requires only one additional pass through the training data, making it a three-pass learner. Our extensive experimental evaluation on 16 large data sets reveals that this out-of-core algorithm achieves competitive classification performance, and substantially better training and classification time than state-of-the-art in-core learners such as random forest and linear and non-linear logistic regression.},
    Keywords = {Conditional Probability Estimation and AODE and Learning from large datasets and DP140100087},
    Related = {learning-complex-conditional-probabilities-from-data},
    Url = {http://jmlr.org/papers/v17/martinez16a.html}
    }
    ABSTRACT Ever increasing data quantity makes ever more urgent the need for highly scalable learners that have good classification performance. Therefore, an out-of-core learner with excellent time and space complexity, along with high expressivity (that is, capacity to learn very complex multivariate probability distributions) is extremely desirable. This paper presents such a learner. We propose an extension to the k-dependence Bayesian classifier (KDB) that discriminatively selects a sub- model of a full KDB classifier. It requires only one additional pass through the training data, making it a three-pass learner. Our extensive experimental evaluation on 16 large data sets reveals that this out-of-core algorithm achieves competitive classification performance, and substantially better training and classification time than state-of-the-art in-core learners such as random forest and linear and non-linear logistic regression.

Time series classification:

Sequential pattern discovery:

  • Skopus: Mining top-k sequential patterns under leverage.
    Petitjean, F., Li, T., Tatti, N., & Webb, G. I.
    Data Mining and Knowledge Discovery, 30(5), 1086-1111, 2016.
    [PDF] [DOI] [Bibtex] [Abstract]  → Related papers and software
    @Article{PetitjeanEtAl16b,
    Title = {Skopus: Mining top-k sequential patterns under leverage},
    Author = {Petitjean, Francois
    and Li, Tao
    and Tatti, Nikolaj
    and Webb, Geoffrey I.},
    Journal = {Data Mining and Knowledge Discovery},
    Year = {2016},
    Number = {5},
    Pages = {1086-1111},
    Volume = {30},
    Abstract = {This paper presents a framework for exact discovery of the top-k sequential patterns under Leverage. It combines (1) a novel definition of the expected support for a sequential pattern---a concept on which most interestingness measures directly rely---with (2) Skopus: a new branch-and-bound algorithm for the exact discovery of top-k sequential patterns under a given measure of interest. Our interestingness measure employs the partition approach. A pattern is interesting to the extent that it is more frequent than can be explained by assuming independence between any of the pairs of patterns from which it can be composed. The larger the support compared to the expectation under independence, the more interesting is the pattern. We build on these two elements to exactly extract the k sequential patterns with highest leverage, consistent with our definition of expected support. We conduct experiments on both synthetic data with known patterns and real-world datasets; both experiments confirm the consistency and relevance of our approach with regard to the state of the art.},
    Doi = {10.1007/s10618-016-0467-9},
    ISSN = {1573-756X},
    Keywords = {OPUS and Association Rule Discovery and statistically sound discovery},
    Related = {statistically-sound-association-discovery},
    Url = {http://rdcu.be/tsDo}
    }
    ABSTRACT This paper presents a framework for exact discovery of the top-k sequential patterns under Leverage. It combines (1) a novel definition of the expected support for a sequential pattern---a concept on which most interestingness measures directly rely---with (2) Skopus: a new branch-and-bound algorithm for the exact discovery of top-k sequential patterns under a given measure of interest. Our interestingness measure employs the partition approach. A pattern is interesting to the extent that it is more frequent than can be explained by assuming independence between any of the pairs of patterns from which it can be composed. The larger the support compared to the expectation under independence, the more interesting is the pattern. We build on these two elements to exactly extract the k sequential patterns with highest leverage, consistent with our definition of expected support. We conduct experiments on both synthetic data with known patterns and real-world datasets; both experiments confirm the consistency and relevance of our approach with regard to the state of the art.

Using generative parameterisations to scale up discriminative learning:

  • ALRn: Accelerated higher-order logistic regression.
    Zaidi, N. A., Webb, G. I., Carman, M. J., Petitjean, F., & Cerquides, J.
    Machine Learning, 104(2), 151-194, 2016.
    [PDF] [DOI] [Bibtex] [Abstract]  → Related papers and software
    @Article{ZaidiEtAl16b,
    Title = {{ALRn}: Accelerated higher-order logistic regression},
    Author = {Zaidi, Nayyar A.
    and Webb, Geoffrey I.
    and Carman, Mark J.
    and Petitjean, Fran{\c{c}}ois
    and Cerquides, Jes{\'u}s},
    Journal = {Machine Learning},
    Year = {2016},
    Number = {2},
    Pages = {151-194},
    Volume = {104},
    Abstract = {This paper introduces Accelerated Logistic Regression: a hybrid generative-discriminative approach to training Logistic Regression with high-order features. We present two main results: (1) that our combined generative-discriminative approach significantly improves the efficiency of Logistic Regression and (2) that incorporating higher order features (i.e. features that are the Cartesian products of the original features) reduces the bias of Logistic Regression, which in turn significantly reduces its error on large datasets. We assess the efficacy of Accelerated Logistic Regression by conducting an extensive set of experiments on 75 standard datasets. We demonstrate its competitiveness, particularly on large datasets, by comparing against state-of-the-art classifiers including Random Forest and Averaged n-Dependence Estimators.},
    Doi = {10.1007/s10994-016-5574-8},
    ISSN = {1573-0565},
    Keywords = {Conditional Probability Estimation and WANBIA and DP140100087},
    Related = {combining-generative-and-discriminative-learning},
    Url = {http://dx.doi.org/10.1007/s10994-016-5574-8}
    }
    ABSTRACT This paper introduces Accelerated Logistic Regression: a hybrid generative-discriminative approach to training Logistic Regression with high-order features. We present two main results: (1) that our combined generative-discriminative approach significantly improves the efficiency of Logistic Regression and (2) that incorporating higher order features (i.e. features that are the Cartesian products of the original features) reduces the bias of Logistic Regression, which in turn significantly reduces its error on large datasets. We assess the efficacy of Accelerated Logistic Regression by conducting an extensive set of experiments on 75 standard datasets. We demonstrate its competitiveness, particularly on large datasets, by comparing against state-of-the-art classifiers including Random Forest and Averaged n-Dependence Estimators.

My first neural network paper:

  • Preconditioning an Artificial Neural Network Using Naive Bayes.
    Zaidi, N. A., Petitjean, F., & Webb, G. I.
    Proceedings of the 20th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2016, pp. 341-353, 2016.
    [PDF] [DOI] [Bibtex] [Abstract]  → Related papers and software
    @InProceedings{ZaidiEtAl16,
    Title = {Preconditioning an Artificial Neural Network Using Naive {Bayes}},
    Author = {Zaidi, Nayyar A.
    and Petitjean, Fran{\c{c}}ois
    and Webb, Geoffrey I.},
    Booktitle = {Proceedings of the 20th {Pacific-Asia} Conference on Advances in Knowledge Discovery and Data Mining, {PAKDD} 2016},
    Year = {2016},
    Editor = {Bailey, James
    and Khan, Latifur
    and Washio, Takashi
    and Dobbie, Gill
    and Huang, Zhexue Joshua
    and Wang, Ruili},
    Pages = {341-353},
    Publisher = {Springer International Publishing},
    Abstract = {Logistic Regression (LR) is a workhorse of the statistics community and a state-of-the-art machine learning classifier. It learns a linear model from inputs to outputs trained by optimizing the Conditional Log-Likelihood (CLL) of the data. Recently, it has been shown that preconditioning LR using a Naive Bayes (NB) model speeds up LR learning many-fold. One can, however, train a linear model by optimizing the mean-square-error (MSE) instead of CLL. This leads to an Artificial Neural Network (ANN) with no hidden layer. In this work, we study the effect of NB preconditioning on such an ANN classifier. Optimizing MSE instead of CLL may lead to a lower bias classifier and hence result in better performance on big datasets. We show that this NB preconditioning can speed-up convergence significantly. We also show that optimizing a linear model with MSE leads to a lower bias classifier than optimizing with CLL. We also compare the performance to state-of-the-art classifier Random Forest.},
    Doi = {10.1007/978-3-319-31753-3_28},
    ISBN = {978-3-319-31753-3},
    Keywords = {Conditional Probability Estimation and WANBIA and DP140100087},
    Related = {combining-generative-and-discriminative-learning},
    Url = {http://dx.doi.org/10.1007/978-3-319-31753-3_28}
    }
    ABSTRACT Logistic Regression (LR) is a workhorse of the statistics community and a state-of-the-art machine learning classifier. It learns a linear model from inputs to outputs trained by optimizing the Conditional Log-Likelihood (CLL) of the data. Recently, it has been shown that preconditioning LR using a Naive Bayes (NB) model speeds up LR learning many-fold. One can, however, train a linear model by optimizing the mean-square-error (MSE) instead of CLL. This leads to an Artificial Neural Network (ANN) with no hidden layer. In this work, we study the effect of NB preconditioning on such an ANN classifier. Optimizing MSE instead of CLL may lead to a lower bias classifier and hence result in better performance on big datasets. We show that this NB preconditioning can speed-up convergence significantly. We also show that optimizing a linear model with MSE leads to a lower bias classifier than optimizing with CLL. We also compare the performance to state-of-the-art classifier Random Forest.

My first fuzzy rules paper:

  • Mining significant association rules from uncertain data.
    Zhang, A., Shi, W., & Webb, G. I.
    Data Mining and Knowledge Discovery, 30(4), 928-963, 2016.
    [PDF] [DOI] [Bibtex] [Abstract]  → Related papers and software
    @Article{ZhangEtAl16,
    Title = {Mining significant association rules from uncertain data},
    Author = {Zhang, Anshu and Shi, Wenzhong and Webb, Geoffrey I},
    Journal = {Data Mining and Knowledge Discovery},
    Year = {2016},
    Number = {4},
    Pages = {928-963},
    Volume = {30},
    Abstract = {In association rule mining, the trade-off between avoiding harmful spurious rules and preserving authentic ones is an ever critical barrier to obtaining reliable and useful results. The statistically sound technique for evaluating statistical significance of association rules is superior in preventing spurious rules, yet can also cause severe loss of true rules in presence of data error. This study presents a new and improved method for statistical test on association rules with uncertain erroneous data. An original mathematical model was established to describe data error propagation through computational procedures of the statistical test. Based on the error model, a scheme combining analytic and simulative processes was designed to correct the statistical test for distortions caused by data error. Experiments on both synthetic and real-world data show that the method significantly recovers the loss in true rules (reduces type-2 error) due to data error occurring in original statistically sound method. Meanwhile, the new method maintains effective control over the familywise error rate, which is the distinctive advantage of the original statistically sound technique. Furthermore, the method is robust against inaccurate data error probability information and situations not fulfilling the commonly accepted assumption on independent error probabilities of different data items. The method is particularly effective for rules which were most practically meaningful yet sensitive to data error. The method proves promising in enhancing values of association rule mining results and helping users make correct decisions.},
    Doi = {10.1007/s10618-015-0446-6},
    Keywords = {Association Rule Discovery and statistically sound discovery},
    Publisher = {Springer},
    Related = {statistically-sound-association-discovery}
    }
    ABSTRACT In association rule mining, the trade-off between avoiding harmful spurious rules and preserving authentic ones is an ever critical barrier to obtaining reliable and useful results. The statistically sound technique for evaluating statistical significance of association rules is superior in preventing spurious rules, yet can also cause severe loss of true rules in presence of data error. This study presents a new and improved method for statistical test on association rules with uncertain erroneous data. An original mathematical model was established to describe data error propagation through computational procedures of the statistical test. Based on the error model, a scheme combining analytic and simulative processes was designed to correct the statistical test for distortions caused by data error. Experiments on both synthetic and real-world data show that the method significantly recovers the loss in true rules (reduces type-2 error) due to data error occurring in original statistically sound method. Meanwhile, the new method maintains effective control over the familywise error rate, which is the distinctive advantage of the original statistically sound technique. Furthermore, the method is robust against inaccurate data error probability information and situations not fulfilling the commonly accepted assumption on independent error probabilities of different data items. The method is particularly effective for rules which were most practically meaningful yet sensitive to data error. The method proves promising in enhancing values of association rule mining results and helping users make correct decisions.

Computational biology:

  • Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli.
    Chang, C. C. H., Li, C., Webb, G. I., Tey, B., & Song, J.
    Scientific Reports, 6, Art. no. 21844, 2016.
    [URL] [Bibtex] [Abstract]  → Related papers and software
    @Article{ChangEtAl2016,
    Title = {Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli},
    Author = {C.C.H. Chang and C. Li and G. I. Webb and B. Tey and J. Song},
    Journal = {Scientific Reports},
    Year = {2016},
    Volume = {6},
    Abstract = {Periplasmic expression of soluble proteins in Escherichia coli not only offers a much-simplified downstream purification process, but also enhances the probability of obtaining correctly folded and biologically active proteins. Different combinations of signal peptides and target proteins lead to different soluble protein expression levels, ranging from negligible to several grams per litre. Accurate algorithms for rational selection of promising candidates can serve as a powerful tool to complement with current trial-and-error approaches. Accordingly, proteomics studies can be conducted with greater efficiency and cost-effectiveness. Here, we developed a predictor with a two-stage architecture, to predict the real-valued expression level of target protein in the periplasm. The output of the first-stage support vector machine (SVM) classifier determines which second-stage support vector regression (SVR) classifier to be used. When tested on an independent test dataset, the predictor achieved an overall prediction accuracy of 78% and a Pearson’s correlation coefficient (PCC) of 0.77. We further illustrate the relative importance of various features with respect to different models. The results indicate that the occurrence of dipeptide glutamine and aspartic acid is the most important feature for the classification model. Finally, we provide access to the implemented predictor through the Periscope webserver, freely accessible at http://lightning.med.monash.edu/periscope/.},
    Articlenumber = {21844},
    Keywords = {Bioinformatics and DP140100087},
    Related = {computational-biology},
    Url = {http://dx.doi.org/10.1038/srep21844}
    }
    ABSTRACT Periplasmic expression of soluble proteins in Escherichia coli not only offers a much-simplified downstream purification process, but also enhances the probability of obtaining correctly folded and biologically active proteins. Different combinations of signal peptides and target proteins lead to different soluble protein expression levels, ranging from negligible to several grams per litre. Accurate algorithms for rational selection of promising candidates can serve as a powerful tool to complement with current trial-and-error approaches. Accordingly, proteomics studies can be conducted with greater efficiency and cost-effectiveness. Here, we developed a predictor with a two-stage architecture, to predict the real-valued expression level of target protein in the periplasm. The output of the first-stage support vector machine (SVM) classifier determines which second-stage support vector regression (SVR) classifier to be used. When tested on an independent test dataset, the predictor achieved an overall prediction accuracy of 78% and a Pearson’s correlation coefficient (PCC) of 0.77. We further illustrate the relative importance of various features with respect to different models. The results indicate that the occurrence of dipeptide glutamine and aspartic acid is the most important feature for the classification model. Finally, we provide access to the implemented predictor through the Periscope webserver, freely accessible at http://lightning.med.monash.edu/periscope/.
  • GlycoMinestruct: a new bioinformatics tool for highly accurate mapping of the human N-linked and O-linked glycoproteomes by incorporating structural features.
    Li, F., Li, C., Revote, J., Zhang, Y., Webb, G. I., Li, J., Song, J., & Lithgow, T.
    Scientific Reports, 6, Art. no. 34595, 2016.
    [DOI] [Bibtex]  → Related papers and software
    @Article{LiEtAl16,
    Title = {GlycoMinestruct: a new bioinformatics tool for highly accurate mapping of the human N-linked and O-linked glycoproteomes by incorporating structural features},
    Author = {Li, Fuyi and Li, Chen and Revote, Jerico and Zhang, Yang and Webb, Geoffrey I. and Li, Jian and Song, Jiangning and Lithgow, Trevor},
    Journal = {Scientific Reports},
    Year = {2016},
    Month = oct,
    Volume = {6},
    Articlenumber = {34595},
    Doi = {10.1038/srep34595},
    Keywords = {Bioinformatics and DP140100087},
    Related = {computational-biology}
    }
    ABSTRACT 
  • Smoothing a rugged protein folding landscape by sequence-based redesign.
    Porebski, B. T., Keleher, S., Hollins, J. J., Nickson, A. A., Marijanovic, E. M., Borg, N. A., Costa, M. G. S., Pearce, M. A., Dai, W., Zhu, L., Irving, J. A., Hoke, D. E., Kass, I., Whisstock, J. C., Bottomley, S. P., Webb, G. I., McGowan, S., & Buckle, A. M.
    Scientific Reports, 6, Art. no. 33958, 2016.
    [DOI] [Bibtex] [Abstract]  → Related papers and software
    @Article{Porebski2016,
    Title = {Smoothing a rugged protein folding landscape by sequence-based redesign},
    Author = {Porebski, Benjamin T. and Keleher, Shani and Hollins, Jeffrey J. and Nickson, Adrian A. and Marijanovic, Emilia M. and Borg, Natalie A. and Costa, Mauricio G. S. and Pearce, Mary A. and Dai, Weiwen and Zhu, Liguang and Irving, James A. and Hoke, David E. and Kass, Itamar and Whisstock, James C. and Bottomley, Stephen P. and Webb, Geoffrey I. and McGowan, Sheena and Buckle, Ashley M.},
    Journal = {Scientific Reports},
    Year = {2016},
    Volume = {6},
    __markedentry = {[giwebb:]},
    Abstract = {The rugged folding landscapes of functional proteins puts them at risk of misfolding and aggregation. Serine protease inhibitors, or serpins, are paradigms for this delicate balance between function and misfolding. Serpins exist in a metastable state that undergoes a major conformational change in order to inhibit proteases. However, conformational labiality of the native serpin fold renders them susceptible to misfolding, which underlies misfolding diseases such as alpha1-antitrypsin deficiency. To investigate how serpins balance function and folding, we used consensus design to create conserpin, a synthetic serpin that folds reversibly, is functional, thermostable, and polymerization resistant. Characterization of its structure, folding and dynamics suggest that consensus design has remodeled the folding landscape to reconcile competing requirements for stability and function. This approach may offer general benefits for engineering functional proteins that have risky folding landscapes, including the removal of aggregation-prone intermediates, and modifying scaffolds for use as protein therapeutics.},
    Articlenumber = {33958},
    Doi = {10.1038/srep33958},
    Keywords = {Bioinformatics and DP140100087},
    Related = {computational-biology},
    Url = {http://dx.doi.org/10.1038/srep33958}
    }
    ABSTRACT The rugged folding landscapes of functional proteins puts them at risk of misfolding and aggregation. Serine protease inhibitors, or serpins, are paradigms for this delicate balance between function and misfolding. Serpins exist in a metastable state that undergoes a major conformational change in order to inhibit proteases. However, conformational labiality of the native serpin fold renders them susceptible to misfolding, which underlies misfolding diseases such as alpha1-antitrypsin deficiency. To investigate how serpins balance function and folding, we used consensus design to create conserpin, a synthetic serpin that folds reversibly, is functional, thermostable, and polymerization resistant. Characterization of its structure, folding and dynamics suggest that consensus design has remodeled the folding landscape to reconcile competing requirements for stability and function. This approach may offer general benefits for engineering functional proteins that have risky folding landscapes, including the removal of aggregation-prone intermediates, and modifying scaffolds for use as protein therapeutics.
  • Crysalis: an integrated server for computational analysis and design of protein crystallization.
    Wang, H., Feng, L., Zhang, Z., Webb, G. I., Lin, D., & Song, J.
    Scientific Reports, 6, Art. no. 21383, 2016.
    [DOI] [Bibtex] [Abstract]  → Related papers and software
    @Article{WangEtAl16,
    Title = {Crysalis: an integrated server for computational analysis and design of protein crystallization},
    Author = {Wang, H. and Feng, L. and Zhang, Z. and Webb, G. I. and Lin, D. and Song, J.},
    Journal = {Scientific Reports},
    Year = {2016},
    Volume = {6},
    Abstract = {The failure of multi-step experimental procedures to yield diffraction-quality crystals is a major bottleneck in protein structure determination. Accordingly, several bioinformatics methods have been successfully developed and employed to select crystallizable proteins. Unfortunately, the majority of existing in silico methods only allow the prediction of crystallization propensity, seldom enabling computational design of protein mutants that can be targeted for enhancing protein crystallizability. Here, we present Crysalis, an integrated crystallization analysis tool that builds on support-vector regression (SVR) models to facilitate computational protein crystallization prediction, analysis, and design. More specifically, the functionality of this new tool includes: (1) rapid selection of target crystallizable proteins at the proteome level, (2) identification of site non-optimality for protein crystallization and systematic analysis of all potential single-point mutations that might enhance protein crystallization propensity, and (3) annotation of target protein based on predicted structural properties. We applied the design mode of Crysalis to identify site non-optimality for protein crystallization on a proteome-scale, focusing on proteins currently classified as non-crystallizable. Our results revealed that site non-optimality is based on biases related to residues, predicted structures, physicochemical properties, and sequence loci, which provides in-depth understanding of the features influencing protein crystallization. Crysalis is freely available at http://nmrcen.xmu.edu.cn/crysalis/.},
    Articlenumber = {21383},
    Doi = {10.1038/srep21383},
    Keywords = {Bioinformatics and DP140100087},
    Related = {computational-biology}
    }
    ABSTRACT The failure of multi-step experimental procedures to yield diffraction-quality crystals is a major bottleneck in protein structure determination. Accordingly, several bioinformatics methods have been successfully developed and employed to select crystallizable proteins. Unfortunately, the majority of existing in silico methods only allow the prediction of crystallization propensity, seldom enabling computational design of protein mutants that can be targeted for enhancing protein crystallizability. Here, we present Crysalis, an integrated crystallization analysis tool that builds on support-vector regression (SVR) models to facilitate computational protein crystallization prediction, analysis, and design. More specifically, the functionality of this new tool includes: (1) rapid selection of target crystallizable proteins at the proteome level, (2) identification of site non-optimality for protein crystallization and systematic analysis of all potential single-point mutations that might enhance protein crystallization propensity, and (3) annotation of target protein based on predicted structural properties. We applied the design mode of Crysalis to identify site non-optimality for protein crystallization on a proteome-scale, focusing on proteins currently classified as non-crystallizable. Our results revealed that site non-optimality is based on biases related to residues, predicted structures, physicochemical properties, and sequence loci, which provides in-depth understanding of the features influencing protein crystallization. Crysalis is freely available at http://nmrcen.xmu.edu.cn/crysalis/.

The Australian Research Council has funded the following two projects:

Electronic skin nanopatches for continuous blood pressure monitoring
Investigators:
Prof Wenlong Cheng (Chief Investigator)
Prof Andrew Tonkin (Chief Investigator)
A/Prof Bing Wang (Chief Investigator)
Prof Geoffrey Webb (Chief Investigator)
Dr Stephen Wang (Chief Investigator)
Prof David Kaye (Partner Investigator)
Dr Yijia Li (Partner Investigator)
Mr Paul Carboon (Partner Investigator)
Summary: This project aims to develop soft, thin, wearable and non-invasive heart health monitors that continuously monitor blood pressures anytime anywhere, using an electronic skin technology platform with the world’s thinnest gold nanowires. Nanotechnologists, electrical engineers, clinicians, information technologists and industrial designers will collaborate to develop blood pressure correlation algorithms and evaluate sensing performances. New knowledge and commercial technologies will make Australian medical technology industries competitive global leaders in wearable technology industries.
Funding: $380,000

Legal and social dynamics of ebook lending in Australia’s public libraries
Investigators:
Dr Rebecca Giblin (Chief Investigator)
A/Prof Kimberlee Weatherall (Chief Investigator)
Prof Julian Thomas (Chief Investigator)
Prof Geoffrey Webb (Chief Investigator)
Summary: This project aims to develop an evidence base of quantitative and qualitative data about how eBooks are used in libraries. EBooks have tremendous beneficial potential, particularly for Australians in remote areas and those with impaired mobility or vision. However, libraries’ rights to acquire and lend them are more restricted than for physical books. Libraries and legal, social and data science researchers will investigate eBook lending practices and understand their social impacts. The project will identify ways of reforming policy, law, and practice to help libraries fulfil their public interest missions. This project is expected to enable libraries to extract more value from existing public investments.
Funding: $252,000

Since 1999, Magnum Opus has been a leading data mining tool making association discovery better and faster for everyone.

BigML is the best platform for Machine Learning on the internet.

G.I. Webb & Associates are excited to be partnering with BigML to provide the best tools and environment for association discovery.

As a result, G.I. Webb & Associates are no longer offering new Magnum Opus  licenses or downloads. We will continue supporting our licensees as usual.

BigML.com

 

1

SDM-awardWe are delighted to receive the SDM15 Best Research Paper Honorable Mention award.

The Society for Industrial and Applied Math (SIAM) International Conference on Data Mining (SDM15) Awards Committee selected 4 papers for awards from nearly 400 submissions.

View the presentation here.

And here is a link to the paper and its bibliographic details:

    [URL] Petitjean, F., & Webb, G. I. (2015). Scaling log-linear analysis to datasets with thousands of variables. Proceedings of the 2015 SIAM International Conference on Data Mining, pp. 469-477.
    [Bibtex] [Abstract]  → Related papers and software

    @InProceedings{PetitjeanWebb15,
    Title = {Scaling log-linear analysis to datasets with thousands of variables},
    Author = {F. Petitjean and G.I. Webb},
    Booktitle = {Proceedings of the 2015 {SIAM} International Conference on Data Mining},
    Year = {2015},
    Pages = {469-477},
    Abstract = {Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We have recently shown that, if we ensure that the graph supporting the log-linear model is chordal, log-linear analysis can be applied to datasets with hundreds of variables without sacrificing the statistical soundness [21]. However, further scalability remained limited, because state-of-the-art techniques have to examine every edge at every step of the search. This paper makes the following contributions: 1) we prove that only a very small subset of edges has to be considered at each step of the search; 2) we demonstrate how to efficiently find this subset of edges and 3) we show how to efficiently keep track of the best edges to be subsequently added to the initial model. Our experiments, carried out on real datasets with up to 2000 variables, show that our contributions make it possible to gain about 4 orders of magnitude, making log-linear analysis of datasets with thousands of variables possible in seconds instead of days.},
    Comment = {Best Research Paper Honorable Mention Award},
    Keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and Learning from large datasets and DP140100087},
    Related = {scalable-graphical-modeling},
    Url = {http://epubs.siam.org/doi/pdf/10.1137/1.9781611974010.53}
    }
    ABSTRACT Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We have recently shown that, if we ensure that the graph supporting the log-linear model is chordal, log-linear analysis can be applied to datasets with hundreds of variables without sacrificing the statistical soundness [21]. However, further scalability remained limited, because state-of-the-art techniques have to examine every edge at every step of the search. This paper makes the following contributions: 1) we prove that only a very small subset of edges has to be considered at each step of the search; 2) we demonstrate how to efficiently find this subset of edges and 3) we show how to efficiently keep track of the best edges to be subsequently added to the initial model. Our experiments, carried out on real datasets with up to 2000 variables, show that our contributions make it possible to gain about 4 orders of magnitude, making log-linear analysis of datasets with thousands of variables possible in seconds instead of days.

View our panel on Video Lectures:


A Data Scientist’s Guide to Making Money from Start-ups
Geoff Webb, Foster Provost, Ron Bekkerman, Oren Etzioni, Usama Fayyad, Claudia Perlich

We also wrote a paper based on the panel discussion:

    [URL] Provost, F., Webb, G. I., Bekkerman, R., Etzioni, O., Fayyad, U., & Perlich, C. (2014). A Data Scientist's Guide to Start-Ups. Big Data, 2(3), 117-128.
    [Bibtex] [Abstract]

    @Article{ProvostEtAl14,
    Title = {A Data Scientist's Guide to Start-Ups},
    Author = {F. Provost and G. I. Webb and R. Bekkerman and O. Etzioni and U. Fayyad and C. Perlich},
    Journal = {Big Data},
    Year = {2014},
    Number = {3},
    Pages = {117-128},
    Volume = {2},
    Abstract = {In August 2013, we held a panel discussion at the KDD 2013 conference in Chicago on the subject of data science, data scientists, and start-ups. KDD is the premier conference on data science research and practice. The panel discussed the pros and cons for top-notch data scientists of the hot data science start-up scene. In this article, we first present background on our panelists. Our four panelists have unquestionable pedigrees in data science and substantial experience with start-ups from multiple perspectives (founders, employees, chief scientists, venture capitalists). For the casual reader, we next present a brief summary of the experts' opinions on eight of the issues the panel discussed. The rest of the article presents a lightly edited transcription of the entire panel discussion.},
    Keywords = {Big Data},
    Url = {http://dx.doi.org/10.1089/big.2014.0031}
    }
    ABSTRACT In August 2013, we held a panel discussion at the KDD 2013 conference in Chicago on the subject of data science, data scientists, and start-ups. KDD is the premier conference on data science research and practice. The panel discussed the pros and cons for top-notch data scientists of the hot data science start-up scene. In this article, we first present background on our panelists. Our four panelists have unquestionable pedigrees in data science and substantial experience with start-ups from multiple perspectives (founders, employees, chief scientists, venture capitalists). For the casual reader, we next present a brief summary of the experts' opinions on eight of the issues the panel discussed. The rest of the article presents a lightly edited transcription of the entire panel discussion.