Association discovery

Association discovery encompasses a broad family of related tasks that identify key interactions between entities or elements within data. These include association mining, pattern mining, association rule discovery, subgroup discovery, emerging pattern discovery and contrast discovery.

My pioneering association discovery techniques seek the most useful associations, rather than applying the minimum-support constraint more commonly used in the field. Many of these techniques are included in my Magnum Opus software, which is now incorporated in BigML. Magnum Opus has been widely used in scientific research.

Due to the large numbers of patterns that are considered in association discovery, most techniques suffer large risk of type-1 error. That is, they are likely to find patterns that appear interesting only due to chance artifacts of the process by which the sample data were generated. Most attempts to control this risk do so at the cost of high risk of type-2 error. That is, they are likely to falsely reject non-spurious patterns. I have pioneered strategies for statistically sound pattern discovery, strictly controlling type-1 error during association discovery without the level of risk of type-2 error suffered by previous approaches.

The OPUSMiner statistically sound itemset discovery software can be downloaded here. An R package is available here.

The Skopus statistically sound sequential pattern discovery software can be downloaded here.

An implementation of impact rules can be downloaded here: https://github.com/501856869/Significant_impact_rule.

ACM SIGKDD 2014 Tutorial on Statistically Sound Pattern Discovery, with Wilhelmiina Hämäläinen.

Publications

Better Short than Greedy: Interpretable Models through Optimal Rule Boosting.
Boley, M., Teshuva, S., Bodic, P. L., & Webb, G. I.
Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pp. 351-359, 2021.
[Bibtex] [Abstract] → Access on publisher site

@InProceedings{boley2021better,
author = {Boley, Mario and Teshuva, Simon and Bodic, Pierre Le and Webb, Geoffrey I},
booktitle = {Proceedings of the 2021 SIAM International Conference on Data Mining (SDM)},
title = {Better Short than Greedy: Interpretable Models through Optimal Rule Boosting},
year = {2021},
organization = {SIAM},
pages = {351-359},
abstract = {Rule ensembles are designed to provide a useful trade-off between predictive accuracy and model interpretability. However, the myopic and random search components of current rule ensemble methods can compromise this goal: they often need more rules than necessary to reach a certain accuracy level or can even outright fail to accurately model a distribution that can actually be described well with a few rules. Here, we present a novel approach aiming to fit rule ensembles of maximal predictive power for a given ensemble size (and thus model comprehensibility). In particular, we present an efficient branch-and-bound algorithm that optimally solves the per-rule objective function of the popular second-order gradient boosting framework. Our main insight is that the boosting objective can be tightly bounded in linear time of the number of covered data points. Along with an additional novel pruning technique related to rule redundancy, this leads to a computationally feasible approach for boosting optimal rules that, as we demonstrate on a wide range of common benchmark problems, consistently outperforms the predictive performance of boosting greedy rules.},
keywords = {Rule Learning and OPUS and Association Rule Discovery},
related = {opus-search},
url = {https://arxiv.org/abs/2101.08380},
}

ABSTRACT Rule ensembles are designed to provide a useful trade-off between predictive accuracy and model interpretability. However, the myopic and random search components of current rule ensemble methods can compromise this goal: they often need more rules than necessary to reach a certain accuracy level or can even outright fail to accurately model a distribution that can actually be described well with a few rules. Here, we present a novel approach aiming to fit rule ensembles of maximal predictive power for a given ensemble size (and thus model comprehensibility). In particular, we present an efficient branch-and-bound algorithm that optimally solves the per-rule objective function of the popular second-order gradient boosting framework. Our main insight is that the boosting objective can be tightly bounded in linear time of the number of covered data points. Along with an additional novel pruning technique related to rule redundancy, this leads to a computationally feasible approach for boosting optimal rules that, as we demonstrate on a wide range of common benchmark problems, consistently outperforms the predictive performance of boosting greedy rules.

A tutorial on statistically sound pattern discovery.
Hamalainen, W., & Webb, G. I.
Data Mining and Knowledge Discovery, 33(2), 325-377, 2019.
[Bibtex] [Abstract] → Access on publisher site

@Article{HamalainenWebb2019,
author = {Hamalainen, Wilhelmiina and Webb, Geoffrey I.},
journal = {Data Mining and Knowledge Discovery},
title = {A tutorial on statistically sound pattern discovery},
year = {2019},
issn = {1573-756X},
month = {March},
number = {2},
pages = {325-377},
volume = {33},
abstract = {Statistically sound pattern discovery harnesses the rigour of statistical hypothesis testing to overcome many of the issues that have hampered standard data mining approaches to pattern discovery. Most importantly, application of appropriate statistical tests allows precise control over the risk of false discoveries---patterns that are found in the sample data but do not hold in the wider population from which the sample was drawn. Statistical tests can also be applied to filter out patterns that are unlikely to be useful, removing uninformative variations of the key patterns in the data. This tutorial introduces the key statistical and data mining theory and techniques that underpin this fast developing field. We concentrate on two general classes of patterns: dependency rules that express statistical dependencies between condition and consequent parts and dependency sets that express mutual dependence between set elements. We clarify alternative interpretations of statistical dependence and introduce appropriate tests for evaluating statistical significance of patterns in different situations. We also introduce special techniques for controlling the likelihood of spurious discoveries when multitudes of patterns are evaluated. The paper is aimed at a wide variety of audiences. It provides the necessary statistical background and summary of the state-of-the-art for any data mining researcher or practitioner wishing to enter or understand statistically sound pattern discovery research or practice. It can serve as a general introduction to the field of statistically sound pattern discovery for any reader with a general background in data sciences.},
doi = {10.1007/s10618-018-0590-x},
keywords = {Association Rule Discovery and statistically sound discovery},
related = {statistically-sound-association-discovery},
url = {https://rdcu.be/bd2MI},
}

ABSTRACT Statistically sound pattern discovery harnesses the rigour of statistical hypothesis testing to overcome many of the issues that have hampered standard data mining approaches to pattern discovery. Most importantly, application of appropriate statistical tests allows precise control over the risk of false discoveries–-patterns that are found in the sample data but do not hold in the wider population from which the sample was drawn. Statistical tests can also be applied to filter out patterns that are unlikely to be useful, removing uninformative variations of the key patterns in the data. This tutorial introduces the key statistical and data mining theory and techniques that underpin this fast developing field. We concentrate on two general classes of patterns: dependency rules that express statistical dependencies between condition and consequent parts and dependency sets that express mutual dependence between set elements. We clarify alternative interpretations of statistical dependence and introduce appropriate tests for evaluating statistical significance of patterns in different situations. We also introduce special techniques for controlling the likelihood of spurious discoveries when multitudes of patterns are evaluated. The paper is aimed at a wide variety of audiences. It provides the necessary statistical background and summary of the state-of-the-art for any data mining researcher or practitioner wishing to enter or understand statistically sound pattern discovery research or practice. It can serve as a general introduction to the field of statistically sound pattern discovery for any reader with a general background in data sciences.

Mining significant crisp-fuzzy spatial association rules.
Shi, W., Zhang, A., & Webb, G. I.
International Journal of Geographical Information Science, 32(6), 1247-1270, 2018.
[Bibtex] [Abstract] → Access on publisher site

@Article{doi:10.1080/13658816.2018.1434525,
Title = {Mining significant crisp-fuzzy spatial association rules},
Author = {Shi, Wenzhong and Zhang, Anshu and Webb, Geoffrey I.},
Journal = {International Journal of Geographical Information Science},
Year = {2018},
Number = {6},
Pages = {1247-1270},
Volume = {32},
Abstract = {Spatial association rule mining (SARM) is an important data mining task for understanding implicit and sophisticated interactions in spatial data. The usefulness of SARM results, represented as sets of rules, depends on their reliability: the abundance of rules, control over the risk of spurious rules, and accuracy of rule interestingness measure (RIM) values. This study presents crisp-fuzzy SARM, a novel SARM method that can enhance the reliability of resultant rules. The method firstly prunes dubious rules using statistically sound tests and crisp supports for the patterns involved, and then evaluates RIMs of accepted rules using fuzzy supports. For the RIM evaluation stage, the study also proposes a Gaussian-curve-based fuzzy data discretization model for SARM with improved design for spatial semantics. The proposed techniques were evaluated by both synthetic and real-world data. The synthetic data was generated with predesigned rules and RIM values, thus the reliability of SARM results could be confidently and quantitatively evaluated. The proposed techniques showed high efficacy in enhancing the reliability of SARM results in all three aspects. The abundance of resultant rules was improved by 50% or more compared with using conventional fuzzy SARM. Minimal risk of spurious rules was guaranteed by statistically sound tests. The probability that the entire result contained any spurious rules was below 1%. The RIM values also avoided large positive errors committed by crisp SARM, which typically exceeded 50% for representative RIMs. The real-world case study on New York City points of interest reconfirms the improved reliability of crisp-fuzzy SARM results, and demonstrates that such improvement is critical for practical spatial data analytics and decision support.},
Doi = {10.1080/13658816.2018.1434525},
Keywords = {Association Rule Discovery and statistically sound discovery},
Publisher = {Taylor \& Francis},
Related = {filtered-top-k-association-discovery},
Url = {http://www.tandfonline.com/eprint/aMdSMrAGuEHsHWSzIuqm/full}
}

ABSTRACT Spatial association rule mining (SARM) is an important data mining task for understanding implicit and sophisticated interactions in spatial data. The usefulness of SARM results, represented as sets of rules, depends on their reliability: the abundance of rules, control over the risk of spurious rules, and accuracy of rule interestingness measure (RIM) values. This study presents crisp-fuzzy SARM, a novel SARM method that can enhance the reliability of resultant rules. The method firstly prunes dubious rules using statistically sound tests and crisp supports for the patterns involved, and then evaluates RIMs of accepted rules using fuzzy supports. For the RIM evaluation stage, the study also proposes a Gaussian-curve-based fuzzy data discretization model for SARM with improved design for spatial semantics. The proposed techniques were evaluated by both synthetic and real-world data. The synthetic data was generated with predesigned rules and RIM values, thus the reliability of SARM results could be confidently and quantitatively evaluated. The proposed techniques showed high efficacy in enhancing the reliability of SARM results in all three aspects. The abundance of resultant rules was improved by 50% or more compared with using conventional fuzzy SARM. Minimal risk of spurious rules was guaranteed by statistically sound tests. The probability that the entire result contained any spurious rules was below 1%. The RIM values also avoided large positive errors committed by crisp SARM, which typically exceeded 50% for representative RIMs. The real-world case study on New York City points of interest reconfirms the improved reliability of crisp-fuzzy SARM results, and demonstrates that such improvement is critical for practical spatial data analytics and decision support.

Specious rules: an efficient and effective unifying method for removing misleading and uninformative patterns in association rule mining.
Hamalainen, W., & Webb, G. I.
Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 309-317, 2017.
[Bibtex] → Download PDF

@InProceedings{HamalainenWebb17,
Title = {Specious rules: an efficient and effective unifying method for removing misleading and uninformative patterns in association rule mining},
Author = {Hamalainen, Wilhelmiina and Webb, Geoffrey I},
Booktitle = {Proceedings of the 2017 SIAM International Conference on Data Mining},
Year = {2017},
Organization = {SIAM},
Pages = {309-317},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
Related = {statistically-sound-association-discovery}
}

ABSTRACT

A multiple test correction for streams and cascades of statistical hypothesis tests.
Webb, G. I., & Petitjean, F.
Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16, pp. 1255-1264, 2016.
[Bibtex] [Abstract] → Download PDF → Access on publisher site

@InProceedings{WebbPetitjean16,
Title = {A multiple test correction for streams and cascades of statistical hypothesis tests},
Author = {Webb, Geoffrey I. and Petitjean, Francois},
Booktitle = {Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16},
Year = {2016},
Pages = {1255-1264},
Publisher = {ACM Press},
Abstract = {Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e.~rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be pre-determined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance.
This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed.
To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models.
We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.},
Comment = {Top reviewer score (4.75/5.0), shortlisted for best paper award and invited to ACM TKDE journal KDD-16 special issue},
Doi = {10.1145/2939672.2939775},
Keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and Learning from large datasets and DP140100087},
Related = {statistically-sound-association-discovery},
Url = {http://dl.acm.org/authorize?N19100}
}

ABSTRACT Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e.~rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be pre-determined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance. This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed. To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models. We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.

Skopus: Mining top-k sequential patterns under leverage.
Petitjean, F., Li, T., Tatti, N., & Webb, G. I.
Data Mining and Knowledge Discovery, 30(5), 1086-1111, 2016.
[Bibtex] [Abstract] → Download PDF → Access on publisher site

@Article{PetitjeanEtAl16b,
Title = {Skopus: Mining top-k sequential patterns under leverage},
Author = {Petitjean, Francois and Li, Tao and Tatti, Nikolaj and Webb, Geoffrey I.},
Journal = {Data Mining and Knowledge Discovery},
Year = {2016},
Number = {5},
Pages = {1086-1111},
Volume = {30},
Abstract = {This paper presents a framework for exact discovery of the top-k sequential patterns under Leverage. It combines (1) a novel definition of the expected support for a sequential pattern---a concept on which most interestingness measures directly rely---with (2) Skopus: a new branch-and-bound algorithm for the exact discovery of top-k sequential patterns under a given measure of interest. Our interestingness measure employs the partition approach. A pattern is interesting to the extent that it is more frequent than can be explained by assuming independence between any of the pairs of patterns from which it can be composed. The larger the support compared to the expectation under independence, the more interesting is the pattern. We build on these two elements to exactly extract the k sequential patterns with highest leverage, consistent with our definition of expected support. We conduct experiments on both synthetic data with known patterns and real-world datasets; both experiments confirm the consistency and relevance of our approach with regard to the state of the art.},
Doi = {10.1007/s10618-016-0467-9},
ISSN = {1573-756X},
Keywords = {OPUS and Association Rule Discovery and statistically sound discovery},
Related = {statistically-sound-association-discovery},
Url = {http://rdcu.be/tsDo}
}

ABSTRACT This paper presents a framework for exact discovery of the top-k sequential patterns under Leverage. It combines (1) a novel definition of the expected support for a sequential pattern–-a concept on which most interestingness measures directly rely–-with (2) Skopus: a new branch-and-bound algorithm for the exact discovery of top-k sequential patterns under a given measure of interest. Our interestingness measure employs the partition approach. A pattern is interesting to the extent that it is more frequent than can be explained by assuming independence between any of the pairs of patterns from which it can be composed. The larger the support compared to the expectation under independence, the more interesting is the pattern. We build on these two elements to exactly extract the k sequential patterns with highest leverage, consistent with our definition of expected support. We conduct experiments on both synthetic data with known patterns and real-world datasets; both experiments confirm the consistency and relevance of our approach with regard to the state of the art.

Mining significant association rules from uncertain data.
Zhang, A., Shi, W., & Webb, G. I.
Data Mining and Knowledge Discovery, 30(4), 928-963, 2016.
[Bibtex] [Abstract] → Download PDF → Access on publisher site

@Article{ZhangEtAl16,
Title = {Mining significant association rules from uncertain data},
Author = {Zhang, Anshu and Shi, Wenzhong and Webb, Geoffrey I.},
Journal = {Data Mining and Knowledge Discovery},
Year = {2016},
Number = {4},
Pages = {928-963},
Volume = {30},
Abstract = {In association rule mining, the trade-off between avoiding harmful spurious rules and preserving authentic ones is an ever critical barrier to obtaining reliable and useful results. The statistically sound technique for evaluating statistical significance of association rules is superior in preventing spurious rules, yet can also cause severe loss of true rules in presence of data error. This study presents a new and improved method for statistical test on association rules with uncertain erroneous data. An original mathematical model was established to describe data error propagation through computational procedures of the statistical test. Based on the error model, a scheme combining analytic and simulative processes was designed to correct the statistical test for distortions caused by data error. Experiments on both synthetic and real-world data show that the method significantly recovers the loss in true rules (reduces type-2 error) due to data error occurring in original statistically sound method. Meanwhile, the new method maintains effective control over the familywise error rate, which is the distinctive advantage of the original statistically sound technique. Furthermore, the method is robust against inaccurate data error probability information and situations not fulfilling the commonly accepted assumption on independent error probabilities of different data items. The method is particularly effective for rules which were most practically meaningful yet sensitive to data error. The method proves promising in enhancing values of association rule mining results and helping users make correct decisions.},
Doi = {10.1007/s10618-015-0446-6},
Keywords = {Association Rule Discovery and statistically sound discovery},
Publisher = {Springer},
Related = {statistically-sound-association-discovery}
}

ABSTRACT In association rule mining, the trade-off between avoiding harmful spurious rules and preserving authentic ones is an ever critical barrier to obtaining reliable and useful results. The statistically sound technique for evaluating statistical significance of association rules is superior in preventing spurious rules, yet can also cause severe loss of true rules in presence of data error. This study presents a new and improved method for statistical test on association rules with uncertain erroneous data. An original mathematical model was established to describe data error propagation through computational procedures of the statistical test. Based on the error model, a scheme combining analytic and simulative processes was designed to correct the statistical test for distortions caused by data error. Experiments on both synthetic and real-world data show that the method significantly recovers the loss in true rules (reduces type-2 error) due to data error occurring in original statistically sound method. Meanwhile, the new method maintains effective control over the familywise error rate, which is the distinctive advantage of the original statistically sound technique. Furthermore, the method is robust against inaccurate data error probability information and situations not fulfilling the commonly accepted assumption on independent error probabilities of different data items. The method is particularly effective for rules which were most practically meaningful yet sensitive to data error. The method proves promising in enhancing values of association rule mining results and helping users make correct decisions.

Scaling log-linear analysis to datasets with thousands of variables.
Petitjean, F., & Webb, G. I.
Proceedings of the 2015 SIAM International Conference on Data Mining, pp. 469-477, 2015.
[Bibtex] [Abstract] → Access on publisher site

@InProceedings{PetitjeanWebb15,
author = {Petitjean, F. and Webb, G. I.},
booktitle = {Proceedings of the 2015 {SIAM} International Conference on Data Mining},
title = {Scaling log-linear analysis to datasets with thousands of variables},
year = {2015},
pages = {469-477},
abstract = {Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We have recently shown that, if we ensure that the graph supporting the log-linear model is chordal, log-linear analysis can be applied to datasets with hundreds of variables without sacrificing the statistical soundness [21]. However, further scalability remained limited, because state-of-the-art techniques have to examine every edge at every step of the search. This paper makes the following contributions: 1) we prove that only a very small subset of edges has to be considered at each step of the search; 2) we demonstrate how to efficiently find this subset of edges and 3) we show how to efficiently keep track of the best edges to be subsequently added to the initial model. Our experiments, carried out on real datasets with up to 2000 variables, show that our contributions make it possible to gain about 4 orders of magnitude, making log-linear analysis of datasets with thousands of variables possible in seconds instead of days.},
comment = {Best Research Paper Honorable Mention Award},
keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and Learning from large datasets and DP140100087, efficient ml},
related = {scalable-graphical-modeling},
url = {http://epubs.siam.org/doi/pdf/10.1137/1.9781611974010.53},
}

ABSTRACT Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We have recently shown that, if we ensure that the graph supporting the log-linear model is chordal, log-linear analysis can be applied to datasets with hundreds of variables without sacrificing the statistical soundness [21]. However, further scalability remained limited, because state-of-the-art techniques have to examine every edge at every step of the search. This paper makes the following contributions: 1) we prove that only a very small subset of edges has to be considered at each step of the search; 2) we demonstrate how to efficiently find this subset of edges and 3) we show how to efficiently keep track of the best edges to be subsequently added to the initial model. Our experiments, carried out on real datasets with up to 2000 variables, show that our contributions make it possible to gain about 4 orders of magnitude, making log-linear analysis of datasets with thousands of variables possible in seconds instead of days.

Statistically sound pattern discovery.
Hamalainen, W., & Webb, G. I.
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1976, 2014.
[Bibtex] [Abstract] → Access on publisher site

@InProceedings{HamalainenKDD2014,
author = {Hamalainen, Wilhelmiina and Webb, Geoffrey I.},
booktitle = {KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining},
title = {Statistically sound pattern discovery},
year = {2014},
pages = {1976},
abstract = {Pattern discovery is a core data mining activity. Initial approaches were dominated by the frequent pattern discovery paradigm -- only patterns that occur frequently in the data were explored. Having been thoroughly researched and its limitations now well understood, this paradigm is giving way to a new one, which can be called statistically sound pattern discovery. In this paradigm, the main impetus is to discover statistically significant patterns, which are unlikely to have occurred by chance and are likely to hold in future data. Thus, the new paradigm provides a strict control over false discoveries and overfitting.
This tutorial covers both classic and cutting-edge research topics on pattern discovery combined to statistical significance testing. We start with an advanced introduction to the relevant forms of statistical significance testing, including different schools and alternative models, their underlying assumptions, practical issues, and limitations. We then discuss their application to data mining specific problems, including evaluation of nested patterns, the multiple testing problem, algorithmic strategies and real-world considerations. We present the current state-of-the art solutions and explore in detail how this approach to pattern discovery can deliver efficient and effective discovery of small sets of interesting patterns.},
doi = {10.1145/2623330.263081},
keywords = {Association Rule Discovery and statistically sound discovery},
related = {statistically-sound-association-discovery},
}

ABSTRACT Pattern discovery is a core data mining activity. Initial approaches were dominated by the frequent pattern discovery paradigm – only patterns that occur frequently in the data were explored. Having been thoroughly researched and its limitations now well understood, this paradigm is giving way to a new one, which can be called statistically sound pattern discovery. In this paradigm, the main impetus is to discover statistically significant patterns, which are unlikely to have occurred by chance and are likely to hold in future data. Thus, the new paradigm provides a strict control over false discoveries and overfitting. This tutorial covers both classic and cutting-edge research topics on pattern discovery combined to statistical significance testing. We start with an advanced introduction to the relevant forms of statistical significance testing, including different schools and alternative models, their underlying assumptions, practical issues, and limitations. We then discuss their application to data mining specific problems, including evaluation of nested patterns, the multiple testing problem, algorithmic strategies and real-world considerations. We present the current state-of-the art solutions and explore in detail how this approach to pattern discovery can deliver efficient and effective discovery of small sets of interesting patterns.

Efficient Discovery of the Most Interesting Associations.
Webb, G. I., & Vreeken, J.
ACM Transactions on Knowledge Discovery from Data, 8(3), Art. no. 15, 2014.
[Bibtex] [Abstract] → Access on publisher site

@Article{WebbVreeken13,
author = {Webb, G. I. and Vreeken, J.},
journal = {{ACM} Transactions on Knowledge Discovery from Data},
title = {Efficient Discovery of the Most Interesting Associations},
year = {2014},
number = {3},
volume = {8},
abstract = {Self-sufficient itemsets have been proposed as an effective approach to summarizing the key associations
in data. However, their computation appears highly demanding, as assessing whether an itemset is selfsufficient
requires consideration of all pairwise partitions of the itemset into pairs of subsets as well as
consideration of all supersets. This paper presents the first published algorithm for efficiently discovering
self-sufficient itemsets. This branch-and-bound algorithm deploys two powerful pruning mechanisms
based on upper-bounds on itemset value and statistical significance level. It demonstrates that finding top-k
productive and non-redundant itemsets, with post processing to identify those that are not independently
productive, can efficiently identify small sets of key associations. We present extensive evaluation of the
strengths and limitations of the technique including comparisons with alternative approaches to finding the
most interesting associations.},
articlenumber = {15},
doi = {10.1145/2601433},
keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
publisher = {ACM},
related = {filtered-top-k-association-discovery},
url = {http://dl.acm.org/authorize?N80829},
}

ABSTRACT Self-sufficient itemsets have been proposed as an effective approach to summarizing the key associations in data. However, their computation appears highly demanding, as assessing whether an itemset is selfsufficient requires consideration of all pairwise partitions of the itemset into pairs of subsets as well as consideration of all supersets. This paper presents the first published algorithm for efficiently discovering self-sufficient itemsets. This branch-and-bound algorithm deploys two powerful pruning mechanisms based on upper-bounds on itemset value and statistical significance level. It demonstrates that finding top-k productive and non-redundant itemsets, with post processing to identify those that are not independently productive, can efficiently identify small sets of key associations. We present extensive evaluation of the strengths and limitations of the technique including comparisons with alternative approaches to finding the most interesting associations.

A Statistically Efficient and Scalable Method for Log-Linear Analysis of High-Dimensional Data.
Petitjean, F., Allison, L., & Webb, G. I.
Proceedings of the 14th IEEE International Conference on Data Mining, pp. 480-489, 2014.
[Bibtex] [Abstract] → Download PDF → Access on publisher site

@InProceedings{PetitjeanEtAl14a,
author = {Petitjean, F. and Allison, L. and Webb, G. I.},
booktitle = {Proceedings of the 14th {IEEE} International Conference on Data Mining},
title = {A Statistically Efficient and Scalable Method for Log-Linear Analysis of High-Dimensional Data},
year = {2014},
pages = {480-489},
abstract = {Log-linear analysis is the primary statistical approach to discovering conditional dependencies between the variables of a dataset. A good log-linear analysis method requires both high precision and statistical efficiency. High precision means that the risk of false discoveries should be kept very low. Statistical efficiency means that the method should discover actual associations with as few samples as possible. Classical approaches to log-linear analysis make use of Chi-square tests to control this balance between quality and complexity. We present an information-theoretic approach to log-linear analysis. We show that our approach 1) requires significantly fewer samples to discover the true associations than statistical approaches -- statistical efficiency -- 2) controls for the risk of false discoveries as well as statistical approaches -- high precision - and 3) can perform the discovery on datasets with hundreds of variables on a standard desktop computer -- computational efficiency.},
comment = {One of nine papers invited to Knowledge and Information Systems journal ICDM-14 special issue},
keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and DP140100087, efficent ML},
related = {scalable-graphical-modeling},
url = {http://dx.doi.org/10.1109/ICDM.2014.23},
}

ABSTRACT Log-linear analysis is the primary statistical approach to discovering conditional dependencies between the variables of a dataset. A good log-linear analysis method requires both high precision and statistical efficiency. High precision means that the risk of false discoveries should be kept very low. Statistical efficiency means that the method should discover actual associations with as few samples as possible. Classical approaches to log-linear analysis make use of Chi-square tests to control this balance between quality and complexity. We present an information-theoretic approach to log-linear analysis. We show that our approach 1) requires significantly fewer samples to discover the true associations than statistical approaches – statistical efficiency – 2) controls for the risk of false discoveries as well as statistical approaches – high precision - and 3) can perform the discovery on datasets with hundreds of variables on a standard desktop computer – computational efficiency.

Scaling log-linear analysis to high-dimensional data.
Petitjean, F., Webb, G. I., & Nicholson, A. E.
Proceedings of the 13th IEEE International Conference on Data Mining, pp. 597-606, 2013.
[Bibtex] [Abstract] → Download PDF → Access on publisher site

@InProceedings{PetitjeanEtAl13,
author = {Petitjean, F. and Webb, G. I. and Nicholson, A. E.},
booktitle = {Proceedings of the 13th {IEEE} International Conference on Data Mining},
title = {Scaling log-linear analysis to high-dimensional data},
year = {2013},
pages = {597-606},
abstract = {Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We develop an efficient approach to log-linear analysis that scales to hundreds of variables by melding the classical statistical machinery of log-linear analysis with advanced data mining techniques from association discovery and graphical modeling.},
doi = {10.1109/ICDM.2013.17},
keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and Learning from large datasets and DP140100087, efficient ml},
related = {scalable-graphical-modeling},
}

ABSTRACT Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We develop an efficient approach to log-linear analysis that scales to hundreds of variables by melding the classical statistical machinery of log-linear analysis with advanced data mining techniques from association discovery and graphical modeling.

Filtered-top-k Association Discovery.
Webb, G. I.
WIREs Data Mining and Knowledge Discovery, 1(3), 183-192, 2011.
[Bibtex] [Abstract] → Download PDF → Access on publisher site

@Article{Webb11,
author = {Webb, G. I.},
journal = {WIREs Data Mining and Knowledge Discovery},
title = {Filtered-top-k Association Discovery},
year = {2011},
number = {3},
pages = {183-192},
volume = {1},
abstract = {Association mining has been one of the most intensively researched areas of data mining. However, direct uptake of the resulting technologies has been relatively low. This paper examines some of the reasons why the dominant paradigms in association mining have not lived up to their promise, and argues that a powerful alternative is provided by top-k techniques coupled with appropriate statistical and other filtering.},
doi = {10.1002/widm.28},
keywords = {Association Rule Discovery and statistically sound discovery},
publisher = {Wiley},
related = {filtered-top-k-association-discovery},
}

ABSTRACT Association mining has been one of the most intensively researched areas of data mining. However, direct uptake of the resulting technologies has been relatively low. This paper examines some of the reasons why the dominant paradigms in association mining have not lived up to their promise, and argues that a powerful alternative is provided by top-k techniques coupled with appropriate statistical and other filtering.

Self-Sufficient Itemsets: An Approach to Screening Potentially Interesting Associations Between Items.
Webb, G. I.
ACM Transactions on Knowledge Discovery from Data, 4, Art. no. 3, 2010.
[Bibtex] [Abstract] → Access on publisher site

@Article{Webb10,
author = {Webb, G. I.},
journal = {{ACM} Transactions on Knowledge Discovery from Data},
title = {Self-Sufficient Itemsets: An Approach to Screening Potentially Interesting Associations Between Items},
year = {2010},
volume = {4},
abstract = {Self-sufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their
supersets. We argue that itemsets that are not
self-sufficient will often be of little interest to the data
analyst, as their frequency should be expected once that of the
itemsets on which their frequency depends is known. We present
statistical tests for statistically sound discovery of
self-sufficient itemsets, and computational techniques that allow
those tests to be applied as a post-processing step for any itemset
discovery algorithm. We also present a measure for assessing the degree of potential interest in an itemset that complements these statistical measures.},
articlenumber = {3},
doi = {10.1145/1644873.1644876},
issue = {1},
keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
publisher = {ACM},
related = {filtered-top-k-association-discovery},
url = {http://dl.acm.org/authorize?270473},
}

ABSTRACT Self-sufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their supersets. We argue that itemsets that are not self-sufficient will often be of little interest to the data analyst, as their frequency should be expected once that of the itemsets on which their frequency depends is known. We present statistical tests for statistically sound discovery of self-sufficient itemsets, and computational techniques that allow those tests to be applied as a post-processing step for any itemset discovery algorithm. We also present a measure for assessing the degree of potential interest in an itemset that complements these statistical measures.

Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining.
Novak, P., Lavrac, N., & Webb, G. I.
Journal of Machine Learning Research, 10, 377-403, 2009.
[Bibtex] [Abstract] → Access on publisher site

@Article{NovakLavracWebb09,
author = {Novak, P. and Lavrac, N. and Webb, G. I.},
journal = {Journal of Machine Learning Research},
title = {Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining},
year = {2009},
pages = {377-403},
volume = {10},
abstract = {This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use different terminology and task definitions, claim to have different goals, claim to use different rule learning heuristics, and use different means for selecting subsets of induced patterns. This paper contributes a novel understanding of these subareas of data mining by presenting a unified terminology, by explaining the apparent differences between the learning tasks as variants of a unique supervised descriptive rule discovery task and by exploring the apparent differences between the approaches. It also shows that various rule learning heuristics used in CSM, EPM and SD algorithms all aim at optimizing a trade off between rule coverage and precision. The commonalities (and differences) between the approaches are showcased on a selection of best known variants of CSM, EPM and SD algorithms. The paper also provides a critical survey of existing supervised descriptive rule discovery visualization methods.},
keywords = {Association Rule Discovery and OPUS},
related = {filtered-top-k-association-discovery},
url = {http://www.jmlr.org/papers/volume10/kralj-novak09a/kralj-novak09a.pdf},
}

ABSTRACT This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use different terminology and task definitions, claim to have different goals, claim to use different rule learning heuristics, and use different means for selecting subsets of induced patterns. This paper contributes a novel understanding of these subareas of data mining by presenting a unified terminology, by explaining the apparent differences between the learning tasks as variants of a unique supervised descriptive rule discovery task and by exploring the apparent differences between the approaches. It also shows that various rule learning heuristics used in CSM, EPM and SD algorithms all aim at optimizing a trade off between rule coverage and precision. The commonalities (and differences) between the approaches are showcased on a selection of best known variants of CSM, EPM and SD algorithms. The paper also provides a critical survey of existing supervised descriptive rule discovery visualization methods.

Layered Critical Values: A Powerful Direct-Adjustment Approach to Discovering Significant Patterns.
Webb, G. I.
Machine Learning, 71(2-3), 307-323, 2008.
[Bibtex] [Abstract] → Access on publisher site

@Article{Webb08,
author = {Webb, G. I.},
journal = {Machine Learning},
title = {Layered Critical Values: A Powerful Direct-Adjustment Approach to Discovering Significant Patterns},
year = {2008},
number = {2-3},
pages = {307-323},
volume = {71},
abstract = {Standard pattern discovery techniques, such as association rules, suffer an extreme risk of finding very large numbers of spurious patterns for many knowledge discovery tasks. The direct-adjustment approach to controlling this risk applies a statistical test during the discovery process, using a critical value adjusted to take account of the size of the search space. However, a problem with the direct-adjustment strategy is that it may discard numerous true patterns. This paper investigates the assignment of different critical values to different areas of the search space as an approach to alleviating this problem, using a variant of a technique originally developed for other purposes. This approach is shown to be effective at increasing the number of discoveries while still maintaining strict control over the risk of false discoveries.},
address = {Netherlands},
doi = {10.1007/s10994-008-5046-x},
keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
notes = {Technical Note},
publisher = {Springer},
related = {statistically-sound-association-discovery},
}

ABSTRACT Standard pattern discovery techniques, such as association rules, suffer an extreme risk of finding very large numbers of spurious patterns for many knowledge discovery tasks. The direct-adjustment approach to controlling this risk applies a statistical test during the discovery process, using a critical value adjusted to take account of the size of the search space. However, a problem with the direct-adjustment strategy is that it may discard numerous true patterns. This paper investigates the assignment of different critical values to different areas of the search space as an approach to alleviating this problem, using a variant of a technique originally developed for other purposes. This approach is shown to be effective at increasing the number of discoveries while still maintaining strict control over the risk of false discoveries.

Discovering Significant Patterns.
Webb, G. I.
Machine Learning, 68(1), 1-33, 2007.
[Bibtex] [Abstract] → Access on publisher site

@Article{Webb07,
author = {Webb, G. I.},
journal = {Machine Learning},
title = {Discovering Significant Patterns},
year = {2007},
number = {1},
pages = {1-33},
volume = {68},
abstract = {Exploratory pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard exploratory pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to statistical evaluation on holdout data. They also reveal that modification of the pattern discovery process to anticipate subsequent statistical evaluation can increase the number of patterns that are accepted by statistical evaluation on holdout data.},
address = {Netherlands},
doi = {10.1007/s10994-007-5006-x},
keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
publisher = {Springer},
related = {statistically-sound-association-discovery},
}

ABSTRACT Exploratory pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard exploratory pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to statistical evaluation on holdout data. They also reveal that modification of the pattern discovery process to anticipate subsequent statistical evaluation can increase the number of patterns that are accepted by statistical evaluation on holdout data.

Discovering Significant Rules.
Webb, G. I.
Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2006), New York, pp. 434-443, 2006.
[Bibtex] [Abstract] → Download PDF → Access on publisher site

@InProceedings{Webb06a,
author = {Webb, G. I.},
booktitle = {Proceedings of the Twelfth {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining (KDD-2006)},
title = {Discovering Significant Rules},
year = {2006},
address = {New York},
editor = {Ungar, L. and Craven, M. and Gunopulos, D. and Eliassi-Rad, T.},
pages = {434-443},
publisher = {The Association for Computing Machinery},
abstract = {In many applications, association rules will only be interesting if they represent non-trivial correlations between all constituent items. Numerous techniques have been developed that seek to avoid false discoveries. However, while all provide useful solutions to aspects of this problem, none provides a generic solution that is both flexible enough to accommodate varying definitions of true and false discoveries and powerful enough to provide strict control over the risk of false discoveries. This paper presents generic techniques that allow definitions of true and false discoveries to be specified in terms of arbitrary statistical hypothesis tests and which provide strict control over the experimentwise risk of false discoveries.},
keywords = {OPUS and Association Rule Discovery and statistically sound discovery},
location = {Philadelphia, PA},
related = {statistically-sound-association-discovery},
url = {http://dl.acm.org/authorize?N00546},
}

ABSTRACT In many applications, association rules will only be interesting if they represent non-trivial correlations between all constituent items. Numerous techniques have been developed that seek to avoid false discoveries. However, while all provide useful solutions to aspects of this problem, none provides a generic solution that is both flexible enough to accommodate varying definitions of true and false discoveries and powerful enough to provide strict control over the risk of false discoveries. This paper presents generic techniques that allow definitions of true and false discoveries to be specified in terms of arbitrary statistical hypothesis tests and which provide strict control over the experimentwise risk of false discoveries.

Efficiently Identifying Exploratory Rules' Significance.
Huang, S., & Webb, G. I.
LNAI State-of-the-Art Survey series, 'Data Mining: Theory, Methodology, Techniques, and Applications', Berlin/Heidelberg, pp. 64-77, 2006.
[Bibtex] [Abstract] → Access on publisher site

@InProceedings{HuangWebb05b,
author = {Huang, S. and Webb, G. I.},
booktitle = {LNAI State-of-the-Art Survey series, 'Data Mining: Theory, Methodology, Techniques, and Applications'},
title = {Efficiently Identifying Exploratory Rules' Significance},
year = {2006},
address = {Berlin/Heidelberg},
note = {An earlier version of this paper was published in S.J. Simoff and G.J. Williams (Eds.), Proceedings of the Third Australasian Data Mining Conference (AusDM04) Cairns, Australia. Sydney: University of Technology, pages 169-182.},
pages = {64-77},
publisher = {Springer},
abstract = {How to efficiently discard potentially uninteresting rules in exploratory rule discovery is one of the important research foci in data mining. Many researchers have presented algorithms to automatically remove potentially uninteresting rules utilizing background knowledge and user-specified constraints. Identifying the significance of exploratory rules using a significance test is desirable for removing rules that may appear interesting by chance, hence providing the users with a more compact set of resulting rules. However, applying statistical tests to identify significant rules requires considerable computation and data access in order to obtain the necessary statistics. The situation gets worse as the size of the database increases. In this paper, we propose two approaches for improving the efficiency of significant exploratory rule discovery. We also evaluate the experimental effect in impact rule discovery which is suitable for discovering exploratory rules in very large, dense databases.},
doi = {10.1007/11677437_6},
keywords = {Association Rule Discovery and statistically sound discovery and OPUS and Impact Rules},
related = {statistically-sound-association-discovery},
}

ABSTRACT How to efficiently discard potentially uninteresting rules in exploratory rule discovery is one of the important research foci in data mining. Many researchers have presented algorithms to automatically remove potentially uninteresting rules utilizing background knowledge and user-specified constraints. Identifying the significance of exploratory rules using a significance test is desirable for removing rules that may appear interesting by chance, hence providing the users with a more compact set of resulting rules. However, applying statistical tests to identify significant rules requires considerable computation and data access in order to obtain the necessary statistics. The situation gets worse as the size of the database increases. In this paper, we propose two approaches for improving the efficiency of significant exploratory rule discovery. We also evaluate the experimental effect in impact rule discovery which is suitable for discovering exploratory rules in very large, dense databases.

Mining Group Differences.
Butler, S., & Webb, G. I.
In Wang, J. (Ed.), In The Encyclopedia of Data Warehousing and Mining (, pp. 795-799). Hershey, PA: Idea Group Inc., 2006.
[Bibtex] → Access on publisher site

@InCollection{ButlerWebb05,
Title = {Mining Group Differences},
Author = {Butler, S. and Webb, G. I.},
Booktitle = {The Encyclopedia of Data Warehousing and Mining},
Publisher = {Idea Group Inc.},
Year = {2006},
Address = {Hershey, PA},
Editor = {Wang, John},
Pages = {795-799},
Audit-trail = {August 04 Copyright signed. Shane handling submission. PDF not posted},
Doi = {10.4018/978-1-60566-010-3.ch199},
Keywords = {Association Rule Discovery}
}

ABSTRACT

k-Optimal-Rule-Discovery.
Webb, G. I., & Zhang, S.
Data Mining and Knowledge Discovery, 10(1), 39-79, 2005.
[Bibtex] [Abstract] → Download PDF → Access on publisher site

@Article{WebbZhang05,
Title = {k-Optimal-Rule-Discovery},
Author = {Webb, G. I. and Zhang, S.},
Journal = {Data Mining and Knowledge Discovery},
Year = {2005},
Number = {1},
Pages = {39-79},
Volume = {10},
Abstract = {K-most-interesting rule discovery finds the k rules that optimize a user-specified measure of interestingness with respect to a set of sample data and user-specified constraints. This approach avoids many limitations of the frequent itemset approach of association rule discovery. This paper presents a scalable algorithm applicable to a wide range of k-most-interesting rule discovery tasks and demonstrates its efficiency.},
Address = {Netherlands},
Doi = {10.1007/s10618-005-0255-4},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
Publisher = {Springer},
Related = {filtered-top-k-association-discovery}
}

ABSTRACT K-most-interesting rule discovery finds the k rules that optimize a user-specified measure of interestingness with respect to a set of sample data and user-specified constraints. This approach avoids many limitations of the frequent itemset approach of association rule discovery. This paper presents a scalable algorithm applicable to a wide range of k-most-interesting rule discovery tasks and demonstrates its efficiency.

Discarding Insignificant Rules During Impact Rule Discovery in Large, Dense Databases.
Huang, S., & Webb, G. I.
Proceedings of the Fifth SIAM International Conference on Data Mining (SDM'05) [short paper], Philadelphia, PA, pp. 541-545, 2005.
[Bibtex] [Abstract] → Download PDF

@InProceedings{HuangWebb05,
author = {Huang, S. and Webb, G. I.},
booktitle = {Proceedings of the Fifth {SIAM} International Conference on Data Mining ({SDM}'05) [short paper]},
title = {Discarding Insignificant Rules During Impact Rule Discovery in Large, Dense Databases},
year = {2005},
address = {Philadelphia, PA},
editor = {Kargupta, H. and Kamath, C. and Srivastava, J. and Goodman, A.},
pages = {541-545},
publisher = {Society for Industrial and Applied Mathematics},
abstract = {Considerable progress has been made on how to reduce the number of spurious exploratory rules with quantitative attributes. However, little has been done for rules with undiscretized quantitative attributes. It is argued that propositional rules can not effectively describe the interactions between quantitative and qualitative attributes. Aumann and Lindell proposed quantitative association rules to provide a better description of such relationship, together with a rule pruning techniques . Since their technique is based on the frequent itemset framework, it is not suitable for rule discovery in large, dense databases. In this paper, an efficient technique for automatically discarding insignificant rules during rule discovery is proposed, based on the OPUS search algorithm. Experiments demonstrate that the algorithm we propose can efficiently remove potentially uninteresting rules even in very large, dense databases.},
audit-trail = {Shiying travelling to present paper. Requested permission to post pdf 10/2},
keywords = {Association Rule Discovery and statistically sound discovery and OPUS and Impact Rules},
location = {Newport Beach, CA},
related = {impact-rules},
}

ABSTRACT Considerable progress has been made on how to reduce the number of spurious exploratory rules with quantitative attributes. However, little has been done for rules with undiscretized quantitative attributes. It is argued that propositional rules can not effectively describe the interactions between quantitative and qualitative attributes. Aumann and Lindell proposed quantitative association rules to provide a better description of such relationship, together with a rule pruning techniques . Since their technique is based on the frequent itemset framework, it is not suitable for rule discovery in large, dense databases. In this paper, an efficient technique for automatically discarding insignificant rules during rule discovery is proposed, based on the OPUS search algorithm. Experiments demonstrate that the algorithm we propose can efficiently remove potentially uninteresting rules even in very large, dense databases.

Pruning Derivative Partial Rules During Impact Rule Discovery.
Huang, S., & Webb, G. I.
Lecture Notes in Computer Science Vol. 3518: Proceedings of the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD 2005), Berlin/Heidelberg, pp. 71-80, 2005.
[Bibtex] [Abstract] → Download PDF

@InProceedings{HuangWebb05a,
author = {Huang, S. and Webb, G. I.},
booktitle = {Lecture Notes in Computer Science Vol. 3518: Proceedings of the 9th {Pacific}-{Asia} Conference on Advances in Knowledge Discovery and Data Mining (PAKDD 2005)},
title = {Pruning Derivative Partial Rules During Impact Rule Discovery},
year = {2005},
address = {Berlin/Heidelberg},
editor = {Ho, T.B. and Cheung, D. and Liu, H.},
pages = {71-80},
publisher = {Springer},
abstract = {Because exploratory rule discovery works with data that is only a sample of the phenomena to be investigated, some resulting rules may appear interesting only by chance. Techniques are developed for automatically discarding statistically insignificant exploratory rules that cannot survive a hypothesis with regard to its ancestors. We call such insignificant rules derivative extended rules. In this paper, we argue that there is another type of derivative exploratory rules, which is derivative with regard to their children. We also argue that considerable amount of such derivative partial rules can not be successfully removed using existing rule pruning techniques. We propose a new technique to address this problem. Experiments are done in impact rule discovery to evaluate the effect of this derivative partial rule filter. Results show that the inherent problem of too many resulting rules in exploratory rule discovery is alleviated.},
keywords = {Association Rule Discovery and statistically sound discovery and OPUS and Impact Rules},
location = {Hanoi, Vietnam},
related = {impact-rules},
}

ABSTRACT Because exploratory rule discovery works with data that is only a sample of the phenomena to be investigated, some resulting rules may appear interesting only by chance. Techniques are developed for automatically discarding statistically insignificant exploratory rules that cannot survive a hypothesis with regard to its ancestors. We call such insignificant rules derivative extended rules. In this paper, we argue that there is another type of derivative exploratory rules, which is derivative with regard to their children. We also argue that considerable amount of such derivative partial rules can not be successfully removed using existing rule pruning techniques. We propose a new technique to address this problem. Experiments are done in impact rule discovery to evaluate the effect of this derivative partial rule filter. Results show that the inherent problem of too many resulting rules in exploratory rule discovery is alleviated.

K-Optimal Pattern Discovery: An Efficient and Effective Approach to Exploratory Data Mining.
Webb, G. I.
Lecture Notes in Computer Science 3809: Advances in Artificial Intelligence, Proceedings of the 18th Australian Joint Conference on Artificial Intelligence (AI 2005)[Extended Abstract], Berlin/Heidelberg, pp. 1-2, 2005.
[Bibtex] → Download PDF → Access on publisher site

@InProceedings{Webb05a,
author = {Webb, G. I.},
booktitle = {Lecture Notes in Computer Science 3809: Advances in Artificial Intelligence, Proceedings of the 18th Australian Joint Conference on Artificial Intelligence (AI 2005)[Extended Abstract]},
title = {K-Optimal Pattern Discovery: An Efficient and Effective Approach to Exploratory Data Mining},
year = {2005},
address = {Berlin/Heidelberg},
editor = {Zhang, S. and Jarvis, R.},
pages = {1-2},
publisher = {Springer},
doi = {10.1007/11589990_1},
keywords = {Association Rule Discovery},
location = {Sydney, Australia},
}

ABSTRACT

Identifying Approximate Itemsets of Interest In Large Databases.
Zhang, C., Zhang, S., & Webb, G. I.
Applied Intelligence, 18, 91-104, 2003.
[Bibtex] [Abstract] → Access on publisher site

@Article{ZhangZhangWebb03,
author = {Zhang, C. and Zhang, S. and Webb, G. I.},
journal = {Applied Intelligence},
title = {Identifying Approximate Itemsets of Interest In Large Databases},
year = {2003},
pages = {91-104},
volume = {18},
abstract = {This paper presents a method for discovering approximate frequent itemsets of interest in large scale databases. This method uses the central limit theorem to increase efficiency, enabling us to reduce the sample size by about half compared to previous approximations. Further efficiency is gained by pruning from the search space uninteresting frequent itemsets. In addition to improving efficiency, this measure also reduces the number of itemsets that the user need consider. The model and algorithm have been implemented and evaluated using both synthetic and real-world databases. Our experimental results demonstrate the efficiency of the approach},
address = {Netherlands},
audit-trail = {Link to paper via Kluwer site. No PDF posted},
keywords = {Association Rule Discovery},
publisher = {Springer},
url = {http://link.springer.com/content/pdf/10.1023%2FA%3A1020995206763.pdf},
}

ABSTRACT This paper presents a method for discovering approximate frequent itemsets of interest in large scale databases. This method uses the central limit theorem to increase efficiency, enabling us to reduce the sample size by about half compared to previous approximations. Further efficiency is gained by pruning from the search space uninteresting frequent itemsets. In addition to improving efficiency, this measure also reduces the number of itemsets that the user need consider. The model and algorithm have been implemented and evaluated using both synthetic and real-world databases. Our experimental results demonstrate the efficiency of the approach

On Detecting Differences Between Groups.
Webb, G. I., Butler, S., & Newlands, D.
Proceedings of The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), New York, pp. 256-265, 2003.
[Bibtex] [Abstract] → Download PDF

@InProceedings{WebbButlerNewlands03,
Title = {On Detecting Differences Between Groups},
Author = {Webb, G. I. and Butler, S. and Newlands, D.},
Booktitle = {Proceedings of The Ninth {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining (KDD-2003)},
Year = {2003},
Address = {New York},
Editor = {Domingos, P. and Faloutsos, C. and Senator, T. and Kargupta, H. and Getoor, L.},
Pages = {256-265},
Publisher = {The Association for Computing Machinery},
Abstract = {Understanding the differences between contrasting groups is a fundamental task in data analysis. This realization has led to the development of a new special purpose data mining technique, {\em contrast-set mining}. We undertook a study with a retail collaborator to compare contrast-set mining with existing rule-discovery techniques. To our surprise we observed that straightforward application of an existing commercial rule-discovery system, Magnum Opus, could successfully perform the contrast-set-mining task. This led to the realization that contrast-set mining is a special case of the more general rule-discovery task. We present the results of our study together with a proof of this conclusion},
Audit-trail = {PDF with ACM copyright posted in accordance with conditions of copyright},
Keywords = {OPUS and Association Rule Discovery},
Location = {Washington, DC},
Related = {filtered-top-k-association-discovery}
}

ABSTRACT Understanding the differences between contrasting groups is a fundamental task in data analysis. This realization has led to the development of a new special purpose data mining technique, {\em contrast-set mining}. We undertook a study with a retail collaborator to compare contrast-set mining with existing rule-discovery techniques. To our surprise we observed that straightforward application of an existing commercial rule-discovery system, Magnum Opus, could successfully perform the contrast-set-mining task. This led to the realization that contrast-set mining is a special case of the more general rule-discovery task. We present the results of our study together with a proof of this conclusion

Association Rules.
Webb, G. I.
In Ye, N. (Ed.), In The Handbook of Data Mining, Chapter 2 (pp. 25-39). Lawrence Erlbaum Associates, 2003.
[Bibtex]

@InCollection{Webb03,
author = {Webb, G. I.},
booktitle = {The Handbook of Data Mining, Chapter 2},
publisher = {Lawrence Erlbaum Associates},
title = {Association Rules},
year = {2003},
editor = {Ye, Nong},
pages = {25-39},
keywords = {Association Rule Discovery},
}

ABSTRACT

Preliminary Investigations into Statistically Valid Exploratory Rule Discovery.
Webb, G. I.
Proceedings of the Second Australasian Data Mining Conference (AusDM03), Sydney, pp. 1-9, 2003.
[Bibtex] [Abstract] → Download PDF

@InProceedings{Webb03a,
author = {Webb, G. I.},
booktitle = {Proceedings of the Second Australasian Data Mining Conference (AusDM03)},
title = {Preliminary Investigations into Statistically Valid Exploratory Rule Discovery},
year = {2003},
address = {Sydney},
editor = {Simoff, S.J. and Williams, G.J. and Hegland, M.},
pages = {1-9},
publisher = {University of Technology},
abstract = {Exploratory rule discovery, as exemplified by association rule discovery, has proven very popular. In this paper I investigate issues surrounding the statistical validity of rules found using this approach and methods that might be employed to deliver statistically sound exploratory rule discovery.},
keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
location = {Canberra, Australia},
related = {statistically-sound-association-discovery},
}

ABSTRACT Exploratory rule discovery, as exemplified by association rule discovery, has proven very popular. In this paper I investigate issues surrounding the statistical validity of rules found using this approach and methods that might be employed to deliver statistically sound exploratory rule discovery.

Removing Trivial Associations in Association Rule Discovery.
Webb, G. I., & Zhang, S.
Proceedings of the First International NAISO Congress on Autonomous Intelligent Systems (ICAIS 2002), Canada/The Netherlands, 2002.
[Bibtex] [Abstract] → Download PDF

@InProceedings{WebbZhang02,
Title = {Removing Trivial Associations in Association Rule Discovery},
Author = {Webb, G. I. and Zhang, S.},
Booktitle = {Proceedings of the First International NAISO Congress on Autonomous Intelligent Systems (ICAIS 2002)},
Year = {2002},
Address = {Canada/The Netherlands},
Publisher = {NAISO Academic Press},
Abstract = {Association rule discovery has become one of the most widely applied data mining strategies. Techniques for association rule discovery have been dominated by the frequent itemset strategy as exemplified by the Apriori algorithm. One limitation of this approach is that it provides little opportunity to detect and remove association rules on the basis of relationships between rules. As a result, the association rules discovered are frequently swamped with large numbers of spurious rules that are of little interest to the user. This paper presents association rule discovery techniques that can detect and discard one form of spurious association rule: trivial associations.},
Audit-trail = {Pre-publication PDF posted},
Keywords = {OPUS and Association Rule Discovery},
Location = {Geelong, Australia},
Related = {filtered-top-k-association-discovery}
}

ABSTRACT Association rule discovery has become one of the most widely applied data mining strategies. Techniques for association rule discovery have been dominated by the frequent itemset strategy as exemplified by the Apriori algorithm. One limitation of this approach is that it provides little opportunity to detect and remove association rules on the basis of relationships between rules. As a result, the association rules discovered are frequently swamped with large numbers of spurious rules that are of little interest to the user. This paper presents association rule discovery techniques that can detect and discard one form of spurious association rule: trivial associations.

Discovering Associations with Numeric Variables.
Webb, G. I.
Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001)[short paper], New York, pp. 383-388, 2001.
[Bibtex] [Abstract] → Download PDF → Access on publisher site

@InProceedings{Webb01a,
author = {Webb, G. I.},
booktitle = {Proceedings of the Seventh {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining (KDD-2001)[short paper]},
title = {Discovering Associations with Numeric Variables},
year = {2001},
address = {New York},
editor = {Provost, F. and Srikant, R.},
pages = {383-388},
publisher = {The Association for Computing Machinery},
abstract = {This paper further develops Aumann and Lindell's [3] proposal for a variant of association rules for which the consequent is a numeric variable. It is argued that these rules can discover useful interactions with numeric data that cannot be discovered directly using traditional association rules with discretization. Alternative measures for identifying interesting rules are proposed. Efficient algorithms are presented that enable these rules to be discovered for dense data sets for which application of Auman and Lindell's algorithm is infeasible.},
keywords = {Association Rule Discovery and statistically sound discovery and OPUS and Impact Rules},
location = {San Francisco, CA},
related = {impact-rules},
url = {http://dl.acm.org/authorize?19861},
}

ABSTRACT This paper further develops Aumann and Lindell's [3] proposal for a variant of association rules for which the consequent is a numeric variable. It is argued that these rules can discover useful interactions with numeric data that cannot be discovered directly using traditional association rules with discretization. Alternative measures for identifying interesting rules are proposed. Efficient algorithms are presented that enable these rules to be discovered for dense data sets for which application of Auman and Lindell's algorithm is infeasible.

Further Pruning for Efficient Association Rule Discovery.
Webb, G. I., & Zhang, S.
Lecture Notes in Computer Science Vol. 2256: Proceedings of the 14th Australian Joint Conference on Artificial Intelligence (AI'01), Berlin, pp. 605-618, 2001.
[Bibtex] [Abstract] → Download PDF

@InProceedings{WebbZhang01,
author = {Webb, G. I. and Zhang, S.},
booktitle = {Lecture Notes in Computer Science Vol. 2256: Proceedings of the 14th Australian Joint Conference on Artificial Intelligence (AI'01)},
title = {Further Pruning for Efficient Association Rule Discovery},
year = {2001},
address = {Berlin},
editor = {Stumptner, M. and Corbett, D. and Brooks, M.J.},
pages = {605-618},
publisher = {Springer},
abstract = {The Apriori algorithm's frequent itemset approach has become the standard approach to discovering association rules. However, the computation requirements of the frequent itemset approach are infeasible for dense data and the approach is unable to discover infrequent associations. OPUS\_AR is an efficient algorithm for rule discovery that does not utilize frequent itemsets and hence avoids these problems. It can reduce search time by using additional constraints on the search space as well as constraints on itemset frequency. However, the effectiveness of the pruning rules used during search will determine the efficiency of its search. This paper presents and analyzes pruning rules for use with OPUS\_AR. We demonstrate that application of OPUS\_AR is feasible for a number of datasets for which application of the frequent itemset approach is infeasible and that the new pruning rules can reduce compute time by more than 40%.},
keywords = {OPUS and Association Rule Discovery},
location = {Adelaide, Australia},
related = {filtered-top-k-association-discovery},
}

ABSTRACT The Apriori algorithm's frequent itemset approach has become the standard approach to discovering association rules. However, the computation requirements of the frequent itemset approach are infeasible for dense data and the approach is unable to discover infrequent associations. OPUS_AR is an efficient algorithm for rule discovery that does not utilize frequent itemsets and hence avoids these problems. It can reduce search time by using additional constraints on the search space as well as constraints on itemset frequency. However, the effectiveness of the pruning rules used during search will determine the efficiency of its search. This paper presents and analyzes pruning rules for use with OPUS_AR. We demonstrate that application of OPUS_AR is feasible for a number of datasets for which application of the frequent itemset approach is infeasible and that the new pruning rules can reduce compute time by more than 40%.

Efficient Search for Association Rules.
Webb, G. I.
Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000), New York, pp. 99-107, 2000.
[Bibtex] [Abstract] → Download PDF

@InProceedings{Webb00b,
Title = {Efficient Search for Association Rules},
Author = {Webb, G. I.},
Booktitle = {Proceedings of the Sixth {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining (KDD-2000)},
Year = {2000},
Address = {New York},
Editor = {Ramakrishnan, R. and Stolfo, S.},
Pages = {99-107},
Publisher = {The Association for Computing Machinery},
Abstract = {This paper argues that for some applications direct search for association rules can be more efficient than the two stage process of the Apriori algorithm which first finds large item sets which are then used to identify associations. In particular, it is argued, Apriori can impose large computational overheads when the number of frequent itemsets is very large. This will often be the case when association rule analysis is performed on domains other than basket analysis or when it is performed for basket analysis with basket information augmented by other customer information. An algorithm is presented that is computationally efficient for association rule analysis during which the number of rules to be found can be constrained and all data can be maintained in memory.},
Keywords = {Search and OPUS and Association Rule Discovery},
Location = {Boston, MA},
Related = {filtered-top-k-association-discovery}
}

ABSTRACT This paper argues that for some applications direct search for association rules can be more efficient than the two stage process of the Apriori algorithm which first finds large item sets which are then used to identify associations. In particular, it is argued, Apriori can impose large computational overheads when the number of frequent itemsets is very large. This will often be the case when association rule analysis is performed on domains other than basket analysis or when it is performed for basket analysis with basket information augmented by other customer information. An algorithm is presented that is computationally efficient for association rule analysis during which the number of rules to be found can be constrained and all data can be maintained in memory.

Inclusive Pruning: A New Class of Pruning Rule for Unordered Search and its Application to Classification Learning.
Webb, G. I.
Australian Computer Science Communications Vol. 18 (1): Proceedings of the Nineteenth Australasian Computer Science Conference (ACSC'96), Melbourne, pp. 1-10, 1996.
[Bibtex] [Abstract] → Download PDF

@InProceedings{Webb96e,
Title = {Inclusive Pruning: A New Class of Pruning Rule for Unordered Search and its Application to Classification Learning},
Author = {Webb, G. I.},
Booktitle = {Australian Computer Science Communications Vol. 18 (1): Proceedings of the Nineteenth Australasian Computer Science Conference (ACSC'96)},
Year = {1996},
Address = {Melbourne},
Editor = {Ramamohanarao, K.},
Pages = {1-10},
Publisher = {ACS},
Abstract = {This paper presents a new class of pruning rule for unordered search. Previous pruning rules for unordered search identify operators that should not be applied in order to prune nodes reached via those operators. In contrast, the new pruning rules identify operators that should be applied and prune nodes that are not reached via those operators. Specific pruning rules employing both these approaches are identified for classification learning. Experimental results demonstrate that application of the new pruning rules can reduce by more than 60% the number of states from the search space that are considered during classification learning.},
Keywords = {Search and Rule Learning and OPUS and Association Rule Discovery},
Location = {Royal Melbourne Insitute of Technology, Australia},
Related = {opus-search}
}

ABSTRACT This paper presents a new class of pruning rule for unordered search. Previous pruning rules for unordered search identify operators that should not be applied in order to prune nodes reached via those operators. In contrast, the new pruning rules identify operators that should be applied and prune nodes that are not reached via those operators. Specific pruning rules employing both these approaches are identified for classification learning. Experimental results demonstrate that application of the new pruning rules can reduce by more than 60% the number of states from the search space that are considered during classification learning.

OPUS: An Efficient Admissible Algorithm For Unordered Search.
Webb, G. I.
Journal of Artificial Intelligence Research, 3, 431-465, 1995.
[Bibtex] [Abstract] → Access on publisher site

@Article{Webb95,
Title = {OPUS: An Efficient Admissible Algorithm For Unordered Search},
Author = {Webb, G. I.},
Journal = {Journal of Artificial Intelligence Research},
Year = {1995},
Pages = {431-465},
Volume = {3},
Abstract = {OPUS is a branch and bound search algorithm that enables efficient admissible search through spaces for which the order of search operator application is not significant. The algorithm's search efficiency is demonstrated with respect to very large machine learning search spaces. The use of admissible search is of potential value to the machine learning community as it means that the exact learning biases to be employed for complex learning tasks can be precisely specified and manipulated. OPUS also has potential for application in other areas of artificial intelligence, notably, truth maintenance.},
Address = {Menlo Park, CA},
Keywords = {Search and Rule Learning and OPUS and Association Rule Discovery},
Publisher = {AAAI Press},
Related = {opus-search},
doi = {10.1613/jair.227}
}

ABSTRACT OPUS is a branch and bound search algorithm that enables efficient admissible search through spaces for which the order of search operator application is not significant. The algorithm's search efficiency is demonstrated with respect to very large machine learning search spaces. The use of admissible search is of potential value to the machine learning community as it means that the exact learning biases to be employed for complex learning tasks can be precisely specified and manipulated. OPUS also has potential for application in other areas of artificial intelligence, notably, truth maintenance.