Data Scientist


Association discovery

Association discovery includes association mining, pattern mining, association rule discovery, subgroup discovery, emerging pattern discovery and contrast discovery.

I have pioneered association discovery techniques that seek the most useful associations, rather than applying the minimum-support constraint more commonly used in the field.  Many of these techniques are included in my Magnum Opus software which is now incorporated in BigML.  Magnum Opus has been widely used in scientific research.

Due to the large numbers of patterns that are considered in association discovery, most techniques typically suffer large risk of type-1 error, that is, of finding patterns that appear interesting only due to chance artifacts of the process by which the sample data were generated.  Most attempts to control this risk have done so at the cost of high risk of type-2 error, that is, of falsely rejecting non-spurious patterns.  I have pioneered strategies for statistically sound pattern discovery, strictly controlling type-1 error during association discovery without the level of risk of type-2 error suffered by previous approaches.

The OPUSMiner statistically sound itemset discovery software can be downloaded here. An R package is available here.

The Skopus statistically sound sequential pattern discovery software can be downloaded here.

ACM SIGKDD 2014 Tutorial on Statistically Sound Pattern Discovery, with Wilhelmiina Hämäläinen.

Publications

Mining significant crisp-fuzzy spatial association rules.
Shi, W., Zhang, A., & Webb, G. I.
International Journal of Geographical Information Science, 2018.
[DOI] [Bibtex] [Abstract]

@Article{doi:10.1080/13658816.2018.1434525,
Title = {Mining significant crisp-fuzzy spatial association rules},
Author = {Wenzhong Shi and Anshu Zhang and Geoffrey I. Webb},
Journal = {International Journal of Geographical Information Science},
Year = {2018},
Abstract = {Spatial association rule mining (SARM) is an important data mining task for understanding implicit and sophisticated interactions in spatial data. The usefulness of SARM results, represented as sets of rules, depends on their reliability: the abundance of rules, control over the risk of spurious rules, and accuracy of rule interestingness measure (RIM) values. This study presents crisp-fuzzy SARM, a novel SARM method that can enhance the reliability of resultant rules. The method firstly prunes dubious rules using statistically sound tests and crisp supports for the patterns involved, and then evaluates RIMs of accepted rules using fuzzy supports. For the RIM evaluation stage, the study also proposes a Gaussian-curve-based fuzzy data discretization model for SARM with improved design for spatial semantics. The proposed techniques were evaluated by both synthetic and real-world data. The synthetic data was generated with predesigned rules and RIM values, thus the reliability of SARM results could be confidently and quantitatively evaluated. The proposed techniques showed high efficacy in enhancing the reliability of SARM results in all three aspects. The abundance of resultant rules was improved by 50% or more compared with using conventional fuzzy SARM. Minimal risk of spurious rules was guaranteed by statistically sound tests. The probability that the entire result contained any spurious rules was below 1%. The RIM values also avoided large positive errors committed by crisp SARM, which typically exceeded 50% for representative RIMs. The real-world case study on New York City points of interest reconfirms the improved reliability of crisp-fuzzy SARM results, and demonstrates that such improvement is critical for practical spatial data analytics and decision support.},
Doi = {10.1080/13658816.2018.1434525},
Keywords = {Association Rule Discovery and statistically sound discovery},
Publisher = {Taylor \& Francis},
Related = {filtered-top-k-association-discovery},
Url = {http://www.tandfonline.com/eprint/aMdSMrAGuEHsHWSzIuqm/full}
}
ABSTRACT Spatial association rule mining (SARM) is an important data mining task for understanding implicit and sophisticated interactions in spatial data. The usefulness of SARM results, represented as sets of rules, depends on their reliability: the abundance of rules, control over the risk of spurious rules, and accuracy of rule interestingness measure (RIM) values. This study presents crisp-fuzzy SARM, a novel SARM method that can enhance the reliability of resultant rules. The method firstly prunes dubious rules using statistically sound tests and crisp supports for the patterns involved, and then evaluates RIMs of accepted rules using fuzzy supports. For the RIM evaluation stage, the study also proposes a Gaussian-curve-based fuzzy data discretization model for SARM with improved design for spatial semantics. The proposed techniques were evaluated by both synthetic and real-world data. The synthetic data was generated with predesigned rules and RIM values, thus the reliability of SARM results could be confidently and quantitatively evaluated. The proposed techniques showed high efficacy in enhancing the reliability of SARM results in all three aspects. The abundance of resultant rules was improved by 50% or more compared with using conventional fuzzy SARM. Minimal risk of spurious rules was guaranteed by statistically sound tests. The probability that the entire result contained any spurious rules was below 1%. The RIM values also avoided large positive errors committed by crisp SARM, which typically exceeded 50% for representative RIMs. The real-world case study on New York City points of interest reconfirms the improved reliability of crisp-fuzzy SARM results, and demonstrates that such improvement is critical for practical spatial data analytics and decision support.

Specious rules: an efficient and effective unifying method for removing misleading and uninformative patterns in association rule mining.
Hamalainen, W., & Webb, G. I.
Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 309-317, 2017.
[PDF] [Bibtex]

@InProceedings{HamalainenWebb17,
Title = {Specious rules: an efficient and effective unifying method for removing misleading and uninformative patterns in association rule mining},
Author = {Hamalainen, Wilhelmiina and Webb, Geoffrey I},
Booktitle = {Proceedings of the 2017 SIAM International Conference on Data Mining},
Year = {2017},
Organization = {SIAM},
Pages = {309-317},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
Related = {statistically-sound-association-discovery}
}
ABSTRACT 

A multiple test correction for streams and cascades of statistical hypothesis tests.
Webb, G. I., & Petitjean, F.
Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16, pp. 1255-1264, 2016.
[PDF] [DOI] [Bibtex] [Abstract]

@InProceedings{WebbPetitjean16,
Title = {A multiple test correction for streams and cascades of statistical hypothesis tests},
Author = {Webb, Geoffrey I and Petitjean, Francois},
Booktitle = {Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16},
Year = {2016},
Pages = {1255-1264},
Publisher = {ACM Press},
Abstract = {Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e.~rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be pre-determined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance.
This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed.
To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models.
We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.},
Comment = {Top reviewer score (4.75/5.0), shortlisted for best paper award and invited to ACM TKDE journal KDD-16 special issue},
Doi = {10.1145/2939672.2939775},
Keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and Learning from large datasets and DP140100087},
Related = {statistically-sound-association-discovery},
Url = {http://dl.acm.org/authorize?N19100}
}
ABSTRACT Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e.~rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be pre-determined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance. This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed. To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models. We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.

Mining significant association rules from uncertain data.
Zhang, A., Shi, W., & Webb, G. I.
Data Mining and Knowledge Discovery, 30(4), 928-963, 2016.
[PDF] [DOI] [Bibtex] [Abstract]

@Article{ZhangEtAl16,
Title = {Mining significant association rules from uncertain data},
Author = {Zhang, Anshu and Shi, Wenzhong and Webb, Geoffrey I},
Journal = {Data Mining and Knowledge Discovery},
Year = {2016},
Number = {4},
Pages = {928-963},
Volume = {30},
Abstract = {In association rule mining, the trade-off between avoiding harmful spurious rules and preserving authentic ones is an ever critical barrier to obtaining reliable and useful results. The statistically sound technique for evaluating statistical significance of association rules is superior in preventing spurious rules, yet can also cause severe loss of true rules in presence of data error. This study presents a new and improved method for statistical test on association rules with uncertain erroneous data. An original mathematical model was established to describe data error propagation through computational procedures of the statistical test. Based on the error model, a scheme combining analytic and simulative processes was designed to correct the statistical test for distortions caused by data error. Experiments on both synthetic and real-world data show that the method significantly recovers the loss in true rules (reduces type-2 error) due to data error occurring in original statistically sound method. Meanwhile, the new method maintains effective control over the familywise error rate, which is the distinctive advantage of the original statistically sound technique. Furthermore, the method is robust against inaccurate data error probability information and situations not fulfilling the commonly accepted assumption on independent error probabilities of different data items. The method is particularly effective for rules which were most practically meaningful yet sensitive to data error. The method proves promising in enhancing values of association rule mining results and helping users make correct decisions.},
Doi = {10.1007/s10618-015-0446-6},
Keywords = {Association Rule Discovery and statistically sound discovery},
Publisher = {Springer},
Related = {statistically-sound-association-discovery}
}
ABSTRACT In association rule mining, the trade-off between avoiding harmful spurious rules and preserving authentic ones is an ever critical barrier to obtaining reliable and useful results. The statistically sound technique for evaluating statistical significance of association rules is superior in preventing spurious rules, yet can also cause severe loss of true rules in presence of data error. This study presents a new and improved method for statistical test on association rules with uncertain erroneous data. An original mathematical model was established to describe data error propagation through computational procedures of the statistical test. Based on the error model, a scheme combining analytic and simulative processes was designed to correct the statistical test for distortions caused by data error. Experiments on both synthetic and real-world data show that the method significantly recovers the loss in true rules (reduces type-2 error) due to data error occurring in original statistically sound method. Meanwhile, the new method maintains effective control over the familywise error rate, which is the distinctive advantage of the original statistically sound technique. Furthermore, the method is robust against inaccurate data error probability information and situations not fulfilling the commonly accepted assumption on independent error probabilities of different data items. The method is particularly effective for rules which were most practically meaningful yet sensitive to data error. The method proves promising in enhancing values of association rule mining results and helping users make correct decisions.

Scaling log-linear analysis to datasets with thousands of variables.
Petitjean, F., & Webb, G. I.
Proceedings of the 2015 SIAM International Conference on Data Mining, pp. 469-477, 2015.
[URL] [Bibtex] [Abstract]

@InProceedings{PetitjeanWebb15,
Title = {Scaling log-linear analysis to datasets with thousands of variables},
Author = {F. Petitjean and G.I. Webb},
Booktitle = {Proceedings of the 2015 {SIAM} International Conference on Data Mining},
Year = {2015},
Pages = {469-477},
Abstract = {Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We have recently shown that, if we ensure that the graph supporting the log-linear model is chordal, log-linear analysis can be applied to datasets with hundreds of variables without sacrificing the statistical soundness [21]. However, further scalability remained limited, because state-of-the-art techniques have to examine every edge at every step of the search. This paper makes the following contributions: 1) we prove that only a very small subset of edges has to be considered at each step of the search; 2) we demonstrate how to efficiently find this subset of edges and 3) we show how to efficiently keep track of the best edges to be subsequently added to the initial model. Our experiments, carried out on real datasets with up to 2000 variables, show that our contributions make it possible to gain about 4 orders of magnitude, making log-linear analysis of datasets with thousands of variables possible in seconds instead of days.},
Comment = {Best Research Paper Honorable Mention Award},
Keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and Learning from large datasets and DP140100087},
Related = {scalable-graphical-modeling},
Url = {http://epubs.siam.org/doi/pdf/10.1137/1.9781611974010.53}
}
ABSTRACT Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We have recently shown that, if we ensure that the graph supporting the log-linear model is chordal, log-linear analysis can be applied to datasets with hundreds of variables without sacrificing the statistical soundness [21]. However, further scalability remained limited, because state-of-the-art techniques have to examine every edge at every step of the search. This paper makes the following contributions: 1) we prove that only a very small subset of edges has to be considered at each step of the search; 2) we demonstrate how to efficiently find this subset of edges and 3) we show how to efficiently keep track of the best edges to be subsequently added to the initial model. Our experiments, carried out on real datasets with up to 2000 variables, show that our contributions make it possible to gain about 4 orders of magnitude, making log-linear analysis of datasets with thousands of variables possible in seconds instead of days.

Efficient Discovery of the Most Interesting Associations.
Webb, G. I., & Vreeken, J.
ACM Transactions on Knowledge Discovery from Data, 8(3), Art. no. 15, 2014.
[URL] [Bibtex] [Abstract]

@Article{WebbVreeken13,
Title = {Efficient Discovery of the Most Interesting Associations},
Author = {G.I. Webb and J. Vreeken},
Journal = {{ACM} Transactions on Knowledge Discovery from Data},
Year = {2014},
Number = {3},
Volume = {8},
Abstract = {Self-sufficient itemsets have been proposed as an effective approach to summarizing the key associations
in data. However, their computation appears highly demanding, as assessing whether an itemset is selfsufficient
requires consideration of all pairwise partitions of the itemset into pairs of subsets as well as
consideration of all supersets. This paper presents the first published algorithm for efficiently discovering
self-sufficient itemsets. This branch-and-bound algorithm deploys two powerful pruning mechanisms
based on upper-bounds on itemset value and statistical significance level. It demonstrates that finding top-k
productive and non-redundant itemsets, with post processing to identify those that are not independently
productive, can efficiently identify small sets of key associations. We present extensive evaluation of the
strengths and limitations of the technique including comparisons with alternative approaches to finding the
most interesting associations.},
Articlenumber = {15},
Direct-url = {http://dx.doi.org/10.1145/2601433},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
Publisher = {ACM},
Related = {filtered-top-k-association-discovery},
Url = {http://dl.acm.org/authorize?N80829}
}
ABSTRACT Self-sufficient itemsets have been proposed as an effective approach to summarizing the key associations in data. However, their computation appears highly demanding, as assessing whether an itemset is selfsufficient requires consideration of all pairwise partitions of the itemset into pairs of subsets as well as consideration of all supersets. This paper presents the first published algorithm for efficiently discovering self-sufficient itemsets. This branch-and-bound algorithm deploys two powerful pruning mechanisms based on upper-bounds on itemset value and statistical significance level. It demonstrates that finding top-k productive and non-redundant itemsets, with post processing to identify those that are not independently productive, can efficiently identify small sets of key associations. We present extensive evaluation of the strengths and limitations of the technique including comparisons with alternative approaches to finding the most interesting associations.

A Statistically Efficient and Scalable Method for Log-Linear Analysis of High-Dimensional Data.
Petitjean, F., Allison, L., & Webb, G. I.
Proceedings of the 14th IEEE International Conference on Data Mining, pp. 480-489, 2014.
[PDF] [URL] [Bibtex] [Abstract]

@InProceedings{PetitjeanEtAl14a,
Title = {A Statistically Efficient and Scalable Method for Log-Linear Analysis of High-Dimensional Data},
Author = {F. Petitjean and L. Allison and G.I. Webb},
Booktitle = {Proceedings of the 14th {IEEE} International Conference on Data Mining},
Year = {2014},
Pages = {480-489},
Abstract = {Log-linear analysis is the primary statistical approach to discovering conditional dependencies between the variables of a dataset. A good log-linear analysis method requires both high precision and statistical efficiency. High precision means that the risk of false discoveries should be kept very low. Statistical efficiency means that the method should discover actual associations with as few samples as possible. Classical approaches to log-linear analysis make use of χ2 tests to control this balance between quality and complexity. We present an information-theoretic approach to log-linear analysis. We show that our approach 1) requires significantly fewer samples to discover the true associations than statistical approaches -- statistical efficiency -- 2) controls for the risk of false discoveries as well as statistical approaches -- high precision - and 3) can perform the discovery on datasets with hundreds of variables on a standard desktop computer -- computational efficiency.},
Comment = {One of nine papers invited to Knowledge and Information Systems journal ICDM-14 special issue},
Keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and DP140100087},
Related = {scalable-graphical-modeling},
Url = {http://dx.doi.org/10.1109/ICDM.2014.23}
}
ABSTRACT Log-linear analysis is the primary statistical approach to discovering conditional dependencies between the variables of a dataset. A good log-linear analysis method requires both high precision and statistical efficiency. High precision means that the risk of false discoveries should be kept very low. Statistical efficiency means that the method should discover actual associations with as few samples as possible. Classical approaches to log-linear analysis make use of χ2 tests to control this balance between quality and complexity. We present an information-theoretic approach to log-linear analysis. We show that our approach 1) requires significantly fewer samples to discover the true associations than statistical approaches – statistical efficiency – 2) controls for the risk of false discoveries as well as statistical approaches – high precision - and 3) can perform the discovery on datasets with hundreds of variables on a standard desktop computer – computational efficiency.

Scaling log-linear analysis to high-dimensional data.
Petitjean, F., Webb, G. I., & Nicholson, A. E.
Proceedings of the 13th IEEE International Conference on Data Mining, pp. 597-606, 2013.
[PDF] [DOI] [Bibtex] [Abstract]

@InProceedings{PetitjeanEtAl13,
Title = {Scaling log-linear analysis to high-dimensional data},
Author = {F. Petitjean and G. I. Webb and A. E. Nicholson},
Booktitle = {Proceedings of the 13th {IEEE} International Conference on Data Mining},
Year = {2013},
Pages = {597-606},
Abstract = {Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We develop an efficient approach to log-linear analysis that scales to hundreds of variables by melding the classical statistical machinery of log-linear analysis with advanced data mining techniques from association discovery and graphical modeling.},
Doi = {10.1109/ICDM.2013.17},
Keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and Learning from large datasets and DP140100087},
Related = {scalable-graphical-modeling}
}
ABSTRACT Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We develop an efficient approach to log-linear analysis that scales to hundreds of variables by melding the classical statistical machinery of log-linear analysis with advanced data mining techniques from association discovery and graphical modeling.

Filtered-top-k Association Discovery.
Webb, G. I.
WIREs Data Mining and Knowledge Discovery, 1(3), 183-192, 2011.
[PDF] [DOI] [Bibtex] [Abstract]

@Article{Webb11,
Title = {Filtered-top-k Association Discovery},
Author = {G.I. Webb},
Journal = {WIREs Data Mining and Knowledge Discovery},
Year = {2011},
Number = {3},
Pages = {183-192},
Volume = {1},
Abstract = {Association mining has been one of the most intensively researched areas of data mining. However, direct uptake of the resulting technologies has been relatively low. This paper examines some of the reasons why the dominant paradigms in association mining have not lived up to their promise, and argues that a powerful alternative is provided by top-k techniques coupled with appropriate statistical and other filtering.},
Doi = {10.1002/widm.28},
Keywords = {Association Rule Discovery and statistically sound discovery},
Publisher = {Wiley},
Related = {filtered-top-k-association-discovery}
}
ABSTRACT Association mining has been one of the most intensively researched areas of data mining. However, direct uptake of the resulting technologies has been relatively low. This paper examines some of the reasons why the dominant paradigms in association mining have not lived up to their promise, and argues that a powerful alternative is provided by top-k techniques coupled with appropriate statistical and other filtering.

Self-Sufficient Itemsets: An Approach to Screening Potentially Interesting Associations Between Items.
Webb, G. I.
ACM Transactions on Knowledge Discovery from Data, 4, Art. no. 3, 2010.
[URL] [Bibtex] [Abstract]

@Article{Webb10,
Title = {Self-Sufficient Itemsets: An Approach to Screening Potentially Interesting Associations Between Items},
Author = {G.I. Webb},
Journal = {{ACM} Transactions on Knowledge Discovery from Data},
Year = {2010},
Volume = {4},
Abstract = {Self-sufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their
supersets. We argue that itemsets that are not
self-sufficient will often be of little interest to the data
analyst, as their frequency should be expected once that of the
itemsets on which their frequency depends is known. We present
statistical tests for statistically sound discovery of
self-sufficient itemsets, and computational techniques that allow
those tests to be applied as a post-processing step for any itemset
discovery algorithm. We also present a measure for assessing the degree of potential interest in an itemset that complements these statistical measures.},
Articlenumber = {3},
Issue = {1},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
Publisher = {ACM},
Related = {filtered-top-k-association-discovery},
Url = {http://dl.acm.org/authorize?270473}
}
ABSTRACT Self-sufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their supersets. We argue that itemsets that are not self-sufficient will often be of little interest to the data analyst, as their frequency should be expected once that of the itemsets on which their frequency depends is known. We present statistical tests for statistically sound discovery of self-sufficient itemsets, and computational techniques that allow those tests to be applied as a post-processing step for any itemset discovery algorithm. We also present a measure for assessing the degree of potential interest in an itemset that complements these statistical measures.

Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining.
Novak, P., Lavrac, N., & Webb, G. I.
Journal of Machine Learning Research, 10, 377-403, 2009.
[URL] [Bibtex] [Abstract]

@Article{NovakLavracWebb09,
Title = {Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining},
Author = {P. Novak and N. Lavrac and G.I. Webb},
Journal = {Journal of Machine Learning Research},
Year = {2009},
Pages = {377-403},
Volume = {10},
Abstract = {This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use different terminology and task definitions, claim to have different goals, claim to use different rule learning heuristics, and use different means for selecting subsets of induced patterns. This paper contributes a novel understanding of these subareas of data mining by presenting a unified terminology, by explaining the apparent differences between the learning tasks as variants of a unique supervised descriptive rule discovery task and by exploring the apparent differences between the approaches. It also shows that various rule learning heuristics used in CSM, EPM and SD algorithms all aim at optimizing a trade off between rule coverage and precision. The commonalities (and differences) between the approaches are showcased on a selection of best known variants of CSM, EPM and SD algorithms. The paper also provides a critical survey of existing supervised descriptive rule discovery visualization methods.},
Keywords = {Association Rule Discovery and OPUS},
Related = {filtered-top-k-association-discovery},
Url = {http://www.jmlr.org/papers/volume10/kralj-novak09a/kralj-novak09a.pdf}
}
ABSTRACT This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use different terminology and task definitions, claim to have different goals, claim to use different rule learning heuristics, and use different means for selecting subsets of induced patterns. This paper contributes a novel understanding of these subareas of data mining by presenting a unified terminology, by explaining the apparent differences between the learning tasks as variants of a unique supervised descriptive rule discovery task and by exploring the apparent differences between the approaches. It also shows that various rule learning heuristics used in CSM, EPM and SD algorithms all aim at optimizing a trade off between rule coverage and precision. The commonalities (and differences) between the approaches are showcased on a selection of best known variants of CSM, EPM and SD algorithms. The paper also provides a critical survey of existing supervised descriptive rule discovery visualization methods.

Layered Critical Values: A Powerful Direct-Adjustment Approach to Discovering Significant Patterns.
Webb, G. I.
Machine Learning, 71(2-3), 307-323 [Technical Note], 2008.
[DOI] [Bibtex] [Abstract]

@Article{Webb08,
Title = {Layered Critical Values: A Powerful Direct-Adjustment Approach to Discovering Significant Patterns},
Author = {G.I. Webb},
Journal = {Machine Learning},
Year = {2008},
Number = {2-3},
Pages = {307-323 [Technical Note]},
Volume = {71},
Abstract = {Standard pattern discovery techniques, such as association rules, suffer an extreme risk of finding very large numbers of spurious patterns for many knowledge discovery tasks. The direct-adjustment approach to controlling this risk applies a statistical test during the discovery process, using a critical value adjusted to take account of the size of the search space. However, a problem with the direct-adjustment strategy is that it may discard numerous true patterns. This paper investigates the assignment of different critical values to different areas of the search space as an approach to alleviating this problem, using a variant of a technique originally developed for other purposes. This approach is shown to be effective at increasing the number of discoveries while still maintaining strict control over the risk of false discoveries.},
Address = {Netherlands},
Audit-trail = {DOI 10.1007/s10994-008-5046-x},
Doi = {10.1007/s10994-008-5046-x},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
Notes = {Technical Note},
Publisher = {Springer},
Related = {statistically-sound-association-discovery}
}
ABSTRACT Standard pattern discovery techniques, such as association rules, suffer an extreme risk of finding very large numbers of spurious patterns for many knowledge discovery tasks. The direct-adjustment approach to controlling this risk applies a statistical test during the discovery process, using a critical value adjusted to take account of the size of the search space. However, a problem with the direct-adjustment strategy is that it may discard numerous true patterns. This paper investigates the assignment of different critical values to different areas of the search space as an approach to alleviating this problem, using a variant of a technique originally developed for other purposes. This approach is shown to be effective at increasing the number of discoveries while still maintaining strict control over the risk of false discoveries.

Discovering Significant Patterns.
Webb, G. I.
Machine Learning, 68(1), 1-33, 2007.
[DOI] [Bibtex] [Abstract]

@Article{Webb07,
Title = {Discovering Significant Patterns},
Author = {G.I. Webb},
Journal = {Machine Learning},
Year = {2007},
Number = {1},
Pages = {1-33},
Volume = {68},
Abstract = {Exploratory pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard exploratory pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to statistical evaluation on holdout data. They also reveal that modification of the pattern discovery process to anticipate subsequent statistical evaluation can increase the number of patterns that are accepted by statistical evaluation on holdout data.},
Address = {Netherlands},
Audit-trail = {subject to revisions},
Doi = {10.1007/s10994-007-5006-x},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
Publisher = {Springer},
Related = {statistically-sound-association-discovery}
}
ABSTRACT Exploratory pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard exploratory pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to statistical evaluation on holdout data. They also reveal that modification of the pattern discovery process to anticipate subsequent statistical evaluation can increase the number of patterns that are accepted by statistical evaluation on holdout data.

Mining Group Differences.
Butler, S., & Webb, G. I.
In Wang, J. (Ed.), In The Encyclopedia of Data Warehousing and Mining (, pp. 795-799). Hershey, PA: Idea Group Inc., 2006.
[DOI] [Bibtex]

@InCollection{ButlerWebb05,
Title = {Mining Group Differences},
Author = {S. Butler and G. I. Webb},
Booktitle = {The Encyclopedia of Data Warehousing and Mining},
Publisher = {Idea Group Inc.},
Year = {2006},
Address = {Hershey, PA},
Editor = {John Wang },
Pages = {795-799},
Audit-trail = {August 04 Copyright signed. Shane handling submission. PDF not posted},
Doi = {10.4018/978-1-60566-010-3.ch199},
Keywords = {Association Rule Discovery}
}
ABSTRACT 

Efficiently Identifying Exploratory Rules’ Significance.
Huang, S., & Webb, G. I.
LNAI State-of-the-Art Survey series, ‘Data Mining: Theory, Methodology, Techniques, and Applications’, Berlin/Heidelberg, pp. 64-77, 2006.
[DOI] [Bibtex] [Abstract]

@InProceedings{HuangWebb05b,
Title = {Efficiently Identifying Exploratory Rules' Significance},
Author = {S. Huang and G.I. Webb},
Booktitle = {LNAI State-of-the-Art Survey series, 'Data Mining: Theory, Methodology, Techniques, and Applications'},
Year = {2006},
Address = {Berlin/Heidelberg},
Note = {An earlier version of this paper was published in S.J. Simoff and G.J. Williams (Eds.), Proceedings of the Third Australasian Data Mining Conference (AusDM04) Cairns, Australia. Sydney: University of Technology, pages 169-182.},
Pages = {64-77},
Publisher = {Springer},
Abstract = {How to efficiently discard potentially uninteresting rules in exploratory rule discovery is one of the important research foci in data mining. Many researchers have presented algorithms to automatically remove potentially uninteresting rules utilizing background knowledge and user-specified constraints. Identifying the significance of exploratory rules using a significance test is desirable for removing rules that may appear interesting by chance, hence providing the users with a more compact set of resulting rules. However, applying statistical tests to identify significant rules requires considerable computation and data access in order to obtain the necessary statistics. The situation gets worse as the size of the database increases. In this paper, we propose two approaches for improving the efficiency of significant exploratory rule discovery. We also evaluate the experimental effect in impact rule discovery which is suitable for discovering exploratory rules in very large, dense databases.},
Doi = {10.1007/11677437_6},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS and Impact Rules},
Related = {statistically-sound-association-discovery}
}
ABSTRACT How to efficiently discard potentially uninteresting rules in exploratory rule discovery is one of the important research foci in data mining. Many researchers have presented algorithms to automatically remove potentially uninteresting rules utilizing background knowledge and user-specified constraints. Identifying the significance of exploratory rules using a significance test is desirable for removing rules that may appear interesting by chance, hence providing the users with a more compact set of resulting rules. However, applying statistical tests to identify significant rules requires considerable computation and data access in order to obtain the necessary statistics. The situation gets worse as the size of the database increases. In this paper, we propose two approaches for improving the efficiency of significant exploratory rule discovery. We also evaluate the experimental effect in impact rule discovery which is suitable for discovering exploratory rules in very large, dense databases.

K-Optimal Pattern Discovery: An Efficient and Effective Approach to Exploratory Data Mining.
Webb, G. I.
Lecture Notes in Computer Science 3809: Advances in Artificial Intelligence, Proceedings of the 18th Australian Joint Conference on Artificial Intelligence (AI 2005)[Extended Abstract], Berlin/Heidelberg, pp. 1-2, 2005.
[PDF] [Bibtex]

@InProceedings{Webb05a,
Title = {K-Optimal Pattern Discovery: An Efficient and Effective Approach to Exploratory Data Mining},
Author = {G.I. Webb},
Booktitle = {Lecture Notes in Computer Science 3809: Advances in Artificial Intelligence, Proceedings of the 18th Australian Joint Conference on Artificial Intelligence (AI 2005)[Extended Abstract]},
Year = {2005},
Address = {Berlin/Heidelberg},
Editor = {S. Zhang and R. Jarvis},
Pages = {1-2},
Publisher = {Springer},
Audit-trail = {http://dx.doi.org/10.1007/11589990_1},
Keywords = {Association Rule Discovery},
Location = {Sydney, Australia},
Related = {filtered-top-k-association-discovery}
}
ABSTRACT 

k-Optimal-Rule-Discovery.
Webb, G. I., & Zhang, S.
Data Mining and Knowledge Discovery, 10(1), 39-79, 2005.
[PDF] [DOI] [Bibtex] [Abstract]

@Article{WebbZhang05,
Title = {k-Optimal-Rule-Discovery},
Author = {G. I. Webb and S. Zhang},
Journal = {Data Mining and Knowledge Discovery},
Year = {2005},
Number = {1},
Pages = {39-79},
Volume = {10},
Abstract = {K-most-interesting rule discovery finds the k rules that optimize a user-specified measure of interestingness with respect to a set of sample data and user-specified constraints. This approach avoids many limitations of the frequent itemset approach of association rule discovery. This paper presents a scalable algorithm applicable to a wide range of k-most-interesting rule discovery tasks and demonstrates its efficiency.},
Address = {Netherlands},
Doi = {10.1007/s10618-005-0255-4},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
Publisher = {Springer},
Related = {filtered-top-k-association-discovery}
}
ABSTRACT K-most-interesting rule discovery finds the k rules that optimize a user-specified measure of interestingness with respect to a set of sample data and user-specified constraints. This approach avoids many limitations of the frequent itemset approach of association rule discovery. This paper presents a scalable algorithm applicable to a wide range of k-most-interesting rule discovery tasks and demonstrates its efficiency.

Discarding Insignificant Rules During Impact Rule Discovery in Large, Dense Databases.
Huang, S., & Webb, G. I.
Proceedings of the Fifth SIAM International Conference on Data Mining (SDM’05) [short paper], Philadelphia, PA, pp. 541-545, 2005.
[PDF] [Bibtex] [Abstract]

@InProceedings{HuangWebb05,
Title = {Discarding Insignificant Rules During Impact Rule Discovery in Large, Dense Databases},
Author = {S. Huang and G.I. Webb},
Booktitle = {Proceedings of the Fifth {SIAM} International Conference on Data Mining ({SDM}'05) [short paper]},
Year = {2005},
Address = {Philadelphia, PA},
Editor = {H. Kargupta and C. Kamath and J. Srivastava and A. Goodman},
Pages = {541-545},
Publisher = {Society for Industrial and Applied Mathematics},
Abstract = {Considerable progress has been made on how to reduce the number of spurious exploratory rules with quantitative attributes. However, little has been done for rules with undiscretized quantitative attributes. It is argued that propositional rules can not effectively describe the interactions between quantitative and qualitative attributes. Aumann and Lindell proposed quantitative association rules to provide a better description of such relationship, together with a rule pruning techniques . Since their technique is based on the frequent itemset framework, it is not suitable for rule discovery in large, dense databases. In this paper, an efficient technique for automatically discarding insignificant rules during rule discovery is proposed, based on the OPUS search algorithm. Experiments demonstrate that the algorithm we propose can efficiently remove potentially uninteresting rules even in very large, dense databases.},
Audit-trail = {Shiying travelling to present paper. Requested permission to post pdf 10/2},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS and Impact Rules},
Location = {Newport Beach, CA},
Related = {impact-rules}
}
ABSTRACT Considerable progress has been made on how to reduce the number of spurious exploratory rules with quantitative attributes. However, little has been done for rules with undiscretized quantitative attributes. It is argued that propositional rules can not effectively describe the interactions between quantitative and qualitative attributes. Aumann and Lindell proposed quantitative association rules to provide a better description of such relationship, together with a rule pruning techniques . Since their technique is based on the frequent itemset framework, it is not suitable for rule discovery in large, dense databases. In this paper, an efficient technique for automatically discarding insignificant rules during rule discovery is proposed, based on the OPUS search algorithm. Experiments demonstrate that the algorithm we propose can efficiently remove potentially uninteresting rules even in very large, dense databases.

Pruning Derivative Partial Rules During Impact Rule Discovery.
Huang, S., & Webb, G. I.
Lecture Notes in Computer Science Vol. 3518: Proceedings of the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD 2005), Berlin/Heidelberg, pp. 71-80, 2005.
[PDF] [Bibtex] [Abstract]

@InProceedings{HuangWebb05a,
Title = {Pruning Derivative Partial Rules During Impact Rule Discovery},
Author = {S. Huang and G.I. Webb},
Booktitle = {Lecture Notes in Computer Science Vol. 3518: Proceedings of the 9th {Pacific}-{Asia} Conference on Advances in Knowledge Discovery and Data Mining (PAKDD 2005)},
Year = {2005},
Address = {Berlin/Heidelberg},
Editor = {T.B. Ho and D. Cheung and H. Liu },
Pages = {71-80},
Publisher = {Springer},
Abstract = {Because exploratory rule discovery works with data that is only a sample of the phenomena to be investigated, some resulting rules may appear interesting only by chance. Techniques are developed for automatically discarding statistically insignificant exploratory rules that cannot survive a hypothesis with regard to its ancestors. We call such insignificant rules derivative extended rules. In this paper, we argue that there is another type of derivative exploratory rules, which is derivative with regard to their children. We also argue that considerable amount of such derivative partial rules can not be successfully removed using existing rule pruning techniques. We propose a new technique to address this problem. Experiments are done in impact rule discovery to evaluate the effect of this derivative partial rule filter. Results show that the inherent problem of too many resulting rules in exploratory rule discovery is alleviated.},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS and Impact Rules},
Location = {Hanoi, Vietnam},
Related = {impact-rules}
}
ABSTRACT Because exploratory rule discovery works with data that is only a sample of the phenomena to be investigated, some resulting rules may appear interesting only by chance. Techniques are developed for automatically discarding statistically insignificant exploratory rules that cannot survive a hypothesis with regard to its ancestors. We call such insignificant rules derivative extended rules. In this paper, we argue that there is another type of derivative exploratory rules, which is derivative with regard to their children. We also argue that considerable amount of such derivative partial rules can not be successfully removed using existing rule pruning techniques. We propose a new technique to address this problem. Experiments are done in impact rule discovery to evaluate the effect of this derivative partial rule filter. Results show that the inherent problem of too many resulting rules in exploratory rule discovery is alleviated.

Mining Negative Rules using GRD.
Thiruvady, D. R., & Webb, G. I.
Lecture Notes in Computer Science Vol. 3056: Proceedings of the Eighth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 04) [Short Paper], Berlin/Heidelberg, pp. 161-165, 2004.
[PDF] [Bibtex] [Abstract]

@InProceedings{ThiruvadyWebb04,
Title = {Mining Negative Rules using GRD},
Author = {D. R. Thiruvady and G. I. Webb},
Booktitle = {Lecture Notes in Computer Science Vol. 3056: Proceedings of the Eighth {Pacific}-{Asia} Conference on Knowledge Discovery and Data Mining (PAKDD 04) [Short Paper]},
Year = {2004},
Address = {Berlin/Heidelberg},
Editor = {H. Dai and R. Srikant and C. Zhang },
Pages = {161-165},
Publisher = {Springer},
Abstract = {GRD is an algorithm for k-most interesting rule discovery. In contrast to association rule discovery, GRD does not require the use of a minimum support constraint. Rather, the user must specify a measure of interestingness and the number of rules sought (k). This paper reports efficient techniques to extend GRD to support mining of negative rules. We demonstrate that the new approach provides tractable discovery of both negative and positive rules.},
Audit-trail = {PDF posted 23/8},
Keywords = {association rule discovery and opus},
Location = {Sydney, Australia},
Related = {filtered-top-k-association-discovery}
}
ABSTRACT GRD is an algorithm for k-most interesting rule discovery. In contrast to association rule discovery, GRD does not require the use of a minimum support constraint. Rather, the user must specify a measure of interestingness and the number of rules sought (k). This paper reports efficient techniques to extend GRD to support mining of negative rules. We demonstrate that the new approach provides tractable discovery of both negative and positive rules.

Preliminary Investigations into Statistically Valid Exploratory Rule Discovery.
Webb, G. I.
Proceedings of the Second Australasian Data Mining Conference (AusDM03), Sydney, pp. 1-9, 2003.
[PDF] [Bibtex] [Abstract]

@InProceedings{Webb03a,
Title = {Preliminary Investigations into Statistically Valid Exploratory Rule Discovery},
Author = {G.I. Webb},
Booktitle = {Proceedings of the Second Australasian Data Mining Conference (AusDM03)},
Year = {2003},
Address = {Sydney},
Editor = {S.J. Simoff and G.J. Williams and M. Hegland},
Pages = {1-9},
Publisher = {University of Technology},
Abstract = {Exploratory rule discovery, as exemplified by association rule discovery, has proven very popular. In this paper I investigate issues surrounding the statistical validity of rules found using this approach and methods that might be employed to deliver statistically sound exploratory rule discovery.},
Audit-trail = {Submitted to AusDM03. No copyright required. Check key words},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
Location = {Canberra, Australia},
Related = {statistically-sound-association-discovery}
}
ABSTRACT Exploratory rule discovery, as exemplified by association rule discovery, has proven very popular. In this paper I investigate issues surrounding the statistical validity of rules found using this approach and methods that might be employed to deliver statistically sound exploratory rule discovery.

Association Rules.
Webb, G. I.
In Ye, D. N. (Ed.), In The Handbook of Data Mining, Chapter 2 (pp. 25-39). Lawrence Erlbaum Associates, 2003.
[Bibtex]

@InCollection{Webb03,
Title = {Association Rules},
Author = {G. I. Webb},
Booktitle = {The Handbook of Data Mining, Chapter 2},
Publisher = {Lawrence Erlbaum Associates},
Year = {2003},
Editor = {Dr. Nong Ye},
Pages = {25 - 39 },
Audit-trail = {*},
Keywords = {Association Rule Discovery}
}
ABSTRACT 

Identifying Approximate Item-Sets Of Interest In Large Databases.
Zhang, C., Zhang, S., & Webb, G. I.
Applied Intelligence, 18, 91-104, 2003.
[URL] [Bibtex] [Abstract]

@Article{ZhangZhangWebb03,
Title = {Identifying Approximate Item-Sets Of Interest In Large Databases},
Author = {C. Zhang and S. Zhang and G. I. Webb},
Journal = {Applied Intelligence},
Year = {2003},
Pages = {91-104},
Volume = {18},
Abstract = {This paper presents a method for discovering approximate frequent itemsets of interest in large scale databases. This method uses the central limit theorem to increase efficiency, enabling us to reduce the sample size by about half compared to previous approximations. Further efficiency is gained by pruning from the search space uninteresting frequent itemsets. In addition to improving efficiency, this measure also reduces the number of itemsets that the user need consider. The model and algorithm have been implemented and evaluated using both synthetic and real-world databases. Our experimental results demonstrate the efficiency of the approach},
Address = {Netherlands},
Audit-trail = {Link to paper via Kluwer site. No PDF posted},
Keywords = {Association Rule Discovery},
Publisher = {Springer},
Related = {filtered-top-k-association-discovery},
Url = {http://link.springer.com/content/pdf/10.1023%2FA%3A1020995206763.pdf}
}
ABSTRACT This paper presents a method for discovering approximate frequent itemsets of interest in large scale databases. This method uses the central limit theorem to increase efficiency, enabling us to reduce the sample size by about half compared to previous approximations. Further efficiency is gained by pruning from the search space uninteresting frequent itemsets. In addition to improving efficiency, this measure also reduces the number of itemsets that the user need consider. The model and algorithm have been implemented and evaluated using both synthetic and real-world databases. Our experimental results demonstrate the efficiency of the approach