# OPUS Search

OPUS is an efficient search algorithm for exploring the space of conjunctive patterns. It supports extremely fast rule discovery.

### Publications

Boley, M., Teshuva, S., Bodic, P. L., & Webb, G. I.
Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pp. 351-359, 2021.
[Bibtex] [Abstract]  → Access on publisher site

@InProceedings{boley2021better,
author = {Boley, Mario and Teshuva, Simon and Bodic, Pierre Le and Webb, Geoffrey I},
booktitle = {Proceedings of the 2021 SIAM International Conference on Data Mining (SDM)},
title = {Better Short than Greedy: Interpretable Models through Optimal Rule Boosting},
year = {2021},
organization = {SIAM},
pages = {351-359},
abstract = {Rule ensembles are designed to provide a useful trade-off between predictive accuracy and model interpretability. However, the myopic and random search components of current rule ensemble methods can compromise this goal: they often need more rules than necessary to reach a certain accuracy level or can even outright fail to accurately model a distribution that can actually be described well with a few rules. Here, we present a novel approach aiming to fit rule ensembles of maximal predictive power for a given ensemble size (and thus model comprehensibility). In particular, we present an efficient branch-and-bound algorithm that optimally solves the per-rule objective function of the popular second-order gradient boosting framework. Our main insight is that the boosting objective can be tightly bounded in linear time of the number of covered data points. Along with an additional novel pruning technique related to rule redundancy, this leads to a computationally feasible approach for boosting optimal rules that, as we demonstrate on a wide range of common benchmark problems, consistently outperforms the predictive performance of boosting greedy rules.},
keywords = {Rule Learning and OPUS and Association Rule Discovery},
related = {opus-search},
url = {https://arxiv.org/abs/2101.08380},
}
ABSTRACT Rule ensembles are designed to provide a useful trade-off between predictive accuracy and model interpretability. However, the myopic and random search components of current rule ensemble methods can compromise this goal: they often need more rules than necessary to reach a certain accuracy level or can even outright fail to accurately model a distribution that can actually be described well with a few rules. Here, we present a novel approach aiming to fit rule ensembles of maximal predictive power for a given ensemble size (and thus model comprehensibility). In particular, we present an efficient branch-and-bound algorithm that optimally solves the per-rule objective function of the popular second-order gradient boosting framework. Our main insight is that the boosting objective can be tightly bounded in linear time of the number of covered data points. Along with an additional novel pruning technique related to rule redundancy, this leads to a computationally feasible approach for boosting optimal rules that, as we demonstrate on a wide range of common benchmark problems, consistently outperforms the predictive performance of boosting greedy rules.

Hamalainen, W., & Webb, G. I.
Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 309-317, 2017.

@InProceedings{HamalainenWebb17,
Title = {Specious rules: an efficient and effective unifying method for removing misleading and uninformative patterns in association rule mining},
Author = {Hamalainen, Wilhelmiina and Webb, Geoffrey I},
Booktitle = {Proceedings of the 2017 SIAM International Conference on Data Mining},
Year = {2017},
Organization = {SIAM},
Pages = {309-317},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
Related = {statistically-sound-association-discovery}
}
ABSTRACT 

Petitjean, F., Li, T., Tatti, N., & Webb, G. I.
Data Mining and Knowledge Discovery, 30(5), 1086-1111, 2016.

@Article{PetitjeanEtAl16b,
Title = {Skopus: Mining top-k sequential patterns under leverage},
Author = {Petitjean, Francois and Li, Tao and Tatti, Nikolaj and Webb, Geoffrey I.},
Journal = {Data Mining and Knowledge Discovery},
Year = {2016},
Number = {5},
Pages = {1086-1111},
Volume = {30},
Abstract = {This paper presents a framework for exact discovery of the top-k sequential patterns under Leverage. It combines (1) a novel definition of the expected support for a sequential pattern---a concept on which most interestingness measures directly rely---with (2) Skopus: a new branch-and-bound algorithm for the exact discovery of top-k sequential patterns under a given measure of interest. Our interestingness measure employs the partition approach. A pattern is interesting to the extent that it is more frequent than can be explained by assuming independence between any of the pairs of patterns from which it can be composed. The larger the support compared to the expectation under independence, the more interesting is the pattern. We build on these two elements to exactly extract the k sequential patterns with highest leverage, consistent with our definition of expected support. We conduct experiments on both synthetic data with known patterns and real-world datasets; both experiments confirm the consistency and relevance of our approach with regard to the state of the art.},
Doi = {10.1007/s10618-016-0467-9},
ISSN = {1573-756X},
Keywords = {OPUS and Association Rule Discovery and statistically sound discovery},
Related = {statistically-sound-association-discovery},
Url = {http://rdcu.be/tsDo}
}
ABSTRACT This paper presents a framework for exact discovery of the top-k sequential patterns under Leverage. It combines (1) a novel definition of the expected support for a sequential pattern–-a concept on which most interestingness measures directly rely–-with (2) Skopus: a new branch-and-bound algorithm for the exact discovery of top-k sequential patterns under a given measure of interest. Our interestingness measure employs the partition approach. A pattern is interesting to the extent that it is more frequent than can be explained by assuming independence between any of the pairs of patterns from which it can be composed. The larger the support compared to the expectation under independence, the more interesting is the pattern. We build on these two elements to exactly extract the k sequential patterns with highest leverage, consistent with our definition of expected support. We conduct experiments on both synthetic data with known patterns and real-world datasets; both experiments confirm the consistency and relevance of our approach with regard to the state of the art.

Webb, G. I., & Vreeken, J.
ACM Transactions on Knowledge Discovery from Data, 8(3), Art. no. 15, 2014.
[Bibtex] [Abstract]  → Access on publisher site

@Article{WebbVreeken13,
author = {Webb, G. I. and Vreeken, J.},
journal = {{ACM} Transactions on Knowledge Discovery from Data},
title = {Efficient Discovery of the Most Interesting Associations},
year = {2014},
number = {3},
volume = {8},
abstract = {Self-sufficient itemsets have been proposed as an effective approach to summarizing the key associations
in data. However, their computation appears highly demanding, as assessing whether an itemset is selfsufficient
requires consideration of all pairwise partitions of the itemset into pairs of subsets as well as
consideration of all supersets. This paper presents the first published algorithm for efficiently discovering
self-sufficient itemsets. This branch-and-bound algorithm deploys two powerful pruning mechanisms
based on upper-bounds on itemset value and statistical significance level. It demonstrates that finding top-k
productive and non-redundant itemsets, with post processing to identify those that are not independently
productive, can efficiently identify small sets of key associations. We present extensive evaluation of the
strengths and limitations of the technique including comparisons with alternative approaches to finding the
most interesting associations.},
articlenumber = {15},
doi = {10.1145/2601433},
keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
publisher = {ACM},
related = {filtered-top-k-association-discovery},
url = {http://dl.acm.org/authorize?N80829},
}
ABSTRACT Self-sufficient itemsets have been proposed as an effective approach to summarizing the key associations in data. However, their computation appears highly demanding, as assessing whether an itemset is selfsufficient requires consideration of all pairwise partitions of the itemset into pairs of subsets as well as consideration of all supersets. This paper presents the first published algorithm for efficiently discovering self-sufficient itemsets. This branch-and-bound algorithm deploys two powerful pruning mechanisms based on upper-bounds on itemset value and statistical significance level. It demonstrates that finding top-k productive and non-redundant itemsets, with post processing to identify those that are not independently productive, can efficiently identify small sets of key associations. We present extensive evaluation of the strengths and limitations of the technique including comparisons with alternative approaches to finding the most interesting associations.

Webb, G. I.
ACM Transactions on Knowledge Discovery from Data, 4, Art. no. 3, 2010.
[Bibtex] [Abstract]  → Access on publisher site

@Article{Webb10,
author = {Webb, G. I.},
journal = {{ACM} Transactions on Knowledge Discovery from Data},
title = {Self-Sufficient Itemsets: An Approach to Screening Potentially Interesting Associations Between Items},
year = {2010},
volume = {4},
abstract = {Self-sufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their
supersets. We argue that itemsets that are not
self-sufficient will often be of little interest to the data
analyst, as their frequency should be expected once that of the
itemsets on which their frequency depends is known. We present
statistical tests for statistically sound discovery of
self-sufficient itemsets, and computational techniques that allow
those tests to be applied as a post-processing step for any itemset
discovery algorithm. We also present a measure for assessing the degree of potential interest in an itemset that complements these statistical measures.},
articlenumber = {3},
doi = {10.1145/1644873.1644876},
issue = {1},
keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
publisher = {ACM},
related = {filtered-top-k-association-discovery},
url = {http://dl.acm.org/authorize?270473},
}
ABSTRACT Self-sufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their supersets. We argue that itemsets that are not self-sufficient will often be of little interest to the data analyst, as their frequency should be expected once that of the itemsets on which their frequency depends is known. We present statistical tests for statistically sound discovery of self-sufficient itemsets, and computational techniques that allow those tests to be applied as a post-processing step for any itemset discovery algorithm. We also present a measure for assessing the degree of potential interest in an itemset that complements these statistical measures.

Novak, P., Lavrac, N., & Webb, G. I.
Journal of Machine Learning Research, 10, 377-403, 2009.
[Bibtex] [Abstract]  → Access on publisher site

@Article{NovakLavracWebb09,
author = {Novak, P. and Lavrac, N. and Webb, G. I.},
journal = {Journal of Machine Learning Research},
title = {Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining},
year = {2009},
pages = {377-403},
volume = {10},
abstract = {This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use different terminology and task definitions, claim to have different goals, claim to use different rule learning heuristics, and use different means for selecting subsets of induced patterns. This paper contributes a novel understanding of these subareas of data mining by presenting a unified terminology, by explaining the apparent differences between the learning tasks as variants of a unique supervised descriptive rule discovery task and by exploring the apparent differences between the approaches. It also shows that various rule learning heuristics used in CSM, EPM and SD algorithms all aim at optimizing a trade off between rule coverage and precision. The commonalities (and differences) between the approaches are showcased on a selection of best known variants of CSM, EPM and SD algorithms. The paper also provides a critical survey of existing supervised descriptive rule discovery visualization methods.},
keywords = {Association Rule Discovery and OPUS},
related = {filtered-top-k-association-discovery},
url = {http://www.jmlr.org/papers/volume10/kralj-novak09a/kralj-novak09a.pdf},
}
ABSTRACT This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use different terminology and task definitions, claim to have different goals, claim to use different rule learning heuristics, and use different means for selecting subsets of induced patterns. This paper contributes a novel understanding of these subareas of data mining by presenting a unified terminology, by explaining the apparent differences between the learning tasks as variants of a unique supervised descriptive rule discovery task and by exploring the apparent differences between the approaches. It also shows that various rule learning heuristics used in CSM, EPM and SD algorithms all aim at optimizing a trade off between rule coverage and precision. The commonalities (and differences) between the approaches are showcased on a selection of best known variants of CSM, EPM and SD algorithms. The paper also provides a critical survey of existing supervised descriptive rule discovery visualization methods.

Webb, G. I.
Machine Learning, 71(2-3), 307-323 [Technical Note], 2008.
[Bibtex] [Abstract]  → Access on publisher site

@Article{Webb08,
author = {Webb, G. I.},
journal = {Machine Learning},
title = {Layered Critical Values: A Powerful Direct-Adjustment Approach to Discovering Significant Patterns},
year = {2008},
number = {2-3},
pages = {307-323 [Technical Note]},
volume = {71},
abstract = {Standard pattern discovery techniques, such as association rules, suffer an extreme risk of finding very large numbers of spurious patterns for many knowledge discovery tasks. The direct-adjustment approach to controlling this risk applies a statistical test during the discovery process, using a critical value adjusted to take account of the size of the search space. However, a problem with the direct-adjustment strategy is that it may discard numerous true patterns. This paper investigates the assignment of different critical values to different areas of the search space as an approach to alleviating this problem, using a variant of a technique originally developed for other purposes. This approach is shown to be effective at increasing the number of discoveries while still maintaining strict control over the risk of false discoveries.},
audit-trail = {DOI 10.1007/s10994-008-5046-x},
doi = {10.1007/s10994-008-5046-x},
keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
notes = {Technical Note},
publisher = {Springer},
related = {statistically-sound-association-discovery},
}
ABSTRACT Standard pattern discovery techniques, such as association rules, suffer an extreme risk of finding very large numbers of spurious patterns for many knowledge discovery tasks. The direct-adjustment approach to controlling this risk applies a statistical test during the discovery process, using a critical value adjusted to take account of the size of the search space. However, a problem with the direct-adjustment strategy is that it may discard numerous true patterns. This paper investigates the assignment of different critical values to different areas of the search space as an approach to alleviating this problem, using a variant of a technique originally developed for other purposes. This approach is shown to be effective at increasing the number of discoveries while still maintaining strict control over the risk of false discoveries.

Webb, G. I.
Machine Learning, 68(1), 1-33, 2007.
[Bibtex] [Abstract]  → Access on publisher site

@Article{Webb07,
author = {Webb, G. I.},
journal = {Machine Learning},
title = {Discovering Significant Patterns},
year = {2007},
number = {1},
pages = {1-33},
volume = {68},
abstract = {Exploratory pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard exploratory pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to statistical evaluation on holdout data. They also reveal that modification of the pattern discovery process to anticipate subsequent statistical evaluation can increase the number of patterns that are accepted by statistical evaluation on holdout data.},
audit-trail = {subject to revisions},
doi = {10.1007/s10994-007-5006-x},
keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
publisher = {Springer},
related = {statistically-sound-association-discovery},
}
ABSTRACT Exploratory pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard exploratory pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to statistical evaluation on holdout data. They also reveal that modification of the pattern discovery process to anticipate subsequent statistical evaluation can increase the number of patterns that are accepted by statistical evaluation on holdout data.

Huang, S., & Webb, G. I.
LNAI State-of-the-Art Survey series, 'Data Mining: Theory, Methodology, Techniques, and Applications', Berlin/Heidelberg, pp. 64-77, 2006.
[Bibtex] [Abstract]  → Access on publisher site

@InProceedings{HuangWebb05b,
author = {Huang, S. and Webb, G. I.},
booktitle = {LNAI State-of-the-Art Survey series, 'Data Mining: Theory, Methodology, Techniques, and Applications'},
title = {Efficiently Identifying Exploratory Rules' Significance},
year = {2006},
note = {An earlier version of this paper was published in S.J. Simoff and G.J. Williams (Eds.), Proceedings of the Third Australasian Data Mining Conference (AusDM04) Cairns, Australia. Sydney: University of Technology, pages 169-182.},
pages = {64-77},
publisher = {Springer},
abstract = {How to efficiently discard potentially uninteresting rules in exploratory rule discovery is one of the important research foci in data mining. Many researchers have presented algorithms to automatically remove potentially uninteresting rules utilizing background knowledge and user-specified constraints. Identifying the significance of exploratory rules using a significance test is desirable for removing rules that may appear interesting by chance, hence providing the users with a more compact set of resulting rules. However, applying statistical tests to identify significant rules requires considerable computation and data access in order to obtain the necessary statistics. The situation gets worse as the size of the database increases. In this paper, we propose two approaches for improving the efficiency of significant exploratory rule discovery. We also evaluate the experimental effect in impact rule discovery which is suitable for discovering exploratory rules in very large, dense databases.},
doi = {10.1007/11677437_6},
keywords = {Association Rule Discovery and statistically sound discovery and OPUS and Impact Rules},
related = {statistically-sound-association-discovery},
}
ABSTRACT How to efficiently discard potentially uninteresting rules in exploratory rule discovery is one of the important research foci in data mining. Many researchers have presented algorithms to automatically remove potentially uninteresting rules utilizing background knowledge and user-specified constraints. Identifying the significance of exploratory rules using a significance test is desirable for removing rules that may appear interesting by chance, hence providing the users with a more compact set of resulting rules. However, applying statistical tests to identify significant rules requires considerable computation and data access in order to obtain the necessary statistics. The situation gets worse as the size of the database increases. In this paper, we propose two approaches for improving the efficiency of significant exploratory rule discovery. We also evaluate the experimental effect in impact rule discovery which is suitable for discovering exploratory rules in very large, dense databases.

Webb, G. I.
Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2006), New York, pp. 434-443, 2006.

@InProceedings{Webb06a,
author = {Webb, G. I.},
booktitle = {Proceedings of the Twelfth {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining (KDD-2006)},
title = {Discovering Significant Rules},
year = {2006},
editor = {Ungar, L. and Craven, M. and Gunopulos, D. and Eliassi-Rad, T.},
pages = {434 - 443},
publisher = {The Association for Computing Machinery},
abstract = {In many applications, association rules will only be interesting if they represent non-trivial correlations between all constituent items. Numerous techniques have been developed that seek to avoid false discoveries. However, while all provide useful solutions to aspects of this problem, none provides a generic solution that is both flexible enough to accommodate varying definitions of true and false discoveries and powerful enough to provide strict control over the risk of false discoveries. This paper presents generic techniques that allow definitions of true and false discoveries to be specified in terms of arbitrary statistical hypothesis tests and which provide strict control over the experimentwise risk of false discoveries.},
keywords = {OPUS and Association Rule Discovery and statistically sound discovery},
related = {statistically-sound-association-discovery},
url = {http://dl.acm.org/authorize?N00546},
}
ABSTRACT In many applications, association rules will only be interesting if they represent non-trivial correlations between all constituent items. Numerous techniques have been developed that seek to avoid false discoveries. However, while all provide useful solutions to aspects of this problem, none provides a generic solution that is both flexible enough to accommodate varying definitions of true and false discoveries and powerful enough to provide strict control over the risk of false discoveries. This paper presents generic techniques that allow definitions of true and false discoveries to be specified in terms of arbitrary statistical hypothesis tests and which provide strict control over the experimentwise risk of false discoveries.

Webb, G. I., & Zhang, S.
Data Mining and Knowledge Discovery, 10(1), 39-79, 2005.

@Article{WebbZhang05,
Title = {k-Optimal-Rule-Discovery},
Author = {Webb, G. I. and Zhang, S.},
Journal = {Data Mining and Knowledge Discovery},
Year = {2005},
Number = {1},
Pages = {39-79},
Volume = {10},
Abstract = {K-most-interesting rule discovery finds the k rules that optimize a user-specified measure of interestingness with respect to a set of sample data and user-specified constraints. This approach avoids many limitations of the frequent itemset approach of association rule discovery. This paper presents a scalable algorithm applicable to a wide range of k-most-interesting rule discovery tasks and demonstrates its efficiency.},
Doi = {10.1007/s10618-005-0255-4},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
Publisher = {Springer},
Related = {filtered-top-k-association-discovery}
}
ABSTRACT K-most-interesting rule discovery finds the k rules that optimize a user-specified measure of interestingness with respect to a set of sample data and user-specified constraints. This approach avoids many limitations of the frequent itemset approach of association rule discovery. This paper presents a scalable algorithm applicable to a wide range of k-most-interesting rule discovery tasks and demonstrates its efficiency.

Huang, S., & Webb, G. I.
Lecture Notes in Computer Science Vol. 3518: Proceedings of the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD 2005), Berlin/Heidelberg, pp. 71-80, 2005.

@InProceedings{HuangWebb05a,
author = {Huang, S. and Webb, G. I.},
booktitle = {Lecture Notes in Computer Science Vol. 3518: Proceedings of the 9th {Pacific}-{Asia} Conference on Advances in Knowledge Discovery and Data Mining (PAKDD 2005)},
title = {Pruning Derivative Partial Rules During Impact Rule Discovery},
year = {2005},
editor = {Ho, T.B. and Cheung, D. and Liu, H.},
pages = {71-80},
publisher = {Springer},
abstract = {Because exploratory rule discovery works with data that is only a sample of the phenomena to be investigated, some resulting rules may appear interesting only by chance. Techniques are developed for automatically discarding statistically insignificant exploratory rules that cannot survive a hypothesis with regard to its ancestors. We call such insignificant rules derivative extended rules. In this paper, we argue that there is another type of derivative exploratory rules, which is derivative with regard to their children. We also argue that considerable amount of such derivative partial rules can not be successfully removed using existing rule pruning techniques. We propose a new technique to address this problem. Experiments are done in impact rule discovery to evaluate the effect of this derivative partial rule filter. Results show that the inherent problem of too many resulting rules in exploratory rule discovery is alleviated.},
keywords = {Association Rule Discovery and statistically sound discovery and OPUS and Impact Rules},
location = {Hanoi, Vietnam},
related = {impact-rules},
}
ABSTRACT Because exploratory rule discovery works with data that is only a sample of the phenomena to be investigated, some resulting rules may appear interesting only by chance. Techniques are developed for automatically discarding statistically insignificant exploratory rules that cannot survive a hypothesis with regard to its ancestors. We call such insignificant rules derivative extended rules. In this paper, we argue that there is another type of derivative exploratory rules, which is derivative with regard to their children. We also argue that considerable amount of such derivative partial rules can not be successfully removed using existing rule pruning techniques. We propose a new technique to address this problem. Experiments are done in impact rule discovery to evaluate the effect of this derivative partial rule filter. Results show that the inherent problem of too many resulting rules in exploratory rule discovery is alleviated.

Huang, S., & Webb, G. I.
Proceedings of the Fifth SIAM International Conference on Data Mining (SDM'05) [short paper], Philadelphia, PA, pp. 541-545, 2005.

@InProceedings{HuangWebb05,
author = {Huang, S. and Webb, G. I.},
booktitle = {Proceedings of the Fifth {SIAM} International Conference on Data Mining ({SDM}'05) [short paper]},
title = {Discarding Insignificant Rules During Impact Rule Discovery in Large, Dense Databases},
year = {2005},
editor = {Kargupta, H. and Kamath, C. and Srivastava, J. and Goodman, A.},
pages = {541-545},
publisher = {Society for Industrial and Applied Mathematics},
abstract = {Considerable progress has been made on how to reduce the number of spurious exploratory rules with quantitative attributes. However, little has been done for rules with undiscretized quantitative attributes. It is argued that propositional rules can not effectively describe the interactions between quantitative and qualitative attributes. Aumann and Lindell proposed quantitative association rules to provide a better description of such relationship, together with a rule pruning techniques . Since their technique is based on the frequent itemset framework, it is not suitable for rule discovery in large, dense databases. In this paper, an efficient technique for automatically discarding insignificant rules during rule discovery is proposed, based on the OPUS search algorithm. Experiments demonstrate that the algorithm we propose can efficiently remove potentially uninteresting rules even in very large, dense databases.},
audit-trail = {Shiying travelling to present paper. Requested permission to post pdf 10/2},
keywords = {Association Rule Discovery and statistically sound discovery and OPUS and Impact Rules},
location = {Newport Beach, CA},
related = {impact-rules},
}
ABSTRACT Considerable progress has been made on how to reduce the number of spurious exploratory rules with quantitative attributes. However, little has been done for rules with undiscretized quantitative attributes. It is argued that propositional rules can not effectively describe the interactions between quantitative and qualitative attributes. Aumann and Lindell proposed quantitative association rules to provide a better description of such relationship, together with a rule pruning techniques . Since their technique is based on the frequent itemset framework, it is not suitable for rule discovery in large, dense databases. In this paper, an efficient technique for automatically discarding insignificant rules during rule discovery is proposed, based on the OPUS search algorithm. Experiments demonstrate that the algorithm we propose can efficiently remove potentially uninteresting rules even in very large, dense databases.

Thiruvady, D. R., & Webb, G. I.
Lecture Notes in Computer Science Vol. 3056: Proceedings of the Eighth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 04) [Short Paper], Berlin/Heidelberg, pp. 161-165, 2004.

@InProceedings{ThiruvadyWebb04,
Title = {Mining Negative Rules using GRD},
Author = {Thiruvady, D. R. and Webb, G. I.},
Booktitle = {Lecture Notes in Computer Science Vol. 3056: Proceedings of the Eighth {Pacific}-{Asia} Conference on Knowledge Discovery and Data Mining (PAKDD 04) [Short Paper]},
Year = {2004},
Editor = {Dai, H. and Srikant, R. and Zhang, C.},
Pages = {161-165},
Publisher = {Springer},
Abstract = {GRD is an algorithm for k-most interesting rule discovery. In contrast to association rule discovery, GRD does not require the use of a minimum support constraint. Rather, the user must specify a measure of interestingness and the number of rules sought (k). This paper reports efficient techniques to extend GRD to support mining of negative rules. We demonstrate that the new approach provides tractable discovery of both negative and positive rules.},
Audit-trail = {PDF posted 23/8},
Keywords = {association rule discovery and OPUS},
Location = {Sydney, Australia},
Related = {filtered-top-k-association-discovery}
}
ABSTRACT GRD is an algorithm for k-most interesting rule discovery. In contrast to association rule discovery, GRD does not require the use of a minimum support constraint. Rather, the user must specify a measure of interestingness and the number of rules sought (k). This paper reports efficient techniques to extend GRD to support mining of negative rules. We demonstrate that the new approach provides tractable discovery of both negative and positive rules.

Webb, G. I.
Proceedings of the Second Australasian Data Mining Conference (AusDM03), Sydney, pp. 1-9, 2003.

@InProceedings{Webb03a,
author = {Webb, G. I.},
booktitle = {Proceedings of the Second Australasian Data Mining Conference (AusDM03)},
title = {Preliminary Investigations into Statistically Valid Exploratory Rule Discovery},
year = {2003},
editor = {Simoff, S.J. and Williams, G.J. and Hegland, M.},
pages = {1-9},
publisher = {University of Technology},
abstract = {Exploratory rule discovery, as exemplified by association rule discovery, has proven very popular. In this paper I investigate issues surrounding the statistical validity of rules found using this approach and methods that might be employed to deliver statistically sound exploratory rule discovery.},
audit-trail = {Submitted to AusDM03. No copyright required. Check key words},
keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
location = {Canberra, Australia},
related = {statistically-sound-association-discovery},
}
ABSTRACT Exploratory rule discovery, as exemplified by association rule discovery, has proven very popular. In this paper I investigate issues surrounding the statistical validity of rules found using this approach and methods that might be employed to deliver statistically sound exploratory rule discovery.

Webb, G. I., Butler, S., & Newlands, D.
Proceedings of The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), New York, pp. 256-265, 2003.

@InProceedings{WebbButlerNewlands03,
Title = {On Detecting Differences Between Groups},
Author = {Webb, G. I. and Butler, S. and Newlands, D.},
Booktitle = {Proceedings of The Ninth {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining (KDD-2003)},
Year = {2003},
Editor = {Domingos, P. and Faloutsos, C. and Senator, T. and Kargupta, H. and Getoor, L.},
Pages = {256-265},
Publisher = {The Association for Computing Machinery},
Abstract = {Understanding the differences between contrasting groups is a fundamental task in data analysis. This realization has led to the development of a new special purpose data mining technique, {\em contrast-set mining}. We undertook a study with a retail collaborator to compare contrast-set mining with existing rule-discovery techniques. To our surprise we observed that straightforward application of an existing commercial rule-discovery system, Magnum Opus, could successfully perform the contrast-set-mining task. This led to the realization that contrast-set mining is a special case of the more general rule-discovery task. We present the results of our study together with a proof of this conclusion},
Audit-trail = {PDF with ACM copyright posted in accordance with conditions of copyright},
Keywords = {OPUS and Association Rule Discovery},
Location = {Washington, DC},
Related = {filtered-top-k-association-discovery}
}
ABSTRACT Understanding the differences between contrasting groups is a fundamental task in data analysis. This realization has led to the development of a new special purpose data mining technique, {\em contrast-set mining}. We undertook a study with a retail collaborator to compare contrast-set mining with existing rule-discovery techniques. To our surprise we observed that straightforward application of an existing commercial rule-discovery system, Magnum Opus, could successfully perform the contrast-set-mining task. This led to the realization that contrast-set mining is a special case of the more general rule-discovery task. We present the results of our study together with a proof of this conclusion

Webb, G. I., & Zhang, S.
Proceedings of the First International NAISO Congress on Autonomous Intelligent Systems (ICAIS 2002), Canada/The Netherlands, 2002.

@InProceedings{WebbZhang02,
Title = {Removing Trivial Associations in Association Rule Discovery},
Author = {Webb, G. I. and Zhang, S.},
Booktitle = {Proceedings of the First International NAISO Congress on Autonomous Intelligent Systems (ICAIS 2002)},
Year = {2002},
Abstract = {Association rule discovery has become one of the most widely applied data mining strategies. Techniques for association rule discovery have been dominated by the frequent itemset strategy as exemplified by the Apriori algorithm. One limitation of this approach is that it provides little opportunity to detect and remove association rules on the basis of relationships between rules. As a result, the association rules discovered are frequently swamped with large numbers of spurious rules that are of little interest to the user. This paper presents association rule discovery techniques that can detect and discard one form of spurious association rule: trivial associations.},
Audit-trail = {Pre-publication PDF posted},
Keywords = {OPUS and Association Rule Discovery},
Location = {Geelong, Australia},
Related = {filtered-top-k-association-discovery}
}
ABSTRACT Association rule discovery has become one of the most widely applied data mining strategies. Techniques for association rule discovery have been dominated by the frequent itemset strategy as exemplified by the Apriori algorithm. One limitation of this approach is that it provides little opportunity to detect and remove association rules on the basis of relationships between rules. As a result, the association rules discovered are frequently swamped with large numbers of spurious rules that are of little interest to the user. This paper presents association rule discovery techniques that can detect and discard one form of spurious association rule: trivial associations.

Webb, G. I.
Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001)[short paper], New York, pp. 383-388, 2001.

@InProceedings{Webb01a,
author = {Webb, G. I.},
booktitle = {Proceedings of the Seventh {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining (KDD-2001)[short paper]},
title = {Discovering Associations with Numeric Variables},
year = {2001},
editor = {Provost, F. and Srikant, R.},
pages = {383-388},
publisher = {The Association for Computing Machinery},
abstract = {This paper further develops Aumann and Lindell's [3] proposal for a variant of association rules for which the consequent is a numeric variable. It is argued that these rules can discover useful interactions with numeric data that cannot be discovered directly using traditional association rules with discretization. Alternative measures for identifying interesting rules are proposed. Efficient algorithms are presented that enable these rules to be discovered for dense data sets for which application of Auman and Lindell's algorithm is infeasible.},
audit-trail = {*},
keywords = {Association Rule Discovery and statistically sound discovery and OPUS and Impact Rules},
location = {San Francisco, CA},
related = {impact-rules},
url = {http://dl.acm.org/authorize?19861},
}
ABSTRACT This paper further develops Aumann and Lindell's [3] proposal for a variant of association rules for which the consequent is a numeric variable. It is argued that these rules can discover useful interactions with numeric data that cannot be discovered directly using traditional association rules with discretization. Alternative measures for identifying interesting rules are proposed. Efficient algorithms are presented that enable these rules to be discovered for dense data sets for which application of Auman and Lindell's algorithm is infeasible.

Webb, G. I., & Zhang, S.
Lecture Notes in Computer Science Vol. 2256: Proceedings of the 14th Australian Joint Conference on Artificial Intelligence (AI'01), Berlin, pp. 605-618, 2001.

@InProceedings{WebbZhang01,
author = {Webb, G. I. and Zhang, S.},
booktitle = {Lecture Notes in Computer Science Vol. 2256: Proceedings of the 14th Australian Joint Conference on Artificial Intelligence (AI'01)},
title = {Further Pruning for Efficient Association Rule Discovery},
year = {2001},
editor = {Stumptner, M. and Corbett, D. and Brooks, M.J.},
pages = {605-618},
publisher = {Springer},
abstract = {The Apriori algorithm's frequent itemset approach has become the standard approach to discovering association rules. However, the computation requirements of the frequent itemset approach are infeasible for dense data and the approach is unable to discover infrequent associations. OPUS\_AR is an efficient algorithm for rule discovery that does not utilize frequent itemsets and hence avoids these problems. It can reduce search time by using additional constraints on the search space as well as constraints on itemset frequency. However, the effectiveness of the pruning rules used during search will determine the efficiency of its search. This paper presents and analyzes pruning rules for use with OPUS\_AR. We demonstrate that application of OPUS\_AR is feasible for a number of datasets for which application of the frequent itemset approach is infeasible and that the new pruning rules can reduce compute time by more than 40%.},
audit-trail = {*},
keywords = {OPUS and Association Rule Discovery},
related = {filtered-top-k-association-discovery},
}
ABSTRACT The Apriori algorithm's frequent itemset approach has become the standard approach to discovering association rules. However, the computation requirements of the frequent itemset approach are infeasible for dense data and the approach is unable to discover infrequent associations. OPUS_AR is an efficient algorithm for rule discovery that does not utilize frequent itemsets and hence avoids these problems. It can reduce search time by using additional constraints on the search space as well as constraints on itemset frequency. However, the effectiveness of the pruning rules used during search will determine the efficiency of its search. This paper presents and analyzes pruning rules for use with OPUS_AR. We demonstrate that application of OPUS_AR is feasible for a number of datasets for which application of the frequent itemset approach is infeasible and that the new pruning rules can reduce compute time by more than 40%.

Webb, G. I.
Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000), New York, pp. 99-107, 2000.

@InProceedings{Webb00b,
Title = {Efficient Search for Association Rules},
Author = {Webb, G. I.},
Booktitle = {Proceedings of the Sixth {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining (KDD-2000)},
Year = {2000},
Editor = {Ramakrishnan, R. and Stolfo, S.},
Pages = {99-107},
Publisher = {The Association for Computing Machinery},
Abstract = {This paper argues that for some applications direct search for association rules can be more efficient than the two stage process of the Apriori algorithm which first finds large item sets which are then used to identify associations. In particular, it is argued, Apriori can impose large computational overheads when the number of frequent itemsets is very large. This will often be the case when association rule analysis is performed on domains other than basket analysis or when it is performed for basket analysis with basket information augmented by other customer information. An algorithm is presented that is computationally efficient for association rule analysis during which the number of rules to be found can be constrained and all data can be maintained in memory.},
Audit-trail = {*},
Keywords = {Search and OPUS and Association Rule Discovery},
Location = {Boston, MA},
Related = {filtered-top-k-association-discovery}
}
ABSTRACT This paper argues that for some applications direct search for association rules can be more efficient than the two stage process of the Apriori algorithm which first finds large item sets which are then used to identify associations. In particular, it is argued, Apriori can impose large computational overheads when the number of frequent itemsets is very large. This will often be the case when association rule analysis is performed on domains other than basket analysis or when it is performed for basket analysis with basket information augmented by other customer information. An algorithm is presented that is computationally efficient for association rule analysis during which the number of rules to be found can be constrained and all data can be maintained in memory.

Webb, G. I.
Australian Computer Science Communications Vol. 18 (1): Proceedings of the Nineteenth Australasian Computer Science Conference (ACSC'96), Melbourne, pp. 1-10, 1996.

@InProceedings{Webb96e,
Title = {Inclusive Pruning: A New Class of Pruning Rule for Unordered Search and its Application to Classification Learning},
Author = {Webb, G. I.},
Booktitle = {Australian Computer Science Communications Vol. 18 (1): Proceedings of the Nineteenth Australasian Computer Science Conference (ACSC'96)},
Year = {1996},
Editor = {Ramamohanarao, K.},
Pages = {1-10},
Publisher = {ACS},
Abstract = {This paper presents a new class of pruning rule for unordered search. Previous pruning rules for unordered search identify operators that should not be applied in order to prune nodes reached via those operators. In contrast, the new pruning rules identify operators that should be applied and prune nodes that are not reached via those operators. Specific pruning rules employing both these approaches are identified for classification learning. Experimental results demonstrate that application of the new pruning rules can reduce by more than 60% the number of states from the search space that are considered during classification learning.},
Audit-trail = {*},
Keywords = {Search and Rule Learning and OPUS and Association Rule Discovery},
Location = {Royal Melbourne Insitute of Technology, Australia},
Related = {opus-search}
}
ABSTRACT This paper presents a new class of pruning rule for unordered search. Previous pruning rules for unordered search identify operators that should not be applied in order to prune nodes reached via those operators. In contrast, the new pruning rules identify operators that should be applied and prune nodes that are not reached via those operators. Specific pruning rules employing both these approaches are identified for classification learning. Experimental results demonstrate that application of the new pruning rules can reduce by more than 60% the number of states from the search space that are considered during classification learning.

Webb, G. I.
Journal of Artificial Intelligence Research, 3, 431-465, 1995.
[Bibtex] [Abstract]  → Access on publisher site

@Article{Webb95,
Title = {OPUS: An Efficient Admissible Algorithm For Unordered Search},
Author = {Webb, G. I.},
Journal = {Journal of Artificial Intelligence Research},
Year = {1995},
Pages = {431-465},
Volume = {3},
Abstract = {OPUS is a branch and bound search algorithm that enables efficient admissible search through spaces for which the order of search operator application is not significant. The algorithm's search efficiency is demonstrated with respect to very large machine learning search spaces. The use of admissible search is of potential value to the machine learning community as it means that the exact learning biases to be employed for complex learning tasks can be precisely specified and manipulated. OPUS also has potential for application in other areas of artificial intelligence, notably, truth maintenance.},
Audit-trail = {Link to paper via JAIR website},
Keywords = {Search and Rule Learning and OPUS and Association Rule Discovery},
Publisher = {AAAI Press},
Related = {opus-search},
Url = {http://dx.doi.org/10.1613/jair.227}
}
ABSTRACT OPUS is a branch and bound search algorithm that enables efficient admissible search through spaces for which the order of search operator application is not significant. The algorithm's search efficiency is demonstrated with respect to very large machine learning search spaces. The use of admissible search is of potential value to the machine learning community as it means that the exact learning biases to be employed for complex learning tasks can be precisely specified and manipulated. OPUS also has potential for application in other areas of artificial intelligence, notably, truth maintenance.

Webb, G. I.
Proceedings of the Sixth Australian Joint Conference on Artificial Intelligence (AI'93), Singapore, pp. 342-347, 1993.

@InProceedings{Webb93a,
Title = {Systematic Search for Categorical Attribute-Value Data-Driven Machine Learning},
Author = {Webb, G. I.},
Booktitle = {Proceedings of the Sixth Australian Joint Conference on Artificial Intelligence (AI'93)},
Year = {1993},
}
ABSTRACT Optimal Pruning for Unordered Search is a search algorithm that enables complete search through the space of possible disjuncts at the inner level of a covering algorithm. This algorithm takes as inputs an evaluation function, e, a training set, t, and a set of specialisation operators, o. It outputs a set of operators from o that creates a classifier that maximises e with respect to t. While OPUS has exponential worst case time complexity, the algorithm is demonstrated to reach solutions for complex real world domains within reasonable time frames. Indeed, for some domains, the algorithm exhibits greater computational efficiency than common heuristic search algorithms.