Data Scientist


OPUS Search

OPUS is an efficient search algorithm for exploring the space of conjunctive patterns. It supports extremely fast rule discovery.

The OPUSMiner pattern discovery software can be downloaded here.

Publications

Specious rules: an efficient and effective unifying method for removing misleading and uninformative patterns in association rule mining.
Hamalainen, W., & Webb, G. I.
Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 309-317, 2017.
[PDF] [Bibtex]

@InProceedings{HamalainenWebb17,
Title = {Specious rules: an efficient and effective unifying method for removing misleading and uninformative patterns in association rule mining},
Author = {Hamalainen, Wilhelmiina and Webb, Geoffrey I},
Booktitle = {Proceedings of the 2017 SIAM International Conference on Data Mining},
Year = {2017},
Organization = {SIAM},
Pages = {309-317},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
Related = {statistically-sound-association-discovery}
}
ABSTRACT 

Skopus: Mining top-k sequential patterns under leverage.
Petitjean, F., Li, T., Tatti, N., & Webb, G. I.
Data Mining and Knowledge Discovery, 30(5), 1086-1111, 2016.
[PDF] [DOI] [Bibtex] [Abstract]

@Article{PetitjeanEtAl16b,
Title = {Skopus: Mining top-k sequential patterns under leverage},
Author = {Petitjean, Francois
and Li, Tao
and Tatti, Nikolaj
and Webb, Geoffrey I.},
Journal = {Data Mining and Knowledge Discovery},
Year = {2016},
Number = {5},
Pages = {1086-1111},
Volume = {30},
Abstract = {This paper presents a framework for exact discovery of the top-k sequential patterns under Leverage. It combines (1) a novel definition of the expected support for a sequential pattern---a concept on which most interestingness measures directly rely---with (2) Skopus: a new branch-and-bound algorithm for the exact discovery of top-k sequential patterns under a given measure of interest. Our interestingness measure employs the partition approach. A pattern is interesting to the extent that it is more frequent than can be explained by assuming independence between any of the pairs of patterns from which it can be composed. The larger the support compared to the expectation under independence, the more interesting is the pattern. We build on these two elements to exactly extract the k sequential patterns with highest leverage, consistent with our definition of expected support. We conduct experiments on both synthetic data with known patterns and real-world datasets; both experiments confirm the consistency and relevance of our approach with regard to the state of the art.},
Doi = {10.1007/s10618-016-0467-9},
ISSN = {1573-756X},
Keywords = {OPUS and Association Rule Discovery and statistically sound discovery},
Related = {statistically-sound-association-discovery},
Url = {http://rdcu.be/tsDo}
}
ABSTRACT This paper presents a framework for exact discovery of the top-k sequential patterns under Leverage. It combines (1) a novel definition of the expected support for a sequential pattern---a concept on which most interestingness measures directly rely---with (2) Skopus: a new branch-and-bound algorithm for the exact discovery of top-k sequential patterns under a given measure of interest. Our interestingness measure employs the partition approach. A pattern is interesting to the extent that it is more frequent than can be explained by assuming independence between any of the pairs of patterns from which it can be composed. The larger the support compared to the expectation under independence, the more interesting is the pattern. We build on these two elements to exactly extract the k sequential patterns with highest leverage, consistent with our definition of expected support. We conduct experiments on both synthetic data with known patterns and real-world datasets; both experiments confirm the consistency and relevance of our approach with regard to the state of the art.

Efficient Discovery of the Most Interesting Associations.
Webb, G. I., & Vreeken, J.
ACM Transactions on Knowledge Discovery from Data, 8(3), Art. no. 15, 2014.
[URL] [Bibtex] [Abstract]

@Article{WebbVreeken13,
Title = {Efficient Discovery of the Most Interesting Associations},
Author = {G.I. Webb and J. Vreeken},
Journal = {{ACM} Transactions on Knowledge Discovery from Data},
Year = {2014},
Number = {3},
Volume = {8},
Abstract = {Self-sufficient itemsets have been proposed as an effective approach to summarizing the key associations
in data. However, their computation appears highly demanding, as assessing whether an itemset is selfsufficient
requires consideration of all pairwise partitions of the itemset into pairs of subsets as well as
consideration of all supersets. This paper presents the first published algorithm for efficiently discovering
self-sufficient itemsets. This branch-and-bound algorithm deploys two powerful pruning mechanisms
based on upper-bounds on itemset value and statistical significance level. It demonstrates that finding top-k
productive and non-redundant itemsets, with post processing to identify those that are not independently
productive, can efficiently identify small sets of key associations. We present extensive evaluation of the
strengths and limitations of the technique including comparisons with alternative approaches to finding the
most interesting associations.},
Articlenumber = {15},
Direct-url = {http://dx.doi.org/10.1145/2601433},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
Publisher = {ACM},
Related = {filtered-top-k-association-discovery},
Url = {http://dl.acm.org/authorize?N80829}
}
ABSTRACT Self-sufficient itemsets have been proposed as an effective approach to summarizing the key associations in data. However, their computation appears highly demanding, as assessing whether an itemset is selfsufficient requires consideration of all pairwise partitions of the itemset into pairs of subsets as well as consideration of all supersets. This paper presents the first published algorithm for efficiently discovering self-sufficient itemsets. This branch-and-bound algorithm deploys two powerful pruning mechanisms based on upper-bounds on itemset value and statistical significance level. It demonstrates that finding top-k productive and non-redundant itemsets, with post processing to identify those that are not independently productive, can efficiently identify small sets of key associations. We present extensive evaluation of the strengths and limitations of the technique including comparisons with alternative approaches to finding the most interesting associations.

Self-Sufficient Itemsets: An Approach to Screening Potentially Interesting Associations Between Items.
Webb, G. I.
ACM Transactions on Knowledge Discovery from Data, 4, Art. no. 3, 2010.
[URL] [Bibtex] [Abstract]

@Article{Webb10,
Title = {Self-Sufficient Itemsets: An Approach to Screening Potentially Interesting Associations Between Items},
Author = {G.I. Webb},
Journal = {{ACM} Transactions on Knowledge Discovery from Data},
Year = {2010},
Volume = {4},
Abstract = {Self-sufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their
supersets. We argue that itemsets that are not
self-sufficient will often be of little interest to the data
analyst, as their frequency should be expected once that of the
itemsets on which their frequency depends is known. We present
statistical tests for statistically sound discovery of
self-sufficient itemsets, and computational techniques that allow
those tests to be applied as a post-processing step for any itemset
discovery algorithm. We also present a measure for assessing the degree of potential interest in an itemset that complements these statistical measures.},
Articlenumber = {3},
Issue = {1},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
Publisher = {ACM},
Related = {filtered-top-k-association-discovery},
Url = {http://dl.acm.org/authorize?270473}
}
ABSTRACT Self-sufficient itemsets are those whose frequency cannot explained solely by the frequency of either their subsets or of their supersets. We argue that itemsets that are not self-sufficient will often be of little interest to the data analyst, as their frequency should be expected once that of the itemsets on which their frequency depends is known. We present statistical tests for statistically sound discovery of self-sufficient itemsets, and computational techniques that allow those tests to be applied as a post-processing step for any itemset discovery algorithm. We also present a measure for assessing the degree of potential interest in an itemset that complements these statistical measures.

Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining.
Novak, P., Lavrac, N., & Webb, G. I.
Journal of Machine Learning Research, 10, 377-403, 2009.
[URL] [Bibtex] [Abstract]

@Article{NovakLavracWebb09,
Title = {Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining},
Author = {P. Novak and N. Lavrac and G.I. Webb},
Journal = {Journal of Machine Learning Research},
Year = {2009},
Pages = {377-403},
Volume = {10},
Abstract = {This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use different terminology and task definitions, claim to have different goals, claim to use different rule learning heuristics, and use different means for selecting subsets of induced patterns. This paper contributes a novel understanding of these subareas of data mining by presenting a unified terminology, by explaining the apparent differences between the learning tasks as variants of a unique supervised descriptive rule discovery task and by exploring the apparent differences between the approaches. It also shows that various rule learning heuristics used in CSM, EPM and SD algorithms all aim at optimizing a trade off between rule coverage and precision. The commonalities (and differences) between the approaches are showcased on a selection of best known variants of CSM, EPM and SD algorithms. The paper also provides a critical survey of existing supervised descriptive rule discovery visualization methods.},
Keywords = {Association Rule Discovery and OPUS},
Related = {filtered-top-k-association-discovery},
Url = {http://www.jmlr.org/papers/volume10/kralj-novak09a/kralj-novak09a.pdf}
}
ABSTRACT This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use different terminology and task definitions, claim to have different goals, claim to use different rule learning heuristics, and use different means for selecting subsets of induced patterns. This paper contributes a novel understanding of these subareas of data mining by presenting a unified terminology, by explaining the apparent differences between the learning tasks as variants of a unique supervised descriptive rule discovery task and by exploring the apparent differences between the approaches. It also shows that various rule learning heuristics used in CSM, EPM and SD algorithms all aim at optimizing a trade off between rule coverage and precision. The commonalities (and differences) between the approaches are showcased on a selection of best known variants of CSM, EPM and SD algorithms. The paper also provides a critical survey of existing supervised descriptive rule discovery visualization methods.

Layered Critical Values: A Powerful Direct-Adjustment Approach to Discovering Significant Patterns.
Webb, G. I.
Machine Learning, 71(2-3), 307-323 [Technical Note], 2008.
[DOI] [Bibtex] [Abstract]

@Article{Webb08,
Title = {Layered Critical Values: A Powerful Direct-Adjustment Approach to Discovering Significant Patterns},
Author = {G.I. Webb},
Journal = {Machine Learning},
Year = {2008},
Number = {2-3},
Pages = {307-323 [Technical Note]},
Volume = {71},
Abstract = {Standard pattern discovery techniques, such as association rules, suffer an extreme risk of finding very large numbers of spurious patterns for many knowledge discovery tasks. The direct-adjustment approach to controlling this risk applies a statistical test during the discovery process, using a critical value adjusted to take account of the size of the search space. However, a problem with the direct-adjustment strategy is that it may discard numerous true patterns. This paper investigates the assignment of different critical values to different areas of the search space as an approach to alleviating this problem, using a variant of a technique originally developed for other purposes. This approach is shown to be effective at increasing the number of discoveries while still maintaining strict control over the risk of false discoveries.},
Address = {Netherlands},
Audit-trail = {DOI 10.1007/s10994-008-5046-x},
Doi = {10.1007/s10994-008-5046-x},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
Notes = {Technical Note},
Publisher = {Springer},
Related = {statistically-sound-association-discovery}
}
ABSTRACT Standard pattern discovery techniques, such as association rules, suffer an extreme risk of finding very large numbers of spurious patterns for many knowledge discovery tasks. The direct-adjustment approach to controlling this risk applies a statistical test during the discovery process, using a critical value adjusted to take account of the size of the search space. However, a problem with the direct-adjustment strategy is that it may discard numerous true patterns. This paper investigates the assignment of different critical values to different areas of the search space as an approach to alleviating this problem, using a variant of a technique originally developed for other purposes. This approach is shown to be effective at increasing the number of discoveries while still maintaining strict control over the risk of false discoveries.

Discovering Significant Patterns.
Webb, G. I.
Machine Learning, 68(1), 1-33, 2007.
[DOI] [Bibtex] [Abstract]

@Article{Webb07,
Title = {Discovering Significant Patterns},
Author = {G.I. Webb},
Journal = {Machine Learning},
Year = {2007},
Number = {1},
Pages = {1-33},
Volume = {68},
Abstract = {Exploratory pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard exploratory pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to statistical evaluation on holdout data. They also reveal that modification of the pattern discovery process to anticipate subsequent statistical evaluation can increase the number of patterns that are accepted by statistical evaluation on holdout data.},
Address = {Netherlands},
Audit-trail = {subject to revisions},
Doi = {10.1007/s10994-007-5006-x},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
Publisher = {Springer},
Related = {statistically-sound-association-discovery}
}
ABSTRACT Exploratory pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard exploratory pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to statistical evaluation on holdout data. They also reveal that modification of the pattern discovery process to anticipate subsequent statistical evaluation can increase the number of patterns that are accepted by statistical evaluation on holdout data.

Discovering Significant Rules.
Webb, G. I.
Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2006), New York, pp. 434-443, 2006.
[PDF] [URL] [Bibtex] [Abstract]

@InProceedings{Webb06a,
Title = {Discovering Significant Rules},
Author = {G.I. Webb},
Booktitle = {Proceedings of the Twelfth {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining (KDD-2006)},
Year = {2006},
Address = {New York},
Editor = {L. Ungar and M. Craven and D. Gunopulos and T. Eliassi-Rad},
Pages = {434 - 443},
Publisher = {The Association for Computing Machinery},
Abstract = {In many applications, association rules will only be interesting if they represent non-trivial correlations between all constituent items. Numerous techniques have been developed that seek to avoid false discoveries. However, while all provide useful solutions to aspects of this problem, none provides a generic solution that is both flexible enough to accommodate varying definitions of true and false discoveries and powerful enough to provide strict control over the risk of false discoveries. This paper presents generic techniques that allow definitions of true and false discoveries to be specified in terms of arbitrary statistical hypothesis tests and which provide strict control over the experimentwise risk of false discoveries.},
Keywords = {OPUS and Association Rule Discovery and statistically sound discovery},
Location = {Philadelphia, PA},
Related = {statistically-sound-association-discovery},
Url = {http://dl.acm.org/authorize?N00546}
}
ABSTRACT In many applications, association rules will only be interesting if they represent non-trivial correlations between all constituent items. Numerous techniques have been developed that seek to avoid false discoveries. However, while all provide useful solutions to aspects of this problem, none provides a generic solution that is both flexible enough to accommodate varying definitions of true and false discoveries and powerful enough to provide strict control over the risk of false discoveries. This paper presents generic techniques that allow definitions of true and false discoveries to be specified in terms of arbitrary statistical hypothesis tests and which provide strict control over the experimentwise risk of false discoveries.

Efficiently Identifying Exploratory Rules’ Significance.
Huang, S., & Webb, G. I.
LNAI State-of-the-Art Survey series, ‘Data Mining: Theory, Methodology, Techniques, and Applications’, Berlin/Heidelberg, pp. 64-77, 2006.
[DOI] [Bibtex] [Abstract]

@InProceedings{HuangWebb05b,
Title = {Efficiently Identifying Exploratory Rules' Significance},
Author = {S. Huang and G.I. Webb},
Booktitle = {LNAI State-of-the-Art Survey series, 'Data Mining: Theory, Methodology, Techniques, and Applications'},
Year = {2006},
Address = {Berlin/Heidelberg},
Note = {An earlier version of this paper was published in S.J. Simoff and G.J. Williams (Eds.), Proceedings of the Third Australasian Data Mining Conference (AusDM04) Cairns, Australia. Sydney: University of Technology, pages 169-182.},
Pages = {64-77},
Publisher = {Springer},
Abstract = {How to efficiently discard potentially uninteresting rules in exploratory rule discovery is one of the important research foci in data mining. Many researchers have presented algorithms to automatically remove potentially uninteresting rules utilizing background knowledge and user-specified constraints. Identifying the significance of exploratory rules using a significance test is desirable for removing rules that may appear interesting by chance, hence providing the users with a more compact set of resulting rules. However, applying statistical tests to identify significant rules requires considerable computation and data access in order to obtain the necessary statistics. The situation gets worse as the size of the database increases. In this paper, we propose two approaches for improving the efficiency of significant exploratory rule discovery. We also evaluate the experimental effect in impact rule discovery which is suitable for discovering exploratory rules in very large, dense databases.},
Doi = {10.1007/11677437_6},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS and Impact Rules},
Related = {statistically-sound-association-discovery}
}
ABSTRACT How to efficiently discard potentially uninteresting rules in exploratory rule discovery is one of the important research foci in data mining. Many researchers have presented algorithms to automatically remove potentially uninteresting rules utilizing background knowledge and user-specified constraints. Identifying the significance of exploratory rules using a significance test is desirable for removing rules that may appear interesting by chance, hence providing the users with a more compact set of resulting rules. However, applying statistical tests to identify significant rules requires considerable computation and data access in order to obtain the necessary statistics. The situation gets worse as the size of the database increases. In this paper, we propose two approaches for improving the efficiency of significant exploratory rule discovery. We also evaluate the experimental effect in impact rule discovery which is suitable for discovering exploratory rules in very large, dense databases.

Pruning Derivative Partial Rules During Impact Rule Discovery.
Huang, S., & Webb, G. I.
Lecture Notes in Computer Science Vol. 3518: Proceedings of the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD 2005), Berlin/Heidelberg, pp. 71-80, 2005.
[PDF] [Bibtex] [Abstract]

@InProceedings{HuangWebb05a,
Title = {Pruning Derivative Partial Rules During Impact Rule Discovery},
Author = {S. Huang and G.I. Webb},
Booktitle = {Lecture Notes in Computer Science Vol. 3518: Proceedings of the 9th {Pacific}-{Asia} Conference on Advances in Knowledge Discovery and Data Mining (PAKDD 2005)},
Year = {2005},
Address = {Berlin/Heidelberg},
Editor = {T.B. Ho and D. Cheung and H. Liu },
Pages = {71-80},
Publisher = {Springer},
Abstract = {Because exploratory rule discovery works with data that is only a sample of the phenomena to be investigated, some resulting rules may appear interesting only by chance. Techniques are developed for automatically discarding statistically insignificant exploratory rules that cannot survive a hypothesis with regard to its ancestors. We call such insignificant rules derivative extended rules. In this paper, we argue that there is another type of derivative exploratory rules, which is derivative with regard to their children. We also argue that considerable amount of such derivative partial rules can not be successfully removed using existing rule pruning techniques. We propose a new technique to address this problem. Experiments are done in impact rule discovery to evaluate the effect of this derivative partial rule filter. Results show that the inherent problem of too many resulting rules in exploratory rule discovery is alleviated.},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS and Impact Rules},
Location = {Hanoi, Vietnam},
Related = {impact-rules}
}
ABSTRACT Because exploratory rule discovery works with data that is only a sample of the phenomena to be investigated, some resulting rules may appear interesting only by chance. Techniques are developed for automatically discarding statistically insignificant exploratory rules that cannot survive a hypothesis with regard to its ancestors. We call such insignificant rules derivative extended rules. In this paper, we argue that there is another type of derivative exploratory rules, which is derivative with regard to their children. We also argue that considerable amount of such derivative partial rules can not be successfully removed using existing rule pruning techniques. We propose a new technique to address this problem. Experiments are done in impact rule discovery to evaluate the effect of this derivative partial rule filter. Results show that the inherent problem of too many resulting rules in exploratory rule discovery is alleviated.

Discarding Insignificant Rules During Impact Rule Discovery in Large, Dense Databases.
Huang, S., & Webb, G. I.
Proceedings of the Fifth SIAM International Conference on Data Mining (SDM’05) [short paper], Philadelphia, PA, pp. 541-545, 2005.
[PDF] [Bibtex] [Abstract]

@InProceedings{HuangWebb05,
Title = {Discarding Insignificant Rules During Impact Rule Discovery in Large, Dense Databases},
Author = {S. Huang and G.I. Webb},
Booktitle = {Proceedings of the Fifth {SIAM} International Conference on Data Mining ({SDM}'05) [short paper]},
Year = {2005},
Address = {Philadelphia, PA},
Editor = {H. Kargupta and C. Kamath and J. Srivastava and A. Goodman},
Pages = {541-545},
Publisher = {Society for Industrial and Applied Mathematics},
Abstract = {Considerable progress has been made on how to reduce the number of spurious exploratory rules with quantitative attributes. However, little has been done for rules with undiscretized quantitative attributes. It is argued that propositional rules can not effectively describe the interactions between quantitative and qualitative attributes. Aumann and Lindell proposed quantitative association rules to provide a better description of such relationship, together with a rule pruning techniques . Since their technique is based on the frequent itemset framework, it is not suitable for rule discovery in large, dense databases. In this paper, an efficient technique for automatically discarding insignificant rules during rule discovery is proposed, based on the OPUS search algorithm. Experiments demonstrate that the algorithm we propose can efficiently remove potentially uninteresting rules even in very large, dense databases.},
Audit-trail = {Shiying travelling to present paper. Requested permission to post pdf 10/2},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS and Impact Rules},
Location = {Newport Beach, CA},
Related = {impact-rules}
}
ABSTRACT Considerable progress has been made on how to reduce the number of spurious exploratory rules with quantitative attributes. However, little has been done for rules with undiscretized quantitative attributes. It is argued that propositional rules can not effectively describe the interactions between quantitative and qualitative attributes. Aumann and Lindell proposed quantitative association rules to provide a better description of such relationship, together with a rule pruning techniques . Since their technique is based on the frequent itemset framework, it is not suitable for rule discovery in large, dense databases. In this paper, an efficient technique for automatically discarding insignificant rules during rule discovery is proposed, based on the OPUS search algorithm. Experiments demonstrate that the algorithm we propose can efficiently remove potentially uninteresting rules even in very large, dense databases.

k-Optimal-Rule-Discovery.
Webb, G. I., & Zhang, S.
Data Mining and Knowledge Discovery, 10(1), 39-79, 2005.
[PDF] [DOI] [Bibtex] [Abstract]

@Article{WebbZhang05,
Title = {k-Optimal-Rule-Discovery},
Author = {G. I. Webb and S. Zhang},
Journal = {Data Mining and Knowledge Discovery},
Year = {2005},
Number = {1},
Pages = {39-79},
Volume = {10},
Abstract = {K-most-interesting rule discovery finds the k rules that optimize a user-specified measure of interestingness with respect to a set of sample data and user-specified constraints. This approach avoids many limitations of the frequent itemset approach of association rule discovery. This paper presents a scalable algorithm applicable to a wide range of k-most-interesting rule discovery tasks and demonstrates its efficiency.},
Address = {Netherlands},
Doi = {10.1007/s10618-005-0255-4},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
Publisher = {Springer},
Related = {filtered-top-k-association-discovery}
}
ABSTRACT K-most-interesting rule discovery finds the k rules that optimize a user-specified measure of interestingness with respect to a set of sample data and user-specified constraints. This approach avoids many limitations of the frequent itemset approach of association rule discovery. This paper presents a scalable algorithm applicable to a wide range of k-most-interesting rule discovery tasks and demonstrates its efficiency.

Preliminary Investigations into Statistically Valid Exploratory Rule Discovery.
Webb, G. I.
Proceedings of the Second Australasian Data Mining Conference (AusDM03), Sydney, pp. 1-9, 2003.
[PDF] [Bibtex] [Abstract]

@InProceedings{Webb03a,
Title = {Preliminary Investigations into Statistically Valid Exploratory Rule Discovery},
Author = {G.I. Webb},
Booktitle = {Proceedings of the Second Australasian Data Mining Conference (AusDM03)},
Year = {2003},
Address = {Sydney},
Editor = {S.J. Simoff and G.J. Williams and M. Hegland},
Pages = {1-9},
Publisher = {University of Technology},
Abstract = {Exploratory rule discovery, as exemplified by association rule discovery, has proven very popular. In this paper I investigate issues surrounding the statistical validity of rules found using this approach and methods that might be employed to deliver statistically sound exploratory rule discovery.},
Audit-trail = {Submitted to AusDM03. No copyright required. Check key words},
Keywords = {Association Rule Discovery and statistically sound discovery and OPUS},
Location = {Canberra, Australia},
Related = {statistically-sound-association-discovery}
}
ABSTRACT Exploratory rule discovery, as exemplified by association rule discovery, has proven very popular. In this paper I investigate issues surrounding the statistical validity of rules found using this approach and methods that might be employed to deliver statistically sound exploratory rule discovery.

On Detecting Differences Between Groups.
Webb, G. I., Butler, S., & Newlands, D.
Proceedings of The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), New York, pp. 256-265, 2003.
[PDF] [Bibtex] [Abstract]

@InProceedings{WebbButlerNewlands03,
Title = {On Detecting Differences Between Groups},
Author = {G. I. Webb and S. Butler and D. Newlands},
Booktitle = {Proceedings of The Ninth {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining (KDD-2003)},
Year = {2003},
Address = {New York},
Editor = {P. Domingos and C. Faloutsos and T. Senator and H. Kargupta and L. Getoor},
Pages = {256-265},
Publisher = {The Association for Computing Machinery},
Abstract = {Understanding the differences between contrasting groups is a fundamental task in data analysis. This realization has led to the development of a new special purpose data mining technique, {\em contrast-set mining}. We undertook a study with a retail collaborator to compare contrast-set mining with existing rule-discovery techniques. To our surprise we observed that straightforward application of an existing commercial rule-discovery system, Magnum Opus, could successfully perform the contrast-set-mining task. This led to the realization that contrast-set mining is a special case of the more general rule-discovery task. We present the results of our study together with a proof of this conclusion},
Audit-trail = {PDF with ACM copyright posted in accordance with conditions of copyright},
Keywords = {OPUS and Association Rule Discovery},
Location = {Washington, DC},
Related = {filtered-top-k-association-discovery}
}
ABSTRACT Understanding the differences between contrasting groups is a fundamental task in data analysis. This realization has led to the development of a new special purpose data mining technique, {\em contrast-set mining}. We undertook a study with a retail collaborator to compare contrast-set mining with existing rule-discovery techniques. To our surprise we observed that straightforward application of an existing commercial rule-discovery system, Magnum Opus, could successfully perform the contrast-set-mining task. This led to the realization that contrast-set mining is a special case of the more general rule-discovery task. We present the results of our study together with a proof of this conclusion

Removing Trivial Associations in Association Rule Discovery.
Webb, G. I., & Zhang, S.
Proceedings of the First International NAISO Congress on Autonomous Intelligent Systems (ICAIS 2002), Canada/The Netherlands, 2002.
[PDF] [Bibtex] [Abstract]

@InProceedings{WebbZhang02,
Title = {Removing Trivial Associations in Association Rule Discovery},
Author = {G. I. Webb and S. Zhang},
Booktitle = {Proceedings of the First International NAISO Congress on Autonomous Intelligent Systems (ICAIS 2002)},
Year = {2002},
Address = {Canada/The Netherlands},
Publisher = {NAISO Academic Press},
Abstract = {Association rule discovery has become one of the most widely applied data mining strategies. Techniques for association rule discovery have been dominated by the frequent itemset strategy as exemplified by the Apriori algorithm. One limitation of this approach is that it provides little opportunity to detect and remove association rules on the basis of relationships between rules. As a result, the association rules discovered are frequently swamped with large numbers of spurious rules that are of little interest to the user. This paper presents association rule discovery techniques that can detect and discard one form of spurious association rule: trivial associations.},
Audit-trail = {Pre-publication PDF posted},
Keywords = {OPUS and Association Rule Discovery},
Location = {Geelong, Australia},
Related = {filtered-top-k-association-discovery}
}
ABSTRACT Association rule discovery has become one of the most widely applied data mining strategies. Techniques for association rule discovery have been dominated by the frequent itemset strategy as exemplified by the Apriori algorithm. One limitation of this approach is that it provides little opportunity to detect and remove association rules on the basis of relationships between rules. As a result, the association rules discovered are frequently swamped with large numbers of spurious rules that are of little interest to the user. This paper presents association rule discovery techniques that can detect and discard one form of spurious association rule: trivial associations.

Further Pruning for Efficient Association Rule Discovery.
Webb, G. I., & Zhang, S.
Lecture Notes in Computer Science Vol. 2256: Proceedings of the 14th Australian Joint Conference on Artificial Intelligence (AI’01), Berlin, pp. 605-618, 2001.
[PDF] [Bibtex] [Abstract]

@InProceedings{WebbZhang01,
Title = {Further Pruning for Efficient Association Rule Discovery},
Author = {G.I. Webb and S. Zhang},
Booktitle = {Lecture Notes in Computer Science Vol. 2256: Proceedings of the 14th Australian Joint Conference on Artificial Intelligence (AI'01)},
Year = {2001},
Address = {Berlin},
Editor = {M. Stumptner and D. Corbett and M.J. Brooks},
Pages = {605-618},
Publisher = {Springer},
Abstract = {The Apriori algorithm's frequent itemset approach has become the standard approach to discovering association rules. However, the computation requirements of the frequent itemset approach are infeasible for dense data and the approach is unable to discover infrequent associations. OPUS\_AR is an efficient algorithm for rule discovery that does not utilize frequent itemsets and hence avoids these problems. It can reduce search time by using additional constraints on the search space as well as constraints on itemset frequency. However, the effectiveness of the pruning rules used during search will determine the efficiency of its search. This paper presents and analyzes pruning rules for use with OPUS\_AR. We demonstrate that application of OPUS\_AR is feasible for a number of datasets for which application of the frequent itemset approach is infeasible and that the new pruning rules can reduce compute time by more than 40%.},
Audit-trail = {*},
Keywords = {OPUS and Association Rule Discovery},
Location = {Adelaide, Australia},
Related = {filtered-top-k-association-discovery}
}
ABSTRACT The Apriori algorithm's frequent itemset approach has become the standard approach to discovering association rules. However, the computation requirements of the frequent itemset approach are infeasible for dense data and the approach is unable to discover infrequent associations. OPUS_AR is an efficient algorithm for rule discovery that does not utilize frequent itemsets and hence avoids these problems. It can reduce search time by using additional constraints on the search space as well as constraints on itemset frequency. However, the effectiveness of the pruning rules used during search will determine the efficiency of its search. This paper presents and analyzes pruning rules for use with OPUS_AR. We demonstrate that application of OPUS_AR is feasible for a number of datasets for which application of the frequent itemset approach is infeasible and that the new pruning rules can reduce compute time by more than 40%.

Discovering Associations with Numeric Variables.
Webb, G. I.
Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001)[short paper], New York, pp. 383-388, 2001.
[PDF] [URL] [Bibtex] [Abstract]

@InProceedings{Webb01a,
Title = {Discovering Associations with Numeric Variables},
Author = {G. I. Webb},
Booktitle = {Proceedings of the Seventh {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining (KDD-2001)[short paper]},
Year = {2001},
Address = {New York},
Editor = {F. Provost and R. Srikant},
Pages = {383-388},
Publisher = {The Association for Computing Machinery},
Abstract = {This paper further develops Aumann and Lindell's [3] proposal for a variant of association rules for which the consequent is a numeric variable. It is argued that these rules can discover useful interactions with numeric data that cannot be discovered directly using traditional association rules with discretization. Alternative measures for identifying interesting rules are proposed. Efficient algorithms are presented that enable these rules to be discovered for dense data sets for which application of Auman and Lindell's algorithm is infeasible.},
Audit-trail = {*},
Keywords = {Impact Rules and OPUS and Association Rule Discovery},
Location = {San Francisco, CA},
Related = {impact-rules},
Url = {http://dl.acm.org/authorize?19861}
}
ABSTRACT This paper further develops Aumann and Lindell's [3] proposal for a variant of association rules for which the consequent is a numeric variable. It is argued that these rules can discover useful interactions with numeric data that cannot be discovered directly using traditional association rules with discretization. Alternative measures for identifying interesting rules are proposed. Efficient algorithms are presented that enable these rules to be discovered for dense data sets for which application of Auman and Lindell's algorithm is infeasible.

Efficient Search for Association Rules.
Webb, G. I.
Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000), New York, pp. 99-107, 2000.
[PDF] [Bibtex] [Abstract]

@InProceedings{Webb00b,
Title = {Efficient Search for Association Rules},
Author = {G. I. Webb},
Booktitle = {Proceedings of the Sixth {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining (KDD-2000)},
Year = {2000},
Address = {New York},
Editor = {R. Ramakrishnan and S. Stolfo},
Pages = {99-107},
Publisher = {The Association for Computing Machinery},
Abstract = {This paper argues that for some applications direct search for association rules can be more efficient than the two stage process of the Apriori algorithm which first finds large item sets which are then used to identify associations. In particular, it is argued, Apriori can impose large computational overheads when the number of frequent itemsets is very large. This will often be the case when association rule analysis is performed on domains other than basket analysis or when it is performed for basket analysis with basket information augmented by other customer information. An algorithm is presented that is computationally efficient for association rule analysis during which the number of rules to be found can be constrained and all data can be maintained in memory.},
Audit-trail = {*},
Keywords = {Search and OPUS and Association Rule Discovery},
Location = {Boston, MA},
Related = {filtered-top-k-association-discovery}
}
ABSTRACT This paper argues that for some applications direct search for association rules can be more efficient than the two stage process of the Apriori algorithm which first finds large item sets which are then used to identify associations. In particular, it is argued, Apriori can impose large computational overheads when the number of frequent itemsets is very large. This will often be the case when association rule analysis is performed on domains other than basket analysis or when it is performed for basket analysis with basket information augmented by other customer information. An algorithm is presented that is computationally efficient for association rule analysis during which the number of rules to be found can be constrained and all data can be maintained in memory.

Inclusive Pruning: A New Class of Pruning Rule for Unordered Search and its Application to Classification Learning.
Webb, G. I.
Australian Computer Science Communications Vol. 18 (1): Proceedings of the Nineteenth Australasian Computer Science Conference (ACSC’96), Melbourne, pp. 1-10, 1996.
[PDF] [Bibtex] [Abstract]

@InProceedings{Webb96e,
Title = {Inclusive Pruning: A New Class of Pruning Rule for Unordered Search and its Application to Classification Learning},
Author = {G. I. Webb},
Booktitle = {Australian Computer Science Communications Vol. 18 (1): Proceedings of the Nineteenth Australasian Computer Science Conference (ACSC'96)},
Year = {1996},
Address = {Melbourne},
Editor = {K. Ramamohanarao},
Pages = {1-10},
Publisher = {ACS},
Abstract = {This paper presents a new class of pruning rule for unordered search. Previous pruning rules for unordered search identify operators that should not be applied in order to prune nodes reached via those operators. In contrast, the new pruning rules identify operators that should be applied and prune nodes that are not reached via those operators. Specific pruning rules employing both these approaches are identified for classification learning. Experimental results demonstrate that application of the new pruning rules can reduce by more than 60% the number of states from the search space that are considered during classification learning.},
Audit-trail = {*},
Keywords = {Search and Rule Learning and OPUS and Association Rule Discovery},
Location = {Royal Melbourne Insitute of Technology, Australia},
Related = {opus-search}
}
ABSTRACT This paper presents a new class of pruning rule for unordered search. Previous pruning rules for unordered search identify operators that should not be applied in order to prune nodes reached via those operators. In contrast, the new pruning rules identify operators that should be applied and prune nodes that are not reached via those operators. Specific pruning rules employing both these approaches are identified for classification learning. Experimental results demonstrate that application of the new pruning rules can reduce by more than 60% the number of states from the search space that are considered during classification learning.

OPUS: An Efficient Admissible Algorithm For Unordered Search.
Webb, G. I.
Journal of Artificial Intelligence Research, 3, 431-465, 1995.
[URL] [Bibtex] [Abstract]

@Article{Webb95,
Title = {OPUS: An Efficient Admissible Algorithm For Unordered Search},
Author = {G. I. Webb},
Journal = {Journal of Artificial Intelligence Research},
Year = {1995},
Pages = {431-465},
Volume = {3},
Abstract = {OPUS is a branch and bound search algorithm that enables efficient admissible search through spaces for which the order of search operator application is not significant. The algorithm's search efficiency is demonstrated with respect to very large machine learning search spaces. The use of admissible search is of potential value to the machine learning community as it means that the exact learning biases to be employed for complex learning tasks can be precisely specified and manipulated. OPUS also has potential for application in other areas of artificial intelligence, notably, truth maintenance.},
Address = {Menlo Park, CA},
Audit-trail = {Link to paper via JAIR website},
Keywords = {Search and Rule Learning and OPUS and Association Rule Discovery},
Publisher = {AAAI Press},
Related = {opus-search},
Url = {http://dx.doi.org/10.1613/jair.227}
}
ABSTRACT OPUS is a branch and bound search algorithm that enables efficient admissible search through spaces for which the order of search operator application is not significant. The algorithm's search efficiency is demonstrated with respect to very large machine learning search spaces. The use of admissible search is of potential value to the machine learning community as it means that the exact learning biases to be employed for complex learning tasks can be precisely specified and manipulated. OPUS also has potential for application in other areas of artificial intelligence, notably, truth maintenance.

Systematic Search for Categorical Attribute-Value Data-Driven Machine Learning.
Webb, G. I.
Proceedings of the Sixth Australian Joint Conference on Artificial Intelligence (AI’93), Singapore, pp. 342-347, 1993.
[PDF] [Bibtex] [Abstract]

@InProceedings{Webb93a,
Title = {Systematic Search for Categorical Attribute-Value Data-Driven Machine Learning},
Author = {G. I. Webb},
Booktitle = {Proceedings of the Sixth Australian Joint Conference on Artificial Intelligence (AI'93)},
Year = {1993},
Address = {Singapore},
Editor = {C. Rowles and H. Liu and N. Foo},
Pages = {342-347},
Publisher = {World Scientific},
Abstract = {Optimal Pruning for Unordered Search is a search algorithm that enables complete search through the space of possible disjuncts at the inner level of a covering algorithm. This algorithm takes as inputs an evaluation function, e, a training set, t, and a set of specialisation operators, o. It outputs a set of operators from o that creates a classifier that maximises e with respect to t. While OPUS has exponential worst case time complexity, the algorithm is demonstrated to reach solutions for complex real world domains within reasonable time frames. Indeed, for some domains, the algorithm exhibits greater computational efficiency than common heuristic search algorithms.},
Audit-trail = {*},
Keywords = {Search and Rule Learning and OPUS},
Location = {Melbourne, Australia},
Related = {opus-search}
}
ABSTRACT Optimal Pruning for Unordered Search is a search algorithm that enables complete search through the space of possible disjuncts at the inner level of a covering algorithm. This algorithm takes as inputs an evaluation function, e, a training set, t, and a set of specialisation operators, o. It outputs a set of operators from o that creates a classifier that maximises e with respect to t. While OPUS has exponential worst case time complexity, the algorithm is demonstrated to reach solutions for complex real world domains within reasonable time frames. Indeed, for some domains, the algorithm exhibits greater computational efficiency than common heuristic search algorithms.