Decision tree grafting

Grafting is a postprocess that adds tests and nodes to existing decision trees, resulting in bagging-like variance reduction while providing a single interpretable decision tree.  It was developed as part of my exploration of the use of Occam's Razor in machine learning.

The first grafting algorithm added new branches that did not change the tree's classification of any training data [Webb, 1996].  This constraint was imposed so as to create as clear a demonstration that it was not true that all other things being equal a less complex classifier should be more accurate than a more complex one.  Subsequent algorithms [Webb, 1998, 1999] were intended as practical approaches to machine learning and allowed branches that reclassified training examples that the original tree misclassified.

The original decision tree grafting software was implemented as an extension to C4.5.  It can be downloaded here (requires C4.5 release 6).  Decision tree grafting is also implemented in the J48Graft component of the Weka machine learning workbench.

Publications

Generality is Predictive of Predication Accuracy.
Webb, G. I., & Brain, D.
Proceedings of the 2002 Pacific Rim Knowledge Acquisition Workshop (PKAW'02), Tokyo, pp. 117-130, 2002.
[PDF] [Bibtex] [Abstract]

@InProceedings{WebbBrain02,
Title = {Generality is Predictive of Predication Accuracy},
Author = {G. I. Webb and D. Brain},
Booktitle = {Proceedings of the 2002 {Pacific} Rim Knowledge Acquisition Workshop (PKAW'02)},
Year = {2002},
Address = {Tokyo},
Editor = {T. Yamaguchi and A. Hoffmann and H. Motoda and P. Compton },
Pages = {117-130},
Publisher = {Japanese Society for Artificial Intelligence},
Abstract = {There has been a dearth of research into the relative impacts of alternative high level learning biases. This paper presents two hypotheses about the expected impact of selecting between classification rules of differing levels of generality in the absence of other evidence about their likely relative performance on unseen data. It is argued that the accuracy on unseen data of the more general rule will tend to be closer to that of a default rule for the class than will that of the more specific rule. It is also argued that the accuracy on unseen cases of the more specific rule will tend to be closer to the accuracy obtained on training data than will the accuracy of the more general rule. Experimental evidence is provided in support of these hypotheses. We argue that these hypotheses can be of use in selecting appropriate learning biases to achieve specific learning objectives.},
Audit-trail = {*},
Keywords = {Occams Razor and Generality},
Location = {Tokyo, Japan},
Related = {generality-is-predictive-of-prediction-accuracy}
}
ABSTRACT There has been a dearth of research into the relative impacts of alternative high level learning biases. This paper presents two hypotheses about the expected impact of selecting between classification rules of differing levels of generality in the absence of other evidence about their likely relative performance on unseen data. It is argued that the accuracy on unseen data of the more general rule will tend to be closer to that of a default rule for the class than will that of the more specific rule. It is also argued that the accuracy on unseen cases of the more specific rule will tend to be closer to the accuracy obtained on training data than will the accuracy of the more general rule. Experimental evidence is provided in support of these hypotheses. We argue that these hypotheses can be of use in selecting appropriate learning biases to achieve specific learning objectives.

Decision Tree Grafting From The All Tests But One Partition.
Webb, G. I.
Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI 99), San Francisco, pp. 702-707, 1999.
[PDF] [Bibtex] [Abstract]

@InProceedings{Webb99,
Title = {Decision Tree Grafting From The All Tests But One Partition},
Author = {G. I. Webb},
Booktitle = {Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence ({IJCAI} 99)},
Year = {1999},
Address = {San Francisco},
Editor = {T. Dean},
Pages = {702-707},
Publisher = {Morgan Kaufmann},
Abstract = {Decision tree grafting adds nodes to an existing decision tree with the objective of reducing prediction error. A new grafting algorithm is presented that considers one set of training data only for each leaf of the initial decision tree, the set of cases that fail at most one test on the path to the leaf. This new technique is demonstrated to retain the error reduction power of the original grafting algorithm while dramatically reducing compute time and the complexity of the inferred tree. Bias/variance analysis reveal that the original grafting technique operated primarily by variance reduction while the new technique reduces both bias and variance.},
Audit-trail = {PDF posted with the permission of {IJCAI} Inc},
Keywords = {Decision Tree Learning and Decision Tree Grafting and Occams Razor},
Location = {Stockholm, Sweden},
Related = {decision-tree-grafting}
}
ABSTRACT Decision tree grafting adds nodes to an existing decision tree with the objective of reducing prediction error. A new grafting algorithm is presented that considers one set of training data only for each leaf of the initial decision tree, the set of cases that fail at most one test on the path to the leaf. This new technique is demonstrated to retain the error reduction power of the original grafting algorithm while dramatically reducing compute time and the complexity of the inferred tree. Bias/variance analysis reveal that the original grafting technique operated primarily by variance reduction while the new technique reduces both bias and variance.

The Problem of Missing Values in Decision Tree Grafting.
Webb, G. I.
Lecture Notes in Computer Science Vol. 1502: Advanced Topics in Artificial Intelligence, Selected Papers from the Eleventh Australian Joint Conference on Artificial Intelligence (AI '98), Berlin, pp. 273-283, 1998.
[PDF] [Bibtex] [Abstract]

@InProceedings{Webb98,
Title = {The Problem of Missing Values in Decision Tree Grafting},
Author = {G. I. Webb},
Booktitle = {Lecture Notes in Computer Science Vol. 1502: Advanced Topics in Artificial Intelligence, Selected Papers from the Eleventh Australian Joint Conference on Artificial Intelligence (AI '98)},
Year = {1998},
Address = {Berlin},
Editor = {G. Antoniou and J.K. Slaney},
Pages = {273-283},
Publisher = {Springer-Verlag},
Abstract = {Decision tree grafting adds nodes to inferred decision trees. Previous research has demonstrated that appropriate grafting techniques can improve predictive accuracy across a wide crossselection of domains. However, previous decision tree grafting systems are demonstrated to have a serious deficiency for some data sets containing missing values. This problem arises due to the method for handling missing values employed by C4.5, in which the grafting systems have been embedded. This paper provides an explanation of and solution to the problem. Experimental evidence is presented of the efficacy of this solution.},
Audit-trail = {*},
Keywords = {Decision Tree Learning and Decision Tree Grafting and Occams Razor},
Location = {Brisbane, Australia},
Related = {decision-tree-grafting}
}
ABSTRACT Decision tree grafting adds nodes to inferred decision trees. Previous research has demonstrated that appropriate grafting techniques can improve predictive accuracy across a wide crossselection of domains. However, previous decision tree grafting systems are demonstrated to have a serious deficiency for some data sets containing missing values. This problem arises due to the method for handling missing values employed by C4.5, in which the grafting systems have been embedded. This paper provides an explanation of and solution to the problem. Experimental evidence is presented of the efficacy of this solution.

Decision Tree Grafting.
Webb, G. I.
Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI 97), San Francisco, pp. 846-851, 1997.
[PDF] [Bibtex] [Abstract]

@InProceedings{Webb97,
Title = {Decision Tree Grafting},
Author = {G. I. Webb},
Booktitle = {Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence ({IJCAI} 97)},
Year = {1997},
Address = {San Francisco},
Pages = {846-851},
Publisher = {Morgan Kaufmann},
Abstract = {This paper extends recent work on decision tree grafting. Grafting is an inductive process that adds nodes to inferred decision trees. This process is demonstrated to frequently improve predictive accuracy. Superficial analysis might suggest that decision tree grafting is the direct reverse of pruning. To the contrary, it is argued that the two processes are complementary. This is because, like standard tree growing techniques, pruning uses only local information, whereas grafting uses non-local information. The use of both pruning and grafting in conjunction is demonstrated to provide the best general predictive accuracy over a representative selection of learning tasks.},
Audit-trail = {PDF posted with the permission of IJCAI Inc},
Keywords = {Decision Trees and Decision Tree Grafting and Occams Razor},
Location = {Nagoya, Japan},
Related = {decision-tree-grafting}
}
ABSTRACT This paper extends recent work on decision tree grafting. Grafting is an inductive process that adds nodes to inferred decision trees. This process is demonstrated to frequently improve predictive accuracy. Superficial analysis might suggest that decision tree grafting is the direct reverse of pruning. To the contrary, it is argued that the two processes are complementary. This is because, like standard tree growing techniques, pruning uses only local information, whereas grafting uses non-local information. The use of both pruning and grafting in conjunction is demonstrated to provide the best general predictive accuracy over a representative selection of learning tasks.

Further Experimental Evidence Against The Utility Of Occam's Razor.
Webb, G. I.
Journal of Artificial Intelligence Research, 4, 397-417, 1996.
[DOI] [Bibtex] [Abstract]

@Article{Webb96b,
Title = {Further Experimental Evidence Against The Utility Of {Occam}'s Razor},
Author = {G. I. Webb},
Journal = {Journal of Artificial Intelligence Research},
Year = {1996},
Pages = {397-417},
Volume = {4},
Abstract = {This paper presents new experimental evidence against the utility of Occam's razor. A systematic procedure is presented for post-processing decision trees produced by C4.5. This procedure was derived by rejecting Occam's razor and instead attending to the assumption that similar objects are likely to belong to the same class. It increases a decision tree's complexity without altering the performance of that tree on the training data from which it is inferred. The resulting more complex decision trees are demonstrated to have, on average, for a variety of common learning tasks, higher predictive accuracy than the less complex original decision trees. This result raises considerable doubt about the utility of Occam's razor as it is commonly applied in modern machine learning.},
Address = {Menlo Park, CA},
Audit-trail = {Link to paper via JAIR website},
Doi = {10.1613/jair.228},
Keywords = {Decision Trees and Decision Tree Grafting and Occams Razor},
Publisher = {AAAI Press},
Related = {occams-razor-in-machine-learning}
}
ABSTRACT This paper presents new experimental evidence against the utility of Occam's razor. A systematic procedure is presented for post-processing decision trees produced by C4.5. This procedure was derived by rejecting Occam's razor and instead attending to the assumption that similar objects are likely to belong to the same class. It increases a decision tree's complexity without altering the performance of that tree on the training data from which it is inferred. The resulting more complex decision trees are demonstrated to have, on average, for a variety of common learning tasks, higher predictive accuracy than the less complex original decision trees. This result raises considerable doubt about the utility of Occam's razor as it is commonly applied in modern machine learning.

Generality Is More Significant Then Complexity: Toward An Alternative To Occams Razor.
Webb, G. I.
Artificial Intelligence: Sowing the Seeds for the Future, Proceedings of Seventh Australian Joint Conference on Artificial Intelligence (AI'94), Singapore, pp. 60-67, 1994.
[PDF] [Bibtex] [Abstract]

@InProceedings{Webb94b,
Title = {Generality Is More Significant Then Complexity: Toward An Alternative To Occams Razor},
Author = {G. I. Webb},
Booktitle = {Artificial Intelligence: Sowing the Seeds for the Future, Proceedings of Seventh Australian Joint Conference on Artificial Intelligence (AI'94)},
Year = {1994},
Address = {Singapore},
Editor = {C. Zhang and J. Debenham and D. Lukose},
Pages = {60-67},
Publisher = {World Scientific},
Abstract = {Occam's Razor is widely employed in machine learning to select between classifiers with equal empirical support. This paper presents the theorem of decreasing inductive power: that, all other things being equal, if two classifiers a and b cover identical cases from the training set and a is a generalisation of b, a has higher probability than b of misclassifying a previously unsighted case. This theorem suggests that, to the contrary of Occam's Razor, generality, not complexity, should be used to select between classifiers with equal empirical support. Two studies are presented. The first study demonstrates that the theorem of decreasing inductive power holds for a number of commonly studied learning problems and for a number of different means of manipulating classifier generality. The second study demonstrates that generality provides a more consistent indicator of predictive accuracy in the context of a default rule than does complexity. These results suggest that the theorem of decreasing predictive power provides a suitable theoretical framework for the development of learning biases for use in selecting between classifiers with identical empirical support},
Audit-trail = {*},
Keywords = {Occams Razor and Rule Learning and Generality},
Location = {Armidale,NSW, Australia},
Related = {occams-razor-in-machine-learning}
}
ABSTRACT Occam's Razor is widely employed in machine learning to select between classifiers with equal empirical support. This paper presents the theorem of decreasing inductive power: that, all other things being equal, if two classifiers a and b cover identical cases from the training set and a is a generalisation of b, a has higher probability than b of misclassifying a previously unsighted case. This theorem suggests that, to the contrary of Occam's Razor, generality, not complexity, should be used to select between classifiers with equal empirical support. Two studies are presented. The first study demonstrates that the theorem of decreasing inductive power holds for a number of commonly studied learning problems and for a number of different means of manipulating classifier generality. The second study demonstrates that generality provides a more consistent indicator of predictive accuracy in the context of a default rule than does complexity. These results suggest that the theorem of decreasing predictive power provides a suitable theoretical framework for the development of learning biases for use in selecting between classifiers with identical empirical support