Bias and variance

Bias and variance provide a powerful conceptual tool for analyzing machine learning performance. My research revealed that as sample size increases, the ratio of variance relative to bias tends to decrease. As has since become widely accepted, this implies that low variance learning algorithms, such as linear models, should be most effective with small data quantity. For large data quantities, low bias learning algorithms, such as deep learning, should be most effective.

Previous approaches to conducting bias-variance experiments have provided little control over the types of data distribution from which bias and variance are estimated. I have developed new techniques for bias-variance analysis that provide greater control over the data distribution. Experiments show that the type of distribution used for bias-variance experiments can greatly affect the results obtained.

Publications

Feature-subspace aggregating: Ensembles for stable and unstable learners.
Ting, K. M., Wells, J., Tan, S., Teng, S., & Webb, G. I.
Machine Learning, 82(3), 375-397, 2011.
[Bibtex] [Abstract]

@Article{TingEtAl11,
Title = {Feature-subspace aggregating: Ensembles for stable and unstable learners},
Author = {K.M. Ting and J. Wells and S. Tan and S. Teng and G.I. Webb},
Journal = {Machine Learning},
Year = {2011},
Number = {3},
Pages = {375-397},
Volume = {82},
Abstract = {This paper introduces a new ensemble approach, Feature-Subspace Aggregating (Feating), which builds local models instead of global models. Feating is a generic ensemble approach that can enhance the predictive performance of both stable and unstable learners. In contrast, most existing ensemble approaches can improve the predictive performance of unstable learners only. Our analysis shows that the new approach reduces the execution time to generate a model in an ensemble through an increased level of localisation in Feating. Our empirical evaluation shows that Feating performs significantly better than Boosting, Random Subspace and Bagging in terms of predictive accuracy, when a stable learner SVM is used as the base learner. The speed up achieved by Feating makes feasible SVM ensembles that would otherwise be infeasible for large data sets. When SVM is the preferred base learner, we show that Feating SVM performs better than Boosting decision trees and Random Forests. We further demonstrate that Feating also substantially reduces the error of another stable learner, k-nearest neighbour, and an unstable learner, decision tree.},
Address = {Netherlands},
Doi = {10.1007/s10994-010-5224-5},
ISSN = {0885-6125},
Keywords = {Feating and Multiboosting and Boosting and Bias-variance},
Publisher = {Springer},
Related = {feating},
Urltext = {Link to paper via SpringerLink}
}

ABSTRACT This paper introduces a new ensemble approach, Feature-Subspace Aggregating (Feating), which builds local models instead of global models. Feating is a generic ensemble approach that can enhance the predictive performance of both stable and unstable learners. In contrast, most existing ensemble approaches can improve the predictive performance of unstable learners only. Our analysis shows that the new approach reduces the execution time to generate a model in an ensemble through an increased level of localisation in Feating. Our empirical evaluation shows that Feating performs significantly better than Boosting, Random Subspace and Bagging in terms of predictive accuracy, when a stable learner SVM is used as the base learner. The speed up achieved by Feating makes feasible SVM ensembles that would otherwise be infeasible for large data sets. When SVM is the preferred base learner, we show that Feating SVM performs better than Boosting decision trees and Random Forests. We further demonstrate that Feating also substantially reduces the error of another stable learner, k-nearest neighbour, and an unstable learner, decision tree.

The Need for Low Bias Algorithms in Classification Learning From Large Data Sets.
Brain, D., & Webb, G. I.
Lecture Notes in Computer Science 2431: Principles of Data Mining and Knowledge Discovery: Proceedings of the Sixth European Conference (PKDD 2002), Berlin/Heidelberg, pp. 62-73, 2002.
[Bibtex] [Abstract]

@InProceedings{BrainWebb02,
Title = {The Need for Low Bias Algorithms in Classification Learning From Large Data Sets},
Author = { D. Brain and G.I. Webb},
Booktitle = {Lecture Notes in Computer Science 2431: Principles of Data Mining and Knowledge Discovery: Proceedings of the Sixth European Conference (PKDD 2002)},
Year = {2002},
Address = {Berlin/Heidelberg},
Pages = {62-73},
Publisher = {Springer-Verlag},
Abstract = {This paper reviews the appropriateness for application to large data sets of standard machine learning algorithms, which were mainly developed in the context of small data sets. Sampling and parallelization have proved useful means for reducing computation time when learning from large data sets. However, such methods assume that algorithms that were designed for use with what are now considered small data sets are also fundamentally suitable for large data sets. It is plausible that optimal learning from large data sets requires a different type of algorithm to optimal learning from small data sets. This paper investigates one respect in which data set size may affect the requirements of a learning algorithm � the bias plus variance decomposition of classification error. Experiments show that learning from large data sets may be more effective when using an algorithm that places greater emphasis on bias management, rather than variance management},
Audit-trail = {http://link.springer.de/link/service/series/0558/bibs/2431/24310062.htm},
Keywords = {Learning from large datasets and Bias-Variance},
Location = {Helsinki, Finland},
Related = {learning-from-large-datasets}
}

ABSTRACT This paper reviews the appropriateness for application to large data sets of standard machine learning algorithms, which were mainly developed in the context of small data sets. Sampling and parallelization have proved useful means for reducing computation time when learning from large data sets. However, such methods assume that algorithms that were designed for use with what are now considered small data sets are also fundamentally suitable for large data sets. It is plausible that optimal learning from large data sets requires a different type of algorithm to optimal learning from small data sets. This paper investigates one respect in which data set size may affect the requirements of a learning algorithm � the bias plus variance decomposition of classification error. Experiments show that learning from large data sets may be more effective when using an algorithm that places greater emphasis on bias management, rather than variance management

MultiBoosting: A Technique for Combining Boosting and Wagging.
Webb, G. I.
Machine Learning, 40(2), 159-196, 2000.
[Bibtex] [Abstract]

@Article{Webb00a,
Title = {MultiBoosting: A Technique for Combining Boosting and Wagging},
Author = {G. I. Webb},
Journal = {Machine Learning},
Year = {2000},
Number = {2},
Pages = {159-196},
Volume = {40},
Abstract = {MultiBoosting is an extension to the highly successful AdaBoost technique for forming decision committees. MultiBoosting can be viewed as combining AdaBoost with wagging. It is able to harness both AdaBoost's high bias and variance reduction with wagging's superior variance reduction. Using C4.5 as the base learning algorithm, Multi-boosting is demonstrated to produce decision committees with lower error than either AdaBoost or wagging significantly more often than the reverse over a large representative cross-section of UCI data sets. It offers the further advantage over AdaBoost of suiting parallel execution.},
Address = {Netherlands},
Audit-trail = {27/10/03 requested permission to post pp pdf. 28/10/03 Permission granted by Kluwer. PDF posted 30/10/03},
Doi = {10.1023/A:1007659514849},
Keywords = {MultiBoosting and Boosting and Bias-Variance},
Publisher = {Springer},
Related = {multiboosting-and-multi-strategy-ensemble-learning}
}

ABSTRACT MultiBoosting is an extension to the highly successful AdaBoost technique for forming decision committees. MultiBoosting can be viewed as combining AdaBoost with wagging. It is able to harness both AdaBoost's high bias and variance reduction with wagging's superior variance reduction. Using C4.5 as the base learning algorithm, Multi-boosting is demonstrated to produce decision committees with lower error than either AdaBoost or wagging significantly more often than the reverse over a large representative cross-section of UCI data sets. It offers the further advantage over AdaBoost of suiting parallel execution.

On The Effect of Data Set Size on Bias And Variance in Classification Learning.
Brain, D., & Webb, G. I.
Proceedings of the Fourth Australian Knowledge Acquisition Workshop (AKAW-99), Sydney, pp. 117-128, 1999.
[Bibtex] [Abstract]

@InProceedings{BrainWebb99,
Title = {On The Effect of Data Set Size on Bias And Variance in Classification Learning},
Author = {D. Brain and G. I. Webb},
Booktitle = {Proceedings of the Fourth {Australian} Knowledge Acquisition Workshop ({AKAW}-99)},
Year = {1999},
Address = {Sydney},
Editor = {D. Richards and G. Beydoun and A. Hoffmann and P. Compton },
Pages = {117-128},
Publisher = {The University of New South Wales},
Abstract = {With the advent of data mining, machine learning has come of age and is now a critical technology in many businesses. However, machine learning evolved in a different research context to that in which it now finds itself employed. A particularly important problem in the data mining world is working effectively with large data sets. However, most machine learning research has been conducted in the context of learning from very small data sets. To date most approaches to scaling up machine learning to large data sets have attempted to modify existing algorithms to deal with large data sets in a more computationally efficient and effective manner. But is this necessarily the best method? This paper explores the possibility of designing algorithms specifically for large data sets. Specifically, the paper looks at how increasing data set size affects bias and variance error decompositions for classification algorithms. Preliminary results of experiments to determine these effects are presented, showing that, as hypothesized variance can be expected to decrease as training set size increases. No clear effect of training set size on bias was observed. These results have profound implications for data mining from large data sets, indicating that developing effective learning algorithms for large data sets is not simply a matter of finding computationally efficient variants of existing learning algorithms.},
Audit-trail = {*},
Keywords = {Learning from large datasets and Bias-Variance},
Location = {Sydney, Australia},
Related = {learning-from-large-datasets}
}

ABSTRACT With the advent of data mining, machine learning has come of age and is now a critical technology in many businesses. However, machine learning evolved in a different research context to that in which it now finds itself employed. A particularly important problem in the data mining world is working effectively with large data sets. However, most machine learning research has been conducted in the context of learning from very small data sets. To date most approaches to scaling up machine learning to large data sets have attempted to modify existing algorithms to deal with large data sets in a more computationally efficient and effective manner. But is this necessarily the best method? This paper explores the possibility of designing algorithms specifically for large data sets. Specifically, the paper looks at how increasing data set size affects bias and variance error decompositions for classification algorithms. Preliminary results of experiments to determine these effects are presented, showing that, as hypothesized variance can be expected to decrease as training set size increases. No clear effect of training set size on bias was observed. These results have profound implications for data mining from large data sets, indicating that developing effective learning algorithms for large data sets is not simply a matter of finding computationally efficient variants of existing learning algorithms.