Big Models for Big Data

We develop machine learning methods that learn massive models that capture the fine detail inherent in large quantities of data. Whereas big models will overfit small data, as data quantity increases ever larger models can effectively capture increased detail delivering greater accuracy and precision.

Big Bayesian Network Classifier Models.

These probabilistic models are efficient to learn. Notably, they do not require frequent access to the data and hence can be learned with a small number of passes through the data without any need for raw data to be retained in active memory. This allows all available data to be utilised, rather than the samples that are often required to learn from big data with conventional techniques.

The Averaged N Dependence Estimators (ANDE) algorithms average over all of a large class of alternative Bayesian Network Classifiers [1, 2]. The models include all combinations of n attributes and their relationship with the class attribute. The models can be learned in a single pass through the data. These generative models have demonstrated accuracy competitive with state-of-the-art discriminative algorithms. More detail and links to software can be found here.

Selective KDB is an extension to the KDB algorithm [3]. KDB can learn a large Bayesian Network Classifier in just two passes through the training data. However, the size of these models is determined by parameter k. As data quantity increases, k should increase to allow ever more detail to be captured. But there is a trade-off between being able to capture more detail and the risk of overfitting. The exact trade-off varies depending on very subtle details of the distribution that has generated the data. Selective KDB performs a single extra pass through the training data which selects the best value of k as well as an effective subset of attributes. This very efficient algorithm is extremely effective with large data. The software can be downloaded here.

Using Generative Parameters to Scale Up Discriminative Learning of Big Models

Simple linear models are very effective with relatively small quantities of data. However, they are unable to capture fine-grained interactions between variables that can be revealed by big data. One way of capturing this detail is to create new features that represent sets of features. However, this can lead to an exponential blow-out in the number of parameters that must be learned, which quickly leads to learning becoming infeasible for big data.

We have shown how efficiently learned generative parameters can be used to greatly speed-up learning of linear models [4, 5]. ALR uses this trick to speed up learning of higher-order models that create new features from combinations of primary features [6]. The resulting big models can achieve very low error on big data. The ALR software can be downloaded here.

Grafting: Big Decision Tree Models

Most learning systems ignore parts of the instance space that are not occupied by any training examples. Typical decision tree learners stop learning as soon as the tree adequately handles all the training examples. However, this is likely to leave unconsidered large areas of the instance space for which the training set does not happen to have included any examples. Decision tree grafting takes a decision tree created by a traditional learning algorithm and then systematically grafts additional branches onto it by considering additional branches using evidence from outside the node being split [7, 8, 9]. A Weka implementation is available here and C source is available here.

Theory underpinning Big Models for Big Data

This research is underpinned by the low bias hypothesis [10]. Error can be decomposed into two terms bias and variance. Bias is caused by a mismatch between the learner's assumptions about the form of distribution from which the data are drawn and the true form. Variance is caused by the learner encoding features of the training data that do not represent the distribution form which they are drawn, which results form the sample being unrepresentative of the distribution. The low bias hypothesis holds that variance will be greatest for small quantities of data and will decrease as data quantity increases. From this it follows that for small data it is most important to minimise variance, while for big data it is most important to minimise bias.

One major determinant of bias is representation bias. A learner can only describe distributions that conform to the form of model that the learner creates. Hence, any mismatch between the form of model and the true distribution will result in bias. The greater the class of distributions that the learner can model, the lower this bias will be. Hence, as data quantity increases error can be minimised by learning models that capture ever more fine grained detail.

References

[1] Not So Naive Bayes: Aggregating One-Dependence Estimators.
Webb, G. I., Boughton, J., & Wang, Z.
Machine Learning, 58(1), 5-24, 2005.
[Bibtex] [Abstract]

@Article{WebbBoughtonWang05,
Title = {Not So Naive {Bayes}: Aggregating One-Dependence Estimators},
Author = {Webb, G. I. and Boughton, J. and Wang, Z.},
Journal = {Machine Learning},
Year = {2005},
Number = {1},
Pages = {5-24},
Volume = {58},
Abstract = {Of numerous proposals to improve the accuracy of naive Bayes by weakening its attribute independence assumption, both LBR and TAN have demonstrated remarkable error performance. However, both techniques obtain this outcome at a considerable computational cost. We present a new approach to weakening the attribute independence assumption by averaging all of a constrained class of classifiers. In extensive experiments this technique delivers comparable prediction accuracy to LBR and TAN with substantially improved computational efficiency.},
Address = {Netherlands},
Audit-trail = {3/5/04 Pre-print posted},
Doi = {10.1007/s10994-005-4258-6},
Keywords = {Conditional Probability Estimation and AODE},
Publisher = {Springer},
Related = {learning-complex-conditional-probabilities-from-data}
}

ABSTRACT Of numerous proposals to improve the accuracy of naive Bayes by weakening its attribute independence assumption, both LBR and TAN have demonstrated remarkable error performance. However, both techniques obtain this outcome at a considerable computational cost. We present a new approach to weakening the attribute independence assumption by averaging all of a constrained class of classifiers. In extensive experiments this technique delivers comparable prediction accuracy to LBR and TAN with substantially improved computational efficiency.

[2] Learning by extrapolation from marginal to full-multivariate probability distributions: Decreasingly naive Bayesian classification.
Webb, G. I., Boughton, J., Zheng, F., Ting, K. M., & Salem, H.
Machine Learning, 86(2), 233-272, 2012.
[Bibtex] [Abstract]

@Article{WebbEtAl12,
author = {Webb, G. I. and Boughton, J. and Zheng, F. and Ting, K. M. and Salem, H.},
journal = {Machine Learning},
title = {Learning by extrapolation from marginal to full-multivariate probability distributions: Decreasingly naive {Bayesian} classification},
year = {2012},
issn = {0885-6125},
number = {2},
pages = {233-272},
volume = {86},
abstract = {Averaged n-Dependence Estimators (AnDE) is an approach to probabilistic classification learning that learns by extrapolation from marginal
to full-multivariate probability distributions. It utilizes a single parameter that transforms the approach between a low-variance high-bias learner
(Naive Bayes) and a high-variance low-bias learner with Bayes optimal
asymptotic error. It extends the underlying strategy of Averaged One-Dependence Estimators (AODE), which relaxes the Naive Bayes independence assumption while retaining many of Naive Bayes' desirable computational and theoretical properties. AnDE further relaxes the independence assumption by generalizing AODE to higher-levels of dependence.
Extensive experimental evaluation shows that the bias-variance trade-off
for Averaged 2-Dependence Estimators results in strong predictive accuracy over a wide range of data sets. It has training time linear with
respect to the number of examples, supports incremental learning, handles directly missing values, and is robust in the face of noise. Beyond
the practical utility of its lower-dimensional variants, AnDE is of interest
in that it demonstrates that it is possible to create low-bias high-variance
generative learners and suggests strategies for developing even more powerful classifiers.},
address = {Netherlands},
keywords = {Conditional Probability Estimation and AODE},
publisher = {Springer},
related = {learning-complex-conditional-probabilities-from-data},
url = {http://dx.doi.org/10.1007/s10994-011-5263-6},
urltext = {Link to paper via SpringerLink},
}

ABSTRACT Averaged n-Dependence Estimators (AnDE) is an approach to probabilistic classification learning that learns by extrapolation from marginal to full-multivariate probability distributions. It utilizes a single parameter that transforms the approach between a low-variance high-bias learner (Naive Bayes) and a high-variance low-bias learner with Bayes optimal asymptotic error. It extends the underlying strategy of Averaged One-Dependence Estimators (AODE), which relaxes the Naive Bayes independence assumption while retaining many of Naive Bayes' desirable computational and theoretical properties. AnDE further relaxes the independence assumption by generalizing AODE to higher-levels of dependence. Extensive experimental evaluation shows that the bias-variance trade-off for Averaged 2-Dependence Estimators results in strong predictive accuracy over a wide range of data sets. It has training time linear with respect to the number of examples, supports incremental learning, handles directly missing values, and is robust in the face of noise. Beyond the practical utility of its lower-dimensional variants, AnDE is of interest in that it demonstrates that it is possible to create low-bias high-variance generative learners and suggests strategies for developing even more powerful classifiers.

[3] Scalable Learning of Bayesian Network Classifiers.
Martinez, A. M., Webb, G. I., Chen, S., & Zaidi, N. A.
Journal of Machine Learning Research, 17(44), 1-35, 2016.
[Bibtex] [Abstract]

@Article{MartinezEtAl16,
author = {Martinez, Ana M. and Webb, Geoffrey I. and Chen, Shenglei and Zaidi, Nayyar A.},
journal = {Journal of Machine Learning Research},
title = {Scalable Learning of {Bayesian} Network Classifiers},
year = {2016},
number = {44},
pages = {1-35},
volume = {17},
abstract = {Ever increasing data quantity makes ever more urgent the need for highly scalable learners that have good classification performance. Therefore, an out-of-core learner with excellent time and space complexity, along with high expressivity (that is, capacity to learn very complex multivariate probability distributions) is extremely desirable. This paper presents such a learner. We propose an extension to the k-dependence Bayesian classifier (KDB) that discriminatively selects a sub- model of a full KDB classifier. It requires only one additional pass through the training data, making it a three-pass learner. Our extensive experimental evaluation on 16 large data sets reveals that this out-of-core algorithm achieves competitive classification performance, and substantially better training and classification time than state-of-the-art in-core learners such as random forest and linear and non-linear logistic regression.},
keywords = {Conditional Probability Estimation and AODE and Learning from large datasets and DP140100087},
related = {learning-complex-conditional-probabilities-from-data},
url = {http://jmlr.org/papers/v17/martinez16a.html},
}

ABSTRACT Ever increasing data quantity makes ever more urgent the need for highly scalable learners that have good classification performance. Therefore, an out-of-core learner with excellent time and space complexity, along with high expressivity (that is, capacity to learn very complex multivariate probability distributions) is extremely desirable. This paper presents such a learner. We propose an extension to the k-dependence Bayesian classifier (KDB) that discriminatively selects a sub- model of a full KDB classifier. It requires only one additional pass through the training data, making it a three-pass learner. Our extensive experimental evaluation on 16 large data sets reveals that this out-of-core algorithm achieves competitive classification performance, and substantially better training and classification time than state-of-the-art in-core learners such as random forest and linear and non-linear logistic regression.

[4] Naive-Bayes Inspired Effective Pre-Conditioner for Speeding-up Logistic Regression.
Zaidi, N., Carman, M., Cerquides, J., & Webb, G. I.
Proceedings of the 14th IEEE International Conference on Data Mining, pp. 1097-1102, 2014.
[Bibtex] [Abstract]

@InProceedings{ZaidiEtAl14,
author = {Zaidi, N. and Carman, M. and Cerquides, J. and Webb, G. I.},
booktitle = {Proceedings of the 14th {IEEE} International Conference on Data Mining},
title = {Naive-{Bayes} Inspired Effective Pre-Conditioner for Speeding-up Logistic Regression},
year = {2014},
pages = {1097-1102},
abstract = {We propose an alternative parameterization of
Logistic Regression (LR) for the categorical data, multi-class
setting. LR optimizes the conditional log-likelihood over the
training data and is based on an iterative optimization procedure
to tune this objective function. The optimization procedure
employed may be sensitive to scale and hence an effective
pre-conditioning method is recommended. Many problems in
machine learning involve arbitrary scales or categorical data
(where simple standardization of features is not applicable).
The problem can be alleviated by using optimization routines
that are invariant to scale such as (second-order) Newton
methods. However, computing and inverting the Hessian is a
costly procedure and not feasible for big data. Thus one must
often rely on first-order methods such as gradient descent (GD),
stochastic gradient descent (SGD) or approximate secondorder
such as quasi-Newton (QN) routines, which are not
invariant to scale. This paper proposes a simple yet effective
pre-conditioner for speeding-up LR based on naive Bayes
conditional probability estimates. The idea is to scale each
attribute by the log of the conditional probability of that
attribute given the class. This formulation substantially speeds up
LR's convergence. It also provides a weighted naive Bayes
formulation which yields an effective framework for hybrid
generative-discriminative classification.},
doi = {10.1109/ICDM.2014.53},
keywords = {Conditional Probability Estimation and WANBIA and DP140100087},
related = {combining-generative-and-discriminative-learning},
}

ABSTRACT We propose an alternative parameterization of Logistic Regression (LR) for the categorical data, multi-class setting. LR optimizes the conditional log-likelihood over the training data and is based on an iterative optimization procedure to tune this objective function. The optimization procedure employed may be sensitive to scale and hence an effective pre-conditioning method is recommended. Many problems in machine learning involve arbitrary scales or categorical data (where simple standardization of features is not applicable). The problem can be alleviated by using optimization routines that are invariant to scale such as (second-order) Newton methods. However, computing and inverting the Hessian is a costly procedure and not feasible for big data. Thus one must often rely on first-order methods such as gradient descent (GD), stochastic gradient descent (SGD) or approximate secondorder such as quasi-Newton (QN) routines, which are not invariant to scale. This paper proposes a simple yet effective pre-conditioner for speeding-up LR based on naive Bayes conditional probability estimates. The idea is to scale each attribute by the log of the conditional probability of that attribute given the class. This formulation substantially speeds up LR's convergence. It also provides a weighted naive Bayes formulation which yields an effective framework for hybrid generative-discriminative classification.

[5] Preconditioning an Artificial Neural Network Using Naive Bayes.
Zaidi, N. A., Petitjean, F., & Webb, G. I.
Proceedings of the 20th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2016, pp. 341-353, 2016.
[Bibtex] [Abstract]

@InProceedings{ZaidiEtAl16,
Title = {Preconditioning an Artificial Neural Network Using Naive {Bayes}},
Author = {Zaidi, Nayyar A. and Petitjean, Francois and Webb, Geoffrey I.},
Booktitle = {Proceedings of the 20th {Pacific-Asia} Conference on Advances in Knowledge Discovery and Data Mining, {PAKDD} 2016},
Year = {2016},
Editor = {Bailey, James and Khan, Latifur and Washio, Takashi and Dobbie, Gill and Huang, Zhexue Joshua and Wang, Ruili},
Pages = {341-353},
Publisher = {Springer International Publishing},
Abstract = {Logistic Regression (LR) is a workhorse of the statistics community and a state-of-the-art machine learning classifier. It learns a linear model from inputs to outputs trained by optimizing the Conditional Log-Likelihood (CLL) of the data. Recently, it has been shown that preconditioning LR using a Naive Bayes (NB) model speeds up LR learning many-fold. One can, however, train a linear model by optimizing the mean-square-error (MSE) instead of CLL. This leads to an Artificial Neural Network (ANN) with no hidden layer. In this work, we study the effect of NB preconditioning on such an ANN classifier. Optimizing MSE instead of CLL may lead to a lower bias classifier and hence result in better performance on big datasets. We show that this NB preconditioning can speed-up convergence significantly. We also show that optimizing a linear model with MSE leads to a lower bias classifier than optimizing with CLL. We also compare the performance to state-of-the-art classifier Random Forest.},
Doi = {10.1007/978-3-319-31753-3_28},
ISBN = {978-3-319-31753-3},
Keywords = {Conditional Probability Estimation and WANBIA and DP140100087},
Related = {combining-generative-and-discriminative-learning},
Url = {http://dx.doi.org/10.1007/978-3-319-31753-3_28}
}

ABSTRACT Logistic Regression (LR) is a workhorse of the statistics community and a state-of-the-art machine learning classifier. It learns a linear model from inputs to outputs trained by optimizing the Conditional Log-Likelihood (CLL) of the data. Recently, it has been shown that preconditioning LR using a Naive Bayes (NB) model speeds up LR learning many-fold. One can, however, train a linear model by optimizing the mean-square-error (MSE) instead of CLL. This leads to an Artificial Neural Network (ANN) with no hidden layer. In this work, we study the effect of NB preconditioning on such an ANN classifier. Optimizing MSE instead of CLL may lead to a lower bias classifier and hence result in better performance on big datasets. We show that this NB preconditioning can speed-up convergence significantly. We also show that optimizing a linear model with MSE leads to a lower bias classifier than optimizing with CLL. We also compare the performance to state-of-the-art classifier Random Forest.

[6] ALRn: Accelerated higher-order logistic regression.
Zaidi, N. A., Webb, G. I., Carman, M. J., Petitjean, F., & Cerquides, J.
Machine Learning, 104(2), 151-194, 2016.
[Bibtex] [Abstract]

@Article{ZaidiEtAl16b,
author = {Zaidi, Nayyar A. and Webb, Geoffrey I. and Carman, Mark J. and Petitjean, Francois and Cerquides, Jesus},
journal = {Machine Learning},
title = {{ALRn}: Accelerated higher-order logistic regression},
year = {2016},
issn = {1573-0565},
number = {2},
pages = {151-194},
volume = {104},
abstract = {This paper introduces Accelerated Logistic Regression: a hybrid generative-discriminative approach to training Logistic Regression with high-order features. We present two main results: (1) that our combined generative-discriminative approach significantly improves the efficiency of Logistic Regression and (2) that incorporating higher order features (i.e. features that are the Cartesian products of the original features) reduces the bias of Logistic Regression, which in turn significantly reduces its error on large datasets. We assess the efficacy of Accelerated Logistic Regression by conducting an extensive set of experiments on 75 standard datasets. We demonstrate its competitiveness, particularly on large datasets, by comparing against state-of-the-art classifiers including Random Forest and Averaged n-Dependence Estimators.},
doi = {10.1007/s10994-016-5574-8},
keywords = {Conditional Probability Estimation and WANBIA and DP140100087},
related = {combining-generative-and-discriminative-learning},
url = {http://rdcu.be/unVb},
}

ABSTRACT This paper introduces Accelerated Logistic Regression: a hybrid generative-discriminative approach to training Logistic Regression with high-order features. We present two main results: (1) that our combined generative-discriminative approach significantly improves the efficiency of Logistic Regression and (2) that incorporating higher order features (i.e. features that are the Cartesian products of the original features) reduces the bias of Logistic Regression, which in turn significantly reduces its error on large datasets. We assess the efficacy of Accelerated Logistic Regression by conducting an extensive set of experiments on 75 standard datasets. We demonstrate its competitiveness, particularly on large datasets, by comparing against state-of-the-art classifiers including Random Forest and Averaged n-Dependence Estimators.

[7] Further Experimental Evidence Against The Utility Of Occam's Razor.
Webb, G. I.
Journal of Artificial Intelligence Research, 4, 397-417, 1996.
[Bibtex] [Abstract]

@Article{Webb96b,
Title = {Further Experimental Evidence Against The Utility Of Occam's Razor},
Author = {Webb, G. I.},
Journal = {Journal of Artificial Intelligence Research},
Year = {1996},
Pages = {397-417},
Volume = {4},
Abstract = {This paper presents new experimental evidence against the utility of Occam's razor. A systematic procedure is presented for post-processing decision trees produced by C4.5. This procedure was derived by rejecting Occam's razor and instead attending to the assumption that similar objects are likely to belong to the same class. It increases a decision tree's complexity without altering the performance of that tree on the training data from which it is inferred. The resulting more complex decision trees are demonstrated to have, on average, for a variety of common learning tasks, higher predictive accuracy than the less complex original decision trees. This result raises considerable doubt about the utility of Occam's razor as it is commonly applied in modern machine learning.},
Address = {Menlo Park, CA},
Doi = {10.1613/jair.228},
Keywords = {Decision Trees and Decision Tree Grafting and Occams Razor},
Publisher = {AAAI Press},
Related = {occams-razor-in-machine-learning}
}

ABSTRACT This paper presents new experimental evidence against the utility of Occam's razor. A systematic procedure is presented for post-processing decision trees produced by C4.5. This procedure was derived by rejecting Occam's razor and instead attending to the assumption that similar objects are likely to belong to the same class. It increases a decision tree's complexity without altering the performance of that tree on the training data from which it is inferred. The resulting more complex decision trees are demonstrated to have, on average, for a variety of common learning tasks, higher predictive accuracy than the less complex original decision trees. This result raises considerable doubt about the utility of Occam's razor as it is commonly applied in modern machine learning.

[8] Decision Tree Grafting.
Webb, G. I.
Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI 97), San Francisco, pp. 846-851, 1997.
[Bibtex] [Abstract]

@InProceedings{Webb97,
Title = {Decision Tree Grafting},
Author = {Webb, G. I.},
Booktitle = {Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence ({IJCAI} 97)},
Year = {1997},
Address = {San Francisco},
Pages = {846-851},
Publisher = {Morgan Kaufmann},
Abstract = {This paper extends recent work on decision tree grafting. Grafting is an inductive process that adds nodes to inferred decision trees. This process is demonstrated to frequently improve predictive accuracy. Superficial analysis might suggest that decision tree grafting is the direct reverse of pruning. To the contrary, it is argued that the two processes are complementary. This is because, like standard tree growing techniques, pruning uses only local information, whereas grafting uses non-local information. The use of both pruning and grafting in conjunction is demonstrated to provide the best general predictive accuracy over a representative selection of learning tasks.},
Audit-trail = {PDF posted with the permission of IJCAI Inc},
Keywords = {Decision Trees and Decision Tree Grafting and Occams Razor},
Location = {Nagoya, Japan},
Related = {decision-tree-grafting}
}

ABSTRACT This paper extends recent work on decision tree grafting. Grafting is an inductive process that adds nodes to inferred decision trees. This process is demonstrated to frequently improve predictive accuracy. Superficial analysis might suggest that decision tree grafting is the direct reverse of pruning. To the contrary, it is argued that the two processes are complementary. This is because, like standard tree growing techniques, pruning uses only local information, whereas grafting uses non-local information. The use of both pruning and grafting in conjunction is demonstrated to provide the best general predictive accuracy over a representative selection of learning tasks.

[9] Decision Tree Grafting From The All Tests But One Partition.
Webb, G. I.
Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI 99), San Francisco, pp. 702-707, 1999.
[Bibtex] [Abstract]

@InProceedings{Webb99,
Title = {Decision Tree Grafting From The All Tests But One Partition},
Author = {Webb, G. I.},
Booktitle = {Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence ({IJCAI} 99)},
Year = {1999},
Address = {San Francisco},
Editor = {Dean, T.},
Pages = {702-707},
Publisher = {Morgan Kaufmann},
Abstract = {Decision tree grafting adds nodes to an existing decision tree with the objective of reducing prediction error. A new grafting algorithm is presented that considers one set of training data only for each leaf of the initial decision tree, the set of cases that fail at most one test on the path to the leaf. This new technique is demonstrated to retain the error reduction power of the original grafting algorithm while dramatically reducing compute time and the complexity of the inferred tree. Bias/variance analysis reveal that the original grafting technique operated primarily by variance reduction while the new technique reduces both bias and variance.},
Audit-trail = {PDF posted with the permission of {IJCAI} Inc},
Keywords = {Decision Tree Learning and Decision Tree Grafting and Occams Razor},
Location = {Stockholm, Sweden},
Related = {decision-tree-grafting}
}

ABSTRACT Decision tree grafting adds nodes to an existing decision tree with the objective of reducing prediction error. A new grafting algorithm is presented that considers one set of training data only for each leaf of the initial decision tree, the set of cases that fail at most one test on the path to the leaf. This new technique is demonstrated to retain the error reduction power of the original grafting algorithm while dramatically reducing compute time and the complexity of the inferred tree. Bias/variance analysis reveal that the original grafting technique operated primarily by variance reduction while the new technique reduces both bias and variance.

[10] On The Effect of Data Set Size on Bias And Variance in Classification Learning.
Brain, D., & Webb, G. I.
Proceedings of the Fourth Australian Knowledge Acquisition Workshop (AKAW-99), Sydney, pp. 117-128, 1999.
[Bibtex] [Abstract]

@InProceedings{BrainWebb99,
Title = {On The Effect of Data Set Size on Bias And Variance in Classification Learning},
Author = {Brain, D. and Webb, G. I.},
Booktitle = {Proceedings of the Fourth {Australian} Knowledge Acquisition Workshop ({AKAW}-99)},
Year = {1999},
Address = {Sydney},
Editor = {Richards, D. and Beydoun, G. and Hoffmann, A. and Compton, P.},
Pages = {117-128},
Publisher = {The University of New South Wales},
Abstract = {With the advent of data mining, machine learning has come of age and is now a critical technology in many businesses. However, machine learning evolved in a different research context to that in which it now finds itself employed. A particularly important problem in the data mining world is working effectively with large data sets. However, most machine learning research has been conducted in the context of learning from very small data sets. To date most approaches to scaling up machine learning to large data sets have attempted to modify existing algorithms to deal with large data sets in a more computationally efficient and effective manner. But is this necessarily the best method? This paper explores the possibility of designing algorithms specifically for large data sets. Specifically, the paper looks at how increasing data set size affects bias and variance error decompositions for classification algorithms. Preliminary results of experiments to determine these effects are presented, showing that, as hypothesized variance can be expected to decrease as training set size increases. No clear effect of training set size on bias was observed. These results have profound implications for data mining from large data sets, indicating that developing effective learning algorithms for large data sets is not simply a matter of finding computationally efficient variants of existing learning algorithms.},
Audit-trail = {*},
Keywords = {Learning from large datasets and Bias-Variance},
Location = {Sydney, Australia},
Related = {learning-from-large-datasets}
}

ABSTRACT With the advent of data mining, machine learning has come of age and is now a critical technology in many businesses. However, machine learning evolved in a different research context to that in which it now finds itself employed. A particularly important problem in the data mining world is working effectively with large data sets. However, most machine learning research has been conducted in the context of learning from very small data sets. To date most approaches to scaling up machine learning to large data sets have attempted to modify existing algorithms to deal with large data sets in a more computationally efficient and effective manner. But is this necessarily the best method? This paper explores the possibility of designing algorithms specifically for large data sets. Specifically, the paper looks at how increasing data set size affects bias and variance error decompositions for classification algorithms. Preliminary results of experiments to determine these effects are presented, showing that, as hypothesized variance can be expected to decrease as training set size increases. No clear effect of training set size on bias was observed. These results have profound implications for data mining from large data sets, indicating that developing effective learning algorithms for large data sets is not simply a matter of finding computationally efficient variants of existing learning algorithms.