Naive Bayes is popular due to its efficiency, direct theoretical foundation and strong classification performance.Â Our techniques improve its accuracy by overcoming the deficiencies of its attribute independence assumption. The results are powerful yet computationally efficient Bayesian Network Classifiers.

The AnDE algorithms achieve this without resorting to model selection or search by averaging over all of a class of models that directly map lower-dimensional probabilities to the desired high-dimensional probability. Averaged One Dependence Estimators (AODE) is the best known of the AnDE algorithms. It provides particularly high prediction accuracy with relatively modest computational overheads. The AnDE algorithms learn in a single pass through the data, thus supporting both incremental learning and learning from data that is too large to fit in memory. They have complexity that is linear with respect to data quantity. These properties, together with their capacity to accurately model high-dimensional probability distributions, make them an extremely attractive option for large data.

An alternative approach is provided by Lazy Bayesian Rules (LBR) which does perform model selection.Â It provides very high prediction accuracy for large training sets, and is computationally efficient when few objects are to be classified for each training set.

Selective AnDE refines AnDE models requiring only a single additional pass through the data.

Selective KDB learns highly accurate models from large quantities of data in just three passes through the data.

AnDE, Selective AnDE and Selective KDB can all operate out-of-core, and thus support learning from very large data.

WANBIA and its variants WANBIA-C and ALR add discriminately learned weights to generatively learned linear models, greatly improving accuracy relative to pure generative learning and learning time relative to pure disciminative learning.

There have been many scientific applications of AODE.

### Software

An AnDE package for Weka is available here.

An AnDE package for R is available here.

The following are standard components in Weka:

- AODE, which is AnDE with n=1.
- AODEsr, which is AODE with subsumption resolution.
- LBR, Lazy Bayesian Rules.

An open source C++ implementation of Selective KDB can be downloaded here. A refinement that uses Hierarchical Dirichlet Processes to obtain exceptional predictive accuracy can be downloaded here.

An open source C++ implementation of Sample-base Selective Attribute ANDE can be downloaded here.

### Publications

A novel selective naive Bayes algorithm.

Chen, S., Webb, G. I., Liu, L., & Ma, X.

Knowledge-Based Systems, 192, Art. no. 105361, 2020.

[Bibtex] [Abstract]

```
@Article{CHEN2019105361,
author = {Shenglei Chen and Geoffrey I. Webb and Linyuan Liu and Xin Ma},
journal = {Knowledge-Based Systems},
title = {A novel selective naive Bayes algorithm},
year = {2020},
volume = {192},
abstract = {Naive Bayes is one of the most popular data mining algorithms. Its efficiency comes from the assumption of attribute independence, although this might be violated in many real-world data sets. Many efforts have been done to mitigate the assumption, among which attribute selection is an important approach. However, conventional efforts to perform attribute selection in naive Bayes suffer from heavy computational overhead. This paper proposes an efficient selective naive Bayes algorithm, which adopts only some of the attributes to construct selective naive Bayes models. These models are built in such a way that each one is a trivial extension of another. The most predictive selective naive Bayes model can be selected by the measures of incremental leave-one-out cross validation. As a result, attributes can be selected by efficient model selection. Empirical results demonstrate that the selective naive Bayes shows superior classification accuracy, yet at the same time maintains the simplicity and efficiency.},
articlenumber = {105361},
doi = {10.1016/j.knosys.2019.105361},
keywords = {Conditional Probability Estimation and AODE and Learning from large datasets and DP140100087},
related = {learning-complex-conditional-probabilities-from-data},
}
```

**ABSTRACT** Naive Bayes is one of the most popular data mining algorithms. Its efficiency comes from the assumption of attribute independence, although this might be violated in many real-world data sets. Many efforts have been done to mitigate the assumption, among which attribute selection is an important approach. However, conventional efforts to perform attribute selection in naive Bayes suffer from heavy computational overhead. This paper proposes an efficient selective naive Bayes algorithm, which adopts only some of the attributes to construct selective naive Bayes models. These models are built in such a way that each one is a trivial extension of another. The most predictive selective naive Bayes model can be selected by the measures of incremental leave-one-out cross validation. As a result, attributes can be selected by efficient model selection. Empirical results demonstrate that the selective naive Bayes shows superior classification accuracy, yet at the same time maintains the simplicity and efficiency.

Accurate parameter estimation for Bayesian network classifiers using hierarchical Dirichlet processes.

Petitjean, F., Buntine, W., Webb, G. I., & Zaidi, N.

Machine Learning, 107(8-10), 1303-1331, 2018.

[Bibtex] [Abstract]

```
@Article{Petitjean2018,
Title = {Accurate parameter estimation for Bayesian network classifiers using hierarchical Dirichlet processes},
Author = {Petitjean, Francois and Buntine, Wray and Webb, Geoffrey I. and Zaidi, Nayyar},
Journal = {Machine Learning},
Year = {2018},
Number = {8-10},
Pages = {1303-1331},
Volume = {107},
Abstract = {This paper introduces a novel parameter estimation method for the probability tables of Bayesian network classifiers (BNCs), using hierarchical Dirichlet processes (HDPs). The main result of this paper is to show that improved parameter estimation allows BNCs to outperform leading learning methods such as random forest for both 0--1 loss and RMSE, albeit just on categorical datasets. As data assets become larger, entering the hyped world of ``big'', efficient accurate classification requires three main elements: (1) classifiers with low-bias that can capture the fine-detail of large datasets (2) out-of-core learners that can learn from data without having to hold it all in main memory and (3) models that can classify new data very efficiently. The latest BNCs satisfy these requirements. Their bias can be controlled easily by increasing the number of parents of the nodes in the graph. Their structure can be learned out of core with a limited number of passes over the data. However, as the bias is made lower to accurately model classification tasks, so is the accuracy of their parameters' estimates, as each parameter is estimated from ever decreasing quantities of data. In this paper, we introduce the use of HDPs for accurate BNC parameter estimation even with lower bias. We conduct an extensive set of experiments on 68 standard datasets and demonstrate that our resulting classifiers perform very competitively with random forest in terms of prediction, while keeping the out-of-core capability and superior classification time.},
Doi = {10.1007/s10994-018-5718-0},
ISSN = {1573-0565},
Keywords = {Conditional Probability Estimation and Bayesian Learning},
Related = {learning-complex-conditional-probabilities-from-data},
Url = {https://rdcu.be/OX3g}
}
```

**ABSTRACT** This paper introduces a novel parameter estimation method for the probability tables of Bayesian network classifiers (BNCs), using hierarchical Dirichlet processes (HDPs). The main result of this paper is to show that improved parameter estimation allows BNCs to outperform leading learning methods such as random forest for both 0â€“1 loss and RMSE, albeit just on categorical datasets. As data assets become larger, entering the hyped world of ``big'', efficient accurate classification requires three main elements: (1) classifiers with low-bias that can capture the fine-detail of large datasets (2) out-of-core learners that can learn from data without having to hold it all in main memory and (3) models that can classify new data very efficiently. The latest BNCs satisfy these requirements. Their bias can be controlled easily by increasing the number of parents of the nodes in the graph. Their structure can be learned out of core with a limited number of passes over the data. However, as the bias is made lower to accurately model classification tasks, so is the accuracy of their parameters' estimates, as each parameter is estimated from ever decreasing quantities of data. In this paper, we introduce the use of HDPs for accurate BNC parameter estimation even with lower bias. We conduct an extensive set of experiments on 68 standard datasets and demonstrate that our resulting classifiers perform very competitively with random forest in terms of prediction, while keeping the out-of-core capability and superior classification time.

Efficient and Effective Accelerated Hierarchical Higher-Order Logistic Regression for Large Data Quantities.

Zaidi, N. A., Petitjean, F., & Webb, G. I.

Proceedings of the 2018 SIAM International Conference on Data Mining, pp. 459-467, 2018.

[Bibtex]

```
@InProceedings{Zaidi2018,
Title = {Efficient and Effective Accelerated Hierarchical Higher-Order Logistic Regression for Large Data Quantities},
Author = {Zaidi, Nayyar A. and Petitjean, Francois and Webb, Geoffrey I.},
Booktitle = {Proceedings of the 2018 SIAM International Conference on Data Mining},
Year = {2018},
Pages = {459-467},
Doi = {10.1137/1.9781611975321.52},
Keywords = {Conditional Probability Estimation and WANBIA},
Related = {combining-generative-and-discriminative-learning}
}
```

**ABSTRACT**

A Fast Trust-Region Newton Method for Softmax Logistic Regression.

Zaidi, N. A., & Webb, G. I.

Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 705-713, 2017.

[Bibtex] [Abstract]

```
@InProceedings{ZaidiWebb17,
Title = {A Fast Trust-Region Newton Method for Softmax Logistic Regression},
Author = {Zaidi, Nayyar A. and Webb, Geoffrey I.},
Booktitle = {Proceedings of the 2017 SIAM International Conference on Data Mining},
Year = {2017},
Organization = {SIAM},
Pages = {705-713},
Abstract = {With the emergence of big data, there has been a growing
interest in optimization routines that lead to faster convergence of Logistic Regression (LR). Among many optimization methods such as Gradient Descent, Quasi-Newton, Conjugate Gradient, etc., the Trust-region based truncated Newton method (TRON) algorithm has been shown to converge
the fastest. The TRON algorithm also forms an important
component of the highly efficient and widely used liblinear
package. It has been shown that the WANBIA-C trick of
scaling with the log of the naive Bayes conditional probabilities can greatly accelerate the convergence of LR trained using (first-order) Gradient Descent and (approximate second-order) Quasi-Newton optimization. In this work we study
the applicability of the WANBIA-C trick to TRON. We first
devise a TRON algorithm optimizing the softmax objective
function and then demonstrate that WANBIA-C style preconditioning can be beneficial for TRON, leading to an ex-
tremely fast (batch) LR algorithm. Second, we present a
comparative analysis of one-vs-all LR and softmax LR in
terms of the 0-1 Loss, Bias, Variance, RMSE, Log-Loss,
Training and Classication time, and show that softmax LR
leads to significantly better RMSE and Log-Loss. We evaluate our proposed approach on 51 benchmark datasets.},
Doi = {10.1137/1.9781611974973.79},
Keywords = {Conditional Probability Estimation and WANBIA and DP140100087},
Related = {combining-generative-and-discriminative-learning}
}
```

**ABSTRACT** With the emergence of big data, there has been a growing interest in optimization routines that lead to faster convergence of Logistic Regression (LR). Among many optimization methods such as Gradient Descent, Quasi-Newton, Conjugate Gradient, etc., the Trust-region based truncated Newton method (TRON) algorithm has been shown to converge the fastest. The TRON algorithm also forms an important component of the highly efficient and widely used liblinear package. It has been shown that the WANBIA-C trick of scaling with the log of the naive Bayes conditional probabilities can greatly accelerate the convergence of LR trained using (first-order) Gradient Descent and (approximate second-order) Quasi-Newton optimization. In this work we study the applicability of the WANBIA-C trick to TRON. We first devise a TRON algorithm optimizing the softmax objective function and then demonstrate that WANBIA-C style preconditioning can be beneficial for TRON, leading to an ex- tremely fast (batch) LR algorithm. Second, we present a comparative analysis of one-vs-all LR and softmax LR in terms of the 0-1 Loss, Bias, Variance, RMSE, Log-Loss, Training and Classication time, and show that softmax LR leads to significantly better RMSE and Log-Loss. We evaluate our proposed approach on 51 benchmark datasets.

Selective AnDE for large data learning: a low-bias memory constrained approach.

Chen, S., Martinez, A. M., Webb, G. I., & Wang, L.

Knowledge and Information Systems, 50(2), 475-503, 2017.

[Bibtex] [Abstract]

```
@Article{ChenEtAl16,
Title = {Selective AnDE for large data learning: a low-bias memory constrained approach},
Author = {Chen, Shenglei and Martinez, Ana M. and Webb, Geoffrey I. and Wang, Limin},
Journal = {Knowledge and Information Systems},
Year = {2017},
Number = {2},
Pages = {475-503},
Volume = {50},
Abstract = {Learning from data that are too big to fit into memory poses great challenges to currently available learning approaches. Averaged n-Dependence Estimators (AnDE) allows for a flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence, AnDE is especially appropriate for learning from large quantities of data. Memory requirement in AnDE, however, increases combinatorially with the number of attributes and the parameter n. In large data learning, number of attributes is often large and we also expect high n to achieve low-bias classification. In order to achieve the lower bias of AnDE with higher n but with less memory requirement, we propose a memory constrained selective AnDE algorithm, in which two passes of learning through training examples are involved. The first pass performs attribute selection on super parents according to available memory, whereas the second one learns an AnDE model with parents only on the selected attributes. Extensive experiments show that the new selective AnDE has considerably lower bias and prediction error relative to A \$\$n'\$\$ n {\textasciiacutex} DE, where \$\$n' = n-1\$\$ n {\textasciiacutex} = n - 1 , while maintaining the same space complexity and similar time complexity. The proposed algorithm works well on categorical data. Numerical data sets need to be discretized first.},
Doi = {10.1007/s10115-016-0937-9},
ISSN = {0219-3116},
Keywords = {Conditional Probability Estimation and AODE and Learning from large datasets and DP140100087},
Related = {learning-complex-conditional-probabilities-from-data}
}
```

**ABSTRACT** Learning from data that are too big to fit into memory poses great challenges to currently available learning approaches. Averaged n-Dependence Estimators (AnDE) allows for a flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence, AnDE is especially appropriate for learning from large quantities of data. Memory requirement in AnDE, however, increases combinatorially with the number of attributes and the parameter n. In large data learning, number of attributes is often large and we also expect high n to achieve low-bias classification. In order to achieve the lower bias of AnDE with higher n but with less memory requirement, we propose a memory constrained selective AnDE algorithm, in which two passes of learning through training examples are involved. The first pass performs attribute selection on super parents according to available memory, whereas the second one learns an AnDE model with parents only on the selected attributes. Extensive experiments show that the new selective AnDE has considerably lower bias and prediction error relative to A \$\$n'\$\$ n {\textasciiacutex} DE, where \$\$n' = n-1\$\$ n {\textasciiacutex} = n - 1 , while maintaining the same space complexity and similar time complexity. The proposed algorithm works well on categorical data. Numerical data sets need to be discretized first.

Efficient Parameter Learning of Bayesian Network Classifiers.

Zaidi, N., Webb, G. I., Carman, M., Petitjean, F., Buntine, W., Hynes, H., & De Sterck, H.

Machine Learning, 106(9-10), 1289-1329, 2017.

[Bibtex] [Abstract]

```
@Article{ZaidiEtAl17,
Title = {Efficient Parameter Learning of Bayesian Network Classifiers},
Author = {Zaidi, N. and Webb, Geoffrey I. and Carman, M. and Petitjean, F. and Buntine, W. and Hynes, H. and De Sterck, H.},
Journal = {Machine Learning},
Year = {2017},
Number = {9-10},
Pages = {1289-1329},
Volume = {106},
Abstract = {Recent advances have demonstrated substantial benefits from learning with both generative and discriminative parameters. On the one hand, generative approaches address the estimation of the parameters of the joint distribution—P(y,x), which for most network types is very computationally efficient (a notable exception to this are Markov networks) and on the other hand, discriminative approaches address the estimation of the parameters of the posterior distribution—and, are more effective for classification, since they fit P(y|x) directly. However, discriminative approaches are less computationally efficient as the normalization factor in the conditional log-likelihood precludes the derivation of closed-form estimation of parameters. This paper introduces a new discriminative parameter learning method for Bayesian network classifiers that combines in an elegant fashion parameters learned using both generative and discriminative methods. The proposed method is discriminative in nature, but uses estimates of generative probabilities to speed-up the optimization process. A second contribution is to propose a simple framework to characterize the parameter learning task for Bayesian network classifiers. We conduct an extensive set of experiments on 72 standard datasets and demonstrate that our proposed discriminative parameterization provides an efficient alternative to other state-of-the-art parameterizations.},
Doi = {10.1007/s10994-016-5619-z},
Keywords = {Conditional Probability Estimation and WANBIA and DP140100087},
Related = {combining-generative-and-discriminative-learning},
Url = {http://rdcu.be/oP1t}
}
```

**ABSTRACT** Recent advances have demonstrated substantial benefits from learning with both generative and discriminative parameters. On the one hand, generative approaches address the estimation of the parameters of the joint distribution—P(y,x), which for most network types is very computationally efficient (a notable exception to this are Markov networks) and on the other hand, discriminative approaches address the estimation of the parameters of the posterior distribution—and, are more effective for classification, since they fit P(y|x) directly. However, discriminative approaches are less computationally efficient as the normalization factor in the conditional log-likelihood precludes the derivation of closed-form estimation of parameters. This paper introduces a new discriminative parameter learning method for Bayesian network classifiers that combines in an elegant fashion parameters learned using both generative and discriminative methods. The proposed method is discriminative in nature, but uses estimates of generative probabilities to speed-up the optimization process. A second contribution is to propose a simple framework to characterize the parameter learning task for Bayesian network classifiers. We conduct an extensive set of experiments on 72 standard datasets and demonstrate that our proposed discriminative parameterization provides an efficient alternative to other state-of-the-art parameterizations.

Sample-based Attribute Selective AnDE for Large Data.

Chen, S., Martinez, A., Webb, G., & Wang, L.

IEEE Transactions on Knowledge and Data Engineering, 29(1), 172-185, 2017.

[Bibtex] [Abstract]

```
@Article{ChenEtAl16b,
Title = {Sample-based Attribute Selective AnDE for Large Data},
Author = {Chen, S. and Martinez, A. and Webb, G. and Wang, L.},
Journal = {{IEEE} Transactions on Knowledge and Data Engineering},
Year = {2017},
Number = {1},
Pages = {172 - 185},
Volume = {29},
Abstract = {More and more applications come with large data sets in the past decade. However, existing algorithms cannot guarantee to scale well on large data. Averaged n-Dependence Estimators (AnDE) allows for flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence AnDE is especially appropriate for large data learning. In this paper, we propose a sample-based attribute selection technique for AnDE. It needs one more pass through the training data, in which a multitude of approximate AnDE models are built and efficiently assessed by leave-one-out cross validation. The use of a sample reduces the training time. Experiments on 15 large data sets demonstrate that the proposed technique significantly reduces AnDE’s error at the cost of a modest increase in training time. This efficient and scalable out-of-core approach delivers superior or comparable performance to typical in-core Bayesian network classifiers.},
Doi = {10.1109/TKDE.2016.2608881},
ISSN = {1041-4347},
Keywords = {Conditional Probability Estimation and AODE and Learning from large datasets and DP140100087},
Related = {learning-complex-conditional-probabilities-from-data}
}
```

**ABSTRACT** More and more applications come with large data sets in the past decade. However, existing algorithms cannot guarantee to scale well on large data. Averaged n-Dependence Estimators (AnDE) allows for flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence AnDE is especially appropriate for large data learning. In this paper, we propose a sample-based attribute selection technique for AnDE. It needs one more pass through the training data, in which a multitude of approximate AnDE models are built and efficiently assessed by leave-one-out cross validation. The use of a sample reduces the training time. Experiments on 15 large data sets demonstrate that the proposed technique significantly reduces AnDE’s error at the cost of a modest increase in training time. This efficient and scalable out-of-core approach delivers superior or comparable performance to typical in-core Bayesian network classifiers.

Scalable Learning of Bayesian Network Classifiers.

Martinez, A. M., Webb, G. I., Chen, S., & Zaidi, N. A.

Journal of Machine Learning Research, 17(44), 1-35, 2016.

[Bibtex] [Abstract]

```
@Article{MartinezEtAl16,
Title = {Scalable Learning of {Bayesian} Network Classifiers},
Author = {Martinez, Ana M. and Webb, Geoffrey I. and Chen, Shenglei and Zaidi, Nayyar A.},
Journal = {Journal of Machine Learning Research},
Year = {2016},
Number = {44},
Pages = {1-35},
Volume = {17},
Abstract = {Ever increasing data quantity makes ever more urgent the need for highly scalable learners that have good classification performance. Therefore, an out-of-core learner with excellent time and space complexity, along with high expressivity (that is, capacity to learn very complex multivariate probability distributions) is extremely desirable. This paper presents such a learner. We propose an extension to the k-dependence Bayesian classifier (KDB) that discriminatively selects a sub- model of a full KDB classifier. It requires only one additional pass through the training data, making it a three-pass learner. Our extensive experimental evaluation on 16 large data sets reveals that this out-of-core algorithm achieves competitive classification performance, and substantially better training and classification time than state-of-the-art in-core learners such as random forest and linear and non-linear logistic regression.},
Keywords = {Conditional Probability Estimation and AODE and Learning from large datasets and DP140100087},
Related = {learning-complex-conditional-probabilities-from-data},
Url = {http://jmlr.org/papers/v17/martinez16a.html}
}
```

**ABSTRACT** Ever increasing data quantity makes ever more urgent the need for highly scalable learners that have good classification performance. Therefore, an out-of-core learner with excellent time and space complexity, along with high expressivity (that is, capacity to learn very complex multivariate probability distributions) is extremely desirable. This paper presents such a learner. We propose an extension to the k-dependence Bayesian classifier (KDB) that discriminatively selects a sub- model of a full KDB classifier. It requires only one additional pass through the training data, making it a three-pass learner. Our extensive experimental evaluation on 16 large data sets reveals that this out-of-core algorithm achieves competitive classification performance, and substantially better training and classification time than state-of-the-art in-core learners such as random forest and linear and non-linear logistic regression.

ALRn: Accelerated higher-order logistic regression.

Zaidi, N. A., Webb, G. I., Carman, M. J., Petitjean, F., & Cerquides, J.

Machine Learning, 104(2), 151-194, 2016.

[Bibtex] [Abstract]

```
@Article{ZaidiEtAl16b,
Title = {{ALRn}: Accelerated higher-order logistic regression},
Author = {Zaidi, Nayyar A. and Webb, Geoffrey I. and Carman, Mark J. and Petitjean, Francois and Cerquides, Jesus},
Journal = {Machine Learning},
Year = {2016},
Number = {2},
Pages = {151-194},
Volume = {104},
Abstract = {This paper introduces Accelerated Logistic Regression: a hybrid generative-discriminative approach to training Logistic Regression with high-order features. We present two main results: (1) that our combined generative-discriminative approach significantly improves the efficiency of Logistic Regression and (2) that incorporating higher order features (i.e. features that are the Cartesian products of the original features) reduces the bias of Logistic Regression, which in turn significantly reduces its error on large datasets. We assess the efficacy of Accelerated Logistic Regression by conducting an extensive set of experiments on 75 standard datasets. We demonstrate its competitiveness, particularly on large datasets, by comparing against state-of-the-art classifiers including Random Forest and Averaged n-Dependence Estimators.},
Doi = {10.1007/s10994-016-5574-8},
ISSN = {1573-0565},
Keywords = {Conditional Probability Estimation and WANBIA and DP140100087},
Related = {combining-generative-and-discriminative-learning},
Url = {http://dx.doi.org/10.1007/s10994-016-5574-8}
}
```

**ABSTRACT** This paper introduces Accelerated Logistic Regression: a hybrid generative-discriminative approach to training Logistic Regression with high-order features. We present two main results: (1) that our combined generative-discriminative approach significantly improves the efficiency of Logistic Regression and (2) that incorporating higher order features (i.e. features that are the Cartesian products of the original features) reduces the bias of Logistic Regression, which in turn significantly reduces its error on large datasets. We assess the efficacy of Accelerated Logistic Regression by conducting an extensive set of experiments on 75 standard datasets. We demonstrate its competitiveness, particularly on large datasets, by comparing against state-of-the-art classifiers including Random Forest and Averaged n-Dependence Estimators.

Preconditioning an Artificial Neural Network Using Naive Bayes.

Zaidi, N. A., Petitjean, F., & Webb, G. I.

Proceedings of the 20th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2016, pp. 341-353, 2016.

[Bibtex] [Abstract]

```
@InProceedings{ZaidiEtAl16,
Title = {Preconditioning an Artificial Neural Network Using Naive {Bayes}},
Author = {Zaidi, Nayyar A. and Petitjean, Francois and Webb, Geoffrey I.},
Booktitle = {Proceedings of the 20th {Pacific-Asia} Conference on Advances in Knowledge Discovery and Data Mining, {PAKDD} 2016},
Year = {2016},
Editor = {Bailey, James and Khan, Latifur and Washio, Takashi and Dobbie, Gill and Huang, Zhexue Joshua and Wang, Ruili},
Pages = {341-353},
Publisher = {Springer International Publishing},
Abstract = {Logistic Regression (LR) is a workhorse of the statistics community and a state-of-the-art machine learning classifier. It learns a linear model from inputs to outputs trained by optimizing the Conditional Log-Likelihood (CLL) of the data. Recently, it has been shown that preconditioning LR using a Naive Bayes (NB) model speeds up LR learning many-fold. One can, however, train a linear model by optimizing the mean-square-error (MSE) instead of CLL. This leads to an Artificial Neural Network (ANN) with no hidden layer. In this work, we study the effect of NB preconditioning on such an ANN classifier. Optimizing MSE instead of CLL may lead to a lower bias classifier and hence result in better performance on big datasets. We show that this NB preconditioning can speed-up convergence significantly. We also show that optimizing a linear model with MSE leads to a lower bias classifier than optimizing with CLL. We also compare the performance to state-of-the-art classifier Random Forest.},
Doi = {10.1007/978-3-319-31753-3_28},
ISBN = {978-3-319-31753-3},
Keywords = {Conditional Probability Estimation and WANBIA and DP140100087},
Related = {combining-generative-and-discriminative-learning},
Url = {http://dx.doi.org/10.1007/978-3-319-31753-3_28}
}
```

**ABSTRACT** Logistic Regression (LR) is a workhorse of the statistics community and a state-of-the-art machine learning classifier. It learns a linear model from inputs to outputs trained by optimizing the Conditional Log-Likelihood (CLL) of the data. Recently, it has been shown that preconditioning LR using a Naive Bayes (NB) model speeds up LR learning many-fold. One can, however, train a linear model by optimizing the mean-square-error (MSE) instead of CLL. This leads to an Artificial Neural Network (ANN) with no hidden layer. In this work, we study the effect of NB preconditioning on such an ANN classifier. Optimizing MSE instead of CLL may lead to a lower bias classifier and hence result in better performance on big datasets. We show that this NB preconditioning can speed-up convergence significantly. We also show that optimizing a linear model with MSE leads to a lower bias classifier than optimizing with CLL. We also compare the performance to state-of-the-art classifier Random Forest.

Naive-Bayes Inspired Effective Pre-Conditioner for Speeding-up Logistic Regression.

Zaidi, N., Carman, M., Cerquides, J., & Webb, G. I.

Proceedings of the 14th IEEE International Conference on Data Mining, pp. 1097-1102, 2014.

[Bibtex] [Abstract]

```
@InProceedings{ZaidiEtAl14,
author = {Zaidi, N. and Carman, M. and Cerquides, J. and Webb, G. I.},
booktitle = {Proceedings of the 14th {IEEE} International Conference on Data Mining},
title = {Naive-{Bayes} Inspired Effective Pre-Conditioner for Speeding-up Logistic Regression},
year = {2014},
pages = {1097-1102},
abstract = {We propose an alternative parameterization of
Logistic Regression (LR) for the categorical data, multi-class
setting. LR optimizes the conditional log-likelihood over the
training data and is based on an iterative optimization procedure
to tune this objective function. The optimization procedure
employed may be sensitive to scale and hence an effective
pre-conditioning method is recommended. Many problems in
machine learning involve arbitrary scales or categorical data
(where simple standardization of features is not applicable).
The problem can be alleviated by using optimization routines
that are invariant to scale such as (second-order) Newton
methods. However, computing and inverting the Hessian is a
costly procedure and not feasible for big data. Thus one must
often rely on first-order methods such as gradient descent (GD),
stochastic gradient descent (SGD) or approximate secondorder
such as quasi-Newton (QN) routines, which are not
invariant to scale. This paper proposes a simple yet effective
pre-conditioner for speeding-up LR based on naive Bayes
conditional probability estimates. The idea is to scale each
attribute by the log of the conditional probability of that
attribute given the class. This formulation substantially speeds up
LR's convergence. It also provides a weighted naive Bayes
formulation which yields an effective framework for hybrid
generative-discriminative classification.},
doi = {10.1109/ICDM.2014.53},
keywords = {Conditional Probability Estimation and WANBIA and DP140100087},
related = {combining-generative-and-discriminative-learning},
}
```

**ABSTRACT** We propose an alternative parameterization of Logistic Regression (LR) for the categorical data, multi-class setting. LR optimizes the conditional log-likelihood over the training data and is based on an iterative optimization procedure to tune this objective function. The optimization procedure employed may be sensitive to scale and hence an effective pre-conditioning method is recommended. Many problems in machine learning involve arbitrary scales or categorical data (where simple standardization of features is not applicable). The problem can be alleviated by using optimization routines that are invariant to scale such as (second-order) Newton methods. However, computing and inverting the Hessian is a costly procedure and not feasible for big data. Thus one must often rely on first-order methods such as gradient descent (GD), stochastic gradient descent (SGD) or approximate secondorder such as quasi-Newton (QN) routines, which are not invariant to scale. This paper proposes a simple yet effective pre-conditioner for speeding-up LR based on naive Bayes conditional probability estimates. The idea is to scale each attribute by the log of the conditional probability of that attribute given the class. This formulation substantially speeds up LR's convergence. It also provides a weighted naive Bayes formulation which yields an effective framework for hybrid generative-discriminative classification.

Highly Scalable Attribute Selection for AODE.

Chen, S., Martinez, A., & Webb, G. I.

Proceedings of the 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 86-97, 2014.

[Bibtex] [Abstract]

```
@InProceedings{ChenEtAl14,
author = {Chen, S. and Martinez, A. and Webb, G. I.},
booktitle = {Proceedings of the 18th {Pacific}-{Asia} Conference on Knowledge Discovery and Data Mining},
title = {Highly Scalable Attribute Selection for AODE},
year = {2014},
pages = {86-97},
abstract = {Averaged One-Dependence Estimators (AODE) is a popular
and effective approach to Bayesian learning. In this paper, a new
attribute selection approach is proposed for AODE. It can search in a
large model space, while it requires only a single extra pass through the
training data, resulting in a computationally efficient two-pass learning
algorithm. The experimental results indicate that the new technique significantly
reduces AODE.s bias at the cost of a modest increase in training
time. Its low bias and computational efficiency make it an attractive
algorithm for learning from big data.},
doi = {10.1007/978-3-319-06605-9_8},
keywords = {Conditional Probability Estimation and AODE and DP140100087},
related = {learning-complex-conditional-probabilities-from-data},
}
```

**ABSTRACT** Averaged One-Dependence Estimators (AODE) is a popular and effective approach to Bayesian learning. In this paper, a new attribute selection approach is proposed for AODE. It can search in a large model space, while it requires only a single extra pass through the training data, resulting in a computationally efficient two-pass learning algorithm. The experimental results indicate that the new technique significantly reduces AODE.s bias at the cost of a modest increase in training time. Its low bias and computational efficiency make it an attractive algorithm for learning from big data.

Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting.

Zaidi, N. A., Cerquides, J., Carman, M. J., & Webb, G. I.

Journal of Machine Learning Research, 14, 1947-1988, 2013.

[Bibtex] [Abstract]

```
@Article{Zaidi2013,
Title = {Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting},
Author = {Zaidi, Nayyar A. and Cerquides, Jesus and Carman, Mark J. and Webb, Geoffrey I.},
Journal = {Journal of Machine Learning Research},
Year = {2013},
Pages = {1947-1988},
Volume = {14},
Abstract = {Despite the simplicity of the Naive Bayes classifier, it has continued to perform well against more sophisticated newcomers and has remained, therefore, of great interest to the machine learning community. Of numerous approaches to refining the naive Bayes classifier, attribute weighting has received less attention than it warrants. Most approaches, perhaps influenced by attribute weighting in other machine learning algorithms, use weighting to place more emphasis on highly predictive attributes than those that are less predictive. In this paper, we argue that for naive Bayes attribute weighting should instead be used to alleviate the conditional independence assumption. Based on this premise, we propose a weighted naive Bayes algorithm, called WANBIA, that selects weights to minimize either the negative conditional log likelihood or the mean squared error objective functions. We perform extensive evaluations and find that WANBIA is a competitive alternative to state of the art classifiers like Random Forest, Logistic Regression and A1DE.},
Keywords = {Conditional Probability Estimation and WANBIA},
Related = {combining-generative-and-discriminative-learning},
Url = {http://jmlr.org/papers/volume14/zaidi13a/zaidi13a.pdf},
Urltext = {Link to paper on JMLR site}
}
```

**ABSTRACT** Despite the simplicity of the Naive Bayes classifier, it has continued to perform well against more sophisticated newcomers and has remained, therefore, of great interest to the machine learning community. Of numerous approaches to refining the naive Bayes classifier, attribute weighting has received less attention than it warrants. Most approaches, perhaps influenced by attribute weighting in other machine learning algorithms, use weighting to place more emphasis on highly predictive attributes than those that are less predictive. In this paper, we argue that for naive Bayes attribute weighting should instead be used to alleviate the conditional independence assumption. Based on this premise, we propose a weighted naive Bayes algorithm, called WANBIA, that selects weights to minimize either the negative conditional log likelihood or the mean squared error objective functions. We perform extensive evaluations and find that WANBIA is a competitive alternative to state of the art classifiers like Random Forest, Logistic Regression and A1DE.

Fast and Effective Single Pass Bayesian Learning.

Zaidi, N., & Webb, G. I.

Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 149-160, 2013.

[Bibtex]

```
@InProceedings{ZaidiWebb13,
Title = {Fast and Effective Single Pass Bayesian Learning},
Author = {Zaidi, N. and Webb, G. I.},
Booktitle = {Proceedings of the 17th {Pacific}-{Asia} Conference on Knowledge Discovery and Data Mining},
Year = {2013},
Pages = {149-160},
Doi = {10.1007/978-3-642-37453-1_13},
Keywords = {Conditional Probability Estimation and AODE},
Related = {learning-complex-conditional-probabilities-from-data}
}
```

**ABSTRACT**

Non-Disjoint Discretization for Aggregating One-Dependence Estimator Classifiers.

Martinez, A., Webb, G. I., Flores, M., & Gamez, J.

Proceedings of the 7th International Conference on Hybrid Artificial Intelligent Systems, Berlin / Heidelberg, pp. 151-162, 2012.

[Bibtex]

```
@InProceedings{MartinezEtAl12,
Title = {Non-Disjoint Discretization for Aggregating One-Dependence Estimator Classifiers},
Author = {Martinez, A. and Webb, G. I. and Flores, M. and Gamez, J.},
Booktitle = {Proceedings of the 7th International Conference on Hybrid Artificial Intelligent Systems},
Year = {2012},
Address = {Berlin / Heidelberg},
Pages = {151-162},
Publisher = {Springer},
ISBN = {978-3-642-28930-9},
Keywords = {Conditional Probability Estimation and AODE and discretization for naive bayes},
Related = {discretization-for-naive-bayes}
}
```

**ABSTRACT**

Learning by extrapolation from marginal to full-multivariate probability distributions: Decreasingly naive Bayesian classification.

Webb, G. I., Boughton, J., Zheng, F., Ting, K. M., & Salem, H.

Machine Learning, 86(2), 233-272, 2012.

[Bibtex] [Abstract]

```
@Article{WebbEtAl12,
author = {Webb, G. I. and Boughton, J. and Zheng, F. and Ting, K. M. and Salem, H.},
journal = {Machine Learning},
title = {Learning by extrapolation from marginal to full-multivariate probability distributions: Decreasingly naive {Bayesian} classification},
year = {2012},
issn = {0885-6125},
number = {2},
pages = {233-272},
volume = {86},
abstract = {Averaged n-Dependence Estimators (AnDE) is an approach to probabilistic classification learning that learns by extrapolation from marginal
to full-multivariate probability distributions. It utilizes a single parameter that transforms the approach between a low-variance high-bias learner
(Naive Bayes) and a high-variance low-bias learner with Bayes optimal
asymptotic error. It extends the underlying strategy of Averaged One-Dependence Estimators (AODE), which relaxes the Naive Bayes independence assumption while retaining many of Naive Bayes' desirable computational and theoretical properties. AnDE further relaxes the independence assumption by generalizing AODE to higher-levels of dependence.
Extensive experimental evaluation shows that the bias-variance trade-off
for Averaged 2-Dependence Estimators results in strong predictive accuracy over a wide range of data sets. It has training time linear with
respect to the number of examples, supports incremental learning, handles directly missing values, and is robust in the face of noise. Beyond
the practical utility of its lower-dimensional variants, AnDE is of interest
in that it demonstrates that it is possible to create low-bias high-variance
generative learners and suggests strategies for developing even more powerful classifiers.},
address = {Netherlands},
keywords = {Conditional Probability Estimation and AODE},
publisher = {Springer},
related = {learning-complex-conditional-probabilities-from-data},
url = {http://dx.doi.org/10.1007/s10994-011-5263-6},
urltext = {Link to paper via SpringerLink},
}
```

**ABSTRACT** Averaged n-Dependence Estimators (AnDE) is an approach to probabilistic classification learning that learns by extrapolation from marginal to full-multivariate probability distributions. It utilizes a single parameter that transforms the approach between a low-variance high-bias learner (Naive Bayes) and a high-variance low-bias learner with Bayes optimal asymptotic error. It extends the underlying strategy of Averaged One-Dependence Estimators (AODE), which relaxes the Naive Bayes independence assumption while retaining many of Naive Bayes' desirable computational and theoretical properties. AnDE further relaxes the independence assumption by generalizing AODE to higher-levels of dependence. Extensive experimental evaluation shows that the bias-variance trade-off for Averaged 2-Dependence Estimators results in strong predictive accuracy over a wide range of data sets. It has training time linear with respect to the number of examples, supports incremental learning, handles directly missing values, and is robust in the face of noise. Beyond the practical utility of its lower-dimensional variants, AnDE is of interest in that it demonstrates that it is possible to create low-bias high-variance generative learners and suggests strategies for developing even more powerful classifiers.

Techniques for Efficient Learning without Search.

Salem, H., Suraweera, P., Webb, G. I., & Boughton, J. R.

Proceedings of the 16th Pacific-Asia Conference, PAKDD 2012, Berlin/Heidelberg, pp. 50-61, 2012.

[Bibtex]

```
@InProceedings{SalemEtAl12,
author = {Salem, H. and Suraweera, P. and Webb, G. I. and Boughton, J. R.},
booktitle = {Proceedings of the 16th {Pacific}-{Asia} Conference, PAKDD 2012},
title = {Techniques for Efficient Learning without Search},
year = {2012},
address = {Berlin/Heidelberg},
pages = {50-61},
publisher = {Springer},
keywords = {Conditional Probability Estimation and AODE},
location = {Kuala Lumpur, Malaysia},
related = {learning-complex-conditional-probabilities-from-data},
url = {http://link.springer.com/chapter/10.1007%2F978-3-642-30217-6_5},
}
```

**ABSTRACT**

Subsumption Resolution: An Efficient and Effective Technique for Semi-Naive Bayesian Learning.

Zheng, F., Webb, G. I., Suraweera, P., & Zhu, L.

Machine Learning, 87(1), 93-125, 2012.

[Bibtex] [Abstract]

```
@Article{ZhengEtAl12,
author = {Zheng, F. and Webb, G. I. and Suraweera, P. and Zhu, L.},
journal = {Machine Learning},
title = {Subsumption Resolution: An Efficient and Effective Technique for Semi-Naive Bayesian Learning},
year = {2012},
issn = {0885-6125},
number = {1},
pages = {93-125},
volume = {87},
abstract = {Semi-naive Bayesian techniques seek to improve the accuracy of naive
Bayes (NB) by relaxing the attribute independence assumption. We present a new
type of semi-naive Bayesian operation, Subsumption Resolution (SR), which efficiently identifies occurrences of the specialization-generalization relationship and
eliminates generalizations at classification time.We extend SR to Near-Subsumption
Resolution (NSR) to delete near.generalizations in addition to generalizations. We
develop two versions of SR: one that performs SR during training, called eager SR
(ESR), and another that performs SR during testing, called lazy SR (LSR).We inves-
tigate the effect of ESR, LSR, NSR and conventional attribute elimination (BSE) on
NB and Averaged One-Dependence Estimators (AODE), a powerful alternative to
NB. BSE imposes very high training time overheads on NB and AODE accompanied
by varying decreases in classification time overheads. ESR, LSR and NSR impose
high training time and test time overheads on NB. However, LSR imposes no extra
training time overheads and only modest test time overheads on AODE, while ESR
and NSR impose modest training and test time overheads on AODE. Our extensive
experimental comparison on sixty UCI data sets shows that applying BSE, LSR or
NSR to NB significantly improves both zero-one loss and RMSE, while applying
BSE, ESR or NSR to AODE significantly improves zero-one loss and RMSE and
applying LSR to AODE significantly improves zero-one loss. The Friedman test and
Nemenyi test show that AODE with ESR or NSR have a significant zero-one loss and
RMSE advantage over Logistic Regression and a zero-one loss advantage overWeka.s
LibSVM implementation with a grid parameter search on categorical data. AODE
with LSR has a zero-one loss advantage over Logistic Regression and comparable
zero-one loss with LibSVM. Finally, we examine the circumstances under which the
elimination of near-generalizations proves beneficial.},
address = {Netherlands},
doi = {10.1007/s10994-011-5275-2},
keywords = {Conditional Probability Estimation and AODE},
publisher = {Springer},
related = {learning-complex-conditional-probabilities-from-data},
urltext = {Link to paper via SpringerLink},
}
```

**ABSTRACT** Semi-naive Bayesian techniques seek to improve the accuracy of naive Bayes (NB) by relaxing the attribute independence assumption. We present a new type of semi-naive Bayesian operation, Subsumption Resolution (SR), which efficiently identifies occurrences of the specialization-generalization relationship and eliminates generalizations at classification time.We extend SR to Near-Subsumption Resolution (NSR) to delete near.generalizations in addition to generalizations. We develop two versions of SR: one that performs SR during training, called eager SR (ESR), and another that performs SR during testing, called lazy SR (LSR).We inves- tigate the effect of ESR, LSR, NSR and conventional attribute elimination (BSE) on NB and Averaged One-Dependence Estimators (AODE), a powerful alternative to NB. BSE imposes very high training time overheads on NB and AODE accompanied by varying decreases in classification time overheads. ESR, LSR and NSR impose high training time and test time overheads on NB. However, LSR imposes no extra training time overheads and only modest test time overheads on AODE, while ESR and NSR impose modest training and test time overheads on AODE. Our extensive experimental comparison on sixty UCI data sets shows that applying BSE, LSR or NSR to NB significantly improves both zero-one loss and RMSE, while applying BSE, ESR or NSR to AODE significantly improves zero-one loss and RMSE and applying LSR to AODE significantly improves zero-one loss. The Friedman test and Nemenyi test show that AODE with ESR or NSR have a significant zero-one loss and RMSE advantage over Logistic Regression and a zero-one loss advantage overWeka.s LibSVM implementation with a grid parameter search on categorical data. AODE with LSR has a zero-one loss advantage over Logistic Regression and comparable zero-one loss with LibSVM. Finally, we examine the circumstances under which the elimination of near-generalizations proves beneficial.

A Comparative Study of Bandwidth Choice in Kernel Density Estimation for Naive Bayesian Classification.

Liu, B., Yang, Y., Webb, G. I., & Boughton, J.

Proceedings of the 13th Pacific-Asia Conference, PAKDD 2009, Berlin/Heidelberg, pp. 302-313, 2009.

[Bibtex]

```
@InProceedings{LiuYangWebbBoughton09,
author = {Liu, B. and Yang, Y. and Webb, G. I. and Boughton, J.},
booktitle = {Proceedings of the 13th {Pacific}-{Asia} Conference, PAKDD 2009},
title = {A Comparative Study of Bandwidth Choice in Kernel Density Estimation for Naive Bayesian Classification},
year = {2009},
address = {Berlin/Heidelberg},
pages = {302-313},
publisher = {Springer},
keywords = {Conditional Probability Estimation and AODE and Discretization for Naive Bayes},
location = {Bangkok, Thailand},
related = {discretization-for-naive-bayes},
url = {http://link.springer.com/chapter/10.1007%2F978-3-642-01307-2_29},
}
```

**ABSTRACT**

Discretization for Naive-Bayes Learning: Managing Discretization Bias and Variance.

Yang, Y., & Webb, G. I.

Machine Learning, 74(1), 39-74, 2009.

[Bibtex] [Abstract]

```
@Article{YangWebb09,
author = {Yang, Y. and Webb, G. I.},
journal = {Machine Learning},
title = {Discretization for Naive-Bayes Learning: Managing Discretization Bias and Variance},
year = {2009},
number = {1},
pages = {39-74},
volume = {74},
abstract = {Quantitative attributes are usually discretized in Naive-Bayes learning. We establish simple conditions under which discretization is equivalent to use of the true probability density function during naive-Bayes learning. The use of different discretization techniques can be expected to affect the classification bias and variance of generated naive-Bayes classifiers, effects we name discretization bias and variance. We argue that by properly managing discretization bias and variance, we can effectively reduce naive-Bayes classification error. In particular, we supply insights into managing discretization bias and variance by adjusting the number of intervals and the number of training instances contained in each interval. We accordingly propose proportional discretization and fixed frequency discretization, two efficient unsupervised discretization methods that are able to effectively manage discretization bias and variance. We evaluate our new techniques against four key discretization methods for naive-Bayes classifiers. The experimental results support our theoretical analyses by showing that with statistically significant frequency, naive-Bayes classifiers trained on data discretized by our new methods are able to achieve lower classification error than those trained on data discretized by current established discretization methods.},
address = {Netherlands},
audit-trail = {DOI 10.1007/s10994-008-5083-5},
doi = {10.1007/s10994-008-5083-5},
keywords = {Discretization for Naive Bayes and Conditional Probability Estimation and AODE},
publisher = {Springer},
related = {discretization-for-naive-bayes},
}
```

**ABSTRACT** Quantitative attributes are usually discretized in Naive-Bayes learning. We establish simple conditions under which discretization is equivalent to use of the true probability density function during naive-Bayes learning. The use of different discretization techniques can be expected to affect the classification bias and variance of generated naive-Bayes classifiers, effects we name discretization bias and variance. We argue that by properly managing discretization bias and variance, we can effectively reduce naive-Bayes classification error. In particular, we supply insights into managing discretization bias and variance by adjusting the number of intervals and the number of training instances contained in each interval. We accordingly propose proportional discretization and fixed frequency discretization, two efficient unsupervised discretization methods that are able to effectively manage discretization bias and variance. We evaluate our new techniques against four key discretization methods for naive-Bayes classifiers. The experimental results support our theoretical analyses by showing that with statistically significant frequency, naive-Bayes classifiers trained on data discretized by our new methods are able to achieve lower classification error than those trained on data discretized by current established discretization methods.

Anytime Classification for a Pool of Instances.

Hui, B., Yang, Y., & Webb, G. I.

Machine Learning, 77(1), 61-102, 2009.

[Bibtex] [Abstract]

```
@Article{HuiYangWebb09,
author = {Hui, B. and Yang, Y. and Webb, G. I.},
journal = {Machine Learning},
title = {Anytime Classification for a Pool of Instances},
year = {2009},
number = {1},
pages = {61-102},
volume = {77},
abstract = {In many real-world applications of classification learning, such as
credit card transaction vetting or classification embedded in sensor
nodes, multiple instances simultaneously require classification under
computational resource constraints such as limited time or limited
battery capacity. In such a situation, available computational
resources should be allocated across the instances in order to
optimize the overall classification efficacy and efficiency. We
propose a novel anytime classification framework, Scheduling Anytime
Averaged Probabilistic Estimators (SAAPE), which is capable of
classifying a pool of instances, delivering accurate results whenever
interrupted and optimizing the collective classification
performance. Following the practice of our previous anytime
classification system AAPE, SAAPE runs a sequence of very efficient
Bayesian probabilistic classifiers to classify each single
instance. Furthermore, SAAPE implements seven alternative scheduling
schemes to decide which instance gets available computational
resources next such that a new classifier can be applied to refine its
classification. We formally present each scheduling scheme's
definition, rationale and time complexity. We conduct large-scale
experiments using 60 benchmark data sets and diversified statistical
tests to evaluate SAAPE's performance on zero-one loss classification
as well as on probability estimation. We analyze each scheduling
scheme's advantage and disadvantage according to both theoretical
understandings and empirical observations. Consequently we identify
effective scheduling schemes that enable SAAPE to accomplish accurate
anytime classification for a pool of instances.},
address = {Netherlands},
audit-trail = {http://dx.doi.org/10.1007/s10994-009-5118-6},
doi = {10.1007/s10994-009-5118-6},
keywords = {Conditional Probability Estimation and AODE},
publisher = {Springer},
related = {learning-complex-conditional-probabilities-from-data},
}
```

**ABSTRACT** In many real-world applications of classification learning, such as credit card transaction vetting or classification embedded in sensor nodes, multiple instances simultaneously require classification under computational resource constraints such as limited time or limited battery capacity. In such a situation, available computational resources should be allocated across the instances in order to optimize the overall classification efficacy and efficiency. We propose a novel anytime classification framework, Scheduling Anytime Averaged Probabilistic Estimators (SAAPE), which is capable of classifying a pool of instances, delivering accurate results whenever interrupted and optimizing the collective classification performance. Following the practice of our previous anytime classification system AAPE, SAAPE runs a sequence of very efficient Bayesian probabilistic classifiers to classify each single instance. Furthermore, SAAPE implements seven alternative scheduling schemes to decide which instance gets available computational resources next such that a new classifier can be applied to refine its classification. We formally present each scheduling scheme's definition, rationale and time complexity. We conduct large-scale experiments using 60 benchmark data sets and diversified statistical tests to evaluate SAAPE's performance on zero-one loss classification as well as on probability estimation. We analyze each scheduling scheme's advantage and disadvantage according to both theoretical understandings and empirical observations. Consequently we identify effective scheduling schemes that enable SAAPE to accomplish accurate anytime classification for a pool of instances.

To Select or To Weigh: A Comparative Study of Linear Combination Schemes for SuperParent-One-Dependence Estimators.

Yang, Y., Webb, G. I., Cerquides, J., Korb, K., Boughton, J., & Ting, K-M.

IEEE Transactions on Knowledge and Data Engineering, 19(12), 1652-1665, 2007.

[Bibtex] [Abstract]

```
@Article{YangWebbCerquideszKorbBoughtonTing07,
author = {Yang, Y. and Webb, G. I. and Cerquides, J. and Korb, K. and Boughton, J. and Ting, K-M.},
journal = {{IEEE} Transactions on Knowledge and Data Engineering},
title = {To Select or To Weigh: A Comparative Study of Linear Combination Schemes for SuperParent-One-Dependence Estimators},
year = {2007},
number = {12},
pages = {1652-1665},
volume = {19},
abstract = {We conduct a large-scale comparative study on linearly combining superparent-one-dependence estimators (SPODEs), a popular family of semi-naive Bayesian classifiers. Altogether 16 model selection and weighing schemes, 58 benchmark data sets, as well as various statistical tests are employed. This paper’s main contributions are three-fold. First, it formally presents each scheme’s definition, rationale and time complexity; and hence can serve as a comprehensive reference for researchers interested in ensemble learning. Second, it offers bias-variance analysis for each scheme’s classification error performance. Third, it identifies effective schemes that meet various needs in practice. This leads to accurate and fast classification algorithms with immediate and significant impact on real-world applications. Another important feature of our study is using a variety of statistical tests to evaluate multiple learning methods across multiple data sets.},
address = {Los Alamitos, CA},
doi = {10.1109/TKDE.2007.190650},
keywords = {Conditional Probability Estimation and AODE},
publisher = {{IEEE} Computer Society},
related = {learning-complex-conditional-probabilities-from-data},
}
```

**ABSTRACT** We conduct a large-scale comparative study on linearly combining superparent-one-dependence estimators (SPODEs), a popular family of semi-naive Bayesian classifiers. Altogether 16 model selection and weighing schemes, 58 benchmark data sets, as well as various statistical tests are employed. This paper’s main contributions are three-fold. First, it formally presents each scheme’s definition, rationale and time complexity; and hence can serve as a comprehensive reference for researchers interested in ensemble learning. Second, it offers bias-variance analysis for each scheme’s classification error performance. Third, it identifies effective schemes that meet various needs in practice. This leads to accurate and fast classification algorithms with immediate and significant impact on real-world applications. Another important feature of our study is using a variety of statistical tests to evaluate multiple learning methods across multiple data sets.

Classifying under Computational Resource Constraints: Anytime Classification Using Probabilistic Estimators.

Yang, Y., Webb, G. I., Korb, K., & Ting, K-M.

Machine Learning, 69(1), 35-53, 2007.

[Bibtex] [Abstract]

```
@Article{YangWebbKorbTing07,
author = {Yang, Y. and Webb, G. I. and Korb, K. and Ting, K-M.},
journal = {Machine Learning},
title = {Classifying under Computational Resource Constraints: Anytime Classification Using Probabilistic Estimators},
year = {2007},
number = {1},
pages = {35-53},
volume = {69},
abstract = {In many online applications of machine learning, the computational resources available for classification will vary from time to time. Most techniques are designed to operate within the constraints of the minimum expected resources and fail to utilize further resources when they are available. We propose a novel anytime classification algorithm, anytime averaged probabilistic estimators (AAPE), which is capable of delivering strong prediction accuracy with little CPU time and utilizing additional CPU time to increase classification accuracy. The idea is to run an ordered sequence of very efficient Bayesian probabilistic estimators (single improvement steps) until classification time runs out. Theoretical studies and empirical validations reveal that by properly identifying, ordering, invoking and ensembling single improvement steps, AAPE is able to accomplish accurate classification whenever it is interrupted. It is also able to output class probability estimates beyond simple 0/1-loss classifications, as well as adeptly handle incremental learning.},
address = {Netherlands},
audit-trail = {DOI 10.1007/s10994-007-5020-z},
doi = {10.1007/s10994-007-5020-z},
keywords = {Conditional Probability Estimation and AODE},
publisher = {Springer},
related = {learning-complex-conditional-probabilities-from-data},
}
```

**ABSTRACT** In many online applications of machine learning, the computational resources available for classification will vary from time to time. Most techniques are designed to operate within the constraints of the minimum expected resources and fail to utilize further resources when they are available. We propose a novel anytime classification algorithm, anytime averaged probabilistic estimators (AAPE), which is capable of delivering strong prediction accuracy with little CPU time and utilizing additional CPU time to increase classification accuracy. The idea is to run an ordered sequence of very efficient Bayesian probabilistic estimators (single improvement steps) until classification time runs out. Theoretical studies and empirical validations reveal that by properly identifying, ordering, invoking and ensembling single improvement steps, AAPE is able to accomplish accurate classification whenever it is interrupted. It is also able to output class probability estimates beyond simple 0/1-loss classifications, as well as adeptly handle incremental learning.

Finding the Right Family: Parent and Child Selection for Averaged One-Dependence Estimators.

Zheng, F., & Webb, G. I.

Lecture Notes in Artificial Intelligence 4710: Proceedings of the 18th European Conference on Machine Learning (ECML'07), Berlin/Heidelberg, pp. 490-501, 2007.

[Bibtex] [Abstract]

```
@InProceedings{ZhengWebb07,
author = {Zheng, F. and Webb, G. I.},
booktitle = {Lecture Notes in Artificial Intelligence 4710: Proceedings of the 18th European Conference on Machine Learning (ECML'07)},
title = {Finding the Right Family: Parent and Child Selection for Averaged One-Dependence Estimators},
year = {2007},
address = {Berlin/Heidelberg},
pages = {490-501},
publisher = {Springer-Verlag},
abstract = {Averaged One-Dependence Estimators (AODE) classifies by uniformly aggregating all qualified one-dependence estimators (ODEs). Its capacity to significantly improve naive Bayes' accuracy without undue time complexity has attracted substantial interest. Forward Sequential Selection and Backwards Sequential Elimination are effective wrapper techniques to identify and repair harmful interdependencies which have been profitably applied to naive Bayes. However, their straightforward application to AODE has previously proved ineffective. We investigate novel variants of these strategies. Our extensive experiments show that elimination of child attributes from within the constituent ODEs results in a significant improvement in probability estimate and reductions in bias and error relative to unmodified AODE. In contrast, elimination of complete constituent ODEs and the four types of attribute addition are found to be less effective and do not demonstrate any strong advantage over AODE. These surprising results lead to effective techniques for improving AODE's prediction accuracy.},
keywords = {Conditional Probability Estimation and AODE},
location = {Warsaw, Poland},
related = {learning-complex-conditional-probabilities-from-data},
}
```

**ABSTRACT** Averaged One-Dependence Estimators (AODE) classifies by uniformly aggregating all qualified one-dependence estimators (ODEs). Its capacity to significantly improve naive Bayes' accuracy without undue time complexity has attracted substantial interest. Forward Sequential Selection and Backwards Sequential Elimination are effective wrapper techniques to identify and repair harmful interdependencies which have been profitably applied to naive Bayes. However, their straightforward application to AODE has previously proved ineffective. We investigate novel variants of these strategies. Our extensive experiments show that elimination of child attributes from within the constituent ODEs results in a significant improvement in probability estimate and reductions in bias and error relative to unmodified AODE. In contrast, elimination of complete constituent ODEs and the four types of attribute addition are found to be less effective and do not demonstrate any strong advantage over AODE. These surprising results lead to effective techniques for improving AODE's prediction accuracy.

Incremental Discretization for Naive-Bayes Classifier.

Lu, J., Yang, Y., & Webb, G. I.

Lecture Notes in Computer Science 4093: Proceedings of the Second International Conference on Advanced Data Mining and Applications (ADMA 2006), Berlin, pp. 223-238, 2006.

[Bibtex] [Abstract]

```
@InProceedings{LuYangWebb06,
author = {Lu, J. and Yang, Y. and Webb, G. I.},
booktitle = {Lecture Notes in Computer Science 4093: Proceedings of the Second International Conference on Advanced Data Mining and Applications (ADMA 2006)},
title = {Incremental Discretization for Naive-Bayes Classifier},
year = {2006},
address = {Berlin},
editor = {Li, Xue and Zaiane, Osmar R. and Li, Zhanhuai},
pages = {223-238},
publisher = {Springer},
abstract = {Naive-Bayes classifiers (NB) support incremental learning. However, the lack of effective incremental discretization methods has been hindering NB's incremental learning in face of quantitative data. This problem is further compounded by the fact that quantitative data are everywhere, from temperature readings to share prices. In this paper, we present a novel incremental discretization method for NB, incremental flexible frequency discretization (IFFD). IFFD discretizes values of a quantitative attribute into a sequence of intervals of flexible sizes. It allows online insertion and splitting operation on intervals. Theoretical analysis and experimental test are conducted to compare IFFD with alternative methods. Empirical evidence suggests that IFFD is efficient and effective. NB coupled with IFFD achieves a rapport between high learning efficiency and high classification accuracy in the context of incremental learning.},
keywords = {Conditional Probability Estimation and Discretization for Naive Bayes and Incremental Learning and Stream Mining},
location = {XiÆan, China},
related = {discretization-for-naive-bayes},
}
```

**ABSTRACT** Naive-Bayes classifiers (NB) support incremental learning. However, the lack of effective incremental discretization methods has been hindering NB's incremental learning in face of quantitative data. This problem is further compounded by the fact that quantitative data are everywhere, from temperature readings to share prices. In this paper, we present a novel incremental discretization method for NB, incremental flexible frequency discretization (IFFD). IFFD discretizes values of a quantitative attribute into a sequence of intervals of flexible sizes. It allows online insertion and splitting operation on intervals. Theoretical analysis and experimental test are conducted to compare IFFD with alternative methods. Empirical evidence suggests that IFFD is efficient and effective. NB coupled with IFFD achieves a rapport between high learning efficiency and high classification accuracy in the context of incremental learning.

To Select or To Weigh: A Comparative Study of Model Selection and Model Weighing for SPODE Ensembles.

Yang, Y., Webb, G. I., Cerquides, J., Korb, K., Boughton, J., & Ting, K-M.

Lecture Notes in Computer Science 4212: Proceedings of the 17th European Conference on Machine Learning (ECML'06), Berlin/Heidelberg, pp. 533-544, 2006.

[Bibtex] [Abstract]

```
@InProceedings{YangWebbCerquideKorbBoughtonTing06,
author = {Yang, Y. and Webb, G. I. and Cerquides, J. and Korb, K. and Boughton, J. and Ting, K-M.},
booktitle = {Lecture Notes in Computer Science 4212: Proceedings of the 17th European Conference on Machine Learning (ECML'06)},
title = {To Select or To Weigh: A Comparative Study of Model Selection and Model Weighing for SPODE Ensembles},
year = {2006},
address = {Berlin/Heidelberg},
editor = {Furkranz, J. and Scheffer, T. and Spiliopoulou, M.},
pages = {533-544},
publisher = {Springer-Verlag},
abstract = {An ensemble of Super-Parent-One-Dependence Estimators (SPODEs) offers a powerful yet simple alternative to naive Bayes classifiers, achieving significantly higher classification accuracy at a moderate cost in classification efficiency. Currently there exist two families of methodologies that ensemble candidate SPODEs for classification. One is to select only helpful SPODEs and uniformly average their probability estimates, a type of model selection. Another is to assign a weight to each SPODE and linearly combine their probability estimates, a methodology named model weighing. This paper presents a theoretical and empirical study comparing model selection and model weighing for ensembling SPODEs. The focus is on maximizing the ensemble's classification accuracy while minimizing its computational time. A number of representative selection and weighing schemes are studied, providing a comprehensive research on this topic and identifying effective schemes that provide alternative trades-offs between speed and expected error},
keywords = {Conditional Probability Estimation and AODE},
location = {Berlin, Germany},
related = {learning-complex-conditional-probabilities-from-data},
}
```

**ABSTRACT** An ensemble of Super-Parent-One-Dependence Estimators (SPODEs) offers a powerful yet simple alternative to naive Bayes classifiers, achieving significantly higher classification accuracy at a moderate cost in classification efficiency. Currently there exist two families of methodologies that ensemble candidate SPODEs for classification. One is to select only helpful SPODEs and uniformly average their probability estimates, a type of model selection. Another is to assign a weight to each SPODE and linearly combine their probability estimates, a methodology named model weighing. This paper presents a theoretical and empirical study comparing model selection and model weighing for ensembling SPODEs. The focus is on maximizing the ensemble's classification accuracy while minimizing its computational time. A number of representative selection and weighing schemes are studied, providing a comprehensive research on this topic and identifying effective schemes that provide alternative trades-offs between speed and expected error

Efficient Lazy Elimination for Averaged One-Dependence Estimators.

Zheng, F., & Webb, G. I.

ACM International Conference Proceeding Series, Vol. 148: The Proceedings of the Twenty-third International Conference on Machine Learning (ICML'06), New York, NY, pp. 1113-1120, 2006.

[Bibtex] [Abstract]

```
@InProceedings{ZhengWebb06,
author = {Zheng, F. and Webb, G. I.},
booktitle = {ACM International Conference Proceeding Series, Vol. 148: The Proceedings of the Twenty-third International Conference on Machine Learning (ICML'06)},
title = {Efficient Lazy Elimination for Averaged One-Dependence Estimators},
year = {2006},
address = {New York, NY},
editor = {W. Cohen and A. Moore},
pages = {1113 - 1120},
publisher = {ACM Press},
abstract = {Semi-naive Bayesian classifiers seek to retain the numerous strengths of naive Bayes while reducing error by weakening the attribute independence assumption. Backwards Sequential Elimination (BSE) is a wrapper technique for attribute elimination that has proved effective at this task. We explore a new efficient technique, Lazy Elimination (LE), which eliminates highly related attribute-values at classification time without the computational overheads inherent in wrapper techniques. We analyze the effect of LE and BSE on Averaged One-Dependence Estimators (AODE), a state-of-the-art semi-naive Bayesian algorithm. Our extensive experiments show that LE significantly reduces bias and error without undue additional computation, while BSE significantly reduces bias but not error, with high training time complexity. In the context of AODE, LE has a significant advantage over BSE in both computational efficiency and error.},
audit-trail = {ISBN:1-59593-383-2, DOI http://doi.acm.org/10.1145/1143844.1143984},
keywords = {Conditional Probability Estimation and AODE},
location = {Pittsburgh, Pennsylvania},
related = {learning-complex-conditional-probabilities-from-data},
url = {http://dl.acm.org/authorize?N00547},
}
```

**ABSTRACT** Semi-naive Bayesian classifiers seek to retain the numerous strengths of naive Bayes while reducing error by weakening the attribute independence assumption. Backwards Sequential Elimination (BSE) is a wrapper technique for attribute elimination that has proved effective at this task. We explore a new efficient technique, Lazy Elimination (LE), which eliminates highly related attribute-values at classification time without the computational overheads inherent in wrapper techniques. We analyze the effect of LE and BSE on Averaged One-Dependence Estimators (AODE), a state-of-the-art semi-naive Bayesian algorithm. Our extensive experiments show that LE significantly reduces bias and error without undue additional computation, while BSE significantly reduces bias but not error, with high training time complexity. In the context of AODE, LE has a significant advantage over BSE in both computational efficiency and error.

A Comparative Study of Semi-naive Bayes Methods in Classification Learning.

Zheng, F., & I., W. G.

Proceedings of the Fourth Australasian Data Mining Conference (AusDM05), Sydney, pp. 141-156, 2005.

[Bibtex] [Abstract]

```
@InProceedings{ZhengWebb05,
author = {Zheng, F. and Webb. G. I.},
booktitle = {Proceedings of the Fourth Australasian Data Mining Conference (AusDM05)},
title = {A Comparative Study of Semi-naive Bayes Methods in Classification Learning},
year = {2005},
address = {Sydney},
editor = {Simoff, S.J. and Williams, G.J. and Galloway, J. and Kolyshkina, I.},
pages = {141-156},
publisher = {University of Technology},
abstract = {Numerous techniques have sought to improve the accuracy of Naive Bayes (NB) by alleviating the attribute interdependence problem. This paper summarizes these semi-naive Bayesian methods into two groups: those that apply conventional NB with a new attribute set, and those that alter NB by allowing inter-dependencies between attributes. We review eight typical semi-naive Bayesian learning algorithms and perform error analysis using the bias-variance decomposition on thirty-six natural domains from the UCI Machine Learning Repository. In analysing the results of these experiments we provide general recommendations for selection between methods.},
keywords = {AODE and Conditional Probability Estimation},
location = {Sydney, Australia},
related = {learning-complex-conditional-probabilities-from-data},
}
```

**ABSTRACT** Numerous techniques have sought to improve the accuracy of Naive Bayes (NB) by alleviating the attribute interdependence problem. This paper summarizes these semi-naive Bayesian methods into two groups: those that apply conventional NB with a new attribute set, and those that alter NB by allowing inter-dependencies between attributes. We review eight typical semi-naive Bayesian learning algorithms and perform error analysis using the bias-variance decomposition on thirty-six natural domains from the UCI Machine Learning Repository. In analysing the results of these experiments we provide general recommendations for selection between methods.

Not So Naive Bayes: Aggregating One-Dependence Estimators.

Webb, G. I., Boughton, J., & Wang, Z.

Machine Learning, 58(1), 5-24, 2005.

[Bibtex] [Abstract]

```
@Article{WebbBoughtonWang05,
Title = {Not So Naive {Bayes}: Aggregating One-Dependence Estimators},
Author = {Webb, G. I. and Boughton, J. and Wang, Z.},
Journal = {Machine Learning},
Year = {2005},
Number = {1},
Pages = {5-24},
Volume = {58},
Abstract = {Of numerous proposals to improve the accuracy of naive Bayes by weakening its attribute independence assumption, both LBR and TAN have demonstrated remarkable error performance. However, both techniques obtain this outcome at a considerable computational cost. We present a new approach to weakening the attribute independence assumption by averaging all of a constrained class of classifiers. In extensive experiments this technique delivers comparable prediction accuracy to LBR and TAN with substantially improved computational efficiency.},
Address = {Netherlands},
Audit-trail = {3/5/04 Pre-print posted},
Doi = {10.1007/s10994-005-4258-6},
Keywords = {Conditional Probability Estimation and AODE},
Publisher = {Springer},
Related = {learning-complex-conditional-probabilities-from-data}
}
```

**ABSTRACT** Of numerous proposals to improve the accuracy of naive Bayes by weakening its attribute independence assumption, both LBR and TAN have demonstrated remarkable error performance. However, both techniques obtain this outcome at a considerable computational cost. We present a new approach to weakening the attribute independence assumption by averaging all of a constrained class of classifiers. In extensive experiments this technique delivers comparable prediction accuracy to LBR and TAN with substantially improved computational efficiency.

Selective Augmented Bayesian Network Classifiers Based on Rough Set Theory.

Wang, Z., Webb, G. I., & Zheng, F.

Lecture Notes in Computer Science Vol. 3056: Proceedings of the Eighth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 04), Berlin/Heidelberg, pp. 319-328, 2004.

[Bibtex] [Abstract]

```
@InProceedings{WangWebbZheng04,
author = {Wang, Z. and Webb, G. I. and Zheng, F.},
booktitle = {Lecture Notes in Computer Science Vol. 3056: Proceedings of the Eighth {Pacific}-{Asia} Conference on Knowledge Discovery and Data Mining (PAKDD 04)},
title = {Selective Augmented Bayesian Network Classifiers Based on Rough Set Theory},
year = {2004},
address = {Berlin/Heidelberg},
editor = {Dai, H. and Srikant, R. and Zhang, C.},
pages = {319-328},
publisher = {Springer},
abstract = {The naive Bayes classifier is widely used in interactive applications due to its computational efficiency, direct theoretical base, and competitive accuracy. However, its attribute independence assumption can result in sub-optimal accuracy. A number of techniques have explored simple relaxations of the attribute independence assumption in order to increase accuracy. TAN, Tree-Augmented Naive Bayes, is a state-of-the-art extension of naive Bayes, that can express limited forms of inter-dependence among attributes. Rough sets theory provides tools for expressing inexact or partial dependencies within dataset. In this paper, we present a variant of TAN and compare their tree classifier structures, which can be thought of as a selective restricted trees Bayesian classifier. It delivers lower error than both pre-existing state-of-the-art TAN-based classifiers, with substantially less computation than is required by the SuperParent approach.},
audit-trail = {PDF posted 23/8},
keywords = {Conditional Probability Estimation and AODE and Learning from large datasets},
location = {Sydney, Australia},
related = {learning-complex-conditional-probabilities-from-data},
}
```

**ABSTRACT** The naive Bayes classifier is widely used in interactive applications due to its computational efficiency, direct theoretical base, and competitive accuracy. However, its attribute independence assumption can result in sub-optimal accuracy. A number of techniques have explored simple relaxations of the attribute independence assumption in order to increase accuracy. TAN, Tree-Augmented Naive Bayes, is a state-of-the-art extension of naive Bayes, that can express limited forms of inter-dependence among attributes. Rough sets theory provides tools for expressing inexact or partial dependencies within dataset. In this paper, we present a variant of TAN and compare their tree classifier structures, which can be thought of as a selective restricted trees Bayesian classifier. It delivers lower error than both pre-existing state-of-the-art TAN-based classifiers, with substantially less computation than is required by the SuperParent approach.

A New Restricted Bayesian Network Classifier.

Shi, H., Wang, Z., Webb, G. I., & Huang, H.

Lecture Notes in Artificial Intelligence Vol. 2637: Proceedings of the Seventh Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'03), Berlin/Heidelberg, pp. 265-270, 2003.

[Bibtex] [Abstract]

```
@InProceedings{ShiWangWebbHuang03,
author = {Shi, H. and Wang, Z. and Webb, G. I. and Huang, H.},
booktitle = {Lecture Notes in Artificial Intelligence Vol. 2637: Proceedings of the Seventh {Pacific}-{Asia} Conference on Knowledge Discovery and Data Mining (PAKDD'03)},
title = {A New Restricted Bayesian Network Classifier},
year = {2003},
address = {Berlin/Heidelberg},
editor = {Whang, K-Y. and Jeon, J. and Shim, K. and Srivastava, J.},
pages = {265-270},
publisher = {Springer-Verlag},
abstract = {On the basis of examining the existing restricted Bayesian network classifiers, a new Bayes-theorem-based and more strictly restricted Bayesian-network-based classification model DLBAN is proposed, which can be viewed as a double-level Bayesian network augmented naive Bayes classification. The experimental results show that the DLBAN classifier is better than the TAN classifier in the most cases.},
audit-trail = {*},
keywords = {Conditional Probability Estimation},
location = {Seoul, Korea},
related = {learning-complex-conditional-probabilities-from-data},
}
```

**ABSTRACT** On the basis of examining the existing restricted Bayesian network classifiers, a new Bayes-theorem-based and more strictly restricted Bayesian-network-based classification model DLBAN is proposed, which can be viewed as a double-level Bayesian network augmented naive Bayes classification. The experimental results show that the DLBAN classifier is better than the TAN classifier in the most cases.

Adjusting Dependence Relations for Semi-Lazy TAN Classifiers.

Wang, Z., Webb, G. I., & Zheng, F.

Lecture Notes in Artificial Intelligence Vol. 2903: Proceedings of the 16th Australian Conference on Artificial Intelligence (AI 03), Berlin/Heidelberg, pp. 453-465, 2003.

[Bibtex] [Abstract]

```
@InProceedings{WangWebbZheng03,
author = {Wang, Z. and Webb, G. I. and Zheng, F.},
booktitle = {Lecture Notes in Artificial Intelligence Vol. 2903: Proceedings of the 16th Australian Conference on Artificial Intelligence ({AI} 03)},
title = {Adjusting Dependence Relations for Semi-Lazy TAN Classifiers},
year = {2003},
address = {Berlin/Heidelberg},
editor = {Gedeon, T.D. and Fung, L.C.C.},
pages = {453-465},
publisher = {Springer},
abstract = {The naive Bayesian classifier is a simple and effective classification method, which assumes a Bayesian network in which each attribute has the class label as its only one parent. But this assumption is not obviously hold in many real world domains. Tree-Augmented Na?ve Bayes (TAN) is a state-of-the-art extension of the naive Bayes, which can express partial dependence relations among attributes. In this paper, we analyze the implementations of two different TAN classifiers and their tree structures. Experiments show how different dependence relations impact on accuracy of TAN classifiers. We present a kind of semi-lazy TAN classifier, which builds a TAN identical to the original TAN at training time, but adjusts the dependence relations for a new test instance at classification time. Our extensive experimental results show that this kind of semi-lazy classifier delivers lower error than the original TAN and is more efficient than SuperParent TAN.},
keywords = {Conditional Probability Estimation},
location = {Perth, Australia},
related = {learning-complex-conditional-probabilities-from-data},
}
```

**ABSTRACT** The naive Bayesian classifier is a simple and effective classification method, which assumes a Bayesian network in which each attribute has the class label as its only one parent. But this assumption is not obviously hold in many real world domains. Tree-Augmented Na?ve Bayes (TAN) is a state-of-the-art extension of the naive Bayes, which can express partial dependence relations among attributes. In this paper, we analyze the implementations of two different TAN classifiers and their tree structures. Experiments show how different dependence relations impact on accuracy of TAN classifiers. We present a kind of semi-lazy TAN classifier, which builds a TAN identical to the original TAN at training time, but adjusts the dependence relations for a new test instance at classification time. Our extensive experimental results show that this kind of semi-lazy classifier delivers lower error than the original TAN and is more efficient than SuperParent TAN.

Averaged One-Dependence Estimators: Preliminary Results.

Webb, G. I., Boughton, J., & Wang, Z.

Proceedings of the First Australasian Data Mining Workshop (AusDM02), Sydney, pp. 65-73, 2002.

[Bibtex] [Abstract]

```
@InProceedings{WebbBoughtonWang02,
Title = {Averaged One-Dependence Estimators: Preliminary Results},
Author = {Webb, G. I. and Boughton, J. and Wang, Z.},
Booktitle = {Proceedings of the First Australasian Data Mining Workshop (AusDM02)},
Year = {2002},
Address = {Sydney},
Editor = {Simoff, S.J. and Williams, G.J. and Hegland, M.},
Pages = {65-73},
Publisher = {University of Technology},
Abstract = {Naive Bayes is a simple, computationally efficient and remarkably accurate approach to classification learning. These properties have led to its wide deployment in many online applications. However, it is based on an assumption that all attributes are conditionally independent given the class. This assumption leads to decreased accuracy in some applications. AODE overcomes the attribute independence assumption of naive Bayes by averaging over all models in which all attributes depend upon the class and a single other attribute. The resulting classification learning algorithm for nominal data is computationally efficient and achieves very low error rates.},
Audit-trail = {*},
Keywords = {Conditional Probability Estimation and AODE},
Location = {Canberra, Australia},
Related = {learning-complex-conditional-probabilities-from-data}
}
```

**ABSTRACT** Naive Bayes is a simple, computationally efficient and remarkably accurate approach to classification learning. These properties have led to its wide deployment in many online applications. However, it is based on an assumption that all attributes are conditionally independent given the class. This assumption leads to decreased accuracy in some applications. AODE overcomes the attribute independence assumption of naive Bayes by averaging over all models in which all attributes depend upon the class and a single other attribute. The resulting classification learning algorithm for nominal data is computationally efficient and achieves very low error rates.

Comparison of Lazy Bayesian Rule Learning and Tree-Augmented Bayesian Learning.

Wang, Z., & Webb, G. I.

Proceedings of the IEEE International Conference on Data Mining (ICDM-2002), Los Alamitos, CA, pp. 775-778, 2002.

[Bibtex] [Abstract]

```
@InProceedings{WangWebb02,
author = {Wang, Z. and Webb, G. I.},
booktitle = {Proceedings of the {IEEE} International Conference on Data Mining (ICDM-2002)},
title = {Comparison of Lazy Bayesian Rule Learning and Tree-Augmented Bayesian Learning},
year = {2002},
address = {Los Alamitos, CA},
pages = {775-778},
publisher = {{IEEE} Computer Society},
abstract = {The naive Bayes classifier is widely used in interactive applications due to its computational efficiency, direct theoretical base, and competitive accuracy. However, its attribute independence assumption can result in sub-optimal accuracy. A number of techniques have explored simple relaxations of the attribute independence assumption in order to increase accuracy. Among these, Lazy Bayesian Rules (LBR) and Tree-Augmented Na?ve-Bayes (TAN) have demonstrated strong prediction accuracy. However, their relative performance has never been evaluated. This paper compares and contrasts these two techniques, finding that they have comparable accuracy and hence should be selected according to computational profile. LBR is desirable when small numbers of objects are to be classified while TAN is desirable when large numbers of objects are to be classified},
audit-trail = {http://csdl.computer.org/comp/proceedings/icdm/2002/1754/00/1754toc.htm},
keywords = {Conditional Probability Estimation},
location = {Maebashi City, Japan},
related = {learning-complex-conditional-probabilities-from-data},
}
```

**ABSTRACT** The naive Bayes classifier is widely used in interactive applications due to its computational efficiency, direct theoretical base, and competitive accuracy. However, its attribute independence assumption can result in sub-optimal accuracy. A number of techniques have explored simple relaxations of the attribute independence assumption in order to increase accuracy. Among these, Lazy Bayesian Rules (LBR) and Tree-Augmented Na?ve-Bayes (TAN) have demonstrated strong prediction accuracy. However, their relative performance has never been evaluated. This paper compares and contrasts these two techniques, finding that they have comparable accuracy and hence should be selected according to computational profile. LBR is desirable when small numbers of objects are to be classified while TAN is desirable when large numbers of objects are to be classified

A Heuristic Lazy Bayesian Rules Algorithm.

Wang, Z., & Webb, G. I.

Proceedings of the First Australasian Data Mining Workshop (AusDM02), Sydney, pp. 57-63, 2002.

[Bibtex] [Abstract]

```
@InProceedings{WangWebb02b,
author = {Wang, Z. and Webb, G. I.},
booktitle = {Proceedings of the First Australasian Data Mining Workshop (AusDM02)},
title = {A Heuristic Lazy Bayesian Rules Algorithm},
year = {2002},
address = {Sydney},
editor = {Simoff, S. J and Williams, G. J and Hegland, M.},
pages = {57-63},
publisher = {University of Technology},
abstract = {Lazy Bayesian rule has demonstrated outstanding classification accuracy. However, it has high computational overheads when large numbers of instances are classified from a single training set. We compare lazy Bayesian rule and the tree-augmented Bayesian classifier, and present a new heuristic lazy Bayesian rule classifier that combines elements of the two. It requires less computation than lazy Bayesian rule, but demonstrates similar prediction accuracy.},
audit-trail = {*},
keywords = {Conditional Probability Estimation},
location = {Canberra, Australia},
related = {learning-complex-conditional-probabilities-from-data},
}
```

**ABSTRACT** Lazy Bayesian rule has demonstrated outstanding classification accuracy. However, it has high computational overheads when large numbers of instances are classified from a single training set. We compare lazy Bayesian rule and the tree-augmented Bayesian classifier, and present a new heuristic lazy Bayesian rule classifier that combines elements of the two. It requires less computation than lazy Bayesian rule, but demonstrates similar prediction accuracy.

Implementation of Lazy Bayesian Rules in the Weka System.

Wang, Z., Webb, G. I., & Dai, H.

Software Technology Catering for 21st Century: Proceedings of the International Symposium on Future Software Technology (ISFST2001), Tokyo, pp. 204-208, 2001.

[Bibtex] [Abstract]

```
@InProceedings{WangWebbDai01,
Title = {Implementation of Lazy Bayesian Rules in the Weka System},
Author = {Wang, Z. and Webb, G. I. and Dai, H.},
Booktitle = {Software Technology Catering for 21st Century: Proceedings of the International Symposium on Future Software Technology (ISFST2001)},
Year = {2001},
Address = {Tokyo},
Pages = {204-208},
Publisher = {Software Engineers Association},
Abstract = {The na?ve Bayesian classification algorithms were shown to be computationally efficient and surprisingly accurate when the conditional independence assumption on which they are based is violated. The lazy Bayesian rule is the application of lazy learning techniques to Bayesian tree induction, which supports a weaker conditional attribute independence assumption. The Weka system is a full, industrial-strength implementation of essentially almost the state-of-the-art machine learning techniques, and it contains a framework, in the form of a Java class library, which supports applications that use embedded machine learning and even the implementation of new learning schemes. In this paper, we mainly discuss the implementation of the algorithm of lazy Bayesian rule in Weka System, and introduce all the methods to be used in the Java class. This is the first lazy learning scheme implemented in Weka System.},
Audit-trail = {*},
Keywords = {Conditional Probability Estimation},
Location = {Zheng Zhou, China},
Related = {learning-complex-conditional-probabilities-from-data}
}
```

**ABSTRACT** The na?ve Bayesian classification algorithms were shown to be computationally efficient and surprisingly accurate when the conditional independence assumption on which they are based is violated. The lazy Bayesian rule is the application of lazy learning techniques to Bayesian tree induction, which supports a weaker conditional attribute independence assumption. The Weka system is a full, industrial-strength implementation of essentially almost the state-of-the-art machine learning techniques, and it contains a framework, in the form of a Java class library, which supports applications that use embedded machine learning and even the implementation of new learning schemes. In this paper, we mainly discuss the implementation of the algorithm of lazy Bayesian rule in Weka System, and introduce all the methods to be used in the Java class. This is the first lazy learning scheme implemented in Weka System.

Candidate Elimination Criteria for Lazy Bayesian Rules.

Webb, G. I.

Lecture Notes in Computer Science Vol. 2256: Proceedings of the 14th Australian Joint Conference on Artificial Intelligence (AI'01), Berlin/Heidelberg, pp. 545-556, 2001.

[Bibtex] [Abstract]

```
@InProceedings{Webb01b,
author = {Webb, G. I.},
booktitle = {Lecture Notes in Computer Science Vol. 2256: Proceedings of the 14th Australian Joint Conference on Artificial Intelligence (AI'01)},
title = {Candidate Elimination Criteria for Lazy Bayesian Rules},
year = {2001},
address = {Berlin/Heidelberg},
editor = {Stumptner, M. and Corbett, D. and Brooks, M.J.},
pages = {545-556},
publisher = {Springer},
abstract = {Lazy Bayesian Rules modify naive Bayesian classification to undo elements of the harmful attribute independence assumption. It has been shown to provide classification error comparable to boosting decision trees. This paper explores alternatives to the candidate elimination criterion employed within Lazy Bayesian Rules. Improvements over naive Bayes are consistent so long as the candidate elimination criteria ensures there is sufficient data for accurate probability estimation. However, the original candidate elimination criterion is demonstrated to provide better overall error reduction than the use of a minimum data subset size criterion.},
audit-trail = {*},
doi = {10.1007%2F3-540-45656-2_47},
keywords = {Conditional Probability Estimation and Bayesian Learning and Lazy Bayesian Rules and Lazy Learning},
location = {Adelaide, Australia},
related = {learning-complex-conditional-probabilities-from-data},
}
```

**ABSTRACT** Lazy Bayesian Rules modify naive Bayesian classification to undo elements of the harmful attribute independence assumption. It has been shown to provide classification error comparable to boosting decision trees. This paper explores alternatives to the candidate elimination criterion employed within Lazy Bayesian Rules. Improvements over naive Bayes are consistent so long as the candidate elimination criteria ensures there is sufficient data for accurate probability estimation. However, the original candidate elimination criterion is demonstrated to provide better overall error reduction than the use of a minimum data subset size criterion.

Lazy Learning of Bayesian Rules.

Zheng, Z., & Webb, G. I.

Machine Learning, 41(1), 53-84, 2000.

[Bibtex] [Abstract]

```
@Article{ZhengWebb00,
author = {Zheng, Z. and Webb, G. I.},
journal = {Machine Learning},
title = {Lazy Learning of Bayesian Rules},
year = {2000},
number = {1},
pages = {53-84},
volume = {41},
abstract = {The naive Bayesian classifier provides a simple and effective approach to classifier learning, but its attribute independence assumption is often violated in the real world. A number of approaches have sought to alleviate this problem. A Bayesian tree learning algorithm builds a decision tree, and generates a local naive Bayesian classifier at each leaf. The tests leading to a leaf can alleviate attribute integra¡dependencies for the local naive Bayesian classifier. However, Bayesian tree learning still suffers from the small disjunct problem of tree learning. While inferred Bayesian trees demonstrate low average prediction error rates, there is reason to believe that error rates will be higher for those leaves with few training examples. This paper proposes the application of lazy learning techniques to Bayesian tree induction and presents the resulting lazy Bayesian rule learning algorithm, called LBR. For each test example, it builds a most appropriate rule with a local naive Bayesian classifier as its consequent. It is demonstrated that the computational requirements of LBR are reasonable in a wide cross¡selection of natural domains. Experiments with these domains show that, on average, this new algorithm obtains lower error rates significantly more often than the reverse in comparison to a naive Bayesian classifier, C4.5, a Bayesian tree learning algorithm, a constructive Bayesian classifier that eliminates attributes and constructs new attributes using Cartesian products of existing nominal attributes, and a lazy decision tree learning algorithm. It also outperforms, although the result is not statistically significant, a selective naive Bayesian classifier.},
address = {Netherlands},
audit-trail = {27/10/03 requested permission to post pp pdf. 28/10/03 Permission granted by Kluwer. PDF posted 30/10/03},
doi = {10.1023/A:1007613203719},
keywords = {Conditional Probability Estimation and Bayesian Learning and Lazy Bayesian Rules and Lazy Learning},
publisher = {Springer},
related = {learning-complex-conditional-probabilities-from-data},
}
```

**ABSTRACT** The naive Bayesian classifier provides a simple and effective approach to classifier learning, but its attribute independence assumption is often violated in the real world. A number of approaches have sought to alleviate this problem. A Bayesian tree learning algorithm builds a decision tree, and generates a local naive Bayesian classifier at each leaf. The tests leading to a leaf can alleviate attribute integra¡dependencies for the local naive Bayesian classifier. However, Bayesian tree learning still suffers from the small disjunct problem of tree learning. While inferred Bayesian trees demonstrate low average prediction error rates, there is reason to believe that error rates will be higher for those leaves with few training examples. This paper proposes the application of lazy learning techniques to Bayesian tree induction and presents the resulting lazy Bayesian rule learning algorithm, called LBR. For each test example, it builds a most appropriate rule with a local naive Bayesian classifier as its consequent. It is demonstrated that the computational requirements of LBR are reasonable in a wide cross¡selection of natural domains. Experiments with these domains show that, on average, this new algorithm obtains lower error rates significantly more often than the reverse in comparison to a naive Bayesian classifier, C4.5, a Bayesian tree learning algorithm, a constructive Bayesian classifier that eliminates attributes and constructs new attributes using Cartesian products of existing nominal attributes, and a lazy decision tree learning algorithm. It also outperforms, although the result is not statistically significant, a selective naive Bayesian classifier.

Lazy Bayesian Rules: A Lazy Semi-Naive Bayesian Learning Technique Competitive to Boosting Decision Trees.

Zheng, Z., Webb, G. I., & Ting, K. M.

Proceedings of the Sixteenth International Conference on Machine Learning (ICML-99), San Francisco, pp. 493-502, 1999.

[Bibtex] [Abstract]

```
@InProceedings{ZhengWebbTing99,
author = {Zheng, Z. and Webb, G. I. and Ting, K. M.},
booktitle = {Proceedings of the Sixteenth International Conference on Machine Learning (ICML-99)},
title = {Lazy Bayesian Rules: A Lazy Semi-Naive Bayesian Learning Technique Competitive to Boosting Decision Trees},
year = {1999},
address = {San Francisco},
editor = {Bratko, I. and Dzeroski, S.},
pages = {493-502},
publisher = {Morgan Kaufmann},
abstract = {LBR is a lazy semi-naive Bayesian classifier learning technique, designed to alleviate the attribute interdependence problem of naive Bayesian classification. To classify a test example, it creates a conjunctive rule that selects a most appropriate subset of training examples and induces a local naive Bayesian classifier using this subset. LBR can significantly improve the performance of the naive Bayesian classifier. A bias and variance analysis of LBR reveals that it significantly reduces the bias of naive Bayesian classification at a cost of a slight increase in variance. It is interesting to compare this lazy technique with boosting and bagging, two well-known state-of-the-art non-lazy learning techniques. Empirical comparison of LBR with boosting decision trees on discrete valued data shows that LBR has, on average, significantly lower variance and higher bias. As a result of the interaction of these effects, the average prediction error of LBR over a range of learning tasks is at a level directly comparable to boosting. LBR provides a very competitive discrete valued learning technique where error minimization is the primary concern. It is very efficient when a single classifier is to be applied to classify few cases, such as in a typical incremental learning scenario.},
audit-trail = {Link via Citeseer},
keywords = {Conditional Probability Estimation and Bayesian Learning and Lazy Bayesian Rules and Lazy Learning},
location = {Bled, Slovenia},
related = {learning-complex-conditional-probabilities-from-data},
}
```

**ABSTRACT** LBR is a lazy semi-naive Bayesian classifier learning technique, designed to alleviate the attribute interdependence problem of naive Bayesian classification. To classify a test example, it creates a conjunctive rule that selects a most appropriate subset of training examples and induces a local naive Bayesian classifier using this subset. LBR can significantly improve the performance of the naive Bayesian classifier. A bias and variance analysis of LBR reveals that it significantly reduces the bias of naive Bayesian classification at a cost of a slight increase in variance. It is interesting to compare this lazy technique with boosting and bagging, two well-known state-of-the-art non-lazy learning techniques. Empirical comparison of LBR with boosting decision trees on discrete valued data shows that LBR has, on average, significantly lower variance and higher bias. As a result of the interaction of these effects, the average prediction error of LBR over a range of learning tasks is at a level directly comparable to boosting. LBR provides a very competitive discrete valued learning technique where error minimization is the primary concern. It is very efficient when a single classifier is to be applied to classify few cases, such as in a typical incremental learning scenario.

Learning Lazy Rules to Improve the Performance of Classifiers.

Ting, K. M., Zheng, Z., & Webb, G. I.

Proceedings of the Nineteenth SGES International Conference on Knowledge Based Systems and Applied Artificial Intelligence (ES'99), New York, pp. 122-131, 1999.

[Bibtex] [Abstract]

```
@InProceedings{TingZhengWebb99,
author = {Ting, K.M. and Zheng, Z. and Webb, G. I.},
booktitle = {Proceedings of the Nineteenth SGES International Conference on Knowledge Based Systems and Applied Artificial Intelligence (ES'99)},
title = {Learning Lazy Rules to Improve the Performance of Classifiers},
year = {1999},
address = {New York},
editor = {Coenen, F. and Macintosh, A.},
pages = {122-131},
publisher = {Springer},
abstract = {Based on an earlier study on lazy Bayesian rule learning, this paper introduces a general lazy learning framework, called LAZYRULE, that begins to learn a rule only when classifying a test case. The objective of the framework is to improve the performance of a base learning algorithm. It has the potential to be used for different types of base learning algorithms. LAZYRULE performs attribute elimination and training case selection using cross-validation to generate the most appropriate rule for each test case. At the consequent of the rule, it applies the base learning algorithm on the selected training subset and the remaining attributes to construct a classifier to make a prediction. This combined action seeks to build a better performing classifier for each test case than the classifier trained using all attributes and all training cases. We show empirically that LAZYRULE improves the performances of naive Bayesian classifiers and majority vote.},
audit-trail = {*},
keywords = {Conditional Probability Estimation},
location = {Peterhouse College, Cambridge, UK},
related = {learning-complex-conditional-probabilities-from-data},
}
```

**ABSTRACT** Based on an earlier study on lazy Bayesian rule learning, this paper introduces a general lazy learning framework, called LAZYRULE, that begins to learn a rule only when classifying a test case. The objective of the framework is to improve the performance of a base learning algorithm. It has the potential to be used for different types of base learning algorithms. LAZYRULE performs attribute elimination and training case selection using cross-validation to generate the most appropriate rule for each test case. At the consequent of the rule, it applies the base learning algorithm on the selected training subset and the remaining attributes to construct a classifier to make a prediction. This combined action seeks to build a better performing classifier for each test case than the classifier trained using all attributes and all training cases. We show empirically that LAZYRULE improves the performances of naive Bayesian classifiers and majority vote.

Adjusted Probability Naive Bayesian Induction.

Webb, G. I., & Pazzani, M.

Lecture Notes in Computer Science Vol. 1502: Advanced Topics in Artificial Intelligence, Selected Papers from the Eleventh Australian Joint Conference on Artificial Intelligence (AI '98), Berlin, pp. 285-295, 1998.

[Bibtex] [Abstract]

```
@InProceedings{WebbPazzani98,
Title = {Adjusted Probability Naive Bayesian Induction},
Author = {Webb, G. I. and Pazzani, M.},
Booktitle = {Lecture Notes in Computer Science Vol. 1502: Advanced Topics in Artificial Intelligence, Selected Papers from the Eleventh Australian Joint Conference on Artificial Intelligence (AI '98)},
Year = {1998},
Address = {Berlin},
Editor = {Antoniou, G. and Slaney, J.K.},
Pages = {285-295},
Publisher = {Springer-Verlag},
Abstract = {Naive Bayesian classifiers utilise a simple mathematical model for induction. While it is known that the assumptions on which this model is based are frequently violated, the predictive accuracy obtained in discriminate classification tasks is surprisingly competitive in comparison to more complex induction techniques. Adjusted probability naive Bayesian induction adds a simple extension to the naive Bayesian classifier. A numeric weight is inferred for each class. During discriminate classification, the naive Bayesian probability of a class is multiplied by its weight to obtain an adjusted value. The use of this adjusted value in place of the naive Bayesian probability is shown to significantly improve predictive accuracy.},
Audit-trail = {*},
Keywords = {Conditional Probability Estimation and Bayesian Learning},
Location = {Brisbane, Australia},
Related = {learning-complex-conditional-probabilities-from-data}
}
```

**ABSTRACT** Naive Bayesian classifiers utilise a simple mathematical model for induction. While it is known that the assumptions on which this model is based are frequently violated, the predictive accuracy obtained in discriminate classification tasks is surprisingly competitive in comparison to more complex induction techniques. Adjusted probability naive Bayesian induction adds a simple extension to the naive Bayesian classifier. A numeric weight is inferred for each class. During discriminate classification, the naive Bayesian probability of a class is multiplied by its weight to obtain an adjusted value. The use of this adjusted value in place of the naive Bayesian probability is shown to significantly improve predictive accuracy.