Combining Generative and Discriminative Learning

Generative learning is often more efficient than discriminative learning, but has substantially higher bias. We are developing techniques for exploiting the information efficiently learned by generative learning in order to assist the process of discriminative learning.

A WANBIA component for Weka can be downloaded here.

EBNC implements our algorithms for Efficient Parameter Learning of Bayesian Networks.

Publications

A Fast Trust-Region Newton Method for Softmax Logistic Regression.
Zaidi, N. A., & Webb, G. I.
Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 705-713, 2017.
[Bibtex] [Abstract]

@InProceedings{ZaidiWebb17,
Title = {A Fast Trust-Region Newton Method for Softmax Logistic Regression},
Author = {Zaidi, Nayyar A and Webb, Geoffrey I},
Booktitle = {Proceedings of the 2017 SIAM International Conference on Data Mining},
Year = {2017},
Organization = {SIAM},
Pages = {705-713},
Abstract = {With the emergence of big data, there has been a growing
interest in optimization routines that lead to faster convergence of Logistic Regression (LR). Among many optimization methods such as Gradient Descent, Quasi-Newton, Conjugate Gradient, etc., the Trust-region based truncated Newton method (TRON) algorithm has been shown to converge
the fastest. The TRON algorithm also forms an important
component of the highly efficient and widely used liblinear
package. It has been shown that the WANBIA-C trick of
scaling with the log of the naive Bayes conditional probabilities can greatly accelerate the convergence of LR trained using (first-order) Gradient Descent and (approximate second-order) Quasi-Newton optimization. In this work we study
the applicability of the WANBIA-C trick to TRON. We first
devise a TRON algorithm optimizing the softmax objective
function and then demonstrate that WANBIA-C style preconditioning can be beneficial for TRON, leading to an ex-
tremely fast (batch) LR algorithm. Second, we present a
comparative analysis of one-vs-all LR and softmax LR in
terms of the 0-1 Loss, Bias, Variance, RMSE, Log-Loss,
Training and Classication time, and show that softmax LR
leads to significantly better RMSE and Log-Loss. We evaluate our proposed approach on 51 benchmark datasets.},
Doi = {10.1137/1.9781611974973.79},
Keywords = {Conditional Probability Estimation and WANBIA and DP140100087},
Related = {combining-generative-and-discriminative-learning}
}

ABSTRACT With the emergence of big data, there has been a growing interest in optimization routines that lead to faster convergence of Logistic Regression (LR). Among many optimization methods such as Gradient Descent, Quasi-Newton, Conjugate Gradient, etc., the Trust-region based truncated Newton method (TRON) algorithm has been shown to converge the fastest. The TRON algorithm also forms an important component of the highly efficient and widely used liblinear package. It has been shown that the WANBIA-C trick of scaling with the log of the naive Bayes conditional probabilities can greatly accelerate the convergence of LR trained using (first-order) Gradient Descent and (approximate second-order) Quasi-Newton optimization. In this work we study the applicability of the WANBIA-C trick to TRON. We first devise a TRON algorithm optimizing the softmax objective function and then demonstrate that WANBIA-C style preconditioning can be beneficial for TRON, leading to an ex- tremely fast (batch) LR algorithm. Second, we present a comparative analysis of one-vs-all LR and softmax LR in terms of the 0-1 Loss, Bias, Variance, RMSE, Log-Loss, Training and Classication time, and show that softmax LR leads to significantly better RMSE and Log-Loss. We evaluate our proposed approach on 51 benchmark datasets.

Efficient Parameter Learning of Bayesian Network Classifiers.
Zaidi, N., Webb, G. I., Carman, M., Petitjean, F., Buntine, W., Hynes, H., & De Sterck, H.
Machine Learning, 106(9-10), 1289-1329, 2017.
[Bibtex] [Abstract]

@Article{ZaidiEtAl17,
Title = {Efficient Parameter Learning of Bayesian Network Classifiers},
Author = {Zaidi, N. and Webb, Geoffrey I and Carman, M. and Petitjean, F. and Buntine, W. and Hynes, H. and De Sterck, H.},
Journal = {Machine Learning},
Year = {2017},
Number = {9-10},
Pages = {1289-1329},
Volume = {106},
Abstract = {Recent advances have demonstrated substantial benefits from learning with both generative and discriminative parameters. On the one hand, generative approaches address the estimation of the parameters of the joint distribution�P(y,x), which for most network types is very computationally efficient (a notable exception to this are Markov networks) and on the other hand, discriminative approaches address the estimation of the parameters of the posterior distribution�and, are more effective for classification, since they fit P(y|x) directly. However, discriminative approaches are less computationally efficient as the normalization factor in the conditional log-likelihood precludes the derivation of closed-form estimation of parameters. This paper introduces a new discriminative parameter learning method for Bayesian network classifiers that combines in an elegant fashion parameters learned using both generative and discriminative methods. The proposed method is discriminative in nature, but uses estimates of generative probabilities to speed-up the optimization process. A second contribution is to propose a simple framework to characterize the parameter learning task for Bayesian network classifiers. We conduct an extensive set of experiments on 72 standard datasets and demonstrate that our proposed discriminative parameterization provides an efficient alternative to other state-of-the-art parameterizations.},
Doi = {10.1007/s10994-016-5619-z},
Keywords = {Conditional Probability Estimation and WANBIA and DP140100087},
Related = {combining-generative-and-discriminative-learning},
Url = {http://rdcu.be/oP1t}
}

ABSTRACT Recent advances have demonstrated substantial benefits from learning with both generative and discriminative parameters. On the one hand, generative approaches address the estimation of the parameters of the joint distribution�P(y,x), which for most network types is very computationally efficient (a notable exception to this are Markov networks) and on the other hand, discriminative approaches address the estimation of the parameters of the posterior distribution�and, are more effective for classification, since they fit P(y|x) directly. However, discriminative approaches are less computationally efficient as the normalization factor in the conditional log-likelihood precludes the derivation of closed-form estimation of parameters. This paper introduces a new discriminative parameter learning method for Bayesian network classifiers that combines in an elegant fashion parameters learned using both generative and discriminative methods. The proposed method is discriminative in nature, but uses estimates of generative probabilities to speed-up the optimization process. A second contribution is to propose a simple framework to characterize the parameter learning task for Bayesian network classifiers. We conduct an extensive set of experiments on 72 standard datasets and demonstrate that our proposed discriminative parameterization provides an efficient alternative to other state-of-the-art parameterizations.

Preconditioning an Artificial Neural Network Using Naive Bayes.
Zaidi, N. A., Petitjean, F., & Webb, G. I.
Proceedings of the 20th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2016, pp. 341-353, 2016.
[Bibtex] [Abstract]

@InProceedings{ZaidiEtAl16,
Title = {Preconditioning an Artificial Neural Network Using Naive {Bayes}},
Author = {Zaidi, Nayyar A.
and Petitjean, Fran{\c{c}}ois
and Webb, Geoffrey I.},
Booktitle = {Proceedings of the 20th {Pacific-Asia} Conference on Advances in Knowledge Discovery and Data Mining, {PAKDD} 2016},
Year = {2016},
Editor = {Bailey, James
and Khan, Latifur
and Washio, Takashi
and Dobbie, Gill
and Huang, Zhexue Joshua
and Wang, Ruili},
Pages = {341-353},
Publisher = {Springer International Publishing},
Abstract = {Logistic Regression (LR) is a workhorse of the statistics community and a state-of-the-art machine learning classifier. It learns a linear model from inputs to outputs trained by optimizing the Conditional Log-Likelihood (CLL) of the data. Recently, it has been shown that preconditioning LR using a Naive Bayes (NB) model speeds up LR learning many-fold. One can, however, train a linear model by optimizing the mean-square-error (MSE) instead of CLL. This leads to an Artificial Neural Network (ANN) with no hidden layer. In this work, we study the effect of NB preconditioning on such an ANN classifier. Optimizing MSE instead of CLL may lead to a lower bias classifier and hence result in better performance on big datasets. We show that this NB preconditioning can speed-up convergence significantly. We also show that optimizing a linear model with MSE leads to a lower bias classifier than optimizing with CLL. We also compare the performance to state-of-the-art classifier Random Forest.},
Doi = {10.1007/978-3-319-31753-3_28},
ISBN = {978-3-319-31753-3},
Keywords = {Conditional Probability Estimation and WANBIA and DP140100087},
Related = {combining-generative-and-discriminative-learning},
Url = {http://dx.doi.org/10.1007/978-3-319-31753-3_28}
}

ABSTRACT Logistic Regression (LR) is a workhorse of the statistics community and a state-of-the-art machine learning classifier. It learns a linear model from inputs to outputs trained by optimizing the Conditional Log-Likelihood (CLL) of the data. Recently, it has been shown that preconditioning LR using a Naive Bayes (NB) model speeds up LR learning many-fold. One can, however, train a linear model by optimizing the mean-square-error (MSE) instead of CLL. This leads to an Artificial Neural Network (ANN) with no hidden layer. In this work, we study the effect of NB preconditioning on such an ANN classifier. Optimizing MSE instead of CLL may lead to a lower bias classifier and hence result in better performance on big datasets. We show that this NB preconditioning can speed-up convergence significantly. We also show that optimizing a linear model with MSE leads to a lower bias classifier than optimizing with CLL. We also compare the performance to state-of-the-art classifier Random Forest.

ALRn: Accelerated higher-order logistic regression.
Zaidi, N. A., Webb, G. I., Carman, M. J., Petitjean, F., & Cerquides, J.
Machine Learning, 104(2), 151-194, 2016.
[Bibtex] [Abstract]

@Article{ZaidiEtAl16b,
Title = {{ALRn}: Accelerated higher-order logistic regression},
Author = {Zaidi, Nayyar A.
and Webb, Geoffrey I.
and Carman, Mark J.
and Petitjean, Fran{\c{c}}ois
and Cerquides, Jes{\'u}s},
Journal = {Machine Learning},
Year = {2016},
Number = {2},
Pages = {151-194},
Volume = {104},
Abstract = {This paper introduces Accelerated Logistic Regression: a hybrid generative-discriminative approach to training Logistic Regression with high-order features. We present two main results: (1) that our combined generative-discriminative approach significantly improves the efficiency of Logistic Regression and (2) that incorporating higher order features (i.e. features that are the Cartesian products of the original features) reduces the bias of Logistic Regression, which in turn significantly reduces its error on large datasets. We assess the efficacy of Accelerated Logistic Regression by conducting an extensive set of experiments on 75 standard datasets. We demonstrate its competitiveness, particularly on large datasets, by comparing against state-of-the-art classifiers including Random Forest and Averaged n-Dependence Estimators.},
Doi = {10.1007/s10994-016-5574-8},
ISSN = {1573-0565},
Keywords = {Conditional Probability Estimation and WANBIA and DP140100087},
Related = {combining-generative-and-discriminative-learning},
Url = {http://dx.doi.org/10.1007/s10994-016-5574-8}
}

ABSTRACT This paper introduces Accelerated Logistic Regression: a hybrid generative-discriminative approach to training Logistic Regression with high-order features. We present two main results: (1) that our combined generative-discriminative approach significantly improves the efficiency of Logistic Regression and (2) that incorporating higher order features (i.e. features that are the Cartesian products of the original features) reduces the bias of Logistic Regression, which in turn significantly reduces its error on large datasets. We assess the efficacy of Accelerated Logistic Regression by conducting an extensive set of experiments on 75 standard datasets. We demonstrate its competitiveness, particularly on large datasets, by comparing against state-of-the-art classifiers including Random Forest and Averaged n-Dependence Estimators.

Naive-Bayes Inspired Effective Pre-Conditioner for Speeding-up Logistic Regression.
Zaidi, N., Carman, M., Cerquides, J., & Webb, G. I.
Proceedings of the 14th IEEE International Conference on Data Mining, pp. 1097-1102, 2014.
[Bibtex] [Abstract]

@InProceedings{ZaidiEtAl14,
Title = {Naive-{Bayes} Inspired Effective Pre-Conditioner for Speeding-up Logistic Regression},
Author = {N. Zaidi and M. Carman and J. Cerquides and G.I. Webb},
Booktitle = {Proceedings of the 14th {IEEE} International Conference on Data Mining},
Year = {2014},
Pages = {1097-1102},
Abstract = {We propose an alternative parameterization of
Logistic Regression (LR) for the categorical data, multi-class
setting. LR optimizes the conditional log-likelihood over the
training data and is based on an iterative optimization procedure
to tune this objective function. The optimization procedure
employed may be sensitive to scale and hence an effective
pre-conditioning method is recommended. Many problems in
machine learning involve arbitrary scales or categorical data
(where simple standardization of features is not applicable).
The problem can be alleviated by using optimization routines
that are invariant to scale such as (second-order) Newton
methods. However, computing and inverting the Hessian is a
costly procedure and not feasible for big data. Thus one must
often rely on first-order methods such as gradient descent (GD),
stochastic gradient descent (SGD) or approximate secondorder
such as quasi-Newton (QN) routines, which are not
invariant to scale. This paper proposes a simple yet effective
pre-conditioner for speeding-up LR based on naive Bayes
conditional probability estimates. The idea is to scale each
attribute by the log of the conditional probability of that
attribute given the class. This formulation substantially speeds up
LR's convergence. It also provides a weighted naive Bayes
formulation which yields an effective framework for hybrid
generative-discriminative classification.},
Doi = {10.1109/ICDM.2014.53},
Keywords = {Conditional Probability Estimation and WANBIA and DP140100087},
Related = {combining-generative-and-discriminative-learning}
}

ABSTRACT We propose an alternative parameterization of Logistic Regression (LR) for the categorical data, multi-class setting. LR optimizes the conditional log-likelihood over the training data and is based on an iterative optimization procedure to tune this objective function. The optimization procedure employed may be sensitive to scale and hence an effective pre-conditioning method is recommended. Many problems in machine learning involve arbitrary scales or categorical data (where simple standardization of features is not applicable). The problem can be alleviated by using optimization routines that are invariant to scale such as (second-order) Newton methods. However, computing and inverting the Hessian is a costly procedure and not feasible for big data. Thus one must often rely on first-order methods such as gradient descent (GD), stochastic gradient descent (SGD) or approximate secondorder such as quasi-Newton (QN) routines, which are not invariant to scale. This paper proposes a simple yet effective pre-conditioner for speeding-up LR based on naive Bayes conditional probability estimates. The idea is to scale each attribute by the log of the conditional probability of that attribute given the class. This formulation substantially speeds up LR's convergence. It also provides a weighted naive Bayes formulation which yields an effective framework for hybrid generative-discriminative classification.

Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting.
Zaidi, N. A., Cerquides, J., Carman, M. J., & Webb, G. I.
Journal of Machine Learning Research, 14, 1947-1988, 2013.
[Bibtex] [Abstract]

@Article{Zaidi2013,
Title = {Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting},
Author = {Nayyar A. Zaidi and Jesus Cerquides and Mark J. Carman and Geoffrey I. Webb},
Journal = {Journal of Machine Learning Research},
Year = {2013},
Pages = {1947-1988},
Volume = {14},
Abstract = {Despite the simplicity of the Naive Bayes classifier, it has continued to perform well against more sophisticated newcomers and has remained, therefore, of great interest to the machine learning community. Of numerous approaches to refining the naive Bayes classifier, attribute weighting has received less attention than it warrants. Most approaches, perhaps influenced by attribute weighting in other machine learning algorithms, use weighting to place more emphasis on highly predictive attributes than those that are less predictive. In this paper, we argue that for naive Bayes attribute weighting should instead be used to alleviate the conditional independence assumption. Based on this premise, we propose a weighted naive Bayes algorithm, called WANBIA, that selects weights to minimize either the negative conditional log likelihood or the mean squared error objective functions. We perform extensive evaluations and find that WANBIA is a competitive alternative to state of the art classifiers like Random Forest, Logistic Regression and A1DE.},
Keywords = {Conditional Probability Estimation and WANBIA},
Related = {combining-generative-and-discriminative-learning},
Url = {http://jmlr.org/papers/volume14/zaidi13a/zaidi13a.pdf},
Urltext = {Link to paper on JMLR site}
}

ABSTRACT Despite the simplicity of the Naive Bayes classifier, it has continued to perform well against more sophisticated newcomers and has remained, therefore, of great interest to the machine learning community. Of numerous approaches to refining the naive Bayes classifier, attribute weighting has received less attention than it warrants. Most approaches, perhaps influenced by attribute weighting in other machine learning algorithms, use weighting to place more emphasis on highly predictive attributes than those that are less predictive. In this paper, we argue that for naive Bayes attribute weighting should instead be used to alleviate the conditional independence assumption. Based on this premise, we propose a weighted naive Bayes algorithm, called WANBIA, that selects weights to minimize either the negative conditional log likelihood or the mean squared error objective functions. We perform extensive evaluations and find that WANBIA is a competitive alternative to state of the art classifiers like Random Forest, Logistic Regression and A1DE.