Data Scientist


Learning Complex Conditional Probabilities from Data

Naive Bayes is popular due to its efficiency, direct theoretical foundation and strong classification performance.  My techniques improve its accuracy by overcoming the deficiencies of its attribute independence assumption.

The AnDE algorithms achieve this without resorting to model selection or search by averaging over all of a class of models that directly map lower-dimensional probabilities to the desired high-dimensional probability. Averaged One Dependence Estimators (AODE) is the best known of the AnDE algorithms. It provides particularly high prediction accuracy with relatively modest computational overheads. The AnDE algorithms learn in a single pass through the data, thus supporting learning from data that is too large to fit in memory. They have complexity that is linear with respect to data quantity. These properties, together with their capacity to accurately model high-dimensional probability distributions, make them an extremely attractive option for large data.

An alternative approach is provided by Lazy Bayesian Rules (LBR) which does perform model selection.  It provides very high prediction accuracy for large training sets, and is computationally efficient when few objects are to be classified for each training set.

Selective AnDE refines AnDE models requiring only a single additional pass through the data.

Selective KDB learns highly accurate models from large quantities of data in just three passes through the data.

AnDE, Selective AnDE and Selective KDB can all operate out-of-core, and thus support learning from very large data.

WANBIA and its variants WANBIA-C and ALR add discriminately learned weights to generatively learned linear models, greatly improving accuracy relative to pure generative learning and learning time relative to pure disciminative learning.

There have been many scientific applications of AODE.

Software

An AnDE package for Weka is available here.

An AnDE package for R is available here.

The following are standard components in Weka:

  • AODE, which is AnDE with n=1.
  • AODEsr, which is AODE with subsumption resolution.
  • LBR, Lazy Bayesian Rules.

An open source C++ implementation of Selective KDB can be downloaded here.

An open source C++ implementation of Sample-base Selective Attribute ANDE can be downloaded here.

Publications