Data Scientist


My data science research investigates how to use data to best support effective evidence-based decision making and derive useful knowledge and insight. My research spans artificial intelligence, machine learning, data mining, data analytics and big data. This overview page lists some key themes and provides links to papers and software.


Learning From Non-Stationary Distributions. The world is dynamic, in a constant state of flux, while machine learning systems typically learn static models from historical data. Failure to account for the dynamic nature of the world may result in sub-optimal performance when these models of the past are used to predict the present or future. This research investigates this phenomenon of concept drift and how it is best addressed. more

large data cloud

Big models for big data.  Early machine learning used very small data sets. Data science often involves very large data sets. Work on ‘scaling-up’ to large data sets has concentrated on reducing the computational complexity of existing algorithms. We contend that this may not be appropriate, that learning from large data sets is fundamentally different to learning from small data sets and that different types of algorithm may be most effective.  more


Association Discovery. I have pioneered association discovery techniques that seek the most useful associations, rather then applying the minimum-support constraint more commonly used in the field.  Many of these techniques are included in my Magnum Opus software which is now incorporated in BigML. more

Scalable learning of Graphical Models.  Graphical models are powerful descriptions of joint probability distributions. We have developed techniques for efficiently scaling exact methods to thousands of variables.  more

Combining Generative and Discriminative Learning.  Generative learning is often more efficient than discriminative learning, but has substantially higher bias. We are developing techniques for exploiting the information efficiently learned by generative learning in order to assist the process of discriminative learning. more

probability estimation cloud

Learning Complex Conditional Probabilities from Data.  Naive Bayes is a popular machine learning technique due to its efficiency, direct theoretical foundation and strong classification performance.  My techniques improve its accuracy by overcoming the deficiencies of its attribute independence assumption.  AnDE, AODE, LBR and related papers and software Scientific applications of AODE

Scalable learning of time series classifiers.  Times series classification is an important data analysis task.  The largest dataset in the standard set of benchmark time-series classification tasks, the UCR respositoryOpens in a new window, contains approximately 10,000 series. We are working with the French Space agency on classifying land usage from satellite images.  This task requires learning from many millions of time series and classifying many billions.  The preexisting state-of-the-art does not scale to these magnitudes.  We are developing new time series classification technologies that will. more


OPUS is an efficient branch and bound search algorithm for exploring the space of conjunctive patterns. It can optimise arbitrary objective functions so long as they provide useful bounds.  It supports extremely fast rule discovery. more


Ensemble Learning.  We have shown that error can be substantially reduced through Multi-Strategy Ensemble Learning, which involves combining multiple ensemble learning techniques. MultiBoosting (also known as Boost Bagging) combines boosting and bagging, obtaining most of boosting’s superior bias reduction together with most of bagging’s superior variance reduction.  We have shown that randomization of the ensemble classifiers can further reduce error.  Feating is a new approach to ensemble learning that combines local rather than global models.  To our knowledge it is the only generic ensemble learning technique that can boost the accuracy of stable learners such as SVM and NB. More on MultiBoosting and Multi-Strategy Ensemble Learning More on Feating

Knowledge Factory Cloud

The Knowledge Factory is an interactive machine learning environment that provides tight integration between machine learning and knowledge acquisition from experts. This pioneering research program included one of the few detailed experimental demonstrations that human-in-the-loop machine learning can outperform both autonomous machine learning and an expert specifying rules without the assistance of machine learning. more


Many machine-learning researchers have utilized Occam’s razor (also frequently spelt as Ockham’s razor), preferring less complex classifiers in the belief that doing so is likely to reduce prediction error. I believe that this is misguided and provide philosophical and experimental support for this opinion. more


Decision tree grafting is a postprocess that adds tests and nodes to existing decision trees in order to improve accuracy.   It acheives this through bagging-like variance reduction, while providing a single directly interpretable decision tree. more

Generality Cloud

Generality is predictive of prediction accuracy.  I argue that manipulation of generality, through appropriate generalization and specialization, can modify the performance of a classifier in predictable and useful ways. more

Prepend Cloud

Prepending is a generic approach to decision list learning that has lower computation and develops shorter decision lists than the classical approach, without any general increase in prediction error. more

NB discretisation cloud

Discretization for Naive Bayes.  Due to its attribute independence assumption, naive Bayes has distinct requirements of a discretization strategy.  This work provides theoretical analysis of those requirements and new techniques that improve classification accuracy. more

Impact Rules cloud

Impact Rules [also known as quantitative association rules] provide analysis similar to association rules except that the target is a distribution on a numeric value.  Impact Rules support data segmentation for optimisation of a numeric outcome. more

Bias-variance cloud

Bias and variance provide a powerful conceptual tool for analyzing classification performance.   Previous approaches to conducting bias-variance experiments have provided little control over the types of data distribution from which bias and variance are estimated.   I have developed new techniques for bias-variance analysis that provide greater control over the data distribution.  Experiments show that the type of distribution used for bias-variance experiments can greatly affect the results obtained. more

Feature construction (also known as constructive induction or attribute discovery) is a form of data enrichment that adds derived features to data. Our pioneering research demonstrated that feature construction can allow machine learning systems to construct more accurate models across a wide range of learning tasks. more

FBM cloud

Feature Based Modeling is a generic approach to agent modeling using attribute-value machine learning. Applications in student modeling have demonstrated high prediction accuracy.  This program produced some of the earliest examples of inspectable or glass-box user models and student models and used associations for modeling before association rules were popularised. more


Program Visualisation for Novice Programmers.  Providing novice programmers with a concrete model of an abstract machine and displaying program execution on that machine improves their acquisition of programming skills and knowledge. more

computational biology clouod

Computational Biology. Along with colleagues in the Monash Faculty of Medicine, Nursing and Health Sciences, I am investigating applications of data science in biology. more


Engineering Applications. Along with colleagues in Engineering at Monash and Deakin Universities, I am investigating applications of data science in process control. more