Geoff Webb: Data Science, Machine Learning and Data Mining Research

My research combines fundamental and translational data science in a virtuous cycle. My translational research applies data science to solve impactful problems in other domains. This process often reveals serious limitations in the state of the art. I address these in my fundamental research, which develops application agnostic methods for deriving knowledge and insight from data. These methods are in turn evaluated in the context of my translational research.

Fundamental data science

My fundamental data science research investigates how to use data to best support effective evidence-based decision making and derive useful knowledge and insight. My research spans artificial intelligence, machine learning, data mining, data analytics and big data. This overview page lists some key themes and provides links to papers and software.

Scalable learning of time series classifiers. Times series classification is an important data analysis task. Our research is motivated by the challenge of monitoring environmental variables from satellite images at global scale. This requires learning from many millions of time series and classifying many billions. The preexisting state-of-the-art does not scale to these magnitudes. We have revolutionised time series classification by developing new time series classification technologies that do. more

Learning From Non-Stationary Distributions. The world is dynamic, in a constant state of flux, while machine learning systems typically learn static models from historical data. Failure to account for the dynamic nature of the world may result in sub-optimal performance when these models of the past are used to predict the present or future. This research investigates this phenomenon of concept drift and how it is best addressed. more

Association Discovery. I have pioneered association discovery techniques that seek the most useful associations, rather then applying the minimum-support constraint more commonly used in the field. Many of these techniques are included in my Magnum Opus software which is now incorporated in BigML. more

Scalable learning of Graphical Models. Graphical models are powerful descriptions of joint probability distributions. We have developed techniques for efficiently scaling exact methods to thousands of variables. more

Earth Observation Analytics. The Monash Earth Observation Analytics Group is developing advanced technologies to extract more accurate information from satellite observations. more

Combining Generative and Discriminative Learning. Generative learning is often more efficient than discriminative learning, but has substantially higher bias. We are developing techniques for exploiting the information efficiently learned by generative learning in order to assist the process of discriminative learning. more

Learning Complex Conditional Probabilities from Data. Naive Bayes is a popular machine learning technique due to its efficiency, direct theoretical foundation and strong classification performance. My techniques improve its accuracy by overcoming the deficiencies of its attribute independence assumption. AnDE, AODE, LBR and related papers and software Scientific applications of AODE

OPUS is an efficient branch and bound search algorithm for exploring the space of conjunctive patterns. It can optimise arbitrary objective functions so long as they provide useful bounds. It supports extremely fast rule discovery. more

Big models for big data. Early machine learning used very small data sets. Data science often involves very large data sets. Work on 'scaling-up' to large data sets has concentrated on reducing the computational complexity of existing algorithms. We contend that this may not be appropriate, that learning from large data sets is fundamentally different to learning from small data sets and that different types of algorithm may be most effective. more

Ensemble Learning. We have shown that error can be substantially reduced through Multi-Strategy Ensemble Learning, which involves combining multiple ensemble learning techniques. MultiBoosting (also known as Boost Bagging) combines boosting and bagging, obtaining most of boosting’s superior bias reduction together with most of bagging’s superior variance reduction. We have shown that randomization of the ensemble classifiers can further reduce error. Feating is a new approach to ensemble learning that combines local rather than global models. To our knowledge it is the only generic ensemble learning technique that can boost the accuracy of stable learners such as SVM and NB. More on MultiBoosting and Multi-Strategy Ensemble Learning More on Feating

The Knowledge Factory is an interactive machine learning environment that provides tight integration between machine learning and knowledge acquisition from experts. This pioneering research program included one of the few detailed experimental demonstrations that human-in-the-loop machine learning can outperform both autonomous machine learning and an expert specifying rules without the assistance of machine learning. more

Many machine-learning researchers have utilized Occam’s razor (also frequently spelt as Ockham’s razor), preferring less complex classifiers in the belief that doing so is likely to reduce prediction error. I believe that this is misguided and provide philosophical and experimental support for this opinion. more

Decision tree grafting is a postprocess that adds tests and nodes to existing decision trees in order to improve accuracy. It acheives this through bagging-like variance reduction, while providing a single directly interpretable decision tree. more

Generality is predictive of prediction accuracy. I argue that manipulation of generality, through appropriate generalization and specialization, can modify the performance of a classifier in predictable and useful ways. more

Prepending is a generic approach to decision list learning that has lower computation and develops shorter decision lists than the classical approach, without any general increase in prediction error. more

Discretization for Naive Bayes. Due to its attribute independence assumption, naive Bayes has distinct requirements of a discretization strategy. This work provides theoretical analysis of those requirements and new techniques that improve classification accuracy. more

Impact Rules [also known as quantitative association rules] provide analysis similar to association rules except that the target is a distribution on a numeric value. Impact Rules support data segmentation for optimisation of a numeric outcome. more

Bias and variance provide a powerful conceptual tool for analyzing classification performance. Previous approaches to conducting bias-variance experiments have provided little control over the types of data distribution from which bias and variance are estimated. I have developed new techniques for bias-variance analysis that provide greater control over the data distribution. Experiments show that the type of distribution used for bias-variance experiments can greatly affect the results obtained. more

Feature construction (also known as constructive induction or attribute discovery) enriches data by adding derived features. These are useful in a data analysis pipeline if they capture relevant relationships within the data that downstream processes are unable to readily model or exploit. Our pioneering research demonstrated that feature construction can empower machine learning systems to construct more accurate models across a wide range of learning tasks. more

Feature Based Modeling is a generic approach to agent modeling using attribute-value machine learning. Applications in student modeling have demonstrated high prediction accuracy. This program produced some of the earliest examples of inspectable or glass-box user models and student models and used associations for modeling before association rules were popularised. more

Translational research

My translational research applies data science to solve impactful problems in other domains.

Computational Biology. Along with colleagues in the Monash Faculty of Medicine, Nursing and Health Sciences, I am investigating applications of data science in biology. This research has twice led to my being recognised as Australia's leading Bioinformatics and Computational Biology researcher. more

Health. I have investigated a wide range of applications of data science to health, ranging from analysis of medical imaging to diagnostics. My research with the Alfred Hospital led to a revision to Medical Emergency Team protocols that saves $500,000 per annum while improving clinical outcomes. more

Engineering. Along with colleagues in Engineering at Monash and Deakin Universities, I am investigating applications of data science in engineering. more

Program Visualisation for Novice Programmers. Providing novice programmers with a concrete model of an abstract machine and displaying program execution on that machine improves their acquisition of programming skills and knowledge. more