We are looking for a talented postdoc to work with us on a number of research projects on Time Series Classification and Machine Learning from Data Streams. Details here.

# Category: Uncategorized

## PhD on understanding the evolution of Earth from space

With the second satellite of the Sentinel-2 mission just launched in 2017, there is an incredible opportunity for the right student to become the world leader on how to analyse and make sense of this vast amount of data. It is anticipated that the project will make contributions to the theory of machine learning, with applications to the study of vegetation in general and more specifically in agriculture. The project however remains open if the successful candidate has other applications at heart (eg landslide, fire prediction). This project is fully funded. There is a paradigm shift in the way we can observe our planet: new-generation satellites (Sentinel-2, Landsat-8) are now imaging Earth completely, every 5 days, at high resolution, and at _no charge to end-users. It is not yet possible to tap the full value of this data, as existing machine learning methods for classifying time series cannot scale to such vast volumes of data. Temporal land-cover maps assign unique labels to geographic areas, describing their evolution over time. One of today’s key challenges is how to automatically produce these maps from the growing torrent of satellite data, to monitor Earth’s highly dynamic systems [a-h]. Presently, state-of-the-art research into time series classification lags behind the demands of the latest space missions, which produce terabytes of data each day. Why? Most of the research into time series classification has been done with datasets that hold no more than 10 thousand time series [i]. In contrast, the Sentinel-2 satellite gathers over 10 *trillion* time series, capturing Earth’s land surfaces and coastal waters at resolutions of 10-60m. Although much research has gone into classifying remote sensing images, few studies have analysed time series extracted from sequences of satellite images. This Project aims to create the machine learning technologies necessary to analyse series of satellite images, and to produce accurate temporal land-cover maps of our planet. Potential high-value applications for Australia include fire prevention, agricultural planning, and mining site monitoring and rehabilitation.

## Shortlisted for Eureka Prize for Excellence in Data Science

I am honoured to have been shortlisted for the inaugural Eureka Prize for Excellence in Data Science for my ARC funded project "Combining generative and discriminative strategies to facilitate efficient and effective learning from big data."

## Finding real associations with R

Our OPUS Miner package is now available in R. It finds statistically significant complex interactions in data. We would value you feedback. It can be downloaded from https://cran.r-project.org/package=opusminer.

## Elected to ACM SIGKDD Board of Directors

I am honoured to have been elected to the Board of the ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD). I am looking forward to working with the new Board, led by incoming Chair Jian Pei, to further strengthen our community's leading conference and to support the community around it.

## Encyclopedia of Machine Learning and Data Mining

We are delighted to have the second edition of the highly successful Encyclopedia of Machine Learning go live. The revised and expanded second edition has been re-titled the Encyclopedia of Machine Learning and Data Mining.

## Usama Fayyad and my keynote addresses at Practical Big Data 2017

## Encyclopedia of Machine Learning still most downloaded Springer Reference

It is good to see that the first edition of our Encyclopedia of Machine Learning is still serving the community.

#SpringerRefCountdown! #1 most downloaded entry last month: https://t.co/hZ2j7FJQmN from our Encyclopedia of #MachineLearning! pic.twitter.com/GdGuTDkfnA

— SpringerReference (@SpringerRef) January 31, 2017

Springer is now taking preorders for the next edition, The Encyclopedia of Machine Learning and Data Mining.

## Two awards in one week!

I am honoured to have received the Australian Computer Society's ICT Researcher of the Year Award and the Australasian Artificial Intelligence Distinguished Research Contributions Award.

## Symposium on Data Science for Good Governing, Dec 5

## Fun KDD video

## Statistical testing of hypothesis streams and cascades

Statistical hypothesis testing was developed in an age when calculations were performed by hand and individual hypotheses were considered in isolation.

In the modern information era, it is often desirable to assess large numbers of related hypotheses. Unless explicit steps are taken to control the risk, it becomes a near certainty that if large numbers of the hypotheses that one seeks to reject are in fact true then many will be rejected in error. Multiple testing corrections control this risk, but previous approaches have required that the hypotheses to be assessed (or at least an upper bound on their number) are known in advance.

Sometimes, however, there is a need to test some hypotheses before other related hypotheses are known. We call this a *hypothesis stream*. In some hypothesis streams, the decision of which hypotheses to reject at one time affect which hypotheses will be considered subsequently. We call this a *hypothesis cascade*.

One context in which hypothesis cascades arise is *model selection*, such as *forward feature selection*. This process starts with an empty model. Then, at each step all models are considered that add a single feature to the current model. A hypothesis test is conducted on every such new model. The model with the best test result (lowest *p*) is then accepted, and the process is repeated considering all single feature additions to this new model. This is repeated until no improvement is statistically significant. The model selected at each round determines the models to be tested subsequently, and it is not possible to determine in advance how many rounds it will take until the process terminates and hence how many statistical tests will be performed.

We have developed a statistical testing procedure that strictly controls the risk of familywise error - the risk of incorrectly rejecting any null hypothesis out of the entire stream. It does so without requiring any information other than the history of the analysis so far and the current pool of hypotheses under consideration. Studies of its performance in the context of selecting the edges to include in a graphical model indicate that it achieves this strict control while maintaining high statistical power.

Statistical hypothesis testing provides a powerful and mature toolkit for data analytics. These techniques help bring these powerful tools to bear in the age of big data.

**Watch the promotional video for our paper:**

[embedyt] http://www.youtube.com/watch?v=FBlhhebFhTI&width=640&height=390&rel=0&showinfo=0[/embedyt]

**Publication**

A multiple test correction for streams and cascades of statistical hypothesis tests.

Webb, G. I., & Petitjean, F.

Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16, pp. 1255-1264, 2016.

**Top reviewer score (4.75/5.0), shortlisted for best paper award and invited to ACM TKDE journal KDD-16 special issue**

[Bibtex] [Abstract] → Related papers and software

```
@InProceedings{WebbPetitjean16,
Title = {A multiple test correction for streams and cascades of statistical hypothesis tests},
Author = {Webb, Geoffrey I and Petitjean, Francois},
Booktitle = {Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16},
Year = {2016},
Pages = {1255-1264},
Publisher = {ACM Press},
Abstract = {Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e.~rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be pre-determined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance.
This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed.
To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models.
We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.},
Comment = {Top reviewer score (4.75/5.0), shortlisted for best paper award and invited to ACM TKDE journal KDD-16 special issue},
Doi = {10.1145/2939672.2939775},
Keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and Learning from large datasets and DP140100087},
Related = {statistically-sound-association-discovery},
Url = {http://dl.acm.org/authorize?N19100}
}
```

**ABSTRACT** Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e.~rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be pre-determined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance. This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed. To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models. We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.