Data Scientist


Learning From Non-Stationary Distributions

The world is dynamic, in a constant state of flux, while machine learning systems typically learn static models from historical data. Failure to account for the dynamic nature of the world may result in sub-optimal performance when these models of the past are used to predict the present or future. This research investigates this phenomenon of concept drift and how it is best addressed.

Our software for generating synthetic data streams with abrupt drift can be downloaded here.

Our system for describing the concept drift present in real-world data can be downloaded here.

Publications

Understanding Concept Drift.
Webb, G. I., Lee, L. K., Petitjean, F., & Goethals, B.
(No. arXiv:1704.00362). , 2017.
[URL] [Bibtex] [Abstract]

@TechReport{WebbEtAl17,
Title = {Understanding Concept Drift},
Author = {Geoffrey I. Webb and Loong Kuan Lee and Francois Petitjean and Bart Goethals},
Year = {2017},
Number = {arXiv:1704.00362},
Abstract = {Concept drift is a major issue that greatly affects the accuracy and reliability of many real-world applications of machine learning. We argue that to tackle concept drift it is important to develop the capacity to describe and analyze it. We propose tools for this purpose, arguing for the importance of quantitative descriptions of drift in marginal distributions. We present quantitative drift analysis techniques along with methods for communicating their results. We demonstrate their effectiveness by application to three real-world learning tasks.},
Keywords = {Concept Drift},
Owner = {giwebb},
Timestamp = {2017.04.04},
Url = {https://arxiv.org/abs/1704.00362}
}
ABSTRACT Concept drift is a major issue that greatly affects the accuracy and reliability of many real-world applications of machine learning. We argue that to tackle concept drift it is important to develop the capacity to describe and analyze it. We propose tools for this purpose, arguing for the importance of quantitative descriptions of drift in marginal distributions. We present quantitative drift analysis techniques along with methods for communicating their results. We demonstrate their effectiveness by application to three real-world learning tasks.

Characterizing Concept Drift.
Webb, G. I., Hyde, R., Cao, H., Nguyen, H. L., & Petitjean, F.
Data Mining and Knowledge Discovery, 30(4), 964-994, 2016.
[PDF] [DOI] [Bibtex] [Abstract]

@Article{WebbEtAl16,
Title = {Characterizing Concept Drift},
Author = {G.I. Webb and R. Hyde and H. Cao and H.L. Nguyen and F. Petitjean},
Journal = {Data Mining and Knowledge Discovery},
Year = {2016},
Number = {4},
Pages = {964-994},
Volume = {30},
Abstract = {Most machine learning models are static, but the world is dynamic, and increasing online deployment of learned models gives increasing urgency to the development of efficient and effective mechanisms to address learning in the context of non-stationary distributions, or as it is commonly called concept drift. However, the key issue of characterizing the different types of drift that can occur has not previously been subjected to rigorous definition and analysis. In particular, while some qualitative drift categorizations have been proposed, few have been formally defined, and the quantitative descriptions required for precise and objective understanding of learner performance have not existed. We present the first comprehensive framework for quantitative analysis of drift. This supports the development of the first comprehensive set of formal definitions of types of concept drift. The formal definitions clarify ambiguities and identify gaps in previous definitions, giving rise to a new comprehensive taxonomy of concept drift types and a solid foundation for research into mechanisms to detect and address concept drift.},
Doi = {10.1007/s10618-015-0448-4},
Keywords = {Concept Drift},
Related = {learning-from-non-stationary-distributions},
Url = {http://arxiv.org/abs/1511.03816},
Urltext = {Link to prepublication draft}
}
ABSTRACT Most machine learning models are static, but the world is dynamic, and increasing online deployment of learned models gives increasing urgency to the development of efficient and effective mechanisms to address learning in the context of non-stationary distributions, or as it is commonly called concept drift. However, the key issue of characterizing the different types of drift that can occur has not previously been subjected to rigorous definition and analysis. In particular, while some qualitative drift categorizations have been proposed, few have been formally defined, and the quantitative descriptions required for precise and objective understanding of learner performance have not existed. We present the first comprehensive framework for quantitative analysis of drift. This supports the development of the first comprehensive set of formal definitions of types of concept drift. The formal definitions clarify ambiguities and identify gaps in previous definitions, giving rise to a new comprehensive taxonomy of concept drift types and a solid foundation for research into mechanisms to detect and address concept drift.

Contrary to Popular Belief Incremental Discretization can be Sound, Computationally Efficient and Extremely Useful for Streaming Data.
Webb, G. I.
Proceedings of the 14th IEEE International Conference on Data Mining, pp. 1031-1036, 2014.
[PDF] [URL] [Bibtex] [Abstract]

@InProceedings{Webb14,
Title = {Contrary to Popular Belief Incremental Discretization can be Sound, Computationally Efficient and Extremely Useful for Streaming Data},
Author = {G.I. Webb},
Booktitle = {Proceedings of the 14th {IEEE} International Conference on Data Mining},
Year = {2014},
Pages = {1031-1036},
Abstract = {Discretization of streaming data has received surprisingly
little attention. This might be because streaming data
require incremental discretization with cutpoints that may vary
over time and this is perceived as undesirable. We argue, to
the contrary, that it can be desirable for a discretization to
evolve in synchronization with an evolving data stream, even
when the learner assumes that attribute values. meanings remain
invariant over time. We examine the issues associated with
discretization in the context of distribution drift and develop
computationally efficient incremental discretization algorithms.
We show that discretization can reduce the error of a classical
incremental learner and that allowing a discretization to drift in
synchronization with distribution drift can further reduce error.},
Keywords = {Concept Drift and Discretization and Incremental Learning and Stream mining},
Related = {learning-from-non-stationary-distributions},
Url = {http://dx.doi.org/10.1109/ICDM.2014.123}
}
ABSTRACT Discretization of streaming data has received surprisingly little attention. This might be because streaming data require incremental discretization with cutpoints that may vary over time and this is perceived as undesirable. We argue, to the contrary, that it can be desirable for a discretization to evolve in synchronization with an evolving data stream, even when the learner assumes that attribute values. meanings remain invariant over time. We examine the issues associated with discretization in the context of distribution drift and develop computationally efficient incremental discretization algorithms. We show that discretization can reduce the error of a classical incremental learner and that allowing a discretization to drift in synchronization with distribution drift can further reduce error.