Skip to content

Concept Drift: Learning From Non-Stationary Distributions

The world is dynamic, in a constant state of flux, while machine learning systems typically learn static models from historical data. Failure to account for the dynamic nature of the world may result in sub-optimal performance when these models of the past are used to predict the present or future. This research investigates this phenomenon of concept drift and how it is best addressed.

Our software for generating synthetic data streams with abrupt drift can be downloaded here.

Our system for describing the concept drift present in real-world data can be downloaded here.

Our Extremely Fast Decision Tree (EFDT) module for MOA can be downloaded here. A third part implementation of EFDT is included in scikitlearn: link.

Publications

Computing Divergences between Discrete Decomposable Models.
Lee, L. K., Piatkowski, N., Petitjean, F., & Webb, G. I.
Proceedings of the AAAI Conference on Artificial Intelligence, 37(10), 12243-12251, 2023.
[Bibtex] [Abstract]  → Access on publisher site

@Article{Lee2023,
author = {Lee, Loong Kuan and Piatkowski, Nico and Petitjean, Francois and Webb, Geoffrey I.},
journal = {Proceedings of the AAAI Conference on Artificial Intelligence},
title = {Computing Divergences between Discrete Decomposable Models},
year = {2023},
month = {Jun.},
number = {10},
pages = {12243-12251},
volume = {37},
abstract = {There are many applications that benefit from computing the exact divergence between 2 discrete probability measures, including machine learning. Unfortunately, in the absence of any assumptions on the structure or independencies within these distributions, computing the divergence between them is an intractable problem in high dimensions. We show that we are able to compute a wide family of functionals and divergences, such as the alpha-beta divergence, between two decomposable models, i.e. chordal Markov networks, in time exponential to the treewidth of these models. The alpha-beta divergence is a family of divergences that include popular divergences such as the Kullback-Leibler divergence, the Hellinger distance, and the chi-squared divergence. Thus, we can accurately compute the exact values of any of this broad class of divergences to the extent to which we can accurately model the two distributions using decomposable models.},
abstractnote = {There are many applications that benefit from computing the exact divergence between 2 discrete probability measures, including machine learning. Unfortunately, in the absence of any assumptions on the structure or independencies within these distributions, computing the divergence between them is an intractable problem in high dimensions. We show that we are able to compute a wide family of functionals and divergences, such as the alpha-beta divergence, between two decomposable models, i.e. chordal Markov networks, in time exponential to the treewidth of these models. The alpha-beta divergence is a family of divergences that include popular divergences such as the Kullback-Leibler divergence, the Hellinger distance, and the chi-squared divergence. Thus, we can accurately compute the exact values of any of this broad class of divergences to the extent to which we can accurately model the two distributions using decomposable models.},
doi = {10.1609/aaai.v37i10.26443},
keywords = {Concept Drift, scalable graphical models},
related = {learning-from-non-stationary-distributions},
}
ABSTRACT There are many applications that benefit from computing the exact divergence between 2 discrete probability measures, including machine learning. Unfortunately, in the absence of any assumptions on the structure or independencies within these distributions, computing the divergence between them is an intractable problem in high dimensions. We show that we are able to compute a wide family of functionals and divergences, such as the alpha-beta divergence, between two decomposable models, i.e. chordal Markov networks, in time exponential to the treewidth of these models. The alpha-beta divergence is a family of divergences that include popular divergences such as the Kullback-Leibler divergence, the Hellinger distance, and the chi-squared divergence. Thus, we can accurately compute the exact values of any of this broad class of divergences to the extent to which we can accurately model the two distributions using decomposable models.

Computing Marginal and Conditional Divergences between Decomposable Models with Applications.
Lee, L. K., Webb, G. I., Schmidt, D. F., & Piatkowski, N.
Proceedings of the IEEE International Conference on Data Mining (ICDM), pp. 239-248, 2023.
[Bibtex] [Abstract]  → Access on publisher site

@InProceedings{Lee2023a,
author = {Lee, Loong Kuan and Webb, Geoffrey I and Schmidt, Daniel F and Piatkowski, Nico},
booktitle = {Proceedings of the IEEE International Conference on Data Mining (ICDM)},
title = {Computing Marginal and Conditional Divergences between Decomposable Models with Applications},
year = {2023},
organization = {IEEE},
pages = {239-248},
abstract = {The ability to compute the exact divergence between two high-dimensional distributions is useful in many applications but doing so naively is intractable. Computing the alpha-beta divergence—a family of divergences that includes the Kullback-Leibler divergence and Hellinger distance—between the joint distribution of two decomposable models, i.e chordal Markov networks, can be done in time exponential in the treewidth of these models. However, reducing the dissimilarity between two high-dimensional objects to a single scalar value can be uninformative. Furthermore, in applications such as supervised learning, the divergence over a conditional distribution might be of more interest. Therefore, we propose an approach to compute the exact alpha-beta divergence between any marginal or conditional distribution of two decomposable models. Doing so tractably is non-trivial as we need to decompose the divergence between these distributions and therefore, require a decomposition over the marginal and conditional distributions of these models. Consequently, we provide such a decomposition and also extend existing work to compute the marginal and conditional alpha-beta divergence between these decompositions. We then show how our method can be used to analyze distributional changes by first applying it to a benchmark image dataset. Finally, based on our framework, we propose a novel way to quantify the error in contemporary superconducting quantum computers. Code for all experiments is available at: https://lklee.dev/pub/2023-icdm/code.},
creationdate = {2024-02-20T12:50:37},
doi = {10.1109/ICDM58522.2023.00033},
keywords = {Concept Drift, scalable graphical models},
related = {learning-from-non-stationary-distributions},
}
ABSTRACT The ability to compute the exact divergence between two high-dimensional distributions is useful in many applications but doing so naively is intractable. Computing the alpha-beta divergence—a family of divergences that includes the Kullback-Leibler divergence and Hellinger distance—between the joint distribution of two decomposable models, i.e chordal Markov networks, can be done in time exponential in the treewidth of these models. However, reducing the dissimilarity between two high-dimensional objects to a single scalar value can be uninformative. Furthermore, in applications such as supervised learning, the divergence over a conditional distribution might be of more interest. Therefore, we propose an approach to compute the exact alpha-beta divergence between any marginal or conditional distribution of two decomposable models. Doing so tractably is non-trivial as we need to decompose the divergence between these distributions and therefore, require a decomposition over the marginal and conditional distributions of these models. Consequently, we provide such a decomposition and also extend existing work to compute the marginal and conditional alpha-beta divergence between these decompositions. We then show how our method can be used to analyze distributional changes by first applying it to a benchmark image dataset. Finally, based on our framework, we propose a novel way to quantify the error in contemporary superconducting quantum computers. Code for all experiments is available at: https://lklee.dev/pub/2023-icdm/code.

Extremely Fast Hoeffding Adaptive Tree.
Manapragada, C., Salehi, M., & Webb, G. I.
Proceedings of the 2022 IEEE International Conference on Data Mining (ICDM), pp. 319-328, 2022.
[Bibtex] [Abstract]  → Download PDF  → Access on publisher site

@InProceedings{ManapragadaEtAl22,
author = {Manapragada, Chaitanya and Salehi, Mahsa and Webb, Geoffrey I.},
booktitle = {Proceedings of the 2022 IEEE International Conference on Data Mining (ICDM)},
title = {Extremely Fast Hoeffding Adaptive Tree},
year = {2022},
pages = {319-328},
publisher = {IEEE},
abstract = {Many real-world data streams are non-stationary.
Subject to concept drift, the distributions change over time.
To retain accuracy in the face of such drift, online decision
tree learners must discard parts of the tree that are no longer
accurate and replace them by new subtrees that reflect the
new distribution. The longstanding state-of-the-art online
decision tree learner for non-stationary streams is Hoeffding
Adaptive Tree (HAT), which adds a drift detection and response
mechanism to the classic Very Fast Decision Tree (VFDT) online
decision tree learner. However, for stationary distributions,
VFDT has been superseded by Extremely Fast Decision Tree
(EFDT), which uses a statistically more efficient learning
mechanism than VFDT. This learning mechanism needs to be
coupled with a compensatory revision mechanism that can
compensate for circumstances where the learning mechanism is
too eager.
The current work develops a strategy to combine the best
of both these state-of-the-art approaches, exploiting both the
statistically efficient learning mechanism from EFDT and the
highly effective drift detection and response mechanism of HAT.
To do so requires decoupling of the EFDT splitting and revision
mechanisms, as the latter incorrectly triggers the HAT drift
detection mechanism. The resulting learner, Extremely Fast
Hoeffding Adaptive Tree, responds to drift more rapidly and
effectively than either HAT or EFDT, and attains a statistically
significant advantage in accuracy even on stationary streams.},
comment = {ICDM-2022 Best Paper Runner-up Award},
doi = {10.1109/ICDM54844.2022.00042},
keywords = {Concept Drift, efficient ml},
related = {learning-from-non-stationary-distributions},
}
ABSTRACT Many real-world data streams are non-stationary. Subject to concept drift, the distributions change over time. To retain accuracy in the face of such drift, online decision tree learners must discard parts of the tree that are no longer accurate and replace them by new subtrees that reflect the new distribution. The longstanding state-of-the-art online decision tree learner for non-stationary streams is Hoeffding Adaptive Tree (HAT), which adds a drift detection and response mechanism to the classic Very Fast Decision Tree (VFDT) online decision tree learner. However, for stationary distributions, VFDT has been superseded by Extremely Fast Decision Tree (EFDT), which uses a statistically more efficient learning mechanism than VFDT. This learning mechanism needs to be coupled with a compensatory revision mechanism that can compensate for circumstances where the learning mechanism is too eager. The current work develops a strategy to combine the best of both these state-of-the-art approaches, exploiting both the statistically efficient learning mechanism from EFDT and the highly effective drift detection and response mechanism of HAT. To do so requires decoupling of the EFDT splitting and revision mechanisms, as the latter incorrectly triggers the HAT drift detection mechanism. The resulting learner, Extremely Fast Hoeffding Adaptive Tree, responds to drift more rapidly and effectively than either HAT or EFDT, and attains a statistically significant advantage in accuracy even on stationary streams.

An eager splitting strategy for online decision trees in ensembles.
Manapragada, C., Gomes, H. M., Salehi, M., Bifet, A., & Webb, G. I.
Data Mining and Knowledge Discovery, 36, 566-619, 2022.
[Bibtex] [Abstract]  → Access on publisher site

@Article{Manapragada_2022,
author = {Chaitanya Manapragada and Heitor M. Gomes and Mahsa Salehi and Albert Bifet and Geoffrey I. Webb},
journal = {Data Mining and Knowledge Discovery},
title = {An eager splitting strategy for online decision trees in ensembles},
year = {2022},
pages = {566-619},
volume = {36},
abstract = {Decision tree ensembles are widely used in practice. In this work, we study in ensemble settings the effectiveness of replacing the split strategy for the state-of-the-art online tree learner, Hoeffding Tree, with a rigorous but more eager splitting strategy that we had previously published as Hoeffding AnyTime Tree. Hoeffding AnyTime Tree (HATT), uses the Hoeffding Test to determine whether the current best candidate split is superior to the current split, with the possibility of revision, while Hoeffding Tree aims to determine whether the top candidate is better than the second best and if a test is selected, fixes it for all posterity. HATT converges to the ideal batch tree while Hoeffding Tree does not. We find that HATT is an efficacious base learner for online bagging and online boosting ensembles. On UCI and synthetic streams, HATT as a base learner outperforms HT at a 0.05 significance level for the majority of tested ensembles on what we believe is the largest and most comprehensive set of testbenches in the online learning literature. Our results indicate that HATT is a superior alternative to Hoeffding Tree in a large number of ensemble settings.},
doi = {10.1007/s10618-021-00816-x},
keywords = {Concept Drift},
publisher = {Springer Science and Business Media {LLC}},
related = {learning-from-non-stationary-distributions},
url = {https://rdcu.be/c1y4Z},
}
ABSTRACT Decision tree ensembles are widely used in practice. In this work, we study in ensemble settings the effectiveness of replacing the split strategy for the state-of-the-art online tree learner, Hoeffding Tree, with a rigorous but more eager splitting strategy that we had previously published as Hoeffding AnyTime Tree. Hoeffding AnyTime Tree (HATT), uses the Hoeffding Test to determine whether the current best candidate split is superior to the current split, with the possibility of revision, while Hoeffding Tree aims to determine whether the top candidate is better than the second best and if a test is selected, fixes it for all posterity. HATT converges to the ideal batch tree while Hoeffding Tree does not. We find that HATT is an efficacious base learner for online bagging and online boosting ensembles. On UCI and synthetic streams, HATT as a base learner outperforms HT at a 0.05 significance level for the majority of tested ensembles on what we believe is the largest and most comprehensive set of testbenches in the online learning literature. Our results indicate that HATT is a superior alternative to Hoeffding Tree in a large number of ensemble settings.

Beyond Adaptation: Understanding Distributional Changes (Dagstuhl Seminar 20372).
Krempl, G., Hofer, V., Webb, G., & Hullermeier, E.
Schloss Dagstuhl - Leibniz-Zentrum fur Informatik, 2021.
[Bibtex]  → Access on publisher site

@TechReport{Krempl2021,
author = {Krempl, Georg and Hofer, Vera and Webb, Geoffrey and Hullermeier, Eyke},
institution = {Schloss Dagstuhl - Leibniz-Zentrum fur Informatik},
title = {Beyond Adaptation: Understanding Distributional Changes ({Dagstuhl} Seminar 20372)},
year = {2021},
copyright = {Creative Commons Attribution 3.0 Unported license},
doi = {10.4230/DAGREP.10.4.1},
keywords = {Statistical Machine Learning, Data Streams, Concept Drift, Non-Stationary Non-IID Data, Change Mining, Dagstuhl Seminar, Theory of computation - Machine learning theory, Mathematics of computing - Time series analysis, Computing methodologies - Multi-task learning, Computing methodologies - Learning under covariate shift, Computing methodologies - Lifelong machine learning},
related = {learning-from-non-stationary-distributions},
}
ABSTRACT 

An Incremental Construction of Deep Neuro Fuzzy System for Continual Learning of Non-stationary Data Streams.
Pratama, M., Pedrycz, W., & Webb, G. I.
IEEE Transactions on Fuzzy Systems, 28(7), 1315-1328, 2020.
[Bibtex]  → Access on publisher site

@Article{Pratama19,
author = {Pratama, M. and Pedrycz, W. and Webb, G. I.},
journal = {IEEE Transactions on Fuzzy Systems},
title = {An Incremental Construction of Deep Neuro Fuzzy System for Continual Learning of Non-stationary Data Streams},
year = {2020},
issn = {1063-6706},
number = {7},
pages = {1315-1328},
volume = {28},
doi = {10.1109/TFUZZ.2019.2939993},
keywords = {Concept Drift},
related = {learning-from-non-stationary-distributions},
}
ABSTRACT 

PCA-based drift and shift quantification framework for multidimensional data.
Goldenberg, I., & Webb, G. I.
Knowledge and Information Systems, 62, 2835-2854, 2020.
[Bibtex] [Abstract]  → Access on publisher site

@Article{Goldenberg2020,
author = {Goldenberg, Igor and Webb, Geoffrey I.},
journal = {Knowledge and Information Systems},
title = {PCA-based drift and shift quantification framework for multidimensional data},
year = {2020},
pages = {2835-2854},
volume = {62},
abstract = {Concept drift is a serious problem confronting machine learning systems in a dynamic and ever-changing world. In order to manage concept drift it may be useful to first quantify it by measuring the distance between distributions that generate data before and after a drift. There is a paucity of methods to do so in the case of multidimensional numeric data. This paper provides an in-depth analysis of the PCA-based change detection approach, identifies shortcomings of existing methods and shows how this approach can be used to measure a drift, not merely detect it.},
doi = {10.1007/s10115-020-01438-3},
keywords = {Concept Drift},
related = {learning-from-non-stationary-distributions},
}
ABSTRACT Concept drift is a serious problem confronting machine learning systems in a dynamic and ever-changing world. In order to manage concept drift it may be useful to first quantify it by measuring the distance between distributions that generate data before and after a drift. There is a paucity of methods to do so in the case of multidimensional numeric data. This paper provides an in-depth analysis of the PCA-based change detection approach, identifies shortcomings of existing methods and shows how this approach can be used to measure a drift, not merely detect it.

Survey of distance measures for quantifying concept drift and shift in numeric data.
Goldenberg, I., & Webb, G. I.
Knowledge and Information Systems, 60(2), 591-615, 2019.
[Bibtex] [Abstract]  → Access on publisher site

@Article{Goldenberg2018,
Title = {Survey of distance measures for quantifying concept drift and shift in numeric data},
Author = {Goldenberg, Igor and Webb, Geoffrey I.},
Journal = {Knowledge and Information Systems},
Year = {2019},
Number = {2},
Pages = {591-615},
Volume = {60},
Abstract = {Deployed machine learning systems are necessarily learned from historical data and are often applied to current data. When the world changes, the learned models can lose fidelity. Such changes to the statistical properties of data over time are known as concept drift. Similarly, models are often learned in one context, but need to be applied in another. This is called concept shift. Quantifying the magnitude of drift or shift, especially in the context of covariate drift or shift, or unsupervised learning, requires use of measures of distance between distributions. In this paper, we survey such distance measures with respect to their suitability for estimating drift and shift magnitude between samples of numeric data.},
Doi = {10.1007/s10115-018-1257-z},
ISSN = {0219-3116},
Keywords = {Concept Drift},
Related = {learning-from-non-stationary-distributions}
}
ABSTRACT Deployed machine learning systems are necessarily learned from historical data and are often applied to current data. When the world changes, the learned models can lose fidelity. Such changes to the statistical properties of data over time are known as concept drift. Similarly, models are often learned in one context, but need to be applied in another. This is called concept shift. Quantifying the magnitude of drift or shift, especially in the context of covariate drift or shift, or unsupervised learning, requires use of measures of distance between distributions. In this paper, we survey such distance measures with respect to their suitability for estimating drift and shift magnitude between samples of numeric data.

Adaptive Online Extreme Learning Machine by Regulating Forgetting Factor by Concept Drift Map.
Yu, H., & Webb, G. I.
Neurocomputing, 343, 141-153, 2019.
[Bibtex] [Abstract]  → Access on publisher site

@Article{yuetal2019,
Title = {Adaptive Online Extreme Learning Machine by Regulating Forgetting Factor by Concept Drift Map},
Author = {Yu, Hualong and Webb, Geoffrey I},
Journal = {Neurocomputing},
Year = {2019},
Pages = {141-153},
Volume = {343},
Abstract = {In online-learning, the data is incrementally received and the distributions from which it is drawn may keep changing over time. This phenomenon is widely known as concept drift. Such changes may affect the generalization of a learned model to future data. This problem may be exacerbated by the form of the drift itself changing over time. Quantitative measures to describe and analyze the concept drift have been proposed in previous work. A description composed from these measures is called a concept drift map. We believe that these maps could be useful for guiding how much knowledge in the old model should be forgotten. Therefore, this paper presents an adaptive online learning model that uses a concept drift map to regulate the forgetting factor of an extreme learning machine. Specifically, when a batch of new instances are labeled, the distribution of each class on each attribute is firstly estimated, and then it is compared with the distribution estimated in the previous batch to calculate the magnitude of concept drift, which is further used to regulate the forgetting factor and to update the learning model. Therefore, the novelty of this paper lies in that a quantitative distance metric between two distributions constructed on continuous attribute space is presented to construct concept drift map which can be further associated with the forgetting factor to make the learning model adapt the concept drift. Experimental results on several benchmark stream data sets show the proposed model is generally superior to several previous algorithms when classifying a variety of data streams subject to drift, indicating its effectiveness and feasibility.},
Doi = {10.1016/j.neucom.2018.11.098},
Keywords = {Concept Drift},
Publisher = {Elsevier},
Related = {learning-from-non-stationary-distributions}
}
ABSTRACT In online-learning, the data is incrementally received and the distributions from which it is drawn may keep changing over time. This phenomenon is widely known as concept drift. Such changes may affect the generalization of a learned model to future data. This problem may be exacerbated by the form of the drift itself changing over time. Quantitative measures to describe and analyze the concept drift have been proposed in previous work. A description composed from these measures is called a concept drift map. We believe that these maps could be useful for guiding how much knowledge in the old model should be forgotten. Therefore, this paper presents an adaptive online learning model that uses a concept drift map to regulate the forgetting factor of an extreme learning machine. Specifically, when a batch of new instances are labeled, the distribution of each class on each attribute is firstly estimated, and then it is compared with the distribution estimated in the previous batch to calculate the magnitude of concept drift, which is further used to regulate the forgetting factor and to update the learning model. Therefore, the novelty of this paper lies in that a quantitative distance metric between two distributions constructed on continuous attribute space is presented to construct concept drift map which can be further associated with the forgetting factor to make the learning model adapt the concept drift. Experimental results on several benchmark stream data sets show the proposed model is generally superior to several previous algorithms when classifying a variety of data streams subject to drift, indicating its effectiveness and feasibility.

Analyzing concept drift and shift from sample data.
Webb, G. I., Lee, L. K., Goethals, B., & Petitjean, F.
Data Mining and Knowledge Discovery, 32(5), 1179-1199, 2018.
[Bibtex] [Abstract]  → Download PDF  → Access on publisher site

@Article{WebbEtAl18,
Title = {Analyzing concept drift and shift from sample data},
Author = {Webb, Geoffrey I and Lee, Loong Kuan and Goethals, Bart and Petitjean, Francois},
Journal = {Data Mining and Knowledge Discovery},
Year = {2018},
Number = {5},
Pages = {1179-1199},
Volume = {32},
Abstract = {Concept drift and shift are major issues that greatly affect the accuracy and reliability of many real-world applications of machine learning. We propose a new data mining task, concept drift mapping - the description and analysis of instances of concept drift or shift. We argue that concept drift mapping is an essential prerequisite for tackling concept drift and shift. We propose tools for this purpose, arguing for the importance of quantitative descriptions of drift and shift in marginal distributions. We present quantitative concept drift mapping techniques, along with methods for visualizing their results. We illustrate their effectiveness for real-world applications across energy-pricing, vegetation monitoring and airline scheduling.},
Doi = {10.1007/s10618-018-0554-1},
Keywords = {Concept Drift},
Related = {learning-from-non-stationary-distributions},
Url = {http://rdcu.be/IUTI}
}
ABSTRACT Concept drift and shift are major issues that greatly affect the accuracy and reliability of many real-world applications of machine learning. We propose a new data mining task, concept drift mapping - the description and analysis of instances of concept drift or shift. We argue that concept drift mapping is an essential prerequisite for tackling concept drift and shift. We propose tools for this purpose, arguing for the importance of quantitative descriptions of drift and shift in marginal distributions. We present quantitative concept drift mapping techniques, along with methods for visualizing their results. We illustrate their effectiveness for real-world applications across energy-pricing, vegetation monitoring and airline scheduling.

On the Inter-Relationships among Drift Rate, Forgetting Rate, Bias/Variance Profile and Error.
Zaidi, N. A., Webb, G. I., Petitjean, F., & Forestier, G.
arxiv, 1801.09354, 2018.
[Bibtex] [Abstract]  → Access on publisher site

@Article{ZaidiEtAl18b,
Title = {On the Inter-Relationships among Drift Rate,
Forgetting Rate, Bias/Variance Profile and Error},
Author = {Zaidi, Nayyar A. and Webb, Geoffrey I. and Petitjean, Francois and Forestier, Germain},
Journal = {arxiv},
Year = {2018},
Pages = {1801.09354},
Abstract = {We propose two general and falsifiable hypotheses about expectations on generalization error when learning in the context of concept drift. One posits that as drift rate increases, the forgetting rate that minimizes generalization error will also increase and vice versa. The other posits that as a learner's forgetting rate increases, the bias/variance profile that minimizes generalization error will have lower variance and vice versa. These hypotheses lead to the concept of the sweet path, a path through the 3-d space of alternative drift rates, forgetting rates and bias/variance profiles on which generalization error will be minimized, such that slow drift is coupled with low forgetting and low bias, while rapid drift is coupled with fast forgetting and low variance. We present experiments that support the existence of such a sweet path. We also demonstrate that simple learners that select appropriate forgetting rates and bias/variance profiles are highly competitive with the state-of-the-art in incremental learners for concept drift on real-world drift problems.},
Keywords = {Concept Drift},
Related = {learning-from-non-stationary-distributions},
Url = {https://arxiv.org/abs/1801.09354}
}
ABSTRACT We propose two general and falsifiable hypotheses about expectations on generalization error when learning in the context of concept drift. One posits that as drift rate increases, the forgetting rate that minimizes generalization error will also increase and vice versa. The other posits that as a learner's forgetting rate increases, the bias/variance profile that minimizes generalization error will have lower variance and vice versa. These hypotheses lead to the concept of the sweet path, a path through the 3-d space of alternative drift rates, forgetting rates and bias/variance profiles on which generalization error will be minimized, such that slow drift is coupled with low forgetting and low bias, while rapid drift is coupled with fast forgetting and low variance. We present experiments that support the existence of such a sweet path. We also demonstrate that simple learners that select appropriate forgetting rates and bias/variance profiles are highly competitive with the state-of-the-art in incremental learners for concept drift on real-world drift problems.

Extremely Fast Decision Tree.
Manapragada, C., Webb, G. I., & Salehi, M.
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, pp. 1953–1962, 2018.
[Bibtex] [Abstract]  → Download PDF  → Access on publisher site

@InProceedings{ManapragadaEtAl18,
author = {Manapragada, Chaitanya and Webb, Geoffrey I. and Salehi, Mahsa},
booktitle = {Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
title = {Extremely Fast Decision Tree},
year = {2018},
address = {New York, NY, USA},
pages = {1953--1962},
publisher = {ACM},
series = {KDD '18},
abstract = {We introduce a novel incremental decision tree learning algorithm, Hoeffding Anytime Tree, that is statistically more efficient than the current state-of-the-art, Hoeffding Tree. We demonstrate that an implementation of Hoeffding Anytime Tree---"Extremely Fast Decision Tree", a minor modification to the MOA implementation of Hoeffding Tree---obtains significantly superior prequential accuracy on most of the largest classification datasets from the UCI repository. Hoeffding Anytime Tree produces the asymptotic batch tree in the limit, is naturally resilient to concept drift, and can be used as a higher accuracy replacement for Hoeffding Tree in most scenarios, at a small additional computational cost.},
acmid = {3220005},
doi = {10.1145/3219819.3220005},
isbn = {978-1-4503-5552-0},
keywords = {Concept Drift, efficient ml},
location = {London, United Kingdom},
related = {learning-from-non-stationary-distributions},
}
ABSTRACT We introduce a novel incremental decision tree learning algorithm, Hoeffding Anytime Tree, that is statistically more efficient than the current state-of-the-art, Hoeffding Tree. We demonstrate that an implementation of Hoeffding Anytime Tree–-"Extremely Fast Decision Tree", a minor modification to the MOA implementation of Hoeffding Tree–-obtains significantly superior prequential accuracy on most of the largest classification datasets from the UCI repository. Hoeffding Anytime Tree produces the asymptotic batch tree in the limit, is naturally resilient to concept drift, and can be used as a higher accuracy replacement for Hoeffding Tree in most scenarios, at a small additional computational cost.

Characterizing Concept Drift.
Webb, G. I., Hyde, R., Cao, H., Nguyen, H. L., & Petitjean, F.
Data Mining and Knowledge Discovery, 30(4), 964-994, 2016.
[Bibtex] [Abstract]  → Download PDF  → Access on publisher site

@Article{WebbEtAl16,
author = {Webb, G. I. and Hyde, R. and Cao, H. and Nguyen, H. L. and Petitjean, F.},
journal = {Data Mining and Knowledge Discovery},
title = {Characterizing Concept Drift},
year = {2016},
number = {4},
pages = {964-994},
volume = {30},
abstract = {Most machine learning models are static, but the world is dynamic, and increasing online deployment of learned models gives increasing urgency to the development of efficient and effective mechanisms to address learning in the context of non-stationary distributions, or as it is commonly called concept drift. However, the key issue of characterizing the different types of drift that can occur has not previously been subjected to rigorous definition and analysis. In particular, while some qualitative drift categorizations have been proposed, few have been formally defined, and the quantitative descriptions required for precise and objective understanding of learner performance have not existed. We present the first comprehensive framework for quantitative analysis of drift. This supports the development of the first comprehensive set of formal definitions of types of concept drift. The formal definitions clarify ambiguities and identify gaps in previous definitions, giving rise to a new comprehensive taxonomy of concept drift types and a solid foundation for research into mechanisms to detect and address concept drift.},
doi = {10.1007/s10618-015-0448-4},
keywords = {Concept Drift},
related = {learning-from-non-stationary-distributions},
url = {https://rdcu.be/7vMN},
}
ABSTRACT Most machine learning models are static, but the world is dynamic, and increasing online deployment of learned models gives increasing urgency to the development of efficient and effective mechanisms to address learning in the context of non-stationary distributions, or as it is commonly called concept drift. However, the key issue of characterizing the different types of drift that can occur has not previously been subjected to rigorous definition and analysis. In particular, while some qualitative drift categorizations have been proposed, few have been formally defined, and the quantitative descriptions required for precise and objective understanding of learner performance have not existed. We present the first comprehensive framework for quantitative analysis of drift. This supports the development of the first comprehensive set of formal definitions of types of concept drift. The formal definitions clarify ambiguities and identify gaps in previous definitions, giving rise to a new comprehensive taxonomy of concept drift types and a solid foundation for research into mechanisms to detect and address concept drift.

Contrary to Popular Belief Incremental Discretization can be Sound, Computationally Efficient and Extremely Useful for Streaming Data.
Webb, G. I.
Proceedings of the 14th IEEE International Conference on Data Mining, pp. 1031-1036, 2014.
[Bibtex] [Abstract]  → Download PDF  → Access on publisher site

@InProceedings{Webb14,
author = {Webb, G. I.},
booktitle = {Proceedings of the 14th {IEEE} International Conference on Data Mining},
title = {Contrary to Popular Belief Incremental Discretization can be Sound, Computationally Efficient and Extremely Useful for Streaming Data},
year = {2014},
pages = {1031-1036},
abstract = {Discretization of streaming data has received surprisingly
little attention. This might be because streaming data
require incremental discretization with cutpoints that may vary
over time and this is perceived as undesirable. We argue, to
the contrary, that it can be desirable for a discretization to
evolve in synchronization with an evolving data stream, even
when the learner assumes that attribute values. meanings remain
invariant over time. We examine the issues associated with
discretization in the context of distribution drift and develop
computationally efficient incremental discretization algorithms.
We show that discretization can reduce the error of a classical
incremental learner and that allowing a discretization to drift in
synchronization with distribution drift can further reduce error.},
doi = {10.1109/ICDM.2014.123},
keywords = {Concept Drift and Discretization and Incremental Learning and Stream mining},
related = {learning-from-non-stationary-distributions},
}
ABSTRACT Discretization of streaming data has received surprisingly little attention. This might be because streaming data require incremental discretization with cutpoints that may vary over time and this is perceived as undesirable. We argue, to the contrary, that it can be desirable for a discretization to evolve in synchronization with an evolving data stream, even when the learner assumes that attribute values. meanings remain invariant over time. We examine the issues associated with discretization in the context of distribution drift and develop computationally efficient incremental discretization algorithms. We show that discretization can reduce the error of a classical incremental learner and that allowing a discretization to drift in synchronization with distribution drift can further reduce error.