In general, approaches to cope with concept drift can be classified into two categories:
- approaches that adapt a learner at regular intervals without considering whether changes have really occurred.
- approaches that first detect concept changes and afterwards the learner is adapted to these changes.
In the fist approach, drifting concepts are often handled by time windows or weighted examples according to their age or utility.
- Weighted examples are based on the simple idea that the importance often example should decrease with time.
- When a tie window is used, at each time step the learner is induced only from the examples that are included in the window.
- The difficult is how to select the appropriate window’s size.
- On the other end, a large window would produce good and stable learning results in stable phases but can not react quickly to concept changes.
In late approaches with the aim of detecting concept changes, some indicators are monitored over time.
- A typical example is to monitor the evolution of a statistical function between two distributions: from past data in a reference window and in a current window of the most recent data points.
- If during the monitoring process a concept drift is detected, some actions to adapt the learner to these changes can be taken.
- When a time window of adaptive size is used these actions usually lead to adjusting the window’s size according to the extent of concept drift.
Change detection algorithms
- the Statistical Process Control (SPC)
- the ADaptative WINDdowing (ADWIN)
- the Fixed Cumulative Windows Model (FCWM)
- a standard algorithm for change detection, the Page-Hinkley Test (PHT)
Statistical Process Control (SPC)
This drift detection method controls online the trace of the probability of error for streaming observations.
- While monitoring the error, it defines a warning and a drift level.
- When the error exceeds the first (lower) threshold, the system enters in a warning mode and stores the time,
, of the corresponding observation.
- If the error drops below the threshold again, the warning mode is cancelled.
- However, if in a sequence of examples, the error increases reaching the second (higher) threshold at time
, a change in the distribution of the examples is declared.
- The classifier is retained using only the examples since
and the warning and drift levels are reset.
ADaptive WINDowing (ADWIN)
The ADaptative WINDowing method keeps a sliding window (with length
) with the most recently received examples and compares the distribution on two sub-windows of
.
- Whenever two large enough sub-windows,
and
, exhibit distinct enough averages, the older sub-window is dropped and a change in the distribtuion of exapmles is assigned.
- The window cut threshold is computed as follows:
with
, where
and
denote the lengths of
and
.
- A confidence value
is used within the algorithm, which establishes a bound on the false positive rate.
- A confidence value
Fixed Cumulative Windows Model (FCWM)
- To summarize data, it first constructs histograms using the two layer structure of the Partition Incremental Discretization (PiD) algorithm, which as designed to learn histograms from high-speed data streams.
- The change detection problem is addressed by monitoring distributions in two different time windows: a reference window (RW), reflecting the distribution observed in the past; and a current window (CW) which receives the most recent data.
- In order to assess drifts, both distributions are compared using the Kullback-Leibler divergence (KLD), defining a threshold for change detection decision based on the asymmetry of this measure.
Page Hinkley Test (PHT)
The Page-Hinkley test (PHT) is a sequential analysis technique typically used for monitor change detection.
- It allows efficient detection of changes in the normal behaviur of a process which is established by a model.
- The PHT was designed to detect a change in the average of a Gaussian signal.
- This test considers a cumulative variable
defined as the cumulated difference between the observed values and their mean till the current moment:
Where
and
corresponds to the magnitude of changes that are allowed.
- To detect increases, it computes the minimum value of
and monitors the difference between
and
.
- When the difference
is greater than a given threshold
a change in the distribution is assigned.
- The threshold
depends on teh admissible false alarm rate.
- Incresing
will entail fewer flase alarms, but might miss or delay some changes.
- Controlling this detection threshold parameter makes it possible to establish a trade-off between the false alarms and the miss detections.
Experiments
dataset: SEA concepts, a benchmark problem for concept drift.
- When the difference
Reference
[1] Sebastiao, Raquel, and Joao Gama. “A study on change detection methods.” In Progress in Artificial Intelligence, 14th Portuguese Conference on Artificial Intelligence, EPIA, pp. 12-15. 2009.
Reference
[2] Nikovski, Daniel, and Ankur Jain. “Fast adaptive algorithms for abrupt change detection.” Machine learning 79, no. 3 (2010): 283-306.
Note
In this paper, authors are proposing algorihtms based on memory-based machine learning methods for quick estimatin of probability distributions from data.
Their change-detection framework assumes that there is only a single change in the entire length of the stream, and this change persists for the reminder of the data stream. This is typical situation when the change is destruvtive (e.g., a burnout, motor failure, etc.)
Reference
Liu, Song, Makoto Yamada, Nigel Collier, and Masashi Sugiyama. “Change-point detection in time-series data by relative density-ratio estimation.” Neural Networks 43 (2013): 72-83.
Note
Having been studied for decades, some pionner works demonstrated good change point detection performance by comparing the probability distributions of time-series samples over past and paresent intervals.
Another group of methods that have attracted high popularity in recent years is the subspace methods.
- By using a pre-designed time-series model, a subspace is discovered by principal componenet analysis from trajectories in past and present intervals, and their dissimilarity is measured by the distance between the subspaces.
- One of the major approaches is called subspace identification, which compares the subspaces spanned by the columns of an extended observability matrix generated by a state-space model with system noise.
- Recent efforts along this line of research have led to a computationally efficient algorithm based on Krylov subspace learning and a successful application of detecting climate change in south Kenya.
The method explained aboce rely on pre-designed parametric models, such as
- underlying probability distributions,
- auto-regressive models, and
- state-space models,
- for tracking specific statistics such as the mean, the variance, and the spectrum.
As alternatives, non-parametric methods such as kernel density estiamtion. are designed with no particular parametric assumption.
- However, they tend to bel ess accurate in high-dimensional problems because of hte so-called curse of dimensionality.
To overcome this difficulty, a new strategy was introduced recetnly, which estimates the ratio of probability densities directly without going through density estimation.
- The rationale of this density-ratio estimation idea is that knowing the two densities implies knowing the dnesity ratio, but not vice versa;
- knowing the ratio does not necessarily imply knowing the two densities because such decomposition is not unique.
- Following this idea, methods of direct density-ratio estimation have been developed, e.g.:
- _kernel mean matching_,
- the logistic-regression method,and
- the Kullback-Leibler importance estimation procedure (KLIEP)
In the context of change-point detection, KLIEP was reported to outperform other approaches such as the one-class support vector machine and singular-spectrum analysis.
The goal of this paper:
- Apply a recently-proposed density-ratio estimation method called the unconstrained least-squares importance fitting(uLSIF) to change-point detection.
- The basic idea of uLSIF is to directly learn the density-ratio function in the least-squares fitting framework.
- Further improve the uLSIF-based change-point detection method by employing a state-of-the-art extension of uLSIF called relative uLSIF(RuLSIF).