In general, approaches to cope with concept drift can be classified into two categories:

- approaches that adapt a learner at regular intervals without considering whether changes have really occurred.
- approaches that first detect concept changes and afterwards the learner is adapted to these changes.
In the fist approach, drifting concepts are often handled by time windows or weighted examples according to their age or utility.

- Weighted examples are based on the simple idea that the importance often example should decrease with time.
- When a tie window is used, at each time step the learner is induced only from the examples that are included in the window.
- The difficult is how to select the appropriate window’s size.
- On the other end, a large window would produce good and stable learning results in stable phases but can not react quickly to concept changes.

In late approaches with the aim of detecting concept changes, some indicators are monitored over time.

- A typical example is to monitor the evolution of a statistical function between two distributions: from past data in a reference window and in a current window of the most recent data points.
- If during the monitoring process a concept drift is detected, some actions to adapt the learner to these changes can be taken.
- When a time window of adaptive size is used these actions usually lead to adjusting the window’s size according to the extent of concept drift.

## Change detection algorithms

- the Statistical Process Control (SPC)
- the ADaptative WINDdowing (ADWIN)
- the Fixed Cumulative Windows Model (FCWM)
- a standard algorithm for change detection, the Page-Hinkley Test (PHT)

### Statistical Process Control (SPC)

This drift detection method controls online the trace of the probability of error for streaming observations.

- While monitoring the error, it defines a warning and a drift level.
- When the error exceeds the first (lower) threshold, the system enters in a warning mode and stores the time, , of the corresponding observation.
- If the error drops below the threshold again, the warning mode is cancelled.
- However, if in a sequence of examples, the error increases reaching the second (higher) threshold at time , a change in the distribution of the examples is declared.
- The classifier is retained using only the examples since and the warning and drift levels are reset.

### ADaptive WINDowing (ADWIN)

The ADaptative WINDowing method keeps a sliding window (with length ) with the most recently received examples and compares the distribution on two sub-windows of .

- Whenever two
__large enough__sub-windows, and , exhibit__distinct enough__averages, the older sub-window is dropped and a change in the distribtuion of exapmles is assigned. - The window cut threshold is computed as follows:
with , where and denote the lengths of and .

- A confidence value is used within the algorithm, which establishes a bound on the false positive rate.

### Fixed Cumulative Windows Model (FCWM)

- To summarize data, it first constructs histograms using the two layer structure of the Partition Incremental Discretization (PiD) algorithm, which as designed to learn histograms from high-speed data streams.
- The change detection problem is addressed by monitoring distributions in two different time windows: a reference window (RW), reflecting the distribution observed in the past; and a current window (CW) which receives the most recent data.
- In order to assess drifts, both distributions are compared using the Kullback-Leibler divergence (KLD), defining a threshold for change detection decision based on the asymmetry of this measure.

### Page Hinkley Test (PHT)

The Page-Hinkley test (PHT) is a sequential analysis technique typically used for monitor change detection.

- It allows efficient detection of changes in the normal behaviur of a process which is established by a model.
- The PHT was designed to detect a change in the average of a Gaussian signal.
- This test considers a cumulative variable defined as the cumulated difference between the observed values and their mean till the current moment:
Where and corresponds to the magnitude of changes that are allowed.

- To detect increases, it computes the minimum value of and monitors the difference between and .
- When the difference is greater than a given threshold a change in the distribution is assigned.
- The threshold depends on teh admissible false alarm rate.
- Incresing will entail fewer flase alarms, but might miss or delay some changes.
- Controlling this detection threshold parameter makes it possible to establish a trade-off between the false alarms and the miss detections.

## Experiments

dataset: SEA concepts, a benchmark problem for concept drift.

## Reference

[1] Sebastiao, Raquel, and Joao Gama. “A study on change detection methods.” In *Progress in Artificial Intelligence, 14th Portuguese Conference on Artificial Intelligence, EPIA*, pp. 12-15. 2009.

## Reference

[2] Nikovski, Daniel, and Ankur Jain. “Fast adaptive algorithms for abrupt change detection.” *Machine learning* 79, no. 3 (2010): 283-306.

## Note

In this paper, authors are proposing algorihtms based on memory-based machine learning methods for quick estimatin of probability distributions from data.

Their change-detection framework assumes that there is only a __single__ change in the entire length of the stream, and this change persists for the reminder of the data stream. This is typical situation when the change is destruvtive (e.g., a burnout, motor failure, etc.)

## Reference

Liu, Song, Makoto Yamada, Nigel Collier, and Masashi Sugiyama. “Change-point detection in time-series data by relative density-ratio estimation.” *Neural Networks* 43 (2013): 72-83.

## Note

Having been studied for decades, some pionner works demonstrated good change point detection performance by comparing the probability distributions of time-series samples over past and paresent intervals.

Another group of methods that have attracted high popularity in recent years is the __subspace__ methods.

- By using a pre-designed time-series model, a subspace is discovered by principal componenet analysis from trajectories in past and present intervals, and their dissimilarity is measured by the distance between the subspaces.
- One of the major approaches is called
__subspace identification__, which compares the subspaces spanned by the columns of an__extended observability matrix__generated by a state-space model with system noise.- Recent efforts along this line of research have led to a computationally efficient algorithm based on
__Krylov subspace learning__and a successful application of detecting climate change in south Kenya.

- Recent efforts along this line of research have led to a computationally efficient algorithm based on

The method explained aboce rely on pre-designed parametric models, such as

- underlying probability distributions,
- auto-regressive models, and
- state-space models,
- for tracking specific statistics such as the mean, the variance, and the spectrum.

As alternatives, non-parametric methods such as __kernel density estiamtion__. are designed with no particular parametric assumption.

- However, they tend to bel ess accurate in high-dimensional problems because of hte so-called
__curse of dimensionality__.

To overcome this difficulty, a new strategy was introduced recetnly, which estimates the __ratio__ of probability densities directly without going through density estimation.

- The rationale of this density-ratio estimation idea is that knowing the two densities implies knowing the dnesity ratio, but not vice versa;
- knowing the ratio does not necessarily imply knowing the two densities because such decomposition is not unique.
- Following this idea, methods of direct density-ratio estimation have been developed, e.g.:
- _kernel mean matching_,
- the
__logistic-regression method__,and - the
__Kullback-Leibler importance estimation procedure__(KLIEP)

In the context of change-point detection, KLIEP was reported to outperform other approaches such as the __one-class support vector machine__ and __singular-spectrum analysis__.

##### The goal of this paper:

- Apply a recently-proposed density-ratio estimation method called the
__unconstrained least-squares importance fitting__(uLSIF) to change-point detection.- The basic idea of uLSIF is to directly learn the density-ratio function in the least-squares fitting framework.

- Further improve the uLSIF-based change-point detection method by employing a state-of-the-art extension of uLSIF called
__relative uLSIF__(RuLSIF).

￼