Photo by Rei Kim on Unsplash

Incremental Online Learning

Danny Butvinik
Analytics Vidhya
Published in
12 min readAug 25, 2019

--

Abstract

In recent times, incremental and online machine learning receive more and more attention especially in the context of learning from real-time data streams, in contrast with a traditional assumption of complete data availability. Although a variety of different methods are available, it often remains unclear which of them is suitable for a concrete task and how they perform in comparison to each other. This article reviews the eight popular incremental methods representing distinct algorithm classes.

Introduction

In my last article, I was talking about the introduction to online machine learning. In this article I’m going to overview a few online incremental learning algorithms (or instance-based incremental learning), that is, the model is learning each example as it arrives. Classical batch machine learning approaches in which all data is simultaneously accessed do not meet the requirements to handle the sheer volume in the given time, leading to more and more accumulated data unprocessed. Furthermore, they do not continuously integrate new information into already constructed models but instead regularly reconstruct new models from scratch. This is not only very time consuming but also leads to potentially outdated models.

Overcoming this state of affairs requires a paradigm shift to sequential data processing in a streaming scheme. This does not only allow us to use information as soon as it is available leading to all-time up to date models but also reduces the costs for data storage and maintenance.

Incremental and online algorithms fit naturally to this scheme, since they continuously incorporate information into their model, and traditionally aim for minimal processing time and space. Due to their ability of continuous largescale and real-time processing they recently gained more attention particularly in the context of data streams.

Incremental algorithms are also very suitable for learning beyond the production phase which enables devices to adapt to individual customer habits and environments. This particularly interesting for financial institutions where fraud detection is one of the centric purposes in financial crime. NICE Actimize — the world leader in crime monitoring, management, and prevention trying to detect fraud in real-time. This is a time-series environment with streams of data. It is very difficult to tackle such task especially when you have highly imbalanced data and issues with ground truth. An only a tiny percentage of the data is known to be as fraudulent. The rest assumed to be legitimate, while there is no assurance that among legitimate data there is no fraud. That’s why trying to solve the problem with the online incremental approach is game-changing for that domain.

Here the main challenge is not large-scale processing but rather continuous and efficient learning from few data. Even though incremental learning could be replaced in this case by repetitive batch learning in the cloud, this solution has crucial drawbacks. A permanent connection to the cloud is required to provide anytime models, which may not always be feasible. Furthermore, customers may not be willing to provide data about their daily life due to privacy reasons. Hence, learning directly on the device in an efficient way is still very desirable. A lot of ambiguity is involved regarding the definition of incremental and online learning in the literature. Some authors use them interchangeably, while others distinguish them in different ways. Additional terms such as lifelong- or evolutionary learning are also used synonymously. We define an incremental learning algorithm as one that generates on a given stream of training data

a sequence of models

In our case,

and

is a model function solely depending on

and the recent p examples

with p being strictly limited. We specify online learning algorithms as incremental learning algorithms which are additionally bounded in model complexity and run-time, capable of endless/lifelong learning on a device with restricted resources.

Incremental learning algorithms face the following challenges:

  • The model has to adapt gradually i.e. is constructed based on without complete retraining.
  • Preservation of previously acquired knowledge and without the effect of catastrophic forgetting.
  • Only a limited number of p training examples are allowed to be maintained.

We explicitly assume the data to be labeled and do not focus on the, nonetheless, crucial scenario of learning from un- or partially labeled data streams. The setting of supervised incremental learning can be applied in most prediction scenarios. In these, after a system has made a prediction the true label can often be inferred with some delay.

An algorithm has to be chosen according to the preconditions of a given task since there cannot exist a method that optimally performs in every scenario. Different interesting incremental learning algorithms have been published so far with various strengths and weaknesses. However, there are only a few sources providing information about them, since basically no comparative in-depth study, experimentally comparing the most popular methods according to the most relevant criteria, is available. Extensive research in the literature leads usually to the original publications of considered algorithms. In this article, I analyze the core features of some of the online incremental methods.

My focus lies in the classification under supervised learning for online incremental algorithms. I primarily perform an evaluation on stationary datasets (i.e. the assumption is that the stream

is independent identically distributed). However, I briefly evaluate and discuss the methods in the context of concept drift.

Algorithms

Incremental Support Vector Machine (ISVM) is the most popular exact incremental version of the SVM and was introduced in (see my article about ISVM here). Additionally, to the set of support vectors a limited number of examples so-called “candidate vectors”, is maintained. These are examples which could be promoted to support vectors depending on the future examples. The smaller the set of candidate vectors is, the higher is the probability of missing potential support vectors. The ISVM is a lossless algorithm — it generates the same model as the corresponding batch algorithm — if the set of candidate vectors contains all previously seen data.

LASVM is an online approximate SVM solver and was proposed in. In an alternative manner, it checks whether the currently processed example is a support vector and removes then obsolete support vectors. For both steps, it heavily utilizes sequential direction searches as it is also done in the Sequential Minimal Optimization (SMO) algorithm. In contrast to the ISVM, it does not maintain a set of candidate vectors but only considers the current example as possible support-vector. This leads to an approximate solution but significantly reduces the training time.

Online Random Forest (ORF) is an incremental version of the Random Forest algorithm. A predefined number of trees grows continuously by adding splits whenever enough samples are gathered within one leaf. Instead of computing locally optimal splits, a predefined number of random values are tested according to the scheme of Extreme Random Trees. The split value optimizing the Gini index the most is selected. Tree ensembles are very popular, due to their high accuracy, simplicity and parallelization capability. Furthermore, they are insensitive to feature scaling and can be easily applied in practice.

Incremental Learning Vector Quantization (ILVQ) is an adaptation of the static Generalized Learning Vector Quantization (GLVQ) to a dynamically growing model, which inserts new prototypes when necessary. The insertion rate is guided by the number of misclassified samples. We use the version in which introduced a prototype placement strategy minimizing the loss on a sliding window of recent samples. Metric learning, as described in [38, 39], can also be applied to extend the classification abilities further.

Learn++ (LPPCART) processes incoming samples in chunks with a predefined size. For each chunk, an ensemble of base classifiers is trained and combined through weighted majority voting to an “ensemble of ensembles”. Similar to the AdaBoost [41] algorithm, each classifier is trained with a subset of chunk examples drawn according to a distribution, ensuring a higher sample probability for misclassified inputs. LPP is a model-independent algorithm and several different base classifiers such as SVM, Classification and Regression Trees (CART) and Multilayer Perceptron have been successfully applied by the authors. As the original author, we employ the popular CART as base classifiers. Chunk-wise trained models inherently incorporate an adaption delay depending on the chunk size. This algorithm was recently utilized.

Incremental Extreme Learning Machine (IELM) reformulates the batch ELM least-squares solution into a sequential scheme. As the batch version, it drastically reduces the training complexity by randomizing the input weights. The network is static, and the number of hidden neurons has to be predefined. This method is able to process the data one-by-one or in chunks, which significantly reduces the overall processing time. However, a valid initialization of the output weights requires at least as many examples as the number of used hidden neurons.

Naive Bayes (NBGauss) fits one axis-parallel Gaussian distribution per class and uses them as a likelihood estimation in the Naive Bayes algorithm. The sparse model allows very efficient learning in terms of processing time and memory requirements. This algorithm learns efficiently from few training examples and has been successfully applied in real-world situations such as Spam filtering and document classification. The major drawbacks of this lossless algorithm are the independence assumption of the features as well as its inability to handle multimodal distributions.

Stochastic Gradient Descent (SGDLin) is an efficient optimization method for learning a discriminative model by minimizing a loss function such as the Hinge — or Logistic loss. We use SGD to learn a linear model by minimizing the Hinge loss function. Revived recently in the context of large-scale learning, SGD coupled with linear models performs especially well for sparse, high-dimensional data as often encountered in the domain of text classification or natural language processing. However, linear models are a misfit whenever non-linear class boundaries are required, which is particularly often the case for low dimensional data.

Even though new versions of the algorithms are continuously proposed, we argue that the chosen methods reflect the general properties of the respective family. Therefore, the conclusions in this paper are commonly applicable for current and upcoming variations of the corresponding algorithm. This is particularly highlighted by both SVMs which perform very similar to the difference that LASVM is able to process slightly larger datasets due to its ap-proximate nature. However, both share the same drawbacks regarding large or noisy datasets. These drawbacks are also shared by a recent LASVM version proposed in, albeit in a slightly weaker degree since a mechanism to reduce the number of support vectors is introduced. Various extensions for the LPP and the IELM algorithm have been proposed. Most of them are tackling non-stationary environments by introducing forgetting mechanisms. However, the major focus of this article is incremental learning in stationary environments where forgetting is rather harmful and deteriorates the performance.

Furthermore, the basic principle of the algorithms and the corresponding ad-and disadvantages remain. In the case of LPP, it is the flexibility of arbitrary base classifiers on the one hand, and the limited knowledge integration across chunks on the other. Methods for speeding up the convergence of SGD were presented in. However, the results obtained by the SGD algorithm in our experiments are not due to slow convergence of the SGD algorithm, but rather highlight the general benefits and limitations of linear models, such as a low model complexity and linear class boundaries.

Fig 1: Classical scheme of evaluating a batch algorithm on offline mode

The learning objective in supervised classification is to predict a target variable y ∈ {1, . . . , c} given a set of features. We consider two different evaluation settings which allow the inference of different aspects regarding the algorithmic performance and provide together even deeper insight.

Offline setting

In the offline setting, a batch algorithm generates a model h based on a training set

In the subsequent test phase, the model is applied to another set

whose labels are kept hidden. Figure 1 depicts the process. The model predicts a label

for every point in the testing set and the 0–1 loss for every point

is calculated. The average accuracy on the test set enables analysis in terms of the generalization ability to unseen examples.

From Incremental Learning In Online Scenario paper. Figure 2: Testing an incremental algorithm in the off-line setting. Noticeably, only the last constructed model is used for prediction. All data used during the training set is obtained from the D training set.

The evaluation of an incremental algorithm in this setting is different as it is shown in Figure 2. Instead of accessing all training data at once, it is sequentially processed in predefined order. The algorithm generates the sequence of tuples

a corresponding sequence of models

Thereby, a model

is solely based on the previously constructed model and a limited amount of p recent tuples

Only a last model hj is applied to the test set to determine the offline accuracy ξ

Figure 3: The online learning scheme. Data is not split into training and testing set. Instead, each model predicts subsequently one example, which is afterward used for the construction of the next model.

Hence, this setting allows only inference about the generalization ability of the last model and neglects all preceding models. Such an evaluation is useful in data streams scenarios, for example, where a lot of training data is available to continuously construct a model as accurate as possible.

Online setting

Data stream classification is usually evaluated in the on-line setting, which is depicted in Figure 3. A potentially infinite sequence

of tuples

arrives one after another. As t represents the current timestamp, the learning objective is to predict the corresponding label

for a given input

which is supposed to be unknown. The prediction

is done according to the previously learned model

Afterward, the true label is revealed and the loss

determined. The online accuracy for a sequence up to the current time t is given by

The main difference to the previous setting is that all intermediate models are considered for the performance evaluation, but each of them predicts only the following example. Additionally, the data for training and testing is not strictly disjunct, but instead, each instance is initially used for model testing and then for the adaption.

Regarding non-stationary data, a high on-line accuracy does not necessarily imply a high generalization ability of the models. For instance, in case of a strong auto-correlation of the labels, an algorithm simply predicting the previous label achieves accurate results without learning any structure in the data. However, for i.i.d. data the on-line accuracy of an incremental algorithm is in general correlated with the average generalization ability of all constructed models. The on-line accuracy is a reasonable evaluation measure for tasks requiring an immediate prediction even after a few training examples.

The combination of both accuracies, off- and on-line, enables conclusions about the learning curve: In the case of two different models A, B having the same offline accuracy, but A having a higher online accuracy implies that A converges on average faster than B and vice versa.

Since this article was a kind of overview of the most prevalent method on online incremental learning, a further series of articles will have more focus on specifically chosen algorithms.

– — –

Join my thriving community of 25,000+ subscribers in The AI Vanguard newsletter! 🔥 If you’re developing an AI or data-driven product/service, seize the opportunity to sponsor an upcoming issue and showcase your business. 🚀 Reach out to dany.butvinik@gmail.com for sponsorship details.

– — –

Reference

[1] Viktor Losing et al., 2018 “Incremental Online Learning: A Review and Comparison of State of the Art Algorithms”.

[2] R. Ade et. al., 2013, “ Methods for Incremental Learning: A Survey, International.

[3] R. Elwell et. al., 2011, “Incremental Learning of Concept Drift in Nonstationary Environments.

[4] G. Ditzler et al., 2013, “Incremental Learning of Concept Drift from Streaming Imbalanced Data.

[5] Y. Ye et.al., 2013 “Online Sequential Extreme Learning Machine

in Nonstationary Environments.

[6] H. He et al., 2011, “Incremental Learning from Stream Data.”

[7] Stefan Hegselmann1 et al., 2020, “Incremental Learning In Online Scenario.”

--

--