# The Boosting Approach

The origin of the boosting method for designing learnign machines is traced back to the work of Valiant and Kearns, who posed the question of whether a weak learning algorithm, meaning one that does slightly better than random guessing, can be boosted into a strong one with a good performance index.

• At the heart of such techniques lies the base leaner, which is a week one.

Boosting consists of an iterative scheme, where

• at each step the base learn is optimally computed using a different training set;

Boosting consists of an iterative scheme, where at each step the base learner is optimally computed using a different training set;

• the set at the current iteration is generated either:
• according to an iteratively obtained data distribution or,
• usually, via a weighting of the training samples, each time using a different set of weights.

The final learner is obtained via a weighted average of all the hierarchically designed base learners.

It turns out that, given a sufficient number of iterations, one can significantly imporve the (poor) performance of a weak learner. (?)

Our goal is to design a binay classifier:

$f(x) = \text{sng} \{F(x)\}$

where

###### $F(x) := \sum_{k=1}^K a_k \phi(x;\mathbf{\theta}_k),$

where $\phi(x;\theta_k) \in \{-1, 1\}$ is the base classifier at iteration $k$, defiend in terms of a set of paramters, $\theta_k, k = 1,2,\dots, K$, to be estimated.

• The base classifier is selected to be a binary one.
• The set of unknown paramters is obtained in a step wise approach and in a greedy way; that is:
• At each iteration step, $i$, we only optimize with respect to a single pair, $(\alpha_i, \mathbf{\theta}_i)$ , by keeping the parameters, $a_k, \mathbf{\theta_k}, k=1,2,\dots,i-1$., obtained from the previous steps, fixed.

Note that ideally, one should optimize with respect to all the unknown parameters, $a_k, \mathbf{\theta}_k, k = 1, 2, \dots, K,$ simultaneously; however, this would lead to a very computationally demanding optimization task.

Assume that we are currently at the $i$th iteration step; consider the partial sum of terms

$F_i(\cdot) = \sum_{k=1}^i a_k \phi(\cdot; \theta_k)$

Then we can write the following recursion:

$F_i(\cdot) = F_{i-1}(\cdot) + a_i \phi(\cdot;\theta_i), \quad i = 1, 2, \dots, K,$

starting from an intial condition. According to the greedy retionale, $F_{i-1}(\cdot)$ is assumed known and the goal is to optimize with resepct to the set of parameters, $a_i, \mathbf{\theta}_i$.

For optimization, a loss function has to be adopted. No doube different options are avialable, giving different names to the derived algorithm. A popular loss function used for classification is the exponential loss, defined as

$L(y, F(x)) = \exp (-y F(x)): \quad \text{Exponential Loss Function,}$

and it gives rise to the Adaptive Boosting algorithm. The former can be considered a (differentiable) upper bound of the (nondifferentiable) 0-1 loss function.

• Note that the exponential loss weights misclassified ($yF(x) < 0$) points more heavily compared to the correctly identified ones ($y F(x) > 0$).

Employing the exponential loss function, the set $\alpha_i, \mathbf{\theta}_i$ is obtianed via the respective empirical cost function, in the following manner:

$(a_i, \theta_i) = \arg\min_{a, \theta} \sum_{n=1}^N \exp\Bigg(-y_n\Big (F_{i-1}(x_n) + a \phi (x_n; \theta)\Big)\Bigg).$

This optimization is also performed in two steps. First , $a$ is treated fixed and we optimize with resepct to $\theta$,

$\theta_i = \arg\min_\theta \sum_{n=1}^N w_n^{(i)} \exp(-y_n a \phi(x_n; \theta)),$

where

$w_n^{(i)} := \exp(-y_n F_{i-1}(x_n)), n = 1, 2, \dots, N.$

Observe that $w_n^{(i)}$ depends neigher on $a$ nor on $\phi(x_n;\theta)$, hence it can be considered a sweight associated with sample $n$. Moreover, its value depends entirely on the results obtianed from the previous recursions.

The optimization depends on the specific form of the base classifier. However, due to the exponential form of the loss, and the fact that the baseclassifier is a binary one, so that $\phi(x;\theta) \in \{-1, 1\}$, optimizaing (7.79) is readily seen to be equivalent with optimizing the following cost:

$\theta_i = \arg\min_\theta P_i,$

where

$P_i := \sum_{n=1}^N w_n^{(i)} \chi_{(-\infty, 0]}\bigg(y_n \phi\big(x_n;\theta\big)\bigg),$

and $\chi_{(-\infty, 0]}(\cdot)$ is the $0-1$ loss function. Note that $P_i$ is the weighted empirical classification error. Obviously, when the miscalssification is minimized, the cost is also minimized, because the exponential loss weighs the misclassified points heavier.

• To guarantee that $P_i$ remains the $[0,1]$ interval, the weights are normalized to unity by dividing by the respective sum; note that this does not affect the optimization process.

In other words, $\theta_i$ can be computed in order to minize the empirical misclassification error committed by the base classifier. For base classifiers of very simple structure, such minimization is computationally feasible.

Having computed the optimal $\mathbf{\theta}_i$, the following are easily established from the respective definitions,

$\sum_{y_n \phi(x_n; \theta_i) < 0} w_n^{(i)} = P_i.$

and

$\sum_{y_n \phi(x_n; \theta_i) > 0} w_n^{(i)} = 1- P_i.$

$a_i = \arg\min_a \{ \exp (-a) (1 - P_i) + \exp(a) P_i\}.$

Taking the derivative with respect to $a$ and equating to zero results in

$a_i = \frac{1}{2} \ln \frac{1 - P_i}{P_i}.$

Once $\alpha_i$ and $\theta_i$ have been estimated, the weights for the next iteration are readily given by

$w_n^{(i+1)} = \frac{\exp\Big(-y_n F_i(x_n)\Big)}{Z_i} = \frac{w_n^{(i)}\exp\big(-y_n a_i \phi(x_n; \theta_i)\Big)}{Z_i}$

where $Z_i$ is th enormalizing factor

$Z_i := \sum_{n=1}^N w_n^{(i)} \exp\Big(-y_n a_i \phi (x_n; \theta_i)\Big).$

Looking at the way the weights are formed, one can grasp one of the major secrets underlying the AdaBoost algorithm:

• The weight associated with a training sample, $x_n$, is increased (decreased) with respect to its value at the previous iteration, depending on whether the pattern has failed (succeeded) in being classified correctly.
• Moreover, the percentatge of the dececrease (increase) depends on the value of $a_i$, which controls the relative importance in the buildup of the final classifier.
• Hard samples, which keep failing over successive iterations, gain importance in their participation in the weighted empirical error value.
• For the case of the AdaBoost, it can be shown that the training error tends to zero exponentially fast. (?)

##### The Log-Loss Function

In AdaBoost, the exponential loss function was employed. From a theoretical poiont of view, this can be justified by teh following argument:

• Consider the mean value with respect to the binary label, $y$, of the exponential loss function:

$E\big[\exp (-y F(x))\big] = P(y = 1) \exp(-F(x)) + P(y = -1) \exp(F(x))$

Taking the derivative with resepct to $F(x)$ and equating to zero, we readily obtain that minimum of (7.89) occurs at

$F_*(x) = \arg\min_f \mathbb{E}[\exp(-yf)] = \frac{1}{2}\ln \frac{P(y = 1 | x)}{P(y = -1 | x)}$

The logarithm of the ratio on the right-hand side is known as the log-odds ratio. Hence, if one views the minimizing function in (7.88) as the empirical approximation of the man value in (7.89), it fully justifies considering the sign of the fucntion in (7.73) as the classification rule.

A major problem associated with the exponential loss function, as is readily seen in Figure (7.14), is that it weights heavily wrongly classified samples, depending on the value of the respective margin, defined as

$m_x := |y F(x)|.$

Note that the farther the point is from the decision surface $(F(x) =0)$, the larger the value of $|F(x)|$. Thus, points that are located at the wrong side of the decision surface $(yF(x) < 0)$ and far away are weighted with (exponentially) large values, and their influence in the optimization process is large compared to the other points. Thus in the presence of outliers, the exponential loss is not the most appropriate one. As a matter of fact, in such environments, the performance of the AdaBoost can degrade dramatically.

An alternative loss function is the log-loss or binomial deviance, defined as

$L(y, F(x)) := \ln(1 + \exp(-y F(x))) : \quad \text{Log-loss Function,}$

which is also shown in Figure 7.14. Observe that tis increase is almost linear for large negative values. Such a function leads to a more balanced influence of the loss among all the points.

Note that the function that minimizes the mean of the log-loss, with respect to $y$, is the same as the one give in (7.90). However, if one employes the log-loss instead of the exponential, the optimization task gets more involved, and one has to resort to gradient descent or Newton-type schemes for optimization.

[20]

[21]

### The Boosting Trees

In the discussion on experimetnal comparison of various methods before (section 7.9), it was stated the boosted trees are among the most powerful learning schemes for clasification and data minning. Thus, it is worth spending some more time on this special type of boosting techniques.

Tree were introduced in Section 7.8. From the knowledge we have now acquired, it is not difficult to see that the output of a tree can be compactly written as

$T(x; \Theta) = \sum_{j=1}^J \hat{y}_j \chi R_j(x), \quad(7.93)$

where

• $J$ is the number of leaf nodes,
• $R_j$ is the region associated with the $j$-leaf, after the space partition imposed by the tree,
• $\hat{y}_j$ is the respective label associated with $R_j$ (output/prediction value for regression) and
• $\chi$ is our familiar characteristic function.
• The set of parameters, $\Theta$, consists of $(\hat{y}_j, R_j), j = 1,2,\dots, J$, which are estimated during training.
• These can be obtained by selecting an appropriate cost function.
• Also, suboptimal techniques are usually employed in order to build up a tree

In a boosted tree model, the base classifier comprises a tree. In practice, one can employ trees of larger size. Of course, the size must not be bery large, in order to be closer to a weak classifier. In practice, values of $J$ between three and eight are advisable.

The boosted tree model can be written as

$F(x) = \sum_{k=1}^K T(x; \Theta_k), \quad (7.94)$

where

$T(x;\Theta_k) = \sum_{j=1}^J \hat{y}_{kj} \chi R_{kj}$

Equation (7.94) is basically the same as (7.74), with the $a$‘s being equal to one. We have assumed the size of all the trees to be the smae, although this may not be necessarily the case.

Adopting a loss function, $\mathcal{L}$, and the greedy rationale, used for the more general boosting approach, we arrive at the following recursive scheme of optimization:

$\Theta_i = \arg\min_\Theta \sum_{n=1}^N \mathcal{L}\Big(y_n, F_{i-1}(x_n) + T(x_n; \Theta)\Big).$

Optimization with respect to $\Theta$ takes place into two steps:

• one with respect to $\hat{y}_{ij}, j = 1, 2,\dots, J$, given $R_{ij}$, and then
• one with respect to the regions $R_{ij}$.

The latter is a difficult task and only simplifies in very special cases. In practice, a number of approximations can be employed.

• Note that in the case of the exponential loss and the two-class classification task, the above is directly linked to the AdaBoost scheme.

For more general cases, numeric optimizatio nschemes are mobilized; see [22].

The same rationale applies for regression trees, where now loss functions for regression, such as LS or the absolute error value, are used.

• Such schemes are also known as multiple additive regression trees (MARTs). [44]

There are two critical factors concerning boosted trees.

• One is the size of the trees, $J$, and
• The other is the choice of $K$.
• Concerning the size of the trees, usually one tries different size, $4 \leq J \leq 8$, and selects the best one.
• Conserning the number of iterations, for large values, the training error may get close to zero, but the test error can increase due to overfitting. Thus, one has to stop early enough, usually by monitoringi the performance.
• Another way to cope with overfitting is to emply shrinkage methods, whcih tend to be equivalent to regularization.
• For example, In the stage-wise expansion of $F_i(x)$ used in the optimization step (7.95), one can instead adopt the following:

$F_i(\cdot) = F_{i-1}(\cdot) + v T(\cdot; \Theta_i).$

The parameter $v$ take small values and it can be considered as controlling the learning rate of the boosting procedure.

• Values smaller than $v < 0.1$ are advised.
• However, the smaller the value of $v$, the larger the value $K$ should be to guarantee good performance.