Boosting and AdaBoost

Keywords: Boosting, AdaBoost

Boosting[1]

The committee has an equal weight for every prediction from all models, and it gives little improvement than a single model. Then boosting was built for this problem. Boosting is a technique for combining multiple ‘base’ classifiers to produce a form of the committee that:

  1. performances better than any of base classifier and
  2. each base classifier has a different weight factor

Adaboost

Adaboost is short for adaptive boosting. It is a method combining several weak classifiers which are just better than random guesses and it gives a better performance than the committee. The base classifiers in AdaBoost are trained in sequence, and their training set is the same but with different weights. So if we consider the training data distribution, the distribution faced by every weak classifier is different. This might be an important reason for the improvement of AdaBoost from the committee. And the weights for weak classifiers are generated depending on the performance of the previous classifier. Then the weak classifiers in the algorithm would be trained one by one. During the prediction process, the input data flows from classifier to classifier and the final result is some kind of combination of all output of weak classifiers.

Key points of the AdaBoost algorithm are:

  1. the data points are predicted incorrectly in current classifier gives a greater weight
  2. once the algorithm was trained, prediction of each classifier is combined through a weighted majority voting scheme as:

where wn(1)w_n^{(1)} is the initial weights of input data of the 11st weak classifier, y1(x)y_1(x) is the prediction of the 11st weak classifier, αm\alpha_m is the weight of each prediction(notably this weight belongs to ym(x)y_m(x) and wn(1)w_n^{(1)} belongs to input data of the first classifier.). And the final output is the sign function of the weighted sum of all predictions.
The procedure of algorithm is:

  1. Initial data weighting coefficients {wn}\{\boldsymbol{w}_n\} by wn(1)=1Nw_n^{(1)}=\frac{1}{N} for n=1,2,,Nn=1,2,\cdots,N
  2. For m=1,,Mm=1,\dots,M:
  • Fit a classifier ym(x)y_m(\boldsymbol{x}) to training set by minimizing the weighted error function:

Jm=wn(m)I(ym(xn)tn)J_m=\sum_{}^{}w_n^{(m)}I(y_m(\boldsymbol{x}_n)\neq t_n)

where I(ym(x)tn)I(y_m(\boldsymbol{x})\neq t_n)is the indicator function and equals 1 when ym(x)tny_m(\boldsymbol{x})\neq t_n and 0 otherwise

  • Evaluate the quatities:

ϵm=n=1Nwn(m)I(ym(x)tn)n=1Nwn(m)\epsilon_m=\frac{\sum_{n=1}^Nw_n^{(m)}I(y_m(\boldsymbol{x})\neq t_n)}{\sum_{n=1}^{N}w_n^{(m)}}

and then use this to evaluate αm=ln{1ϵmϵm}\alpha_m=\ln \{\frac{1-\epsilon_m}{\epsilon_m}\}

  • Updata the data weighting coefficients:

wn(m+1)=wn(m)exp{αmI(ym(x)tn)}w_n^{(m+1)}=w_n^{(m)}\exp\{\alpha_mI(y_m(\boldsymbol{x})\neq t_n)\}

  1. Make predictions using the final model, which is given by:

YM=sign(m=1Mαmym(x))Y_M = \mathrm{sign} (\sum_{m=1}^{M}\alpha_my_m(x))

This procedure comes from ‘Pattern recognition and machine learning’[1:1]

Python Code of Adaboost

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
# weak classifier
# test each dimension and each value and each direction to find a
# best threshold and direction('<' or '>')
class Stump():
def __init__(self):
self.feature = 0
self.threshold = 0
self.direction = '<'

def loss(self,y_hat, y, weights):
"""
:param y_hat: prediction
:param y: target
:param weights: weight of each data
:return: loss
"""
sum = 0
example_size = y.shape[0]
for i in range(example_size):
if y_hat[i] != y[i]:
sum += weights[i]
return sum

def test_in_traing(self, x, feature, threshold, direction='<'):
"""
test during training
:param x: input data
:param feature: classification on which dimension
:param threshold: threshold
:param direction: '<' or '>' to threshold
:return: classification result
"""
example_size = x.shape[0]
classification_result = -np.ones(example_size)
for i in range(example_size):
if direction == '<':
if x[i][feature] < threshold:
classification_result[i] = 1
else:
if x[i][feature] > threshold:
classification_result[i] = 1
return classification_result

def test(self,x):
"""
test during prediction
:param x: input
:return: classification result
"""
return self.test_in_traing(x, self.feature, self.threshold, self.direction)

def training(self, x, y, weights):
"""
main training process
:param x: input
:param y: target
:param weights: weights
:return: none
"""
example_size = x.shape[0]
example_dimension = x.shape[1]
loss_matrix_less = np.zeros(np.shape(x))
loss_matrix_more = np.zeros(np.shape(x))
for i in range(example_dimension):
for j in range(example_size):
results_ji_less = self.test_in_traing(x, i, x[j][i], '<')
results_ji_more = self.test_in_traing(x, i, x[j][i], '>')
loss_matrix_less[j][i] = self.loss(results_ji_less, y, weights)
loss_matrix_more[j][i] = self.loss(results_ji_more, y, weights)
loss_matrix_less_min = np.min(loss_matrix_less)
loss_matrix_more_min = np.min(loss_matrix_more)
if loss_matrix_less_min > loss_matrix_more_min:
minimum_position = np.where(loss_matrix_more == loss_matrix_more_min)
self.threshold = x[minimum_position[0][0]][minimum_position[1][0]]
self.feature = minimum_position[1][0]
self.direction = '>'
else:
minimum_position = np.where(loss_matrix_less == loss_matrix_less_min)
self.threshold = x[minimum_position[0][0]][minimum_position[1][0]]
self.feature = minimum_position[1][0]
self.direction = '<'


class Adaboost():
def __init__(self, maximum_classifier_size):
self.max_classifier_size = maximum_classifier_size
self.classifiers = []
self.alpha = np.ones(self.max_classifier_size)

def training(self, x, y, classifier_class):
"""
training adaboost main steps
:param x: input
:param y: target
:param classifier_class: what can classifier would be used, here we use stump above
:return: none
"""
example_size = x.shape[0]
weights = np.ones(example_size)/example_size

for i in range(self.max_classifier_size):
classifier = classifier_class()
classifier.training(x, y, weights)
test_res = classifier.test(x)
indicator = np.zeros(len(weights))
for j in range(len(indicator)):
if test_res[j] != y[j]:
indicator[j] = 1

cost_function = np.sum(weights*indicator)
epsilon = cost_function/np.sum(weights)
self.alpha[i] = np.log((1-epsilon)/epsilon)
self.classifiers.append(classifier)
weights = weights * np.exp(self.alpha[i]*indicator)

def predictor(self, x):
"""
prediction
:param x: input data
:return: prediction result
"""
example_size = x.shape[0]
results = np.zeros(example_size)
for i in range(example_size):
y = np.zeros(self.max_classifier_size)
for j in range(self.max_classifier_size):
y[j] = self.classifiers[j].test(x[i].reshape(1,-1))
results[i] = np.sign(np.sum(self.alpha*y))
return results

the entire project can be found https://github.com/Tony-Tan/ML. And please star me! Thanks!

When we use different numbers of classifiers, the results of the algorithm are like:

where the blue circles are correct classification of class 1 and red circles are correct classification of class 2. And the blue crosses belong to class 2 but were classified into class 1, and so do the red crosses.

A 40-classifiers AdaBoost gives a relatively good prediction:

where there is only one misclassification point.

References


  1. Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006. ↩︎ ↩︎