AdaBoost Ensemble Model
Table of content:
Adaptive Boosting or AdaBoost is a boosting ensemble technique that takes low variance and high bias models and uses the additive combination to reduce bias while keeping variance low.
AdaBoost is most often used in computer vision for face detection and image processing. At every stage of AdaBoost, we are adapting to the errors that were made before, more weight is given to the misclassified points.
2. Three main ideas behind AdaBoost:
Idea 1. AdaBoost combined a lot of weak learners to make predictions, and these weak learners are almost always stumps.
A stump is a tree with one root node and 2 leaves. Unlike Random Forest which combine fully grown trees with no pre-determined depth. Stump has a high bias (as only one feature is used to predict output) and high variance (as stumps are unable to train on all the data points because of the low depth of the tree).
Let’s consider the below dataset for the classification problem:
Weak learners are the ML (Machine Learning) models that make predictions based on only one feature. In the below example, only one feature-Chest pain is predicting the output — ‘heart disease’:
Idea 2. Some stumps get more amount of say/vote in the final prediction than the other stumps, which usually depends of how much errror a stump made. Unlike Random Forest, in which each tree has equal say/vote in final prediction.
Idea 3. Each stump is made by taking the previous stump’s mistake into account. Unlike in Random Forest each tree is made independently of the other.
3. Steps of AdaBoost:
Step 1. Assign weight to all the samples, which is initially 1/(#samples) to make all samples equally important.
Step 2. Select the stump to start with: The stump with the lowest Gini index/Entropy will be selected as the stump to start with for the particular stage.
Note: We can easily create a stump with categorical data (yes/no) but what if data is numerical like ‘patient weight’? Below are the steps:
a. Sort the row in increasing order of patient weight.
b. Find the average between consecutive terms.
c. Calculate the Gini index value for each average value calculated in the previous step. In this step we calculate the Gini index of leaves and root node (which is the weighted average of the Gini index of 2 leaves), below is the calculation:
Similarly, we compute the Gini Index of each Average value of Patient weight. Below are the values:
We can see that the Gini Index of Patient weight < 176 is the lowest and hence is the best weight to select.
Now on comparing the Gini Index of 3 features (Chest Pain-0.47, Blocked Arteries-0.5, Patient Weight-0.2) we select the feature with the lowest Gini Index and that would be our first stump in the forest. In this example, we select Patient weight as it is having the lowest Gini Index as our first stump.
Step 3. Calculate Total error and amount of say: Once the stump is selected, we find out the total error and amount of say for that stump for the incorrectly classified sample point.
The amount of say depends on how well a stump predicts the data. The Total Error of a stump is the sum of weights associated with incorrectly classified samples. The total error will always be between 0 (for perfect stump classifying the data correctly) and 1 (not classifying the data at all). ‘Amount of say’ is defined as:
When the Total Error made by that stump is low, the amount of say is larger. When the Total Error made by that stump is large, the amount of say is lower. When the Total error is 0.5 (means half the data is classified in one class and half in another class) then the Amount of say will be 0 and this is the worst classification similar to random classification (like flipping a coin).
Amount of say for stump (Patient Weight < 176) is given as:
We calculated the amount of say and total error for incorrectly classified data points. Sample weights are used to calculate the Amount of say for a stump. Basically, the sample weights are used as the error, which is used in the calculation of the Amount of say.
Note: Total error will be between 0 and 1, and if the total error ends up being 0 or 1 then the equation of ‘Amount of Say’ will approach infinity, to avoid this a small error term is added in the formula.
Step 4. Update Weight:
Once we know the incorrectly classified samples, we need to increase their sample weight and decrease all other sample weights. The formula for increasing the sample weight is:
Below is the graph of e^x where x is the amount of say:
If the amount of say is large → stump did a great job in making the prediction → e^ (amount of say) will be larger → New Sample weight will be large than the older sample weight
If the amount of say is low → stump didn’t do a good job in making prediction → then previous sample weight increased by very low number →New sample weight will be little larger than older sample weight.
In the case of stump Patient weight < 176, below is the incorrectly classified weight:
The new sample weight of the incorrectly classified sample is 0.33 which is more than the older weight (1/8 = 0.125).
Now we need to decrease the sample weight of all correctly classified samples using the below formula:
Below is the graph of e^-x, where x is the amount of say
If the amount of say is large → we scale weight by a value close to 0 → New sample weight becomes very small
If the amount of say is small→ we scale weight by a value close to 1 → New sample weight becomes just a little smaller than the older sample weight
In our example, the new decreased sample weight will be 0.05.
Below are the new weights:
Below data is used to create the next stump based on the Gini index:
In the new data (above table) more emphasis is given to the sample with more weight as it was wrongly classified last time.
Step 5: Create the new dataset for the next stage
In this step, we select n samples (n is the sample size, in our example, it is 8) with replacement from the dataset calculated in step 4, with the odds of picking each of the samples based on their new sample weight.
The steps are below:
a. Pick a random number (let it is ‘s’) between 0 and 1.
b. Now select the sample based on the number s picked randomly in step ‘a’.
We can see that the sample weight of incorrect classified data points is more than the rest, hence it will be more likely to be selected more often and thus next stump will focus more on classifying the misclassified sample correctly. All misclassified samples (highlighted below) are the same, and they will be treated as a block, creating a large penalty for being misclassified.
Now repeat steps 1–5 several times to make stumps until the number of stumps you asked for, or it has a perfect fit.
Step 6.: How forest of stumps make predictions:
There will be several stumps after performing the above steps, some of them predict ‘Yes’ and some of them predict ‘No’ as Heart Disease. In this case, what will be the final prediction? Based on the total amount of say (for Yes and for No) we calculate the sum of the amount of say for both the classed and class with the higher amount of say is the classification of the new point.
5. End Notes:
Thanks for reading this article, I hope it is helpful in understanding the AdaBoost model and the mathematics behind it.