1. Ensemble methods
- An ensemble methods is an approach that combines many simple building ensemble block models in order to obtain a single and potentially very powerful model.
- These simple building block models are sometimes known as weak learners.
- Methods
- Bagging
- Random Forest
- Boosting
- Bayesian additive regression trees
2. Bagging
2.1 Boostrap methods
- Referring (
) as population, sample n with replacement. - Iterating n times : making n Bootstrap sets
- Calculating statistics from sampled sets (
) - Calculating aggregated statistics from Bootstrap sets
- Calculating statistics from sampled sets (
# Population of training set
seq(20)
# Boostrap set
sort(sample(seq(20), 20))
2.2 Bagging Tree methods
- Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method.
- Repeat calculating statistical with every sampled sets :
- Taking repeated samples from the training set.
- Generate B different bootstrapped training data sets :
- For classification trees : for each test observation, we record the class predicted by each of the B trees, and take a majority vote : the overall prediction is the most commonly occuring class among the B predictions.
3. Building Bagging Decision Tree
# Separate training and test sets
set.seed(123)
train <- sample(1:nrow(Heart), nrow(Heart)/2)
test <- setdiff(1:nrow(Heart), train)
# Classification error rate for single tree
heart.tran <- tree(AHD ~., subset=train, Heart)
heart.pred <- predict(heart.tran, Heart[test, ], type="class")
tree.err <- mean(Heart$AHD[test]!=heart.pred)
tree.err
# Bagging
set.seed(12345)
B <- 500
n <- nrow(Heart)
Vote <- rep(0, length(test))
bag.err <- NULL
for (i in 1:B) {
# Bootstrap training set
index <- sample(train, replace=TRUE)
heart.tran <- tree(AHD ~., Heart[index,])
heart.pred <- predict(heart.tran, Heart[test, ], type="class")
Vote[heart.pred=="Yes"] <- Vote[heart.pred=="Yes"] + 1
preds <- rep("Yes", length(test))
# Decide as "No" when the number of voted case is lower than i/2
# Apply majority rules
preds[Vote < i/2] <- "No"
bag.err[i] <- mean(Heart$AHD[test]!=preds)
}
# Visualize bagging decision tree
plot(bag.err, type="l", xlab="Number of Trees", col=1, ylab="Classification Error Rate")
abline(h=tree.err, lty=2, col=2)
legend("topright", c("Single tree", "Bagging"), col=c(2,1), lty=c(2,1))

- Missclassification rate converges to 0.23xxx
4. Out-of-Bag Error Estimation
- The key to bagging is that trees are repeatedly fit to bootstrapped subsets of the observations.
- One can show that on average, each bagged tree makes use of around two-thirds of the observations.
- The remaining one-third of the observations not used to fit a given bagged tree are referred to as out-of-bag(OOB) observations.
- We can predict the response for the
th observations using each of the trees in which that observation was OOB. This will yield around predictions for the th observation on average.
4.1 [Ex] Average of missclassification rate of single tree
# Average of missclassification rate of single tree
# Over 50 replications
set.seed(12345)
K <- 50
Err <- NULL
for (i in 1:K) {
train <- sample(1:nrow(Heart), nrow(Heart)*2/3)
test <- setdiff(1:nrow(Heart), train)
heart.tran <- tree(AHD ~., subset=train, Heart)
heart.pred <- predict(heart.tran, Heart[test, ], type="class")
Err[i] <- mean(Heart$AHD[test]!=heart.pred)
}
summary(Err)
Tree.Err <- mean(Err)
Min | 1st | Median | Mean | 3rd | Max |
0.1616 | 0.2121 | 0.2424 | 0.2473 | 0.2727 | 0.3333 |
- Over 50 replications, the mean of missclassification rates is 0.2424.
4.1 [Ex] Out-of-Bagging missclassification rate
# OOB
set.seed(1234)
Valid <- Vote <- Mis <- rep(0, nrow(Heart))
OOB.err <- NULL
for (i in 1:B) {
# Bootstrapping from Heart index
index <- sample(1:nrow(Heart), replace=TRUE)
# Extract test index from boostrapped index
test <- setdiff(1:nrow(Heart), unique(index))
Valid[test] <- Valid[test] + 1
# Train model with bootstrapped training sample
heart.tran <- tree(AHD ~., Heart[index,])
# Make predictions of test sets
heart.pred <- predict(heart.tran, Heart[test,], type="class")
Vote[test] <- Vote[test] + (heart.pred=="Yes")
# Vote for test sets
preds <- rep("Yes", length(test))
preds[Vote[test]/Valid[test] < 0.5] <- "No"
# Find index of misscalssified case
wh <- which(Heart$AHD[test]!=preds)
Mis[test[wh]] <- -1
Mis[test[-wh]] <- 1
OOB.err[i] <- sum(Mis==-1)/sum(Mis!=0)
}
# View statistical reports of error
summary(OOB.err)
summary(OOB.err[-c(1:100)])
# Visualize results
plot(OOB.err, type="l", xlab="Number of Trees", col=1,
ylab="Classification Error Rate", ylim=c(0.1,0.4))
abline(h=Tree.Err, lty=2, col=2)
legend("topright", c("Single tree", "OOB"), col=c(2,1), lty=c(2,1))

'Data Science > R' 카테고리의 다른 글
[R] Tree-Based Methods : Boosting (0) | 2022.11.27 |
---|---|
[R] Tree-Based Methods : Random Forest (0) | 2022.11.27 |
[R] Tree-Based Methods : Advantages and Disadvantages of Tree (0) | 2022.11.27 |
[R] Tree-Based Methods : Classification Decision Tree (1) | 2022.11.27 |
[R] Tree-Based Methods : Regression Decision Tree (0) | 2022.11.27 |