1. Ensemble methods
- An ensemble methods is an approach that combines many simple building ensemble block models in order to obtain a single and potentially very powerful model.
- These simple building block models are sometimes known as weak learners.
- Methods
- Bagging
- Random Forest
- Boosting
- Bayesian additive regression trees
2. Bagging
2.1 Boostrap methods
- Referring (\(X_1, ..., X_n\)) as population, sample n with replacement.
- Iterating n times : making n Bootstrap sets
- Calculating statistics from sampled sets (\(\hat{X_1}, ..., \hat{X_n}\))
- Calculating aggregated statistics from Bootstrap sets
# Population of training set
seq(20)
# Boostrap set
sort(sample(seq(20), 20))
2.2 Bagging Tree methods
- Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method.
- Repeat calculating statistical with every sampled sets : \(\frac{1}{B}\sum_{i=1}^{B} \bar{X_i}\)
- Taking repeated samples from the training set.
- Generate B different bootstrapped training data sets : \(\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B \hat{f}^b(x)\)
- For classification trees : for each test observation, we record the class predicted by each of the B trees, and take a majority vote : the overall prediction is the most commonly occuring class among the B predictions.
3. Building Bagging Decision Tree
# Separate training and test sets
set.seed(123)
train <- sample(1:nrow(Heart), nrow(Heart)/2)
test <- setdiff(1:nrow(Heart), train)
# Classification error rate for single tree
heart.tran <- tree(AHD ~., subset=train, Heart)
heart.pred <- predict(heart.tran, Heart[test, ], type="class")
tree.err <- mean(Heart$AHD[test]!=heart.pred)
tree.err
# Bagging
set.seed(12345)
B <- 500
n <- nrow(Heart)
Vote <- rep(0, length(test))
bag.err <- NULL
for (i in 1:B) {
# Bootstrap training set
index <- sample(train, replace=TRUE)
heart.tran <- tree(AHD ~., Heart[index,])
heart.pred <- predict(heart.tran, Heart[test, ], type="class")
Vote[heart.pred=="Yes"] <- Vote[heart.pred=="Yes"] + 1
preds <- rep("Yes", length(test))
# Decide as "No" when the number of voted case is lower than i/2
# Apply majority rules
preds[Vote < i/2] <- "No"
bag.err[i] <- mean(Heart$AHD[test]!=preds)
}
# Visualize bagging decision tree
plot(bag.err, type="l", xlab="Number of Trees", col=1, ylab="Classification Error Rate")
abline(h=tree.err, lty=2, col=2)
legend("topright", c("Single tree", "Bagging"), col=c(2,1), lty=c(2,1))
- Missclassification rate converges to 0.23xxx
4. Out-of-Bag Error Estimation
- The key to bagging is that trees are repeatedly fit to bootstrapped subsets of the observations.
- One can show that on average, each bagged tree makes use of around two-thirds of the observations.
- The remaining one-third of the observations not used to fit a given bagged tree are referred to as out-of-bag(OOB) observations.
- We can predict the response for the \(i\)th observations using each of the trees in which that observation was OOB. This will yield around \(\frac{B}{3}\) predictions for the \(i\)th observation on average.
4.1 [Ex] Average of missclassification rate of single tree
# Average of missclassification rate of single tree
# Over 50 replications
set.seed(12345)
K <- 50
Err <- NULL
for (i in 1:K) {
train <- sample(1:nrow(Heart), nrow(Heart)*2/3)
test <- setdiff(1:nrow(Heart), train)
heart.tran <- tree(AHD ~., subset=train, Heart)
heart.pred <- predict(heart.tran, Heart[test, ], type="class")
Err[i] <- mean(Heart$AHD[test]!=heart.pred)
}
summary(Err)
Tree.Err <- mean(Err)
Min | 1st | Median | Mean | 3rd | Max |
0.1616 | 0.2121 | 0.2424 | 0.2473 | 0.2727 | 0.3333 |
- Over 50 replications, the mean of missclassification rates is 0.2424.
4.1 [Ex] Out-of-Bagging missclassification rate
# OOB
set.seed(1234)
Valid <- Vote <- Mis <- rep(0, nrow(Heart))
OOB.err <- NULL
for (i in 1:B) {
# Bootstrapping from Heart index
index <- sample(1:nrow(Heart), replace=TRUE)
# Extract test index from boostrapped index
test <- setdiff(1:nrow(Heart), unique(index))
Valid[test] <- Valid[test] + 1
# Train model with bootstrapped training sample
heart.tran <- tree(AHD ~., Heart[index,])
# Make predictions of test sets
heart.pred <- predict(heart.tran, Heart[test,], type="class")
Vote[test] <- Vote[test] + (heart.pred=="Yes")
# Vote for test sets
preds <- rep("Yes", length(test))
preds[Vote[test]/Valid[test] < 0.5] <- "No"
# Find index of misscalssified case
wh <- which(Heart$AHD[test]!=preds)
Mis[test[wh]] <- -1
Mis[test[-wh]] <- 1
OOB.err[i] <- sum(Mis==-1)/sum(Mis!=0)
}
# View statistical reports of error
summary(OOB.err)
summary(OOB.err[-c(1:100)])
# Visualize results
plot(OOB.err, type="l", xlab="Number of Trees", col=1,
ylab="Classification Error Rate", ylim=c(0.1,0.4))
abline(h=Tree.Err, lty=2, col=2)
legend("topright", c("Single tree", "OOB"), col=c(2,1), lty=c(2,1))
'Data Science > R' 카테고리의 다른 글
[R] Tree-Based Methods : Boosting (0) | 2022.11.27 |
---|---|
[R] Tree-Based Methods : Random Forest (0) | 2022.11.27 |
[R] Tree-Based Methods : Advantages and Disadvantages of Tree (0) | 2022.11.27 |
[R] Tree-Based Methods : Classification Decision Tree (1) | 2022.11.27 |
[R] Tree-Based Methods : Regression Decision Tree (0) | 2022.11.27 |