본문 바로가기

Data Science/R

[R] Tree-Based Methods : Bagging

1. Ensemble methods

  • An ensemble methods is an approach that combines many simple building ensemble block models in order to obtain a single and potentially very powerful model.
  • These simple building block models are sometimes known as weak learners.
  • Methods
    • Bagging
    • Random Forest
    • Boosting
    • Bayesian additive regression trees

 

2. Bagging

2.1 Boostrap methods

  1. Referring (\(X_1, ..., X_n\)) as population, sample n with replacement.
  2. Iterating n times : making n Bootstrap sets
    1. Calculating statistics from sampled sets (\(\hat{X_1}, ..., \hat{X_n}\))
    2. Calculating aggregated statistics from Bootstrap sets

 

# Population of training set 
seq(20) 

# Boostrap set 
sort(sample(seq(20), 20))

 

2.2 Bagging Tree methods

  • Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method.
  • Repeat calculating statistical with every sampled sets : \(\frac{1}{B}\sum_{i=1}^{B} \bar{X_i}\)
  • Taking repeated samples from the training set.
  • Generate B different bootstrapped training data sets : \(\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B \hat{f}^b(x)\)
  • For classification trees : for each test observation, we record the class predicted by each of the B trees, and take a majority vote : the overall prediction is the most commonly occuring class among the B predictions.

 

3. Building Bagging Decision Tree

# Separate training and test sets
set.seed(123)
train <- sample(1:nrow(Heart), nrow(Heart)/2)
test <- setdiff(1:nrow(Heart), train)

# Classification error rate for single tree 
heart.tran <- tree(AHD ~., subset=train, Heart)
heart.pred <- predict(heart.tran, Heart[test, ], type="class")
tree.err <- mean(Heart$AHD[test]!=heart.pred)
tree.err

# Bagging 
set.seed(12345)
B <- 500
n <- nrow(Heart)
Vote <- rep(0, length(test))
bag.err <- NULL 

for (i in 1:B) {
    # Bootstrap training set 
    index <- sample(train, replace=TRUE)
    heart.tran <- tree(AHD ~., Heart[index,])
    heart.pred <- predict(heart.tran, Heart[test, ], type="class")
    Vote[heart.pred=="Yes"] <- Vote[heart.pred=="Yes"] + 1
    preds <- rep("Yes", length(test))
    # Decide as "No" when the number of voted case is lower than i/2 
    # Apply majority rules
    preds[Vote < i/2] <- "No"
    bag.err[i] <- mean(Heart$AHD[test]!=preds)
}

# Visualize bagging decision tree 
plot(bag.err, type="l", xlab="Number of Trees", col=1, ylab="Classification Error Rate")
abline(h=tree.err, lty=2, col=2)
legend("topright", c("Single tree", "Bagging"), col=c(2,1), lty=c(2,1))

 

 

  • Missclassification rate converges to 0.23xxx

 

4. Out-of-Bag Error Estimation

  • The key to bagging is that trees are repeatedly fit to bootstrapped subsets of the observations.
  • One can show that on average, each bagged tree makes use of around two-thirds of the observations.
  • The remaining one-third of the observations not used to fit a given bagged tree are referred to as out-of-bag(OOB) observations.
  • We can predict the response for the \(i\)th observations using each of the trees in which that observation was OOB. This will yield around \(\frac{B}{3}\) predictions for the \(i\)th observation on average.

 

4.1 [Ex] Average of missclassification rate of single tree

# Average of missclassification rate of single tree
# Over 50 replications
set.seed(12345)
K <- 50
Err <- NULL 

for (i in 1:K) {
  train <- sample(1:nrow(Heart), nrow(Heart)*2/3) 
  test <- setdiff(1:nrow(Heart), train) 
  heart.tran <- tree(AHD ~., subset=train, Heart)
  heart.pred <- predict(heart.tran, Heart[test, ], type="class") 
  Err[i] <- mean(Heart$AHD[test]!=heart.pred)
}
summary(Err)
Tree.Err <- mean(Err)

 

Min 1st Median Mean 3rd Max
0.1616 0.2121 0.2424 0.2473 0.2727 0.3333

 

  • Over 50 replications, the mean of missclassification rates is 0.2424.

 

4.1 [Ex] Out-of-Bagging missclassification rate

# OOB 
set.seed(1234)
Valid <- Vote <- Mis <- rep(0, nrow(Heart)) 
OOB.err <- NULL

for (i in 1:B) {
  # Bootstrapping from Heart index 
  index <- sample(1:nrow(Heart), replace=TRUE)
  # Extract test index from boostrapped index 
  test <- setdiff(1:nrow(Heart), unique(index))
  Valid[test] <- Valid[test] + 1
  # Train model with bootstrapped training sample 
  heart.tran <- tree(AHD ~., Heart[index,])
  # Make predictions of test sets 
  heart.pred <- predict(heart.tran, Heart[test,], type="class")
  Vote[test] <- Vote[test] + (heart.pred=="Yes")
  # Vote for test sets 
  preds <- rep("Yes", length(test))
  preds[Vote[test]/Valid[test] < 0.5] <- "No"
  # Find index of misscalssified case 
  wh <- which(Heart$AHD[test]!=preds)
  Mis[test[wh]] <- -1
  Mis[test[-wh]] <- 1
  OOB.err[i] <- sum(Mis==-1)/sum(Mis!=0)
}

# View statistical reports of error 
summary(OOB.err)
summary(OOB.err[-c(1:100)])

# Visualize results 
plot(OOB.err, type="l", xlab="Number of Trees", col=1,
     ylab="Classification Error Rate", ylim=c(0.1,0.4))
abline(h=Tree.Err, lty=2, col=2)
legend("topright", c("Single tree", "OOB"), col=c(2,1), lty=c(2,1))