본문 바로가기

Data Science/R

[R] Tree-Based Methods : Bagging

1. Ensemble methods

  • An ensemble methods is an approach that combines many simple building ensemble block models in order to obtain a single and potentially very powerful model.
  • These simple building block models are sometimes known as weak learners.
  • Methods
    • Bagging
    • Random Forest
    • Boosting
    • Bayesian additive regression trees

 

2. Bagging

2.1 Boostrap methods

  1. Referring (X1,...,Xn) as population, sample n with replacement.
  2. Iterating n times : making n Bootstrap sets
    1. Calculating statistics from sampled sets (X1^,...,Xn^)
    2. Calculating aggregated statistics from Bootstrap sets

 

# Population of training set 
seq(20) 

# Boostrap set 
sort(sample(seq(20), 20))

 

2.2 Bagging Tree methods

  • Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method.
  • Repeat calculating statistical with every sampled sets : 1Bi=1BXi¯
  • Taking repeated samples from the training set.
  • Generate B different bootstrapped training data sets : f^bag(x)=1Bb=1Bf^b(x)
  • For classification trees : for each test observation, we record the class predicted by each of the B trees, and take a majority vote : the overall prediction is the most commonly occuring class among the B predictions.

 

3. Building Bagging Decision Tree

# Separate training and test sets
set.seed(123)
train <- sample(1:nrow(Heart), nrow(Heart)/2)
test <- setdiff(1:nrow(Heart), train)

# Classification error rate for single tree 
heart.tran <- tree(AHD ~., subset=train, Heart)
heart.pred <- predict(heart.tran, Heart[test, ], type="class")
tree.err <- mean(Heart$AHD[test]!=heart.pred)
tree.err

# Bagging 
set.seed(12345)
B <- 500
n <- nrow(Heart)
Vote <- rep(0, length(test))
bag.err <- NULL 

for (i in 1:B) {
    # Bootstrap training set 
    index <- sample(train, replace=TRUE)
    heart.tran <- tree(AHD ~., Heart[index,])
    heart.pred <- predict(heart.tran, Heart[test, ], type="class")
    Vote[heart.pred=="Yes"] <- Vote[heart.pred=="Yes"] + 1
    preds <- rep("Yes", length(test))
    # Decide as "No" when the number of voted case is lower than i/2 
    # Apply majority rules
    preds[Vote < i/2] <- "No"
    bag.err[i] <- mean(Heart$AHD[test]!=preds)
}

# Visualize bagging decision tree 
plot(bag.err, type="l", xlab="Number of Trees", col=1, ylab="Classification Error Rate")
abline(h=tree.err, lty=2, col=2)
legend("topright", c("Single tree", "Bagging"), col=c(2,1), lty=c(2,1))

 

 

  • Missclassification rate converges to 0.23xxx

 

4. Out-of-Bag Error Estimation

  • The key to bagging is that trees are repeatedly fit to bootstrapped subsets of the observations.
  • One can show that on average, each bagged tree makes use of around two-thirds of the observations.
  • The remaining one-third of the observations not used to fit a given bagged tree are referred to as out-of-bag(OOB) observations.
  • We can predict the response for the ith observations using each of the trees in which that observation was OOB. This will yield around B3 predictions for the ith observation on average.

 

4.1 [Ex] Average of missclassification rate of single tree

# Average of missclassification rate of single tree
# Over 50 replications
set.seed(12345)
K <- 50
Err <- NULL 

for (i in 1:K) {
  train <- sample(1:nrow(Heart), nrow(Heart)*2/3) 
  test <- setdiff(1:nrow(Heart), train) 
  heart.tran <- tree(AHD ~., subset=train, Heart)
  heart.pred <- predict(heart.tran, Heart[test, ], type="class") 
  Err[i] <- mean(Heart$AHD[test]!=heart.pred)
}
summary(Err)
Tree.Err <- mean(Err)

 

Min 1st Median Mean 3rd Max
0.1616 0.2121 0.2424 0.2473 0.2727 0.3333

 

  • Over 50 replications, the mean of missclassification rates is 0.2424.

 

4.1 [Ex] Out-of-Bagging missclassification rate

# OOB 
set.seed(1234)
Valid <- Vote <- Mis <- rep(0, nrow(Heart)) 
OOB.err <- NULL

for (i in 1:B) {
  # Bootstrapping from Heart index 
  index <- sample(1:nrow(Heart), replace=TRUE)
  # Extract test index from boostrapped index 
  test <- setdiff(1:nrow(Heart), unique(index))
  Valid[test] <- Valid[test] + 1
  # Train model with bootstrapped training sample 
  heart.tran <- tree(AHD ~., Heart[index,])
  # Make predictions of test sets 
  heart.pred <- predict(heart.tran, Heart[test,], type="class")
  Vote[test] <- Vote[test] + (heart.pred=="Yes")
  # Vote for test sets 
  preds <- rep("Yes", length(test))
  preds[Vote[test]/Valid[test] < 0.5] <- "No"
  # Find index of misscalssified case 
  wh <- which(Heart$AHD[test]!=preds)
  Mis[test[wh]] <- -1
  Mis[test[-wh]] <- 1
  OOB.err[i] <- sum(Mis==-1)/sum(Mis!=0)
}

# View statistical reports of error 
summary(OOB.err)
summary(OOB.err[-c(1:100)])

# Visualize results 
plot(OOB.err, type="l", xlab="Number of Trees", col=1,
     ylab="Classification Error Rate", ylim=c(0.1,0.4))
abline(h=Tree.Err, lty=2, col=2)
legend("topright", c("Single tree", "OOB"), col=c(2,1), lty=c(2,1))