[R] Tree-Based Methods : Bagging

1. Ensemble methods

An ensemble methods is an approach that combines many simple building ensemble block models in order to obtain a single and potentially very powerful model.
These simple building block models are sometimes known as weak learners.
Methods
- Bagging
- Random Forest
- Boosting
- Bayesian additive regression trees

2. Bagging

2.1 Boostrap methods

Referring ( $X_{1}, . . ., X_{n}$ ) as population, sample n with replacement.
Iterating n times : making n Bootstrap sets
1. Calculating statistics from sampled sets ( $\hat{X_{1}}, . . ., \hat{X_{n}}$ )
2. Calculating aggregated statistics from Bootstrap sets

# Population of training set 
seq(20) 

# Boostrap set 
sort(sample(seq(20), 20))

2.2 Bagging Tree methods

Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method.
Repeat calculating statistical with every sampled sets : $\frac{1}{B} \sum_{i = 1}^{B} \bar{X_{i}}$
Taking repeated samples from the training set.
Generate B different bootstrapped training data sets : ${\hat{f}}_{b a g} (x) = \frac{1}{B} \sum_{b = 1}^{B} {\hat{f}}^{b} (x)$
For classification trees : for each test observation, we record the class predicted by each of the B trees, and take a majority vote : the overall prediction is the most commonly occuring class among the B predictions.

3. Building Bagging Decision Tree

# Separate training and test sets
set.seed(123)
train <- sample(1:nrow(Heart), nrow(Heart)/2)
test <- setdiff(1:nrow(Heart), train)

# Classification error rate for single tree 
heart.tran <- tree(AHD ~., subset=train, Heart)
heart.pred <- predict(heart.tran, Heart[test, ], type="class")
tree.err <- mean(Heart$AHD[test]!=heart.pred)
tree.err

# Bagging 
set.seed(12345)
B <- 500
n <- nrow(Heart)
Vote <- rep(0, length(test))
bag.err <- NULL 

for (i in 1:B) {
    # Bootstrap training set 
    index <- sample(train, replace=TRUE)
    heart.tran <- tree(AHD ~., Heart[index,])
    heart.pred <- predict(heart.tran, Heart[test, ], type="class")
    Vote[heart.pred=="Yes"] <- Vote[heart.pred=="Yes"] + 1
    preds <- rep("Yes", length(test))
    # Decide as "No" when the number of voted case is lower than i/2 
    # Apply majority rules
    preds[Vote < i/2] <- "No"
    bag.err[i] <- mean(Heart$AHD[test]!=preds)
}

# Visualize bagging decision tree 
plot(bag.err, type="l", xlab="Number of Trees", col=1, ylab="Classification Error Rate")
abline(h=tree.err, lty=2, col=2)
legend("topright", c("Single tree", "Bagging"), col=c(2,1), lty=c(2,1))

Missclassification rate converges to 0.23xxx

4. Out-of-Bag Error Estimation

The key to bagging is that trees are repeatedly fit to bootstrapped subsets of the observations.
One can show that on average, each bagged tree makes use of around two-thirds of the observations.
The remaining one-third of the observations not used to fit a given bagged tree are referred to as out-of-bag(OOB) observations.
We can predict the response for the $i$ th observations using each of the trees in which that observation was OOB. This will yield around $\frac{B}{3}$ predictions for the $i$ th observation on average.

4.1 [Ex] Average of missclassification rate of single tree

# Average of missclassification rate of single tree
# Over 50 replications
set.seed(12345)
K <- 50
Err <- NULL 

for (i in 1:K) {
  train <- sample(1:nrow(Heart), nrow(Heart)*2/3) 
  test <- setdiff(1:nrow(Heart), train) 
  heart.tran <- tree(AHD ~., subset=train, Heart)
  heart.pred <- predict(heart.tran, Heart[test, ], type="class") 
  Err[i] <- mean(Heart$AHD[test]!=heart.pred)
}
summary(Err)
Tree.Err <- mean(Err)

Min	1st	Median	Mean	3rd	Max
0.1616	0.2121	0.2424	0.2473	0.2727	0.3333

Over 50 replications, the mean of missclassification rates is 0.2424.

4.1 [Ex] Out-of-Bagging missclassification rate

# OOB 
set.seed(1234)
Valid <- Vote <- Mis <- rep(0, nrow(Heart)) 
OOB.err <- NULL

for (i in 1:B) {
  # Bootstrapping from Heart index 
  index <- sample(1:nrow(Heart), replace=TRUE)
  # Extract test index from boostrapped index 
  test <- setdiff(1:nrow(Heart), unique(index))
  Valid[test] <- Valid[test] + 1
  # Train model with bootstrapped training sample 
  heart.tran <- tree(AHD ~., Heart[index,])
  # Make predictions of test sets 
  heart.pred <- predict(heart.tran, Heart[test,], type="class")
  Vote[test] <- Vote[test] + (heart.pred=="Yes")
  # Vote for test sets 
  preds <- rep("Yes", length(test))
  preds[Vote[test]/Valid[test] < 0.5] <- "No"
  # Find index of misscalssified case 
  wh <- which(Heart$AHD[test]!=preds)
  Mis[test[wh]] <- -1
  Mis[test[-wh]] <- 1
  OOB.err[i] <- sum(Mis==-1)/sum(Mis!=0)
}

# View statistical reports of error 
summary(OOB.err)
summary(OOB.err[-c(1:100)])

# Visualize results 
plot(OOB.err, type="l", xlab="Number of Trees", col=1,
     ylab="Classification Error Rate", ylim=c(0.1,0.4))
abline(h=Tree.Err, lty=2, col=2)
legend("topright", c("Single tree", "OOB"), col=c(2,1), lty=c(2,1))

저작자표시 (새창열림)

'Data Science > R' 카테고리의 다른 글

[R] Tree-Based Methods : Boosting (0)	2022.11.27
[R] Tree-Based Methods : Random Forest (0)	2022.11.27
[R] Tree-Based Methods : Advantages and Disadvantages of Tree (0)	2022.11.27
[R] Tree-Based Methods : Classification Decision Tree (1)	2022.11.27
[R] Tree-Based Methods : Regression Decision Tree (0)	2022.11.27

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

See the forest

[R] Tree-Based Methods : Bagging

1. Ensemble methods

2. Bagging

2.1 Boostrap methods

2.2 Bagging Tree methods

3. Building Bagging Decision Tree

4. Out-of-Bag Error Estimation

4.1 [Ex] Average of missclassification rate of single tree

4.1 [Ex] Out-of-Bagging missclassification rate

'Data Science > R' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[R] Tree-Based Methods : Bagging

1. Ensemble methods

2. Bagging

2.1 Boostrap methods

2.2 Bagging Tree methods

3. Building Bagging Decision Tree

4. Out-of-Bag Error Estimation

4.1 [Ex] Average of missclassification rate of single tree

4.1 [Ex] Out-of-Bagging missclassification rate

'Data Science > R' 카테고리의 다른 글

'Data Science/R' Related Articles

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역