[R] Tree-Based Methods : Boosting

1. What is Boosting?

Boosting is a general approach that can be applied to many statistical learning methods for regression or classification.
Boosting grow trees sequentially.
The boosting approach learns slowly.
Given the current model, we fit a decision tree to the residuals from the model. We then add this new decision tree into the fitted function in order to update the residuals.
Each of thses models can be rather small, with just a few terminal nodes, determined by the parameter $d$ in the algorithm.
The kinds of boosting algorithms
- XGBoost
- Adaboost

1.1 Boosting algorithm

Set $\hat{f} (x) = 0$ and $r_{i} = y_{i}$ for all $i$ in the training set.
For $b = 1, 2, . . ., B$ , repeat :
1. Fit a tree ${\hat{f}}^{b}$ with $d$ splits to the training data $(X, r)$ .
2. Update $\hat{f}$ by adding in a shrunken version of the new tree : $\hat{f} (x) \leftarrow \hat{f} (x) + λ {\hat{f}}^{b} (x)$
3. Update the residuals, $r_{i} \leftarrow r_{i} - λ {\hat{f}}^{b} (x_{i})$
Output the boosted model : $\hat{f} (x) = \sum_{b = 1}^{B} λ {\hat{f}}^{b} (x)$ .

1.2 Tuning parameters for boosting

The number of trees $B$ : Boosting can be overfit if $B$ is too large, altough this overfitting tends to occur slowly if at all.
The shrinkage parameter $λ$ : Controls the rate at which boosting learns. Typically values are 0.01 or 0.001.
The number of splits $d$ in each tree : Controls the complexity of the boosted ensemble.

2. Building Boosting Decision Tree for Classification problem

Step 1 : Prerequirisite

# Prerequirisite : Importing library, dataset, preprocessing 
library(gbm)

# Train-Test split 
set.seed(123)
train <- sample(1:nrow(Heart), nrow(Heart)/2)
test <- setdiff(1:nrow(Heart), train)

# Create (0,1) response
Heart0 <- Heart
Heart0[,"AHD"] <- as.numeric(Heart$AHD)-1

Step 2 : Training model with tuing parameters

# boosting (d=1)
boost.d1 <- gbm(AHD~., data=Heart0[train, ], n.trees=1000,
distribution="bernoulli", interaction.depth=1)

# Results of feature selection 
summary(boost.d1)

# Calcualte missclassification error 
# We need to set value of keyword n.trees for boosting trees 
yhat.d1 <- predict(boost.d1, newdata=Heart0[test, ], type="response", n.trees=1000)
phat.d1 <- rep(0, length(yhat.d1))
phat.d1[yhat.d1 > 0.5] <- 1
mean(phat.d1!=Heart0[test, "AHD"])

# boosting (d=2)
# Training model with max_depth = 2
boost.d2 <- gbm(AHD~., data=Heart0[train, ], n.trees=1000,
distribution="bernoulli", interaction.depth=2)

# Make predictions 
yhat.d2 <- predict(boost.d2, newdata=Heart0[test, ],
type="response", n.trees=1000)
phat.d2 <- rep(0, length(yhat.d2))
phat.d2[yhat.d2 > 0.5] <- 1
mean(phat.d2!=Heart0[test, "AHD"])

# boosting (d=3)
# Training model with max_depth = 3
boost.d3 <- gbm(AHD~., data=Heart0[train, ], n.trees=1000,
distribution="bernoulli", interaction.depth=3)

# Make predictions
yhat.d3 <- predict(boost.d3, newdata=Heart0[test, ],
type="response", n.trees=1000)
phat.d3 <- rep(0, length(yhat.d3))
phat.d3[yhat.d3 > 0.5] <- 1
mean(phat.d3!=Heart0[test, "AHD"])

# boosting (d=4)
# Training model with max_depth = 4
boost.d4 <- gbm(AHD~., data=Heart0[train, ], n.trees=1000,
distribution="bernoulli", interaction.depth=4)

# Make predictions 
yhat.d4 <- predict(boost.d4, newdata=Heart0[test, ],
type="response", n.trees=1000)
phat.d4 <- rep(0, length(yhat.d4))
phat.d4[yhat.d4 > 0.5] <- 1
mean(phat.d4!=Heart0[test, "AHD"]

gbm(interaction.depth=1)	gbm(interaction.depth=2)	gbm(intercation.depth=3)	gbm(iteraction.depth=4)
0.2013423	0.2013423	0.2281879	0.1879105

3. Boosting Decision Tree with Hyperparameter Tuning(n.trees, interaction.depth)

# Simulation: Boosting with d=1, 2, 3 and 4
# The number of trees: 1 to 3000

# Set grids 
set.seed(1111)
Err <- matrix(0, 3000, 4)

# Training models with n.trees, interaction.depth : 1 to 4
for (k in 1:4) {
    boost <- gbm(AHD~., data=Heart0[train, ], n.trees=3000,
distribution="bernoulli", interaction.depth=k)
    for (i in 1:3000) {
        # Make predictions of n.trees : 1 to 3000 
        yhat <- predict(boost, newdata=Heart0[test, ],
        type="response", n.trees=i)
        phat <- rep(0, length(yhat))
        phat[yhat > 0.5] <- 1
        Err[i,k] <- mean(phat!=Heart0[test, "AHD"])
    }
}

# Visualize results 
labels <- c("d = 1", "d = 2", "d = 3", "d = 4")
matplot(Err, type="l", xlab="Number of Trees", lty=2, col=1:4,
ylab="Classification Error Rate")
legend("topright", legend=labels, col=1:4, lty=1)

# View statistical reports 
colnames(Err) <- labels
apply(Err, 2, summary)
apply(Err[-c(1:100),], 2, summary)

Because the Boosting model is a continuously updated model, the initially trained model is less accurate.

저작자표시 (새창열림)

'Data Science > R' 카테고리의 다른 글

[R] Support Vector Machine (0)	2022.12.06
[R] Tree-Based Methods : Bayesian Additive Tree (0)	2022.11.27
[R] Tree-Based Methods : Random Forest (0)	2022.11.27
[R] Tree-Based Methods : Bagging (0)	2022.11.27
[R] Tree-Based Methods : Advantages and Disadvantages of Tree (0)	2022.11.27

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

See the forest

[R] Tree-Based Methods : Boosting

1. What is Boosting?

1.1 Boosting algorithm

1.2 Tuning parameters for boosting

2. Building Boosting Decision Tree for Classification problem

Step 1 : Prerequirisite

Step 2 : Training model with tuing parameters

3. Boosting Decision Tree with Hyperparameter Tuning(n.trees, interaction.depth)

'Data Science > R' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[R] Tree-Based Methods : Boosting

1. What is Boosting?

1.1 Boosting algorithm

1.2 Tuning parameters for boosting

2. Building Boosting Decision Tree for Classification problem

Step 1 : Prerequirisite

Step 2 : Training model with tuing parameters

3. Boosting Decision Tree with Hyperparameter Tuning(n.trees, interaction.depth)

'Data Science > R' 카테고리의 다른 글

'Data Science/R' Related Articles

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역