본문 바로가기

Data Science/R

[R] Tree-Based Methods : Decision Tree

1. Tree-Based Methods

  • Tree-based methods for regression and classification involve stratifying or segmenting the predictor space into a number of simple regions
  • The set of splitting rules used to segment the predictor space can be summarized in a tree.
  • Tree-based methods are simple and useful for interpretation.
  • Bagging and Random Forests grow multiple trees which are combined to yield a single consensus prediction.
  • Combining a large number of trees can result in dramatic improvements in prediction accuracy, at the expense of some lose interpretation.

 

2. Decision Tree

  • At a given internal node, the label (of the form (\(x_j < t_k\)) indicates the left-hand branch emanting from that split, and the right-hand branch corresponds to \(X_j \geq t_k\).
  • The number in each leaf is the mean of the response for the observation that fall there.

 

2.1 [Ex] Decision Trees using tree package

# Import library and dataset 
library(ISLR)
library(tree)
data(Hitters)
str(Hitters)

# Missing data
summary(Hitters$Salary)
miss <- is.na(Hitters$Salary)

# Fit a regression tree
g <- tree(log(Salary) ~ Years + Hits, subset=!miss, Hitters)
g
summary(g)

# Draw a tree
plot(g)
text(g, pretty=0)

 

 

tree(formula, data, weights, subsets, x, y)

A tree is grown by binary response partitioning using the response in the specified formula and choosing splits from the terms of the right-hand side.

 

# Prune a tree
g2 <- prune.tree(g, best=3)
plot(g2)
text(g2, pretty=0)

 

 

prune.tree(tree, k=NULL, best=NULL, newdata)

Determine a nested sequence of subtrees of the supplied tree by recursively "snipping" off the least important splits.

  • tree : fitted model object of class 'tree'.
  • best : integer requesting the size of a specific subtree in the cost-complexity sequence to be returned.

 

2.2 [Ex] Decision Tree using rpart package

# Another R package for tree
library(rpart)
library(rpart.plot)

# Fit a regression tree
g3 <- rpart(log(Salary) ~ Years + Hits, subset=!miss, Hitters)
g3
summary(g3)

# Draw a fancy tree
prp(g3, branch.lwd=4, branch.col="darkgreen", extra=101)

 

3. Details of the Tree-building Process

  • Decision trees are typically drawn upside down, in the sense that the leaves are at the bottom of the tree.
  • The points along the tree where the predictor space is split are referred to as internal node.
  • The predictor space :
    • Set of possible values : \(X_1, ..., X_p\)
    • Non-overlapping regions : \(R_1, ..., R_j\)
  • For every observation that falls into the region \(R_j\), make the same prediction which is simply the mean of the response value for the training observations in \(R_j\).
  • Choose to divide the predictor space into high-dimensional rectangles.

 

The goal of constructing tree methods is finding boxes \(R_1, ..., R_j\) that minimizes the RSS(\(\sum_{j=1}^J\sum_{i\in R_j}(y_i - \hat{y}_{R_j})\)).

 

3.1 Recursive Binary Splitting

  • Creating a binary decision tree is actually a process of dividing up the input space.
  • A greedy approach is used to divide the space called recursive binary splitting.
  • It is greedy because at each step of the tree-building process, the best split is made at that particular step rather than picking a split that will lead to a better tree in future step.

 

  1. Select the predictor \(X_j\) such that splitting the predictor space into the region {\(X|X_j < s\)} and {\(X|X_j \geq s\)} leads to the greatest possible reduction in RSS.
  2. Repeat the process, looking for the best predictor and best cutpoint in order to split the data further so as to minimize the RSS within each of the resulting regions.
  3. The process contiues until a stopping criterion is reached : no region contains more than five observations.

 

3.2 Pruning a Tree

  • The process with many splits may produce good predictions on the training set, but is likely to overfit the data, leading poor testing set perfomance.
  • A smaller tree with with fewer splits might lead lower variance and better interpretation at the cost of a little bias.
  • Grow a very alrge tree \(T_0\), and then prune it back in order to obtain a subtree.
  • Cost complexity pruning(weakest link pruning) : \(\sum_{m=1}^{|T|}\sum_{i:x_i \in R_m}(y_i - \hat{y}R_m)^2 + \alpha|T|\)
  • \(\alpha\) controls a trade-off between the subtree's complexity and its fit to the training data.
  • Selecting an optimal value \(\hat{\alpha}\) using Cross Validation.

 

3.3 Summary : Tree Algorithm

  • Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations(stopping criterion).
  • Apply cost complexity pruing to the large tree in order to obtain a sequence of best subtrees, as a function of \(\alpha\).
    • We can use prune.tree(tree, best=K) function 
  • Use K-fold corss-validation to choos \(\alpha\). For each \(k = 1, ... K\) :
    • Repeat step 1 and 2 on the \(\frac{K - 1}{K}\) th fraction of the training data, excluding the \(k\)th fold.
    • Evaluate the mean squared prediction error on the data in the left-out \(k\) th fold, as a function of \(\alpha\).
    • Average the results, and pick \(\alpha\) to minimize the average error.
  • Return the subtree from step 2 that corresponds to the chosen value of \(\alpha\).