본문 바로가기

Data Science/R

[R] Best Subset Selection

1. Three classes of solving problems

  • To solve the problem (variance become higher when the number of features is bigger), we need to make p lower than n.
  • Subset Selection : Identify a subset of the p predictors that we belive to be related to the response.
  • Shrinkage : Fit a model involving all p predictors, but the estimated coefficient are shrunked towards zero relative to the OLS estimates.
  • Dimension Reduction : Project the p predictors into a M-dimensional subspace. PCA is example. 

 

2. Best Subset Selection

  • The most direct approach is called all subsets or best subsets regression.
  • Fit for all possible subsets and then choose model based on some criterion that balances training error with model size.

 

Let's consider when we want to make best subset model using 10 features. First, we make null model \(M_0\), which contains no predictors. Next, for \(k = 1, 2, ... ,p=10)\), repeat the steps following:

 

  1. Fit all \((p=10, k)\) models that contains exactly \(k\) predictors. (This models must have the same counts of predictors)
  2. Pick the best among these \((p, k)\) models, and call it \(M_k\). Here best is defined as having the smallest \(RSS\), or equivalently largest \(R^2\). (We only use \(RSS\) or \(R^2\) when we pick the best model among \((p, k)\) models)
  3. Select a single best model from among \(M_0, M_1, ..., M_p\) using CVE, \(C_p\), \(AIC\), \(BIC\), or adjusted \(R^2\).

 

# Importing Library
library(ISLR)
names(Hitters)
dim(Hitters)
sum(is.na(Hitters$Salary))
Hitters <- na.omit(Hitters)
dim(Hitters)
sum(is.na(Hitters))

# Modeling best subset selection 
library(leaps)
fit <- regsubsets(Salary ~ ., Hitters)

summary(fit)
sg <- summary(fit)
names(sg)

dim(sg$which)
sg$which
plot(fit)
plot(fit, scale="Cp")

 

 

 

3. Choosing the Optimal model

  • Metrics \(RSS\) and \(R^2\) are only used in choosing best model among \((p, k)\) models.
  • To choose a model with low test error, we need to consider two ways.
    • Indirectly estimate test error : \(C_p\), \(AIC\), \(BIC\), and adjusted \(R^2\)
    • Directly estimate test error : \(CVE\)
  • Mallow's \(C_p\) : \(C_p = \frac{1}{n}(RSS + 2d\hat{\sigma}^2)\). Trade off between \(RSS\) and \(2d\hat{\sigma}^2\). This metric has penalty on the number of parameters. 
    - \(AIC\) : \(AIC = -2log(L) + 2d\)
    - \(BIC\) : \(BIC = \frac{1}{n}(RSS + log(n)d\hat{\sigma}^2)\). This metric has penalty on the number of sample size. (If \(n > 8\), then \(BIC > C_p\))
    - Adjusted \(R^2\) : scale \(RSS\) term by \(\frac{RSS}{n-d-1}\) which d is the number of parameter and n is the number of sample size.

 

3.1 Find the best model based on same parameter size (RSS, \(R^2\))

# Find the best model based on same model size 
big <- regsubsets(Salary ~ ., data=Hitters, nvmax=19, nbest=10)
sg <- summary(big)
dim(sg$which)
sg.size <- as.numeric(rownames(sg$which))
table(sg.size)

sg.rss <- tapply(sg$rss, sg.size, min)
w1 <- which.min(sg.rss)
sg.rsq <- tapply(sg$rsq, sg.size, max)
w2 <- which.max(sg.rsq)

# Visualize plot of RSS and R^2 
par(mfrow=c(1,2))
plot(1:19, sg.rss, type="b", xlab="Number of Predictors",
     ylab="Residual Sum of Squares", col=2, pch=19)
points(w1, sg.rss[w1], pch="x", col="blue", cex=2)
plot(1:19, sg.rsq, type="b", xlab="Number of Predictors",
     ylab=expression(R^2), col=2, pch=19)
points(w2, sg.rsq[w2], pch="x", col="blue", cex=2)

 

 

3.2 Find the best model based on different parameter size (AIC, BIC, Adjust \(R^2\))

# Find the best model based on different model size 
sg.cp <- tapply(sg$cp, sg.size, min)
w3 <- which.min(sg.cp)
sg.bic <- tapply(sg$bic, sg.size, min)
w4 <- which.min(sg.bic)
sg.adjr2 <- tapply(sg$adjr2, sg.size, max)
w5 <- which.max(sg.adjr2)

# Visualize results of C_p, BIC, adjusted R^2
par(mfrow=c(1,3))
plot(1:19, sg.cp, type="b", xlab ="Number of Predictors",
     ylab=expression(C[p]), col=2, pch=19)
points(w3, sg.cp[w3], pch="x", col="blue", cex=2)
plot(1:19, sg.bic, type="b", xlab ="Number of Predictors",
     ylab="Bayesian information criterion", col=2, pch=19)
points(w4, sg.bic[w4], pch="x", col="blue", cex=2)
plot(1:19, sg.adjr2, type="b", xlab ="Number of Predictors",
     ylab=expression(paste("Adjusted ", R^2)), col=2, pch=19)
points(w5, sg.adjr2[w5], pch="x", col="blue", cex=2)

 

 

3.3 Find the best model based on Validation Set

  • Consider \(RSS\) and \(R^2\) among models with same sample size : \(M_k\)
  • Consider \(C_p\), \(AIC\), \(BIC\), and \(Adjust R^2\) among models with different sample size
  • To find best model among \(M_k\) models, we divide sample size into train size and test size. (The test size shouldn't be trained to train model.) 

 

library(ISLR)
library(leaps)
names(Hitters)
Hitters <- na.omit(Hitters)

# Train-test split : Bootstrap (66%, 33%) 
set.seed(1234)
train <- sample(c(TRUE, FALSE), nrow(Hitters), replace=TRUE) 
test <- (!train) 

# Training model 
# Consider RSS and R^2 among models with same sample size : find M_k 
g1 <- regsubsets(Salary ~ ., data=Hitters[train, ], nvmax=19) 
test.mat <- model.matrix(Salary~., data=Hitters[test, ]) 
val.errors <- rep(NA, 19)

# Calculating validation error 
for (i in 1:19) {
  coefi <- coef(g1, id=i) 
  pred <- test.mat[, names(coefi)] %*% coefi
  val.errors[i] <- mean((Hitters$Salary[test]-pred)^2) 
}
val.errors
w <- which.min(val.errors) 

plot(1:19, val.errors, type="l", col="red",
     xlab="Number of Predictors", ylab="Validation Set Error")
points(1:19, val.errors, pch=19, col="blue")
points(w, val.errors[w], pch="x", col="blue", cex=2)

 

3.4 Find the best model based on K-fold Cross Validation

  • We can calculate K-fold CV by two ways.
  • Option 1 : Make error matrix of matrix(K, parameters) -> MAE -> apply(mat, 2, mean)
  • Option 2 : Make error matrix of matrix(n, parameters) -> E -> apply(mat, 2, mean)

 

# Import dataset 
library(ISLR)
names(Hitters)

# Define new "predict" function on regsubset
predict.regsubsets <- function(object, newdata, id, ...) {
  form <- as.formula(object$call[[2]])
  mat <- model.matrix(form, newdata)
  coefi <- coef(object, id=id)
  xvars <- names(coefi)
  mat[, xvars] %*% coefi
}

# KFold Train-test split : Bootstrap (66%, 33%) using 10 folds 
set.seed(1234)
K <- 10 
folds <- sample(rep(1:K, length=nrow(Hitters))) 

# Initialize error matrix of every fold 
cv.errors <- matrix(NA, K, 19, dimnames=list(NULL, paste(1:19))) 

# Repeat calculation in ever 8 folds
for (k in 1:K) { 
  train <- sample(c(TRUE, FALSE), nrow(Hitters), replace=TRUE) 
  test <- (!train) 

  # Training model 
  # Consider RSS and R^2 among models with same sample size : find M_k 
  fit <- regsubsets(Salary ~ ., data=Hitters[folds!=k, ], nvmax=19) 

  # Calculating validation error 
  for (i in 1:19) {
    pred <- predict(fit, Hitters[folds==k, ], id=i) 
    cv.errors[k, i] <- mean((Hitters$Salary[folds==k]-pred)^2) 
  }
} 
apply(cv.errors, 2, mean)
K.ERR <- apply(cv.errors, 2, mean)
ww <- which.min(K.ERR)

# Visualize the test error
plot(1:19, K.ERR, type="l", col="red",
     xlab="Number of Predictors", ylab="Cross-Validation Error")
points(1:19, K.ERR, pch=19, col="blue")
points(ww, K.ERR[ww], pch="x", col="blue", cex=2)

 

 

3.5 Find the best model based on One-Standard Error Rules

  • Considering CVE as \((\bar{X} - \frac{\hat{\sigma}}{\sqrt{n}}, \bar{X} + \frac{\hat{\sigma}}{\sqrt{n}})\)
  • Finding the best model roughly.
  • Refer the best model which mean of CVE is lower than the high upperbond of the optimized model. 

 

# Train-test split 
set.seed(111)
n = nrow(Hitters)
folds <- sample(rep(1:K, length=n))
CVR.1se <- matrix(NA, n, 19)

# Train the model M_k on ith fold 
for (i in 1:K) {
  fit <- regsubsets(Salary~., Hitters[folds!=i, ], nvmax=19)
  # Calculate CVE of test sample 
  for (j in 1:19) {
    pred <- predict(fit, Hitters[folds==i, ], id=j)
    CVR.1se[folds==i, j] <- (Hitters$Salary[folds==i]-pred)^2
  }
}

# Calculate average based on One-Standard Error rule 
avg <- apply(CVR.1se, 2, mean)
se <- apply(CVR.1se, 2, sd)/sqrt(n)
PE <- cbind(avg - se, avg, avg + se)

# Visualize test error 
matplot(1:19, PE, type="b", col=c(1,2,1), lty=c(3,1,3), pch=20,
        xlab="Number of Predictors", ylab="Cross-Validation Error")
points(which.min(avg), PE[which.min(avg),2],
       pch="o",col="blue",cex=2)
up <- which(PE[,2] < PE[which.min(PE[,2]),3])
points(min(up), PE[min(up),2], pch="x", col="blue", cex=2)

 

'

 

'Data Science > R' 카테고리의 다른 글

[R] Variable Selection Methods : Lasso  (0) 2022.10.05
[R] Variable Selection Methods : Ridge  (0) 2022.10.05
[R] Linear Model  (0) 2022.10.05
[R] Cross Validation  (0) 2022.10.05
[R] Assessing Model Accuracy  (0) 2022.10.05