[R] Best Subset Selection

1. Three classes of solving problems

To solve the problem (variance become higher when the number of features is bigger), we need to make p lower than n.
Subset Selection : Identify a subset of the p predictors that we belive to be related to the response.
Shrinkage : Fit a model involving all p predictors, but the estimated coefficient are shrunked towards zero relative to the OLS estimates.
Dimension Reduction : Project the p predictors into a M-dimensional subspace. PCA is example.

2. Best Subset Selection

The most direct approach is called all subsets or best subsets regression.
Fit for all possible subsets and then choose model based on some criterion that balances training error with model size.

Let's consider when we want to make best subset model using 10 features. First, we make null model \(M_0\), which contains no predictors. Next, for \(k = 1, 2, ... ,p=10)\), repeat the steps following:

Fit all \((p=10, k)\) models that contains exactly \(k\) predictors. (This models must have the same counts of predictors)
Pick the best among these \((p, k)\) models, and call it \(M_k\). Here best is defined as having the smallest \(RSS\), or equivalently largest \(R^2\). (We only use \(RSS\) or \(R^2\) when we pick the best model among \((p, k)\) models)
Select a single best model from among \(M_0, M_1, ..., M_p\) using CVE, \(C_p\), \(AIC\), \(BIC\), or adjusted \(R^2\).

# Importing Library
library(ISLR)
names(Hitters)
dim(Hitters)
sum(is.na(Hitters$Salary))
Hitters <- na.omit(Hitters)
dim(Hitters)
sum(is.na(Hitters))

# Modeling best subset selection 
library(leaps)
fit <- regsubsets(Salary ~ ., Hitters)

summary(fit)
sg <- summary(fit)
names(sg)

dim(sg$which)
sg$which
plot(fit)
plot(fit, scale="Cp")

The regsubsets function calculate the best model among \((p, k)\) models.
The default values of parameter nbest and nvmax are 1 and 8 in eacn.
https://www.rdocumentation.org/packages/leaps/versions/3.1/topics/regsubsets

3. Choosing the Optimal model

Metrics \(RSS\) and \(R^2\) are only used in choosing best model among \((p, k)\) models.
To choose a model with low test error, we need to consider two ways.
- Indirectly estimate test error : \(C_p\), \(AIC\), \(BIC\), and adjusted \(R^2\)
- Directly estimate test error : \(CVE\)
Mallow's \(C_p\) : \(C_p = \frac{1}{n}(RSS + 2d\hat{\sigma}^2)\). Trade off between \(RSS\) and \(2d\hat{\sigma}^2\). This metric has penalty on the number of parameters.
- \(AIC\) : \(AIC = -2log(L) + 2d\)
- \(BIC\) : \(BIC = \frac{1}{n}(RSS + log(n)d\hat{\sigma}^2)\). This metric has penalty on the number of sample size. (If \(n > 8\), then \(BIC > C_p\))
- Adjusted \(R^2\) : scale \(RSS\) term by \(\frac{RSS}{n-d-1}\) which d is the number of parameter and n is the number of sample size.

3.1 Find the best model based on same parameter size (RSS, \(R^2\))

# Find the best model based on same model size 
big <- regsubsets(Salary ~ ., data=Hitters, nvmax=19, nbest=10)
sg <- summary(big)
dim(sg$which)
sg.size <- as.numeric(rownames(sg$which))
table(sg.size)

sg.rss <- tapply(sg$rss, sg.size, min)
w1 <- which.min(sg.rss)
sg.rsq <- tapply(sg$rsq, sg.size, max)
w2 <- which.max(sg.rsq)

# Visualize plot of RSS and R^2 
par(mfrow=c(1,2))
plot(1:19, sg.rss, type="b", xlab="Number of Predictors",
     ylab="Residual Sum of Squares", col=2, pch=19)
points(w1, sg.rss[w1], pch="x", col="blue", cex=2)
plot(1:19, sg.rsq, type="b", xlab="Number of Predictors",
     ylab=expression(R^2), col=2, pch=19)
points(w2, sg.rsq[w2], pch="x", col="blue", cex=2)

3.2 Find the best model based on different parameter size (AIC, BIC, Adjust \(R^2\))

# Find the best model based on different model size 
sg.cp <- tapply(sg$cp, sg.size, min)
w3 <- which.min(sg.cp)
sg.bic <- tapply(sg$bic, sg.size, min)
w4 <- which.min(sg.bic)
sg.adjr2 <- tapply(sg$adjr2, sg.size, max)
w5 <- which.max(sg.adjr2)

# Visualize results of C_p, BIC, adjusted R^2
par(mfrow=c(1,3))
plot(1:19, sg.cp, type="b", xlab ="Number of Predictors",
     ylab=expression(C[p]), col=2, pch=19)
points(w3, sg.cp[w3], pch="x", col="blue", cex=2)
plot(1:19, sg.bic, type="b", xlab ="Number of Predictors",
     ylab="Bayesian information criterion", col=2, pch=19)
points(w4, sg.bic[w4], pch="x", col="blue", cex=2)
plot(1:19, sg.adjr2, type="b", xlab ="Number of Predictors",
     ylab=expression(paste("Adjusted ", R^2)), col=2, pch=19)
points(w5, sg.adjr2[w5], pch="x", col="blue", cex=2)

3.3 Find the best model based on Validation Set

Consider \(RSS\) and \(R^2\) among models with same sample size : \(M_k\)
Consider \(C_p\), \(AIC\), \(BIC\), and \(Adjust R^2\) among models with different sample size
To find best model among \(M_k\) models, we divide sample size into train size and test size. (The test size shouldn't be trained to train model.)

library(ISLR)
library(leaps)
names(Hitters)
Hitters <- na.omit(Hitters)

# Train-test split : Bootstrap (66%, 33%) 
set.seed(1234)
train <- sample(c(TRUE, FALSE), nrow(Hitters), replace=TRUE) 
test <- (!train) 

# Training model 
# Consider RSS and R^2 among models with same sample size : find M_k 
g1 <- regsubsets(Salary ~ ., data=Hitters[train, ], nvmax=19) 
test.mat <- model.matrix(Salary~., data=Hitters[test, ]) 
val.errors <- rep(NA, 19)

# Calculating validation error 
for (i in 1:19) {
  coefi <- coef(g1, id=i) 
  pred <- test.mat[, names(coefi)] %*% coefi
  val.errors[i] <- mean((Hitters$Salary[test]-pred)^2) 
}
val.errors
w <- which.min(val.errors) 

plot(1:19, val.errors, type="l", col="red",
     xlab="Number of Predictors", ylab="Validation Set Error")
points(1:19, val.errors, pch=19, col="blue")
points(w, val.errors[w], pch="x", col="blue", cex=2)

3.4 Find the best model based on K-fold Cross Validation

We can calculate K-fold CV by two ways.
Option 1 : Make error matrix of matrix(K, parameters) -> MAE -> apply(mat, 2, mean)
Option 2 : Make error matrix of matrix(n, parameters) -> E -> apply(mat, 2, mean)

# Import dataset 
library(ISLR)
names(Hitters)

# Define new "predict" function on regsubset
predict.regsubsets <- function(object, newdata, id, ...) {
  form <- as.formula(object$call[[2]])
  mat <- model.matrix(form, newdata)
  coefi <- coef(object, id=id)
  xvars <- names(coefi)
  mat[, xvars] %*% coefi
}

# KFold Train-test split : Bootstrap (66%, 33%) using 10 folds 
set.seed(1234)
K <- 10 
folds <- sample(rep(1:K, length=nrow(Hitters))) 

# Initialize error matrix of every fold 
cv.errors <- matrix(NA, K, 19, dimnames=list(NULL, paste(1:19))) 

# Repeat calculation in ever 8 folds
for (k in 1:K) { 
  train <- sample(c(TRUE, FALSE), nrow(Hitters), replace=TRUE) 
  test <- (!train) 

  # Training model 
  # Consider RSS and R^2 among models with same sample size : find M_k 
  fit <- regsubsets(Salary ~ ., data=Hitters[folds!=k, ], nvmax=19) 

  # Calculating validation error 
  for (i in 1:19) {
    pred <- predict(fit, Hitters[folds==k, ], id=i) 
    cv.errors[k, i] <- mean((Hitters$Salary[folds==k]-pred)^2) 
  }
} 
apply(cv.errors, 2, mean)
K.ERR <- apply(cv.errors, 2, mean)
ww <- which.min(K.ERR)

# Visualize the test error
plot(1:19, K.ERR, type="l", col="red",
     xlab="Number of Predictors", ylab="Cross-Validation Error")
points(1:19, K.ERR, pch=19, col="blue")
points(ww, K.ERR[ww], pch="x", col="blue", cex=2)

3.5 Find the best model based on One-Standard Error Rules

Considering CVE as \((\bar{X} - \frac{\hat{\sigma}}{\sqrt{n}}, \bar{X} + \frac{\hat{\sigma}}{\sqrt{n}})\)
Finding the best model roughly.
Refer the best model which mean of CVE is lower than the high upperbond of the optimized model.

# Train-test split 
set.seed(111)
n = nrow(Hitters)
folds <- sample(rep(1:K, length=n))
CVR.1se <- matrix(NA, n, 19)

# Train the model M_k on ith fold 
for (i in 1:K) {
  fit <- regsubsets(Salary~., Hitters[folds!=i, ], nvmax=19)
  # Calculate CVE of test sample 
  for (j in 1:19) {
    pred <- predict(fit, Hitters[folds==i, ], id=j)
    CVR.1se[folds==i, j] <- (Hitters$Salary[folds==i]-pred)^2
  }
}

# Calculate average based on One-Standard Error rule 
avg <- apply(CVR.1se, 2, mean)
se <- apply(CVR.1se, 2, sd)/sqrt(n)
PE <- cbind(avg - se, avg, avg + se)

# Visualize test error 
matplot(1:19, PE, type="b", col=c(1,2,1), lty=c(3,1,3), pch=20,
        xlab="Number of Predictors", ylab="Cross-Validation Error")
points(which.min(avg), PE[which.min(avg),2],
       pch="o",col="blue",cex=2)
up <- which(PE[,2] < PE[which.min(PE[,2]),3])
points(min(up), PE[min(up),2], pch="x", col="blue", cex=2)

저작자표시 (새창열림)

'Data Science > R' 카테고리의 다른 글

[R] Variable Selection Methods : Lasso (0)	2022.10.05
[R] Variable Selection Methods : Ridge (0)	2022.10.05
[R] Linear Model (0)	2022.10.05
[R] Cross Validation (0)	2022.10.05
[R] Assessing Model Accuracy (0)	2022.10.05

See the forest

[R] Best Subset Selection

1. Three classes of solving problems

2. Best Subset Selection

3. Choosing the Optimal model

3.1 Find the best model based on same parameter size (RSS, \(R^2\))

3.2 Find the best model based on different parameter size (AIC, BIC, Adjust \(R^2\))

3.3 Find the best model based on Validation Set

3.4 Find the best model based on K-fold Cross Validation

3.5 Find the best model based on One-Standard Error Rules

'Data Science > R' 카테고리의 다른 글

티스토리툴바

[R] Best Subset Selection

1. Three classes of solving problems

2. Best Subset Selection

3. Choosing the Optimal model

3.1 Find the best model based on same parameter size (RSS, \(R^2\))

3.2 Find the best model based on different parameter size (AIC, BIC, Adjust \(R^2\))

3.3 Find the best model based on Validation Set

3.4 Find the best model based on K-fold Cross Validation

3.5 Find the best model based on One-Standard Error Rules

'Data Science > R' 카테고리의 다른 글

'Data Science/R' Related Articles

티스토리툴바