본문 바로가기

Data Science/R

[R] Cross Validation

1. What is Cross Validation?

  • In real world, we can't get test data for \(MSE_{test}\).
  • So we should divide train data into train set and test set. 
  • Test-set error estimation
    • Mathmatical Adjustment : \(C_p\), \(AIC\), \(BIC\), Adjusted \(R^2\)
    • Hold out : holding out a subset of training set. 
      • Validation set approach 
      • K-fold Cross Validation 
      • LOOCV, LpOCV

 

2. Validation Set  Approach 

  • Divide training set into two parts : Training set vs Test set 
  • Regression problem : MSE 
  • Classification problem : Missclassification Rate 
  • Validation shuldn't take part in training statistical model. 

 

 

# Dataset Preparation 
library(ISLR) 
data(Auto) 
str(Auto) 
summary(Auto) 

# Extract target 
mpg <- Auto$mpg
horsepower <- Auto$horsepower

# set df 
dg <- 1:9
u <- order(horsepower) 

# Preview dataset 
par(mfrow=c(3,3))
for (k in 1:length(dg)) {
    g <- lm(mpg ~ poly(horsepower, dg[k]))
    plot(mpg~horsepower, col=2, pch=20, xlab="Horsepower",
    ylab="mpg", main=paste("dg =", dg[k]))
    lines(horsepower[u], g$fit[u], col="darkblue", lwd=3)
}

 

 

# Single Split 
set.seed(1)
n <- nrow(Auto)

# training set
tran <- sample(n, n/2)
MSE <- NULL
for (k in 1:length(dg)) {
    g <- lm(mpg ~ poly(horsepower, dg[k]), subset=tran)
    MSE[k] <- mean((mpg - predict(g, Auto))[-tran]^2)
}

# Visualization MSE_test
plot(dg, MSE, type="b", col=2, xlab="Degree of Polynomial",
ylab="Mean Squared Error", ylim=c(15,30), lwd=2, pch=19)
abline(v=which.min(MSE), lty=2)

 

3. K-fold Cross Validation 

- K-fold Cross-validation divide the data into K equal-sized parts. We leave out part \(K\), fit the model to the other \(K - 1\) parts, and then obtain prediction for the left-out kth part. 
- If we evaluate 10 models with 5-fold CV, then we need to consider \(5 \times 10\) cross validation score. 
- We compare the average mean score of K cross validation scores among models. 
- \(CVE = \frac{1}{n}\sum_{k=1}^{K}(n_k MSE_k)\)

 

 

# 10-fold cross validation
k <- 10 
# degree is 1:9
MSE <- matrix(0, n, length(dg)) 
## MSE <- matrix(0, k, length(dg)) 

# Assertion each data point to each fold 
# e.g. [1, 3, 3, 5, 6, ..., 10] (n) 
set.seed(1234) 
u <- sample(rep(seq(K), lengnth=n)) 

# Model training 
""" 
 f1     f2 f3 f4 f5 ... f9 
MSE1
MSE2
...
MSE10 
"""
for (k in 1:K) {
    tran <- which(u!=k) 
    test <- which(u==k) 
    for (i in 1:length(dg)) { 
        g <- lm(mpg ~ poly(horsepower, i), subset=tran) 
        MSE[test, i] <- (mpg - predict(g, Auto))[test]^2 
        ## MSE[k, i] <- mean((mpg - predict(g, Auto))[test]^2)
    } 
}
CVE <- apply(MSE, 2, mean) 
## CVE <- apply(MSE, 2, mean) 

# Visualization
plot(dg, CVE, type="b", col="darkblue",
xlab="Degree of Polynomial", ylab="Mean Squared Error",
ylim=c(18,25), lwd=2, pch=19)
abline(v=which.min(CVE), lty=2)

 

  • The best parameter of degree of freedom is 2 when we consider inferring on population sets.
  • This is elbow point of CVE plot.

 

4. LOOCV 

Setting \(K = n\) yields leave-one-out cross validation(LOOCV).

 

# Auto Data : LOOCV
# Set the degree of freedom and result matrix 
n <- nrow(Auto)
dg <- 1:9
MSE <- matrix(0, n, length(dg))

for (i in 1:n) {
    for (k in 1:length(dg)) {
        g <- lm(mpg ~ poly(horsepower, k), subset=(1:n)[-i])
        MSE[i, k] <- mean((mpg - predict(g, Auto))[i]^2)
    }
}
# Calculate CVE 
aMSE <- apply(MSE, 2, mean)

# Visualization
par(mfrow=c(1, 2))
plot(dg, aMSE, type="b", col="darkblue",
     xlab="Degree of Polynomial", ylab="Mean Squared Error",
     ylim=c(18,25), lwd=2, pch=19)
abline(v=which.min(aMSE), lty=2)

'Data Science > R' 카테고리의 다른 글

[R] Best Subset Selection  (0) 2022.10.05
[R] Linear Model  (0) 2022.10.05
[R] Assessing Model Accuracy  (0) 2022.10.05
[R] Flexibility and Interpretability  (0) 2022.10.05
[R] Supervised Learning  (0) 2022.10.05