[R] Elastic-Net Regression

1. Regularization Methods : Elastic-Net

Lasso penalty function(\(l1\)-norm)
- If \(p > n\), lasso selects at most \(n\) variables.
- Lasso is indifferent to highly correlated variables and tends to pick only one variable.
Ridge penalty function(\(l2\)-norm)
- If cannot perform variable selection.
- Shrinks correlated features to each other.
Elastic-net regularization
- Combine Lasso and Ridge
- \(p_{\lambda}(\beta) = \lambda \alpha \sum_{j=1}^p |\beta_j| + \lambda(1 - \alpha)\sum_{j=1}^p \beta_j^2\)
- \(\alpha\) : mixing proportion
  - Lasso if \(\alpha = 1\)
  - Ridge if \(\alpha = 0\)
- Minimize the penalized log-likelihood where tuning parameters \(\alpha\) and \(\lambda\) are fixed.

2. [Ex] Golub Data

2.1 Prepare Dataset Golub

# 1. importing package and dataset 

# Install bioconductor package ’golubEsets’
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("golubEsets")
library(golubEsets)
library(glmnet)
data(Golub_Merge)

# Golub data
?Golub_Merge
dim(Golub_Merge)
varLabels(Golub_Merge)
Golub_Merge$ALL.AML
table(Golub_Merge$ALL.AML)
Golub_Merge$Gender

# Expression data
dim(exprs(Golub_Merge))
head(exprs(Golub_Merge))
tail(exprs(Golub_Merge))

# Transpose the predictor x into n by p matrix
x <- t(as.matrix(exprs(Golub_Merge)))
y <- Golub_Merge$ALL.AML
dim(x)
sum(is.na(x))

2.2 Training models : Ridge and Lasso

# Model1 : Ridge regression 
g0 <- glmnet(x, y, alpha=0, family="binomial")
par(mfrow=c(1,2))
plot(g0, "lambda")
plot(g0, "norm")

# Model2 : Lasso regression 
g1 <- glmnet(x, y, alpha=1, family="binomial")
par(mfrow=c(1,2))
plot(g1, "lambda")
plot(g1, "norm")

# Cross Validation Lasso 
set.seed(12345)
gvc <- cv.glmnet(x, y, alpha=1, family="binomial", nfolds=5)
gvc$lambda.min
gvc$lambda.1se
plot(gvc)
fit1 <- coef(gvc, s="lambda.min")
fit2 <- coef(gvc, s="lambda.1se")
c(sum(as.matrix(fit1)!=0), sum(as.matrix(fit2)!=0))
w1 <- which(as.matrix(fit1)!=0)
w2 <- which(as.matrix(fit2)!=0)
data.frame(name=colnames(x)[w1], beta=fit1[w1])
data.frame(name=colnames(x)[w2], beta=fit1[w2])

In Ridge regression, all variables are selected for training.
In Lasso regression, only subset variables are selected for training.
In Lasso with Cross Valdiation, we can select opmimized number of variables.

2.3 Training model : Elastic-Net

# Elastic-Net regression
ge <- glmnet(x, y, alpha=0.5, family="binomial")
ge$df
ge$lambda
plot(ge, "lambda")

# Elastic-Net regression with K-folds Cross Validation 
set.seed(111)
gecv <- cv.glmnet(x, y, alpha=0.5, family="binomial", nfolds=5)
plot(gecv)
fit3 <- coef(gecv, s="lambda.min")
fit4 <- coef(gecv, s="lambda.1se")
c(sum(as.matrix(fit3)!=0), sum(as.matrix(fit4)!=0))
w3 <- which(as.matrix(fit3)!=0)
w4 <- which(as.matrix(fit4)!=0)
data.frame(name=colnames(x)[w3], beta=fit3[w3])
data.frame(name=colnames(x)[w4], beta=fit4[w4])

In Elastic-Net regression, the number of selected variables increases from 27 to 91.

2.4 Calculate Test Error among Models based on Classification Error

# Seperate training sets and test sets
Golub_Merge$Samples
tran <- Golub_Merge$Samples < 39
test <- !tran
c(sum(tran), sum(test))

# Calculate Test error from Lasso  
set.seed(123)
g1 <- cv.glmnet(x, y, alpha=1, family="binomial", K=5,
subset=tran)
fit1 <- coef(g1, s="lambda.min")
fit2 <- coef(g1, s="lambda.1se")
c(sum(as.matrix(fit1)!=0), sum(as.matrix(fit2)!=0))
p1 <- predict(g1, x[test,], s="lambda.min", type="class")
p2 <- predict(g1, x[test,], s="lambda.1se", type="class")

# Confusion Matrix from 2 models 
table(p1, y[test])
table(p2, y[test])

# Calculate Test error from Elastic-Net
set.seed(1234)
g2 <- cv.glmnet(x, y, alpha=0.5, family="binomial", K=5,
subset=tran)
fit3 <- coef(g2, s="lambda.min")
fit4 <- coef(g2, s="lambda.1se")
c(sum(as.matrix(fit3)!=0), sum(as.matrix(fit4)!=0))
p3 <- predict(g2, x[test,], s="lambda.min", type="class")
p4 <- predict(g2, x[test,], s="lambda.1se", type="class")

# Confusion Matrix from 2 models 
table(p3, y[test])
table(p4, y[test])

In model1(p1), there is no missclassified prediction.
In model2(p2), there is only one missclassified prediction.
Using Elastic-Net regression, the number of selected variables are greater than using lasso.

3. 50 Simulation Replications

set.seed(111)
K <- 50
ERR <- DF <- array(0, c(K, 4))
for (i in 1:K) {
    # Train-Test Split 
    tran <- as.logical(rep(0, 72))
    tran[sample(1:72, 38)] <- TRUE
    test <- !tran

    # Training model : lasso vs elastic-net 
    g1 <- cv.glmnet(x, y, alpha=1, family="binomial", K=5, subset=tran)
    g2 <- cv.glmnet(x, y, alpha=0.5, family="binomial", K=5, subset=tran)

    # Calculate probability of prediction 
    p1 <- predict(g1, x[test,], s="lambda.min", type="class")
    p2 <- predict(g1, x[test,], s="lambda.1se", type="class")
    p3 <- predict(g2, x[test,], s="lambda.min", type="class")
    p4 <- predict(g2, x[test,], s="lambda.1se", type="class")

    # Calculate Degree of freedom (The number of selected variables) 
    DF[i, 1] <- sum(coef(g1, s="lambda.min")!=0)
    DF[i, 2] <- sum(coef(g1, s="lambda.1se")!=0)
    DF[i, 3] <- sum(coef(g2, s="lambda.min")!=0)
    DF[i, 4] <- sum(coef(g2, s="lambda.1se")!=0)

    # Calculate Missclassification Error
    ERR[i, 1] <- sum(p1!=y[test])/sum(test)
    ERR[i, 2] <- sum(p2!=y[test])/sum(test)
    ERR[i, 3] <- sum(p3!=y[test])/sum(test)
    ERR[i, 4] <- sum(p4!=y[test])/sum(test)
}
apply(ERR, 2, mean)
DF
apply(DF, 2, var)

The number of selected variables among models are in order Elastic-net + lambda.min, Elastic-net + lambda.1se, Lasso + lambda.min, Lasso + lambda.1se

저작자표시

'Data Science > R' 카테고리의 다른 글

[R] Classification Problem : LDA(Linear Discriminant Analysis) (0)	2022.10.31
[R] Classification Problem : Logistic Regression (0)	2022.10.18
[R] Regularization Methods (0)	2022.10.11
[R] Simulation Study : Prediction Performance (0)	2022.10.11
[R] Useful Functions for Regression Problems (0)	2022.10.07

See the forest

[R] Elastic-Net Regression

1. Regularization Methods : Elastic-Net

2. [Ex] Golub Data

2.1 Prepare Dataset Golub

2.2 Training models : Ridge and Lasso

2.3 Training model : Elastic-Net

2.4 Calculate Test Error among Models based on Classification Error

3. 50 Simulation Replications

'Data Science > R' 카테고리의 다른 글

티스토리툴바

[R] Elastic-Net Regression

1. Regularization Methods : Elastic-Net

2. [Ex] Golub Data

2.1 Prepare Dataset Golub

2.2 Training models : Ridge and Lasso

2.3 Training model : Elastic-Net

2.4 Calculate Test Error among Models based on Classification Error

3. 50 Simulation Replications

'Data Science > R' 카테고리의 다른 글

'Data Science/R' Related Articles

티스토리툴바