1. Regularization Methods : Elastic-Net
- Lasso penalty function(\(l1\)-norm)
- If \(p > n\), lasso selects at most \(n\) variables.
- Lasso is indifferent to highly correlated variables and tends to pick only one variable.
- Ridge penalty function(\(l2\)-norm)
- If cannot perform variable selection.
- Shrinks correlated features to each other.
- Elastic-net regularization
- Combine Lasso and Ridge
- \(p_{\lambda}(\beta) = \lambda \alpha \sum_{j=1}^p |\beta_j| + \lambda(1 - \alpha)\sum_{j=1}^p \beta_j^2\)
- \(\alpha\) : mixing proportion
- Lasso if \(\alpha = 1\)
- Ridge if \(\alpha = 0\)
- Minimize the penalized log-likelihood where tuning parameters \(\alpha\) and \(\lambda\) are fixed.
2. [Ex] Golub Data
2.1 Prepare Dataset Golub
# 1. importing package and dataset
# Install bioconductor package ’golubEsets’
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("golubEsets")
library(golubEsets)
library(glmnet)
data(Golub_Merge)
# Golub data
?Golub_Merge
dim(Golub_Merge)
varLabels(Golub_Merge)
Golub_Merge$ALL.AML
table(Golub_Merge$ALL.AML)
Golub_Merge$Gender
# Expression data
dim(exprs(Golub_Merge))
head(exprs(Golub_Merge))
tail(exprs(Golub_Merge))
# Transpose the predictor x into n by p matrix
x <- t(as.matrix(exprs(Golub_Merge)))
y <- Golub_Merge$ALL.AML
dim(x)
sum(is.na(x))
2.2 Training models : Ridge and Lasso
# Model1 : Ridge regression
g0 <- glmnet(x, y, alpha=0, family="binomial")
par(mfrow=c(1,2))
plot(g0, "lambda")
plot(g0, "norm")
# Model2 : Lasso regression
g1 <- glmnet(x, y, alpha=1, family="binomial")
par(mfrow=c(1,2))
plot(g1, "lambda")
plot(g1, "norm")
# Cross Validation Lasso
set.seed(12345)
gvc <- cv.glmnet(x, y, alpha=1, family="binomial", nfolds=5)
gvc$lambda.min
gvc$lambda.1se
plot(gvc)
fit1 <- coef(gvc, s="lambda.min")
fit2 <- coef(gvc, s="lambda.1se")
c(sum(as.matrix(fit1)!=0), sum(as.matrix(fit2)!=0))
w1 <- which(as.matrix(fit1)!=0)
w2 <- which(as.matrix(fit2)!=0)
data.frame(name=colnames(x)[w1], beta=fit1[w1])
data.frame(name=colnames(x)[w2], beta=fit1[w2])
- In Ridge regression, all variables are selected for training.
- In Lasso regression, only subset variables are selected for training.
- In Lasso with Cross Valdiation, we can select opmimized number of variables.
2.3 Training model : Elastic-Net
# Elastic-Net regression
ge <- glmnet(x, y, alpha=0.5, family="binomial")
ge$df
ge$lambda
plot(ge, "lambda")
# Elastic-Net regression with K-folds Cross Validation
set.seed(111)
gecv <- cv.glmnet(x, y, alpha=0.5, family="binomial", nfolds=5)
plot(gecv)
fit3 <- coef(gecv, s="lambda.min")
fit4 <- coef(gecv, s="lambda.1se")
c(sum(as.matrix(fit3)!=0), sum(as.matrix(fit4)!=0))
w3 <- which(as.matrix(fit3)!=0)
w4 <- which(as.matrix(fit4)!=0)
data.frame(name=colnames(x)[w3], beta=fit3[w3])
data.frame(name=colnames(x)[w4], beta=fit4[w4])
- In Elastic-Net regression, the number of selected variables increases from 27 to 91.
2.4 Calculate Test Error among Models based on Classification Error
# Seperate training sets and test sets
Golub_Merge$Samples
tran <- Golub_Merge$Samples < 39
test <- !tran
c(sum(tran), sum(test))
# Calculate Test error from Lasso
set.seed(123)
g1 <- cv.glmnet(x, y, alpha=1, family="binomial", K=5,
subset=tran)
fit1 <- coef(g1, s="lambda.min")
fit2 <- coef(g1, s="lambda.1se")
c(sum(as.matrix(fit1)!=0), sum(as.matrix(fit2)!=0))
p1 <- predict(g1, x[test,], s="lambda.min", type="class")
p2 <- predict(g1, x[test,], s="lambda.1se", type="class")
# Confusion Matrix from 2 models
table(p1, y[test])
table(p2, y[test])
# Calculate Test error from Elastic-Net
set.seed(1234)
g2 <- cv.glmnet(x, y, alpha=0.5, family="binomial", K=5,
subset=tran)
fit3 <- coef(g2, s="lambda.min")
fit4 <- coef(g2, s="lambda.1se")
c(sum(as.matrix(fit3)!=0), sum(as.matrix(fit4)!=0))
p3 <- predict(g2, x[test,], s="lambda.min", type="class")
p4 <- predict(g2, x[test,], s="lambda.1se", type="class")
# Confusion Matrix from 2 models
table(p3, y[test])
table(p4, y[test])
- In model1(p1), there is no missclassified prediction.
- In model2(p2), there is only one missclassified prediction.
- Using Elastic-Net regression, the number of selected variables are greater than using lasso.
3. 50 Simulation Replications
set.seed(111)
K <- 50
ERR <- DF <- array(0, c(K, 4))
for (i in 1:K) {
# Train-Test Split
tran <- as.logical(rep(0, 72))
tran[sample(1:72, 38)] <- TRUE
test <- !tran
# Training model : lasso vs elastic-net
g1 <- cv.glmnet(x, y, alpha=1, family="binomial", K=5, subset=tran)
g2 <- cv.glmnet(x, y, alpha=0.5, family="binomial", K=5, subset=tran)
# Calculate probability of prediction
p1 <- predict(g1, x[test,], s="lambda.min", type="class")
p2 <- predict(g1, x[test,], s="lambda.1se", type="class")
p3 <- predict(g2, x[test,], s="lambda.min", type="class")
p4 <- predict(g2, x[test,], s="lambda.1se", type="class")
# Calculate Degree of freedom (The number of selected variables)
DF[i, 1] <- sum(coef(g1, s="lambda.min")!=0)
DF[i, 2] <- sum(coef(g1, s="lambda.1se")!=0)
DF[i, 3] <- sum(coef(g2, s="lambda.min")!=0)
DF[i, 4] <- sum(coef(g2, s="lambda.1se")!=0)
# Calculate Missclassification Error
ERR[i, 1] <- sum(p1!=y[test])/sum(test)
ERR[i, 2] <- sum(p2!=y[test])/sum(test)
ERR[i, 3] <- sum(p3!=y[test])/sum(test)
ERR[i, 4] <- sum(p4!=y[test])/sum(test)
}
apply(ERR, 2, mean)
DF
apply(DF, 2, var)
The number of selected variables among models are in order Elastic-net + lambda.min, Elastic-net + lambda.1se, Lasso + lambda.min, Lasso + lambda.1se
'Data Science > R' 카테고리의 다른 글
[R] Classification Problem : LDA(Linear Discriminant Analysis) (0) | 2022.10.31 |
---|---|
[R] Classification Problem : Logistic Regression (0) | 2022.10.18 |
[R] Regularization Methods (0) | 2022.10.11 |
[R] Simulation Study : Prediction Performance (0) | 2022.10.11 |
[R] Useful Functions for Regression Problems (0) | 2022.10.07 |