본문 바로가기

Data Science/R

[R] Variable Selection Methods : Lasso

1. Lasso Regression

  • Ridge have disadvantages of including all p predictors in the final model.
  • What we want to do is variable selection.
  • Lasso shrinks \(\hat{\beta}\) towards zero.
  • \(RSS + \lambda\sum_{j=1}^{p}|\beta_j|\)
  • The \(l_1\)-norm of \(\hat{\beta}\) : \(df(\hat{\beta}_{\lambda_1}) = 0 <  ... < df(\hat{\beta}_{\lambda_m}) = m\)
    • \(df\) is the number of variable which takes part in training model.

 

library(ISLR)
library(leaps) 
library(glmnet)

data(Hitters) 
Hitters <- na.omit(Hitters)

# model.matrix convert categorical data type into numerical 
x <- model.matrix(Salary~., Hitters)[, -1] 
y <- Hitters$Salary

# lasso when the value of argument alpha is 1. 
lasso.mod <- glmnet(x, y, alpha=1) 
# the number of default value of lambda is 100. 
# However, considering complexity of parameter, our model us 80 lambdas.   
dim(coef(lasso.mod)) 

# Find the degree of freedom matrix 
las <- cbind(lasso.mod$lambda, lasso.mod$df) 
colnames(las) <- c("lambda", "df") 
las

# Find the beta
# The sum of beta which is not zero is same with the value of df. 
dim(lasso.mod$beta)
apply(lasso.mod$beta, 2, function(t) sum(t!=0))
apply(lasso.mod$beta, 2, function(t) sum(abs(t)))

# Calculate l1-norm based on lambda 
l1.norm <- apply(lasso.mod$beta, 2, function(t) sum(abs(t)))
x.axis <- cbind(log(lasso.mod$lambda), l1.norm)
colnames(x.axis) <- c("log.lambda", "L1.norm")
x.axis

# Visualize plot
par(mfrow=c(1,2))
plot(lasso.mod, "lambda", label=TRUE)
plot(lasso.mod, "norm", label=TRUE)

 

 

2. Comparison between Lasso and Ridge

2.1 The Bias-Variance Tradeoff

  •  If lambda axis increases to the right
  • Overfittng vs Underfitting
  • (Low bias + High variance) vs (High bias + Low variance)
  • (l1-norm, l2-norm increase) vs (l1-norm, l2-norm decrease)
  • \(\lambda\) decrease vs \(\lambda\) increase

 

2.2 Which is better?

  • If nonzero coefficient are large, ridge is better.
  • If nonzero coefficient are small, lasso is better.
  • In high-dimensional data where spares model is assummed, lasso perform better. 

'Data Science > R' 카테고리의 다른 글

[R] Useful Functions for Regression Problems  (0) 2022.10.07
[R] Regularization Methods : Binary  (0) 2022.10.05
[R] Variable Selection Methods : Ridge  (0) 2022.10.05
[R] Best Subset Selection  (0) 2022.10.05
[R] Linear Model  (0) 2022.10.05