[R] Variable Selection Methods : Lasso

1. Lasso Regression

Ridge have disadvantages of including all p predictors in the final model.
What we want to do is variable selection.
Lasso shrinks \(\hat{\beta}\) towards zero.
\(RSS + \lambda\sum_{j=1}^{p}|\beta_j|\)
The \(l_1\)-norm of \(\hat{\beta}\) : \(df(\hat{\beta}_{\lambda_1}) = 0 < ... < df(\hat{\beta}_{\lambda_m}) = m\)
- \(df\) is the number of variable which takes part in training model.

library(ISLR)
library(leaps) 
library(glmnet)

data(Hitters) 
Hitters <- na.omit(Hitters)

# model.matrix convert categorical data type into numerical 
x <- model.matrix(Salary~., Hitters)[, -1] 
y <- Hitters$Salary

# lasso when the value of argument alpha is 1. 
lasso.mod <- glmnet(x, y, alpha=1) 
# the number of default value of lambda is 100. 
# However, considering complexity of parameter, our model us 80 lambdas.   
dim(coef(lasso.mod)) 

# Find the degree of freedom matrix 
las <- cbind(lasso.mod$lambda, lasso.mod$df) 
colnames(las) <- c("lambda", "df") 
las

# Find the beta
# The sum of beta which is not zero is same with the value of df. 
dim(lasso.mod$beta)
apply(lasso.mod$beta, 2, function(t) sum(t!=0))
apply(lasso.mod$beta, 2, function(t) sum(abs(t)))

# Calculate l1-norm based on lambda 
l1.norm <- apply(lasso.mod$beta, 2, function(t) sum(abs(t)))
x.axis <- cbind(log(lasso.mod$lambda), l1.norm)
colnames(x.axis) <- c("log.lambda", "L1.norm")
x.axis

# Visualize plot
par(mfrow=c(1,2))
plot(lasso.mod, "lambda", label=TRUE)
plot(lasso.mod, "norm", label=TRUE)

2. Comparison between Lasso and Ridge

2.1 The Bias-Variance Tradeoff

If lambda axis increases to the right
Overfittng vs Underfitting
(Low bias + High variance) vs (High bias + Low variance)
(l1-norm, l2-norm increase) vs (l1-norm, l2-norm decrease)
\(\lambda\) decrease vs \(\lambda\) increase

2.2 Which is better?

If nonzero coefficient are large, ridge is better.
If nonzero coefficient are small, lasso is better.
In high-dimensional data where spares model is assummed, lasso perform better.

저작자표시 (새창열림)

'Data Science > R' 카테고리의 다른 글

[R] Useful Functions for Regression Problems (0)	2022.10.07
[R] Regularization Methods : Binary (0)	2022.10.05
[R] Variable Selection Methods : Ridge (0)	2022.10.05
[R] Best Subset Selection (0)	2022.10.05
[R] Linear Model (0)	2022.10.05

See the forest

[R] Variable Selection Methods : Lasso

1. Lasso Regression

2. Comparison between Lasso and Ridge

2.1 The Bias-Variance Tradeoff

2.2 Which is better?

'Data Science > R' 카테고리의 다른 글

티스토리툴바

[R] Variable Selection Methods : Lasso

1. Lasso Regression

2. Comparison between Lasso and Ridge

2.1 The Bias-Variance Tradeoff

2.2 Which is better?

'Data Science > R' 카테고리의 다른 글

'Data Science/R' Related Articles

티스토리툴바