본문 바로가기

Data Science/R

[R] Non-Linear Models : Step Functions

1. What is Step functions?

  • Polynomial regression extends the linear model by adding extra predictors, obtained by raising each of the original predictors to a power. Cubic regression uses three variables $X_1$, $X_2$, $X_3$ as predictors. This is a simple way to provide a non-linear fit to the data.
  • Step functions cut the range of a variable into K distinct regions in order to produce a qualitative variable. This has the effect of fitting a piece wise constant function. 

 

Using polynomial functions of the features as predictors imposes a global structure on the non-linear function of X. Instead, we could use step functions to avoid such global structure. Here we break X into bins, and fit a different constant in each bin. Essentially, we create the bins by selecting K cut-points in the range of X, and then construct K+1 new variables, which behave like dummy variables.    
   

  • Cut the variable into distinct regions : \(C_1(X) = I(X < c_1), C_2(X) = I(c_1 \leq X < c_2), ..., C_K(X) = I(X \geq c_K)\)
  • For any value of \(X\), \(C_0(X) + C_1(X) + ... C_k(X) = 1\)
  • Choice of cutpoints or knots can be problematic. 

 

2. [Ex] Step Functions with Linearity

# cut() automatically picked the cut points 
table(cut(age, 4)) 
# (17.9, 33.5] (33.5, 49] (49, 64.5] (64.5, 80.1]

# Linear regression
fit <- lm(wage ~ cut(age, 4), data=Wage) 
# Logistic regression 
fit2 <- glm(I(wage > 250) ~ cut(age, 4), data=Wage, family="binomial") 

# The age < 33.5 category is left out. 
# The first category is recognized as reference. 
coef(summary(fit))

 

 

# Fitted values along with confidence bands. 
# confidence bands of linear regression 
age.grid <- seq(min(age), max(age))
preds <- predict(fit, newdata=list(age=age.grid), se=TRUE)
se.bands <- cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)

# confidence bands of logistic regression 
preds2 <- predict(fit2, newdata=list(age=age.grid), se=T)
pfit <- exp(preds2$fit)/(1 + exp(preds2$fit))
se.bands.logit <- cbind(preds2$fit + 2*preds2$se.fit, preds2$fit - 2*preds2$se.fit)
se.bands2 <- exp(se.bands.logit)/(1 + exp(se.bands.logit))

# Visualize result 
# Linear regression 
par(mfrow=c(1,2), mar=c(4.5 ,4.5 ,1 ,1), oma=c(0, 0, 4, 0))
plot(age, wage, xlim=range(age), cex=.5, col="darkgrey")
title("Degree-4 Step Functions", outer=T)
lines(age.grid, preds$fit, lwd=3, col="darkgreen")
matlines(age.grid, se.bands, lwd=2, col="darkgreen", lty=2)

# Logistic regression 
plot(age , I(wage > 250), xlim=range(age), type="n", ylim=c(0, .2))
points(jitter(age), I((wage >250)/5), cex=.5, pch="|", col="darkgrey")
lines(age.grid, pfit, lwd=3, col="darkgreen")
matlines(age.grid, se.bands2, lwd=2, col="darkgreen", lty=2)

 

 

3. Piecewise Polynomials

  • Instead of a single polynomials in \(X\) over, we can use different polynomials in regions defined by knots.
  • The value of the continuous variable of \(X\) is separated by to apply the polynomial regression for each continuous value.
  • \(y_i\)
    • \(y_i = \beta_{01} + \beta_{11}x_i + \beta_{21}x_i^2 + \beta_{31}x_i^3 + \epsilon_i\), if \(x_i < c\)
    • \(y_i = \beta_{02} + \beta_{12}x_i + \beta_{22}x_i^2 + \beta_{32}x_i^3 + \epsilon_i\), if \(x_i \geq c\)

 

4. [Ex] Step Functions with Non-Linearity

# 200 obs. are randomly generated from 3000 obs 
set.seed(19) 
ss <- sample(3000, 200)
nWage <- Wage[ss,]

# testing sets 
age.grid <- seq(min(nWage$age), max(nWage$age)) 

# Training model in piecewise polynomials 
g1 <- lm(wage ~ poly(age, 3), data=nWage, subset=(age < 50))
g2 <- lm(wage ~ poly(age, 3), data=nWage, subset=(age > 50)) 

# Predict on testing dataset
pred1 <- predict(g1, newdata=list(age=age.grid[age.grid < 50]))
pred2 <- predict(g2, newdata=list(age=age.grid[age.grid >= 50]))

# Visualize result
par(mfrow = c(1, 2))
plot(nWage[, 2], nWage[, 11], col="darkgrey", xlab="Age",
ylab="Wage")
title(main = "Piecewise Cubic")
lines(age.grid[age.grid < 50], pred1, lwd=2, col="darkblue")
lines(age.grid[age.grid >= 50], pred2, lwd=2, col="darkblue")
abline(v=50, lty=2)

 

  • Using piecewise polynomials, discontinuities occur for each point(knots) separated from the continuous value.
  • To solve this problem, continuous piecewise polynomials are calculated by additionally calculating LHS and RHS(the effect of removing one coefficients)

 

# Define the two hockey-stick functions
LHS <- function(x) ifelse(x < 50, 50-x, 0)
RHS <- function(x) ifelse(x < 50, 0, x-50)

# Fit continuous piecewise polynomials
g3 <- lm(wage ~ poly(LHS(age), 3) + poly(RHS(age), 3), nWage)
pred3 <- predict(g3, newdata=list(age=age.grid))
plot(nWage[, 2], nWage[, 11], col="darkgrey", xlab="Age",
     ylab="Wage")
title(main="Continuous Piecewise Cubic")
lines(age.grid, pred3, lwd=2, col="darkgreen")
abline(v=50, lty=2)

summary(g1)
summary(g2)
summary(g3)

 

 

 

 


Source from : https://rstudio-pubs-static.s3.amazonaws.com/24589_7552e489485b4c2790ea6634e1afd68d.html