1. What is Step functions?
- Polynomial regression extends the linear model by adding extra predictors, obtained by raising each of the original predictors to a power. Cubic regression uses three variables $X_1$, $X_2$, $X_3$ as predictors. This is a simple way to provide a non-linear fit to the data.
- Step functions cut the range of a variable into K distinct regions in order to produce a qualitative variable. This has the effect of fitting a piece wise constant function.
Using polynomial functions of the features as predictors imposes a global structure on the non-linear function of X. Instead, we could use step functions to avoid such global structure. Here we break X into bins, and fit a different constant in each bin. Essentially, we create the bins by selecting K cut-points in the range of X, and then construct K+1 new variables, which behave like dummy variables.
- Cut the variable into distinct regions : \(C_1(X) = I(X < c_1), C_2(X) = I(c_1 \leq X < c_2), ..., C_K(X) = I(X \geq c_K)\)
- For any value of \(X\), \(C_0(X) + C_1(X) + ... C_k(X) = 1\)
- Choice of cutpoints or knots can be problematic.
2. [Ex] Step Functions with Linearity
# cut() automatically picked the cut points
table(cut(age, 4))
# (17.9, 33.5] (33.5, 49] (49, 64.5] (64.5, 80.1]
# Linear regression
fit <- lm(wage ~ cut(age, 4), data=Wage)
# Logistic regression
fit2 <- glm(I(wage > 250) ~ cut(age, 4), data=Wage, family="binomial")
# The age < 33.5 category is left out.
# The first category is recognized as reference.
coef(summary(fit))
# Fitted values along with confidence bands.
# confidence bands of linear regression
age.grid <- seq(min(age), max(age))
preds <- predict(fit, newdata=list(age=age.grid), se=TRUE)
se.bands <- cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)
# confidence bands of logistic regression
preds2 <- predict(fit2, newdata=list(age=age.grid), se=T)
pfit <- exp(preds2$fit)/(1 + exp(preds2$fit))
se.bands.logit <- cbind(preds2$fit + 2*preds2$se.fit, preds2$fit - 2*preds2$se.fit)
se.bands2 <- exp(se.bands.logit)/(1 + exp(se.bands.logit))
# Visualize result
# Linear regression
par(mfrow=c(1,2), mar=c(4.5 ,4.5 ,1 ,1), oma=c(0, 0, 4, 0))
plot(age, wage, xlim=range(age), cex=.5, col="darkgrey")
title("Degree-4 Step Functions", outer=T)
lines(age.grid, preds$fit, lwd=3, col="darkgreen")
matlines(age.grid, se.bands, lwd=2, col="darkgreen", lty=2)
# Logistic regression
plot(age , I(wage > 250), xlim=range(age), type="n", ylim=c(0, .2))
points(jitter(age), I((wage >250)/5), cex=.5, pch="|", col="darkgrey")
lines(age.grid, pfit, lwd=3, col="darkgreen")
matlines(age.grid, se.bands2, lwd=2, col="darkgreen", lty=2)
3. Piecewise Polynomials
- Instead of a single polynomials in \(X\) over, we can use different polynomials in regions defined by knots.
- The value of the continuous variable of \(X\) is separated by to apply the polynomial regression for each continuous value.
- \(y_i\)
- \(y_i = \beta_{01} + \beta_{11}x_i + \beta_{21}x_i^2 + \beta_{31}x_i^3 + \epsilon_i\), if \(x_i < c\)
- \(y_i = \beta_{02} + \beta_{12}x_i + \beta_{22}x_i^2 + \beta_{32}x_i^3 + \epsilon_i\), if \(x_i \geq c\)
4. [Ex] Step Functions with Non-Linearity
# 200 obs. are randomly generated from 3000 obs
set.seed(19)
ss <- sample(3000, 200)
nWage <- Wage[ss,]
# testing sets
age.grid <- seq(min(nWage$age), max(nWage$age))
# Training model in piecewise polynomials
g1 <- lm(wage ~ poly(age, 3), data=nWage, subset=(age < 50))
g2 <- lm(wage ~ poly(age, 3), data=nWage, subset=(age > 50))
# Predict on testing dataset
pred1 <- predict(g1, newdata=list(age=age.grid[age.grid < 50]))
pred2 <- predict(g2, newdata=list(age=age.grid[age.grid >= 50]))
# Visualize result
par(mfrow = c(1, 2))
plot(nWage[, 2], nWage[, 11], col="darkgrey", xlab="Age",
ylab="Wage")
title(main = "Piecewise Cubic")
lines(age.grid[age.grid < 50], pred1, lwd=2, col="darkblue")
lines(age.grid[age.grid >= 50], pred2, lwd=2, col="darkblue")
abline(v=50, lty=2)
- Using piecewise polynomials, discontinuities occur for each point(knots) separated from the continuous value.
- To solve this problem, continuous piecewise polynomials are calculated by additionally calculating LHS and RHS(the effect of removing one coefficients)
# Define the two hockey-stick functions
LHS <- function(x) ifelse(x < 50, 50-x, 0)
RHS <- function(x) ifelse(x < 50, 0, x-50)
# Fit continuous piecewise polynomials
g3 <- lm(wage ~ poly(LHS(age), 3) + poly(RHS(age), 3), nWage)
pred3 <- predict(g3, newdata=list(age=age.grid))
plot(nWage[, 2], nWage[, 11], col="darkgrey", xlab="Age",
ylab="Wage")
title(main="Continuous Piecewise Cubic")
lines(age.grid, pred3, lwd=2, col="darkgreen")
abline(v=50, lty=2)
summary(g1)
summary(g2)
summary(g3)
Source from : https://rstudio-pubs-static.s3.amazonaws.com/24589_7552e489485b4c2790ea6634e1afd68d.html
'Data Science > R' 카테고리의 다른 글
[R] Non-Linear Models : Local Regression, GAM (0) | 2022.11.14 |
---|---|
[R] Non-Linear Models : Splines (0) | 2022.11.14 |
[R] Non-Linear Models : Polynomials (0) | 2022.11.14 |
[R] Classification Problem : QDA, Naive Bayes, KNN (0) | 2022.11.06 |
[R] Assessment of the Performance of Classifier (0) | 2022.11.05 |