[R] Non-Linear Models : Splines

1. What is Splines?

Regression splines are more flexible than polynomials and step functions, and are actually an extension of the two. The divide the range of X into K distinct regions. For each region, a polynomial function is fit to the data, however, the polynomials are constrained so that they join smoothly at the region boundaries or knots.

For regression splines, instead of fitting a high degree polynomial over the entire of X, we can fit several low-degree polynomials over different regions of X. Each of these functions can be fit using lease squares.

2. Linear Splines

A Linear splines with knots at \(\zeta_k\), \(k = 1, ..., K\) is a piecewise linear polynomial continuous at each knot.
\(y_i = \beta_0 + \beta_1b_1(x_i) + \beta_{K+1}b_{K+1}(x_i) + \epsilon\)
- \(b_1(x_i) = x_i\)
- \(b_{k+1}(x_i) = (x_i - \zeta_k)_{+}\)
- \((x_i - \zeta_k)_{+} = x_i - \zeta_k\), if \(x_i > \zeta_k\)

# Linear spline
g6 <- lm(wage ~ bs(age, knots=50, degree=1), data=nWage)
pred6 <- predict(g6, newdata=list(age=age.grid))

plot(nWage[, 2], nWage[, 11], col="darkgrey", xlab="Age",
     ylab="Wage")
title(main="Linear Spline")
lines(age.grid, pred6, lwd=2, col="darkred")
abline(v=50, lty=2)

3. Cubic Splines

A Cubic spline with knots at \(\zeta_k\), \(k = 1, ..., K\) is a piecewise cubic polynomial with contunuous derivatives up to order 2 at each knot.
\(y_i = \beta_0 + \beta_1b_1(x_i) + \beta_2b_2(x_i) + ... + \beta_{K+3}b_{K+3}(x_i) + \epsilon\)
- \(b_1(x_i) = x_i\)
- \(b_2(x_i) = x_i^2\)
- \(b_3(x_i) = x_i^3\)
- \(b_{k+3}(x_i) = (x_i - \zeta_k)_{+}^3\)
- \((x_i - \zeta_{k+3})_{+}^3 = (x_i - \zeta_k)^3\), if \(x_i > \zeta_k\)

# Truncated power basis functions 
d <- 3
knots <- 50 
x1 <- outer(nWage$age, 1:3, "^") # make form of x, x^2, x^3 
x2 <- outer(nWage$age, knots, ">") * outer(nwage$age, knots, "-")^d # make form of (x-z)^3 
x <- cbind(x1, x2) # make form of x, x^2, x^3, (x-50)^3

# Train models 
g4 <- lm(wage ~ x, data=nWage)

# Make testing set and predictions 
nx1 <- outer(age.grid, 1:d, "^")
nx2 <- outer(age.grid, knots, ">") * outer(age.grid, knots, "-")^d
nx <- cbind(nx1, nx2)
pred4 <- predict(g4, newdata=list(x=nx))

# Visualize 
par(mfrow=c(1,2))
plot(nWage[, 2], nWage[, 11], col="darkgrey", xlab="Age", ylab="Wage")
title(main = "Cubic Spline")
lines(age.grid, pred4, lwd=2, col="red")
abline(v=50, lty = 2)

# Make automatic splines using bs function 
library(splines)
g5 <- lm(wage ~ bs(age, knots=50), data=nWage)
pred5 <- predict(g5, newdata=list(age=age.grid))
plot(nWage[, 2], nWage[, 11], col="darkgrey", xlab="Age", ylab="Wage")
title(main="Cubic Spline")
lines(age.grid, pred5, lwd=2, col="red")
abline(v=50, lty=2)

4. Natural Cubic Splines

Splines have high variance at the outer range of the predictors.
A natural splines is a regression spline with additional boundary constraints.
The natural function is required to be linear at the boundary.

# Cubic Spline
fit <- lm(wage ~ bs(age, knots=c(25 ,40 ,60)), data=nWage)
pred <- predict(fit, newdata=list(age=age.grid), se=T)

# Natural Spline : using(ns) function 
fit2 <- lm(wage ~ ns(age, knots=c(25 ,40 ,60)), data=nWage)
pred2 <- predict(fit2, newdata=list(age=age.grid), se=T)
plot(nWage[, 2], nWage[, 11], col="darkgrey", xlab="Age", ylab="Wage")
lines(age.grid, pred$fit, lwd=2, col=4)
lines(age.grid, pred$fit + 2*pred$se, lty="dashed", col=4)
lines(age.grid, pred$fit - 2*pred$se, lty="dashed", col=4)
lines(age.grid, pred2$fit, lwd=2, col=2)
lines(age.grid, pred2$fit + 2*pred2$se, lty="dashed", col=2)
lines(age.grid, pred2$fit - 2*pred2$se, lty="dashed", col=2)
abline(v=c(25, 40, 60), lty=2)
legend("topright", c("Natural Cubic Spline", "Cubic Spline"), lty=1, lwd=2, col=c(2, 4))

4.1 [Ex] Natural Cubic Splines with degree of freedom

We can train data with natural cubic splines based on percentile using ns(df=k) keywords.
We fit a natrual cubic spline with three knots, where the knots locations were chosen automatically as the 25th, 50th, and 75th percentiles.

# Use a complete Wage data 
age <- Wage$age 
wage <- Wage$wage 

# Make test dataset 
age.grid <- seq(min(age), max(age)) 

# Training model with natural cubic splines 
g1 <- lm(wage ~ ns(age, df=4), data=Wage) 
pred1 <- predict(g1, newdata=list(age=age.grid), se=T) 
se.bands1 <- cbind(pred1$fit + 2*pred1$se.fit, pred1$fit - 2*pred1$se.fit) 

# Training model with natural cubic splines of binary problems 
g2 <- glm(I(wage > 25) ~ ns(age, df=4), data=Wage, family="binomial") 
pred2 <- predict(g2, newdata=list(age=age.grid), se=T) 
pfit <- exp(pred2$fit)/(1 + exp(pred2$fit)) 
se.bands.logit <- cbind(pred2$fit + 2*pred2$se.fit, pred2$fit - 2*pred2$se.fit)
se.bands2 <- exp(se.bands.logit)/(1 + exp(se.bands.logit)) 

# Visualize predicted confidence bands : lm fit 
par(mfrow=c(1,2), mar=c(4.5 ,4.5 ,1 ,1), oma=c(0, 0, 4, 0))
plot(age, wage, cex=.5, col="darkgrey", xlab="Age", ylab="Wage")
title("Natural Cubic Spline", outer=T)
lines(age.grid, pred1$fit, lwd=3, col=2)
matlines(age.grid, se.bands1, lwd=2, col=2, lty=2)
ncs <- ns(age, df=4)
attr(ncs, "knots")
abline(v=attr(ncs, "knots"), lty=3)

# Visualize predicted confidence bands : glm fit 
plot(age, I(wage > 250), type ="n", ylim=c(0, .2), xlab="Age",
     ylab="Pr(Wage>250 | Age)")
points(jitter(age), I((wage >250)/5), cex=.5, pch ="|",
       col="darkgrey")
lines(age.grid, pfit, lwd=3, col=2)
matlines(age.grid, se.bands2, lwd=2, col=2, lty=2)
abline(v=attr(ncs, "knots"), lty=3)

Cubic splines with 𝐾 knots has 𝐾+4 parameters or df.

4.2 [Ex] Optimized Natural Cubic Splines

Place more knots where the function might vary most rapidly.
Place fewer knots where it seems more stable.
How many knots should we use : Using a cross-validation

set.seed(1111)
CVE <- matrix(0, 20, 10) 

# Iterate 20 times of Cross Validation 
for (k in 1:20) { 
    gr <- sample(rep(seq(10), length=nrow(Wage))) 
    pred <- matrix(NA, nrow(Wage), 10) 
    # Applying 10 K-folds Cross Validation 
    for (i in 1:10) {
        tran <- (gr != i) 
        test <- (gr == i) 
        # Natural Cubic Splines with j degree of freedom 
        for (j in 1:10) {
            nsx <- ns(age, df=j) 
            g <- lm(wage ~ nsx, data=Wage, subset=tran) 
            mse <- (Wage$wage - predict(g, nsx))^2
            pred[test, j] <- mse[test]
        } 
    }
    CVE[k, ] <- apply(pred, 2, mean)
}
RES <- apply(CVE, 2, mean)
plot(seq(10), RES, type="b", col=2, pch=20, xlab="Degrees of Freedom of Natural Spline", 
     ylab="Mean Squared Error")

4.3 [Ex] Comparison to Polynomial Regression

Regression splines often give superior results to polynomial regression.
The extra flexibility in the polynomial produces undesirable results at the boundaries, while the natural cubic spline still provides a reasonable fit to the data.

# Training model with natural cubic splines with 15 df and polynomail with 15 terms 
g1 <- lm(wage ~ ns(age, df=15), data=Wage)
g2 <- lm(wage ~ poly(age, 15), data=Wage)

# Make prediction of testing set
pred1 <- predict(g1, newdata=list(age=age.grid), se=T)
pred2 <- predict(g2, newdata=list(age=age.grid), se=T)

# Visualize comparison results 
plot(age, wage, cex=.5, col="darkgrey", xlab="Age", ylab="Wage")
lines(age.grid, pred1$fit, lwd=2, col=2)
lines(age.grid, pred1$fit + 2*pred1$se, lty="dashed", col=2)
lines(age.grid, pred1$fit - 2*pred1$se, lty="dashed", col=2)
lines(age.grid, pred2$fit, lwd=2, col=4)
lines(age.grid, pred2$fit + 2*pred2$se, lty="dashed", col=4)
lines(age.grid, pred2$fit - 2*pred2$se, lty="dashed", col=4)
legend("topleft", c("Natural Cubic Spline", "Polynomial"), lty=1, lwd=2, col=c(2, 4))

5. Smoothing Splines

Finding a function \(g(x)\) that make \(RSS\) small, but that is also smooth.
A function \(g(x)\) that minimizes below is a smoothing spline.
- \(\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g(t)^2dt\)
- \(\lambda\) is a non-negative tuning paramete.
- \(\lambda\) controls how wiggly \(g(x)\) is.
- The smaller \(\lambda\), the more wiggly the function.
- As \(\lambda \to \infty\), the function \(g(x)\) becomes linear.
The solution is a natural cubic spline, with a knot at every unique value of \(x_i\).
- Smoothing splines avoid the knot-selection issue, leaving a single \(\lambda\) to be choosen.
- The function smooth.spline() will fit a smoothing spline.
- The leave-one-out cross-validation error(LOOCV) can be computed very efficiently for smoothing splines.
- \(LOOCV_{\lambda} = \sum_{i=1}^n(y_i - \hat{g}_{\lambda}^{[-i]}(x_i))^2 = \sum_{i=1}^n[\frac{y_i - \hat{g}_{\lambda}(x_i)}{1 - S_{\lambda ii}}]^2\)

# Preparing dataset
library(ILSR)
library(splines)
data(Wage)
age <- Wage$age
wage <- Wage$wage
age.grid <- seq(min(age), max(age)) 

# Training models fitted in smoothing splines 
# Polynomial fits with df = 16
fit <- smooth.spline(age, wage, df=16) 

# Optimized smoothing splines with cross validation
fit2 <- smooth.spline(age, wage, cv=TRUE) 
fit2$df 

# Natural cubic splines with df=7
fit3 <- lm(wage ~ ns(age, df=7), data=Wage)
pred3 <- predict(fit3, newdata=list(age=age.grid)) 

# Visualize fits 
plot(age, wage, cex=.5, col="darkgrey") 
title("Smoothing Spline vs Natural Spline") 
lines(fit, col="red", lwd=2)
lines(fit2, col="blue", lwd=2) 
lines(age.grid, pred3, col="green", lwd=2) 
legend("topright", legend=c("SS (DF=16)", "SS (DF=6.8)", "NS (DF=7)"), col=c("red","blue","green"), lty=1, lwd=2)

# Cross Validation Simulation
set.seed(1234)
N <- 10 # Simulation replications
K <- 10 # 10-fold CV
df <- seq(2, 20) ## Degrees of freedom
CVE <- matrix(0, N, length(df))
for (k in 1:N) {
    gr <- sample(rep(seq(K), length=nrow(Wage)))
    pred <- matrix(NA, nrow(Wage), length(df))
    for (i in 1:K) {
        tran <- (gr != i)
        test <- (gr == i)
        for (j in 1:length(df)) {
            fit <- smooth.spline(age[tran], wage[tran], df=df[j])
            mse <- (wage-predict(fit, age)$y)^2
            pred[test, j] <- mse[test]
        }
    }
CVE[k, ] <- apply(pred, 2, mean)
}
RES <- apply(CVE, 2, mean)

# Visualize Cross Validation Error 
par(mfrow=c(1,2))
matplot(t(CVE), type="b", col=2, lty=2, pch=20, ylab="CV errors", xlab="Degrees of freedom")
plot(df, RES, type="b", col=2, pch=20, ylab="averaged CV errors", xlab="Degrees of freedom")
abline(v=df[which.min(RES)], col="grey", lty=4)

# Smoothng splines vs Natural Cubic Splines 
set.seed(1357)
MSE1 <- matrix(0, 100, 2)
for (i in 1:100) {
    tran <- sample(nrow(Wage), size=floor(nrow(Wage)*2/3))
    test <- setdiff(1:nrow(Wage), tran)
    g1 <- smooth.spline(age[tran], wage[tran], df=7)
    g2 <- lm(wage ~ ns(age, df=7), data=Wage, subset=tran)
    mse1 <- (wage-predict(g1, age)$y)[test]^2
    mse2 <- (wage-predict(g2, Wage))[test]^2
    MSE1[i,] <- c(mean(mse1), mean(mse2))
}
apply(MSE1, 2, mean)

저작자표시 (새창열림)

'Data Science > R' 카테고리의 다른 글

[R] Tree-Based Methods : Decision Tree (0)	2022.11.27
[R] Non-Linear Models : Local Regression, GAM (0)	2022.11.14
[R] Non-Linear Models : Step Functions (0)	2022.11.14
[R] Non-Linear Models : Polynomials (0)	2022.11.14
[R] Classification Problem : QDA, Naive Bayes, KNN (0)	2022.11.06

See the forest

[R] Non-Linear Models : Splines

1. What is Splines?

2. Linear Splines

3. Cubic Splines

4. Natural Cubic Splines

4.1 [Ex] Natural Cubic Splines with degree of freedom

4.2 [Ex] Optimized Natural Cubic Splines

4.3 [Ex] Comparison to Polynomial Regression

5. Smoothing Splines

'Data Science > R' 카테고리의 다른 글

티스토리툴바

[R] Non-Linear Models : Splines

1. What is Splines?

2. Linear Splines

3. Cubic Splines

4. Natural Cubic Splines

4.1 [Ex] Natural Cubic Splines with degree of freedom

4.2 [Ex] Optimized Natural Cubic Splines

4.3 [Ex] Comparison to Polynomial Regression

5. Smoothing Splines

'Data Science > R' 카테고리의 다른 글

'Data Science/R' Related Articles

티스토리툴바