[R] Non-Linear Models : Polynomials

1. What is Non-Linear Models

Kinds of non-linear models :
- Polynomials
- Step functions
- Splines
- Local regression
- Generalized Additive Models(GAM)
Population : \(y_i = \sum_{j=1}^p f_j(x_j) + \epsilon_i\)
Inference : \(\hat{y}_i = \sum_{j=1}^p \hat{f}_j(x_j)\)

2. Polynomial Regression

Not really intersted in the coefficients.
More interested in fitted function values.
\(\hat{f}(x_0) = \hat{\beta_0} + \hat{\beta_1}x_0 + ... + \hat{\beta_4}x_0^4\)
Compute the fit and pointwise standard error(confidence interval) : \(\hat{f}(x_0) +- 2se[\hat{f}(x_0)]\).

2.1 [Ex] Polynomial Regression of wage dataset

# install.packages('ISLR')

# Import library 
library(ISLR)
attach(Wage)

# Orthogonal polynomials:
# Each column is a linear orthogonal combination of
# age, age^2, age^3 and age^4
fit <- lm(wage ~ poly(age, 4), data=Wage)
summary(fit)
# Direct power of age
fit2 <- lm(wage ~ poly(age, 4, raw=T), data=Wage)
summary(fit2)
fit2a <- lm(wage ~ age + I(age^2) + I(age^3) + I(age^4), Wage)
fit2b <- lm(wage ~ cbind(age, age ^2, age^3, age^4), Wage)
round(data.frame(fit=coef(fit), fit2=coef(fit2),
                 fit2a=coef(fit2a), fit2b=coef(fit2b)), 5)

# Testing datasets 
age.grid <- seq(min(age), max(age))
# se=TRUE, extract sd(f(x)) from fhat  
preds <- predict(fit, newdata=list(age=age.grid), se=TRUE)

# Calcualte confidence band
se.bands <- cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit)

# Visualize results
plot(age, wage, xlim=range(age), cex=.5, col="darkgrey")
title("Degree-4 Polynomial", outer=T)
lines(age.grid, preds$fit, lwd=2, col="darkblue")
matlines(age.grid, se.bands, lwd=2, col="darkblue", lty=2)

The more narrower width of confidence bands, the more stable values of prediction we can get.
Checking the confidence interval graph, it can be seen that the area of the interval was narrow for the data that existsed in the training dataset, but the prediction accuracy was poor for the data that was completely new.

# Orthogonal vs. Non-orthogonal polynomial regression
preds2 <- predict(fit2, newdata=list(age=age.grid), se=TRUE)
data.frame(fit=preds$fit, fit2=preds2$fit)
sum(abs(preds$fit-preds2$fit))

The difference of prediction between orthogonal vs non-orthogonal polynomial regression is just 1.103956e-09.

2.2 [Ex] Choose optimal polynomial terms : Statistics

# Anova test to find the optimal polynomial degree
fit.1 <- lm(wage ~ age, data=Wage)
fit.2 <- lm(wage ~ poly(age, 2), data=Wage)
fit.3 <- lm(wage ~ poly(age, 3), data=Wage)
fit.4 <- lm(wage ~ poly(age, 4), data=Wage)
fit.5 <- lm(wage ~ poly(age, 5), data=Wage)
g <- anova(fit.1, fit.2, fit.3, fit.4, fit.5)
g

The ANOVA test will perform comparison of two models.
In case of Model 2,
- \(H_0 : \beta_2 = 0\) : Model 1 is more preferable.
- \(H_1 : \beta_2 \neq 0\) : Model 2 is more preferable.
We can choose adequate polynomial terms from this test.

# Perform T-test
coef(summary(fit.5))
round(coef(summary(fit.5)), 5)

# T-test^2 = F-test
summary(fit.5)$coef[-c(1, 2), 3]
summary(fit.5)$coef[-c(1, 2), 3]^2
g$F[-1]

With T-test with fit.5 models, we can choose optimal polynomial terms.
The preferable polynomial terms will be 3(\(\beta_3\)).

2.3 [Ex] Choose optimal polynomial terms : K-fold CV simulation

# 10-fold cross-validation to choose the optimal polynomial
set.seed(1111)
N <- 10 # simulation replications
K <- 10 # 10-fold CV

# Simulation Error matrix 
CVE <- matrix(0, N, 10)

# Model training - Calculate test error 
for (k in 1:N) {
    gr <- sample(rep(seq(K), length=nrow(Wage)))
    # Cross Validation Error matrix 
    pred <- matrix(NA, nrow(Wage), 10)
    for (i in 1:K) {
        tran <- (gr != i)
        test <- (gr == i)
        for (j in 1:10) {
            g <- lm(wage ~ poly(age, j), data=Wage, subset=tran)
            yhat <- predict(g, data.frame(poly(age, j)))
            mse <- (Wage$wage - yhat)^2
            pred[test, j] <- mse[test]
        }
    }
    CVE[k, ] <- apply(pred, 2, mean)
}
RES <- apply(CVE, 2, mean)
RES

# Visualize Result
par(mfrow=c(1,2))
matplot(t(CVE), type="l",xlab="Degrees of Polynomials ",
        ylab="Mean Squared Error")
plot(seq(10), RES, type="b", col=2, pch=20, 
     xlab="Degrees of Polynomials ", ylab="Mean Squared Error")

When we check the graph, we can see that the CVE value significantly decreases when the polynomial term increase from 1 to 2.
When building a model, it is important to use the lease expensive one with statistically significant results, rather than choosing the one with the lowest CVE value.

2.4 [Ex] Logistic Regression

# Logistic regression using binary response
# 1 for wage > 250 and 0 for wage <= 250
fit <- glm(I(wage>250) ~ poly(age, 4), Wage, family="binomial")

# Predict dataset and calculate confidence bands 
preds <- predict(fit, newdata=list(age=age.grid), se=T)
# logit transformation 
pfit <- exp(preds$fit) / (1 + exp(preds$fit))
# calculate confidence bands 
se.bands.logit <- cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)
se.bands <- exp(se.bands.logit)/(1 + exp(se.bands.logit))

# Another function 
preds2 <- predict(fit, newdata=list(age=age.grid), type="response", se=T)
cbind(pfit, preds2$fit)

# Visualize result of logistic regression 
plot(age , I(wage > 250), xlim=range(age), type="n", ylim=c(0, .2))
points(jitter(age), I((wage>250)/5), cex=.5, pch="|", col="darkgrey")
lines(age.grid, pfit, lwd=2, col="darkblue")
matlines(age.grid, se.bands, lwd=2, col="darkblue", lty=2)

When the wage target was changed to TRUE/FALSE classification problem based on 250 and the model was fit, it can be seen that the confidence interval was narrow in the case of values existing in the training dataset, but the width of the confidence interval increased significantly.

저작자표시 (새창열림)

'Data Science > R' 카테고리의 다른 글

[R] Non-Linear Models : Splines (0)	2022.11.14
[R] Non-Linear Models : Step Functions (0)	2022.11.14
[R] Classification Problem : QDA, Naive Bayes, KNN (0)	2022.11.06
[R] Assessment of the Performance of Classifier (0)	2022.11.05
[R] Classification Problem : LDA(Linear Discriminant Analysis) (0)	2022.10.31

See the forest

[R] Non-Linear Models : Polynomials

1. What is Non-Linear Models

2. Polynomial Regression

2.1 [Ex] Polynomial Regression of wage dataset

2.2 [Ex] Choose optimal polynomial terms : Statistics

2.3 [Ex] Choose optimal polynomial terms : K-fold CV simulation

2.4 [Ex] Logistic Regression

'Data Science > R' 카테고리의 다른 글

티스토리툴바

[R] Non-Linear Models : Polynomials

1. What is Non-Linear Models

2. Polynomial Regression

2.1 [Ex] Polynomial Regression of wage dataset

2.2 [Ex] Choose optimal polynomial terms : Statistics

2.3 [Ex] Choose optimal polynomial terms : K-fold CV simulation

2.4 [Ex] Logistic Regression

'Data Science > R' 카테고리의 다른 글

'Data Science/R' Related Articles

티스토리툴바