[R] Classification Problem : LDA(Linear Discriminant Analysis)

1. Discriminant Analysis

Bayes Theorem : $Pr(Y=k|X=x) = \frac{\pi_k f_k(x)}{\sum_{l=1}^K \pi_l f_l(x)}$
The density for $X$ in class $k$ : $f_k(x) = \frac{1}{\sqrt{2\pi}\sigma_k}e^{-\frac{(x-\mu_k)^2}{2\sigma_k^2}}$
The prior probability for class $k$ : $\pi_k = Pr(Y=k)$
If the distribution of the $X$ is approximately normal, LDA and QDA is more stable.

1.1 LDA(Linear Discriminant Analysis)

Assume that $\sigma = \sigma_1 = \sigma_2 = ... \sigma_k$
$f_k(x)$
- p = 1 : $f_k(x) \to N(\mu_k, \sigma)$
- p > 1 : $f_k(x) \to N(\mu_k, \sum)$
$p_k(x) = \frac{\pi_k \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu_k)^2}{2\sigma^2}}}{\sum_{l=1}^K \pi_l \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu_l)^2}{2\sigma^2}}}$
To classify at the value $X=x$, see which of the $p_k(x)$ is largest.
Because below term of $p_k(x)$ is same, we need to consider upper terms.
- Discriminant score :
  - p = 1 : $\sigma_k(x) = x\frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + log(\pi_k)$
  - p > 1 : $\sigma_k(x) = x^T\sum^{-1}\mu_k - \frac{1}{2}\mu_k^T\sum^{-1}\mu_k + log(\pi_k)$
- $\sigma_k(x)$ is a linear function of $x$.
Decision boundary at $x = \frac{\hat{\mu_1} + \hat{\mu_2}}{2}$
Need to estimate : $(\mu_1, ..., \mu_k), (\pi_1, ..., \pi_k), \sum$

1.2 [Ex] Iris Data

Training model with LDA

# open the iris dataset 
data(iris)
str(iris)
summary(iris)
plot(iris[, -5], col=as.numeric(iris$Species) + 1)

# Apply LDA for iris data
library(MASS)
g <- lda(Species ~., data=iris)
plot(g)
plot(g, dimen=1)

Compute Missclassification Rate for training sets

# Compute misclassification error for training sets
pred <- predict(g)
table(pred$class, iris$Species)
mean(pred$class!=iris$Species)

We can get inferred probability ${\hat{p}}_{k} (x)$ : predict(g, iris)$posterior[-tran, ].

Performance comparison between LDA vs Multinomial Regression

library(nnet)
set.seed(1234)
K <- 100
RES <- array(0, c(K, 2))
for (i in 1:K) {
    tran.num <- sample(nrow(iris), size=floor(nrow(iris)*2/3))
    tran <- as.logical(rep(0, nrow(iris)))
    tran[tran.num] <- TRUE
    g1 <- lda(Species ~., data=iris, subset=tran)
    g2 <- multinom(Species ~., data=iris, subset=tran, trace=FALSE)
    pred1 <- predict(g1, iris[!tran,])$class
    pred2 <- predict(g2, iris[!tran,])
    RES[i, 1] <- mean(pred1!=iris$Species[!tran])
    RES[i, 2] <- mean(pred2!=iris$Species[!tran])
}
apply(RES, 2, mean)

# 0.0254 0.0436

1.3 [Ex] Default Dataset

# Importing library 
library(ISLR)
data(Default)
attach(Default)
library(MASS)

# Training model 
g <- lda(default~., data=Default)
pred <- predict(g, default)
table(pred$class, default)
mean(pred$class!=default)

False Positive rate : The fraction of negative examples that are classified as positive(0.22%)
False Negative rate : The fraction of positive examples that are classified as negative(76.2%)
If we classified the prior - always to class "No", then we would make 333/10000 errors, which is only 3.33%.

저작자표시 (새창열림)

'Data Science > R' 카테고리의 다른 글

[R] Classification Problem : QDA, Naive Bayes, KNN (0)	2022.11.06
[R] Assessment of the Performance of Classifier (0)	2022.11.05
[R] Classification Problem : Logistic Regression (0)	2022.10.18
[R] Elastic-Net Regression (0)	2022.10.13
[R] Regularization Methods (0)	2022.10.11

See the forest

[R] Classification Problem : LDA(Linear Discriminant Analysis)

1. Discriminant Analysis

1.1 LDA(Linear Discriminant Analysis)

1.2 [Ex] Iris Data

1.3 [Ex] Default Dataset

'Data Science > R' 카테고리의 다른 글

티스토리툴바

[R] Classification Problem : LDA(Linear Discriminant Analysis)

1. Discriminant Analysis

1.1 LDA(Linear Discriminant Analysis)

1.2 [Ex] Iris Data

1.3 [Ex] Default Dataset

'Data Science > R' 카테고리의 다른 글

'Data Science/R' Related Articles

티스토리툴바