본문 바로가기

Data Science/R

[R] Assessment of the Performance of Classifier

1. Two types of Missclassification errors

We can change the two error rates by changing the thershold from 0.5 to some other value in [0, 1].

  • \(\hat{Pr}(Default = Yes|Balance, Student) \ge \alpha\)
  • \(\alpha\) is a threshold. 
  • If \(\alpha\) ↑, FN increase while FP decrease.
  • If \(\alpha\) ↓, FN decrease while FP increase.

 

2. [Ex] Changes in errors along with Thresholds

thresholds <- seq(0, 1, 0.01) 
res <- matrix(NA, length(thresholds), 3) 

# Compute overall error, false positive, false negatives
for (i in 1:length(thresholds)) {
    decision <- rep("No", length(default))
    decision[pred$posterior[,2] >= thresholds[i]] <- "Yes"
    res[i, 1] <- mean(decision != default)
    res[i, 2] <- mean(decision[default=="No"]=="Yes")
    res[i, 3] <- mean(decision[default=="Yes"]=="No")
}

k <- 1:51
matplot(thre[k], res[k,], col=c(1,"orange",4), lty=c(1,4,2), type="l", xlab="Threshold", ylab="Error Rate", lwd=2)
legend("top", c("Overall Error", "False Positive", "False Negative"), col=c(1,"orange",4), lty=c(1,4,2), cex=1.2)
apply(res, 2, which.min)

 

 

  • The overall error seems to decrease in every alpha, because there are only 22 FP rate.
  • However, it will increase slightly after the turning point.
  • We can made model by setting adequate thresholds in specific situation or problems. 

 

3. Confusion Matrix

 

4. ROC curve

 

  • Class-specific performance in medicine and biology : Sensitive(TPR) and specificity(TNR)
  • The ROC(Receiver Operating Characteristics) curve
  • (\(\alpha=1\), TPR=0, TNR=1) in left-lower point, (\(\alpha=0\), TPR=1, TNR=0) in right-upper point.
  • The overall performance of a classifier : AUC(The area under the ROC curve)
    • Larger the AUC, better the classifier

 

5. [Ex] ROC curve

Way1 : Drawing ROC curve

# Prerequirisite
library(ISLR)
data(Default)
attach(Default)
library(MASS)

# Train model 
g <- lda(default~., data=Default)
pred <- predict(g, default)

# Error grids
thre <- seq(0,1,0.001)
Sen <- Spe <- NULL
RES <- matrix(NA, length(thre), 4)

# Classification metrics 
colnames(RES) <- c("TP", "TN", "FP", "FN")
for (i in 1:length(thre)) {
  decision <- rep("No", length(default))
  decision[pred$posterior[,2] >= thre[i]] <- "Yes"
  Sen[i] <- mean(decision[default=="Yes"] == "Yes")
  Spe[i] <- mean(decision[default=="No"] == "No")
  RES[i,1] <- sum(decision[default=="Yes"] == "Yes")
  RES[i,2] <- sum(decision[default=="No"] == "No")
  RES[i,3] <- sum(decision=="Yes") - RES[i,1]
  RES[i,4] <- sum(default=="Yes") - RES[i,1]
}

# Visualize ROc curve 
plot(1-Spe, Sen, type="b", pch=20, xlab="False positive rate",
     col="darkblue", ylab="True positive rate", main="ROC Curve")
abline(0, 1, lty=3, col="gray")

 

 

Way2 : Drawing ROC curve

# Way 2 : Calculating TPR, TNR
TPR <- RES[,1] / (RES[,1] + RES[,4])
TNR <- RES[,2] / (RES[,2] + RES[,3])

plot(1-TNR, TPR, type="b", pch=20, xlab="False positive rate",
col="darkblue", ylab="True positive rate", main="ROC Curve")
abline(0, 1, lty=3, col="gray")

 

Way3 : ROC curve with ROCR package

library(ROCR)

# Compute ROC curve
label <- factor(default, levels=c("Yes","No"),
labels=c("TRUE","FALSE"))
preds <- prediction(pred$posterior[,2], label)
perf <- performance(preds, "tpr", "fpr" )

# Visualization 
plot(perf, lwd=4, col="darkblue")
abline(a=0, b=1, lty=2)
slotNames(perf)

k <- 1:100
# X - axis values : FPR 
list(perf@x.name, perf@x.values[[1]][k])
# Y - axis values : TPR 
list(perf@y.name, perf@y.values[[1]][k])
# alpha - cutoffs 
list(perf@alpha.name, perf@alpha.values[[1]][k])

# Compute AUC
performance(preds, "auc")@y.values