본문 바로가기

Data Science/R

[R] Support Vector Machine

1. What is Support Vector Machine?

  • Find a plane that separate the classes in feature space.
  • Soften what we mean by separates and enrich and enlarge the feature space so that separation is possible.
  • Three methods for SVM
    • Maximum Margin Classifier
    • Support Vector Classifier(SVC)
    • Support Vector Machine(SVM)

 

1.1 Hyperplane

  • A hyperplane in p dimensions is a flat affine subspace of dimension p-1.
  • General equation for a hyper plane : \(\beta_0 + \beta_1 X_1 + ... \beta_p X_p = 0\)
    • If \(X = (X_1, ..., X_p)^T\) satisfies above, then \(X\) lies on the hyper plane.
    • If \(X = (X_1, ..., X_p)^T\) does not satify above, then \(X\) lies to on the side of the hyperplane.

 

1.2 Separating Hyperplanes

  • If \(f(x) = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p\), then \(f(x) > 0\) for points on one side of the hyperplane, and \(f(x) < 0\) for points on the other.
  • If a separating hyperplane exists, we can use it to construct a very natural classifier.
  • For a test observation \(x^{*}\) : \(f(x^*) = \beta_0 + \beta_1 x_1^* + ... + \beta_p x_p^*\)
    • If \(f(x^*) > 0\), we assign the test observation to class 1. 
    • If \(f(x^*) < 0\), we assign the test observation to class 2.
  • If \(|f(x^*)|\) is relatively large, the class assignment is confident.
  • If \(|f(x^*)|\) is relatively small, the class assignment is less confident.

 

2. Maximal Margin Classifier

  • The maximal margin hyperplane is the separating hyperplane that is farthest from the training observations.
  • Among all separating hyperplane, find the one that makes the biggest gap or margin between the two classes.
  • Constrained optimization problem :
    • maximize M subject to
  • The function svm() in an R package e1071 solves this problem efficiently.

 

 

2.1 The Non-separable Case

  • The maximal margin classifier is a very natural way to perform classification.
  • However, in many cases no separating hyperplane exists, and so there is no maximal margin classifier.
  • If a separating hyperplane doesn't exist, there is no solution to M in the optimization problem.
  • However, we can develop a hyperplane that almost separates the classes, using a so-called soft margin.
  • The generalization of the maximal margin classifier to the non-separable case is known as the support vector classifier.

 

3. Support Vector Classifier

 

  • We need to consider a classifier based on a hyperplane that does not perfectly separate the two classes, in the interest of
    • Greater robustness to individual observations
    • Better classification of most of the training observations
  • It could be worthwile to misclassify a few training observations in order todo a better job in classifying the remaining observations.
  • The support vector classfier (soft margin classifier) allows some observations to be on the incorrect side of the margin, or even incorrect side of the hyperplane.
  • The margin is soft because it can be violated by some of the training observations.
  • Optimization of SVC :
    • \(maximize_{\beta_0, \beta_1, ..., \beta_p, \epsilon_1, ..., \epsilon_n} M$ subject to $\sum_{j=1}^p \beta_j^2 = 1\)
    • \(y_i(\beta_0 + \beta_1 x_{i1} + ... + \beta_p x_{ip} \geq M(1-\epsilon_i)\)
    • \(\epsilon_i \geq 0$, and $\sum_{i=1}^n \epsilon_i \leq C\)
    • \(C\) is a tuning parameter.

 

3.1 Margins and Slack Variables

  • \(M\) is the width of the margin.
  • \(\epsilon_1, ..., \epsilon_n\) are slack variables.
    • If \(\epsilon_i = 0\), the \(i\)th obs. is on the correct side of the margin.
    • If \(\epsilon_i > 0\), the \(i\)th obs. is on the wrong side of the margin.
    • If \(\epsilon_i > 1\), the \(i\)th obs. is on the wrong side of the margin.

 

3.2 [Ex] Support Vector Classifier

# Simple example (simulate data set)
set.seed(1)
x <- matrix(rnorm(20*2), ncol=2)
y <- c(rep(-1, 10), rep(1, 10))
x[y==1, ] <- x[y==1, ] + 1
plot(x, col=(3-y), pch=19, xlab="X1", ylab="X2")

# Support vector classifier with cost=10
library(e1071)
# y must be in format (-1, 1) 
dat <- data.frame(x, y=as.factor(y))
# Tuning parameter cost (inverse of C) 
svmfit <- svm(y~., data=dat, kernel="linear", cost=10, scale=FALSE)
plot(svmfit, dat)
summary(svmfit)

 

 

 

3.3 [Ex] Support Vector Classifier with different margins

# SVM of tuning parameter (cost=0.1) 
svmfit <- svm(y~., data=dat, kernel="linear", cost=0.1,
scale=FALSE)
svmfit$index
beta <- drop(t(svmfit$coefs)%*%x[svmfit$index,])
beta0 <- svmfit$rho

# Visualize results
plot(x, col=(3-y), pch=19, xlab="X1", ylab="X2")
points(x[svmfit$index, ], pch=5, cex=2)
abline(beta0 / beta[2], -beta[1] / beta[2])
abline((beta0 - 1) / beta[2], -beta[1] / beta[2], lty = 2)
abline((beta0 + 1) / beta[2], -beta[1] / beta[2], lty = 2)

 

 

3.4 [Ex] Support Vector Classifier with Hyper parameter tuning

# Cross validation to find the optimal tuning parameter (cost) 
# Training using tune function 
set.seed(1)
tune.out <- tune(svm, y~., data=dat, kernel="linear",
                 ranges=list(cost=c(0.001, 0.01, 0.1, 1, 5, 10, 100)))
summary(tune.out)
bestmod <- tune.out$best.model
summary(bestmod)

# Generate test set 
set.seed(4321)
xtest <- matrix(rnorm(20*2), ncol=2)
ytest <- sample(c(-1,1), 20, rep=TRUE)
xtest[ytest==1, ] <- xtest[ytest==1, ] + 1
testdat <- data.frame(xtest, y=as.factor(ytest))

# Calculate missclassification rate for optimal model 
# Compute misclassification rate for the optimal model
ypred <- predict(bestmod, testdat)
table(predict=ypred, truth=testdat$y)
mean(ypred!=testdat$y)

 

  • Missclassification error rate : 0.15

 

3.5 [Ex] Simulation Study of Support Vector Classifier

# Prepare package 
library(mnormt)
library(e1071)

# Function for calculating Misclassification Rate of SVC
SVC.MCR <- function(x.tran, x.test, y.tran, y.test,
                    cost=c(0.01,0.1,1,10,100)) {
  dat <- data.frame(x.tran, y=as.factor(y.tran))
  testdat <- data.frame(x.test, y=as.factor(y.test))
  MCR <- rep(0, length(cost)+1)
  for (i in 1:length(cost)) {
    svmfit <- svm(y~., data=dat, kernel="linear",
                  cost=cost[i])
    MCR[i] <- mean(predict(svmfit, testdat)!=testdat$y)
  }
  tune.out <- tune(svm, y~., data=dat, kernel="linear",
                   ranges=list(cost=cost))
  pred <- predict(tune.out$best.model, testdat)
  MCR[length(cost)+1] <- mean(pred!=testdat$y)
  MCR
}

# Simulation test for 100 replications
set.seed(123)
K <- 100
RES <- matrix(NA, K, 6)
colnames(RES) <- c("0.01", "0.1" ,"1" , "10" ,"100", "CV")
for (i in 1:K) {
  x.A <- rmnorm(100, rep(0, 2), matrix(c(1,-0.5,-0.5,1),2))
  x.B <- rmnorm(100, rep(1, 2), matrix(c(1,-0.5,-0.5,1),2))
  x.tran <- rbind(x.A[1:50, ], x.B[1:50, ])
  x.test <- rbind(x.A[-c(1:50), ], x.B[-c(1:50), ])
  y.tran <- factor(rep(0:1, each=50))
  y.test <- factor(rep(0:1, each=50))
  RES[i,] <- SVC.MCR(x.tran, x.test, y.tran, y.test)
}
apply(RES, 2, summary)
boxplot(RES, boxwex=0.5, col=2:7,
        names=c("0.01", "0.1", "1", "10", "100", "CV"),
        main="", ylab="Classification Error Rates")

 

 

  • The model with C=10/100 get best missclassification error rate of test sets.