1. What is Support Vector Machine?
- Find a plane that separate the classes in feature space.
- Soften what we mean by separates and enrich and enlarge the feature space so that separation is possible.
- Three methods for SVM
- Maximum Margin Classifier
- Support Vector Classifier(SVC)
- Support Vector Machine(SVM)
1.1 Hyperplane
- A hyperplane in p dimensions is a flat affine subspace of dimension p-1.
- General equation for a hyper plane : \(\beta_0 + \beta_1 X_1 + ... \beta_p X_p = 0\)
- If \(X = (X_1, ..., X_p)^T\) satisfies above, then \(X\) lies on the hyper plane.
- If \(X = (X_1, ..., X_p)^T\) does not satify above, then \(X\) lies to on the side of the hyperplane.
1.2 Separating Hyperplanes
- If \(f(x) = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p\), then \(f(x) > 0\) for points on one side of the hyperplane, and \(f(x) < 0\) for points on the other.
- If a separating hyperplane exists, we can use it to construct a very natural classifier.
- For a test observation \(x^{*}\) : \(f(x^*) = \beta_0 + \beta_1 x_1^* + ... + \beta_p x_p^*\)
- If \(f(x^*) > 0\), we assign the test observation to class 1.
- If \(f(x^*) < 0\), we assign the test observation to class 2.
- If \(|f(x^*)|\) is relatively large, the class assignment is confident.
- If \(|f(x^*)|\) is relatively small, the class assignment is less confident.
2. Maximal Margin Classifier
- The maximal margin hyperplane is the separating hyperplane that is farthest from the training observations.
- Among all separating hyperplane, find the one that makes the biggest gap or margin between the two classes.
- Constrained optimization problem :
- maximize M subject to
- The function svm() in an R package e1071 solves this problem efficiently.
2.1 The Non-separable Case
- The maximal margin classifier is a very natural way to perform classification.
- However, in many cases no separating hyperplane exists, and so there is no maximal margin classifier.
- If a separating hyperplane doesn't exist, there is no solution to M in the optimization problem.
- However, we can develop a hyperplane that almost separates the classes, using a so-called soft margin.
- The generalization of the maximal margin classifier to the non-separable case is known as the support vector classifier.
3. Support Vector Classifier
- We need to consider a classifier based on a hyperplane that does not perfectly separate the two classes, in the interest of
- Greater robustness to individual observations
- Better classification of most of the training observations
- It could be worthwile to misclassify a few training observations in order todo a better job in classifying the remaining observations.
- The support vector classfier (soft margin classifier) allows some observations to be on the incorrect side of the margin, or even incorrect side of the hyperplane.
- The margin is soft because it can be violated by some of the training observations.
- Optimization of SVC :
- \(maximize_{\beta_0, \beta_1, ..., \beta_p, \epsilon_1, ..., \epsilon_n} M$ subject to $\sum_{j=1}^p \beta_j^2 = 1\)
- \(y_i(\beta_0 + \beta_1 x_{i1} + ... + \beta_p x_{ip} \geq M(1-\epsilon_i)\)
- \(\epsilon_i \geq 0$, and $\sum_{i=1}^n \epsilon_i \leq C\)
- \(C\) is a tuning parameter.
3.1 Margins and Slack Variables
- \(M\) is the width of the margin.
- \(\epsilon_1, ..., \epsilon_n\) are slack variables.
- If \(\epsilon_i = 0\), the \(i\)th obs. is on the correct side of the margin.
- If \(\epsilon_i > 0\), the \(i\)th obs. is on the wrong side of the margin.
- If \(\epsilon_i > 1\), the \(i\)th obs. is on the wrong side of the margin.
3.2 [Ex] Support Vector Classifier
# Simple example (simulate data set)
set.seed(1)
x <- matrix(rnorm(20*2), ncol=2)
y <- c(rep(-1, 10), rep(1, 10))
x[y==1, ] <- x[y==1, ] + 1
plot(x, col=(3-y), pch=19, xlab="X1", ylab="X2")
# Support vector classifier with cost=10
library(e1071)
# y must be in format (-1, 1)
dat <- data.frame(x, y=as.factor(y))
# Tuning parameter cost (inverse of C)
svmfit <- svm(y~., data=dat, kernel="linear", cost=10, scale=FALSE)
plot(svmfit, dat)
summary(svmfit)
3.3 [Ex] Support Vector Classifier with different margins
# SVM of tuning parameter (cost=0.1)
svmfit <- svm(y~., data=dat, kernel="linear", cost=0.1,
scale=FALSE)
svmfit$index
beta <- drop(t(svmfit$coefs)%*%x[svmfit$index,])
beta0 <- svmfit$rho
# Visualize results
plot(x, col=(3-y), pch=19, xlab="X1", ylab="X2")
points(x[svmfit$index, ], pch=5, cex=2)
abline(beta0 / beta[2], -beta[1] / beta[2])
abline((beta0 - 1) / beta[2], -beta[1] / beta[2], lty = 2)
abline((beta0 + 1) / beta[2], -beta[1] / beta[2], lty = 2)
3.4 [Ex] Support Vector Classifier with Hyper parameter tuning
# Cross validation to find the optimal tuning parameter (cost)
# Training using tune function
set.seed(1)
tune.out <- tune(svm, y~., data=dat, kernel="linear",
ranges=list(cost=c(0.001, 0.01, 0.1, 1, 5, 10, 100)))
summary(tune.out)
bestmod <- tune.out$best.model
summary(bestmod)
# Generate test set
set.seed(4321)
xtest <- matrix(rnorm(20*2), ncol=2)
ytest <- sample(c(-1,1), 20, rep=TRUE)
xtest[ytest==1, ] <- xtest[ytest==1, ] + 1
testdat <- data.frame(xtest, y=as.factor(ytest))
# Calculate missclassification rate for optimal model
# Compute misclassification rate for the optimal model
ypred <- predict(bestmod, testdat)
table(predict=ypred, truth=testdat$y)
mean(ypred!=testdat$y)
- Missclassification error rate : 0.15
3.5 [Ex] Simulation Study of Support Vector Classifier
# Prepare package
library(mnormt)
library(e1071)
# Function for calculating Misclassification Rate of SVC
SVC.MCR <- function(x.tran, x.test, y.tran, y.test,
cost=c(0.01,0.1,1,10,100)) {
dat <- data.frame(x.tran, y=as.factor(y.tran))
testdat <- data.frame(x.test, y=as.factor(y.test))
MCR <- rep(0, length(cost)+1)
for (i in 1:length(cost)) {
svmfit <- svm(y~., data=dat, kernel="linear",
cost=cost[i])
MCR[i] <- mean(predict(svmfit, testdat)!=testdat$y)
}
tune.out <- tune(svm, y~., data=dat, kernel="linear",
ranges=list(cost=cost))
pred <- predict(tune.out$best.model, testdat)
MCR[length(cost)+1] <- mean(pred!=testdat$y)
MCR
}
# Simulation test for 100 replications
set.seed(123)
K <- 100
RES <- matrix(NA, K, 6)
colnames(RES) <- c("0.01", "0.1" ,"1" , "10" ,"100", "CV")
for (i in 1:K) {
x.A <- rmnorm(100, rep(0, 2), matrix(c(1,-0.5,-0.5,1),2))
x.B <- rmnorm(100, rep(1, 2), matrix(c(1,-0.5,-0.5,1),2))
x.tran <- rbind(x.A[1:50, ], x.B[1:50, ])
x.test <- rbind(x.A[-c(1:50), ], x.B[-c(1:50), ])
y.tran <- factor(rep(0:1, each=50))
y.test <- factor(rep(0:1, each=50))
RES[i,] <- SVC.MCR(x.tran, x.test, y.tran, y.test)
}
apply(RES, 2, summary)
boxplot(RES, boxwex=0.5, col=2:7,
names=c("0.01", "0.1", "1", "10", "100", "CV"),
main="", ylab="Classification Error Rates")
- The model with C=10/100 get best missclassification error rate of test sets.
'Data Science > R' 카테고리의 다른 글
[R] Non-Linear Support Vector Machine (0) | 2022.12.06 |
---|---|
[R] Tree-Based Methods : Bayesian Additive Tree (0) | 2022.11.27 |
[R] Tree-Based Methods : Boosting (0) | 2022.11.27 |
[R] Tree-Based Methods : Random Forest (0) | 2022.11.27 |
[R] Tree-Based Methods : Bagging (0) | 2022.11.27 |