February 1st, 2019

Support vector machines is a popular algoritm for supervised modeling. It works well with data that is not too noisy (naive bayes is better for such tasks) and data that has a clear margin of separation. However, with large datasets support vector machines are not the tool to go.

Support vector machines work by maximizing the margin between points which are separated by a hyperplane. I will not go into detail on how it works, there have been a number of tutorials that have explained it in depth (see here and here).

Imagine you want to classify the species of a particular flower, the sepal by the flower’s width and length. So you have two features and one of three species to predict.

Obviously you are not able to separate these points by a straight line. Some points with different species cluster and need a non-linear separation.

Well, support vector machines have a trick, the kernel trick. Basically, support vector machines are able to generate non-linear hyperplanes by mapping a low dimension feature space (like two features) to high a dimensional feature space. Support vector machines have a bunch of features to tune:

- c: With the c-parameter we can adust the smoothness of the hyperplane. By tuning that parameter we run into a tradeoff between getting a smooth separation and classifying points correctly
- gamma: With gamma we specify how points that are far from the hyperplane are taken into account. Low values of gamma only take points into account that are close to the hyperplane, high values of gamma take points into account that are farther away from the hyperplane
- kernel: By specifying the kernel we tell the algorithm what type of high dimensional feature space to produce. There are polynomial feature spaces, linear feature spaces and others.

Let’s build a support vector machine classifier in R. First, we need to load our packages and split our data into a training and testing set:

```
(iris_with_id <- iris %>%
rownames_to_column(var = "id") %>%
as.tibble)
train <- iris_with_id %>% sample_frac(.70)
test <- anti_join(iris_with_id, train, by = 'id')
```

```
# A tibble: 150 x 6
id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<chr> <dbl> <dbl> <dbl> <dbl> <fct>
1 1 5.1 3.5 1.4 0.2 setosa
2 2 4.9 3 1.4 0.2 setosa
3 3 4.7 3.2 1.3 0.2 setosa
4 4 4.6 3.1 1.5 0.2 setosa
5 5 5 3.6 1.4 0.2 setosa
6 6 5.4 3.9 1.7 0.4 setosa
7 7 4.6 3.4 1.4 0.3 setosa
8 8 5 3.4 1.5 0.2 setosa
9 9 4.4 2.9 1.4 0.2 setosa
10 10 4.9 3.1 1.5 0.1 setosa
# ... with 140 more rows
```

For this exercise we only use two features to predict the Species: Sepal.Length and Sepal.Width. Since support vector machines have a bunch of tuning parameters, let’s use a grid search to get the best features:

```
tune(svm, Species ~ Sepal.Length + Sepal.Width, data = train,
ranges = list(gamma = seq(0.5, 2, by = 0.5),
cost = 2^(2:8),
kernel = c("linear", "polynomial")
),
tunecontrol = tune.control(sampling = "fix")
)
```

```
Parameter tuning of ‘svm’:
- sampling method: fixed training/validation set
- best parameters:
gamma cost kernel
0.5 4 linear
- best performance: 0.1714286
```

Next, we can train our svm classifier:

```
svmfit <- svm(Species ~ Sepal.Length + Sepal.Width, data = train,
gamma = 0.5,
kernel = "polynomial", cost = 4, scale = FALSE)
plot(svmfit, data = test[, c(2, 3, 6)])
```

As you can see we no longer split the datapoints by a line but our kernel trick allowed us to separate points that cannot be separated by a line. How accurate is our classifier on the testing set?

`confusionMatrix(test$Species, predict(svmfit, newdata = test))`

```
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 14 0 0
versicolor 0 12 2
virginica 0 4 13
Overall Statistics
Accuracy : 0.8667
95% CI : (0.7321undefined 0.9495)
No Information Rate : 0.3556
P-Value [Acc > NIR] : 1.942e-12
Kappa : 0.8
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.7500 0.8667
Specificity 1.0000 0.9310 0.8667
Pos Pred Value 1.0000 0.8571 0.7647
Neg Pred Value 1.0000 0.8710 0.9286
Prevalence 0.3111 0.3556 0.3333
Detection Rate 0.3111 0.2667 0.2889
Detection Prevalence 0.3111 0.3111 0.3778
Balanced Accuracy 1.0000 0.8405 0.8667
```

Not too bad, *86.7%*. Let’s compare our svm with a naive bayes classifier:

```
naive_bayes <- naiveBayes(Species ~ Sepal.Length + Sepal.Width, data = iris)
confusionMatrix(test$Species, predict(naive_bayes, newdata = test))
```

```
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 14 0 0
versicolor 0 11 3
virginica 0 4 13
Overall Statistics
Accuracy : 0.8444
95% CI : (0.7054undefined 0.9351)
No Information Rate : 0.3556
P-Value [Acc > NIR] : 1.996e-11
Kappa : 0.7661
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.7333 0.8125
Specificity 1.0000 0.9000 0.8621
Pos Pred Value 1.0000 0.7857 0.7647
Neg Pred Value 1.0000 0.8710 0.8929
Prevalence 0.3111 0.3333 0.3556
Detection Rate 0.3111 0.2444 0.2889
Detection Prevalence 0.3111 0.3111 0.3778
Balanced Accuracy 1.0000 0.8167 0.8373
```

Pretty similar. Naive Bayes is somewhat less accurate but also faster.