January 13th, 2019
Naive Bayes is a very popular supervised machine learning algorithm to classify documents. The algorithm is often used to detect spam emails or for sentiment analysis. The advantage of the algorithm is that it requires only a small amount of training data to extract the necessary parameters. And it is fast compared to other methods.
Naive Bayes uses Bayes’ Theorem, to find out the propability that words come from a certain class of documents. For example, you could use the algorithm to find an author of a book. Bayes’ Theorem works the following:
For example, suppose you want to find out the probability that the word freedom comes from a certain author. You know that the probability for the word freedom in a corpus is 0.02%. You also know that the probability of the word freedom, under the condition that the author is Mr X, is 0.05%. The probability that a document again comes from a certain author is 25%. In other words:
Filling in the numbers we get the following posterior probabilty that the author given a word is Mr. X:
Hence, there is a 62.5% change the author is Mr. X given the word freedom.
Naive Bayes uses this formula to calculate the probability that a document belongs to a certain class.
As an example, we will next use a Naive Bayes algorithms to tell which authors have written different books. As corpus we use the collected works of the philosophers David Hume, Sir Francis Bacon, René Descartes and John Locke.
library(quanteda) library(quanteda.corpora) library(tidyverse) library(lubridate) library(gutenbergr) library(caret) library(tidytext) library(stopwords)
First we have to build the corpus. With the help of the gutenbergr package we can simply extract the books of the individual authors from the gutenberg corpus and save them in dataframes:
hume <- gutenberg_works(author == "Hume, David") bacon <- gutenberg_works(author == "Bacon, Francis") descartes <- gutenberg_works(author == "Descartes, René") locke <- gutenberg_works(author == "Locke, John") hume_corpus <- gutenberg_download(hume$gutenberg_id, meta_fields = c("author", "title")) bacon_corpus <- gutenberg_download(bacon$gutenberg_id, meta_fields = c("author", "title")) descartes_corpus <- gutenberg_download(descartes$gutenberg_id, meta_fields = c("author", "title")) locke_corpus <- gutenberg_download(locke$gutenberg_id, meta_fields = c("author", "title"))
Next, we can put all the data frames together and break them down into their individual words. Usually we are not interested in stopwords like a or the when analyzing texts. Therefore, we filter these stopwords out of the data set. After we have removed these stopwords, we have to reassemble the text to create a corpus:
stopwords <- stopwords <- tibble(word = stopwords::stopwords(language = "en", source = "stopwords-iso")) books_unnested <- bind_rows(hume_corpus, bacon_corpus, locke_corpus, descartes_corpus) %>% unnest_tokens(word, text) %>% anti_join(stopwords, by = "word") books_cleaned <- books_unnested %>% group_by(author, title) %>% nest(word) %>% mutate( text = map(data, unlist), text = map_chr(text, paste, collapse = " ") ) %>% select(-data) # Build corpus corpus <- corpus(books_cleaned) # Add id variable to corpus docvars(corpus, "id_numeric") <- 1:ndoc(corpus)
As always, we split the dataset into a training dataset and a testing dataset. We do not do this arbitrarily because we only want to test certain books:
Naive Bayes can only consider functions that occur in both the training set and the test set, so we must only use words that occured in the training dataset.
id_test <- c(2, 14, 21, 24) training_dfm <- corpus_subset(corpus, !id_numeric %in% id_train) %>% dfm(stem = TRUE) testing_dfm <- corpus_subset(corpus, id_numeric %in% id_train) %>% dfm(stem = TRUE) %>% dfm_select(pattern = training_dfm, selection = "keep")
The Naive Bayes classifier is fairly easy to build:
naive_bayes <- textmodel_nb(training_dfm, docvars(training_dfm, "author")) summary(naive_bayes)
Call: textmodel_nb.dfm(x = training_dfmundefined y = docvars(training_dfmundefined "author")) Class Priors: (showing first 2 elements) Descartesundefined René Humeundefined David 0.5 0.5 Estimated Feature Scores: illustr frontispiece.jpg portrait hume titlepage.jpg boadicea harangu briton histori england invas julius caesar Descartesundefined René 0.5362 0.7762 0.7762 0.1336 0.7762 0.6343 0.4644 0.06941 0.3024 0.01498 0.131 0.5362 0.2782 Humeundefined David 0.4638 0.2238 0.2238 0.8664 0.2238 0.3657 0.5356 0.93059 0.6976 0.98502 0.869 0.4638 0.7218 reign jame david esq 1688 london virtu citi road ivi lane york 26 john street 1860 Descartesundefined René 0.03771 0.5362 0.3479 0.5811 0.7762 0.1187 0.1068 0.1759 0.5811 0.874 0.7762 0.2106 0.3867 0.05298 0.6343 0.7762 Humeundefined David 0.96229 0.4638 0.6521 0.4189 0.2238 0.8813 0.8932 0.8241 0.4189 0.126 0.2238 0.7894 0.6133 0.94702 0.3657 0.2238 philadelphia Descartesundefined René 0.7762 Humeundefined David 0.2238
Obviously we still have the introduction of the books in our corpus. We should have removed them in advance.
u <- union(predicted_class, actual_class) actual_class <- docvars(testing_dfm, "author") predicted_class <- predict(naive_bayes, newdata = testing_dfm) (class_table <- table(factor(actual_class, u), factor(predicted_class, u)))
Humeundefined David Descartesundefined René Baconundefined Francis Lockeundefined John Humeundefined David 10 0 0 0 Descartesundefined René 0 2 0 0 Baconundefined Francis 1 5 0 0 Lockeundefined John 1 2 0 0
If we look at the confusion matrix, we can be quite confident to discover books by David, Hume. The only problem with our approach is that our training and test data set is too small. For example, not a single book was predicted to be written by Franics Bacon or John Locke.
confusionMatrix(class_table, mode = "everything")
Confusion Matrix and Statistics Humeundefined David Descartesundefined René Baconundefined Francis Lockeundefined John Humeundefined David 10 0 0 0 Descartesundefined René 0 2 0 0 Baconundefined Francis 1 5 0 0 Lockeundefined John 1 2 0 0 Overall Statistics Accuracy : 0.5714 95% CI : (0.3402undefined 0.7818) No Information Rate : 0.5714 P-Value [Acc > NIR] : 0.5909 Kappa : 0.3762 Mcnemar's Test P-Value : NA Statistics by Class: Class: Humeundefined David Class: Descartesundefined René Class: Baconundefined Francis Class: Lockeundefined John Sensitivity 0.8333 0.22222 NA NA Specificity 1.0000 1.00000 0.7143 0.8571 Pos Pred Value 1.0000 1.00000 NA NA Neg Pred Value 0.8182 0.63158 NA NA Precision 1.0000 1.00000 0.0000 0.0000 Recall 0.8333 0.22222 NA NA F1 0.9091 0.36364 NA NA Prevalence 0.5714 0.42857 0.0000 0.0000 Detection Rate 0.4762 0.09524 0.0000 0.0000 Detection Prevalence 0.4762 0.09524 0.2857 0.1429 Balanced Accuracy 0.9167 0.61111 NA NA
Overall, we achieved a mediocre accuracy of 57%, which is still better than chance (which would be 25%), but not perfect. Since this is my first attempt to apply a Naive Bayes classifier to textual data, I am pretty happy with any accuracy. Next, I need to find ways to improve the accuracy of my results. But not today :)