Naive Bayes is a very popular supervised machine learning algorithm to classify documents. The algorithm is often used to detect spam emails or for sentiment analysis. The advantage of the algorithm is that it requires only a small amount of training data to extract the necessary parameters. And it is fast compared to other methods.

Naive Bayes uses Bayes’ Theorem, to find out the propability that words come from a certain class of documents. For example, you could use the algorithm to find an author of a book. Bayes’ Theorem works the following:

P(cx)=P(xc)P(c)P(x)P(c|x) = \frac{P(x|c) * P(c)}{P(x)}

For example, suppose you want to find out the probability that the word freedom comes from a certain author. You know that the probability for the word freedom in a corpus is 0.02%. You also know that the probability of the word freedom, under the condition that the author is Mr X, is 0.05%. The probability that a document again comes from a certain author is 25%. In other words:

P(Mr.Xfreedom)=P(freedomMr.X)P(Mr.X)P(freedom)P(Mr. X|freedom) = \frac{P(freedom|Mr. X) * P(Mr. X)}{P(freedom)}

Filling in the numbers we get the following posterior probabilty that the author given a word is Mr. X:

P(Mr.Xfreedom)=0.000050.250.00002=0.625P(Mr. X|freedom) = \frac{0.00005 * 0.25}{0.00002} = 0.625

Hence, there is a 62.5% change the author is Mr. X given the word freedom.

Naive Bayes uses this formula to calculate the probability that a document belongs to a certain class.

Detecting authors

As an example, we will next use a Naive Bayes algorithms to tell which authors have written different books. As corpus we use the collected works of the philosophers David Hume, Sir Francis Bacon, René Descartes and John Locke.

library(quanteda)
library(quanteda.corpora)
library(tidyverse)
library(lubridate)
library(gutenbergr)
library(caret)
library(tidytext)
library(stopwords)

First we have to build the corpus. With the help of the gutenbergr package we can simply extract the books of the individual authors from the gutenberg corpus and save them in dataframes:

hume <- gutenberg_works(author == "Hume, David")
bacon <- gutenberg_works(author == "Bacon, Francis")
descartes <- gutenberg_works(author == "Descartes, René")
locke <- gutenberg_works(author == "Locke, John")

hume_corpus <- gutenberg_download(hume$gutenberg_id, 
                                    meta_fields = c("author", "title"))
bacon_corpus <- gutenberg_download(bacon$gutenberg_id, 
                                   meta_fields = c("author", "title"))
descartes_corpus <- gutenberg_download(descartes$gutenberg_id, 
                                   meta_fields = c("author", "title"))
locke_corpus <- gutenberg_download(locke$gutenberg_id, 
                                   meta_fields = c("author", "title"))

Next, we can put all the data frames together and break them down into their individual words. Usually we are not interested in stopwords like a or the when analyzing texts. Therefore, we filter these stopwords out of the data set. After we have removed these stopwords, we have to reassemble the text to create a corpus:

stopwords <- stopwords <- tibble(word = stopwords::stopwords(language = "en", 
source = "stopwords-iso"))

books_unnested <- bind_rows(hume_corpus,
                       bacon_corpus, locke_corpus,
                       descartes_corpus) %>%
  unnest_tokens(word, text) %>%
  anti_join(stopwords, by = "word")

books_cleaned <- books_unnested %>%
  group_by(author, title) %>%
  nest(word) %>%
  mutate(
    text = map(data, unlist),
    text = map_chr(text, paste, collapse = " ")
  ) %>%
  select(-data)

# Build corpus
corpus <- corpus(books_cleaned)

# Add id variable to corpus
docvars(corpus, "id_numeric") <- 1:ndoc(corpus)

As always, we split the dataset into a training dataset and a testing dataset. We do not do this arbitrarily because we only want to test certain books:

  • David Hume: Dialogues Concerning Natural Religion
  • Francis, Bacon: The Essays or Counsels, Civil and Moral
  • An Essay Concerning Humane Understanding, Volume 1 MDCXC, Based on the 2nd Edition, Books 1 and 2
  • René Descartes: A Discourse of a Method for the Well Guiding of Reason

Naive Bayes can only consider functions that occur in both the training set and the test set, so we must only use words that occured in the training dataset.

id_test <- c(2, 14, 21, 24)

training_dfm <- corpus_subset(corpus, !id_numeric %in% id_train) %>%
  dfm(stem = TRUE)

testing_dfm <- corpus_subset(corpus, id_numeric %in% id_train) %>%
  dfm(stem = TRUE) %>%
  dfm_select(pattern = training_dfm, 
             selection = "keep")

The Naive Bayes classifier is fairly easy to build:

naive_bayes <- textmodel_nb(training_dfm, docvars(training_dfm, "author"))
summary(naive_bayes)
Call:
textmodel_nb.dfm(x = training_dfm, y = docvars(training_dfm, 
    "author"))

Class Priors:
(showing first 2 elements)
Descartes, René     Hume, David 
            0.5             0.5 

Estimated Feature Scores:
                illustr frontispiece.jpg portrait   hume titlepage.jpg boadicea harangu  briton histori england invas julius caesar
Descartes, René  0.5362           0.7762   0.7762 0.1336        0.7762   0.6343  0.4644 0.06941  0.3024 0.01498 0.131 0.5362 0.2782
Hume, David      0.4638           0.2238   0.2238 0.8664        0.2238   0.3657  0.5356 0.93059  0.6976 0.98502 0.869 0.4638 0.7218
                  reign   jame  david    esq   1688 london  virtu   citi   road   ivi   lane   york     26    john street   1860
Descartes, René 0.03771 0.5362 0.3479 0.5811 0.7762 0.1187 0.1068 0.1759 0.5811 0.874 0.7762 0.2106 0.3867 0.05298 0.6343 0.7762
Hume, David     0.96229 0.4638 0.6521 0.4189 0.2238 0.8813 0.8932 0.8241 0.4189 0.126 0.2238 0.7894 0.6133 0.94702 0.3657 0.2238
                philadelphia
Descartes, René       0.7762
Hume, David           0.2238

Obviously we still have the introduction of the books in our corpus. We should have removed them in advance.

u <- union(predicted_class, actual_class)

actual_class <- docvars(testing_dfm, "author")
predicted_class <- predict(naive_bayes, newdata = testing_dfm)
(class_table <- table(factor(actual_class, u), factor(predicted_class, u)))
                  Hume, David Descartes, René Bacon, Francis Locke, John
  Hume, David              10               0              0           0
  Descartes, René           0               2              0           0
  Bacon, Francis            1               5              0           0
  Locke, John               1               2              0           0

If we look at the confusion matrix, we can be quite confident to discover books by David, Hume. The only problem with our approach is that our training and test data set is too small. For example, not a single book was predicted to be written by Franics Bacon or John Locke.

confusionMatrix(class_table, mode = "everything")
Confusion Matrix and Statistics

                 
                  Hume, David Descartes, René Bacon, Francis Locke, John
  Hume, David              10               0              0           0
  Descartes, René           0               2              0           0
  Bacon, Francis            1               5              0           0
  Locke, John               1               2              0           0

Overall Statistics
                                          
               Accuracy : 0.5714          
                 95% CI : (0.3402, 0.7818)
    No Information Rate : 0.5714          
    P-Value [Acc > NIR] : 0.5909          
                                          
                  Kappa : 0.3762          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Hume, David Class: Descartes, René Class: Bacon, Francis Class: Locke, John
Sensitivity                      0.8333                0.22222                    NA                 NA
Specificity                      1.0000                1.00000                0.7143             0.8571
Pos Pred Value                   1.0000                1.00000                    NA                 NA
Neg Pred Value                   0.8182                0.63158                    NA                 NA
Precision                        1.0000                1.00000                0.0000             0.0000
Recall                           0.8333                0.22222                    NA                 NA
F1                               0.9091                0.36364                    NA                 NA
Prevalence                       0.5714                0.42857                0.0000             0.0000
Detection Rate                   0.4762                0.09524                0.0000             0.0000
Detection Prevalence             0.4762                0.09524                0.2857             0.1429
Balanced Accuracy                0.9167                0.61111                    NA                 NA

Overall, we achieved a mediocre accuracy of 57%, which is still better than chance (which would be 25%), but not perfect. Since this is my first attempt to apply a Naive Bayes classifier to textual data, I am pretty happy with any accuracy. Next, I need to find ways to improve the accuracy of my results. But not today :)