Every Christmas we celebrate with our friends the tradition of the Secret Santa. Every person comes up with a funny present which will be raffled off for the Christmas party. We are a circle of friends of over 19 people and have had a WhatsApp chat for years. A good time to gain some insights from the this data.

Getting started

You can export the data relatively easily from WhatsApp. WhatsApp will send you a txt file in which each post is listed as a line. You can either export the media assets as well. For this analysis however I did not include them and sticked to the plain text format. Unfortunately the txt file is not well formatted and we get rows without some necessary columns:

library(tidyverse)
library(tidytext)
library(hrbrthemes)
library(lubridate)
library(stopwords)
theme_set(theme_ipsum())

stopwords <- tibble(word = stopwords::stopwords(language = "de", source = "stopwords-iso"))
date -  post
04.01.14, 17:32 - Person1: Secret text
04.01.14, 18:53 - Person2: Secret text
04.01.14, 18:54 - Person3: Secret text
Damn that's bad formatting. 
04.01.14, 19:11 - Person4: Secret text
04.01.14, 19:20 - Person5: Secret text
04.01.14, 19:21 - Person4: Secret text

I couldn’t for the life of me figure out how to parse these lines. Hence we have to do without them in the following analysis. Since the export did not include a header I inserted the header manually.

Cleaning the dataset

Before we start analyzing the data there is some cleaning we have to do. First, we need to separate the two columns into three since the name of the person should be an extra column. Since the data is not well formated it might be better to filter the date column for entries that start with two integers. By doing so, we can be fairly confident that we do not bad formatted data. The same problem occurs for the post variable. The post variable initially included the name of the person and the post itself.

whatsapp <- read_delim("whatsapp.txt", delim = "-") %>%
  setNames(c("date", "post")) %>%
  filter(date %>% str_detect("^[:digit:]{2}")) %>%
  mutate(post = str_trim(post)) %>% 
  filter(post %>% str_detect("^[:alpha:]+")) %>%
  separate(post, into = c("person", "post"), sep = ":") %>%
  mutate(
    date = date %>% parse_date_time("d.m.y, H:M"),
    post = str_trim(post)
  )

What we will get is a pretty well formatted dataset:

# A tibble: 18,827 x 3
   date                person        post 
   <dttm>              <chr>         <chr> 
 1 2014-01-04 08:52:00 Person1       "Text \ue312\ue30c\ue312\ue011"                           
 2 2014-01-04 09:43:00 Person2       Juhu   
 3 2014-01-04 09:51:00 Person4       "Random Text"
 4 2014-01-04 11:15:00 Person2       More Text
 5 2014-01-04 16:03:00 Person2       Some Text
 6 2014-01-04 17:31:00 Person4       Text Text Text 
 7 2014-01-04 17:32:00 Person2       More Text.
 8 2014-01-04 18:53:00 Person1       Some random words 
 9 2014-01-04 18:54:00 Person1       More words  
10 2014-01-04 19:11:00 Person2       Bunch of text  
# ... with 18,817 more rows

Now we can start analyzing the data.

Who is posting the most to WhatsApp?

This question is fairly obvious and not to difficult to analyze:

whatsapp %>% 
  count(person, sort = TRUE) %>%
  slice(1:19) %>%
  ggplot(aes(x = fct_reorder(person, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(
    y = "Person",
    x = "Number of posts"
  )

How did the number of posts change over the years?

whatsapp %>%
  mutate(
    month = date %>% floor_date("month")
  ) %>%
  count(month) %>%
  ggplot(aes(x = month, y = n)) +
  geom_area() +
    labs(
    x = "month",
    y = "whatsapp posts",
    title = "Number of WhatsApp posts over the years"
  )

How did the posting behavior change over the years for each person?

whatsapp %>%
  mutate(
    month = date %>% floor_date("month")
  ) %>%
  count(month, person) %>%
  ggplot(aes(x = month, y = n)) +
  geom_area() +
  facet_wrap(~ person) +
  labs(
    x = "month",
    y = "Number of whatsapp posts",
    title = "Posting behavior over the years"
  )

What are the most common posts?

whatsapp %>% 
  count(post, sort = TRUE) %>% 
  filter(post %>% str_detect("^[:alpha:]")) %>%
  head(40) %>%
  ggplot(aes(x = fct_reorder(post, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(
    x = "post",
    y = "Number of posts",
    title = "What are the most common posts?"
  )

How does the weekday affect our posting behavior?

whatsapp %>%
  mutate(
    month = month(date, label = TRUE),
    year = year(date),
    wday = wday(date, label = TRUE),
    week = week(date)
  ) %>%
  count(week, wday, year) %>%
  na.omit() %>%
  mutate(
    wday = fct_relevel(wday, c("So", "Sa", "Fr", "Do", "Mi", "Di", "Mo"))
  ) %>%
  ggplot(aes(x = week, y = wday)) +
  geom_tile(aes(fill = n), color = "white") +
  scale_fill_gradient(low = "#d8e1cf", high = "#438484") +
  labs(title = "When do we post the most?",
       x = "Week", 
       y = "Wday", 
       fill = "Number of posts") +
  facet_wrap(~ year)

At which hour do we post?

whatsapp %>%
  mutate(
    month = month(date, label = TRUE),
    year = year(date),
    wday = wday(date, label = TRUE),
    week = week(date),
    hour = hour(date)
  ) %>%
  count(hour, wday, year) %>%
  mutate(
    wday = fct_relevel(wday, c("So", "Sa", "Fr", "Do", "Mi", "Di", "Mo"))
  ) %>%
  ggplot(aes(x = hour, y = wday)) +
  geom_tile(aes(fill = n), color = "white") +
  scale_fill_gradient(low = "#d8e1cf", high = "#438484") +
  labs(title = "At which time do we post?",
       x = "Hour", 
       y = "Wday", 
       fill = "Number of posts") +
  facet_wrap(~ year)

Conclusion

You could do way more analyses with this data. This is just a glimpse of what’s possible. Hope you enjoyed it.