December 3rd, 2018
This week’s tidytuesdays dataset consists of 78388 stories from Medium. Medium is an online publishing platform where amateurs and professionals can post their own articles. All stories were published between Augst 1st, 2017 and August 1st, 2018 and hence span a whole year. Since this dataset contains each story title, I made heavy use of the tidytext package to gain some insights into the most common words used by all authors. You can find the data and my analysis here.
As always, we have to load some packages. We will need lubridate to work with dates, tidytext to analyze the title variable and hrbrthemes for a good looking theme.
library(tidyverse) library(lubridate) library(tidytext) library(hrbrthemes) medium <- read_csv("medium_datasci.csv") data(stop_words)
Afer half an hour tinkering with the dataset I wanted to analyze the title variable in particular. A title can reveal us a lot of information about the ideas authors write about. Since all articles in this dataset are in the domain of data science, we would expect to find many words in the title that are related to data science (e.g., data, machine, learning, ai). First, I tried to filter the dataset for the 10 most contributing authors). These authors are the trendsetters and influence all others. We also want our data to be tidy. That means each word should be in a separate row. I also created a new date variable.
top_authors <- medium %>% count(author, sort = TRUE) %>% drop_na() %>% head(n = 10) words_authors <- medium %>% select(author, title, year, month, day) %>% unite("date", c("year", "month", "day"), sep = "-") %>% mutate( date = ymd(date) ) %>% unnest_tokens(word, title) %>% anti_join(stop_words) %>% filter(!str_detect(word, "\\d"))
My first goal was to find out what the most contributing authors are talking about:
words_authors %>% filter(author %in% top_authors$author) %>% count(word, author, sort = TRUE) %>% group_by(author) %>% arrange(desc(n)) %>% slice(1:10) %>% ungroup() %>% ggplot(aes(word, y = n)) + geom_col() + facet_wrap(~ author, scales = "free") + coord_flip() + theme_ipsum() + ggtitle("Most frequent words of the top contributing authors")
Interestingly, the author AI Hawk does not use many words. Actually, he is just using two words, which have a semantic meaning. If you have a look at the webpage you will find out why. The author only uses the words read and ai. Also other patterns can be found in the data. Some others favor specific words: C Gavilanes talks most about telecom and tech. Synced is very interested in ai and Yves Mulkers talks about data. Deep Aero Dro is more like the drone guy.
Looking at the data I was curious to find out how the top two contributing authors differ in their choice of words. There must be some words that one author favors over the other?
words_authors %>% filter(author %in% top_authors$author[1:2]) %>% count(author, word) %>% group_by(author) %>% mutate( proportion = n / sum(n) ) %>% select(-n) %>% filter(proportion < 0.04) %>% spread(author, proportion) %>% drop_na() %>% ggplot(aes(x = Synced, y = `Yves Mulkers`)) + geom_point(alpha = .8) + geom_text(aes(label = word, vjust = -1.2, check_overlap = TRUE)) + geom_abline(intercept = 0, slope = 1, lty = 2) + xlim(0, .025) + ylim(0, .025) + theme_ipsum() + ggtitle("How do the top two contributing authors differ in their choice of words?")
Very much so. At first glance it is obvious that Synced has a smaller vocabulary than Yves Mulkers. Mulkers writes more about analytics, artificial and intelligence, whereas Synced slightly more favors deep, google, and smart. Apart from that there is not much of a difference between the two.
But what about trends? Some words most gain in popularity whereas other words lose popularity? Let’s have a look how the occurences of words changed during the past year:
top_10_words <- words_authors %>% count(word, sort = TRUE) %>% slice(1:10) %>% pull(word) words_authors %>% filter(word %in% top_10_words) %>% select(-author) %>% mutate( week = floor_date(date, "week") ) %>% select(-date) %>% count(word, week) %>% ggplot(aes(x = week, y = n, color = word)) + geom_line() + theme_ipsum() + xlab("Date") + ylab("Frequency") + labs(color = "Word") + ggtitle("Frequency of most popular words in the last year")
Not very much. Data is the most popular word and has always been, closely followed by ai. Interestingly, the words data, ai, and learning gained in popularity during the past year. I guess, deep learning is on the rise and people speak more about it. Another interesting insight is the dip during New Year’s Eve. The authors tend to put their laptops aside during New Year’s Eve and rather celebrate.