December 16th, 2018
This week tidytuesday is all about NYC restaurant inspections. The dataset contains 300000 inspections of restaurants in NYC in the last years. You can find my full analysis here.
library(tidyverse) library(tidytext) library(hrbrthemes) library(lubridate) theme_set(theme_ipsum())
restaurants <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018-12-11/nyc_restaurants.csv")
When I looked at the dataset, I first wondered what inspectors usually complain about. Some terms may appear more frequently when a restaurant is considered critical. So I pulled the words from the daasets that were most associated with a critical inspection or a non-critical inspection.
common_critique <- restaurants %>% select(camis, dba, cuisine_description, critical_flag, inspection_date, action, score, grade, inspection_type, violation_description) %>% unnest_tokens(word, violation_description) %>% anti_join(stop_words, by = "word") %>% count(word, critical_flag, sort = TRUE) %>% group_by(critical_flag) %>% slice(1:20) %>% ungroup() %>% drop_na(word) %>% mutate(word = as.factor(word)) common_critique %>% ggplot(aes(x = critical_flag, y = fct_reorder(word, n))) + geom_point(aes(size = n)) + labs(x = "Type of critical flag", y = "Common words to describe violoation description", title = "Common words to describe violations per critical flag")
It turns out that animals are the most frequent points of criticism. Flies turn up very often and so do mice. Another common critique seems to be the preparation of the food. Especially contaminated food. Most of the criticism, which does not result in a critical flag, is directed at the kitchen equipment and the storage of the food.
I then wondered whether the number of inspections has increased in recent years. First I had to convert the inspection_date variable into a date variable so that I could round the inspections to the respective month.
restaurants_cleaned_dates <- restaurants %>% mutate( inspection_date = inspection_date %>% parse_date_time("%m/%d/%y"), month = inspection_date %>% round_date("month") ) restaurants_cleaned_dates %>% count(month) %>% filter(month >= as.Date("2014-01-01")) %>% ggplot(aes(month, n)) + geom_area(fill = "#93856a") + labs( x = "Month", y = "Number of inspections", title = "Number of inspections of NYC restaurants since 2014" )
In fact, inspections have increased over the years. This is a good sign for customers.
As a customer, I would also like to know what the cleanest cuisine types are. I have filtered the dataset according to the 10 best and 10 worst cuisine types. I used the variable score as a criterion for the quality. A high score means a good cuisine, a low score means a not so good cuisine.
restaurants %>% group_by(cuisine_description) %>% summarise( mean_grading = mean(score, na.rm = TRUE), n = n() ) %>% filter(n > 100) %>% arrange(desc(mean_grading)) %>% mutate( median_split = (mean_grading > median(mean_grading)) %>% as.factor %>% fct_rev %>% fct_recode(`Best Scoring` = "TRUE", `Worst Scoring` = "FALSE"), cuisine_description = as.factor(cuisine_description) ) %>% filter(row_number() < 10 | row_number() > nrow(.) - 10) %>% ggplot(aes(x = fct_reorder(cuisine_description, mean_grading), y = mean_grading)) + geom_col(aes(fill = mean_grading)) + coord_flip() + facet_wrap(~ median_split, scales = "free") + scale_fill_gradient(high = "#1a9850", low = "#d73027") + labs( x = "Mean scoring", y = "Cuisine type", title = "10 best and 10 worst cuisine types according to inspection scores", fill = "Mean scoring" ) + theme(legend.position = "bottom")
It turns out that Asian restaurants are usually very clean. Poor ratings are given especially to donut shops, hot dog stands and English restaurants.