Back to the Basics - A Refresher on Basic Plot Types in R


Recently, I was working through a small analysis for a collaboration & I had to google how to make a box plot in R. I used to do more data viz than I do now. I was a bit bummed to have forgotten some basic plot types so quickly. I thought I would use this post as a quick refresher on some basic plot types in ggplot2.

Let’s get started!

I’m going to work with some #TidyTuesday data here. This data set looks are park availability across US metro areas. The specific cities we’re going to work with are Chicago & Raleigh. We’ve going to be visualizing the amount of money each city spends pre resident over the course of 8 years (2013-2020). A more thorough description of this data set can be found here.

First, we’re going to load the libraries we will be using & read in our data.

library(tidyverse)

parks = read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-06-22/parks.csv")

head(parks)
## # A tibble: 6 x 28
##    year  rank city         med_park_size_da… med_park_size_poi… park_pct_city_d…
##   <dbl> <dbl> <chr>                    <dbl>              <dbl> <chr>           
## 1  2020     1 Minneapolis                5.7                 26 15%             
## 2  2020     2 Washington,…               1.4                  5 24%             
## 3  2020     3 St. Paul                   3.2                 14 15%             
## 4  2020     4 Arlington, …               2.4                 10 11%             
## 5  2020     5 Cincinnati                 4.4                 20 14%             
## 6  2020     6 Portland                   4.9                 22 18%             
## # … with 22 more variables: park_pct_city_points <dbl>,
## #   pct_near_park_data <chr>, pct_near_park_points <dbl>,
## #   spend_per_resident_data <chr>, spend_per_resident_points <dbl>,
## #   basketball_data <dbl>, basketball_points <dbl>, dogpark_data <dbl>,
## #   dogpark_points <dbl>, playground_data <dbl>, playground_points <dbl>,
## #   rec_sr_data <dbl>, rec_sr_points <dbl>, restroom_data <dbl>,
## #   restroom_points <dbl>, splashground_data <dbl>, splashground_points <dbl>,
## #   amenities_points <dbl>, total_points <dbl>, total_pct <dbl>,
## #   city_dup <chr>, park_benches <dbl>

Next let’s create a data frame to use for the different types of plots. We can filter the data by city and year. Next, we’ll select only the columns we’re interested in (year, spend_per_resident_data, city). Then we’ll take a quick peek at the resulting data frame to make sure everything looks good!

parks_dat = parks %>%
  filter((city == "Chicago" | city == "Raleigh") & year >= 2013) %>% 
  select(year, spend_per_resident_data, city) 

head(parks_dat)
## # A tibble: 6 x 3
##    year spend_per_resident_data city   
##   <dbl> <chr>                   <chr>  
## 1  2020 $179                    Chicago
## 2  2020 $157                    Raleigh
## 3  2019 $171                    Chicago
## 4  2019 $194                    Raleigh
## 5  2018 $172                    Chicago
## 6  2018 $202                    Raleigh

Rather than add a theme() call to each plot in this post, I’m going to set a “global” theme. You can create this global theme by making a list of the different parameters you would like to pass to each plot. Here I’m starting with theme_bw(). I’m setting the color & fill palettes. I’m also moving the legend to the bottom for each plot using legend.position. This can be added to any ggplot2 object by adding + gglayer_theme.

gglayer_theme <- list(
  theme_bw(),
  scale_color_brewer(palette = "Dark2"),
  scale_fill_brewer(palette = "Dark2"),
  theme(legend.position = "bottom")
)

First up, we’ll do a bar plot. Everyone loves a good plot! It’s an easy to read way to share data. Here, I aave year on the x axis & “spend per resident” on the y axis. I have filled the bars based on the city.

A few quirks here: I used factor()to get R to understand that I want to treat the year was a discrete variable & not a continuous variable. I used stat = "identity" to tell R that I want the height of each bar to be the dollar amount spent per resident. Lastly, I used position = "dodge" to get the bars next to each other instead of stacked on top of each other.

ggplot(parks_dat, aes(x = factor(year), y = spend_per_resident_data, fill = city)) +
  geom_bar(aes(group = city), stat = "identity", position = "dodge") +
  gglayer_theme

Next, let’s do a scatter plot! This code looks very similar to the bar plot code. Main changes: using geom_point() instead of geom_bar(). Also, I used color to change the color of the dots based on the city. This ends up being a bit harder to read than the bar plot.

ggplot(parks_dat, aes(x = factor(year), y = spend_per_resident_data, color = city)) +
  geom_point(aes(group = city)) +
  gglayer_theme

Let’s look at a line plot next! This is the exact same code as the scatter plot except we changed geom_point() to geom_line(). This looks better but it’s still tough to quickly see the values for each year.

ggplot(parks_dat, aes(x = factor(year), y = spend_per_resident_data, color = city)) +
  geom_line(aes(group = city)) +
  gglayer_theme

We can combine the scatter plot & line plot to get a better plot! To get this, we use both geom_line() & geom_point(). This is much more readable. Looks good!

ggplot(parks_dat, aes(x = factor(year), y = spend_per_resident_data, color = city)) +
  geom_line(aes(group = city)) +
  geom_point() +
  gglayer_theme

Next up, we’ll do a box plot. Box plots allow us to see the distribution of each group. It also lets us do a quick comparison of the median. Once again, we just swap out the geom call.

ggplot(parks_dat, aes(x = city, y = spend_per_resident_data, color = city)) +
  geom_boxplot(aes(group = city)) +
  gglayer_theme 

Last up, let’s do a quick violin plot. Similar to a box plot, but the shape allows us to see where the distribution is highest. Here we use, geom_violin() to get the violin plot.

ggplot(parks_dat, aes(x = city, y = spend_per_resident_data, fill = city)) +
  geom_violin(aes(group = city)) +
  gglayer_theme 

This wraps up a (very) brief refresher on some basic plot types. Feel free to reach out with any questions/comments/concerns via Twitter or using one of the other methods on my contact page. Thanks for reading!