A Modern Day Comparison of 3 Heat Map Packages - Part 1

I’m going to use some early data from the first week of #TidyTuesday to do a comparison of different methods for making a heat map. The three packages I’m going to compare are ggplot2, pheatmap, & ComplexHeatmap.

The first package I’m going to look at is ggplot2, an integral part of the tidyverse.** You can create a heat map in ggplot2 using geom_tile(). One of the pros of making any type of plot in ggplot2 is it plays well with other types of plots made in ggplot2. This allows you to combine many different types of plots into a single figure in a pretty straight forward fashion (for example, a heat map & a bar plot together). They all follow a similar pattern so once you get the hang of how to make one type of plot, you can apply those principles to other types of plots. The only big negative I can think of for making heat maps in ggplot2 is it’s a bit more difficult to cluster your data. It’s not impossible, but it’s quite a bit of extra work as compared to other packages.

Let’s start by creating a heat map of our data in ggplot2. In Part 2 & Part 3 of this post, I’ll recreate that plot using pheatmap & https://jokergoo.github.io/ComplexHeatmap-reference/book/.

** To clarify, calling ggplot2 a heat map package is trivializing the sheer number of things you can do with it. However, in this case, we’re only interested in the heat map capabilities.

A quick description of our data. ¶

This data is the average in-state tuition & fees for one year of full time study at a public 4 year institution for each state from the 2004-2005 school year to 2015-2016. You can download the data here.

Let’s get started! First, let’s load the libraries we’re going to use.

library(tidyverse)
library(RColorBrewer)

Next, I want to read in the file we’re going to get our data from. I saved it to my Github so I’ll have it for future use. Also, this way I can read it in from anywhere. I’m using read_csv() to read the file into an object named df. Then we can use df to see what our data looks like.

df = read_csv("https://raw.githubusercontent.com/sapo83/TidyTuesday/master/2018/TT.20180403/us.avg.tuition.noCR.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   State = col_character(),
##   `2004-05` = col_character(),
##   `2005-06` = col_character(),
##   `2006-07` = col_character(),
##   `2007-08` = col_character(),
##   `2008-09` = col_character(),
##   `2009-10` = col_character(),
##   `2010-11` = col_character(),
##   `2011-12` = col_character(),
##   `2012-13` = col_character(),
##   `2013-14` = col_character(),
##   `2014-15` = col_character(),
##   `2015-16` = col_character()
## )

df

## # A tibble: 50 x 13
##    State   `2004-05` `2005-06` `2006-07` `2007-08` `2008-09` `2009-10` `2010-11`
##    <chr>   <chr>     <chr>     <chr>     <chr>     <chr>     <chr>     <chr>    
##  1 Alabama $5,683    $5,841    $5,753    $6,008    $6,475    $7,189    $8,071   
##  2 Alaska  $4,328    $4,633    $4,919    $5,070    $5,075    $5,455    $5,759   
##  3 Arizona $5,138    $5,416    $5,481    $5,682    $6,058    $7,263    $8,840   
##  4 Arkans… $5,772    $6,082    $6,232    $6,415    $6,417    $6,627    $6,901   
##  5 Califo… $5,286    $5,528    $5,335    $5,672    $5,898    $7,259    $8,194   
##  6 Colora… $4,704    $5,407    $5,596    $6,227    $6,284    $6,948    $7,748   
##  7 Connec… $7,984    $8,249    $8,368    $8,678    $8,721    $9,371    $9,827   
##  8 Delawa… $8,353    $8,611    $8,682    $8,946    $8,995    $9,987    $10,534  
##  9 Florida $3,848    $3,924    $3,888    $3,879    $4,150    $4,783    $5,511   
## 10 Georgia $4,298    $4,492    $4,584    $4,790    $4,831    $5,550    $6,428   
## # … with 40 more rows, and 5 more variables: 2011-12 <chr>, 2012-13 <chr>,
## #   2013-14 <chr>, 2014-15 <chr>, 2015-16 <chr>

So far everything looks good! The first thing I want to do is remove the dollar signs & the commas from each value. This will make it easier to work with.

Let’s break the command down. gsub() is used to substitute a pattern with a new pattern. The general pattern is gsub("old_pattern", "new_pattern", object). Here I’m looking for either a dollar sign or comma [$,] & replacing it with nothing "". I’m using lapply() to apply this function to each column. Here the x represents the argument being passed to the function. In this case the argument is a column from my data frame as a list. I wrap the whole command in as.data.frame() so that the returned object is a data frame.

df = as.data.frame(lapply(df, function (x) {gsub("[$,]", "", x)}))

head(df)

##        State X2004.05 X2005.06 X2006.07 X2007.08 X2008.09 X2009.10 X2010.11
## 1    Alabama     5683     5841     5753     6008     6475     7189     8071
## 2     Alaska     4328     4633     4919     5070     5075     5455     5759
## 3    Arizona     5138     5416     5481     5682     6058     7263     8840
## 4   Arkansas     5772     6082     6232     6415     6417     6627     6901
## 5 California     5286     5528     5335     5672     5898     7259     8194
## 6   Colorado     4704     5407     5596     6227     6284     6948     7748
##   X2011.12 X2012.13 X2013.14 X2014.15 X2015.16
## 1     8452     9098     9359     9496     9751
## 2     5762     6026     6012     6149     6571
## 3     9967    10134    10296    10414    10646
## 4     7029     7287     7408     7606     7867
## 5     9436     9361     9274     9187     9270
## 6     8316     8793     9293     9299     9748

Next, I need this data frame to be in long format instead of wide.

A bit about long versus wide. ¶

Generally, to plot data using ggplot the data needs to be in long format. This means one observation in each row. This also means a single subject with multiple measurements will have multiple rows in the data frame. In our melted data frame, you can see each state (the subject) has multiple rows. Each row corresponds to a single measurement (tuition for a specific academic year).

When a single subject has one row with multiple measurements then the data is in wide format. In our original data frame, each row had one state (the subject) with multiple measurements in each row (each column represents one measurement).

Here I’m going to use melt() from the reshape2 package. Another option is pivot_longer() from the tidyr package. tidyr is loaded as part of the tidyverse library. Here is a vignette to learn more about pivot_longer() and pivot_wider().

In the melt, I used variable = "Years" to specify the variable column to be named Years. I used value.name = "Tuition" to specify that I want the value column named Tuition.

Along with the melt, I also want to correct how the years are written. After the melt, the “Years” column looks like X2005.06. I don’t like that. Again, I’m using gsub() as a find and replace. I nested two gsub() commands to first replace the X with nothing & then replace the . with -20.

Last thing! After the melt, “Tuition” is changed to a character column. This interferes with plotting our data later. In the mutate() call, I’m going to use as.numeric() to change it back to a number.

Now everything looks better!

melted_df = reshape2::melt(df, id.vars = "State", variable = "Years", 
                           value.name = "Tuition") %>%
  mutate(Years = gsub("[.]", "-20", gsub("X", "", Years)), Tuition = as.numeric(Tuition))

head(melted_df)

##        State     Years Tuition
## 1    Alabama 2004-2005    5683
## 2     Alaska 2004-2005    4328
## 3    Arizona 2004-2005    5138
## 4   Arkansas 2004-2005    5772
## 5 California 2004-2005    5286
## 6   Colorado 2004-2005    4704

Now that we have our melted data frame, let’s make a heat map!

ggplot(data = melted_df, aes(x = Years, y = State, fill = Tuition)) +
  geom_tile()

Not bad! One of the first things I noticed is that y axis is out of order. I’d like to see those states in alphabetical order. We can use fct_rev() to reverse the order.

ggplot(data = melted_df, aes(x = Years, y = fct_rev(State), fill = Tuition)) +
  geom_tile()

Better! Next I want to rotate the x axis labels so they are more readable. We can do this using a call to theme(). We specify axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1) to rotate the labels & then center them on the tick mark.

ggplot(data = melted_df, aes(x = Years, y = fct_rev(State), fill = Tuition)) +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

Last up, I want to change the color palette & relabel the axes. I really like the “red, yellow, blue” palette from ColorBrewer. I adapted it here to use as a continuous scale. To relabel the axes, I use a call to lab().

cols = colorRampPalette(rev(brewer.pal(11, "RdYlBu")))

ggplot(data = melted_df, aes(x = Years, y = fct_rev(State), fill = Tuition)) +
  geom_tile() +
  scale_fill_gradientn(colours = cols(11)) +
  labs(y = "State",
       x = "Years", 
       fill = "Tuition ($)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

That looks good! I’m going to call this done! If you have any questions, feel free to reach out to me on Twitter. Head over to Part 2 or Part 3.