A Modern Day Comparison of 3 Heat Map Packages - Part 2


We’re back to look at another heat map package! This is Part 2 of a 3 part series. You can find Part 1 here.

As someone who learned R using primarily the tidyverse, pheatmap had a bit of a learning curve. As a grad student, I was looking for an easier way to make a heat map with clustered data. I was trying to make heat maps for about 50 gene clusters in a programmatic way. ggplot2 wasn’t cutting it for me so I landed here. The end result was this figure & this figure, both published in this paper.

We’re going to use the same data as before. A lot of the data wrangling will look similar to Part 1.

First, we’re going to load the libraries will be using.

library(tidyverse)
library(RColorBrewer)
library(pheatmap)

Next, we’re going to read in the file using read_csv() from the tidyverse package.

df = read_csv("https://raw.githubusercontent.com/sapo83/TidyTuesday/master/2018/TT.20180403/us.avg.tuition.noCR.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   State = col_character(),
##   `2004-05` = col_character(),
##   `2005-06` = col_character(),
##   `2006-07` = col_character(),
##   `2007-08` = col_character(),
##   `2008-09` = col_character(),
##   `2009-10` = col_character(),
##   `2010-11` = col_character(),
##   `2011-12` = col_character(),
##   `2012-13` = col_character(),
##   `2013-14` = col_character(),
##   `2014-15` = col_character(),
##   `2015-16` = col_character()
## )
df
## # A tibble: 50 x 13
##    State   `2004-05` `2005-06` `2006-07` `2007-08` `2008-09` `2009-10` `2010-11`
##    <chr>   <chr>     <chr>     <chr>     <chr>     <chr>     <chr>     <chr>    
##  1 Alabama $5,683    $5,841    $5,753    $6,008    $6,475    $7,189    $8,071   
##  2 Alaska  $4,328    $4,633    $4,919    $5,070    $5,075    $5,455    $5,759   
##  3 Arizona $5,138    $5,416    $5,481    $5,682    $6,058    $7,263    $8,840   
##  4 Arkans… $5,772    $6,082    $6,232    $6,415    $6,417    $6,627    $6,901   
##  5 Califo… $5,286    $5,528    $5,335    $5,672    $5,898    $7,259    $8,194   
##  6 Colora… $4,704    $5,407    $5,596    $6,227    $6,284    $6,948    $7,748   
##  7 Connec… $7,984    $8,249    $8,368    $8,678    $8,721    $9,371    $9,827   
##  8 Delawa… $8,353    $8,611    $8,682    $8,946    $8,995    $9,987    $10,534  
##  9 Florida $3,848    $3,924    $3,888    $3,879    $4,150    $4,783    $5,511   
## 10 Georgia $4,298    $4,492    $4,584    $4,790    $4,831    $5,550    $6,428   
## # … with 40 more rows, and 5 more variables: 2011-12 <chr>, 2012-13 <chr>,
## #   2013-14 <chr>, 2014-15 <chr>, 2015-16 <chr>

Next, we’re going to use gsub() to replace all the dollar signs & commas. lapply() allows us to apply this to every column in our data frame.

df = as.data.frame(lapply(df, function (x) {gsub("[$,]", "", x)}))

head(df)
##        State X2004.05 X2005.06 X2006.07 X2007.08 X2008.09 X2009.10 X2010.11
## 1    Alabama     5683     5841     5753     6008     6475     7189     8071
## 2     Alaska     4328     4633     4919     5070     5075     5455     5759
## 3    Arizona     5138     5416     5481     5682     6058     7263     8840
## 4   Arkansas     5772     6082     6232     6415     6417     6627     6901
## 5 California     5286     5528     5335     5672     5898     7259     8194
## 6   Colorado     4704     5407     5596     6227     6284     6948     7748
##   X2011.12 X2012.13 X2013.14 X2014.15 X2015.16
## 1     8452     9098     9359     9496     9751
## 2     5762     6026     6012     6149     6571
## 3     9967    10134    10296    10414    10646
## 4     7029     7287     7408     7606     7867
## 5     9436     9361     9274     9187     9270
## 6     8316     8793     9293     9299     9748

Looks good so far!

One of the quirks of pheatmap is that it will only accept a matrix with numeric values. This caused me the most trouble when I first started using this package. To handle this, we’ll apply (using sapply()) the as.numeric() function to each column except the first column. We’ll then assign this as a matrix (using as.matrix()) to a new object (df_num). Our row names for our new object are the states. We set the row names for df_num using the first column from our original data frame (df).

df_num = as.matrix(sapply(df[, 2:13], as.numeric))  

rownames(df_num) = df$State

head(df)
##        State X2004.05 X2005.06 X2006.07 X2007.08 X2008.09 X2009.10 X2010.11
## 1    Alabama     5683     5841     5753     6008     6475     7189     8071
## 2     Alaska     4328     4633     4919     5070     5075     5455     5759
## 3    Arizona     5138     5416     5481     5682     6058     7263     8840
## 4   Arkansas     5772     6082     6232     6415     6417     6627     6901
## 5 California     5286     5528     5335     5672     5898     7259     8194
## 6   Colorado     4704     5407     5596     6227     6284     6948     7748
##   X2011.12 X2012.13 X2013.14 X2014.15 X2015.16
## 1     8452     9098     9359     9496     9751
## 2     5762     6026     6012     6149     6571
## 3     9967    10134    10296    10414    10646
## 4     7029     7287     7408     7606     7867
## 5     9436     9361     9274     9187     9270
## 6     8316     8793     9293     9299     9748

One last thing before we get to plotting our heat map: column names! The column names need to be cleaned up a bit.

We’ll used two nested gsub() commands to remove the “X” from the beginning of the string & replace the “-” with “-20”.

colnames(df_num) = gsub("[.]", "-20", gsub("X", "", colnames(df_num)))

Now, let’s plot our heat map!

pheatmap(df_num)

First thing I want to fix is the column order. I don’t want them clustered. I want to see them in order by date. We can add cluster_cols = FALSE to take care of that. I’ll leave the clustering on for the rows (states). With this clustering we can see which states have similar patterns in their tuition changes. This may help us see particular trends in certain regions or identify outlier(s) in a specific area.

I also want to add a color scheme to it. I’m going to stick with the color palette from Part 1.

pheatmap(df_num, 
         color = colorRampPalette(rev(brewer.pal(11, "RdYlBu")))(100), 
         cluster_cols = FALSE)

Next I want to change the cell size & remove the cell border. I’ll use NA to remove the cell border completely. I prefer squares in a heat map as opposed to rectangles so I’ll set the cell width & cell height to match.

pheatmap(df_num, 
         color = colorRampPalette(rev(brewer.pal(11, "RdYlBu")))(100), 
         cluster_cols = FALSE,
         cellwidth = 10, 
         cellheight = 10,
         border_color = NA)

This is fairly similar to the plot in Part 1 so I’m going to stop here. Any questions/comments/concerns, I’d be happy to hear from you. You can reach out to me via Twitter or using one of the other methods on my contact page. Stay tuned for Part 3!