wiki:SOPs/clusteringMatrixHeatmap

Version 5 (modified by thiruvil, 4 years ago) ( diff )

--

Creating/viewing a heatmap

For clustering a matrix of values, our favorite tool is Cluster 3.0.

Basic steps for Cluster 3.0:

  • File >> Open data file. Navigate to a tab-delimited text file with the first row containing sample names and the first column containing feature (gene) names or IDs. The file should contain a subset (up to several thousand) features -- any more, and the uninformative noise obscures the informative signal.
  • Adjust Data [tab] >> Check "Log transform data" (unless your input matrix was already transformed), and check "Center genes", leaving the centering method as "mean", unless you have lots of samples and want to choose "median" to better tolerate outliers. Click the Apply button.
  • Hierarchical [tab] >> Under Genes, check "Cluster" and leave Similarity Metric as "Correlation (Uncentered)", and if you wish to cluster samples ("Arrays"), make the same choices. For Clustering Method, click on the "Average linkage" button.
  • Large matrices may take a while to cluster. Completion is indicated by a small "Done clustering" at the bottom left of the window.
  • If your matrix is taking too long to cluster, you may want to try clustering on tak, our Linux server, using the command-line version of Cluster 3.0. Sample command:
    cluster -f My.matrix.txt -l -cg a -g 1 -e 1 -m a 
    

For creating a heatmap from clustered data, our favorite tool is TreeView, an updated (and still free) program that does what the original Eisen program did.

Basic steps for Java TreeView:

  • File >> Open. Navigate to the CDT file created by Cluster 3.0. A heatmap should appear.
  • Settings >> "Pixel settings". You can adjust the color scale and the contrast (to make the colors better for color-blind people, for example).
  • Use your mouse to select a region of the heatmap on the left, and the subset will appear as a separate heatmap on the right. You can also subset the samples and/or genes by clicking on the dendrogram(s).

Other notes:

  • For creating/viewing a heatmap, the CDT file created by Cluster 3.0 is typically used as input. This CDT file is essentially a spreadsheet (so can be viewed in Excel), and it links to other files created by Cluster 3.0. As a result, if you are giving your cluster file to another person, be sure to include the other files created by Cluster 3.0.
  • Excel can be used to created a file to view in Java TreeView. If using this method, the first 2 columns must be non-redundant row labels (typically gene symbols). The rest of the columns should be numeric (or blank). Save the file as tab-delimited text.
  • To optimize the look of the heatmap, go to Settings => Pixel Settings, where you can modify colors and the height and width of each rectangle in the heatmap.
  • Once you have the look you like, go to Export to create a high-resolution image (Export to Postscript), low resolution image (Export to Image), and/or a color bar.
  • A heatmap can also be created on the command line. See the TreeView manual for all options, but a sample command is
java -jar TreeView.jar -r my_genes_clustered.cdt -x Dendrogram -- -s 10x10 -a 0 -o my_genes_heatmap.png

More details about clustering

Questions from a scientist, with answers from BaRC

It's worth reading the Cluster 3.0 manual (http://bonsai.hgc.jp/~mdehoon/software/cluster/cluster3.pdf) to get the authors' input on some of these issues.

1 - What to cluster (RPKMs, fold-change relative to a reference?)

  • We always prefer log2 ratios rather than levels (like RPKM). Ratios provide more information to the clustering algorithm, and the result (as a heatmap) includes clearer visual cues of changes (since one has a range of positive and negative values, instead of just a range of positive values). In that case, however, you often have a choice of the reference(s) for your ratios, whether it's another sample or a median of all samples.

    Note: To create a heatmap using z-scores, log2-transform the expression values (e.g. normalized counts, FPKM, etc.) and then calculate the z-scores. In R, the scale function can be used.

2 - How to normalize or scale the data, if necessary?

  • Same as for general other analyses of this sort of data, and you probably want to do this before any filtering (so do it on all genes, rather than selected genes). Log2 transformation of ratios is almost always a good idea.

3 - What distance measures to use?

  • I usually simply use some sort of correlation (like the Cluster 3.0 default of "Correlation (uncentered)"), along with the "average linkage" clustering method (for hierarchical clustering).

4 - Whether to cluster all genes, or filter for genes which vary across conditions?

  • We always use a subset of genes that seem to be doing something interesting, whether the union of all differentially expressed genes, the top n genes after sorting by CV across all conditions, or some other method. Processing all genes often causes the interesting genes to be hidden amidst the more-numerous boring genes.

5 - Type of clustering you'd recommend: hierarchical, k-means, or others?

  • Usually hierachical, at least to start with. The only times we do k-means or SOMs are if we want to split selected genes into groups (and then it takes some playing around to get just the right number of groups). Hierarchical is also easiest because you don't have to decide how many groups you want.

6 - Evaluating clustering results?

  • This may be the trickiest issue (in some ways similar to evaluating differential expression in general). First, some QC -- without looking at individual genes, is the heatmap as expected? If not, do we need to process/normalize the data in a different way or did the experiment just not work? Then, why are we doing the experiment -- is it to identify genes of interest or patterns of interest? The heatmap may help us answer these questions; on the other hand, even if the clustering looks reasonable and dependable, we probably need some real statistics to get a final answer to our question(s).

7 - What to do with biological replicates (average? keep them separate?)

  • We like doing it both ways. Having a column for each replicate can help convince me that my method of selecting genes (such as from differential expression statistics) worked, and I can also get a feeling for variability of the system being assayed. On the other hand, especially if you're comparing a lot of samples, including replicates takes away from the major story. In that case it may come down to what you want to convey in the figure. Also, as a summary metric we prefer median so one sample doesn't mess up the mean.
Note: See TracWiki for help on using the wiki.