Clustering a matrix

Questions from a scientist, with answers from BaRC

It's worth reading the Cluster 3.0 manual ( to get the authors' input on some of these issues.

1 - What to cluster (RPKMs, fold-change relative to a reference?)

  • I always prefer log2 ratios rather than levels (like RPKM). Ratios provide more information (I think I'm correct in saying) to the clustering algorithm, and the result (as a heatmap) includes clearer visual cues of changes. In that case, however, you often have a choice of the reference(s) for your ratios, whether it's another sample or a median of all samples.

2 - How to normalize or scale the data, if necessary?

  • Same as for general other analyses of this sort of data, and you probably want to do this before any filtering (so do it on all genes, rather than selected genes). Log2 transformation of ratios is almost always a good idea.

3 - What distance measures to use?

  • I usually simply use some sort of correlation (like the Cluster 3.0 default), along with the "average linkage" clustering method (for hierarchical clustering).

4 - Whether to cluster all genes, or filter for genes which vary across conditions?

  • We always use a subset of genes that seem to be doing something interesting, whether the union of all differentially expressed genes, the top n genes after sorting by CV across all conditions, or some other method. Processing all genes often causes the interesting genes to be hidden amidst the more-numerous boring genes.

5 - Type of clustering you'd recommend: hierarchical, k-means, or others?

  • Usually hierachical, at least to start with. The only times I do k-means or SOMs are if we want to split selected genes into groups (and then it takes some playing around to get just the right number of groups). Hierarchical is also easiest because you don't have to decide how many groups you want.

6 - Evaluating clustering results?

  • This may be the trickiest issue (in some ways similar to evaluating differential expression in general). First, some QC -- without looking at individual genes, is the heatmap as expected? If not, do we need to process/normalize the data in a different way or did the experiment just not work? Then, why are we doing the experiment -- is it to identify genes of interest or patterns of interest? The heatmap may help us answer these questions; on the other hand, even if the clustering looks reasonable and dependable, we probably need some real statistics to get a final answer to our question(s).

7 - What to do with biological replicates (average? keep them separate?)

  • I like doing it both ways. Having a column for each replicate can help convince me that my method of selecting genes (such as from differential expression statistics) worked, and I can also get a feeling for variability of the system being assayed. On the other hand, especially if you're comparing a lot of samples, including replicates takes away from the major story. In that case it may come down to what you want to convey in the figure. As far as clustering, I haven't compared doing it both ways to see how much of an effect this has. Also, I prefer median so one sample doesn't mess up the mean.

Creating/viewing a heatmap

For creating a heatmap from clustered data, our favorite tool is TreeView, an updated (and still free) program that does what the original Eisen program did.

  • The CDT file created by Cluster 3.0 is typically used as input. This CDT file is essentially a spreadsheet (so can be viewed in Excel), and it links to other files created by Cluster 3.0. As a result, if you are giving your cluster file to another person, be sure to include the other files created by Cluster 3.0.
  • Excel can be used to created a file to view in Java TreeView. If using this method, the first 2 columns must be non-redundant row labels (typically gene symbols). The rest of the columns should be numeric (or blank). Save the file as tab-delimited text.
  • To optimize the look of the heatmap, go to Settings => Pixel Settings, where you can modify colors and the height and width of each rectangle in the heatmap.
  • Once you have the look you like, go to Export to create a high-resolution image (Export to Postscript), low resolution image (Export to Image), and/or a color bar.
  • A heatmap can also be created on the command line. See the TreeView manual for all options, but a sample command is
java -jar TreeView.jar -r my_genes_clustered.cdt -x Dendrogram -- -s 10x10 -a 0 -o my_genes_heatmap.png