wiki:SOPs/diff_rnaSeq

Preliminary issues

  • Statistics for all methods require a matrix of counts (positive integer values) for each gene for each sample.
  • Create a tab-delimited matrix of integer counts, with column labels for each sample.
  • Genes with no counts in any sample should generally be removed to permit higher statistical power to identify differential expression.
  • According to Bullard et al., 2010, differential expression analysis is influenced more by the normalization method than by the choice of differential expression statistic.
  • Note that without replication, one cannot make very strong conclusions. High-throughput sequencing, just like every other technology, needs biological replication.
    • One can conclude that certain genes in sample A have a different RNA abundance than in sample B, but the results cannot be generalized.
    • Example, using an extremely precise balance: If Dick weighs more than Sally, we cannot conclude that males weigh more than females because we know nothing about the variability of weights among males and among females. Even if we weighed several individuals together, we'd still be missing information about within-group variability.
  • Sample commands to get raw counts from an alignment file:
    • coverageBed -split -abam accepted_hits.bam -b transcripts.gtf > transcript.coverage.bed (See the bedTools page for details)
    • htseq-count -m intersection-strict -s no accepted_hits.sam -b transcripts.gff > transcript.coverage.txt (See the htseq-count page for details)
    • In our view, htseq-count is better at handling reads that map to a genome region with overlapping genes.

Calculating library size

We can do this the most obvious way using the total number of reads in each sample (column of our counts matrix), which is shown below. Many methods for differential expression analysis use more complex methods to calculate what they refer to as something like "effective library size" (such as DESeq's estimateSizeFactors() command), which appears to work better than the more obvious method.

# Start with RNA_seq_counts, a matrix of counts for each gene (row) in each sample (column)

# Get total count
libsize = apply(RNA_seq_counts, 2, sum)

# By 75th percentile (similar to Bullard J.H., et. al (2010))
libsize = apply(RNA_seq_counts, 2, function(x) { z <- x[x > 0] ; sum(z[z <= quantile(z, 0.75)]) })

Method 1 (using edgeR; only for experiments with replication)

  • See the edgeR home page for official documentation
  • Algorithm comes from lab of main developer of limma, Gordon Smyth.
  • Input for edgeR should be a matrix of counts, not RPKM values.
    • From the edgeR manual: "RPKM values should not be used for assessing differential expression of genes between samples in edgeR. We use the raw counts, because the methods implemented in edgeR are based on the negative binomial distribution, a discrete distribution."
  • See get_DE_genes_with_edgeR.R for a command-line script to use edgeR
  • Major edgeR upgrade (April 2010?) required a change from version 0.1 to 0.2 of above script.
  • Outputs also includes image files: boxplot, MA plot, volcano plot
  • Sample command:
    • bsub "R --vanilla < get_DE_genes_with_edgeR.R edgeR_sample_input.txt 2 5 edgeR_sample_output.txt"
  • In running edgeR, the parameter prior.n should be chosen where the default value is 10. prior.n determines the amount of smoothing tagwise dispersion towards the common dispersion.
    The suggested method for choosing prior.n is to ensure prior.n * df is approx 50 where degrees of freedom (df) = lib. size - number of groups. prion.n should generally be greater than 5. For more details see section 6.4 edgeR Manual

Method 2 (using DESeq; for experiments with or without replication)

  • See the DESeq home page for official documentation
  • Package summary: "Estimate variance-mean dependence in count data from high-throughput sequencing assays and test for differential expression based on a model using the negative binomial distribution"
  • From the DESeq publication: "DESeq owes its basic idea to edgeR, yet differs in several aspects."
  • Input is also a matrix of counts, with gene identifiers used as row names.
  • Sample code to use with replication:

 # Use package
 library(DESeq)
 # Read counts
 countsTable = read.delim("Gene_counts.txt")
 # Describe groups
 groups = c(rep("C",3), rep("T", 3))
 # Make a CountDataSet
 cds = newCountDataSet(countsTable, groups)
 # Estimate effective library size
 cds = estimateSizeFactors(cds)
 # Core assumption of the method: variance is a function of expression level
 # Estimate variance for each gene
 cds = estimateVarianceFunctions(cds)
 # Do stats based on a negative binomial distribution
 results = nbinomTest(cds, "T", "C")
 write.table(results, file="DESeq_output.txt", sep="\t", quote=F)
  • Sample code to use without replication:
# Show that all samples are different
 conds = c("a", "b", "c", "d", "e", "f")
  • Then apply the same code as above except for estimateVarianceFunctions():
    • cds = estimateVarianceFunctions(cds, method='blind')
  • Make desired comparison
    • results = nbinomTest(cds, "a", "b")

Method 3 (Fisher's exact test; for experiments without replication)

  • Fisher's exact test can be used on a 2x2 (or larger) contingency matrix to test dependency between rows and columns
  • For expression counts, the 4 values are
    • number of counts for Gene A in Sample 1
    • number of counts not for Gene A in Sample 1
    • number of counts for Gene A in Sample 2
    • number of counts not for Gene A in Sample 2
  • A test is performed for every gene of interest.
  • This was a popular method in earlier SAGE expression counts and is also tested in Bullard et al., 2010.
  • We don't yet have a command-line script to perform this on 2 samples, followed by FDR correction.
  • As with any method not taking advantage of biological replication, any conclusions must be treated with care (and not taken too seriously).

Normalizing an expression matrix

  • Most RNA-Seq statistical analysis requires raw counts as input, and the statistical models take differences in library sizes into account.
  • To summaries of RNA-Seq experiments, however, library sizes should be taken into account, either through the use of statistical packages or scaling/normalization yourself.
  • Expression counts can be transformed by some combination of the following:
    • Upper-quartile (75th percentile) normalization
    • Quantile normalization
    • Scaling so total number of counts is equal to mean number of mapped reads across all samples
    • Scaling by transcript length (one of the steps in converting raw counts to RPKM)
  • As with microarray normalization, be aware of the assumptions of each method and choose the method(s) which are most valid with your experiment
  • Origin of recommendation for upper-quartile normalization: Bullard et al., 2010
  • Methods for above script: mean, median, quantile, percentile, none
  • Outputs also includes before and after image files: boxplots, scatterpots, violin plots
  • Be careful with interpretation of images, as they appear to show that upper-quartile normalization does more bad than good.
  • Sample commands:
    • # Normalize to the 75th percentile
    • R --vanilla < normalize_DGE_matrix.R DGE_sample_input.txt percentile=0.75 DGE_sample_input.percentile.0.75.txt
    • # Normalize to the mean
    • R --vanilla < normalize_DGE_matrix.R DGE_sample_input.txt mean DGE_sample_input.mean.txt

Other methods

Note: See TracWiki for help on using the wiki.