=== Preliminary issues === * Statistics for all methods require a matrix of counts (positive integer values) for each gene for each sample. * Create a tab-delimited matrix of integer counts, with column labels for each sample. * Genes with no counts in any sample should generally be removed to permit higher statistical power to identify differential expression. * According to [[http://www.ncbi.nlm.nih.gov/pubmed/20167110|Bullard et al., 2010]], differential expression analysis is influenced more by the normalization method than by the choice of differential expression statistic. * Note that without replication, one cannot make very strong conclusions. High-throughput sequencing, just like every other technology, needs biological replication. * One can conclude that certain genes in sample A have a different RNA abundance than in sample B, but the results cannot be generalized. * Example, using an extremely precise balance: If Dick weighs more than Sally, we cannot conclude that males weigh more than females because we know nothing about the variability of weights among males and among females. Even if we weighed several individuals together, we'd still be missing information about within-group variability. * Sample commands to get raw counts from an alignment file: * ''coverageBed -abam accepted_hits.bam -b transcripts.gtf > transcript.coverage.txt'' (See the [[http://code.google.com/p/bedtools/wiki/Usage#coverageBed|bedTools]] page for details) * ''htseq-count -m intersection-strict -s no accepted_hits.sam -b transcripts.gff > transcript.coverage.txt'' (See the [[http://www-huber.embl.de/users/anders/HTSeq/doc/count.html|htseq-count]] page for details) * In our view, htseq-count is better at handling reads that map to a genome region with overlapping genes. === Calculating library size === We can do this the most obvious way using the total number of reads in each sample (column of our counts matrix), which is shown below. Many methods for differential expression analysis use more complex methods to calculate what they refer to as something like "effective library size" (such as DESeq's estimateSizeFactors() command), which appears to work better than the more obvious method. {{{ # Start with RNA_seq_counts, a matrix of counts for each gene (row) in each sample (column) # Get total count libsize = apply(RNA_seq_counts, 2, sum) # By 75th percentile (similar to Bullard J.H., et. al (2010)) libsize = apply(RNA_seq_counts, 2, function(x) { z <- x[x > 0] ; sum(z[z <= quantile(z, 0.75)]) }) }}} === Method 1 (using edgeR; only for experiments with replication) === * See [[http://www.bioconductor.org/packages/devel/bioc/html/edgeR.html|the edgeR home page]] for official documentation * Algorithm comes from lab of main developer of limma, Gordon Smyth. * Input for edgeR should be a matrix of counts, not RPKM values. * From the edgeR manual: "RPKM values should not be used for assessing differential expression of genes between samples in edgeR. We use the raw counts, because the methods implemented in edgeR are based on the negative binomial distribution, a discrete distribution." * See [[http://jura.wi.mit.edu/bio/scripts/R/get_DE_genes_with_edgeR.R|get_DE_genes_with_edgeR.R]] for a command-line script to use edgeR * Major edgeR upgrade (April 2010?) required a change from version 0.1 to 0.2 of above script. * Outputs also includes image files: boxplot, MA plot, volcano plot * Sample command: * ''bsub "R --vanilla < get_DE_genes_with_edgeR.R edgeR_sample_input.txt 2 5 edgeR_sample_output.txt"'' * In running edgeR, the parameter //prior.n// should be chosen where the default value is 10. //prior.n// determines the amount of smoothing tagwise dispersion towards the common dispersion.\\ The suggested method for choosing //prior.n// is to ensure //prior.n * df// is approx 50 where degrees of freedom (df) = lib. size - number of groups. //prion.n// should generally be greater than 5. For more details see section 6.4 [[http://bioconductor.org/packages/2.7/bioc/vignettes/edgeR/inst/doc/edgeR.pdf|edgeR Manual]] === Method 2 (using DESeq; for experiments with or without replication) === * See [[http://www.bioconductor.org/packages/devel/bioc/html/DESeq.html|the DESeq home page]] for official documentation * Package summary: "Estimate variance-mean dependence in count data from high-throughput sequencing assays and test for differential expression based on a model using the negative binomial distribution" * From the DESeq publication: "DESeq owes its basic idea to edgeR, yet differs in several aspects." * Input is also a matrix of counts, with gene identifiers used as row names. * Sample code to use __with__ replication: {{{ # Use package library(DESeq) # Read counts countsTable = read.delim("Gene_counts.txt") # Describe groups groups = c(rep("C",3), rep("T", 3)) # Make a CountDataSet cds = newCountDataSet(countsTable, groups) # Estimate effective library size cds = estimateSizeFactors(cds) # Core assumption of the method: variance is a function of expression level # Estimate variance for each gene cds = estimateVarianceFunctions(cds) # Do stats based on a negative binomial distribution results = nbinomTest(cds, "T", "C") write.table(results, file="DESeq_output.txt", sep="\t", quote=F) }}} * Sample code to use __without__ replication: {{{ # Show that all samples are different conds = c("a", "b", "c", "d", "e", "f") }}} * Then apply the same code as above except for ''estimateVarianceFunctions()'': * ''cds = estimateVarianceFunctions(cds, method='blind')'' * Make desired comparison * ''results = nbinomTest(cds, "a", "b")'' === Method 3 (Fisher's exact test; for experiments without replication) === * Fisher's exact test can be used on a 2x2 (or larger) contingency matrix to test dependency between rows and columns * For expression counts, the 4 values are * number of counts for Gene A in Sample 1 * number of counts not for Gene A in Sample 1 * number of counts for Gene A in Sample 2 * number of counts not for Gene A in Sample 2 * A test is performed for every gene of interest. * This was a popular method in earlier SAGE expression counts and is also tested in [[http://www.ncbi.nlm.nih.gov/pubmed/20167110|Bullard et al., 2010]]. * We don't yet have a command-line script to perform this on 2 samples, followed by FDR correction. * As with any method not taking advantage of biological replication, any conclusions must be treated with care (and not taken too seriously). === Normalizing an expression matrix === * Most RNA-Seq statistical analysis requires raw counts as input, and the statistical models take differences in library sizes into account. * To summaries of RNA-Seq experiments, however, library sizes should be taken into account, either through the use of statistical packages or scaling/normalization yourself. * Expression counts can be transformed by some combination of the following: * Upper-quartile (75th percentile) normalization * Quantile normalization * Scaling so total number of counts is equal to mean number of mapped reads across all samples * Scaling by transcript length (one of the steps in converting raw counts to RPKM) * As with microarray normalization, be aware of the assumptions of each method and choose the method(s) which are most valid with your experiment * Origin of recommendation for upper-quartile normalization: [[http://www.ncbi.nlm.nih.gov/pubmed/20167110|Bullard et al., 2010]] * See [[http://jura.wi.mit.edu/bio/scripts/R/normalize_DGE_matrix.R|normalize_DGE_matrix.R]] for a command-line script for count normalization * Methods for above script: mean, median, quantile, percentile, none * Outputs also includes before and after image files: boxplots, scatterpots, violin plots * Be careful with interpretation of images, as they appear to show that upper-quartile normalization does more bad than good. * Sample commands: * ''# Normalize to the 75th percentile'' * ''R --vanilla < normalize_DGE_matrix.R DGE_sample_input.txt percentile=0.75 DGE_sample_input.percentile.0.75.txt'' * ''# Normalize to the mean'' * ''R --vanilla < normalize_DGE_matrix.R DGE_sample_input.txt mean DGE_sample_input.mean.txt'' === Other methods === * Review articles: * [[http://www.ncbi.nlm.nih.gov/pubmed/21106489|A survey of statistical software for analysing RNA-seq data]] - Gao D, Kim J, Kim H, Phang TL, Selby H, Tan AC, Tong T. Hum Genomics. 2010 Oct;5(1):56-60. * [[http://www.ncbi.nlm.nih.gov/pubmed/21176179|From RNA-seq reads to differential expression results]] - Oshlack A, Robinson MD, Young MD. Genome Biol. 2010;11(12):220. Epub 2010 Dec 22. * [[http://www.ncbi.nlm.nih.gov/pubmed/20698981|baySeq: empirical Bayesian methods for identifying differential expression in sequence count data]] - Hardcastle TJ, Kelly KA. BMC Bioinformatics. 2010 Aug 10;11:422. * [[http://www.ncbi.nlm.nih.gov/pubmed/20979621|DESeq ("Differential expression analysis for sequence count data")]] - Anders S, Huber W. Genome Biol. 2010;11(10):R106. Epub 2010 Oct 27. * [[http://bioinfo.cipf.es/noiseq/doku.php?id=tutorial|NOISeq]] * For more practical information, see the third session of [http://jura.wi.mit.edu/bio/education/R2011/ | An introduction to R and Bioconductor: A BaRC Short Course] and the hot topic [http://jura.wi.mit.edu/bio/education/hot_topics/RNAseq/RNAseqDE_Dec2011.pdf | RNA-seq: A practical guide to the analysis of differential gene expression ] * You may also want to check this SOP: [wiki:SOPs/rna-seq-diff-expressions/ Using RNA-Seq to quantify gene levels and assay for differential expression]