Version 25 (modified by 9 years ago) ( diff ) | ,
---|
Using RNA-Seq to quantify gene levels and assay for differential expression
Basic approach
- For each sample, map reads to genome using splice-aware mapper.
- Count reads mapped to each gene (or other set of features).
- Use gene counts to identify differentially expressed genes.
General suggestions
- Preliminary issues
- Statistics for all methods require a matrix of counts (positive integer values) for each gene for each sample.
- Create a tab-delimited matrix of integer counts, with column labels for each sample.
- Genes with no counts in any sample should generally be removed to permit higher statistical power to identify differential expression.
- According to Bullard et al., 2010, differential expression analysis is influenced more by the normalization method than by the choice of differential expression statistic.
- Note that without replication, one cannot make very strong conclusions. High-throughput sequencing, just like every other technology, needs biological replication.
- One can conclude that certain genes in sample A have a different RNA abundance than in sample B, but the results cannot be generalized.
- Example, using an extremely precise balance: If Dick weighs more than Sally, we cannot conclude that males weigh more than females because we know nothing about the variability of weights among males and among females. Even if we weighed several individuals together, we'd still be missing information about within-group variability.
- Sample commands to get raw counts from an alignment file:
- coverageBed -split -abam accepted_hits.bam -b transcripts.gtf > transcript.coverage.bed (See the bedTools page for details)
- htseq-count -m intersection-strict --stranded=no accepted_hits.sam -b transcripts.gff > transcript.coverage.txt (See the htseq-count page for details)
- In our view, htseq-count is better at handling reads that map to a genome region with overlapping genes.
Step by step analysis
- Mapping
- Use TopHat (or another spliced mapper) to map short reads to the genome of choice.
- See the mapping SOP for more details.
- Quantification of raw counts
- Use htseq-count to get counts for each gene
- Include same GTF file describing gene models as was used for mapping -- but think carefully about what genes should be included (such as long non-coding RNAs, microRNAs, or piRNAs)
- Carefully choose the best "mode" to handle reads that don't completely map to exactly one gene
- Counting can be also performed transcript level but this may be too much information
- htseq-count can accept SAM or BAM (-f option), for example
- htseq-count -f bam -m intersection-strict --stranded=no accepted_hits.bam gene_models.gtf >| gene_model.counts.txt
- Note that htseq-count assumes that your reads are strand-specific, so if not, include "--stranded=no" (or half of your reads won't be counted).
- Note the "-" can be used for input from standard input (stdin)
- For paired-end reads the sam file has to be sorted by read name, or coordinate, eg. bsub "samtools sort -n -o accepted_hits.sortedByName.bam -m 5G -O bam -T temp accepted_hits.bam"
- To request a certain amount of memory and a specific node use bsub -R "rusage[mem=50000]" -m NodeName
- Remove the rows at the bottom with descriptions like no_feature, ambiguous, etc.
- For custom analyses, you can add a pseudocount (such as 1 or another small number) to all genes in all samples to prevent log2 ratios that require dividing by 0 and reduce background count noise -- BUT be aware that some statistical methods (like DESeq) require raw input values without any pseudocounts or normalization.
- Use htseq-count to get counts for each gene
- Quantification of FPKM values
- Use cufflinks
- to quantify transcripts and genes in a GTF file (ex: bsub cufflinks -G gene_models.gtf accepted_hits.bam)
- to quantify transcripts, potentially novel, annotated by cufflinks (ex: bsub cufflinks accepted_hits.bam)
- NOTE: Some genes, although present in a GTF annotation file, may not get quantified by cufflinks. This occurs for genes found in very long regions of overlapping genes (which exceed the default value for --max-bundle-length). When this occurs, the standard err output of cufflinks (contained in the long LSF email when cufflinks is run via 'bsub') will contain the message "Warning: Skipping large bundle." To correct this (or prevent it in the first place), add an argument like '--max-bundle-length 10000000' to the cufflinks command. You may want to compare the list of genes in the GTF file to that of the cufflinks output to verify that they match.
- Use cufflinks
awk -F"\t" '{print $9}' genes.gtf | awk '{print $2}' | perl -pe 's/\"//g;s/;//g' | sort -u > gtf_genes.txt cut -f1 cufflinks_output/genes.fpkm_tracking | grep -v tracking_id > cufflinks_genes.txt
- If you only want to quantify genes in your GTF file use the -G option (instead of -g which will give also transcripts found by Cufflinks and will take away counts from transcripts in your gtf file).
- A second option to get fpkm is to use Cuffquant and Cuffnorm as described here http://cole-trapnell-lab.github.io/monocle-release/getting-started/ and here cuffnorm. The default normalization for Cuffnorm is the same than the normalization performed by DEseq.
- Gene filtering
- Remove from the analysis any genes with 0 counts across all samples. Some analysis tools do this themselves.
- Remove counts for any genes we want to classify as contaminants or simply "not interesting" such as ribosomal RNAs (if these are included in the GTF gene annotation file).
- Normalization
- Many RNA-Seq statistics programs require an input of raw values.
- Display of count comparisons, however, should include some type of normalization.
- Expression counts can be transformed by some combination of the following:
- Total counts (sum of counts for all genes/features)
- Total number of mapped reads
- Upper-quartile (75th percentile) normalization (so scaling is to number of reads mapping to 75th percentile feature)
- Scaling by transcript length (one of the steps in converting raw counts to RPKM)
- Scaling so total number of counts is equal to mean number of mapped reads across all samples
- Effective library size (a more complex, but probably more valid method included in programs such as edgeR and DESeq)
- Differential expression statistics packages can output a matrix of normalized counts (typically using the method recommended for the accompanying statistics), so typically no additional normalization needs to be done (unless one want to perform further normalization, such as using gene length).
- As with microarray normalization, be aware of the assumptions of each method and choose the method(s) which are most valid with your experiment.
- Origin of recommendation for upper-quartile normalization: Bullard et al., 2010
- See normalize_DGE_matrix.R for a command-line script for count normalization
- Methods for above script: mean, median, quantile, percentile, none
- Outputs also includes before and after image files: boxplots, scatterpots, violin plots
- Be careful with interpretation of images, as they appear to show that upper-quartile normalization does more bad than good.
- Sample commands:
- Normalize to the 75th percentile
- R --vanilla < normalize_DGE_matrix.R DGE_sample_input.txt percentile=0.75 DGE_sample_input.percentile.0.75.txt
- Normalize to the mean
- R --vanilla < normalize_DGE_matrix.R DGE_sample_input.txt mean DGE_sample_input.mean.txt
- Normalize to the 75th percentile
- Calculating library size
- We can do this the most obvious way using the total number of reads in each sample (column of our counts matrix), which is shown below. Many methods for differential expression analysis use more complex methods to calculate what they refer to as something like "effective library size" (such as DESeq's estimateSizeFactors() command), which appears to work better than the more obvious method.
# Start with RNA_seq_counts, a matrix of counts for each gene (row) in each sample (column) # Get total count libsize = apply(RNA_seq_counts, 2, sum) # By 75th percentile (similar to Bullard J.H., et. al (2010)) libsize = apply(RNA_seq_counts, 2, function(x) { z <- x[x > 0] ; sum(z[z <= quantile(z, 0.75)]) })
- Statistics for differential expression
- Biologically meaningful statistics are only obtained with experiments that include biological replication.
- Unless otherwise noted, these statistical analyses require raw counts, not RPKM values, as input.
- To prepare the data for analysis, make a counts matrix (as a tab-delimited text file) with genes as rows and samples as columns.
- Some statistical tests or packages can be applied to non-replicated data (but must be interpreted with a high degree of caution):
- cuffDiff (part of the cufflinks package; uses BAM files as input)
- DESeq | More detail about DESeq
- NOISeq | More detail about NOISeq
- Fisher's exact test (with R and/or with BaRC R script) | More detail about Fisher's exact test
- Some statistical tests or packages are designed for experiments with replication:
- DESeq | More detail about DESeq (This is our preferred method.)
- DESeq2 (We haven't compared this with DESeq very thoroughly.)
- edgeR | More detail about EdgeR
- NOISeq | More detail about NOISeq
- In our hands, all of these methods appear to work fine. Note, however, that the scale of the p-values from baySeq are much different from those of DESeq and edgeR. For example, a gene may have a baySeq p-value of 1e-15 but a DESeq p-value of 1e-300.
- Scripts with sample code are in the BaRC code repository at
wi-files1\BaRC_Public\BaRC_code\R (and see the vignettes in the R packages linked above).
- Identifying differentially expressed genes
- Use p-values (corrected for FDR) to rank genes to get those that have the highest confidence of being truly differentially expressed.
- Be wary of actual p-values: some tests output very small p-values that appear to reflect higher confidence than is possible given the amount of replication.
- How many genes do you predict are truly differentially expressed? Use the answer to this to help determine a reasonable FDR threshold.
- Fold change thresholds may be used in addition, if desired, to select genes that are changing an amount that is biologically meaningful.
- Accounting for a batch effect in a differential expression model
- With edgeR, the comparison can be adjusted for differences between batches (such as date of sample preparation) with an addictive model formula using an equation like
design = model.matrix(~Batch+Treatment).
For example, with a targets file (target.txt) like this:
Sample genotype date KO.1 KO old WT.1 WT old KO.2 KO new WT.2 WT new
Sample R code for reducing the date effect (old vs. new) is
target <- read.delim("target.txt", header=T) genotype <- factor(target$genotype, levels=c("WT", "KO")) mydate <- factor(target$date, levels=c("old", "new")) Design <- model.matrix(~mydate+genotype) colnames(Design) # [1] "(Intercept)" "mydatenew" "genotypeKO" Y2 <- calcNormFactors(Y) # Y is the DGEList Y2 <- estimateGLMCommonDisp(Y2, Design, verbose=TRUE) Y2 <- estimateGLMTrendedDisp(Y2, Design) Y2 <- estimateGLMTagwiseDisp(Y2, Design) Fit <- glmFit(Y2, Design) Lrt <- glmLRT(Fit, coef="genotypeKO") # where 'coef' points to the column name in 'Design'
The data structure Lrt$table contains the values for log2FC (log 2 fold change) and p-value.
Detailed information can be found in the edgeR User Guide (after a search for "batch effects").
- Other
- Review articles:
- Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. - Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, Mason CE, Socci ND, Betel D. Genome Biol. 2013 Sep 10;14(9):R95.
- A survey of statistical software for analyzing RNA-seq data - Gao D, Kim J, Kim H, Phang TL, Selby H, Tan AC, Tong T. Hum Genomics. 2010 Oct;5(1):56-60.
- From RNA-seq reads to differential expression results - Oshlack A, Robinson MD, Young MD. Genome Biol. 2010;11(12):220. Epub 2010 Dec 22.
- For more practical information, see the third session of | An introduction to R and Bioconductor: A BaRC Short Course and the BaRC Hot Topic | RNA-seq: A practical guide to the analysis of differential gene expression
- Review articles:
Note:
See TracWiki
for help on using the wiki.