wiki:SOPs/rna-seq-diff-expressions

Version 29 (modified by thiruvil, 9 years ago) ( diff )

--

Using RNA-Seq to quantify gene levels and assay for differential expression

Basic approach

  • For each sample, map reads to genome using splice-aware mapper.
  • Count reads mapped to each gene (or other set of features).
  • Use gene counts to identify differentially expressed genes.

General suggestions

  • Preliminary issues
    • Statistics for all methods require a matrix of counts (positive integer values) for each gene for each sample.
    • Create a tab-delimited matrix of integer counts, with column labels for each sample.
    • Genes with no counts in any sample should generally be removed to permit higher statistical power to identify differential expression.
    • According to Bullard et al., 2010, differential expression analysis is influenced more by the normalization method than by the choice of differential expression statistic.
    • Note that without replication, one cannot make very strong conclusions. High-throughput sequencing, just like every other technology, needs biological replication.
      • One can conclude that certain genes in sample A have a different RNA abundance than in sample B, but the results cannot be generalized.
      • Example, using an extremely precise balance: If Dick weighs more than Sally, we cannot conclude that males weigh more than females because we know nothing about the variability of weights among males and among females. Even if we weighed several individuals together, we'd still be missing information about within-group variability.
    • Sample commands to get raw counts from an alignment file:
      • coverageBed -split -abam accepted_hits.bam -b transcripts.gtf > transcript.coverage.bed (See the bedTools page for details)
      • htseq-count -m intersection-strict --stranded=no accepted_hits.sam -b transcripts.gff > transcript.coverage.txt (See the htseq-count page for details)
      • In our view, htseq-count is better at handling reads that map to a genome region with overlapping genes.

Step by step analysis

  • Mapping
    • Use TopHat (or another spliced mapper) to map short reads to the genome of choice.
    • See the mapping SOP for more details.
  • Quantification of raw counts
    • Typically we use htseq-count to get counts for each gene
      • Include same GTF file describing gene models as was used for mapping -- but think carefully about what genes should be included (such as long non-coding RNAs, microRNAs, or piRNAs)
      • Carefully choose the best "mode" to handle reads that don't completely map to exactly one gene
      • Counting can be also performed transcript level but this may be too much information
      • htseq-count can accept SAM or BAM (-f option), for example
        • htseq-count -f bam -m intersection-strict --stranded=no accepted_hits.bam gene_models.gtf >| gene_model.counts.txt
        • Note that htseq-count assumes that your reads are strand-specific, so if not, include "--stranded=no" (or half of your reads won't be counted).
        • Note the "-" can be used for input from standard input (stdin)
        • For paired-end reads the sam file has to be sorted by read name, or coordinate, eg. bsub "samtools sort -n -o accepted_hits.sortedByName.bam -m 5G -O bam -T temp accepted_hits.bam"
        • To request a certain amount of memory and a specific node use bsub -R "rusage[mem=50000]" -m NodeName
        • Remove the rows at the bottom with descriptions like no_feature, ambiguous, etc.
    • Another tool to use featureCounts, part of the Subread package
      • featureCounts is much faster than htseq-count, but the details of its counting method is quite different from that of htseq-count, especially for paired-end reads
      • See Liao et al., 2014 for details of the method (and comparisons with other counting tools)
      • Sample commands:
        • featureCounts -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam (single-end reads)
        • featureCounts -p -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam (paired-end reads)
    • For some analyses (or for visualization), you can add a pseudocount (such as 1 or another small number) to all genes in all samples to prevent log2 ratios that require dividing by 0 and reduce background count noise -- BUT be aware that some statistical methods (like DESeq) require raw input values without any pseudocounts or normalization.
  • Quantification of FPKM values
    • Use cufflinks
      • to quantify transcripts and genes in a GTF file (ex: bsub cufflinks -G gene_models.gtf accepted_hits.bam)
      • to quantify transcripts, potentially novel, annotated by cufflinks (ex: bsub cufflinks accepted_hits.bam)
      • NOTE: Some genes, although present in a GTF annotation file, may not get quantified by cufflinks. This occurs for genes found in very long regions of overlapping genes (which exceed the default value for --max-bundle-length). When this occurs, the standard err output of cufflinks (contained in the long LSF email when cufflinks is run via 'bsub') will contain the message "Warning: Skipping large bundle." To correct this (or prevent it in the first place), add an argument like '--max-bundle-length 10000000' to the cufflinks command. You may want to compare the list of genes in the GTF file to that of the cufflinks output to verify that they match.
            awk -F"\t" '{print $9}' genes.gtf | awk '{print $2}' | perl -pe 's/\"//g;s/;//g' | sort -u > gtf_genes.txt
            cut -f1 cufflinks_output/genes.fpkm_tracking | grep -v tracking_id > cufflinks_genes.txt

  • If you only want to quantify genes in your GTF file use the -G option (instead of -g which will give also transcripts found by Cufflinks and will take away counts from transcripts in your gtf file).
  • Gene filtering
    • Remove from the analysis any genes with 0 counts across all samples. Some analysis tools do this themselves.
    • Remove counts for any genes we want to classify as contaminants or simply "not interesting" such as ribosomal RNAs (if these are included in the GTF gene annotation file).
  • Normalization
    • Many RNA-Seq statistics programs require an input of raw values.
    • Display of count comparisons, however, should include some type of normalization.
    • Expression counts can be transformed by some combination of the following:
      • Total counts (sum of counts for all genes/features)
      • Total number of mapped reads
      • Upper-quartile (75th percentile) normalization (so scaling is to number of reads mapping to 75th percentile feature)
      • Scaling by transcript length (one of the steps in converting raw counts to RPKM)
      • Scaling so total number of counts is equal to mean number of mapped reads across all samples
      • Effective library size (a more complex, but probably more valid method included in programs such as edgeR and DESeq)
    • Differential expression statistics packages can output a matrix of normalized counts (typically using the method recommended for the accompanying statistics), so typically no additional normalization needs to be done (unless one want to perform further normalization, such as using gene length).
    • As with microarray normalization, be aware of the assumptions of each method and choose the method(s) which are most valid with your experiment.
    • Origin of recommendation for upper-quartile normalization: Bullard et al., 2010
    • See normalize_DGE_matrix.R for a command-line script for count normalization
      • Methods for above script: mean, median, quantile, percentile, none
      • Outputs also includes before and after image files: boxplots, scatterpots, violin plots
      • Be careful with interpretation of images, as they appear to show that upper-quartile normalization does more bad than good.
      • Sample commands:
        • Normalize to the 75th percentile
          • R --vanilla < normalize_DGE_matrix.R DGE_sample_input.txt percentile=0.75 DGE_sample_input.percentile.0.75.txt
        • Normalize to the mean
          • R --vanilla < normalize_DGE_matrix.R DGE_sample_input.txt mean DGE_sample_input.mean.txt
      • Calculating library size
        • We can do this the most obvious way using the total number of reads in each sample (column of our counts matrix), which is shown below. Many methods for differential expression analysis use more complex methods to calculate what they refer to as something like "effective library size" (such as DESeq's estimateSizeFactors() command), which appears to work better than the more obvious method.
               # Start with RNA_seq_counts, a matrix of counts for each gene (row) in each sample (column)
               # Get total count
               libsize = apply(RNA_seq_counts, 2, sum)
               # By 75th percentile (similar to Bullard J.H., et. al (2010))
               libsize = apply(RNA_seq_counts, 2, function(x) { z <- x[x > 0] ; sum(z[z <= quantile(z, 0.75)]) })
  • Statistics for differential expression
    • Biologically meaningful statistics are only obtained with experiments that include biological replication.
    • Unless otherwise noted, these statistical analyses require raw counts, not RPKM values, as input.
    • To prepare the data for analysis, make a counts matrix (as a tab-delimited text file) with genes as rows and samples as columns.
    • Some statistical tests or packages can be applied to non-replicated data (but must be interpreted with a high degree of caution):
    • Some statistical tests or packages are designed for experiments with replication:
    • Scripts with sample code are in the BaRC code repository at
      wi-files1\BaRC_Public\BaRC_code\R (and see the vignettes in the R packages linked above).
  • Identifying differentially expressed genes
    • Use p-values (corrected for FDR) to rank genes to get those that have the highest confidence of being truly differentially expressed.
    • Be wary of actual p-values: some tests output very small p-values that appear to reflect higher confidence than is possible given the amount of replication.
    • How many genes do you predict are truly differentially expressed? Use the answer to this to help determine a reasonable FDR threshold.
    • Fold change thresholds may be used in addition, if desired, to select genes that are changing an amount that is biologically meaningful.
  • Accounting for a batch effect in a differential expression model
    • With edgeR, the comparison can be adjusted for differences between batches (such as date of sample preparation) with an additive model formula using an equation like

design = model.matrix(~Batch+Treatment).

For example, with a targets file (target.txt) like this:

Sample genotype date
KO.1 KO old
WT.1 WT old
KO.2 KO new
WT.2 WT new

Sample R code for reducing the date effect (old vs. new) is

       target   <- read.delim("target.txt", header=T)
       genotype <- factor(target$genotype, levels=c("WT", "KO"))
       mydate   <- factor(target$date, levels=c("old", "new"))
       Design   <- model.matrix(~date+genotype)
       colnames(Design)
       # [1] "(Intercept)" "mydatenew"   "genotypeKO" 

       Y2  <- calcNormFactors(Y) # Y is the DGEList
       Y2  <- estimateGLMCommonDisp(Y2, Design, verbose=TRUE)
       Y2  <- estimateGLMTrendedDisp(Y2, Design)
       Y2  <- estimateGLMTagwiseDisp(Y2, Design)
       Fit <- glmFit(Y2, Design)
       Lrt <- glmLRT(Fit, coef="genotypeKO") # where 'coef' points to the column name in 'Design'

The data structure Lrt$table contains the values for log2FC (log 2 fold change) and p-value.

Detailed information can be found in the edgeR User Guide (after a search for "batch effects").

Note: See TracWiki for help on using the wiki.