wiki:SOPs/rna-seq-diff-expressions

Version 3 (modified by gbell, 12 years ago) ( diff )

--

Using RNA-Seq to quantify gene levels and assay for differential expression

Basic approach

  • For each sample, map reads to genome using splice-aware mapper.
  • Count reads mapped to each gene (or other set of features).
  • Use gene counts to identify differentially expressed genes.

General suggestions

  • Preliminary issues
    • Statistics for all methods require a matrix of counts (positive integer values) for each gene for each sample.
    • Create a tab-delimited matrix of integer counts, with column labels for each sample.
    • Genes with no counts in any sample should generally be removed to permit higher statistical power to identify differential expression.
    • According to Bullard et al., 2010, differential expression analysis is influenced more by the normalization method than by the choice of differential expression statistic.
    • Note that without replication, one cannot make very strong conclusions. High-throughput sequencing, just like every other technology, needs biological replication.
      • One can conclude that certain genes in sample A have a different RNA abundance than in sample B, but the results cannot be generalized.
      • Example, using an extremely precise balance: If Dick weighs more than Sally, we cannot conclude that males weigh more than females because we know nothing about the variability of weights among males and among females. Even if we weighed several individuals together, we'd still be missing information about within-group variability.
    • Sample commands to get raw counts from an alignment file:
      • coverageBed -abam accepted_hits.bam -b transcripts.gtf > transcript.coverage.txt (See the bedTools page for details)
      • htseq-count -m intersection-strict --stranded=no accepted_hits.sam -b transcripts.gff > transcript.coverage.txt (See the htseq-count page for details)
      • In our view, htseq-count is better at handling reads that map to a genome region with overlapping genes.

Step by step analysis

  • Mapping
    • Use TopHat (or another spliced mapper) to map short reads to genome of choice with default settings except
      • Selecting a value for --segment-length equal to half of the read length can greatly help spliced mapping (across an intron) [ex: For 40-nt reads, use '--segment-length 20' ]
      • Supplying a GTF file describing gene models to TopHat can provide better mapping of some spliced reads.
  • Quantification of raw counts
    • Use htseq-count to get counts for each gene
      • Include same GTF file describing gene models as was used for mapping -- but think carefully about what genes should be included (such as long non-coding RNAs, microRNAs, or piRNAs)
      • Carefully choose the best "mode" to handle reads that don't completely map to exactly one gene
      • Counting can be also performed transcript level but this may be too much information
      • htseq-count typically uses a SAM file as input, but a BAM fine can be used in place, for example
        • samtools view accepted_hits.bam | htseq-count -m intersection-strict --stranded=no - gene_models.gtf >| gene_model.counts.txt
        • Note that htseq-count assumes that your reads are strand-specific, so if not, include "--stranded=no" (or half of your reads won't be counted).
        • Note the "-" before the gtf filename; this indicates that the SAM file is coming from standard input (in this case, piped from samtools).
        • For paired-end reads the sam file has to be sorted by read name. You can sort first the BAM file and then run htseq-count like this:
        • bsub "samtools sort -n -m 5000000000 accepted_hits.bam accepted_hitsSortedByname"
        • bsub "samtools view accepted_hitsSortedByname.bam | htseq-count -m intersection-strict --stranded=no - gene_models.gtf >| gene_model.counts.txt"
    • Add a pseudocount (such as 1) to all genes in all samples to
      • prevent log2 ratios that require dividing by 0
      • reduce background count noise
      • reduce problems with statistical methods that don't like 0's
    • Remove the rows at the bottom with descriptions like no_feature, ambiguous, etc.
  • Quantification of FPKM values
    • Use cufflinks
      • to quantify transcripts and genes in a GTF file (ex: bsub cufflinks -G gene_models.gtf accepted_hits.bam)
      • to quantify transcripts, potentially novel, annotated by cufflinks (ex: bsub cufflinks accepted_hits.bam)
      • NOTE: Some genes, although present in a GTF annotation file, may not get quantified by cufflinks. This occurs for genes found in very long regions of overlapping genes (which exceed the default value for --max-bundle-length). When this occurs, the standard err output of cufflinks (contained in the long LSF email when cufflinks is run via 'bsub') will contain the message "Warning: Skipping large bundle." To correct this (or prevent it in the first place), add an argument like '--max-bundle-length 10000000' to the cufflinks command. You may want to compare the list of genes in the GTF file to that of the cufflinks output to verify that they match.
            awk -F"\t" '{print $9}' genes.gtf | awk '{print $2}' | perl -pe 's/\"//g;s/;//g' | sort -u > gtf_genes.txt
            cut -f1 cufflinks_output/genes.fpkm_tracking | grep -v tracking_id > cufflinks_genes.txt
  • Gene filtering
    • Remove from the analysis any genes with 0 counts (or counts = pseudocounts).
    • Remove counts for any genes we want to classify as contaminants such as ribosomal RNAs (if these are included in the GTF gene annotation file).
  • Normalization
    • Many RNA-Seq statistics programs require an input of raw values.
    • Display of count comparisons, however, should include some type of normalization.
    • Expression counts can be transformed by some combination of the following:
      • Total counts (sum of counts for all genes/features)
      • Total number of mapped reads
      • Upper-quartile (75th percentile) normalization (so scaling is to number of reads mapping to 75th percentile feature)
      • Scaling by transcript length (one of the steps in converting raw counts to RPKM)
      • Scaling so total number of counts is equal to mean number of mapped reads across all samples
      • Effective library size (a more complex, but probably more valid method included in programs such as edgeR and DESeq)
    • As with microarray normalization, be aware of the assumptions of each method and choose the method(s) which are most valid with your experiment
    • Origin of recommendation for upper-quartile normalization: Bullard et al., 2010
    • See normalize_DGE_matrix.R for a command-line script for count normalization
      • Methods for above script: mean, median, quantile, percentile, none
      • Outputs also includes before and after image files: boxplots, scatterpots, violin plots
      • Be careful with interpretation of images, as they appear to show that upper-quartile normalization does more bad than good.
      • Sample commands:
        • # Normalize to the 75th percentile
        • R --vanilla < normalize_DGE_matrix.R DGE_sample_input.txt percentile=0.75 DGE_sample_input.percentile.0.75.txt
        • # Normalize to the mean
        • R --vanilla < normalize_DGE_matrix.R DGE_sample_input.txt mean DGE_sample_input.mean.txt
      • Calculating library size
        • We can do this the most obvious way using the total number of reads in each sample (column of our counts matrix), which is shown below. Many methods for differential expression analysis use more complex methods to calculate what they refer to as something like "effective library size" (such as DESeq's estimateSizeFactors() command), which appears to work better than the more obvious method.
               # Start with RNA_seq_counts, a matrix of counts for each gene (row) in each sample (column)
               # Get total count
               libsize = apply(RNA_seq_counts, 2, sum)
               # By 75th percentile (similar to Bullard J.H., et. al (2010))
               libsize = apply(RNA_seq_counts, 2, function(x) { z <- x[x > 0] ; sum(z[z <= quantile(z, 0.75)]) })
  • Statistics for differential expression
    • Biologically meaningful statistics are only obtained with experiments that include biological replication.
    • Unless otherwise noted, these statistical analyses require raw counts, not RPKM values, as input.
    • To prepare the data for analysis, make a counts matrix (as a tab-delimited text file) with genes as rows and samples as columns.
    • Some statistical tests or packages can be applied to non-replicated data (but must be interpreted with caution):
    • Some statistical tests or packages are designed for experiments with replication:
    • Scripts with sample code are in the BaRC code repository at
      wi-files1\BaRC_Public\BaRC_code\R (and see the vignettes in the R packages linked above)
  • Identifying differentially expressed genes
    • Use p-values (corrected for FDR) to rank genes to get those that have the highest confidence of being truly differentially expressed.
    • Be wary of actual p-values: some tests output very small p-values that appear to reflect higher confidence than is possible given the amount of replication.
    • How many genes do you predict are truly differentially expressed? Use the answer to this to help determine a reasonable FDR threshold.
    • Fold change thresholds may be used in addition, if desired, to select genes that are changing an amount that is biologically meaningful.
Note: See TracWiki for help on using the wiki.