Using RNA-Seq to quantify gene levels and assay for differential expression

Basic approach

  • For each sample, map reads to genome using splice-aware mapper.
  • Count reads mapped to each gene (or other set of features).
  • Use gene counts to identify differentially expressed genes.

General suggestions

  • Preliminary issues
    • Statistics for all methods require a matrix of counts (positive integer values) for each gene for each sample.
    • Create a tab-delimited matrix of integer counts, with column labels for each sample.
    • Genes with no counts in any sample should generally be removed to permit higher statistical power to identify differential expression.
    • According to Bullard et al., 2010, differential expression analysis is influenced more by the normalization method than by the choice of differential expression statistic.
    • Note that without replication, one cannot make very strong conclusions. High-throughput sequencing, just like every other technology, needs biological replication.
      • One can conclude that certain genes in sample A have a different RNA abundance than in sample B, but the results cannot be generalized.
      • Example, using an extremely precise balance: If Dick weighs more than Sally, we cannot conclude that males weigh more than females because we know nothing about the variability of weights among males and among females. Even if we weighed several individuals together, we'd still be missing information about within-group variability.
    • Sample commands to get raw counts from an alignment file:
      • coverageBed -split -abam accepted_hits.bam -b transcripts.gtf > transcript.coverage.bed (See the bedTools coverage page for details)
      • htseq-count -m intersection-strict --stranded=no accepted_hits.sam -b transcripts.gff > transcript.coverage.txt (See the htseq-count page for details)
      • In our view, htseq-count is better at handling reads that map to a genome region with overlapping genes.

Step by step analysis

  • Mapping
    • Use STAR or another spliced mapper to map short reads to the genome of choice.
    • See our mapping SOP for more details.
  • Quantification of raw counts
  • Is your sequencing library stranded or unstranded? This information is needed to help these tools accurately count features. Strandedness of some library prep methods:
    • TruSeq Stranded mRNA Kits ("TruSeqStrandedPolyA") reads are reverse stranded (stranded in the reverse direction relative to the transcript orientation).
    • SMART-Seq v4 Ultra Low Input RNA Kit ("SMARTerUltra-lowPOLYA-V4") reads are unstranded.
    • KAPA RNA HyperPrep Kits ("KAPAHyperPrepmRNA") reads are reverse stranded.
  • See SAMBAMqc (and/or look at mapped reads in a genome browser) to determine or verify strandedness
  • Currently our favorite tool for this is featureCounts, part of the Subread package.
    • featureCounts is much faster than htseq-count, but the details of its counting method is quite different from that of htseq-count, especially for paired-end reads
    • See Liao et al., 2014 for details of the method (and comparisons with other counting tools)
    • featureCounts needs the paired-read BAM file to be sorted by read ID, but if it isn't, it'll do the sorting.
    • Sample commands:
      # single-end reads (unstranded)
      featureCounts -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam
      # single-end reads (forward stranded)
      featureCounts -s 1 -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam
      # single-end reads (reverse stranded)
      featureCounts -s 2 -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam
      # paired-end reads (unstranded)
      featureCounts -p -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam
      # paired-end reads (forward stranded)
      featureCounts -p -s 1 -a gene_annotations.gtf -o MySamples.featureCounts.txt *sortedByName.bam
      # paired-end reads (reverse stranded)
      featureCounts -p -s 2 -a gene_annotations.gtf -o MySamples.featureCounts.txt *sortedByName.bam
  • htseq-count works fine to get counts for each gene, but it's quite slow.
    • Include same GTF file describing gene models as was used for mapping -- but think carefully about what genes should be included (such as long non-coding RNAs, microRNAs, or piRNAs)
    • Carefully choose the best "mode" to handle reads that don't completely map to exactly one gene.
    • Counting can be also performed transcript level but this may be too much information.
    • htseq-count can accept SAM or BAM (-f option), for example
#Note that htseq-count assumes that your reads are strand-specific; default is --stranded=yes
#If your reads are not stranded, use "--stranded=no" (or half of your reads won't be counted).
#If they're stranded and in the opposite direction from the transcript, use "--stranded=reverse".

htseq-count -f bam -m intersection-strict --stranded=reverse MySample.accepted_hits.sortedByName.bam gene_models.gtf > MySample.counts.txt

#Examine to see the last rows at the bottom with descriptions like no_feature, ambiguous, etc. 
#if too many reads are thrown out.  Otherwise, these rows can be removed for downstream analysis.

#For PE reads the bam files needs to be sorted by name (default for htseq-count), eg.
#bsub "samtools sort -n -o accepted_hits.sortedByName.bam -m 5G -O bam -T temp accepted_hits.bam"
#If the bam file is sorted by coordinate you may try htseq-count -r option, eg. -r pos , however, 
#this may not always work (htseq-count throws numerous errors).

#To request a certain amount of memory and a specific node use bsub -R "rusage[mem=50000]" -m NodeName
  • NOTE:
    • Both htseq-count and featureCounts ignore multi-mapped reads (ie. these will not get counted) by default. In featureCounts use -M option to count multi-mapped reads, if needed.
    • Summary metrics reported in both htseq-count and featureCounts is with respect to number of records (ie. lines) in the bam file, to summarize by reads further parsing/processing may be needed: extra information can be obtained from i) htseq-count use -o option and ii) featureCounts use -R option.
  • For some analyses (or for visualization), you can add a pseudocount (such as 1 or another small number) to all genes in all samples to prevent log2 ratios that require dividing by 0 and reduce background count noise -- BUT be aware that some statistical methods (like DESeq2) require raw input values without any pseudocounts or normalization.
  • Quantification by FPKM (Fragments Per Kilobase of transcript per Million mapped reads)
  • Method 1: Use cufflinks
    • This is the traditional method.
    • To quantify transcripts and genes in a GTF file, use a command like
      bsub cufflinks -G gene_models.gtf accepted_hits.bam
    • To quantify transcripts, potentially novel, annotated by cufflinks, use a command like
      bsub cufflinks accepted_hits.bam
    • Gene-level FPKM values are calculated by taking the sum of all transcript FPKMs for a gene. As a result, no "gene length" needs to be calculated.
    • NOTE: Some genes, although present in a GTF annotation file, may not get quantified by cufflinks. This occurs for genes found in very long regions of overlapping genes (which exceed the default value for --max-bundle-length). When this occurs, the standard err output of cufflinks (contained in the long LSF email when cufflinks is run via 'bsub') will contain the message "Warning: Skipping large bundle." To correct this (or prevent it in the first place), add an argument like '--max-bundle-length 10000000' to the cufflinks command. You may want to compare the list of genes in the GTF file to that of the cufflinks output to verify that they match.
    • If you only want to quantify genes in your GTF file use the -G option (instead of -g which will give also transcripts found by Cufflinks and will take away counts from transcripts in your gtf file).
                  awk -F"\t" '{print $9}' genes.gtf | awk '{print $2}' | perl -pe 's/\"//g;s/;//g' | sort -u > gtf_genes.txt
                  cut -f1 cufflinks_output/genes.fpkm_tracking | grep -v tracking_id > cufflinks_genes.txt

  • Method 2: Use the fpkm() function from DESeq2. This method requires raw counts and any measure of gene length (such as mean or median length of transcripts of gene) that one needs to produce independently (or create a GRanges data structure). It can produce (raw) FPKMs [using fpkm(... robust = FALSE)] and normalized FPKMs [using fpkm(... robust = TRUE)].
  • Method 3: Use Cuffquant to get normalized FPKMs, as described with monocle and Cuffnorm. The default normalization for Cuffnorm is the same as the normalization performed by DESeq.
  • Method 4: Use featureCounts to get gene lengths
    • With gene-level counts and gene lengths, one can use a spreadsheet or statistics program to calculate FPKM.
    • "Gene length" is defined by the length of the non-redundant overlap of all exons of all transcripts of that gene. Typically this method overestimates gene length since the length of a gene is by definition at least as long as every transcript defining the gene.
  • Gene filtering
    • Remove from the analysis any genes with 0 counts across all samples. Some analysis tools do this themselves.
    • Remove counts for any genes we want to classify as contaminants or simply "not interesting" such as ribosomal RNAs (if these are included in the GTF gene annotation file).
  • Normalization
    • Many RNA-Seq statistics programs require an input of raw values.
    • Display of count comparisons, however, should include some type of normalization.
    • Expression counts can be transformed by some combination of the following:
      • Total counts (sum of counts for all genes/features)
      • Total number of mapped reads
      • Upper-quartile (75th percentile) normalization (so scaling is to number of reads mapping to 75th percentile feature)
      • Scaling by transcript length (one of the steps in converting raw counts to RPKM)
      • Scaling so total number of counts is equal to mean number of mapped reads across all samples
      • Effective library size (a more complex, but probably more valid method included in programs such as edgeR and DESeq)
    • Differential expression statistics packages can output a matrix of normalized counts (typically using the method recommended for the accompanying statistics), so typically no additional normalization needs to be done (unless one want to perform further normalization, such as using gene length).
    • As with microarray normalization, be aware of the assumptions of each method and choose the method(s) which are most valid with your experiment.
    • Origin of recommendation for upper-quartile normalization: Bullard et al., 2010
    • Use DESeq, DESeq2, or see normalize_DGE_matrix.R for a command-line script for count normalization
      • Methods for above script: mean, median, quantile, percentile, none
      • Outputs also includes before and after image files: boxplots, scatterpots, violin plots
      • Be careful with interpretation of images, as they appear to show that upper-quartile normalization does more bad than good.
      • Sample commands:
        • Normalize to the 75th percentile
          • R --vanilla < normalize_DGE_matrix.R DGE_sample_input.txt percentile=0.75 DGE_sample_input.percentile.0.75.txt
        • Normalize to the mean
          • R --vanilla < normalize_DGE_matrix.R DGE_sample_input.txt mean DGE_sample_input.mean.txt
      • Calculating library size
        • We can do this the most obvious way using the total number of reads in each sample (column of our counts matrix), which is shown below. Many methods for differential expression analysis use more complex methods to calculate what they refer to as something like "effective library size" (such as DESeq's estimateSizeFactors() command), which appears to work better than the more obvious method.
               # Start with RNA_seq_counts, a matrix of counts for each gene (row) in each sample (column)
               # Get total count
               libsize = apply(RNA_seq_counts, 2, sum)
               # By 75th percentile (similar to Bullard J.H., et. al (2010))
               libsize = apply(RNA_seq_counts, 2, function(x) { z <- x[x > 0] ; sum(z[z <= quantile(z, 0.75)]) })
  • Statistics for differential expression
    • Biologically meaningful statistics are only obtained with experiments that include biological replication.
    • Unless otherwise noted, these statistical analyses require raw counts, not RPKM values, as input.
    • To prepare the data for analysis, make a counts matrix (as a tab-delimited text file) with genes as rows and samples as columns.
    • Some statistical tests or packages can be applied to non-replicated data (but must be interpreted with a high degree of caution):
    • Some statistical tests or packages are designed for experiments with replication:
    • In our hands, all of these methods appear to work fine. Note, however, that the scale of the p-values from baySeq are much different from those of DESeq and edgeR. For example, a gene may have a baySeq p-value of 1e-15 but a DESeq p-value of 1e-300.
    • Technical replicates can be collapsed (eg. summed or use collapseReplicates()in DESeq2)
  • Scripts with sample code are in the BaRC code repository at
    wi-files1\BaRC_Public\BaRC_code\R (and see the vignettes in the R packages linked above).

  • Identifying differentially expressed genes
    • Use p-values (corrected for FDR) to rank genes to get those that have the highest confidence of being truly differentially expressed.
    • Be wary of actual p-values: some tests output very small p-values that appear to reflect higher confidence than is possible given the amount of replication.
    • How many genes do you predict are truly differentially expressed? Use the answer to this to help determine a reasonable FDR threshold.
    • Fold change thresholds may be used in addition, if desired, to select genes that are changing an amount that is biologically meaningful.
  • Accounting for a batch effect in a differential expression model
    • With DESeq2, the comparison can be adjusted for differences between batches (such as date of sample preparation) with an additive model formula using an equation like dds = DESeqDataSet(se, design = ~ batch + condition).
    • With edgeR, the comparison can be adjusted for differences between batches (such as date of sample preparation) with an additive model formula using an equation like

design = model.matrix(~Batch+Treatment).

For example, with a targets file (target.txt) like this:

Sample genotype date
KO.1 KO old
WT.1 WT old
KO.2 KO new
WT.2 WT new

Sample R code for reducing the date effect (old vs. new) is

    target   <- read.delim("target.txt", header=T)
    genotype <- factor(target$genotype, levels=c("WT", "KO"))
    mydate   <- factor(target$date, levels=c("old", "new"))
    Design   <- model.matrix(~date+genotype)
    # [1] "(Intercept)" "mydatenew"   "genotypeKO" 

    Y2  <- calcNormFactors(Y) # Y is the DGEList
    Y2  <- estimateGLMCommonDisp(Y2, Design, verbose=TRUE)
    Y2  <- estimateGLMTrendedDisp(Y2, Design)
    Y2  <- estimateGLMTagwiseDisp(Y2, Design)
    Fit <- glmFit(Y2, Design)
    Lrt <- glmLRT(Fit, coef="genotypeKO") # where 'coef' points to the column name in 'Design'

The data structure Lrt$table contains the values for log2FC (log 2 fold change) and p-value.

Detailed information can be found in the edgeR User Guide (after a search for "batch effects").

  • To print out batch-effect corrected CPM or other normalized values from DESeq, DESeq2, or edgeR, one can use the limma removeBatchEffect() command
    # Input needs to be log-transformed values
    logCPM <- removeBatchEffect(log2(cpm.out), batch=batch)  #where batch is based on your design

Others: Alignment-free

  • k-mer indexing/counting:
    • Sailfish : ultrafast isoform abundance estimation from RNA-seq reads
    • Salmon : successor of Sailfish
    • Kallisto: quantifying abundances of transcripts from RNA-Seq data
    • Sleuth: assay differential expression after quantification (eg. from Kallisto)