wiki:SOPs/barcBakeOff

Using RNA-Seq to quantify gene levels and assay for differential expression

  • Mapping
    • Use TopHat to map short reads to genome of choice with default settings.
    • Supplying a GTF file describing gene models to TopHat can provide better mapping of some spliced reads.
    • Convert BAM to SAM output: samtools view -h -o accepted_hits.sam accepted_hits.bam (since SAM is required for htseq-count)
  • Quantification
    • Use htseq-count to get counts for each gene
      • Include same GTF file describing gene models as was used for mapping – but think carefully about what genes should be included (such as long non-coding RNAs, microRNAs, or piRNAs)
      • Carefully choose the best “mode” to handle reads that don't completely map to exactly one gene
      • Counting can be also performed transcript level but
        • this may be too much information
        • mapping to correct transcript for multiple-transcript genes may be tricky to do well
    • Add a pseudocount (such as 1) to all genes in all samples to
      • prevent log2 ratios that require dividing by 0
      • reduce background count noise
      • reduce problems with statistical methods that don't like 0's
    • Remove the rows at the bottom with descriptions like no_feature, ambiguous, etc.
  • Gene filtering
    • Remove from the analysis any genes with 0 counts (or counts = pseudocounts).
    • Remove counts for any genes we want to classify as contaminants such as ribosomal RNAs (if these are included in the GTF gene annotation file).
  • Statistics for differential expression
    • Biologically meaningful statistics are only obtained with experiments that include biological replication.
    • Unless otherwise noted, these statistical analyses require raw counts, not RPKM values, as input.
    • To prepare the data for analysis, make a counts matrix (as a tab-delimited text file) with genes as rows and samples as columns.
    • Some statistical tests or packages can be applied to non-replicated data (but must be interpreted with caution):
      • cuffDiff (part of the cufflinks package; uses BAM files as input)
      • DESeq
      • DEGSeq
      • Fisher's exact test (with R and/or with BaRC R script)
    • Some statistical tests or packages are designed for experiments with replication:
      • DESeq
      • edgeR
      • baySeq
      • In our hands, all of these methods appear to work fine. Note, however, that the scale of the p-values from baySeq are much different from those of DESeq and edgeR. For example, a gene may have a baySeq p-value of 1e-15 but a DESeq p-value of 1e-300.
    • Scripts with sample code are in the BaRC code repository at
      wi-files1\BaRC_Public\BaRC_code\R (and see the vignettes in the R packages linked above)
  • Identifying differentially expressed genes
    • Use p-values (corrected for FDR) to rank genes to get those that have the highest confidence of being truly differentially expressed.
    • Be wary of actual p-values: some tests output very small p-values that appear to reflect higher confidence than is possible given the amount of replication.
    • How many genes do you predict are truly differentially expressed? Use the answer to this to help determine a reasonable FDR threshold.
    • Fold change thresholds may be used in addition, if desired, to select genes that are changing an amount that is biologically meaningful.
Note: See TracWiki for help on using the wiki.