== Using RNA-Seq to quantify gene levels and assay for differential expression ==

* Mapping
 * Use [http://tophat.cbcb.umd.edu/manual.html TopHat] to map short reads to genome of choice with default settings.
 * Supplying a GTF file describing gene models to TopHat can provide better mapping of some spliced reads.
 * Convert BAM to SAM output: samtools view -h -o accepted_hits.sam accepted_hits.bam (since SAM is required for htseq-count)


* Quantification
 * Use [http://www-huber.embl.de/users/anders/HTSeq/doc/count.html htseq-count] to get counts for each gene
  * Include same GTF file describing gene models as was used for mapping – but think carefully about what genes should be included (such as long non-coding RNAs, microRNAs, or piRNAs)
  * Carefully choose the best “mode” to handle reads that don't completely map to exactly one gene
  * Counting can be also performed transcript level but
   * this may be too much information
   * mapping to correct transcript for multiple-transcript genes may be tricky to do well
 * Add a pseudocount (such as 1) to all genes in all samples to
  * prevent log2 ratios that require dividing by 0
  * reduce background count noise
  * reduce problems with statistical methods that don't like 0's
 * Remove the rows at the bottom with descriptions like no_feature, ambiguous, etc.


* Gene filtering
 * Remove from the analysis any genes with 0 counts (or counts = pseudocounts).
 * Remove counts for any genes we want to classify as contaminants such as ribosomal RNAs (if these are included in the GTF gene annotation file).


* Statistics for differential expression
 * Biologically meaningful statistics are only obtained with experiments that include biological replication.
 * Unless otherwise noted, these statistical analyses require raw counts, not RPKM values, as input.
 * To prepare the data for analysis, make a counts matrix (as a tab-delimited text file) with genes as rows and samples as columns.
 * Some statistical tests or packages can be applied to non-replicated data (but must be interpreted with caution):
  * [http://cufflinks.cbcb.umd.edu/manual.html#cuffdiff cuffDiff] (part of the cufflinks package; uses BAM files as input)
  * [http://www.bioconductor.org/packages/release/bioc/html/DESeq.html DESeq]
  * [http://www.bioconductor.org/packages/release/bioc/html/DEGseq.html DEGSeq]
  * Fisher's exact test (with R and/or with BaRC R script)
 * Some statistical tests or packages are designed for experiments with replication:
  * [http://www.bioconductor.org/packages/release/bioc/html/DESeq.html DESeq]
  * [http://www.bioconductor.org/packages/release/bioc/html/edgeR.html edgeR]
  * [http://www.bioconductor.org/packages/release/bioc/html/baySeq.html baySeq]
  * In our hands, all of these methods appear to work fine. Note, however, that the scale of the p-values from baySeq are much different from those of DESeq and edgeR. For example, a gene may have a baySeq p-value of 1e-15 but a DESeq p-value of 1e-300.
 * Scripts with sample code are in the BaRC code repository at \\wi-files1\BaRC_Public\BaRC_code\R (and see the vignettes in the R packages linked above)


* Identifying differentially expressed genes
  * Use p-values (corrected for FDR) to rank genes to get those that have the highest confidence of being truly differentially expressed.
  * Be wary of actual p-values: some tests output very small p-values that appear to reflect higher confidence than is possible given the amount of replication.
  * How many genes do you predict are truly differentially expressed? Use the answer to this to help determine a reasonable FDR threshold.
  * Fold change thresholds may be used in addition, if desired, to select genes that are changing an amount that is biologically meaningful.