== Using RNA-Seq to quantify gene levels and assay for differential expression == * Mapping * Use [http://tophat.cbcb.umd.edu/manual.html TopHat] to map short reads to genome of choice with default settings. * Supplying a GTF file describing gene models to TopHat can provide better mapping of some spliced reads. * Convert BAM to SAM output: samtools view -h -o accepted_hits.sam accepted_hits.bam (since SAM is required for htseq-count) * Quantification * Use [http://www-huber.embl.de/users/anders/HTSeq/doc/count.html htseq-count] to get counts for each gene * Include same GTF file describing gene models as was used for mapping – but think carefully about what genes should be included (such as long non-coding RNAs, microRNAs, or piRNAs) * Carefully choose the best “mode” to handle reads that don't completely map to exactly one gene * Counting can be also performed transcript level but * this may be too much information * mapping to correct transcript for multiple-transcript genes may be tricky to do well * Add a pseudocount (such as 1) to all genes in all samples to * prevent log2 ratios that require dividing by 0 * reduce background count noise * reduce problems with statistical methods that don't like 0's * Remove the rows at the bottom with descriptions like no_feature, ambiguous, etc. * Gene filtering * Remove from the analysis any genes with 0 counts (or counts = pseudocounts). * Remove counts for any genes we want to classify as contaminants such as ribosomal RNAs (if these are included in the GTF gene annotation file). * Statistics for differential expression * Biologically meaningful statistics are only obtained with experiments that include biological replication. * Unless otherwise noted, these statistical analyses require raw counts, not RPKM values, as input. * To prepare the data for analysis, make a counts matrix (as a tab-delimited text file) with genes as rows and samples as columns. * Some statistical tests or packages can be applied to non-replicated data (but must be interpreted with caution): * [http://cufflinks.cbcb.umd.edu/manual.html#cuffdiff cuffDiff] (part of the cufflinks package; uses BAM files as input) * [http://www.bioconductor.org/packages/release/bioc/html/DESeq.html DESeq] * [http://www.bioconductor.org/packages/release/bioc/html/DEGseq.html DEGSeq] * Fisher's exact test (with R and/or with BaRC R script) * Some statistical tests or packages are designed for experiments with replication: * [http://www.bioconductor.org/packages/release/bioc/html/DESeq.html DESeq] * [http://www.bioconductor.org/packages/release/bioc/html/edgeR.html edgeR] * [http://www.bioconductor.org/packages/release/bioc/html/baySeq.html baySeq] * In our hands, all of these methods appear to work fine. Note, however, that the scale of the p-values from baySeq are much different from those of DESeq and edgeR. For example, a gene may have a baySeq p-value of 1e-15 but a DESeq p-value of 1e-300. * Scripts with sample code are in the BaRC code repository at \\wi-files1\BaRC_Public\BaRC_code\R (and see the vignettes in the R packages linked above) * Identifying differentially expressed genes * Use p-values (corrected for FDR) to rank genes to get those that have the highest confidence of being truly differentially expressed. * Be wary of actual p-values: some tests output very small p-values that appear to reflect higher confidence than is possible given the amount of replication. * How many genes do you predict are truly differentially expressed? Use the answer to this to help determine a reasonable FDR threshold. * Fold change thresholds may be used in addition, if desired, to select genes that are changing an amount that is biologically meaningful.