Using RNA-Seq to quantify gene levels and assay for differential expression
- Mapping
- Use TopHat to map short reads to genome of choice with default settings.
- Supplying a GTF file describing gene models to TopHat can provide better mapping of some spliced reads.
- Convert BAM to SAM output: samtools view -h -o accepted_hits.sam accepted_hits.bam (since SAM is required for htseq-count)
- Quantification
- Use htseq-count to get counts for each gene
- Include same GTF file describing gene models as was used for mapping – but think carefully about what genes should be included (such as long non-coding RNAs, microRNAs, or piRNAs)
- Carefully choose the best “mode” to handle reads that don't completely map to exactly one gene
- Counting can be also performed transcript level but
- this may be too much information
- mapping to correct transcript for multiple-transcript genes may be tricky to do well
- Add a pseudocount (such as 1) to all genes in all samples to
- prevent log2 ratios that require dividing by 0
- reduce background count noise
- reduce problems with statistical methods that don't like 0's
- Remove the rows at the bottom with descriptions like no_feature, ambiguous, etc.
- Use htseq-count to get counts for each gene
- Gene filtering
- Remove from the analysis any genes with 0 counts (or counts = pseudocounts).
- Remove counts for any genes we want to classify as contaminants such as ribosomal RNAs (if these are included in the GTF gene annotation file).
- Statistics for differential expression
- Biologically meaningful statistics are only obtained with experiments that include biological replication.
- Unless otherwise noted, these statistical analyses require raw counts, not RPKM values, as input.
- To prepare the data for analysis, make a counts matrix (as a tab-delimited text file) with genes as rows and samples as columns.
- Some statistical tests or packages can be applied to non-replicated data (but must be interpreted with caution):
- Some statistical tests or packages are designed for experiments with replication:
- Scripts with sample code are in the BaRC code repository at
wi-files1\BaRC_Public\BaRC_code\R (and see the vignettes in the R packages linked above)
- Identifying differentially expressed genes
- Use p-values (corrected for FDR) to rank genes to get those that have the highest confidence of being truly differentially expressed.
- Be wary of actual p-values: some tests output very small p-values that appear to reflect higher confidence than is possible given the amount of replication.
- How many genes do you predict are truly differentially expressed? Use the answer to this to help determine a reasonable FDR threshold.
- Fold change thresholds may be used in addition, if desired, to select genes that are changing an amount that is biologically meaningful.
Note:
See TracWiki
for help on using the wiki.