SOPs/barcBakeOff – BaRC Wiki

wiki:SOPs/barcBakeOff

Context Navigation

Using RNA-Seq to quantify gene levels and assay for differential expression

Mapping
- Use TopHat to map short reads to genome of choice with default settings.
- Supplying a GTF file describing gene models to TopHat can provide better mapping of some spliced reads.
- Convert BAM to SAM output: samtools view -h -o accepted_hits.sam accepted_hits.bam (since SAM is required for htseq-count)

Quantification
- Use htseq-count to get counts for each gene
  - Include same GTF file describing gene models as was used for mapping – but think carefully about what genes should be included (such as long non-coding RNAs, microRNAs, or piRNAs)
  - Carefully choose the best “mode” to handle reads that don't completely map to exactly one gene
  - Counting can be also performed transcript level but
    - this may be too much information
    - mapping to correct transcript for multiple-transcript genes may be tricky to do well
- Add a pseudocount (such as 1) to all genes in all samples to
  - prevent log2 ratios that require dividing by 0
  - reduce background count noise
  - reduce problems with statistical methods that don't like 0's
- Remove the rows at the bottom with descriptions like no_feature, ambiguous, etc.

Gene filtering
- Remove from the analysis any genes with 0 counts (or counts = pseudocounts).
- Remove counts for any genes we want to classify as contaminants such as ribosomal RNAs (if these are included in the GTF gene annotation file).

Statistics for differential expression
- Biologically meaningful statistics are only obtained with experiments that include biological replication.
- Unless otherwise noted, these statistical analyses require raw counts, not RPKM values, as input.
- To prepare the data for analysis, make a counts matrix (as a tab-delimited text file) with genes as rows and samples as columns.
- Some statistical tests or packages can be applied to non-replicated data (but must be interpreted with caution):
  - cuffDiff (part of the cufflinks package; uses BAM files as input)
  - DESeq
  - DEGSeq
  - Fisher's exact test (with R and/or with BaRC R script)
- Some statistical tests or packages are designed for experiments with replication:
  - DESeq
  - edgeR
  - baySeq
  - In our hands, all of these methods appear to work fine. Note, however, that the scale of the p-values from baySeq are much different from those of DESeq and edgeR. For example, a gene may have a baySeq p-value of 1e-15 but a DESeq p-value of 1e-300.
- Scripts with sample code are in the BaRC code repository at
  wi-files1\BaRC_Public\BaRC_code\R (and see the vignettes in the R packages linked above)

Identifying differentially expressed genes
- Use p-values (corrected for FDR) to rank genes to get those that have the highest confidence of being truly differentially expressed.
- Be wary of actual p-values: some tests output very small p-values that appear to reflect higher confidence than is possible given the amount of replication.
- How many genes do you predict are truly differentially expressed? Use the answer to this to help determine a reasonable FDR threshold.
- Fold change thresholds may be used in addition, if desired, to select genes that are changing an amount that is biologically meaningful.

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text