Context Navigation

Changes between Version 52 and Version 53 of SOPs/rna-seq-diff-expressions

Timestamp:: 11/02/17 12:37:01 (8 years ago)
Author:: gbell
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

SOPs/rna-seq-diff-expressions

-              v52
+              v53
   * **Quantification of raw counts**
+    * Typically we use [[http://www-huber.embl.de/users/anders/HTSeq/doc/count.html|htseq-count]] to get counts for each gene
+    * Currently our favorite tool for this is [[http://bioinf.wehi.edu.au/featureCounts/|featureCounts]], part of the [[http://subread.sourceforge.net/|Subread]] package.
+      * featureCounts is much faster than htseq-count, but the details of its counting method is quite different from that of htseq-count, especially for paired-end reads
+      * See [[http://www.ncbi.nlm.nih.gov/pubmed/24227677|Liao et al., 2014]] for details of the method (and comparisons with other counting tools)
+      * featureCounts needs the paired-read BAM file to be sorted by read ID, but if it isn't, it'll do the sorting.
+      * Sample commands:
+{{{
+# single-end reads (unstranded)
+featureCounts -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam
+# single-end reads (forward stranded)
+featureCounts -s 1 -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam
+# single-end reads (reverse stranded)
+featureCounts -s 2 -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam
+# paired-end reads (unstranded)
+featureCounts -p -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam
+# paired-end reads (forward stranded)
+featureCounts -p -s 1 -a gene_annotations.gtf -o MySamples.featureCounts.txt *sortedByName.bam
+# paired-end reads (reverse stranded)
+featureCounts -p -s 2 -a gene_annotations.gtf -o MySamples.featureCounts.txt *sortedByName.bam
+}}}
+    * [[http://www-huber.embl.de/users/anders/HTSeq/doc/count.html|htseq-count]] works fine to get counts for each gene, but it's quite slow.
       * Include same GTF file describing gene models as was used for mapping -- but think carefully about what genes should be included (such as long non-coding RNAs, microRNAs, or piRNAs)
       * Is your sequencing library stranded or unstranded?  This information is needed to help htseq-count accurately count features.  If the library prep method is "TruSeqStrandedPolyA", for example, the reads will be stranded in the reverse direction (relative to the transcript orientation).
 …
 }}}
+    * Another tool to use [[http://bioinf.wehi.edu.au/featureCounts/|featureCounts]], part of the [[http://subread.sourceforge.net/|Subread]] package
+      * featureCounts is much faster than htseq-count, but the details of its counting method is quite different from that of htseq-count, especially for paired-end reads
+      * See [[http://www.ncbi.nlm.nih.gov/pubmed/24227677|Liao et al., 2014]] for details of the method (and comparisons with other counting tools)
+      * featureCounts needs the paired-read BAM file to be sorted by read ID, but if it isn't, it'll do the sorting.
+      * Sample commands:
+{{{
+# single-end reads (unstranded)
+featureCounts -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam
+# single-end reads (forward stranded)
+featureCounts -s 1 -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam
+# single-end reads (reverse stranded)
+featureCounts -s 2 -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam
+# paired-end reads (unstranded)
+featureCounts -p -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam
+# paired-end reads (forward stranded)
+featureCounts -p -s 1 -a gene_annotations.gtf -o MySamples.featureCounts.txt *sortedByName.bam
+# paired-end reads (reverse stranded)
+featureCounts -p -s 2 -a gene_annotations.gtf -o MySamples.featureCounts.txt *sortedByName.bam
+}}}
+    * For some analyses (or for visualization), you can add a pseudocount (such as 1 or another small number) to all genes in all samples to prevent log2 ratios that require dividing by 0 and reduce background count noise -- BUT be aware that some statistical methods (like DESeq) require raw input values without any pseudocounts or normalization.
     * **NOTE:**
       * Both htseq-count and featureCounts ignore multi-mapped reads (ie. these will not get counted) by default.  In featureCounts use -M option to count multi-mapped reads, if needed.
       * Summary metrics reported in both htseq-count and featureCounts is with respect to number of records (ie. lines) in the bam file, to summarize by reads further parsing/processing may be needed: extra information can be obtained from i) htseq-count use -o option and ii) featureCounts use -R option.
+    * For some analyses (or for visualization), you can add a pseudocount (such as 1 or another small number) to all genes in all samples to prevent log2 ratios that require dividing by 0 and reduce background count noise -- BUT be aware that some statistical methods (like DESeq2) require raw input values without any pseudocounts or normalization.
   * **Quantification by FPKM (Fragments Per Kilobase of transcript per Million mapped reads)**