Context Navigation

Changes between Version 38 and Version 39 of SOPs/rna-seq-diff-expressions

Timestamp:: 05/10/17 10:13:13 (8 years ago)
Author:: thiruvil
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

SOPs/rna-seq-diff-expressions

-              v38
+              v39
       * Counting can be also performed transcript level but this may be too much information.
       * htseq-count can accept SAM or BAM (-f option), for example
+         * ''htseq-count -f bam -m intersection-strict --stranded=reverse MySample.accepted_hits.sortedByName.bam gene_models.gtf >| MySample.counts.txt''
+         * Note that htseq-count assumes that your reads are strand-specific; default is --stranded=yes
+         * If your reads are not stranded, use "--stranded=no" (or half of your reads won't be counted).  If they're stranded and in the opposite direction from the transcript, use "--stranded=reverse".
+         * Note the "-" can be used for input from standard input (stdin)
+         * For paired-end reads the sam file has to be sorted by read name: ''bsub  "samtools sort -n -o accepted_hits.sortedByName.bam -m 5G -O bam -T temp accepted_hits.bam"''
+         * To request a certain amount of memory and a specific node use ''bsub  -R "rusage[mem=50000]" -m NodeName ''
+         * Remove the rows at the bottom with descriptions like no_feature, ambiguous, etc.
+{{{
+#Note that htseq-count assumes that your reads are strand-specific; default is --stranded=yes
+#If your reads are not stranded, use "--stranded=no" (or half of your reads won't be counted).
+#If they're stranded and in the opposite direction from the transcript, use "--stranded=reverse".
+htseq-count -f bam -m intersection-strict --stranded=reverse MySample.accepted_hits.sortedByName.bam gene_models.gtf > MySample.counts.txt
+#Examine to see the last rows at the bottom with descriptions like no_feature, ambiguous, etc.
+#if too many reads are thrown out.  Otherwise, these rows can be removed for downstream analysis.
+#For PE reads the bam files needs to be sorted by name (default for htseq-count), eg.
+#bsub "samtools sort -n -o accepted_hits.sortedByName.bam -m 5G -O bam -T temp accepted_hits.bam"
+#If the bam file is sorted by coordinate you may try htseq-count -r option, eg. -r pos , however,
+#this may not always work (htseq-count throws numerous errors).
+#To request a certain amount of memory and a specific node use bsub -R "rusage[mem=50000]" -m NodeName
+}}}
     * Another tool to use [[http://bioinf.wehi.edu.au/featureCounts/|featureCounts]], part of the [[http://subread.sourceforge.net/|Subread]] package
       * featureCounts is much faster than htseq-count, but the details of its counting method is quite different from that of htseq-count, especially for paired-end reads
       * See [[http://www.ncbi.nlm.nih.gov/pubmed/24227677|Liao et al., 2014]] for details of the method (and comparisons with other counting tools)
       * Sample commands:
+         * ''featureCounts -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam'' (single-end reads)
+         * ''featureCounts -p -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam'' (paired-end reads)
+{{{
+#single-end reads
+featureCounts -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam
+#PE reads
+featureCounts -p -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam
+}}}
     * For some analyses (or for visualization), you can add a pseudocount (such as 1 or another small number) to all genes in all samples to prevent log2 ratios that require dividing by 0 and reduce background count noise -- BUT be aware that some statistical methods (like DESeq) require raw input values without any pseudocounts or normalization.