Context Navigation

Changes between Version 23 and Version 24 of SOPs/qc_shortReads

Timestamp:: 04/27/16 11:20:16 (9 years ago)
Author:: ibarrasa
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

SOPs/qc_shortReads

-              v23
+              v24
 = Analyzing short read quality (after mapping) =
+== Remove Duplicates ==
+  * Remove duplicates, for eg. from PCR
+ {{{
+   #samtools command
+    samtools rmdup [-sS] <input.srt.bam> <output.bam>
+    -s or -S depending on PE data or not
+}}}
+== Determining the paired-end insert size for DNA samples ==
+If paired-end insert size or distance is unknown or need to be verified, it can be extracted from a BAM/SAM file after running Bowtie.
+When mapping with bowtie (or another mapper), the insert size can often be included as an input parameter (example for bowtie: -X 500), which can help with mapping.  See the [[http://barcwiki.wi.mit.edu/wiki/SOPs/mapping|mapping SOP]] for mapping details.
+Method 1: Get insert sizes from BAM file
+{{{
+   # Using a SAM file (at Unix command prompt)
+   awk -F "\t" '$9 > 0 {print $9}' s_1_bowtie.sam > s_1_insert_sizes.txt
+   # Using a BAM file (at Unix command prompt)
+   samtools view s_1_bowtie.bam | awk -F"\t" '$9 > 0 {print $9}' > s_1_insert_sizes.txt
+   # and then process column of numbers with R (or Excel)
+   # In R Session
+   sizeFile = "s_1_insert_sizes.txt"
+   sample.name = "My paired reads"
+   distance = read.delim(sizeFile, h=F)[,1]
+   pdf(paste(sample.name, "insert.size.histogram.pdf", sep="."), w=11, h=8.5)
+   hist(distance, breaks=200, col="wheat", main=paste("Insert sizes for", sample.name), xlab="length (nt)")
+   dev.off()
+}}}
+Method 2: Calculate insert sizes with CollectInsertSizeMetrics function from picard (http://picard.sourceforge.net).  This is also a good approximation for RNA samples.
+{{{
+   #
+   # I=File    Input SAM or BAM file.  (Required)
+   # O=File    File to write the output to.  (Required)
+   # H=File    File to write insert size histogram chart to.  (Required)
+   # output: CollectInsertSizeMetrics.txt: values for -r and --mate-std-dev can be found in this text file
+   #         CollectInsertSizeMetrics_hist.pdf: insert size histogram (graphic representation)
+bsub java -jar  /usr/local/share/picard-tools/CollectInsertSizeMetrics.jar I=foo.bam O=CollectInsertSizeMetrics.txt H=CollectInsertSizeMetrics_hist.pdf
+}}}
+== [RNA-seq only] Get global coverage profile across transcripts ==
+Do reads come from across the length of a typical transcript, or is there 3' or 5' bias (where most reads come from one end of a typical transcript)?
+One way to look at this is with Picard's CollectRnaSeqMetrics tool
+{{{
+   # Usage:
+   java -jar /usr/local/share/picard-tools/CollectRnaSeqMetrics.jar INPUT=bamFile REF_FLAT=refFlatFile STRAND_SPECIFICITY=NONE OUTPUT=outputFile REFERENCE_SEQUENCE=/path/to/genome.fa CHART_OUTPUT=output.pdf VALIDATION_STRINGENCY=SILENT
+   # Example command
+   java -jar /usr/local/share/picard-tools/CollectRnaSeqMetrics.jar INPUT=WT.bam REF_FLAT=/nfs/genomes/mouse_mm10_dec_11_no_random/anno/refFlat.txt STRAND_SPECIFICITY=NONE OUTPUT=QC_metrics/WT.RnaSeqMetrics.txt REFERENCE_SEQUENCE=/nfs/genomes/mouse_mm10_dec_11_no_random/fasta_whole_genome/mm10.fa CHART_OUTPUT=QC_metrics/WT.RnaSeqMetrics.pdf VALIDATION_STRINGENCY=SILENT
+}}}
+The VALIDATION_STRINGENCY=SILENT option will keep the program from crashing if it finds something unexpected.  The default: VALIDATION_STRINGENCY=STRICT
+= Interpreting quality control issues =
+See [[https://sequencing.qcfail.com/|QC Fail Sequencing]] from the Babraham Institute
+See [[http://barcwiki.wi.mit.edu/wiki/SOPs/SAMBAMqc|SAM/BAM quality control]]