245 | | == Remove Duplicates == |
246 | | * Remove duplicates, for eg. from PCR |
247 | | |
248 | | {{{ |
249 | | #samtools command |
250 | | samtools rmdup [-sS] <input.srt.bam> <output.bam> |
251 | | -s or -S depending on PE data or not |
252 | | }}} |
253 | | |
254 | | == Determining the paired-end insert size for DNA samples == |
255 | | |
256 | | If paired-end insert size or distance is unknown or need to be verified, it can be extracted from a BAM/SAM file after running Bowtie. |
257 | | |
258 | | When mapping with bowtie (or another mapper), the insert size can often be included as an input parameter (example for bowtie: -X 500), which can help with mapping. See the [[http://barcwiki.wi.mit.edu/wiki/SOPs/mapping|mapping SOP]] for mapping details. |
259 | | |
260 | | |
261 | | Method 1: Get insert sizes from BAM file |
262 | | {{{ |
263 | | # Using a SAM file (at Unix command prompt) |
264 | | awk -F "\t" '$9 > 0 {print $9}' s_1_bowtie.sam > s_1_insert_sizes.txt |
265 | | # Using a BAM file (at Unix command prompt) |
266 | | samtools view s_1_bowtie.bam | awk -F"\t" '$9 > 0 {print $9}' > s_1_insert_sizes.txt |
267 | | |
268 | | # and then process column of numbers with R (or Excel) |
269 | | # In R Session |
270 | | sizeFile = "s_1_insert_sizes.txt" |
271 | | sample.name = "My paired reads" |
272 | | distance = read.delim(sizeFile, h=F)[,1] |
273 | | pdf(paste(sample.name, "insert.size.histogram.pdf", sep="."), w=11, h=8.5) |
274 | | hist(distance, breaks=200, col="wheat", main=paste("Insert sizes for", sample.name), xlab="length (nt)") |
275 | | dev.off() |
276 | | }}} |
277 | | |
278 | | Method 2: Calculate insert sizes with CollectInsertSizeMetrics function from picard (http://picard.sourceforge.net). This is also a good approximation for RNA samples. |
279 | | {{{ |
280 | | # |
281 | | # I=File Input SAM or BAM file. (Required) |
282 | | # O=File File to write the output to. (Required) |
283 | | # H=File File to write insert size histogram chart to. (Required) |
284 | | # output: CollectInsertSizeMetrics.txt: values for -r and --mate-std-dev can be found in this text file |
285 | | # CollectInsertSizeMetrics_hist.pdf: insert size histogram (graphic representation) |
286 | | bsub java -jar /usr/local/share/picard-tools/CollectInsertSizeMetrics.jar I=foo.bam O=CollectInsertSizeMetrics.txt H=CollectInsertSizeMetrics_hist.pdf |
287 | | }}} |
288 | | |
289 | | == [RNA-seq only] Get global coverage profile across transcripts == |
290 | | |
291 | | Do reads come from across the length of a typical transcript, or is there 3' or 5' bias (where most reads come from one end of a typical transcript)? |
292 | | |
293 | | One way to look at this is with Picard's CollectRnaSeqMetrics tool |
294 | | {{{ |
295 | | # Usage: |
296 | | java -jar /usr/local/share/picard-tools/CollectRnaSeqMetrics.jar INPUT=bamFile REF_FLAT=refFlatFile STRAND_SPECIFICITY=NONE OUTPUT=outputFile REFERENCE_SEQUENCE=/path/to/genome.fa CHART_OUTPUT=output.pdf VALIDATION_STRINGENCY=SILENT |
297 | | # Example command |
298 | | java -jar /usr/local/share/picard-tools/CollectRnaSeqMetrics.jar INPUT=WT.bam REF_FLAT=/nfs/genomes/mouse_mm10_dec_11_no_random/anno/refFlat.txt STRAND_SPECIFICITY=NONE OUTPUT=QC_metrics/WT.RnaSeqMetrics.txt REFERENCE_SEQUENCE=/nfs/genomes/mouse_mm10_dec_11_no_random/fasta_whole_genome/mm10.fa CHART_OUTPUT=QC_metrics/WT.RnaSeqMetrics.pdf VALIDATION_STRINGENCY=SILENT |
299 | | |
300 | | }}} |
301 | | The VALIDATION_STRINGENCY=SILENT option will keep the program from crashing if it finds something unexpected. The default: VALIDATION_STRINGENCY=STRICT |
302 | | |
303 | | |
304 | | = Interpreting quality control issues = |
305 | | |
306 | | See [[https://sequencing.qcfail.com/|QC Fail Sequencing]] from the Babraham Institute |
| 245 | See [[http://barcwiki.wi.mit.edu/wiki/SOPs/SAMBAMqc|SAM/BAM quality control]] |
| 246 | |
| 247 | |