| 245 | | == Remove Duplicates == |
| 246 | | * Remove duplicates, for eg. from PCR |
| 247 | | |
| 248 | | {{{ |
| 249 | | #samtools command |
| 250 | | samtools rmdup [-sS] <input.srt.bam> <output.bam> |
| 251 | | -s or -S depending on PE data or not |
| 252 | | }}} |
| 253 | | |
| 254 | | == Determining the paired-end insert size for DNA samples == |
| 255 | | |
| 256 | | If paired-end insert size or distance is unknown or need to be verified, it can be extracted from a BAM/SAM file after running Bowtie. |
| 257 | | |
| 258 | | When mapping with bowtie (or another mapper), the insert size can often be included as an input parameter (example for bowtie: -X 500), which can help with mapping. See the [[http://barcwiki.wi.mit.edu/wiki/SOPs/mapping|mapping SOP]] for mapping details. |
| 259 | | |
| 260 | | |
| 261 | | Method 1: Get insert sizes from BAM file |
| 262 | | {{{ |
| 263 | | # Using a SAM file (at Unix command prompt) |
| 264 | | awk -F "\t" '$9 > 0 {print $9}' s_1_bowtie.sam > s_1_insert_sizes.txt |
| 265 | | # Using a BAM file (at Unix command prompt) |
| 266 | | samtools view s_1_bowtie.bam | awk -F"\t" '$9 > 0 {print $9}' > s_1_insert_sizes.txt |
| 267 | | |
| 268 | | # and then process column of numbers with R (or Excel) |
| 269 | | # In R Session |
| 270 | | sizeFile = "s_1_insert_sizes.txt" |
| 271 | | sample.name = "My paired reads" |
| 272 | | distance = read.delim(sizeFile, h=F)[,1] |
| 273 | | pdf(paste(sample.name, "insert.size.histogram.pdf", sep="."), w=11, h=8.5) |
| 274 | | hist(distance, breaks=200, col="wheat", main=paste("Insert sizes for", sample.name), xlab="length (nt)") |
| 275 | | dev.off() |
| 276 | | }}} |
| 277 | | |
| 278 | | Method 2: Calculate insert sizes with CollectInsertSizeMetrics function from picard (http://picard.sourceforge.net). This is also a good approximation for RNA samples. |
| 279 | | {{{ |
| 280 | | # |
| 281 | | # I=File Input SAM or BAM file. (Required) |
| 282 | | # O=File File to write the output to. (Required) |
| 283 | | # H=File File to write insert size histogram chart to. (Required) |
| 284 | | # output: CollectInsertSizeMetrics.txt: values for -r and --mate-std-dev can be found in this text file |
| 285 | | # CollectInsertSizeMetrics_hist.pdf: insert size histogram (graphic representation) |
| 286 | | bsub java -jar /usr/local/share/picard-tools/CollectInsertSizeMetrics.jar I=foo.bam O=CollectInsertSizeMetrics.txt H=CollectInsertSizeMetrics_hist.pdf |
| 287 | | }}} |
| 288 | | |
| 289 | | == [RNA-seq only] Get global coverage profile across transcripts == |
| 290 | | |
| 291 | | Do reads come from across the length of a typical transcript, or is there 3' or 5' bias (where most reads come from one end of a typical transcript)? |
| 292 | | |
| 293 | | One way to look at this is with Picard's CollectRnaSeqMetrics tool |
| 294 | | {{{ |
| 295 | | # Usage: |
| 296 | | java -jar /usr/local/share/picard-tools/CollectRnaSeqMetrics.jar INPUT=bamFile REF_FLAT=refFlatFile STRAND_SPECIFICITY=NONE OUTPUT=outputFile REFERENCE_SEQUENCE=/path/to/genome.fa CHART_OUTPUT=output.pdf VALIDATION_STRINGENCY=SILENT |
| 297 | | # Example command |
| 298 | | java -jar /usr/local/share/picard-tools/CollectRnaSeqMetrics.jar INPUT=WT.bam REF_FLAT=/nfs/genomes/mouse_mm10_dec_11_no_random/anno/refFlat.txt STRAND_SPECIFICITY=NONE OUTPUT=QC_metrics/WT.RnaSeqMetrics.txt REFERENCE_SEQUENCE=/nfs/genomes/mouse_mm10_dec_11_no_random/fasta_whole_genome/mm10.fa CHART_OUTPUT=QC_metrics/WT.RnaSeqMetrics.pdf VALIDATION_STRINGENCY=SILENT |
| 299 | | |
| 300 | | }}} |
| 301 | | The VALIDATION_STRINGENCY=SILENT option will keep the program from crashing if it finds something unexpected. The default: VALIDATION_STRINGENCY=STRICT |
| 302 | | |
| 303 | | |
| 304 | | = Interpreting quality control issues = |
| 305 | | |
| 306 | | See [[https://sequencing.qcfail.com/|QC Fail Sequencing]] from the Babraham Institute |
| | 245 | See [[http://barcwiki.wi.mit.edu/wiki/SOPs/SAMBAMqc|SAM/BAM quality control]] |
| | 246 | |
| | 247 | |