SOPs/miningSAMBAM – BaRC Wiki

wiki:SOPs/miningSAMBAM

Context Navigation

SAM/BAM summarizing and processing

Many of these involve samtools

Get the official SAM/BAM file format description

All fields in a SAM/BAM file are explained in the Sequence Alignment/Map Format Specification.

Differences between SAM and BAM files

A BAM file is a binary version of a SAM file.
Both contain identical information about reads and their mapping.
A BAM file requires a header but a SAM file may not have one. (Use 'samtools view -h reads.bam' to print the header with the mapped reads.)
Many operations (such as sorting and indexing) work only on BAM files.
For almost any application that requires SAM input, this can be created on the fly from a BAM file (using 'samtools view reads.bam |').
BAM files take up much less space than SAM files.
For archiving purposes, keep only the BAM file. The SAM file can easily be regenerated (if ever needed).

Convert, sort, and/or index

# Convert SAM to BAM:
samtools view -bS -o foo.bam foo.sam

# Convert BAM to SAM:
samtools view -h -o foo.sam foo.bam

# Sort BAM file (where ".bam" is added to "foo.sorted")
samtools sort foo.bam -T foo > foo.sorted.bam

# Index a sorted BAM file (which creates foo.sorted.bam.bai):
samtools index foo.sorted.bam
    
# Both foo.sorted.bam and foo.sorted.bam.bai are needed for visualization.

All three steps (SAM=>BAM, sorting, and indexing) can be merged into one command. See

/nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl
# Or on a folder of SAM files
for samFile in `/bin/ls *.sam`; do sbatch --partition=20 --job-name=sam2bam --mem=32G --wrap "/nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl $samFile " ; done

Modify a BAM file into another BAM file

In many cases, there's no need to create an intermediate SAM file.

#to extract selected (mapped to chrM) reads:
samtools index accepted_hits.bam   # index file required if you want to select a genome region (like chrM)
samtools view -h accepted_hits.bam chrM | samtools view -bS - > accepted_hits.chrM_only.bam
#We need to keep the header to convert back to BAM (hence the '-h' with 'samtools view').

#extract multiple regions (e.g., chromosomes) into a new bam file
samtools view -bh -L chromInfo.bed alignment.bam > alignment_chr1_3.bam
#where chromInfo.bed is a bed file, e.g.,
chr1    1       195471971
chr2    1       182113224
chr3    1       160039680

#rename header, e.g., use only chr1 to chr3 like above
samtools reheader newHeader.txt alignment_chr1_3.bam> alignment_chr1_3.newHeader.bam
#where newheader.txt file is,
@HD     VN:1.3  SO:coordinate
@SQ     SN:chr1 LN:195471971
@SQ     SN:chr2 LN:182113224
@SQ     SN:chr3 LN:160039680
@PG     ID:bwa  PN:bwa  VN:0.7.12-r1039 CL:bwa ...

Count the number of mapped reads

samtools flagstat mapped_unmapped.bam > mapped_unmapped.flagstat.txt

Count the number of mapped reads by chromosome

# Method 1 (all chromosomes)
# 1 - Index the BAM file: 
samtools index mapped_reads.bam
# 2 - Get index statistics (including the number of mapped reads in the third column: 
samtools idxstats mapped_reads.bam

# Method 2 (one chromosome at a time, for example, chr2)
samtools view mapped_reads.bam chr2 | cut -f 1 | sort -u | wc -l

Remove unmapped reads

samtools view -h -F 4 mapped_unmapped.bam | samtools view -bS > mapped_only.bam

Extract unmapped reads

samtools view -h -f 4 mapped_unmapped.bam | samtools view -bS > unmapped_only.bam

Keep only properly paired reads

samtools view -h -f 2 mapped_unmapped.bam | samtools view -bS > mapped.properly_paired.bam

Extract spliced reads

samtools view -h All_reads.bam | awk '$0 ~ /^@/ || $6 ~ /N/' | samtools view -bS > Spliced_reads.bam

How many multiple/uniquely mapped reads are in a bam/sam file?

bam_stat.py -i mapped_reads.bam >& bam_stat.out.txt

Get only uniquely mapped reads from sam/bam files

# For sam/bam files generated by bowtie2:
grep -v "XS:i" | grep "AS:i" All.sam >| Uniquely_mapped.sam
samtools view -h All.bam | grep -E -v "XS:i" | grep -E "@|AS:i" | samtools view -b - >| Uniquely_mapped.bam
# For any bam files
samtools view -b -q 10 All_reads.bam > Uniquely_mapped.bam

Remove or mark duplicate reads in bam files

# Using Picard tools, mark duplicates only, this is the default
/pathtojava/java -jar /path_to_picard-tools/picard.jar  MarkDuplicates  I=input.bam O=ouput.marked_duplicates.bam \
      M=ouputFileWithStats_metrics.txt  REMOVE_DUPLICATES=FALSE


# Using Picard tools, remove duplicates
/pathtojava/java -jar /path_to_picard-tools/picard.jar  MarkDuplicates  I=input.bam O=ouput.marked_duplicates.bam \
      M=ouputFileWithStats_metrics.txt  REMOVE_DUPLICATES=TRUE

View alignment with samtools

# -e: change identical bases to '='
samtools view -b accepted_hits.bam | samtools fillmd -e - /nfs/genomes/mouse_mm10_dec_11_no_random/fasta_whole_genome/mm10.fa | more

Get a list of multi-mapped reads, including the number of times each one was mapped

Tophat/bowtie mappers create the tag NH:i:XXX where XXX is the number of times the read has mapped.

sbatch --partition=20 --job-name=samtools --mem=32G --wrap "samtools view accepted_hits.bam | grep -v NH:i:1 | perl -pe 's/AS.+(NH:i:\d+)/\$1/' | cut -f1,10,12 | perl -pe 's/NH:i://' | sort -u -k3,3nr > Multi-mapped.sorted.txt"
# Output format:
# read_ID<tab>read<tab>number times mapped

SAM flag explanation

An explanation of all flags is found in the Sequence Alignment/Map Format Specification.
Converting a flag into its components may be easiest with the Picard Decoding SAM flags tool.

Split by strand by matched strand

# input: 	accepted_hits.bam
# output:	accepted_hits_negStrand.bam: mapped to negative strand
#		accepted_hits_posStrand.bam: mapped to positive strand

sbatch --partition=20 --job-name=samtools --mem=32G --wrap "samtools view -f 16 -b accepted_hits.bam >| accepted_hits_negStrand.bam"
sbatch --partition=20 --job-name=samtools --mem=32G --wrap "samtools view -F 16 -b accepted_hits.bam >| accepted_hits_posStrand.bam"

Split reads by pair

# input:	accepted_hits.bam
# output:	1st pair: accepted_hits_1stPair.bam
#		2nd pair: accepted_hits_2ndPair.bam
sbatch --partition=20 --job-name=samtools --mem=32G --wrap "samtools view -b -f 0x0040 accepted_hits.bam > accepted_hits_1stPair.bam"
sbatch --partition=20 --job-name=samtools --mem=32G --wrap "samtools view -b -F 0x0040 accepted_hits.bam > accepted_hits_2ndPair.bam"

Print consensus sequence of mapped reads

samtools mpileup -uf ref_genome.fa Mapped_reads.bam >| Mapped_reads.variants.bcf
bcftools call -O v -c Mapped_reads.variants.bcf | vcfutils.pl vcf2fq -d 2 >| Mapped_reads.consensus.fq

Convert BAM to BED format

This can be helpful for genome browser viewing, as features in a BAM file are only visible when the browser is zoomed in enough, whereas BED features are visible at any scale.

# Use bam2bed from the bedtools suite
# Run 'bam2bed -h' to get all options
bam2bed < Mapped_reads.bam > Mapped_reads.bed

Convert BAM to bigwig format

This can also be helpful for genome browser viewing. Bigwig files are smaller than BED files.

# Use bamCoverage from the deepTools suite
bamCoverage -b Mapped_reads.bam -o Mapped_reads.bw

Convert BAM back to fastq format

bamToFastq is part of the bedtools suite.

# Single-end reads
bamToFastq -i Reads.bam -fq Reads.fq
# Paired-end reads
bamToFastq -i Read_pairs.bam -fq Reads.R1.fq -fq2 Reads.R2.fq

Get genome coverage of DNA reads

One way to do this is with a BAM QC analysis using the QualiMap package.

qualimap bamqc -bam Mapped_DNA_reads.bam -outdir qualimap_output -outformat PDF:HTML
# Command-line execution may require disabling X-windows
sbatch --partition=20 --job-name=bamqc --mem=32G --wrap "unset DISPLAY; qualimap bamqc -bam Mapped_DNA_reads.bam -outdir qualimap_output -outformat PDF:HTML"

For genome coverage, search for "mean coverageData" in the output file: qualimap_output/genome_results.txt This file includes lots of other details too.

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text