SOPs/miningSAMBAM – BaRC Wiki

Context Navigation

Version 24 (modified by byuan, 11 years ago) ( diff )
--

SAM/BAM summarizing, processing and quality control(QC)

Many of these involve samtools

Convert, sort, and/or index

# Convert SAM to BAM:
samtools view -bS -o foo.bam foo.sam

# Convert BAM to SAM:
samtools view -h -o foo.sam foo.bam

# Sort BAM file (where ".bam" is added to "foo.sorted")
samtools sort foo.bam foo.sorted

# Index a sorted BAM file (which creates foo.sorted.bam.bai):
samtools index foo.sorted.bam
    
# Both foo.sorted.bam and foo.sorted.bam.bai are needed for visualization.

All three steps (SAM=>BAM, sorting, and indexing) can be merged into one command. See

/nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl

Differences between SAM and BAM files

A BAM file is a binary version of a SAM file.
Both contain identical information about reads and their mapping.
A BAM file requires a header but a SAM file may not have one. (Use 'samtools view -h reads.bam' to print the header with the mapped reads.)
Many operations (such as sorting and indexing) work only on BAM files.
For almost any application that requires SAM input, this can be created on the fly from a BAM file (using 'samtools view reads.bam |').
BAM files take up much less space than SAM files.
For archiving purposes, keep only the BAM file. The SAM file can easily be regenerated (if ever needed).

Modify a BAM file into another BAM file

In many cases, there's no need to create an intermediate SAM file. For example, to extract selected (mapped to chrM) reads:

samtools index accepted_hits.bam   # Required if you want to select a genome region (like chrM)
samtools view -h accepted_hits.bam chrM | samtools view -bS - > accepted_hits.chrM_only.bam

We need to keep the header to convert back to BAM (hence the '-h' with 'samtools view' and the '$1 ~ ...' with awk).

Count the number of mapped reads

samtools flagstat mapped_unmapped.bam

Count the number of mapped reads by chromosome

# Method 1 (all chromosomes)
# 1 - Index the BAM file: 
samtools index mapped_reads.bam
# 2 - Get index statistics (including the number of mapped reads in the third column: 
samtools idxstats mapped_reads.bam

# Method 2 (one chromosome at a time, for example, chr2)
# From SAM
awk -F"\t" '$3 == "chr2" {print $1}' mapped_reads.sam | sort -u | wc -l
# From BAM
samtools view mapped_reads.bam chr2 | cut -f 1 | sort -u | wc -l

Remove unmapped reads

samtools view -hS -F 4 mapped_unmapped.sam > mapped_only.sam

How many multiple/uniquely mapped reads are in a bam/sam file?

bam_stat.py -i mapped_reads.bam >& bam_stat.out.txt

View alignment with samtools

# -e: change identical bases to '='
samtools view -b accepted_hits.bam | samtools fillmd -e - /nfs/genomes/mouse_mm10_dec_11_no_random/fasta_whole_genome/mm10.fa | more

Get a list of multi-mapped reads, including the number of times each one was mapped

Tophat/bowtie mappers create the tag NH:i:XXX where XXX is the number of times the read has mapped.

bsub "samtools view accepted_hits.bam | grep -v NH:i:1 | perl -pe 's/AS.+(NH:i:\d+)/\$1/' | cut -f1,10,12 | perl -pe 's/NH:i://' | sort -u -k3,3nr > Multi-mapped.sorted.txt"
# Output format:
# read_ID<tab>read<tab>number times mapped

QC to get a (visual) summary of mapping statistics. For eg. coverage/distribution of mapped reads across the genome or transcriptome.

| RSeQC: RNA-Seq quality control package for getting mapping statistics (eg. unique/multi-mapped reads)

bam_stat.py -i myFile.bam

| Picard: CollectRnaSeqMetrics.jar to find coverage across gene body for 5' or 3' bias

java -jar /usr/local/share/picard-tools/CollectRnaSeqMetrics.jar INPUT=accepted_hits.bam REF_FLAT=refFlat.txt STRAND_SPECIFICITY=NONE OUTPUT=Out_RnaSeqMetrics.txt REFERENCE_SEQUENCE=hg19.fa CHART_OUTPUT=Out_RnaSeqMetrics.pdf

| QualiMap: can be used on DNA or RNA-Seq to get summary of mapping and coverage/distribution

# Graphical interface: enter 'qualimap' on the command line
# Command line:
unset DISPLAY  #needed for submitting to cluster
bsub "qualimap bamqc -bam myFile.bam -outdir output_qualimap"

#rnaseq qc
bsub "qualimap rnaseq -bam myFile.bam -gtf Homo_sapiens.GRCh37.72.canonical.gtf -outdir output_qualimap_rnaseq -protocol non-strand-specific"

#counts qc (after using htseq-count or similar program to generate a matrix of counts)
qualimap counts -d countsqc_input.txt -c -s HUMAN -outdir counts_qc
#Format of countsqc_input.txt (below), totalCounts.txt is a matrix of counts; header lines must be commented "#" and species is human or mouse only.
#Sample	Condition	Path	Column
HMLE1	HMLE	totalCounts.txt	2
HMLE2	HMLE	totalCounts.txt	3
HMLE3	HMLE	totalCounts.txt	4
N81	N8	totalCounts.txt	5
N82	N8	totalCounts.txt	6
N83	N8	totalCounts.txt	7

infer_experiment.py from RseQC package: can be used to check if the RNA-seq reads are stranded.

# Command line:

bsub infer_experiment.py -i accepted_hits.bam -r hs.bed

-i INPUT_FILE in SAM or BAM format
-r Reference gene model in bed fomat.

# sample output on strand-specific PE reads:
This is PairEnd Data
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0193
Fraction of reads explained by "1+-,1-+,2++,2--": 0.9807
Fraction of reads explained by other combinations: 0.0000

# sample output on non-stranded PE reads:
This is PairEnd Data
Fraction of reads explained by "1++,1--,2+-,2-+": 0.5103
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4897
Fraction of reads explained by other combinations: 0.0000

For pair-end RNA-seq, there are two different ways to strand reads:
  i) 1++,1--,2+-,2-+
     read1 mapped to '+' strand indicates parental gene on '+' strand
     read1 mapped to '-' strand indicates parental gene on '-' strand
     read2 mapped to '+' strand indicates parental gene on '-' strand
     read2 mapped to '-' strand indicates parental gene on '+' strand
  ii) 1+-,1-+,2++,2--
     read1 mapped to '+' strand indicates parental gene on '-' strand
     read1 mapped to '-' strand indicates parental gene on '+' strand
     read2 mapped to '+' strand indicates parental gene on '+' strand
     read2 mapped to '-' strand indicates parental gene on '-' strand

For single-end RNA-seq, there are two different ways to strand reads:
  i) ++,--
     read mapped to '+' strand indicates parental gene on '+' strand
     read mapped to '-' strand indicates parental gene on '-' strand
  ii) +-,-+
     read mapped to '+' strand indicates parental gene on '-' strand
     read mapped to '-' strand indicates parental gene on '+' strand

Split by strand by matched strand

# input: 	accepted_hits.bam
# output:	accepted_hits_negStrand.bam: mapped to negative strand
#		accepted_hits_posStrand.bam: mapped to positive strand

bsub "samtools view -f 16 -b accepted_hits.bam >| accepted_hits_negStrand.bam"
bsub "samtools view -F 16 -b accepted_hits.bam >| accepted_hits_posStrand.bam"

Split reads by pair

# input:	accepted_hits_posStrand.bam or accepted_hits_negStrand.bam
# output:	1st pair: *_1stPair.bam
#		2nd pair: *_2ndPair.bam
bsub "samtools view -b -f 0x0040 accepted_hits_posStrand.bam > accepted_hits_posStrand_1stPair.bam"
bsub "samtools view -b -F 0x0040 accepted_hits_posStrand.bam > accepted_hits_posStrand_2ndPair.bam"
bsub "samtools view -b -f 0x0040 accepted_hits_negStrand.bam > accepted_hits_negStrand_1stPair.bam"
bsub "samtools view -b -F 0x0040 accepted_hits_negStrand.bam > accepted_hits_negStrand_2ndPair.bam"

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text