wiki:SOPs/variant_calling

Version 19 (modified by gbell, 8 years ago) ( diff )

--

Variant calling and analysis

The main steps comprising variant calling and analysis are

  • mapping short reads
  • calling raw variants
  • adding filters (really more like 'tagging' to identify raw variants that are really variants and not technical errors) and some annotations to variants
  • annotating effect(s) of variants on genes (like if they change protein sequence)

Which mapper and variant caller works best? No simple answer (of course), but see

Map short reads

After quality control and/or filtering of reads...

Single reads with bowtie2:

bowtie2 -x /nfs/genomes/sgd_2010/bowtie/sacCer3 A_reads.fq -S A_reads.bt2.sam

Single reads with bwa:

bwa aln /nfs/genomes/sgd_2010/bwa/sacCer3.fa A_reads.fq > A_reads.sai
bwa samse /nfs/genomes/sgd_2010/bwa/sacCer3.fa A_reads.sai  A_reads.fq > A_reads.bwa.sam

Paired-end reads with bowtie2:

bowtie2 -x /nfs/genomes/sgd_2010/bowtie/sacCer3 -1 A_reads.1.fq -2 A_reads.2.fq -S A_reads.1+2.bt2.sam

Paired-end reads with bwa (for reads of at least 70 nt):

bwa mem /nfs/genomes/sgd_2010/bwa/sacCer3.fa A_reads.1.fq A_reads.2.fq > A_reads.1+2.bwa.sam

Convert to BAM, sort, and index [with a custom BaRC script that uses samtools]:

/nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl A_reads.bwa.sam

Get uniquely mapping reads (Is this recommended or not?)

samtools view -h A_reads.bt2.bam | grep -v XS:i: | samtools view -bS - > A_reads.bt2.sorted_unique.bam

Remove duplicated reads assumed to be PCR artifacts. This should be done even if duplication rate is low.

# paired-end reads
samtools rmdup A_reads.bt2.sorted.bam A_reads.bt2.sorted.noDups.bam
# single-end reads
samtools rmdup -s A_reads.bt2.sorted.bam A_reads.bt2.sorted.noDups.bam

Call raw variants with mpileup+bcftools

Call variants (one sample vs. reference) with samtools' mpileup+bcftools (see the samtools' variant calling page for more details). In our experience, "-B" (disable BAQ) or "-E" (recalculate BAQ) works better than the default method, which can remove some obvious variants.

samtools mpileup -d100000 -t DP,AD,ADF,ADR,INFO/AD,SP -Buf /nfs/genomes/sgd_2010/bwa/sacCer3.fa A_reads.bt2.sorted_unique.bam | bcftools call -O b -v -c - >| A_reads.bt2.sorted_unique.raw.bcf
#in bcftools call: -m for multi-allelic, -c for bi-alleleic

Call variants (multiple sample vs. reference) using a set of BAM files

samtools mpileup -d100000 -t DP,AD,ADF,ADR,INFO/AD,SP -Buf /nfs/genomes/sgd_2010/bwa/sacCer3.fa *_reads.bt2.sorted_unique.bam | bcftools call -O b -v -c - >| ALL_reads.bt2.sorted_unique.raw.bcf

Add filters and annotations to raw variants

This step uses vcf-annotate from the VCFtools suite

Annotate variants by adding tags ("filters" but all variants are kept) to each variant, using all default filters.

bcftools view -vcg A_reads.bt2.sorted_unique.raw.bcf | vcf-annotate -f + > A_reads.bt2.sorted_unique.withTags.vcf

Prepare file of known SNPs for use with vcf-annotate. Start with tab-delimited file (ex: SNP137.bed) that looks like

chr1 1360 1361 rs000000001

bgzip SNP137.bed
tabix -p bed SNP137.bed.gz

Annotate variants by adding tags, more analysis, and any SNPdb overlaps

bcftools view -vcg A_reads.bt2.sorted_unique.raw.bcf | vcf-annotate -f +/d=10 --fill-HWE --fill-type -n -a SNP137.bed.gz -c CHROM,FROM,TO,INFO/SNP_ID -d key=INFO,ID=SNP_ID,Number=1,Type=Integer,Description='SNP137 sites' > A_reads.bt2.sorted_unique.filtered.vcf

In the output VCF file, any tags will appear in the FILTER field, and SNPdb overlaps will appear in the INFO field.

Overlap with annotations can also be identified with intersectBed (where annotation will appear in new fields of the output VCF file):

intersectBed -wao -split -a A_reads.bt2.sorted_unique.raw.vcf -b SNP137.bed > A_reads.bt2.sorted_unique.annotated.vcf

Annotate effect(s) of variants on genes

Use snpEff (assuming snpEff has gene+protein annotations for your genome):

snpEff -c /usr/local/share/snpEff/snpEff.config -s A_snpEff.html SacCer_Apr2011.18 A_reads.bt2.sorted_unique.filtered.vcf > A_reads.bt2.sorted_unique.filtered.snpEff.vcf 

Get information only for variants that overlap protein-coding exons:

snpEff -c /usr/local/share/snpEff/snpEff.config -no-downstream -no-intergenic -no-intron -no-upstream -no-utr -s A_snpEff.html SacCer_Apr2011.18 A_reads.bt2.sorted_unique.filtered.vcf > A_reads.bt2.sorted_unique.filtered.snpEff.vcf

See Interpreting VCF files for more information.

Get read counts and statistics at desired sites

Sometimes one may have an interesting set of genome locations that may not turn up in a variant output file (if they are variants in that sample, for example).

If you need variant-type metrics for those sites, you can use a command like

samtools mpileup -uv -t DP,DPR,DV,DP4,INFO/DPR,SP -l Desired_sites.bed -f refGenome.fa Sample.bam | bcftools call -c - > Desired_sites_info.vcf 

If you want counts for all 4 nucleotides (not just reference and alternate alleles), one way is with bam-readcount:

bam-readcount -l Desired_sites.bed -f refGenome.fa Sample.bam > Desired_sites.nt_counts.txt

Original page

Note: See TracWiki for help on using the wiki.