Version 3 (modified by 11 years ago) ( diff ) | ,
---|
Variant calling and analysis
The main steps comprising variant calling and analysis are
- mapping short reads
- calling raw variants
- adding filters (really more like 'tagging') and some annotations to variants
- annotating effect(s) of variants on genes
Map short reads
After quality control and/or filtering of reads...
Single reads with bowtie2:
bowtie2 -x /nfs/genomes/sgd_2010/bowtie/sacCer3 A_reads.fq -S A_reads.bt2.sam
Single reads with bwa:
bwa aln /nfs/genomes/sgd_2010/bwa/sacCer3.fa A_reads.fq > A_reads.sai bwa samse /nfs/genomes/sgd_2010/bwa/sacCer3.fa A_reads.sai A_reads.fq > A_reads.bwa.sam
Paired-end reads with bowtie2:
bowtie2 -x /nfs/genomes/sgd_2010/bowtie/sacCer3 -1 A_reads.1.fq -2 A_reads.2.fq -S A_reads.1+2.bt2.sam
Paired-end reads with bwa (for reads of at least 70 nt):
bwa mem /nfs/genomes/sgd_2010/bwa/sacCer3.fa A_reads.1.fq A_reads.2.fq > A_reads.1+2.bwa.sam
Convert to BAM, sort, and index [with a custom BaRC script that uses samtools]:
/nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl A_reads.bwa.sam
Get uniquely mapping reads
samtools view -h A_reads.bt2.bam | grep -v XS:i: | samtools view -bS - > A_reads.bt2.sorted_unique.bam
Call raw variants with mpileup+bcftools
Call variants (one sample vs. reference) with samtools' mpileup+bcftools (see the samtools' variant calling page for more details)
samtools mpileup -d100000 -uf /nfs/genomes/sgd_2010/bwa/sacCer3.fa A_reads.bt2.sorted_unique.bam | bcftools view -bvcg - >| A_reads.bt2.sorted_unique.raw.bcf
Call variants (multiple sample vs. reference) using a set of BAM files
samtools mpileup -d100000 -uf /nfs/genomes/sgd_2010/bwa/sacCer3.fa *_reads.bt2.sorted_unique.bam | bcftools view -bvcg - >| ALL_reads.bt2.sorted_unique.raw.bcf
Add filters and annotations to raw variants
This step uses vcf-annotate from the VCFtools suite
Annotate variants by adding tags ("filters" but all variants are kept) to each variant, using all default filters
bcftools view -L -vcg A_reads.bt2.sorted_unique.raw.bcf | vcf-annotate -f + > A_reads.bt2.sorted_unique.withTags.bcf
Prepare file of known SNPs for use with vcf-annotate. Start with tab-delimited file (ex: SNP137.bed) that looks like
chr1 1360 1361 rs000000001
bgzip SNP137.bed tabix -p bed SNP137.bed.gz
Annotate variants by adding tags, more analysis, and any SNPdb overlaps
bcftools view -L -vcg A_reads.bt2.sorted_unique.raw.bcf | vcf-annotate -f +/d=10 --fill-HWE --fill-type -n -a SNP137.bed.gz -c CHROM,FROM,TO,INFO/SNP_ID -d key=INFO,ID=SNP_ID,Number=1,Type=Integer,Description='SNP137 sites' > A_reads.bt2.sorted_unique.filtered.vcf
Any tags will appear in the FILTER field, and SNPdb overlaps will appear in the INFO field of the output VCF file.
Overlap with annotations can also be identified with intersectBed (where annotation will appear in new fields of the output VCF file):
intersectBed -wao -split -a A_reads.bt2.sorted_unique.raw.vcf -b SNP137.bed > A_reads.bt2.sorted_unique.annotated.vcf
Annotate effect(s) of variants on genes
Use snpEff (assuming snpEff has gene+protein annotations for your genome):
snpEff -c /usr/local/share/snpEff/snpEff.config -s A_snpEff.html SacCer_Apr2011.18 A_reads.bt2.sorted_unique.filtered.vcf > A_reads.bt2.sorted_unique.filtered.snpEff.vcf
Get information only for variants that overlap protein-coding exons:
snpEff -c /usr/local/share/snpEff/snpEff.config -no-downstream -no-intergenic -no-intron -no-upstream -no-utr -s A_snpEff.html SacCer_Apr2011.18 A_reads.bt2.sorted_unique.filtered.vcf > A_reads.bt2.sorted_unique.filtered.snpEff.vcf