= Variant calling and analysis = The main steps comprising variant calling and analysis are * mapping short reads * calling raw variants * adding filters (really more like 'tagging' to identify raw variants that are really variants and not technical errors) and some annotations to variants * annotating effect(s) of variants on genes (like if they change protein sequence) Which mapper and variant caller works best? No simple answer (of course), but see * [http://www.ncbi.nlm.nih.gov/pubmed/26639839 Hwang at al., 2015. Systematic comparison of variant calling pipelines using gold standard personal exome variants.] * [http://www.ncbi.nlm.nih.gov/pubmed/25078893 Pirooznia et al., 2014. Validation and assessment of variant calling pipelines for next-generation sequencing.] * [http://www.ncbi.nlm.nih.gov/pubmed/24086590 Liu et al., 2013. Variant callers for next-generation sequencing data: a comparison study.] * Note: GATK is optimized for large human datasets, whereas GATK and samtools may perform similarly with other species and smaller-scale experiments. == Map short reads == After quality control and/or filtering of reads... Single reads with bowtie2: {{{ bowtie2 -x /nfs/genomes/sgd_2010/bowtie/sacCer3 A_reads.fq -S A_reads.bt2.sam }}} Single reads with bwa: {{{ #assuming the index file name is sacCer3.fa: use the name that was used when creating the index bwa aln /nfs/genomes/sgd_2010/bwa/sacCer3.fa A_reads.fq > A_reads.sai bwa samse /nfs/genomes/sgd_2010/bwa/sacCer3.fa A_reads.sai A_reads.fq > A_reads.bwa.sam }}} Paired-end reads with bowtie2: {{{ bowtie2 -x /nfs/genomes/sgd_2010/bowtie/sacCer3 -1 A_reads.1.fq -2 A_reads.2.fq -S A_reads.1+2.bt2.sam }}} Paired-end reads with bwa (for reads of at least 70 nt): {{{ bwa mem /nfs/genomes/sgd_2010/bwa/sacCer3.fa A_reads.1.fq A_reads.2.fq > A_reads.1+2.bwa.sam }}} Convert to BAM, sort, and index [with a custom BaRC script that uses samtools]: {{{ /nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl A_reads.bwa.sam }}} Get uniquely mapping reads (Is this recommended or not?) {{{ bsub samtools view -b -q 5 -o A_reads.bt2.unique.bam A_reads.bt2.bam }}} Remove duplicated reads assumed to be PCR artifacts. This should be done even if duplication rate is low. {{{ # paired-end reads samtools rmdup A_reads.bt2.sorted.bam A_reads.bt2.sorted.noDups.bam # single-end reads samtools rmdup -s A_reads.bt2.sorted.bam A_reads.bt2.sorted.noDups.bam }}} == Call raw variants with mpileup+bcftools == Call variants (one sample vs. reference) with samtools' mpileup+bcftools (see the [http://www.htslib.org/workflow/#mapping_to_variant samtools' variant calling workflow] for more details). In our experience, "-B" (disable BAQ) or "-E" (recalculate BAQ) works better than the default method, which can remove some obvious variants. {{{ samtools mpileup -d100000 -t DP,AD,ADF,ADR,SP,DP4,INFO/AD -Buf /nfs/genomes/sgd_2010/bwa/sacCer3.fa A_reads.bt2.sorted_unique.bam | bcftools call -O b -v -c - >| A_reads.bt2.sorted_unique.raw.bcf #in bcftools call: -m for multi-allelic, -c for bi-alleleic }}} Call variants (multiple sample vs. reference) using a set of BAM files {{{ samtools mpileup -d100000 -t DP,AD,ADF,ADR,SP,DP4,INFO/AD -Buf /nfs/genomes/sgd_2010/bwa/sacCer3.fa *_reads.bt2.sorted_unique.bam | bcftools call -O b -v -c - >| ALL_reads.bt2.sorted_unique.raw.bcf }}} == Add filters and annotations to raw variants == This step uses vcf-annotate from the [https://vcftools.github.io/perl_module.html VCFtools suite] Annotate variants by adding tags ("filters" but all variants are kept) to each variant, using all default filters. {{{ bcftools view A_reads.bt2.sorted_unique.raw.bcf | vcf-annotate -f + > A_reads.bt2.sorted_unique.withTags.vcf }}} Generate variant statistics for each sample (with 'bcftools stats') and plot them. {{{ bcftools stats -F -s - > plot-vcfstats -p vcfstats }}} Prepare file of known SNPs for use with vcf-annotate. Start with tab-delimited file (ex: SNP137.bed) that looks like chr1 1360 1361 rs000000001 {{{ bgzip SNP137.bed tabix -p bed SNP137.bed.gz }}} Annotate variants by adding tags, more analysis, and any SNPdb overlaps {{{ bcftools view A_reads.bt2.sorted_unique.raw.bcf | vcf-annotate -f +/d=10 --fill-HWE --fill-type -n -a SNP137.bed.gz -c CHROM,FROM,TO,INFO/SNP_ID -d key=INFO,ID=SNP_ID,Number=1,Type=Integer,Description='SNP137 sites' > A_reads.bt2.sorted_unique.filtered.vcf }}} In the output VCF file, any tags will appear in the FILTER field, and SNPdb overlaps will appear in the INFO field. Overlap with annotations can also be identified with intersectBed (where annotation will appear in new fields of the output VCF file): {{{ intersectBed -wao -split -a A_reads.bt2.sorted_unique.raw.vcf -b SNP137.bed > A_reads.bt2.sorted_unique.annotated.vcf }}} == Annotate effect(s) of variants on genes == * snpEff Use snpEff (assuming snpEff has gene+protein annotations for your genome): {{{ snpEff -c /usr/local/share/snpEff/snpEff.config -s A_snpEff.html SacCer_Apr2011.18 A_reads.bt2.sorted_unique.filtered.vcf > A_reads.bt2.sorted_unique.filtered.snpEff.vcf }}} Get information only for variants that overlap protein-coding exons: {{{ snpEff -c /usr/local/share/snpEff/snpEff.config -no-downstream -no-intergenic -no-intron -no-upstream -no-utr -s A_snpEff.html SacCer_Apr2011.18 A_reads.bt2.sorted_unique.filtered.vcf > A_reads.bt2.sorted_unique.filtered.snpEff.vcf }}} * Ensembl's Variant Effect Predictor (VEP) For large number of variants, download the annotation and use --cache or --offline instead of --database, see[[http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html | documentation]] {{{ #default human, set by --species vep -i chr1_subset.vcf --database }}} See [wiki:SOPs/vcf Interpreting VCF files ] for more information. == Get read counts and statistics at desired sites == Sometimes one may have an interesting set of genome locations that may not turn up in a variant output file (if they are variants in that sample, for example). If you need variant-type metrics for those sites, you can use a command like {{{ samtools mpileup -uv -t DP,DPR,DV,DP4,INFO/DPR,SP -l Desired_sites.bed -f refGenome.fa Sample.bam | bcftools call -c - > Desired_sites_info.vcf }}} If you want counts for all 4 nucleotides (not just reference and alternate alleles), one way is with [https://github.com/genome/bam-readcount bam-readcount]. Note that you'll only get a maximum coverage of ~8000 reads unless you include location (ex: chr19:50-100) as the final argument for 'bam-readcount'. {{{ bam-readcount -l Desired_sites.bed -f refGenome.fa Sample.bam chr1 > Desired_sites.nt_counts_etc.txt # Extract just nt counts (fields: chr, position, reference_base, depth, As, Cs, Gs, Ts) perl -pe 's/:/\t/g' | cut -f1-4,20,34,48,62 Desired_sites.nt_counts_etc.txt > Desired_sites.nt_counts_only.txt }}} == Create a consensus sequence by applying VCF variants to a reference genome file == {{{ bgzip All_samples.variants.withTags.vcf tabix -p vcf All_samples.variants.withTags.vcf.gz samtools faidx Ref_genome.fa chr1:1-1000 | bcftools consensus All_samples.variants.withTags.vcf.gz > chr1_1_1000.consensus.fa }}} == Original page == * [wiki:SOPs/callingSNPs Calling SNPs from Short-Read Sequencing][[br]] [[br]]