= Interpreting VCF files = The [https://samtools.github.io/hts-specs/ VCF (Variant Call Format) specification pages] describe most of what you need to know. Tags in the FILTER, INFO, and FORMAT fields are described in the VCF header. Probability (ranging from 0 to 1) for a Phred score P is defined as '''10^-P/10^'''. As a tabular reference, common tags and scores are as follows: '''QUAL''' field: QUAL = -10*log,,10,,(Probability(call in ALT is wrong)) '''FILTER''' field (typically generated by vcf-annotate): ||'''Tag''' || '''Description''' || '''Default threshold''' || ||BaseQualBias || Min P-value for baseQ bias || 0 || ||EndDistBias || Min P-value for end distance bias || 0.0001 || ||GapWin || Window size for filtering adjacent gaps || 3 || ||MapQualBias || Min P-value for mapQ bias || 0 || ||MaxDP || Maximum read depth || 10000000 || ||MinAB || Minimum number of alternate bases || 2 || ||MinDP || Minimum read depth || 2 || ||MinMQ || Minimum RMS mapping quality for SNPs || 10 || ||Qual || Minimum value of the QUAL field || 10 || ||RefN || Reference base is N || [] || ||SnpGap || SNP within INT bp around a gap to be filtered || 10 || ||StrandBias || Min P-value for strand bias || 0.0001 || ||VDB || Minimum Variant Distance Bias || 0 || '''INFO''' field (typically generated by bcftools and expanded with vcf-annotate): ||'''Tag''' || '''Description''' || '''More details''' || ||AC || Allele count in genotypes || || ||AC1 || Max-likelihood estimate of the first ALT allele count (no HWE assumption) || || ||AF1 || Max-likelihood estimate of the first ALT allele frequency (assuming HWE) || || ||AN || Total number of alleles in called genotypes || || ||CGT || The most probable constrained genotype configuration in the trio || || ||CLR || Log ratio of genotype likelihoods with and without the constraint || || ||DP || Raw read depth || For multiple-sample VCFs, the sum for all samples || ||DP4 || Number of high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases || || ||FQ || Phred probability of all samples being the same || || ||G3 || ML estimate of genotype frequencies || || ||HWE || Hardy-Weinberg equilibrium test (PMID:15789306) || || ||ICF || Inbreeding coefficient F || || ||INDEL || Indicates that the variant is an INDEL. || || ||IS || Maximum number of reads supporting an indel and fraction of indel reads || || ||MDV || Maximum number of high-quality nonRef reads in samples || || ||MQ || Root-mean-square mapping quality of covering reads || || ||PC2 || Phred probability of the nonRef allele frequency in group1 samples being larger (, smaller) than in group2. || || ||PCHI2 || Posterior weighted chi2 P-value for testing the association between group1 and group2 samples. || || ||PR || Number of permutations yielding a smaller PCHI2. || || ||PV4 || P-values for strand bias, baseQ bias, mapQ bias and tail distance bias || || ||QBD || Quality by Depth: QUAL/#reads || || ||QCHI2 || Phred scaled PCHI2. || || ||RPB || Read Position Bias || || ||SF || Source File (index to sourceFiles, f when filtered) || || ||TYPE || Variant type || || ||UGT || The most probable unconstrained genotype configuration in the trio || || ||VDB || Variant Distance Bias (v2) for filtering splice-site artefacts in RNA-seq data. || || '''FORMAT''' field (listing tags of metrics for each sample): ||'''Tag''' || '''Description''' || '''More details''' || || GT || Genotype || || || GQ || Genotype Quality || -10*log,,10,,prob(genotype call is wrong) [so bigger is more confident] || || GL || Likelihoods for RR,RA,AA genotypes (R=ref,A=alt) || || || DP || Number of high-quality bases || || || DV || Number of high-quality non-reference bases || || || SP || Phred-scaled strand bias P-value || || || PL || List of Phred-scaled genotype likelihoods || Scores for 0/0 (homozygous ref), 0/1 (heterozygous), and 1/1 (homozygous alt) genotypes. For a phred-scaled likelihood of P, the raw likelihood of that genotype L = 10^-P/10^ (so the higher the number, the less likely it is that your sample is that genotype). The sum of likelihoods is not necessarily 1. || == Predicting the effects of a set of variants == Especially for genome-wide analyses, often the most difficult is prioritizing and making biological sense of a potentially very long list of variants. The best way(s) to do this often depends on the goal of the study, but it may help to use tools such as * snpEff * Given a genome-wide gene annotation file (as a GTF file), each variant can be linked to an exon, intron, and/or intergenic region. * For variants within an open reading frame, snpEff will determine the relevant amino acid and if the variant will produce a synonymous or non-synonymous change. * Example command: java -jar /usr/local/share/snpEff/snpEff.jar -c /usr/local/share/snpEff/snpEff.config GENOME < Variants.vcf > Variants.snpEff.vcf * where GENOME is chosen from those listed with the command java -jar /usr/local/share/snpEff/snpEff.jar databases * Ensembl's Variant Effect Predictor * Go to http://www.ensembl.org/info/docs/tools/vep/index.html and click on "Launch the online VEP tool!". * Broad's MutSig ("Mutation Significance") * See http://www.broadinstitute.org/cancer/cga/mutsig for details == Other Analysis == * Assessing Hardy-Weinberg Equilibrium (HWE) using an exact test: deviation from HWE may indicate selection, population mixing or non-random mating, see [[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1199378 | PMID 7498780 ]] for more details. * Use [[https://vcftools.github.io/index.html | vcftools]] option "--hardy" to report a p-value for each site to assess deviation from HWE. {{{ #VCF file must have genotypes of individuals explicitly reported vcftools --vcf myVariants.vcf --hardy --out myVariantsHWE }}} * Alternatively, R package [[https://cran.r-project.org/web/packages/genetics/index.html | genetics]] can be utilized with counts for each genotype, {{{ library(genetics) g1 = genotype(c(rep("G/G",1),rep("G/A",47),rep("A/A",6450))) HWE.exact(g1) }}}