= Interpreting VCF files =

The [https://samtools.github.io/hts-specs/ VCF (Variant Call Format) specification pages] describe most of what you need to know.

Tags in the FILTER, INFO, and FORMAT fields are described in the VCF header.

Probability (ranging from 0 to 1) for a Phred score P is defined as '''10^-P/10^'''.

As a tabular reference, common tags and scores are as follows:

'''QUAL''' field: QUAL = -10*log,,10,,(Probability(call in ALT is wrong))

'''FILTER''' field (typically generated by vcf-annotate):

||'''Tag''' || '''Description''' || '''Default threshold''' ||
||BaseQualBias || Min P-value for baseQ bias || 0 ||
||EndDistBias || Min P-value for end distance bias || 0.0001 ||
||GapWin || Window size for filtering adjacent gaps || 3 ||
||MapQualBias || Min P-value for mapQ bias || 0 ||
||MaxDP || Maximum read depth || 10000000 ||
||MinAB || Minimum number of alternate bases || 2 ||
||MinDP || Minimum read depth || 2 ||
||MinMQ || Minimum RMS mapping quality for SNPs || 10 ||
||Qual || Minimum value of the QUAL field || 10 ||
||RefN || Reference base is N || [] ||
||SnpGap || SNP within INT bp around a gap to be filtered || 10 ||
||StrandBias || Min P-value for strand bias || 0.0001 ||
||VDB || Minimum Variant Distance Bias || 0 ||

'''INFO''' field (typically generated by bcftools and expanded with vcf-annotate):

||'''Tag''' || '''Description''' || '''More details''' ||
||AC || Allele count in genotypes ||  ||
||AC1 || Max-likelihood estimate of the first ALT allele count (no HWE assumption) ||  ||
||AF1 || Max-likelihood estimate of the first ALT allele frequency (assuming HWE) ||  ||
||AN || Total number of alleles in called genotypes ||  ||
||CGT || The most probable constrained genotype configuration in the trio ||  ||
||CLR || Log ratio of genotype likelihoods with and without the constraint ||  ||
||DP || Raw read depth || For multiple-sample VCFs, the sum for all samples ||
||DP4 || Number of high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases ||  ||
||FQ || Phred probability of all samples being the same ||  ||
||G3 || ML estimate of genotype frequencies ||  ||
||HWE || Hardy-Weinberg equilibrium test (PMID:15789306) ||  ||
||ICF || Inbreeding coefficient F ||  ||
||INDEL || Indicates that the variant is an INDEL. ||  ||
||IS || Maximum number of reads supporting an indel and fraction of indel reads ||  ||
||MDV || Maximum number of high-quality nonRef reads in samples ||  ||
||MQ || Root-mean-square mapping quality of covering reads ||  ||
||PC2 || Phred probability of the nonRef allele frequency in group1 samples being larger (, smaller) than in group2. ||  ||
||PCHI2 || Posterior weighted chi2 P-value for testing the association between group1 and group2 samples. ||  ||
||PR || Number of permutations yielding a smaller PCHI2. ||  ||
||PV4 || P-values for strand bias, baseQ bias, mapQ bias and tail distance bias ||  ||
||QBD || Quality by Depth: QUAL/#reads ||  ||
||QCHI2 || Phred scaled PCHI2. ||  ||
||RPB || Read Position Bias ||  ||
||SF || Source File (index to sourceFiles, f when filtered) ||  ||
||TYPE || Variant type ||  ||
||UGT || The most probable unconstrained genotype configuration in the trio ||  ||
||VDB || Variant Distance Bias (v2) for filtering splice-site artefacts in RNA-seq data. ||  ||

'''FORMAT''' field (listing tags of metrics for each sample):

||'''Tag''' || '''Description''' || '''More details''' ||
|| GT || Genotype || ||
|| GQ || Genotype Quality || -10*log,,10,,prob(genotype call is wrong) [so bigger is more confident] ||
|| GL || Likelihoods for RR,RA,AA genotypes (R=ref,A=alt) || ||
|| DP || Number of high-quality bases || ||
|| DV || Number of high-quality non-reference bases || ||
|| SP || Phred-scaled strand bias P-value || ||
|| PL || List of Phred-scaled genotype likelihoods || Scores for 0/0 (homozygous ref), 0/1 (heterozygous), and 1/1 (homozygous alt) genotypes.  For a phred-scaled likelihood of P, the raw likelihood of that genotype L = 10^-P/10^ (so the higher the number, the less likely it is that your sample is that genotype).  The sum of likelihoods is not necessarily 1. ||

== Predicting the effects of a set of variants ==

Especially for genome-wide analyses, often the most difficult is prioritizing and making biological sense of a potentially very long list of variants.

The best way(s) to do this often depends on the goal of the study, but it may help to use tools such as

* snpEff 
   * Given a genome-wide gene annotation file (as a GTF file), each variant can be linked to an exon, intron, and/or intergenic region.
   * For variants within an open reading frame, snpEff will determine the relevant amino acid and if the variant will produce a synonymous or non-synonymous change.
   * Example command: java -jar /usr/local/share/snpEff/snpEff.jar -c /usr/local/share/snpEff/snpEff.config GENOME < Variants.vcf > Variants.snpEff.vcf
     * where GENOME is chosen from those listed with the command java -jar /usr/local/share/snpEff/snpEff.jar databases
* Ensembl's Variant Effect Predictor
   * Go to http://www.ensembl.org/info/docs/tools/vep/index.html and click on "Launch the online VEP tool!".

* Broad's MutSig ("Mutation Significance")
   * See http://www.broadinstitute.org/cancer/cga/mutsig for details

== Other Analysis ==

* Assessing Hardy-Weinberg Equilibrium (HWE) using an exact test: deviation from HWE may indicate selection, population mixing or non-random mating, see [[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1199378 | PMID 7498780 ]] for more details.
  * Use [[https://vcftools.github.io/index.html | vcftools]] option "--hardy" to report a p-value for each site to assess deviation from HWE.  
{{{
#VCF file must have genotypes of individuals explicitly reported
vcftools --vcf myVariants.vcf --hardy --out myVariantsHWE
}}}
  * Alternatively, R package [[https://cran.r-project.org/web/packages/genetics/index.html | genetics]] can be utilized with counts for each genotype,
{{{
library(genetics)
g1 = genotype(c(rep("G/G",1),rep("G/A",47),rep("A/A",6450)))
HWE.exact(g1)
}}}