Interpreting VCF files
The VCF (Variant Call Format) specification pages describe most of what you need to know.
Tags in the FILTER, INFO, and FORMAT fields are described in the VCF header.
Probability (ranging from 0 to 1) for a Phred score P is defined as 10-P/10.
As a tabular reference, common tags and scores are as follows:
QUAL field: QUAL = -10*log10(Probability(call in ALT is wrong))
FILTER field (typically generated by vcf-annotate):
Tag | Description | Default threshold |
BaseQualBias | Min P-value for baseQ bias | 0 |
EndDistBias | Min P-value for end distance bias | 0.0001 |
GapWin | Window size for filtering adjacent gaps | 3 |
MapQualBias | Min P-value for mapQ bias | 0 |
MaxDP | Maximum read depth | 10000000 |
MinAB | Minimum number of alternate bases | 2 |
MinDP | Minimum read depth | 2 |
MinMQ | Minimum RMS mapping quality for SNPs | 10 |
Qual | Minimum value of the QUAL field | 10 |
RefN | Reference base is N | [] |
SnpGap | SNP within INT bp around a gap to be filtered | 10 |
StrandBias | Min P-value for strand bias | 0.0001 |
VDB | Minimum Variant Distance Bias | 0 |
INFO field (typically generated by bcftools and expanded with vcf-annotate):
Tag | Description | More details |
AC | Allele count in genotypes | |
AC1 | Max-likelihood estimate of the first ALT allele count (no HWE assumption) | |
AF1 | Max-likelihood estimate of the first ALT allele frequency (assuming HWE) | |
AN | Total number of alleles in called genotypes | |
CGT | The most probable constrained genotype configuration in the trio | |
CLR | Log ratio of genotype likelihoods with and without the constraint | |
DP | Raw read depth | For multiple-sample VCFs, the sum for all samples |
DP4 | Number of high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases | |
FQ | Phred probability of all samples being the same | |
G3 | ML estimate of genotype frequencies | |
HWE | Hardy-Weinberg equilibrium test (PMID:15789306) | |
ICF | Inbreeding coefficient F | |
INDEL | Indicates that the variant is an INDEL. | |
IS | Maximum number of reads supporting an indel and fraction of indel reads | |
MDV | Maximum number of high-quality nonRef reads in samples | |
MQ | Root-mean-square mapping quality of covering reads | |
PC2 | Phred probability of the nonRef allele frequency in group1 samples being larger (, smaller) than in group2. | |
PCHI2 | Posterior weighted chi2 P-value for testing the association between group1 and group2 samples. | |
PR | Number of permutations yielding a smaller PCHI2. | |
PV4 | P-values for strand bias, baseQ bias, mapQ bias and tail distance bias | |
QBD | Quality by Depth: QUAL/#reads | |
QCHI2 | Phred scaled PCHI2. | |
RPB | Read Position Bias | |
SF | Source File (index to sourceFiles, f when filtered) | |
TYPE | Variant type | |
UGT | The most probable unconstrained genotype configuration in the trio | |
VDB | Variant Distance Bias (v2) for filtering splice-site artefacts in RNA-seq data. |
FORMAT field (listing tags of metrics for each sample):
Tag | Description | More details |
GT | Genotype | |
GQ | Genotype Quality | -10*log10prob(genotype call is wrong) [so bigger is more confident] |
GL | Likelihoods for RR,RA,AA genotypes (R=ref,A=alt) | |
DP | Number of high-quality bases | |
DV | Number of high-quality non-reference bases | |
SP | Phred-scaled strand bias P-value | |
PL | List of Phred-scaled genotype likelihoods | Scores for 0/0 (homozygous ref), 0/1 (heterozygous), and 1/1 (homozygous alt) genotypes. For a phred-scaled likelihood of P, the raw likelihood of that genotype L = 10-P/10 (so the higher the number, the less likely it is that your sample is that genotype). The sum of likelihoods is not necessarily 1. |
Predicting the effects of a set of variants
Especially for genome-wide analyses, often the most difficult is prioritizing and making biological sense of a potentially very long list of variants.
The best way(s) to do this often depends on the goal of the study, but it may help to use tools such as
- snpEff
- Given a genome-wide gene annotation file (as a GTF file), each variant can be linked to an exon, intron, and/or intergenic region.
- For variants within an open reading frame, snpEff will determine the relevant amino acid and if the variant will produce a synonymous or non-synonymous change.
- Example command: java -jar /usr/local/share/snpEff/snpEff.jar -c /usr/local/share/snpEff/snpEff.config GENOME < Variants.vcf > Variants.snpEff.vcf
- where GENOME is chosen from those listed with the command java -jar /usr/local/share/snpEff/snpEff.jar databases
- Ensembl's Variant Effect Predictor
- Go to http://www.ensembl.org/info/docs/tools/vep/index.html and click on "Launch the online VEP tool!".
- Broad's MutSig ("Mutation Significance")
- See http://www.broadinstitute.org/cancer/cga/mutsig for details
Other Analysis
- Assessing Hardy-Weinberg Equilibrium (HWE) using an exact test: deviation from HWE may indicate selection, population mixing or non-random mating, see PMID 7498780 for more details.
- Use vcftools option "--hardy" to report a p-value for each site to assess deviation from HWE.
#VCF file must have genotypes of individuals explicitly reported vcftools --vcf myVariants.vcf --hardy --out myVariantsHWE
- Alternatively, R package genetics can be utilized with counts for each genotype,
library(genetics) g1 = genotype(c(rep("G/G",1),rep("G/A",47),rep("A/A",6450))) HWE.exact(g1)
- Use vcftools option "--hardy" to report a p-value for each site to assess deviation from HWE.