wiki:SOPs/vcf

Interpreting VCF files

The VCF (Variant Call Format) specification pages describe most of what you need to know.

Tags in the FILTER, INFO, and FORMAT fields are described in the VCF header.

Probability (ranging from 0 to 1) for a Phred score P is defined as 10-P/10.

As a tabular reference, common tags and scores are as follows:

QUAL field: QUAL = -10*log10(Probability(call in ALT is wrong))

FILTER field (typically generated by vcf-annotate):

Tag Description Default threshold
BaseQualBias Min P-value for baseQ bias 0
EndDistBias Min P-value for end distance bias 0.0001
GapWin Window size for filtering adjacent gaps 3
MapQualBias Min P-value for mapQ bias 0
MaxDP Maximum read depth 10000000
MinAB Minimum number of alternate bases 2
MinDP Minimum read depth 2
MinMQ Minimum RMS mapping quality for SNPs 10
Qual Minimum value of the QUAL field 10
RefN Reference base is N []
SnpGap SNP within INT bp around a gap to be filtered 10
StrandBias Min P-value for strand bias 0.0001
VDB Minimum Variant Distance Bias 0

INFO field (typically generated by bcftools and expanded with vcf-annotate):

Tag Description More details
AC Allele count in genotypes
AC1 Max-likelihood estimate of the first ALT allele count (no HWE assumption)
AF1 Max-likelihood estimate of the first ALT allele frequency (assuming HWE)
AN Total number of alleles in called genotypes
CGT The most probable constrained genotype configuration in the trio
CLR Log ratio of genotype likelihoods with and without the constraint
DP Raw read depth For multiple-sample VCFs, the sum for all samples
DP4 Number of high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases
FQ Phred probability of all samples being the same
G3 ML estimate of genotype frequencies
HWE Hardy-Weinberg equilibrium test (PMID:15789306)
ICF Inbreeding coefficient F
INDEL Indicates that the variant is an INDEL.
IS Maximum number of reads supporting an indel and fraction of indel reads
MDV Maximum number of high-quality nonRef reads in samples
MQ Root-mean-square mapping quality of covering reads
PC2 Phred probability of the nonRef allele frequency in group1 samples being larger (, smaller) than in group2.
PCHI2 Posterior weighted chi2 P-value for testing the association between group1 and group2 samples.
PR Number of permutations yielding a smaller PCHI2.
PV4 P-values for strand bias, baseQ bias, mapQ bias and tail distance bias
QBD Quality by Depth: QUAL/#reads
QCHI2 Phred scaled PCHI2.
RPB Read Position Bias
SF Source File (index to sourceFiles, f when filtered)
TYPE Variant type
UGT The most probable unconstrained genotype configuration in the trio
VDB Variant Distance Bias (v2) for filtering splice-site artefacts in RNA-seq data.

FORMAT field (listing tags of metrics for each sample):

Tag Description More details
GT Genotype
GQ Genotype Quality -10*log10prob(genotype call is wrong) [so bigger is more confident]
GL Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)
DP Number of high-quality bases
DV Number of high-quality non-reference bases
SP Phred-scaled strand bias P-value
PL List of Phred-scaled genotype likelihoods Scores for 0/0 (homozygous ref), 0/1 (heterozygous), and 1/1 (homozygous alt) genotypes. For a phred-scaled likelihood of P, the raw likelihood of that genotype L = 10-P/10 (so the higher the number, the less likely it is that your sample is that genotype). The sum of likelihoods is not necessarily 1.

Predicting the effects of a set of variants

Especially for genome-wide analyses, often the most difficult is prioritizing and making biological sense of a potentially very long list of variants.

The best way(s) to do this often depends on the goal of the study, but it may help to use tools such as

  • snpEff
    • Given a genome-wide gene annotation file (as a GTF file), each variant can be linked to an exon, intron, and/or intergenic region.
    • For variants within an open reading frame, snpEff will determine the relevant amino acid and if the variant will produce a synonymous or non-synonymous change.
    • Example command: java -jar /usr/local/share/snpEff/snpEff.jar -c /usr/local/share/snpEff/snpEff.config GENOME < Variants.vcf > Variants.snpEff.vcf
      • where GENOME is chosen from those listed with the command java -jar /usr/local/share/snpEff/snpEff.jar databases
  • Ensembl's Variant Effect Predictor

Other Analysis

  • Assessing Hardy-Weinberg Equilibrium (HWE) using an exact test: deviation from HWE may indicate selection, population mixing or non-random mating, see PMID 7498780 for more details.
    • Use vcftools option "--hardy" to report a p-value for each site to assess deviation from HWE.
      #VCF file must have genotypes of individuals explicitly reported
      vcftools --vcf myVariants.vcf --hardy --out myVariantsHWE
      
    • Alternatively, R package genetics can be utilized with counts for each genotype,
      library(genetics)
      g1 = genotype(c(rep("G/G",1),rep("G/A",47),rep("A/A",6450)))
      HWE.exact(g1)