Context Navigation

Interpreting VCF files

The VCF (Variant Call Format) specification pages describe most of what you need to know.

Tags in the FILTER, INFO, and FORMAT fields are described in the VCF header.

Probability (ranging from 0 to 1) for a Phred score P is defined as 10^-P/10.

As a tabular reference, common tags and scores are as follows:

QUAL field: QUAL = -10*log₁₀(Probability(call in ALT is wrong))

FILTER field (typically generated by vcf-annotate):

Tag	Description	Default threshold
BaseQualBias	Min P-value for baseQ bias	0
EndDistBias	Min P-value for end distance bias	0.0001
GapWin	Window size for filtering adjacent gaps	3
MapQualBias	Min P-value for mapQ bias	0
MaxDP	Maximum read depth	10000000
MinAB	Minimum number of alternate bases	2
MinDP	Minimum read depth	2
MinMQ	Minimum RMS mapping quality for SNPs	10
Qual	Minimum value of the QUAL field	10
RefN	Reference base is N	[]
SnpGap	SNP within INT bp around a gap to be filtered	10
StrandBias	Min P-value for strand bias	0.0001
VDB	Minimum Variant Distance Bias	0

INFO field (typically generated by bcftools and expanded with vcf-annotate):

Tag	Description	More details
AC	Allele count in genotypes
AC1	Max-likelihood estimate of the first ALT allele count (no HWE assumption)
AF1	Max-likelihood estimate of the first ALT allele frequency (assuming HWE)
AN	Total number of alleles in called genotypes
CGT	The most probable constrained genotype configuration in the trio
CLR	Log ratio of genotype likelihoods with and without the constraint
DP	Raw read depth	For multiple-sample VCFs, the sum for all samples
DP4	Number of high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases
FQ	Phred probability of all samples being the same
G3	ML estimate of genotype frequencies
HWE	Hardy-Weinberg equilibrium test (PMID:15789306)
ICF	Inbreeding coefficient F
INDEL	Indicates that the variant is an INDEL.
IS	Maximum number of reads supporting an indel and fraction of indel reads
MDV	Maximum number of high-quality nonRef reads in samples
MQ	Root-mean-square mapping quality of covering reads
PC2	Phred probability of the nonRef allele frequency in group1 samples being larger (, smaller) than in group2.
PCHI2	Posterior weighted chi2 P-value for testing the association between group1 and group2 samples.
PR	Number of permutations yielding a smaller PCHI2.
PV4	P-values for strand bias, baseQ bias, mapQ bias and tail distance bias
QBD	Quality by Depth: QUAL/#reads
QCHI2	Phred scaled PCHI2.
RPB	Read Position Bias
SF	Source File (index to sourceFiles, f when filtered)
TYPE	Variant type
UGT	The most probable unconstrained genotype configuration in the trio
VDB	Variant Distance Bias (v2) for filtering splice-site artefacts in RNA-seq data.

FORMAT field (listing tags of metrics for each sample):

Tag	Description	More details
GT	Genotype
GQ	Genotype Quality	-10*log₁₀prob(genotype call is wrong) [so bigger is more confident]
GL	Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)
DP	Number of high-quality bases
DV	Number of high-quality non-reference bases
SP	Phred-scaled strand bias P-value
PL	List of Phred-scaled genotype likelihoods	Scores for 0/0 (homozygous ref), 0/1 (heterozygous), and 1/1 (homozygous alt) genotypes. For a phred-scaled likelihood of P, the raw likelihood of that genotype L = 10^-P/10 (so the higher the number, the less likely it is that your sample is that genotype). The sum of likelihoods is not necessarily 1.

Predicting the effects of a set of variants

Especially for genome-wide analyses, often the most difficult is prioritizing and making biological sense of a potentially very long list of variants.

The best way(s) to do this often depends on the goal of the study, but it may help to use tools such as

snpEff
- Given a genome-wide gene annotation file (as a GTF file), each variant can be linked to an exon, intron, and/or intergenic region.
- For variants within an open reading frame, snpEff will determine the relevant amino acid and if the variant will produce a synonymous or non-synonymous change.
- Example command: java -jar /usr/local/share/snpEff/snpEff.jar -c /usr/local/share/snpEff/snpEff.config GENOME < Variants.vcf > Variants.snpEff.vcf
  - where GENOME is chosen from those listed with the command java -jar /usr/local/share/snpEff/snpEff.jar databases
Ensembl's Variant Effect Predictor
- Go to http://www.ensembl.org/info/docs/tools/vep/index.html and click on "Launch the online VEP tool!".

Broad's MutSig ("Mutation Significance")
- See http://www.broadinstitute.org/cancer/cga/mutsig for details

Other Analysis

Assessing Hardy-Weinberg Equilibrium (HWE) using an exact test: deviation from HWE may indicate selection, population mixing or non-random mating, see PMID 7498780 for more details.
- Use vcftools option "--hardy" to report a p-value for each site to assess deviation from HWE.
```
#VCF file must have genotypes of individuals explicitly reported
vcftools --vcf myVariants.vcf --hardy --out myVariantsHWE
```
- Alternatively, R package genetics can be utilized with counts for each genotype,
```
library(genetics)
g1 = genotype(c(rep("G/G",1),rep("G/A",47),rep("A/A",6450)))
HWE.exact(g1)
```

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text