SOPs/variant_calling_GATK – BaRC Wiki

Context Navigation

Which mapper and variant caller works best?

No simple answer (of course), but see

Hwang at al., 2015. Systematic comparison of variant calling pipelines using gold standard personal exome variants.
Pirooznia et al., 2014. Validation and assessment of variant calling pipelines for next-generation sequencing.
Liu et al., 2013. Variant callers for next-generation sequencing data: a comparison study.
Note: GATK is optimized for large human datasets, whereas GATK and samtools may perform similarly with other species and smaller-scale experiments.

Using GATK to call variants from short-read sequencing

This information comes from the Best Practices for Variant Calling with the GATK (sample slides) from the Broad Institute. This page summarizes and formats their detailed documentation. GATK3 (v3 or higher) is recommended.

Note that if you're calling variants from RNA-seq reads, follow the somewhat different commands optimized for this, as described in GATK's Calling variants in RNAseq. RNAseq includes reads mapped across splice junctions and is associated with high variability of coverage, so typical variant calling pipelines (for DNA) can lead to lots of false positives and negatives.

This example pipeline starts with a single-end short-read fastq file (Reads_1.fq).

Note that GATK (versions 3.7 and 3.8) requires Java 1.8 (so you may need to adjust your path to point to that version, if an older version is the default).
For example, this can be added to ~/.bashrc:
export PATH=/usr/local/jre1.8/bin:$PATH

1 - Index the reference genome. [Need to do just once, with samtools.]

samtools faidx /path/to/genome/genome.fa

2 - Create a genome dictionary. [Need to do just once, with Picard's CreateSequenceDictionary.]

java -jar /usr/local/share/picard-tools/picard.jar CreateSequenceDictionary R=/path/to/genome/genome.fa O=/path/to/genome/genome.dict

3 - Validate VCF file or known variants (with GATK's ValidateVariants)

java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T ValidateVariants -R /path/to/genome/genome.fa --variant:VCF SNPs_from_NCBI.sorted.vcf

Respond to errors (by correcting or removing problematic variants), run command again, etc., until validation is successful.
Otherwise GATK will not run on any subsequent commands that require this file.

4 - Align reads to genome with bwa

sbatch --job-name=bwa_aln_1 --mem=16G --wrap="bwa aln /path/to/genome/bwa/genome Reads_1.fq > Reads_1.sai"
sbatch --job-name=bwa_samse_1 --mem=16G --wrap="bwa samse /path/to/genome/bwa/genome Reads_1.sai Reads_1.fq > Reads_1.bwa.sam"

5 - Convert SAM to BAM, sort, and index with BaRC's streamlined samtools commands

sbatch --job-name=SAM2BAM --wrap="/nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl Reads_1.bwa.sam"

6 - Mark duplicates (multiple identical reads mapped to the same location)
Run Picard Tools' MarkDuplicates on each sample
May Need "VALIDATION_STRINGENCY=LENIENT" if you get
Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: ... MAPQ should be 0 for unmapped read.

sbatch --job-name=MarkDuplicates --wrap="java -jar /usr/local/share/picard-tools/picard.jar MarkDuplicates I=Reads_1.bwa.sorted.bam O=Reads_1.bwa.dedup.bam M=Reads_1.bwa.dedup.txt VALIDATION_STRINGENCY=LENIENT"

7 - Add Read Group header information to each BAM file (or GATK won't let you continue)
Run Picard Tools' AddOrReplaceReadGroups on each sample.
Specify RGSM (Read Group sample), RGLB (Read Group Library), RGPL (Read Group platform), and RGPU (Read Group platform unit [e.g. run barcode])

sbatch --job-name=AddRG --wrap="java -jar /usr/local/share/picard-tools/picard.jar AddOrReplaceReadGroups I=Reads_1.bwa.dedup.bam O=Reads_1.bwa.dedup.good.bam RGSM=My_sample RGLB=My_project RGPL=illumina RGPU=none VALIDATION_STRINGENCY=LENIENT"

8 - Index BAM file(s) with samtools (optional; for IGV viewing)

sbatch --job-name=samtools_index --wrap="samtools index Reads_1.bwa.dedup.good.bam"

9 - Run Indel Realignment (with RealignerTargetCreator and IndelRealigner)

Example 1: java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T RealignerTargetCreator -R human.fasta -I original.bam -known indels.vcf -o realigner.intervals
Example 2: java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T IndelRealigner -R human.fasta -I original.bam -known indels.vcf -targetIntervals realigner.intervals -o realigned.bam
java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.good.bam -o Reads_1.realigner.intervals --fix_misencoded_quality_scores
java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T IndelRealigner -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.good.bam -targetIntervals Reads_1.realigner.intervals -o Reads_1.bwa.dedup.realigned.bam --fix_misencoded_quality_scores

10 - Run Base Recalibration (BaseRecalibrator and PrintReads)
Known variants/SNPs in VCF format is required for this step. If none is available, then use the data itself to "bootstrap" known SNPs, see BQSR.

Example 1: java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human.fasta -I realigned.bam -knownSites dbsnp137.vcf -knownSites gold.standard.indels.vcf -o recal.table
Example 2: java -jar GenomeAnalysisTK.jar -T PrintReads -R human.fasta -I realigned.bam -BQSR recal.table -o recal.bam

See how things have changed

Example 3: java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human.fasta -I realigned.bam -knownSites dbsnp137.vcf -knownSites gold.standard.indels.vcf -BQSR recal.table -o after_recal.table

and then make plots of how they've changed (which requires the R gsalib R package).

Example 4: java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R human.fasta -before recal.table -after after_recal.table -plots recal_plots.pdf

All applied to our sample data:

sbatch --job-name=GATK_BaseRecal --wrap="java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T BaseRecalibrator -I Reads_1.bwa.dedup.realigned.bam -R /path/to/genome/genome.fa -o Reads_1.bwa.recal_data.txt -knownSites SNPs_from_NCBI.sorted.vcf"
sbatch --job-name=GATK_PrintReads --wrap="java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T PrintReads -I Reads_1.bwa.dedup.realigned.bam -R /path/to/genome/genome.fa -BQSR Reads_1.bwa.recal_data.txt -o Reads_1.bwa.dedup.realigned.recal.bam"
sbatch --job-name=GATK_BaseRecal2 --wrap="java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T BaseRecalibrator -I Reads_1.bwa.dedup.realigned.bam -R /path/to/genome/genome.fa -knownSites SNPs_from_NCBI.sorted.vcf -BQSR Reads_1.bwa.recal_data.txt -o Reads_1.bwa.after_recal.txt"
sbatch --job-name=GATK_AnalyzeCov --wrap="java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T AnalyzeCovariates -R /path/to/genome/genome.fa -before Reads_1.bwa.recal_data.txt -after Reads_1.bwa.after_recal.txt -plots Reads_1.bwa.recal_plots.pdf"

11 - Finally -- Call variants
Run HaplotypeCaller ("The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper."; HaplotypeCaller is recommended as of GATK Version 3.0)

Example: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human.fasta -I input.bam -o output.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3
java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T HaplotypeCaller -R /nfs/genomes/a.thaliana_TAIR_10/fasta_whole_genome/TAIR10.fa -I Reads_1.bwa.dedup.realigned.recal.reduced.bam --dbsnp SNPs_from_NCBI.sorted.vcf -o Reads_1.bwa.raw.snps.indels.HaplotypeCaller.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3

[If needed] Run UnifiedGenotyper may be a better choice for nondiploid samples and high sample numbers

Example: java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R human.fasta -I input.bam -o output.vcf -stand_call_conf 30 -stand_emit_conf 10
java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /nfs/genomes/a.thaliana_TAIR_10/fasta_whole_genome/TAIR10.fa -I Reads_1.bwa.dedup.realigned.recal.reduced.bam --dbsnp SNPs_from_NCBI.sorted.vcf -o Reads_1.bwa.raw.snps.indels.UnifiedGenotyper.vcf -stand_call_conf 30 -stand_emit_conf 10

12 - Run Variant Quality Score Recalibration ("VQSR", with VariantRecalibrator and ApplyRecalibration)

13 - Run Genotype Phasing and Refinement

14 - Run Functional Annotation (snpEff and VariantAnnotator [which "parses output from snpEff into a simpler format that is more useful for analysis"])

Example 1: java -jar snpEff.jar eff -v -onlyCoding true -i vcf -o gatk GRCh37.64 input.vcf > output.vcf
Example 2: java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R human.fasta -A SnpEff --variant original.vcf --snpEffFile snpEff_output.vcf -o annotated.vcf

15 - Analyze variant calls (with CombineVariants, SelectVariants, and VariantEval)

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text