=== Which mapper and variant caller works best? === No simple answer (of course), but see * [http://www.ncbi.nlm.nih.gov/pubmed/26639839 Hwang at al., 2015. Systematic comparison of variant calling pipelines using gold standard personal exome variants.] * [http://www.ncbi.nlm.nih.gov/pubmed/25078893 Pirooznia et al., 2014. Validation and assessment of variant calling pipelines for next-generation sequencing.] * [http://www.ncbi.nlm.nih.gov/pubmed/24086590 Liu et al., 2013. Variant callers for next-generation sequencing data: a comparison study.] == Using GATK to call variants from short-read sequencing == This information comes from the [[http://www.broadinstitute.org/gatk/guide/events?id=3093#materials|slides for "Best Practices for Variant Calling with the GATK"]] from the Broad Institute. This page summarizes and formats their detailed documentation. '''GATK3 (v3 or higher) is recommended.''' \\ \\ Note that if you're calling variants from RNA-seq reads, follow the somewhat different commands optimized for this, as described in GATK's [[https://www.broadinstitute.org/gatk/guide/article?id=3891|Calling variants in RNAseq]]. \\ \\ This example pipeline starts with a single-end short-read fastq file (Reads_1.fq). \\ \\ Note that '''GATK requires Java 1.7''' (so you may need to adjust your path to point to that version, if an older version is the default).\\ For example, this can be added to ~/.bashrc: \\ export PATH=/usr/lib/jvm/java-7-openjdk-amd64/bin:$PATH \\ \\ Also note that GATK is not designed to use RNA-Seq data, in particular reads that are mapped across splice junctions. If one still wants to use GATK for experiments of this type, one needs to do one of the following: * remove reads that map across splice junctions (e.g., those containing 'N' in the CIGAR string * map with a non-splice mapper (like bwa or bowtie, rather than tophat) * run GATK with the '--filter_reads_with_N_cigar' option 1 - '''Index the reference genome.''' [Need to do just once, with [[http://samtools.sourceforge.net/samtools.shtml|samtools]].] * samtools faidx /path/to/genome/genome.fa \\ 2 - '''Create a genome dictionary.''' [Need to do just once, with Picard's [[http://picard.sourceforge.net/command-line-overview.shtml#CreateSequenceDictionary|CreateSequenceDictionary]].] * java -jar /usr/local/share/picard-tools/CreateSequenceDictionary.jar R=/path/to/genome/genome.fa O=/path/to/genome/genome.dict \\ 3 - '''Validate VCF file or known variants''' (with GATK's [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_ValidateVariants.html|ValidateVariants]]) * java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T ValidateVariants -R /path/to/genome/genome.fa --variant:VCF SNPs_from_NCBI.sorted.vcf \\ Respond to errors (by correcting or removing problematic variants), run command again, etc., until validation is successful. \\ Otherwise GATK will not run on any subsequent commands that require this file. \\ 4 - '''Align reads to genome with [[http://bio-bwa.sourceforge.net/bwa.shtml|bwa]]''' * bsub "bwa aln /path/to/genome/bwa/genome Reads_1.fq > Reads_1.sai" * bsub "bwa samse /path/to/genome/bwa/genome Reads_1.sai Reads_1.fq > Reads_1.bwa.sam" \\ 5 - '''Convert SAM to BAM, sort, and index''' with BaRC's streamlined [[http://samtools.sourceforge.net/samtools.shtml|samtools]] commands * bsub /nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl Reads_1.bwa.sam \\ 6 - '''Mark duplicates''' (multiple identical reads mapped to the same location) \\ Run Picard Tools' [[http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates|MarkDuplicates]] on each sample \\ May Need "VALIDATION_STRINGENCY=LENIENT" if you get \\ Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: ... MAPQ should be 0 for unmapped read. \\ * bsub java -jar /usr/local/share/picard-tools/MarkDuplicates.jar I=Reads_1.bwa.sorted.bam O=Reads_1.bwa.dedup.bam M=Reads_1.bwa.dedup.txt VALIDATION_STRINGENCY=LENIENT \\ 7 - '''Add Read Group header information to each BAM file''' (or GATK won't let you continue) \\ Run Picard Tools' [[http://picard.sourceforge.net/command-line-overview.shtml#AddOrReplaceReadGroups|AddOrReplaceReadGroups]] on each sample. \\ Specify RGSM (Read Group sample), RGLB (Read Group Library), RGPL (Read Group platform), and RGPU (Read Group platform unit [e.g. run barcode]) * bsub java -jar /usr/local/share/picard-tools/AddOrReplaceReadGroups.jar I=Reads_1.bwa.dedup.bam O=Reads_1.bwa.dedup.good.bam RGSM=My_sample RGLB=My_project RGPL=illumina RGPU=none VALIDATION_STRINGENCY=LENIENT \\ 8 - '''Index BAM file(s)''' with [[http://samtools.sourceforge.net/samtools.shtml|samtools]] (optional; for IGV viewing) * bsub samtools index Reads_1.bwa.dedup.good.bam \\ 9 - '''Run Indel Realignment''' (with [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_indels_RealignerTargetCreator.html|RealignerTargetCreator]] and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_indels_IndelRealigner|IndelRealigner]]) \\ * Example 1: java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T RealignerTargetCreator -R human.fasta -I original.bam -known indels.vcf -o realigner.intervals \\ * Example 2: java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T IndelRealigner -R human.fasta -I original.bam -known indels.vcf -targetIntervals realigner.intervals -o realigned.bam \\ * java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.good.bam -o Reads_1.realigner.intervals --fix_misencoded_quality_scores * java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T IndelRealigner -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.good.bam -targetIntervals Reads_1.realigner.intervals -o Reads_1.bwa.dedup.realigned.bam --fix_misencoded_quality_scores \\ 10 - '''Run Base Recalibration''' ([[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_bqsr_BaseRecalibrator.html|BaseRecalibrator]] and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_PrintReads.html|PrintReads]]) \\ Known variants/SNPs in VCF format is required for this step. If none is available, then use the data itself to "bootstrap" known SNPs, see [[http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr | BQSR]]. \\ * Example 1: java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human.fasta -I realigned.bam -knownSites dbsnp137.vcf -knownSites gold.standard.indels.vcf -o recal.table * Example 2: java -jar GenomeAnalysisTK.jar -T PrintReads -R human.fasta -I realigned.bam -BQSR recal.table -o recal.bam \\ See how things have changed * Example 3: java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human.fasta -I realigned.bam -knownSites dbsnp137.vcf -knownSites gold.standard.indels.vcf -BQSR recal.table -o after_recal.table and then make plots of how they've changed (which requires the R [[http://cran.r-project.org/web/packages/gsalib/index.html|'gsalib']] R package). * Example 4: java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R human.fasta -before recal.table -after after_recal.table -plots recal_plots.pdf All applied to our sample data: * bsub java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T BaseRecalibrator -I Reads_1.bwa.dedup.realigned.bam -R /path/to/genome/genome.fa -o Reads_1.bwa.recal_data.txt -knownSites SNPs_from_NCBI.sorted.vcf * bsub java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T PrintReads -I Reads_1.bwa.dedup.realigned.bam -R /path/to/genome/genome.fa -BQSR Reads_1.bwa.recal_data.txt -o Reads_1.bwa.dedup.realigned.recal.bam * bsub java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T BaseRecalibrator -I Reads_1.bwa.dedup.realigned.bam -R /path/to/genome/genome.fa -knownSites SNPs_from_NCBI.sorted.vcf -BQSR Reads_1.bwa.recal_data.txt -o Reads_1.bwa.after_recal.txt * bsub java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T AnalyzeCovariates -R /path/to/genome/genome.fa -before Reads_1.bwa.recal_data.txt -after Reads_1.bwa.after_recal.txt -plots Reads_1.bwa.recal_plots.pdf \\ 11 - '''Compress BAM with [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_compression_reducereads_ReduceReads.html|ReduceReads]]''' [Optional: this step is not available in GATK Version 3.0 and higher] \\ * Example 1: java -jar GenomeAnalysisTK.jar -T ReduceReads -R human.fasta -I recal.bam -o reduced.bam * java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T ReduceReads -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.realigned.recal.bam -o Reads_1.bwa.dedup.realigned.recal.reduced.bam \\ 12 - '''Finally -- Call variants''' \\ Run [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html|HaplotypeCaller]] ("The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper."; HaplotypeCaller is recommended as of GATK Version 3.0) * Example: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human.fasta -I input.bam -o output.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3 * java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T HaplotypeCaller -R /nfs/genomes/a.thaliana_TAIR_10/fasta_whole_genome/TAIR10.fa -I Reads_1.bwa.dedup.realigned.recal.reduced.bam --dbsnp SNPs_from_NCBI.sorted.vcf -o Reads_1.bwa.raw.snps.indels.HaplotypeCaller.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3 Pay attention to the INFO lines printed when the command is complete to check that too many reads aren't being unexpectedly filtered out. Using RNA-seq reads mapped by STAR, for example, can result in almost all reads failing the HCMappingQualityFilter or the MappingQualityUnavailableFilter. A quick fix is to modify the mapping quality of reads with some value (like 255) to some other value (like 60) on the fly by adding the arguments "-rf ReassignOneMappingQuality -RMQF 255 -RMQT 60". See the [[https://www.broadinstitute.org/gatk/blog?id=4285|GATK blog]] for more information. [If needed] Run [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_genotyper_UnifiedGenotyper.html|UnifiedGenotyper]] should be a better choice for nondiploid samples and high sample numbers * Example: java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R human.fasta -I input.bam -o output.vcf -stand_call_conf 30 -stand_emit_conf 10 * java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /nfs/genomes/a.thaliana_TAIR_10/fasta_whole_genome/TAIR10.fa -I Reads_1.bwa.dedup.realigned.recal.reduced.bam --dbsnp SNPs_from_NCBI.sorted.vcf -o Reads_1.bwa.raw.snps.indels.UnifiedGenotyper.vcf -stand_call_conf 30 -stand_emit_conf 10 \\ 13 - '''Run Variant Quality Score Recalibration''' ("VQSR", with [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantrecalibration_VariantRecalibrator.html|VariantRecalibrator] and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantrecalibration_ApplyRecalibration.html|ApplyRecalibration]) \\ \\ 14 - '''Run Genotype Phasing and Refinement''' \\ \\ 15 - '''Run Functional Annotation''' ([[http://snpeff.sourceforge.net/SnpEff.html|snpEff]] and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_VariantAnnotator.html|VariantAnnotator]] [which "parses output from snpEff into a simpler format that is more useful for analysis"]) * Example 1: java -jar snpEff.jar eff -v -onlyCoding true -i vcf -o gatk GRCh37.64 input.vcf > output.vcf * Example 2: java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R human.fasta -A SnpEff --variant original.vcf --snpEffFile snpEff_output.vcf -o annotated.vcf 16 - '''Analyze variant calls''' (with [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_CombineVariants|CombineVariants]], [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_SelectVariants.html|SelectVariants]], and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_varianteval_VariantEval.html|VariantEval]]) \\ \\