Changes between Version 21 and Version 22 of SOPs/variant_calling_GATK
- Timestamp:
- 07/18/17 08:52:39 (8 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
SOPs/variant_calling_GATK
v21 v22 9 9 == Using GATK to call variants from short-read sequencing == 10 10 11 This information comes from the [ [http://www.broadinstitute.org/gatk/guide/events?id=3093#materials|slides for "Best Practices for Variant Calling with the GATK"]]from the Broad Institute. This page summarizes and formats their detailed documentation. '''GATK3 (v3 or higher) is recommended.'''11 This information comes from the [https://software.broadinstitute.org/gatk/best-practices/ "Best Practices for Variant Calling with the GATK"] ([https://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk-1 sample slides]) from the Broad Institute. This page summarizes and formats their detailed documentation. '''GATK3 (v3 or higher) is recommended.''' 12 12 \\ \\ 13 Note that if you're calling variants from RNA-seq reads, follow the somewhat different commands optimized for this, as described in GATK's [[https://www.broadinstitute.org/gatk/guide/article?id=3891|Calling variants in RNAseq]]. 13 Note that if you're calling variants from RNA-seq reads, follow the somewhat different commands optimized for this, as described in GATK's [[https://www.broadinstitute.org/gatk/guide/article?id=3891|Calling variants in RNAseq]]. RNAseq includes reads mapped across splice junctions and is associated with high variability of coverage, so typical variant calling pipelines (for DNA) can lead to lots of false positives and negatives. 14 14 \\ \\ 15 15 This example pipeline starts with a single-end short-read fastq file (Reads_1.fq). … … 19 19 export PATH=/usr/lib/jvm/java-7-openjdk-amd64/bin:$PATH 20 20 \\ \\ 21 Also note that GATK is not designed to use RNA-Seq data, in particular reads that are mapped across splice junctions. If one still wants to use GATK for experiments of this type, one needs to do one of the following:22 * remove reads that map across splice junctions (e.g., those containing 'N' in the CIGAR string23 * map with a non-splice mapper (like bwa or bowtie, rather than tophat)24 * run GATK with the '--filter_reads_with_N_cigar' option25 21 26 22 1 - '''Index the reference genome.''' [Need to do just once, with [[http://samtools.sourceforge.net/samtools.shtml|samtools]].] 27 23 * samtools faidx /path/to/genome/genome.fa 28 24 \\ 29 2 - '''Create a genome dictionary.''' [Need to do just once, with Picard's [[http ://picard.sourceforge.net/command-line-overview.shtml#CreateSequenceDictionary|CreateSequenceDictionary]].]25 2 - '''Create a genome dictionary.''' [Need to do just once, with Picard's [[https://broadinstitute.github.io/picard/command-line-overview.html#CreateSequenceDictionary|CreateSequenceDictionary]].] 30 26 * java -jar /usr/local/share/picard-tools/picard.jar CreateSequenceDictionary R=/path/to/genome/genome.fa O=/path/to/genome/genome.dict 31 27 \\ 32 3 - '''Validate VCF file or known variants''' (with GATK's [[http ://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_ValidateVariants.html|ValidateVariants]])28 3 - '''Validate VCF file or known variants''' (with GATK's [[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_ValidateVariants.php|ValidateVariants]]) 33 29 * java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T ValidateVariants -R /path/to/genome/genome.fa --variant:VCF SNPs_from_NCBI.sorted.vcf \\ 34 30 Respond to errors (by correcting or removing problematic variants), run command again, etc., until validation is successful. \\ … … 44 40 \\ 45 41 6 - '''Mark duplicates''' (multiple identical reads mapped to the same location) \\ 46 Run Picard Tools' [[http ://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates|MarkDuplicates]] on each sample \\42 Run Picard Tools' [[https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates|MarkDuplicates]] on each sample \\ 47 43 May Need "VALIDATION_STRINGENCY=LENIENT" if you get \\ 48 44 Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: ... MAPQ should be 0 for unmapped read. \\ … … 50 46 \\ 51 47 7 - '''Add Read Group header information to each BAM file''' (or GATK won't let you continue) \\ 52 Run Picard Tools' [[http ://picard.sourceforge.net/command-line-overview.shtml#AddOrReplaceReadGroups|AddOrReplaceReadGroups]] on each sample. \\48 Run Picard Tools' [[https://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups|AddOrReplaceReadGroups]] on each sample. \\ 53 49 Specify RGSM (Read Group sample), RGLB (Read Group Library), RGPL (Read Group platform), and RGPU (Read Group platform unit [e.g. run barcode]) 54 50 * bsub java -jar /usr/local/share/picard-tools/picard.jar AddOrReplaceReadGroups I=Reads_1.bwa.dedup.bam O=Reads_1.bwa.dedup.good.bam RGSM=My_sample RGLB=My_project RGPL=illumina RGPU=none VALIDATION_STRINGENCY=LENIENT … … 57 53 * bsub samtools index Reads_1.bwa.dedup.good.bam 58 54 \\ 59 9 - '''Run Indel Realignment''' (with [[http ://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_indels_RealignerTargetCreator.html|RealignerTargetCreator]] and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_indels_IndelRealigner|IndelRealigner]]) \\55 9 - '''Run Indel Realignment''' (with [[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php|RealignerTargetCreator]] and [[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php|IndelRealigner]]) \\ 60 56 * Example 1: java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T RealignerTargetCreator -R human.fasta -I original.bam -known indels.vcf -o realigner.intervals \\ 61 57 * Example 2: java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T IndelRealigner -R human.fasta -I original.bam -known indels.vcf -targetIntervals realigner.intervals -o realigned.bam \\ … … 63 59 * java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T IndelRealigner -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.good.bam -targetIntervals Reads_1.realigner.intervals -o Reads_1.bwa.dedup.realigned.bam --fix_misencoded_quality_scores 64 60 \\ 65 10 - '''Run Base Recalibration''' ([[http ://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_bqsr_BaseRecalibrator.html|BaseRecalibrator]] and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_PrintReads.html|PrintReads]]) \\61 10 - '''Run Base Recalibration''' ([[https://software.broadinstitute.org/gatk/gatkdocs/current/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php|BaseRecalibrator]] and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_PrintReads.html|PrintReads]]) \\ 66 62 Known variants/SNPs in VCF format is required for this step. If none is available, then use the data itself to "bootstrap" known SNPs, see [[http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr | BQSR]]. \\ 67 63 * Example 1: java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human.fasta -I realigned.bam -knownSites dbsnp137.vcf -knownSites gold.standard.indels.vcf -o recal.table … … 78 74 79 75 \\ 80 11 - '''Compress BAM with [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_compression_reducereads_ReduceReads.html|ReduceReads]]''' [Optional: this step is not available in GATK Version 3.0 and higher] \\ 81 * Example 1: java -jar GenomeAnalysisTK.jar -T ReduceReads -R human.fasta -I recal.bam -o reduced.bam 82 * java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T ReduceReads -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.realigned.recal.bam -o Reads_1.bwa.dedup.realigned.recal.reduced.bam 83 \\ 84 12 - '''Finally -- Call variants''' \\ 76 11 - '''Finally -- Call variants''' \\ 85 77 Run [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html|HaplotypeCaller]] ("The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper."; HaplotypeCaller is recommended as of GATK Version 3.0) 86 78 * Example: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human.fasta -I input.bam -o output.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3 … … 93 85 * java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /nfs/genomes/a.thaliana_TAIR_10/fasta_whole_genome/TAIR10.fa -I Reads_1.bwa.dedup.realigned.recal.reduced.bam --dbsnp SNPs_from_NCBI.sorted.vcf -o Reads_1.bwa.raw.snps.indels.UnifiedGenotyper.vcf -stand_call_conf 30 -stand_emit_conf 10 94 86 \\ 95 1 3- '''Run Variant Quality Score Recalibration''' ("VQSR", with87 12 - '''Run Variant Quality Score Recalibration''' ("VQSR", with 96 88 [[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantrecalibration_VariantRecalibrator.php|VariantRecalibrator]] and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantrecalibration_ApplyRecalibration.php|ApplyRecalibration]]) \\ \\ 97 1 4- '''Run Genotype Phasing and Refinement''' \\ \\98 1 5- '''Run Functional Annotation''' ([[http://snpeff.sourceforge.net/SnpEff.html|snpEff]] and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_VariantAnnotator.html|VariantAnnotator]] [which "parses output from snpEff into a simpler format that is more useful for analysis"])89 13 - '''Run Genotype Phasing and Refinement''' \\ \\ 90 14 - '''Run Functional Annotation''' ([[http://snpeff.sourceforge.net/SnpEff.html|snpEff]] and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_VariantAnnotator.html|VariantAnnotator]] [which "parses output from snpEff into a simpler format that is more useful for analysis"]) 99 91 * Example 1: java -jar snpEff.jar eff -v -onlyCoding true -i vcf -o gatk GRCh37.64 input.vcf > output.vcf 100 92 * Example 2: java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R human.fasta -A SnpEff --variant original.vcf --snpEffFile snpEff_output.vcf -o annotated.vcf 101 93 102 1 6- '''Analyze variant calls''' (with [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_CombineVariants|CombineVariants]], [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_SelectVariants.html|SelectVariants]], and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_varianteval_VariantEval.html|VariantEval]]) \\ \\94 15 - '''Analyze variant calls''' (with [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_CombineVariants|CombineVariants]], [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_SelectVariants.html|SelectVariants]], and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_varianteval_VariantEval.html|VariantEval]]) \\ \\ 103 95