Changes between Version 21 and Version 22 of SOPs/variant_calling_GATK


Ignore:
Timestamp:
07/18/17 08:52:39 (8 years ago)
Author:
gbell
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/variant_calling_GATK

    v21 v22  
    99== Using GATK to call variants from short-read sequencing ==
    1010
    11 This information comes from the [[http://www.broadinstitute.org/gatk/guide/events?id=3093#materials|slides for "Best Practices for Variant Calling with the GATK"]] from the Broad Institute.  This page summarizes and formats their detailed documentation.  '''GATK3 (v3 or higher) is recommended.'''
     11This information comes from the [https://software.broadinstitute.org/gatk/best-practices/ "Best Practices for Variant Calling with the GATK"] ([https://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk-1 sample slides]) from the Broad Institute.  This page summarizes and formats their detailed documentation.  '''GATK3 (v3 or higher) is recommended.'''
    1212\\ \\
    13 Note that if you're calling variants from RNA-seq reads, follow the somewhat different commands optimized for this, as described in GATK's [[https://www.broadinstitute.org/gatk/guide/article?id=3891|Calling variants in RNAseq]].
     13Note that if you're calling variants from RNA-seq reads, follow the somewhat different commands optimized for this, as described in GATK's [[https://www.broadinstitute.org/gatk/guide/article?id=3891|Calling variants in RNAseq]].  RNAseq includes reads mapped across splice junctions and is associated with high variability of coverage, so typical variant calling pipelines (for DNA) can lead to lots of false positives and negatives.
    1414\\ \\
    1515This example pipeline starts with a single-end short-read fastq file (Reads_1.fq).
     
    1919export PATH=/usr/lib/jvm/java-7-openjdk-amd64/bin:$PATH
    2020\\ \\
    21 Also note that GATK is not designed to use RNA-Seq data, in particular reads that are mapped across splice junctions.  If one still wants to use GATK for experiments of this type, one needs to do one of the following:
    22   * remove reads that map across splice junctions (e.g., those containing 'N' in the CIGAR string
    23   * map with a non-splice mapper (like bwa or bowtie, rather than tophat)
    24   * run GATK with the '--filter_reads_with_N_cigar' option
    2521
    26221 - '''Index the reference genome.''' [Need to do just once, with [[http://samtools.sourceforge.net/samtools.shtml|samtools]].]
    2723  * samtools faidx /path/to/genome/genome.fa
    2824\\
    29 2 - '''Create a genome dictionary.''' [Need to do just once, with Picard's [[http://picard.sourceforge.net/command-line-overview.shtml#CreateSequenceDictionary|CreateSequenceDictionary]].]
     252 - '''Create a genome dictionary.''' [Need to do just once, with Picard's [[https://broadinstitute.github.io/picard/command-line-overview.html#CreateSequenceDictionary|CreateSequenceDictionary]].]
    3026  * java -jar /usr/local/share/picard-tools/picard.jar CreateSequenceDictionary R=/path/to/genome/genome.fa O=/path/to/genome/genome.dict
    3127\\
    32 3 - '''Validate VCF file or known variants''' (with GATK's [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_ValidateVariants.html|ValidateVariants]])
     283 - '''Validate VCF file or known variants''' (with GATK's [[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_ValidateVariants.php|ValidateVariants]])
    3329  * java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T ValidateVariants -R /path/to/genome/genome.fa --variant:VCF SNPs_from_NCBI.sorted.vcf \\
    3430Respond to errors (by correcting or removing problematic variants), run command again, etc., until validation is successful. \\
     
    4440\\
    45416 - '''Mark duplicates''' (multiple identical reads mapped to the same location) \\
    46 Run Picard Tools' [[http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates|MarkDuplicates]] on each sample \\
     42Run Picard Tools' [[https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates|MarkDuplicates]] on each sample \\
    4743May Need "VALIDATION_STRINGENCY=LENIENT" if you get  \\
    4844Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: ... MAPQ should be 0 for unmapped read. \\
     
    5046\\
    51477 - '''Add Read Group header information to each BAM file''' (or GATK won't let you continue) \\
    52 Run Picard Tools' [[http://picard.sourceforge.net/command-line-overview.shtml#AddOrReplaceReadGroups|AddOrReplaceReadGroups]] on each sample. \\
     48Run Picard Tools' [[https://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups|AddOrReplaceReadGroups]] on each sample. \\
    5349Specify RGSM (Read Group sample), RGLB (Read Group Library), RGPL (Read Group platform), and RGPU (Read Group platform unit [e.g. run barcode])
    5450  * bsub java -jar /usr/local/share/picard-tools/picard.jar AddOrReplaceReadGroups I=Reads_1.bwa.dedup.bam O=Reads_1.bwa.dedup.good.bam RGSM=My_sample RGLB=My_project RGPL=illumina RGPU=none VALIDATION_STRINGENCY=LENIENT
     
    5753  * bsub samtools index Reads_1.bwa.dedup.good.bam
    5854\\
    59 9 - '''Run Indel Realignment''' (with [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_indels_RealignerTargetCreator.html|RealignerTargetCreator]] and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_indels_IndelRealigner|IndelRealigner]]) \\
     559 - '''Run Indel Realignment''' (with [[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php|RealignerTargetCreator]] and [[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php|IndelRealigner]]) \\
    6056  * Example 1: java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T RealignerTargetCreator -R human.fasta -I original.bam -known indels.vcf -o realigner.intervals \\
    6157  * Example 2: java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T IndelRealigner -R human.fasta -I original.bam -known indels.vcf -targetIntervals realigner.intervals -o realigned.bam \\
     
    6359  * java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T IndelRealigner -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.good.bam -targetIntervals Reads_1.realigner.intervals -o Reads_1.bwa.dedup.realigned.bam --fix_misencoded_quality_scores
    6460\\
    65 10 - '''Run Base Recalibration''' ([[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_bqsr_BaseRecalibrator.html|BaseRecalibrator]] and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_PrintReads.html|PrintReads]]) \\
     6110 - '''Run Base Recalibration''' ([[https://software.broadinstitute.org/gatk/gatkdocs/current/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php|BaseRecalibrator]] and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_PrintReads.html|PrintReads]]) \\
    6662Known variants/SNPs in VCF format is required for this step.  If none is available, then use the data itself to "bootstrap" known SNPs, see [[http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr | BQSR]]. \\
    6763  * Example 1: java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human.fasta -I realigned.bam -knownSites dbsnp137.vcf -knownSites gold.standard.indels.vcf -o recal.table
     
    7874
    7975\\
    80 11 - '''Compress BAM with [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_compression_reducereads_ReduceReads.html|ReduceReads]]''' [Optional: this step is not available in GATK Version 3.0 and higher] \\
    81   * Example 1: java -jar GenomeAnalysisTK.jar -T ReduceReads -R human.fasta -I recal.bam -o reduced.bam
    82   * java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T ReduceReads -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.realigned.recal.bam -o Reads_1.bwa.dedup.realigned.recal.reduced.bam
    83 \\
    84 12 - '''Finally -- Call variants''' \\
     7611 - '''Finally -- Call variants''' \\
    8577Run [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html|HaplotypeCaller]] ("The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper."; HaplotypeCaller is recommended as of GATK Version 3.0)
    8678  * Example: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human.fasta -I input.bam -o output.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3
     
    9385  * java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /nfs/genomes/a.thaliana_TAIR_10/fasta_whole_genome/TAIR10.fa -I Reads_1.bwa.dedup.realigned.recal.reduced.bam --dbsnp SNPs_from_NCBI.sorted.vcf -o Reads_1.bwa.raw.snps.indels.UnifiedGenotyper.vcf -stand_call_conf 30 -stand_emit_conf 10
    9486\\
    95 13 - '''Run Variant Quality Score Recalibration''' ("VQSR", with
     8712 - '''Run Variant Quality Score Recalibration''' ("VQSR", with
    9688[[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantrecalibration_VariantRecalibrator.php|VariantRecalibrator]] and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantrecalibration_ApplyRecalibration.php|ApplyRecalibration]]) \\ \\
    97 14 - '''Run Genotype Phasing and Refinement''' \\ \\
    98 15 - '''Run Functional Annotation''' ([[http://snpeff.sourceforge.net/SnpEff.html|snpEff]] and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_VariantAnnotator.html|VariantAnnotator]] [which "parses output from snpEff into a simpler format that is more useful for analysis"])
     8913 - '''Run Genotype Phasing and Refinement''' \\ \\
     9014 - '''Run Functional Annotation''' ([[http://snpeff.sourceforge.net/SnpEff.html|snpEff]] and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_VariantAnnotator.html|VariantAnnotator]] [which "parses output from snpEff into a simpler format that is more useful for analysis"])
    9991  * Example 1: java -jar snpEff.jar eff -v -onlyCoding true -i vcf -o gatk GRCh37.64 input.vcf > output.vcf
    10092  * Example 2: java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R human.fasta -A SnpEff --variant original.vcf --snpEffFile snpEff_output.vcf -o annotated.vcf
    10193
    102 16 - '''Analyze variant calls''' (with [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_CombineVariants|CombineVariants]], [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_SelectVariants.html|SelectVariants]], and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_varianteval_VariantEval.html|VariantEval]]) \\ \\
     9415 - '''Analyze variant calls''' (with [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_CombineVariants|CombineVariants]], [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_SelectVariants.html|SelectVariants]], and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_varianteval_VariantEval.html|VariantEval]]) \\ \\
    10395