Changes between Initial Version and Version 1 of SOPs/variant_calling_GATK


Ignore:
Timestamp:
01/16/14 15:37:29 (11 years ago)
Author:
gbell
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/variant_calling_GATK

    v1 v1  
     1
     2== Using GATK to call variants from short-read sequencing ==
     3
     4This information comes from the [[http://www.broadinstitute.org/gatk/guide/events?id=3093#materials|slides for "Best Practices for Variant Calling with the GATK"]] from the Broad Institute.  This page summarizes and formats their detailed documentation.
     5\\ \\
     6This example pipeline starts with a single-end short-read fastq file (Reads_1.fq).
     7\\ \\
     8Note that '''GATK requires Java 1.7''' (so you may need to adjust your path to point to that version, if an older version is the default).\\
     9For example, this can be added to ~/.bashrc: \\
     10export PATH=/usr/lib/jvm/java-7-openjdk-amd64/bin:$PATH
     11\\ \\
     12'''Index the reference genome.''' [Need to do just once.]
     13  * samtools faidx /path/to/genome/genome.fa
     14\\
     15'''Create a genome dictionary.''' [Need to do just once.]
     16  * java -jar /usr/local/share/picard-tools/CreateSequenceDictionary.jar R=/path/to/genome/genome.fa O=/path/to/genome/genome.dict
     17\\
     18'''Align reads to genome with bwa.'''
     19  * bsub "bwa aln /path/to/genome/bwa/genome Reads_1.fq > Reads_1.sai"
     20  * bsub "bwa samse /path/to/genome/bwa/genome Reads_1.sai  Reads_1.fq > Reads_1.bwa.sam"
     21\\
     22'''Convert SAM to BAM, sort, and index.'''
     23  * bsub /nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl Reads_1.bwa.sam
     24\\
     25'''Mark duplicates''' (multiple identical reads mapped to the same location). \\
     26Run Picard Tools' MarkDuplicates on each sample \\
     27May Need "VALIDATION_STRINGENCY=LENIENT" if you get  \\
     28Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: ... MAPQ should be 0 for unmapped read. \\
     29  * bsub java -jar /usr/local/share/picard-tools/MarkDuplicates.jar I=Reads_1.bwa.sorted.bam O=Reads_1.bwa.dedup.bam M=Reads_1.bwa.dedup.txt VALIDATION_STRINGENCY=LENIENT
     30\\
     31'''Add Read Group header information to each BAM file''' (or GATK won't let you continue) \\
     32Run Picard Tools' [[http://picard.sourceforge.net/command-line-overview.shtml#AddOrReplaceReadGroups|AddOrReplaceReadGroups] on each sample. \\
     33Specify RGSM (Read Group sample), RGLB (Read Group Library), RGPL (Read Group platform), and RGPU (Read Group platform unit [e.g. run barcode])
     34  * bsub java -jar /usr/local/share/picard-tools/AddOrReplaceReadGroups.jar I=Reads_1.bwa.dedup.bam O=Reads_1.bwa.dedup.good.bam RGSM=My_sample RGLB=My_project RGPL=illumina RGPU=none VALIDATION_STRINGENCY=LENIENT
     35\\
     36'''Index BAM file(s)''' (optional; for IGV viewing)
     37  * bsub samtools index Reads_1.bwa.dedup.good.bam
     38\\
     39'''Run Indel Realignment''' (with RealignerTargetCreator and IndelRealigner) \\
     40  * Example 1: java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T RealignerTargetCreator -R human.fasta -I original.bam -known indels.vcf -o realigner.intervals \\
     41  * Example 2: java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T IndelRealigner -R human.fasta -I original.bam -known indels.vcf -targetIntervals realigner.intervals -o realigned.bam \\
     42  * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.good.bam -o Reads_1.realigner.intervals  --fix_misencoded_quality_scores
     43  * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T IndelRealigner -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.good.bam -targetIntervals Reads_1.realigner.intervals -o Reads_1.bwa.dedup.realigned.bam --fix_misencoded_quality_scores
     44\\
     45'''Run Base Recalibration''' (BaseRecalibrator and PrintReads) \\
     46  * Example 1: java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human.fasta -I realigned.bam -knownSites dbsnp137.vcf -knownSites gold.standard.indels.vcf -o recal.table
     47  * Example 2: java -jar GenomeAnalysisTK.jar -T PrintReads -R human.fasta -I realigned.bam -BQSR recal.table -o recal.bam \\
     48See how things have changed
     49  * Example 3: java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human.fasta -I realigned.bam -knownSites dbsnp137.vcf -knownSites gold.standard.indels.vcf -BQSR recal.table -o after_recal.table
     50and then make plots of how they've changed (which requires the R [[http://cran.r-project.org/web/packages/gsalib/index.html|'gsalib']] R package).
     51  * Example 4: java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R human.fasta -before recal.table -after after_recal.table -plots recal_plots.pdf
     52\\
     53'''Compress BAM with ReduceReads''' [Optional] \\
     54  * Example 1: java -jar GenomeAnalysisTK.jar -T ReduceReads -R human.fasta -I recal.bam -o reduced.bam
     55  * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T ReduceReads -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.realigned.recal.bam -o Reads_1.bwa.dedup.realigned.recal.reduced.bam
     56\\
     57'''Finally -- Call variants''' \\
     58Run HaplotypeCaller ("The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper.")
     59  * Example: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human.fasta -I input.bam -o output.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3
     60  * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller -R /nfs/genomes/a.thaliana_TAIR_10/fasta_whole_genome/TAIR10.fa -I Reads_1.bwa.dedup.realigned.recal.reduced.bam --dbsnp SNPs_from_NCBI.sorted.vcf -o Reads_1.bwa.raw.snps.indels.HaplotypeCaller.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3
     61
     62[If needed] Run UnifiedGenotyper should be a better choice for nondiploid samples and high sample numbers
     63  * Example: java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R human.fasta -I input.bam -o output.vcf -stand_call_conf 30 -stand_emit_conf 10
     64  * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /nfs/genomes/a.thaliana_TAIR_10/fasta_whole_genome/TAIR10.fa -I Reads_1.bwa.dedup.realigned.recal.reduced.bam --dbsnp SNPs_from_NCBI.sorted.vcf -o Reads_1.bwa.raw.snps.indels.UnifiedGenotyper.vcf -stand_call_conf 30 -stand_emit_conf 10
     65\\
     66'''Run Variant Quality Score Recalibration''' ("VQSR", with VariantRecalibrator and ApplyRecalibration) \\ \\
     67'''Run Genotype Phasing and Refinement''' \\ \\
     68'''Run Functional Annotation''' (snpEff and VariantAnnotator [which "parses output from snpEff into a simpler format that is more useful for analysis"])
     69  * Example 1: java -jar snpEff.jar eff -v -onlyCoding true -i vcf -o gatk GRCh37.64 input.vcf > output.vcf
     70  * Example 2: java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R human.fasta -A SnpEff --variant original.vcf --snpEffFile snpEff_output.vcf -o annotated.vcf
     71
     72'''Analyze variant calls''' (with CombineVariants, SelectVariants, and VariantEval) \\ \\
     73