Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Initial Version and Version 1 of SOPs/variant_calling_GATK

Timestamp:: 01/16/14 15:37:29 (11 years ago)
Author:: gbell
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

SOPs/variant_calling_GATK

               v1
+== Using GATK to call variants from short-read sequencing ==
+This information comes from the [[http://www.broadinstitute.org/gatk/guide/events?id=3093#materials|slides for "Best Practices for Variant Calling with the GATK"]] from the Broad Institute.  This page summarizes and formats their detailed documentation.
+\\ \\
+This example pipeline starts with a single-end short-read fastq file (Reads_1.fq).
+\\ \\
+Note that '''GATK requires Java 1.7''' (so you may need to adjust your path to point to that version, if an older version is the default).\\
+For example, this can be added to ~/.bashrc: \\
+export PATH=/usr/lib/jvm/java-7-openjdk-amd64/bin:$PATH
+\\ \\
+'''Index the reference genome.''' [Need to do just once.]
+  * samtools faidx /path/to/genome/genome.fa
+\\
+'''Create a genome dictionary.''' [Need to do just once.]
+  * java -jar /usr/local/share/picard-tools/CreateSequenceDictionary.jar R=/path/to/genome/genome.fa O=/path/to/genome/genome.dict
+\\
+'''Align reads to genome with bwa.'''
+  * bsub "bwa aln /path/to/genome/bwa/genome Reads_1.fq > Reads_1.sai"
+  * bsub "bwa samse /path/to/genome/bwa/genome Reads_1.sai  Reads_1.fq > Reads_1.bwa.sam"
+\\
+'''Convert SAM to BAM, sort, and index.'''
+  * bsub /nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl Reads_1.bwa.sam
+\\
+'''Mark duplicates''' (multiple identical reads mapped to the same location). \\
+Run Picard Tools' MarkDuplicates on each sample \\
+May Need "VALIDATION_STRINGENCY=LENIENT" if you get  \\
+Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: ... MAPQ should be 0 for unmapped read. \\
+  * bsub java -jar /usr/local/share/picard-tools/MarkDuplicates.jar I=Reads_1.bwa.sorted.bam O=Reads_1.bwa.dedup.bam M=Reads_1.bwa.dedup.txt VALIDATION_STRINGENCY=LENIENT
+\\
+'''Add Read Group header information to each BAM file''' (or GATK won't let you continue) \\
+Run Picard Tools' [[http://picard.sourceforge.net/command-line-overview.shtml#AddOrReplaceReadGroups|AddOrReplaceReadGroups] on each sample. \\
+Specify RGSM (Read Group sample), RGLB (Read Group Library), RGPL (Read Group platform), and RGPU (Read Group platform unit [e.g. run barcode])
+  * bsub java -jar /usr/local/share/picard-tools/AddOrReplaceReadGroups.jar I=Reads_1.bwa.dedup.bam O=Reads_1.bwa.dedup.good.bam RGSM=My_sample RGLB=My_project RGPL=illumina RGPU=none VALIDATION_STRINGENCY=LENIENT
+\\
+'''Index BAM file(s)''' (optional; for IGV viewing)
+  * bsub samtools index Reads_1.bwa.dedup.good.bam
+\\
+'''Run Indel Realignment''' (with RealignerTargetCreator and IndelRealigner) \\
+  * Example 1: java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T RealignerTargetCreator -R human.fasta -I original.bam -known indels.vcf -o realigner.intervals \\
+  * Example 2: java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T IndelRealigner -R human.fasta -I original.bam -known indels.vcf -targetIntervals realigner.intervals -o realigned.bam \\
+  * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.good.bam -o Reads_1.realigner.intervals  --fix_misencoded_quality_scores
+  * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T IndelRealigner -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.good.bam -targetIntervals Reads_1.realigner.intervals -o Reads_1.bwa.dedup.realigned.bam --fix_misencoded_quality_scores
+\\
+'''Run Base Recalibration''' (BaseRecalibrator and PrintReads) \\
+  * Example 1: java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human.fasta -I realigned.bam -knownSites dbsnp137.vcf -knownSites gold.standard.indels.vcf -o recal.table
+  * Example 2: java -jar GenomeAnalysisTK.jar -T PrintReads -R human.fasta -I realigned.bam -BQSR recal.table -o recal.bam \\
+See how things have changed
+  * Example 3: java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human.fasta -I realigned.bam -knownSites dbsnp137.vcf -knownSites gold.standard.indels.vcf -BQSR recal.table -o after_recal.table
+and then make plots of how they've changed (which requires the R [[http://cran.r-project.org/web/packages/gsalib/index.html|'gsalib']] R package).
+  * Example 4: java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R human.fasta -before recal.table -after after_recal.table -plots recal_plots.pdf
+\\
+'''Compress BAM with ReduceReads''' [Optional] \\
+  * Example 1: java -jar GenomeAnalysisTK.jar -T ReduceReads -R human.fasta -I recal.bam -o reduced.bam
+  * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T ReduceReads -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.realigned.recal.bam -o Reads_1.bwa.dedup.realigned.recal.reduced.bam
+\\
+'''Finally -- Call variants''' \\
+Run HaplotypeCaller ("The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper.")
+  * Example: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human.fasta -I input.bam -o output.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3
+  * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller -R /nfs/genomes/a.thaliana_TAIR_10/fasta_whole_genome/TAIR10.fa -I Reads_1.bwa.dedup.realigned.recal.reduced.bam --dbsnp SNPs_from_NCBI.sorted.vcf -o Reads_1.bwa.raw.snps.indels.HaplotypeCaller.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3
+[If needed] Run UnifiedGenotyper should be a better choice for nondiploid samples and high sample numbers
+  * Example: java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R human.fasta -I input.bam -o output.vcf -stand_call_conf 30 -stand_emit_conf 10
+  * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /nfs/genomes/a.thaliana_TAIR_10/fasta_whole_genome/TAIR10.fa -I Reads_1.bwa.dedup.realigned.recal.reduced.bam --dbsnp SNPs_from_NCBI.sorted.vcf -o Reads_1.bwa.raw.snps.indels.UnifiedGenotyper.vcf -stand_call_conf 30 -stand_emit_conf 10
+\\
+'''Run Variant Quality Score Recalibration''' ("VQSR", with VariantRecalibrator and ApplyRecalibration) \\ \\
+'''Run Genotype Phasing and Refinement''' \\ \\
+'''Run Functional Annotation''' (snpEff and VariantAnnotator [which "parses output from snpEff into a simpler format that is more useful for analysis"])
+  * Example 1: java -jar snpEff.jar eff -v -onlyCoding true -i vcf -o gatk GRCh37.64 input.vcf > output.vcf
+  * Example 2: java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R human.fasta -A SnpEff --variant original.vcf --snpEffFile snpEff_output.vcf -o annotated.vcf
+'''Analyze variant calls''' (with CombineVariants, SelectVariants, and VariantEval) \\ \\