== Using GATK to call variants from short-read sequencing ==

This information comes from the [[http://www.broadinstitute.org/gatk/guide/events?id=3093#materials|slides for "Best Practices for Variant Calling with the GATK"]] from the Broad Institute.  This page summarizes and formats their detailed documentation.
\\ \\
This example pipeline starts with a single-end short-read fastq file (Reads_1.fq).
\\ \\
Note that '''GATK requires Java 1.7''' (so you may need to adjust your path to point to that version, if an older version is the default).\\
For example, this can be added to ~/.bashrc: \\
export PATH=/usr/lib/jvm/java-7-openjdk-amd64/bin:$PATH
\\ \\
'''Index the reference genome.''' [Need to do just once.]
  * samtools faidx /path/to/genome/genome.fa
\\ 
'''Create a genome dictionary.''' [Need to do just once.]
  * java -jar /usr/local/share/picard-tools/CreateSequenceDictionary.jar R=/path/to/genome/genome.fa O=/path/to/genome/genome.dict 
\\
'''Align reads to genome with bwa.'''
  * bsub "bwa aln /path/to/genome/bwa/genome Reads_1.fq > Reads_1.sai"
  * bsub "bwa samse /path/to/genome/bwa/genome Reads_1.sai  Reads_1.fq > Reads_1.bwa.sam"
\\
'''Convert SAM to BAM, sort, and index.'''
  * bsub /nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl Reads_1.bwa.sam
\\
'''Mark duplicates''' (multiple identical reads mapped to the same location). \\
Run Picard Tools' MarkDuplicates on each sample \\
May Need "VALIDATION_STRINGENCY=LENIENT" if you get  \\
Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: ... MAPQ should be 0 for unmapped read. \\
  * bsub java -jar /usr/local/share/picard-tools/MarkDuplicates.jar I=Reads_1.bwa.sorted.bam O=Reads_1.bwa.dedup.bam M=Reads_1.bwa.dedup.txt VALIDATION_STRINGENCY=LENIENT
\\
'''Add Read Group header information to each BAM file''' (or GATK won't let you continue) \\
Run Picard Tools' [[http://picard.sourceforge.net/command-line-overview.shtml#AddOrReplaceReadGroups|AddOrReplaceReadGroups] on each sample. \\
Specify RGSM (Read Group sample), RGLB (Read Group Library), RGPL (Read Group platform), and RGPU (Read Group platform unit [e.g. run barcode])
  * bsub java -jar /usr/local/share/picard-tools/AddOrReplaceReadGroups.jar I=Reads_1.bwa.dedup.bam O=Reads_1.bwa.dedup.good.bam RGSM=My_sample RGLB=My_project RGPL=illumina RGPU=none VALIDATION_STRINGENCY=LENIENT
\\
'''Index BAM file(s)''' (optional; for IGV viewing)
  * bsub samtools index Reads_1.bwa.dedup.good.bam
\\
'''Run Indel Realignment''' (with RealignerTargetCreator and IndelRealigner) \\
  * Example 1: java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T RealignerTargetCreator -R human.fasta -I original.bam -known indels.vcf -o realigner.intervals \\
  * Example 2: java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T IndelRealigner -R human.fasta -I original.bam -known indels.vcf -targetIntervals realigner.intervals -o realigned.bam \\
  * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.good.bam -o Reads_1.realigner.intervals  --fix_misencoded_quality_scores
  * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T IndelRealigner -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.good.bam -targetIntervals Reads_1.realigner.intervals -o Reads_1.bwa.dedup.realigned.bam --fix_misencoded_quality_scores
\\
'''Run Base Recalibration''' (BaseRecalibrator and PrintReads) \\
  * Example 1: java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human.fasta -I realigned.bam -knownSites dbsnp137.vcf -knownSites gold.standard.indels.vcf -o recal.table
  * Example 2: java -jar GenomeAnalysisTK.jar -T PrintReads -R human.fasta -I realigned.bam -BQSR recal.table -o recal.bam \\
See how things have changed
  * Example 3: java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human.fasta -I realigned.bam -knownSites dbsnp137.vcf -knownSites gold.standard.indels.vcf -BQSR recal.table -o after_recal.table
and then make plots of how they've changed (which requires the R [[http://cran.r-project.org/web/packages/gsalib/index.html|'gsalib']] R package).
  * Example 4: java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R human.fasta -before recal.table -after after_recal.table -plots recal_plots.pdf
\\
'''Compress BAM with ReduceReads''' [Optional] \\
  * Example 1: java -jar GenomeAnalysisTK.jar -T ReduceReads -R human.fasta -I recal.bam -o reduced.bam
  * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T ReduceReads -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.realigned.recal.bam -o Reads_1.bwa.dedup.realigned.recal.reduced.bam
\\
'''Finally -- Call variants''' \\
Run HaplotypeCaller ("The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper.")
  * Example: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human.fasta -I input.bam -o output.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3
  * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller -R /nfs/genomes/a.thaliana_TAIR_10/fasta_whole_genome/TAIR10.fa -I Reads_1.bwa.dedup.realigned.recal.reduced.bam --dbsnp SNPs_from_NCBI.sorted.vcf -o Reads_1.bwa.raw.snps.indels.HaplotypeCaller.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3

[If needed] Run UnifiedGenotyper should be a better choice for nondiploid samples and high sample numbers
  * Example: java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R human.fasta -I input.bam -o output.vcf -stand_call_conf 30 -stand_emit_conf 10 
  * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /nfs/genomes/a.thaliana_TAIR_10/fasta_whole_genome/TAIR10.fa -I Reads_1.bwa.dedup.realigned.recal.reduced.bam --dbsnp SNPs_from_NCBI.sorted.vcf -o Reads_1.bwa.raw.snps.indels.UnifiedGenotyper.vcf -stand_call_conf 30 -stand_emit_conf 10
\\
'''Run Variant Quality Score Recalibration''' ("VQSR", with VariantRecalibrator and ApplyRecalibration) \\ \\
'''Run Genotype Phasing and Refinement''' \\ \\
'''Run Functional Annotation''' (snpEff and VariantAnnotator [which "parses output from snpEff into a simpler format that is more useful for analysis"])
  * Example 1: java -jar snpEff.jar eff -v -onlyCoding true -i vcf -o gatk GRCh37.64 input.vcf > output.vcf
  * Example 2: java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R human.fasta -A SnpEff --variant original.vcf --snpEffFile snpEff_output.vcf -o annotated.vcf

'''Analyze variant calls''' (with CombineVariants, SelectVariants, and VariantEval) \\ \\