| 1 | |
| 2 | == Using GATK to call variants from short-read sequencing == |
| 3 | |
| 4 | This information comes from the [[http://www.broadinstitute.org/gatk/guide/events?id=3093#materials|slides for "Best Practices for Variant Calling with the GATK"]] from the Broad Institute. This page summarizes and formats their detailed documentation. |
| 5 | \\ \\ |
| 6 | This example pipeline starts with a single-end short-read fastq file (Reads_1.fq). |
| 7 | \\ \\ |
| 8 | Note that '''GATK requires Java 1.7''' (so you may need to adjust your path to point to that version, if an older version is the default).\\ |
| 9 | For example, this can be added to ~/.bashrc: \\ |
| 10 | export PATH=/usr/lib/jvm/java-7-openjdk-amd64/bin:$PATH |
| 11 | \\ \\ |
| 12 | '''Index the reference genome.''' [Need to do just once.] |
| 13 | * samtools faidx /path/to/genome/genome.fa |
| 14 | \\ |
| 15 | '''Create a genome dictionary.''' [Need to do just once.] |
| 16 | * java -jar /usr/local/share/picard-tools/CreateSequenceDictionary.jar R=/path/to/genome/genome.fa O=/path/to/genome/genome.dict |
| 17 | \\ |
| 18 | '''Align reads to genome with bwa.''' |
| 19 | * bsub "bwa aln /path/to/genome/bwa/genome Reads_1.fq > Reads_1.sai" |
| 20 | * bsub "bwa samse /path/to/genome/bwa/genome Reads_1.sai Reads_1.fq > Reads_1.bwa.sam" |
| 21 | \\ |
| 22 | '''Convert SAM to BAM, sort, and index.''' |
| 23 | * bsub /nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl Reads_1.bwa.sam |
| 24 | \\ |
| 25 | '''Mark duplicates''' (multiple identical reads mapped to the same location). \\ |
| 26 | Run Picard Tools' MarkDuplicates on each sample \\ |
| 27 | May Need "VALIDATION_STRINGENCY=LENIENT" if you get \\ |
| 28 | Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: ... MAPQ should be 0 for unmapped read. \\ |
| 29 | * bsub java -jar /usr/local/share/picard-tools/MarkDuplicates.jar I=Reads_1.bwa.sorted.bam O=Reads_1.bwa.dedup.bam M=Reads_1.bwa.dedup.txt VALIDATION_STRINGENCY=LENIENT |
| 30 | \\ |
| 31 | '''Add Read Group header information to each BAM file''' (or GATK won't let you continue) \\ |
| 32 | Run Picard Tools' [[http://picard.sourceforge.net/command-line-overview.shtml#AddOrReplaceReadGroups|AddOrReplaceReadGroups] on each sample. \\ |
| 33 | Specify RGSM (Read Group sample), RGLB (Read Group Library), RGPL (Read Group platform), and RGPU (Read Group platform unit [e.g. run barcode]) |
| 34 | * bsub java -jar /usr/local/share/picard-tools/AddOrReplaceReadGroups.jar I=Reads_1.bwa.dedup.bam O=Reads_1.bwa.dedup.good.bam RGSM=My_sample RGLB=My_project RGPL=illumina RGPU=none VALIDATION_STRINGENCY=LENIENT |
| 35 | \\ |
| 36 | '''Index BAM file(s)''' (optional; for IGV viewing) |
| 37 | * bsub samtools index Reads_1.bwa.dedup.good.bam |
| 38 | \\ |
| 39 | '''Run Indel Realignment''' (with RealignerTargetCreator and IndelRealigner) \\ |
| 40 | * Example 1: java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T RealignerTargetCreator -R human.fasta -I original.bam -known indels.vcf -o realigner.intervals \\ |
| 41 | * Example 2: java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T IndelRealigner -R human.fasta -I original.bam -known indels.vcf -targetIntervals realigner.intervals -o realigned.bam \\ |
| 42 | * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.good.bam -o Reads_1.realigner.intervals --fix_misencoded_quality_scores |
| 43 | * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T IndelRealigner -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.good.bam -targetIntervals Reads_1.realigner.intervals -o Reads_1.bwa.dedup.realigned.bam --fix_misencoded_quality_scores |
| 44 | \\ |
| 45 | '''Run Base Recalibration''' (BaseRecalibrator and PrintReads) \\ |
| 46 | * Example 1: java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human.fasta -I realigned.bam -knownSites dbsnp137.vcf -knownSites gold.standard.indels.vcf -o recal.table |
| 47 | * Example 2: java -jar GenomeAnalysisTK.jar -T PrintReads -R human.fasta -I realigned.bam -BQSR recal.table -o recal.bam \\ |
| 48 | See how things have changed |
| 49 | * Example 3: java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human.fasta -I realigned.bam -knownSites dbsnp137.vcf -knownSites gold.standard.indels.vcf -BQSR recal.table -o after_recal.table |
| 50 | and then make plots of how they've changed (which requires the R [[http://cran.r-project.org/web/packages/gsalib/index.html|'gsalib']] R package). |
| 51 | * Example 4: java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R human.fasta -before recal.table -after after_recal.table -plots recal_plots.pdf |
| 52 | \\ |
| 53 | '''Compress BAM with ReduceReads''' [Optional] \\ |
| 54 | * Example 1: java -jar GenomeAnalysisTK.jar -T ReduceReads -R human.fasta -I recal.bam -o reduced.bam |
| 55 | * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T ReduceReads -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.realigned.recal.bam -o Reads_1.bwa.dedup.realigned.recal.reduced.bam |
| 56 | \\ |
| 57 | '''Finally -- Call variants''' \\ |
| 58 | Run HaplotypeCaller ("The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper.") |
| 59 | * Example: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human.fasta -I input.bam -o output.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3 |
| 60 | * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller -R /nfs/genomes/a.thaliana_TAIR_10/fasta_whole_genome/TAIR10.fa -I Reads_1.bwa.dedup.realigned.recal.reduced.bam --dbsnp SNPs_from_NCBI.sorted.vcf -o Reads_1.bwa.raw.snps.indels.HaplotypeCaller.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3 |
| 61 | |
| 62 | [If needed] Run UnifiedGenotyper should be a better choice for nondiploid samples and high sample numbers |
| 63 | * Example: java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R human.fasta -I input.bam -o output.vcf -stand_call_conf 30 -stand_emit_conf 10 |
| 64 | * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /nfs/genomes/a.thaliana_TAIR_10/fasta_whole_genome/TAIR10.fa -I Reads_1.bwa.dedup.realigned.recal.reduced.bam --dbsnp SNPs_from_NCBI.sorted.vcf -o Reads_1.bwa.raw.snps.indels.UnifiedGenotyper.vcf -stand_call_conf 30 -stand_emit_conf 10 |
| 65 | \\ |
| 66 | '''Run Variant Quality Score Recalibration''' ("VQSR", with VariantRecalibrator and ApplyRecalibration) \\ \\ |
| 67 | '''Run Genotype Phasing and Refinement''' \\ \\ |
| 68 | '''Run Functional Annotation''' (snpEff and VariantAnnotator [which "parses output from snpEff into a simpler format that is more useful for analysis"]) |
| 69 | * Example 1: java -jar snpEff.jar eff -v -onlyCoding true -i vcf -o gatk GRCh37.64 input.vcf > output.vcf |
| 70 | * Example 2: java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R human.fasta -A SnpEff --variant original.vcf --snpEffFile snpEff_output.vcf -o annotated.vcf |
| 71 | |
| 72 | '''Analyze variant calls''' (with CombineVariants, SelectVariants, and VariantEval) \\ \\ |
| 73 | |