| | 1 | |
| | 2 | == Using GATK to call variants from short-read sequencing == |
| | 3 | |
| | 4 | This information comes from the [[http://www.broadinstitute.org/gatk/guide/events?id=3093#materials|slides for "Best Practices for Variant Calling with the GATK"]] from the Broad Institute. This page summarizes and formats their detailed documentation. |
| | 5 | \\ \\ |
| | 6 | This example pipeline starts with a single-end short-read fastq file (Reads_1.fq). |
| | 7 | \\ \\ |
| | 8 | Note that '''GATK requires Java 1.7''' (so you may need to adjust your path to point to that version, if an older version is the default).\\ |
| | 9 | For example, this can be added to ~/.bashrc: \\ |
| | 10 | export PATH=/usr/lib/jvm/java-7-openjdk-amd64/bin:$PATH |
| | 11 | \\ \\ |
| | 12 | '''Index the reference genome.''' [Need to do just once.] |
| | 13 | * samtools faidx /path/to/genome/genome.fa |
| | 14 | \\ |
| | 15 | '''Create a genome dictionary.''' [Need to do just once.] |
| | 16 | * java -jar /usr/local/share/picard-tools/CreateSequenceDictionary.jar R=/path/to/genome/genome.fa O=/path/to/genome/genome.dict |
| | 17 | \\ |
| | 18 | '''Align reads to genome with bwa.''' |
| | 19 | * bsub "bwa aln /path/to/genome/bwa/genome Reads_1.fq > Reads_1.sai" |
| | 20 | * bsub "bwa samse /path/to/genome/bwa/genome Reads_1.sai Reads_1.fq > Reads_1.bwa.sam" |
| | 21 | \\ |
| | 22 | '''Convert SAM to BAM, sort, and index.''' |
| | 23 | * bsub /nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl Reads_1.bwa.sam |
| | 24 | \\ |
| | 25 | '''Mark duplicates''' (multiple identical reads mapped to the same location). \\ |
| | 26 | Run Picard Tools' MarkDuplicates on each sample \\ |
| | 27 | May Need "VALIDATION_STRINGENCY=LENIENT" if you get \\ |
| | 28 | Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: ... MAPQ should be 0 for unmapped read. \\ |
| | 29 | * bsub java -jar /usr/local/share/picard-tools/MarkDuplicates.jar I=Reads_1.bwa.sorted.bam O=Reads_1.bwa.dedup.bam M=Reads_1.bwa.dedup.txt VALIDATION_STRINGENCY=LENIENT |
| | 30 | \\ |
| | 31 | '''Add Read Group header information to each BAM file''' (or GATK won't let you continue) \\ |
| | 32 | Run Picard Tools' [[http://picard.sourceforge.net/command-line-overview.shtml#AddOrReplaceReadGroups|AddOrReplaceReadGroups] on each sample. \\ |
| | 33 | Specify RGSM (Read Group sample), RGLB (Read Group Library), RGPL (Read Group platform), and RGPU (Read Group platform unit [e.g. run barcode]) |
| | 34 | * bsub java -jar /usr/local/share/picard-tools/AddOrReplaceReadGroups.jar I=Reads_1.bwa.dedup.bam O=Reads_1.bwa.dedup.good.bam RGSM=My_sample RGLB=My_project RGPL=illumina RGPU=none VALIDATION_STRINGENCY=LENIENT |
| | 35 | \\ |
| | 36 | '''Index BAM file(s)''' (optional; for IGV viewing) |
| | 37 | * bsub samtools index Reads_1.bwa.dedup.good.bam |
| | 38 | \\ |
| | 39 | '''Run Indel Realignment''' (with RealignerTargetCreator and IndelRealigner) \\ |
| | 40 | * Example 1: java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T RealignerTargetCreator -R human.fasta -I original.bam -known indels.vcf -o realigner.intervals \\ |
| | 41 | * Example 2: java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T IndelRealigner -R human.fasta -I original.bam -known indels.vcf -targetIntervals realigner.intervals -o realigned.bam \\ |
| | 42 | * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.good.bam -o Reads_1.realigner.intervals --fix_misencoded_quality_scores |
| | 43 | * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T IndelRealigner -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.good.bam -targetIntervals Reads_1.realigner.intervals -o Reads_1.bwa.dedup.realigned.bam --fix_misencoded_quality_scores |
| | 44 | \\ |
| | 45 | '''Run Base Recalibration''' (BaseRecalibrator and PrintReads) \\ |
| | 46 | * Example 1: java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human.fasta -I realigned.bam -knownSites dbsnp137.vcf -knownSites gold.standard.indels.vcf -o recal.table |
| | 47 | * Example 2: java -jar GenomeAnalysisTK.jar -T PrintReads -R human.fasta -I realigned.bam -BQSR recal.table -o recal.bam \\ |
| | 48 | See how things have changed |
| | 49 | * Example 3: java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human.fasta -I realigned.bam -knownSites dbsnp137.vcf -knownSites gold.standard.indels.vcf -BQSR recal.table -o after_recal.table |
| | 50 | and then make plots of how they've changed (which requires the R [[http://cran.r-project.org/web/packages/gsalib/index.html|'gsalib']] R package). |
| | 51 | * Example 4: java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R human.fasta -before recal.table -after after_recal.table -plots recal_plots.pdf |
| | 52 | \\ |
| | 53 | '''Compress BAM with ReduceReads''' [Optional] \\ |
| | 54 | * Example 1: java -jar GenomeAnalysisTK.jar -T ReduceReads -R human.fasta -I recal.bam -o reduced.bam |
| | 55 | * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T ReduceReads -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.realigned.recal.bam -o Reads_1.bwa.dedup.realigned.recal.reduced.bam |
| | 56 | \\ |
| | 57 | '''Finally -- Call variants''' \\ |
| | 58 | Run HaplotypeCaller ("The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper.") |
| | 59 | * Example: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human.fasta -I input.bam -o output.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3 |
| | 60 | * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller -R /nfs/genomes/a.thaliana_TAIR_10/fasta_whole_genome/TAIR10.fa -I Reads_1.bwa.dedup.realigned.recal.reduced.bam --dbsnp SNPs_from_NCBI.sorted.vcf -o Reads_1.bwa.raw.snps.indels.HaplotypeCaller.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3 |
| | 61 | |
| | 62 | [If needed] Run UnifiedGenotyper should be a better choice for nondiploid samples and high sample numbers |
| | 63 | * Example: java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R human.fasta -I input.bam -o output.vcf -stand_call_conf 30 -stand_emit_conf 10 |
| | 64 | * java -jar /usr/local/gatk/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /nfs/genomes/a.thaliana_TAIR_10/fasta_whole_genome/TAIR10.fa -I Reads_1.bwa.dedup.realigned.recal.reduced.bam --dbsnp SNPs_from_NCBI.sorted.vcf -o Reads_1.bwa.raw.snps.indels.UnifiedGenotyper.vcf -stand_call_conf 30 -stand_emit_conf 10 |
| | 65 | \\ |
| | 66 | '''Run Variant Quality Score Recalibration''' ("VQSR", with VariantRecalibrator and ApplyRecalibration) \\ \\ |
| | 67 | '''Run Genotype Phasing and Refinement''' \\ \\ |
| | 68 | '''Run Functional Annotation''' (snpEff and VariantAnnotator [which "parses output from snpEff into a simpler format that is more useful for analysis"]) |
| | 69 | * Example 1: java -jar snpEff.jar eff -v -onlyCoding true -i vcf -o gatk GRCh37.64 input.vcf > output.vcf |
| | 70 | * Example 2: java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R human.fasta -A SnpEff --variant original.vcf --snpEffFile snpEff_output.vcf -o annotated.vcf |
| | 71 | |
| | 72 | '''Analyze variant calls''' (with CombineVariants, SelectVariants, and VariantEval) \\ \\ |
| | 73 | |