Changes between Initial Version and Version 1 of SOP/CallingVariantsRNAseq


Ignore:
Timestamp:
08/14/17 10:53:09 (7 years ago)
Author:
krichard
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOP/CallingVariantsRNAseq

    v1 v1  
     1=== Which mapper and variant caller works best? ===
     2
     3No simple answer (of course), but see
     4* [http://www.ncbi.nlm.nih.gov/pubmed/26639839 Hwang at al., 2015. Systematic comparison of variant calling pipelines using gold standard personal exome variants.]
     5* [http://www.ncbi.nlm.nih.gov/pubmed/25078893 Pirooznia et al., 2014. Validation and assessment of variant calling pipelines for next-generation sequencing.]
     6* [http://www.ncbi.nlm.nih.gov/pubmed/24086590 Liu et al., 2013. Variant callers for next-generation sequencing data: a comparison study.]
     7* Note: GATK is optimized for large human datasets, whereas GATK and samtools may perform similarly with other species and smaller-scale experiments.
     8
     9== Using GATK to call variants from short-read sequencing ==
     10
     11This information comes from the [https://software.broadinstitute.org/gatk/best-practices/ "Best Practices for Variant Calling with the GATK"] ([https://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk-1 sample slides]) from the Broad Institute.  This page summarizes and formats their detailed documentation.  '''GATK3 (v3 or higher) is recommended.'''
     12\\ \\
     13Note that if you're calling variants from RNA-seq reads, follow the somewhat different commands optimized for this, as described in GATK's [[https://www.broadinstitute.org/gatk/guide/article?id=3891|Calling variants in RNAseq]].  RNAseq includes reads mapped across splice junctions and is associated with high variability of coverage, so typical variant calling pipelines (for DNA) can lead to lots of false positives and negatives.
     14\\ \\
     15This example pipeline starts with a single-end short-read fastq file (Reads_1.fq).
     16\\ \\
     17Note that '''GATK requires Java 1.7''' (so you may need to adjust your path to point to that version, if an older version is the default).\\
     18For example, this can be added to ~/.bashrc: \\
     19export PATH=/usr/lib/jvm/java-7-openjdk-amd64/bin:$PATH
     20\\ \\
     21
     221 - '''Index the reference genome.''' [Need to do just once, with [[http://samtools.sourceforge.net/samtools.shtml|samtools]].]
     23  * samtools faidx /path/to/genome/genome.fa
     24\\
     252 - '''Create a genome dictionary.''' [Need to do just once, with Picard's [[https://broadinstitute.github.io/picard/command-line-overview.html#CreateSequenceDictionary|CreateSequenceDictionary]].]
     26  * java -jar /usr/local/share/picard-tools/picard.jar CreateSequenceDictionary R=/path/to/genome/genome.fa O=/path/to/genome/genome.dict
     27\\
     283 - '''Validate VCF file or known variants''' (with GATK's [[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_ValidateVariants.php|ValidateVariants]])
     29  * java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T ValidateVariants -R /path/to/genome/genome.fa --variant:VCF SNPs_from_NCBI.sorted.vcf \\
     30Respond to errors (by correcting or removing problematic variants), run command again, etc., until validation is successful. \\
     31Otherwise GATK will not run on any subsequent commands that require this file.
     32
     33\\
     344 - '''Align reads to genome with [[http://bio-bwa.sourceforge.net/bwa.shtml|bwa]]'''
     35  * bsub "bwa aln /path/to/genome/bwa/genome Reads_1.fq > Reads_1.sai"
     36  * bsub "bwa samse /path/to/genome/bwa/genome Reads_1.sai  Reads_1.fq > Reads_1.bwa.sam"
     37\\
     385 - '''Convert SAM to BAM, sort, and index''' with BaRC's streamlined [[http://samtools.sourceforge.net/samtools.shtml|samtools]] commands
     39  * bsub /nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl Reads_1.bwa.sam
     40\\
     416 - '''Mark duplicates''' (multiple identical reads mapped to the same location) \\
     42Run Picard Tools' [[https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates|MarkDuplicates]] on each sample \\
     43May Need "VALIDATION_STRINGENCY=LENIENT" if you get  \\
     44Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: ... MAPQ should be 0 for unmapped read. \\
     45  * bsub java -jar /usr/local/share/picard-tools/picard.jar MarkDuplicates I=Reads_1.bwa.sorted.bam O=Reads_1.bwa.dedup.bam M=Reads_1.bwa.dedup.txt VALIDATION_STRINGENCY=LENIENT
     46\\
     477 - '''Add Read Group header information to each BAM file''' (or GATK won't let you continue) \\
     48Run Picard Tools' [[https://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups|AddOrReplaceReadGroups]] on each sample. \\
     49Specify RGSM (Read Group sample), RGLB (Read Group Library), RGPL (Read Group platform), and RGPU (Read Group platform unit [e.g. run barcode])
     50  * bsub java -jar /usr/local/share/picard-tools/picard.jar AddOrReplaceReadGroups I=Reads_1.bwa.dedup.bam O=Reads_1.bwa.dedup.good.bam RGSM=My_sample RGLB=My_project RGPL=illumina RGPU=none VALIDATION_STRINGENCY=LENIENT
     51\\
     528 - '''Index BAM file(s)''' with [[http://samtools.sourceforge.net/samtools.shtml|samtools]] (optional; for IGV viewing)
     53  * bsub samtools index Reads_1.bwa.dedup.good.bam
     54\\
     559 - '''Run Indel Realignment''' (with [[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php|RealignerTargetCreator]] and [[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php|IndelRealigner]]) \\
     56  * Example 1: java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T RealignerTargetCreator -R human.fasta -I original.bam -known indels.vcf -o realigner.intervals \\
     57  * Example 2: java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T IndelRealigner -R human.fasta -I original.bam -known indels.vcf -targetIntervals realigner.intervals -o realigned.bam \\
     58  * java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.good.bam -o Reads_1.realigner.intervals  --fix_misencoded_quality_scores
     59  * java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T IndelRealigner -R /path/to/genome/genome.fa -I Reads_1.bwa.dedup.good.bam -targetIntervals Reads_1.realigner.intervals -o Reads_1.bwa.dedup.realigned.bam --fix_misencoded_quality_scores
     60\\
     6110 - '''Run Base Recalibration''' ([[https://software.broadinstitute.org/gatk/gatkdocs/current/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php|BaseRecalibrator]] and [[http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_PrintReads.html|PrintReads]]) \\
     62Known variants/SNPs in VCF format is required for this step.  If none is available, then use the data itself to "bootstrap" known SNPs, see [[http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr | BQSR]]. \\
     63  * Example 1: java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human.fasta -I realigned.bam -knownSites dbsnp137.vcf -knownSites gold.standard.indels.vcf -o recal.table
     64  * Example 2: java -jar GenomeAnalysisTK.jar -T PrintReads -R human.fasta -I realigned.bam -BQSR recal.table -o recal.bam \\
     65See how things have changed
     66  * Example 3: java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human.fasta -I realigned.bam -knownSites dbsnp137.vcf -knownSites gold.standard.indels.vcf -BQSR recal.table -o after_recal.table
     67and then make plots of how they've changed (which requires the R [[http://cran.r-project.org/web/packages/gsalib/index.html|'gsalib']] R package).
     68  * Example 4: java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R human.fasta -before recal.table -after after_recal.table -plots recal_plots.pdf
     69All applied to our sample data:
     70  * bsub java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T BaseRecalibrator -I Reads_1.bwa.dedup.realigned.bam -R /path/to/genome/genome.fa -o Reads_1.bwa.recal_data.txt -knownSites SNPs_from_NCBI.sorted.vcf
     71  * bsub java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T PrintReads -I Reads_1.bwa.dedup.realigned.bam -R /path/to/genome/genome.fa -BQSR Reads_1.bwa.recal_data.txt -o Reads_1.bwa.dedup.realigned.recal.bam
     72  * bsub java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T BaseRecalibrator -I Reads_1.bwa.dedup.realigned.bam -R /path/to/genome/genome.fa -knownSites SNPs_from_NCBI.sorted.vcf -BQSR Reads_1.bwa.recal_data.txt -o Reads_1.bwa.after_recal.txt
     73  * bsub java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T AnalyzeCovariates -R /path/to/genome/genome.fa -before Reads_1.bwa.recal_data.txt -after Reads_1.bwa.after_recal.txt -plots Reads_1.bwa.recal_plots.pdf
     74
     75\\
     7611 - '''Finally -- Call variants''' \\
     77Run [[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php|HaplotypeCaller]] ("The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper."; HaplotypeCaller is recommended as of GATK Version 3.0)
     78  * Example: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human.fasta -I input.bam -o output.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3
     79  * java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T HaplotypeCaller -R /nfs/genomes/a.thaliana_TAIR_10/fasta_whole_genome/TAIR10.fa -I Reads_1.bwa.dedup.realigned.recal.reduced.bam --dbsnp SNPs_from_NCBI.sorted.vcf -o Reads_1.bwa.raw.snps.indels.HaplotypeCaller.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3
     80
     81[If needed] Run [[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_genotyper_UnifiedGenotyper.php|UnifiedGenotyper]] may be a better choice for nondiploid samples and high sample numbers
     82  * Example: java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R human.fasta -I input.bam -o output.vcf -stand_call_conf 30 -stand_emit_conf 10
     83  * java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /nfs/genomes/a.thaliana_TAIR_10/fasta_whole_genome/TAIR10.fa -I Reads_1.bwa.dedup.realigned.recal.reduced.bam --dbsnp SNPs_from_NCBI.sorted.vcf -o Reads_1.bwa.raw.snps.indels.UnifiedGenotyper.vcf -stand_call_conf 30 -stand_emit_conf 10
     84\\
     8512 - '''Run Variant Quality Score Recalibration''' ("VQSR", with
     86[[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantrecalibration_VariantRecalibrator.php|VariantRecalibrator]] and [[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantrecalibration_ApplyRecalibration.php|ApplyRecalibration]]) \\ \\
     8713 - '''Run Genotype Phasing and Refinement''' \\ \\
     8814 - '''Run Functional Annotation''' ([[http://snpeff.sourceforge.net/SnpEff.html|snpEff]] and [[https://software.broadinstitute.org/gatk/gatkdocs/current/org_broadinstitute_gatk_tools_walkers_annotator_VariantAnnotator.php|VariantAnnotator]] [which "parses output from snpEff into a simpler format that is more useful for analysis"])
     89  * Example 1: java -jar snpEff.jar eff -v -onlyCoding true -i vcf -o gatk GRCh37.64 input.vcf > output.vcf
     90  * Example 2: java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R human.fasta -A SnpEff --variant original.vcf --snpEffFile snpEff_output.vcf -o annotated.vcf
     91
     9215 - '''Analyze variant calls''' (with [[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_CombineVariants.php|CombineVariants]], [[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php|SelectVariants]], and [[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_varianteval_VariantEval.php|VariantEval]]) \\ \\
     93