wiki:SOP/CallingVariantsRNAseq

Version 57 (modified by krichard, 7 years ago) ( diff )

--

Notes on calling variants in RNA-seq data with GATK

  • RNAseq includes reads mapped across splice junctions and is associated with high variability of coverage, so typical variant calling pipelines (for DNA) can lead to lots of false positives and negatives.
  • GATK is currently the gold standard for calling variants in RNA-seq data. See a detailed description of their workflow here: GATK Variant Calling for RNA-seq
  • A main difference between calling variants in RNA vs DNA sequencing reads with GATK, is for RNA-seq data the STAR aligner is used to perform a 2-pass read mapping step, which was shown to have superior SNP sensitivity in a comparison of the most common mapping tools Engström, et al.

Using GATK to call variants from RNA-seq reads

This example pipeline starts with a single-end short-read fastq file (Reads_1.fq).

Note that GATK requires Java 1.7 (so you may need to adjust your path to point to that version, if an older version is the default).
For example, this can be added to ~/.bashrc:
export PATH=/usr/lib/jvm/java-7-openjdk-amd64/bin:$PATH

1 - Run the STAR 2 pass procedure to map reads to the reference genome.

Index the reference genome for First Pass.

Create folder, "FirstPass" before running these commands.

To generate genome index files for STAR:

bsub STAR --runMode genomeGenerate --genomeDir /path/to/GenomeDirFirstPass --genomeFastaFiles /path/to/genome/fasta --sjdbGTFfile /path/to/GTF/FileName.gtf --sjdbOverhang 100 --runThreadN 8 

To map your reads:

Run this command within the FirstPass directory

bsub STAR --genomeDir /path/to/GenomeDirFirstPass --readFilesIn /path/to/read1.fastq  --outFileNamePrefix whateverPrefix --runThreadN 8

Index the reference genome for Second Pass.

Create folder, "SecondPass" before running these commands.

To generate the 2nd pass genome index files for STAR:

bsub STAR --runMode genomeGenerate --genomeDir /path/to/GenomeDirSecondPass --genomeFastaFiles /path/to/genome/fasta --sjdbFileChrStartEnd /path/to/first/pass/directory/SJ.out.tab --sjdbGTFfile /path/to/GTF/FileName.gtf --sjdbOverhang 100 --runThreadN 8 

To map your reads:

Run this command within the SecondPass directory

Input format: fastq ; output format: SAM
bsub STAR --genomeDir /path/to/GenomeDirSecondPass --readFilesIn /path/to/read1.fastq  --outFileNamePrefix whateverPrefix --runThreadN 8

The parameters included in the above sample commands are:

  • --sjdbOverhang Specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. For short reads (<50) use readLength - 1, otherwise a generic value of 100 will work as well (see manual for more info).
  • --sjdbGTFfile <GTF_file.gtf> Supplies STAR with a GTF file during the genomeGenerate step. Combined with the --sjdbScore <n> option during mapping, this will bias the alignment toward annotated junctions, and reduces alignment to pseudogenes.
  • --runMode <alignReads, genomeGenerate> "alignReads" does the actual mapping. "genomeGenerate" generates the genomeDir required for mapping (default = alignReads).
  • --genomeDir </path/to/GenomeDir> Specifies the path to the directory used for storing the genome information created in the genomeGenerate step.
  • --genomeFastaFiles <genome FASTA files> Specifies genome FASTA files to be used.
  • --sjdbFileChrStartEnd <output from first pass> path to the file with genomic coordinates for introns
  • --readFilesIn <read1.fastq read2.fastq> Specifies the fastq files containing the reads, can be single-end or paired-end.
  • --runThreadN <n> Specifies the number of threads to use.

2 - Replace ReadGroups, mark duplicate reads , clip intron overhangs and reassign mapping qualities withPicard Tools and GATK

Replace read groups and order, by coordinates, the reads.

Note, if you are combining multiple experiments in this step the RGSM IDs must be the same while the library IDs must be unique.

java -jar /usr/local/share/picard-tools/picard.jar AddOrReplaceReadGroups I=output.sam O=rg_added_sorted.bam SO=coordinate RGID=ID_NAME RGLB=library RGPL=illumina RGPU=identifier RGSM=sample_name

Mark duplicate reads.

java -jar /usr/local/share/picard-tools/picard.jar MarkDuplicates I=rg_added_sorted.bam O=dedupped.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics

Identify and split Cigar N Reads and reassign quality scores.

java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T SplitNCigarReads -R /path/to/genome/fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS -fixMisencodedQuals

Perform BaseRecalibration.
Calibration files can be found here for hg38 Recalibration Files
NOTE: Calibration Files are only available for a few genomes (Human, Mouse, etc).

java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T BaseRecalibrator -R /path/to/genome/fasta -I dedupped.bam -knownSites /path/to/calibration/files -o recalibration.table

java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T PrintReads -R /path/to/genome/fasta -I dedupped.bam -BQSR recalibration.table -o recalibrated.bam

3 - Call and Filter Variants

java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T HaplotypeCaller -R /path/to/genome/fasta -I recalibrated.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -o Variants_called.vcf

java -jar /usr/local/gatk3/GenomeAnalysisTK.jar -T VariantFiltration -R /path/to/genome/fasta -V Variants_called.vcf -window 35 -cluster 3 -filterName Filter -filter "QD < 2.0" -filterName Filter -filter "FS > 30.0" -o Filtered_variants_called.vcf

4 - Predict the effects of called variants

Several tools are available to analyze variants found in your .vcf result files for potential functional consequence. See details here.

Note: See TracWiki for help on using the wiki.