Version 14 (modified by 7 years ago) ( diff ) | ,
---|
Notes on calling variants in RNA-seq data with GATK
- RNAseq includes reads mapped across splice junctions and is associated with high variability of coverage, so typical variant calling pipelines (for DNA) can lead to lots of false positives and negatives.
- GATK is currently the gold standard for calling variants in RNA-seq data. See a detailed description of their workflow here: https://gatkforums.broadinstitute.org/gatk/discussion/3892/the-gatk-best-practices-for-variant-calling-on-rnaseq-in-full-detail
- A main difference between calling variants in RNA vs DNA sequencing reads with GATK, is for RNA-seq data the STAR aligner is used to perform a 2-pass read mapping step, which was shown to have superior SNP sensitivity in a comparison of the most common mapping tools (https://www.nature.com/nmeth/journal/v10/n12/full/nmeth.2722.html)
Using GATK to call variants from RNA-seq reads
This example pipeline starts with a single-end short-read fastq file (Reads_1.fq).
Note that GATK requires Java 1.7 (so you may need to adjust your path to point to that version, if an older version is the default).
For example, this can be added to ~/.bashrc:
export PATH=/usr/lib/jvm/java-7-openjdk-amd64/bin:$PATH
1 - Index the reference genome for First Pass.
Create folder, "FirstPass" before running these commands.
To generate genome index files for STAR:
bsub STAR --runMode genomeGenerate --genomeDir /path/to/GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 --sjdbGTFfile /path/to/GTF/FileName.gtf --sjdbOverhang 100 --runThreadN 8
The parameters included in the above sample command are:
- --sjdbOverhang Specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. For short reads (<50) use readLength - 1, otherwise a generic value of 100 will work as well (see manual for more info).
- --sjdbGTFfile <GTF_file.gtf> Supplies STAR with a GTF file during the genomeGenerate step. Combined with the --sjdbScore <n> option during mapping, this will bias the alignment toward annotated junctions, and reduces alignment to pseudogenes.
To map:
'''Run this command within the "FirstPass" directory''' Input format: fastq ; output format: SAM bsub STAR --genomeDir /path/to/GenomeDir --readFilesIn /path/to/read1.fastq --outFileNamePrefix whateverPrefix --runThreadN 8
Note:
See TracWiki
for help on using the wiki.