Mapping short reads
Contents:
Regular mappers
These mapping tools are useful for reads of DNA origin that should map to a continuous stretch of genomic DNA. Some of these tools can tolerate short indels but they're not designed for reads that span a splice junction
BWA
The Burrows-Wheeler Alignment (BWA) tool is a software package containing several related algorithms using the Burrows-Wheeler Transform. It works well even with indels, but not with spliced (RNA) reads.
Sample commands for short (up to 100 nt) reads:
# Align single-end reads sbatch --partition 20 --job-name=bwa --mem=32G --wrap "bwa aln /nfs/genomes/mouse_gp_jul_07_no_random/bwa/mm9.fa Reads.fq > Mapped_reads.sai" sbatch --partition 20 --job-name=bwa --mem=32G --wrap "bwa samse /nfs/genomes/mouse_gp_jul_07_no_random/bwa/mm9.fa Mapped_reads.sai Reads.fq > Reads.bwa.sam" # Align paired-end reads sbatch --partition 20 --job-name=bwa --mem=32G --wrap "bwa aln /nfs/genomes/mouse_gp_jul_07_no_random/bwa/mm9.fa Reads.1.fq > Mapped_reads.1.sai" sbatch --partition 20 --job-name=bwa --mem=32G --wrap "bwa aln /nfs/genomes/mouse_gp_jul_07_no_random/bwa/mm9.fa Reads.2.fq > Mapped_reads.2.sai" sbatch --partition 20 --job-name=bwa --mem=32G --wrap "bwa sampe /nfs/genomes/mouse_gp_jul_07_no_random/bwa/mm9.fa Mapped_reads.1.sai Mapped_reads.2.sai Reads.1.fq Reads.2.fq > Reads.bwa.sam"
Use "-n" option to change the maximum number of mismatches in the alignment.
Sample commands for long (70 nt or longer) reads:
# Align single-end reads sbatch --partition 20 --job-name=bwa --mem=32G --wrap "bwa mem /nfs/genomes/sgd_2010/bwa/sacCer3.fa Reads.fq > Mapped_reads.bwa.sam" # Align paired-end reads sbatch --partition 20 --job-name=bwa --mem=32G --wrap "bwa mem /nfs/genomes/sgd_2010/bwa/sacCer3.fa Reads.1.fq Reads.2.fq > Mapped_reads.bwa.sam"
And then convert the SAM to BAM, sort, and index. This BaRC script can to all three with one command. (Then the original SAM can be deleted.)
/nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl input.sam
For aligning long reads using the bwa mem option, there's no maximum number of mismatches, this is analogous to a local alignment using blat/blast.
One may choose between bowtie version 1 (faster but ignores indels) and bowtie version 2 (slower but performs gapped alignment (i.e., indels)). For a feature comparision, see How is Bowtie 2 different from Bowtie 1?
Bowtie2
Bowtie2 was designed as an improvement to bowtie 1, specifically, it supports gapped alignment. See the first bowtie2 FAQ for how they differ. Bowtie 2 uses a different set of genome index files (*.bt2) than bowtie 1 (*.ebwt). Bowtie 2 works with indels.
Sample command:
sbatch --partition=20 --job-name=bt2 --mem=32G --wrap "bowtie2 --phred64 -L 22 -N 1 -x /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 -U Sample_A.fq -S Sample_A.mm9.L22.N1.sam"
The parameters included in the sample command:
- -L <int> length of seed substrings; must be >3 and <32 (default=22)
- -N <int> max # mismatches in seed alignment; can be 0 or 1 (default=0)
- -S name of SAM output file
Choices for fastq encoding (which is listed as "Encoding" in the top "Basic Statistics" table of the FastQC output file). See the FASTQ format page for more details.
- --solexa-quals (for input quality scores from Illumina versions 1.2 and earlier)
- --phred64 (for input quality scores from Illumina versions 1.3-1.7)
- --phred33 (default "Sanger format"; for input quality scores from Illumina versions 1.8 and later)
bowtie2 can also perform local alignments where the unaligned end(s) of a read are clipped (so, for example, remaining adapter won't prevent alignment) by adding the argument --local.
The bowtie2 command can be modified to output mapped reads as BAM, such as
sbatch --partition=20 --job-name=bt2 --mem=32G --wrap "bowtie2 -x /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 -U s_7.txt | samtools view -bS - > s7_mm9.bam"
Bowtie
Bowtie may still have some advantages over bowtie2 for specific use cases.
Sample command:
sbatch --partition=20 --job-name=bt --mem=32G --wrap "bowtie -k 1 -n 2 -l 50 --best --sam --solexa1.3-quals /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 Sample_A.fq Sample_A.mm9.k1.n2.l50.best.sam"
Parameters included in the sample command:
- -l/--seedlen <int> seed length for -n (default: 28) -- Set to longest possible length of high-quality bases (but no longer than 40-50, or mapping may become too stringent). Use the FastQC output to determine length of high-quality positions.
- -n/--seedmms <int> max mismatches in seed (can be 0-3, default: -n 2)
- -k <int> report up to <int> good alignments per read (default: 1) -- If you want only uniquely mapped reads, however, also use '-m 1' to ignore multi-mapped reads; use --all to report all alignments (much slower, ie. turn-off -k option)
- --best (in the case of multi-mapped reads, keep only the best hit(s))
- --sam to get SAM output format (which is the best format for downstream analysis)
Choices for fastq encoding (which is listed as "Encoding" in the top "Basic Statistics" table of the FastQC output file). See the FASTQ format page for more details.
- --solexa-quals (for input quality scores from Illumina versions 1.2 and earlier)
- --solexa1.3-quals or --phred64-quals (for input quality scores from Illumina versions 1.3-1.7)
- --phred33-quals (default "Sanger format"; for input quality scores from Illumina versions 1.8 and later)
As mentioned in the bowtie manual : "Paired-end alignments where one mate’s alignment is entirely contained within the other’s are considered invalid." In this case, you can trim one base off from 3 prime end of each read, as suggested by trim_galore documentation
To see other parameters log into tak and type bowtie
Other tools Many other regular mapping tools are also available, although they generally require a tool-specific indexed version of the genome.
Splice-aware mappers
These mappers permit the beginning and end of a read to map to (originate from) different places in the genome, which is common for spliced RNA.
Some choices for regular short-read mappers are STAR, Tophat 2, and Tophat.
STAR
STAR (manual) is an ultrafast universal RNA-seq aligner. It maps >60 times faster than Tophat2. To use STAR, a genome index specific for the STAR mapper needs to be generated first. STAR tends to align more reads to pseudogenes compared to Tophat2. However, the pseudogene problem can be significantly minimized by providing an annotation file containing known splice junctions. If no annotation is available for a genome of interest, a 2-pass mapping procedure is recommended. The first pass generates a splice junctions file, which is then used as the annotation file to run the second pass mapping.
Sample command:
To generate genome index files for STAR:
sbatch --partition=20 --job-name=star_idx --cpus-per-task=8 --mem=64G --wrap "STAR --runMode genomeGenerate --genomeDir /path/to/GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 --sjdbGTFfile /path/to/GTF/FileName.gtf --sjdbOverhang 100 --runThreadN 8 " #generate index without overhang/gtf, specify these at mapping instead sbatch --partition=20 --job-name=star_idx --cpus-per-task=8 --mem=64G --wrap "STAR --runMode genomeGenerate --genomeDir /path/to/GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 --runThreadN 8 "
The parameters included in the above sample command are:
- --sjdbOverhang Specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. For short reads (<50) use readLength - 1, otherwise a generic value of 100 will work as well for most long reads (see manual for more info and github response with reads longer than 100 bp).
- --sjdbGTFfile <GTF_file.gtf> Supplies STAR with a GTF file during the genomeGenerate step. Combined with the --sjdbScore <n> option during mapping, this will bias the alignment toward annotated junctions, and reduces alignment to pseudogenes.
- --genomeSAindexNbases For small genomes, the parameter --genomeSAindexNbases must to be scaled down, with a typical value of min(14, log2(GenomeLength)/2 - 1).
To map:
# Input format: fastq ; output format: SAM sbatch --partition=20 --job-name=star_map --cpus-per-task=8 --mem=64G --wrap "STAR --genomeDir /path/to/GenomeDir --readFilesIn /path/to/read1.fastq /path/to/read2.fastq --outFileNamePrefix whateverPrefix --sjdbScore 2 --runThreadN 8" # Input format: fastq.gz ; output format: BAM (sorted by position) sbatch --partition=20 --job-name=star_map --cpus-per-task=8 --mem=64G --wrap "STAR --genomeDir /path/to/GenomeDir --readFilesIn /path/to/read1.fq.gz /path/to/read2.fq.gz --outFileNamePrefix whateverPrefix --sjdbScore 2 --runThreadN 8 --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate" #mapping when gtf/overhang was not specified during indexing sbatch --partition=20 --job-name=star_map --cpus-per-task=8 --mem=64G --wrap "STAR --genomeDir /path/to/GenomeDir --readFilesIn /path/to/read1.fq.gz /path/to/read2.fq.gz --outFileNamePrefix whateverPrefix --sjdbScore 2 --runThreadN 8 --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --sjdbGTFfile /path/to/GTF/FileName.gtf --sjdbOverhang readLength-1 "
The parameters included in the above sample commands are:
- --runMode <alignReads, genomeGenerate> "alignReads" does the actual mapping. "genomeGenerate" generates the genomeDir required for mapping (default = alignReads).
- --genomeDir </path/to/GenomeDir> Specifies the path to the directory used for storing the genome information created in the genomeGenerate step.
- --genomeFastaFiles <genome FASTA files> Specifies genome FASTA files to be used.
- --readFilesIn <read1.fastq read2.fastq> Specifies the fastq files containing the reads, can be single-end or paired-end.
- --sjdbScore <n> Provides extra alignment score for alignments that cross database junctions (default = 2). If this score is positive, it will bias the alignment toward annotated junctions. This is only used if during the genomeGenerate step a splice junction annotation file is used.
- --runThreadN <n> Specifies the number of threads to use.
- --readFilesCommand <cmd> Specifies the command to uncompress compressed fastq files. For gzipped files (*.gz) use --readFilesCommand zcat.
- --outSAMtype <BAM sortingMode> Specifies the type of BAM file to create. Options: 'BAM Unsorted', 'BAM SortedByCoordinate', 'BAM Unsorted SortedByCoordinate' (to create both unsorted and sorted BAMs)
- Optional
- --alignIntronMax sets the maximum intron size where the default is (2winBinNbits)*winAnchorDistNbins or approx 590K when finding novel splice junctions. Decrease the intron size for small genomes (e.g. yeast). Also see STAR FAQ: Options and Parameters
Genome(annotation) Intron Size (99th Percentile) Human (GRCh38.101) 104052 Mouse (GRCm38.91) 77207
- To turn off soft-clipping, add "--alignEndsType EndToEnd" to your command.
- --outFilterMultimapNmax sets the maximum number of multiple alignments, the default is 10.
STAR 2-Pass alignment method. To improve identification of novel splice junctions, run STAR for a second pass, using junctions found from all samples during the first-pass mapping. This is especially useful if no GTF file (with annotated junctions) was available for the first-pass mapping. Furthermore, this procedure is recommended for improving sensitivity of variant calling with RNA-seq data.
Index the reference genome for First Pass.
Create folder, "FirstPass" before running these commands.
To generate FirstPass genome index files for STAR:
sbatch --partition=20 --job-name=st_1stIdx --cpus-per-task=8 --mem=64G --wrap "STAR --runMode genomeGenerate --genomeDir /path/to/GenomeDirFirstPass --genomeFastaFiles /path/to/genome/fasta --sjdbGTFfile /path/to/GTF/FileName.gtf --sjdbOverhang 100 --runThreadN 8 "
To map your reads:
Run this command within the FirstPass directory
sbatch --partition=20 --job-name=st_1stMap --cpus-per-task=8 --mem=64G --wrap "STAR --genomeDir /path/to/GenomeDirFirstPass --readFilesIn /path/to/Reads_1.fastq --outFileNamePrefix whateverPrefix --runThreadN 8"
re-index the reference genome for Second Pass.
Create folder, "SecondPass" before running these commands.
To improve sensitivity for variant detection it is recommended to run STAR for a second pass, using junctions found from the first-pass mapping. To generate theSecondPass genome index files for STAR:
sbatch --partition=20 --job-name=st_2ndIdx --cpus-per-task=8 --mem=64G --wrap "STAR --runMode genomeGenerate --genomeDir /path/to/GenomeDirSecondPass --genomeFastaFiles /path/to/genome/fasta --sjdbFileChrStartEnd /path/to/first/pass/directory/SJ.out.tab --sjdbGTFfile /path/to/GTF/FileName.gtf --sjdbOverhang 100 --runThreadN 8 "
To map your reads:
Run this command within the SecondPass directory
#Input format: fastq ; output format: SAM sbatch --partition=20 --job-name=st_2ndMap --cpus-per-task=8 --mem=64G --wrap "STAR --genomeDir /path/to/GenomeDirSecondPass --readFilesIn /path/to/Reads_1.fastq --outFileNamePrefix whateverPrefix --runThreadN 8"
TopHat version 2
TopHat version 2 is no longer recommended. The authors of TopHat currently recommend HISAT2. TopHat version 2 uses bowtie2, rather than bowtie, for its mapping. As a result, TopHat 2 uses a different set of genome index files (*.bt2) than TopHat 1 (*.ebwt).
Sample command:
# Single-end reads sbatch --partition=20 --job-name=tophat --mem=32G --wrap "tophat -o s_7_tophat_out --phred64-quals --library-type fr-firststrand --segment-length 20 -I 200000 -G /nfs/genomes/mouse_gp_jul_07_no_random/gtf/Mus_musculus.NCBIM37.67_noNT.gtf --no-novel-juncs /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.txt" # Paired-end reads # For PE reads, specifiy expected (mean) inner distance using -r option (default is 50bp). The inner distance, or insert size, does not include length of the reads/mates. For example, PE run with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. sbatch --partition=20 --job-name=tophat --mem=32G --wrap "tophat -o s_7_tophat_out --phred64-quals --library-type fr-firststrand --segment-length 20 -I 200000 -G /nfs/genomes/mouse_gp_jul_07_no_random/gtf/Mus_musculus.NCBIM37.67_noNT.gtf --no-novel-juncs /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.1.txt s_7.2.txt"
The parameters included in the sample command are:
- -o/--output-dir <word> All output files will be created in this directory (default = tophat_out)
- --segment-length <int> Shortest length of a spliced read that can map to one side of the junction. For reads shorter than ~45 nt, set this to half the read length (so set '--segment-length 20' for 40-nt reads). For longer reads, the default length (25) can be used.
- -I <int> Maximum intron length. If your genome has introns that are all shorter (or many that are longer) than the default value (500000), set this to a more appropriate value.
- -G <GTF file> Supply bowtie with a GTF file of transcript models. This can help bowtie identify functions that may otherwise be missed.
- --no-novel-juncs Only look for spliced reads across junctions in the supplied GTF file. Not used if looking for novel isoforms.
- --library type Take advantage of strandedness of library for mapping (especially across splice junctions); can be fr-unstranded, fr-firststrand, or fr-secondstrand
Choices for fastq encoding (which is listed as "Encoding" in the top "Basic Statistics" table of the FastQC output file). See the FASTQ format page for more details.
- --solexa-quals (for input quality scores from Illumina versions 1.2 and earlier)
- --solexa1.3-quals or --phred64-quals (for input quality scores from Illumina versions 1.3-1.7)
- For "Sanger / Illumina 1.8" or "Sanger / Illumina 1.9", bowtie can use the default "phred33" encoding
Choices for controlling alignment (eg. mismatches)
- --read-mismatches/-N Final read alignments having more than these many mismatches are discarded (default is 2).
- --read-gap-length Final read alignments having more than these many total length of gaps are discarded (default is 2).
- --read-edit-dist Final read alignments having more than these many edit distance (ie. mismatches+indels) are discarded (default is 2).
- --segment-mismatches Read segments are mapped independently, allowing up to this many mismatches in each segment alignment (default is 2).
TopHat version 1
TopHat version 1 is no longer recommended. Running TopHat version 1 requires a change to a user's environment on tak and only applies to the specific tak session. First run this command:
export PATH="/usr/local/share/tophat1:$PATH"
and then check that your terminal will use the correct TopHat version:
tophat --version
Sample command:
sbatch --partition=20 --job-name=tophat --cpus-per-task=6 --mem=32G --wrap "tophat -o s_7_tophat_out -p 6 --phred64-quals --segment-length 20 -I 200000 -G /nfs/genomes/mouse_gp_jul_07_no_random/gtf/Mus_musculus.NCBIM37.67_noNT.gtf --no-novel-juncs /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.txt"
The parameters included in the sample command are:
- -o/--output-dir <word> All output files will be created in this directory (default = tophat_out)
- --segment-length <int> Shortest length of a spliced read that can map to one side of the junction. For reads shorter than ~45 nt, set this to half the read length (so set '--segment-length 20' for 40-nt reads). For longer reads, the default length (25) can be used.
- -I <int> Maximum intron length. If your genome has introns that are all shorter (or many that are longer) than the default value (500000), set this to a more appropriate value.
- -G <GTF file> Supply bowtie with a GTF file of transcript models. This can help bowtie identify functions that may otherwise be missed.
- --no-novel-juncs Only look for spliced reads across junctions in the supplied GTF file. Not used if looking for novel isoforms.
- -p/--num-threads Use this many threads to align reads (default is 1)
Choices for fastq encoding (which is listed as "Encoding" in the top "Basic Statistics" table of the FastQC output file). See the FASTQ format page for more details.
- --solexa-quals (for input quality scores from Illumina versions 1.2 and earlier)
- --solexa1.3-quals or --phred64-quals (for input quality scores from Illumina versions 1.3-1.7)
Others
- HISAT2: successor of TopHat2/HISAT