Changes between Version 66 and Version 67 of SOPs/mapping


Ignore:
Timestamp:
08/25/20 14:28:33 (5 years ago)
Author:
gbell
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/mapping

    v66 v67  
    33
    44
    5 == Regular mappers ==
     5
     6=== Regular mappers ===
     7=== [#splice_mappers Go to splice-aware mappers] ===
     8
     9Some choices for regular short-read mappers are
     10  * [#bwa BWA]
     11  * [#bowtie2 bowtie2]
     12  * [#bowtie bowtie]
     13
    614
    715These mapping tools are useful for reads of DNA origin that should map to a continuous stretch of genomic DNA.  Some of these tools can tolerate short indels but they're not designed for reads that span a splice junction
    816
    9 One may choose between bowtie version 1 (faster but ignores indels) and bowtie version 2 (slower but performs gapped alignment (i.e., indels)).  For a feature comparision, see [http://bowtie-bio.sourceforge.net/bowtie2/faq.shtml How is Bowtie 2 different from Bowtie 1?]
    10 
    11 '''[http://bowtie-bio.sourceforge.net/index.shtml bowtie version 1]'''
    12 
    13 Sample command:
    14 {{{
    15 bsub bowtie  -k 1 -n 2 -l 50 --best --sam --solexa1.3-quals /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 Sample_A.fq Sample_A.mm9.k1.n2.l50.best.sam
    16 }}}
    17 
    18 Parameters included in the sample command:
    19   * '''-l/--seedlen <int>'''     seed length for -n (default: 28) -- Set to longest possible length of high-quality bases (but no longer than 40-50, or mapping may become too stringent).  Use the FastQC output to determine length of high-quality positions.
    20   * '''-n/--seedmms <int>'''     max mismatches in seed (can be 0-3, default: -n 2)
    21   * '''-k <int>'''               report up to <int> good alignments per read (default: 1) -- If you want only uniquely mapped reads, however, also use '-m 1' to ignore multi-mapped reads; use --all to report all alignments (much slower, ie. turn-off -k option)
    22   * '''--best'''                 (in the case of multi-mapped reads, keep only the best hit(s))   
    23   * '''--sam'''                  to get SAM output format (which is the best format for downstream analysis)
    24 
    25 Choices for fastq encoding (which is listed as "Encoding" in the top "Basic Statistics" table of the FastQC output file).  See the [http://en.wikipedia.org/wiki/FASTQ_format FASTQ format page] for more details.
    26   * '''--solexa-quals'''         (for input quality scores from Illumina versions 1.2 and earlier)
    27   * '''--solexa1.3-quals''' or '''--phred64-quals'''     (for input quality scores from Illumina versions 1.3-1.7)
    28   * '''--phred33-quals'''         (default "Sanger format"; for input quality scores from Illumina versions 1.8 and later)
    29 
    30 To see other parameters log into tak and type '''bowtie'''
    31 
    32 
    33 '''[http://bowtie-bio.sourceforge.net/bowtie2/index.shtml bowtie version 2]'''
    34 
    35 Bowtie 2 was designed as an improvement to bowtie 1, specifically, it supports gapped alignment.  See the first [http://bowtie-bio.sourceforge.net/bowtie2/faq.shtml bowtie2 FAQ] for how they differ.  Early versions of bowtie 2 had some issues, but these seem to have been fixed.  Bowtie 2 uses a different set of genome index files (*.bt2) than bowtie 1 (*.ebwt).  Bowtie 2 works with indels.
    36 
    37 Sample command:
    38 {{{
    39 bsub bowtie2 --phred64 -L 22 -N 1 -x /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 -U Sample_A.fq -S Sample_A.mm9.L22.N1.sam
    40 }}}
    41 
    42 The parameters included in the sample command:
    43   * '''-L <int>'''     length of seed substrings; must be >3 and <32 (default=22)
    44   * '''-N <int>'''     max # mismatches in seed alignment; can be 0 or 1 (default=0)
    45   * '''-S'''           name of SAM output file
    46 Choices for fastq encoding (which is listed as "Encoding" in the top "Basic Statistics" table of the FastQC output file).  See the [http://en.wikipedia.org/wiki/FASTQ_format FASTQ format page] for more details.
    47   * '''--solexa-quals'''         (for input quality scores from Illumina versions 1.2 and earlier)
    48   * '''--phred64'''     (for input quality scores from Illumina versions 1.3-1.7)
    49   * '''--phred33'''         (default "Sanger format"; for input quality scores from Illumina versions 1.8 and later)
    50 
    51 bowtie2 can also perform local alignments where the unaligned end(s) of a read are clipped (so, for example, remaining adapter won't prevent alignment) by adding the argument '''--local'''.
    52 
    53 The bowtie2 command can be modified to output mapped reads as BAM, such as
    54 
    55 {{{
    56 bsub "bowtie2 -x /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 -U s_7.txt | samtools view -bS - > s7_mm9.bam"
    57 }}}
    58 
    59 '''[http://bio-bwa.sourceforge.net/ bwa - Burrows-Wheeler Alignment Tool ]'''
    60 
    61 Bwa is a software package containing several related algorithms using the Burrows-Wheeler Transform.  It works well even with indels, but not with spliced (RNA) reads.
    62 
    63 ''Sample commands for short (upto 100 bp) reads:''
     17=== [=#bwa BWA] ===
     18
     19The [[http://bio-bwa.sourceforge.net/ | Burrows-Wheeler Alignment (BWA) tool]] is a software package containing several related algorithms using the Burrows-Wheeler Transform.  It works well even with indels, but not with spliced (RNA) reads.
     20
     21''Sample commands for short (up to 100 bp) reads:''
    6422{{{
    6523# Align single-end reads
     
    8442For aligning long reads using the bwa mem option, there's no maximum number of mismatches, this is analogous to a local alignment using blat/blast.
    8543
    86 [[BR]]
     44
     45One may choose between bowtie version 1 (faster but ignores indels) and bowtie version 2 (slower but performs gapped alignment (i.e., indels)).  For a feature comparision, see [http://bowtie-bio.sourceforge.net/bowtie2/faq.shtml How is Bowtie 2 different from Bowtie 1?]
     46
     47'''[http://bowtie-bio.sourceforge.net/bowtie2/index.shtml bowtie version 2]'''
     48
     49=== [=#bowtie2 Bowtie2] ===
     50
     51[http://bowtie-bio.sourceforge.net/bowtie2/index.shtml Bowtie2] was designed as an improvement to bowtie 1, specifically, it supports gapped alignment.  See the first [http://bowtie-bio.sourceforge.net/bowtie2/faq.shtml bowtie2 FAQ] for how they differ.  Bowtie 2 uses a different set of genome index files (*.bt2) than bowtie 1 (*.ebwt).  Bowtie 2 works with indels.
     52
     53Sample command:
     54{{{
     55bsub bowtie2 --phred64 -L 22 -N 1 -x /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 -U Sample_A.fq -S Sample_A.mm9.L22.N1.sam
     56}}}
     57
     58The parameters included in the sample command:
     59  * '''-L <int>'''     length of seed substrings; must be >3 and <32 (default=22)
     60  * '''-N <int>'''     max # mismatches in seed alignment; can be 0 or 1 (default=0)
     61  * '''-S'''           name of SAM output file
     62Choices for fastq encoding (which is listed as "Encoding" in the top "Basic Statistics" table of the FastQC output file).  See the [http://en.wikipedia.org/wiki/FASTQ_format FASTQ format page] for more details.
     63  * '''--solexa-quals'''         (for input quality scores from Illumina versions 1.2 and earlier)
     64  * '''--phred64'''     (for input quality scores from Illumina versions 1.3-1.7)
     65  * '''--phred33'''         (default "Sanger format"; for input quality scores from Illumina versions 1.8 and later)
     66
     67bowtie2 can also perform local alignments where the unaligned end(s) of a read are clipped (so, for example, remaining adapter won't prevent alignment) by adding the argument '''--local'''.
     68
     69The bowtie2 command can be modified to output mapped reads as BAM, such as
     70
     71{{{
     72bsub "bowtie2 -x /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 -U s_7.txt | samtools view -bS - > s7_mm9.bam"
     73}}}
     74
     75=== [=#bowtie Bowtie] ===
     76
     77[http://bowtie-bio.sourceforge.net/index.shtml Bowtie] may still have some advantages over bowtie2 for specific use cases.
     78
     79Sample command:
     80{{{
     81bsub bowtie  -k 1 -n 2 -l 50 --best --sam --solexa1.3-quals /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 Sample_A.fq Sample_A.mm9.k1.n2.l50.best.sam
     82}}}
     83
     84Parameters included in the sample command:
     85  * '''-l/--seedlen <int>'''     seed length for -n (default: 28) -- Set to longest possible length of high-quality bases (but no longer than 40-50, or mapping may become too stringent).  Use the FastQC output to determine length of high-quality positions.
     86  * '''-n/--seedmms <int>'''     max mismatches in seed (can be 0-3, default: -n 2)
     87  * '''-k <int>'''               report up to <int> good alignments per read (default: 1) -- If you want only uniquely mapped reads, however, also use '-m 1' to ignore multi-mapped reads; use --all to report all alignments (much slower, ie. turn-off -k option)
     88  * '''--best'''                 (in the case of multi-mapped reads, keep only the best hit(s))   
     89  * '''--sam'''                  to get SAM output format (which is the best format for downstream analysis)
     90
     91Choices for fastq encoding (which is listed as "Encoding" in the top "Basic Statistics" table of the FastQC output file).  See the [http://en.wikipedia.org/wiki/FASTQ_format FASTQ format page] for more details.
     92  * '''--solexa-quals'''         (for input quality scores from Illumina versions 1.2 and earlier)
     93  * '''--solexa1.3-quals''' or '''--phred64-quals'''     (for input quality scores from Illumina versions 1.3-1.7)
     94  * '''--phred33-quals'''         (default "Sanger format"; for input quality scores from Illumina versions 1.8 and later)
     95
     96To see other parameters log into tak and type '''bowtie'''
     97
    8798
    8899'''Other tools'''
    89100Many other regular mapping tools are also available, although they generally require a tool-specific indexed version of the genome.
    90101
    91 == Splice-aware mappers ==
     102== [=#splice_mappers Splice-aware mappers] ==
    92103
    93104These mappers permit the beginning and end of a read to map to (originate from) different places in the genome, which is common for spliced RNA.
    94105
    95 '''[https://github.com/alexdobin/STAR STAR]'''
    96 
    97 STAR ([https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf manual]) is an ultrafast universal RNA-seq aligner.  It maps >60 times faster than Tophat2. To use STAR, a genome index specific for the STAR mapper needs to be generated first.  STAR tends to align more reads to pseudogenes compared to Tophat2.  However, the pseudogene problem can be significantly minimized by providing an annotation file containing known splice junctions. If no annotation is available for a genome of interest, a 2-pass mapping procedure is recommended. The first pass generates a splice junctions file, which is then used as the annotation file to run the second pass mapping. 
     106Some choices for regular short-read mappers are
     107  * [#STAR STAR]
     108  * [#tophat2 tophat2]
     109  * [#tophat tophat]
     110
     111=== [=#STAR STAR] ===
     112
     113[https://github.com/alexdobin/STAR STAR] ([https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf manual]) is an ultrafast universal RNA-seq aligner.  It maps >60 times faster than Tophat2. To use STAR, a genome index specific for the STAR mapper needs to be generated first.  STAR tends to align more reads to pseudogenes compared to Tophat2.  However, the pseudogene problem can be significantly minimized by providing an annotation file containing known splice junctions. If no annotation is available for a genome of interest, a 2-pass mapping procedure is recommended. The first pass generates a splice junctions file, which is then used as the annotation file to run the second pass mapping. 
    98114
    99115Sample command:
     
    175191}}}
    176192
    177 
    178 '''tophat version 1 (old)'''
     193=== [=#tophat2 TopHat version 2] ===
     194
     195'''[http://ccb.jhu.edu/software/tophat/index.shtml TopHat version 2] is no longer recommended.'''  The authors of TopHat currently recommend [http://ccb.jhu.edu/software/hisat2/index.shtml HISAT2].
     196TopHat version 2 uses bowtie2, rather than bowtie, for its mapping.  As a result, TopHat 2 uses a different set of genome index files (*.bt2) than TopHat 1 (*.ebwt).
     197
     198Sample command:
     199{{{
     200# Single-end reads
     201bsub tophat -o s_7_tophat_out --phred64-quals --library-type fr-firststrand --segment-length 20 -I 200000 -G /nfs/genomes/mouse_gp_jul_07_no_random/gtf/Mus_musculus.NCBIM37.67_noNT.gtf --no-novel-juncs /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.txt
     202# Paired-end reads
     203# For PE reads, specifiy expected (mean) inner distance using -r option (default is 50bp).  The inner distance, or insert size, does not include length of the reads/mates.  For example, PE run with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200.
     204bsub tophat -o s_7_tophat_out --phred64-quals --library-type fr-firststrand --segment-length 20 -I 200000 -G /nfs/genomes/mouse_gp_jul_07_no_random/gtf/Mus_musculus.NCBIM37.67_noNT.gtf --no-novel-juncs /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.1.txt s_7.2.txt
     205}}}
     206
     207The parameters included in the sample command are:
     208  * '''-o/--output-dir <word>'''     All output files will be created in this directory (default = tophat_out)
     209  * '''--segment-length <int>'''  Shortest length of a spliced read that can map to one side of the junction.  For reads shorter than ~45 nt, set this to half the read length (so set '--segment-length 20' for 40-nt reads).  For longer reads, the default length (25) can be used.
     210  * '''-I <int>''' Maximum intron length.  If your genome has introns that are all shorter (or many that are longer) than the default value (500000), set this to a more appropriate value.
     211  * '''-G <GTF file>''' Supply bowtie with a GTF file of transcript models.  This can help bowtie identify functions that may otherwise be missed.
     212  * '''--no-novel-juncs ''' Only look for spliced reads across junctions in the supplied GTF file.  Not used if looking for novel isoforms.
     213  * '''--library type ''' Take advantage of strandedness of library for mapping (especially across splice junctions); can be fr-unstranded, fr-firststrand, or fr-secondstrand
     214
     215Choices for fastq encoding (which is listed as "Encoding" in the top "Basic Statistics" table of the FastQC output file).  See the [http://en.wikipedia.org/wiki/FASTQ_format FASTQ format page] for more details.
     216  * '''--solexa-quals'''         (for input quality scores from Illumina versions 1.2 and earlier)
     217  * '''--solexa1.3-quals''' or '''--phred64-quals'''     (for input quality scores from Illumina versions 1.3-1.7)
     218  * For "Sanger / Illumina 1.8" or "Sanger / Illumina 1.9", bowtie can use the default "phred33" encoding
     219
     220Choices for controlling alignment (eg. mismatches)
     221  * '''--read-mismatches/-N''' Final read alignments having more than these many mismatches are discarded (default is 2).
     222  * '''--read-gap-length''' Final read alignments having more than these many total length of gaps are discarded (default is 2).
     223  * '''--read-edit-dist''' Final read alignments having more than these many edit distance (ie. mismatches+indels) are discarded (default is 2).
     224  * '''--segment-mismatches'''  Read segments are mapped independently, allowing up to this many mismatches in each segment alignment (default is 2).
     225
     226=== [=#tophat TopHat version 1] ===
    179227
    180228'''TopHat version 1 is no longer recommended.'''
     
    204252  * '''--solexa1.3-quals''' or '''--phred64-quals'''     (for input quality scores from Illumina versions 1.3-1.7)
    205253
    206 '''[http://ccb.jhu.edu/software/tophat/index.shtml tophat version 2]'''
    207 
    208 '''TopHat version 2 is no longer recommended.'''  The authors of TopHat currently recommend [http://ccb.jhu.edu/software/hisat2/index.shtml HISAT2].
    209 TopHat version 2 uses bowtie2, rather than bowtie, for its mapping.  As a result, TopHat 2 uses a different set of genome index files (*.bt2) than TopHat 1 (*.ebwt).
    210 
    211 Sample command:
    212 {{{
    213 # Single-end reads
    214 bsub tophat -o s_7_tophat_out --phred64-quals --library-type fr-firststrand --segment-length 20 -I 200000 -G /nfs/genomes/mouse_gp_jul_07_no_random/gtf/Mus_musculus.NCBIM37.67_noNT.gtf --no-novel-juncs /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.txt
    215 # Paired-end reads
    216 # For PE reads, specifiy expected (mean) inner distance using -r option (default is 50bp).  The inner distance, or insert size, does not include length of the reads/mates.  For example, PE run with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200.
    217 bsub tophat -o s_7_tophat_out --phred64-quals --library-type fr-firststrand --segment-length 20 -I 200000 -G /nfs/genomes/mouse_gp_jul_07_no_random/gtf/Mus_musculus.NCBIM37.67_noNT.gtf --no-novel-juncs /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.1.txt s_7.2.txt
    218 }}}
    219 
    220 The parameters included in the sample command are:
    221   * '''-o/--output-dir <word>'''     All output files will be created in this directory (default = tophat_out)
    222   * '''--segment-length <int>'''  Shortest length of a spliced read that can map to one side of the junction.  For reads shorter than ~45 nt, set this to half the read length (so set '--segment-length 20' for 40-nt reads).  For longer reads, the default length (25) can be used.
    223   * '''-I <int>''' Maximum intron length.  If your genome has introns that are all shorter (or many that are longer) than the default value (500000), set this to a more appropriate value.
    224   * '''-G <GTF file>''' Supply bowtie with a GTF file of transcript models.  This can help bowtie identify functions that may otherwise be missed.
    225   * '''--no-novel-juncs ''' Only look for spliced reads across junctions in the supplied GTF file.  Not used if looking for novel isoforms.
    226   * '''--library type ''' Take advantage of strandedness of library for mapping (especially across splice junctions); can be fr-unstranded, fr-firststrand, or fr-secondstrand
    227 
    228 Choices for fastq encoding (which is listed as "Encoding" in the top "Basic Statistics" table of the FastQC output file).  See the [http://en.wikipedia.org/wiki/FASTQ_format FASTQ format page] for more details.
    229   * '''--solexa-quals'''         (for input quality scores from Illumina versions 1.2 and earlier)
    230   * '''--solexa1.3-quals''' or '''--phred64-quals'''     (for input quality scores from Illumina versions 1.3-1.7)
    231   * For "Sanger / Illumina 1.8" or "Sanger / Illumina 1.9", bowtie can use the default "phred33" encoding
    232 
    233 Choices for controlling alignment (eg. mismatches)
    234   * '''--read-mismatches/-N''' Final read alignments having more than these many mismatches are discarded (default is 2).
    235   * '''--read-gap-length''' Final read alignments having more than these many total length of gaps are discarded (default is 2).
    236   * '''--read-edit-dist''' Final read alignments having more than these many edit distance (ie. mismatches+indels) are discarded (default is 2).
    237   * '''--segment-mismatches'''  Read segments are mapped independently, allowing up to this many mismatches in each segment alignment (default is 2).
    238 
    239254== Others ==
    240255