== Mapping short reads == == Regular mappers == These mapping tools are useful for reads of DNA origin that should map to a continuous stretch of genomic DNA. Some of these tools can tolerate short indels but they're not designed for reads that span a splice junction One may choose between bowtie version 1 (faster but ignores indels) and bowtie version 2 (slower but performs gapped alignment (i.e., indels)). For a feature comparision, see [http://bowtie-bio.sourceforge.net/bowtie2/faq.shtml How is Bowtie 2 different from Bowtie 1?] '''[http://bowtie-bio.sourceforge.net/index.shtml bowtie version 1]''' Sample command: {{{ bsub bowtie -k 1 -n 2 -l 70 --best --sam --solexa1.3-quals /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.txt s7_mm9.k1.n2.l36.best.sam }}} The parameters included in the sample command are: * '''-l/--seedlen ''' seed length for -n (default: 28) -- Set to longest possible length of high-quality bases. Use the FastQC output to determine length of high-quality positions. * '''-n/--seedmms ''' max mismatches in seed (can be 0-3, default: -n 2) * '''-k ''' report up to good alignments per read (default: 1) -- If you want only uniquely mapped reads, however, also use '-m 1' to ignore multi-mapped reads * '''--solexa-quals''' (if input quals are from GA Pipeline ver. < 1.3) See the table at the top of FastQC output to identify the "encoding" scale [[br]] * '''--solexa1.3-quals''' or '''--phred64-quals''' (if input quals are from GA Pipeline ver. >= 1.3 and before Illumina 1.8) See the table at the top of FastQC output to identify the "encoding" scale [[br]] * '''--best''' (in the case of multi-mapped reads, keep only the best hit(s)) * '''--sam''' to get SAM output format (which is the best format for downstream analysis) To see other parameters log into tak and type '''bowtie''' '''[http://bowtie-bio.sourceforge.net/bowtie2/index.shtml bowtie version 2]''' Sample command: {{{ bsub bowtie2 --phred64 -L 22 -N 1 -x /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.txt -S s7_mm9.L22.N1.sam }}} The parameters included in the sample command are: * '''-L ''' length of seed substrings; must be >3 and <32 (default=22) * '''-N ''' max # mismatches in seed alignment; can be 0 or 1 (default=0) * '''--phred64''' (if input quals are from GA Pipeline ver. >= 1.3 and before Illumina 1.8) See the table at the top of FastQC output to identify the "encoding" scale [[br]] * '''-S''' name of SAM output file bowtie2 can also perform local alignments where the unaligned end(s) of a read are clipped (so, for example, remaining adapter won't prevent alignment) by adding the argument '''--local'''. '''Other tools''' Many other regular mapping tools are also available, although they generally require a tool-specific indexed version of the genome. == Splice-aware mappers == These mappers permit the beginning and end of a read to map to (originate from) different places in the genome, which is common for spliced RNA. '''[http://tophat.cbcb.umd.edu/ tophat version 1]''' Running TopHat version 1 requires a change to a user's environment on tak (and only applies to the specific tak session. First run this command: {{{ export PATH="/usr/local/share/tophat1:$PATH" }}} and then check that your terminal will use the correct TopHat version: {{{ tophat --version }}} Sample command: {{{ bsub tophat -o s_7_tophat_out --phred64-quals --no-novel-juncs --segment-length 20 -G /nfs/genomes/mouse_gp_jul_07_no_random/gtf/Mus_musculus.NCBIM37.67_noNT.gtf /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.txt }}} The parameters included in the sample command are: * '''-o/--output-dir ''' All output files will be created in this directory (default = tophat_out) * '''--solexa-quals''' (if input quals are from GA Pipeline ver. < 1.3) See the table at the top of FastQC output to identify the "encoding" scale [[br]] * '''--phred64-quals''' or '''solexa1.3-quals''' (if input quals are from GA Pipeline ver. >= 1.3 before Illumina 1.8) See the table at the top of FastQC output to identify the "encoding" scale [[br]]