== Mapping short reads == == Regular mappers == These mapping tools are useful for reads of DNA origin that should map to a continuous stretch of genomic DNA. Some of these tools can tolerate short indels but they're not designed for reads that span a splice junction One may choose between bowtie version 1 (faster but ignores indels) and bowtie version 2 (slower but performs gapped alignment (i.e., indels)). For a feature comparision, see [http://bowtie-bio.sourceforge.net/bowtie2/faq.shtml How is Bowtie 2 different from Bowtie 1?] '''[http://bowtie-bio.sourceforge.net/index.shtml bowtie version 1]''' Sample command: {{{ bsub bowtie -k 1 -n 2 -l 70 --best --sam --solexa1.3-quals /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.txt s7_mm9.k1.n2.l36.best.sam }}} Parameters included in the sample command: * '''-l/--seedlen ''' seed length for -n (default: 28) -- Set to longest possible length of high-quality bases. Use the FastQC output to determine length of high-quality positions. * '''-n/--seedmms ''' max mismatches in seed (can be 0-3, default: -n 2) * '''-k ''' report up to good alignments per read (default: 1) -- If you want only uniquely mapped reads, however, also use '-m 1' to ignore multi-mapped reads * '''--best''' (in the case of multi-mapped reads, keep only the best hit(s)) * '''--sam''' to get SAM output format (which is the best format for downstream analysis) Choices for fastq encoding (which is listed as "Encoding" in the top "Basic Statistics" table of the FastQC output file). See the [http://en.wikipedia.org/wiki/FASTQ_format FASTQ format page] for more details. * '''--solexa-quals''' (for input quality scores from Illumina versions 1.2 and earlier) * '''--solexa1.3-quals''' or '''--phred64-quals''' (for input quality scores from Illumina versions 1.3-1.7) * '''--phred33-quals''' (default "Sanger format"; for input quality scores from Illumina versions 1.8 and later) To see other parameters log into tak and type '''bowtie''' '''[http://bowtie-bio.sourceforge.net/bowtie2/index.shtml bowtie version 2]''' Sample command: {{{ bsub bowtie2 --phred64 -L 22 -N 1 -x /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.txt -S s7_mm9.L22.N1.sam }}} The parameters included in the sample command are: * '''-L ''' length of seed substrings; must be >3 and <32 (default=22) * '''-N ''' max # mismatches in seed alignment; can be 0 or 1 (default=0) * '''--phred64''' (if input quals are from GA Pipeline ver. >= 1.3 and before Illumina 1.8) See the table at the top of FastQC output to identify the "encoding" scale [[br]] * '''-S''' name of SAM output file bowtie2 can also perform local alignments where the unaligned end(s) of a read are clipped (so, for example, remaining adapter won't prevent alignment) by adding the argument '''--local'''. '''Other tools''' Many other regular mapping tools are also available, although they generally require a tool-specific indexed version of the genome. == Splice-aware mappers == These mappers permit the beginning and end of a read to map to (originate from) different places in the genome, which is common for spliced RNA. '''[http://tophat.cbcb.umd.edu/ tophat version 1]''' Running TopHat version 1 requires a change to a user's environment on tak (and only applies to the specific tak session. First run this command: {{{ export PATH="/usr/local/share/tophat1:$PATH" }}} and then check that your terminal will use the correct TopHat version: {{{ tophat --version }}} Sample command: {{{ bsub tophat -o s_7_tophat_out --phred64-quals --no-novel-juncs --segment-length 20 -G /nfs/genomes/mouse_gp_jul_07_no_random/gtf/Mus_musculus.NCBIM37.67_noNT.gtf /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.txt }}} The parameters included in the sample command are: * '''-o/--output-dir ''' All output files will be created in this directory (default = tophat_out) * '''--solexa-quals''' (if input quals are from GA Pipeline ver. < 1.3) See the table at the top of FastQC output to identify the "encoding" scale [[br]] * '''--phred64-quals''' or '''solexa1.3-quals''' (if input quals are from GA Pipeline ver. >= 1.3 before Illumina 1.8) See the table at the top of FastQC output to identify the "encoding" scale [[br]]