== Preprocessing read files from NCBI SRA == **SRA** (for Sequence Read Archive) is a NCBI binary format for short reads. It's thoroughly described in the [[http://www.ncbi.nlm.nih.gov/books/NBK47528/|SRA Handbook]] Processing SRA files requires the [[https://tak.wi.mit.edu/trac/wiki/sra-toolkit|NCBI SRA Toolkit]], which is installed on our systems. The main command is **fastq-dump **, like ''**fastq-dump SRR060751.sra**'' If your reads are paired, by default the #1 and #2 reads will end up concatenated together in the same file. To get them into separate files, instead use a command like ''**fastq-dump --split-files SRR060751.sra**'' See [[http://www.ncbi.nlm.nih.gov/books/NBK47540/#SRA_Download_Guid_B.5_Converting_SRA_for|Converting SRA format data into FASTQ]] for all program options. Note that a fastq file is about 4-5x larger than its corresponding SRA file. == Mapping short reads == == Regular mappers == These mapping tools are useful for reads of DNA origin that should map to a continuous stretch of genomic DNA. Some of these tools can tolerate short indels but they're not designed for reads that span a splice junction One may choose between bowtie version 1 (faster but ignores indels) and bowtie version 2 (slower but performs gapped alignment (i.e., indels)). For a feature comparision, see [http://bowtie-bio.sourceforge.net/bowtie2/faq.shtml How is Bowtie 2 different from Bowtie 1?] '''[http://bowtie-bio.sourceforge.net/index.shtml bowtie version 1]''' Sample command: {{{ bsub bowtie -k 1 -n 2 -l 70 --best --sam --solexa1.3-quals /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.txt s7_mm9.k1.n2.l36.best.sam }}} Parameters included in the sample command: * '''-l/--seedlen ''' seed length for -n (default: 28) -- Set to longest possible length of high-quality bases. Use the FastQC output to determine length of high-quality positions. * '''-n/--seedmms ''' max mismatches in seed (can be 0-3, default: -n 2) * '''-k ''' report up to good alignments per read (default: 1) -- If you want only uniquely mapped reads, however, also use '-m 1' to ignore multi-mapped reads * '''--best''' (in the case of multi-mapped reads, keep only the best hit(s)) * '''--sam''' to get SAM output format (which is the best format for downstream analysis) Choices for fastq encoding (which is listed as "Encoding" in the top "Basic Statistics" table of the FastQC output file). See the [http://en.wikipedia.org/wiki/FASTQ_format FASTQ format page] for more details. * '''--solexa-quals''' (for input quality scores from Illumina versions 1.2 and earlier) * '''--solexa1.3-quals''' or '''--phred64-quals''' (for input quality scores from Illumina versions 1.3-1.7) * '''--phred33-quals''' (default "Sanger format"; for input quality scores from Illumina versions 1.8 and later) To see other parameters log into tak and type '''bowtie''' '''[http://bowtie-bio.sourceforge.net/bowtie2/index.shtml bowtie version 2]''' Bowtie 2 was designed as an improvement to bowtie 1, specifically, it supports gapped alignment. See the first [http://bowtie-bio.sourceforge.net/bowtie2/faq.shtml bowtie2 FAQ] for how they differ. Early versions of bowtie 2 had some issues, but these seem to have been fixed. Bowtie 2 uses a different set of genome index files (*.bt2) than bowtie 1 (*.ebwt). Bowtie 2 works with indels. Sample command: {{{ bsub bowtie2 --phred64 -L 22 -N 1 -x /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 -U s_7.txt -S s7_mm9.L22.N1.sam }}} The parameters included in the sample command: * '''-L ''' length of seed substrings; must be >3 and <32 (default=22) * '''-N ''' max # mismatches in seed alignment; can be 0 or 1 (default=0) * '''-S''' name of SAM output file Choices for fastq encoding (which is listed as "Encoding" in the top "Basic Statistics" table of the FastQC output file). See the [http://en.wikipedia.org/wiki/FASTQ_format FASTQ format page] for more details. * '''--solexa-quals''' (for input quality scores from Illumina versions 1.2 and earlier) * '''--phred64''' (for input quality scores from Illumina versions 1.3-1.7) * '''--phred33''' (default "Sanger format"; for input quality scores from Illumina versions 1.8 and later) bowtie2 can also perform local alignments where the unaligned end(s) of a read are clipped (so, for example, remaining adapter won't prevent alignment) by adding the argument '''--local'''. The bowtie2 command can be modified to output mapped reads as BAM, such as {{{ bsub "bowtie2 -x /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 -U s_7.txt | samtools view -bS - > s7_mm9.bam" }}} '''[http://bio-bwa.sourceforge.net/ bwa - Burrows-Wheeler Alignment Tool ]''' Bwa is a software package containing several related algorithms using the Burrows-Wheeler Transform. It works well even with indels, but not with spliced (RNA) reads. ''Sample commands for short (upto 100 bp) reads:'' {{{ # Align single-end reads bsub "bwa aln /nfs/genomes/mouse_gp_jul_07_no_random/bwa/mm9.fa Reads.fq > Mapped_reads.sai" bsub "bwa samse /nfs/genomes/mouse_gp_jul_07_no_random/bwa/mm9.fa Mapped_reads.sai Reads.fq > Reads.bwa.sam" # Align paired-end reads bsub "bwa aln /nfs/genomes/mouse_gp_jul_07_no_random/bwa/mm9.fa Reads.1.fq > Mapped_reads.1.sai" bsub "bwa aln /nfs/genomes/mouse_gp_jul_07_no_random/bwa/mm9.fa Reads.2.fq > Mapped_reads.2.sai" bsub "bwa sampe /nfs/genomes/mouse_gp_jul_07_no_random/bwa/mm9.fa Mapped_reads.1.sai Mapped_reads.2.sai Reads.1.fq Reads.2.fq > Reads.bwa.sam" }}} Use "-n" option to change the maximum number of mismatches in the alignment. ''Sample commands for long (70bp or longer) reads:'' {{{ # Align single-end reads bsub "bwa mem /nfs/genomes/sgd_2010/bwa/sacCer3.fa Reads.fq > Mapped_reads.bwa.sam" # Align paired-end reads bsub "bwa mem /nfs/genomes/sgd_2010/bwa/sacCer3.fa Reads.1.fq Reads.1.fq > Mapped_reads.bwa.sam" }}} For aligning long reads using the bwa mem option, there's no maximum number of mismatches, this is analogous to a local alignment using blat/blast. [[BR]] '''Other tools''' Many other regular mapping tools are also available, although they generally require a tool-specific indexed version of the genome. == Splice-aware mappers == These mappers permit the beginning and end of a read to map to (originate from) different places in the genome, which is common for spliced RNA. '''[http://tophat.cbcb.umd.edu/ tophat version 1]''' Running TopHat version 1 requires a change to a user's environment on tak and only applies to the specific tak session. First run this command: {{{ export PATH="/usr/local/share/tophat1:$PATH" }}} and then check that your terminal will use the correct TopHat version: {{{ tophat --version }}} Sample command: {{{ bsub tophat -o s_7_tophat_out --phred64-quals --segment-length 20 -I 200000 -G /nfs/genomes/mouse_gp_jul_07_no_random/gtf/Mus_musculus.NCBIM37.67_noNT.gtf --no-novel-juncs /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.txt }}} The parameters included in the sample command are: * '''-o/--output-dir ''' All output files will be created in this directory (default = tophat_out) * '''--segment-length ''' Shortest length of a spliced read that can map to one side of the junction. For reads shorter than ~45 nt, set this to half the read length (so set '--segment-length 20' for 40-nt reads). For longer reads, the default length (25) can be used. * '''-I ''' Maximum intron length. If your genome has introns that are all shorter (or many that are longer) than the default value (500000), set this to a more appropriate value. * '''-G ''' Supply bowtie with a GTF file of transcript models. This can help bowtie identify functions that may otherwise be missed. * '''--no-novel-juncs ''' Only look for spliced reads across junctions in the supplied GTF file. Not used if looking for novel isoforms. Choices for fastq encoding (which is listed as "Encoding" in the top "Basic Statistics" table of the FastQC output file). See the [http://en.wikipedia.org/wiki/FASTQ_format FASTQ format page] for more details. * '''--solexa-quals''' (for input quality scores from Illumina versions 1.2 and earlier) * '''--solexa1.3-quals''' or '''--phred64-quals''' (for input quality scores from Illumina versions 1.3-1.7) '''[http://tophat.cbcb.umd.edu/ tophat version 2]''' TopHat version 2 uses bowtie2, rather than bowtie, for its mapping. As a result, TopHat 2 uses a different set of genome index files (*.bt2) than TopHat 1 (*.ebwt). Sample command: {{{ bsub tophat -o s_7_tophat_out --phred64-quals --segment-length 20 -I 200000 -G /nfs/genomes/mouse_gp_jul_07_no_random/gtf/Mus_musculus.NCBIM37.67_noNT.gtf --no-novel-juncs /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_7.txt }}} The parameters included in the sample command are: * '''-o/--output-dir ''' All output files will be created in this directory (default = tophat_out) * '''--segment-length ''' Shortest length of a spliced read that can map to one side of the junction. For reads shorter than ~45 nt, set this to half the read length (so set '--segment-length 20' for 40-nt reads). For longer reads, the default length (25) can be used. * '''-I ''' Maximum intron length. If your genome has introns that are all shorter (or many that are longer) than the default value (500000), set this to a more appropriate value. * '''-G ''' Supply bowtie with a GTF file of transcript models. This can help bowtie identify functions that may otherwise be missed. * '''--no-novel-juncs ''' Only look for spliced reads across junctions in the supplied GTF file. Not used if looking for novel isoforms. Choices for fastq encoding (which is listed as "Encoding" in the top "Basic Statistics" table of the FastQC output file). See the [http://en.wikipedia.org/wiki/FASTQ_format FASTQ format page] for more details. * '''--solexa-quals''' (for input quality scores from Illumina versions 1.2 and earlier) * '''--solexa1.3-quals''' or '''--phred64-quals''' (for input quality scores from Illumina versions 1.3-1.7) * For "Sanger / Illumina 1.8" or "Sanger / Illumina 1.9", bowtie can use the default "phred33" encoding '''[https://github.com/alexdobin/STAR STAR]''' STAR is an ultrafast universal RNA-seq aligner. It maps >60 times faster than Tophat2. To use STAR, a genome directory specific for the STAR mapper needs to be generated first. STAR tends to align more reads to pseudogenes compared to Tophat2. However, the pseudogene problem can be significantly minimized by providing an annotation file containing known splice junctions. If no annotation is available for a genome of interest, a 2-pass mapping procedure is recommended. The first pass generates a splice junctions file, which is then used as the annotation file to run the second pass mapping. Sample command: To generate genome index files for STAR: {{{ bsub STAR --runMode genomeGenerate --genomeDir /path/to/GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 --sjdbGTFfile /path/to/GTF/FileName.gtf --runThreadN 8 }}} To map: {{{ bsub STAR --genomeDir /path/to/GenomeDir --readFilesIn /path/to/read1.fastq /path/to/read2.fastq --sjdbScore 2 --outFileNamePrefix whateverPrefix --runThreadN 8 }}} The parameters included in the sample command are: * '''--runMode ''' "alignReads" does the actual mapping. "genomeGenerate" generates the genomeDir required for mapping (default = alignReads). * '''--genomeDir ''' Specifies the path to the directory used for storing the genome information created in the genomeGenerate step. * '''--genomeFastaFiles ''' Specifies genome FASTA files to be used. * '''--sjdbGTFfile ''' Supplies STAR with a GTF file during the genomeGenerate step. Combined with the --sjdbScore option during mapping, this will bias the alignment toward annotated junctions, and reduces alignment to pseudogenes. * '''--readFilesIn ''' Specifies the fastq files containing the reads, can be single-end or paired-end. * '''--sjdbScore ''' Provides extra alignment score for alignments that cross database junctions (default = 2). If this score is positive, it will bias the alignment toward annotated junctions. This is only used if during the genomeGenerate step a splice junction annotation file is used. * '''--runThreadN ''' Specifies the number of threads to use.