Changes between Version 2 and Version 3 of SOPs/qc_shortReads


Ignore:
Timestamp:
09/06/13 14:56:15 (11 years ago)
Author:
gbell
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/qc_shortReads

    v2 v3  
    118118If you get an error like "Invalid quality score value", your fastq file probably has Sanger (offset 33) instead of Illumina (ASCII offset 64) quality scores. 
    119119You'll need to add the option "-Q33" to your FASTX Toolkit arguments.
    120 
    121 == Remove linker (adapter) RNA: ==
    122   * What is the sequence of the linker (adapter) to be removed?
    123     * Biologists generally know which linker (adapter) RNA is used for their sample(s).
    124     * Also or in addition, when you run quality control with shortRead or FASTQC, check out
    125          * repetitive segments in the "over represented sequences" section.
    126          * "Per base sequence content" for any patterns at the beginning of your reads
    127     * See [[http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_clipper_usage|fastx_clipper usage]] (or ''fastx_clipper -h'') for more arguments
    128   * sample command:
    129 
    130 {{{
    131 bsub "fastx_clipper -a CTGTAGGCACCATCAAT -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt"
    132 In the above command:
    133    -a CTGTAGGCACCATCAAT is the linker sequence
    134    -i  s2_sequence.txt is input solexa fastq file
    135    -v is Verbose [report number of sequences in output and discarded]
    136    -l 22 is to discard sequences shorter than 22 nucleotides
    137    -o s2_ sequence_noLinker.txt is output file.
    138 }}}
    139 
    140 
    141   * If you get the message "Invalid quality score value..." you have the older range of quality scores.
    142     * Add the argument -Q 33, such as
    143     * fastx_clipper -a CTGTAGGCACCATCAAT -Q 33 -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt
    144 
    145   * [[http://code.google.com/p/cutadapt/|cutadapt]] is another tool that is designed to find and remove adapters:
    146     * more options than fastx_clipper, such as specifically trimming 5' or 3' adapters and specifying error rate (allowed mismatches)
    147     * [wiki:SOPs/cutadapt sample usage]
    148 
    149 == Trim reads to a specified length ==
    150    * If we have reads of different lengths (//i.e.// because we clipped out the adapter sequences), we can trim them to have them all be the same length. Use **fastx_trimmer** for that.
    151    * sample command:
    152 
    153  
    154 {{{
    155 bsub "fastx_trimmer -f 1 -l 22  -i s7_sequence_clipped.txt -o s7_sequence_clipped_trimmed.txt"
    156      
    157 [-i INFILE]  = FASTA/Q input file. default is STDIN.
    158 [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
    159 [-l N] = Last base to keep
    160 [-f N] = First base to keep. Default is 1 (=first base).
    161 
    162 }}}
    163120
    164121== Trim end of reads when quality drops below a threshold ==
     
    182139}}}
    183140
     141\\
     142
     143= Modifying a file of short reads in other ways =
     144
     145\\
     146
     147== Remove linker (adapter) RNA: ==
     148  * What is the sequence of the linker (adapter) to be removed?
     149    * Biologists generally know which linker (adapter) RNA is used for their sample(s).
     150    * Also or in addition, when you run quality control with shortRead or FASTQC, check out
     151         * repetitive segments in the "over represented sequences" section.
     152         * "Per base sequence content" for any patterns at the beginning of your reads
     153    * See [[http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_clipper_usage|fastx_clipper usage]] (or ''fastx_clipper -h'') for more arguments
     154  * sample command:
     155
     156{{{
     157bsub "fastx_clipper -a CTGTAGGCACCATCAAT -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt"
     158In the above command:
     159   -a CTGTAGGCACCATCAAT is the linker sequence
     160   -i  s2_sequence.txt is input solexa fastq file
     161   -v is Verbose [report number of sequences in output and discarded]
     162   -l 22 is to discard sequences shorter than 22 nucleotides
     163   -o s2_ sequence_noLinker.txt is output file.
     164}}}
     165
     166
     167  * If you get the message "Invalid quality score value..." you have the older range of quality scores.
     168    * Add the argument -Q 33, such as
     169    * fastx_clipper -a CTGTAGGCACCATCAAT -Q 33 -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt
     170
     171  * [[http://code.google.com/p/cutadapt/|cutadapt]] is another tool that is designed to find and remove adapters:
     172    * more options than fastx_clipper, such as specifically trimming 5' or 3' adapters and specifying error rate (allowed mismatches)
     173    * [wiki:SOPs/cutadapt sample usage]
     174
     175== Trim reads to a specified length ==
     176   * If we have reads of different lengths (//i.e.// because we clipped out the adapter sequences), we can trim them to have them all be the same length. Use **fastx_trimmer** for that.
     177   * sample command:
     178
     179 
     180{{{
     181bsub "fastx_trimmer -f 1 -l 22  -i s7_sequence_clipped.txt -o s7_sequence_clipped_trimmed.txt"
     182     
     183[-i INFILE]  = FASTA/Q input file. default is STDIN.
     184[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
     185[-l N] = Last base to keep
     186[-f N] = First base to keep. Default is 1 (=first base).
     187
     188}}}
    184189
    185190== Remove Duplicates ==
     
    192197}}}
    193198
     199== Select reads that are paired [for paired-end sequencing]  ==
     200
     201During quality control, if low-quality reads have been removed for any reason, some reads may not have a paired end at the other end.  This can cause problems with mapping programs.
     202
     203Sample command:
     204
     205{{{
     206/nfs/BaRC_Public/BaRC_code/Perl/cmpfastq/cmpfastq.pl sequence.1_1.filt.txt sequence.1_2.filt.txt
     207}}}
     208
     209Output files will be
     210  * *unique.out (reads that are only in the "1" or "2" set; 2 files) and
     211  * *common.out (reads that are in both the "1" and "2" set; 2 files).
     212
     213The *common.out reads should be used for paired-read mapping.
     214
    194215== Galaxy ==
    195216