Context Navigation

Changes between Version 2 and Version 3 of SOPs/qc_shortReads

Timestamp:: 09/06/13 14:56:15 (12 years ago)
Author:: gbell
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

SOPs/qc_shortReads

-              v2
+              v3
 If you get an error like "Invalid quality score value", your fastq file probably has Sanger (offset 33) instead of Illumina (ASCII offset 64) quality scores.
 You'll need to add the option "-Q33" to your FASTX Toolkit arguments.
-== Remove linker (adapter) RNA: ==
-  * What is the sequence of the linker (adapter) to be removed?
-    * Biologists generally know which linker (adapter) RNA is used for their sample(s).
-    * Also or in addition, when you run quality control with shortRead or FASTQC, check out
-         * repetitive segments in the "over represented sequences" section.
-         * "Per base sequence content" for any patterns at the beginning of your reads
-    * See [[http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_clipper_usage|fastx_clipper usage]] (or ''fastx_clipper -h'') for more arguments
-  * sample command:
-{{{
-bsub "fastx_clipper -a CTGTAGGCACCATCAAT -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt"
-In the above command:
-   -a CTGTAGGCACCATCAAT is the linker sequence
-   -i  s2_sequence.txt is input solexa fastq file
-   -v is Verbose [report number of sequences in output and discarded]
-   -l 22 is to discard sequences shorter than 22 nucleotides
-   -o s2_ sequence_noLinker.txt is output file.
-}}}
-  * If you get the message "Invalid quality score value..." you have the older range of quality scores.
-    * Add the argument -Q 33, such as
-    * fastx_clipper -a CTGTAGGCACCATCAAT -Q 33 -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt
-  * [[http://code.google.com/p/cutadapt/|cutadapt]] is another tool that is designed to find and remove adapters:
-    * more options than fastx_clipper, such as specifically trimming 5' or 3' adapters and specifying error rate (allowed mismatches)
-    * [wiki:SOPs/cutadapt sample usage]
-== Trim reads to a specified length ==
-   * If we have reads of different lengths (//i.e.// because we clipped out the adapter sequences), we can trim them to have them all be the same length. Use **fastx_trimmer** for that.
-   * sample command:
-{{{
-bsub "fastx_trimmer -f 1 -l 22  -i s7_sequence_clipped.txt -o s7_sequence_clipped_trimmed.txt"
-[-i INFILE]  = FASTA/Q input file. default is STDIN.
-[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
-[-l N] = Last base to keep
-[-f N] = First base to keep. Default is 1 (=first base).
-}}}
 == Trim end of reads when quality drops below a threshold ==
 …
 }}}
+\\
+= Modifying a file of short reads in other ways =
+\\
+== Remove linker (adapter) RNA: ==
+  * What is the sequence of the linker (adapter) to be removed?
+    * Biologists generally know which linker (adapter) RNA is used for their sample(s).
+    * Also or in addition, when you run quality control with shortRead or FASTQC, check out
+         * repetitive segments in the "over represented sequences" section.
+         * "Per base sequence content" for any patterns at the beginning of your reads
+    * See [[http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_clipper_usage|fastx_clipper usage]] (or ''fastx_clipper -h'') for more arguments
+  * sample command:
+{{{
+bsub "fastx_clipper -a CTGTAGGCACCATCAAT -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt"
+In the above command:
+   -a CTGTAGGCACCATCAAT is the linker sequence
+   -i  s2_sequence.txt is input solexa fastq file
+   -v is Verbose [report number of sequences in output and discarded]
+   -l 22 is to discard sequences shorter than 22 nucleotides
+   -o s2_ sequence_noLinker.txt is output file.
+}}}
+  * If you get the message "Invalid quality score value..." you have the older range of quality scores.
+    * Add the argument -Q 33, such as
+    * fastx_clipper -a CTGTAGGCACCATCAAT -Q 33 -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt
+  * [[http://code.google.com/p/cutadapt/|cutadapt]] is another tool that is designed to find and remove adapters:
+    * more options than fastx_clipper, such as specifically trimming 5' or 3' adapters and specifying error rate (allowed mismatches)
+    * [wiki:SOPs/cutadapt sample usage]
+== Trim reads to a specified length ==
+   * If we have reads of different lengths (//i.e.// because we clipped out the adapter sequences), we can trim them to have them all be the same length. Use **fastx_trimmer** for that.
+   * sample command:
+{{{
+bsub "fastx_trimmer -f 1 -l 22  -i s7_sequence_clipped.txt -o s7_sequence_clipped_trimmed.txt"
+[-i INFILE]  = FASTA/Q input file. default is STDIN.
+[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
+[-l N] = Last base to keep
+[-f N] = First base to keep. Default is 1 (=first base).
+}}}
 == Remove Duplicates ==
 …
 }}}
+== Select reads that are paired [for paired-end sequencing]  ==
+During quality control, if low-quality reads have been removed for any reason, some reads may not have a paired end at the other end.  This can cause problems with mapping programs.
+Sample command:
+{{{
+/nfs/BaRC_Public/BaRC_code/Perl/cmpfastq/cmpfastq.pl sequence.1_1.filt.txt sequence.1_2.filt.txt
+}}}
+Output files will be
+  * *unique.out (reads that are only in the "1" or "2" set; 2 files) and
+  * *common.out (reads that are in both the "1" and "2" set; 2 files).
+The *common.out reads should be used for paired-read mapping.
 == Galaxy ==