120 | | |
121 | | == Remove linker (adapter) RNA: == |
122 | | * What is the sequence of the linker (adapter) to be removed? |
123 | | * Biologists generally know which linker (adapter) RNA is used for their sample(s). |
124 | | * Also or in addition, when you run quality control with shortRead or FASTQC, check out |
125 | | * repetitive segments in the "over represented sequences" section. |
126 | | * "Per base sequence content" for any patterns at the beginning of your reads |
127 | | * See [[http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_clipper_usage|fastx_clipper usage]] (or ''fastx_clipper -h'') for more arguments |
128 | | * sample command: |
129 | | |
130 | | {{{ |
131 | | bsub "fastx_clipper -a CTGTAGGCACCATCAAT -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt" |
132 | | In the above command: |
133 | | -a CTGTAGGCACCATCAAT is the linker sequence |
134 | | -i s2_sequence.txt is input solexa fastq file |
135 | | -v is Verbose [report number of sequences in output and discarded] |
136 | | -l 22 is to discard sequences shorter than 22 nucleotides |
137 | | -o s2_ sequence_noLinker.txt is output file. |
138 | | }}} |
139 | | |
140 | | |
141 | | * If you get the message "Invalid quality score value..." you have the older range of quality scores. |
142 | | * Add the argument -Q 33, such as |
143 | | * fastx_clipper -a CTGTAGGCACCATCAAT -Q 33 -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt |
144 | | |
145 | | * [[http://code.google.com/p/cutadapt/|cutadapt]] is another tool that is designed to find and remove adapters: |
146 | | * more options than fastx_clipper, such as specifically trimming 5' or 3' adapters and specifying error rate (allowed mismatches) |
147 | | * [wiki:SOPs/cutadapt sample usage] |
148 | | |
149 | | == Trim reads to a specified length == |
150 | | * If we have reads of different lengths (//i.e.// because we clipped out the adapter sequences), we can trim them to have them all be the same length. Use **fastx_trimmer** for that. |
151 | | * sample command: |
152 | | |
153 | | |
154 | | {{{ |
155 | | bsub "fastx_trimmer -f 1 -l 22 -i s7_sequence_clipped.txt -o s7_sequence_clipped_trimmed.txt" |
156 | | |
157 | | [-i INFILE] = FASTA/Q input file. default is STDIN. |
158 | | [-o OUTFILE] = FASTA/Q output file. default is STDOUT. |
159 | | [-l N] = Last base to keep |
160 | | [-f N] = First base to keep. Default is 1 (=first base). |
161 | | |
162 | | }}} |
| 141 | \\ |
| 142 | |
| 143 | = Modifying a file of short reads in other ways = |
| 144 | |
| 145 | \\ |
| 146 | |
| 147 | == Remove linker (adapter) RNA: == |
| 148 | * What is the sequence of the linker (adapter) to be removed? |
| 149 | * Biologists generally know which linker (adapter) RNA is used for their sample(s). |
| 150 | * Also or in addition, when you run quality control with shortRead or FASTQC, check out |
| 151 | * repetitive segments in the "over represented sequences" section. |
| 152 | * "Per base sequence content" for any patterns at the beginning of your reads |
| 153 | * See [[http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_clipper_usage|fastx_clipper usage]] (or ''fastx_clipper -h'') for more arguments |
| 154 | * sample command: |
| 155 | |
| 156 | {{{ |
| 157 | bsub "fastx_clipper -a CTGTAGGCACCATCAAT -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt" |
| 158 | In the above command: |
| 159 | -a CTGTAGGCACCATCAAT is the linker sequence |
| 160 | -i s2_sequence.txt is input solexa fastq file |
| 161 | -v is Verbose [report number of sequences in output and discarded] |
| 162 | -l 22 is to discard sequences shorter than 22 nucleotides |
| 163 | -o s2_ sequence_noLinker.txt is output file. |
| 164 | }}} |
| 165 | |
| 166 | |
| 167 | * If you get the message "Invalid quality score value..." you have the older range of quality scores. |
| 168 | * Add the argument -Q 33, such as |
| 169 | * fastx_clipper -a CTGTAGGCACCATCAAT -Q 33 -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt |
| 170 | |
| 171 | * [[http://code.google.com/p/cutadapt/|cutadapt]] is another tool that is designed to find and remove adapters: |
| 172 | * more options than fastx_clipper, such as specifically trimming 5' or 3' adapters and specifying error rate (allowed mismatches) |
| 173 | * [wiki:SOPs/cutadapt sample usage] |
| 174 | |
| 175 | == Trim reads to a specified length == |
| 176 | * If we have reads of different lengths (//i.e.// because we clipped out the adapter sequences), we can trim them to have them all be the same length. Use **fastx_trimmer** for that. |
| 177 | * sample command: |
| 178 | |
| 179 | |
| 180 | {{{ |
| 181 | bsub "fastx_trimmer -f 1 -l 22 -i s7_sequence_clipped.txt -o s7_sequence_clipped_trimmed.txt" |
| 182 | |
| 183 | [-i INFILE] = FASTA/Q input file. default is STDIN. |
| 184 | [-o OUTFILE] = FASTA/Q output file. default is STDOUT. |
| 185 | [-l N] = Last base to keep |
| 186 | [-f N] = First base to keep. Default is 1 (=first base). |
| 187 | |
| 188 | }}} |