| 120 | | |
| 121 | | == Remove linker (adapter) RNA: == |
| 122 | | * What is the sequence of the linker (adapter) to be removed? |
| 123 | | * Biologists generally know which linker (adapter) RNA is used for their sample(s). |
| 124 | | * Also or in addition, when you run quality control with shortRead or FASTQC, check out |
| 125 | | * repetitive segments in the "over represented sequences" section. |
| 126 | | * "Per base sequence content" for any patterns at the beginning of your reads |
| 127 | | * See [[http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_clipper_usage|fastx_clipper usage]] (or ''fastx_clipper -h'') for more arguments |
| 128 | | * sample command: |
| 129 | | |
| 130 | | {{{ |
| 131 | | bsub "fastx_clipper -a CTGTAGGCACCATCAAT -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt" |
| 132 | | In the above command: |
| 133 | | -a CTGTAGGCACCATCAAT is the linker sequence |
| 134 | | -i s2_sequence.txt is input solexa fastq file |
| 135 | | -v is Verbose [report number of sequences in output and discarded] |
| 136 | | -l 22 is to discard sequences shorter than 22 nucleotides |
| 137 | | -o s2_ sequence_noLinker.txt is output file. |
| 138 | | }}} |
| 139 | | |
| 140 | | |
| 141 | | * If you get the message "Invalid quality score value..." you have the older range of quality scores. |
| 142 | | * Add the argument -Q 33, such as |
| 143 | | * fastx_clipper -a CTGTAGGCACCATCAAT -Q 33 -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt |
| 144 | | |
| 145 | | * [[http://code.google.com/p/cutadapt/|cutadapt]] is another tool that is designed to find and remove adapters: |
| 146 | | * more options than fastx_clipper, such as specifically trimming 5' or 3' adapters and specifying error rate (allowed mismatches) |
| 147 | | * [wiki:SOPs/cutadapt sample usage] |
| 148 | | |
| 149 | | == Trim reads to a specified length == |
| 150 | | * If we have reads of different lengths (//i.e.// because we clipped out the adapter sequences), we can trim them to have them all be the same length. Use **fastx_trimmer** for that. |
| 151 | | * sample command: |
| 152 | | |
| 153 | | |
| 154 | | {{{ |
| 155 | | bsub "fastx_trimmer -f 1 -l 22 -i s7_sequence_clipped.txt -o s7_sequence_clipped_trimmed.txt" |
| 156 | | |
| 157 | | [-i INFILE] = FASTA/Q input file. default is STDIN. |
| 158 | | [-o OUTFILE] = FASTA/Q output file. default is STDOUT. |
| 159 | | [-l N] = Last base to keep |
| 160 | | [-f N] = First base to keep. Default is 1 (=first base). |
| 161 | | |
| 162 | | }}} |
| | 141 | \\ |
| | 142 | |
| | 143 | = Modifying a file of short reads in other ways = |
| | 144 | |
| | 145 | \\ |
| | 146 | |
| | 147 | == Remove linker (adapter) RNA: == |
| | 148 | * What is the sequence of the linker (adapter) to be removed? |
| | 149 | * Biologists generally know which linker (adapter) RNA is used for their sample(s). |
| | 150 | * Also or in addition, when you run quality control with shortRead or FASTQC, check out |
| | 151 | * repetitive segments in the "over represented sequences" section. |
| | 152 | * "Per base sequence content" for any patterns at the beginning of your reads |
| | 153 | * See [[http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_clipper_usage|fastx_clipper usage]] (or ''fastx_clipper -h'') for more arguments |
| | 154 | * sample command: |
| | 155 | |
| | 156 | {{{ |
| | 157 | bsub "fastx_clipper -a CTGTAGGCACCATCAAT -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt" |
| | 158 | In the above command: |
| | 159 | -a CTGTAGGCACCATCAAT is the linker sequence |
| | 160 | -i s2_sequence.txt is input solexa fastq file |
| | 161 | -v is Verbose [report number of sequences in output and discarded] |
| | 162 | -l 22 is to discard sequences shorter than 22 nucleotides |
| | 163 | -o s2_ sequence_noLinker.txt is output file. |
| | 164 | }}} |
| | 165 | |
| | 166 | |
| | 167 | * If you get the message "Invalid quality score value..." you have the older range of quality scores. |
| | 168 | * Add the argument -Q 33, such as |
| | 169 | * fastx_clipper -a CTGTAGGCACCATCAAT -Q 33 -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt |
| | 170 | |
| | 171 | * [[http://code.google.com/p/cutadapt/|cutadapt]] is another tool that is designed to find and remove adapters: |
| | 172 | * more options than fastx_clipper, such as specifically trimming 5' or 3' adapters and specifying error rate (allowed mismatches) |
| | 173 | * [wiki:SOPs/cutadapt sample usage] |
| | 174 | |
| | 175 | == Trim reads to a specified length == |
| | 176 | * If we have reads of different lengths (//i.e.// because we clipped out the adapter sequences), we can trim them to have them all be the same length. Use **fastx_trimmer** for that. |
| | 177 | * sample command: |
| | 178 | |
| | 179 | |
| | 180 | {{{ |
| | 181 | bsub "fastx_trimmer -f 1 -l 22 -i s7_sequence_clipped.txt -o s7_sequence_clipped_trimmed.txt" |
| | 182 | |
| | 183 | [-i INFILE] = FASTA/Q input file. default is STDIN. |
| | 184 | [-o OUTFILE] = FASTA/Q output file. default is STDOUT. |
| | 185 | [-l N] = Last base to keep |
| | 186 | [-f N] = First base to keep. Default is 1 (=first base). |
| | 187 | |
| | 188 | }}} |