| | 1 | |
| | 2 | == Preprocessing read files from NCBI SRA == |
| | 3 | |
| | 4 | **SRA** (for Sequence Read Archive) is a NCBI binary format for short reads. |
| | 5 | |
| | 6 | It's thoroughly described in the [[http://www.ncbi.nlm.nih.gov/books/NBK47528/|SRA Handbook]] |
| | 7 | |
| | 8 | Processing SRA files requires the [[https://tak.wi.mit.edu/trac/wiki/sra-toolkit|NCBI SRA Toolkit]], which is installed on our systems. |
| | 9 | |
| | 10 | The main command is **fastq-dump <SRA archive file>**, like |
| | 11 | |
| | 12 | ''**fastq-dump SRR060751.sra**'' |
| | 13 | |
| | 14 | If your reads are paired, by default the #1 and #2 reads will end up concatenated together in the same file. To get them into separate files, instead use a command like |
| | 15 | |
| | 16 | ''**fastq-dump --split-files SRR060751.sra**'' |
| | 17 | |
| | 18 | See [[http://www.ncbi.nlm.nih.gov/books/NBK47540/#SRA_Download_Guid_B.5_Converting_SRA_for|Converting SRA format data into FASTQ]] for all program options. |
| | 19 | |
| | 20 | Note that a fastq file is about 4-5x larger than its corresponding SRA file. |