| 1 | |
| 2 | == Preprocessing read files from NCBI SRA == |
| 3 | |
| 4 | **SRA** (for Sequence Read Archive) is a NCBI binary format for short reads. |
| 5 | |
| 6 | It's thoroughly described in the [[http://www.ncbi.nlm.nih.gov/books/NBK47528/|SRA Handbook]] |
| 7 | |
| 8 | Processing SRA files requires the [[https://tak.wi.mit.edu/trac/wiki/sra-toolkit|NCBI SRA Toolkit]], which is installed on our systems. |
| 9 | |
| 10 | The main command is **fastq-dump <SRA archive file>**, like |
| 11 | |
| 12 | ''**fastq-dump SRR060751.sra**'' |
| 13 | |
| 14 | If your reads are paired, by default the #1 and #2 reads will end up concatenated together in the same file. To get them into separate files, instead use a command like |
| 15 | |
| 16 | ''**fastq-dump --split-files SRR060751.sra**'' |
| 17 | |
| 18 | See [[http://www.ncbi.nlm.nih.gov/books/NBK47540/#SRA_Download_Guid_B.5_Converting_SRA_for|Converting SRA format data into FASTQ]] for all program options. |
| 19 | |
| 20 | Note that a fastq file is about 4-5x larger than its corresponding SRA file. |