| 66 | == Preprocessing read files from NCBI SRA == |
| 67 | |
| 68 | **SRA** (for Sequence Read Archive) is a NCBI binary format for short reads. |
| 69 | |
| 70 | It's thoroughly described in the [[http://www.ncbi.nlm.nih.gov/books/NBK47528/|SRA Handbook]] |
| 71 | |
| 72 | Processing SRA files requires the [[https://ncbi.github.io/sra-tools/|NCBI SRA Toolkit]], which is installed on our systems. |
| 73 | |
| 74 | The main command is **fastq-dump <SRA archive file>**, like |
| 75 | |
| 76 | ''**fastq-dump SRR060751.sra**'' |
| 77 | |
| 78 | If your reads are paired, by default the #1 and #2 reads will end up concatenated together in the same file. |
| 79 | To check if the SRA sample has paired reads or not, go to the [https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=run_browser SRA Run browser], enter the sample ID, and look in the table under "Layout". |
| 80 | |
| 81 | To get paired reads into separate files, use a command like |
| 82 | |
| 83 | ''**fastq-dump --split-files SRR060751.sra**'' |
| 84 | |
| 85 | You can ask for gzipped output instead of typical fastq: |
| 86 | |
| 87 | ''**fastq-dump --gzip SRR060751.sra**'' |
| 88 | |
| 89 | See [[https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump|Converting SRA format data into FASTQ]] for all program options. |
| 90 | |
| 91 | Note that a fastq file is about 4-5x larger than its corresponding SRA file. |
| 92 | |
| 93 | fastq-dump can be used to download/fetch the SRA file, or you can download (eg. using wget) the SRA file directly and then run fastq-dump to get the fastq file. Downloading SRA file directly will avoid changing home dir path for large file (see below). |
| 94 | |
| 95 | '''Note:''' As of fastq-dump version 2.8.1, running fastq-dump will require the vdb-config to be set up correctly. By default, downloaded/cache file is copied to the user's home directory, which is likely to run out of space. Run, |
| 96 | |
| 97 | {{{ |
| 98 | vdb-config --restore-defaults |
| 99 | vdb-config -i #use the GUI to enter a different location. |
| 100 | }}} |
| 101 | |
| 102 | Manually editing the file, $HOME/.ncbi/user-settings.mkfg, doesn't seem to work. See [[https://ncbi.github.io/sra-tools/install_config.html | NCBI SRA Installation/Config]]. Other alternatives: i) simply symlink the NCBI directory in your home directory to somewhere else with larger storage, or ii) download the SRA file directly (eg. using wget) before using fastq-dump. |
| 103 | |
| 104 | {{{ |
| 105 | #download SRR4090409.sra (e.g. use wget) from SRA and convert to fastq |
| 106 | fastq-dump SRR4090409.sra |
| 107 | |
| 108 | #download SRA file via fastq-dump (important: home directory or vdb-config file must be set up correctly), and convert to fastq |
| 109 | fastq-dump SRR4090409 |
| 110 | }}} |
| 111 | |
| 112 | |