| 66 | | == Preprocessing read files from NCBI SRA == |
| 67 | | |
| 68 | | **SRA** (for Sequence Read Archive) is a NCBI binary format for short reads. |
| 69 | | |
| 70 | | It's thoroughly described in the [[http://www.ncbi.nlm.nih.gov/books/NBK47528/|SRA Handbook]] |
| 71 | | |
| 72 | | SRA files can be downloaded as compressed fastq in a web browser using [[https://ewels.github.io/sra-explorer/|SRA Explorer]]. |
| 73 | | |
| 74 | | Processing SRA files requires the [[https://ncbi.github.io/sra-tools/|NCBI SRA Toolkit]], which is installed on our systems. |
| 75 | | |
| 76 | | The main command is **fastq-dump <SRA archive file>**, like |
| 77 | | |
| 78 | | ''**fastq-dump SRR060751.sra**'' |
| 79 | | |
| 80 | | If your reads are paired, by default the #1 and #2 reads will end up concatenated together in the same file. |
| 81 | | To check if the SRA sample has paired reads or not, go to the [https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=run_browser SRA Run browser], enter the sample ID, and look in the table under "Layout". |
| 82 | | |
| 83 | | To get matched paired reads into separate files, use a command like |
| 84 | | |
| 85 | | ''**fastq-dump --split-3 SRR060751.sra**'' |
| 86 | | |
| 87 | | This works the same as using the "--split-files", but "--split-3" puts unpaired reads (if any) into a third file. |
| 88 | | |
| 89 | | You can ask for gzipped output instead of typical fastq: |
| 90 | | |
| 91 | | ''**fastq-dump --gzip SRR060751.sra**'' |
| 92 | | |
| 93 | | See [[https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump|Converting SRA format data into FASTQ]] for all program options. |
| 94 | | |
| 95 | | Note that a fastq file is about 4-5x larger than its corresponding SRA file. |
| 96 | | |
| 97 | | fastq-dump can be used to download/fetch the SRA file, or you can download (eg. using wget) the SRA file directly and then run fastq-dump to get the fastq file. Downloading SRA file directly will avoid changing home dir path for large file (see below). |
| 98 | | |
| 99 | | '''Note:''' As of fastq-dump version 2.8.1, running fastq-dump will require the vdb-config to be set up correctly. By default, downloaded/cache file is copied to the user's home directory, which is likely to run out of space. Run, |
| 100 | | |
| 101 | | {{{ |
| 102 | | vdb-config --restore-defaults |
| 103 | | vdb-config -i #use the GUI to enter a different location. |
| 104 | | }}} |
| 105 | | |
| 106 | | Manually editing the file, $HOME/.ncbi/user-settings.mkfg, doesn't seem to work. See [[https://ncbi.github.io/sra-tools/install_config.html | NCBI SRA Installation/Config]]. Other alternatives: i) simply symlink the NCBI directory in your home directory to somewhere else with larger storage, or ii) download the SRA file directly (eg. using wget) before using fastq-dump. |
| 107 | | |
| 108 | | {{{ |
| 109 | | #download SRR4090409.sra (e.g. use wget) from SRA and convert to fastq |
| 110 | | fastq-dump SRR4090409.sra |
| 111 | | |
| 112 | | #download SRA file via fastq-dump (important: home directory or vdb-config file must be set up correctly), and convert to fastq |
| 113 | | fastq-dump SRR4090409 |
| 114 | | }}} |
| 115 | | |
| 116 | | In order to '''download a list of SRA files''' from NCBI, it is convenient to use prefetch. |
| 117 | | |
| 118 | | As mentioned in [[https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/| SRA website ]], you can download list of Run accessions from search results page ([[https://www.ncbi.nlm.nih.gov/sra/?term=cancer |- Example offsite image]]) - select Runs of interest by clicking on the checkboxes, click on "Send To", "file", and select "Accession List" in the drop-down menu. |
| 119 | | |
| 120 | | Given a set of SRA files listed in a single column in the text file "SraAccList.txt" (e.g. SRR7623010, SRR7623011, etc.), the following command will download the entire set: |
| 121 | | |
| 122 | | {{{ |
| 123 | | prefetch --option-file sraAccList.txt |
| 124 | | }}} |
| 125 | | |
| 126 | | This is current as of prefetch v. 2.9.3 (2.9.3-1). Note that the default location for downloaded files is in your home directory under ~/ncbi/ncbi_public/sra. With this default, one can quickly run out of space. One solution to address this problem is to edit your ~/.ncbi/user-settings.mkfg file to include the following line: |
| 127 | | {{{ |
| 128 | | /repository/user/main/public/root = "/destination/for/big/storage/here" |
| 129 | | }}} |
| 130 | | |
| 131 | | or point to the destination folder with -O |
| 132 | | |
| 133 | | |
| 134 | | '''Download Metadata:''' |
| 135 | | When in your GEO series page, click on SRA link -> click on "Send to" on the top of the page -> check the "File" radiobutton, and select "RunInfo" in pull-down menu. This will generate a tabular SraRunInfo.csv file with metadata available for each Run. |