== Downloading and processing NCBI SRA files == '''Contents''' * [#one Downloading and processing one NCBI SRA sample at a time] * [#multiple Downloading and processing multiple NCBI SRA samples] **SRA** (for Sequence Read Archive) is a NCBI binary format for short reads. This task is thoroughly described in the [[http://www.ncbi.nlm.nih.gov/books/NBK47528/|SRA Handbook]]. SRA files can be downloaded as compressed fastq in a web browser using [[https://ewels.github.io/sra-explorer/|SRA Explorer]]. Processing SRA files requires the [[https://ncbi.github.io/sra-tools/|NCBI SRA Toolkit]], which is installed on our systems. === [=#one Downloading and processing one NCBI SRA sample at a time] === NCBI short-read files can be * downloaded and converted to fastq.gz format at the same time (recommended), or * downloaded in SRA format for subsequent conversion To download one SRR ID at a time to get fastq.gz format, use the command fastq-dump, like {{{ fastq-dump --split-3 --gzip SRR123456 }}} With the option "--split-3", * single-end reads will end up in a single file, named SRR123456.fastq.gz * paired-end reads will produce two files (named SRR123456_1.fastq.gz and SRR123456_2.fastq.g) * unpaired reads (if any) will be placed into a third file. See the [[https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump|fastq-dump documentation]] for all program options. We recommend always gzipping fastq files because * fastq.gz files are much smaller than fastq files * our typically-used analysis programs all permit fastq.gz input '''Note:''' Running fastq-dump places downloaded or cache files into the user's home directory, which is likely to run out of space. To prevent this, you have at least 3 options: Option 1: symlink the NCBI directory in your home directory to somewhere else with larger storage, such as with a command like {{{ ln -s /path/to/large/storage ~/ncbi }}} Option 2: edit your ~/.ncbi/user-settings.mkfg file to include the following line: {{{ /repository/user/default-path = "/path/to/large/storage" }}} Option 3: Modify your environment with vdb-config {{{ vdb-config --restore-defaults # To restore your settings vdb-config -i # To use the GUI to enter a different location with "Set Default Import Path". }}} In the event of problems downloading and converting to fastq.gz all at once, SRA files can be downloaded by navigating through the SRA web site to the sample's "Data access" tab, which provides direct links to the file(s). Using this direct path to the SRA file, {{{ # Use 'wget' for download reads in SRA format wget -O SRR123456.sra https://sra-downloadb/path_to_file/SRR123456/SRR123456.1 # Convert SRA format to fastq.gz fastq-dump --split-3 --gzip ./SRR123456.sra }}} If fastq-dump gives you an error like "Failed to call external services." you may need to use the NCBI link to a sralite file to download the sequences in that format first and then convert to fastq. This can be done with commands like {{{ # Download the sralite file wget https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos5/sra-pub-zq-11/SRR002/626/SRR1234567/SRR1234567.sralite.1 # Convert sralite to fastq.gz fastq-dump --split-files --gzip -A SRR1234567 SRR1234567.sralite.1 }}} === [=#multiple Downloading and processing multiple NCBI SRA samples] === To download a list of SRR files (such as for all of the samples of a data series) from NCBI, use NCBI's 'prefetch'. Given a set of SRA files (by SRR ID) listed in a single column in the text file "SraAccList.txt" (e.g. SRR7623010, SRR7623011, etc.), the following command will download the entire set: {{{ prefetch -O output_directory --option-file SRR_Acc_List.txt }}} If you don't specify an output directory, the SRR files will be downloaded to ~/ncbi/ncbi_public/sra (or your configured "Import Path" as described above). To get this list of SRR IDs, go the [[https://www.ncbi.nlm.nih.gov/Traces/study/|SRA Run Selector]] and enter a project accession. Once on a [[https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP000002|project page]], go to the "Select" section and click on "Accession List" to get 'SRR_Acc_List.txt' (or if you want a subset of these, click on "Metadata" in the "Select" section to get a comma separated file, 'SraRunTable.txt' and create your own 'SRR_Acc_List.txt'). Or you can go to the [[https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=search_seq_name|Downloading SRA data]] page, enter a list of SRX (experimental accession) IDs, and get a list of SRR (sample) accessions for 'SRR_Acc_List.txt'. The 'prefetch' command will provide you with a set of SRA files which then need to be converted to fastq.gz. One way to do this on the set of SRA files is {{{ find -name \*.sra -exec sbatch --partition=20 --ntasks=1 --wrap "fastq-dump --split-3 --gzip {}" \; }}} Once you create the fastq.gz files, the *.sra files can be deleted.