== Downloading and processing NCBI SRA files ==

**SRA** (for Sequence Read Archive) is a NCBI binary format for short reads.

It's thoroughly described in the [[http://www.ncbi.nlm.nih.gov/books/NBK47528/|SRA Handbook]]

SRA files can be downloaded as compressed fastq in a web browser using [[https://ewels.github.io/sra-explorer/|SRA Explorer]].

Processing SRA files requires the [[https://ncbi.github.io/sra-tools/|NCBI SRA Toolkit]], which is installed on our systems.

=== Downloading and processing one NCBI SRA sample at a time ===

NCBI short-read files can be 
  * downloaded and converted to fastq.gz format at the same time (recommended), or
  * downloaded in SRA format for subsequent conversion

To download one SRR ID at a time to get fastq.gz format, use the command fastq-dump, like

{{{
fastq-dump --split-3 --gzip SRR123456
}}}

With the option "--split-3",
  * single-end reads will end up in a single file, named SRR123456.fastq.gz
  * paired-end reads will produce two files (named SRR123456_1.fastq.gz and SRR123456_2.fastq.g)
  * unpaired reads (if any) will be placed into a third file.

See the [[https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump|fastq-dump documentation]] for all program options.

We recommend always gzipping fastq files because
  * fastq.gz files are much smaller than fastq files
  * our typically-used analysis programs all permit fastq.gz input


'''Note:''' Running fastq-dump places downloaded or cache files into the user's home directory, which is likely to run out of space.  To prevent this, you have at least 3 options:

Option 1: symlink the NCBI directory in your home directory to somewhere else with larger storage, such as with a command like

{{{
ln -s /path/to/large/storage ~/ncbi
}}}

Option 2: edit your ~/.ncbi/user-settings.mkfg file to include the following line:
{{{
/repository/user/default-path = "/path/to/large/storage"
}}}

Option 3: Modify your environment with vdb-config
{{{
vdb-config --restore-defaults # To restore your settings
vdb-config -i # To use the GUI to enter a different location with "Set Default Import Path".  
}}}

In the event of problems downloading and converting to fastq.gz all at once, SRA files can be downloaded by navigating through the SRA web site to the sample's "Data access" tab, which provides direct links to the files.  Using this direct path to the SRA file,
{{{
# Use 'wget' for download
wget -O SRR123456.sra https://sra-downloadb/path_to_file/SRR123456/SRR123456.1
# Convert SRA to fastq.gz
fastq-dump --split-3 --gzip ./SRR123456.sra
}}} 

=== Downloading and processing multiple NCBI SRA samples ===

To download a list of SRR files (such as for all of the samples of a data series) from NCBI, use NCBI's 'prefetch'.

Given a set of SRA files (by SRR ID) listed in a single column in the text file "SraAccList.txt" (e.g. SRR7623010, SRR7623011, etc.), the following command will download the entire set:

{{{
prefetch -O output_directory --option-file SRR_Acc_List.txt
}}}

If you don't specify an output directory, the SRR files will be downloaded to ~/ncbi/ncbi_public/sra (or your configured "Import Path" as described above).  

To get this list of SRR IDs, go the [[https://www.ncbi.nlm.nih.gov/Traces/study/|SRA Run Selector]] and enter a project accession.  Once on a [[https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP000002|project page]], go to the "Select" section and click on "Accession List" to get 'SRR_Acc_List.txt' (or if you want a subset of these, click on "Metadata" in the "Select" section to get a comma separated file, 'SraRunTable.txt' and create your own 'SRR_Acc_List.txt')

The 'prefetch' command will provide you with a set of SRA files which then need to be converted to fastq.gz.  One way to do this on the set of SRA files is

{{{
find -name \*.sra -exec bsub fastq-dump --split-3 --gzip {} \;
}}}

Once you create the fastq.gz files, the *.sra files can be deleted.