== Downloading and processing NCBI SRA files ==

**SRA** (for Sequence Read Archive) is a NCBI binary format for short reads.

It's thoroughly described in the [[http://www.ncbi.nlm.nih.gov/books/NBK47528/|SRA Handbook]]

SRA files can be downloaded as compressed fastq in a web browser using [[https://ewels.github.io/sra-explorer/|SRA Explorer]].

Processing SRA files requires the [[https://ncbi.github.io/sra-tools/|NCBI SRA Toolkit]], which is installed on our systems.

=== Downloading and processing one NCBI SRA sample at a time ===

NCBI short-read files can be 
  * downloaded and converted to fastq.gz format at the same time (recommended), or
  * downloaded in SRA format for subsequent conversion

To download one SRR ID at a time to get fastq.gz format, use the command fastq-dump, like

{{{
fastq-dump --split-3 --gzip SRR123456
}}}

With the option "--split-3",
  * single-end reads will end up in a single file, named SRR123456.fastq.gz
  * paired-end reads will produce two files (named SRR123456_1.fastq.gz and SRR123456_2.fastq.g)
  * unpaired reads (if any) will be placed into a third file.

See the [[https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump|fastq-dump documentation]] for all program options.

We recommend always gzipping fastq files because
  * fastq.gz files are much smaller than fastq files
  * our typically-used analysis programs all permit fastq.gz input


'''Note:''' Running fastq-dump places downloaded or cache files into the user's home directory, which is likely to run out of space.  To prevent this, you have at least 3 options:

Option 1: symlink the NCBI directory in your home directory to somewhere else with larger storage, such as with a command like

{{{
ln -s /path/to/large/storage ~/ncbi
}}}

Option 2: edit your ~/.ncbi/user-settings.mkfg file to include the following line:
{{{
/repository/user/default-path = "/path/to/large/storage"
}}}

Option 3: Modify your environment with vdb-config
{{{
vdb-config --restore-defaults # To restore your settings
vdb-config -i # To use the GUI to enter a different location with "Set Default Import Path".  
}}}

=== Downloading and processing multiple NCBI SRA samples ===

To download a list of SRR files (such as for all of the samples of a data series) from NCBI, use NCBI's 'prefetch'.

Given a set of SRA files (by SRR ID) listed in a single column in the text file "SraAccList.txt" (e.g. SRR7623010, SRR7623011, etc.), the following command will download the entire set:

{{{
prefetch -O output_directory --option-file SRR_Acc_List.txt
}}}

If you don't specify an output directory, the SRR files will be downloaded to ~/ncbi/ncbi_public/sra (or your configured "Import Path" as described above).  

To get this list of SRR IDs, go the [[https://www.ncbi.nlm.nih.gov/Traces/study/|SRA Run Selector]] and enter a project accession.  Once on a [[https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP000002|project page]], go to the "Select" section and click on "Accession List" to get 'SRR_Acc_List.txt' (or if you want a subset of these, click on "Metadata" in the "Select" section to get a comma separated file, 'SraRunTable.txt' and create your own 'SRR_Acc_List.txt')

The 'prefetch' command will provide you with a set of SRA files which then need to be converted to fastq.gz.  One way to do this on the set of SRA files is

{{{
find -name \*.sra -exec bsub fastq-dump --split-3 --gzip {} \;
}}}

Once you create the fastq.gz files, the *.sra files can be deleted.