wiki:SOPs/qc_SRA

Downloading and processing NCBI SRA files

Contents

SRA (for Sequence Read Archive) is a NCBI binary format for short reads.

This task is thoroughly described in the SRA Handbook.

SRA files can be downloaded as compressed fastq in a web browser using SRA Explorer.

Processing SRA files requires the NCBI SRA Toolkit, which is installed on our systems.

Downloading and processing one NCBI SRA sample at a time

NCBI short-read files can be

  • downloaded and converted to fastq.gz format at the same time (recommended), or
  • downloaded in SRA format for subsequent conversion

To download one SRR ID at a time to get fastq.gz format, use the command fastq-dump, like

fastq-dump --split-3 --gzip SRR123456

With the option "--split-3",

  • single-end reads will end up in a single file, named SRR123456.fastq.gz
  • paired-end reads will produce two files (named SRR123456_1.fastq.gz and SRR123456_2.fastq.g)
  • unpaired reads (if any) will be placed into a third file.

See the fastq-dump documentation for all program options.

We recommend always gzipping fastq files because

  • fastq.gz files are much smaller than fastq files
  • our typically-used analysis programs all permit fastq.gz input

Note: Running fastq-dump places downloaded or cache files into the user's home directory, which is likely to run out of space. To prevent this, you have at least 3 options:

Option 1: symlink the NCBI directory in your home directory to somewhere else with larger storage, such as with a command like

ln -s /path/to/large/storage ~/ncbi

Option 2: edit your ~/.ncbi/user-settings.mkfg file to include the following line:

/repository/user/default-path = "/path/to/large/storage"

Option 3: Modify your environment with vdb-config

vdb-config --restore-defaults # To restore your settings
vdb-config -i # To use the GUI to enter a different location with "Set Default Import Path".  

In the event of problems downloading and converting to fastq.gz all at once, SRA files can be downloaded by navigating through the SRA web site to the sample's "Data access" tab, which provides direct links to the file(s). Using this direct path to the SRA file,

# Use 'wget' for download reads in SRA format
wget -O SRR123456.sra https://sra-downloadb/path_to_file/SRR123456/SRR123456.1
# Convert SRA format to fastq.gz
fastq-dump --split-3 --gzip ./SRR123456.sra

If fastq-dump gives you an error like "Failed to call external services." you may need to use the NCBI link to a sralite file to download the sequences in that format first and then convert to fastq. This can be done with commands like

# Download the sralite file
wget https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos5/sra-pub-zq-11/SRR002/626/SRR1234567/SRR1234567.sralite.1

# Convert sralite to fastq.gz
fastq-dump --split-files --gzip -A SRR1234567 SRR1234567.sralite.1

Downloading and processing multiple NCBI SRA samples

To download a list of SRR files (such as for all of the samples of a data series) from NCBI, use NCBI's 'prefetch'.

Given a set of SRA files (by SRR ID) listed in a single column in the text file "SraAccList.txt" (e.g. SRR7623010, SRR7623011, etc.), the following command will download the entire set:

prefetch -O output_directory --option-file SRR_Acc_List.txt

If you don't specify an output directory, the SRR files will be downloaded to ~/ncbi/ncbi_public/sra (or your configured "Import Path" as described above).

To get this list of SRR IDs, go the SRA Run Selector and enter a project accession. Once on a project page, go to the "Select" section and click on "Accession List" to get 'SRR_Acc_List.txt' (or if you want a subset of these, click on "Metadata" in the "Select" section to get a comma separated file, 'SraRunTable.txt' and create your own 'SRR_Acc_List.txt'). Or you can go to the Downloading SRA data page, enter a list of SRX (experimental accession) IDs, and get a list of SRR (sample) accessions for 'SRR_Acc_List.txt'.

The 'prefetch' command will provide you with a set of SRA files which then need to be converted to fastq.gz. One way to do this on the set of SRA files is

find -name \*.sra -exec sbatch --partition=20 --ntasks=1 --wrap "fastq-dump --split-3 --gzip {}" \;

Once you create the fastq.gz files, the *.sra files can be deleted.

Note: See TracWiki for help on using the wiki.