wiki:SOPs/qc_SRA

Version 4 (modified by gbell, 4 years ago) ( diff )

--

Downloading and processing NCBI SRA files

SRA (for Sequence Read Archive) is a NCBI binary format for short reads.

It's thoroughly described in the SRA Handbook

SRA files can be downloaded as compressed fastq in a web browser using SRA Explorer.

Processing SRA files requires the NCBI SRA Toolkit, which is installed on our systems.

Downloading and processing one NCBI SRA sample at a time

NCBI short-read files can be downloaded

  • in SRA format for subsequent conversion, or
  • and converted to fastq.gz format at the same time (recommended)

To download one SRR ID at a time to get fastq.gz format, use the command fastq-dump, like

fastq-dump --split-3 --gzip SRR123456

With the option "--split-3",

  • single-end reads will end up in a single file, named SRR123456.fastq.gz
  • paired-end reads will produce two files (named SRR123456_1.fastq.gz and SRR123456_2.fastq.g)
  • unpaired reads (if any) will be placed into a third file.

See the fastq-dump documentation for all program options.

We recommend always gzipping fastq files because

  • fastq.gz files are much smaller than fastq files
  • our typically-used analysis programs all permit fastq.gz input

Note: Running fastq-dump places downloaded or cache files into the user's home directory, which is likely to run out of space. To prevent this, you have at least 3 options:

Option 1: symlink the NCBI directory in your home directory to somewhere else with larger storage, such as with a command like

ln -s /path/to/large/storage ~/ncbi

Option 2: edit your ~/.ncbi/user-settings.mkfg file to include the following line:

/repository/user/default-path = "/path/to/large/storage"

Option 3: Modify your environment with vdb-config

vdb-config --restore-defaults # To restore your settings
vdb-config -i # To use the GUI to enter a different location with "Set Default Import Path".  

Downloading and processing multiple NCBI SRA samples

To download a list of SRR files (such as for all of the samples of a data series) from NCBI, use prefetch.

Given a set of SRA files listed in a single column in the text file "SraAccList.txt" (e.g. SRR7623010, SRR7623011, etc.), the following command will download the entire set:

prefetch -O output_directory SRR_Acc_List.txt

If you don't specify an output directory, the SRR files will be downloaded to ~/ncbi/ncbi_public/sra (or your configured "Import Path" as described above).

To get this list of SRR IDs, go the SRA Run Selector and enter a project accession. Once on a project page, go to the "Select" section and click on "Accession List" to get 'SRR_Acc_List.txt' (or if you want a subset of these, click on "Metadata" in the "Select" section to get a comma separated file, 'SraRunTable.txt')

Note: See TracWiki for help on using the wiki.