wiki:SOPs/qc_SRA

Version 1 (modified by gbell, 4 years ago) ( diff )

--

Downloading and processing NCBI SRA files

SRA (for Sequence Read Archive) is a NCBI binary format for short reads.

It's thoroughly described in the SRA Handbook

SRA files can be downloaded as compressed fastq in a web browser using SRA Explorer.

Processing SRA files requires the NCBI SRA Toolkit, which is installed on our systems.

The main command is fastq-dump <SRA archive file>, like

fastq-dump SRR060751.sra

If your reads are paired, by default the #1 and #2 reads will end up concatenated together in the same file. To check if the SRA sample has paired reads or not, go to the SRA Run browser, enter the sample ID, and look in the table under "Layout".

To get matched paired reads into separate files, use a command like

fastq-dump --split-3 SRR060751.sra

This works the same as using the "--split-files", but "--split-3" puts unpaired reads (if any) into a third file.

You can ask also for gzipped output instead of typical fastq:

fastq-dump --split-3 --gzip SRR060751.sra

See Converting SRA format data into FASTQ for all program options.

Note that a fastq file is about 4-5x larger than its corresponding SRA file.

fastq-dump can be used to download/fetch the SRA file, or you can download (eg. using wget) the SRA file directly and then run fastq-dump to get the fastq file. Downloading SRA file directly will avoid changing home dir path for large file (see below).

Note: As of fastq-dump version 2.8.1, running fastq-dump will require the vdb-config to be set up correctly. By default, downloaded/cache file is copied to the user's home directory, which is likely to run out of space. Run,

vdb-config --restore-defaults
vdb-config -i #use the GUI to enter a different location.  

Manually editing the file, $HOME/.ncbi/user-settings.mkfg, doesn't seem to work. See NCBI SRA Installation/Config. Other alternatives: i) simply symlink the NCBI directory in your home directory to somewhere else with larger storage, or ii) download the SRA file directly (eg. using wget) before using fastq-dump.

#download SRR4090409.sra (e.g. use wget) from SRA and convert to fastq
fastq-dump SRR4090409.sra

#download SRA file via fastq-dump (important: home directory or vdb-config file must be set up correctly), and convert to fastq
fastq-dump SRR4090409

In order to download a list of SRA files from NCBI, it is convenient to use prefetch.

As mentioned in SRA website , you can download list of Run accessions from search results page (- Example offsite image) - select Runs of interest by clicking on the checkboxes, click on "Send To", "file", and select "Accession List" in the drop-down menu.

Given a set of SRA files listed in a single column in the text file "SraAccList.txt" (e.g. SRR7623010, SRR7623011, etc.), the following command will download the entire set:

prefetch --option-file sraAccList.txt

This is current as of prefetch v. 2.9.3 (2.9.3-1). Note that the default location for downloaded files is in your home directory under ~/ncbi/ncbi_public/sra. With this default, one can quickly run out of space. One solution to address this problem is to edit your ~/.ncbi/user-settings.mkfg file to include the following line:

/repository/user/main/public/root = "/destination/for/big/storage/here"

or point to the destination folder with -O

Download Metadata: When in your GEO series page, click on SRA link -> click on "Send to" on the top of the page -> check the "File" radiobutton, and select "RunInfo" in pull-down menu. This will generate a tabular SraRunInfo.csv file with metadata available for each Run.

Note: See TracWiki for help on using the wiki.