Version 3 (modified by 4 years ago) ( diff ) | ,
---|
Downloading and processing NCBI SRA files
SRA (for Sequence Read Archive) is a NCBI binary format for short reads.
It's thoroughly described in the SRA Handbook
SRA files can be downloaded as compressed fastq in a web browser using SRA Explorer.
Processing SRA files requires the NCBI SRA Toolkit, which is installed on our systems.
Downloading and processing one NCBI SRA sample at a time
NCBI short-read files can be downloaded
- in SRA format for subsequent conversion, or
- and converted to fastq.gz format at the same time (recommended)
To download one SRR ID at a time to get fastq.gz format, use the command fastq-dump, like
fastq-dump --split-3 --gzip SRR123456
With the option "--split-3",
- single-end reads will end up in a single file, named SRR123456.fastq.gz
- paired-end reads will produce two files (named SRR123456_1.fastq.gz and SRR123456_2.fastq.g)
- unpaired reads (if any) will be placed into a third file.
See the fastq-dump documentation for all program options.
We recommend always gzipping fastq files because
- fastq.gz files are much smaller than fastq files
- our typically-used analysis programs all permit fastq.gz input
Note: Running fastq-dump places downloaded or cache files into the user's home directory, which is likely to run out of space. To prevent this, you have at least 3 options:
Option 1: symlink the NCBI directory in your home directory to somewhere else with larger storage, such as with a command like
ln -s /path/to/large/storage ~/ncbi
Option 2: edit your ~/.ncbi/user-settings.mkfg file to include the following line:
/repository/user/default-path = "/path/to/large/storage"
Option 3: Modify your environment with vdb-config
vdb-config --restore-defaults # To restore your settings vdb-config -i # To use the GUI to enter a different location with "Set Default Import Path".
Downloading and processing multiple NCBI SRA samples
To download a list of SRR files (such as for all of the samples of a data series) from NCBI, use prefetch.
Given a set of SRA files listed in a single column in the text file "SraAccList.txt" (e.g. SRR7623010, SRR7623011, etc.), the following command will download the entire set:
prefetch -O output_directory sraAccList.txt
If you don't specify an output directory, the SRR files will be downloaded to ~/ncbi/ncbi_public/sra (or your configured "Import Path" as described above).
To get this list of SRR IDs, go the SRA Run Selector and enter a project accession. Once on a [[https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP000002|project page, go to the "Select" section and click on "Accession List" (or if you want a subset of these, click on "Metadata" in the "Select" section to get a comma separated file, SraRunTable.txt)