== Downloading and processing NCBI SRA files == **SRA** (for Sequence Read Archive) is a NCBI binary format for short reads. It's thoroughly described in the [[http://www.ncbi.nlm.nih.gov/books/NBK47528/|SRA Handbook]] SRA files can be downloaded as compressed fastq in a web browser using [[https://ewels.github.io/sra-explorer/|SRA Explorer]]. Processing SRA files requires the [[https://ncbi.github.io/sra-tools/|NCBI SRA Toolkit]], which is installed on our systems. === Downloading and processing one NCBI SRA sample at a time === NCBI short-read files can be * downloaded and converted to fastq.gz format at the same time (recommended), or * downloaded in SRA format for subsequent conversion To download one SRR ID at a time to get fastq.gz format, use the command fastq-dump, like {{{ fastq-dump --split-3 --gzip SRR123456 }}} With the option "--split-3", * single-end reads will end up in a single file, named SRR123456.fastq.gz * paired-end reads will produce two files (named SRR123456_1.fastq.gz and SRR123456_2.fastq.g) * unpaired reads (if any) will be placed into a third file. See the [[https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump|fastq-dump documentation]] for all program options. We recommend always gzipping fastq files because * fastq.gz files are much smaller than fastq files * our typically-used analysis programs all permit fastq.gz input '''Note:''' Running fastq-dump places downloaded or cache files into the user's home directory, which is likely to run out of space. To prevent this, you have at least 3 options: Option 1: symlink the NCBI directory in your home directory to somewhere else with larger storage, such as with a command like {{{ ln -s /path/to/large/storage ~/ncbi }}} Option 2: edit your ~/.ncbi/user-settings.mkfg file to include the following line: {{{ /repository/user/default-path = "/path/to/large/storage" }}} Option 3: Modify your environment with vdb-config {{{ vdb-config --restore-defaults # To restore your settings vdb-config -i # To use the GUI to enter a different location with "Set Default Import Path". }}} === Downloading and processing multiple NCBI SRA samples === To download a list of SRR files (such as for all of the samples of a data series) from NCBI, use NCBI's 'prefetch'. Given a set of SRA files (by SRR ID) listed in a single column in the text file "SraAccList.txt" (e.g. SRR7623010, SRR7623011, etc.), the following command will download the entire set: {{{ prefetch -O output_directory --option-file SRR_Acc_List.txt }}} If you don't specify an output directory, the SRR files will be downloaded to ~/ncbi/ncbi_public/sra (or your configured "Import Path" as described above). To get this list of SRR IDs, go the [[https://www.ncbi.nlm.nih.gov/Traces/study/|SRA Run Selector]] and enter a project accession. Once on a [[https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP000002|project page]], go to the "Select" section and click on "Accession List" to get 'SRR_Acc_List.txt' (or if you want a subset of these, click on "Metadata" in the "Select" section to get a comma separated file, 'SraRunTable.txt' and create your own 'SRR_Acc_List.txt') The 'prefetch' command will provide you with a set of SRA files which then need to be converted to fastq.gz. One way to do this on the set of SRA files is {{{ find -name \*.sra -exec bsub fastq-dump --split-3 --gzip {} \; }}} Once you create the fastq.gz files, the *.sra files can be deleted.