Changes between Version 2 and Version 3 of SOPs/qc_SRA


Ignore:
Timestamp:
08/25/20 11:36:47 (4 years ago)
Author:
gbell
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/qc_SRA

    v2 v3  
    99Processing SRA files requires the [[https://ncbi.github.io/sra-tools/|NCBI SRA Toolkit]], which is installed on our systems.
    1010
    11 The main command is **fastq-dump <SRA archive file>**, like
     11=== Downloading and processing one NCBI SRA sample at a time ===
    1212
    13 ''**fastq-dump SRR060751.sra**''
     13NCBI short-read files can be downloaded
     14  * in SRA format for subsequent conversion, or
     15  * and converted to fastq.gz format at the same time (recommended)
    1416
    15 If your reads are paired, by default the #1 and #2 reads will end up concatenated together in the same file. 
    16 To check if the SRA sample has paired reads or not, go to the [https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=run_browser SRA Run browser], enter the sample ID, and look in the table under "Layout".
    17 
    18 To get matched paired reads into separate files, use a command like
    19 
    20 ''**fastq-dump --split-3 SRR060751.sra**''
    21 
    22 This works the same as using the "--split-files", but "--split-3" puts unpaired reads (if any) into a third file.
    23 
    24 You can ask also for gzipped output instead of typical fastq:
    25 
    26 ''**fastq-dump --split-3 --gzip SRR060751.sra**''
    27 
    28 See [[https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump|Converting SRA format data into FASTQ]] for all program options.
    29 
    30 Note that a fastq file is about 4-5x larger than its corresponding SRA file.
    31 
    32 fastq-dump can be used to download/fetch the SRA file, or you can download (eg. using wget) the SRA file directly and then run fastq-dump to get the fastq file.  Downloading SRA file directly will avoid changing home dir path for large file (see below).
    33 
    34 '''Note:''' As of fastq-dump version 2.8.1, running fastq-dump will require the vdb-config to be set up correctly.  By default, downloaded/cache file is copied to the user's home directory, which is likely to run out of space.  Run,
     17To download one SRR ID at a time to get fastq.gz format, use the command fastq-dump, like
    3518
    3619{{{
    37 vdb-config --restore-defaults
    38 vdb-config -i #use the GUI to enter a different location. 
     20fastq-dump --split-3 --gzip SRR123456
    3921}}}
    4022
    41 Manually editing the file, $HOME/.ncbi/user-settings.mkfg, doesn't seem to work.  See [[https://ncbi.github.io/sra-tools/install_config.html | NCBI SRA Installation/Config]].  Other alternatives: i) simply symlink the NCBI directory in your home directory to somewhere else with larger storage, or ii) download the SRA file directly (eg. using wget) before using fastq-dump.
     23With the option "--split-3",
     24  * single-end reads will end up in a single file, named SRR123456.fastq.gz
     25  * paired-end reads will produce two files (named SRR123456_1.fastq.gz and SRR123456_2.fastq.g)
     26  * unpaired reads (if any) will be placed into a third file.
     27
     28See the [[https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump|fastq-dump documentation]] for all program options.
     29
     30We recommend always gzipping fastq files because
     31  * fastq.gz files are much smaller than fastq files
     32  * our typically-used analysis programs all permit fastq.gz input
     33
     34
     35'''Note:''' Running fastq-dump places downloaded or cache files into the user's home directory, which is likely to run out of space.  To prevent this, you have at least 3 options:
     36
     37Option 1: symlink the NCBI directory in your home directory to somewhere else with larger storage, such as with a command like
    4238
    4339{{{
    44 #download SRR4090409.sra (e.g. use wget) from SRA and convert to fastq
    45 fastq-dump SRR4090409.sra
    46 
    47 #download SRA file via fastq-dump (important: home directory or vdb-config file must be set up correctly), and convert to fastq
    48 fastq-dump SRR4090409
     40ln -s /path/to/large/storage ~/ncbi
    4941}}}
    5042
    51 In order to '''download a list of SRA files''' from NCBI, it is convenient to use prefetch.
     43Option 2: edit your ~/.ncbi/user-settings.mkfg file to include the following line:
     44{{{
     45/repository/user/default-path = "/path/to/large/storage"
     46}}}
    5247
    53 As mentioned in [[https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/| SRA website ]], you can download list of Run accessions from search results page ([[https://www.ncbi.nlm.nih.gov/sra/?term=cancer |- Example offsite image]]) - select Runs of interest by clicking on the checkboxes, click on "Send To", "file", and select "Accession List" in the drop-down menu.
     48Option 3: Modify your environment with vdb-config
     49{{{
     50vdb-config --restore-defaults # To restore your settings
     51vdb-config -i # To use the GUI to enter a different location with "Set Default Import Path". 
     52}}}
     53
     54=== Downloading and processing multiple NCBI SRA samples ===
     55
     56To '''download a list of SRR files''' (such as for all of the samples of a data series) from NCBI, use prefetch.
    5457
    5558Given a set of SRA files listed in a single column in the text file "SraAccList.txt" (e.g. SRR7623010, SRR7623011, etc.), the following command will download the entire set:
    5659
    5760{{{
    58 prefetch --option-file sraAccList.txt
     61prefetch -O output_directory sraAccList.txt
    5962}}}
    6063
    61 This is current as of prefetch v. 2.9.3 (2.9.3-1).  Note that the default location for downloaded files is in your home directory under ~/ncbi/ncbi_public/sra.  With this default, one can quickly run out of space.  One solution to address this problem is to edit your ~/.ncbi/user-settings.mkfg file to include the following line:
    62 {{{
    63 /repository/user/main/public/root = "/destination/for/big/storage/here"
    64 }}}
     64If you don't specify an output directory, the SRR files will be downloaded to ~/ncbi/ncbi_public/sra (or your configured "Import Path" as described above). 
    6565
    66 or point to the destination folder with -O
     66To get this list of SRR IDs, go the [[https://www.ncbi.nlm.nih.gov/Traces/study/|SRA Run Selector]] and enter a project accession.  Once on a [[https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP000002|project page, go to the "Select" section and click on "Accession List" (or if you want a subset of these, click on "Metadata" in the "Select" section to get a comma separated file, SraRunTable.txt)
    6767
    68 
    69 '''Download Metadata:'''
    70 After searching in [[ https://www.ncbi.nlm.nih.gov/Traces/study/?| SRA Run Selector ]] -> click on "Metadata" in the middle of the page in "Select" section -> This will generate a comma separated file SraRunTable.txt.