Changes between Version 54 and Version 55 of SOPs/qc_shortReads


Ignore:
Timestamp:
08/19/20 16:27:50 (4 years ago)
Author:
gbell
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/qc_shortReads

    v54 v55  
    6464  from [[https://en.wikipedia.org/wiki/FASTQ_format | Wikipedia's FASTQ page]]
    6565
    66 == Preprocessing read files from NCBI SRA ==
    67 
    68 **SRA** (for Sequence Read Archive) is a NCBI binary format for short reads.
    69 
    70 It's thoroughly described in the [[http://www.ncbi.nlm.nih.gov/books/NBK47528/|SRA Handbook]]
    71 
    72 SRA files can be downloaded as compressed fastq in a web browser using [[https://ewels.github.io/sra-explorer/|SRA Explorer]].
    73 
    74 Processing SRA files requires the [[https://ncbi.github.io/sra-tools/|NCBI SRA Toolkit]], which is installed on our systems.
    75 
    76 The main command is **fastq-dump <SRA archive file>**, like
    77 
    78 ''**fastq-dump SRR060751.sra**''
    79 
    80 If your reads are paired, by default the #1 and #2 reads will end up concatenated together in the same file. 
    81 To check if the SRA sample has paired reads or not, go to the [https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=run_browser SRA Run browser], enter the sample ID, and look in the table under "Layout".
    82 
    83 To get matched paired reads into separate files, use a command like
    84 
    85 ''**fastq-dump --split-3 SRR060751.sra**''
    86 
    87 This works the same as using the "--split-files", but "--split-3" puts unpaired reads (if any) into a third file.
    88 
    89 You can ask for gzipped output instead of typical fastq:
    90 
    91 ''**fastq-dump --gzip SRR060751.sra**''
    92 
    93 See [[https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump|Converting SRA format data into FASTQ]] for all program options.
    94 
    95 Note that a fastq file is about 4-5x larger than its corresponding SRA file.
    96 
    97 fastq-dump can be used to download/fetch the SRA file, or you can download (eg. using wget) the SRA file directly and then run fastq-dump to get the fastq file.  Downloading SRA file directly will avoid changing home dir path for large file (see below).
    98 
    99 '''Note:''' As of fastq-dump version 2.8.1, running fastq-dump will require the vdb-config to be set up correctly.  By default, downloaded/cache file is copied to the user's home directory, which is likely to run out of space.  Run,
    100 
    101 {{{
    102 vdb-config --restore-defaults
    103 vdb-config -i #use the GUI to enter a different location. 
    104 }}}
    105 
    106 Manually editing the file, $HOME/.ncbi/user-settings.mkfg, doesn't seem to work.  See [[https://ncbi.github.io/sra-tools/install_config.html | NCBI SRA Installation/Config]].  Other alternatives: i) simply symlink the NCBI directory in your home directory to somewhere else with larger storage, or ii) download the SRA file directly (eg. using wget) before using fastq-dump.
    107 
    108 {{{
    109 #download SRR4090409.sra (e.g. use wget) from SRA and convert to fastq
    110 fastq-dump SRR4090409.sra
    111 
    112 #download SRA file via fastq-dump (important: home directory or vdb-config file must be set up correctly), and convert to fastq
    113 fastq-dump SRR4090409
    114 }}}
    115 
    116 In order to '''download a list of SRA files''' from NCBI, it is convenient to use prefetch.
    117 
    118 As mentioned in [[https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/| SRA website ]], you can download list of Run accessions from search results page ([[https://www.ncbi.nlm.nih.gov/sra/?term=cancer |- Example offsite image]]) - select Runs of interest by clicking on the checkboxes, click on "Send To", "file", and select "Accession List" in the drop-down menu.
    119 
    120 Given a set of SRA files listed in a single column in the text file "SraAccList.txt" (e.g. SRR7623010, SRR7623011, etc.), the following command will download the entire set:
    121 
    122 {{{
    123 prefetch --option-file sraAccList.txt
    124 }}}
    125 
    126 This is current as of prefetch v. 2.9.3 (2.9.3-1).  Note that the default location for downloaded files is in your home directory under ~/ncbi/ncbi_public/sra.  With this default, one can quickly run out of space.  One solution to address this problem is to edit your ~/.ncbi/user-settings.mkfg file to include the following line:
    127 {{{
    128 /repository/user/main/public/root = "/destination/for/big/storage/here"
    129 }}}
    130 
    131 or point to the destination folder with -O
    132 
    133 
    134 '''Download Metadata:'''
    135 When in your GEO series page, click on SRA link -> click on "Send to" on the top of the page -> check the "File" radiobutton, and select "RunInfo" in pull-down menu. This will generate a tabular SraRunInfo.csv file with metadata available for each Run.
    13666
    13767