= Quality Control and preprocessing of short reads = = FASTQ:= == Format == Each entry in a FASTQ file consists of four lines: • Sequence identifier • Sequence • Quality score identifier line (consisting of a +) • Quality score == Naming == * Ensure your fastq headers conforms to the standarad convention listed below, otherwise, downstream analysis (eg. aligners, counting, etc.) may behave differently than intended. FASTQ header/naming scheme as specified by Illumina's CASAVA pipeline, [[http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm | Illumina (Casava 1.8.2):]]\\ @:::::: ::: ||'''Element'''||'''Requirements'''||'''Description'''|| ||@||@||Each sequence identifier line starts with @|| || ||Characters allowed: a-z, A-Z, 0-9 and underscore||Instrument ID (eg. HWI-DXXXX for HiSeq 2500)|| || ||Numerical||Run number on instrument|| || ||Characters allowed: a-z, A-Z, 0-9|| || ||||Numerical||Lane number|| ||||Numerical||Tile number|| ||||Numerical||X coordinate of cluster|| ||||Numerical||Y coordinate of cluster|| ||||Numerical||Read number. 1 can be single read or read 2 of paired-end|| ||||Y or N||Y if the read is filtered, N otherwise|| ||||Numerical||0 when none of the control bits are on, otherwise it is an even number|| ||||ACTG||Index sequence|| \\ Illumina (Casava 1.7)\\ @::::#/ ||'''Element'''||'''Requirements'''||'''Description'''|| ||@||@||Each sequence identifier line starts with @|| || ||Characters allowed: a-z, A-Z, 0-9 and underscore||Instrument ID (eg. HWI-DXXXX for HiSeq 2500)|| ||||Numerical||Lane number|| ||||Numerical||Tile number|| ||||Numerical||X coordinate of cluster|| ||||Numerical||Y coordinate of cluster|| ||#||0 or 1||0 means no index, 1 means indexed sample|| ||/||Numerical||Read number. 1 can be single read or read 2 of paired-end|| \\ = Analyzing short read quality (before mapping) = \\ == Quality scoring introduction == * Quality scores are typically represented using a Phred scoring scheme, where a read quality value = −10 * log10 (error probability) * For example, * Quality = 10 => error rate = 10% => base call has 90% confidence * Quality = 20 => error rate = 1% => base call has 99% confidence * Quality = 30 => error rate = 0.1% => base call has 99.9% confidence * See [[http://en.wikipedia.org/wiki/Phred_quality_score|Phred quality score]] for more details. [[Image(fastq_phread-base.png)]] from [[https://en.wikipedia.org/wiki/FASTQ_format | Wikipedia's FASTQ page]] == Preprocessing read files from NCBI SRA == **SRA** (for Sequence Read Archive) is a NCBI binary format for short reads. It's thoroughly described in the [[http://www.ncbi.nlm.nih.gov/books/NBK47528/|SRA Handbook]] SRA files can be downloaded as compressed fastq in a web browser using [[https://ewels.github.io/sra-explorer/|SRA Explorer]]. Processing SRA files requires the [[https://ncbi.github.io/sra-tools/|NCBI SRA Toolkit]], which is installed on our systems. The main command is **fastq-dump **, like ''**fastq-dump SRR060751.sra**'' If your reads are paired, by default the #1 and #2 reads will end up concatenated together in the same file. To check if the SRA sample has paired reads or not, go to the [https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=run_browser SRA Run browser], enter the sample ID, and look in the table under "Layout". To get matched paired reads into separate files, use a command like ''**fastq-dump --split-3 SRR060751.sra**'' This works the same as using the "--split-files", but "--split-3" puts unpaired reads (if any) into a third file. You can ask for gzipped output instead of typical fastq: ''**fastq-dump --gzip SRR060751.sra**'' See [[https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump|Converting SRA format data into FASTQ]] for all program options. Note that a fastq file is about 4-5x larger than its corresponding SRA file. fastq-dump can be used to download/fetch the SRA file, or you can download (eg. using wget) the SRA file directly and then run fastq-dump to get the fastq file. Downloading SRA file directly will avoid changing home dir path for large file (see below). '''Note:''' As of fastq-dump version 2.8.1, running fastq-dump will require the vdb-config to be set up correctly. By default, downloaded/cache file is copied to the user's home directory, which is likely to run out of space. Run, {{{ vdb-config --restore-defaults vdb-config -i #use the GUI to enter a different location. }}} Manually editing the file, $HOME/.ncbi/user-settings.mkfg, doesn't seem to work. See [[https://ncbi.github.io/sra-tools/install_config.html | NCBI SRA Installation/Config]]. Other alternatives: i) simply symlink the NCBI directory in your home directory to somewhere else with larger storage, or ii) download the SRA file directly (eg. using wget) before using fastq-dump. {{{ #download SRR4090409.sra (e.g. use wget) from SRA and convert to fastq fastq-dump SRR4090409.sra #download SRA file via fastq-dump (important: home directory or vdb-config file must be set up correctly), and convert to fastq fastq-dump SRR4090409 }}} In order to '''download a list of SRA files''' from NCBI, it is convenient to use prefetch. As mentioned in [[https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/| SRA website ]], you can download list of Run accessions from search results page ([[https://www.ncbi.nlm.nih.gov/sra/?term=cancer |- Example offsite image]]) - select Runs of interest by clicking on the checkboxes, click on "Send To", "file", and select "Accession List" in the drop-down menu. Given a set of SRA files listed in a single column in the text file "SraAccList.txt" (e.g. SRR7623010, SRR7623011, etc.), the following command will download the entire set: {{{ prefetch --option-file sraAccList.txt }}} This is current as of prefetch v. 2.9.3 (2.9.3-1). Note that the default location for downloaded files is in your home directory under ~/ncbi/ncbi_public/sra. With this default, one can quickly run out of space. One solution to address this problem is to edit your ~/.ncbi/user-settings.mkfg file to include the following line: {{{ /repository/user/main/public/root = "/destination/for/big/storage/here" }}} or point to the destination folder with -O '''Download Metadata:''' When in your GEO series page, click on SRA link -> click on "Send to" on the top of the page -> check the "File" radiobutton, and select "RunInfo" in pull-down menu. This will generate a tabular SraRunInfo.csv file with metadata available for each Run. == FastQC == [[http://www.bioinformatics.babraham.ac.uk/projects/fastqc/|FastQC]] * quality control analysis with nice graphical output * available for Linux, Windows, and MacOSX * (but no tools for editing reads) It's installed on tak and LSF and can be run from the command line * Sample command 1 (fastq inputs): ''**fastqc s_1_sequence.txt s_2_sequence.txt**'' * Sample command 2 (fastq.gz inputs): ''**fastqc s_1_sequence.txt.gz s_2_sequence.txt.gz**'' or interactively (with X Windows): * ''**fastqc**'' Output is a directory (named "s_1_sequence_fastqc" for input "s_1_sequence.txt") with "fastqc_report.html", a web page including all figures. The "Basic Statistics" section at the top of the FastQC report also shows the Encoding (quality score) information (like "Illumina 1.5"), which may be necessary to specify in subsequent analysis steps. The Encoding scales are described at [http://en.wikipedia.org/wiki/FASTQ_format#Encoding]. == QC for paired-end reads == * QC as above each of the forward and reverse reads separately using QC'ing program (above). * If reads reads are removed, get reads/mates after QC'ing that are //perfect// pairs: {{{ bsub "/nfs/BaRC_Public/BaRC_code/Perl/cmpfastq/cmpfastq.pl s_8_1_sequence.txt s_8_2_sequence.txt" # fastq inputs }}} == ShortRead (R package) == [[http://www.bioconductor.org/packages/release/bioc/html/ShortRead.html| ShortRead]] R package, Linux (Tak), Window, Mac \\ * It takes the fastq files, and creates a website with different ways to check the data, and with instruction on how to interpret results. The output files are stored in dest folder (my_qa in this example). * QC a single file using ShortRead {{{ library("ShortRead") # load the data sr <- readFastq("s1_sequence.txt") # create a qa object from the ShortRead object qa <- qa( sr, lane="character" ) # create an html report in the qa directory report(qa, dest="my_qa") }}} * QC all *.txt fastq files in a directory using ShortRead. {{{ library("ShortRead") qaSummary <- qa(".", pattern="*.txt", type="fastq") #create an html report in the qa directory report(qaSummary, dest="myQC_dir") }}} == Fastx Toolkit == [[http://hannonlab.cshl.edu/fastx_toolkit/|FastX Toolkit]] galaxy integration, Linux(Tak), MacOSX \\ '''Note:''' fastq_quality_filter v 0.0.14 may have a bug where it reports "got empty array at fastq_quality_filter.c:97" and/or segmentation fault. {{{ # Sample commands: # quality_stats: Sample Solexa reads file: s_1_1_sequence.txt or s_1_1_sequence.txt.gz fastx_quality_stats -i s_1_1_sequence.txt -o s_1_1_sequence.stats # fastq input gunzip -c s_1_1_sequence.txt.gz | fastx_quality_stats -o s_1_1_sequence.stats # fastq.gz input # a figure for Nucleotide Distribution: fastx_nucleotide_distribution_graph.sh -i s_1_1_sequence.stats -o s_1_1_sequence.stats.nuc.png -t "s_1_1_sequence.stats Nucleotide Distribution" # boxplot: fastq_quality_boxplot_graph.sh -i s_1_1_sequence.stats -o s_1_1_sequence.stats.quality.png -t "s_1_1_sequence.stats Quality Scores" }}} \\ == PrinSeq == [[http://prinseq.sourceforge.net | PrinSeq]] gives summary statistics and QC info, eg. exact duplicate and other options, includes visual summary on a fastq file. {{{ #Get html summary of output prinseq-lite.pl -fastq myFile.fastq -graph_data file.gd -out_good null -out_bad null prinseq-graphs.pl -i file.gd -html_all -o prinseq_outFile #remove exact duplicates (drep=1) and create a new fastq file (of unique sequences) prinseq-lite.pl -fastq myFile.fastq -derep 1 -out_good myFile.unique.fq -out_bad myFile.dup.fastq }}} = Modifying a file of short reads based on quality considerations = \\ == Remove reads with low quality score: To use FASTX Toolkit to get only reads that are above a quality score along with a certain read length: {{{ fastq_quality_filter -v -q 20 -p 75 -i myFile.fastq -o myFile.fastq.fastx_trim # fastq input and output gunzip -c myFile.fq.gz | fastq_quality_filter -v -q 20 -p 75 -z -o myFile.fastq.fastx_trim.gz # fastq.gz input and output [-h] = This helpful help screen. [-q N] = Minimum quality score to keep. [-p N] = Minimum percent of bases that must have [-q] quality. [-z] = Compress output with GZIP. [-i INFILE] = FASTA/Q input file. default is STDIN. [-o OUTFILE] = FASTA/Q output file. default is STDOUT. [-v] = Verbose - report number of sequences. If [-o] is specified, report will be printed to STDOUT. If [-o] is not specified (and output goes to STDOUT), report will be printed to STDERR. }}} If you get an error like "Invalid quality score value" or segmentation fault, your fastq file probably has Sanger (offset 33) instead of Illumina (ASCII offset 64) quality scores. You'll need to add the option "-Q33" to your FASTX Toolkit arguments. == Trim end of reads when quality drops below a threshold == * sample command: {{{ bsub "fastq_quality_trimmer -v -t 20 -l 25 -i input.fastq -o output.fastq" # fastq input and output bsub "gunzip -c input.fastq.gz | fastq_quality_trimmer -v -t 20 -l 25 -z -o output.fastq.gz" # fastq.gz input and output [-t N] = Quality threshold - nucleotides with lower quality will be trimmed (from the end of the sequence). [-l N] = Minimum length - sequences shorter than this (after trimming) will be discarded. Default = 0 = no minimum length. [-z] = Compress output with GZIP. [-i INFILE] = FASTQ input file. default is STDIN. [-o OUTFILE] = FASTQ output file. default is STDOUT. [-v] = Verbose - report number of sequences. If [-o] is specified, report will be printed to STDOUT. If [-o] is not specified (and output goes to STDOUT), report will be printed to STDERR. }}} \\ = Modifying a file of short reads in other ways = \\ == Remove linker (adapter) RNA: == * What is the sequence of the linker (adapter) to be removed? * Biologists generally know which linker (adapter) RNA is used for their sample(s). * For experiments using Illumina libraries, [[https://support.illumina.com/bulletins/2016/12/what-sequences-do-i-use-for-adapter-trimming.html|What sequences do I use for adapter trimming?]] may help you find the correct adapter sequence(s). The above adapters may be used with KAPA kits as well. * Also or in addition, when you run quality control with shortRead or FASTQC, check out * repetitive segments in the "over represented sequences" section. * "Per base sequence content" for any patterns at the beginning of your reads * '''Method 1''': [[https://cutadapt.readthedocs.io/en/stable/|cutadapt]] is a good tool that is designed to find and remove adapters: * more options than fastx_clipper, such as specifically trimming 5' or 3' adapters and specifying error rate (allowed mismatches) * much more conservative than fastx_clipper. * In paired-end mode, if a read is removed from one file, its pair will be removed from the other file * [https://cutadapt.readthedocs.io/en/stable/guide.html|User guide] * a vs b options from [[http://journal.embnet.org/index.php/embnetjournal/article/view/200|EMBnet.journal]] [[Image(cutadapt.2.jpeg,500px)]] * sample commands: {{{ # single-end reads bsub cutadapt -a GATCGGAAGAGCTCGTATGCCGTCTT -o Reads_noAdapter.fastq Reads.fastq In the above command: -a: Sequence of an adapter that was ligated to the 3' end. -o: output file name # paired-end reads bsub cutadapt -a GATCGGAAGAGCTCGTATGCCGTCTT -o Reads_trimmed.1.fq -p Reads_trimmed.2.fq Reads.1.fq Reads.2.fq # use --quality-base=64 option for Phred+64 (Illumina 1.5) }}} * '''Method 2''': [[https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/|Trim Galore!]] combines cutadapt, FastQC, and the ability to guess what your adapters, if -a/-a2 options are not provided: * Sample command {{{ # Usage (single-end reads): trim_galore [options] trim_galore --phred64 --fastqc -o my_trimmed_reads raw_reads/My_reads.fq.gz # Usage (paired-end reads; discard pair if at least one read is too short): trim_galore --paired --phred64 --fastqc -o my_trimmed_reads raw_reads/My_reads_1.fq.gz raw_reads/My_reads_2.fq.gz # Get all options trim_galore --help --phred64 ==> Use Illumina 1.5 encoding quality scores --fastqc ==> Run FastQC after trimming -o ==> Print output files in this directory instead of the current directory }}} * '''Method 3''': [[http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_clipper_usage|fastx_clipper]] * Sample command: {{{ bsub "fastx_clipper -a CTGTAGGCACCATCAAT -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt" # fastq input and output bsub "gunzip -c s2_sequence.txt | fastx_clipper -a CTGTAGGCACCATCAAT -v -l 22 -z -o s2_sequence_noLinker.txt.gz" # fastq.gz input and output In the above command: -a CTGTAGGCACCATCAAT is the linker sequence -i s2_sequence.txt is input solexa fastq file -v is Verbose [report number of sequences in output and discarded] -l 22 is to discard sequences shorter than 22 nucleotides -o s2_ sequence_noLinker.txt is output file. }}} * See [[http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_clipper_usage|fastx_clipper usage]] (or ''fastx_clipper -h'') for more arguments * If you get the message "Invalid quality score value..." you have the older range of quality scores. * Add the argument -Q 33, such as * fastx_clipper -a CTGTAGGCACCATCAAT -Q 33 -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt == Trim reads to a specified length == * If we have reads of different lengths (//i.e.// because we clipped out the adapter sequences), we can trim them to have them all be the same length. Use **fastx_trimmer** for that. * sample command: {{{ bsub "fastx_trimmer -f 1 -l 22 -i s7_sequence_clipped.txt -o s7_sequence_clipped_trimmed.txt" # fastq input and output bsub "gunzip -c s7_sequence_clipped.txt | fastx_trimmer -f 1 -l 22 -z -o s7_sequence_clipped_trimmed.txt.gz" # fastq.gz input and output [-i INFILE] = FASTA/Q input file. default is STDIN. [-o OUTFILE] = FASTA/Q output file. default is STDOUT. [-l N] = Last base to keep [-f N] = First base to keep. Default is 1 (=first base). }}} == Select reads that are paired [for paired-end sequencing] == During quality control, if low-quality reads have been removed for any reason, some reads may not have a paired end at the other end. This can cause problems with mapping programs. Sample command: {{{ /nfs/BaRC_Public/BaRC_code/Perl/cmpfastq/cmpfastq.pl sequence.1_1.filt.txt sequence.1_2.filt.txt # fastq inputs }}} Output files will be * *unique.out (reads that are only in the "1" or "2" set; 2 files) and * *common.out (reads that are in both the "1" and "2" set; 2 files). The *common.out reads should be used for paired-read mapping. \\ = Analyzing potential species composition/contamination of reads in a fastq file = \\ * To get a quick preview of what genomes/collections (or vectors or other contaminants) of sequences the reads in a fastq file can map to, one can use the FastQ Screen tool. * Information about FastQ Screen can be found at this page: http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/ * FastQ Screen allows you to screen your fastq file against a set of libraries, which can be set up to contain the genomes of interest, potential contaminating genomes, vectors, sequencing adaptors, ribosomal RNAs, or other contaminants commonly seen in sequencing experiments. This allows you to see if composition of the reads matches with what you expect and, where the contamination might have come from, if any. * FastQ Screen uses bowtie or bowtie2 for mapping. The libraries you wish to screen against need to be bowtie-indexed. * Paths to the bowtie indexed libraries need to be specified in the fastq_screen.conf file, which is called either from the same directory where the fastq_screen program is (by default) or from a manually specified location using the -conf /path/to/.conf file option * Sample commands are: {{{ Usage: fastq_screen [OPTIONS] file.fq eg. fastq_screen --aligner bowtie2 myFastQ.txt Commonly used options are: [--aligner] Specify 'bowtie' or bowtie2' to use for the mapping. [--outdir] Specify a directory in which to save output files. [--illumina1_3] Assume that the quality values are in encoded in Illumina v1.3 format. Defaults to Sanger format if this flag is not specified. [--conf] Manually specify a location for the configuration file to be used for this run. If not specified then the file will be taken from the same directory as the fastq_screen program. Note: the config file is already setup on our internal server }}} = Analyzing short read quality (after mapping) = See [[http://barcwiki.wi.mit.edu/wiki/SOPs/SAMBAMqc|SAM/BAM quality control]]