Changes between Initial Version and Version 1 of SOPs/qc_shortReads


Ignore:
Timestamp:
01/23/13 16:49:43 (12 years ago)
Author:
trac
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/qc_shortReads

    v1 v1  
     1= Quality Control and preprocessing of short reads =
     2
     3\\
     4
     5= Analyzing short read quality =
     6
     7\\
     8
     9== FastQC ==
     10
     11http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc
     12  * quality control analysis with nice graphical output
     13  * available for Linux, Windows, and MacOSX
     14  * (but no tools for editing reads)
     15
     16It's installed on tak and LSF and can be run from the command line
     17  * Sample command: ''**fastqc s_1_sequence.txt s_2_sequence.txt**''
     18or interactively (with X Windows):
     19  * ''**fastqc**''
     20Output is a directory (named "s_1_sequence_fastqc" for input "s_1_sequence.txt") with "fastqc_report.html", a web page including all figures. 
     21
     22The "Basic Statistics" section at the top of the FastQC report also shows the Encoding (quality score) information (like "Illumina 1.5"), which may be necessary to specify in subsequent analysis steps.  The Encoding scales are described at [http://en.wikipedia.org/wiki/FASTQ_format#Encoding].
     23
     24
     25== QC Paired-End Reads ==
     26  * QC as above each of the forward and reverse reads separately using QC'ing program (above)
     27  * Get reads/mates after QC'ing that are //perfect// pairs:
     28
     29 {{{
     30    bsub “perl /nfs/BaRC_Public/BaRC_code/Perl/cmpfastq/cmpfastq.pl s_8_1_sequence.txt s_8_2_sequence.txt”
     31}}}
     32
     33  * '''Paired-End Insert Size''': If paired-end insert size or distance is unknown or need to be verified, it can be extracted from a BAM/SAM file after running Bowtie.
     34
     35 {{{
     36    # Run bowtie
     37    bowtie -S /nfs/genomes/mouse_gp_jul_07/bowtie/mm9 -1 s_1_1_sequence.txt -2 s_1_2_sequence.txt -n 1 -p 2 -I 0 -X 500 >| s_1_bowtie.sam
     38}}}
     39
     40{{{
     41   # Using a SAM file (at Unix command prompt)
     42   awk -F "\t" '$9 > 0 {print $9}' s_1_bowtie.sam > s_1_insert_sizes.txt
     43   # Using a BAM file (at Unix command prompt)
     44   samtools view s_1_bowtie.bam | awk -F"\t" '$9 > 0 {print $9}' > s_1_insert_sizes.txt
     45
     46   # and then process column of numbers with R (or Excel)
     47   # In R Session
     48   sizeFile = "s_1_insert_sizes.txt"
     49   sample.name = "My paired reads"
     50   distance = read.delim(sizeFile, h=F)[,1]
     51   pdf(paste(sample.name, "insert.size.histogram.pdf", sep="."), w=11, h=8.5)
     52   hist(distance, breaks=200, col="wheat", main=paste("Insert sizes for", sample.name), xlab="length (nt)")
     53   dev.off()
     54}}}
     55
     56== ShortRead (R package) ==
     57http://www.bioconductor.org/packages/release/bioc/html/ShortRead.html
     58R package, Linux (Tak), Window, Mac \\
     59
     60  * It takes the fastq files, and creates a website with different ways to check the data, and with instruction on how to interpret results. The output files are stored in dest folder (my_qa in this example).
     61  * QC a single file using ShortRead
     62
     63 {{{
     64   library("ShortRead")
     65    # load the data
     66    sr <- readFastq("s1_sequence.txt")
     67    # create a qa object from the ShortRead object
     68    qa <- qa( sr, lane="character" )
     69    # create an html report in the qa directory
     70    report(qa, dest="my_qa")
     71
     72  * QC all *.txt fastq files in a directory using ShortRead
     73
     74    library("ShortRead")
     75    qaSummary <- qa(".", pattern="*.txt", type="fastq")
     76    #create an html report in the qa directory
     77    report(qaSummary, dest="myQC_dir") 
     78}}}
     79
     80== Fastx Toolkit ==
     81http://hannonlab.cshl.edu/fastx_toolkit/
     82galaxy integration, Linux(Tak), MacOSX \\
     83 
     84{{{
     85  # sample commands:
     86   # quality_stats: Sampl Solexa reads file: s_1_1_sequence.txt
     87   fastx_quality_stats -i s_1_1_sequence.txt -o s_1_1_sequence.stats
     88   # a figure for Nucleotide Distribution:
     89   fastx_nucleotide_distribution_graph.sh -i s_1_1_sequence.stats  -o s_1_1_sequence.stats.nuc.png -t "s_1_1_sequence.stats Nucleotide Distribution"
     90   # boxplot:
     91   fastq_quality_boxplot_graph.sh -i s_1_1_sequence.stats -o s_1_1_sequence.stats.quality.png -t "s_1_1_sequence.stats Quality Scores"
     92}}}
     93
     94
     95\\
     96
     97= Modifying a file of short reads based on quality considerations =
     98
     99\\
     100
     101== Remove reads with low quality score: To use FASTX Toolkit to get only reads that are above a quality score along with a certain read length:
     102 
     103{{{
     104  fastq_quality_filter -v -q 20 -p 75 -i myFile.fastq -o myFile.fastq.fastx_trim   
     105           version 0.0.6
     106           [-h]         = This helpful help screen.
     107           [-q N]       = Minimum quality score to keep.
     108           [-p N]       = Minimum percent of bases that must have [-q] quality.
     109           [-z]         = Compress output with GZIP.
     110           [-i INFILE]  = FASTA/Q input file. default is STDIN.
     111           [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
     112           [-v]         = Verbose - report number of sequences.
     113                          If [-o] is specified,  report will be printed to STDOUT.
     114                          If [-o] is not specified (and output goes to STDOUT),
     115                          report will be printed to STDERR.
     116}}}
     117
     118If you get an error like "Invalid quality score value", your fastq file probably has Sanger (offset 33) instead of Illumina (ASCII offset 64) quality scores. 
     119You'll need to add the option "-Q33" to your FASTX Toolkit arguments.
     120
     121== Remove linker (adapter) RNA: ==
     122  * What is the sequence of the linker (adapter) to be removed?
     123    * Biologists generally know which linker (adapter) RNA is used for their sample(s).
     124    * Also or in addition, when you run quality control with shortRead or FASTQC, check out
     125         * repetitive segments in the "over represented sequences" section.
     126         * "Per base sequence content" for any patterns at the beginning of your reads
     127    * See [[http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_clipper_usage|fastx_clipper usage]] (or ''fastx_clipper -h'') for more arguments
     128  * sample command:
     129
     130{{{
     131bsub "fastx_clipper -a CTGTAGGCACCATCAAT -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt"
     132In the above command:
     133   -a CTGTAGGCACCATCAAT is the linker sequence
     134   -i  s2_sequence.txt is input solexa fastq file
     135   -v is Verbose [report number of sequences in output and discarded]
     136   -l 22 is to discard sequences shorter than 22 nucleotides
     137   -o s2_ sequence_noLinker.txt is output file.
     138}}}
     139
     140
     141  * If you get the message "Invalid quality score value..." you have the older range of quality scores.
     142    * Add the argument -Q 33, such as
     143    * fastx_clipper -a CTGTAGGCACCATCAAT -Q 33 -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt
     144
     145  * [[http://code.google.com/p/cutadapt/|cutadapt]] is another tool that is designed to find and remove adapters:
     146    * more options than fastx_clipper, such as specifically trimming 5' or 3' adapters and specifying error rate (allowed mismatches)
     147    * [wiki:SOPs/cutadapt sample usage]
     148
     149== Trim reads to a specified length ==
     150   * If we have reads of different lengths (//i.e.// because we clipped out the adapter sequences), we can trim them to have them all be the same length. Use **fastx_trimmer** for that.
     151   * sample command:
     152
     153 
     154{{{
     155bsub "fastx_trimmer -f 1 -l 22  -i s7_sequence_clipped.txt -o s7_sequence_clipped_trimmed.txt"
     156     
     157[-i INFILE]  = FASTA/Q input file. default is STDIN.
     158[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
     159[-l N] = Last base to keep
     160[-f N] = First base to keep. Default is 1 (=first base).
     161
     162}}}
     163
     164== Trim end of reads when quality drops below a threshold ==
     165
     166   * sample command:
     167
     168{{{
     169bsub "fastq_quality_trimmer -v -t 20 -l 25 -i input.fastq -o output.fastq"
     170     
     171   [-t N]       = Quality threshold - nucleotides with lower
     172                  quality will be trimmed (from the end of the sequence).
     173   [-l N]       = Minimum length - sequences shorter than this (after trimming)
     174                  will be discarded. Default = 0 = no minimum length.
     175   [-z]         = Compress output with GZIP.
     176   [-i INFILE]  = FASTQ input file. default is STDIN.
     177   [-o OUTFILE] = FASTQ output file. default is STDOUT.
     178   [-v]         = Verbose - report number of sequences.
     179                  If [-o] is specified,  report will be printed to STDOUT.
     180                  If [-o] is not specified (and output goes to STDOUT),
     181                  report will be printed to STDERR.
     182}}}
     183
     184
     185== Remove Duplicates ==
     186  * Remove duplicates, for eg. from PCR
     187
     188 {{{
     189   #samtools command
     190    samtools rmdup [-sS] <input.srt.bam> <output.bam>
     191    -s or -S depending on PE data or not
     192}}}
     193
     194== Galaxy ==
     195
     196http://main.g2.bx.psu.edu/
     197Many functions \\
     198