wiki:SOPs/qc_shortReads

Version 1 (modified by trac, 12 years ago) ( diff )

--

Quality Control and preprocessing of short reads


Analyzing short read quality


FastQC

http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc

  • quality control analysis with nice graphical output
  • available for Linux, Windows, and MacOSX
  • (but no tools for editing reads)

It's installed on tak and LSF and can be run from the command line

  • Sample command: fastqc s_1_sequence.txt s_2_sequence.txt

or interactively (with X Windows):

  • fastqc

Output is a directory (named "s_1_sequence_fastqc" for input "s_1_sequence.txt") with "fastqc_report.html", a web page including all figures.

The "Basic Statistics" section at the top of the FastQC report also shows the Encoding (quality score) information (like "Illumina 1.5"), which may be necessary to specify in subsequent analysis steps. The Encoding scales are described at http://en.wikipedia.org/wiki/FASTQ_format#Encoding.

QC Paired-End Reads

  • QC as above each of the forward and reverse reads separately using QC'ing program (above)
  • Get reads/mates after QC'ing that are perfect pairs:
   bsub “perl /nfs/BaRC_Public/BaRC_code/Perl/cmpfastq/cmpfastq.pl s_8_1_sequence.txt s_8_2_sequence.txt”
  • Paired-End Insert Size: If paired-end insert size or distance is unknown or need to be verified, it can be extracted from a BAM/SAM file after running Bowtie.
   # Run bowtie
   bowtie -S /nfs/genomes/mouse_gp_jul_07/bowtie/mm9 -1 s_1_1_sequence.txt -2 s_1_2_sequence.txt -n 1 -p 2 -I 0 -X 500 >| s_1_bowtie.sam
   # Using a SAM file (at Unix command prompt)
   awk -F "\t" '$9 > 0 {print $9}' s_1_bowtie.sam > s_1_insert_sizes.txt
   # Using a BAM file (at Unix command prompt)
   samtools view s_1_bowtie.bam | awk -F"\t" '$9 > 0 {print $9}' > s_1_insert_sizes.txt

   # and then process column of numbers with R (or Excel)
   # In R Session
   sizeFile = "s_1_insert_sizes.txt"
   sample.name = "My paired reads"
   distance = read.delim(sizeFile, h=F)[,1]
   pdf(paste(sample.name, "insert.size.histogram.pdf", sep="."), w=11, h=8.5)
   hist(distance, breaks=200, col="wheat", main=paste("Insert sizes for", sample.name), xlab="length (nt)")
   dev.off()

ShortRead (R package)

http://www.bioconductor.org/packages/release/bioc/html/ShortRead.html R package, Linux (Tak), Window, Mac

  • It takes the fastq files, and creates a website with different ways to check the data, and with instruction on how to interpret results. The output files are stored in dest folder (my_qa in this example).
  • QC a single file using ShortRead
  library("ShortRead")
   # load the data
   sr <- readFastq("s1_sequence.txt")
   # create a qa object from the ShortRead object
   qa <- qa( sr, lane="character" )
   # create an html report in the qa directory
   report(qa, dest="my_qa")

 * QC all *.txt fastq files in a directory using ShortRead

   library("ShortRead")
   qaSummary <- qa(".", pattern="*.txt", type="fastq")
   #create an html report in the qa directory
   report(qaSummary, dest="myQC_dir")  

Fastx Toolkit

http://hannonlab.cshl.edu/fastx_toolkit/ galaxy integration, Linux(Tak), MacOSX

  # sample commands: 
   # quality_stats: Sampl Solexa reads file: s_1_1_sequence.txt
   fastx_quality_stats -i s_1_1_sequence.txt -o s_1_1_sequence.stats
   # a figure for Nucleotide Distribution:
   fastx_nucleotide_distribution_graph.sh -i s_1_1_sequence.stats  -o s_1_1_sequence.stats.nuc.png -t "s_1_1_sequence.stats Nucleotide Distribution"
   # boxplot:
   fastq_quality_boxplot_graph.sh -i s_1_1_sequence.stats -o s_1_1_sequence.stats.quality.png -t "s_1_1_sequence.stats Quality Scores"


Modifying a file of short reads based on quality considerations


Remove reads with low quality score: To use FASTX Toolkit to get only reads that are above a quality score along with a certain read length:

  fastq_quality_filter -v -q 20 -p 75 -i myFile.fastq -o myFile.fastq.fastx_trim   
           version 0.0.6
	   [-h]         = This helpful help screen.
	   [-q N]       = Minimum quality score to keep.
	   [-p N]       = Minimum percent of bases that must have [-q] quality.
	   [-z]         = Compress output with GZIP.
	   [-i INFILE]  = FASTA/Q input file. default is STDIN.
	   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
	   [-v]         = Verbose - report number of sequences.
			  If [-o] is specified,  report will be printed to STDOUT.
			  If [-o] is not specified (and output goes to STDOUT),
			  report will be printed to STDERR.

If you get an error like "Invalid quality score value", your fastq file probably has Sanger (offset 33) instead of Illumina (ASCII offset 64) quality scores. You'll need to add the option "-Q33" to your FASTX Toolkit arguments.

Remove linker (adapter) RNA:

  • What is the sequence of the linker (adapter) to be removed?
    • Biologists generally know which linker (adapter) RNA is used for their sample(s).
    • Also or in addition, when you run quality control with shortRead or FASTQC, check out
      • repetitive segments in the "over represented sequences" section.
      • "Per base sequence content" for any patterns at the beginning of your reads
    • See fastx_clipper usage (or fastx_clipper -h) for more arguments
  • sample command:
bsub "fastx_clipper -a CTGTAGGCACCATCAAT -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt"
In the above command: 
   -a CTGTAGGCACCATCAAT is the linker sequence
   -i  s2_sequence.txt is input solexa fastq file
   -v is Verbose [report number of sequences in output and discarded]
   -l 22 is to discard sequences shorter than 22 nucleotides
   -o s2_ sequence_noLinker.txt is output file.
  • If you get the message "Invalid quality score value..." you have the older range of quality scores.
    • Add the argument -Q 33, such as
    • fastx_clipper -a CTGTAGGCACCATCAAT -Q 33 -i s2_sequence.txt -v -l 22 -o s2_sequence_noLinker.txt
  • cutadapt is another tool that is designed to find and remove adapters:
    • more options than fastx_clipper, such as specifically trimming 5' or 3' adapters and specifying error rate (allowed mismatches)
    • sample usage

Trim reads to a specified length

  • If we have reads of different lengths (i.e. because we clipped out the adapter sequences), we can trim them to have them all be the same length. Use fastx_trimmer for that.
  • sample command:

bsub "fastx_trimmer -f 1 -l 22  -i s7_sequence_clipped.txt -o s7_sequence_clipped_trimmed.txt"
      
[-i INFILE]  = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
[-l N] = Last base to keep 
[-f N] = First base to keep. Default is 1 (=first base).

Trim end of reads when quality drops below a threshold

  • sample command:
bsub "fastq_quality_trimmer -v -t 20 -l 25 -i input.fastq -o output.fastq"
      
   [-t N]       = Quality threshold - nucleotides with lower
                  quality will be trimmed (from the end of the sequence).
   [-l N]       = Minimum length - sequences shorter than this (after trimming)
                  will be discarded. Default = 0 = no minimum length.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTQ input file. default is STDIN.
   [-o OUTFILE] = FASTQ output file. default is STDOUT.
   [-v]         = Verbose - report number of sequences.
                  If [-o] is specified,  report will be printed to STDOUT.
                  If [-o] is not specified (and output goes to STDOUT),
                  report will be printed to STDERR.

Remove Duplicates

  • Remove duplicates, for eg. from PCR
  #samtools command
   samtools rmdup [-sS] <input.srt.bam> <output.bam>
   -s or -S depending on PE data or not

Galaxy

http://main.g2.bx.psu.edu/ Many functions

Attachments (3)

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.