Changes between Version 29 and Version 30 of SOPs/InProgress
- Timestamp:
- 12/07/15 16:07:16 (10 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
SOPs/InProgress
v29 v30 1 '''Calculating number of reads needed for certain coverage''' 1 == How long reads should you use? Should they be single or paired-end? == 2 2 3 * These are some useful references: 4 Ref1 for estimates of the number of reads required for single nucleotide variant calling: [http://www.ncbi.nlm.nih.gov/pubmed/21771779/] 5 6 Ref2 for estimates of the number of reads required for single nucleotide variant calling: [http://www.ncbi.nlm.nih.gov/pubmed/24434847] 7 8 For estimates of the number of reads required for RNA-seq and ChIP-seq experiments: [http://www.ncbi.nlm.nih.gov/pubmed/24434847/] 3 * What is the goal of your experiment? 4 * For typical RNA-seq expression level quantification, a read or read pair gets one count, regardless of the length. As a result, shorter reads may provide just as good data, as long as they aren't so short that repetitive mapping is a problem. 5 * Longer and/or paired reads are surely beneficial if the experimental goal is 6 * novel gene discovery: longer reads are much better at identifying novel splice junctions 7 * For variant discovery, coverage is key, whether it's fewer long reads or more shorter reads (as long as the reads are long enough to map uniquely) 9 8 10 9 11 * ''Example 1'' 10 == If you are able to sequence more than one lane, how should the samples be divided? == 12 11 13 For a 3e+9 nt genome if we want 35x coverage we would need: 12 * The magnitude of a lane effect is typically small but probably non-zero. 13 * To balance any lane effect, sequence all of your samples on each of your lanes. 14 * Another benefit of barcoding and mixing all samples together is that the samples can be re-sequenced in other lanes in the future (from the same library preparation) without unbalancing the experimental design. 14 15 15 3e+9 * 35 / 40 = 2.625e+09 = 2.6 billion 40-nt reads[[BR]] 16 or[[BR]] 17 3e+9 * 35 / 100 = 1.05e+09 = 1 billion 100-nt reads 16 == Calculating number of DNA or RNA reads needed to obtain the desired coverage == 18 17 18 * Some useful references: 19 * Sims et al., 2014. [http://www.ncbi.nlm.nih.gov/pubmed/24434847 Sequencing depth and coverage: key considerations in genomic analyses.] 20 * Includes methods to estimate the number of reads required for single nucleotide variant calling, and RNA-seq and ChIP-seq experiments 21 * Ajay et al., 2011. [http://www.ncbi.nlm.nih.gov/pubmed/21771779/ Accurate and comprehensive sequencing of personal genomes.] 22 * Includes methods to estimate the number of reads required for single nucleotide variant calling 19 23 20 * '' Example 2'' 24 * ''Example 1'' (genome sequencing): For a genome of 3e+9 nt, to get 35x coverage we would need: 25 * For 40-nt reads: 26 * 3e+9 * 35 / 40 = 2.625e+09 => ~2.6 billion reads 27 * For 100-nt reads: 28 * 3e+9 * 35 / 100 = 1.05e+09 => ~1 billion reads 21 29 22 For an RNA_seq experiment: 23 If we have 6 million paired end reads and a genome with ~7000 genes expressed X 5741 bp average gene length = 40,187,000. That is 40 mill nt to cover. 24 25 6M reads x 70 bp (35 pb per paired end reads) = 420 mill bp that we will cover.[[BR]] 26 420 mill nt that we will cover/ 40 mill nt to cover ~ 10 x coverage. 30 * '' Example 2'' (RNA_seq experiment): 31 * If we have 32 * 6 million 35x35-nt paired end reads 33 * a genome with ~7000 genes expressed 34 * average gene length = 5741 bp 35 * then the total length of the transcriptome is 7000 x 5741 => 38,297,000 nt 36 * and the total length of the reads is 6 million x 70 nt [35 + 35] => 420,000,000 nt 37 * so the average coverage will be 420,000,000 / 38,297,000 => ~11x 38 * but note that coverage will be very irregular to due a wide range of expression levels 27 39 28 40