| 1 | |
| 2 | = Experimental design of short read sequencing experiments = |
| 3 | |
| 4 | == How long should the reads be? Should they be single or paired-end? == |
| 5 | |
| 6 | * What is the goal of your experiment? |
| 7 | * For typical RNA-seq expression level quantification, a read or read pair gets one count, regardless of the length. As a result, shorter reads may provide just as good data, as long as they aren't so short that repetitive mapping is a problem. |
| 8 | * Longer and/or paired reads are surely beneficial if the experimental goal is |
| 9 | * novel gene discovery: longer reads are much better at identifying novel splice junctions |
| 10 | * For variant discovery, coverage is key, whether it's fewer long reads or more shorter reads (as long as the reads are long enough to map uniquely) |
| 11 | * How much read length is used for primers, adapters, barcodes, etc.? Of course make sure that enough actual experimental DNA is left for effective mapping. |
| 12 | |
| 13 | == If you are able to sequence more than one lane, how should the samples be partitioned? == |
| 14 | |
| 15 | * The magnitude of a lane effect is typically small but typically non-zero. |
| 16 | * To balance any lane effect, sequence all of your samples on each of your lanes. |
| 17 | * Another benefit of barcoding and mixing all samples together is that the samples can be re-sequenced in other lanes in the future (from the same library preparation) without unbalancing the experimental design. |
| 18 | |
| 19 | == How many reads are needed for each sample? == |
| 20 | |
| 21 | == Calculating number of DNA or RNA reads needed to obtain the desired coverage == |
| 22 | |
| 23 | * Some useful references: |
| 24 | * Sims et al., 2014. [http://www.ncbi.nlm.nih.gov/pubmed/24434847 Sequencing depth and coverage: key considerations in genomic analyses.] |
| 25 | * Includes methods to estimate the number of reads required for single nucleotide variant calling, and RNA-seq and ChIP-seq experiments |
| 26 | * Ajay et al., 2011. [http://www.ncbi.nlm.nih.gov/pubmed/21771779/ Accurate and comprehensive sequencing of personal genomes.] |
| 27 | * Includes methods to estimate the number of reads required for single nucleotide variant calling |
| 28 | |
| 29 | * ''Example 1'' (genome sequencing): For a genome of 3e+9 nt, to get 35x coverage we would need: |
| 30 | * For 40-nt reads: |
| 31 | * 3e+9 * 35 / 40 = 2.625e+09 => ~2.6 billion reads |
| 32 | * For 100-nt reads: |
| 33 | * 3e+9 * 35 / 100 = 1.05e+09 => ~1 billion reads |
| 34 | |
| 35 | * '' Example 2'' (RNA_seq experiment): |
| 36 | * If we have |
| 37 | * 6 million 35x35-nt paired end reads |
| 38 | * a genome with ~7000 genes expressed |
| 39 | * average gene length = 5741 bp |
| 40 | * then the total length of the transcriptome is 7000 x 5741 => 38,297,000 nt |
| 41 | * and the total length of the reads is 6 million x 70 nt [35 + 35] => 420,000,000 nt |
| 42 | * so the average coverage will be 420,000,000 / 38,297,000 => ~11x |
| 43 | * but note that coverage will be very irregular to due a wide range of expression levels |
| 44 | |
| 45 | |
| 46 | |