Context Navigation

Experimental design of short read sequencing experiments

What is the goal of your experiment?
- For typical RNA-seq expression level quantification, a read or read pair gets one count, regardless of the length. As a result, shorter reads may provide just as good data, as long as they aren't so short that repetitive mapping is a problem.
- Longer and/or paired reads are surely beneficial if the experimental goal is
  - novel gene discovery: longer reads are much better at identifying novel splice junctions
- According to Chhangawala et al., 2015, 50-bp reads are adequate for identifying differentially expressed genes. Identifying novel splice junctions, however, can benefit from paired and longer reads.
- For variant discovery, coverage is key, whether it's fewer long reads or more shorter reads (as long as the reads are long enough to map uniquely)
How much read length is used for primers, adapters, barcodes, etc.? Of course make sure that enough actual experimental DNA is left for effective mapping.

The magnitude of a lane effect is typically small but typically non-zero.
To balance any lane effect, sequence all of your samples on each of your lanes.
Another benefit of barcoding and mixing all samples together is that the samples can be re-sequenced in other lanes in the future (from the same library preparation) without unbalancing the experimental design.

One resource is the RNASeqPower R package described in Hart et al., 2019
A good resource to help with this analysis is Ching et al.
The main command from RNASeqPower is rnapower(depth, n, cv, effect, alpha, power), where one can enter all values but one, and the algorithm will calculate the missing value
The inputs are
- depth = the number of reads (of a median gene, which according to Ching et al. is abou 16-32 per million mapped reads)
- n = sample size (number of biological replicates per group)
- cv = coefficient of variation (sqrt(disperson)), which can vary between 0.2 (cell lines) and 0.5 (animals or human subjects); see the above references for a discussion of typical values for different types of experiments
- effect = effect size that one wants to identify (such as 2 for a 2-fold difference)
- alpha = test statistic (such as FDR) threshold, such as 0.05
- power = power of the test (the fraction of true positives that will be detected); can be calculated or set to a value like 0.8

Some useful references:
- Sims et al., 2014. Sequencing depth and coverage: key considerations in genomic analyses.
  - Includes methods to estimate the number of reads required for single nucleotide variant calling, and RNA-seq and ChIP-seq experiments
- Ajay et al., 2011. Accurate and comprehensive sequencing of personal genomes.
  - Includes methods to estimate the number of reads required for single nucleotide variant calling

Example 1 (genome sequencing): For a genome of 3e+9 nt, to get 35x coverage we would need:
- For 40-nt reads:
  - 3e+9 * 35 / 40 = 2.625e+09 => ~2.6 billion reads
- For 100-nt reads:
  - 3e+9 * 35 / 100 = 1.05e+09 => ~1 billion reads

Note: See TracWiki for help on using the wiki.