Experimental design of short read sequencing experiments

How long should the reads be? Should they be single or paired-end?

  • What is the goal of your experiment?
    • For typical RNA-seq expression level quantification, a read or read pair gets one count, regardless of the length. As a result, shorter reads may provide just as good data, as long as they aren't so short that repetitive mapping is a problem.
    • Longer and/or paired reads are surely beneficial if the experimental goal is
      • novel gene discovery: longer reads are much better at identifying novel splice junctions
    • According to Chhangawala et al., 2015, 50-bp reads are adequate for identifying differentially expressed genes. Identifying novel splice junctions, however, can benefit from paired and longer reads.
    • For variant discovery, coverage is key, whether it's fewer long reads or more shorter reads (as long as the reads are long enough to map uniquely)
  • How much read length is used for primers, adapters, barcodes, etc.? Of course make sure that enough actual experimental DNA is left for effective mapping.

If you are able to sequence more than one lane, how should the samples be partitioned?

  • The magnitude of a lane effect is typically small but typically non-zero.
  • To balance any lane effect, sequence all of your samples on each of your lanes.
  • Another benefit of barcoding and mixing all samples together is that the samples can be re-sequenced in other lanes in the future (from the same library preparation) without unbalancing the experimental design.

How many reads are needed for each sample?

  • Resources like Scotty can help with a power analysis for RNA-seq experiments.

How many replicates are needed for a RNA-seq experiment?

  • One resource is the RNASeqPower R package described in Hart et al., 2019
  • A good resource to help with this analysis is Ching et al.
  • The main command from RNASeqPower is rnapower(depth, n, cv, effect, alpha, power), where one can enter all values but one, and the algorithm will calculate the missing value
  • The inputs are
    • depth = the number of reads (of a median gene, which according to Ching et al. is abou 16-32 per million mapped reads)
    • n = sample size (number of biological replicates per group)
    • cv = coefficient of variation (sqrt(disperson)), which can vary between 0.2 (cell lines) and 0.5 (animals or human subjects); see the above references for a discussion of typical values for different types of experiments
    • effect = effect size that one wants to identify (such as 2 for a 2-fold difference)
    • alpha = test statistic (such as FDR) threshold, such as 0.05
    • power = power of the test (the fraction of true positives that will be detected); can be calculated or set to a value like 0.8

Calculating number of DNA or RNA reads needed to obtain the desired coverage

  • Example 1 (genome sequencing): For a genome of 3e+9 nt, to get 35x coverage we would need:
    • For 40-nt reads:
      • 3e+9 * 35 / 40 = 2.625e+09 => ~2.6 billion reads
    • For 100-nt reads:
      • 3e+9 * 35 / 100 = 1.05e+09 => ~1 billion reads
  • Example 2 (RNA_seq experiment):
    • If we have
      • 6 million 35x35-nt paired end reads
      • a genome with ~7000 genes expressed
      • average gene length = 5741 bp
    • then the total length of the transcriptome is 7000 x 5741 => 38,297,000 nt
    • and the total length of the reads is 6 million x 70 nt [35 + 35] => 420,000,000 nt
    • so the average coverage will be 420,000,000 / 38,297,000 => ~11x
    • but note that coverage will be very irregular to due a wide range of expression levels